[rocprofiler-compute] metrics generator (#1199)

2025-10-22 15:17:43 -04:00
@@ -7,12 +7,23 @@ repos:
      - id: check-yaml
      - id: end-of-file-fixer
      - id: trailing-whitespace
-    # Python import sorting and formatting
+
+  # Python import sorting and formatting
  - repo: https://github.com/astral-sh/ruff-pre-commit
-    # Ruff version. Check https://github.com/astral-sh/ruff-pre-commit#version-compatibility,
+    # Ruff version. Check https://github.com/astral-sh/ruff-pre-commit#version-compatibility
    # for the latest ruff version supported by the hook.
    rev: v0.12.12
    hooks:
      - id: ruff-check
-        args: [--fix, --exit-non-zero-on-fix]
-      - id: ruff-format
+        args: [--fix]
+      - id: ruff-format
+
+  # Local hook: hash consistency check
+  - repo: local
+    hooks:
+      - id: hash-check
+        name: Hash consistency check
+        entry: bash -lc 'cd projects/rocprofiler-compute && python3 tools/config_management/hash_checker.py'
+        language: system
+        pass_filenames: false
+        stages: [pre-commit]
@@ -5,8 +5,12 @@ Full documentation for ROCm Compute Profiler is available at [https://rocm.docs.
 ## Unreleased

 ### Added
+* Add `--list-blocks <arch>` option to general options to list available IP blocks on specified arch (similar to `--list-metrics`), cannot be used with `--block`.
+* Added `config_delta/gfx950_diff.yaml` to analysis config yamls to track the revision between a gfx9 architecture against the latest supported architecture gfx950

 ### Changed
+* `-b/--block` accepts block alias(es) (See block aliases using command-line option `--list-blocks <arch>`).
+* analysis configs yamls are now managed with the new config management workflow in `tools/config_management/`

 ### Removed

@@ -400,18 +400,6 @@ add_test(
    WORKING_DIRECTORY ${PROJECT_SOURCE_DIR}
 )

-# ---------------------------
-# DB Connector tests
-# ---------------------------
-
-add_test(
-    NAME test_db_connector
-    COMMAND
-        ${Python3_EXECUTABLE} -m pytest --junitxml=tests/test_db_connector.xml
-        ${COV_OPTION} ${PROJECT_SOURCE_DIR}/tests/test_db_connector.py
-    WORKING_DIRECTORY ${PROJECT_SOURCE_DIR}
-)
-
 # ---------------------------
 # Utils tests
 # ---------------------------
@@ -547,6 +535,13 @@ install(
    COMPONENT main
    PATTERN "__pycache__" EXCLUDE
 )
+# tools/config_management
+install(
+    DIRECTORY tools/config_management
+    DESTINATION ${CMAKE_INSTALL_LIBEXECDIR}/${PROJECT_NAME}
+    COMPONENT main
+    PATTERN "__pycache__" EXCLUDE
+)
 # grafana assets
 install(
    DIRECTORY grafana
@@ -586,10 +581,10 @@ install(
 add_custom_target(
    license
    COMMAND
-        ${PROJECT_SOURCE_DIR}/utils/update_license.py --source ${PROJECT_SOURCE_DIR}/src
+        ${PROJECT_SOURCE_DIR}/tools/update_license.py --source ${PROJECT_SOURCE_DIR}/src
        --license ${PROJECT_SOURCE_DIR}/LICENSE.md --extension '.py'
    COMMAND
-        ${PROJECT_SOURCE_DIR}/utils/update_license.py --source ${PROJECT_SOURCE_DIR}
+        ${PROJECT_SOURCE_DIR}/tools/update_license.py --source ${PROJECT_SOURCE_DIR}
        --license ${PROJECT_SOURCE_DIR}/LICENSE.md --file
        "src/${PACKAGE_NAME},cmake/Dockerfile,cmake/rocm_install.sh,docker/docker-entrypoint.sh,src/rocprof_compute_analyze/convertor/mongodb/convert"
 )
@@ -190,4 +190,13 @@ Any future contributions should adhere to these guidelines:

 ### Build and test documentation changes

-For instructions on how to build and test documentation changes (files under docs folder), please see https://rocm.docs.amd.com/en/latest/contribute/contributing.html
+For instructions on how to build and test documentation changes (files under docs folder), please see https://rocm.docs.amd.com/en/latest/contribute/contributing.html
+
+
+## Metrics Management
+
+If your PR touches **metric configs** (panel YAMLs under `src/rocprof_compute_soc/analysis_configs/gfx<arch>/*.yaml`, config deltas, or metric descriptions in `docs/data/metrics_description.yaml`), please follow the metric management workflow summarized here:
+- Edit the panel YAMLs and, when appropriate, generate/apply a delta and (optionally) promote a new architecture using the [workflow script](`tools/config_management/master_config_workflow_script.py`).
+- Verify hashes are updated and CI tests pass.
+
+For full details, see the [metric config management README](./tools/config_management/README.md)
@@ -13,7 +13,7 @@ monorepo/
 │       ├── CMakeLists.txt
 │       ├── coverage/
 │       │   └── coverage-latest.xml  # committed coverage file
-│       ├── utils/
+│       ├── tools/
 │       │   ├── update_coverage.sh  # coverage generation/update script
 │       │   └── run-ci.py             # CDash upload script
 │       └── ...
@@ -31,7 +31,7 @@ Run this periodically to update the coverage baseline:
 ```bash
 # From monorepo root
 cd projects/rocprofiler-compute
-./utils/update_coverage.sh
+./tools/update_coverage.sh

 # This will:
 # - Build with coverage enabled
@@ -74,4 +74,4 @@ pip install coverage pytest pytest-cov
 #verify tests can run
 cd projects/rocprofiler-compute/build
 ctest --verbose
-```
+```
@@ -19,7 +19,7 @@ This section provides an overview of ROCm Compute Profiler's CLI analysis featur
 * :ref:`Filtering <cli-analysis-options>`: Hone in on a particular kernel,
  GPU ID, or dispatch ID via post-process filtering.

-* :ref:`Per-kernel roofline analysis <per-kernel-roofline>`: Detailed arithmetic 
+* :ref:`Per-kernel roofline analysis <per-kernel-roofline>`: Detailed arithmetic
   intensity and performance analysis for individual kernels.

 Run ``rocprof-compute analyze -h`` for more details.
@@ -214,6 +214,90 @@ There are three high-level GPU analysis views:
      │ 2.1.28  │ Instr Fetch Latency       │ 21.729248046875       │ Cycles           │                    │                        │
      ╘═════════╧═══════════════════════════╧═══════════════════════╧══════════════════╧════════════════════╧════════════════════════╛

+   Alternatively, use the option ``-b`` (or ``--block``) with block alias(es).
+   The following snippet shows how to generate a report containing only metric 2 with the alias equivalent of ``sol``
+
+   .. code-block:: shell-session
+
+      $ rocprof-compute analyze -p workloads/vcopy/MI200/ -b sol
+
+      --------
+      Analyze
+      --------
+
+      --------------------------------------------------------------------------------
+      1. Top Stat
+      ╒════╤══════════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╕
+      │    │ KernelName                               │   Count │   Sum(ns) │   Mean(ns) │   Median(ns) │    Pct │
+      ╞════╪══════════════════════════════════════════╪═════════╪═══════════╪════════════╪══════════════╪════════╡
+      │  0 │ vecCopy(double*, double*, double*, int,  │       1 │  20000.00 │   20000.00 │     20000.00 │ 100.00 │
+      │    │ int) [clone .kd]                         │         │           │            │              │        │
+      ╘════╧══════════════════════════════════════════╧═════════╧═══════════╧════════════╧══════════════╧════════╛
+
+
+      --------------------------------------------------------------------------------
+      2. System Speed-of-Light
+      ╒═════════╤═══════════════════════════╤═══════════════════════╤══════════════════╤════════════════════╤════════════════════════╕
+      │ Index   │ Metric                    │ Value                 │ Unit             │ Peak               │ PoP                    │
+      ╞═════════╪═══════════════════════════╪═══════════════════════╪══════════════════╪════════════════════╪════════════════════════╡
+      │ 2.1.0   │ VALU FLOPs                │ 0.0                   │ Gflop            │ 22630.4            │ 0.0                    │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.1   │ VALU IOPs                 │ 367.0016              │ Giop             │ 22630.4            │ 1.6217194570135745     │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.2   │ MFMA FLOPs (BF16)         │ 0.0                   │ Gflop            │ 90521.6            │ 0.0                    │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.3   │ MFMA FLOPs (F16)          │ 0.0                   │ Gflop            │ 181043.2           │ 0.0                    │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.4   │ MFMA FLOPs (F32)          │ 0.0                   │ Gflop            │ 45260.8            │ 0.0                    │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.5   │ MFMA FLOPs (F64)          │ 0.0                   │ Gflop            │ 45260.8            │ 0.0                    │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.6   │ MFMA IOPs (Int8)          │ 0.0                   │ Giop             │ 181043.2           │ 0.0                    │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.7   │ Active CUs                │ 74                    │ Cus              │ 104                │ 71.15384615384616      │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.8   │ SALU Util                 │ 4.016057506716307     │ Pct              │ 100                │ 4.016057506716307      │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.9   │ VALU Util                 │ 5.737225009594725     │ Pct              │ 100                │ 5.737225009594725      │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.10  │ MFMA Util                 │ 0.0                   │ Pct              │ 100                │ 0.0                    │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.11  │ VALU Active Threads/Wave  │ 64.0                  │ Threads          │ 64                 │ 100.0                  │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.12  │ IPC - Issue               │ 1.0                   │ Instr/cycle      │ 5                  │ 20.0                   │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.13  │ LDS BW                    │ 0.0                   │ Gb/sec           │ 22630.4            │ 0.0                    │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.14  │ LDS Bank Conflict         │                       │ Conflicts/access │ 32                 │                        │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.15  │ Instr Cache Hit Rate      │ 99.91306912556854     │ Pct              │ 100                │ 99.91306912556854      │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.16  │ Instr Cache BW            │ 209.7152              │ Gb/s             │ 6092.8             │ 3.442016806722689      │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.17  │ Scalar L1D Cache Hit Rate │ 99.81986908342313     │ Pct              │ 100                │ 99.81986908342313      │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.18  │ Scalar L1D Cache BW       │ 209.7152              │ Gb/s             │ 6092.8             │ 3.442016806722689      │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.19  │ Vector L1D Cache Hit Rate │ 50.0                  │ Pct              │ 100                │ 50.0                   │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.20  │ Vector L1D Cache BW       │ 1677.7216             │ Gb/s             │ 11315.199999999999 │ 14.82714932126697      │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.21  │ L2 Cache Hit Rate         │ 35.55067615693325     │ Pct              │ 100                │ 35.55067615693325      │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.22  │ L2-Fabric Read BW         │ 419.8496              │ Gb/s             │ 1638.4             │ 25.6255859375          │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.23  │ L2-Fabric Write BW        │ 293.9456              │ Gb/s             │ 1638.4             │ 17.941015625           │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.24  │ L2-Fabric Read Latency    │ 256.6482321288385     │ Cycles           │                    │                        │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.25  │ L2-Fabric Write Latency   │ 317.2264255699014     │ Cycles           │                    │                        │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.26  │ Wave Occupancy            │ 1821.723057333852     │ Wavefronts       │ 3328               │ 54.73927455931046      │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.27  │ Instr Fetch BW            │ 4.174722306564298e-08 │ Gb/s             │ 3046.4             │ 1.3703789084047721e-09 │
+      ├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
+      │ 2.1.28  │ Instr Fetch Latency       │ 21.729248046875       │ Cycles           │                    │                        │
+      ╘═════════╧═══════════════════════════╧═══════════════════════╧══════════════════╧════════════════════╧════════════════════════╛
   .. note::

      Some cells may be blank indicating a missing or unavailable hardware
@@ -245,6 +329,11 @@ List metrics

     $ rocprof-compute analyze -p workloads/vcopy/MI200/  --list-metrics gfx90a

+List IP blocks
+  .. code-block:: shell
+
+     $ rocprof-compute analyze -p workloads/vcopy/MI200/  --list-blocks gfx90a
+
 Show Description column which is excluded by default in cli output
  .. code-block:: shell

@@ -261,7 +261,7 @@ detailed description of profiling filters available when using ROCm Compute Prof
 Filtering options
 -----------------

-``-b``, ``--block <block-name>``
+``-b``, ``--block <block-id|block-alias|metric-id>``
   Allows system profiling on one or more selected analysis report blocks to speed
   up the profiling process. See :ref:`profiling-hw-component-filtering`.
   Note that this option cannot be used with ``--roof-only`` or ``--set``.
@@ -70,6 +70,13 @@ to view the metrics for current system architecture:
   $ rocprof-compute --list-metrics <sys_arch>
   $ rocprof-compute profile --list-available-metrics

+To view available aliases by hardware block, use the ``--list-blocks``
+option with a system architecture argument
+
+.. code-block:: shell
+
+   $ rocprof-compute --list-blocks <sys_arch>
+
 .. _basic-analyze-cli:

 Analyze in the command line
@@ -25,13 +25,30 @@

 import argparse
 import os
-import re
 from pathlib import Path
 from typing import Optional

+from utils.utils import METRIC_ID_RE

-def print_avail_arch(avail_arch: list[str]) -> str:
-    ret_str = "List all available metrics for analysis on specified arch:"
+
+def validate_block(value: str) -> str:
+    if METRIC_ID_RE.match(value):
+        return value
+    raise argparse.ArgumentTypeError(f"Invalid metric id: {value}")
+
+
+def block_token_or_alias(s: str) -> str:
+    try:
+        return validate_block(s)
+    except argparse.ArgumentTypeError:
+        s = (s or "").strip()
+        if not s:
+            raise argparse.ArgumentTypeError("empty token for --block")
+        return s
+
+
+def print_avail_arch(avail_arch: list[str], args: str) -> str:
+    ret_str = f"List all available {args} for analysis on specified arch:"
    for arch in avail_arch:
        ret_str += f"\n   {arch}"
    return ret_str
@@ -66,7 +83,14 @@ def add_general_group(
        dest="list_metrics",
        metavar="",
        choices=supported_archs.keys(),  # ["gfx908", "gfx90a"],
-        help=print_avail_arch(list(supported_archs.keys())),
+        help=print_avail_arch(list(supported_archs.keys()), "metrics"),
+    )
+    general_group.add_argument(
+        "--list-blocks",
+        dest="list_blocks",
+        metavar="",
+        choices=supported_archs.keys(),  # ["gfx908", "gfx90a"],
+        help=print_avail_arch(list(supported_archs.keys()), "blocks"),
    )
    general_group.add_argument(
        "--config-dir",
@@ -234,12 +258,6 @@ Examples:
        help="\t\t\tDispatch ID filtering.",
    )

-    def validate_block(value: str) -> str:
-        # Metric id is of the form I or I.I or I.I.I where I is two digit number.
-        if re.compile(r"^\d{1,2}(?:\.\d{1,2}){0,2}$").match(value):
-            return value
-        raise argparse.ArgumentTypeError(f"Invalid metric id: {value}")
-
    profile_group.add_argument(
        "--list-available-metrics",
        dest="list_available_metrics",
@@ -249,15 +267,19 @@ Examples:
    profile_group.add_argument(
        "-b",
        "--block",
-        type=validate_block,
        dest="filter_blocks",
        metavar="",
        nargs="+",
+        type=block_token_or_alias,
        required=False,
        default=[],
        help=(
            "\t\t\tSpecify metric id(s) from --list-metrics for filtering "
            "(e.g. 12, 12.1, 12.1.1).\n"
+            "\t\t\tAlternatively, specify block id(s) for filtering "
+            "(e.g. 12, 13, 14).\n"
+            "\t\t\tAlternatively, specify block alias(es) for filtering "
+            "(e.g. lds, l1i, sl1d).\n"
            "\t\t\tCan provide multiple space separated arguments.\n"
            "\t\t\tCannot be used with --set or --roof-only"
        ),
@@ -656,6 +678,7 @@ Examples:
        dest="filter_metrics",
        metavar="",
        nargs="+",
+        type=block_token_or_alias,
        help="\t\tSpecify metric id(s) from --list-metrics for filtering.",
    )
    analyze_group.add_argument(
@@ -45,7 +45,12 @@ from utils.logger import (
    console_warning,
    demarcate,
 )
-from utils.utils import get_uuid, is_workload_empty, merge_counters_spatial_multiplex
+from utils.utils import (
+    get_panel_alias,
+    get_uuid,
+    is_workload_empty,
+    merge_counters_spatial_multiplex,
+)

 # the build-in config to list kernel names purpose only
 TOP_STATS_BUILD_IN_CONFIG: OrderedDict[int, dict[str, Any]] = OrderedDict([
@@ -160,21 +165,41 @@ class OmniAnalyze_Base:
        }
        for key, value in self._arch_configs[arch].metric_list.items():
            dot_count = str(key).count(".")
-            if dot_count == 0:
-                prefix = ""
-            elif dot_count == 1:
-                prefix = "\t"
-            else:
-                prefix = "\t\t"
+            indent = "\t" * min(dot_count, 2)

-            description = metric_descriptions.get(key, "") if dot_count > 1 else ""
+            print(f"{indent}{key} -> {value}\n")

-            print(f"{prefix}{key} -> {value}\n")
-            if description:
-                formatted_desc = f"\n{prefix}".join(
-                    textwrap.wrap(description, width=40)
-                )
-                print(f"{prefix}{formatted_desc}\n")
+            if dot_count > 1:
+                description = metric_descriptions.get(key, "")
+                if description:
+                    wrapped = textwrap.wrap(description, width=40)
+                    print(f"{indent}" + f"\n{indent}".join(wrapped) + "\n")
+
+        sys.exit(0)
+
+    @demarcate
+    def list_blocks(self) -> None:
+        args = self.get_args()
+        arch = args.list_blocks
+
+        if arch not in self.__supported_archs:
+            console_error("analysis", "Unsupported arch")
+        if arch not in self._arch_configs:
+            sys_info = file_io.load_sys_info(f"{args.path[0][0]}/sysinfo.csv")
+            self.generate_configs(
+                arch,
+                args.config_dir,
+                args.list_stats,
+                args.filter_metrics,
+                sys_info.iloc[0],
+            )
+
+        print(f"{'INDEX':<8} {'BLOCK ALIAS':<16} {'BLOCK NAME'}")
+        for key, value in self._arch_configs[arch].metric_list.items():
+            panel_alias_dict = get_panel_alias()
+            if key.count(".") > 0:
+                continue
+            print(f"{key:<8} {panel_alias_dict[value]:<16} {value}")

        sys.exit(0)

@@ -208,6 +233,9 @@ class OmniAnalyze_Base:
        if args.list_metrics:
            self.list_metrics()

+        if args.list_blocks:
+            self.list_blocks()
+
        def get_sysinfo_path(data_path: str) -> Optional[str]:
            return (
                data_path
@@ -49,6 +49,7 @@ from utils.mi_gpu_spec import mi_gpu_specs
 from utils.specs import MachineSpecs, generate_machine_specs
 from utils.utils import (
    detect_rocprof,
+    get_panel_alias,
    get_submodules,
    get_version,
    get_version_display,
@@ -142,6 +143,8 @@ class RocProfCompute:

        if self.__args.list_metrics is not None and block:
            console_error("Cannot use --list-metrics with --blocks")
+        if self.__args.list_blocks is not None and block:
+            console_error("Cannot use --list-blocks with --blocks")
        if (
            hasattr(self.__args, "list_available_metrics")
            and self.__args.list_available_metrics
@@ -194,6 +197,9 @@ class RocProfCompute:
            elif self.__args.list_metrics is not None:
                self.list_metrics()
                sys.exit(0)
+            elif self.__args.list_blocks is not None:
+                self.list_blocks()
+                sys.exit(0)
            elif self.__args.config_dir:
                parser.print_help(sys.stderr)
                console_error(
@@ -250,6 +256,34 @@ class RocProfCompute:
        else:
            console_error("Unsupported arch")

+    @demarcate
+    def list_blocks(self) -> None:
+        for_current_arch = getattr(self.__args, "list_available_metrics", False)
+
+        arch = (
+            self.__mspec.gpu_arch
+            if (for_current_arch or self.__args.list_blocks is None)
+            else self.__args.list_blocks
+        )
+        if arch in self.__supported_archs.keys():
+            ac = schema.ArchConfig()
+            ac.panel_configs = file_io.load_panel_configs([
+                str(Path(self.__args.config_dir) / arch)
+            ])
+            sys_info = (
+                self.__mspec.get_class_members().iloc[0] if for_current_arch else None
+            )
+            parser.build_dfs(arch_configs=ac, filter_metrics=[], sys_info=sys_info)
+
+            print(f"{'INDEX':<8} {'BLOCK ALIAS':<16} {'BLOCK NAME'}")
+            for key, value in ac.metric_list.items():
+                if key.count(".") > 0:
+                    continue
+                print(f"{key:<8} {get_panel_alias()[value]:<16} {value}")
+            sys.exit(0)
+        else:
+            console_error("Unsupported arch")
+
    @demarcate
    def list_sets(self) -> None:
        sets_info = parse_sets_yaml(self.__mspec.gpu_arch)
@@ -505,6 +505,7 @@ class RocProfCompute_Base:
        # PC sampling data is only collected when block "21" is specified
        if not (
            "21" in args.filter_blocks
+            and "pc_sampling" in args.filter_blocks
            and self.__profiler in ("rocprofv3", "rocprofiler-sdk")
        ):
            return
@@ -2,7 +2,6 @@
 Panel Config:
  id: 0
  title: Top Stats
-  metrics_description: {}
  data source:
  - raw_csv_table:
      id: 1
@@ -12,3 +11,4 @@ Panel Config:
      id: 2
      title: Dispatch List
      source: pmc_dispatch_info.csv
+  metrics_description: {}
@@ -2,10 +2,10 @@
 Panel Config:
  id: 100
  title: System Info
-  metrics_description: {}
  data source:
  - raw_csv_table:
      id: 101
      title: System Info
      source: sysinfo.csv
      columnwise: true
+  metrics_description: {}
@@ -2,124 +2,6 @@
 Panel Config:
  id: 200
  title: System Speed-of-Light
-  metrics_description:
-    VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
-      This is also presented as a percent of the peak theoretical FLOPs achievable
-      on the specific accelerator. Note: this does not include any floating-point
-      operations from MFMA instructions.'
-    VALU IOPs: 'The total integer operations executed per second on the VALU. This
-      is also presented as a percent of the peak theoretical IOPs achievable on the
-      specific accelerator. Note: this does not include any integer operations from
-      MFMA instructions.'
-    MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
-      executed per second. This does not include any 16-bit brain floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F8 MFMA operations achievable on the specific accelerator. It is supported on
-      AMD Instinct MI300 series and later only.
-    MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
-      executed per second. Note: this does not include any 16-bit brain floating point
-      operations from VALU instructions. This is also presented as a percent of the
-      peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
-      per second. Note: this does not include any 16-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F16 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
-      per second. Note: this does not include any 32-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F32 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
-      per second. Note: this does not include any 64-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F64 MFMA operations achievable on the specific accelerator.'
-    MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
-      per second. Note: this does not include any 8-bit integer operations from VALU
-      instructions. This is also presented as a percent of the peak theoretical INT8
-      MFMA operations achievable on the specific accelerator.'
-    Active CUs: Total number of active compute units (CUs) on the accelerator during
-      the kernel execution.
-    SALU Utilization: Indicates what percent of the kernel's duration the SALU was
-      busy executing instructions. Computed as the ratio of the total number of cycles
-      spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.
-    VALU Utilization: Indicates what percent of the kernel's duration the VALU was
-      busy executing instructions. Does not include VMEM operations. Computed as the
-      ratio of the total number of cycles spent by the scheduler issuing VALU instructions
-      over the total CU cycles.
-    MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
-      was busy executing instructions. Computed as the ratio of the total number of
-      cycles the MFMA was busy over the total CU cycles.
-    VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
-      was busy executing instructions, including both global/generic and spill/scratch
-      operations (see the VMEM instruction count metrics) for more detail). Does not
-      include VALU operations. Computed as the ratio of the total number of cycles
-      spent by the scheduler issuing VMEM instructions over the total CU cycles.
-    Branch Utilization: Indicates what percent of the kernel's duration the branch
-      unit was busy executing instructions. Computed as the ratio of the total number
-      of cycles spent by the scheduler issuing branch instructions over the total
-      CU cycles
-    VALU Active Threads: Indicates the average level of divergence within a wavefront
-      over the lifetime of the kernel. The number of work-items that were active in
-      a wavefront during execution of each VALU instruction, time-averaged over all
-      VALU instructions run on all wavefronts in the kernel.
-    IPC: The ratio of the total number of instructions executed on the CU over the
-      total active CU cycles. This is also presented as a percent of the peak theoretical
-      bandwidth achievable on the specific accelerator.
-    Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
-      over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
-      kernels (less than 1ms). This is also presented as a percent of the peak theoretical
-      occupancy achievable on the specific accelerator.'
-    Theoretical LDS Bandwidth: Indicates the maximum amount of bytes that could have
-      been loaded from, stored to, or atomically updated in the LDS per unit time
-      (see LDS Bandwidth example for more detail). This is also presented as a percent
-      of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
-    LDS Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS
-      scheduler due to bank conflicts (as determined by the conflict resolution hardware)
-      to the base number of cycles that would be spent in the LDS scheduler in a completely
-      uncontended case. This is also presented in normalized form (i.e., the Bank
-      Conflict Rate).
-    vL1D Cache Hit Rate: The ratio of the number of vL1D cache line requests that
-      hit in vL1D cache over the total number of cache line requests to the vL1D cache
-      RAM.
-    vL1D Cache BW: The number of bytes looked up in the vL1D cache as a result of
-      VMEM instructions per unit time. The number of bytes is calculated as the number
-      of cache lines requested multiplied by the cache line size. This value does
-      not consider partial requests, so e.g., if only a single value is requested
-      in a cache line, the data movement will still be counted as a full cache line.
-      This is also presented as a percent of the peak theoretical bandwidth achievable
-      on the specific accelerator.
-    L2 Cache Hit Rate: The ratio of the number of L2 cache line requests that hit
-      in the L2 cache over the total number of incoming cache line requests to the
-      L2 cache.
-    L2 Cache BW: The number of bytes looked up in the L2 cache per unit time. The
-      number of bytes is calculated as the number of cache lines requested multiplied
-      by the cache line size. This value does not consider partial requests, so e.g.,
-      if only a single value is requested in a cache line, the data movement will
-      still be counted as a full cache line. This is also presented as a percent of
-      the peak theoretical bandwidth achievable on the specific accelerator.
-    L2-Fabric Read BW: "The number of bytes read by the L2 over the Infinity Fabric\u2122\
-      \ interface per unit time. This is also presented as a percent of the peak theoretical\
-      \ bandwidth achievable on the specific accelerator."
-    L2-Fabric Write BW: The number of bytes sent by the L2 over the Infinity Fabric
-      interface by write and atomic operations per unit time. This is also presented
-      as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
-    L2-Fabric Read Latency: The time-averaged number of cycles read requests spent
-      in Infinity Fabric before data was returned to the L2.
-    L2-Fabric Write Latency: The time-averaged number of cycles write requests spent
-      in Infinity Fabric before a completion acknowledgement was returned to the L2.
-    sL1D Cache Hit Rate: The percent of sL1D requests that hit on a previously loaded
-      line the cache. Calculated as the ratio of the number of sL1D requests that
-      hit over the number of all sL1D requests.
-    sL1D Cache BW: The number of bytes looked up in the sL1D cache per unit time.
-      This is also presented as a percent of the peak theoretical bandwidth achievable
-      on the specific accelerator.
-    L1I Hit Rate: The number of bytes looked up in the L1I cache per unit time. This
-      is also presented as a percent of the peak theoretical bandwidth achievable
-      on the specific accelerator.
-    L1I BW: The percent of L1I requests that hit on a previously loaded line the cache.
-      Calculated as the ratio of the number of L1I requests that hit over the number
-      of all L1I requests.
-    L1I Fetch Latency: The average number of cycles spent to fetch instructions to
-      a CU.
  data source:
  - metric_table:
      id: 201
@@ -317,3 +199,125 @@ Panel Config:
          peak: None
          pop: None
          coll_level: SQ_IFETCH_LEVEL
+  metrics_description:
+    VALU FLOPs: |-
+      The total floating-point operations executed per second on the VALU.
+      This is also presented as a percent of the peak theoretical FLOPs achievable
+      on the specific accelerator. Note: this does not include any floating-point
+      operations from MFMA instructions.
+    VALU IOPs: |-
+      The total integer operations executed per second on the VALU. This is
+      also presented as a percent of the peak theoretical IOPs achievable on the
+      specific accelerator. Note: this does not include any integer operations from
+      MFMA instructions.
+    MFMA FLOPs (BF16): |-
+      The total number of 16-bit brain floating point MFMA operations executed
+      per second. Note: this does not include any 16-bit brain floating point operations
+      from VALU instructions. This is also presented as a percent of the peak theoretical
+      BF16 MFMA operations achievable on the specific accelerator.
+    MFMA FLOPs (F16): |-
+      The total number of 16-bit floating point MFMA operations executed per
+      second. Note: this does not include any 16-bit floating point operations from
+      VALU instructions. This is also presented as a percent of the peak theoretical
+      F16 MFMA operations achievable on the specific accelerator.
+    MFMA FLOPs (F32): |-
+      The total number of 32-bit floating point MFMA operations executed per
+      second. Note: this does not include any 32-bit floating point operations from
+      VALU instructions. This is also presented as a percent of the peak theoretical
+      F32 MFMA operations achievable on the specific accelerator.
+    MFMA FLOPs (F64): |-
+      The total number of 64-bit floating point MFMA operations executed per
+      second. Note: this does not include any 64-bit floating point operations from
+      VALU instructions. This is also presented as a percent of the peak theoretical
+      F64 MFMA operations achievable on the specific accelerator.
+    MFMA IOPs (Int8): |-
+      The total number of 8-bit integer MFMA operations executed per second.
+      Note: this does not include any 8-bit integer operations from VALU instructions.
+      This is also presented as a percent of the peak theoretical INT8 MFMA operations
+      achievable on the specific accelerator.
+    Active CUs: Total number of active compute units (CUs) on the accelerator during
+      the kernel execution.
+    SALU Utilization: Indicates what percent of the kernel's duration the SALU was
+      busy executing instructions. Computed as the ratio of the total number of cycles
+      spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.
+    VALU Utilization: Indicates what percent of the kernel's duration the VALU was
+      busy executing instructions. Does not include VMEM operations. Computed as the
+      ratio of the total number of cycles spent by the scheduler issuing VALU instructions
+      over the total CU cycles.
+    MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
+      was busy executing instructions. Computed as the ratio of the total number of
+      cycles the MFMA was busy over the total CU cycles.
+    VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
+      was busy executing instructions, including both global/generic and spill/scratch
+      operations (see the VMEM instruction count metrics) for more detail). Does not
+      include VALU operations. Computed as the ratio of the total number of cycles
+      spent by the scheduler issuing VMEM instructions over the total CU cycles.
+    Branch Utilization: Indicates what percent of the kernel's duration the branch
+      unit was busy executing instructions. Computed as the ratio of the total number
+      of cycles spent by the scheduler issuing branch instructions over the total
+      CU cycles
+    VALU Active Threads: Indicates the average level of divergence within a wavefront
+      over the lifetime of the kernel. The number of work-items that were active in
+      a wavefront during execution of each VALU instruction, time-averaged over all
+      VALU instructions run on all wavefronts in the kernel.
+    IPC: The ratio of the total number of instructions executed on the CU over the
+      total active CU cycles. This is also presented as a percent of the peak theoretical
+      bandwidth achievable on the specific accelerator.
+    Wavefront Occupancy: |-
+      The time-averaged number of wavefronts resident on the accelerator over
+      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
+      kernels (less than 1ms). This is also presented as a percent of the peak theoretical
+      occupancy achievable on the specific accelerator.
+    Theoretical LDS Bandwidth: Indicates the maximum amount of bytes that could have
+      been loaded from, stored to, or atomically updated in the LDS per unit time
+      (see LDS Bandwidth example for more detail). This is also presented as a percent
+      of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
+    LDS Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS
+      scheduler due to bank conflicts (as determined by the conflict resolution hardware)
+      to the base number of cycles that would be spent in the LDS scheduler in a completely
+      uncontended case. This is also presented in normalized form (i.e., the Bank
+      Conflict Rate).
+    vL1D Cache Hit Rate: The ratio of the number of vL1D cache line requests that
+      hit in vL1D cache over the total number of cache line requests to the vL1D cache
+      RAM.
+    vL1D Cache BW: The number of bytes looked up in the vL1D cache as a result of
+      VMEM instructions per unit time. The number of bytes is calculated as the number
+      of cache lines requested multiplied by the cache line size. This value does
+      not consider partial requests, so e.g., if only a single value is requested
+      in a cache line, the data movement will still be counted as a full cache line.
+      This is also presented as a percent of the peak theoretical bandwidth achievable
+      on the specific accelerator.
+    L2 Cache Hit Rate: The ratio of the number of L2 cache line requests that hit
+      in the L2 cache over the total number of incoming cache line requests to the
+      L2 cache.
+    L2 Cache BW: The number of bytes looked up in the L2 cache per unit time. The
+      number of bytes is calculated as the number of cache lines requested multiplied
+      by the cache line size. This value does not consider partial requests, so e.g.,
+      if only a single value is requested in a cache line, the data movement will
+      still be counted as a full cache line. This is also presented as a percent of
+      the peak theoretical bandwidth achievable on the specific accelerator.
+    L2-Fabric Read BW: |-
+      The number of bytes read by the L2 over the Infinity Fabric\u2122 interface
+      per unit time. This is also presented as a percent of the peak theoretical
+      bandwidth achievable on the specific accelerator.
+    L2-Fabric Write BW: The number of bytes sent by the L2 over the Infinity Fabric
+      interface by write and atomic operations per unit time. This is also presented
+      as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
+    L2-Fabric Read Latency: The time-averaged number of cycles read requests spent
+      in Infinity Fabric before data was returned to the L2.
+    L2-Fabric Write Latency: The time-averaged number of cycles write requests spent
+      in Infinity Fabric before a completion acknowledgement was returned to the L2.
+    sL1D Cache Hit Rate: The percent of sL1D requests that hit on a previously loaded
+      line the cache. Calculated as the ratio of the number of sL1D requests that
+      hit over the number of all sL1D requests.
+    sL1D Cache BW: The number of bytes looked up in the sL1D cache per unit time.
+      This is also presented as a percent of the peak theoretical bandwidth achievable
+      on the specific accelerator.
+    L1I Hit Rate: The number of bytes looked up in the L1I cache per unit time. This
+      is also presented as a percent of the peak theoretical bandwidth achievable
+      on the specific accelerator.
+    L1I BW: The percent of L1I requests that hit on a previously loaded line the cache.
+      Calculated as the ratio of the number of L1I requests that hit over the number
+      of all L1I requests.
+    L1I Fetch Latency: The average number of cycles spent to fetch instructions to
+      a CU.
@@ -2,122 +2,6 @@
 Panel Config:
  id: 300
  title: Memory Chart
-  metrics_description:
-    Wavefront Occupancy: Wavefronts per active CU.
-    Wave Life: Average number of cycles executing a wave.
-    SALU: Total Number of SALU (Scalar ALU) instructions issued per normalization
-      unit.
-    SMEM: Total number of SMEM (Scalar Memory Read) instructions issued normalization
-      unit.
-    VALU: The number of VALU (Vector ALU) instructions issued per normalization unit.
-    MFMA: Total number of MFMA (Matrix-Fused-Multiply-Add) instructions issued per
-      normalization unit.
-    VMEM: The number of VMEM (GPU Memory) read instructions issued (including FLAT/scratch
-      memory) per normalization unit.
-    LDS: The total number of LDS instructions (including, but not limited to, read/write/atomics
-      and HIP's __shfl instructions) executed per normalization unit.
-    GWS: Total number of GDS (global data sync) instructions issued per normalization
-      unit.
-    BR: Total number of BRANCH instructions issued per normalization unit.
-    Active CUs: Total number of active compute units (CUs) on the accelerator during
-      the kernel execution.
-    Num CUs: Total number of compute units (CUs) on the accelerator.
-    VGPR: 'The number of architected vector general-purpose registers allocated for
-      the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
-      by the compiler due to allocation granularity.'
-    SGPR: 'The number of scalar general-purpose registers allocated for the kernel,
-      see SALU. Note: this may not exactly match the number of SGPRs requested by
-      the compiler due to allocation granularity.'
-    LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
-      for this kernel. Note: This may also be larger than what was requested at compile
-      time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
-    Scratch Allocation: The number of bytes of scratch memory requested per work-item
-      for this kernel. Scratch memory is used for stack memory on the accelerator,
-      as well as for register spills and restores.
-    Wavefronts: The total number of wavefronts, summed over all workgroups, forming
-      this kernel launch.
-    Workgroups: The total number of workgroups forming this kernel launch.
-    LDS Req: The total number of LDS instructions (including, but not limited to,
-      read/write/atomics and HIP's __shfl instructions) executed per normalization
-      unit.
-    LDS Util: Indicates what percent of the kernel's duration the LDS was actively
-      executing instructions (including, but not limited to, load, store, atomic and
-      HIP's __shfl operations). Calculated as the ratio of the total number of cycles
-      LDS was active over the total CU cycles.
-    LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
-      / acknowledgment) required for an LDS instruction to complete.
-    VL1 Rd: The total number of incoming read requests from the address processing
-      unit after coalescing per normalization unit
-    VL1 Wr: The total number of incoming write requests from the address processing
-      unit after coalescing per normalization unit
-    VL1 Atomic: The total number of incoming atomic requests from the address processing
-      unit after coalescing per normalization unit
-    VL1 Hit: The ratio of the number of vL1D cache line requests that hit in vL1D
-      cache over the total number of cache line requests to the vL1D Cache RAM.
-    VL1 Lat: Calculated as the average number of cycles that a vL1D cache line request
-      spent in the vL1D cache pipeline.
-    VL1 Coalesce: Indicates how well memory instructions were coalesced by the address
-      processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
-      as the average number of thread-requests generated per instruction divided by
-      the ideal number of thread-requests per instruction.
-    VL1 Stall: The ratio of the number of cycles where the vL1D is stalled waiting
-      to issue a request for data to the L2 cache divided by the number of cycles
-      where the vL1D is active.
-    VL1_L2 Rd: The number of read requests for a vL1D cache line that were not satisfied
-      by the vL1D and must be retrieved from the to the L2 Cache per normalization
-      unit.
-    VL1_L2 Wr: The number of write requests to a vL1D cache line that were sent through
-      the vL1D to the L2 cache, per normalization unit.
-    VL1_L2 Atomic: The number of atomic requests that are sent through the vL1D to
-      the L2 cache, per normalization unit. This includes requests for atomics with,
-      and without return.
-    sL1D Rd: The total number of requests, of any size or type, made to the sL1D per
-      normalization unit.
-    sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
-      line, per normalization unit.
-    sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
-      unit.
-    sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
-      unit. Typically unused on current CDNA accelerators.
-    sL1D_L2 Atomic: The total number of atomic requests from sL1D to the L2, per normalization
-      unit. Typically unused on current CDNA accelerators.
-    IL1 Fetch: The total number of requests made to the L1I per normalization-unit.
-    IL1 Hit: The percent of L1I requests that hit on a previously loaded line the
-      cache. Calculated as the ratio of the number of L1I requests that hit over the
-      number of all L1I requests.
-    IL1 Lat: The average number of cycles spent to fetch instructions to a CU.
-    IL1_L2 Rd: The total number of requests across the L1I - L2 interface per normalization-unit.
-    L2 Rd: The total number of read requests to the L2 from all clients.
-    L2 Wr: The total number of write requests to the L2 from all clients.
-    L2 Atomic: The total number of atomic requests (with and without return) to the
-      L2 from all clients.
-    L2 Hit: The ratio of the number of L2 cache line requests that hit in the L2 cache
-      over the total number of incoming cache line requests to the L2 cache.
-    L2 Rd Lat: Calculated as the average number of cycles that the vL1D cache took
-      to issue and receive read requests from the L2 Cache. This number also includes
-      requests for atomics with return values.
-    L2 Wr Lat: Calculated as the average number of cycles that the vL1D cache took
-      to issue and receive acknowledgement of a write request to the L2 Cache. This
-      number also includes requests for atomics without return values.
-    Fabric_L2 Rd: Number of L2 cache - Infinity Fabric read requests (either 32-byte
-      or 64-byte) summed over TCC instances per normalization unit.
-    Fabric_L2 Wr: Number of L2 cache - Infinity Fabric write requests (either 32-byte
-      or 64-byte) summed over TCC instances per normalization unit.
-    Fabric_L2 Atomic: Number of L2 cache - Infinity Fabric write requests (either
-      32-byte or 64-byte) that are actually atomic requests summed over TCC instances
-      per normalization unit.
-    Fabric Rd Lat: The time-averaged number of cycles read requests spent in Infinity
-      Fabric before data was returned to the L2.
-    Fabric Wr Lat: The time-averaged number of cycles write requests spent in Infinity
-      Fabric before a completion acknowledgement was returned to the L2.
-    Fabric Atomic Lat: The time-averaged number of cycles atomic requests spent in
-      Infinity Fabric before a completion acknowledgement (atomic without return value)
-      or data (atomic with return value) was returned to the L2.
-    HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
-      of data from the accelerator's local HBM, per normalization unit.
-    HBM Wr: 'The total number of L2 requests to Infinity Fabric to write or atomically
-      update 32B or 64B of data in the accelerator''s local HBM, per normalization
-      unit. '
  data source:
  - metric_table:
      id: 301
@@ -252,13 +136,13 @@ Panel Config:
          value: ROUND(AVG((TCC_EA0_ATOMIC_sum / $denom)), 0)
        Fabric Rd Lat:
          value: ROUND(AVG(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
-            != 0) else  0)), 0)
+            != 0) else 0)), 0)
        Fabric Wr Lat:
          value: ROUND(AVG(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
-            != 0) else  0)), 0)
+            != 0) else 0)), 0)
        Fabric Atomic Lat:
          value: ROUND(AVG(((TCC_EA0_ATOMIC_LEVEL_sum / TCC_EA0_ATOMIC_sum) if (TCC_EA0_ATOMIC_sum
-            != 0) else  0)), 0)
+            != 0) else 0)), 0)
        HBM Rd:
          value: ROUND(AVG((TCC_EA0_RDREQ_DRAM_sum / $denom)), 0)
        HBM Wr:
@@ -266,3 +150,123 @@ Panel Config:
      comparable: false
      cli_style: mem_chart
      tui_style: mem_chart
+  metrics_description:
+    Wavefront Occupancy: Wavefronts per active CU.
+    Wave Life: Average number of cycles executing a wave.
+    SALU: Total Number of SALU (Scalar ALU) instructions issued per normalization
+      unit.
+    SMEM: Total number of SMEM (Scalar Memory Read) instructions issued normalization
+      unit.
+    VALU: The number of VALU (Vector ALU) instructions issued per normalization unit.
+    MFMA: Total number of MFMA (Matrix-Fused-Multiply-Add) instructions issued per
+      normalization unit.
+    VMEM: The number of VMEM (GPU Memory) read instructions issued (including FLAT/scratch
+      memory) per normalization unit.
+    LDS: The total number of LDS instructions (including, but not limited to, read/write/atomics
+      and HIP's __shfl instructions) executed per normalization unit.
+    GWS: Total number of GDS (global data sync) instructions issued per normalization
+      unit.
+    BR: Total number of BRANCH instructions issued per normalization unit.
+    Active CUs: Total number of active compute units (CUs) on the accelerator during
+      the kernel execution.
+    Num CUs: Total number of compute units (CUs) on the accelerator.
+    VGPR: |-
+      The number of architected vector general-purpose registers allocated
+      for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
+      requested by the compiler due to allocation granularity.
+    SGPR: |-
+      The number of scalar general-purpose registers allocated for the kernel,
+      see SALU. Note: this may not exactly match the number of SGPRs requested by
+      the compiler due to allocation granularity.
+    LDS Allocation: |-
+      The number of bytes of LDS memory (or, shared memory) allocated for
+      this kernel. Note: This may also be larger than what was requested at compile
+      time due to both allocation granularity and dynamic per-dispatch LDS allocations.
+    Scratch Allocation: The number of bytes of scratch memory requested per work-item
+      for this kernel. Scratch memory is used for stack memory on the accelerator,
+      as well as for register spills and restores.
+    Wavefronts: The total number of wavefronts, summed over all workgroups, forming
+      this kernel launch.
+    Workgroups: The total number of workgroups forming this kernel launch.
+    LDS Req: The total number of LDS instructions (including, but not limited to,
+      read/write/atomics and HIP's __shfl instructions) executed per normalization
+      unit.
+    LDS Util: Indicates what percent of the kernel's duration the LDS was actively
+      executing instructions (including, but not limited to, load, store, atomic and
+      HIP's __shfl operations). Calculated as the ratio of the total number of cycles
+      LDS was active over the total CU cycles.
+    LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
+      / acknowledgment) required for an LDS instruction to complete.
+    VL1 Rd: The total number of incoming read requests from the address processing
+      unit after coalescing per normalization unit
+    VL1 Wr: The total number of incoming write requests from the address processing
+      unit after coalescing per normalization unit
+    VL1 Atomic: The total number of incoming atomic requests from the address processing
+      unit after coalescing per normalization unit
+    VL1 Hit: The ratio of the number of vL1D cache line requests that hit in vL1D
+      cache over the total number of cache line requests to the vL1D Cache RAM.
+    VL1 Lat: Calculated as the average number of cycles that a vL1D cache line request
+      spent in the vL1D cache pipeline.
+    VL1 Coalesce: Indicates how well memory instructions were coalesced by the address
+      processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
+      as the average number of thread-requests generated per instruction divided by
+      the ideal number of thread-requests per instruction.
+    VL1 Stall: The ratio of the number of cycles where the vL1D is stalled waiting
+      to issue a request for data to the L2 cache divided by the number of cycles
+      where the vL1D is active.
+    VL1_L2 Rd: The number of read requests for a vL1D cache line that were not satisfied
+      by the vL1D and must be retrieved from the to the L2 Cache per normalization
+      unit.
+    VL1_L2 Wr: The number of write requests to a vL1D cache line that were sent through
+      the vL1D to the L2 cache, per normalization unit.
+    VL1_L2 Atomic: The number of atomic requests that are sent through the vL1D to
+      the L2 cache, per normalization unit. This includes requests for atomics with,
+      and without return.
+    sL1D Rd: The total number of requests, of any size or type, made to the sL1D per
+      normalization unit.
+    sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
+      line, per normalization unit.
+    sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
+      unit.
+    sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
+      unit. Typically unused on current CDNA accelerators.
+    sL1D_L2 Atomic: The total number of atomic requests from sL1D to the L2, per normalization
+      unit. Typically unused on current CDNA accelerators.
+    IL1 Fetch: The total number of requests made to the L1I per normalization-unit.
+    IL1 Hit: The percent of L1I requests that hit on a previously loaded line the
+      cache. Calculated as the ratio of the number of L1I requests that hit over the
+      number of all L1I requests.
+    IL1 Lat: The average number of cycles spent to fetch instructions to a CU.
+    IL1_L2 Rd: The total number of requests across the L1I - L2 interface per normalization-unit.
+    L2 Rd: The total number of read requests to the L2 from all clients.
+    L2 Wr: The total number of write requests to the L2 from all clients.
+    L2 Atomic: The total number of atomic requests (with and without return) to the
+      L2 from all clients.
+    L2 Hit: The ratio of the number of L2 cache line requests that hit in the L2 cache
+      over the total number of incoming cache line requests to the L2 cache.
+    L2 Rd Lat: Calculated as the average number of cycles that the vL1D cache took
+      to issue and receive read requests from the L2 Cache. This number also includes
+      requests for atomics with return values.
+    L2 Wr Lat: Calculated as the average number of cycles that the vL1D cache took
+      to issue and receive acknowledgement of a write request to the L2 Cache. This
+      number also includes requests for atomics without return values.
+    Fabric_L2 Rd: Number of L2 cache - Infinity Fabric read requests (either 32-byte
+      or 64-byte) summed over TCC instances per normalization unit.
+    Fabric_L2 Wr: Number of L2 cache - Infinity Fabric write requests (either 32-byte
+      or 64-byte) summed over TCC instances per normalization unit.
+    Fabric_L2 Atomic: Number of L2 cache - Infinity Fabric write requests (either
+      32-byte or 64-byte) that are actually atomic requests summed over TCC instances
+      per normalization unit.
+    Fabric Rd Lat: The time-averaged number of cycles read requests spent in Infinity
+      Fabric before data was returned to the L2.
+    Fabric Wr Lat: The time-averaged number of cycles write requests spent in Infinity
+      Fabric before a completion acknowledgement was returned to the L2.
+    Fabric Atomic Lat: The time-averaged number of cycles atomic requests spent in
+      Infinity Fabric before a completion acknowledgement (atomic without return value)
+      or data (atomic with return value) was returned to the L2.
+    HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
+      of data from the accelerator's local HBM, per normalization unit.
+    HBM Wr: |-
+      The total number of L2 requests to Infinity Fabric to write or atomically
+      update 32B or 64B of data in the accelerator's local HBM, per normalization
+      unit.
@@ -2,85 +2,6 @@
 Panel Config:
  id: 400
  title: Roofline
-  metrics_description:
-    VALU FLOPs (F16): 'The total 16-bit floating-point operations executed per second
-      on the VALU. This is presented with the value of the peak empirical F16 FLOPs
-      achievable on the specific accelerator. Note: this does not include any F16
-      operations from MFMA instructions.'
-    VALU FLOPs (F32): 'The total 32-bit floating-point operations executed per second
-      on the VALU. This is presented with the value of the peak empirical F32 FLOPs
-      achievable on the specific accelerator. Note: this does not include any F32
-      operations from MFMA instructions.'
-    VALU FLOPs (F64): 'The total 64-bit floating-point operations executed per second
-      on the VALU. This is presented with the value of the peak empirical F64 FLOPs
-      achievable on the specific accelerator. Note: this does not include any F64
-      operations from MFMA instructions.'
-    MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
-      executed per second. This does not include any 16-bit brain floating point operations
-      from VALU instructions. The peak empirically measured F8 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison. It is supported
-      on AMD Instinct MI300 series and later only.
-    MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
-      executed per second. Note: this does not include any 16-bit brain floating point
-      operations from VALU instructions. The peak empirically measured BF16 MFMA operations
-      achievable on the specific accelerator is displayed alongside for comparison.'
-    MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
-      per second. Note: this does not include any 16-bit floating point operations
-      from VALU instructions. The peak empirically measured F16 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison.'
-    MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
-      per second. Note: this does not include any 32-bit floating point operations
-      from VALU instructions. The peak empirically measured F32 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison.'
-    MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
-      per second. Note: this does not include any 64-bit floating point operations
-      from VALU instructions. The peak empirically measured F64 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison.'
-    MFMA FLOPs (F6F4): 'The total number of 4-bit and 6-bit floating point MFMA operations
-      executed per second. Note: this does not include any floating point operations
-      from VALU instructions. The peak empirically measured F6F4 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison. It is supported
-      on AMD Instinct MI350 series (gfx950) and later only.'
-    MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
-      per second. Note: this does not include any 8-bit integer operations from VALU
-      instructions. The peak empirically measured INT8 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison.'
-    HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth
-      Memory (HBM) per second. The peak empirically measured bandwidth achievable
-      on the specific accelerator is displayed alongside for comparison.
-    L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
-      The number of bytes is calculated as the number of cache lines requested multiplied
-      by the cache line size. This value does not consider partial requests, so e.g.,
-      if only a single value is requested in a cache line, the data movement will
-      still be counted as a full cache line. The peak empirically measured bandwidth
-      achievable on the specific accelerator is displayed alongside for comparison.
-    L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
-      of VMEM instructions per unit time. The number of bytes is calculated as the
-      number of cache lines requested multiplied by the cache line size. This value
-      does not consider partial requests, so e.g., if only a single value is requested
-      in a cache line, the data movement will still be counted as a full cache line.
-      The peak empirically measured bandwidth achievable on the specific accelerator
-      is displayed alongside for comparison.
-    LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
-      from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
-      example for more detail). The peak empirically measured LDS bandwidth achievable
-      on the specific accelerator is displayed alongside for comparison.
-    AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
-      of total floating-point operations (FLOPs) to total bytes transferred between
-      the L1 cache and the processing units. This value is used as the x-coordinate
-      for the L1 roofline.
-    AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
-      of total floating-point operations (FLOPs) to total bytes transferred between
-      the L2 cache and the L1 cache. This value is used as the x-coordinate for the
-      L2 roofline.
-    AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
-      It is the ratio of total floating-point operations (FLOPs) to total bytes transferred
-      between HBM and the L2 cache. This value is used as the x-coordinate for the
-      HBM roofline.
-    Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs
-      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
-      operations divided by the total execution time. This value is used as the y-coordinate
-      for the kernel's point on the Roofline plot.
  data source:
  - metric_table:
      id: 401
@@ -212,3 +133,86 @@ Panel Config:
            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / (SUM(End_Timestamp - Start_Timestamp)
            / 1e9) ) / 1e9
          unit: GFLOP/s
+  metrics_description:
+    VALU FLOPs (F16): |-
+      The total 16-bit floating-point operations executed per second on the VALU.
+      This is presented with the value of the peak empirical F16 FLOPs achievable
+      on the specific accelerator. Note: this does not include any F16 operations
+      from MFMA instructions.
+    VALU FLOPs (F32): |-
+      The total 32-bit floating-point operations executed per second on the VALU.
+      This is presented with the value of the peak empirical F32 FLOPs achievable
+      on the specific accelerator. Note: this does not include any F32 operations
+      from MFMA instructions.
+    VALU FLOPs (F64): |-
+      The total 64-bit floating-point operations executed per second on the VALU.
+      This is presented with the value of the peak empirical F64 FLOPs achievable
+      on the specific accelerator. Note: this does not include any F64 operations
+      from MFMA instructions.
+    MFMA FLOPs (BF16): |-
+      The total number of 16-bit brain floating point MFMA operations executed
+      per second. Note: this does not include any 16-bit brain floating point
+      operations from VALU instructions. The peak empirically measured BF16 MFMA
+      operations achievable on the specific accelerator is displayed alongside
+      for comparison.
+    MFMA FLOPs (F16): |-
+      The total number of 16-bit floating point MFMA operations executed per
+      second. Note: this does not include any 16-bit floating point operations from
+      VALU instructions. The peak empirically measured F16 MFMA operations
+      achievable on the specific accelerator is displayed alongside for comparison.
+    MFMA FLOPs (F32): |-
+      The total number of 32-bit floating point MFMA operations executed per
+      second. Note: this does not include any 32-bit floating point operations from
+      VALU instructions. The peak empirically measured F32 MFMA operations
+      achievable on the specific accelerator is displayed alongside for comparison.
+    MFMA FLOPs (F64): |-
+      The total number of 64-bit floating point MFMA operations executed per
+      second. Note: this does not include any 64-bit floating point operations from
+      VALU instructions. The peak empirically measured F64 MFMA operations
+      achievable on the specific accelerator is displayed alongside for comparison.
+    MFMA IOPs (Int8): |-
+      The total number of 8-bit integer MFMA operations executed per second.
+      Note: this does not include any 8-bit integer operations from VALU instructions.
+      The peak empirically measured INT8 MFMA operations achievable on the specific
+      accelerator is displayed alongside for comparison.
+    HBM Bandwidth: |-
+      The total number of bytes read from and written to High-Bandwidth
+      Memory (HBM) per second. The peak empirically measured bandwidth achievable
+      on the specific accelerator is displayed alongside for comparison.
+    L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
+      The number of bytes is calculated as the number of cache lines requested multiplied
+      by the cache line size. This value does not consider partial requests, so e.g.,
+      if only a single value is requested in a cache line, the data movement will
+      still be counted as a full cache line. The peak empirically measured bandwidth
+      achievable on the specific accelerator is displayed alongside for comparison.
+    L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
+      of VMEM instructions per unit time. The number of bytes is calculated as the
+      number of cache lines requested multiplied by the cache line size. This value
+      does not consider partial requests, so e.g., if only a single value is requested
+      in a cache line, the data movement will still be counted as a full cache line.
+      The peak empirically measured bandwidth achievable on the specific accelerator
+      is displayed alongside for comparison.
+    LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
+      from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
+      example for more detail). The peak empirically measured LDS bandwidth achievable
+      on the specific accelerator is displayed alongside for comparison.
+    AI L1: |-
+      The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
+      of total floating-point operations (FLOPs) to total bytes transferred between
+      the L1 cache and the processing units. This value is used as the x-coordinate
+      for the L1 roofline.
+    AI L2: |-
+      The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
+      of total floating-point operations (FLOPs) to total bytes transferred between
+      the L2 cache and the L1 cache. This value is used as the x-coordinate for
+      the L2 roofline.
+    AI HBM: |-
+      The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
+      It is the ratio of total floating-point operations (FLOPs) to total bytes
+      transferred between HBM and the L2 cache. This value is used as the x-coordinate
+      for the HBM roofline.
+    Performance (GFLOPs): |-
+      The overall achieved performance, measured in GigaFLOPs
+      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
+      operations divided by the total execution time. This value is used as the y-coordinate
+      for the kernel's point on the Roofline plot.
@@ -2,30 +2,6 @@
 Panel Config:
  id: 500
  title: Command Processor (CPC/CPF)
-  metrics_description:
-    CPF Utilization: Percent of total cycles where the CPF was busy actively doing
-      any work. The ratio of CPF busy cycles over total cycles counted by the CPF.
-    CPF Stall: Percent of CPF busy cycles where the CPF was stalled for any reason.
-    CPF-L2 Utilization: Percent of total cycles counted by the CPF-L2 interface where
-      the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
-      over total cycles counted by the CPF-L2.
-    CPF-L2 Stall: Percent of CPF-L2 L2 busy cycles where the CPF-L2 interface was
-      stalled for any reason.
-    CPF-UTCL1 Stall: Percent of CPF busy cycles where the CPF was stalled by address
-      translation.
-    CPC Utilization: Percent of total cycles where the CPC was busy actively doing
-      any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
-    CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
-    CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
-      for processing.
-    CPC-Workgroup Manager Utilization: Percent of CPC busy cycles spent dispatching
-      workgroups to the workgroup manager.
-    CPC-L2 Utilization: Percent of total cycles counted by the CPC-L2 interface where
-      the CPC-L2 interface was active doing any work.
-    CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
-      translation
-    CPC-UTCL2 Utilization: 'Percent of total cycles counted by the CPC''s L2 address
-      translation interface where the CPC was busy doing address translation work.  '
  data source:
  - metric_table:
      id: 501
@@ -143,3 +119,28 @@ Panel Config:
          max: MAX((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
            if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
          unit: pct
+  metrics_description:
+    CPF Utilization: Percent of total cycles where the CPF was busy actively doing
+      any work. The ratio of CPF busy cycles over total cycles counted by the CPF.
+    CPF Stall: Percent of CPF busy cycles where the CPF was stalled for any reason.
+    CPF-L2 Utilization: Percent of total cycles counted by the CPF-L2 interface where
+      the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
+      over total cycles counted by the CPF-L2.
+    CPF-L2 Stall: Percent of CPF-L2 L2 busy cycles where the CPF-L2 interface was
+      stalled for any reason.
+    CPF-UTCL1 Stall: Percent of CPF busy cycles where the CPF was stalled by address
+      translation.
+    CPC Utilization: Percent of total cycles where the CPC was busy actively doing
+      any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
+    CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
+    CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
+      for processing.
+    CPC-Workgroup Manager Utilization: Percent of CPC busy cycles spent dispatching
+      workgroups to the workgroup manager.
+    CPC-L2 Utilization: Percent of total cycles counted by the CPC-L2 interface where
+      the CPC-L2 interface was active doing any work.
+    CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
+      translation
+    CPC-UTCL2 Utilization: |-
+      Percent of total cycles counted by the CPC's L2 address translation
+      interface where the CPC was busy doing address translation work.
@@ -2,61 +2,6 @@
 Panel Config:
  id: 600
  title: Workgroup Manager (SPI)
-  metrics_description:
-    Accelerator Utilization: The percent of cycles in the kernel where the accelerator
-      was actively doing any work.
-    Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
-      kernel where the scheduler-pipes were actively doing any work.
-    Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
-      manager was actively doing any work.
-    Shader Engine Utilization: The percent of total shader engine cycles in the kernel
-      where any CU in a shader-engine was actively doing any work, normalized over
-      all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
-      was not fully saturated by the kernel, or a potential load-imbalance issue.
-    SIMD Utilization: The percent of total SIMD cycles in the kernel where any SIMD
-      on a CU was actively doing any work, summed over all CUs. Low values (less than
-      100%) indicate that the accelerator was not fully saturated by the kernel, or
-      a potential load-imbalance issue.
-    Dispatched Workgroups: The total number of workgroups forming this kernel launch.
-    Dispatched Wavefronts: The total number of wavefronts, summed over all workgroups,
-      forming this kernel launch.
-    VGPR Writes: The average number of cycles spent initializing VGPRs at wave creation.
-    SGPR Writes: The average number of cycles spent initializing SGPRs at wave creation.
-    Not-scheduled Rate (Workgroup Manager): The percent of total scheduler-pipe cycles
-      in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
-      within the workgroup manager rather than a lack of a CU or SIMD with sufficient
-      resources.
-    Not-scheduled Rate (Scheduler-Pipe): 'The percent of total scheduler-pipe cycles
-      in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
-      within the scheduler-pipes rather than a lack of a CU or SIMD with sufficient
-      resources. '
-    Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
-      where a workgroup could not be scheduled to a CU due to occupancy limitations
-      (like a lack of a CU or SIMD with sufficient resources).
-    Scratch Stall Rate: The percent of total shader-engine cycles in the kernel where
-      a workgroup could not be scheduled to a CU due to lack of private (a.k.a., scratch)
-      memory slots. While this can reach up to 100%, note that the actual occupancy
-      limitations on a kernel using private memory are typically quite small (for
-      example, less than 1% of the total number of waves that can be scheduled to
-      an accelerator).
-    Insufficient SIMD Waveslots: The percent of total SIMD cycles in the kernel where
-      a workgroup could not be scheduled to a SIMD due to lack of available waveslots.
-    Insufficient SIMD VGPRs: The percent of total SIMD cycles in the kernel where
-      a workgroup could not be scheduled to a SIMD due to lack of available VGPRs.
-    Insufficient SIMD SGPRs: The percent of total SIMD cycles in the kernel where
-      a workgroup could not be scheduled to a SIMD due to lack of available SGPRs.
-    Insufficient CU LDS: The percent of total CU cycles in the kernel where a workgroup
-      could not be scheduled to a CU due to lack of available LDS.
-    Insufficient CU Barriers: The percent of total CU cycles in the kernel where a
-      workgroup could not be scheduled to a CU due to lack of available barriers.
-    Reached CU Workgroup Limit: The percent of total CU cycles in the kernel where
-      a workgroup could not be scheduled to a CU due to limits within the workgroup
-      manager. This is expected to be always be zero on CDNA2 or newer accelerators
-      (and small for previous accelerators).
-    Reached CU Wavefront Limit: The percent of total CU cycles in the kernel where
-      a wavefront could not be scheduled to a CU due to limits within the workgroup
-      manager. This is expected to be always be zero on CDNA2 or newer accelerators
-      (and small for previous accelerators).
  data source:
  - metric_table:
      id: 601
@@ -199,3 +144,58 @@ Panel Config:
          min: MIN(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
+  metrics_description:
+    Accelerator Utilization: The percent of cycles in the kernel where the accelerator
+      was actively doing any work.
+    Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
+      kernel where the scheduler-pipes were actively doing any work.
+    Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
+      manager was actively doing any work.
+    Shader Engine Utilization: The percent of total shader engine cycles in the kernel
+      where any CU in a shader-engine was actively doing any work, normalized over
+      all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
+      was not fully saturated by the kernel, or a potential load-imbalance issue.
+    SIMD Utilization: The percent of total SIMD cycles in the kernel where any SIMD
+      on a CU was actively doing any work, summed over all CUs. Low values (less than
+      100%) indicate that the accelerator was not fully saturated by the kernel, or
+      a potential load-imbalance issue.
+    Dispatched Workgroups: The total number of workgroups forming this kernel launch.
+    Dispatched Wavefronts: The total number of wavefronts, summed over all workgroups,
+      forming this kernel launch.
+    VGPR Writes: The average number of cycles spent initializing VGPRs at wave creation.
+    SGPR Writes: The average number of cycles spent initializing SGPRs at wave creation.
+    Not-scheduled Rate (Workgroup Manager): The percent of total scheduler-pipe cycles
+      in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
+      within the workgroup manager rather than a lack of a CU or SIMD with sufficient
+      resources.
+    Not-scheduled Rate (Scheduler-Pipe): |-
+      The percent of total scheduler-pipe cycles in the kernel where a workgroup
+      could not be scheduled to a CU due to a bottleneck within the scheduler-pipes
+      rather than a lack of a CU or SIMD with sufficient resources.
+    Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
+      where a workgroup could not be scheduled to a CU due to occupancy limitations
+      (like a lack of a CU or SIMD with sufficient resources).
+    Scratch Stall Rate: The percent of total shader-engine cycles in the kernel where
+      a workgroup could not be scheduled to a CU due to lack of private (a.k.a., scratch)
+      memory slots. While this can reach up to 100%, note that the actual occupancy
+      limitations on a kernel using private memory are typically quite small (for
+      example, less than 1% of the total number of waves that can be scheduled to
+      an accelerator).
+    Insufficient SIMD Waveslots: The percent of total SIMD cycles in the kernel where
+      a workgroup could not be scheduled to a SIMD due to lack of available waveslots.
+    Insufficient SIMD VGPRs: The percent of total SIMD cycles in the kernel where
+      a workgroup could not be scheduled to a SIMD due to lack of available VGPRs.
+    Insufficient SIMD SGPRs: The percent of total SIMD cycles in the kernel where
+      a workgroup could not be scheduled to a SIMD due to lack of available SGPRs.
+    Insufficient CU LDS: The percent of total CU cycles in the kernel where a workgroup
+      could not be scheduled to a CU due to lack of available LDS.
+    Insufficient CU Barriers: The percent of total CU cycles in the kernel where a
+      workgroup could not be scheduled to a CU due to lack of available barriers.
+    Reached CU Workgroup Limit: The percent of total CU cycles in the kernel where
+      a workgroup could not be scheduled to a CU due to limits within the workgroup
+      manager. This is expected to be always be zero on CDNA2 or newer accelerators
+      (and small for previous accelerators).
+    Reached CU Wavefront Limit: The percent of total CU cycles in the kernel where
+      a wavefront could not be scheduled to a CU due to limits within the workgroup
+      manager. This is expected to be always be zero on CDNA2 or newer accelerators
+      (and small for previous accelerators).
@@ -2,63 +2,6 @@
 Panel Config:
  id: 700
  title: Wavefront
-  metrics_description:
-    Grid Size: The total number of work-items (or, threads) launched as a part of
-      the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied
-      by the total workgroup (or, block) size.
-    Workgroup Size: The total number of work-items (or, threads) in each workgroup
-      (or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
-      to the total block size.
-    Total Wavefronts: "The total number of wavefronts launched as part of the kernel\
-      \ dispatch. On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs,\
-      \ the wavefront size is always 64 work-items. Thus, the total number of wavefronts\
-      \ should be equivalent to the ceiling of grid size divided by 64."
-    Saved Wavefronts: The total number of wavefronts saved at a context-save.
-    Restored Wavefronts: The total number of wavefronts restored from a context-save.
-    VGPRs: 'The number of architected vector general-purpose registers allocated for
-      the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
-      by the compiler due to allocation granularity.'
-    AGPRs: 'The number of accumulation vector general-purpose registers allocated
-      for the kernel, see AGPRs. Note: this may not exactly match the number of AGPRs
-      requested by the compiler due to allocation granularity.'
-    SGPRs: 'The number of scalar general-purpose registers allocated for the kernel,
-      see SALU. Note: this may not exactly match the number of SGPRs requested by
-      the compiler due to allocation granularity.'
-    LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
-      for this kernel. Note: This may also be larger than what was requested at compile
-      time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
-    Scratch Allocation: The number of bytes of scratch memory requested per work-item
-      for this kernel. Scratch memory is used for stack memory on the accelerator,
-      as well as for register spills and restores.
-    Kernel Time: The total duration of the executed kernel.
-    Kernel Time (Cycles): The total duration of the executed kernel in cycles.
-    Instructions per wavefront: The average number of instructions (of all types)
-      executed per wavefront. This is averaged over all wavefronts in a kernel dispatch.
-    Wave Cycles: The number of cycles a wavefront in the kernel dispatch spent resident
-      on a compute unit per normalization unit. This is averaged over all wavefronts
-      in a kernel dispatch.
-    Dependency Wait Cycles: The number of cycles a wavefront in the kernel dispatch
-      spent resident on a compute unit per normalization unit. This is averaged over
-      all wavefronts in a kernel dispatch.
-    Issue Wait Cycles: The number of cycles a wavefront in the kernel dispatch was
-      unable to issue an instruction for any reason (e.g., execution pipe back-pressure,
-      arbitration loss, etc.) per normalization unit. This counter is incremented
-      at every cycle by all wavefronts on a CU unable to issue an instruction. As
-      such, it is most useful to get a sense of how waves were spending their time,
-      rather than identification of a precise limiter because another wave could be
-      actively executing while a wave is issue stalled. The sum of this metric, Dependency
-      Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
-    Active Cycles: The average number of cycles a wavefront in the kernel dispatch
-      was actively executing instructions per normalization unit. This measurement
-      is made on a per-wavefront basis, and may include cycles that another wavefront
-      spent actively executing (on another execution unit, for example) or was stalled.
-      As such, it is most useful to get a sense of how waves were spending their time,
-      rather than identification of a precise limiter. The sum of this metric, Issue
-      Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
-      metric.
-    Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
-      over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
-      kernels (less than 1ms).'
  data source:
  - metric_table:
      id: 701
@@ -171,3 +114,66 @@ Panel Config:
          max: MAX((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
          unit: Wavefronts
          coll_level: SQ_LEVEL_WAVES
+  metrics_description:
+    Grid Size: The total number of work-items (or, threads) launched as a part of
+      the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied
+      by the total workgroup (or, block) size.
+    Workgroup Size: The total number of work-items (or, threads) in each workgroup
+      (or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
+      to the total block size.
+    Total Wavefronts: |-
+      The total number of wavefronts launched as part of the kernel dispatch.
+      On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
+      size is always 64 work-items. Thus, the total number of wavefronts should
+      be equivalent to the ceiling of grid size divided by 64.
+    Saved Wavefronts: The total number of wavefronts saved at a context-save.
+    Restored Wavefronts: The total number of wavefronts restored from a context-save.
+    VGPRs: |-
+      The number of architected vector general-purpose registers allocated
+      for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
+      requested by the compiler due to allocation granularity.
+    AGPRs: |-
+      The number of accumulation vector general-purpose registers allocated
+      for the kernel, see AGPRs. Note: this may not exactly match the number of
+      AGPRs requested by the compiler due to allocation granularity.
+    SGPRs: |-
+      The number of scalar general-purpose registers allocated for the kernel,
+      see SALU. Note: this may not exactly match the number of SGPRs requested by
+      the compiler due to allocation granularity.
+    LDS Allocation: |-
+      The number of bytes of LDS memory (or, shared memory) allocated for
+      this kernel. Note: This may also be larger than what was requested at compile
+      time due to both allocation granularity and dynamic per-dispatch LDS allocations.
+    Scratch Allocation: The number of bytes of scratch memory requested per work-item
+      for this kernel. Scratch memory is used for stack memory on the accelerator,
+      as well as for register spills and restores.
+    Kernel Time: The total duration of the executed kernel.
+    Kernel Time (Cycles): The total duration of the executed kernel in cycles.
+    Instructions per wavefront: The average number of instructions (of all types)
+      executed per wavefront. This is averaged over all wavefronts in a kernel dispatch.
+    Wave Cycles: The number of cycles a wavefront in the kernel dispatch spent resident
+      on a compute unit per normalization unit. This is averaged over all wavefronts
+      in a kernel dispatch.
+    Dependency Wait Cycles: The number of cycles a wavefront in the kernel dispatch
+      spent resident on a compute unit per normalization unit. This is averaged over
+      all wavefronts in a kernel dispatch.
+    Issue Wait Cycles: The number of cycles a wavefront in the kernel dispatch was
+      unable to issue an instruction for any reason (e.g., execution pipe back-pressure,
+      arbitration loss, etc.) per normalization unit. This counter is incremented
+      at every cycle by all wavefronts on a CU unable to issue an instruction. As
+      such, it is most useful to get a sense of how waves were spending their time,
+      rather than identification of a precise limiter because another wave could be
+      actively executing while a wave is issue stalled. The sum of this metric, Dependency
+      Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
+    Active Cycles: The average number of cycles a wavefront in the kernel dispatch
+      was actively executing instructions per normalization unit. This measurement
+      is made on a per-wavefront basis, and may include cycles that another wavefront
+      spent actively executing (on another execution unit, for example) or was stalled.
+      As such, it is most useful to get a sense of how waves were spending their time,
+      rather than identification of a precise limiter. The sum of this metric, Issue
+      Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
+      metric.
+    Wavefront Occupancy: |-
+      The time-averaged number of wavefronts resident on the accelerator over
+      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
+      kernels (less than 1ms).
@@ -2,90 +2,6 @@
 Panel Config:
  id: 1000
  title: Compute Units - Instruction Mix
-  metrics_description:
-    VALU: The total number of vector arithmetic logic unit (VALU) operations issued.
-      These are the workhorses of the compute unit, and are used to execute a wide
-      range of instruction types including floating point operations, non-uniform
-      address calculations, transcendental operations, integer operations, shifts,
-      conditional evaluation, etc.
-    VMEM: The total number of vector memory operations issued. These include most
-      loads, stores and atomic operations and all accesses to generic, global, private
-      and texture memory.
-    LDS: The total number of LDS (also known as shared memory) operations issued.
-      These include loads, stores, atomics, and HIP's __shfl operations.
-    MFMA: The total number of matrix fused multiply-add instructions issued.
-    SALU: The total number of scalar arithmetic logic unit (SALU) operations issued.
-      Typically these are used for address calculations, literal constants, and other
-      operations that are provably uniform across a wavefront. Although scalar memory
-      (SMEM) operations are issued by the SALU, they are counted separately in this
-      section.
-    SMEM: The total number of scalar memory (SMEM) operations issued. These are typically
-      used for loading kernel arguments, base-pointers and loads from HIP's __constant__
-      memory.
-    Branch: The total number of branch operations issued. These typically consist
-      of jump or branch operations and are used to implement control flow.
-    INT32: The total number of instructions operating on 32-bit integer operands issued
-      to the VALU per normalization unit.
-    INT64: The total number of instructions operating on 64-bit integer operands issued
-      to the VALU per normalization unit.
-    F16-ADD: The total number of addition instructions operating on 16-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F16-MUL: The total number of multiplication instructions operating on 16-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F16-FMA: The total number of fused multiply-add instructions operating on 16-bit
-      floating-point operands issued to the VALU per normalization unit.
-    F16-Trans: The total number of transcendental instructions (e.g., sqrt) operating
-      on 16-bit floating-point operands issued to the VALU per normalization unit.
-    F32-ADD: The total number of addition instructions operating on 32-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F32-MUL: The total number of multiplication instructions operating on 32-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F32-FMA: The total number of fused multiply-add instructions operating on 32-bit
-      floating-point operands issued to the VALU per normalization unit.
-    F32-Trans: The total number of transcendental instructions (such as sqrt) operating
-      on 32-bit floating-point operands issued to the VALU per normalization unit.
-    F64-ADD: The total number of addition instructions operating on 64-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F64-MUL: The total number of multiplication instructions operating on 64-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F64-FMA: The total number of fused multiply-add instructions operating on 64-bit
-      floating-point operands issued to the VALU per normalization unit.
-    F64-Trans: The total number of transcendental instructions (such as sqrt) operating
-      on 64-bit floating-point operands issued to the VALU per normalization unit.
-    Conversion: "The total number of type conversion instructions (such as converting\
-      \ data to or from F32\u2194F64) issued to the VALU per normalization unit."
-    Global/Generic Instr: The total number of global & generic memory instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Global/Generic Read: The total number of global & generic memory read instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Global/Generic Write: The total number of global & generic memory write instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Global/Generic Atomic: The total number of global & generic memory atomic (with
-      and without return) instructions executed on all compute units on the accelerator,
-      per normalization unit.
-    Spill/Stack Instr: The total number of spill/stack memory instructions executed
-      on all compute units on the accelerator, per normalization unit.
-    Spill/Stack Read: The total number of spill/stack memory read instructions executed
-      on all compute units on the accelerator, per normalization unit.
-    Spill/Stack Write: The total number of spill/stack memory write instructions executed
-      on all compute units on the accelerator, per normalization unit.
-    Spill/Stack Atomic: The total number of spill/stack memory atomic (with and without
-      return) instructions executed on all compute units on the accelerator, per normalization
-      unit. Typically unused as these memory operations are typically used to implement
-      thread-local storage.
-    MFMA-I8: The total number of 8-bit integer MFMA instructions issued per normalization
-      unit.
-    MFMA-F8: The total number of 8-bit floating point MFMA instructions issued per
-      normalization unit. This is supported in AMD Instinct MI300 series and later
-      only.
-    MFMA-F16: The total number of 16-bit floating point MFMA instructions issued per
-      normalization unit.
-    MFMA-BF16: The total number of 16-bit brain floating point MFMA instructions issued
-      per normalization unit.
-    MFMA-F32: The total number of 32-bit floating-point MFMA instructions issued per
-      normalization unit.
-    MFMA-F64: The total number of 64-bit floating-point MFMA instructions issued per
-      normalization unit.
  data source:
  - metric_table:
      id: 1001
@@ -187,3 +103,35 @@ Panel Config:
        max: Max
        unit: Unit
      metric: {}
+  metrics_description:
+    LDS: The total number of LDS (also known as shared memory) operations issued.
+      These include loads, stores, atomics, and HIP's __shfl operations.
+    SALU: The total number of scalar arithmetic logic unit (SALU) operations issued.
+      Typically these are used for address calculations, literal constants, and other
+      operations that are provably uniform across a wavefront. Although scalar memory
+      (SMEM) operations are issued by the SALU, they are counted separately in this
+      section.
+    SMEM: The total number of scalar memory (SMEM) operations issued. These are typically
+      used for loading kernel arguments, base-pointers and loads from HIP's __constant__
+      memory.
+    Branch: The total number of branch operations issued. These typically consist
+      of jump or branch operations and are used to implement control flow.
+    Global/Generic Instr: The total number of global & generic memory instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Global/Generic Read: The total number of global & generic memory read instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Global/Generic Write: The total number of global & generic memory write instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Global/Generic Atomic: The total number of global & generic memory atomic (with
+      and without return) instructions executed on all compute units on the accelerator,
+      per normalization unit.
+    Spill/Stack Instr: The total number of spill/stack memory instructions executed
+      on all compute units on the accelerator, per normalization unit.
+    Spill/Stack Read: The total number of spill/stack memory read instructions executed
+      on all compute units on the accelerator, per normalization unit.
+    Spill/Stack Write: The total number of spill/stack memory write instructions executed
+      on all compute units on the accelerator, per normalization unit.
+    Spill/Stack Atomic: The total number of spill/stack memory atomic (with and without
+      return) instructions executed on all compute units on the accelerator, per normalization
+      unit. Typically unused as these memory operations are typically used to implement
+      thread-local storage.
@@ -2,84 +2,6 @@
 Panel Config:
  id: 1100
  title: Compute Units - Compute Pipeline
-  metrics_description:
-    VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
-      This is also presented as a percent of the peak theoretical FLOPs achievable
-      on the specific accelerator. Note: this does not include any floating-point
-      operations from MFMA instructions.'
-    VALU IOPs: 'The total integer operations executed per second on the VALU. This
-      is also presented as a percent of the peak theoretical IOPs achievable on the
-      specific accelerator. Note: this does not include any integer operations from
-      MFMA instructions.'
-    MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
-      executed per second. Note: this does not include any 16-bit brain floating point
-      operations from VALU instructions. This is also presented as a percent of the
-      peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
-      per second. Note: this does not include any 16-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F16 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
-      per second. Note: this does not include any 32-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F32 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
-      per second. Note: this does not include any 64-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F64 MFMA operations achievable on the specific accelerator.'
-    MFMA IOPs (INT8): 'The total number of 8-bit integer MFMA operations executed
-      per second. Note: this does not include any 8-bit integer operations from VALU
-      instructions. This is also presented as a percent of the peak theoretical INT8
-      MFMA operations achievable on the specific accelerator.'
-    IPC: The ratio of the total number of instructions executed on the CU over the
-      total active CU cycles.
-    IPC (Issued): The ratio of the total number of (non-internal) instructions issued
-      over the number of cycles where the scheduler was actively working on issuing
-      instructions.
-    SALU Utilization: Indicates what percent of the kernel's duration the SALU was
-      busy executing instructions. Computed as the ratio of the total number of cycles
-      spent by the scheduler issuing SALU / SMEM instructions over the total CU cycles.
-    VALU Utilization: Indicates what percent of the kernel's duration the VALU was
-      busy executing instructions. Does not include VMEM operations. Computed as the
-      ratio of the total number of cycles spent by the scheduler issuing VALU instructions
-      over the total CU cycles.
-    VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
-      was busy executing instructions, including both global/generic and spill/scratch
-      operations (see the VMEM instruction count metrics for more detail). Does not
-      include VALU operations. Computed as the ratio of the total number of cycles
-      spent by the scheduler issuing VMEM instructions over the total CU cycles.
-    Branch Utilization: Indicates what percent of the kernel's duration the branch
-      unit was busy executing instructions. Computed as the ratio of the total number
-      of cycles spent by the scheduler issuing branch instructions over the total
-      CU cycles.
-    VALU Active Threads: Indicates the average level of divergence within a wavefront
-      over the lifetime of the kernel. The number of work-items that were active in
-      a wavefront during execution of each VALU instruction, time-averaged over all
-      VALU instructions run on all wavefronts in the kernel
-    MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
-      was busy executing instructions. Computed as the ratio of the total number of
-      cycles spent by the MFMA was busy over the total CU cycles.
-    MFMA Instruction Cycles: The average duration of MFMA instructions in this kernel
-      in cycles. Computed as the ratio of the total number of cycles the MFMA unit
-      was busy over the total number of MFMA instructions.
-    VMEM Latency: The average number of round-trip cycles (that is, from issue to
-      data return / acknowledgment) required for a VMEM instruction to complete.
-    SMEM Latency: The average number of round-trip cycles (that is, from issue to
-      data return / acknowledgment) required for a SMEM instruction to complete.
-    FLOPs (Total): The total number of floating-point operations executed on either
-      the VALU or MFMA units, per normalization unit.
-    IOPs (Total): The total number of integer operations executed on either the VALU
-      or MFMA units, per normalization unit.
-    F16 OPs: The total number of 16-bit floating-point operations executed on either
-      the VALU or MFMA units, per normalization unit.
-    BF16 OPs: The total number of 16-bit brain floating-point operations executed
-      on either the VALU or MFMA units, per normalization unit.
-    F32 OPs: The total number of 32-bit floating-point operations executed on either
-      the VALU or MFMA units, per normalization unit.
-    F64 OPs: The total number of 64-bit floating-point operations executed on either
-      the VALU or MFMA units, per normalization unit.
-    INT8 OPs: The total number of 8-bit integer operations executed on either the
-      VALU or MFMA units, per normalization unit.
  data source:
  - metric_table:
      id: 1101
@@ -108,13 +30,13 @@ Panel Config:
          unit: Instr/cycle
        IPC (Issued):
          avg: AVG(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
-            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED  + SQ_INSTS_LDS)
+            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
            / SQ_ACTIVE_INST_ANY))
          min: MIN(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
            / SQ_ACTIVE_INST_ANY))
          max: MAX(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
-            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED  + SQ_INSTS_LDS)
+            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
            / SQ_ACTIVE_INST_ANY))
          unit: Instr/cycle
        SALU Utilization:
@@ -145,3 +67,20 @@ Panel Config:
        max: Max
        unit: Unit
      metric: {}
+  metrics_description:
+    IPC: The ratio of the total number of instructions executed on the CU over the
+      total active CU cycles.
+    IPC (Issued): The ratio of the total number of (non-internal) instructions issued
+      over the number of cycles where the scheduler was actively working on issuing
+      instructions.
+    SALU Utilization: Indicates what percent of the kernel's duration the SALU was
+      busy executing instructions. Computed as the ratio of the total number of cycles
+      spent by the scheduler issuing SALU / SMEM instructions over the total CU cycles.
+    VALU Utilization: Indicates what percent of the kernel's duration the VALU was
+      busy executing instructions. Does not include VMEM operations. Computed as the
+      ratio of the total number of cycles spent by the scheduler issuing VALU instructions
+      over the total CU cycles.
+    VALU Active Threads: Indicates the average level of divergence within a wavefront
+      over the lifetime of the kernel. The number of work-items that were active in
+      a wavefront during execution of each VALU instruction, time-averaged over all
+      VALU instructions run on all wavefronts in the kernel
@@ -2,51 +2,6 @@
 Panel Config:
  id: 1200
  title: Local Data Share (LDS)
-  metrics_description:
-    Utilization: Indicates what percent of the kernel's duration the LDS was actively
-      executing instructions (including, but not limited to, load, store, atomic and
-      HIP's __shfl operations). Calculated as the ratio of the total number of cycles
-      LDS was active over the total CU cycles.
-    Access Rate: Indicates the percentage of SIMDs in the VALU actively issuing LDS
-      instructions, averaged over the lifetime of the kernel. Calculated as the ratio
-      of the total number of cycles spent by the scheduler issuing LDS instructions
-      over the total CU cycles.
-    Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
-      could have been loaded from, stored to, or atomically updated in the LDS divided
-      as percentage of theoretical peak. Does not take into account the execution
-      mask of the wavefront when the instruction was executed.
-    Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
-      loaded from, stored to, or atomically updated in the LDS divided by total duration.
-      Does not take into account the execution mask of the wavefront when the instruction
-      was executed.
-    Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
-      servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing
-      bank conflicts over the number of LDS cycles that would have been required to
-      move the same amount of data in an uncontended access.
-    LDS Instructions: The total number of LDS instructions (including, but not limited
-      to, read/write/atomics and HIP's __shfl instructions) executed per normalization
-      unit.
-    LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
-      / acknowledgment) required for an LDS instruction to complete.
-    Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS scheduler
-      due to bank conflicts (as determined by the conflict resolution hardware) to
-      the base number of cycles that would be spent in the LDS scheduler in a completely
-      uncontended case. This is the unnormalized form of the Bank Conflict Rate.
-    Index Accesses: The total number of cycles spent in the LDS scheduler over all
-      operations per normalization unit.
-    Atomic Return Cycles: The total number of cycles spent on LDS atomics with return
-      per normalization unit.
-    Bank Conflict: The total number of cycles spent in the LDS scheduler due to bank
-      conflicts (as determined by the conflict resolution hardware) per normalization
-      unit.
-    Addr Conflict: The total number of cycles spent in the LDS scheduler due to address
-      conflicts (as determined by the conflict resolution hardware) per normalization
-      unit.
-    Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
-      stalls from non-dword aligned addresses per normalization unit.
-    Mem Violations: "The total number of out-of-bounds accesses made to the LDS, per\
-      \ normalization unit. This is unused and expected to be zero in most configurations\
-      \ for modern CDNA\u2122 accelerators."
  data source:
  - metric_table:
      id: 1201
@@ -87,7 +42,7 @@ Panel Config:
          avg: AVG((SQ_INSTS_LDS / $denom))
          min: MIN((SQ_INSTS_LDS / $denom))
          max: MAX((SQ_INSTS_LDS / $denom))
-          unit: (Instr  + $normUnit)
+          unit: (Instr + $normUnit)
        Theoretical Bandwidth:
          avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
            / (End_Timestamp - Start_Timestamp)))
@@ -117,29 +72,75 @@ Panel Config:
          avg: AVG((SQ_LDS_IDX_ACTIVE / $denom))
          min: MIN((SQ_LDS_IDX_ACTIVE / $denom))
          max: MAX((SQ_LDS_IDX_ACTIVE / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Atomic Return Cycles:
          avg: AVG((SQ_LDS_ATOMIC_RETURN / $denom))
          min: MIN((SQ_LDS_ATOMIC_RETURN / $denom))
          max: MAX((SQ_LDS_ATOMIC_RETURN / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Bank Conflict:
          avg: AVG((SQ_LDS_BANK_CONFLICT / $denom))
          min: MIN((SQ_LDS_BANK_CONFLICT / $denom))
          max: MAX((SQ_LDS_BANK_CONFLICT / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Addr Conflict:
          avg: AVG((SQ_LDS_ADDR_CONFLICT / $denom))
          min: MIN((SQ_LDS_ADDR_CONFLICT / $denom))
          max: MAX((SQ_LDS_ADDR_CONFLICT / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Unaligned Stall:
          avg: AVG((SQ_LDS_UNALIGNED_STALL / $denom))
          min: MIN((SQ_LDS_UNALIGNED_STALL / $denom))
          max: MAX((SQ_LDS_UNALIGNED_STALL / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Mem Violations:
          avg: AVG((SQ_LDS_MEM_VIOLATIONS / $denom))
          min: MIN((SQ_LDS_MEM_VIOLATIONS / $denom))
          max: MAX((SQ_LDS_MEM_VIOLATIONS / $denom))
          unit: (Accesses + $normUnit)
+  metrics_description:
+    Utilization: Indicates what percent of the kernel's duration the LDS was actively
+      executing instructions (including, but not limited to, load, store, atomic and
+      HIP's __shfl operations). Calculated as the ratio of the total number of cycles
+      LDS was active over the total CU cycles.
+    Access Rate: Indicates the percentage of SIMDs in the VALU actively issuing LDS
+      instructions, averaged over the lifetime of the kernel. Calculated as the ratio
+      of the total number of cycles spent by the scheduler issuing LDS instructions
+      over the total CU cycles.
+    Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
+      could have been loaded from, stored to, or atomically updated in the LDS divided
+      as percentage of theoretical peak. Does not take into account the execution
+      mask of the wavefront when the instruction was executed.
+    Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
+      loaded from, stored to, or atomically updated in the LDS divided by total duration.
+      Does not take into account the execution mask of the wavefront when the instruction
+      was executed.
+    Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
+      servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing
+      bank conflicts over the number of LDS cycles that would have been required to
+      move the same amount of data in an uncontended access.
+    LDS Instructions: The total number of LDS instructions (including, but not limited
+      to, read/write/atomics and HIP's __shfl instructions) executed per normalization
+      unit.
+    LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
+      acknowledgment) required for an LDS instruction to complete.
+    Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS scheduler
+      due to bank conflicts (as determined by the conflict resolution hardware) to
+      the base number of cycles that would be spent in the LDS scheduler in a completely
+      uncontended case. This is the unnormalized form of the Bank Conflict Rate.
+    Index Accesses: The total number of cycles spent in the LDS scheduler over all
+      operations per normalization unit.
+    Atomic Return Cycles: The total number of cycles spent on LDS atomics with return
+      per normalization unit.
+    Bank Conflict: The total number of cycles spent in the LDS scheduler due to bank
+      conflicts (as determined by the conflict resolution hardware) per normalization
+      unit.
+    Addr Conflict: The total number of cycles spent in the LDS scheduler due to address
+      conflicts (as determined by the conflict resolution hardware) per normalization
+      unit.
+    Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
+      stalls from non-dword aligned addresses per normalization unit.
+    Mem Violations: |-
+      The total number of out-of-bounds accesses made to the LDS, per normalization
+      unit. This is unused and expected to be zero in most configurations for
+      modern CDNA\u2122 accelerators.
@@ -2,28 +2,6 @@
 Panel Config:
  id: 1300
  title: Instruction Cache
-  metrics_description:
-    Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
-      of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
-      the total L1I cycles.
-    Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
-      loaded line the cache. Calculated as the ratio of the number of L1I requests
-      that hit over the number of all L1I requests.
-    L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\
-      \ L2 cache request bandwidth achieved. Calculated as the ratio of the total\
-      \ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\
-      \ cycles."
-    L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
-      divided by total duration.
-    Req: The total number of requests made to the L1I per normalization-unit
-    Hits: The total number of L1I requests that hit on a previously loaded cache line,
-      per normalization-unit.
-    Misses - Non Duplicated: The total number of L1I requests that missed on a cache
-      line that were not already pending due to another request, per normalization-unit.
-    Misses - Duplicated: The total number of L1I requests that missed on a cache line
-      that were already pending due to another request, per normalization-unit.
-    Instruction Fetch Latency: The average number of cycles spent to fetch instructions
-      to a CU.
  data source:
  - metric_table:
      id: 1301
@@ -62,22 +40,22 @@ Panel Config:
          avg: AVG((SQC_ICACHE_REQ / $denom))
          min: MIN((SQC_ICACHE_REQ / $denom))
          max: MAX((SQC_ICACHE_REQ / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Hits:
          avg: AVG((SQC_ICACHE_HITS / $denom))
          min: MIN((SQC_ICACHE_HITS / $denom))
          max: MAX((SQC_ICACHE_HITS / $denom))
-          unit: (Hits  + $normUnit)
+          unit: (Hits + $normUnit)
        Misses - Non Duplicated:
          avg: AVG((SQC_ICACHE_MISSES / $denom))
          min: MIN((SQC_ICACHE_MISSES / $denom))
          max: MAX((SQC_ICACHE_MISSES / $denom))
-          unit: (Misses  + $normUnit)
+          unit: (Misses + $normUnit)
        Misses - Duplicated:
          avg: AVG((SQC_ICACHE_MISSES_DUPLICATE / $denom))
          min: MIN((SQC_ICACHE_MISSES_DUPLICATE / $denom))
          max: MAX((SQC_ICACHE_MISSES_DUPLICATE / $denom))
-          unit: (Misses  + $normUnit)
+          unit: (Misses + $normUnit)
        Cache Hit Rate:
          avg: AVG(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
            + SQC_ICACHE_MISSES_DUPLICATE)))
@@ -107,3 +85,25 @@ Panel Config:
          min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
          max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
          unit: Gbps
+  metrics_description:
+    Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
+      of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
+      the total L1I cycles.
+    Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
+      loaded line the cache. Calculated as the ratio of the number of L1I requests
+      that hit over the number of all L1I requests.
+    L1I-L2 Bandwidth Utilization: |-
+      The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
+      achieved. Calculated as the ratio of the total number of requests from the
+      L1I to the L2 cache over the total L1I-L2 interface cycles.
+    L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
+      divided by total duration.
+    Req: The total number of requests made to the L1I per normalization-unit
+    Hits: The total number of L1I requests that hit on a previously loaded cache line,
+      per normalization-unit.
+    Misses - Non Duplicated: The total number of L1I requests that missed on a cache
+      line that were not already pending due to another request, per normalization-unit.
+    Misses - Duplicated: The total number of L1I requests that missed on a cache line
+      that were already pending due to another request, per normalization-unit.
+    Instruction Fetch Latency: The average number of cycles spent to fetch instructions
+      to a CU.
@@ -2,49 +2,6 @@
 Panel Config:
  id: 1400
  title: Scalar L1 Data Cache
-  metrics_description:
-    Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
-      of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
-      over the total sL1D cycles.
-    Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
-      loaded line the cache. The ratio of the number of sL1D requests that hit over
-      the number of all sL1D requests.
-    sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
-      bandwidth acheived.\ \ Caclulated as total number of bytes read from, written
-      to, or atomically updated\ \ across the sL1D - L2 interface.
-    sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
-      \ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\
-      \ writes and atomics are typically unused on current CDNA accelerators, so in\
-      \ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
-    Req: The total number of requests, of any size or type, made to the sL1D per normalization
-      unit.
-    Hits: The total number of sL1D requests that hit on a previously loaded cache
-      line, per normalization unit.
-    Misses - Non Duplicated: 'The total number of sL1D requests that missed on a cache
-      line that was not already pending due to another request, per normalization
-      unit. '
-    Misses- Duplicated: The total number of sL1D requests that missed on a cache line
-      that was already pending due to another request, per normalization unit.
-    Read Req (Total): The total number of sL1D read requests of any size, per normalization
-      unit.
-    Atomic Req: The total number of atomic requests from sL1D to the L2, per normalization
-      unit. Typically unused on current CDNA accelerators.
-    Read Req (1 DWord): The total number of sL1D read requests made for a single dword
-      of data (4B), per normalization unit.
-    Read Req (2 DWord): The total number of sL1D read requests made for a two dwords
-      of data (8B), per normalization unit.
-    Read Req (4 DWord): The total number of sL1D read requests made for a four dwords
-      of data (16B), per normalization unit.
-    Read Req (8 DWord): The total number of sL1D read requests made for a eight dwords
-      of data (32B), per normalization unit.
-    Read Req (16 DWord): The total number of sL1D read requests made for a sixteen
-      dwords of data (64B), per normalization unit.
-    Read Req: The total number of read requests from sL1D to the L2 per normalization
-      unit.
-    Write Req: The total number of write requests from sL1D to the L2, per normalization
-      unit. Typically unused on current CDNA accelerators.
-    Stall Cycles: "The total number of cycles the sL1D\u2194L2 interface was stalled,\
-      \ per normalization unit."
  data source:
  - metric_table:
      id: 1401
@@ -84,22 +41,22 @@ Panel Config:
          avg: AVG((SQC_DCACHE_REQ / $denom))
          min: MIN((SQC_DCACHE_REQ / $denom))
          max: MAX((SQC_DCACHE_REQ / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Hits:
          avg: AVG((SQC_DCACHE_HITS / $denom))
          min: MIN((SQC_DCACHE_HITS / $denom))
          max: MAX((SQC_DCACHE_HITS / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Misses - Non Duplicated:
          avg: AVG((SQC_DCACHE_MISSES / $denom))
          min: MIN((SQC_DCACHE_MISSES / $denom))
          max: MAX((SQC_DCACHE_MISSES / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Misses- Duplicated:
          avg: AVG((SQC_DCACHE_MISSES_DUPLICATE / $denom))
          min: MIN((SQC_DCACHE_MISSES_DUPLICATE / $denom))
          max: MAX((SQC_DCACHE_MISSES_DUPLICATE / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Cache Hit Rate:
          avg: AVG((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
            + SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
@@ -118,37 +75,37 @@ Panel Config:
            + SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
          max: MAX((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
            + SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Atomic Req:
          avg: AVG((SQC_DCACHE_ATOMIC / $denom))
          min: MIN((SQC_DCACHE_ATOMIC / $denom))
          max: MAX((SQC_DCACHE_ATOMIC / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read Req (1 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_1 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_1 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_1 / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read Req (2 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_2 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_2 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_2 / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read Req (4 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_4 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_4 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_4 / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read Req (8 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_8 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_8 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_8 / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read Req (16 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_16 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_16 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_16 / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
  - metric_table:
      id: 1403
      title: Scalar L1D Cache - L2 Interface
@@ -171,19 +128,65 @@ Panel Config:
          avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
          min: MIN((SQC_TC_DATA_READ_REQ / $denom))
          max: MAX((SQC_TC_DATA_READ_REQ / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Write Req:
          avg: AVG((SQC_TC_DATA_WRITE_REQ / $denom))
          min: MIN((SQC_TC_DATA_WRITE_REQ / $denom))
          max: MAX((SQC_TC_DATA_WRITE_REQ / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Atomic Req:
          avg: AVG((SQC_TC_DATA_ATOMIC_REQ / $denom))
          min: MIN((SQC_TC_DATA_ATOMIC_REQ / $denom))
          max: MAX((SQC_TC_DATA_ATOMIC_REQ / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Stall Cycles:
          avg: AVG((SQC_TC_STALL / $denom))
          min: MIN((SQC_TC_STALL / $denom))
          max: MAX((SQC_TC_STALL / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
+  metrics_description:
+    Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
+      of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
+      over the total sL1D cycles.
+    Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
+      loaded line the cache. The ratio of the number of sL1D requests that hit over
+      the number of all sL1D requests.
+    sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
+      bandwidth acheived. Calculated as total number of bytes read from, written to,
+      or atomically updated across the sL1D - L2 interface.
+    sL1D-L2 BW: |-
+      The total number of bytes read from, written to, or atomically updated
+      across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D
+      writes and atomics are typically unused on current CDNA accelerators, so
+      in the majority of cases this can be interpreted as an sL1D\u2192L2 read
+      bandwidth.
+    Req: The total number of requests, of any size or type, made to the sL1D per normalization
+      unit.
+    Hits: The total number of sL1D requests that hit on a previously loaded cache
+      line, per normalization unit.
+    Misses - Non Duplicated: |-
+      The total number of sL1D requests that missed on a cache line that was
+      not already pending due to another request, per normalization unit.
+    Misses- Duplicated: The total number of sL1D requests that missed on a cache line
+      that was already pending due to another request, per normalization unit.
+    Read Req (Total): The total number of sL1D read requests of any size, per normalization
+      unit.
+    Atomic Req: The total number of atomic requests from sL1D to the L2, per normalization
+      unit. Typically unused on current CDNA accelerators.
+    Read Req (1 DWord): The total number of sL1D read requests made for a single dword
+      of data (4B), per normalization unit.
+    Read Req (2 DWord): The total number of sL1D read requests made for a two dwords
+      of data (8B), per normalization unit.
+    Read Req (4 DWord): The total number of sL1D read requests made for a four dwords
+      of data (16B), per normalization unit.
+    Read Req (8 DWord): The total number of sL1D read requests made for a eight dwords
+      of data (32B), per normalization unit.
+    Read Req (16 DWord): The total number of sL1D read requests made for a sixteen
+      dwords of data (64B), per normalization unit.
+    Read Req: The total number of read requests from sL1D to the L2 per normalization
+      unit.
+    Write Req: The total number of write requests from sL1D to the L2, per normalization
+      unit. Typically unused on current CDNA accelerators.
+    Stall Cycles: |-
+      The total number of cycles the sL1D\u2194L2 interface was stalled, per
+      normalization unit.
@@ -2,70 +2,6 @@
 Panel Config:
  id: 1500
  title: Address Processing Unit and Data Return Path (TA/TD)
-  metrics_description:
-    Address Processing Unit Busy: Percent of the total CU cycles the address processor
-      was busy
-    Address Stall: Percent of the total CU cycles the address processor was stalled
-      from sending address requests further into the vL1D pipeline.
-    Data Stall: Percent of the total CU cycles the address processor was stalled from
-      sending write/atomic data further into the vL1D pipeline.
-    "Data-Processor \u2192 Address Stall": Percent of total CU cycles the address
-      processor was stalled waiting to send command data to the data processor.
-    Total Instructions: The total number of memory instructions executed by the address
-      processer over all compute units on the accelerator, per normalization unit.
-    Global/Generic Instructions: The total number of global & generic memory instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Global/Generic Read Instructions: The total number of global & generic memory
-      read instructions executed on all compute units on the accelerator, per normalization
-      unit.
-    Global/Generic Write Instructions: The total number of global & generic memory
-      write instructions executed on all compute units on the accelerator, per normalization
-      unit.
-    Global/Generic Atomic Instructions: The total number of global & generic memory
-      atomic (with and without return) instructions executed on all compute units
-      on the accelerator, per normalization unit.
-    Spill/Stack Instructions: The total number of spill/stack memory instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Spill/Stack Read Instructions: The total number of spill/stack memory read instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Spill/Stack Write Instructions: The total number of spill/stack memory write instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Spill/Stack Atomic Instructions: The total number of spill/stack memory atomic
-      (with and without return) instructions executed on all compute units on the
-      accelerator, per normalization unit. Typically unused as these memory operations
-      are typically used to implement thread-local storage.
-    Spill/Stack Total Cycles: The number of cycles the address processing unit spent
-      working on spill/stack instructions, per normalization unit.
-    Spill/Stack Coalesced Read: The number of cycles the address processing unit spent
-      working on coalesced spill/stack read instructions, per normalization unit.
-    Spill/Stack Coalesced Write: The number of cycles the address processing unit
-      spent working on coalesced spill/stack write instructions, per normalization
-      unit.
-    Data-Return Busy: Percent of the total CU cycles the data-return unit was busy
-      processing or waiting on data to return to the CU.
-    "Cache RAM \u2192 Data-Return Stall": Percent of the total CU cycles the data-return
-      unit was stalled on data to be returned from the vL1D Cache RAM.
-    "Workgroup manager \u2192 Data-Return Stall": Percent of the total CU cycles the
-      data-return unit was stalled by the workgroup manager due to initialization
-      of registers as a part of launching new workgroups.
-    Coalescable Instructions: The number of instructions submitted to the data-return
-      unit by the address processor that were found to be coalescable, per normalization
-      unit.
-    Read Instructions: The number of read instructions submitted to the data-return
-      unit by the address processor summed over all compute units on the accelerator,
-      per normalization unit. This is expected to be the sum of global/generic and
-      spill/stack reads in the address processor.
-    Write Instructions: The number of store instructions submitted to the data-return
-      unit by the address processor summed over all compute units on the accelerator,
-      per normalization unit. This is expected to be the sum of global/generic and
-      spill/stack stores in the address processor.
-    Atomic Instructions: The number of atomic instructions submitted to the data-return
-      unit by the address processor summed over all compute units on the accelerator,
-      per normalization unit. This is expected to be the sum of global/generic and
-      spill/stack atomics in the address processor.
-    Write Ack Instructions: The total number of write acknowledgements submitted by
-      data-return unit to SQ, summed over all compute units on the accelerator, per
-      normalization unit.
  data source:
  - metric_table:
      id: 1501
@@ -120,47 +56,47 @@ Panel Config:
          avg: AVG((TA_TOTAL_WAVEFRONTS_sum / $denom))
          min: MIN((TA_TOTAL_WAVEFRONTS_sum / $denom))
          max: MAX((TA_TOTAL_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Global/Generic Instructions:
          avg: AVG((TA_FLAT_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Global/Generic Read Instructions:
          avg: AVG((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Global/Generic Write Instructions:
          avg: AVG((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Global/Generic Atomic Instructions:
          avg: AVG((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Spill/Stack Instructions:
          avg: AVG((TA_BUFFER_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Spill/Stack Read Instructions:
          avg: AVG((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Spill/Stack Write Instructions:
          avg: AVG((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Spill/Stack Atomic Instructions:
          avg: AVG((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
  - metric_table:
      id: 1503
      title: Spill and stack metrics
@@ -175,17 +111,17 @@ Panel Config:
          avg: AVG((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
          min: MIN((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
          max: MAX((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Spill/Stack Coalesced Read:
          avg: AVG((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
          min: MIN((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
          max: MAX((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Spill/Stack Coalesced Write:
          avg: AVG((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
          min: MIN((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
          max: MAX((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
  - metric_table:
      id: 1504
      title: Vector L1 data-return path or Texture Data (TD)
@@ -210,7 +146,7 @@ Panel Config:
          avg: AVG((TD_COALESCABLE_WAVEFRONT_sum / $denom))
          min: MIN((TD_COALESCABLE_WAVEFRONT_sum / $denom))
          max: MAX((TD_COALESCABLE_WAVEFRONT_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Read Instructions:
          avg: AVG((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
            / $denom))
@@ -218,14 +154,72 @@ Panel Config:
            / $denom))
          max: MAX((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
            / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Write Instructions:
          avg: AVG((TD_STORE_WAVEFRONT_sum / $denom))
          min: MIN((TD_STORE_WAVEFRONT_sum / $denom))
          max: MAX((TD_STORE_WAVEFRONT_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Atomic Instructions:
          avg: AVG((TD_ATOMIC_WAVEFRONT_sum / $denom))
          min: MIN((TD_ATOMIC_WAVEFRONT_sum / $denom))
          max: MAX((TD_ATOMIC_WAVEFRONT_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
+  metrics_description:
+    Address Processing Unit Busy: Percent of the total CU cycles the address processor
+      was busy
+    Address Stall: Percent of the total CU cycles the address processor was stalled
+      from sending address requests further into the vL1D pipeline.
+    Data Stall: Percent of the total CU cycles the address processor was stalled from
+      sending write/atomic data further into the vL1D pipeline.
+    "Data-Processor \u2192 Address Stall": Percent of total CU cycles the address
+      processor was stalled waiting to send command data to the data processor.
+    Total Instructions: The total number of memory instructions executed by the address
+      processer over all compute units on the accelerator, per normalization unit.
+    Global/Generic Instructions: The total number of global & generic memory instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Global/Generic Read Instructions: The total number of global & generic memory
+      read instructions executed on all compute units on the accelerator, per normalization
+      unit.
+    Global/Generic Write Instructions: The total number of global & generic memory
+      write instructions executed on all compute units on the accelerator, per normalization
+      unit.
+    Global/Generic Atomic Instructions: The total number of global & generic memory
+      atomic (with and without return) instructions executed on all compute units
+      on the accelerator, per normalization unit.
+    Spill/Stack Instructions: The total number of spill/stack memory instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Spill/Stack Read Instructions: The total number of spill/stack memory read instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Spill/Stack Write Instructions: The total number of spill/stack memory write instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Spill/Stack Atomic Instructions: The total number of spill/stack memory atomic
+      (with and without return) instructions executed on all compute units on the
+      accelerator, per normalization unit. Typically unused as these memory operations
+      are typically used to implement thread-local storage.
+    Spill/Stack Total Cycles: The number of cycles the address processing unit spent
+      working on spill/stack instructions, per normalization unit.
+    Spill/Stack Coalesced Read: The number of cycles the address processing unit spent
+      working on coalesced spill/stack read instructions, per normalization unit.
+    Spill/Stack Coalesced Write: The number of cycles the address processing unit
+      spent working on coalesced spill/stack write instructions, per normalization
+      unit.
+    Data-Return Busy: Percent of the total CU cycles the data-return unit was busy
+      processing or waiting on data to return to the CU.
+    "Cache RAM \u2192 Data-Return Stall": Percent of the total CU cycles the data-return
+      unit was stalled on data to be returned from the vL1D Cache RAM.
+    Coalescable Instructions: The number of instructions submitted to the data-return
+      unit by the address processor that were found to be coalescable, per normalization
+      unit.
+    Read Instructions: The number of read instructions submitted to the data-return
+      unit by the address processor summed over all compute units on the accelerator,
+      per normalization unit. This is expected to be the sum of global/generic and
+      spill/stack reads in the address processor.
+    Write Instructions: The number of store instructions submitted to the data-return
+      unit by the address processor summed over all compute units on the accelerator,
+      per normalization unit. This is expected to be the sum of global/generic and
+      spill/stack stores in the address processor.
+    Atomic Instructions: The number of atomic instructions submitted to the data-return
+      unit by the address processor summed over all compute units on the accelerator,
+      per normalization unit. This is expected to be the sum of global/generic and
+      spill/stack atomics in the address processor.
@@ -2,117 +2,6 @@
 Panel Config:
  id: 1600
  title: Vector L1 Data Cache
-  metrics_description:
-    Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
-      cache over the total number of cache line requests to the vL1D Cache RAM.
-    Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
-      of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
-      on the specific accelerator. The number of bytes is calculated as the number
-      of cache lines requested multiplied by the cache line size. This value does
-      not consider partial requests, so for instance, if only a single value is requested
-      in a cache line, the data movement will still be counted as a full cache line.
-    Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
-      The number of cycles where the vL1D Cache RAM is actively processing any request
-      divided by the number of cycles where the vL1D is active.
-    Coalescing: Indicates how well memory instructions were coalesced by the address
-      processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
-      as the average number of thread-requests generated per instruction divided by
-      the ideal number of thread-requests per instruction.
-    Stalled on L2 Data: The ratio of the number of cycles where the vL1D is stalled
-      waiting for requested data to return from the L2 cache divided by the number
-      of cycles where the vL1D is active.
-    Stalled on L2 Req: The ratio of the number of cycles where the vL1D is stalled
-      waiting to issue a request for data to the L2 cache divided by the number of
-      cycles where the vL1D is active.
-    Tag RAM Stall (Read): The ratio of the number of cycles where the vL1D is stalled
-      due to Read requests with conflicting tags being looked up concurrently, divided
-      by the number of cycles where the vL1D is active.
-    Tag RAM Stall (Write): The ratio of the number of cycles where the vL1D is stalled
-      due to Write requests with conflicting tags being looked up concurrently, divided
-      by the number of cycles where the vL1D is active.
-    Tag RAM Stall (Atomic): The ratio of the number of cycles where the vL1D is stalled
-      due to Atomic requests with conflicting tags being looked up concurrently, divided
-      by the number of cycles where the vL1D is active.
-    Total Req: The total number of incoming requests from the address processing unit
-      after coalescing.
-    Read Req: The total number of incoming read requests from the address processing
-      unit after coalescing per normalization unit.
-    Write Req: The total number of incoming write requests from the address processing
-      unit after coalescing per normalization unit.
-    Atomic Req: The total number of incoming atomic requests from the address processing
-      unit after coalescing per normalization unit.
-    Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
-      instructions divided by total duration. The number of bytes is calculated as
-      the number of cache lines requested multiplied by the cache line size.  This
-      value does not consider partial requests, so for instance, if only a single
-      value is requested in a cache line, the data movement will still be counted
-      as a full cache line.
-    Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
-      vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
-    Cache Accesses: The total number of cache line lookups in the vL1D.
-    Cache Hits: The number of cache accesses minus the number of outgoing requests
-      to the L2 cache, that is, the number of cache line requests serviced by the
-      vL1D Cache RAM per normalization unit.
-    Invalidations: The number of times the vL1D was issued a write-back invalidate
-      command during the kernel's execution per normalization unit. This may be triggered
-      by, for instance, the buffer_wbinvl1 instruction.
-    L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
-      of VMEM instructions, divided by total duration. The number of bytes is calculated
-      as the number of cache lines requested multiplied by the cache line size. This
-      value does not consider partial requests, so for instance, if only a single
-      value is requested in a cache line, the data movement will still be counted
-      as a full cache line.
-    L1-L2 Read: The number of read requests for a vL1D cache line that were not satisfied
-      by the vL1D and must be retrieved from the to the L2 Cache per normalization
-      unit.
-    L1-L2 Write: The number of write requests to a vL1D cache line that were sent
-      through the vL1D to the L2 cache, per normalization unit.
-    L1-L2 Atomic: The number of atomic requests that are sent through the vL1D to
-      the L2 cache, per normalization unit. This includes requests for atomics with,
-      and without return.
-    L1 Access Latency: Calculated as the average number of cycles that a vL1D cache
-      line request spent in the vL1D cache pipeline.
-    L1-L2 Read Latency: Calculated as the average number of cycles that the vL1D cache
-      took to issue and receive read requests from the L2 Cache. This number also
-      includes requests for atomics with return values.
-    L1-L2 Write Latency: Calculated as the average number of cycles that the vL1D
-      cache took to issue and receive acknowledgement of a write request to the L2
-      Cache. This number also includes requests for atomics without return values.
-    NC - Read: Total read requests with NC mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    UC - Read: Total read requests with UC mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    CC - Read: Total read requests with CC mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    RW - Read: Total read requests with RW mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    RW - Write: Total write requests with RW mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    NC - Write: Total write requests with NC mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    UC - Write: Total write requests with UC mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    CC - Write: Total write requests with CC mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    NC - Atomic: Total atomic requests with NC mtype from this TCP to all TCCs Sum
-      over TCP instances per normalization unit.
-    UC - Atomic: Total atomic requests with UC mtype from this TCP to all TCCs Sum
-      over TCP instances per normalization unit.
-    CC - Atomic: Total atomic requests with CC mtype from this TCP to all TCCs Sum
-      over TCP instances per normalization unit.
-    RW - Atomic: Total atomic requests with RW mtype from this TCP to all TCCs Sum
-      over TCP instances per normalization unit.
-    Req: The number of translation requests made to the UTCL1 per normalization unit.
-    Hit Ratio: The ratio of the number of translation requests that hit in the UTCL1
-      divided by the total number of translation requests made to the UTCL1.
-    Hits: The number of translation requests that hit in the UTCL1, and could be reused,
-      per normalization unit.
-    Translation Misses: The total number of translation requests that missed in the
-      UTCL1 due to  translation not being present in the cache, per normalization
-      unit.
-    Permission Misses: "The total number of translation requests that missed in the\
-      \ UTCL1 due to a permission error, per normalization unit. This is unused and\
-      \ expected to be zero in most configurations for modern CDNA\u2122 accelerators."
  data source:
  - metric_table:
      id: 1601
@@ -181,17 +70,17 @@ Panel Config:
          avg: AVG((TCP_TOTAL_ACCESSES_sum / $denom))
          min: MIN((TCP_TOTAL_ACCESSES_sum / $denom))
          max: MAX((TCP_TOTAL_ACCESSES_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read Req:
          avg: AVG((TCP_TOTAL_READ_sum / $denom))
          min: MIN((TCP_TOTAL_READ_sum / $denom))
          max: MAX((TCP_TOTAL_READ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Write Req:
          avg: AVG((TCP_TOTAL_WRITE_sum / $denom))
          min: MIN((TCP_TOTAL_WRITE_sum / $denom))
          max: MAX((TCP_TOTAL_WRITE_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Atomic Req:
          avg: AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
            / $denom))
@@ -199,7 +88,7 @@ Panel Config:
            / $denom))
          max: MAX(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
            / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Cache BW:
          avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
          min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
@@ -223,7 +112,7 @@ Panel Config:
          avg: AVG((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
          min: MIN((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
          max: MAX((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Cache Hits:
          avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
@@ -234,7 +123,7 @@ Panel Config:
          max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Invalidations:
          avg: AVG((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
          min: MIN((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
@@ -252,12 +141,12 @@ Panel Config:
          avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_READ_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        L1-L2 Write:
          avg: AVG((TCP_TCC_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_WRITE_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        L1-L2 Atomic:
          avg: AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
            / $denom))
@@ -265,7 +154,7 @@ Panel Config:
            / $denom))
          max: MAX(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
            / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        L1 Access Latency:
          avg: AVG(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
            != 0) else None))
@@ -314,84 +203,84 @@ Panel Config:
          avg: AVG((TCP_TCC_NC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_NC_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_NC_READ_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        UC - Read:
          xfer: Read
          coherency: UC
          avg: AVG((TCP_TCC_UC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_UC_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_UC_READ_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        CC - Read:
          xfer: Read
          coherency: CC
          avg: AVG((TCP_TCC_CC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_CC_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_CC_READ_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        RW - Read:
          xfer: Read
          coherency: RW
          avg: AVG((TCP_TCC_RW_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_RW_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_RW_READ_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        RW - Write:
          xfer: Write
          coherency: RW
          avg: AVG((TCP_TCC_RW_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_RW_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_RW_WRITE_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        NC - Write:
          xfer: Write
          coherency: NC
          avg: AVG((TCP_TCC_NC_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_NC_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_NC_WRITE_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        UC - Write:
          xfer: Write
          coherency: UC
          avg: AVG((TCP_TCC_UC_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_UC_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_UC_WRITE_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        CC - Write:
          xfer: Write
          coherency: CC
          avg: AVG((TCP_TCC_CC_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_CC_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_CC_WRITE_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        NC - Atomic:
          xfer: Atomic
          coherency: NC
          avg: AVG((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
          min: MIN((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
          max: MAX((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        UC - Atomic:
          xfer: Atomic
          coherency: UC
          avg: AVG((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
          min: MIN((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
          max: MAX((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        CC - Atomic:
          xfer: Atomic
          coherency: CC
          avg: AVG((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
          min: MIN((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
          max: MAX((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        RW - Atomic:
          xfer: Atomic
          coherency: RW
          avg: AVG((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
          min: MIN((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
          max: MAX((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
  - metric_table:
      id: 1605
      title: L1 Unified Translation Cache (UTCL1)
@@ -440,3 +329,114 @@ Panel Config:
        max: Max
        units: Unit
      metric: {}
+  metrics_description:
+    Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
+      cache over the total number of cache line requests to the vL1D Cache RAM.
+    Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
+      of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
+      on the specific accelerator. The number of bytes is calculated as the number
+      of cache lines requested multiplied by the cache line size. This value does
+      not consider partial requests, so for instance, if only a single value is requested
+      in a cache line, the data movement will still be counted as a full cache line.
+    Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
+      The number of cycles where the vL1D Cache RAM is actively processing any request
+      divided by the number of cycles where the vL1D is active.
+    Coalescing: Indicates how well memory instructions were coalesced by the address
+      processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
+      as the average number of thread-requests generated per instruction divided by
+      the ideal number of thread-requests per instruction.
+    Stalled on L2 Data: The ratio of the number of cycles where the vL1D is stalled
+      waiting for requested data to return from the L2 cache divided by the number
+      of cycles where the vL1D is active.
+    Stalled on L2 Req: The ratio of the number of cycles where the vL1D is stalled
+      waiting to issue a request for data to the L2 cache divided by the number of
+      cycles where the vL1D is active.
+    Tag RAM Stall (Read): The ratio of the number of cycles where the vL1D is stalled
+      due to Read requests with conflicting tags being looked up concurrently, divided
+      by the number of cycles where the vL1D is active.
+    Tag RAM Stall (Write): The ratio of the number of cycles where the vL1D is stalled
+      due to Write requests with conflicting tags being looked up concurrently, divided
+      by the number of cycles where the vL1D is active.
+    Tag RAM Stall (Atomic): The ratio of the number of cycles where the vL1D is stalled
+      due to Atomic requests with conflicting tags being looked up concurrently, divided
+      by the number of cycles where the vL1D is active.
+    Total Req: The total number of incoming requests from the address processing unit
+      after coalescing.
+    Read Req: The total number of incoming read requests from the address processing
+      unit after coalescing per normalization unit.
+    Write Req: The total number of incoming write requests from the address processing
+      unit after coalescing per normalization unit.
+    Atomic Req: The total number of incoming atomic requests from the address processing
+      unit after coalescing per normalization unit.
+    Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
+      instructions divided by total duration. The number of bytes is calculated as
+      the number of cache lines requested multiplied by the cache line size. This
+      value does not consider partial requests, so for instance, if only a single
+      value is requested in a cache line, the data movement will still be counted
+      as a full cache line.
+    Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
+      vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
+    Cache Accesses: The total number of cache line lookups in the vL1D.
+    Cache Hits: The number of cache accesses minus the number of outgoing requests
+      to the L2 cache, that is, the number of cache line requests serviced by the
+      vL1D Cache RAM per normalization unit.
+    Invalidations: The number of times the vL1D was issued a write-back invalidate
+      command during the kernel's execution per normalization unit. This may be triggered
+      by, for instance, the buffer_wbinvl1 instruction.
+    L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
+      of VMEM instructions, divided by total duration. The number of bytes is calculated
+      as the number of cache lines requested multiplied by the cache line size. This
+      value does not consider partial requests, so for instance, if only a single
+      value is requested in a cache line, the data movement will still be counted
+      as a full cache line.
+    L1-L2 Read: The number of read requests for a vL1D cache line that were not satisfied
+      by the vL1D and must be retrieved from the to the L2 Cache per normalization
+      unit.
+    L1-L2 Write: The number of write requests to a vL1D cache line that were sent
+      through the vL1D to the L2 cache, per normalization unit.
+    L1-L2 Atomic: The number of atomic requests that are sent through the vL1D to
+      the L2 cache, per normalization unit. This includes requests for atomics with,
+      and without return.
+    L1 Access Latency: Calculated as the average number of cycles that a vL1D cache
+      line request spent in the vL1D cache pipeline.
+    L1-L2 Read Latency: Calculated as the average number of cycles that the vL1D cache
+      took to issue and receive read requests from the L2 Cache. This number also
+      includes requests for atomics with return values.
+    L1-L2 Write Latency: Calculated as the average number of cycles that the vL1D
+      cache took to issue and receive acknowledgement of a write request to the L2
+      Cache. This number also includes requests for atomics without return values.
+    NC - Read: Total read requests with NC mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    UC - Read: Total read requests with UC mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    CC - Read: Total read requests with CC mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    RW - Read: Total read requests with RW mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    RW - Write: Total write requests with RW mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    NC - Write: Total write requests with NC mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    UC - Write: Total write requests with UC mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    CC - Write: Total write requests with CC mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    NC - Atomic: Total atomic requests with NC mtype from this TCP to all TCCs Sum
+      over TCP instances per normalization unit.
+    UC - Atomic: Total atomic requests with UC mtype from this TCP to all TCCs Sum
+      over TCP instances per normalization unit.
+    CC - Atomic: Total atomic requests with CC mtype from this TCP to all TCCs Sum
+      over TCP instances per normalization unit.
+    RW - Atomic: Total atomic requests with RW mtype from this TCP to all TCCs Sum
+      over TCP instances per normalization unit.
+    Req: The number of translation requests made to the UTCL1 per normalization unit.
+    Hit Ratio: The ratio of the number of translation requests that hit in the UTCL1
+      divided by the total number of translation requests made to the UTCL1.
+    Hits: The number of translation requests that hit in the UTCL1, and could be reused,
+      per normalization unit.
+    Translation Misses: The total number of translation requests that missed in the
+      UTCL1 due to translation not being present in the cache, per normalization unit.
+    Permission Misses: |-
+      The total number of translation requests that missed in the UTCL1 due
+      to a permission error, per normalization unit. This is unused and expected
+      to be zero in most configurations for modern CDNA\u2122 accelerators.
@@ -2,6 +2,350 @@
 Panel Config:
  id: 1700
  title: L2 Cache
+  data source:
+  - metric_table:
+      id: 1701
+      title: L2 Speed-of-Light
+      header:
+        metric: Metric
+        value: Avg
+        unit: Unit
+      metric:
+        Utilization:
+          value: AVG(((TCC_BUSY_sum * 100) / (TO_INT($total_l2_chan) * $GRBM_GUI_ACTIVE_PER_XCD)))
+          unit: pct
+        Peak Bandwidth:
+          value: ((100 * AVG(((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))))
+            / ((($max_sclk / 1000) * 64) * TO_INT($total_l2_chan)))
+          unit: pct
+        Hit Rate:
+          value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
+            + TCC_MISS_sum) != 0) else 0))
+          unit: pct
+        L2-Fabric Read BW:
+          value: AVG((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
+            * 64)) / (End_Timestamp - Start_Timestamp)))
+          unit: GB/s
+        L2-Fabric Write and Atomic BW:
+          value: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
+            * 32)) / (End_Timestamp - Start_Timestamp)))
+          unit: GB/s
+        HBM Bandwidth:
+          value: $hbmBandwidth
+          unit: GB/s
+  - metric_table:
+      id: 1702
+      title: L2-Fabric interface metrics
+      header:
+        metric: Metric
+        avg: Avg
+        min: Min
+        max: Max
+        unit: Unit
+      metric:
+        Read BW:
+          avg: AVG((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
+            * 64)) / $denom))
+          min: MIN((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
+            * 64)) / $denom))
+          max: MAX((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
+            * 64)) / $denom))
+          unit: (Bytes + $normUnit)
+        HBM Read Traffic:
+          avg: AVG((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
+            != 0) else None))
+          min: MIN((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
+            != 0) else None))
+          max: MAX((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
+            != 0) else None))
+          unit: pct
+        Remote Read Traffic:
+          avg: AVG((100 * ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum) / TCC_EA0_RDREQ_sum)
+            if (TCC_EA0_RDREQ_sum != 0) else None))
+          min: MIN((100 * ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum) / TCC_EA0_RDREQ_sum)
+            if (TCC_EA0_RDREQ_sum != 0) else None))
+          max: MAX((100 * ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum) / TCC_EA0_RDREQ_sum)
+            if (TCC_EA0_RDREQ_sum != 0) else None))
+          unit: pct
+        Uncached Read Traffic:
+          avg: AVG((100 * (TCC_EA0_RD_UNCACHED_32B_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
+            != 0) else None))
+          min: MIN((100 * (TCC_EA0_RD_UNCACHED_32B_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
+            != 0) else None))
+          max: MAX((100 * (TCC_EA0_RD_UNCACHED_32B_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
+            != 0) else None))
+          unit: pct
+        Write and Atomic BW:
+          avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
+            * 32)) / $denom))
+          min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
+            * 32)) / $denom))
+          max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
+            * 32)) / $denom))
+          unit: (Bytes + $normUnit)
+        HBM Write and Atomic Traffic:
+          avg: AVG((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
+            != 0) else None))
+          min: MIN((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
+            != 0) else None))
+          max: MAX((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
+            != 0) else None))
+          unit: pct
+        Remote Write and Atomic Traffic:
+          avg: AVG((100 * ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum) / TCC_EA0_WRREQ_sum)
+            if (TCC_EA0_WRREQ_sum != 0) else None))
+          min: MIN((100 * ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum) / TCC_EA0_WRREQ_sum)
+            if (TCC_EA0_WRREQ_sum != 0) else None))
+          max: MAX((100 * ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum) / TCC_EA0_WRREQ_sum)
+            if (TCC_EA0_WRREQ_sum != 0) else None))
+          unit: pct
+        Atomic Traffic:
+          avg: AVG((100 * (TCC_EA0_ATOMIC_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
+            != 0) else None))
+          min: MIN((100 * (TCC_EA0_ATOMIC_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
+            != 0) else None))
+          max: MAX((100 * (TCC_EA0_ATOMIC_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
+            != 0) else None))
+          unit: pct
+        Uncached Write and Atomic Traffic:
+          avg: AVG((100 * (TCC_EA0_WR_UNCACHED_32B_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
+            != 0) else None))
+          min: MIN((100 * (TCC_EA0_WR_UNCACHED_32B_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
+            != 0) else None))
+          max: MAX((100 * (TCC_EA0_WR_UNCACHED_32B_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
+            != 0) else None))
+          unit: pct
+        Read Latency:
+          avg: AVG(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
+            != 0) else None))
+          min: MIN(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
+            != 0) else None))
+          max: MAX(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
+            != 0) else None))
+          unit: Cycles
+        Write and Atomic Latency:
+          avg: AVG(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
+            != 0) else None))
+          min: MIN(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
+            != 0) else None))
+          max: MAX(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
+            != 0) else None))
+          unit: Cycles
+        Atomic Latency:
+          avg: AVG(((TCC_EA0_ATOMIC_LEVEL_sum / TCC_EA0_ATOMIC_sum) if (TCC_EA0_ATOMIC_sum
+            != 0) else None))
+          min: MIN(((TCC_EA0_ATOMIC_LEVEL_sum / TCC_EA0_ATOMIC_sum) if (TCC_EA0_ATOMIC_sum
+            != 0) else None))
+          max: MAX(((TCC_EA0_ATOMIC_LEVEL_sum / TCC_EA0_ATOMIC_sum) if (TCC_EA0_ATOMIC_sum
+            != 0) else None))
+          unit: Cycles
+  - metric_table:
+      id: 1703
+      title: L2 Cache Accesses
+      header:
+        metric: Metric
+        avg: Avg
+        min: Min
+        max: Max
+        unit: Unit
+      metric:
+        Bandwidth:
+          avg: AVG((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))
+          min: MIN((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))
+          max: MAX((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))
+          unit: Gbps
+        Req:
+          avg: AVG((TCC_REQ_sum / $denom))
+          min: MIN((TCC_REQ_sum / $denom))
+          max: MAX((TCC_REQ_sum / $denom))
+          unit: (Req + $normUnit)
+        Read Req:
+          avg: AVG((TCC_READ_sum / $denom))
+          min: MIN((TCC_READ_sum / $denom))
+          max: MAX((TCC_READ_sum / $denom))
+          unit: (Req + $normUnit)
+        Write Req:
+          avg: AVG((TCC_WRITE_sum / $denom))
+          min: MIN((TCC_WRITE_sum / $denom))
+          max: MAX((TCC_WRITE_sum / $denom))
+          unit: (Req + $normUnit)
+        Atomic Req:
+          avg: AVG((TCC_ATOMIC_sum / $denom))
+          min: MIN((TCC_ATOMIC_sum / $denom))
+          max: MAX((TCC_ATOMIC_sum / $denom))
+          unit: (Req + $normUnit)
+        Streaming Req:
+          avg: AVG((TCC_STREAMING_REQ_sum / $denom))
+          min: MIN((TCC_STREAMING_REQ_sum / $denom))
+          max: MAX((TCC_STREAMING_REQ_sum / $denom))
+          unit: (Req + $normUnit)
+        Probe Req:
+          avg: AVG((TCC_PROBE_sum / $denom))
+          min: MIN((TCC_PROBE_sum / $denom))
+          max: MAX((TCC_PROBE_sum / $denom))
+          unit: (Req + $normUnit)
+        Cache Hit:
+          avg: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
+            + TCC_MISS_sum) != 0) else None))
+          min: MIN((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
+            + TCC_MISS_sum) != 0) else None))
+          max: MAX((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
+            + TCC_MISS_sum) != 0) else None))
+          unit: pct
+        Hits:
+          avg: AVG((TCC_HIT_sum / $denom))
+          min: MIN((TCC_HIT_sum / $denom))
+          max: MAX((TCC_HIT_sum / $denom))
+          unit: (Hits + $normUnit)
+        Misses:
+          avg: AVG((TCC_MISS_sum / $denom))
+          min: MIN((TCC_MISS_sum / $denom))
+          max: MAX((TCC_MISS_sum / $denom))
+          unit: (Misses + $normUnit)
+        Writeback:
+          avg: AVG((TCC_WRITEBACK_sum / $denom))
+          min: MIN((TCC_WRITEBACK_sum / $denom))
+          max: MAX((TCC_WRITEBACK_sum / $denom))
+          unit: (Cachelines + $normUnit)
+        Writeback (Internal):
+          avg: AVG((TCC_NORMAL_WRITEBACK_sum / $denom))
+          min: MIN((TCC_NORMAL_WRITEBACK_sum / $denom))
+          max: MAX((TCC_NORMAL_WRITEBACK_sum / $denom))
+          unit: (Cachelines + $normUnit)
+        Writeback (vL1D Req):
+          avg: AVG((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
+          min: MIN((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
+          max: MAX((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
+          unit: (Cachelines + $normUnit)
+        Evict (Internal):
+          avg: AVG((TCC_NORMAL_EVICT_sum / $denom))
+          min: MIN((TCC_NORMAL_EVICT_sum / $denom))
+          max: MAX((TCC_NORMAL_EVICT_sum / $denom))
+          unit: (Cachelines + $normUnit)
+        Evict (vL1D Req):
+          avg: AVG((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
+          min: MIN((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
+          max: MAX((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
+          unit: (Cachelines + $normUnit)
+        NC Req:
+          avg: AVG((TCC_NC_REQ_sum / $denom))
+          min: MIN((TCC_NC_REQ_sum / $denom))
+          max: MAX((TCC_NC_REQ_sum / $denom))
+          unit: (Req + $normUnit)
+        UC Req:
+          avg: AVG((TCC_UC_REQ_sum / $denom))
+          min: MIN((TCC_UC_REQ_sum / $denom))
+          max: MAX((TCC_UC_REQ_sum / $denom))
+          unit: (Req + $normUnit)
+        CC Req:
+          avg: AVG((TCC_CC_REQ_sum / $denom))
+          min: MIN((TCC_CC_REQ_sum / $denom))
+          max: MAX((TCC_CC_REQ_sum / $denom))
+          unit: (Req + $normUnit)
+        RW Req:
+          avg: AVG((TCC_RW_REQ_sum / $denom))
+          min: MIN((TCC_RW_REQ_sum / $denom))
+          max: MAX((TCC_RW_REQ_sum / $denom))
+          unit: (Req + $normUnit)
+  - metric_table:
+      id: 1704
+      title: L2 Cache Stalls
+      header:
+        metric: Metric
+        avg: Avg
+        min: Min
+        max: Max
+        unit: Unit
+      metric: {}
+  - metric_table:
+      id: 1705
+      title: L2 - Fabric Interface stalls
+      header:
+        metric: Metric
+        type: Type
+        transaction: Transaction
+        avg: Avg
+        min: Min
+        max: Max
+        unit: Unit
+      style:
+        type: simple_multi_bar
+      metric:
+        Write - Credit Starvation:
+          type: Credit Starvation
+          transaction: Write
+          avg: AVG(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
+            != 0) else None))
+          min: MIN(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
+            != 0) else None))
+          max: MAX(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
+            != 0) else None))
+          unit: pct
+  - metric_table:
+      id: 1706
+      title: L2 - Fabric interface detailed metrics
+      header:
+        metric: Metric
+        avg: Avg
+        min: Min
+        max: Max
+        unit: Unit
+      metric:
+        Read (32B):
+          avg: AVG((TCC_EA0_RDREQ_32B_sum / $denom))
+          min: MIN((TCC_EA0_RDREQ_32B_sum / $denom))
+          max: MAX((TCC_EA0_RDREQ_32B_sum / $denom))
+          unit: (Req + $normUnit)
+        Read (64B):
+          avg: AVG(((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) / $denom))
+          min: MIN(((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) / $denom))
+          max: MAX(((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) / $denom))
+          unit: (Req + $normUnit)
+        Read (Uncached):
+          avg: AVG((TCC_EA0_RD_UNCACHED_32B_sum / $denom))
+          min: MIN((TCC_EA0_RD_UNCACHED_32B_sum / $denom))
+          max: MAX((TCC_EA0_RD_UNCACHED_32B_sum / $denom))
+          unit: (Req + $normUnit)
+        HBM Read:
+          avg: AVG((TCC_EA0_RDREQ_DRAM_sum / $denom))
+          min: MIN((TCC_EA0_RDREQ_DRAM_sum / $denom))
+          max: MAX((TCC_EA0_RDREQ_DRAM_sum / $denom))
+          unit: (Req + $normUnit)
+        Remote Read:
+          avg: AVG((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
+          min: MIN((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
+          max: MAX((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
+          unit: (Req + $normUnit)
+        Write and Atomic (32B):
+          avg: AVG(MAX(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom), 0))
+          min: MIN(MAX(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom), 0))
+          max: MAX(MAX(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom), 0))
+          unit: (Req + $normUnit)
+        Write and Atomic (Uncached):
+          avg: AVG((TCC_EA0_WR_UNCACHED_32B_sum / $denom))
+          min: MIN((TCC_EA0_WR_UNCACHED_32B_sum / $denom))
+          max: MAX((TCC_EA0_WR_UNCACHED_32B_sum / $denom))
+          unit: (Req + $normUnit)
+        Write and Atomic (64B):
+          avg: AVG((TCC_EA0_WRREQ_64B_sum / $denom))
+          min: MIN((TCC_EA0_WRREQ_64B_sum / $denom))
+          max: MAX((TCC_EA0_WRREQ_64B_sum / $denom))
+          unit: (Req + $normUnit)
+        HBM Write and Atomic:
+          avg: AVG((TCC_EA0_WRREQ_DRAM_sum / $denom))
+          min: MIN((TCC_EA0_WRREQ_DRAM_sum / $denom))
+          max: MAX((TCC_EA0_WRREQ_DRAM_sum / $denom))
+          unit: (Req + $normUnit)
+        Remote Write and Atomic:
+          avg: AVG((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom))
+          min: MIN((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom))
+          max: MAX((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom))
+          unit: (Req + $normUnit)
+        Atomic:
+          avg: AVG((TCC_EA0_ATOMIC_sum / $denom))
+          min: MIN((TCC_EA0_ATOMIC_sum / $denom))
+          max: MAX((TCC_EA0_ATOMIC_sum / $denom))
+          unit: (Req + $normUnit)
  metrics_description:
    Utilization: The ratio of the number of cycles an L2 channel was active, summed
      over all L2 channels on the accelerator over the total L2 cycles.
@@ -87,12 +431,6 @@ Panel Config:
      by the cache line size. This value does not consider partial requests, so for
      example, if only a single value is requested in a cache line, the data movement
      will still be counted as a full cache line.
-    Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests,
-      divided by total duration.
-    Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests,
-      divided by total duration.
-    Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests,
-      divided by total duration.
    Req: The total number of incoming requests to the L2 from all clients for all
      request types, per normalization unit.
    Read Req: The total number of read requests to the L2 from all clients.
@@ -149,12 +487,6 @@ Panel Config:
    Remote Read: The total number of L2 requests to Infinity Fabric to read 32B or
      64B of data from any source other than the accelerator's local HBM, per normalization
      unit.
-    Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe
-      traffic, divided by total duration.
-    "Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read
-      requests due to Infinity Fabric traffic, divided by total duration.
-    Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM
-      traffic, divided by total duration.
    Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
      write or atomically update 32B of data to any memory location, per normalization
      unit.
@@ -170,391 +502,9 @@ Panel Config:
    Remote Write and Atomic: The total number of L2 requests to Infinity Fabric to
      write or atomically update 32B or 64B of data in any memory location other than
      the accelerator's local HBM, per normalization unit.
-    Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to
-      PCIe traffic, divided by total duration.
-    "Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write
-      requests due to Infinity Fabric traffic, divided by total duration.
-    Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM
-      traffic, divided by total duration.
-    Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to
-      PCIe traffic, divided by total duration.
-    "Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic
-      requests due to Infinity Fabric traffic, divided by total duration.
-    Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
-      HBM traffic, divided by total duration.
    Atomic: The total number of L2 requests to Infinity Fabric to atomically update
      32B or 64B of data in any memory location, per normalization unit. See Request
      flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
      requests are only considered atomic by Infinity Fabric if they are targeted
      at non-write-cacheable memory, such as fine-grained memory allocations or uncached
      memory allocations on the MI2XX.
-    Read Stall: "The ratio of the total number of cycles the L2-Fabric interface was\
-      \ stalled on a read request to any destination (local HBM, remote PCIe\xAE connected\
-      \ accelerator or CPU, or remote Infinity Fabric connected accelerator or CPU)\
-      \ over the total active L2 cycles."
-    Write Stall: The ratio of the total number of cycles the L2-Fabric interface was
-      stalled on a write or atomic request to any destination (local HBM, remote accelerator
-      or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected
-      accelerator or CPU) over the total active L2 cycles.
-    Read - PCIe Stall: The number of cycles the L2-Fabric interface was stalled on
-      read requests to remote PCIe connected accelerators or CPUs as a percent of
-      the total active L2 cycles.
-    Read - Infinity Fabric Stall: The number of cycles the L2-Fabric interface was
-      stalled on read requests to remote Infinity Fabric connected accelerators or
-      CPUs as a percent of the total active L2 cycles.
-    Read - HBM Stall: The number of cycles the L2-Fabric interface was stalled on
-      read requests to the accelerator's local HBM as a percent of the total active
-      L2 cycles.
-    Write - PCIe Stall: The number of cycles the L2-Fabric interface was stalled on
-      write or atomic requests to remote PCIe connected accelerators or CPUs as a
-      percent of the total active L2 cycles.
-    Write - Infinity Fabric Stall: The number of cycles the L2-Fabric interface was
-      stalled on write or atomic requests to remote Infinity Fabric connected accelerators
-      or CPUs as a percent of the total active L2 cycles.
-    Write - HBM Stall: The number of cycles the L2-Fabric interface was stalled on
-      write or atomic requests to accelerator's local HBM as a percent of the total
-      active L2 cycles.
-  data source:
-  - metric_table:
-      id: 1701
-      title: L2 Speed-of-Light
-      header:
-        metric: Metric
-        value: Avg
-        unit: Unit
-      metric:
-        Utilization:
-          value: AVG(((TCC_BUSY_sum * 100) / (TO_INT($total_l2_chan) * $GRBM_GUI_ACTIVE_PER_XCD)))
-          unit: pct
-        Peak Bandwidth:
-          value: ((100 * AVG(((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))))
-            / ((($max_sclk / 1000) * 64) * TO_INT($total_l2_chan)))
-          unit: pct
-        Hit Rate:
-          value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
-            + TCC_MISS_sum) != 0) else 0))
-          unit: pct
-        L2-Fabric Read BW:
-          value: AVG((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
-            * 64)) / (End_Timestamp - Start_Timestamp)))
-          unit: GB/s
-        L2-Fabric Write and Atomic BW:
-          value: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
-            * 32)) / (End_Timestamp - Start_Timestamp)))
-          unit: GB/s
-        HBM Bandwidth:
-          value: $hbmBandwidth
-          unit: GB/s
-  - metric_table:
-      id: 1702
-      title: L2-Fabric interface metrics
-      header:
-        metric: Metric
-        avg: Avg
-        min: Min
-        max: Max
-        unit: Unit
-      metric:
-        Read BW:
-          avg: AVG((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
-            * 64)) / $denom))
-          min: MIN((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
-            * 64)) / $denom))
-          max: MAX((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
-            * 64)) / $denom))
-          unit: (Bytes  + $normUnit)
-        HBM Read Traffic:
-          avg: AVG((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
-            != 0) else None))
-          min: MIN((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
-            != 0) else None))
-          max: MAX((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
-            != 0) else None))
-          unit: pct
-        Remote Read Traffic:
-          avg: AVG((100 * ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum) / TCC_EA0_RDREQ_sum)
-            if (TCC_EA0_RDREQ_sum != 0) else None))
-          min: MIN((100 * ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum) / TCC_EA0_RDREQ_sum)
-            if (TCC_EA0_RDREQ_sum != 0) else None))
-          max: MAX((100 * ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum) / TCC_EA0_RDREQ_sum)
-            if (TCC_EA0_RDREQ_sum != 0) else None))
-          unit: pct
-        Uncached Read Traffic:
-          avg: AVG((100 * (TCC_EA0_RD_UNCACHED_32B_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
-            != 0) else None))
-          min: MIN((100 * (TCC_EA0_RD_UNCACHED_32B_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
-            != 0) else None))
-          max: MAX((100 * (TCC_EA0_RD_UNCACHED_32B_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
-            != 0) else None))
-          unit: pct
-        Write and Atomic BW:
-          avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
-            * 32)) / $denom))
-          min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
-            * 32)) / $denom))
-          max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
-            * 32)) / $denom))
-          unit: (Bytes  + $normUnit)
-        HBM Write and Atomic Traffic:
-          avg: AVG((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
-            != 0) else None))
-          min: MIN((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
-            != 0) else None))
-          max: MAX((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
-            != 0) else None))
-          unit: pct
-        Remote Write and Atomic Traffic:
-          avg: AVG((100 * ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum) / TCC_EA0_WRREQ_sum)
-            if (TCC_EA0_WRREQ_sum != 0) else None))
-          min: MIN((100 * ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum) / TCC_EA0_WRREQ_sum)
-            if (TCC_EA0_WRREQ_sum != 0) else None))
-          max: MAX((100 * ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum) / TCC_EA0_WRREQ_sum)
-            if (TCC_EA0_WRREQ_sum != 0) else None))
-          unit: pct
-        Atomic Traffic:
-          avg: AVG((100 * (TCC_EA0_ATOMIC_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
-            != 0) else None))
-          min: MIN((100 * (TCC_EA0_ATOMIC_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
-            != 0) else None))
-          max: MAX((100 * (TCC_EA0_ATOMIC_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
-            != 0) else None))
-          unit: pct
-        Uncached Write and Atomic Traffic:
-          avg: AVG((100 * (TCC_EA0_WR_UNCACHED_32B_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
-            != 0) else None))
-          min: MIN((100 * (TCC_EA0_WR_UNCACHED_32B_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
-            != 0) else None))
-          max: MAX((100 * (TCC_EA0_WR_UNCACHED_32B_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
-            != 0) else None))
-          unit: pct
-        Read Latency:
-          avg: AVG(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
-            != 0) else None))
-          min: MIN(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
-            != 0) else None))
-          max: MAX(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
-            != 0) else None))
-          unit: Cycles
-        Write and Atomic Latency:
-          avg: AVG(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
-            != 0) else None))
-          min: MIN(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
-            != 0) else None))
-          max: MAX(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
-            != 0) else None))
-          unit: Cycles
-        Atomic Latency:
-          avg: AVG(((TCC_EA0_ATOMIC_LEVEL_sum / TCC_EA0_ATOMIC_sum) if (TCC_EA0_ATOMIC_sum
-            != 0) else None))
-          min: MIN(((TCC_EA0_ATOMIC_LEVEL_sum / TCC_EA0_ATOMIC_sum) if (TCC_EA0_ATOMIC_sum
-            != 0) else None))
-          max: MAX(((TCC_EA0_ATOMIC_LEVEL_sum / TCC_EA0_ATOMIC_sum) if (TCC_EA0_ATOMIC_sum
-            != 0) else None))
-          unit: Cycles
-  - metric_table:
-      id: 1703
-      title: L2 Cache Accesses
-      header:
-        metric: Metric
-        avg: Avg
-        min: Min
-        max: Max
-        unit: Unit
-      metric:
-        Bandwidth:
-          avg: AVG((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))
-          min: MIN((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))
-          max: MAX((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))
-          unit: Gbps
-        Req:
-          avg: AVG((TCC_REQ_sum / $denom))
-          min: MIN((TCC_REQ_sum / $denom))
-          max: MAX((TCC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
-        Read Req:
-          avg: AVG((TCC_READ_sum / $denom))
-          min: MIN((TCC_READ_sum / $denom))
-          max: MAX((TCC_READ_sum / $denom))
-          unit: (Req  + $normUnit)
-        Write Req:
-          avg: AVG((TCC_WRITE_sum / $denom))
-          min: MIN((TCC_WRITE_sum / $denom))
-          max: MAX((TCC_WRITE_sum / $denom))
-          unit: (Req  + $normUnit)
-        Atomic Req:
-          avg: AVG((TCC_ATOMIC_sum / $denom))
-          min: MIN((TCC_ATOMIC_sum / $denom))
-          max: MAX((TCC_ATOMIC_sum / $denom))
-          unit: (Req  + $normUnit)
-        Streaming Req:
-          avg: AVG((TCC_STREAMING_REQ_sum / $denom))
-          min: MIN((TCC_STREAMING_REQ_sum / $denom))
-          max: MAX((TCC_STREAMING_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
-        Probe Req:
-          avg: AVG((TCC_PROBE_sum / $denom))
-          min: MIN((TCC_PROBE_sum / $denom))
-          max: MAX((TCC_PROBE_sum / $denom))
-          unit: (Req  + $normUnit)
-        Cache Hit:
-          avg: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
-            + TCC_MISS_sum) != 0) else None))
-          min: MIN((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
-            + TCC_MISS_sum) != 0) else None))
-          max: MAX((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
-            + TCC_MISS_sum) != 0) else None))
-          unit: pct
-        Hits:
-          avg: AVG((TCC_HIT_sum / $denom))
-          min: MIN((TCC_HIT_sum / $denom))
-          max: MAX((TCC_HIT_sum / $denom))
-          unit: (Hits  + $normUnit)
-        Misses:
-          avg: AVG((TCC_MISS_sum / $denom))
-          min: MIN((TCC_MISS_sum / $denom))
-          max: MAX((TCC_MISS_sum / $denom))
-          unit: (Misses  + $normUnit)
-        Writeback:
-          avg: AVG((TCC_WRITEBACK_sum / $denom))
-          min: MIN((TCC_WRITEBACK_sum / $denom))
-          max: MAX((TCC_WRITEBACK_sum / $denom))
-          unit: (Cachelines + $normUnit)
-        Writeback (Internal):
-          avg: AVG((TCC_NORMAL_WRITEBACK_sum / $denom))
-          min: MIN((TCC_NORMAL_WRITEBACK_sum / $denom))
-          max: MAX((TCC_NORMAL_WRITEBACK_sum / $denom))
-          unit: (Cachelines + $normUnit)
-        Writeback (vL1D Req):
-          avg: AVG((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
-          min: MIN((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
-          max: MAX((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
-          unit: (Cachelines + $normUnit)
-        Evict (Internal):
-          avg: AVG((TCC_NORMAL_EVICT_sum / $denom))
-          min: MIN((TCC_NORMAL_EVICT_sum / $denom))
-          max: MAX((TCC_NORMAL_EVICT_sum / $denom))
-          unit: (Cachelines + $normUnit)
-        Evict (vL1D Req):
-          avg: AVG((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
-          min: MIN((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
-          max: MAX((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
-          unit: (Cachelines + $normUnit)
-        NC Req:
-          avg: AVG((TCC_NC_REQ_sum / $denom))
-          min: MIN((TCC_NC_REQ_sum / $denom))
-          max: MAX((TCC_NC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
-        UC Req:
-          avg: AVG((TCC_UC_REQ_sum / $denom))
-          min: MIN((TCC_UC_REQ_sum / $denom))
-          max: MAX((TCC_UC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
-        CC Req:
-          avg: AVG((TCC_CC_REQ_sum / $denom))
-          min: MIN((TCC_CC_REQ_sum / $denom))
-          max: MAX((TCC_CC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
-        RW Req:
-          avg: AVG((TCC_RW_REQ_sum / $denom))
-          min: MIN((TCC_RW_REQ_sum / $denom))
-          max: MAX((TCC_RW_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
-  - metric_table:
-      id: 1704
-      title: L2 Cache Stalls
-      header:
-        metric: Metric
-        avg: Avg
-        min: Min
-        max: Max
-        unit: Unit
-      metric: {}
-  - metric_table:
-      id: 1705
-      title: L2 - Fabric Interface stalls
-      header:
-        metric: Metric
-        type: Type
-        transaction: Transaction
-        avg: Avg
-        min: Min
-        max: Max
-        unit: Unit
-      style:
-        type: simple_multi_bar
-      metric:
-        Write - Credit Starvation:
-          type: Credit Starvation
-          transaction: Write
-          avg: AVG(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
-            != 0) else None))
-          min: MIN(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
-            != 0) else None))
-          max: MAX(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
-            != 0) else None))
-          unit: pct
-  - metric_table:
-      id: 1706
-      title: L2 - Fabric interface detailed metrics
-      header:
-        metric: Metric
-        avg: Avg
-        min: Min
-        max: Max
-        unit: Unit
-      metric:
-        Read (32B):
-          avg: AVG((TCC_EA0_RDREQ_32B_sum / $denom))
-          min: MIN((TCC_EA0_RDREQ_32B_sum / $denom))
-          max: MAX((TCC_EA0_RDREQ_32B_sum / $denom))
-          unit: (Req  + $normUnit)
-        Read (64B):
-          avg: AVG(((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) / $denom))
-          min: MIN(((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) / $denom))
-          max: MAX(((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) / $denom))
-          unit: (Req  + $normUnit)
-        Read (Uncached):
-          avg: AVG((TCC_EA0_RD_UNCACHED_32B_sum / $denom))
-          min: MIN((TCC_EA0_RD_UNCACHED_32B_sum / $denom))
-          max: MAX((TCC_EA0_RD_UNCACHED_32B_sum / $denom))
-          unit: (Req  + $normUnit)
-        HBM Read:
-          avg: AVG((TCC_EA0_RDREQ_DRAM_sum / $denom))
-          min: MIN((TCC_EA0_RDREQ_DRAM_sum / $denom))
-          max: MAX((TCC_EA0_RDREQ_DRAM_sum / $denom))
-          unit: (Req  + $normUnit)
-        Remote Read:
-          avg: AVG((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
-          min: MIN((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
-          max: MAX((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
-          unit: (Req  + $normUnit)
-        Write and Atomic (32B):
-          avg: AVG(MAX(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom), 0))
-          min: MIN(MAX(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom), 0))
-          max: MAX(MAX(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom), 0))
-          unit: (Req  + $normUnit)
-        Write and Atomic (Uncached):
-          avg: AVG((TCC_EA0_WR_UNCACHED_32B_sum / $denom))
-          min: MIN((TCC_EA0_WR_UNCACHED_32B_sum / $denom))
-          max: MAX((TCC_EA0_WR_UNCACHED_32B_sum / $denom))
-          unit: (Req  + $normUnit)
-        Write and Atomic (64B):
-          avg: AVG((TCC_EA0_WRREQ_64B_sum / $denom))
-          min: MIN((TCC_EA0_WRREQ_64B_sum / $denom))
-          max: MAX((TCC_EA0_WRREQ_64B_sum / $denom))
-          unit: (Req  + $normUnit)
-        HBM Write and Atomic:
-          avg: AVG((TCC_EA0_WRREQ_DRAM_sum / $denom))
-          min: MIN((TCC_EA0_WRREQ_DRAM_sum / $denom))
-          max: MAX((TCC_EA0_WRREQ_DRAM_sum / $denom))
-          unit: (Req  + $normUnit)
-        Remote Write and Atomic:
-          avg: AVG((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom))
-          min: MIN((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom))
-          max: MAX((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom))
-          unit: (Req  + $normUnit)
-        Atomic:
-          avg: AVG((TCC_EA0_ATOMIC_sum / $denom))
-          min: MIN((TCC_EA0_ATOMIC_sum / $denom))
-          max: MAX((TCC_EA0_ATOMIC_sum / $denom))
-          unit: (Req  + $normUnit)
@@ -2,10 +2,6 @@
 Panel Config:
  id: 1800
  title: L2 Cache (per Channel)
-  metrics_description:
-    L2 Cache Hit Rate: The percent of total number of requests to the L2 from all
-      clients that hit in the cache. As noted in the Speed-of-Light section, this
-      includes hit-on-miss requests.
  data source:
  - metric_table:
      id: 1801
@@ -321,3 +317,7 @@ Panel Config:
          ::_1: $total_l2_chan
      cli_style: simple_box
      tui_style: simple_box
+  metrics_description:
+    L2 Cache Hit Rate: The percent of total number of requests to the L2 from all
+      clients that hit in the cache. As noted in the Speed-of-Light section, this
+      includes hit-on-miss requests.
@@ -2,10 +2,10 @@
 Panel Config:
  id: 2100
  title: PC Sampling
-  metrics_description: {}
  data source:
  - pc_sampling_table:
      id: 2101
      title: PC Sampling
      source: ps_file
      comparable: false
+  metrics_description: {}
@@ -2,7 +2,6 @@
 Panel Config:
  id: 0
  title: Top Stats
-  metrics_description: {}
  data source:
  - raw_csv_table:
      id: 1
@@ -12,3 +11,4 @@ Panel Config:
      id: 2
      title: Dispatch List
      source: pmc_dispatch_info.csv
+  metrics_description: {}
@@ -2,10 +2,10 @@
 Panel Config:
  id: 100
  title: System Info
-  metrics_description: {}
  data source:
  - raw_csv_table:
      id: 101
      title: System Info
      source: sysinfo.csv
      columnwise: true
+  metrics_description: {}
@@ -2,124 +2,6 @@
 Panel Config:
  id: 200
  title: System Speed-of-Light
-  metrics_description:
-    VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
-      This is also presented as a percent of the peak theoretical FLOPs achievable
-      on the specific accelerator. Note: this does not include any floating-point
-      operations from MFMA instructions.'
-    VALU IOPs: 'The total integer operations executed per second on the VALU. This
-      is also presented as a percent of the peak theoretical IOPs achievable on the
-      specific accelerator. Note: this does not include any integer operations from
-      MFMA instructions.'
-    MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
-      executed per second. This does not include any 16-bit brain floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F8 MFMA operations achievable on the specific accelerator. It is supported on
-      AMD Instinct MI300 series and later only.
-    MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
-      executed per second. Note: this does not include any 16-bit brain floating point
-      operations from VALU instructions. This is also presented as a percent of the
-      peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
-      per second. Note: this does not include any 16-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F16 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
-      per second. Note: this does not include any 32-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F32 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
-      per second. Note: this does not include any 64-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F64 MFMA operations achievable on the specific accelerator.'
-    MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
-      per second. Note: this does not include any 8-bit integer operations from VALU
-      instructions. This is also presented as a percent of the peak theoretical INT8
-      MFMA operations achievable on the specific accelerator.'
-    Active CUs: Total number of active compute units (CUs) on the accelerator during
-      the kernel execution.
-    SALU Utilization: Indicates what percent of the kernel's duration the SALU was
-      busy executing instructions. Computed as the ratio of the total number of cycles
-      spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.
-    VALU Utilization: Indicates what percent of the kernel's duration the VALU was
-      busy executing instructions. Does not include VMEM operations. Computed as the
-      ratio of the total number of cycles spent by the scheduler issuing VALU instructions
-      over the total CU cycles.
-    MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
-      was busy executing instructions. Computed as the ratio of the total number of
-      cycles the MFMA was busy over the total CU cycles.
-    VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
-      was busy executing instructions, including both global/generic and spill/scratch
-      operations (see the VMEM instruction count metrics) for more detail). Does not
-      include VALU operations. Computed as the ratio of the total number of cycles
-      spent by the scheduler issuing VMEM instructions over the total CU cycles.
-    Branch Utilization: Indicates what percent of the kernel's duration the branch
-      unit was busy executing instructions. Computed as the ratio of the total number
-      of cycles spent by the scheduler issuing branch instructions over the total
-      CU cycles
-    VALU Active Threads: Indicates the average level of divergence within a wavefront
-      over the lifetime of the kernel. The number of work-items that were active in
-      a wavefront during execution of each VALU instruction, time-averaged over all
-      VALU instructions run on all wavefronts in the kernel.
-    IPC: The ratio of the total number of instructions executed on the CU over the
-      total active CU cycles. This is also presented as a percent of the peak theoretical
-      bandwidth achievable on the specific accelerator.
-    Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
-      over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
-      kernels (less than 1ms). This is also presented as a percent of the peak theoretical
-      occupancy achievable on the specific accelerator.'
-    Theoretical LDS Bandwidth: Indicates the maximum amount of bytes that could have
-      been loaded from, stored to, or atomically updated in the LDS per unit time
-      (see LDS Bandwidth example for more detail). This is also presented as a percent
-      of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
-    LDS Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS
-      scheduler due to bank conflicts (as determined by the conflict resolution hardware)
-      to the base number of cycles that would be spent in the LDS scheduler in a completely
-      uncontended case. This is also presented in normalized form (i.e., the Bank
-      Conflict Rate).
-    vL1D Cache Hit Rate: The ratio of the number of vL1D cache line requests that
-      hit in vL1D cache over the total number of cache line requests to the vL1D cache
-      RAM.
-    vL1D Cache BW: The number of bytes looked up in the vL1D cache as a result of
-      VMEM instructions per unit time. The number of bytes is calculated as the number
-      of cache lines requested multiplied by the cache line size. This value does
-      not consider partial requests, so e.g., if only a single value is requested
-      in a cache line, the data movement will still be counted as a full cache line.
-      This is also presented as a percent of the peak theoretical bandwidth achievable
-      on the specific accelerator.
-    L2 Cache Hit Rate: The ratio of the number of L2 cache line requests that hit
-      in the L2 cache over the total number of incoming cache line requests to the
-      L2 cache.
-    L2 Cache BW: The number of bytes looked up in the L2 cache per unit time. The
-      number of bytes is calculated as the number of cache lines requested multiplied
-      by the cache line size. This value does not consider partial requests, so e.g.,
-      if only a single value is requested in a cache line, the data movement will
-      still be counted as a full cache line. This is also presented as a percent of
-      the peak theoretical bandwidth achievable on the specific accelerator.
-    L2-Fabric Read BW: "The number of bytes read by the L2 over the Infinity Fabric\u2122\
-      \ interface per unit time. This is also presented as a percent of the peak theoretical\
-      \ bandwidth achievable on the specific accelerator."
-    L2-Fabric Write BW: The number of bytes sent by the L2 over the Infinity Fabric
-      interface by write and atomic operations per unit time. This is also presented
-      as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
-    L2-Fabric Read Latency: The time-averaged number of cycles read requests spent
-      in Infinity Fabric before data was returned to the L2.
-    L2-Fabric Write Latency: The time-averaged number of cycles write requests spent
-      in Infinity Fabric before a completion acknowledgement was returned to the L2.
-    sL1D Cache Hit Rate: The percent of sL1D requests that hit on a previously loaded
-      line the cache. Calculated as the ratio of the number of sL1D requests that
-      hit over the number of all sL1D requests.
-    sL1D Cache BW: The number of bytes looked up in the sL1D cache per unit time.
-      This is also presented as a percent of the peak theoretical bandwidth achievable
-      on the specific accelerator.
-    L1I Hit Rate: The number of bytes looked up in the L1I cache per unit time. This
-      is also presented as a percent of the peak theoretical bandwidth achievable
-      on the specific accelerator.
-    L1I BW: The percent of L1I requests that hit on a previously loaded line the cache.
-      Calculated as the ratio of the number of L1I requests that hit over the number
-      of all L1I requests.
-    L1I Fetch Latency: The average number of cycles spent to fetch instructions to
-      a CU.
  data source:
  - metric_table:
      id: 201
@@ -335,3 +217,125 @@ Panel Config:
          peak: None
          pop: None
          coll_level: SQ_IFETCH_LEVEL
+  metrics_description:
+    VALU FLOPs: |-
+      The total floating-point operations executed per second on the VALU.
+      This is also presented as a percent of the peak theoretical FLOPs achievable
+      on the specific accelerator. Note: this does not include any floating-point
+      operations from MFMA instructions.
+    VALU IOPs: |-
+      The total integer operations executed per second on the VALU. This is
+      also presented as a percent of the peak theoretical IOPs achievable on the
+      specific accelerator. Note: this does not include any integer operations from
+      MFMA instructions.
+    MFMA FLOPs (BF16): |-
+      The total number of 16-bit brain floating point MFMA operations executed
+      per second. Note: this does not include any 16-bit brain floating point operations
+      from VALU instructions. This is also presented as a percent of the peak theoretical
+      BF16 MFMA operations achievable on the specific accelerator.
+    MFMA FLOPs (F16): |-
+      The total number of 16-bit floating point MFMA operations executed per
+      second. Note: this does not include any 16-bit floating point operations from
+      VALU instructions. This is also presented as a percent of the peak theoretical
+      F16 MFMA operations achievable on the specific accelerator.
+    MFMA FLOPs (F32): |-
+      The total number of 32-bit floating point MFMA operations executed per
+      second. Note: this does not include any 32-bit floating point operations from
+      VALU instructions. This is also presented as a percent of the peak theoretical
+      F32 MFMA operations achievable on the specific accelerator.
+    MFMA FLOPs (F64): |-
+      The total number of 64-bit floating point MFMA operations executed per
+      second. Note: this does not include any 64-bit floating point operations from
+      VALU instructions. This is also presented as a percent of the peak theoretical
+      F64 MFMA operations achievable on the specific accelerator.
+    MFMA IOPs (Int8): |-
+      The total number of 8-bit integer MFMA operations executed per second.
+      Note: this does not include any 8-bit integer operations from VALU instructions.
+      This is also presented as a percent of the peak theoretical INT8 MFMA operations
+      achievable on the specific accelerator.
+    Active CUs: Total number of active compute units (CUs) on the accelerator during
+      the kernel execution.
+    SALU Utilization: Indicates what percent of the kernel's duration the SALU was
+      busy executing instructions. Computed as the ratio of the total number of cycles
+      spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.
+    VALU Utilization: Indicates what percent of the kernel's duration the VALU was
+      busy executing instructions. Does not include VMEM operations. Computed as the
+      ratio of the total number of cycles spent by the scheduler issuing VALU instructions
+      over the total CU cycles.
+    MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
+      was busy executing instructions. Computed as the ratio of the total number of
+      cycles the MFMA was busy over the total CU cycles.
+    VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
+      was busy executing instructions, including both global/generic and spill/scratch
+      operations (see the VMEM instruction count metrics) for more detail). Does not
+      include VALU operations. Computed as the ratio of the total number of cycles
+      spent by the scheduler issuing VMEM instructions over the total CU cycles.
+    Branch Utilization: Indicates what percent of the kernel's duration the branch
+      unit was busy executing instructions. Computed as the ratio of the total number
+      of cycles spent by the scheduler issuing branch instructions over the total
+      CU cycles
+    VALU Active Threads: Indicates the average level of divergence within a wavefront
+      over the lifetime of the kernel. The number of work-items that were active in
+      a wavefront during execution of each VALU instruction, time-averaged over all
+      VALU instructions run on all wavefronts in the kernel.
+    IPC: The ratio of the total number of instructions executed on the CU over the
+      total active CU cycles. This is also presented as a percent of the peak theoretical
+      bandwidth achievable on the specific accelerator.
+    Wavefront Occupancy: |-
+      The time-averaged number of wavefronts resident on the accelerator over
+      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
+      kernels (less than 1ms). This is also presented as a percent of the peak theoretical
+      occupancy achievable on the specific accelerator.
+    Theoretical LDS Bandwidth: Indicates the maximum amount of bytes that could have
+      been loaded from, stored to, or atomically updated in the LDS per unit time
+      (see LDS Bandwidth example for more detail). This is also presented as a percent
+      of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
+    LDS Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS
+      scheduler due to bank conflicts (as determined by the conflict resolution hardware)
+      to the base number of cycles that would be spent in the LDS scheduler in a completely
+      uncontended case. This is also presented in normalized form (i.e., the Bank
+      Conflict Rate).
+    vL1D Cache Hit Rate: The ratio of the number of vL1D cache line requests that
+      hit in vL1D cache over the total number of cache line requests to the vL1D cache
+      RAM.
+    vL1D Cache BW: The number of bytes looked up in the vL1D cache as a result of
+      VMEM instructions per unit time. The number of bytes is calculated as the number
+      of cache lines requested multiplied by the cache line size. This value does
+      not consider partial requests, so e.g., if only a single value is requested
+      in a cache line, the data movement will still be counted as a full cache line.
+      This is also presented as a percent of the peak theoretical bandwidth achievable
+      on the specific accelerator.
+    L2 Cache Hit Rate: The ratio of the number of L2 cache line requests that hit
+      in the L2 cache over the total number of incoming cache line requests to the
+      L2 cache.
+    L2 Cache BW: The number of bytes looked up in the L2 cache per unit time. The
+      number of bytes is calculated as the number of cache lines requested multiplied
+      by the cache line size. This value does not consider partial requests, so e.g.,
+      if only a single value is requested in a cache line, the data movement will
+      still be counted as a full cache line. This is also presented as a percent of
+      the peak theoretical bandwidth achievable on the specific accelerator.
+    L2-Fabric Read BW: |-
+      The number of bytes read by the L2 over the Infinity Fabric\u2122 interface
+      per unit time. This is also presented as a percent of the peak theoretical
+      bandwidth achievable on the specific accelerator.
+    L2-Fabric Write BW: The number of bytes sent by the L2 over the Infinity Fabric
+      interface by write and atomic operations per unit time. This is also presented
+      as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
+    L2-Fabric Read Latency: The time-averaged number of cycles read requests spent
+      in Infinity Fabric before data was returned to the L2.
+    L2-Fabric Write Latency: The time-averaged number of cycles write requests spent
+      in Infinity Fabric before a completion acknowledgement was returned to the L2.
+    sL1D Cache Hit Rate: The percent of sL1D requests that hit on a previously loaded
+      line the cache. Calculated as the ratio of the number of sL1D requests that
+      hit over the number of all sL1D requests.
+    sL1D Cache BW: The number of bytes looked up in the sL1D cache per unit time.
+      This is also presented as a percent of the peak theoretical bandwidth achievable
+      on the specific accelerator.
+    L1I Hit Rate: The number of bytes looked up in the L1I cache per unit time. This
+      is also presented as a percent of the peak theoretical bandwidth achievable
+      on the specific accelerator.
+    L1I BW: The percent of L1I requests that hit on a previously loaded line the cache.
+      Calculated as the ratio of the number of L1I requests that hit over the number
+      of all L1I requests.
+    L1I Fetch Latency: The average number of cycles spent to fetch instructions to
+      a CU.
@@ -2,122 +2,6 @@
 Panel Config:
  id: 300
  title: Memory Chart
-  metrics_description:
-    Wavefront Occupancy: Wavefronts per active CU.
-    Wave Life: Average number of cycles executing a wave.
-    SALU: Total Number of SALU (Scalar ALU) instructions issued per normalization
-      unit.
-    SMEM: Total number of SMEM (Scalar Memory Read) instructions issued normalization
-      unit.
-    VALU: The number of VALU (Vector ALU) instructions issued per normalization unit.
-    MFMA: Total number of MFMA (Matrix-Fused-Multiply-Add) instructions issued per
-      normalization unit.
-    VMEM: The number of VMEM (GPU Memory) read instructions issued (including FLAT/scratch
-      memory) per normalization unit.
-    LDS: The total number of LDS instructions (including, but not limited to, read/write/atomics
-      and HIP's __shfl instructions) executed per normalization unit.
-    GWS: Total number of GDS (global data sync) instructions issued per normalization
-      unit.
-    BR: Total number of BRANCH instructions issued per normalization unit.
-    Active CUs: Total number of active compute units (CUs) on the accelerator during
-      the kernel execution.
-    Num CUs: Total number of compute units (CUs) on the accelerator.
-    VGPR: 'The number of architected vector general-purpose registers allocated for
-      the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
-      by the compiler due to allocation granularity.'
-    SGPR: 'The number of scalar general-purpose registers allocated for the kernel,
-      see SALU. Note: this may not exactly match the number of SGPRs requested by
-      the compiler due to allocation granularity.'
-    LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
-      for this kernel. Note: This may also be larger than what was requested at compile
-      time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
-    Scratch Allocation: The number of bytes of scratch memory requested per work-item
-      for this kernel. Scratch memory is used for stack memory on the accelerator,
-      as well as for register spills and restores.
-    Wavefronts: The total number of wavefronts, summed over all workgroups, forming
-      this kernel launch.
-    Workgroups: The total number of workgroups forming this kernel launch.
-    LDS Req: The total number of LDS instructions (including, but not limited to,
-      read/write/atomics and HIP's __shfl instructions) executed per normalization
-      unit.
-    LDS Util: Indicates what percent of the kernel's duration the LDS was actively
-      executing instructions (including, but not limited to, load, store, atomic and
-      HIP's __shfl operations). Calculated as the ratio of the total number of cycles
-      LDS was active over the total CU cycles.
-    LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
-      / acknowledgment) required for an LDS instruction to complete.
-    VL1 Rd: The total number of incoming read requests from the address processing
-      unit after coalescing per normalization unit
-    VL1 Wr: The total number of incoming write requests from the address processing
-      unit after coalescing per normalization unit
-    VL1 Atomic: The total number of incoming atomic requests from the address processing
-      unit after coalescing per normalization unit
-    VL1 Hit: The ratio of the number of vL1D cache line requests that hit in vL1D
-      cache over the total number of cache line requests to the vL1D Cache RAM.
-    VL1 Lat: Calculated as the average number of cycles that a vL1D cache line request
-      spent in the vL1D cache pipeline.
-    VL1 Coalesce: Indicates how well memory instructions were coalesced by the address
-      processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
-      as the average number of thread-requests generated per instruction divided by
-      the ideal number of thread-requests per instruction.
-    VL1 Stall: The ratio of the number of cycles where the vL1D is stalled waiting
-      to issue a request for data to the L2 cache divided by the number of cycles
-      where the vL1D is active.
-    VL1_L2 Rd: The number of read requests for a vL1D cache line that were not satisfied
-      by the vL1D and must be retrieved from the to the L2 Cache per normalization
-      unit.
-    VL1_L2 Wr: The number of write requests to a vL1D cache line that were sent through
-      the vL1D to the L2 cache, per normalization unit.
-    VL1_L2 Atomic: The number of atomic requests that are sent through the vL1D to
-      the L2 cache, per normalization unit. This includes requests for atomics with,
-      and without return.
-    sL1D Rd: The total number of requests, of any size or type, made to the sL1D per
-      normalization unit.
-    sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
-      line, per normalization unit.
-    sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
-      unit.
-    sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
-      unit. Typically unused on current CDNA accelerators.
-    sL1D_L2 Atomic: The total number of atomic requests from sL1D to the L2, per normalization
-      unit. Typically unused on current CDNA accelerators.
-    IL1 Fetch: The total number of requests made to the L1I per normalization-unit.
-    IL1 Hit: The percent of L1I requests that hit on a previously loaded line the
-      cache. Calculated as the ratio of the number of L1I requests that hit over the
-      number of all L1I requests.
-    IL1 Lat: The average number of cycles spent to fetch instructions to a CU.
-    IL1_L2 Rd: The total number of requests across the L1I - L2 interface per normalization-unit.
-    L2 Rd: The total number of read requests to the L2 from all clients.
-    L2 Wr: The total number of write requests to the L2 from all clients.
-    L2 Atomic: The total number of atomic requests (with and without return) to the
-      L2 from all clients.
-    L2 Hit: The ratio of the number of L2 cache line requests that hit in the L2 cache
-      over the total number of incoming cache line requests to the L2 cache.
-    L2 Rd Lat: Calculated as the average number of cycles that the vL1D cache took
-      to issue and receive read requests from the L2 Cache. This number also includes
-      requests for atomics with return values.
-    L2 Wr Lat: Calculated as the average number of cycles that the vL1D cache took
-      to issue and receive acknowledgement of a write request to the L2 Cache. This
-      number also includes requests for atomics without return values.
-    Fabric_L2 Rd: Number of L2 cache - Infinity Fabric read requests (either 32-byte
-      or 64-byte) summed over TCC instances per normalization unit.
-    Fabric_L2 Wr: Number of L2 cache - Infinity Fabric write requests (either 32-byte
-      or 64-byte) summed over TCC instances per normalization unit.
-    Fabric_L2 Atomic: Number of L2 cache - Infinity Fabric write requests (either
-      32-byte or 64-byte) that are actually atomic requests summed over TCC instances
-      per normalization unit.
-    Fabric Rd Lat: The time-averaged number of cycles read requests spent in Infinity
-      Fabric before data was returned to the L2.
-    Fabric Wr Lat: The time-averaged number of cycles write requests spent in Infinity
-      Fabric before a completion acknowledgement was returned to the L2.
-    Fabric Atomic Lat: The time-averaged number of cycles atomic requests spent in
-      Infinity Fabric before a completion acknowledgement (atomic without return value)
-      or data (atomic with return value) was returned to the L2.
-    HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
-      of data from the accelerator's local HBM, per normalization unit.
-    HBM Wr: 'The total number of L2 requests to Infinity Fabric to write or atomically
-      update 32B or 64B of data in the accelerator''s local HBM, per normalization
-      unit. '
  data source:
  - metric_table:
      id: 301
@@ -252,13 +136,13 @@ Panel Config:
          value: ROUND(AVG((TCC_EA_ATOMIC_sum / $denom)), 0)
        Fabric Rd Lat:
          value: ROUND(AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
-            != 0) else  0)), 0)
+            != 0) else 0)), 0)
        Fabric Wr Lat:
          value: ROUND(AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
-            != 0) else  0)), 0)
+            != 0) else 0)), 0)
        Fabric Atomic Lat:
          value: ROUND(AVG(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
-            != 0) else  0)), 0)
+            != 0) else 0)), 0)
        HBM Rd:
          value: ROUND(AVG((TCC_EA_RDREQ_DRAM_sum / $denom)), 0)
        HBM Wr:
@@ -266,3 +150,123 @@ Panel Config:
      comparable: false
      cli_style: mem_chart
      tui_style: mem_chart
+  metrics_description:
+    Wavefront Occupancy: Wavefronts per active CU.
+    Wave Life: Average number of cycles executing a wave.
+    SALU: Total Number of SALU (Scalar ALU) instructions issued per normalization
+      unit.
+    SMEM: Total number of SMEM (Scalar Memory Read) instructions issued normalization
+      unit.
+    VALU: The number of VALU (Vector ALU) instructions issued per normalization unit.
+    MFMA: Total number of MFMA (Matrix-Fused-Multiply-Add) instructions issued per
+      normalization unit.
+    VMEM: The number of VMEM (GPU Memory) read instructions issued (including FLAT/scratch
+      memory) per normalization unit.
+    LDS: The total number of LDS instructions (including, but not limited to, read/write/atomics
+      and HIP's __shfl instructions) executed per normalization unit.
+    GWS: Total number of GDS (global data sync) instructions issued per normalization
+      unit.
+    BR: Total number of BRANCH instructions issued per normalization unit.
+    Active CUs: Total number of active compute units (CUs) on the accelerator during
+      the kernel execution.
+    Num CUs: Total number of compute units (CUs) on the accelerator.
+    VGPR: |-
+      The number of architected vector general-purpose registers allocated
+      for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
+      requested by the compiler due to allocation granularity.
+    SGPR: |-
+      The number of scalar general-purpose registers allocated for the kernel,
+      see SALU. Note: this may not exactly match the number of SGPRs requested by
+      the compiler due to allocation granularity.
+    LDS Allocation: |-
+      The number of bytes of LDS memory (or, shared memory) allocated for
+      this kernel. Note: This may also be larger than what was requested at compile
+      time due to both allocation granularity and dynamic per-dispatch LDS allocations.
+    Scratch Allocation: The number of bytes of scratch memory requested per work-item
+      for this kernel. Scratch memory is used for stack memory on the accelerator,
+      as well as for register spills and restores.
+    Wavefronts: The total number of wavefronts, summed over all workgroups, forming
+      this kernel launch.
+    Workgroups: The total number of workgroups forming this kernel launch.
+    LDS Req: The total number of LDS instructions (including, but not limited to,
+      read/write/atomics and HIP's __shfl instructions) executed per normalization
+      unit.
+    LDS Util: Indicates what percent of the kernel's duration the LDS was actively
+      executing instructions (including, but not limited to, load, store, atomic and
+      HIP's __shfl operations). Calculated as the ratio of the total number of cycles
+      LDS was active over the total CU cycles.
+    LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
+      / acknowledgment) required for an LDS instruction to complete.
+    VL1 Rd: The total number of incoming read requests from the address processing
+      unit after coalescing per normalization unit
+    VL1 Wr: The total number of incoming write requests from the address processing
+      unit after coalescing per normalization unit
+    VL1 Atomic: The total number of incoming atomic requests from the address processing
+      unit after coalescing per normalization unit
+    VL1 Hit: The ratio of the number of vL1D cache line requests that hit in vL1D
+      cache over the total number of cache line requests to the vL1D Cache RAM.
+    VL1 Lat: Calculated as the average number of cycles that a vL1D cache line request
+      spent in the vL1D cache pipeline.
+    VL1 Coalesce: Indicates how well memory instructions were coalesced by the address
+      processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
+      as the average number of thread-requests generated per instruction divided by
+      the ideal number of thread-requests per instruction.
+    VL1 Stall: The ratio of the number of cycles where the vL1D is stalled waiting
+      to issue a request for data to the L2 cache divided by the number of cycles
+      where the vL1D is active.
+    VL1_L2 Rd: The number of read requests for a vL1D cache line that were not satisfied
+      by the vL1D and must be retrieved from the to the L2 Cache per normalization
+      unit.
+    VL1_L2 Wr: The number of write requests to a vL1D cache line that were sent through
+      the vL1D to the L2 cache, per normalization unit.
+    VL1_L2 Atomic: The number of atomic requests that are sent through the vL1D to
+      the L2 cache, per normalization unit. This includes requests for atomics with,
+      and without return.
+    sL1D Rd: The total number of requests, of any size or type, made to the sL1D per
+      normalization unit.
+    sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
+      line, per normalization unit.
+    sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
+      unit.
+    sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
+      unit. Typically unused on current CDNA accelerators.
+    sL1D_L2 Atomic: The total number of atomic requests from sL1D to the L2, per normalization
+      unit. Typically unused on current CDNA accelerators.
+    IL1 Fetch: The total number of requests made to the L1I per normalization-unit.
+    IL1 Hit: The percent of L1I requests that hit on a previously loaded line the
+      cache. Calculated as the ratio of the number of L1I requests that hit over the
+      number of all L1I requests.
+    IL1 Lat: The average number of cycles spent to fetch instructions to a CU.
+    IL1_L2 Rd: The total number of requests across the L1I - L2 interface per normalization-unit.
+    L2 Rd: The total number of read requests to the L2 from all clients.
+    L2 Wr: The total number of write requests to the L2 from all clients.
+    L2 Atomic: The total number of atomic requests (with and without return) to the
+      L2 from all clients.
+    L2 Hit: The ratio of the number of L2 cache line requests that hit in the L2 cache
+      over the total number of incoming cache line requests to the L2 cache.
+    L2 Rd Lat: Calculated as the average number of cycles that the vL1D cache took
+      to issue and receive read requests from the L2 Cache. This number also includes
+      requests for atomics with return values.
+    L2 Wr Lat: Calculated as the average number of cycles that the vL1D cache took
+      to issue and receive acknowledgement of a write request to the L2 Cache. This
+      number also includes requests for atomics without return values.
+    Fabric_L2 Rd: Number of L2 cache - Infinity Fabric read requests (either 32-byte
+      or 64-byte) summed over TCC instances per normalization unit.
+    Fabric_L2 Wr: Number of L2 cache - Infinity Fabric write requests (either 32-byte
+      or 64-byte) summed over TCC instances per normalization unit.
+    Fabric_L2 Atomic: Number of L2 cache - Infinity Fabric write requests (either
+      32-byte or 64-byte) that are actually atomic requests summed over TCC instances
+      per normalization unit.
+    Fabric Rd Lat: The time-averaged number of cycles read requests spent in Infinity
+      Fabric before data was returned to the L2.
+    Fabric Wr Lat: The time-averaged number of cycles write requests spent in Infinity
+      Fabric before a completion acknowledgement was returned to the L2.
+    Fabric Atomic Lat: The time-averaged number of cycles atomic requests spent in
+      Infinity Fabric before a completion acknowledgement (atomic without return value)
+      or data (atomic with return value) was returned to the L2.
+    HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
+      of data from the accelerator's local HBM, per normalization unit.
+    HBM Wr: |-
+      The total number of L2 requests to Infinity Fabric to write or atomically
+      update 32B or 64B of data in the accelerator's local HBM, per normalization
+      unit.
@@ -2,85 +2,6 @@
 Panel Config:
  id: 400
  title: Roofline
-  metrics_description:
-    VALU FLOPs (F16): 'The total 16-bit floating-point operations executed per second
-      on the VALU. This is presented with the value of the peak empirical F16 FLOPs
-      achievable on the specific accelerator. Note: this does not include any F16
-      operations from MFMA instructions.'
-    VALU FLOPs (F32): 'The total 32-bit floating-point operations executed per second
-      on the VALU. This is presented with the value of the peak empirical F32 FLOPs
-      achievable on the specific accelerator. Note: this does not include any F32
-      operations from MFMA instructions.'
-    VALU FLOPs (F64): 'The total 64-bit floating-point operations executed per second
-      on the VALU. This is presented with the value of the peak empirical F64 FLOPs
-      achievable on the specific accelerator. Note: this does not include any F64
-      operations from MFMA instructions.'
-    MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
-      executed per second. This does not include any 16-bit brain floating point operations
-      from VALU instructions. The peak empirically measured F8 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison. It is supported
-      on AMD Instinct MI300 series and later only.
-    MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
-      executed per second. Note: this does not include any 16-bit brain floating point
-      operations from VALU instructions. The peak empirically measured BF16 MFMA operations
-      achievable on the specific accelerator is displayed alongside for comparison.'
-    MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
-      per second. Note: this does not include any 16-bit floating point operations
-      from VALU instructions. The peak empirically measured F16 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison.'
-    MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
-      per second. Note: this does not include any 32-bit floating point operations
-      from VALU instructions. The peak empirically measured F32 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison.'
-    MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
-      per second. Note: this does not include any 64-bit floating point operations
-      from VALU instructions. The peak empirically measured F64 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison.'
-    MFMA FLOPs (F6F4): 'The total number of 4-bit and 6-bit floating point MFMA operations
-      executed per second. Note: this does not include any floating point operations
-      from VALU instructions. The peak empirically measured F6F4 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison. It is supported
-      on AMD Instinct MI350 series (gfx950) and later only.'
-    MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
-      per second. Note: this does not include any 8-bit integer operations from VALU
-      instructions. The peak empirically measured INT8 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison.'
-    HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth
-      Memory (HBM) per second. The peak empirically measured bandwidth achievable
-      on the specific accelerator is displayed alongside for comparison.
-    L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
-      The number of bytes is calculated as the number of cache lines requested multiplied
-      by the cache line size. This value does not consider partial requests, so e.g.,
-      if only a single value is requested in a cache line, the data movement will
-      still be counted as a full cache line. The peak empirically measured bandwidth
-      achievable on the specific accelerator is displayed alongside for comparison.
-    L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
-      of VMEM instructions per unit time. The number of bytes is calculated as the
-      number of cache lines requested multiplied by the cache line size. This value
-      does not consider partial requests, so e.g., if only a single value is requested
-      in a cache line, the data movement will still be counted as a full cache line.
-      The peak empirically measured bandwidth achievable on the specific accelerator
-      is displayed alongside for comparison.
-    LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
-      from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
-      example for more detail). The peak empirically measured LDS bandwidth achievable
-      on the specific accelerator is displayed alongside for comparison.
-    AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
-      of total floating-point operations (FLOPs) to total bytes transferred between
-      the L1 cache and the processing units. This value is used as the x-coordinate
-      for the L1 roofline.
-    AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
-      of total floating-point operations (FLOPs) to total bytes transferred between
-      the L2 cache and the L1 cache. This value is used as the x-coordinate for the
-      L2 roofline.
-    AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
-      It is the ratio of total floating-point operations (FLOPs) to total bytes transferred
-      between HBM and the L2 cache. This value is used as the x-coordinate for the
-      HBM roofline.
-    Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs
-      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
-      operations divided by the total execution time. This value is used as the y-coordinate
-      for the kernel's point on the Roofline plot.
  data source:
  - metric_table:
      id: 401
@@ -210,3 +131,86 @@ Panel Config:
            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / (SUM(End_Timestamp - Start_Timestamp)
            / 1e9) ) / 1e9
          unit: GFLOP/s
+  metrics_description:
+    VALU FLOPs (F16): |-
+      The total 16-bit floating-point operations executed per second on the VALU.
+      This is presented with the value of the peak empirical F16 FLOPs achievable
+      on the specific accelerator. Note: this does not include any F16 operations
+      from MFMA instructions.
+    VALU FLOPs (F32): |-
+      The total 32-bit floating-point operations executed per second on the VALU.
+      This is presented with the value of the peak empirical F32 FLOPs achievable
+      on the specific accelerator. Note: this does not include any F32 operations
+      from MFMA instructions.
+    VALU FLOPs (F64): |-
+      The total 64-bit floating-point operations executed per second on the VALU.
+      This is presented with the value of the peak empirical F64 FLOPs achievable
+      on the specific accelerator. Note: this does not include any F64 operations
+      from MFMA instructions.
+    MFMA FLOPs (BF16): |-
+      The total number of 16-bit brain floating point MFMA operations executed
+      per second. Note: this does not include any 16-bit brain floating point
+      operations from VALU instructions. The peak empirically measured BF16 MFMA
+      operations achievable on the specific accelerator is displayed alongside
+      for comparison.
+    MFMA FLOPs (F16): |-
+      The total number of 16-bit floating point MFMA operations executed per
+      second. Note: this does not include any 16-bit floating point operations from
+      VALU instructions. The peak empirically measured F16 MFMA operations
+      achievable on the specific accelerator is displayed alongside for comparison.
+    MFMA FLOPs (F32): |-
+      The total number of 32-bit floating point MFMA operations executed per
+      second. Note: this does not include any 32-bit floating point operations from
+      VALU instructions. The peak empirically measured F32 MFMA operations
+      achievable on the specific accelerator is displayed alongside for comparison.
+    MFMA FLOPs (F64): |-
+      The total number of 64-bit floating point MFMA operations executed per
+      second. Note: this does not include any 64-bit floating point operations from
+      VALU instructions. The peak empirically measured F64 MFMA operations
+      achievable on the specific accelerator is displayed alongside for comparison.
+    MFMA IOPs (Int8): |-
+      The total number of 8-bit integer MFMA operations executed per second.
+      Note: this does not include any 8-bit integer operations from VALU instructions.
+      The peak empirically measured INT8 MFMA operations achievable on the specific
+      accelerator is displayed alongside for comparison.
+    HBM Bandwidth: |-
+      The total number of bytes read from and written to High-Bandwidth
+      Memory (HBM) per second. The peak empirically measured bandwidth achievable
+      on the specific accelerator is displayed alongside for comparison.
+    L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
+      The number of bytes is calculated as the number of cache lines requested multiplied
+      by the cache line size. This value does not consider partial requests, so e.g.,
+      if only a single value is requested in a cache line, the data movement will
+      still be counted as a full cache line. The peak empirically measured bandwidth
+      achievable on the specific accelerator is displayed alongside for comparison.
+    L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
+      of VMEM instructions per unit time. The number of bytes is calculated as the
+      number of cache lines requested multiplied by the cache line size. This value
+      does not consider partial requests, so e.g., if only a single value is requested
+      in a cache line, the data movement will still be counted as a full cache line.
+      The peak empirically measured bandwidth achievable on the specific accelerator
+      is displayed alongside for comparison.
+    LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
+      from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
+      example for more detail). The peak empirically measured LDS bandwidth achievable
+      on the specific accelerator is displayed alongside for comparison.
+    AI L1: |-
+      The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
+      of total floating-point operations (FLOPs) to total bytes transferred between
+      the L1 cache and the processing units. This value is used as the x-coordinate
+      for the L1 roofline.
+    AI L2: |-
+      The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
+      of total floating-point operations (FLOPs) to total bytes transferred between
+      the L2 cache and the L1 cache. This value is used as the x-coordinate for
+      the L2 roofline.
+    AI HBM: |-
+      The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
+      It is the ratio of total floating-point operations (FLOPs) to total bytes
+      transferred between HBM and the L2 cache. This value is used as the x-coordinate
+      for the HBM roofline.
+    Performance (GFLOPs): |-
+      The overall achieved performance, measured in GigaFLOPs
+      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
+      operations divided by the total execution time. This value is used as the y-coordinate
+      for the kernel's point on the Roofline plot.
@@ -2,30 +2,6 @@
 Panel Config:
  id: 500
  title: Command Processor (CPC/CPF)
-  metrics_description:
-    CPF Utilization: Percent of total cycles where the CPF was busy actively doing
-      any work. The ratio of CPF busy cycles over total cycles counted by the CPF.
-    CPF Stall: Percent of CPF busy cycles where the CPF was stalled for any reason.
-    CPF-L2 Utilization: Percent of total cycles counted by the CPF-L2 interface where
-      the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
-      over total cycles counted by the CPF-L2.
-    CPF-L2 Stall: Percent of CPF-L2 L2 busy cycles where the CPF-L2 interface was
-      stalled for any reason.
-    CPF-UTCL1 Stall: Percent of CPF busy cycles where the CPF was stalled by address
-      translation.
-    CPC Utilization: Percent of total cycles where the CPC was busy actively doing
-      any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
-    CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
-    CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
-      for processing.
-    CPC-Workgroup Manager Utilization: Percent of CPC busy cycles spent dispatching
-      workgroups to the workgroup manager.
-    CPC-L2 Utilization: Percent of total cycles counted by the CPC-L2 interface where
-      the CPC-L2 interface was active doing any work.
-    CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
-      translation
-    CPC-UTCL2 Utilization: 'Percent of total cycles counted by the CPC''s L2 address
-      translation interface where the CPC was busy doing address translation work.  '
  data source:
  - metric_table:
      id: 501
@@ -143,3 +119,28 @@ Panel Config:
          max: MAX((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
            if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
          unit: pct
+  metrics_description:
+    CPF Utilization: Percent of total cycles where the CPF was busy actively doing
+      any work. The ratio of CPF busy cycles over total cycles counted by the CPF.
+    CPF Stall: Percent of CPF busy cycles where the CPF was stalled for any reason.
+    CPF-L2 Utilization: Percent of total cycles counted by the CPF-L2 interface where
+      the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
+      over total cycles counted by the CPF-L2.
+    CPF-L2 Stall: Percent of CPF-L2 L2 busy cycles where the CPF-L2 interface was
+      stalled for any reason.
+    CPF-UTCL1 Stall: Percent of CPF busy cycles where the CPF was stalled by address
+      translation.
+    CPC Utilization: Percent of total cycles where the CPC was busy actively doing
+      any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
+    CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
+    CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
+      for processing.
+    CPC-Workgroup Manager Utilization: Percent of CPC busy cycles spent dispatching
+      workgroups to the workgroup manager.
+    CPC-L2 Utilization: Percent of total cycles counted by the CPC-L2 interface where
+      the CPC-L2 interface was active doing any work.
+    CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
+      translation
+    CPC-UTCL2 Utilization: |-
+      Percent of total cycles counted by the CPC's L2 address translation
+      interface where the CPC was busy doing address translation work.
@@ -2,61 +2,6 @@
 Panel Config:
  id: 600
  title: Workgroup Manager (SPI)
-  metrics_description:
-    Accelerator Utilization: The percent of cycles in the kernel where the accelerator
-      was actively doing any work.
-    Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
-      kernel where the scheduler-pipes were actively doing any work.
-    Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
-      manager was actively doing any work.
-    Shader Engine Utilization: The percent of total shader engine cycles in the kernel
-      where any CU in a shader-engine was actively doing any work, normalized over
-      all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
-      was not fully saturated by the kernel, or a potential load-imbalance issue.
-    SIMD Utilization: The percent of total SIMD cycles in the kernel where any SIMD
-      on a CU was actively doing any work, summed over all CUs. Low values (less than
-      100%) indicate that the accelerator was not fully saturated by the kernel, or
-      a potential load-imbalance issue.
-    Dispatched Workgroups: The total number of workgroups forming this kernel launch.
-    Dispatched Wavefronts: The total number of wavefronts, summed over all workgroups,
-      forming this kernel launch.
-    VGPR Writes: The average number of cycles spent initializing VGPRs at wave creation.
-    SGPR Writes: The average number of cycles spent initializing SGPRs at wave creation.
-    Not-scheduled Rate (Workgroup Manager): The percent of total scheduler-pipe cycles
-      in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
-      within the workgroup manager rather than a lack of a CU or SIMD with sufficient
-      resources.
-    Not-scheduled Rate (Scheduler-Pipe): 'The percent of total scheduler-pipe cycles
-      in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
-      within the scheduler-pipes rather than a lack of a CU or SIMD with sufficient
-      resources. '
-    Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
-      where a workgroup could not be scheduled to a CU due to occupancy limitations
-      (like a lack of a CU or SIMD with sufficient resources).
-    Scratch Stall Rate: The percent of total shader-engine cycles in the kernel where
-      a workgroup could not be scheduled to a CU due to lack of private (a.k.a., scratch)
-      memory slots. While this can reach up to 100%, note that the actual occupancy
-      limitations on a kernel using private memory are typically quite small (for
-      example, less than 1% of the total number of waves that can be scheduled to
-      an accelerator).
-    Insufficient SIMD Waveslots: The percent of total SIMD cycles in the kernel where
-      a workgroup could not be scheduled to a SIMD due to lack of available waveslots.
-    Insufficient SIMD VGPRs: The percent of total SIMD cycles in the kernel where
-      a workgroup could not be scheduled to a SIMD due to lack of available VGPRs.
-    Insufficient SIMD SGPRs: The percent of total SIMD cycles in the kernel where
-      a workgroup could not be scheduled to a SIMD due to lack of available SGPRs.
-    Insufficient CU LDS: The percent of total CU cycles in the kernel where a workgroup
-      could not be scheduled to a CU due to lack of available LDS.
-    Insufficient CU Barriers: The percent of total CU cycles in the kernel where a
-      workgroup could not be scheduled to a CU due to lack of available barriers.
-    Reached CU Workgroup Limit: The percent of total CU cycles in the kernel where
-      a workgroup could not be scheduled to a CU due to limits within the workgroup
-      manager. This is expected to be always be zero on CDNA2 or newer accelerators
-      (and small for previous accelerators).
-    Reached CU Wavefront Limit: The percent of total CU cycles in the kernel where
-      a wavefront could not be scheduled to a CU due to limits within the workgroup
-      manager. This is expected to be always be zero on CDNA2 or newer accelerators
-      (and small for previous accelerators).
  data source:
  - metric_table:
      id: 601
@@ -199,3 +144,58 @@ Panel Config:
          min: MIN(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
+  metrics_description:
+    Accelerator Utilization: The percent of cycles in the kernel where the accelerator
+      was actively doing any work.
+    Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
+      kernel where the scheduler-pipes were actively doing any work.
+    Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
+      manager was actively doing any work.
+    Shader Engine Utilization: The percent of total shader engine cycles in the kernel
+      where any CU in a shader-engine was actively doing any work, normalized over
+      all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
+      was not fully saturated by the kernel, or a potential load-imbalance issue.
+    SIMD Utilization: The percent of total SIMD cycles in the kernel where any SIMD
+      on a CU was actively doing any work, summed over all CUs. Low values (less than
+      100%) indicate that the accelerator was not fully saturated by the kernel, or
+      a potential load-imbalance issue.
+    Dispatched Workgroups: The total number of workgroups forming this kernel launch.
+    Dispatched Wavefronts: The total number of wavefronts, summed over all workgroups,
+      forming this kernel launch.
+    VGPR Writes: The average number of cycles spent initializing VGPRs at wave creation.
+    SGPR Writes: The average number of cycles spent initializing SGPRs at wave creation.
+    Not-scheduled Rate (Workgroup Manager): The percent of total scheduler-pipe cycles
+      in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
+      within the workgroup manager rather than a lack of a CU or SIMD with sufficient
+      resources.
+    Not-scheduled Rate (Scheduler-Pipe): |-
+      The percent of total scheduler-pipe cycles in the kernel where a workgroup
+      could not be scheduled to a CU due to a bottleneck within the scheduler-pipes
+      rather than a lack of a CU or SIMD with sufficient resources.
+    Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
+      where a workgroup could not be scheduled to a CU due to occupancy limitations
+      (like a lack of a CU or SIMD with sufficient resources).
+    Scratch Stall Rate: The percent of total shader-engine cycles in the kernel where
+      a workgroup could not be scheduled to a CU due to lack of private (a.k.a., scratch)
+      memory slots. While this can reach up to 100%, note that the actual occupancy
+      limitations on a kernel using private memory are typically quite small (for
+      example, less than 1% of the total number of waves that can be scheduled to
+      an accelerator).
+    Insufficient SIMD Waveslots: The percent of total SIMD cycles in the kernel where
+      a workgroup could not be scheduled to a SIMD due to lack of available waveslots.
+    Insufficient SIMD VGPRs: The percent of total SIMD cycles in the kernel where
+      a workgroup could not be scheduled to a SIMD due to lack of available VGPRs.
+    Insufficient SIMD SGPRs: The percent of total SIMD cycles in the kernel where
+      a workgroup could not be scheduled to a SIMD due to lack of available SGPRs.
+    Insufficient CU LDS: The percent of total CU cycles in the kernel where a workgroup
+      could not be scheduled to a CU due to lack of available LDS.
+    Insufficient CU Barriers: The percent of total CU cycles in the kernel where a
+      workgroup could not be scheduled to a CU due to lack of available barriers.
+    Reached CU Workgroup Limit: The percent of total CU cycles in the kernel where
+      a workgroup could not be scheduled to a CU due to limits within the workgroup
+      manager. This is expected to be always be zero on CDNA2 or newer accelerators
+      (and small for previous accelerators).
+    Reached CU Wavefront Limit: The percent of total CU cycles in the kernel where
+      a wavefront could not be scheduled to a CU due to limits within the workgroup
+      manager. This is expected to be always be zero on CDNA2 or newer accelerators
+      (and small for previous accelerators).
@@ -2,63 +2,6 @@
 Panel Config:
  id: 700
  title: Wavefront
-  metrics_description:
-    Grid Size: The total number of work-items (or, threads) launched as a part of
-      the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied
-      by the total workgroup (or, block) size.
-    Workgroup Size: The total number of work-items (or, threads) in each workgroup
-      (or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
-      to the total block size.
-    Total Wavefronts: "The total number of wavefronts launched as part of the kernel\
-      \ dispatch. On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs,\
-      \ the wavefront size is always 64 work-items. Thus, the total number of wavefronts\
-      \ should be equivalent to the ceiling of grid size divided by 64."
-    Saved Wavefronts: The total number of wavefronts saved at a context-save.
-    Restored Wavefronts: The total number of wavefronts restored from a context-save.
-    VGPRs: 'The number of architected vector general-purpose registers allocated for
-      the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
-      by the compiler due to allocation granularity.'
-    AGPRs: 'The number of accumulation vector general-purpose registers allocated
-      for the kernel, see AGPRs. Note: this may not exactly match the number of AGPRs
-      requested by the compiler due to allocation granularity.'
-    SGPRs: 'The number of scalar general-purpose registers allocated for the kernel,
-      see SALU. Note: this may not exactly match the number of SGPRs requested by
-      the compiler due to allocation granularity.'
-    LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
-      for this kernel. Note: This may also be larger than what was requested at compile
-      time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
-    Scratch Allocation: The number of bytes of scratch memory requested per work-item
-      for this kernel. Scratch memory is used for stack memory on the accelerator,
-      as well as for register spills and restores.
-    Kernel Time: The total duration of the executed kernel.
-    Kernel Time (Cycles): The total duration of the executed kernel in cycles.
-    Instructions per wavefront: The average number of instructions (of all types)
-      executed per wavefront. This is averaged over all wavefronts in a kernel dispatch.
-    Wave Cycles: The number of cycles a wavefront in the kernel dispatch spent resident
-      on a compute unit per normalization unit. This is averaged over all wavefronts
-      in a kernel dispatch.
-    Dependency Wait Cycles: The number of cycles a wavefront in the kernel dispatch
-      spent resident on a compute unit per normalization unit. This is averaged over
-      all wavefronts in a kernel dispatch.
-    Issue Wait Cycles: The number of cycles a wavefront in the kernel dispatch was
-      unable to issue an instruction for any reason (e.g., execution pipe back-pressure,
-      arbitration loss, etc.) per normalization unit. This counter is incremented
-      at every cycle by all wavefronts on a CU unable to issue an instruction. As
-      such, it is most useful to get a sense of how waves were spending their time,
-      rather than identification of a precise limiter because another wave could be
-      actively executing while a wave is issue stalled. The sum of this metric, Dependency
-      Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
-    Active Cycles: The average number of cycles a wavefront in the kernel dispatch
-      was actively executing instructions per normalization unit. This measurement
-      is made on a per-wavefront basis, and may include cycles that another wavefront
-      spent actively executing (on another execution unit, for example) or was stalled.
-      As such, it is most useful to get a sense of how waves were spending their time,
-      rather than identification of a precise limiter. The sum of this metric, Issue
-      Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
-      metric.
-    Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
-      over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
-      kernels (less than 1ms).'
  data source:
  - metric_table:
      id: 701
@@ -171,3 +114,66 @@ Panel Config:
          max: MAX((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
          unit: Wavefronts
          coll_level: SQ_LEVEL_WAVES
+  metrics_description:
+    Grid Size: The total number of work-items (or, threads) launched as a part of
+      the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied
+      by the total workgroup (or, block) size.
+    Workgroup Size: The total number of work-items (or, threads) in each workgroup
+      (or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
+      to the total block size.
+    Total Wavefronts: |-
+      The total number of wavefronts launched as part of the kernel dispatch.
+      On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
+      size is always 64 work-items. Thus, the total number of wavefronts should
+      be equivalent to the ceiling of grid size divided by 64.
+    Saved Wavefronts: The total number of wavefronts saved at a context-save.
+    Restored Wavefronts: The total number of wavefronts restored from a context-save.
+    VGPRs: |-
+      The number of architected vector general-purpose registers allocated
+      for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
+      requested by the compiler due to allocation granularity.
+    AGPRs: |-
+      The number of accumulation vector general-purpose registers allocated
+      for the kernel, see AGPRs. Note: this may not exactly match the number of
+      AGPRs requested by the compiler due to allocation granularity.
+    SGPRs: |-
+      The number of scalar general-purpose registers allocated for the kernel,
+      see SALU. Note: this may not exactly match the number of SGPRs requested by
+      the compiler due to allocation granularity.
+    LDS Allocation: |-
+      The number of bytes of LDS memory (or, shared memory) allocated for
+      this kernel. Note: This may also be larger than what was requested at compile
+      time due to both allocation granularity and dynamic per-dispatch LDS allocations.
+    Scratch Allocation: The number of bytes of scratch memory requested per work-item
+      for this kernel. Scratch memory is used for stack memory on the accelerator,
+      as well as for register spills and restores.
+    Kernel Time: The total duration of the executed kernel.
+    Kernel Time (Cycles): The total duration of the executed kernel in cycles.
+    Instructions per wavefront: The average number of instructions (of all types)
+      executed per wavefront. This is averaged over all wavefronts in a kernel dispatch.
+    Wave Cycles: The number of cycles a wavefront in the kernel dispatch spent resident
+      on a compute unit per normalization unit. This is averaged over all wavefronts
+      in a kernel dispatch.
+    Dependency Wait Cycles: The number of cycles a wavefront in the kernel dispatch
+      spent resident on a compute unit per normalization unit. This is averaged over
+      all wavefronts in a kernel dispatch.
+    Issue Wait Cycles: The number of cycles a wavefront in the kernel dispatch was
+      unable to issue an instruction for any reason (e.g., execution pipe back-pressure,
+      arbitration loss, etc.) per normalization unit. This counter is incremented
+      at every cycle by all wavefronts on a CU unable to issue an instruction. As
+      such, it is most useful to get a sense of how waves were spending their time,
+      rather than identification of a precise limiter because another wave could be
+      actively executing while a wave is issue stalled. The sum of this metric, Dependency
+      Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
+    Active Cycles: The average number of cycles a wavefront in the kernel dispatch
+      was actively executing instructions per normalization unit. This measurement
+      is made on a per-wavefront basis, and may include cycles that another wavefront
+      spent actively executing (on another execution unit, for example) or was stalled.
+      As such, it is most useful to get a sense of how waves were spending their time,
+      rather than identification of a precise limiter. The sum of this metric, Issue
+      Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
+      metric.
+    Wavefront Occupancy: |-
+      The time-averaged number of wavefronts resident on the accelerator over
+      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
+      kernels (less than 1ms).
@@ -2,90 +2,6 @@
 Panel Config:
  id: 1000
  title: Compute Units - Instruction Mix
-  metrics_description:
-    VALU: The total number of vector arithmetic logic unit (VALU) operations issued.
-      These are the workhorses of the compute unit, and are used to execute a wide
-      range of instruction types including floating point operations, non-uniform
-      address calculations, transcendental operations, integer operations, shifts,
-      conditional evaluation, etc.
-    VMEM: The total number of vector memory operations issued. These include most
-      loads, stores and atomic operations and all accesses to generic, global, private
-      and texture memory.
-    LDS: The total number of LDS (also known as shared memory) operations issued.
-      These include loads, stores, atomics, and HIP's __shfl operations.
-    MFMA: The total number of matrix fused multiply-add instructions issued.
-    SALU: The total number of scalar arithmetic logic unit (SALU) operations issued.
-      Typically these are used for address calculations, literal constants, and other
-      operations that are provably uniform across a wavefront. Although scalar memory
-      (SMEM) operations are issued by the SALU, they are counted separately in this
-      section.
-    SMEM: The total number of scalar memory (SMEM) operations issued. These are typically
-      used for loading kernel arguments, base-pointers and loads from HIP's __constant__
-      memory.
-    Branch: The total number of branch operations issued. These typically consist
-      of jump or branch operations and are used to implement control flow.
-    INT32: The total number of instructions operating on 32-bit integer operands issued
-      to the VALU per normalization unit.
-    INT64: The total number of instructions operating on 64-bit integer operands issued
-      to the VALU per normalization unit.
-    F16-ADD: The total number of addition instructions operating on 16-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F16-MUL: The total number of multiplication instructions operating on 16-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F16-FMA: The total number of fused multiply-add instructions operating on 16-bit
-      floating-point operands issued to the VALU per normalization unit.
-    F16-Trans: The total number of transcendental instructions (e.g., sqrt) operating
-      on 16-bit floating-point operands issued to the VALU per normalization unit.
-    F32-ADD: The total number of addition instructions operating on 32-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F32-MUL: The total number of multiplication instructions operating on 32-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F32-FMA: The total number of fused multiply-add instructions operating on 32-bit
-      floating-point operands issued to the VALU per normalization unit.
-    F32-Trans: The total number of transcendental instructions (such as sqrt) operating
-      on 32-bit floating-point operands issued to the VALU per normalization unit.
-    F64-ADD: The total number of addition instructions operating on 64-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F64-MUL: The total number of multiplication instructions operating on 64-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F64-FMA: The total number of fused multiply-add instructions operating on 64-bit
-      floating-point operands issued to the VALU per normalization unit.
-    F64-Trans: The total number of transcendental instructions (such as sqrt) operating
-      on 64-bit floating-point operands issued to the VALU per normalization unit.
-    Conversion: "The total number of type conversion instructions (such as converting\
-      \ data to or from F32\u2194F64) issued to the VALU per normalization unit."
-    Global/Generic Instr: The total number of global & generic memory instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Global/Generic Read: The total number of global & generic memory read instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Global/Generic Write: The total number of global & generic memory write instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Global/Generic Atomic: The total number of global & generic memory atomic (with
-      and without return) instructions executed on all compute units on the accelerator,
-      per normalization unit.
-    Spill/Stack Instr: The total number of spill/stack memory instructions executed
-      on all compute units on the accelerator, per normalization unit.
-    Spill/Stack Read: The total number of spill/stack memory read instructions executed
-      on all compute units on the accelerator, per normalization unit.
-    Spill/Stack Write: The total number of spill/stack memory write instructions executed
-      on all compute units on the accelerator, per normalization unit.
-    Spill/Stack Atomic: The total number of spill/stack memory atomic (with and without
-      return) instructions executed on all compute units on the accelerator, per normalization
-      unit. Typically unused as these memory operations are typically used to implement
-      thread-local storage.
-    MFMA-I8: The total number of 8-bit integer MFMA instructions issued per normalization
-      unit.
-    MFMA-F8: The total number of 8-bit floating point MFMA instructions issued per
-      normalization unit. This is supported in AMD Instinct MI300 series and later
-      only.
-    MFMA-F16: The total number of 16-bit floating point MFMA instructions issued per
-      normalization unit.
-    MFMA-BF16: The total number of 16-bit brain floating point MFMA instructions issued
-      per normalization unit.
-    MFMA-F32: The total number of 32-bit floating-point MFMA instructions issued per
-      normalization unit.
-    MFMA-F64: The total number of 64-bit floating-point MFMA instructions issued per
-      normalization unit.
  data source:
  - metric_table:
      id: 1001
@@ -302,3 +218,85 @@ Panel Config:
          min: MIN((SQ_INSTS_VALU_MFMA_F64 / $denom))
          max: MAX((SQ_INSTS_VALU_MFMA_F64 / $denom))
          unit: (instr + $normUnit)
+  metrics_description:
+    VALU: The total number of vector arithmetic logic unit (VALU) operations issued.
+      These are the workhorses of the compute unit, and are used to execute a wide
+      range of instruction types including floating point operations, non-uniform
+      address calculations, transcendental operations, integer operations, shifts,
+      conditional evaluation, etc.
+    VMEM: The total number of vector memory operations issued. These include most
+      loads, stores and atomic operations and all accesses to generic, global, private
+      and texture memory.
+    LDS: The total number of LDS (also known as shared memory) operations issued.
+      These include loads, stores, atomics, and HIP's __shfl operations.
+    MFMA: The total number of matrix fused multiply-add instructions issued.
+    SALU: The total number of scalar arithmetic logic unit (SALU) operations issued.
+      Typically these are used for address calculations, literal constants, and other
+      operations that are provably uniform across a wavefront. Although scalar memory
+      (SMEM) operations are issued by the SALU, they are counted separately in this
+      section.
+    SMEM: The total number of scalar memory (SMEM) operations issued. These are typically
+      used for loading kernel arguments, base-pointers and loads from HIP's __constant__
+      memory.
+    Branch: The total number of branch operations issued. These typically consist
+      of jump or branch operations and are used to implement control flow.
+    INT32: The total number of instructions operating on 32-bit integer operands issued
+      to the VALU per normalization unit.
+    INT64: The total number of instructions operating on 64-bit integer operands issued
+      to the VALU per normalization unit.
+    F16-ADD: The total number of addition instructions operating on 16-bit floating-point
+      operands issued to the VALU per normalization unit.
+    F16-MUL: The total number of multiplication instructions operating on 16-bit floating-point
+      operands issued to the VALU per normalization unit.
+    F16-FMA: The total number of fused multiply-add instructions operating on 16-bit
+      floating-point operands issued to the VALU per normalization unit.
+    F16-Trans: The total number of transcendental instructions (e.g., sqrt) operating
+      on 16-bit floating-point operands issued to the VALU per normalization unit.
+    F32-ADD: The total number of addition instructions operating on 32-bit floating-point
+      operands issued to the VALU per normalization unit.
+    F32-MUL: The total number of multiplication instructions operating on 32-bit floating-point
+      operands issued to the VALU per normalization unit.
+    F32-FMA: The total number of fused multiply-add instructions operating on 32-bit
+      floating-point operands issued to the VALU per normalization unit.
+    F32-Trans: The total number of transcendental instructions (such as sqrt) operating
+      on 32-bit floating-point operands issued to the VALU per normalization unit.
+    F64-ADD: The total number of addition instructions operating on 64-bit floating-point
+      operands issued to the VALU per normalization unit.
+    F64-MUL: The total number of multiplication instructions operating on 64-bit floating-point
+      operands issued to the VALU per normalization unit.
+    F64-FMA: The total number of fused multiply-add instructions operating on 64-bit
+      floating-point operands issued to the VALU per normalization unit.
+    F64-Trans: The total number of transcendental instructions (such as sqrt) operating
+      on 64-bit floating-point operands issued to the VALU per normalization unit.
+    Conversion: |-
+      The total number of type conversion instructions (such as converting
+      data to or from F32\u2194F64) issued to the VALU per normalization unit.
+    Global/Generic Instr: The total number of global & generic memory instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Global/Generic Read: The total number of global & generic memory read instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Global/Generic Write: The total number of global & generic memory write instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Global/Generic Atomic: The total number of global & generic memory atomic (with
+      and without return) instructions executed on all compute units on the accelerator,
+      per normalization unit.
+    Spill/Stack Instr: The total number of spill/stack memory instructions executed
+      on all compute units on the accelerator, per normalization unit.
+    Spill/Stack Read: The total number of spill/stack memory read instructions executed
+      on all compute units on the accelerator, per normalization unit.
+    Spill/Stack Write: The total number of spill/stack memory write instructions executed
+      on all compute units on the accelerator, per normalization unit.
+    Spill/Stack Atomic: The total number of spill/stack memory atomic (with and without
+      return) instructions executed on all compute units on the accelerator, per normalization
+      unit. Typically unused as these memory operations are typically used to implement
+      thread-local storage.
+    MFMA-I8: The total number of 8-bit integer MFMA instructions issued per normalization
+      unit.
+    MFMA-F16: The total number of 16-bit floating point MFMA instructions issued per
+      normalization unit.
+    MFMA-BF16: The total number of 16-bit brain floating point MFMA instructions issued
+      per normalization unit.
+    MFMA-F32: The total number of 32-bit floating-point MFMA instructions issued per
+      normalization unit.
+    MFMA-F64: The total number of 64-bit floating-point MFMA instructions issued per
+      normalization unit.
@@ -2,84 +2,6 @@
 Panel Config:
  id: 1100
  title: Compute Units - Compute Pipeline
-  metrics_description:
-    VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
-      This is also presented as a percent of the peak theoretical FLOPs achievable
-      on the specific accelerator. Note: this does not include any floating-point
-      operations from MFMA instructions.'
-    VALU IOPs: 'The total integer operations executed per second on the VALU. This
-      is also presented as a percent of the peak theoretical IOPs achievable on the
-      specific accelerator. Note: this does not include any integer operations from
-      MFMA instructions.'
-    MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
-      executed per second. Note: this does not include any 16-bit brain floating point
-      operations from VALU instructions. This is also presented as a percent of the
-      peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
-      per second. Note: this does not include any 16-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F16 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
-      per second. Note: this does not include any 32-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F32 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
-      per second. Note: this does not include any 64-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F64 MFMA operations achievable on the specific accelerator.'
-    MFMA IOPs (INT8): 'The total number of 8-bit integer MFMA operations executed
-      per second. Note: this does not include any 8-bit integer operations from VALU
-      instructions. This is also presented as a percent of the peak theoretical INT8
-      MFMA operations achievable on the specific accelerator.'
-    IPC: The ratio of the total number of instructions executed on the CU over the
-      total active CU cycles.
-    IPC (Issued): The ratio of the total number of (non-internal) instructions issued
-      over the number of cycles where the scheduler was actively working on issuing
-      instructions.
-    SALU Utilization: Indicates what percent of the kernel's duration the SALU was
-      busy executing instructions. Computed as the ratio of the total number of cycles
-      spent by the scheduler issuing SALU / SMEM instructions over the total CU cycles.
-    VALU Utilization: Indicates what percent of the kernel's duration the VALU was
-      busy executing instructions. Does not include VMEM operations. Computed as the
-      ratio of the total number of cycles spent by the scheduler issuing VALU instructions
-      over the total CU cycles.
-    VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
-      was busy executing instructions, including both global/generic and spill/scratch
-      operations (see the VMEM instruction count metrics for more detail). Does not
-      include VALU operations. Computed as the ratio of the total number of cycles
-      spent by the scheduler issuing VMEM instructions over the total CU cycles.
-    Branch Utilization: Indicates what percent of the kernel's duration the branch
-      unit was busy executing instructions. Computed as the ratio of the total number
-      of cycles spent by the scheduler issuing branch instructions over the total
-      CU cycles.
-    VALU Active Threads: Indicates the average level of divergence within a wavefront
-      over the lifetime of the kernel. The number of work-items that were active in
-      a wavefront during execution of each VALU instruction, time-averaged over all
-      VALU instructions run on all wavefronts in the kernel
-    MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
-      was busy executing instructions. Computed as the ratio of the total number of
-      cycles spent by the MFMA was busy over the total CU cycles.
-    MFMA Instruction Cycles: The average duration of MFMA instructions in this kernel
-      in cycles. Computed as the ratio of the total number of cycles the MFMA unit
-      was busy over the total number of MFMA instructions.
-    VMEM Latency: The average number of round-trip cycles (that is, from issue to
-      data return / acknowledgment) required for a VMEM instruction to complete.
-    SMEM Latency: The average number of round-trip cycles (that is, from issue to
-      data return / acknowledgment) required for a SMEM instruction to complete.
-    FLOPs (Total): The total number of floating-point operations executed on either
-      the VALU or MFMA units, per normalization unit.
-    IOPs (Total): The total number of integer operations executed on either the VALU
-      or MFMA units, per normalization unit.
-    F16 OPs: The total number of 16-bit floating-point operations executed on either
-      the VALU or MFMA units, per normalization unit.
-    BF16 OPs: The total number of 16-bit brain floating-point operations executed
-      on either the VALU or MFMA units, per normalization unit.
-    F32 OPs: The total number of 32-bit floating-point operations executed on either
-      the VALU or MFMA units, per normalization unit.
-    F64 OPs: The total number of 64-bit floating-point operations executed on either
-      the VALU or MFMA units, per normalization unit.
-    INT8 OPs: The total number of 8-bit integer operations executed on either the
-      VALU or MFMA units, per normalization unit.
  data source:
  - metric_table:
      id: 1101
@@ -159,13 +81,13 @@ Panel Config:
          unit: Instr/cycle
        IPC (Issued):
          avg: AVG(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
-            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED  + SQ_INSTS_LDS)
+            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
            / SQ_ACTIVE_INST_ANY))
          min: MIN(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
            / SQ_ACTIVE_INST_ANY))
          max: MAX(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
-            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED  + SQ_INSTS_LDS)
+            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
            / SQ_ACTIVE_INST_ANY))
          unit: Instr/cycle
        SALU Utilization:
@@ -262,7 +184,7 @@ Panel Config:
            * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32)) + (64 * (((SQ_INSTS_VALU_ADD_F64
            + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64) + (SQ_INSTS_VALU_FMA_F64
            * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64)) / $denom))
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
        IOPs (Total):
          avg: AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8
            * 512)) / $denom)
@@ -270,7 +192,7 @@ Panel Config:
            * 512)) / $denom)
          max: MAX(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8
            * 512)) / $denom)
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
        F16 OPs:
          avg: AVG(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16))
            + (64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512
@@ -281,12 +203,12 @@ Panel Config:
          max: MAX(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16))
            + (64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512
            * SQ_INSTS_VALU_MFMA_MOPS_F16)) / $denom))
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
        BF16 OPs:
          avg: AVG(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
          min: MIN(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
          max: MAX(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
        F32 OPs:
          avg: AVG((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
            + (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
@@ -297,7 +219,7 @@ Panel Config:
          max: MAX((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
            + (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
            / $denom))
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
        F64 OPs:
          avg: AVG((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
            + (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
@@ -308,9 +230,94 @@ Panel Config:
          max: MAX((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
            + (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
            / $denom))
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
        INT8 OPs:
          avg: AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
          min: MIN(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
          max: MAX(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
+  metrics_description:
+    VALU FLOPs: |-
+      The total floating-point operations executed per second on the VALU.
+      This is also presented as a percent of the peak theoretical FLOPs achievable
+      on the specific accelerator. Note: this does not include any floating-point
+      operations from MFMA instructions.
+    VALU IOPs: |-
+      The total integer operations executed per second on the VALU. This is
+      also presented as a percent of the peak theoretical IOPs achievable on the
+      specific accelerator. Note: this does not include any integer operations from
+      MFMA instructions.
+    MFMA FLOPs (BF16): |-
+      The total number of 16-bit brain floating point MFMA operations executed
+      per second. Note: this does not include any 16-bit brain floating point operations
+      from VALU instructions. This is also presented as a percent of the peak theoretical
+      BF16 MFMA operations achievable on the specific accelerator.
+    MFMA FLOPs (F16): |-
+      The total number of 16-bit floating point MFMA operations executed per
+      second. Note: this does not include any 16-bit floating point operations from
+      VALU instructions. This is also presented as a percent of the peak theoretical
+      F16 MFMA operations achievable on the specific accelerator.
+    MFMA FLOPs (F32): |-
+      The total number of 32-bit floating point MFMA operations executed per
+      second. Note: this does not include any 32-bit floating point operations from
+      VALU instructions. This is also presented as a percent of the peak theoretical
+      F32 MFMA operations achievable on the specific accelerator.
+    MFMA FLOPs (F64): |-
+      The total number of 64-bit floating point MFMA operations executed per
+      second. Note: this does not include any 64-bit floating point operations from
+      VALU instructions. This is also presented as a percent of the peak theoretical
+      F64 MFMA operations achievable on the specific accelerator.
+    MFMA IOPs (INT8): |-
+      The total number of 8-bit integer MFMA operations executed per second.
+      Note: this does not include any 8-bit integer operations from VALU instructions.
+      This is also presented as a percent of the peak theoretical INT8 MFMA operations
+      achievable on the specific accelerator.
+    IPC: The ratio of the total number of instructions executed on the CU over the
+      total active CU cycles.
+    IPC (Issued): The ratio of the total number of (non-internal) instructions issued
+      over the number of cycles where the scheduler was actively working on issuing
+      instructions.
+    SALU Utilization: Indicates what percent of the kernel's duration the SALU was
+      busy executing instructions. Computed as the ratio of the total number of cycles
+      spent by the scheduler issuing SALU / SMEM instructions over the total CU cycles.
+    VALU Utilization: Indicates what percent of the kernel's duration the VALU was
+      busy executing instructions. Does not include VMEM operations. Computed as the
+      ratio of the total number of cycles spent by the scheduler issuing VALU instructions
+      over the total CU cycles.
+    VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
+      was busy executing instructions, including both global/generic and spill/scratch
+      operations (see the VMEM instruction count metrics for more detail). Does not
+      include VALU operations. Computed as the ratio of the total number of cycles
+      spent by the scheduler issuing VMEM instructions over the total CU cycles.
+    Branch Utilization: Indicates what percent of the kernel's duration the branch
+      unit was busy executing instructions. Computed as the ratio of the total number
+      of cycles spent by the scheduler issuing branch instructions over the total
+      CU cycles.
+    VALU Active Threads: Indicates the average level of divergence within a wavefront
+      over the lifetime of the kernel. The number of work-items that were active in
+      a wavefront during execution of each VALU instruction, time-averaged over all
+      VALU instructions run on all wavefronts in the kernel
+    MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
+      was busy executing instructions. Computed as the ratio of the total number of
+      cycles spent by the MFMA was busy over the total CU cycles.
+    MFMA Instruction Cycles: The average duration of MFMA instructions in this kernel
+      in cycles. Computed as the ratio of the total number of cycles the MFMA unit
+      was busy over the total number of MFMA instructions.
+    VMEM Latency: The average number of round-trip cycles (that is, from issue to
+      data return / acknowledgment) required for a VMEM instruction to complete.
+    SMEM Latency: The average number of round-trip cycles (that is, from issue to
+      data return / acknowledgment) required for a SMEM instruction to complete.
+    FLOPs (Total): The total number of floating-point operations executed on either
+      the VALU or MFMA units, per normalization unit.
+    IOPs (Total): The total number of integer operations executed on either the VALU
+      or MFMA units, per normalization unit.
+    F16 OPs: The total number of 16-bit floating-point operations executed on either
+      the VALU or MFMA units, per normalization unit.
+    BF16 OPs: The total number of 16-bit brain floating-point operations executed
+      on either the VALU or MFMA units, per normalization unit.
+    F32 OPs: The total number of 32-bit floating-point operations executed on either
+      the VALU or MFMA units, per normalization unit.
+    F64 OPs: The total number of 64-bit floating-point operations executed on either
+      the VALU or MFMA units, per normalization unit.
+    INT8 OPs: The total number of 8-bit integer operations executed on either the
+      VALU or MFMA units, per normalization unit.
@@ -2,51 +2,6 @@
 Panel Config:
  id: 1200
  title: Local Data Share (LDS)
-  metrics_description:
-    Utilization: Indicates what percent of the kernel's duration the LDS was actively
-      executing instructions (including, but not limited to, load, store, atomic and
-      HIP's __shfl operations). Calculated as the ratio of the total number of cycles
-      LDS was active over the total CU cycles.
-    Access Rate: Indicates the percentage of SIMDs in the VALU actively issuing LDS
-      instructions, averaged over the lifetime of the kernel. Calculated as the ratio
-      of the total number of cycles spent by the scheduler issuing LDS instructions
-      over the total CU cycles.
-    Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
-      could have been loaded from, stored to, or atomically updated in the LDS divided
-      as percentage of theoretical peak. Does not take into account the execution
-      mask of the wavefront when the instruction was executed.
-    Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
-      loaded from, stored to, or atomically updated in the LDS divided by total duration.
-      Does not take into account the execution mask of the wavefront when the instruction
-      was executed.
-    Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
-      servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing
-      bank conflicts over the number of LDS cycles that would have been required to
-      move the same amount of data in an uncontended access.
-    LDS Instructions: The total number of LDS instructions (including, but not limited
-      to, read/write/atomics and HIP's __shfl instructions) executed per normalization
-      unit.
-    LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
-      / acknowledgment) required for an LDS instruction to complete.
-    Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS scheduler
-      due to bank conflicts (as determined by the conflict resolution hardware) to
-      the base number of cycles that would be spent in the LDS scheduler in a completely
-      uncontended case. This is the unnormalized form of the Bank Conflict Rate.
-    Index Accesses: The total number of cycles spent in the LDS scheduler over all
-      operations per normalization unit.
-    Atomic Return Cycles: The total number of cycles spent on LDS atomics with return
-      per normalization unit.
-    Bank Conflict: The total number of cycles spent in the LDS scheduler due to bank
-      conflicts (as determined by the conflict resolution hardware) per normalization
-      unit.
-    Addr Conflict: The total number of cycles spent in the LDS scheduler due to address
-      conflicts (as determined by the conflict resolution hardware) per normalization
-      unit.
-    Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
-      stalls from non-dword aligned addresses per normalization unit.
-    Mem Violations: "The total number of out-of-bounds accesses made to the LDS, per\
-      \ normalization unit. This is unused and expected to be zero in most configurations\
-      \ for modern CDNA\u2122 accelerators."
  data source:
  - metric_table:
      id: 1201
@@ -87,7 +42,7 @@ Panel Config:
          avg: AVG((SQ_INSTS_LDS / $denom))
          min: MIN((SQ_INSTS_LDS / $denom))
          max: MAX((SQ_INSTS_LDS / $denom))
-          unit: (Instr  + $normUnit)
+          unit: (Instr + $normUnit)
        Theoretical Bandwidth:
          avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
            / (End_Timestamp - Start_Timestamp)))
@@ -117,29 +72,75 @@ Panel Config:
          avg: AVG((SQ_LDS_IDX_ACTIVE / $denom))
          min: MIN((SQ_LDS_IDX_ACTIVE / $denom))
          max: MAX((SQ_LDS_IDX_ACTIVE / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Atomic Return Cycles:
          avg: AVG((SQ_LDS_ATOMIC_RETURN / $denom))
          min: MIN((SQ_LDS_ATOMIC_RETURN / $denom))
          max: MAX((SQ_LDS_ATOMIC_RETURN / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Bank Conflict:
          avg: AVG((SQ_LDS_BANK_CONFLICT / $denom))
          min: MIN((SQ_LDS_BANK_CONFLICT / $denom))
          max: MAX((SQ_LDS_BANK_CONFLICT / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Addr Conflict:
          avg: AVG((SQ_LDS_ADDR_CONFLICT / $denom))
          min: MIN((SQ_LDS_ADDR_CONFLICT / $denom))
          max: MAX((SQ_LDS_ADDR_CONFLICT / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Unaligned Stall:
          avg: AVG((SQ_LDS_UNALIGNED_STALL / $denom))
          min: MIN((SQ_LDS_UNALIGNED_STALL / $denom))
          max: MAX((SQ_LDS_UNALIGNED_STALL / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Mem Violations:
          avg: AVG((SQ_LDS_MEM_VIOLATIONS / $denom))
          min: MIN((SQ_LDS_MEM_VIOLATIONS / $denom))
          max: MAX((SQ_LDS_MEM_VIOLATIONS / $denom))
          unit: (Accesses + $normUnit)
+  metrics_description:
+    Utilization: Indicates what percent of the kernel's duration the LDS was actively
+      executing instructions (including, but not limited to, load, store, atomic and
+      HIP's __shfl operations). Calculated as the ratio of the total number of cycles
+      LDS was active over the total CU cycles.
+    Access Rate: Indicates the percentage of SIMDs in the VALU actively issuing LDS
+      instructions, averaged over the lifetime of the kernel. Calculated as the ratio
+      of the total number of cycles spent by the scheduler issuing LDS instructions
+      over the total CU cycles.
+    Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
+      could have been loaded from, stored to, or atomically updated in the LDS divided
+      as percentage of theoretical peak. Does not take into account the execution
+      mask of the wavefront when the instruction was executed.
+    Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
+      loaded from, stored to, or atomically updated in the LDS divided by total duration.
+      Does not take into account the execution mask of the wavefront when the instruction
+      was executed.
+    Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
+      servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing
+      bank conflicts over the number of LDS cycles that would have been required to
+      move the same amount of data in an uncontended access.
+    LDS Instructions: The total number of LDS instructions (including, but not limited
+      to, read/write/atomics and HIP's __shfl instructions) executed per normalization
+      unit.
+    LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
+      acknowledgment) required for an LDS instruction to complete.
+    Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS scheduler
+      due to bank conflicts (as determined by the conflict resolution hardware) to
+      the base number of cycles that would be spent in the LDS scheduler in a completely
+      uncontended case. This is the unnormalized form of the Bank Conflict Rate.
+    Index Accesses: The total number of cycles spent in the LDS scheduler over all
+      operations per normalization unit.
+    Atomic Return Cycles: The total number of cycles spent on LDS atomics with return
+      per normalization unit.
+    Bank Conflict: The total number of cycles spent in the LDS scheduler due to bank
+      conflicts (as determined by the conflict resolution hardware) per normalization
+      unit.
+    Addr Conflict: The total number of cycles spent in the LDS scheduler due to address
+      conflicts (as determined by the conflict resolution hardware) per normalization
+      unit.
+    Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
+      stalls from non-dword aligned addresses per normalization unit.
+    Mem Violations: |-
+      The total number of out-of-bounds accesses made to the LDS, per normalization
+      unit. This is unused and expected to be zero in most configurations for
+      modern CDNA\u2122 accelerators.
@@ -2,28 +2,6 @@
 Panel Config:
  id: 1300
  title: Instruction Cache
-  metrics_description:
-    Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
-      of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
-      the total L1I cycles.
-    Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
-      loaded line the cache. Calculated as the ratio of the number of L1I requests
-      that hit over the number of all L1I requests.
-    L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\
-      \ L2 cache request bandwidth achieved. Calculated as the ratio of the total\
-      \ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\
-      \ cycles."
-    L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
-      divided by total duration.
-    Req: The total number of requests made to the L1I per normalization-unit
-    Hits: The total number of L1I requests that hit on a previously loaded cache line,
-      per normalization-unit.
-    Misses - Non Duplicated: The total number of L1I requests that missed on a cache
-      line that were not already pending due to another request, per normalization-unit.
-    Misses - Duplicated: The total number of L1I requests that missed on a cache line
-      that were already pending due to another request, per normalization-unit.
-    Instruction Fetch Latency: The average number of cycles spent to fetch instructions
-      to a CU.
  data source:
  - metric_table:
      id: 1301
@@ -62,22 +40,22 @@ Panel Config:
          avg: AVG((SQC_ICACHE_REQ / $denom))
          min: MIN((SQC_ICACHE_REQ / $denom))
          max: MAX((SQC_ICACHE_REQ / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Hits:
          avg: AVG((SQC_ICACHE_HITS / $denom))
          min: MIN((SQC_ICACHE_HITS / $denom))
          max: MAX((SQC_ICACHE_HITS / $denom))
-          unit: (Hits  + $normUnit)
+          unit: (Hits + $normUnit)
        Misses - Non Duplicated:
          avg: AVG((SQC_ICACHE_MISSES / $denom))
          min: MIN((SQC_ICACHE_MISSES / $denom))
          max: MAX((SQC_ICACHE_MISSES / $denom))
-          unit: (Misses  + $normUnit)
+          unit: (Misses + $normUnit)
        Misses - Duplicated:
          avg: AVG((SQC_ICACHE_MISSES_DUPLICATE / $denom))
          min: MIN((SQC_ICACHE_MISSES_DUPLICATE / $denom))
          max: MAX((SQC_ICACHE_MISSES_DUPLICATE / $denom))
-          unit: (Misses  + $normUnit)
+          unit: (Misses + $normUnit)
        Cache Hit Rate:
          avg: AVG(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
            + SQC_ICACHE_MISSES_DUPLICATE)))
@@ -107,3 +85,25 @@ Panel Config:
          min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
          max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
          unit: Gbps
+  metrics_description:
+    Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
+      of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
+      the total L1I cycles.
+    Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
+      loaded line the cache. Calculated as the ratio of the number of L1I requests
+      that hit over the number of all L1I requests.
+    L1I-L2 Bandwidth Utilization: |-
+      The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
+      achieved. Calculated as the ratio of the total number of requests from the
+      L1I to the L2 cache over the total L1I-L2 interface cycles.
+    L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
+      divided by total duration.
+    Req: The total number of requests made to the L1I per normalization-unit
+    Hits: The total number of L1I requests that hit on a previously loaded cache line,
+      per normalization-unit.
+    Misses - Non Duplicated: The total number of L1I requests that missed on a cache
+      line that were not already pending due to another request, per normalization-unit.
+    Misses - Duplicated: The total number of L1I requests that missed on a cache line
+      that were already pending due to another request, per normalization-unit.
+    Instruction Fetch Latency: The average number of cycles spent to fetch instructions
+      to a CU.
@@ -2,49 +2,6 @@
 Panel Config:
  id: 1400
  title: Scalar L1 Data Cache
-  metrics_description:
-    Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
-      of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
-      over the total sL1D cycles.
-    Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
-      loaded line the cache. The ratio of the number of sL1D requests that hit over
-      the number of all sL1D requests.
-    sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
-      bandwidth acheived.\ \ Caclulated as total number of bytes read from, written
-      to, or atomically updated\ \ across the sL1D - L2 interface.
-    sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
-      \ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\
-      \ writes and atomics are typically unused on current CDNA accelerators, so in\
-      \ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
-    Req: The total number of requests, of any size or type, made to the sL1D per normalization
-      unit.
-    Hits: The total number of sL1D requests that hit on a previously loaded cache
-      line, per normalization unit.
-    Misses - Non Duplicated: 'The total number of sL1D requests that missed on a cache
-      line that was not already pending due to another request, per normalization
-      unit. '
-    Misses- Duplicated: The total number of sL1D requests that missed on a cache line
-      that was already pending due to another request, per normalization unit.
-    Read Req (Total): The total number of sL1D read requests of any size, per normalization
-      unit.
-    Atomic Req: The total number of atomic requests from sL1D to the L2, per normalization
-      unit. Typically unused on current CDNA accelerators.
-    Read Req (1 DWord): The total number of sL1D read requests made for a single dword
-      of data (4B), per normalization unit.
-    Read Req (2 DWord): The total number of sL1D read requests made for a two dwords
-      of data (8B), per normalization unit.
-    Read Req (4 DWord): The total number of sL1D read requests made for a four dwords
-      of data (16B), per normalization unit.
-    Read Req (8 DWord): The total number of sL1D read requests made for a eight dwords
-      of data (32B), per normalization unit.
-    Read Req (16 DWord): The total number of sL1D read requests made for a sixteen
-      dwords of data (64B), per normalization unit.
-    Read Req: The total number of read requests from sL1D to the L2 per normalization
-      unit.
-    Write Req: The total number of write requests from sL1D to the L2, per normalization
-      unit. Typically unused on current CDNA accelerators.
-    Stall Cycles: "The total number of cycles the sL1D\u2194L2 interface was stalled,\
-      \ per normalization unit."
  data source:
  - metric_table:
      id: 1401
@@ -84,22 +41,22 @@ Panel Config:
          avg: AVG((SQC_DCACHE_REQ / $denom))
          min: MIN((SQC_DCACHE_REQ / $denom))
          max: MAX((SQC_DCACHE_REQ / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Hits:
          avg: AVG((SQC_DCACHE_HITS / $denom))
          min: MIN((SQC_DCACHE_HITS / $denom))
          max: MAX((SQC_DCACHE_HITS / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Misses - Non Duplicated:
          avg: AVG((SQC_DCACHE_MISSES / $denom))
          min: MIN((SQC_DCACHE_MISSES / $denom))
          max: MAX((SQC_DCACHE_MISSES / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Misses- Duplicated:
          avg: AVG((SQC_DCACHE_MISSES_DUPLICATE / $denom))
          min: MIN((SQC_DCACHE_MISSES_DUPLICATE / $denom))
          max: MAX((SQC_DCACHE_MISSES_DUPLICATE / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Cache Hit Rate:
          avg: AVG((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
            + SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
@@ -118,37 +75,37 @@ Panel Config:
            + SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
          max: MAX((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
            + SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Atomic Req:
          avg: AVG((SQC_DCACHE_ATOMIC / $denom))
          min: MIN((SQC_DCACHE_ATOMIC / $denom))
          max: MAX((SQC_DCACHE_ATOMIC / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read Req (1 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_1 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_1 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_1 / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read Req (2 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_2 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_2 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_2 / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read Req (4 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_4 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_4 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_4 / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read Req (8 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_8 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_8 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_8 / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read Req (16 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_16 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_16 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_16 / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
  - metric_table:
      id: 1403
      title: Scalar L1D Cache - L2 Interface
@@ -171,19 +128,65 @@ Panel Config:
          avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
          min: MIN((SQC_TC_DATA_READ_REQ / $denom))
          max: MAX((SQC_TC_DATA_READ_REQ / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Write Req:
          avg: AVG((SQC_TC_DATA_WRITE_REQ / $denom))
          min: MIN((SQC_TC_DATA_WRITE_REQ / $denom))
          max: MAX((SQC_TC_DATA_WRITE_REQ / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Atomic Req:
          avg: AVG((SQC_TC_DATA_ATOMIC_REQ / $denom))
          min: MIN((SQC_TC_DATA_ATOMIC_REQ / $denom))
          max: MAX((SQC_TC_DATA_ATOMIC_REQ / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Stall Cycles:
          avg: AVG((SQC_TC_STALL / $denom))
          min: MIN((SQC_TC_STALL / $denom))
          max: MAX((SQC_TC_STALL / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
+  metrics_description:
+    Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
+      of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
+      over the total sL1D cycles.
+    Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
+      loaded line the cache. The ratio of the number of sL1D requests that hit over
+      the number of all sL1D requests.
+    sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
+      bandwidth acheived. Calculated as total number of bytes read from, written to,
+      or atomically updated across the sL1D - L2 interface.
+    sL1D-L2 BW: |-
+      The total number of bytes read from, written to, or atomically updated
+      across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D
+      writes and atomics are typically unused on current CDNA accelerators, so
+      in the majority of cases this can be interpreted as an sL1D\u2192L2 read
+      bandwidth.
+    Req: The total number of requests, of any size or type, made to the sL1D per normalization
+      unit.
+    Hits: The total number of sL1D requests that hit on a previously loaded cache
+      line, per normalization unit.
+    Misses - Non Duplicated: |-
+      The total number of sL1D requests that missed on a cache line that was
+      not already pending due to another request, per normalization unit.
+    Misses- Duplicated: The total number of sL1D requests that missed on a cache line
+      that was already pending due to another request, per normalization unit.
+    Read Req (Total): The total number of sL1D read requests of any size, per normalization
+      unit.
+    Atomic Req: The total number of atomic requests from sL1D to the L2, per normalization
+      unit. Typically unused on current CDNA accelerators.
+    Read Req (1 DWord): The total number of sL1D read requests made for a single dword
+      of data (4B), per normalization unit.
+    Read Req (2 DWord): The total number of sL1D read requests made for a two dwords
+      of data (8B), per normalization unit.
+    Read Req (4 DWord): The total number of sL1D read requests made for a four dwords
+      of data (16B), per normalization unit.
+    Read Req (8 DWord): The total number of sL1D read requests made for a eight dwords
+      of data (32B), per normalization unit.
+    Read Req (16 DWord): The total number of sL1D read requests made for a sixteen
+      dwords of data (64B), per normalization unit.
+    Read Req: The total number of read requests from sL1D to the L2 per normalization
+      unit.
+    Write Req: The total number of write requests from sL1D to the L2, per normalization
+      unit. Typically unused on current CDNA accelerators.
+    Stall Cycles: |-
+      The total number of cycles the sL1D\u2194L2 interface was stalled, per
+      normalization unit.
@@ -2,70 +2,6 @@
 Panel Config:
  id: 1500
  title: Address Processing Unit and Data Return Path (TA/TD)
-  metrics_description:
-    Address Processing Unit Busy: Percent of the total CU cycles the address processor
-      was busy
-    Address Stall: Percent of the total CU cycles the address processor was stalled
-      from sending address requests further into the vL1D pipeline.
-    Data Stall: Percent of the total CU cycles the address processor was stalled from
-      sending write/atomic data further into the vL1D pipeline.
-    "Data-Processor \u2192 Address Stall": Percent of total CU cycles the address
-      processor was stalled waiting to send command data to the data processor.
-    Total Instructions: The total number of memory instructions executed by the address
-      processer over all compute units on the accelerator, per normalization unit.
-    Global/Generic Instructions: The total number of global & generic memory instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Global/Generic Read Instructions: The total number of global & generic memory
-      read instructions executed on all compute units on the accelerator, per normalization
-      unit.
-    Global/Generic Write Instructions: The total number of global & generic memory
-      write instructions executed on all compute units on the accelerator, per normalization
-      unit.
-    Global/Generic Atomic Instructions: The total number of global & generic memory
-      atomic (with and without return) instructions executed on all compute units
-      on the accelerator, per normalization unit.
-    Spill/Stack Instructions: The total number of spill/stack memory instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Spill/Stack Read Instructions: The total number of spill/stack memory read instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Spill/Stack Write Instructions: The total number of spill/stack memory write instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Spill/Stack Atomic Instructions: The total number of spill/stack memory atomic
-      (with and without return) instructions executed on all compute units on the
-      accelerator, per normalization unit. Typically unused as these memory operations
-      are typically used to implement thread-local storage.
-    Spill/Stack Total Cycles: The number of cycles the address processing unit spent
-      working on spill/stack instructions, per normalization unit.
-    Spill/Stack Coalesced Read: The number of cycles the address processing unit spent
-      working on coalesced spill/stack read instructions, per normalization unit.
-    Spill/Stack Coalesced Write: The number of cycles the address processing unit
-      spent working on coalesced spill/stack write instructions, per normalization
-      unit.
-    Data-Return Busy: Percent of the total CU cycles the data-return unit was busy
-      processing or waiting on data to return to the CU.
-    "Cache RAM \u2192 Data-Return Stall": Percent of the total CU cycles the data-return
-      unit was stalled on data to be returned from the vL1D Cache RAM.
-    "Workgroup manager \u2192 Data-Return Stall": Percent of the total CU cycles the
-      data-return unit was stalled by the workgroup manager due to initialization
-      of registers as a part of launching new workgroups.
-    Coalescable Instructions: The number of instructions submitted to the data-return
-      unit by the address processor that were found to be coalescable, per normalization
-      unit.
-    Read Instructions: The number of read instructions submitted to the data-return
-      unit by the address processor summed over all compute units on the accelerator,
-      per normalization unit. This is expected to be the sum of global/generic and
-      spill/stack reads in the address processor.
-    Write Instructions: The number of store instructions submitted to the data-return
-      unit by the address processor summed over all compute units on the accelerator,
-      per normalization unit. This is expected to be the sum of global/generic and
-      spill/stack stores in the address processor.
-    Atomic Instructions: The number of atomic instructions submitted to the data-return
-      unit by the address processor summed over all compute units on the accelerator,
-      per normalization unit. This is expected to be the sum of global/generic and
-      spill/stack atomics in the address processor.
-    Write Ack Instructions: The total number of write acknowledgements submitted by
-      data-return unit to SQ, summed over all compute units on the accelerator, per
-      normalization unit.
  data source:
  - metric_table:
      id: 1501
@@ -135,47 +71,47 @@ Panel Config:
          avg: AVG((TA_TOTAL_WAVEFRONTS_sum / $denom))
          min: MIN((TA_TOTAL_WAVEFRONTS_sum / $denom))
          max: MAX((TA_TOTAL_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Global/Generic Instructions:
          avg: AVG((TA_FLAT_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Global/Generic Read Instructions:
          avg: AVG((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Global/Generic Write Instructions:
          avg: AVG((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Global/Generic Atomic Instructions:
          avg: AVG((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Spill/Stack Instructions:
          avg: AVG((TA_BUFFER_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Spill/Stack Read Instructions:
          avg: AVG((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Spill/Stack Write Instructions:
          avg: AVG((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Spill/Stack Atomic Instructions:
          avg: AVG((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
          max: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
  - metric_table:
      id: 1503
      title: Spill and stack metrics
@@ -190,17 +126,17 @@ Panel Config:
          avg: AVG((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
          min: MIN((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
          max: MAX((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Spill/Stack Coalesced Read:
          avg: AVG((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
          min: MIN((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
          max: MAX((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Spill/Stack Coalesced Write:
          avg: AVG((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
          min: MIN((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
          max: MAX((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
  - metric_table:
      id: 1504
      title: Vector L1 data-return path or Texture Data (TD)
@@ -230,7 +166,7 @@ Panel Config:
          avg: AVG((TD_COALESCABLE_WAVEFRONT_sum / $denom))
          min: MIN((TD_COALESCABLE_WAVEFRONT_sum / $denom))
          max: MAX((TD_COALESCABLE_WAVEFRONT_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Read Instructions:
          avg: AVG((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
            / $denom))
@@ -238,14 +174,75 @@ Panel Config:
            / $denom))
          max: MAX((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
            / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Write Instructions:
          avg: AVG((TD_STORE_WAVEFRONT_sum / $denom))
          min: MIN((TD_STORE_WAVEFRONT_sum / $denom))
          max: MAX((TD_STORE_WAVEFRONT_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Atomic Instructions:
          avg: AVG((TD_ATOMIC_WAVEFRONT_sum / $denom))
          min: MIN((TD_ATOMIC_WAVEFRONT_sum / $denom))
          max: MAX((TD_ATOMIC_WAVEFRONT_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
+  metrics_description:
+    Address Processing Unit Busy: Percent of the total CU cycles the address processor
+      was busy
+    Address Stall: Percent of the total CU cycles the address processor was stalled
+      from sending address requests further into the vL1D pipeline.
+    Data Stall: Percent of the total CU cycles the address processor was stalled from
+      sending write/atomic data further into the vL1D pipeline.
+    "Data-Processor \u2192 Address Stall": Percent of total CU cycles the address
+      processor was stalled waiting to send command data to the data processor.
+    Total Instructions: The total number of memory instructions executed by the address
+      processer over all compute units on the accelerator, per normalization unit.
+    Global/Generic Instructions: The total number of global & generic memory instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Global/Generic Read Instructions: The total number of global & generic memory
+      read instructions executed on all compute units on the accelerator, per normalization
+      unit.
+    Global/Generic Write Instructions: The total number of global & generic memory
+      write instructions executed on all compute units on the accelerator, per normalization
+      unit.
+    Global/Generic Atomic Instructions: The total number of global & generic memory
+      atomic (with and without return) instructions executed on all compute units
+      on the accelerator, per normalization unit.
+    Spill/Stack Instructions: The total number of spill/stack memory instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Spill/Stack Read Instructions: The total number of spill/stack memory read instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Spill/Stack Write Instructions: The total number of spill/stack memory write instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Spill/Stack Atomic Instructions: The total number of spill/stack memory atomic
+      (with and without return) instructions executed on all compute units on the
+      accelerator, per normalization unit. Typically unused as these memory operations
+      are typically used to implement thread-local storage.
+    Spill/Stack Total Cycles: The number of cycles the address processing unit spent
+      working on spill/stack instructions, per normalization unit.
+    Spill/Stack Coalesced Read: The number of cycles the address processing unit spent
+      working on coalesced spill/stack read instructions, per normalization unit.
+    Spill/Stack Coalesced Write: The number of cycles the address processing unit
+      spent working on coalesced spill/stack write instructions, per normalization
+      unit.
+    Data-Return Busy: Percent of the total CU cycles the data-return unit was busy
+      processing or waiting on data to return to the CU.
+    "Cache RAM \u2192 Data-Return Stall": Percent of the total CU cycles the data-return
+      unit was stalled on data to be returned from the vL1D Cache RAM.
+    "Workgroup manager \u2192 Data-Return Stall": Percent of the total CU cycles the
+      data-return unit was stalled by the workgroup manager due to initialization
+      of registers as a part of launching new workgroups.
+    Coalescable Instructions: The number of instructions submitted to the data-return
+      unit by the address processor that were found to be coalescable, per normalization
+      unit.
+    Read Instructions: The number of read instructions submitted to the data-return
+      unit by the address processor summed over all compute units on the accelerator,
+      per normalization unit. This is expected to be the sum of global/generic and
+      spill/stack reads in the address processor.
+    Write Instructions: The number of store instructions submitted to the data-return
+      unit by the address processor summed over all compute units on the accelerator,
+      per normalization unit. This is expected to be the sum of global/generic and
+      spill/stack stores in the address processor.
+    Atomic Instructions: The number of atomic instructions submitted to the data-return
+      unit by the address processor summed over all compute units on the accelerator,
+      per normalization unit. This is expected to be the sum of global/generic and
+      spill/stack atomics in the address processor.
@@ -2,117 +2,6 @@
 Panel Config:
  id: 1600
  title: Vector L1 Data Cache
-  metrics_description:
-    Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
-      cache over the total number of cache line requests to the vL1D Cache RAM.
-    Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
-      of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
-      on the specific accelerator. The number of bytes is calculated as the number
-      of cache lines requested multiplied by the cache line size. This value does
-      not consider partial requests, so for instance, if only a single value is requested
-      in a cache line, the data movement will still be counted as a full cache line.
-    Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
-      The number of cycles where the vL1D Cache RAM is actively processing any request
-      divided by the number of cycles where the vL1D is active.
-    Coalescing: Indicates how well memory instructions were coalesced by the address
-      processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
-      as the average number of thread-requests generated per instruction divided by
-      the ideal number of thread-requests per instruction.
-    Stalled on L2 Data: The ratio of the number of cycles where the vL1D is stalled
-      waiting for requested data to return from the L2 cache divided by the number
-      of cycles where the vL1D is active.
-    Stalled on L2 Req: The ratio of the number of cycles where the vL1D is stalled
-      waiting to issue a request for data to the L2 cache divided by the number of
-      cycles where the vL1D is active.
-    Tag RAM Stall (Read): The ratio of the number of cycles where the vL1D is stalled
-      due to Read requests with conflicting tags being looked up concurrently, divided
-      by the number of cycles where the vL1D is active.
-    Tag RAM Stall (Write): The ratio of the number of cycles where the vL1D is stalled
-      due to Write requests with conflicting tags being looked up concurrently, divided
-      by the number of cycles where the vL1D is active.
-    Tag RAM Stall (Atomic): The ratio of the number of cycles where the vL1D is stalled
-      due to Atomic requests with conflicting tags being looked up concurrently, divided
-      by the number of cycles where the vL1D is active.
-    Total Req: The total number of incoming requests from the address processing unit
-      after coalescing.
-    Read Req: The total number of incoming read requests from the address processing
-      unit after coalescing per normalization unit.
-    Write Req: The total number of incoming write requests from the address processing
-      unit after coalescing per normalization unit.
-    Atomic Req: The total number of incoming atomic requests from the address processing
-      unit after coalescing per normalization unit.
-    Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
-      instructions divided by total duration. The number of bytes is calculated as
-      the number of cache lines requested multiplied by the cache line size.  This
-      value does not consider partial requests, so for instance, if only a single
-      value is requested in a cache line, the data movement will still be counted
-      as a full cache line.
-    Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
-      vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
-    Cache Accesses: The total number of cache line lookups in the vL1D.
-    Cache Hits: The number of cache accesses minus the number of outgoing requests
-      to the L2 cache, that is, the number of cache line requests serviced by the
-      vL1D Cache RAM per normalization unit.
-    Invalidations: The number of times the vL1D was issued a write-back invalidate
-      command during the kernel's execution per normalization unit. This may be triggered
-      by, for instance, the buffer_wbinvl1 instruction.
-    L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
-      of VMEM instructions, divided by total duration. The number of bytes is calculated
-      as the number of cache lines requested multiplied by the cache line size. This
-      value does not consider partial requests, so for instance, if only a single
-      value is requested in a cache line, the data movement will still be counted
-      as a full cache line.
-    L1-L2 Read: The number of read requests for a vL1D cache line that were not satisfied
-      by the vL1D and must be retrieved from the to the L2 Cache per normalization
-      unit.
-    L1-L2 Write: The number of write requests to a vL1D cache line that were sent
-      through the vL1D to the L2 cache, per normalization unit.
-    L1-L2 Atomic: The number of atomic requests that are sent through the vL1D to
-      the L2 cache, per normalization unit. This includes requests for atomics with,
-      and without return.
-    L1 Access Latency: Calculated as the average number of cycles that a vL1D cache
-      line request spent in the vL1D cache pipeline.
-    L1-L2 Read Latency: Calculated as the average number of cycles that the vL1D cache
-      took to issue and receive read requests from the L2 Cache. This number also
-      includes requests for atomics with return values.
-    L1-L2 Write Latency: Calculated as the average number of cycles that the vL1D
-      cache took to issue and receive acknowledgement of a write request to the L2
-      Cache. This number also includes requests for atomics without return values.
-    NC - Read: Total read requests with NC mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    UC - Read: Total read requests with UC mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    CC - Read: Total read requests with CC mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    RW - Read: Total read requests with RW mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    RW - Write: Total write requests with RW mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    NC - Write: Total write requests with NC mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    UC - Write: Total write requests with UC mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    CC - Write: Total write requests with CC mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    NC - Atomic: Total atomic requests with NC mtype from this TCP to all TCCs Sum
-      over TCP instances per normalization unit.
-    UC - Atomic: Total atomic requests with UC mtype from this TCP to all TCCs Sum
-      over TCP instances per normalization unit.
-    CC - Atomic: Total atomic requests with CC mtype from this TCP to all TCCs Sum
-      over TCP instances per normalization unit.
-    RW - Atomic: Total atomic requests with RW mtype from this TCP to all TCCs Sum
-      over TCP instances per normalization unit.
-    Req: The number of translation requests made to the UTCL1 per normalization unit.
-    Hit Ratio: The ratio of the number of translation requests that hit in the UTCL1
-      divided by the total number of translation requests made to the UTCL1.
-    Hits: The number of translation requests that hit in the UTCL1, and could be reused,
-      per normalization unit.
-    Translation Misses: The total number of translation requests that missed in the
-      UTCL1 due to  translation not being present in the cache, per normalization
-      unit.
-    Permission Misses: "The total number of translation requests that missed in the\
-      \ UTCL1 due to a permission error, per normalization unit. This is unused and\
-      \ expected to be zero in most configurations for modern CDNA\u2122 accelerators."
  data source:
  - metric_table:
      id: 1601
@@ -181,17 +70,17 @@ Panel Config:
          avg: AVG((TCP_TOTAL_ACCESSES_sum / $denom))
          min: MIN((TCP_TOTAL_ACCESSES_sum / $denom))
          max: MAX((TCP_TOTAL_ACCESSES_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read Req:
          avg: AVG((TCP_TOTAL_READ_sum / $denom))
          min: MIN((TCP_TOTAL_READ_sum / $denom))
          max: MAX((TCP_TOTAL_READ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Write Req:
          avg: AVG((TCP_TOTAL_WRITE_sum / $denom))
          min: MIN((TCP_TOTAL_WRITE_sum / $denom))
          max: MAX((TCP_TOTAL_WRITE_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Atomic Req:
          avg: AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
            / $denom))
@@ -199,7 +88,7 @@ Panel Config:
            / $denom))
          max: MAX(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
            / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Cache BW:
          avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
          min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
@@ -223,7 +112,7 @@ Panel Config:
          avg: AVG((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
          min: MIN((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
          max: MAX((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Cache Hits:
          avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
@@ -234,7 +123,7 @@ Panel Config:
          max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Invalidations:
          avg: AVG((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
          min: MIN((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
@@ -252,12 +141,12 @@ Panel Config:
          avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_READ_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        L1-L2 Write:
          avg: AVG((TCP_TCC_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_WRITE_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        L1-L2 Atomic:
          avg: AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
            / $denom))
@@ -265,7 +154,7 @@ Panel Config:
            / $denom))
          max: MAX(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
            / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        L1 Access Latency:
          avg: AVG(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
            != 0) else None))
@@ -314,84 +203,84 @@ Panel Config:
          avg: AVG((TCP_TCC_NC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_NC_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_NC_READ_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        UC - Read:
          xfer: Read
          coherency: UC
          avg: AVG((TCP_TCC_UC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_UC_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_UC_READ_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        CC - Read:
          xfer: Read
          coherency: CC
          avg: AVG((TCP_TCC_CC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_CC_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_CC_READ_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        RW - Read:
          xfer: Read
          coherency: RW
          avg: AVG((TCP_TCC_RW_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_RW_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_RW_READ_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        RW - Write:
          xfer: Write
          coherency: RW
          avg: AVG((TCP_TCC_RW_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_RW_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_RW_WRITE_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        NC - Write:
          xfer: Write
          coherency: NC
          avg: AVG((TCP_TCC_NC_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_NC_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_NC_WRITE_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        UC - Write:
          xfer: Write
          coherency: UC
          avg: AVG((TCP_TCC_UC_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_UC_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_UC_WRITE_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        CC - Write:
          xfer: Write
          coherency: CC
          avg: AVG((TCP_TCC_CC_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_CC_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_CC_WRITE_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        NC - Atomic:
          xfer: Atomic
          coherency: NC
          avg: AVG((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
          min: MIN((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
          max: MAX((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        UC - Atomic:
          xfer: Atomic
          coherency: UC
          avg: AVG((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
          min: MIN((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
          max: MAX((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        CC - Atomic:
          xfer: Atomic
          coherency: CC
          avg: AVG((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
          min: MIN((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
          max: MAX((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        RW - Atomic:
          xfer: Atomic
          coherency: RW
          avg: AVG((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
          min: MIN((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
          max: MAX((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
  - metric_table:
      id: 1605
      title: L1 Unified Translation Cache (UTCL1)
@@ -440,3 +329,114 @@ Panel Config:
        max: Max
        units: Unit
      metric: {}
+  metrics_description:
+    Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
+      cache over the total number of cache line requests to the vL1D Cache RAM.
+    Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
+      of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
+      on the specific accelerator. The number of bytes is calculated as the number
+      of cache lines requested multiplied by the cache line size. This value does
+      not consider partial requests, so for instance, if only a single value is requested
+      in a cache line, the data movement will still be counted as a full cache line.
+    Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
+      The number of cycles where the vL1D Cache RAM is actively processing any request
+      divided by the number of cycles where the vL1D is active.
+    Coalescing: Indicates how well memory instructions were coalesced by the address
+      processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
+      as the average number of thread-requests generated per instruction divided by
+      the ideal number of thread-requests per instruction.
+    Stalled on L2 Data: The ratio of the number of cycles where the vL1D is stalled
+      waiting for requested data to return from the L2 cache divided by the number
+      of cycles where the vL1D is active.
+    Stalled on L2 Req: The ratio of the number of cycles where the vL1D is stalled
+      waiting to issue a request for data to the L2 cache divided by the number of
+      cycles where the vL1D is active.
+    Tag RAM Stall (Read): The ratio of the number of cycles where the vL1D is stalled
+      due to Read requests with conflicting tags being looked up concurrently, divided
+      by the number of cycles where the vL1D is active.
+    Tag RAM Stall (Write): The ratio of the number of cycles where the vL1D is stalled
+      due to Write requests with conflicting tags being looked up concurrently, divided
+      by the number of cycles where the vL1D is active.
+    Tag RAM Stall (Atomic): The ratio of the number of cycles where the vL1D is stalled
+      due to Atomic requests with conflicting tags being looked up concurrently, divided
+      by the number of cycles where the vL1D is active.
+    Total Req: The total number of incoming requests from the address processing unit
+      after coalescing.
+    Read Req: The total number of incoming read requests from the address processing
+      unit after coalescing per normalization unit.
+    Write Req: The total number of incoming write requests from the address processing
+      unit after coalescing per normalization unit.
+    Atomic Req: The total number of incoming atomic requests from the address processing
+      unit after coalescing per normalization unit.
+    Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
+      instructions divided by total duration. The number of bytes is calculated as
+      the number of cache lines requested multiplied by the cache line size. This
+      value does not consider partial requests, so for instance, if only a single
+      value is requested in a cache line, the data movement will still be counted
+      as a full cache line.
+    Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
+      vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
+    Cache Accesses: The total number of cache line lookups in the vL1D.
+    Cache Hits: The number of cache accesses minus the number of outgoing requests
+      to the L2 cache, that is, the number of cache line requests serviced by the
+      vL1D Cache RAM per normalization unit.
+    Invalidations: The number of times the vL1D was issued a write-back invalidate
+      command during the kernel's execution per normalization unit. This may be triggered
+      by, for instance, the buffer_wbinvl1 instruction.
+    L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
+      of VMEM instructions, divided by total duration. The number of bytes is calculated
+      as the number of cache lines requested multiplied by the cache line size. This
+      value does not consider partial requests, so for instance, if only a single
+      value is requested in a cache line, the data movement will still be counted
+      as a full cache line.
+    L1-L2 Read: The number of read requests for a vL1D cache line that were not satisfied
+      by the vL1D and must be retrieved from the to the L2 Cache per normalization
+      unit.
+    L1-L2 Write: The number of write requests to a vL1D cache line that were sent
+      through the vL1D to the L2 cache, per normalization unit.
+    L1-L2 Atomic: The number of atomic requests that are sent through the vL1D to
+      the L2 cache, per normalization unit. This includes requests for atomics with,
+      and without return.
+    L1 Access Latency: Calculated as the average number of cycles that a vL1D cache
+      line request spent in the vL1D cache pipeline.
+    L1-L2 Read Latency: Calculated as the average number of cycles that the vL1D cache
+      took to issue and receive read requests from the L2 Cache. This number also
+      includes requests for atomics with return values.
+    L1-L2 Write Latency: Calculated as the average number of cycles that the vL1D
+      cache took to issue and receive acknowledgement of a write request to the L2
+      Cache. This number also includes requests for atomics without return values.
+    NC - Read: Total read requests with NC mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    UC - Read: Total read requests with UC mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    CC - Read: Total read requests with CC mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    RW - Read: Total read requests with RW mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    RW - Write: Total write requests with RW mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    NC - Write: Total write requests with NC mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    UC - Write: Total write requests with UC mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    CC - Write: Total write requests with CC mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    NC - Atomic: Total atomic requests with NC mtype from this TCP to all TCCs Sum
+      over TCP instances per normalization unit.
+    UC - Atomic: Total atomic requests with UC mtype from this TCP to all TCCs Sum
+      over TCP instances per normalization unit.
+    CC - Atomic: Total atomic requests with CC mtype from this TCP to all TCCs Sum
+      over TCP instances per normalization unit.
+    RW - Atomic: Total atomic requests with RW mtype from this TCP to all TCCs Sum
+      over TCP instances per normalization unit.
+    Req: The number of translation requests made to the UTCL1 per normalization unit.
+    Hit Ratio: The ratio of the number of translation requests that hit in the UTCL1
+      divided by the total number of translation requests made to the UTCL1.
+    Hits: The number of translation requests that hit in the UTCL1, and could be reused,
+      per normalization unit.
+    Translation Misses: The total number of translation requests that missed in the
+      UTCL1 due to translation not being present in the cache, per normalization unit.
+    Permission Misses: |-
+      The total number of translation requests that missed in the UTCL1 due
+      to a permission error, per normalization unit. This is unused and expected
+      to be zero in most configurations for modern CDNA\u2122 accelerators.
@@ -2,6 +2,350 @@
 Panel Config:
  id: 1700
  title: L2 Cache
+  data source:
+  - metric_table:
+      id: 1701
+      title: L2 Speed-of-Light
+      header:
+        metric: Metric
+        value: Avg
+        unit: Unit
+      metric:
+        Utilization:
+          value: AVG(((TCC_BUSY_sum * 100) / (TO_INT($total_l2_chan) * $GRBM_GUI_ACTIVE_PER_XCD)))
+          unit: pct
+        Peak Bandwidth:
+          value: ((100 * AVG(((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))))
+            / ((($max_sclk / 1000) * 128) * TO_INT($total_l2_chan)))
+          unit: pct
+        Hit Rate:
+          value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
+            + TCC_MISS_sum) != 0) else 0))
+          unit: pct
+        L2-Fabric Read BW:
+          value: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
+            * 64)) / (End_Timestamp - Start_Timestamp)))
+          unit: GB/s
+        L2-Fabric Write and Atomic BW:
+          value: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
+            * 32)) / (End_Timestamp - Start_Timestamp)))
+          unit: GB/s
+        HBM Bandwidth:
+          value: $hbmBandwidth
+          unit: GB/s
+  - metric_table:
+      id: 1702
+      title: L2-Fabric interface metrics
+      header:
+        metric: Metric
+        avg: Avg
+        min: Min
+        max: Max
+        unit: Unit
+      metric:
+        Read BW:
+          avg: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
+            * 64)) / (End_Timestamp - Start_Timestamp)))
+          min: MIN((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
+            * 64)) / (End_Timestamp - Start_Timestamp)))
+          max: MAX((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
+            * 64)) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
+        HBM Read Traffic:
+          avg: AVG((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
+            != 0) else None))
+          min: MIN((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
+            != 0) else None))
+          max: MAX((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
+            != 0) else None))
+          unit: pct
+        Remote Read Traffic:
+          avg: AVG((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum)
+            if (TCC_EA_RDREQ_sum != 0) else None))
+          min: MIN((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum)
+            if (TCC_EA_RDREQ_sum != 0) else None))
+          max: MAX((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum)
+            if (TCC_EA_RDREQ_sum != 0) else None))
+          unit: pct
+        Uncached Read Traffic:
+          avg: AVG((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
+            != 0) else None))
+          min: MIN((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
+            != 0) else None))
+          max: MAX((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
+            != 0) else None))
+          unit: pct
+        Write and Atomic BW:
+          avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
+            * 32)) / $denom))
+          min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
+            * 32)) / $denom))
+          max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
+            * 32)) / $denom))
+          unit: (Bytes + $normUnit)
+        HBM Write and Atomic Traffic:
+          avg: AVG((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
+            != 0) else None))
+          min: MIN((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
+            != 0) else None))
+          max: MAX((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
+            != 0) else None))
+          unit: pct
+        Remote Write and Atomic Traffic:
+          avg: AVG((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum)
+            if (TCC_EA_WRREQ_sum != 0) else None))
+          min: MIN((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum)
+            if (TCC_EA_WRREQ_sum != 0) else None))
+          max: MAX((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum)
+            if (TCC_EA_WRREQ_sum != 0) else None))
+          unit: pct
+        Atomic Traffic:
+          avg: AVG((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
+            != 0) else None))
+          min: MIN((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
+            != 0) else None))
+          max: MAX((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
+            != 0) else None))
+          unit: pct
+        Uncached Write and Atomic Traffic:
+          avg: AVG((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
+            != 0) else None))
+          min: MIN((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
+            != 0) else None))
+          max: MAX((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
+            != 0) else None))
+          unit: pct
+        Read Latency:
+          avg: AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
+            != 0) else None))
+          min: MIN(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
+            != 0) else None))
+          max: MAX(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
+            != 0) else None))
+          unit: Cycles
+        Write and Atomic Latency:
+          avg: AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
+            != 0) else None))
+          min: MIN(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
+            != 0) else None))
+          max: MAX(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
+            != 0) else None))
+          unit: Cycles
+        Atomic Latency:
+          avg: AVG(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
+            != 0) else None))
+          min: MIN(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
+            != 0) else None))
+          max: MAX(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
+            != 0) else None))
+          unit: Cycles
+  - metric_table:
+      id: 1703
+      title: L2 Cache Accesses
+      header:
+        metric: Metric
+        avg: Avg
+        min: Min
+        max: Max
+        unit: Unit
+      metric:
+        Bandwidth:
+          avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
+          min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
+          max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
+          unit: Gbps
+        Req:
+          avg: AVG((TCC_REQ_sum / $denom))
+          min: MIN((TCC_REQ_sum / $denom))
+          max: MAX((TCC_REQ_sum / $denom))
+          unit: (Req + $normUnit)
+        Read Req:
+          avg: AVG((TCC_READ_sum / $denom))
+          min: MIN((TCC_READ_sum / $denom))
+          max: MAX((TCC_READ_sum / $denom))
+          unit: (Req + $normUnit)
+        Write Req:
+          avg: AVG((TCC_WRITE_sum / $denom))
+          min: MIN((TCC_WRITE_sum / $denom))
+          max: MAX((TCC_WRITE_sum / $denom))
+          unit: (Req + $normUnit)
+        Atomic Req:
+          avg: AVG((TCC_ATOMIC_sum / $denom))
+          min: MIN((TCC_ATOMIC_sum / $denom))
+          max: MAX((TCC_ATOMIC_sum / $denom))
+          unit: (Req + $normUnit)
+        Streaming Req:
+          avg: AVG((TCC_STREAMING_REQ_sum / $denom))
+          min: MIN((TCC_STREAMING_REQ_sum / $denom))
+          max: MAX((TCC_STREAMING_REQ_sum / $denom))
+          unit: (Req + $normUnit)
+        Probe Req:
+          avg: AVG((TCC_PROBE_sum / $denom))
+          min: MIN((TCC_PROBE_sum / $denom))
+          max: MAX((TCC_PROBE_sum / $denom))
+          unit: (Req + $normUnit)
+        Cache Hit:
+          avg: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
+            + TCC_MISS_sum) != 0) else None))
+          min: MIN((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
+            + TCC_MISS_sum) != 0) else None))
+          max: MAX((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
+            + TCC_MISS_sum) != 0) else None))
+          unit: pct
+        Hits:
+          avg: AVG((TCC_HIT_sum / $denom))
+          min: MIN((TCC_HIT_sum / $denom))
+          max: MAX((TCC_HIT_sum / $denom))
+          unit: (Hits + $normUnit)
+        Misses:
+          avg: AVG((TCC_MISS_sum / $denom))
+          min: MIN((TCC_MISS_sum / $denom))
+          max: MAX((TCC_MISS_sum / $denom))
+          unit: (Misses + $normUnit)
+        Writeback:
+          avg: AVG((TCC_WRITEBACK_sum / $denom))
+          min: MIN((TCC_WRITEBACK_sum / $denom))
+          max: MAX((TCC_WRITEBACK_sum / $denom))
+          unit: (Cachelines + $normUnit)
+        Writeback (Internal):
+          avg: AVG((TCC_NORMAL_WRITEBACK_sum / $denom))
+          min: MIN((TCC_NORMAL_WRITEBACK_sum / $denom))
+          max: MAX((TCC_NORMAL_WRITEBACK_sum / $denom))
+          unit: (Cachelines + $normUnit)
+        Writeback (vL1D Req):
+          avg: AVG((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
+          min: MIN((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
+          max: MAX((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
+          unit: (Cachelines + $normUnit)
+        Evict (Internal):
+          avg: AVG((TCC_NORMAL_EVICT_sum / $denom))
+          min: MIN((TCC_NORMAL_EVICT_sum / $denom))
+          max: MAX((TCC_NORMAL_EVICT_sum / $denom))
+          unit: (Cachelines + $normUnit)
+        Evict (vL1D Req):
+          avg: AVG((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
+          min: MIN((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
+          max: MAX((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
+          unit: (Cachelines + $normUnit)
+        NC Req:
+          avg: AVG((TCC_NC_REQ_sum / $denom))
+          min: MIN((TCC_NC_REQ_sum / $denom))
+          max: MAX((TCC_NC_REQ_sum / $denom))
+          unit: (Req + $normUnit)
+        UC Req:
+          avg: AVG((TCC_UC_REQ_sum / $denom))
+          min: MIN((TCC_UC_REQ_sum / $denom))
+          max: MAX((TCC_UC_REQ_sum / $denom))
+          unit: (Req + $normUnit)
+        CC Req:
+          avg: AVG((TCC_CC_REQ_sum / $denom))
+          min: MIN((TCC_CC_REQ_sum / $denom))
+          max: MAX((TCC_CC_REQ_sum / $denom))
+          unit: (Req + $normUnit)
+        RW Req:
+          avg: AVG((TCC_RW_REQ_sum / $denom))
+          min: MIN((TCC_RW_REQ_sum / $denom))
+          max: MAX((TCC_RW_REQ_sum / $denom))
+          unit: (Req + $normUnit)
+  - metric_table:
+      id: 1704
+      title: L2 Cache Stalls
+      header:
+        metric: Metric
+        avg: Avg
+        min: Min
+        max: Max
+        unit: Unit
+      metric: {}
+  - metric_table:
+      id: 1705
+      title: L2 - Fabric Interface stalls
+      header:
+        metric: Metric
+        type: Type
+        transaction: Transaction
+        avg: Avg
+        min: Min
+        max: Max
+        unit: Unit
+      style:
+        type: simple_multi_bar
+      metric:
+        Write - Credit Starvation:
+          type: Credit Starvation
+          transaction: Write
+          avg: AVG(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
+            != 0) else None))
+          min: MIN(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
+            != 0) else None))
+          max: MAX(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
+            != 0) else None))
+          unit: pct
+  - metric_table:
+      id: 1706
+      title: L2 - Fabric interface detailed metrics
+      header:
+        metric: Metric
+        avg: Avg
+        min: Min
+        max: Max
+        unit: Unit
+      metric:
+        Read (32B):
+          avg: AVG((TCC_EA_RDREQ_32B_sum / $denom))
+          min: MIN((TCC_EA_RDREQ_32B_sum / $denom))
+          max: MAX((TCC_EA_RDREQ_32B_sum / $denom))
+          unit: (Req + $normUnit)
+        Read (64B):
+          avg: AVG(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
+          min: MIN(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
+          max: MAX(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
+          unit: (Req + $normUnit)
+        Read (Uncached):
+          avg: AVG((TCC_EA_RD_UNCACHED_32B_sum / $denom))
+          min: MIN((TCC_EA_RD_UNCACHED_32B_sum / $denom))
+          max: MAX((TCC_EA_RD_UNCACHED_32B_sum / $denom))
+          unit: (Req + $normUnit)
+        HBM Read:
+          avg: AVG((TCC_EA_RDREQ_DRAM_sum / $denom))
+          min: MIN((TCC_EA_RDREQ_DRAM_sum / $denom))
+          max: MAX((TCC_EA_RDREQ_DRAM_sum / $denom))
+          unit: (Req + $normUnit)
+        Remote Read:
+          avg: AVG((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
+          min: MIN((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
+          max: MAX((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
+          unit: (Req + $normUnit)
+        Write and Atomic (32B):
+          avg: AVG(MAX(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom), 0))
+          min: MIN(MAX(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom), 0))
+          max: MAX(MAX(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom), 0))
+          unit: (Req + $normUnit)
+        Write and Atomic (Uncached):
+          avg: AVG((TCC_EA_WR_UNCACHED_32B_sum / $denom))
+          min: MIN((TCC_EA_WR_UNCACHED_32B_sum / $denom))
+          max: MAX((TCC_EA_WR_UNCACHED_32B_sum / $denom))
+          unit: (Req + $normUnit)
+        Write and Atomic (64B):
+          avg: AVG((TCC_EA_WRREQ_64B_sum / $denom))
+          min: MIN((TCC_EA_WRREQ_64B_sum / $denom))
+          max: MAX((TCC_EA_WRREQ_64B_sum / $denom))
+          unit: (Req + $normUnit)
+        HBM Write and Atomic:
+          avg: AVG((TCC_EA_WRREQ_DRAM_sum / $denom))
+          min: MIN((TCC_EA_WRREQ_DRAM_sum / $denom))
+          max: MAX((TCC_EA_WRREQ_DRAM_sum / $denom))
+          unit: (Req + $normUnit)
+        Remote Write and Atomic:
+          avg: AVG((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
+          min: MIN((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
+          max: MAX((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
+          unit: (Req + $normUnit)
+        Atomic:
+          avg: AVG((TCC_EA_ATOMIC_sum / $denom))
+          min: MIN((TCC_EA_ATOMIC_sum / $denom))
+          max: MAX((TCC_EA_ATOMIC_sum / $denom))
+          unit: (Req + $normUnit)
  metrics_description:
    Utilization: The ratio of the number of cycles an L2 channel was active, summed
      over all L2 channels on the accelerator over the total L2 cycles.
@@ -87,12 +431,6 @@ Panel Config:
      by the cache line size. This value does not consider partial requests, so for
      example, if only a single value is requested in a cache line, the data movement
      will still be counted as a full cache line.
-    Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests,
-      divided by total duration.
-    Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests,
-      divided by total duration.
-    Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests,
-      divided by total duration.
    Req: The total number of incoming requests to the L2 from all clients for all
      request types, per normalization unit.
    Read Req: The total number of read requests to the L2 from all clients.
@@ -149,12 +487,6 @@ Panel Config:
    Remote Read: The total number of L2 requests to Infinity Fabric to read 32B or
      64B of data from any source other than the accelerator's local HBM, per normalization
      unit.
-    Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe
-      traffic, divided by total duration.
-    "Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read
-      requests due to Infinity Fabric traffic, divided by total duration.
-    Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM
-      traffic, divided by total duration.
    Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
      write or atomically update 32B of data to any memory location, per normalization
      unit.
@@ -170,391 +502,9 @@ Panel Config:
    Remote Write and Atomic: The total number of L2 requests to Infinity Fabric to
      write or atomically update 32B or 64B of data in any memory location other than
      the accelerator's local HBM, per normalization unit.
-    Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to
-      PCIe traffic, divided by total duration.
-    "Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write
-      requests due to Infinity Fabric traffic, divided by total duration.
-    Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM
-      traffic, divided by total duration.
-    Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to
-      PCIe traffic, divided by total duration.
-    "Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic
-      requests due to Infinity Fabric traffic, divided by total duration.
-    Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
-      HBM traffic, divided by total duration.
    Atomic: The total number of L2 requests to Infinity Fabric to atomically update
      32B or 64B of data in any memory location, per normalization unit. See Request
      flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
      requests are only considered atomic by Infinity Fabric if they are targeted
      at non-write-cacheable memory, such as fine-grained memory allocations or uncached
      memory allocations on the MI2XX.
-    Read Stall: "The ratio of the total number of cycles the L2-Fabric interface was\
-      \ stalled on a read request to any destination (local HBM, remote PCIe\xAE connected\
-      \ accelerator or CPU, or remote Infinity Fabric connected accelerator or CPU)\
-      \ over the total active L2 cycles."
-    Write Stall: The ratio of the total number of cycles the L2-Fabric interface was
-      stalled on a write or atomic request to any destination (local HBM, remote accelerator
-      or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected
-      accelerator or CPU) over the total active L2 cycles.
-    Read - PCIe Stall: The number of cycles the L2-Fabric interface was stalled on
-      read requests to remote PCIe connected accelerators or CPUs as a percent of
-      the total active L2 cycles.
-    Read - Infinity Fabric Stall: The number of cycles the L2-Fabric interface was
-      stalled on read requests to remote Infinity Fabric connected accelerators or
-      CPUs as a percent of the total active L2 cycles.
-    Read - HBM Stall: The number of cycles the L2-Fabric interface was stalled on
-      read requests to the accelerator's local HBM as a percent of the total active
-      L2 cycles.
-    Write - PCIe Stall: The number of cycles the L2-Fabric interface was stalled on
-      write or atomic requests to remote PCIe connected accelerators or CPUs as a
-      percent of the total active L2 cycles.
-    Write - Infinity Fabric Stall: The number of cycles the L2-Fabric interface was
-      stalled on write or atomic requests to remote Infinity Fabric connected accelerators
-      or CPUs as a percent of the total active L2 cycles.
-    Write - HBM Stall: The number of cycles the L2-Fabric interface was stalled on
-      write or atomic requests to accelerator's local HBM as a percent of the total
-      active L2 cycles.
-  data source:
-  - metric_table:
-      id: 1701
-      title: L2 Speed-of-Light
-      header:
-        metric: Metric
-        value: Avg
-        unit: Unit
-      metric:
-        Utilization:
-          value: AVG(((TCC_BUSY_sum * 100) / (TO_INT($total_l2_chan) * $GRBM_GUI_ACTIVE_PER_XCD)))
-          unit: pct
-        Peak Bandwidth:
-          value: ((100 * AVG(((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))))
-            / ((($max_sclk / 1000) * 128) * TO_INT($total_l2_chan)))
-          unit: pct
-        Hit Rate:
-          value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
-            + TCC_MISS_sum) != 0) else 0))
-          unit: pct
-        L2-Fabric Read BW:
-          value: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
-            * 64)) / (End_Timestamp - Start_Timestamp)))
-          unit: GB/s
-        L2-Fabric Write and Atomic BW:
-          value: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
-            * 32)) / (End_Timestamp - Start_Timestamp)))
-          unit: GB/s
-        HBM Bandwidth:
-          value: $hbmBandwidth
-          unit: GB/s
-  - metric_table:
-      id: 1702
-      title: L2-Fabric interface metrics
-      header:
-        metric: Metric
-        avg: Avg
-        min: Min
-        max: Max
-        unit: Unit
-      metric:
-        Read BW:
-          avg: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
-            * 64)) / (End_Timestamp - Start_Timestamp)))
-          min: MIN((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
-            * 64)) / (End_Timestamp - Start_Timestamp)))
-          max: MAX((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
-            * 64)) / (End_Timestamp - Start_Timestamp)))
-          unit: Gbps
-        HBM Read Traffic:
-          avg: AVG((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
-            != 0) else None))
-          min: MIN((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
-            != 0) else None))
-          max: MAX((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
-            != 0) else None))
-          unit: pct
-        Remote Read Traffic:
-          avg: AVG((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum)
-            if (TCC_EA_RDREQ_sum != 0) else None))
-          min: MIN((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum)
-            if (TCC_EA_RDREQ_sum != 0) else None))
-          max: MAX((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum)
-            if (TCC_EA_RDREQ_sum != 0) else None))
-          unit: pct
-        Uncached Read Traffic:
-          avg: AVG((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
-            != 0) else None))
-          min: MIN((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
-            != 0) else None))
-          max: MAX((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
-            != 0) else None))
-          unit: pct
-        Write and Atomic BW:
-          avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
-            * 32)) / $denom))
-          min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
-            * 32)) / $denom))
-          max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
-            * 32)) / $denom))
-          unit: (Bytes  + $normUnit)
-        HBM Write and Atomic Traffic:
-          avg: AVG((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
-            != 0) else None))
-          min: MIN((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
-            != 0) else None))
-          max: MAX((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
-            != 0) else None))
-          unit: pct
-        Remote Write and Atomic Traffic:
-          avg: AVG((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum)
-            if (TCC_EA_WRREQ_sum != 0) else None))
-          min: MIN((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum)
-            if (TCC_EA_WRREQ_sum != 0) else None))
-          max: MAX((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum)
-            if (TCC_EA_WRREQ_sum != 0) else None))
-          unit: pct
-        Atomic Traffic:
-          avg: AVG((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
-            != 0) else None))
-          min: MIN((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
-            != 0) else None))
-          max: MAX((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
-            != 0) else None))
-          unit: pct
-        Uncached Write and Atomic Traffic:
-          avg: AVG((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
-            != 0) else None))
-          min: MIN((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
-            != 0) else None))
-          max: MAX((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
-            != 0) else None))
-          unit: pct
-        Read Latency:
-          avg: AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
-            != 0) else None))
-          min: MIN(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
-            != 0) else None))
-          max: MAX(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
-            != 0) else None))
-          unit: Cycles
-        Write and Atomic Latency:
-          avg: AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
-            != 0) else None))
-          min: MIN(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
-            != 0) else None))
-          max: MAX(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
-            != 0) else None))
-          unit: Cycles
-        Atomic Latency:
-          avg: AVG(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
-            != 0) else None))
-          min: MIN(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
-            != 0) else None))
-          max: MAX(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
-            != 0) else None))
-          unit: Cycles
-  - metric_table:
-      id: 1703
-      title: L2 Cache Accesses
-      header:
-        metric: Metric
-        avg: Avg
-        min: Min
-        max: Max
-        unit: Unit
-      metric:
-        Bandwidth:
-          avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
-          min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
-          max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
-          unit: Gbps
-        Req:
-          avg: AVG((TCC_REQ_sum / $denom))
-          min: MIN((TCC_REQ_sum / $denom))
-          max: MAX((TCC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
-        Read Req:
-          avg: AVG((TCC_READ_sum / $denom))
-          min: MIN((TCC_READ_sum / $denom))
-          max: MAX((TCC_READ_sum / $denom))
-          unit: (Req  + $normUnit)
-        Write Req:
-          avg: AVG((TCC_WRITE_sum / $denom))
-          min: MIN((TCC_WRITE_sum / $denom))
-          max: MAX((TCC_WRITE_sum / $denom))
-          unit: (Req  + $normUnit)
-        Atomic Req:
-          avg: AVG((TCC_ATOMIC_sum / $denom))
-          min: MIN((TCC_ATOMIC_sum / $denom))
-          max: MAX((TCC_ATOMIC_sum / $denom))
-          unit: (Req  + $normUnit)
-        Streaming Req:
-          avg: AVG((TCC_STREAMING_REQ_sum / $denom))
-          min: MIN((TCC_STREAMING_REQ_sum / $denom))
-          max: MAX((TCC_STREAMING_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
-        Probe Req:
-          avg: AVG((TCC_PROBE_sum / $denom))
-          min: MIN((TCC_PROBE_sum / $denom))
-          max: MAX((TCC_PROBE_sum / $denom))
-          unit: (Req  + $normUnit)
-        Cache Hit:
-          avg: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
-            + TCC_MISS_sum) != 0) else None))
-          min: MIN((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
-            + TCC_MISS_sum) != 0) else None))
-          max: MAX((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
-            + TCC_MISS_sum) != 0) else None))
-          unit: pct
-        Hits:
-          avg: AVG((TCC_HIT_sum / $denom))
-          min: MIN((TCC_HIT_sum / $denom))
-          max: MAX((TCC_HIT_sum / $denom))
-          unit: (Hits  + $normUnit)
-        Misses:
-          avg: AVG((TCC_MISS_sum / $denom))
-          min: MIN((TCC_MISS_sum / $denom))
-          max: MAX((TCC_MISS_sum / $denom))
-          unit: (Misses  + $normUnit)
-        Writeback:
-          avg: AVG((TCC_WRITEBACK_sum / $denom))
-          min: MIN((TCC_WRITEBACK_sum / $denom))
-          max: MAX((TCC_WRITEBACK_sum / $denom))
-          unit: (Cachelines  + $normUnit)
-        Writeback (Internal):
-          avg: AVG((TCC_NORMAL_WRITEBACK_sum / $denom))
-          min: MIN((TCC_NORMAL_WRITEBACK_sum / $denom))
-          max: MAX((TCC_NORMAL_WRITEBACK_sum / $denom))
-          unit: (Cachelines + $normUnit)
-        Writeback (vL1D Req):
-          avg: AVG((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
-          min: MIN((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
-          max: MAX((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
-          unit: (Cachelines + $normUnit)
-        Evict (Internal):
-          avg: AVG((TCC_NORMAL_EVICT_sum / $denom))
-          min: MIN((TCC_NORMAL_EVICT_sum / $denom))
-          max: MAX((TCC_NORMAL_EVICT_sum / $denom))
-          unit: (Cachelines + $normUnit)
-        Evict (vL1D Req):
-          avg: AVG((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
-          min: MIN((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
-          max: MAX((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
-          unit: (Cachelines + $normUnit)
-        NC Req:
-          avg: AVG((TCC_NC_REQ_sum / $denom))
-          min: MIN((TCC_NC_REQ_sum / $denom))
-          max: MAX((TCC_NC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
-        UC Req:
-          avg: AVG((TCC_UC_REQ_sum / $denom))
-          min: MIN((TCC_UC_REQ_sum / $denom))
-          max: MAX((TCC_UC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
-        CC Req:
-          avg: AVG((TCC_CC_REQ_sum / $denom))
-          min: MIN((TCC_CC_REQ_sum / $denom))
-          max: MAX((TCC_CC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
-        RW Req:
-          avg: AVG((TCC_RW_REQ_sum / $denom))
-          min: MIN((TCC_RW_REQ_sum / $denom))
-          max: MAX((TCC_RW_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
-  - metric_table:
-      id: 1704
-      title: L2 Cache Stalls
-      header:
-        metric: Metric
-        avg: Avg
-        min: Min
-        max: Max
-        unit: Unit
-      metric: {}
-  - metric_table:
-      id: 1705
-      title: L2 - Fabric Interface stalls
-      header:
-        metric: Metric
-        type: Type
-        transaction: Transaction
-        avg: Avg
-        min: Min
-        max: Max
-        unit: Unit
-      style:
-        type: simple_multi_bar
-      metric:
-        Write - Credit Starvation:
-          type: Credit Starvation
-          transaction: Write
-          avg: AVG(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
-            != 0) else None))
-          min: MIN(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
-            != 0) else None))
-          max: MAX(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
-            != 0) else None))
-          unit: pct
-  - metric_table:
-      id: 1706
-      title: L2 - Fabric interface detailed metrics
-      header:
-        metric: Metric
-        avg: Avg
-        min: Min
-        max: Max
-        unit: Unit
-      metric:
-        Read (32B):
-          avg: AVG((TCC_EA_RDREQ_32B_sum / $denom))
-          min: MIN((TCC_EA_RDREQ_32B_sum / $denom))
-          max: MAX((TCC_EA_RDREQ_32B_sum / $denom))
-          unit: (Req  + $normUnit)
-        Read (64B):
-          avg: AVG(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
-          min: MIN(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
-          max: MAX(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
-          unit: (Req  + $normUnit)
-        Read (Uncached):
-          avg: AVG((TCC_EA_RD_UNCACHED_32B_sum / $denom))
-          min: MIN((TCC_EA_RD_UNCACHED_32B_sum / $denom))
-          max: MAX((TCC_EA_RD_UNCACHED_32B_sum / $denom))
-          unit: (Req  + $normUnit)
-        HBM Read:
-          avg: AVG((TCC_EA_RDREQ_DRAM_sum / $denom))
-          min: MIN((TCC_EA_RDREQ_DRAM_sum / $denom))
-          max: MAX((TCC_EA_RDREQ_DRAM_sum / $denom))
-          unit: (Req  + $normUnit)
-        Remote Read:
-          avg: AVG((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
-          min: MIN((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
-          max: MAX((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
-          unit: (Req  + $normUnit)
-        Write and Atomic (32B):
-          avg: AVG(MAX(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom), 0))
-          min: MIN(MAX(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom), 0))
-          max: MAX(MAX(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom), 0))
-          unit: (Req  + $normUnit)
-        Write and Atomic (Uncached):
-          avg: AVG((TCC_EA_WR_UNCACHED_32B_sum / $denom))
-          min: MIN((TCC_EA_WR_UNCACHED_32B_sum / $denom))
-          max: MAX((TCC_EA_WR_UNCACHED_32B_sum / $denom))
-          unit: (Req  + $normUnit)
-        Write and Atomic (64B):
-          avg: AVG((TCC_EA_WRREQ_64B_sum / $denom))
-          min: MIN((TCC_EA_WRREQ_64B_sum / $denom))
-          max: MAX((TCC_EA_WRREQ_64B_sum / $denom))
-          unit: (Req  + $normUnit)
-        HBM Write and Atomic:
-          avg: AVG((TCC_EA_WRREQ_DRAM_sum / $denom))
-          min: MIN((TCC_EA_WRREQ_DRAM_sum / $denom))
-          max: MAX((TCC_EA_WRREQ_DRAM_sum / $denom))
-          unit: (Req  + $normUnit)
-        Remote Write and Atomic:
-          avg: AVG((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
-          min: MIN((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
-          max: MAX((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
-          unit: (Req  + $normUnit)
-        Atomic:
-          avg: AVG((TCC_EA_ATOMIC_sum / $denom))
-          min: MIN((TCC_EA_ATOMIC_sum / $denom))
-          max: MAX((TCC_EA_ATOMIC_sum / $denom))
-          unit: (Req  + $normUnit)
@@ -2,10 +2,6 @@
 Panel Config:
  id: 1800
  title: L2 Cache (per Channel)
-  metrics_description:
-    L2 Cache Hit Rate: The percent of total number of requests to the L2 from all
-      clients that hit in the cache. As noted in the Speed-of-Light section, this
-      includes hit-on-miss requests.
  data source:
  - metric_table:
      id: 1801
@@ -321,3 +317,7 @@ Panel Config:
          ::_1: $total_l2_chan
      cli_style: simple_box
      tui_style: simple_box
+  metrics_description:
+    L2 Cache Hit Rate: The percent of total number of requests to the L2 from all
+      clients that hit in the cache. As noted in the Speed-of-Light section, this
+      includes hit-on-miss requests.
@@ -2,10 +2,10 @@
 Panel Config:
  id: 2100
  title: PC Sampling
-  metrics_description: {}
  data source:
  - pc_sampling_table:
      id: 2101
      title: PC Sampling
      source: ps_file
      comparable: false
+  metrics_description: {}
@@ -2,7 +2,6 @@
 Panel Config:
  id: 0
  title: Top Stats
-  metrics_description: {}
  data source:
  - raw_csv_table:
      id: 1
@@ -12,3 +11,4 @@ Panel Config:
      id: 2
      title: Dispatch List
      source: pmc_dispatch_info.csv
+  metrics_description: {}
@@ -2,10 +2,10 @@
 Panel Config:
  id: 100
  title: System Info
-  metrics_description: {}
  data source:
  - raw_csv_table:
      id: 101
      title: System Info
      source: sysinfo.csv
      columnwise: true
+  metrics_description: {}
@@ -2,124 +2,6 @@
 Panel Config:
  id: 200
  title: System Speed-of-Light
-  metrics_description:
-    VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
-      This is also presented as a percent of the peak theoretical FLOPs achievable
-      on the specific accelerator. Note: this does not include any floating-point
-      operations from MFMA instructions.'
-    VALU IOPs: 'The total integer operations executed per second on the VALU. This
-      is also presented as a percent of the peak theoretical IOPs achievable on the
-      specific accelerator. Note: this does not include any integer operations from
-      MFMA instructions.'
-    MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
-      executed per second. This does not include any 16-bit brain floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F8 MFMA operations achievable on the specific accelerator. It is supported on
-      AMD Instinct MI300 series and later only.
-    MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
-      executed per second. Note: this does not include any 16-bit brain floating point
-      operations from VALU instructions. This is also presented as a percent of the
-      peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
-      per second. Note: this does not include any 16-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F16 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
-      per second. Note: this does not include any 32-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F32 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
-      per second. Note: this does not include any 64-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F64 MFMA operations achievable on the specific accelerator.'
-    MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
-      per second. Note: this does not include any 8-bit integer operations from VALU
-      instructions. This is also presented as a percent of the peak theoretical INT8
-      MFMA operations achievable on the specific accelerator.'
-    Active CUs: Total number of active compute units (CUs) on the accelerator during
-      the kernel execution.
-    SALU Utilization: Indicates what percent of the kernel's duration the SALU was
-      busy executing instructions. Computed as the ratio of the total number of cycles
-      spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.
-    VALU Utilization: Indicates what percent of the kernel's duration the VALU was
-      busy executing instructions. Does not include VMEM operations. Computed as the
-      ratio of the total number of cycles spent by the scheduler issuing VALU instructions
-      over the total CU cycles.
-    MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
-      was busy executing instructions. Computed as the ratio of the total number of
-      cycles the MFMA was busy over the total CU cycles.
-    VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
-      was busy executing instructions, including both global/generic and spill/scratch
-      operations (see the VMEM instruction count metrics) for more detail). Does not
-      include VALU operations. Computed as the ratio of the total number of cycles
-      spent by the scheduler issuing VMEM instructions over the total CU cycles.
-    Branch Utilization: Indicates what percent of the kernel's duration the branch
-      unit was busy executing instructions. Computed as the ratio of the total number
-      of cycles spent by the scheduler issuing branch instructions over the total
-      CU cycles
-    VALU Active Threads: Indicates the average level of divergence within a wavefront
-      over the lifetime of the kernel. The number of work-items that were active in
-      a wavefront during execution of each VALU instruction, time-averaged over all
-      VALU instructions run on all wavefronts in the kernel.
-    IPC: The ratio of the total number of instructions executed on the CU over the
-      total active CU cycles. This is also presented as a percent of the peak theoretical
-      bandwidth achievable on the specific accelerator.
-    Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
-      over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
-      kernels (less than 1ms). This is also presented as a percent of the peak theoretical
-      occupancy achievable on the specific accelerator.'
-    Theoretical LDS Bandwidth: Indicates the maximum amount of bytes that could have
-      been loaded from, stored to, or atomically updated in the LDS per unit time
-      (see LDS Bandwidth example for more detail). This is also presented as a percent
-      of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
-    LDS Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS
-      scheduler due to bank conflicts (as determined by the conflict resolution hardware)
-      to the base number of cycles that would be spent in the LDS scheduler in a completely
-      uncontended case. This is also presented in normalized form (i.e., the Bank
-      Conflict Rate).
-    vL1D Cache Hit Rate: The ratio of the number of vL1D cache line requests that
-      hit in vL1D cache over the total number of cache line requests to the vL1D cache
-      RAM.
-    vL1D Cache BW: The number of bytes looked up in the vL1D cache as a result of
-      VMEM instructions per unit time. The number of bytes is calculated as the number
-      of cache lines requested multiplied by the cache line size. This value does
-      not consider partial requests, so e.g., if only a single value is requested
-      in a cache line, the data movement will still be counted as a full cache line.
-      This is also presented as a percent of the peak theoretical bandwidth achievable
-      on the specific accelerator.
-    L2 Cache Hit Rate: The ratio of the number of L2 cache line requests that hit
-      in the L2 cache over the total number of incoming cache line requests to the
-      L2 cache.
-    L2 Cache BW: The number of bytes looked up in the L2 cache per unit time. The
-      number of bytes is calculated as the number of cache lines requested multiplied
-      by the cache line size. This value does not consider partial requests, so e.g.,
-      if only a single value is requested in a cache line, the data movement will
-      still be counted as a full cache line. This is also presented as a percent of
-      the peak theoretical bandwidth achievable on the specific accelerator.
-    L2-Fabric Read BW: "The number of bytes read by the L2 over the Infinity Fabric\u2122\
-      \ interface per unit time. This is also presented as a percent of the peak theoretical\
-      \ bandwidth achievable on the specific accelerator."
-    L2-Fabric Write BW: The number of bytes sent by the L2 over the Infinity Fabric
-      interface by write and atomic operations per unit time. This is also presented
-      as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
-    L2-Fabric Read Latency: The time-averaged number of cycles read requests spent
-      in Infinity Fabric before data was returned to the L2.
-    L2-Fabric Write Latency: The time-averaged number of cycles write requests spent
-      in Infinity Fabric before a completion acknowledgement was returned to the L2.
-    sL1D Cache Hit Rate: The percent of sL1D requests that hit on a previously loaded
-      line the cache. Calculated as the ratio of the number of sL1D requests that
-      hit over the number of all sL1D requests.
-    sL1D Cache BW: The number of bytes looked up in the sL1D cache per unit time.
-      This is also presented as a percent of the peak theoretical bandwidth achievable
-      on the specific accelerator.
-    L1I Hit Rate: The number of bytes looked up in the L1I cache per unit time. This
-      is also presented as a percent of the peak theoretical bandwidth achievable
-      on the specific accelerator.
-    L1I BW: The percent of L1I requests that hit on a previously loaded line the cache.
-      Calculated as the ratio of the number of L1I requests that hit over the number
-      of all L1I requests.
-    L1I Fetch Latency: The average number of cycles spent to fetch instructions to
-      a CU.
  data source:
  - metric_table:
      id: 201
@@ -344,3 +226,130 @@ Panel Config:
          peak: None
          pop: None
          coll_level: SQ_IFETCH_LEVEL
+  metrics_description:
+    VALU FLOPs: |-
+      The total floating-point operations executed per second on the VALU.
+      This is also presented as a percent of the peak theoretical FLOPs achievable
+      on the specific accelerator. Note: this does not include any floating-point
+      operations from MFMA instructions.
+    VALU IOPs: |-
+      The total integer operations executed per second on the VALU. This is
+      also presented as a percent of the peak theoretical IOPs achievable on the
+      specific accelerator. Note: this does not include any integer operations from
+      MFMA instructions.
+    MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
+      executed per second. This does not include any 16-bit brain floating point operations
+      from VALU instructions. This is also presented as a percent of the peak theoretical
+      F8 MFMA operations achievable on the specific accelerator. It is supported on
+      AMD Instinct MI300 series and later only.
+    MFMA FLOPs (BF16): |-
+      The total number of 16-bit brain floating point MFMA operations executed
+      per second. Note: this does not include any 16-bit brain floating point operations
+      from VALU instructions. This is also presented as a percent of the peak theoretical
+      BF16 MFMA operations achievable on the specific accelerator.
+    MFMA FLOPs (F16): |-
+      The total number of 16-bit floating point MFMA operations executed per
+      second. Note: this does not include any 16-bit floating point operations from
+      VALU instructions. This is also presented as a percent of the peak theoretical
+      F16 MFMA operations achievable on the specific accelerator.
+    MFMA FLOPs (F32): |-
+      The total number of 32-bit floating point MFMA operations executed per
+      second. Note: this does not include any 32-bit floating point operations from
+      VALU instructions. This is also presented as a percent of the peak theoretical
+      F32 MFMA operations achievable on the specific accelerator.
+    MFMA FLOPs (F64): |-
+      The total number of 64-bit floating point MFMA operations executed per
+      second. Note: this does not include any 64-bit floating point operations from
+      VALU instructions. This is also presented as a percent of the peak theoretical
+      F64 MFMA operations achievable on the specific accelerator.
+    MFMA IOPs (Int8): |-
+      The total number of 8-bit integer MFMA operations executed per second.
+      Note: this does not include any 8-bit integer operations from VALU instructions.
+      This is also presented as a percent of the peak theoretical INT8 MFMA operations
+      achievable on the specific accelerator.
+    Active CUs: Total number of active compute units (CUs) on the accelerator during
+      the kernel execution.
+    SALU Utilization: Indicates what percent of the kernel's duration the SALU was
+      busy executing instructions. Computed as the ratio of the total number of cycles
+      spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.
+    VALU Utilization: Indicates what percent of the kernel's duration the VALU was
+      busy executing instructions. Does not include VMEM operations. Computed as the
+      ratio of the total number of cycles spent by the scheduler issuing VALU instructions
+      over the total CU cycles.
+    MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
+      was busy executing instructions. Computed as the ratio of the total number of
+      cycles the MFMA was busy over the total CU cycles.
+    VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
+      was busy executing instructions, including both global/generic and spill/scratch
+      operations (see the VMEM instruction count metrics) for more detail). Does not
+      include VALU operations. Computed as the ratio of the total number of cycles
+      spent by the scheduler issuing VMEM instructions over the total CU cycles.
+    Branch Utilization: Indicates what percent of the kernel's duration the branch
+      unit was busy executing instructions. Computed as the ratio of the total number
+      of cycles spent by the scheduler issuing branch instructions over the total
+      CU cycles
+    VALU Active Threads: Indicates the average level of divergence within a wavefront
+      over the lifetime of the kernel. The number of work-items that were active in
+      a wavefront during execution of each VALU instruction, time-averaged over all
+      VALU instructions run on all wavefronts in the kernel.
+    IPC: The ratio of the total number of instructions executed on the CU over the
+      total active CU cycles. This is also presented as a percent of the peak theoretical
+      bandwidth achievable on the specific accelerator.
+    Wavefront Occupancy: |-
+      The time-averaged number of wavefronts resident on the accelerator over
+      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
+      kernels (less than 1ms). This is also presented as a percent of the peak theoretical
+      occupancy achievable on the specific accelerator.
+    Theoretical LDS Bandwidth: Indicates the maximum amount of bytes that could have
+      been loaded from, stored to, or atomically updated in the LDS per unit time
+      (see LDS Bandwidth example for more detail). This is also presented as a percent
+      of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
+    LDS Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS
+      scheduler due to bank conflicts (as determined by the conflict resolution hardware)
+      to the base number of cycles that would be spent in the LDS scheduler in a completely
+      uncontended case. This is also presented in normalized form (i.e., the Bank
+      Conflict Rate).
+    vL1D Cache Hit Rate: The ratio of the number of vL1D cache line requests that
+      hit in vL1D cache over the total number of cache line requests to the vL1D cache
+      RAM.
+    vL1D Cache BW: The number of bytes looked up in the vL1D cache as a result of
+      VMEM instructions per unit time. The number of bytes is calculated as the number
+      of cache lines requested multiplied by the cache line size. This value does
+      not consider partial requests, so e.g., if only a single value is requested
+      in a cache line, the data movement will still be counted as a full cache line.
+      This is also presented as a percent of the peak theoretical bandwidth achievable
+      on the specific accelerator.
+    L2 Cache Hit Rate: The ratio of the number of L2 cache line requests that hit
+      in the L2 cache over the total number of incoming cache line requests to the
+      L2 cache.
+    L2 Cache BW: The number of bytes looked up in the L2 cache per unit time. The
+      number of bytes is calculated as the number of cache lines requested multiplied
+      by the cache line size. This value does not consider partial requests, so e.g.,
+      if only a single value is requested in a cache line, the data movement will
+      still be counted as a full cache line. This is also presented as a percent of
+      the peak theoretical bandwidth achievable on the specific accelerator.
+    L2-Fabric Read BW: |-
+      The number of bytes read by the L2 over the Infinity Fabric\u2122 interface
+      per unit time. This is also presented as a percent of the peak theoretical
+      bandwidth achievable on the specific accelerator.
+    L2-Fabric Write BW: The number of bytes sent by the L2 over the Infinity Fabric
+      interface by write and atomic operations per unit time. This is also presented
+      as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
+    L2-Fabric Read Latency: The time-averaged number of cycles read requests spent
+      in Infinity Fabric before data was returned to the L2.
+    L2-Fabric Write Latency: The time-averaged number of cycles write requests spent
+      in Infinity Fabric before a completion acknowledgement was returned to the L2.
+    sL1D Cache Hit Rate: The percent of sL1D requests that hit on a previously loaded
+      line the cache. Calculated as the ratio of the number of sL1D requests that
+      hit over the number of all sL1D requests.
+    sL1D Cache BW: The number of bytes looked up in the sL1D cache per unit time.
+      This is also presented as a percent of the peak theoretical bandwidth achievable
+      on the specific accelerator.
+    L1I Hit Rate: The number of bytes looked up in the L1I cache per unit time. This
+      is also presented as a percent of the peak theoretical bandwidth achievable
+      on the specific accelerator.
+    L1I BW: The percent of L1I requests that hit on a previously loaded line the cache.
+      Calculated as the ratio of the number of L1I requests that hit over the number
+      of all L1I requests.
+    L1I Fetch Latency: The average number of cycles spent to fetch instructions to
+      a CU.
@@ -2,122 +2,6 @@
 Panel Config:
  id: 300
  title: Memory Chart
-  metrics_description:
-    Wavefront Occupancy: Wavefronts per active CU.
-    Wave Life: Average number of cycles executing a wave.
-    SALU: Total Number of SALU (Scalar ALU) instructions issued per normalization
-      unit.
-    SMEM: Total number of SMEM (Scalar Memory Read) instructions issued normalization
-      unit.
-    VALU: The number of VALU (Vector ALU) instructions issued per normalization unit.
-    MFMA: Total number of MFMA (Matrix-Fused-Multiply-Add) instructions issued per
-      normalization unit.
-    VMEM: The number of VMEM (GPU Memory) read instructions issued (including FLAT/scratch
-      memory) per normalization unit.
-    LDS: The total number of LDS instructions (including, but not limited to, read/write/atomics
-      and HIP's __shfl instructions) executed per normalization unit.
-    GWS: Total number of GDS (global data sync) instructions issued per normalization
-      unit.
-    BR: Total number of BRANCH instructions issued per normalization unit.
-    Active CUs: Total number of active compute units (CUs) on the accelerator during
-      the kernel execution.
-    Num CUs: Total number of compute units (CUs) on the accelerator.
-    VGPR: 'The number of architected vector general-purpose registers allocated for
-      the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
-      by the compiler due to allocation granularity.'
-    SGPR: 'The number of scalar general-purpose registers allocated for the kernel,
-      see SALU. Note: this may not exactly match the number of SGPRs requested by
-      the compiler due to allocation granularity.'
-    LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
-      for this kernel. Note: This may also be larger than what was requested at compile
-      time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
-    Scratch Allocation: The number of bytes of scratch memory requested per work-item
-      for this kernel. Scratch memory is used for stack memory on the accelerator,
-      as well as for register spills and restores.
-    Wavefronts: The total number of wavefronts, summed over all workgroups, forming
-      this kernel launch.
-    Workgroups: The total number of workgroups forming this kernel launch.
-    LDS Req: The total number of LDS instructions (including, but not limited to,
-      read/write/atomics and HIP's __shfl instructions) executed per normalization
-      unit.
-    LDS Util: Indicates what percent of the kernel's duration the LDS was actively
-      executing instructions (including, but not limited to, load, store, atomic and
-      HIP's __shfl operations). Calculated as the ratio of the total number of cycles
-      LDS was active over the total CU cycles.
-    LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
-      / acknowledgment) required for an LDS instruction to complete.
-    VL1 Rd: The total number of incoming read requests from the address processing
-      unit after coalescing per normalization unit
-    VL1 Wr: The total number of incoming write requests from the address processing
-      unit after coalescing per normalization unit
-    VL1 Atomic: The total number of incoming atomic requests from the address processing
-      unit after coalescing per normalization unit
-    VL1 Hit: The ratio of the number of vL1D cache line requests that hit in vL1D
-      cache over the total number of cache line requests to the vL1D Cache RAM.
-    VL1 Lat: Calculated as the average number of cycles that a vL1D cache line request
-      spent in the vL1D cache pipeline.
-    VL1 Coalesce: Indicates how well memory instructions were coalesced by the address
-      processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
-      as the average number of thread-requests generated per instruction divided by
-      the ideal number of thread-requests per instruction.
-    VL1 Stall: The ratio of the number of cycles where the vL1D is stalled waiting
-      to issue a request for data to the L2 cache divided by the number of cycles
-      where the vL1D is active.
-    VL1_L2 Rd: The number of read requests for a vL1D cache line that were not satisfied
-      by the vL1D and must be retrieved from the to the L2 Cache per normalization
-      unit.
-    VL1_L2 Wr: The number of write requests to a vL1D cache line that were sent through
-      the vL1D to the L2 cache, per normalization unit.
-    VL1_L2 Atomic: The number of atomic requests that are sent through the vL1D to
-      the L2 cache, per normalization unit. This includes requests for atomics with,
-      and without return.
-    sL1D Rd: The total number of requests, of any size or type, made to the sL1D per
-      normalization unit.
-    sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
-      line, per normalization unit.
-    sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
-      unit.
-    sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
-      unit. Typically unused on current CDNA accelerators.
-    sL1D_L2 Atomic: The total number of atomic requests from sL1D to the L2, per normalization
-      unit. Typically unused on current CDNA accelerators.
-    IL1 Fetch: The total number of requests made to the L1I per normalization-unit.
-    IL1 Hit: The percent of L1I requests that hit on a previously loaded line the
-      cache. Calculated as the ratio of the number of L1I requests that hit over the
-      number of all L1I requests.
-    IL1 Lat: The average number of cycles spent to fetch instructions to a CU.
-    IL1_L2 Rd: The total number of requests across the L1I - L2 interface per normalization-unit.
-    L2 Rd: The total number of read requests to the L2 from all clients.
-    L2 Wr: The total number of write requests to the L2 from all clients.
-    L2 Atomic: The total number of atomic requests (with and without return) to the
-      L2 from all clients.
-    L2 Hit: The ratio of the number of L2 cache line requests that hit in the L2 cache
-      over the total number of incoming cache line requests to the L2 cache.
-    L2 Rd Lat: Calculated as the average number of cycles that the vL1D cache took
-      to issue and receive read requests from the L2 Cache. This number also includes
-      requests for atomics with return values.
-    L2 Wr Lat: Calculated as the average number of cycles that the vL1D cache took
-      to issue and receive acknowledgement of a write request to the L2 Cache. This
-      number also includes requests for atomics without return values.
-    Fabric_L2 Rd: Number of L2 cache - Infinity Fabric read requests (either 32-byte
-      or 64-byte) summed over TCC instances per normalization unit.
-    Fabric_L2 Wr: Number of L2 cache - Infinity Fabric write requests (either 32-byte
-      or 64-byte) summed over TCC instances per normalization unit.
-    Fabric_L2 Atomic: Number of L2 cache - Infinity Fabric write requests (either
-      32-byte or 64-byte) that are actually atomic requests summed over TCC instances
-      per normalization unit.
-    Fabric Rd Lat: The time-averaged number of cycles read requests spent in Infinity
-      Fabric before data was returned to the L2.
-    Fabric Wr Lat: The time-averaged number of cycles write requests spent in Infinity
-      Fabric before a completion acknowledgement was returned to the L2.
-    Fabric Atomic Lat: The time-averaged number of cycles atomic requests spent in
-      Infinity Fabric before a completion acknowledgement (atomic without return value)
-      or data (atomic with return value) was returned to the L2.
-    HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
-      of data from the accelerator's local HBM, per normalization unit.
-    HBM Wr: 'The total number of L2 requests to Infinity Fabric to write or atomically
-      update 32B or 64B of data in the accelerator''s local HBM, per normalization
-      unit. '
  data source:
  - metric_table:
      id: 301
@@ -244,13 +128,13 @@ Panel Config:
          value: ROUND(AVG((TCC_EA0_ATOMIC_sum / $denom)), 0)
        Fabric Rd Lat:
          value: ROUND(AVG(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
-            != 0) else  0)), 0)
+            != 0) else 0)), 0)
        Fabric Wr Lat:
          value: ROUND(AVG(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
-            != 0) else  0)), 0)
+            != 0) else 0)), 0)
        Fabric Atomic Lat:
          value: ROUND(AVG(((TCC_EA0_ATOMIC_LEVEL_sum / TCC_EA0_ATOMIC_sum) if (TCC_EA0_ATOMIC_sum
-            != 0) else  0)), 0)
+            != 0) else 0)), 0)
        HBM Rd:
          value: ROUND(AVG((TCC_EA0_RDREQ_DRAM_sum / $denom)), 0)
        HBM Wr:
@@ -258,3 +142,117 @@ Panel Config:
      comparable: false
      cli_style: mem_chart
      tui_style: mem_chart
+  metrics_description:
+    Wavefront Occupancy: Wavefronts per active CU.
+    Wave Life: Average number of cycles executing a wave.
+    SALU: Total Number of SALU (Scalar ALU) instructions issued per normalization
+      unit.
+    SMEM: Total number of SMEM (Scalar Memory Read) instructions issued normalization
+      unit.
+    VALU: The number of VALU (Vector ALU) instructions issued per normalization unit.
+    MFMA: Total number of MFMA (Matrix-Fused-Multiply-Add) instructions issued per
+      normalization unit.
+    VMEM: The number of VMEM (GPU Memory) read instructions issued (including FLAT/scratch
+      memory) per normalization unit.
+    LDS: The total number of LDS instructions (including, but not limited to, read/write/atomics
+      and HIP's __shfl instructions) executed per normalization unit.
+    GWS: Total number of GDS (global data sync) instructions issued per normalization
+      unit.
+    BR: Total number of BRANCH instructions issued per normalization unit.
+    Active CUs: Total number of active compute units (CUs) on the accelerator during
+      the kernel execution.
+    Num CUs: Total number of compute units (CUs) on the accelerator.
+    VGPR: |-
+      The number of architected vector general-purpose registers allocated
+      for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
+      requested by the compiler due to allocation granularity.
+    SGPR: |-
+      The number of scalar general-purpose registers allocated for the kernel,
+      see SALU. Note: this may not exactly match the number of SGPRs requested by
+      the compiler due to allocation granularity.
+    LDS Allocation: |-
+      The number of bytes of LDS memory (or, shared memory) allocated for
+      this kernel. Note: This may also be larger than what was requested at compile
+      time due to both allocation granularity and dynamic per-dispatch LDS allocations.
+    Scratch Allocation: The number of bytes of scratch memory requested per work-item
+      for this kernel. Scratch memory is used for stack memory on the accelerator,
+      as well as for register spills and restores.
+    Wavefronts: The total number of wavefronts, summed over all workgroups, forming
+      this kernel launch.
+    Workgroups: The total number of workgroups forming this kernel launch.
+    LDS Req: The total number of LDS instructions (including, but not limited to,
+      read/write/atomics and HIP's __shfl instructions) executed per normalization
+      unit.
+    LDS Util: Indicates what percent of the kernel's duration the LDS was actively
+      executing instructions (including, but not limited to, load, store, atomic and
+      HIP's __shfl operations). Calculated as the ratio of the total number of cycles
+      LDS was active over the total CU cycles.
+    LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
+      / acknowledgment) required for an LDS instruction to complete.
+    VL1 Rd: The total number of incoming read requests from the address processing
+      unit after coalescing per normalization unit
+    VL1 Wr: The total number of incoming write requests from the address processing
+      unit after coalescing per normalization unit
+    VL1 Atomic: The total number of incoming atomic requests from the address processing
+      unit after coalescing per normalization unit
+    VL1 Hit: The ratio of the number of vL1D cache line requests that hit in vL1D
+      cache over the total number of cache line requests to the vL1D Cache RAM.
+    VL1 Lat: Calculated as the average number of cycles that a vL1D cache line request
+      spent in the vL1D cache pipeline.
+    VL1 Coalesce: Indicates how well memory instructions were coalesced by the address
+      processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
+      as the average number of thread-requests generated per instruction divided by
+      the ideal number of thread-requests per instruction.
+    VL1 Stall: The ratio of the number of cycles where the vL1D is stalled waiting
+      to issue a request for data to the L2 cache divided by the number of cycles
+      where the vL1D is active.
+    VL1_L2 Rd: The number of read requests for a vL1D cache line that were not satisfied
+      by the vL1D and must be retrieved from the to the L2 Cache per normalization
+      unit.
+    VL1_L2 Wr: The number of write requests to a vL1D cache line that were sent through
+      the vL1D to the L2 cache, per normalization unit.
+    VL1_L2 Atomic: The number of atomic requests that are sent through the vL1D to
+      the L2 cache, per normalization unit. This includes requests for atomics with,
+      and without return.
+    sL1D Rd: The total number of requests, of any size or type, made to the sL1D per
+      normalization unit.
+    sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
+      line, per normalization unit.
+    sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
+      unit.
+    sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
+      unit. Typically unused on current CDNA accelerators.
+    sL1D_L2 Atomic: The total number of atomic requests from sL1D to the L2, per normalization
+      unit. Typically unused on current CDNA accelerators.
+    IL1 Fetch: The total number of requests made to the L1I per normalization-unit.
+    IL1 Hit: The percent of L1I requests that hit on a previously loaded line the
+      cache. Calculated as the ratio of the number of L1I requests that hit over the
+      number of all L1I requests.
+    IL1 Lat: The average number of cycles spent to fetch instructions to a CU.
+    IL1_L2 Rd: The total number of requests across the L1I - L2 interface per normalization-unit.
+    L2 Rd: The total number of read requests to the L2 from all clients.
+    L2 Wr: The total number of write requests to the L2 from all clients.
+    L2 Atomic: The total number of atomic requests (with and without return) to the
+      L2 from all clients.
+    L2 Hit: The ratio of the number of L2 cache line requests that hit in the L2 cache
+      over the total number of incoming cache line requests to the L2 cache.
+    Fabric_L2 Rd: Number of L2 cache - Infinity Fabric read requests (either 32-byte
+      or 64-byte) summed over TCC instances per normalization unit.
+    Fabric_L2 Wr: Number of L2 cache - Infinity Fabric write requests (either 32-byte
+      or 64-byte) summed over TCC instances per normalization unit.
+    Fabric_L2 Atomic: Number of L2 cache - Infinity Fabric write requests (either
+      32-byte or 64-byte) that are actually atomic requests summed over TCC instances
+      per normalization unit.
+    Fabric Rd Lat: The time-averaged number of cycles read requests spent in Infinity
+      Fabric before data was returned to the L2.
+    Fabric Wr Lat: The time-averaged number of cycles write requests spent in Infinity
+      Fabric before a completion acknowledgement was returned to the L2.
+    Fabric Atomic Lat: The time-averaged number of cycles atomic requests spent in
+      Infinity Fabric before a completion acknowledgement (atomic without return value)
+      or data (atomic with return value) was returned to the L2.
+    HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
+      of data from the accelerator's local HBM, per normalization unit.
+    HBM Wr: |-
+      The total number of L2 requests to Infinity Fabric to write or atomically
+      update 32B or 64B of data in the accelerator's local HBM, per normalization
+      unit.
@@ -2,85 +2,6 @@
 Panel Config:
  id: 400
  title: Roofline
-  metrics_description:
-    VALU FLOPs (F16): 'The total 16-bit floating-point operations executed per second
-      on the VALU. This is presented with the value of the peak empirical F16 FLOPs
-      achievable on the specific accelerator. Note: this does not include any F16
-      operations from MFMA instructions.'
-    VALU FLOPs (F32): 'The total 32-bit floating-point operations executed per second
-      on the VALU. This is presented with the value of the peak empirical F32 FLOPs
-      achievable on the specific accelerator. Note: this does not include any F32
-      operations from MFMA instructions.'
-    VALU FLOPs (F64): 'The total 64-bit floating-point operations executed per second
-      on the VALU. This is presented with the value of the peak empirical F64 FLOPs
-      achievable on the specific accelerator. Note: this does not include any F64
-      operations from MFMA instructions.'
-    MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
-      executed per second. This does not include any 16-bit brain floating point operations
-      from VALU instructions. The peak empirically measured F8 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison. It is supported
-      on AMD Instinct MI300 series and later only.
-    MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
-      executed per second. Note: this does not include any 16-bit brain floating point
-      operations from VALU instructions. The peak empirically measured BF16 MFMA operations
-      achievable on the specific accelerator is displayed alongside for comparison.'
-    MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
-      per second. Note: this does not include any 16-bit floating point operations
-      from VALU instructions. The peak empirically measured F16 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison.'
-    MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
-      per second. Note: this does not include any 32-bit floating point operations
-      from VALU instructions. The peak empirically measured F32 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison.'
-    MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
-      per second. Note: this does not include any 64-bit floating point operations
-      from VALU instructions. The peak empirically measured F64 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison.'
-    MFMA FLOPs (F6F4): 'The total number of 4-bit and 6-bit floating point MFMA operations
-      executed per second. Note: this does not include any floating point operations
-      from VALU instructions. The peak empirically measured F6F4 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison. It is supported
-      on AMD Instinct MI350 series (gfx950) and later only.'
-    MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
-      per second. Note: this does not include any 8-bit integer operations from VALU
-      instructions. The peak empirically measured INT8 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison.'
-    HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth
-      Memory (HBM) per second. The peak empirically measured bandwidth achievable
-      on the specific accelerator is displayed alongside for comparison.
-    L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
-      The number of bytes is calculated as the number of cache lines requested multiplied
-      by the cache line size. This value does not consider partial requests, so e.g.,
-      if only a single value is requested in a cache line, the data movement will
-      still be counted as a full cache line. The peak empirically measured bandwidth
-      achievable on the specific accelerator is displayed alongside for comparison.
-    L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
-      of VMEM instructions per unit time. The number of bytes is calculated as the
-      number of cache lines requested multiplied by the cache line size. This value
-      does not consider partial requests, so e.g., if only a single value is requested
-      in a cache line, the data movement will still be counted as a full cache line.
-      The peak empirically measured bandwidth achievable on the specific accelerator
-      is displayed alongside for comparison.
-    LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
-      from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
-      example for more detail). The peak empirically measured LDS bandwidth achievable
-      on the specific accelerator is displayed alongside for comparison.
-    AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
-      of total floating-point operations (FLOPs) to total bytes transferred between
-      the L1 cache and the processing units. This value is used as the x-coordinate
-      for the L1 roofline.
-    AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
-      of total floating-point operations (FLOPs) to total bytes transferred between
-      the L2 cache and the L1 cache. This value is used as the x-coordinate for the
-      L2 roofline.
-    AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
-      It is the ratio of total floating-point operations (FLOPs) to total bytes transferred
-      between HBM and the L2 cache. This value is used as the x-coordinate for the
-      HBM roofline.
-    Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs
-      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
-      operations divided by the total execution time. This value is used as the y-coordinate
-      for the kernel's point on the Roofline plot.
  data source:
  - metric_table:
      id: 401
@@ -218,3 +139,91 @@ Panel Config:
            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
            * 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9
          unit: GFLOP/s
+  metrics_description:
+    VALU FLOPs (F16): |-
+      The total 16-bit floating-point operations executed per second on the VALU.
+      This is presented with the value of the peak empirical F16 FLOPs achievable
+      on the specific accelerator. Note: this does not include any F16 operations
+      from MFMA instructions.
+    VALU FLOPs (F32): |-
+      The total 32-bit floating-point operations executed per second on the VALU.
+      This is presented with the value of the peak empirical F32 FLOPs achievable
+      on the specific accelerator. Note: this does not include any F32 operations
+      from MFMA instructions.
+    VALU FLOPs (F64): |-
+      The total 64-bit floating-point operations executed per second on the VALU.
+      This is presented with the value of the peak empirical F64 FLOPs achievable
+      on the specific accelerator. Note: this does not include any F64 operations
+      from MFMA instructions.
+    MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
+      executed per second. This does not include any 16-bit brain floating point operations
+      from VALU instructions. The peak empirically measured F8 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison. It is supported
+      on AMD Instinct MI300 series and later only.
+    MFMA FLOPs (BF16): |-
+      The total number of 16-bit brain floating point MFMA operations executed
+      per second. Note: this does not include any 16-bit brain floating point
+      operations from VALU instructions. The peak empirically measured BF16 MFMA
+      operations achievable on the specific accelerator is displayed alongside
+      for comparison.
+    MFMA FLOPs (F16): |-
+      The total number of 16-bit floating point MFMA operations executed per
+      second. Note: this does not include any 16-bit floating point operations from
+      VALU instructions. The peak empirically measured F16 MFMA operations
+      achievable on the specific accelerator is displayed alongside for comparison.
+    MFMA FLOPs (F32): |-
+      The total number of 32-bit floating point MFMA operations executed per
+      second. Note: this does not include any 32-bit floating point operations from
+      VALU instructions. The peak empirically measured F32 MFMA operations
+      achievable on the specific accelerator is displayed alongside for comparison.
+    MFMA FLOPs (F64): |-
+      The total number of 64-bit floating point MFMA operations executed per
+      second. Note: this does not include any 64-bit floating point operations from
+      VALU instructions. The peak empirically measured F64 MFMA operations
+      achievable on the specific accelerator is displayed alongside for comparison.
+    MFMA IOPs (Int8): |-
+      The total number of 8-bit integer MFMA operations executed per second.
+      Note: this does not include any 8-bit integer operations from VALU instructions.
+      The peak empirically measured INT8 MFMA operations achievable on the specific
+      accelerator is displayed alongside for comparison.
+    HBM Bandwidth: |-
+      The total number of bytes read from and written to High-Bandwidth
+      Memory (HBM) per second. The peak empirically measured bandwidth achievable
+      on the specific accelerator is displayed alongside for comparison.
+    L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
+      The number of bytes is calculated as the number of cache lines requested multiplied
+      by the cache line size. This value does not consider partial requests, so e.g.,
+      if only a single value is requested in a cache line, the data movement will
+      still be counted as a full cache line. The peak empirically measured bandwidth
+      achievable on the specific accelerator is displayed alongside for comparison.
+    L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
+      of VMEM instructions per unit time. The number of bytes is calculated as the
+      number of cache lines requested multiplied by the cache line size. This value
+      does not consider partial requests, so e.g., if only a single value is requested
+      in a cache line, the data movement will still be counted as a full cache line.
+      The peak empirically measured bandwidth achievable on the specific accelerator
+      is displayed alongside for comparison.
+    LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
+      from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
+      example for more detail). The peak empirically measured LDS bandwidth achievable
+      on the specific accelerator is displayed alongside for comparison.
+    AI L1: |-
+      The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
+      of total floating-point operations (FLOPs) to total bytes transferred between
+      the L1 cache and the processing units. This value is used as the x-coordinate
+      for the L1 roofline.
+    AI L2: |-
+      The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
+      of total floating-point operations (FLOPs) to total bytes transferred between
+      the L2 cache and the L1 cache. This value is used as the x-coordinate for
+      the L2 roofline.
+    AI HBM: |-
+      The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
+      It is the ratio of total floating-point operations (FLOPs) to total bytes
+      transferred between HBM and the L2 cache. This value is used as the x-coordinate
+      for the HBM roofline.
+    Performance (GFLOPs): |-
+      The overall achieved performance, measured in GigaFLOPs
+      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
+      operations divided by the total execution time. This value is used as the y-coordinate
+      for the kernel's point on the Roofline plot.
@@ -2,30 +2,6 @@
 Panel Config:
  id: 500
  title: Command Processor (CPC/CPF)
-  metrics_description:
-    CPF Utilization: Percent of total cycles where the CPF was busy actively doing
-      any work. The ratio of CPF busy cycles over total cycles counted by the CPF.
-    CPF Stall: Percent of CPF busy cycles where the CPF was stalled for any reason.
-    CPF-L2 Utilization: Percent of total cycles counted by the CPF-L2 interface where
-      the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
-      over total cycles counted by the CPF-L2.
-    CPF-L2 Stall: Percent of CPF-L2 L2 busy cycles where the CPF-L2 interface was
-      stalled for any reason.
-    CPF-UTCL1 Stall: Percent of CPF busy cycles where the CPF was stalled by address
-      translation.
-    CPC Utilization: Percent of total cycles where the CPC was busy actively doing
-      any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
-    CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
-    CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
-      for processing.
-    CPC-Workgroup Manager Utilization: Percent of CPC busy cycles spent dispatching
-      workgroups to the workgroup manager.
-    CPC-L2 Utilization: Percent of total cycles counted by the CPC-L2 interface where
-      the CPC-L2 interface was active doing any work.
-    CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
-      translation
-    CPC-UTCL2 Utilization: 'Percent of total cycles counted by the CPC''s L2 address
-      translation interface where the CPC was busy doing address translation work.  '
  data source:
  - metric_table:
      id: 501
@@ -143,3 +119,28 @@ Panel Config:
          max: MAX((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
            if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
          unit: pct
+  metrics_description:
+    CPF Utilization: Percent of total cycles where the CPF was busy actively doing
+      any work. The ratio of CPF busy cycles over total cycles counted by the CPF.
+    CPF Stall: Percent of CPF busy cycles where the CPF was stalled for any reason.
+    CPF-L2 Utilization: Percent of total cycles counted by the CPF-L2 interface where
+      the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
+      over total cycles counted by the CPF-L2.
+    CPF-L2 Stall: Percent of CPF-L2 L2 busy cycles where the CPF-L2 interface was
+      stalled for any reason.
+    CPF-UTCL1 Stall: Percent of CPF busy cycles where the CPF was stalled by address
+      translation.
+    CPC Utilization: Percent of total cycles where the CPC was busy actively doing
+      any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
+    CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
+    CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
+      for processing.
+    CPC-Workgroup Manager Utilization: Percent of CPC busy cycles spent dispatching
+      workgroups to the workgroup manager.
+    CPC-L2 Utilization: Percent of total cycles counted by the CPC-L2 interface where
+      the CPC-L2 interface was active doing any work.
+    CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
+      translation
+    CPC-UTCL2 Utilization: |-
+      Percent of total cycles counted by the CPC's L2 address translation
+      interface where the CPC was busy doing address translation work.
@@ -2,61 +2,6 @@
 Panel Config:
  id: 600
  title: Workgroup Manager (SPI)
-  metrics_description:
-    Accelerator Utilization: The percent of cycles in the kernel where the accelerator
-      was actively doing any work.
-    Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
-      kernel where the scheduler-pipes were actively doing any work.
-    Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
-      manager was actively doing any work.
-    Shader Engine Utilization: The percent of total shader engine cycles in the kernel
-      where any CU in a shader-engine was actively doing any work, normalized over
-      all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
-      was not fully saturated by the kernel, or a potential load-imbalance issue.
-    SIMD Utilization: The percent of total SIMD cycles in the kernel where any SIMD
-      on a CU was actively doing any work, summed over all CUs. Low values (less than
-      100%) indicate that the accelerator was not fully saturated by the kernel, or
-      a potential load-imbalance issue.
-    Dispatched Workgroups: The total number of workgroups forming this kernel launch.
-    Dispatched Wavefronts: The total number of wavefronts, summed over all workgroups,
-      forming this kernel launch.
-    VGPR Writes: The average number of cycles spent initializing VGPRs at wave creation.
-    SGPR Writes: The average number of cycles spent initializing SGPRs at wave creation.
-    Not-scheduled Rate (Workgroup Manager): The percent of total scheduler-pipe cycles
-      in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
-      within the workgroup manager rather than a lack of a CU or SIMD with sufficient
-      resources.
-    Not-scheduled Rate (Scheduler-Pipe): 'The percent of total scheduler-pipe cycles
-      in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
-      within the scheduler-pipes rather than a lack of a CU or SIMD with sufficient
-      resources. '
-    Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
-      where a workgroup could not be scheduled to a CU due to occupancy limitations
-      (like a lack of a CU or SIMD with sufficient resources).
-    Scratch Stall Rate: The percent of total shader-engine cycles in the kernel where
-      a workgroup could not be scheduled to a CU due to lack of private (a.k.a., scratch)
-      memory slots. While this can reach up to 100%, note that the actual occupancy
-      limitations on a kernel using private memory are typically quite small (for
-      example, less than 1% of the total number of waves that can be scheduled to
-      an accelerator).
-    Insufficient SIMD Waveslots: The percent of total SIMD cycles in the kernel where
-      a workgroup could not be scheduled to a SIMD due to lack of available waveslots.
-    Insufficient SIMD VGPRs: The percent of total SIMD cycles in the kernel where
-      a workgroup could not be scheduled to a SIMD due to lack of available VGPRs.
-    Insufficient SIMD SGPRs: The percent of total SIMD cycles in the kernel where
-      a workgroup could not be scheduled to a SIMD due to lack of available SGPRs.
-    Insufficient CU LDS: The percent of total CU cycles in the kernel where a workgroup
-      could not be scheduled to a CU due to lack of available LDS.
-    Insufficient CU Barriers: The percent of total CU cycles in the kernel where a
-      workgroup could not be scheduled to a CU due to lack of available barriers.
-    Reached CU Workgroup Limit: The percent of total CU cycles in the kernel where
-      a workgroup could not be scheduled to a CU due to limits within the workgroup
-      manager. This is expected to be always be zero on CDNA2 or newer accelerators
-      (and small for previous accelerators).
-    Reached CU Wavefront Limit: The percent of total CU cycles in the kernel where
-      a wavefront could not be scheduled to a CU due to limits within the workgroup
-      manager. This is expected to be always be zero on CDNA2 or newer accelerators
-      (and small for previous accelerators).
  data source:
  - metric_table:
      id: 601
@@ -199,3 +144,58 @@ Panel Config:
          min: MIN(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
+  metrics_description:
+    Accelerator Utilization: The percent of cycles in the kernel where the accelerator
+      was actively doing any work.
+    Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
+      kernel where the scheduler-pipes were actively doing any work.
+    Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
+      manager was actively doing any work.
+    Shader Engine Utilization: The percent of total shader engine cycles in the kernel
+      where any CU in a shader-engine was actively doing any work, normalized over
+      all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
+      was not fully saturated by the kernel, or a potential load-imbalance issue.
+    SIMD Utilization: The percent of total SIMD cycles in the kernel where any SIMD
+      on a CU was actively doing any work, summed over all CUs. Low values (less than
+      100%) indicate that the accelerator was not fully saturated by the kernel, or
+      a potential load-imbalance issue.
+    Dispatched Workgroups: The total number of workgroups forming this kernel launch.
+    Dispatched Wavefronts: The total number of wavefronts, summed over all workgroups,
+      forming this kernel launch.
+    VGPR Writes: The average number of cycles spent initializing VGPRs at wave creation.
+    SGPR Writes: The average number of cycles spent initializing SGPRs at wave creation.
+    Not-scheduled Rate (Workgroup Manager): The percent of total scheduler-pipe cycles
+      in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
+      within the workgroup manager rather than a lack of a CU or SIMD with sufficient
+      resources.
+    Not-scheduled Rate (Scheduler-Pipe): |-
+      The percent of total scheduler-pipe cycles in the kernel where a workgroup
+      could not be scheduled to a CU due to a bottleneck within the scheduler-pipes
+      rather than a lack of a CU or SIMD with sufficient resources.
+    Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
+      where a workgroup could not be scheduled to a CU due to occupancy limitations
+      (like a lack of a CU or SIMD with sufficient resources).
+    Scratch Stall Rate: The percent of total shader-engine cycles in the kernel where
+      a workgroup could not be scheduled to a CU due to lack of private (a.k.a., scratch)
+      memory slots. While this can reach up to 100%, note that the actual occupancy
+      limitations on a kernel using private memory are typically quite small (for
+      example, less than 1% of the total number of waves that can be scheduled to
+      an accelerator).
+    Insufficient SIMD Waveslots: The percent of total SIMD cycles in the kernel where
+      a workgroup could not be scheduled to a SIMD due to lack of available waveslots.
+    Insufficient SIMD VGPRs: The percent of total SIMD cycles in the kernel where
+      a workgroup could not be scheduled to a SIMD due to lack of available VGPRs.
+    Insufficient SIMD SGPRs: The percent of total SIMD cycles in the kernel where
+      a workgroup could not be scheduled to a SIMD due to lack of available SGPRs.
+    Insufficient CU LDS: The percent of total CU cycles in the kernel where a workgroup
+      could not be scheduled to a CU due to lack of available LDS.
+    Insufficient CU Barriers: The percent of total CU cycles in the kernel where a
+      workgroup could not be scheduled to a CU due to lack of available barriers.
+    Reached CU Workgroup Limit: The percent of total CU cycles in the kernel where
+      a workgroup could not be scheduled to a CU due to limits within the workgroup
+      manager. This is expected to be always be zero on CDNA2 or newer accelerators
+      (and small for previous accelerators).
+    Reached CU Wavefront Limit: The percent of total CU cycles in the kernel where
+      a wavefront could not be scheduled to a CU due to limits within the workgroup
+      manager. This is expected to be always be zero on CDNA2 or newer accelerators
+      (and small for previous accelerators).
@@ -2,63 +2,6 @@
 Panel Config:
  id: 700
  title: Wavefront
-  metrics_description:
-    Grid Size: The total number of work-items (or, threads) launched as a part of
-      the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied
-      by the total workgroup (or, block) size.
-    Workgroup Size: The total number of work-items (or, threads) in each workgroup
-      (or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
-      to the total block size.
-    Total Wavefronts: "The total number of wavefronts launched as part of the kernel\
-      \ dispatch. On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs,\
-      \ the wavefront size is always 64 work-items. Thus, the total number of wavefronts\
-      \ should be equivalent to the ceiling of grid size divided by 64."
-    Saved Wavefronts: The total number of wavefronts saved at a context-save.
-    Restored Wavefronts: The total number of wavefronts restored from a context-save.
-    VGPRs: 'The number of architected vector general-purpose registers allocated for
-      the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
-      by the compiler due to allocation granularity.'
-    AGPRs: 'The number of accumulation vector general-purpose registers allocated
-      for the kernel, see AGPRs. Note: this may not exactly match the number of AGPRs
-      requested by the compiler due to allocation granularity.'
-    SGPRs: 'The number of scalar general-purpose registers allocated for the kernel,
-      see SALU. Note: this may not exactly match the number of SGPRs requested by
-      the compiler due to allocation granularity.'
-    LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
-      for this kernel. Note: This may also be larger than what was requested at compile
-      time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
-    Scratch Allocation: The number of bytes of scratch memory requested per work-item
-      for this kernel. Scratch memory is used for stack memory on the accelerator,
-      as well as for register spills and restores.
-    Kernel Time: The total duration of the executed kernel.
-    Kernel Time (Cycles): The total duration of the executed kernel in cycles.
-    Instructions per wavefront: The average number of instructions (of all types)
-      executed per wavefront. This is averaged over all wavefronts in a kernel dispatch.
-    Wave Cycles: The number of cycles a wavefront in the kernel dispatch spent resident
-      on a compute unit per normalization unit. This is averaged over all wavefronts
-      in a kernel dispatch.
-    Dependency Wait Cycles: The number of cycles a wavefront in the kernel dispatch
-      spent resident on a compute unit per normalization unit. This is averaged over
-      all wavefronts in a kernel dispatch.
-    Issue Wait Cycles: The number of cycles a wavefront in the kernel dispatch was
-      unable to issue an instruction for any reason (e.g., execution pipe back-pressure,
-      arbitration loss, etc.) per normalization unit. This counter is incremented
-      at every cycle by all wavefronts on a CU unable to issue an instruction. As
-      such, it is most useful to get a sense of how waves were spending their time,
-      rather than identification of a precise limiter because another wave could be
-      actively executing while a wave is issue stalled. The sum of this metric, Dependency
-      Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
-    Active Cycles: The average number of cycles a wavefront in the kernel dispatch
-      was actively executing instructions per normalization unit. This measurement
-      is made on a per-wavefront basis, and may include cycles that another wavefront
-      spent actively executing (on another execution unit, for example) or was stalled.
-      As such, it is most useful to get a sense of how waves were spending their time,
-      rather than identification of a precise limiter. The sum of this metric, Issue
-      Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
-      metric.
-    Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
-      over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
-      kernels (less than 1ms).'
  data source:
  - metric_table:
      id: 701
@@ -171,3 +114,66 @@ Panel Config:
          max: MAX((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
          unit: Wavefronts
          coll_level: SQ_LEVEL_WAVES
+  metrics_description:
+    Grid Size: The total number of work-items (or, threads) launched as a part of
+      the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied
+      by the total workgroup (or, block) size.
+    Workgroup Size: The total number of work-items (or, threads) in each workgroup
+      (or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
+      to the total block size.
+    Total Wavefronts: |-
+      The total number of wavefronts launched as part of the kernel dispatch.
+      On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
+      size is always 64 work-items. Thus, the total number of wavefronts should
+      be equivalent to the ceiling of grid size divided by 64.
+    Saved Wavefronts: The total number of wavefronts saved at a context-save.
+    Restored Wavefronts: The total number of wavefronts restored from a context-save.
+    VGPRs: |-
+      The number of architected vector general-purpose registers allocated
+      for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
+      requested by the compiler due to allocation granularity.
+    AGPRs: |-
+      The number of accumulation vector general-purpose registers allocated
+      for the kernel, see AGPRs. Note: this may not exactly match the number of
+      AGPRs requested by the compiler due to allocation granularity.
+    SGPRs: |-
+      The number of scalar general-purpose registers allocated for the kernel,
+      see SALU. Note: this may not exactly match the number of SGPRs requested by
+      the compiler due to allocation granularity.
+    LDS Allocation: |-
+      The number of bytes of LDS memory (or, shared memory) allocated for
+      this kernel. Note: This may also be larger than what was requested at compile
+      time due to both allocation granularity and dynamic per-dispatch LDS allocations.
+    Scratch Allocation: The number of bytes of scratch memory requested per work-item
+      for this kernel. Scratch memory is used for stack memory on the accelerator,
+      as well as for register spills and restores.
+    Kernel Time: The total duration of the executed kernel.
+    Kernel Time (Cycles): The total duration of the executed kernel in cycles.
+    Instructions per wavefront: The average number of instructions (of all types)
+      executed per wavefront. This is averaged over all wavefronts in a kernel dispatch.
+    Wave Cycles: The number of cycles a wavefront in the kernel dispatch spent resident
+      on a compute unit per normalization unit. This is averaged over all wavefronts
+      in a kernel dispatch.
+    Dependency Wait Cycles: The number of cycles a wavefront in the kernel dispatch
+      spent resident on a compute unit per normalization unit. This is averaged over
+      all wavefronts in a kernel dispatch.
+    Issue Wait Cycles: The number of cycles a wavefront in the kernel dispatch was
+      unable to issue an instruction for any reason (e.g., execution pipe back-pressure,
+      arbitration loss, etc.) per normalization unit. This counter is incremented
+      at every cycle by all wavefronts on a CU unable to issue an instruction. As
+      such, it is most useful to get a sense of how waves were spending their time,
+      rather than identification of a precise limiter because another wave could be
+      actively executing while a wave is issue stalled. The sum of this metric, Dependency
+      Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
+    Active Cycles: The average number of cycles a wavefront in the kernel dispatch
+      was actively executing instructions per normalization unit. This measurement
+      is made on a per-wavefront basis, and may include cycles that another wavefront
+      spent actively executing (on another execution unit, for example) or was stalled.
+      As such, it is most useful to get a sense of how waves were spending their time,
+      rather than identification of a precise limiter. The sum of this metric, Issue
+      Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
+      metric.
+    Wavefront Occupancy: |-
+      The time-averaged number of wavefronts resident on the accelerator over
+      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
+      kernels (less than 1ms).
@@ -2,90 +2,6 @@
 Panel Config:
  id: 1000
  title: Compute Units - Instruction Mix
-  metrics_description:
-    VALU: The total number of vector arithmetic logic unit (VALU) operations issued.
-      These are the workhorses of the compute unit, and are used to execute a wide
-      range of instruction types including floating point operations, non-uniform
-      address calculations, transcendental operations, integer operations, shifts,
-      conditional evaluation, etc.
-    VMEM: The total number of vector memory operations issued. These include most
-      loads, stores and atomic operations and all accesses to generic, global, private
-      and texture memory.
-    LDS: The total number of LDS (also known as shared memory) operations issued.
-      These include loads, stores, atomics, and HIP's __shfl operations.
-    MFMA: The total number of matrix fused multiply-add instructions issued.
-    SALU: The total number of scalar arithmetic logic unit (SALU) operations issued.
-      Typically these are used for address calculations, literal constants, and other
-      operations that are provably uniform across a wavefront. Although scalar memory
-      (SMEM) operations are issued by the SALU, they are counted separately in this
-      section.
-    SMEM: The total number of scalar memory (SMEM) operations issued. These are typically
-      used for loading kernel arguments, base-pointers and loads from HIP's __constant__
-      memory.
-    Branch: The total number of branch operations issued. These typically consist
-      of jump or branch operations and are used to implement control flow.
-    INT32: The total number of instructions operating on 32-bit integer operands issued
-      to the VALU per normalization unit.
-    INT64: The total number of instructions operating on 64-bit integer operands issued
-      to the VALU per normalization unit.
-    F16-ADD: The total number of addition instructions operating on 16-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F16-MUL: The total number of multiplication instructions operating on 16-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F16-FMA: The total number of fused multiply-add instructions operating on 16-bit
-      floating-point operands issued to the VALU per normalization unit.
-    F16-Trans: The total number of transcendental instructions (e.g., sqrt) operating
-      on 16-bit floating-point operands issued to the VALU per normalization unit.
-    F32-ADD: The total number of addition instructions operating on 32-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F32-MUL: The total number of multiplication instructions operating on 32-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F32-FMA: The total number of fused multiply-add instructions operating on 32-bit
-      floating-point operands issued to the VALU per normalization unit.
-    F32-Trans: The total number of transcendental instructions (such as sqrt) operating
-      on 32-bit floating-point operands issued to the VALU per normalization unit.
-    F64-ADD: The total number of addition instructions operating on 64-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F64-MUL: The total number of multiplication instructions operating on 64-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F64-FMA: The total number of fused multiply-add instructions operating on 64-bit
-      floating-point operands issued to the VALU per normalization unit.
-    F64-Trans: The total number of transcendental instructions (such as sqrt) operating
-      on 64-bit floating-point operands issued to the VALU per normalization unit.
-    Conversion: "The total number of type conversion instructions (such as converting\
-      \ data to or from F32\u2194F64) issued to the VALU per normalization unit."
-    Global/Generic Instr: The total number of global & generic memory instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Global/Generic Read: The total number of global & generic memory read instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Global/Generic Write: The total number of global & generic memory write instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Global/Generic Atomic: The total number of global & generic memory atomic (with
-      and without return) instructions executed on all compute units on the accelerator,
-      per normalization unit.
-    Spill/Stack Instr: The total number of spill/stack memory instructions executed
-      on all compute units on the accelerator, per normalization unit.
-    Spill/Stack Read: The total number of spill/stack memory read instructions executed
-      on all compute units on the accelerator, per normalization unit.
-    Spill/Stack Write: The total number of spill/stack memory write instructions executed
-      on all compute units on the accelerator, per normalization unit.
-    Spill/Stack Atomic: The total number of spill/stack memory atomic (with and without
-      return) instructions executed on all compute units on the accelerator, per normalization
-      unit. Typically unused as these memory operations are typically used to implement
-      thread-local storage.
-    MFMA-I8: The total number of 8-bit integer MFMA instructions issued per normalization
-      unit.
-    MFMA-F8: The total number of 8-bit floating point MFMA instructions issued per
-      normalization unit. This is supported in AMD Instinct MI300 series and later
-      only.
-    MFMA-F16: The total number of 16-bit floating point MFMA instructions issued per
-      normalization unit.
-    MFMA-BF16: The total number of 16-bit brain floating point MFMA instructions issued
-      per normalization unit.
-    MFMA-F32: The total number of 32-bit floating-point MFMA instructions issued per
-      normalization unit.
-    MFMA-F64: The total number of 64-bit floating-point MFMA instructions issued per
-      normalization unit.
  data source:
  - metric_table:
      id: 1001
@@ -307,3 +223,88 @@ Panel Config:
          min: MIN((SQ_INSTS_VALU_MFMA_F64 / $denom))
          max: MAX((SQ_INSTS_VALU_MFMA_F64 / $denom))
          unit: (instr + $normUnit)
+  metrics_description:
+    VALU: The total number of vector arithmetic logic unit (VALU) operations issued.
+      These are the workhorses of the compute unit, and are used to execute a wide
+      range of instruction types including floating point operations, non-uniform
+      address calculations, transcendental operations, integer operations, shifts,
+      conditional evaluation, etc.
+    VMEM: The total number of vector memory operations issued. These include most
+      loads, stores and atomic operations and all accesses to generic, global, private
+      and texture memory.
+    LDS: The total number of LDS (also known as shared memory) operations issued.
+      These include loads, stores, atomics, and HIP's __shfl operations.
+    MFMA: The total number of matrix fused multiply-add instructions issued.
+    SALU: The total number of scalar arithmetic logic unit (SALU) operations issued.
+      Typically these are used for address calculations, literal constants, and other
+      operations that are provably uniform across a wavefront. Although scalar memory
+      (SMEM) operations are issued by the SALU, they are counted separately in this
+      section.
+    SMEM: The total number of scalar memory (SMEM) operations issued. These are typically
+      used for loading kernel arguments, base-pointers and loads from HIP's __constant__
+      memory.
+    Branch: The total number of branch operations issued. These typically consist
+      of jump or branch operations and are used to implement control flow.
+    INT32: The total number of instructions operating on 32-bit integer operands issued
+      to the VALU per normalization unit.
+    INT64: The total number of instructions operating on 64-bit integer operands issued
+      to the VALU per normalization unit.
+    F16-ADD: The total number of addition instructions operating on 16-bit floating-point
+      operands issued to the VALU per normalization unit.
+    F16-MUL: The total number of multiplication instructions operating on 16-bit floating-point
+      operands issued to the VALU per normalization unit.
+    F16-FMA: The total number of fused multiply-add instructions operating on 16-bit
+      floating-point operands issued to the VALU per normalization unit.
+    F16-Trans: The total number of transcendental instructions (e.g., sqrt) operating
+      on 16-bit floating-point operands issued to the VALU per normalization unit.
+    F32-ADD: The total number of addition instructions operating on 32-bit floating-point
+      operands issued to the VALU per normalization unit.
+    F32-MUL: The total number of multiplication instructions operating on 32-bit floating-point
+      operands issued to the VALU per normalization unit.
+    F32-FMA: The total number of fused multiply-add instructions operating on 32-bit
+      floating-point operands issued to the VALU per normalization unit.
+    F32-Trans: The total number of transcendental instructions (such as sqrt) operating
+      on 32-bit floating-point operands issued to the VALU per normalization unit.
+    F64-ADD: The total number of addition instructions operating on 64-bit floating-point
+      operands issued to the VALU per normalization unit.
+    F64-MUL: The total number of multiplication instructions operating on 64-bit floating-point
+      operands issued to the VALU per normalization unit.
+    F64-FMA: The total number of fused multiply-add instructions operating on 64-bit
+      floating-point operands issued to the VALU per normalization unit.
+    F64-Trans: The total number of transcendental instructions (such as sqrt) operating
+      on 64-bit floating-point operands issued to the VALU per normalization unit.
+    Conversion: |-
+      The total number of type conversion instructions (such as converting
+      data to or from F32\u2194F64) issued to the VALU per normalization unit.
+    Global/Generic Instr: The total number of global & generic memory instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Global/Generic Read: The total number of global & generic memory read instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Global/Generic Write: The total number of global & generic memory write instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Global/Generic Atomic: The total number of global & generic memory atomic (with
+      and without return) instructions executed on all compute units on the accelerator,
+      per normalization unit.
+    Spill/Stack Instr: The total number of spill/stack memory instructions executed
+      on all compute units on the accelerator, per normalization unit.
+    Spill/Stack Read: The total number of spill/stack memory read instructions executed
+      on all compute units on the accelerator, per normalization unit.
+    Spill/Stack Write: The total number of spill/stack memory write instructions executed
+      on all compute units on the accelerator, per normalization unit.
+    Spill/Stack Atomic: The total number of spill/stack memory atomic (with and without
+      return) instructions executed on all compute units on the accelerator, per normalization
+      unit. Typically unused as these memory operations are typically used to implement
+      thread-local storage.
+    MFMA-I8: The total number of 8-bit integer MFMA instructions issued per normalization
+      unit.
+    MFMA-F8: The total number of 8-bit floating point MFMA instructions issued per
+      normalization unit. This is supported in AMD Instinct MI300 series and later
+      only.
+    MFMA-F16: The total number of 16-bit floating point MFMA instructions issued per
+      normalization unit.
+    MFMA-BF16: The total number of 16-bit brain floating point MFMA instructions issued
+      per normalization unit.
+    MFMA-F32: The total number of 32-bit floating-point MFMA instructions issued per
+      normalization unit.
+    MFMA-F64: The total number of 64-bit floating-point MFMA instructions issued per
+      normalization unit.
@@ -2,84 +2,6 @@
 Panel Config:
  id: 1100
  title: Compute Units - Compute Pipeline
-  metrics_description:
-    VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
-      This is also presented as a percent of the peak theoretical FLOPs achievable
-      on the specific accelerator. Note: this does not include any floating-point
-      operations from MFMA instructions.'
-    VALU IOPs: 'The total integer operations executed per second on the VALU. This
-      is also presented as a percent of the peak theoretical IOPs achievable on the
-      specific accelerator. Note: this does not include any integer operations from
-      MFMA instructions.'
-    MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
-      executed per second. Note: this does not include any 16-bit brain floating point
-      operations from VALU instructions. This is also presented as a percent of the
-      peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
-      per second. Note: this does not include any 16-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F16 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
-      per second. Note: this does not include any 32-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F32 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
-      per second. Note: this does not include any 64-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F64 MFMA operations achievable on the specific accelerator.'
-    MFMA IOPs (INT8): 'The total number of 8-bit integer MFMA operations executed
-      per second. Note: this does not include any 8-bit integer operations from VALU
-      instructions. This is also presented as a percent of the peak theoretical INT8
-      MFMA operations achievable on the specific accelerator.'
-    IPC: The ratio of the total number of instructions executed on the CU over the
-      total active CU cycles.
-    IPC (Issued): The ratio of the total number of (non-internal) instructions issued
-      over the number of cycles where the scheduler was actively working on issuing
-      instructions.
-    SALU Utilization: Indicates what percent of the kernel's duration the SALU was
-      busy executing instructions. Computed as the ratio of the total number of cycles
-      spent by the scheduler issuing SALU / SMEM instructions over the total CU cycles.
-    VALU Utilization: Indicates what percent of the kernel's duration the VALU was
-      busy executing instructions. Does not include VMEM operations. Computed as the
-      ratio of the total number of cycles spent by the scheduler issuing VALU instructions
-      over the total CU cycles.
-    VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
-      was busy executing instructions, including both global/generic and spill/scratch
-      operations (see the VMEM instruction count metrics for more detail). Does not
-      include VALU operations. Computed as the ratio of the total number of cycles
-      spent by the scheduler issuing VMEM instructions over the total CU cycles.
-    Branch Utilization: Indicates what percent of the kernel's duration the branch
-      unit was busy executing instructions. Computed as the ratio of the total number
-      of cycles spent by the scheduler issuing branch instructions over the total
-      CU cycles.
-    VALU Active Threads: Indicates the average level of divergence within a wavefront
-      over the lifetime of the kernel. The number of work-items that were active in
-      a wavefront during execution of each VALU instruction, time-averaged over all
-      VALU instructions run on all wavefronts in the kernel
-    MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
-      was busy executing instructions. Computed as the ratio of the total number of
-      cycles spent by the MFMA was busy over the total CU cycles.
-    MFMA Instruction Cycles: The average duration of MFMA instructions in this kernel
-      in cycles. Computed as the ratio of the total number of cycles the MFMA unit
-      was busy over the total number of MFMA instructions.
-    VMEM Latency: The average number of round-trip cycles (that is, from issue to
-      data return / acknowledgment) required for a VMEM instruction to complete.
-    SMEM Latency: The average number of round-trip cycles (that is, from issue to
-      data return / acknowledgment) required for a SMEM instruction to complete.
-    FLOPs (Total): The total number of floating-point operations executed on either
-      the VALU or MFMA units, per normalization unit.
-    IOPs (Total): The total number of integer operations executed on either the VALU
-      or MFMA units, per normalization unit.
-    F16 OPs: The total number of 16-bit floating-point operations executed on either
-      the VALU or MFMA units, per normalization unit.
-    BF16 OPs: The total number of 16-bit brain floating-point operations executed
-      on either the VALU or MFMA units, per normalization unit.
-    F32 OPs: The total number of 32-bit floating-point operations executed on either
-      the VALU or MFMA units, per normalization unit.
-    F64 OPs: The total number of 64-bit floating-point operations executed on either
-      the VALU or MFMA units, per normalization unit.
-    INT8 OPs: The total number of 8-bit integer operations executed on either the
-      VALU or MFMA units, per normalization unit.
  data source:
  - metric_table:
      id: 1101
@@ -165,13 +87,13 @@ Panel Config:
          unit: Instr/cycle
        IPC (Issued):
          avg: AVG(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
-            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED  + SQ_INSTS_LDS)
+            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
            / SQ_ACTIVE_INST_ANY))
          min: MIN(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
            / SQ_ACTIVE_INST_ANY))
          max: MAX(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
-            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED  + SQ_INSTS_LDS)
+            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
            / SQ_ACTIVE_INST_ANY))
          unit: Instr/cycle
        SALU Utilization:
@@ -271,7 +193,7 @@ Panel Config:
            + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
            + (SQ_INSTS_VALU_FMA_F64 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
            / $denom))
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
        IOPs (Total):
          avg: AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8
            * 512)) / $denom)
@@ -279,12 +201,12 @@ Panel Config:
            * 512)) / $denom)
          max: MAX(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8
            * 512)) / $denom)
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
        F8 OPs:
          avg: AVG(((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) / $denom))
          min: MIN(((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) / $denom))
          max: MAX(((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) / $denom))
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
        F16 OPs:
          avg: AVG(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16))
            + (64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512
@@ -295,12 +217,12 @@ Panel Config:
          max: MAX(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16))
            + (64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512
            * SQ_INSTS_VALU_MFMA_MOPS_F16)) / $denom))
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
        BF16 OPs:
          avg: AVG(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
          min: MIN(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
          max: MAX(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
        F32 OPs:
          avg: AVG((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
            + (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
@@ -311,7 +233,7 @@ Panel Config:
          max: MAX((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
            + (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
            / $denom))
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
        F64 OPs:
          avg: AVG((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
            + (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
@@ -322,9 +244,94 @@ Panel Config:
          max: MAX((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
            + (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
            / $denom))
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
        INT8 OPs:
          avg: AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
          min: MIN(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
          max: MAX(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
+  metrics_description:
+    VALU FLOPs: |-
+      The total floating-point operations executed per second on the VALU.
+      This is also presented as a percent of the peak theoretical FLOPs achievable
+      on the specific accelerator. Note: this does not include any floating-point
+      operations from MFMA instructions.
+    VALU IOPs: |-
+      The total integer operations executed per second on the VALU. This is
+      also presented as a percent of the peak theoretical IOPs achievable on the
+      specific accelerator. Note: this does not include any integer operations from
+      MFMA instructions.
+    MFMA FLOPs (BF16): |-
+      The total number of 16-bit brain floating point MFMA operations executed
+      per second. Note: this does not include any 16-bit brain floating point operations
+      from VALU instructions. This is also presented as a percent of the peak theoretical
+      BF16 MFMA operations achievable on the specific accelerator.
+    MFMA FLOPs (F16): |-
+      The total number of 16-bit floating point MFMA operations executed per
+      second. Note: this does not include any 16-bit floating point operations from
+      VALU instructions. This is also presented as a percent of the peak theoretical
+      F16 MFMA operations achievable on the specific accelerator.
+    MFMA FLOPs (F32): |-
+      The total number of 32-bit floating point MFMA operations executed per
+      second. Note: this does not include any 32-bit floating point operations from
+      VALU instructions. This is also presented as a percent of the peak theoretical
+      F32 MFMA operations achievable on the specific accelerator.
+    MFMA FLOPs (F64): |-
+      The total number of 64-bit floating point MFMA operations executed per
+      second. Note: this does not include any 64-bit floating point operations from
+      VALU instructions. This is also presented as a percent of the peak theoretical
+      F64 MFMA operations achievable on the specific accelerator.
+    MFMA IOPs (INT8): |-
+      The total number of 8-bit integer MFMA operations executed per second.
+      Note: this does not include any 8-bit integer operations from VALU instructions.
+      This is also presented as a percent of the peak theoretical INT8 MFMA operations
+      achievable on the specific accelerator.
+    IPC: The ratio of the total number of instructions executed on the CU over the
+      total active CU cycles.
+    IPC (Issued): The ratio of the total number of (non-internal) instructions issued
+      over the number of cycles where the scheduler was actively working on issuing
+      instructions.
+    SALU Utilization: Indicates what percent of the kernel's duration the SALU was
+      busy executing instructions. Computed as the ratio of the total number of cycles
+      spent by the scheduler issuing SALU / SMEM instructions over the total CU cycles.
+    VALU Utilization: Indicates what percent of the kernel's duration the VALU was
+      busy executing instructions. Does not include VMEM operations. Computed as the
+      ratio of the total number of cycles spent by the scheduler issuing VALU instructions
+      over the total CU cycles.
+    VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
+      was busy executing instructions, including both global/generic and spill/scratch
+      operations (see the VMEM instruction count metrics for more detail). Does not
+      include VALU operations. Computed as the ratio of the total number of cycles
+      spent by the scheduler issuing VMEM instructions over the total CU cycles.
+    Branch Utilization: Indicates what percent of the kernel's duration the branch
+      unit was busy executing instructions. Computed as the ratio of the total number
+      of cycles spent by the scheduler issuing branch instructions over the total
+      CU cycles.
+    VALU Active Threads: Indicates the average level of divergence within a wavefront
+      over the lifetime of the kernel. The number of work-items that were active in
+      a wavefront during execution of each VALU instruction, time-averaged over all
+      VALU instructions run on all wavefronts in the kernel
+    MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
+      was busy executing instructions. Computed as the ratio of the total number of
+      cycles spent by the MFMA was busy over the total CU cycles.
+    MFMA Instruction Cycles: The average duration of MFMA instructions in this kernel
+      in cycles. Computed as the ratio of the total number of cycles the MFMA unit
+      was busy over the total number of MFMA instructions.
+    VMEM Latency: The average number of round-trip cycles (that is, from issue to
+      data return / acknowledgment) required for a VMEM instruction to complete.
+    SMEM Latency: The average number of round-trip cycles (that is, from issue to
+      data return / acknowledgment) required for a SMEM instruction to complete.
+    FLOPs (Total): The total number of floating-point operations executed on either
+      the VALU or MFMA units, per normalization unit.
+    IOPs (Total): The total number of integer operations executed on either the VALU
+      or MFMA units, per normalization unit.
+    F16 OPs: The total number of 16-bit floating-point operations executed on either
+      the VALU or MFMA units, per normalization unit.
+    BF16 OPs: The total number of 16-bit brain floating-point operations executed
+      on either the VALU or MFMA units, per normalization unit.
+    F32 OPs: The total number of 32-bit floating-point operations executed on either
+      the VALU or MFMA units, per normalization unit.
+    F64 OPs: The total number of 64-bit floating-point operations executed on either
+      the VALU or MFMA units, per normalization unit.
+    INT8 OPs: The total number of 8-bit integer operations executed on either the
+      VALU or MFMA units, per normalization unit.
@@ -2,51 +2,6 @@
 Panel Config:
  id: 1200
  title: Local Data Share (LDS)
-  metrics_description:
-    Utilization: Indicates what percent of the kernel's duration the LDS was actively
-      executing instructions (including, but not limited to, load, store, atomic and
-      HIP's __shfl operations). Calculated as the ratio of the total number of cycles
-      LDS was active over the total CU cycles.
-    Access Rate: Indicates the percentage of SIMDs in the VALU actively issuing LDS
-      instructions, averaged over the lifetime of the kernel. Calculated as the ratio
-      of the total number of cycles spent by the scheduler issuing LDS instructions
-      over the total CU cycles.
-    Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
-      could have been loaded from, stored to, or atomically updated in the LDS divided
-      as percentage of theoretical peak. Does not take into account the execution
-      mask of the wavefront when the instruction was executed.
-    Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
-      loaded from, stored to, or atomically updated in the LDS divided by total duration.
-      Does not take into account the execution mask of the wavefront when the instruction
-      was executed.
-    Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
-      servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing
-      bank conflicts over the number of LDS cycles that would have been required to
-      move the same amount of data in an uncontended access.
-    LDS Instructions: The total number of LDS instructions (including, but not limited
-      to, read/write/atomics and HIP's __shfl instructions) executed per normalization
-      unit.
-    LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
-      / acknowledgment) required for an LDS instruction to complete.
-    Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS scheduler
-      due to bank conflicts (as determined by the conflict resolution hardware) to
-      the base number of cycles that would be spent in the LDS scheduler in a completely
-      uncontended case. This is the unnormalized form of the Bank Conflict Rate.
-    Index Accesses: The total number of cycles spent in the LDS scheduler over all
-      operations per normalization unit.
-    Atomic Return Cycles: The total number of cycles spent on LDS atomics with return
-      per normalization unit.
-    Bank Conflict: The total number of cycles spent in the LDS scheduler due to bank
-      conflicts (as determined by the conflict resolution hardware) per normalization
-      unit.
-    Addr Conflict: The total number of cycles spent in the LDS scheduler due to address
-      conflicts (as determined by the conflict resolution hardware) per normalization
-      unit.
-    Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
-      stalls from non-dword aligned addresses per normalization unit.
-    Mem Violations: "The total number of out-of-bounds accesses made to the LDS, per\
-      \ normalization unit. This is unused and expected to be zero in most configurations\
-      \ for modern CDNA\u2122 accelerators."
  data source:
  - metric_table:
      id: 1201
@@ -87,7 +42,7 @@ Panel Config:
          avg: AVG((SQ_INSTS_LDS / $denom))
          min: MIN((SQ_INSTS_LDS / $denom))
          max: MAX((SQ_INSTS_LDS / $denom))
-          unit: (Instr  + $normUnit)
+          unit: (Instr + $normUnit)
        Theoretical Bandwidth:
          avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
            / (End_Timestamp - Start_Timestamp)))
@@ -117,29 +72,75 @@ Panel Config:
          avg: AVG((SQ_LDS_IDX_ACTIVE / $denom))
          min: MIN((SQ_LDS_IDX_ACTIVE / $denom))
          max: MAX((SQ_LDS_IDX_ACTIVE / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Atomic Return Cycles:
          avg: AVG((SQ_LDS_ATOMIC_RETURN / $denom))
          min: MIN((SQ_LDS_ATOMIC_RETURN / $denom))
          max: MAX((SQ_LDS_ATOMIC_RETURN / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Bank Conflict:
          avg: AVG((SQ_LDS_BANK_CONFLICT / $denom))
          min: MIN((SQ_LDS_BANK_CONFLICT / $denom))
          max: MAX((SQ_LDS_BANK_CONFLICT / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Addr Conflict:
          avg: AVG((SQ_LDS_ADDR_CONFLICT / $denom))
          min: MIN((SQ_LDS_ADDR_CONFLICT / $denom))
          max: MAX((SQ_LDS_ADDR_CONFLICT / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Unaligned Stall:
          avg: AVG((SQ_LDS_UNALIGNED_STALL / $denom))
          min: MIN((SQ_LDS_UNALIGNED_STALL / $denom))
          max: MAX((SQ_LDS_UNALIGNED_STALL / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Mem Violations:
          avg: AVG((SQ_LDS_MEM_VIOLATIONS / $denom))
          min: MIN((SQ_LDS_MEM_VIOLATIONS / $denom))
          max: MAX((SQ_LDS_MEM_VIOLATIONS / $denom))
          unit: (Accesses + $normUnit)
+  metrics_description:
+    Utilization: Indicates what percent of the kernel's duration the LDS was actively
+      executing instructions (including, but not limited to, load, store, atomic and
+      HIP's __shfl operations). Calculated as the ratio of the total number of cycles
+      LDS was active over the total CU cycles.
+    Access Rate: Indicates the percentage of SIMDs in the VALU actively issuing LDS
+      instructions, averaged over the lifetime of the kernel. Calculated as the ratio
+      of the total number of cycles spent by the scheduler issuing LDS instructions
+      over the total CU cycles.
+    Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
+      could have been loaded from, stored to, or atomically updated in the LDS divided
+      as percentage of theoretical peak. Does not take into account the execution
+      mask of the wavefront when the instruction was executed.
+    Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
+      loaded from, stored to, or atomically updated in the LDS divided by total duration.
+      Does not take into account the execution mask of the wavefront when the instruction
+      was executed.
+    Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
+      servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing
+      bank conflicts over the number of LDS cycles that would have been required to
+      move the same amount of data in an uncontended access.
+    LDS Instructions: The total number of LDS instructions (including, but not limited
+      to, read/write/atomics and HIP's __shfl instructions) executed per normalization
+      unit.
+    LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
+      acknowledgment) required for an LDS instruction to complete.
+    Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS scheduler
+      due to bank conflicts (as determined by the conflict resolution hardware) to
+      the base number of cycles that would be spent in the LDS scheduler in a completely
+      uncontended case. This is the unnormalized form of the Bank Conflict Rate.
+    Index Accesses: The total number of cycles spent in the LDS scheduler over all
+      operations per normalization unit.
+    Atomic Return Cycles: The total number of cycles spent on LDS atomics with return
+      per normalization unit.
+    Bank Conflict: The total number of cycles spent in the LDS scheduler due to bank
+      conflicts (as determined by the conflict resolution hardware) per normalization
+      unit.
+    Addr Conflict: The total number of cycles spent in the LDS scheduler due to address
+      conflicts (as determined by the conflict resolution hardware) per normalization
+      unit.
+    Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
+      stalls from non-dword aligned addresses per normalization unit.
+    Mem Violations: |-
+      The total number of out-of-bounds accesses made to the LDS, per normalization
+      unit. This is unused and expected to be zero in most configurations for
+      modern CDNA\u2122 accelerators.
@@ -2,28 +2,6 @@
 Panel Config:
  id: 1300
  title: Instruction Cache
-  metrics_description:
-    Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
-      of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
-      the total L1I cycles.
-    Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
-      loaded line the cache. Calculated as the ratio of the number of L1I requests
-      that hit over the number of all L1I requests.
-    L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\
-      \ L2 cache request bandwidth achieved. Calculated as the ratio of the total\
-      \ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\
-      \ cycles."
-    L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
-      divided by total duration.
-    Req: The total number of requests made to the L1I per normalization-unit
-    Hits: The total number of L1I requests that hit on a previously loaded cache line,
-      per normalization-unit.
-    Misses - Non Duplicated: The total number of L1I requests that missed on a cache
-      line that were not already pending due to another request, per normalization-unit.
-    Misses - Duplicated: The total number of L1I requests that missed on a cache line
-      that were already pending due to another request, per normalization-unit.
-    Instruction Fetch Latency: The average number of cycles spent to fetch instructions
-      to a CU.
  data source:
  - metric_table:
      id: 1301
@@ -62,22 +40,22 @@ Panel Config:
          avg: AVG((SQC_ICACHE_REQ / $denom))
          min: MIN((SQC_ICACHE_REQ / $denom))
          max: MAX((SQC_ICACHE_REQ / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Hits:
          avg: AVG((SQC_ICACHE_HITS / $denom))
          min: MIN((SQC_ICACHE_HITS / $denom))
          max: MAX((SQC_ICACHE_HITS / $denom))
-          unit: (Hits  + $normUnit)
+          unit: (Hits + $normUnit)
        Misses - Non Duplicated:
          avg: AVG((SQC_ICACHE_MISSES / $denom))
          min: MIN((SQC_ICACHE_MISSES / $denom))
          max: MAX((SQC_ICACHE_MISSES / $denom))
-          unit: (Misses  + $normUnit)
+          unit: (Misses + $normUnit)
        Misses - Duplicated:
          avg: AVG((SQC_ICACHE_MISSES_DUPLICATE / $denom))
          min: MIN((SQC_ICACHE_MISSES_DUPLICATE / $denom))
          max: MAX((SQC_ICACHE_MISSES_DUPLICATE / $denom))
-          unit: (Misses  + $normUnit)
+          unit: (Misses + $normUnit)
        Cache Hit Rate:
          avg: AVG(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
            + SQC_ICACHE_MISSES_DUPLICATE)))
@@ -107,3 +85,25 @@ Panel Config:
          min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
          max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
          unit: Gbps
+  metrics_description:
+    Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
+      of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
+      the total L1I cycles.
+    Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
+      loaded line the cache. Calculated as the ratio of the number of L1I requests
+      that hit over the number of all L1I requests.
+    L1I-L2 Bandwidth Utilization: |-
+      The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
+      achieved. Calculated as the ratio of the total number of requests from the
+      L1I to the L2 cache over the total L1I-L2 interface cycles.
+    L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
+      divided by total duration.
+    Req: The total number of requests made to the L1I per normalization-unit
+    Hits: The total number of L1I requests that hit on a previously loaded cache line,
+      per normalization-unit.
+    Misses - Non Duplicated: The total number of L1I requests that missed on a cache
+      line that were not already pending due to another request, per normalization-unit.
+    Misses - Duplicated: The total number of L1I requests that missed on a cache line
+      that were already pending due to another request, per normalization-unit.
+    Instruction Fetch Latency: The average number of cycles spent to fetch instructions
+      to a CU.
@@ -2,49 +2,6 @@
 Panel Config:
  id: 1400
  title: Scalar L1 Data Cache
-  metrics_description:
-    Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
-      of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
-      over the total sL1D cycles.
-    Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
-      loaded line the cache. The ratio of the number of sL1D requests that hit over
-      the number of all sL1D requests.
-    sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
-      bandwidth acheived.\ \ Caclulated as total number of bytes read from, written
-      to, or atomically updated\ \ across the sL1D - L2 interface.
-    sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
-      \ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\
-      \ writes and atomics are typically unused on current CDNA accelerators, so in\
-      \ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
-    Req: The total number of requests, of any size or type, made to the sL1D per normalization
-      unit.
-    Hits: The total number of sL1D requests that hit on a previously loaded cache
-      line, per normalization unit.
-    Misses - Non Duplicated: 'The total number of sL1D requests that missed on a cache
-      line that was not already pending due to another request, per normalization
-      unit. '
-    Misses- Duplicated: The total number of sL1D requests that missed on a cache line
-      that was already pending due to another request, per normalization unit.
-    Read Req (Total): The total number of sL1D read requests of any size, per normalization
-      unit.
-    Atomic Req: The total number of atomic requests from sL1D to the L2, per normalization
-      unit. Typically unused on current CDNA accelerators.
-    Read Req (1 DWord): The total number of sL1D read requests made for a single dword
-      of data (4B), per normalization unit.
-    Read Req (2 DWord): The total number of sL1D read requests made for a two dwords
-      of data (8B), per normalization unit.
-    Read Req (4 DWord): The total number of sL1D read requests made for a four dwords
-      of data (16B), per normalization unit.
-    Read Req (8 DWord): The total number of sL1D read requests made for a eight dwords
-      of data (32B), per normalization unit.
-    Read Req (16 DWord): The total number of sL1D read requests made for a sixteen
-      dwords of data (64B), per normalization unit.
-    Read Req: The total number of read requests from sL1D to the L2 per normalization
-      unit.
-    Write Req: The total number of write requests from sL1D to the L2, per normalization
-      unit. Typically unused on current CDNA accelerators.
-    Stall Cycles: "The total number of cycles the sL1D\u2194L2 interface was stalled,\
-      \ per normalization unit."
  data source:
  - metric_table:
      id: 1401
@@ -84,22 +41,22 @@ Panel Config:
          avg: AVG((SQC_DCACHE_REQ / $denom))
          min: MIN((SQC_DCACHE_REQ / $denom))
          max: MAX((SQC_DCACHE_REQ / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Hits:
          avg: AVG((SQC_DCACHE_HITS / $denom))
          min: MIN((SQC_DCACHE_HITS / $denom))
          max: MAX((SQC_DCACHE_HITS / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Misses - Non Duplicated:
          avg: AVG((SQC_DCACHE_MISSES / $denom))
          min: MIN((SQC_DCACHE_MISSES / $denom))
          max: MAX((SQC_DCACHE_MISSES / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Misses- Duplicated:
          avg: AVG((SQC_DCACHE_MISSES_DUPLICATE / $denom))
          min: MIN((SQC_DCACHE_MISSES_DUPLICATE / $denom))
          max: MAX((SQC_DCACHE_MISSES_DUPLICATE / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Cache Hit Rate:
          avg: AVG((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
            + SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
@@ -118,37 +75,37 @@ Panel Config:
            + SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
          max: MAX((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
            + SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Atomic Req:
          avg: AVG((SQC_DCACHE_ATOMIC / $denom))
          min: MIN((SQC_DCACHE_ATOMIC / $denom))
          max: MAX((SQC_DCACHE_ATOMIC / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read Req (1 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_1 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_1 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_1 / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read Req (2 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_2 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_2 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_2 / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read Req (4 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_4 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_4 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_4 / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read Req (8 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_8 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_8 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_8 / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read Req (16 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_16 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_16 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_16 / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
  - metric_table:
      id: 1403
      title: Scalar L1D Cache - L2 Interface
@@ -171,19 +128,65 @@ Panel Config:
          avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
          min: MIN((SQC_TC_DATA_READ_REQ / $denom))
          max: MAX((SQC_TC_DATA_READ_REQ / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Write Req:
          avg: AVG((SQC_TC_DATA_WRITE_REQ / $denom))
          min: MIN((SQC_TC_DATA_WRITE_REQ / $denom))
          max: MAX((SQC_TC_DATA_WRITE_REQ / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Atomic Req:
          avg: AVG((SQC_TC_DATA_ATOMIC_REQ / $denom))
          min: MIN((SQC_TC_DATA_ATOMIC_REQ / $denom))
          max: MAX((SQC_TC_DATA_ATOMIC_REQ / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Stall Cycles:
          avg: AVG((SQC_TC_STALL / $denom))
          min: MIN((SQC_TC_STALL / $denom))
          max: MAX((SQC_TC_STALL / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
+  metrics_description:
+    Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
+      of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
+      over the total sL1D cycles.
+    Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
+      loaded line the cache. The ratio of the number of sL1D requests that hit over
+      the number of all sL1D requests.
+    sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
+      bandwidth acheived. Calculated as total number of bytes read from, written to,
+      or atomically updated across the sL1D - L2 interface.
+    sL1D-L2 BW: |-
+      The total number of bytes read from, written to, or atomically updated
+      across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D
+      writes and atomics are typically unused on current CDNA accelerators, so
+      in the majority of cases this can be interpreted as an sL1D\u2192L2 read
+      bandwidth.
+    Req: The total number of requests, of any size or type, made to the sL1D per normalization
+      unit.
+    Hits: The total number of sL1D requests that hit on a previously loaded cache
+      line, per normalization unit.
+    Misses - Non Duplicated: |-
+      The total number of sL1D requests that missed on a cache line that was
+      not already pending due to another request, per normalization unit.
+    Misses- Duplicated: The total number of sL1D requests that missed on a cache line
+      that was already pending due to another request, per normalization unit.
+    Read Req (Total): The total number of sL1D read requests of any size, per normalization
+      unit.
+    Atomic Req: The total number of atomic requests from sL1D to the L2, per normalization
+      unit. Typically unused on current CDNA accelerators.
+    Read Req (1 DWord): The total number of sL1D read requests made for a single dword
+      of data (4B), per normalization unit.
+    Read Req (2 DWord): The total number of sL1D read requests made for a two dwords
+      of data (8B), per normalization unit.
+    Read Req (4 DWord): The total number of sL1D read requests made for a four dwords
+      of data (16B), per normalization unit.
+    Read Req (8 DWord): The total number of sL1D read requests made for a eight dwords
+      of data (32B), per normalization unit.
+    Read Req (16 DWord): The total number of sL1D read requests made for a sixteen
+      dwords of data (64B), per normalization unit.
+    Read Req: The total number of read requests from sL1D to the L2 per normalization
+      unit.
+    Write Req: The total number of write requests from sL1D to the L2, per normalization
+      unit. Typically unused on current CDNA accelerators.
+    Stall Cycles: |-
+      The total number of cycles the sL1D\u2194L2 interface was stalled, per
+      normalization unit.
@@ -2,70 +2,6 @@
 Panel Config:
  id: 1500
  title: Address Processing Unit and Data Return Path (TA/TD)
-  metrics_description:
-    Address Processing Unit Busy: Percent of the total CU cycles the address processor
-      was busy
-    Address Stall: Percent of the total CU cycles the address processor was stalled
-      from sending address requests further into the vL1D pipeline.
-    Data Stall: Percent of the total CU cycles the address processor was stalled from
-      sending write/atomic data further into the vL1D pipeline.
-    "Data-Processor \u2192 Address Stall": Percent of total CU cycles the address
-      processor was stalled waiting to send command data to the data processor.
-    Total Instructions: The total number of memory instructions executed by the address
-      processer over all compute units on the accelerator, per normalization unit.
-    Global/Generic Instructions: The total number of global & generic memory instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Global/Generic Read Instructions: The total number of global & generic memory
-      read instructions executed on all compute units on the accelerator, per normalization
-      unit.
-    Global/Generic Write Instructions: The total number of global & generic memory
-      write instructions executed on all compute units on the accelerator, per normalization
-      unit.
-    Global/Generic Atomic Instructions: The total number of global & generic memory
-      atomic (with and without return) instructions executed on all compute units
-      on the accelerator, per normalization unit.
-    Spill/Stack Instructions: The total number of spill/stack memory instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Spill/Stack Read Instructions: The total number of spill/stack memory read instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Spill/Stack Write Instructions: The total number of spill/stack memory write instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Spill/Stack Atomic Instructions: The total number of spill/stack memory atomic
-      (with and without return) instructions executed on all compute units on the
-      accelerator, per normalization unit. Typically unused as these memory operations
-      are typically used to implement thread-local storage.
-    Spill/Stack Total Cycles: The number of cycles the address processing unit spent
-      working on spill/stack instructions, per normalization unit.
-    Spill/Stack Coalesced Read: The number of cycles the address processing unit spent
-      working on coalesced spill/stack read instructions, per normalization unit.
-    Spill/Stack Coalesced Write: The number of cycles the address processing unit
-      spent working on coalesced spill/stack write instructions, per normalization
-      unit.
-    Data-Return Busy: Percent of the total CU cycles the data-return unit was busy
-      processing or waiting on data to return to the CU.
-    "Cache RAM \u2192 Data-Return Stall": Percent of the total CU cycles the data-return
-      unit was stalled on data to be returned from the vL1D Cache RAM.
-    "Workgroup manager \u2192 Data-Return Stall": Percent of the total CU cycles the
-      data-return unit was stalled by the workgroup manager due to initialization
-      of registers as a part of launching new workgroups.
-    Coalescable Instructions: The number of instructions submitted to the data-return
-      unit by the address processor that were found to be coalescable, per normalization
-      unit.
-    Read Instructions: The number of read instructions submitted to the data-return
-      unit by the address processor summed over all compute units on the accelerator,
-      per normalization unit. This is expected to be the sum of global/generic and
-      spill/stack reads in the address processor.
-    Write Instructions: The number of store instructions submitted to the data-return
-      unit by the address processor summed over all compute units on the accelerator,
-      per normalization unit. This is expected to be the sum of global/generic and
-      spill/stack stores in the address processor.
-    Atomic Instructions: The number of atomic instructions submitted to the data-return
-      unit by the address processor summed over all compute units on the accelerator,
-      per normalization unit. This is expected to be the sum of global/generic and
-      spill/stack atomics in the address processor.
-    Write Ack Instructions: The total number of write acknowledgements submitted by
-      data-return unit to SQ, summed over all compute units on the accelerator, per
-      normalization unit.
  data source:
  - metric_table:
      id: 1501
@@ -135,47 +71,47 @@ Panel Config:
          avg: AVG((TA_TOTAL_WAVEFRONTS_sum / $denom))
          min: MIN((TA_TOTAL_WAVEFRONTS_sum / $denom))
          max: MAX((TA_TOTAL_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Global/Generic Instructions:
          avg: AVG((TA_FLAT_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Global/Generic Read Instructions:
          avg: AVG((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Global/Generic Write Instructions:
          avg: AVG((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Global/Generic Atomic Instructions:
          avg: AVG((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Spill/Stack Instructions:
          avg: AVG((TA_BUFFER_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Spill/Stack Read Instructions:
          avg: AVG((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Spill/Stack Write Instructions:
          avg: AVG((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Spill/Stack Atomic Instructions:
          avg: AVG((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
  - metric_table:
      id: 1503
      title: Spill and stack metrics
@@ -190,17 +126,17 @@ Panel Config:
          avg: AVG((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
          min: MIN((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
          max: MAX((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Spill/Stack Coalesced Read:
          avg: AVG((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
          min: MIN((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
          max: MAX((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Spill/Stack Coalesced Write:
          avg: AVG((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
          min: MIN((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
          max: MAX((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
  - metric_table:
      id: 1504
      title: Vector L1 data-return path or Texture Data (TD)
@@ -230,7 +166,7 @@ Panel Config:
          avg: AVG((TD_COALESCABLE_WAVEFRONT_sum / $denom))
          min: MIN((TD_COALESCABLE_WAVEFRONT_sum / $denom))
          max: MAX((TD_COALESCABLE_WAVEFRONT_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Read Instructions:
          avg: AVG((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
            / $denom))
@@ -238,14 +174,75 @@ Panel Config:
            / $denom))
          max: MAX((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
            / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Write Instructions:
          avg: AVG((TD_STORE_WAVEFRONT_sum / $denom))
          min: MIN((TD_STORE_WAVEFRONT_sum / $denom))
          max: MAX((TD_STORE_WAVEFRONT_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Atomic Instructions:
          avg: AVG((TD_ATOMIC_WAVEFRONT_sum / $denom))
          min: MIN((TD_ATOMIC_WAVEFRONT_sum / $denom))
          max: MAX((TD_ATOMIC_WAVEFRONT_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
+  metrics_description:
+    Address Processing Unit Busy: Percent of the total CU cycles the address processor
+      was busy
+    Address Stall: Percent of the total CU cycles the address processor was stalled
+      from sending address requests further into the vL1D pipeline.
+    Data Stall: Percent of the total CU cycles the address processor was stalled from
+      sending write/atomic data further into the vL1D pipeline.
+    "Data-Processor \u2192 Address Stall": Percent of total CU cycles the address
+      processor was stalled waiting to send command data to the data processor.
+    Total Instructions: The total number of memory instructions executed by the address
+      processer over all compute units on the accelerator, per normalization unit.
+    Global/Generic Instructions: The total number of global & generic memory instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Global/Generic Read Instructions: The total number of global & generic memory
+      read instructions executed on all compute units on the accelerator, per normalization
+      unit.
+    Global/Generic Write Instructions: The total number of global & generic memory
+      write instructions executed on all compute units on the accelerator, per normalization
+      unit.
+    Global/Generic Atomic Instructions: The total number of global & generic memory
+      atomic (with and without return) instructions executed on all compute units
+      on the accelerator, per normalization unit.
+    Spill/Stack Instructions: The total number of spill/stack memory instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Spill/Stack Read Instructions: The total number of spill/stack memory read instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Spill/Stack Write Instructions: The total number of spill/stack memory write instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Spill/Stack Atomic Instructions: The total number of spill/stack memory atomic
+      (with and without return) instructions executed on all compute units on the
+      accelerator, per normalization unit. Typically unused as these memory operations
+      are typically used to implement thread-local storage.
+    Spill/Stack Total Cycles: The number of cycles the address processing unit spent
+      working on spill/stack instructions, per normalization unit.
+    Spill/Stack Coalesced Read: The number of cycles the address processing unit spent
+      working on coalesced spill/stack read instructions, per normalization unit.
+    Spill/Stack Coalesced Write: The number of cycles the address processing unit
+      spent working on coalesced spill/stack write instructions, per normalization
+      unit.
+    Data-Return Busy: Percent of the total CU cycles the data-return unit was busy
+      processing or waiting on data to return to the CU.
+    "Cache RAM \u2192 Data-Return Stall": Percent of the total CU cycles the data-return
+      unit was stalled on data to be returned from the vL1D Cache RAM.
+    "Workgroup manager \u2192 Data-Return Stall": Percent of the total CU cycles the
+      data-return unit was stalled by the workgroup manager due to initialization
+      of registers as a part of launching new workgroups.
+    Coalescable Instructions: The number of instructions submitted to the data-return
+      unit by the address processor that were found to be coalescable, per normalization
+      unit.
+    Read Instructions: The number of read instructions submitted to the data-return
+      unit by the address processor summed over all compute units on the accelerator,
+      per normalization unit. This is expected to be the sum of global/generic and
+      spill/stack reads in the address processor.
+    Write Instructions: The number of store instructions submitted to the data-return
+      unit by the address processor summed over all compute units on the accelerator,
+      per normalization unit. This is expected to be the sum of global/generic and
+      spill/stack stores in the address processor.
+    Atomic Instructions: The number of atomic instructions submitted to the data-return
+      unit by the address processor summed over all compute units on the accelerator,
+      per normalization unit. This is expected to be the sum of global/generic and
+      spill/stack atomics in the address processor.
@@ -2,117 +2,6 @@
 Panel Config:
  id: 1600
  title: Vector L1 Data Cache
-  metrics_description:
-    Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
-      cache over the total number of cache line requests to the vL1D Cache RAM.
-    Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
-      of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
-      on the specific accelerator. The number of bytes is calculated as the number
-      of cache lines requested multiplied by the cache line size. This value does
-      not consider partial requests, so for instance, if only a single value is requested
-      in a cache line, the data movement will still be counted as a full cache line.
-    Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
-      The number of cycles where the vL1D Cache RAM is actively processing any request
-      divided by the number of cycles where the vL1D is active.
-    Coalescing: Indicates how well memory instructions were coalesced by the address
-      processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
-      as the average number of thread-requests generated per instruction divided by
-      the ideal number of thread-requests per instruction.
-    Stalled on L2 Data: The ratio of the number of cycles where the vL1D is stalled
-      waiting for requested data to return from the L2 cache divided by the number
-      of cycles where the vL1D is active.
-    Stalled on L2 Req: The ratio of the number of cycles where the vL1D is stalled
-      waiting to issue a request for data to the L2 cache divided by the number of
-      cycles where the vL1D is active.
-    Tag RAM Stall (Read): The ratio of the number of cycles where the vL1D is stalled
-      due to Read requests with conflicting tags being looked up concurrently, divided
-      by the number of cycles where the vL1D is active.
-    Tag RAM Stall (Write): The ratio of the number of cycles where the vL1D is stalled
-      due to Write requests with conflicting tags being looked up concurrently, divided
-      by the number of cycles where the vL1D is active.
-    Tag RAM Stall (Atomic): The ratio of the number of cycles where the vL1D is stalled
-      due to Atomic requests with conflicting tags being looked up concurrently, divided
-      by the number of cycles where the vL1D is active.
-    Total Req: The total number of incoming requests from the address processing unit
-      after coalescing.
-    Read Req: The total number of incoming read requests from the address processing
-      unit after coalescing per normalization unit.
-    Write Req: The total number of incoming write requests from the address processing
-      unit after coalescing per normalization unit.
-    Atomic Req: The total number of incoming atomic requests from the address processing
-      unit after coalescing per normalization unit.
-    Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
-      instructions divided by total duration. The number of bytes is calculated as
-      the number of cache lines requested multiplied by the cache line size.  This
-      value does not consider partial requests, so for instance, if only a single
-      value is requested in a cache line, the data movement will still be counted
-      as a full cache line.
-    Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
-      vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
-    Cache Accesses: The total number of cache line lookups in the vL1D.
-    Cache Hits: The number of cache accesses minus the number of outgoing requests
-      to the L2 cache, that is, the number of cache line requests serviced by the
-      vL1D Cache RAM per normalization unit.
-    Invalidations: The number of times the vL1D was issued a write-back invalidate
-      command during the kernel's execution per normalization unit. This may be triggered
-      by, for instance, the buffer_wbinvl1 instruction.
-    L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
-      of VMEM instructions, divided by total duration. The number of bytes is calculated
-      as the number of cache lines requested multiplied by the cache line size. This
-      value does not consider partial requests, so for instance, if only a single
-      value is requested in a cache line, the data movement will still be counted
-      as a full cache line.
-    L1-L2 Read: The number of read requests for a vL1D cache line that were not satisfied
-      by the vL1D and must be retrieved from the to the L2 Cache per normalization
-      unit.
-    L1-L2 Write: The number of write requests to a vL1D cache line that were sent
-      through the vL1D to the L2 cache, per normalization unit.
-    L1-L2 Atomic: The number of atomic requests that are sent through the vL1D to
-      the L2 cache, per normalization unit. This includes requests for atomics with,
-      and without return.
-    L1 Access Latency: Calculated as the average number of cycles that a vL1D cache
-      line request spent in the vL1D cache pipeline.
-    L1-L2 Read Latency: Calculated as the average number of cycles that the vL1D cache
-      took to issue and receive read requests from the L2 Cache. This number also
-      includes requests for atomics with return values.
-    L1-L2 Write Latency: Calculated as the average number of cycles that the vL1D
-      cache took to issue and receive acknowledgement of a write request to the L2
-      Cache. This number also includes requests for atomics without return values.
-    NC - Read: Total read requests with NC mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    UC - Read: Total read requests with UC mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    CC - Read: Total read requests with CC mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    RW - Read: Total read requests with RW mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    RW - Write: Total write requests with RW mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    NC - Write: Total write requests with NC mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    UC - Write: Total write requests with UC mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    CC - Write: Total write requests with CC mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    NC - Atomic: Total atomic requests with NC mtype from this TCP to all TCCs Sum
-      over TCP instances per normalization unit.
-    UC - Atomic: Total atomic requests with UC mtype from this TCP to all TCCs Sum
-      over TCP instances per normalization unit.
-    CC - Atomic: Total atomic requests with CC mtype from this TCP to all TCCs Sum
-      over TCP instances per normalization unit.
-    RW - Atomic: Total atomic requests with RW mtype from this TCP to all TCCs Sum
-      over TCP instances per normalization unit.
-    Req: The number of translation requests made to the UTCL1 per normalization unit.
-    Hit Ratio: The ratio of the number of translation requests that hit in the UTCL1
-      divided by the total number of translation requests made to the UTCL1.
-    Hits: The number of translation requests that hit in the UTCL1, and could be reused,
-      per normalization unit.
-    Translation Misses: The total number of translation requests that missed in the
-      UTCL1 due to  translation not being present in the cache, per normalization
-      unit.
-    Permission Misses: "The total number of translation requests that missed in the\
-      \ UTCL1 due to a permission error, per normalization unit. This is unused and\
-      \ expected to be zero in most configurations for modern CDNA\u2122 accelerators."
  data source:
  - metric_table:
      id: 1601
@@ -181,17 +70,17 @@ Panel Config:
          avg: AVG((TCP_TOTAL_ACCESSES_sum / $denom))
          min: MIN((TCP_TOTAL_ACCESSES_sum / $denom))
          max: MAX((TCP_TOTAL_ACCESSES_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read Req:
          avg: AVG((TCP_TOTAL_READ_sum / $denom))
          min: MIN((TCP_TOTAL_READ_sum / $denom))
          max: MAX((TCP_TOTAL_READ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Write Req:
          avg: AVG((TCP_TOTAL_WRITE_sum / $denom))
          min: MIN((TCP_TOTAL_WRITE_sum / $denom))
          max: MAX((TCP_TOTAL_WRITE_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Atomic Req:
          avg: AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
            / $denom))
@@ -199,7 +88,7 @@ Panel Config:
            / $denom))
          max: MAX(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
            / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Cache BW:
          avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
          min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
@@ -223,7 +112,7 @@ Panel Config:
          avg: AVG((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
          min: MIN((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
          max: MAX((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Cache Hits:
          avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
@@ -234,7 +123,7 @@ Panel Config:
          max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Invalidations:
          avg: AVG((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
          min: MIN((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
@@ -252,12 +141,12 @@ Panel Config:
          avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_READ_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        L1-L2 Write:
          avg: AVG((TCP_TCC_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_WRITE_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        L1-L2 Atomic:
          avg: AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
            / $denom))
@@ -265,7 +154,7 @@ Panel Config:
            / $denom))
          max: MAX(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
            / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
  - metric_table:
      id: 1604
      title: L1D - L2 Transactions
@@ -284,84 +173,84 @@ Panel Config:
          avg: AVG((TCP_TCC_NC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_NC_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_NC_READ_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        UC - Read:
          xfer: Read
          coherency: UC
          avg: AVG((TCP_TCC_UC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_UC_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_UC_READ_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        CC - Read:
          xfer: Read
          coherency: CC
          avg: AVG((TCP_TCC_CC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_CC_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_CC_READ_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        RW - Read:
          xfer: Read
          coherency: RW
          avg: AVG((TCP_TCC_RW_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_RW_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_RW_READ_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        RW - Write:
          xfer: Write
          coherency: RW
          avg: AVG((TCP_TCC_RW_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_RW_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_RW_WRITE_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        NC - Write:
          xfer: Write
          coherency: NC
          avg: AVG((TCP_TCC_NC_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_NC_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_NC_WRITE_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        UC - Write:
          xfer: Write
          coherency: UC
          avg: AVG((TCP_TCC_UC_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_UC_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_UC_WRITE_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        CC - Write:
          xfer: Write
          coherency: CC
          avg: AVG((TCP_TCC_CC_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_CC_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_CC_WRITE_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        NC - Atomic:
          xfer: Atomic
          coherency: NC
          avg: AVG((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
          min: MIN((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
          max: MAX((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        UC - Atomic:
          xfer: Atomic
          coherency: UC
          avg: AVG((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
          min: MIN((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
          max: MAX((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        CC - Atomic:
          xfer: Atomic
          coherency: CC
          avg: AVG((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
          min: MIN((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
          max: MAX((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        RW - Atomic:
          xfer: Atomic
          coherency: RW
          avg: AVG((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
          min: MIN((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
          max: MAX((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
  - metric_table:
      id: 1605
      title: L1 Unified Translation Cache (UTCL1)
@@ -410,3 +299,106 @@ Panel Config:
        max: Max
        units: Unit
      metric: {}
+  metrics_description:
+    Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
+      cache over the total number of cache line requests to the vL1D Cache RAM.
+    Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
+      of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
+      on the specific accelerator. The number of bytes is calculated as the number
+      of cache lines requested multiplied by the cache line size. This value does
+      not consider partial requests, so for instance, if only a single value is requested
+      in a cache line, the data movement will still be counted as a full cache line.
+    Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
+      The number of cycles where the vL1D Cache RAM is actively processing any request
+      divided by the number of cycles where the vL1D is active.
+    Coalescing: Indicates how well memory instructions were coalesced by the address
+      processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
+      as the average number of thread-requests generated per instruction divided by
+      the ideal number of thread-requests per instruction.
+    Stalled on L2 Data: The ratio of the number of cycles where the vL1D is stalled
+      waiting for requested data to return from the L2 cache divided by the number
+      of cycles where the vL1D is active.
+    Stalled on L2 Req: The ratio of the number of cycles where the vL1D is stalled
+      waiting to issue a request for data to the L2 cache divided by the number of
+      cycles where the vL1D is active.
+    Tag RAM Stall (Read): The ratio of the number of cycles where the vL1D is stalled
+      due to Read requests with conflicting tags being looked up concurrently, divided
+      by the number of cycles where the vL1D is active.
+    Tag RAM Stall (Write): The ratio of the number of cycles where the vL1D is stalled
+      due to Write requests with conflicting tags being looked up concurrently, divided
+      by the number of cycles where the vL1D is active.
+    Tag RAM Stall (Atomic): The ratio of the number of cycles where the vL1D is stalled
+      due to Atomic requests with conflicting tags being looked up concurrently, divided
+      by the number of cycles where the vL1D is active.
+    Total Req: The total number of incoming requests from the address processing unit
+      after coalescing.
+    Read Req: The total number of incoming read requests from the address processing
+      unit after coalescing per normalization unit.
+    Write Req: The total number of incoming write requests from the address processing
+      unit after coalescing per normalization unit.
+    Atomic Req: The total number of incoming atomic requests from the address processing
+      unit after coalescing per normalization unit.
+    Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
+      instructions divided by total duration. The number of bytes is calculated as
+      the number of cache lines requested multiplied by the cache line size. This
+      value does not consider partial requests, so for instance, if only a single
+      value is requested in a cache line, the data movement will still be counted
+      as a full cache line.
+    Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
+      vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
+    Cache Accesses: The total number of cache line lookups in the vL1D.
+    Cache Hits: The number of cache accesses minus the number of outgoing requests
+      to the L2 cache, that is, the number of cache line requests serviced by the
+      vL1D Cache RAM per normalization unit.
+    Invalidations: The number of times the vL1D was issued a write-back invalidate
+      command during the kernel's execution per normalization unit. This may be triggered
+      by, for instance, the buffer_wbinvl1 instruction.
+    L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
+      of VMEM instructions, divided by total duration. The number of bytes is calculated
+      as the number of cache lines requested multiplied by the cache line size. This
+      value does not consider partial requests, so for instance, if only a single
+      value is requested in a cache line, the data movement will still be counted
+      as a full cache line.
+    L1-L2 Read: The number of read requests for a vL1D cache line that were not satisfied
+      by the vL1D and must be retrieved from the to the L2 Cache per normalization
+      unit.
+    L1-L2 Write: The number of write requests to a vL1D cache line that were sent
+      through the vL1D to the L2 cache, per normalization unit.
+    L1-L2 Atomic: The number of atomic requests that are sent through the vL1D to
+      the L2 cache, per normalization unit. This includes requests for atomics with,
+      and without return.
+    NC - Read: Total read requests with NC mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    UC - Read: Total read requests with UC mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    CC - Read: Total read requests with CC mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    RW - Read: Total read requests with RW mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    RW - Write: Total write requests with RW mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    NC - Write: Total write requests with NC mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    UC - Write: Total write requests with UC mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    CC - Write: Total write requests with CC mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    NC - Atomic: Total atomic requests with NC mtype from this TCP to all TCCs Sum
+      over TCP instances per normalization unit.
+    UC - Atomic: Total atomic requests with UC mtype from this TCP to all TCCs Sum
+      over TCP instances per normalization unit.
+    CC - Atomic: Total atomic requests with CC mtype from this TCP to all TCCs Sum
+      over TCP instances per normalization unit.
+    RW - Atomic: Total atomic requests with RW mtype from this TCP to all TCCs Sum
+      over TCP instances per normalization unit.
+    Req: The number of translation requests made to the UTCL1 per normalization unit.
+    Hit Ratio: The ratio of the number of translation requests that hit in the UTCL1
+      divided by the total number of translation requests made to the UTCL1.
+    Hits: The number of translation requests that hit in the UTCL1, and could be reused,
+      per normalization unit.
+    Translation Misses: The total number of translation requests that missed in the
+      UTCL1 due to translation not being present in the cache, per normalization unit.
+    Permission Misses: |-
+      The total number of translation requests that missed in the UTCL1 due
+      to a permission error, per normalization unit. This is unused and expected
+      to be zero in most configurations for modern CDNA\u2122 accelerators.
@@ -2,218 +2,6 @@
 Panel Config:
  id: 1700
  title: L2 Cache
-  metrics_description:
-    Utilization: The ratio of the number of cycles an L2 channel was active, summed
-      over all L2 channels on the accelerator over the total L2 cycles.
-    Peak Bandwidth: The number of bytes looked up in the L2 cache, as a percent of
-      the peak theoretical bandwidth achievable on the specific accelerator. The number
-      of bytes is calculated as the number of cache lines requested multiplied by
-      the cache line size. This value does not consider partial requests, so e.g.,
-      if only a single value is requested in a cache line, the data movement will
-      still be counted as a full cache line.
-    Hit Rate: The ratio of the number of L2 cache line requests that hit in the L2
-      cache over the total number of incoming cache line requests to the L2 cache.
-    L2-Fabric Read BW: The number of bytes read by the L2 over the Infinity Fabric
-      interface per unit time.
-    L2-Fabric Write and Atomic BW: The number of bytes sent by the L2 over the Infinity
-      Fabric interface by write and atomic operations per unit time.
-    HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
-      memory (HBM) per unit time. This value is calculated as the number of HBM channels
-      multiplied by the HBM channel width multiplied by the HBM clock frequency.
-    Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided
-      by total duration.
-    HBM Read Traffic: The percent of read requests generated by the L2 cache that
-      are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
-      does not consider the size of the request (meaning that 32B and 64B requests
-      are both counted as a single request), so this metric only approximates the
-      percent of the L2-Fabric Read bandwidth directed to the local HBM.
-    Remote Read Traffic: The percent of read requests generated by the L2 cache that
-      are routed to any memory location other than the accelerator's local high-bandwidth
-      memory (HBM) - for example, the CPU's DRAM or a remote accelerator's HBM. This
-      breakdown does not consider the size of the request (meaning that 32B and 64B
-      requests are both counted as a single request), so this metric only approximates
-      the percent of the L2-Fabric Read bandwidth directed to a remote location.
-    Uncached Read Traffic: The percent of read requests generated by the L2 cache
-      that are reading from an uncached memory allocation. Note, as described in the
-      request flow section, a single 64B read request is typically counted as two
-      uncached read requests. So, it is possible for the Uncached Read Traffic to
-      reach up to 200% of the total number of read requests. This breakdown does not
-      consider the size of the request (i.e., 32B and 64B requests are both counted
-      as a single request), so this metric only approximates the percent of the L2-Fabric
-      read bandwidth directed to an uncached memory location.
-    Write and Atomic BW: The total number of bytes written by the L2 over Infinity
-      Fabric by write and atomic operations divided by total duration. Note that on
-      current CDNA accelerators, such as the MI2XX, requests are only considered atomic
-      by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
-      fine-grained memory allocations or uncached memory allocations on the MI2XX.
-    HBM Write and Atomic Traffic: The percent of write and atomic requests generated
-      by the L2 cache that are routed to the accelerator's local high-bandwidth memory
-      (HBM). This breakdown does not consider the size of the request (meaning that
-      32B and 64B requests are both counted as a single request), so this metric only
-      approximates the percent of the L2-Fabric Write and Atomic bandwidth directed
-      to the local HBM. Note that on current CDNA accelerators, such as the MI2XX,
-      requests are only considered atomic by Infinity Fabric if they are targeted
-      at fine-grained memory allocations or uncached memory allocations.
-    Remote Write and Atomic Traffic: The percent of read requests generated by the
-      L2 cache that are routed to any memory location other than the accelerator's
-      local high-bandwidth memory (HBM) - for example, the CPU's DRAM or a remote
-      accelerator's HBM. This breakdown does not consider the size of the request
-      (meaning that 32B and 64B requests are both counted as a single request), so
-      this metric only approximates the percent of the L2-Fabric Read bandwidth directed
-      to a remote location. Note that on current CDNA accelerators, such as the MI2XX,
-      requests are only considered atomic by Infinity Fabric if they are targeted
-      at fine-grained memory allocations or uncached memory allocations.
-    Atomic Traffic: The percent of write requests generated by the L2 cache that are
-      atomic requests to any memory location. This breakdown does not consider the
-      size of the request (meaning that 32B and 64B requests are both counted as a
-      single request), so this metric only approximates the percent of the L2-Fabric
-      Read bandwidth directed to a remote location. Note that on current CDNA accelerators,
-      such as the MI2XX, requests are only considered atomic by Infinity Fabric if
-      they are targeted at fine-grained memory allocations or uncached memory allocations.
-    Uncached Write and Atomic Traffic: The percent of write and atomic requests generated
-      by the L2 cache that are targeting uncached memory allocations. This breakdown
-      does not consider the size of the request (meaning that 32B and 64B requests
-      are both counted as a single request), so this metric only approximates the
-      percent of the L2-Fabric read bandwidth directed to uncached memory allocations.
-    Read Latency: The time-averaged number of cycles read requests spent in Infinity
-      Fabric before data was returned to the L2.
-    Write and Atomic Latency: The time-averaged number of cycles write requests spent
-      in Infinity Fabric before a completion acknowledgement was returned to the L2.
-    Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
-      Fabric before a completion acknowledgement (atomic without return value) or
-      data (atomic with return value) was returned to the L2.
-    Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration.
-      The number of bytes is calculated as the number of cache lines requested multiplied
-      by the cache line size. This value does not consider partial requests, so for
-      example, if only a single value is requested in a cache line, the data movement
-      will still be counted as a full cache line.
-    Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests,
-      divided by total duration.
-    Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests,
-      divided by total duration.
-    Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests,
-      divided by total duration.
-    Req: The total number of incoming requests to the L2 from all clients for all
-      request types, per normalization unit.
-    Read Req: The total number of read requests to the L2 from all clients.
-    Write Req: The total number of write requests to the L2 from all clients.
-    Atomic Req: The total number of atomic requests (with and without return) to the
-      L2 from all clients.
-    Streaming Req: The total number of incoming requests to the L2 that are marked
-      as streaming. The exact meaning of this may differ depending on the targeted
-      accelerator, however on an MI2XX this corresponds to non-temporal load or stores.
-      The L2 cache attempts to evict streaming requests before normal requests when
-      the L2 is at capacity.
-    Probe Req: The number of coherence probe requests made to the L2 cache from outside
-      the accelerator. On an MI2XX, probe requests may be generated by, for example,
-      writes to fine-grained device memory or by writes to coarse-grained device memory.
-    Cache Hit: The ratio of the number of L2 cache line requests that hit in the L2
-      cache over the total number of incoming cache line requests to the L2 cache.
-    Hits: The total number of requests to the L2 from all clients that hit in the
-      cache. As noted in the Speed-of-Light section, this includes hit-on-miss requests.
-    Misses: The total number of requests to the L2 from all clients that miss in the
-      cache. As noted in the Speed-of-Light section, these do not include hit-on-miss
-      requests.
-    Writeback: The total number of L2 cache lines written back to memory for any reason.
-      Write-backs may occur due to user code (such as HIP kernel calls to _threadfence_system
-      or atomic built-ins) by the command processor's memory acquire/release fences,
-      or for other internal hardware reasons.
-    Writeback (Internal): The total number of L2 cache lines written back to memory
-      for internal hardware reasons, per normalization unit.
-    Writeback (vL1D Req): The total number of L2 cache lines written back to memory
-      due to requests initiated by the vL1D cache, per normalization unit.
-    Evict (Internal): The total number of L2 cache lines evicted from the cache due
-      to capacity limits, per normalization unit.
-    Evict (vL1D Req): The total number of L2 cache lines evicted from the cache due
-      to invalidation requests initiated by the vL1D cache, per normalization unit.
-    NC Req: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory
-      allocations, per normalization unit.
-    UC Req: The total number of requests to the L2 that go to Uncached (UC) memory
-      allocations.
-    CC Req: The total number of requests to the L2 that go to Coherently Cacheable
-      (CC) memory allocations.
-    RW Req: The total number of requests to the L2 that go to Read-Write coherent
-      memory (RW) allocations.
-    Write - Credit Starvation: The number of cycles the L2-Fabric interface was stalled
-      on write or atomic requests to any memory location because too many write/atomic
-      requests were currently in flight, as a percent of the total active L2 cycles.
-    Read (32B): The total number of L2 requests to Infinity Fabric to read 32B of
-      data from any memory location, per normalization unit.
-    Read (64B): The total number of L2 requests to Infinity Fabric to read 64B of
-      data from any memory location, per normalization unit.
-    Read (Uncached): The total number of L2 requests to Infinity Fabric to read uncached
-      data from any memory location, per normalization unit. 64B requests for uncached
-      data are counted as two 32B uncached data requests.
-    HBM Read: The total number of L2 requests to Infinity Fabric to read 32B or 64B
-      of data from the accelerator's local HBM, per normalization unit.
-    Remote Read: The total number of L2 requests to Infinity Fabric to read 32B or
-      64B of data from any source other than the accelerator's local HBM, per normalization
-      unit.
-    Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe
-      traffic, divided by total duration.
-    "Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read
-      requests due to Infinity Fabric traffic, divided by total duration.
-    Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM
-      traffic, divided by total duration.
-    Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
-      write or atomically update 32B of data to any memory location, per normalization
-      unit.
-    Write and Atomic (Uncached): The total number of L2 requests to Infinity Fabric
-      to write or atomically update 32B or 64B of uncached data, per normalization
-      unit.
-    Write and Atomic (64B): The total number of L2 requests to Infinity Fabric to
-      write or atomically update 64B of data in any memory location, per normalization
-      unit.
-    HBM Write and Atomic: The total number of L2 requests to Infinity Fabric to write
-      or atomically update 32B or 64B of data in the accelerator's local HBM, per
-      normalization unit.
-    Remote Write and Atomic: The total number of L2 requests to Infinity Fabric to
-      write or atomically update 32B or 64B of data in any memory location other than
-      the accelerator's local HBM, per normalization unit.
-    Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to
-      PCIe traffic, divided by total duration.
-    "Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write
-      requests due to Infinity Fabric traffic, divided by total duration.
-    Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM
-      traffic, divided by total duration.
-    Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to
-      PCIe traffic, divided by total duration.
-    "Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic
-      requests due to Infinity Fabric traffic, divided by total duration.
-    Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
-      HBM traffic, divided by total duration.
-    Atomic: The total number of L2 requests to Infinity Fabric to atomically update
-      32B or 64B of data in any memory location, per normalization unit. See Request
-      flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
-      requests are only considered atomic by Infinity Fabric if they are targeted
-      at non-write-cacheable memory, such as fine-grained memory allocations or uncached
-      memory allocations on the MI2XX.
-    Read Stall: "The ratio of the total number of cycles the L2-Fabric interface was\
-      \ stalled on a read request to any destination (local HBM, remote PCIe\xAE connected\
-      \ accelerator or CPU, or remote Infinity Fabric connected accelerator or CPU)\
-      \ over the total active L2 cycles."
-    Write Stall: The ratio of the total number of cycles the L2-Fabric interface was
-      stalled on a write or atomic request to any destination (local HBM, remote accelerator
-      or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected
-      accelerator or CPU) over the total active L2 cycles.
-    Read - PCIe Stall: The number of cycles the L2-Fabric interface was stalled on
-      read requests to remote PCIe connected accelerators or CPUs as a percent of
-      the total active L2 cycles.
-    Read - Infinity Fabric Stall: The number of cycles the L2-Fabric interface was
-      stalled on read requests to remote Infinity Fabric connected accelerators or
-      CPUs as a percent of the total active L2 cycles.
-    Read - HBM Stall: The number of cycles the L2-Fabric interface was stalled on
-      read requests to the accelerator's local HBM as a percent of the total active
-      L2 cycles.
-    Write - PCIe Stall: The number of cycles the L2-Fabric interface was stalled on
-      write or atomic requests to remote PCIe connected accelerators or CPUs as a
-      percent of the total active L2 cycles.
-    Write - Infinity Fabric Stall: The number of cycles the L2-Fabric interface was
-      stalled on write or atomic requests to remote Infinity Fabric connected accelerators
-      or CPUs as a percent of the total active L2 cycles.
-    Write - HBM Stall: The number of cycles the L2-Fabric interface was stalled on
-      write or atomic requests to accelerator's local HBM as a percent of the total
-      active L2 cycles.
  data source:
  - metric_table:
      id: 1701
@@ -370,32 +158,32 @@ Panel Config:
          avg: AVG((TCC_REQ_sum / $denom))
          min: MIN((TCC_REQ_sum / $denom))
          max: MAX((TCC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read Req:
          avg: AVG((TCC_READ_sum / $denom))
          min: MIN((TCC_READ_sum / $denom))
          max: MAX((TCC_READ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Write Req:
          avg: AVG((TCC_WRITE_sum / $denom))
          min: MIN((TCC_WRITE_sum / $denom))
          max: MAX((TCC_WRITE_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Atomic Req:
          avg: AVG((TCC_ATOMIC_sum / $denom))
          min: MIN((TCC_ATOMIC_sum / $denom))
          max: MAX((TCC_ATOMIC_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Streaming Req:
          avg: AVG((TCC_STREAMING_REQ_sum / $denom))
          min: MIN((TCC_STREAMING_REQ_sum / $denom))
          max: MAX((TCC_STREAMING_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Probe Req:
          avg: AVG((TCC_PROBE_sum / $denom))
          min: MIN((TCC_PROBE_sum / $denom))
          max: MAX((TCC_PROBE_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Cache Hit:
          avg: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
            + TCC_MISS_sum) != 0) else None))
@@ -408,17 +196,17 @@ Panel Config:
          avg: AVG((TCC_HIT_sum / $denom))
          min: MIN((TCC_HIT_sum / $denom))
          max: MAX((TCC_HIT_sum / $denom))
-          unit: (Hits  + $normUnit)
+          unit: (Hits + $normUnit)
        Misses:
          avg: AVG((TCC_MISS_sum / $denom))
          min: MIN((TCC_MISS_sum / $denom))
          max: MAX((TCC_MISS_sum / $denom))
-          unit: (Misses  + $normUnit)
+          unit: (Misses + $normUnit)
        Writeback:
          avg: AVG((TCC_WRITEBACK_sum / $denom))
          min: MIN((TCC_WRITEBACK_sum / $denom))
          max: MAX((TCC_WRITEBACK_sum / $denom))
-          unit: (Cachelines  + $normUnit)
+          unit: (Cachelines + $normUnit)
        Writeback (Internal):
          avg: AVG((TCC_NORMAL_WRITEBACK_sum / $denom))
          min: MIN((TCC_NORMAL_WRITEBACK_sum / $denom))
@@ -443,22 +231,22 @@ Panel Config:
          avg: AVG((TCC_NC_REQ_sum / $denom))
          min: MIN((TCC_NC_REQ_sum / $denom))
          max: MAX((TCC_NC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        UC Req:
          avg: AVG((TCC_UC_REQ_sum / $denom))
          min: MIN((TCC_UC_REQ_sum / $denom))
          max: MAX((TCC_UC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        CC Req:
          avg: AVG((TCC_CC_REQ_sum / $denom))
          min: MIN((TCC_CC_REQ_sum / $denom))
          max: MAX((TCC_CC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        RW Req:
          avg: AVG((TCC_RW_REQ_sum / $denom))
          min: MIN((TCC_RW_REQ_sum / $denom))
          max: MAX((TCC_RW_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
  - metric_table:
      id: 1704
      title: L2 Cache Stalls
@@ -507,54 +295,216 @@ Panel Config:
          avg: AVG((TCC_EA0_RDREQ_32B_sum / $denom))
          min: MIN((TCC_EA0_RDREQ_32B_sum / $denom))
          max: MAX((TCC_EA0_RDREQ_32B_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read (64B):
          avg: AVG(((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) / $denom))
          min: MIN(((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) / $denom))
          max: MAX(((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read (Uncached):
          avg: AVG((TCC_EA0_RD_UNCACHED_32B_sum / $denom))
          min: MIN((TCC_EA0_RD_UNCACHED_32B_sum / $denom))
          max: MAX((TCC_EA0_RD_UNCACHED_32B_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        HBM Read:
          avg: AVG((TCC_EA0_RDREQ_DRAM_sum / $denom))
          min: MIN((TCC_EA0_RDREQ_DRAM_sum / $denom))
          max: MAX((TCC_EA0_RDREQ_DRAM_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Remote Read:
          avg: AVG((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
          min: MIN((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
          max: MAX((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Write and Atomic (32B):
          avg: AVG(MAX(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom), 0))
          min: MIN(MAX(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom), 0))
          max: MAX(MAX(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom), 0))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Write and Atomic (Uncached):
          avg: AVG((TCC_EA0_WR_UNCACHED_32B_sum / $denom))
          min: MIN((TCC_EA0_WR_UNCACHED_32B_sum / $denom))
          max: MAX((TCC_EA0_WR_UNCACHED_32B_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Write and Atomic (64B):
          avg: AVG((TCC_EA0_WRREQ_64B_sum / $denom))
          min: MIN((TCC_EA0_WRREQ_64B_sum / $denom))
          max: MAX((TCC_EA0_WRREQ_64B_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        HBM Write and Atomic:
          avg: AVG((TCC_EA0_WRREQ_DRAM_sum / $denom))
          min: MIN((TCC_EA0_WRREQ_DRAM_sum / $denom))
          max: MAX((TCC_EA0_WRREQ_DRAM_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Remote Write and Atomic:
          avg: AVG((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom))
          min: MIN((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom))
          max: MAX((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Atomic:
          avg: AVG((TCC_EA0_ATOMIC_sum / $denom))
          min: MIN((TCC_EA0_ATOMIC_sum / $denom))
          max: MAX((TCC_EA0_ATOMIC_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
+  metrics_description:
+    Utilization: The ratio of the number of cycles an L2 channel was active, summed
+      over all L2 channels on the accelerator over the total L2 cycles.
+    Peak Bandwidth: The number of bytes looked up in the L2 cache, as a percent of
+      the peak theoretical bandwidth achievable on the specific accelerator. The number
+      of bytes is calculated as the number of cache lines requested multiplied by
+      the cache line size. This value does not consider partial requests, so e.g.,
+      if only a single value is requested in a cache line, the data movement will
+      still be counted as a full cache line.
+    Hit Rate: The ratio of the number of L2 cache line requests that hit in the L2
+      cache over the total number of incoming cache line requests to the L2 cache.
+    L2-Fabric Read BW: The number of bytes read by the L2 over the Infinity Fabric
+      interface per unit time.
+    L2-Fabric Write and Atomic BW: The number of bytes sent by the L2 over the Infinity
+      Fabric interface by write and atomic operations per unit time.
+    HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
+      memory (HBM) per unit time. This value is calculated as the number of HBM channels
+      multiplied by the HBM channel width multiplied by the HBM clock frequency.
+    Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided
+      by total duration.
+    HBM Read Traffic: The percent of read requests generated by the L2 cache that
+      are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
+      does not consider the size of the request (meaning that 32B and 64B requests
+      are both counted as a single request), so this metric only approximates the
+      percent of the L2-Fabric Read bandwidth directed to the local HBM.
+    Remote Read Traffic: The percent of read requests generated by the L2 cache that
+      are routed to any memory location other than the accelerator's local high-bandwidth
+      memory (HBM) - for example, the CPU's DRAM or a remote accelerator's HBM. This
+      breakdown does not consider the size of the request (meaning that 32B and 64B
+      requests are both counted as a single request), so this metric only approximates
+      the percent of the L2-Fabric Read bandwidth directed to a remote location.
+    Uncached Read Traffic: The percent of read requests generated by the L2 cache
+      that are reading from an uncached memory allocation. Note, as described in the
+      request flow section, a single 64B read request is typically counted as two
+      uncached read requests. So, it is possible for the Uncached Read Traffic to
+      reach up to 200% of the total number of read requests. This breakdown does not
+      consider the size of the request (i.e., 32B and 64B requests are both counted
+      as a single request), so this metric only approximates the percent of the L2-Fabric
+      read bandwidth directed to an uncached memory location.
+    Write and Atomic BW: The total number of bytes written by the L2 over Infinity
+      Fabric by write and atomic operations divided by total duration. Note that on
+      current CDNA accelerators, such as the MI2XX, requests are only considered atomic
+      by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
+      fine-grained memory allocations or uncached memory allocations on the MI2XX.
+    HBM Write and Atomic Traffic: The percent of write and atomic requests generated
+      by the L2 cache that are routed to the accelerator's local high-bandwidth memory
+      (HBM). This breakdown does not consider the size of the request (meaning that
+      32B and 64B requests are both counted as a single request), so this metric only
+      approximates the percent of the L2-Fabric Write and Atomic bandwidth directed
+      to the local HBM. Note that on current CDNA accelerators, such as the MI2XX,
+      requests are only considered atomic by Infinity Fabric if they are targeted
+      at fine-grained memory allocations or uncached memory allocations.
+    Remote Write and Atomic Traffic: The percent of read requests generated by the
+      L2 cache that are routed to any memory location other than the accelerator's
+      local high-bandwidth memory (HBM) - for example, the CPU's DRAM or a remote
+      accelerator's HBM. This breakdown does not consider the size of the request
+      (meaning that 32B and 64B requests are both counted as a single request), so
+      this metric only approximates the percent of the L2-Fabric Read bandwidth directed
+      to a remote location. Note that on current CDNA accelerators, such as the MI2XX,
+      requests are only considered atomic by Infinity Fabric if they are targeted
+      at fine-grained memory allocations or uncached memory allocations.
+    Atomic Traffic: The percent of write requests generated by the L2 cache that are
+      atomic requests to any memory location. This breakdown does not consider the
+      size of the request (meaning that 32B and 64B requests are both counted as a
+      single request), so this metric only approximates the percent of the L2-Fabric
+      Read bandwidth directed to a remote location. Note that on current CDNA accelerators,
+      such as the MI2XX, requests are only considered atomic by Infinity Fabric if
+      they are targeted at fine-grained memory allocations or uncached memory allocations.
+    Uncached Write and Atomic Traffic: The percent of write and atomic requests generated
+      by the L2 cache that are targeting uncached memory allocations. This breakdown
+      does not consider the size of the request (meaning that 32B and 64B requests
+      are both counted as a single request), so this metric only approximates the
+      percent of the L2-Fabric read bandwidth directed to uncached memory allocations.
+    Read Latency: The time-averaged number of cycles read requests spent in Infinity
+      Fabric before data was returned to the L2.
+    Write and Atomic Latency: The time-averaged number of cycles write requests spent
+      in Infinity Fabric before a completion acknowledgement was returned to the L2.
+    Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
+      Fabric before a completion acknowledgement (atomic without return value) or
+      data (atomic with return value) was returned to the L2.
+    Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration.
+      The number of bytes is calculated as the number of cache lines requested multiplied
+      by the cache line size. This value does not consider partial requests, so for
+      example, if only a single value is requested in a cache line, the data movement
+      will still be counted as a full cache line.
+    Req: The total number of incoming requests to the L2 from all clients for all
+      request types, per normalization unit.
+    Read Req: The total number of read requests to the L2 from all clients.
+    Write Req: The total number of write requests to the L2 from all clients.
+    Atomic Req: The total number of atomic requests (with and without return) to the
+      L2 from all clients.
+    Streaming Req: The total number of incoming requests to the L2 that are marked
+      as streaming. The exact meaning of this may differ depending on the targeted
+      accelerator, however on an MI2XX this corresponds to non-temporal load or stores.
+      The L2 cache attempts to evict streaming requests before normal requests when
+      the L2 is at capacity.
+    Probe Req: The number of coherence probe requests made to the L2 cache from outside
+      the accelerator. On an MI2XX, probe requests may be generated by, for example,
+      writes to fine-grained device memory or by writes to coarse-grained device memory.
+    Cache Hit: The ratio of the number of L2 cache line requests that hit in the L2
+      cache over the total number of incoming cache line requests to the L2 cache.
+    Hits: The total number of requests to the L2 from all clients that hit in the
+      cache. As noted in the Speed-of-Light section, this includes hit-on-miss requests.
+    Misses: The total number of requests to the L2 from all clients that miss in the
+      cache. As noted in the Speed-of-Light section, these do not include hit-on-miss
+      requests.
+    Writeback: The total number of L2 cache lines written back to memory for any reason.
+      Write-backs may occur due to user code (such as HIP kernel calls to _threadfence_system
+      or atomic built-ins) by the command processor's memory acquire/release fences,
+      or for other internal hardware reasons.
+    Writeback (Internal): The total number of L2 cache lines written back to memory
+      for internal hardware reasons, per normalization unit.
+    Writeback (vL1D Req): The total number of L2 cache lines written back to memory
+      due to requests initiated by the vL1D cache, per normalization unit.
+    Evict (Internal): The total number of L2 cache lines evicted from the cache due
+      to capacity limits, per normalization unit.
+    Evict (vL1D Req): The total number of L2 cache lines evicted from the cache due
+      to invalidation requests initiated by the vL1D cache, per normalization unit.
+    NC Req: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory
+      allocations, per normalization unit.
+    UC Req: The total number of requests to the L2 that go to Uncached (UC) memory
+      allocations.
+    CC Req: The total number of requests to the L2 that go to Coherently Cacheable
+      (CC) memory allocations.
+    RW Req: The total number of requests to the L2 that go to Read-Write coherent
+      memory (RW) allocations.
+    Write - Credit Starvation: The number of cycles the L2-Fabric interface was stalled
+      on write or atomic requests to any memory location because too many write/atomic
+      requests were currently in flight, as a percent of the total active L2 cycles.
+    Read (32B): The total number of L2 requests to Infinity Fabric to read 32B of
+      data from any memory location, per normalization unit.
+    Read (64B): The total number of L2 requests to Infinity Fabric to read 64B of
+      data from any memory location, per normalization unit.
+    Read (Uncached): The total number of L2 requests to Infinity Fabric to read uncached
+      data from any memory location, per normalization unit. 64B requests for uncached
+      data are counted as two 32B uncached data requests.
+    HBM Read: The total number of L2 requests to Infinity Fabric to read 32B or 64B
+      of data from the accelerator's local HBM, per normalization unit.
+    Remote Read: The total number of L2 requests to Infinity Fabric to read 32B or
+      64B of data from any source other than the accelerator's local HBM, per normalization
+      unit.
+    Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
+      write or atomically update 32B of data to any memory location, per normalization
+      unit.
+    Write and Atomic (Uncached): The total number of L2 requests to Infinity Fabric
+      to write or atomically update 32B or 64B of uncached data, per normalization
+      unit.
+    Write and Atomic (64B): The total number of L2 requests to Infinity Fabric to
+      write or atomically update 64B of data in any memory location, per normalization
+      unit.
+    HBM Write and Atomic: The total number of L2 requests to Infinity Fabric to write
+      or atomically update 32B or 64B of data in the accelerator's local HBM, per
+      normalization unit.
+    Remote Write and Atomic: The total number of L2 requests to Infinity Fabric to
+      write or atomically update 32B or 64B of data in any memory location other than
+      the accelerator's local HBM, per normalization unit.
+    Atomic: The total number of L2 requests to Infinity Fabric to atomically update
+      32B or 64B of data in any memory location, per normalization unit. See Request
+      flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
+      requests are only considered atomic by Infinity Fabric if they are targeted
+      at non-write-cacheable memory, such as fine-grained memory allocations or uncached
+      memory allocations on the MI2XX.
@@ -2,10 +2,6 @@
 Panel Config:
  id: 1800
  title: L2 Cache (per Channel)
-  metrics_description:
-    L2 Cache Hit Rate: The percent of total number of requests to the L2 from all
-      clients that hit in the cache. As noted in the Speed-of-Light section, this
-      includes hit-on-miss requests.
  data source:
  - metric_table:
      id: 1801
@@ -249,3 +245,7 @@ Panel Config:
          ::_1: $total_l2_chan
      cli_style: simple_box
      tui_style: simple_box
+  metrics_description:
+    L2 Cache Hit Rate: The percent of total number of requests to the L2 from all
+      clients that hit in the cache. As noted in the Speed-of-Light section, this
+      includes hit-on-miss requests.
@@ -2,10 +2,10 @@
 Panel Config:
  id: 2100
  title: PC Sampling
-  metrics_description: {}
  data source:
  - pc_sampling_table:
      id: 2101
      title: PC Sampling
      source: ps_file
      comparable: false
+  metrics_description: {}
@@ -0,0 +1,755 @@
+# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated by tools/config_management/generate_config_deltas.py
+Addition:
+  - Panel Config:
+      id: 200
+      title: System Speed-of-Light
+    metric_tables:
+      - metric_table:
+          id: 201
+          title: System Speed-of-Light
+          metrics:
+            - MFMA FLOPs (F6F4):
+                value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) / (End_Timestamp - Start_Timestamp)))
+                unit: GFLOP/s
+                peak: ((($max_sclk * $cu_per_gpu) * 16834) / 1000)
+                pop: |
+                  ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 16834) / 1000))
+  - Panel Config:
+      id: 300
+      title: Memory Chart
+    metric_tables:
+      - metric_table:
+          id: 301
+          title: Memory Chart
+          metrics:
+            - L2 Wr Lat:
+                value: |
+                  ROUND(AVG(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0) else None)), 0)
+            - L2 Rd Lat:
+                value: |
+                  ROUND(AVG(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)) if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else None)), 0)
+  - Panel Config:
+      id: 400
+      title: Roofline
+    metric_tables:
+      - metric_table:
+          id: 401
+          title: Roofline Performance Rates
+          metrics:
+            - MFMA FLOPs (F6F4):
+                value: |
+                  AVG((((SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+                unit: GFLOP/s
+                peak: $MFMA_FLOPs_F6F4_empirical_peak
+  - Panel Config:
+      id: 500
+      title: Command Processor (CPC/CPF)
+    metric_tables:
+      - metric_table:
+          id: 502
+          title: Command processor packet processor (CPC)
+          metrics:
+            - CPC SYNC FIFO Full Rate:
+                avg: |
+                  AVG((100 * CPC_SYNC_FIFO_FULL) / CPC_SYNC_WRREQ_FIFO_BUSY if (CPC_SYNC_WRREQ_FIFO_BUSY != 0) else None)
+                min: |
+                  MIN((100 * CPC_SYNC_FIFO_FULL) / CPC_SYNC_WRREQ_FIFO_BUSY if (CPC_SYNC_WRREQ_FIFO_BUSY != 0) else None)
+                max: |
+                  MAX((100 * CPC_SYNC_FIFO_FULL) / CPC_SYNC_WRREQ_FIFO_BUSY if (CPC_SYNC_WRREQ_FIFO_BUSY != 0) else None)
+                unit: pct
+            - CPC CANE Stall Rate:
+                avg: AVG((100 * CPC_CANE_STALL) / CPC_CANE_BUSY if (CPC_CANE_BUSY != 0) else None)
+                min: MIN((100 * CPC_CANE_STALL) / CPC_CANE_BUSY if (CPC_CANE_BUSY != 0) else None)
+                max: MAX((100 * CPC_CANE_STALL) / CPC_CANE_BUSY if (CPC_CANE_BUSY != 0) else None)
+                unit: pct
+            - CPC ADC Utilization:
+                avg: AVG((100 * CPC_TG_SEND) / CPC_GD_BUSY if (CPC_GD_BUSY != 0) else None)
+                min: MIN((100 * CPC_TG_SEND) / CPC_GD_BUSY if (CPC_GD_BUSY != 0) else None)
+                max: MAX((100 * CPC_TG_SEND) / CPC_GD_BUSY if (CPC_GD_BUSY != 0) else None)
+                unit: pct
+  - Panel Config:
+      id: 600
+      title: Workgroup Manager (SPI)
+    metric_tables:
+      - metric_table:
+          id: 601
+          title: Workgroup manager utilizations
+          metrics:
+            - Scheduler-Pipe Wave Utilization:
+                avg: |
+                  AVG(100 * (SPI_CSC_WAVE_CNT_BUSY) / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
+                min: |
+                  MIN(100 * (SPI_CSC_WAVE_CNT_BUSY) / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
+                max: |
+                  MAX(100 * (SPI_CSC_WAVE_CNT_BUSY) / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
+                unit: Pct
+            - Schedule-Pipe Wave Occupancy:
+                avg: |
+                  AVG(SPI_CSQ_P0_OCCUPANCY + SPI_CSQ_P1_OCCUPANCY + SPI_CSQ_P2_OCCUPANCY + SPI_CSQ_P3_OCCUPANCY)
+                min: |
+                  MIN(SPI_CSQ_P0_OCCUPANCY + SPI_CSQ_P1_OCCUPANCY + SPI_CSQ_P2_OCCUPANCY + SPI_CSQ_P3_OCCUPANCY)
+                max: |
+                  MAX(SPI_CSQ_P0_OCCUPANCY + SPI_CSQ_P1_OCCUPANCY + SPI_CSQ_P2_OCCUPANCY + SPI_CSQ_P3_OCCUPANCY)
+                unit: Wave
+      - metric_table:
+          id: 602
+          title: Workgroup Manager - Resource Allocation
+          metrics:
+            - Scheduler-Pipe FIFO Full Rate:
+                avg: |
+                  AVG((100 * (SPI_CS0_CRAWLER_STALL + SPI_CS1_CRAWLER_STALL + SPI_CS2_CRAWLER_STALL + SPI_CS3_CRAWLER_STALL) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
+                min: |
+                  MIN((100 * (SPI_CS0_CRAWLER_STALL + SPI_CS1_CRAWLER_STALL + SPI_CS2_CRAWLER_STALL + SPI_CS3_CRAWLER_STALL) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
+                max: |
+                  MAX((100 * (SPI_CS0_CRAWLER_STALL + SPI_CS1_CRAWLER_STALL + SPI_CS2_CRAWLER_STALL + SPI_CS3_CRAWLER_STALL) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
+                unit: Pct
+  - Panel Config:
+      id: 1000
+      title: Compute Units - Instruction Mix
+    metric_tables:
+      - metric_table:
+          id: 1003
+          title: VMEM Instruction Mix
+          metrics:
+            - Spill/Stack Coalesceable Instr:
+                avg: AVG((TA_BUFFER_COALESCEABLE_WAVEFRONTS_sum / $denom))
+                min: MIN((TA_BUFFER_COALESCEABLE_WAVEFRONTS_sum / $denom))
+                max: MAX((TA_BUFFER_COALESCEABLE_WAVEFRONTS_sum / $denom))
+                unit: (instr + $normUnit)
+      - metric_table:
+          id: 1004
+          title: MFMA Arithmetic Instruction Mix
+          metrics:
+            - MFMA-F6F4:
+                avg: AVG((SQ_INSTS_VALU_MFMA_F6F4 / $denom))
+                min: MIN((SQ_INSTS_VALU_MFMA_F6F4 / $denom))
+                max: MAX((SQ_INSTS_VALU_MFMA_F6F4 / $denom))
+                unit: (instr + $normUnit)
+  - Panel Config:
+      id: 1100
+      title: Compute Units - Compute Pipeline
+    metric_tables:
+      - metric_table:
+          id: 1101
+          title: Compute Speed-of-Light
+          metrics:
+            - MFMA FLOPs (F6F4):
+                value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) / (End_Timestamp - Start_Timestamp)))
+                unit: GFLOP
+                peak: ((($max_sclk * $cu_per_gpu) * 16834) / 1000)
+                pop: |
+                  ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 16834) / 1000))
+      - metric_table:
+          id: 1102
+          title: Pipeline Statistics
+          metrics:
+            - VALU Co-Issue Efficiency:
+                avg: AVG((100 * SQ_ACTIVE_INST_VALU2) / (SQ_ACTIVE_INST_VALU - SQ_ACTIVE_INST_VALU2))
+                min: MIN((100 * SQ_ACTIVE_INST_VALU2) / (SQ_ACTIVE_INST_VALU - SQ_ACTIVE_INST_VALU2))
+                max: MAX((100 * SQ_ACTIVE_INST_VALU2) / (SQ_ACTIVE_INST_VALU - SQ_ACTIVE_INST_VALU2))
+                unit: pct
+      - metric_table:
+          id: 1103
+          title: Arithmetic Operations
+          metrics:
+            - F6F4 OPs:
+                avg: AVG((512 * SQ_INSTS_VALU_MFMA_MOPS_F6F4) / $denom)
+                min: MIN((512 * SQ_INSTS_VALU_MFMA_MOPS_F6F4) / $denom)
+                max: MAX((512 * SQ_INSTS_VALU_MFMA_MOPS_F6F4) / $denom)
+                unit: (OPs + $normUnit)
+  - Panel Config:
+      id: 1200
+      title: Local Data Share (LDS)
+    metric_tables:
+      - metric_table:
+          id: 1202
+          title: LDS Statistics
+          metrics:
+            - LDS STORE:
+                avg: AVG((SQ_INSTS_LDS_STORE / $denom))
+                min: MIN((SQ_INSTS_LDS_STORE / $denom))
+                max: MAX((SQ_INSTS_LDS_STORE / $denom))
+                unit: (instr + $normUnit)
+            - LDS LOAD Bandwidth:
+                avg: AVG(64 * SQ_INSTS_LDS_LOAD_BANDWIDTH / (End_Timestamp - Start_Timestamp))
+                min: MIN(64 * SQ_INSTS_LDS_LOAD_BANDWIDTH / (End_Timestamp - Start_Timestamp))
+                max: MAX(64 * SQ_INSTS_LDS_LOAD_BANDWIDTH / (End_Timestamp - Start_Timestamp))
+                units: Gbps
+            - LDS ATOMIC:
+                avg: AVG((SQ_INSTS_LDS_ATOMIC / $denom))
+                min: MIN((SQ_INSTS_LDS_ATOMIC / $denom))
+                max: MAX((SQ_INSTS_LDS_ATOMIC / $denom))
+                unit: (instr + $normUnit)
+            - LDS STORE Bandwidth:
+                avg: AVG(64 * SQ_INSTS_LDS_STORE_BANDWIDTH / (End_Timestamp - Start_Timestamp))
+                min: MIN(64 * SQ_INSTS_LDS_STORE_BANDWIDTH / (End_Timestamp - Start_Timestamp))
+                max: MAX(64 * SQ_INSTS_LDS_STORE_BANDWIDTH / (End_Timestamp - Start_Timestamp))
+                units: Gbps
+            - LDS Command FIFO Full Rate:
+                avg: AVG((SQ_LDS_CMD_FIFO_FULL / $denom))
+                min: MIN((SQ_LDS_CMD_FIFO_FULL / $denom))
+                max: MAX((SQ_LDS_CMD_FIFO_FULL / $denom))
+                unit: (Cycles + $normUnit)
+            - LDS LOAD:
+                avg: AVG((SQ_INSTS_LDS_LOAD / $denom))
+                min: MIN((SQ_INSTS_LDS_LOAD / $denom))
+                max: MAX((SQ_INSTS_LDS_LOAD / $denom))
+                unit: (instr + $normUnit)
+            - LDS ATOMIC Bandwidth:
+                avg: AVG(64 * SQ_INSTS_LDS_ATOMIC_BANDWIDTH / (End_Timestamp - Start_Timestamp))
+                min: MIN(64 * SQ_INSTS_LDS_ATOMIC_BANDWIDTH / (End_Timestamp - Start_Timestamp))
+                max: MAX(64 * SQ_INSTS_LDS_ATOMIC_BANDWIDTH / (End_Timestamp - Start_Timestamp))
+                units: Gbps
+            - LDS Data FIFO Full Rate:
+                avg: AVG((SQ_LDS_DATA_FIFO_FULL / $denom))
+                min: MIN((SQ_LDS_DATA_FIFO_FULL / $denom))
+                max: MAX((SQ_LDS_DATA_FIFO_FULL / $denom))
+                unit: (Cycles + $normUnit)
+  - Panel Config:
+      id: 1500
+      title: Address Processing Unit and Data Return Path (TA/TD)
+    metric_tables:
+      - metric_table:
+          id: 1504
+          title: Vector L1 data-return path or Texture Data (TD)
+          metrics:
+            - Write Ack Instructions:
+                avg: AVG((TD_WRITE_ACKT_WAVEFRONT_sum / $denom))
+                min: MIN((TD_WRITE_ACKT_WAVEFRONT_sum / $denom))
+                max: MAX((TD_WRITE_ACKT_WAVEFRONT_sum / $denom))
+                unit: (Instructions + $normUnit)
+      - metric_table:
+          id: 1502
+          title: Instruction counts
+          metrics:
+            - Spill/Stack Read Instructions for LDS:
+                avg: AVG((TA_BUFFER_READ_LDS_WAVEFRONTS_sum / $denom))
+                min: MIN((TA_BUFFER_READ_LDS_WAVEFRONTS_sum / $denom))
+                max: MAX((TA_BUFFER_READ_LDS_WAVEFRONTS_sum / $denom))
+                unit: (Instructions + $normUnit)
+            - Global/Generic Read Instructions for LDS:
+                avg: AVG((TA_FLAT_READ_LDS_WAVEFRONTS_sum / $denom))
+                min: MIN((TA_FLAT_READ_LDS_WAVEFRONTS_sum / $denom))
+                max: MAX((TA_FLAT_READ_LDS_WAVEFRONTS_sum / $denom))
+                unit: (Instructions + $normUnit)
+  - Panel Config:
+      id: 1600
+      title: Vector L1 Data Cache
+    metric_tables:
+      - metric_table:
+          id: 1602
+          title: vL1D cache stall metrics
+          metrics:
+            - Stalled on Address:
+                expr: |
+                  (((100 * TCP_TCP_TA_ADDR_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum != 0) else None)
+            - Stalled on Read Return:
+                expr: |
+                  (((100 * TCP_TCR_RDRET_STALL_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum != 0) else None)
+            - Stalled on Request FIFO:
+                expr: |
+                  (((100 * TCP_RFIFO_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum != 0) else None)
+            - Stalled on Data:
+                expr: |
+                  (((100 * TCP_TCP_TA_DATA_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum != 0) else None)
+            - Stalled on Latency FIFO:
+                expr: |
+                  (((100 * TCP_LFIFO_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum != 0) else None)
+      - metric_table:
+          id: 1603
+          title: vL1D cache access metrics
+          metrics:
+            - Tag RAM 3 Req:
+                avg: AVG((TCP_TAGRAM3_REQ_sum / $denom))
+                min: MIN((TCP_TAGRAM3_REQ_sum / $denom))
+                max: MAX((TCP_TAGRAM3_REQ_sum / $denom))
+                unit: (Req + $normUnit)
+            - L1-L2 Read Latency:
+                avg: AVG((TCP_TCC_READ_REQ_LATENCY_sum / $denom))
+                min: MIN((TCP_TCC_READ_REQ_LATENCY_sum / $denom))
+                max: MAX((TCP_TCC_READ_REQ_LATENCY_sum / $denom))
+                unit: (Cycles + $normUnit)
+            - Tag RAM 2 Req:
+                avg: AVG((TCP_TAGRAM2_REQ_sum / $denom))
+                min: MIN((TCP_TAGRAM2_REQ_sum / $denom))
+                max: MAX((TCP_TAGRAM2_REQ_sum / $denom))
+                unit: (Req + $normUnit)
+            - Tag RAM 0 Req:
+                avg: AVG((TCP_TAGRAM0_REQ_sum / $denom))
+                min: MIN((TCP_TAGRAM0_REQ_sum / $denom))
+                max: MAX((TCP_TAGRAM0_REQ_sum / $denom))
+                unit: (Req + $normUnit)
+            - L1-L2 Write Latency:
+                avg: AVG((TCP_TCC_WRITE_REQ_LATENCY_sum / $denom))
+                min: MIN((TCP_TCC_WRITE_REQ_LATENCY_sum / $denom))
+                max: MAX((TCP_TCC_WRITE_REQ_LATENCY_sum / $denom))
+                unit: (Cycles + $normUnit)
+            - L1 Access Latency:
+                avg: AVG((TCP_TCP_LATENCY_sum / $denom))
+                min: MIN((TCP_TCP_LATENCY_sum / $denom))
+                max: MAX((TCP_TCP_LATENCY_sum / $denom))
+                unit: (Cycles + $normUnit)
+            - Tag RAM 1 Req:
+                avg: AVG((TCP_TAGRAM1_REQ_sum / $denom))
+                min: MIN((TCP_TAGRAM1_REQ_sum / $denom))
+                max: MAX((TCP_TAGRAM1_REQ_sum / $denom))
+                unit: (Req + $normUnit)
+      - metric_table:
+          id: 1605
+          title: L1 Unified Translation Cache (UTCL1)
+          metrics:
+            - Misses under Translation Miss:
+                avg: AVG((TCP_UTCL1_TRANSLATION_MISS_UNDER_MISS_sum / $denom))
+                min: MIN((TCP_UTCL1_TRANSLATION_MISS_UNDER_MISS_sum / $denom))
+                max: MAX((TCP_UTCL1_TRANSLATION_MISS_UNDER_MISS_sum / $denom))
+                units: (Req + $normUnit)
+            - Inflight Req:
+                avg: AVG((TCP_CLIENT_UTCL1_INFLIGHT_sum / $denom))
+                min: MIN((TCP_CLIENT_UTCL1_INFLIGHT_sum / $denom))
+                max: MAX((TCP_CLIENT_UTCL1_INFLIGHT_sum / $denom))
+                units: (Req + $normUnit)
+      - metric_table:
+          id: 1606
+          title: L1D Addr Translation Stalls
+          metrics:
+            - Latency FIFO Stall:
+                avg: AVG((TCP_UTCL1_LFIFO_FULL_sum / $denom))
+                min: MIN((TCP_UTCL1_LFIFO_FULL_sum / $denom))
+                max: MAX((TCP_UTCL1_LFIFO_FULL_sum / $denom))
+                units: (Cycles + $normUnit)
+            - Serialization Stall:
+                avg: AVG((TCP_UTCL1_SERIALIZATION_STALL_sum / $denom))
+                min: MIN((TCP_UTCL1_SERIALIZATION_STALL_sum / $denom))
+                max: MAX((TCP_UTCL1_SERIALIZATION_STALL_sum / $denom))
+                units: (Cycles + $normUnit)
+            - Cache Full Stall:
+                avg: AVG((TCP_UTCL1_STALL_INFLIGHT_MAX_sum / $denom))
+                min: MIN((TCP_UTCL1_STALL_INFLIGHT_MAX_sum / $denom))
+                max: MAX((TCP_UTCL1_STALL_INFLIGHT_MAX_sum / $denom))
+                units: (Cycles + $normUnit)
+            - UTCL2 Stall:
+                avg: AVG((TCP_UTCL1_STALL_UTCL2_REQ_OUT_OF_CREDITS_sum / $denom))
+                min: MIN((TCP_UTCL1_STALL_UTCL2_REQ_OUT_OF_CREDITS_sum / $denom))
+                max: MAX((TCP_UTCL1_STALL_UTCL2_REQ_OUT_OF_CREDITS_sum / $denom))
+                units: (Cycles + $normUnit)
+            - Cache Miss Stall:
+                avg: AVG((TCP_UTCL1_STALL_MULTI_MISS_sum / $denom))
+                min: MIN((TCP_UTCL1_STALL_MULTI_MISS_sum / $denom))
+                max: MAX((TCP_UTCL1_STALL_MULTI_MISS_sum / $denom))
+                units: (Cycles + $normUnit)
+            - Resident Page Full Stall:
+                avg: AVG((TCP_UTCL1_STALL_LFIFO_NO_RES_sum / $denom))
+                min: MIN((TCP_UTCL1_STALL_LFIFO_NO_RES_sum / $denom))
+                max: MAX((TCP_UTCL1_STALL_LFIFO_NO_RES_sum / $denom))
+                units: (Cycles + $normUnit)
+            - Thrashing Stall:
+                avg: AVG((TCP_UTCL1_THRASHING_STALL_sum / $denom))
+                min: MIN((TCP_UTCL1_THRASHING_STALL_sum / $denom))
+                max: MAX((TCP_UTCL1_THRASHING_STALL_sum / $denom))
+                units: (Cycles + $normUnit)
+  - Panel Config:
+      id: 1700
+      title: L2 Cache
+    metric_tables:
+      - metric_table:
+          id: 1702
+          title: L2-Fabric interface metrics
+          metrics:
+            - Write Stall:
+                avg: |
+                  AVG(((100 * (TCC_EA0_WRREQ_STALL_sum) / TCC_BUSY_sum) if (TCC_BUSY_sum != 0) else None))
+                min: |
+                  MIN(((100 * (TCC_EA0_WRREQ_STALL_sum) / TCC_BUSY_sum) if (TCC_BUSY_sum != 0) else None))
+                max: |
+                  MAX(((100 * (TCC_EA0_WRREQ_STALL_sum) / TCC_BUSY_sum) if (TCC_BUSY_sum != 0) else None))
+                unit: pct
+            - Read Stall:
+                avg: |
+                  AVG((((100 * ((TCC_EA0_RDREQ_IO_CREDIT_STALL_sum + TCC_EA0_RDREQ_GMI_CREDIT_STALL_sum) + TCC_EA0_RDREQ_DRAM_CREDIT_STALL_sum)) / TCC_BUSY_sum) if (TCC_BUSY_sum != 0) else None))
+                min: |
+                  MIN((((100 * ((TCC_EA0_RDREQ_IO_CREDIT_STALL_sum + TCC_EA0_RDREQ_GMI_CREDIT_STALL_sum) + TCC_EA0_RDREQ_DRAM_CREDIT_STALL_sum)) / TCC_BUSY_sum) if (TCC_BUSY_sum != 0) else None))
+                max: |
+                  MAX((((100 * ((TCC_EA0_RDREQ_IO_CREDIT_STALL_sum + TCC_EA0_RDREQ_GMI_CREDIT_STALL_sum) + TCC_EA0_RDREQ_DRAM_CREDIT_STALL_sum)) / TCC_BUSY_sum) if (TCC_BUSY_sum != 0) else None))
+                unit: pct
+      - metric_table:
+          id: 1703
+          title: L2 Cache Accesses
+          metrics:
+            - Input Buffer Req:
+                avg: AVG((TCC_IB_REQ_sum / $denom))
+                min: MIN((TCC_IB_REQ_sum / $denom))
+                max: MAX((TCC_IB_REQ_sum / $denom))
+                unit: (Req + $normUnit)
+            - Bypasss Req:
+                avg: AVG((TCC_BYPASS_REQ_sum / $denom))
+                min: MIN((TCC_BYPASS_REQ_sum / $denom))
+                max: MAX((TCC_BYPASS_REQ_sum / $denom))
+                unit: (Req + $normUnit)
+            - Atomic Bandwidth:
+                avg: AVG(TCC_ATOMIC_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
+                min: MIN(TCC_ATOMIC_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
+                max: MAX(TCC_ATOMIC_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
+                unit: Gbps
+            - Write Bandwidth:
+                avg: AVG(TCC_WRITE_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
+                min: MIN(TCC_WRITE_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
+                max: MAX(TCC_WRITE_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
+                unit: Gbps
+            - Read Bandwidth:
+                avg: AVG(TCC_READ_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
+                min: MIN(TCC_READ_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
+                max: MAX(TCC_READ_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
+                unit: Gbps
+      - metric_table:
+          id: 1704
+          title: L2 Cache Stalls
+          metrics:
+            - Input Buffer Stalled on L2:
+                avg: AVG(TCC_IB_STALL_sum / $denom)
+                min: MIN(TCC_IB_STALL_sum / $denom)
+                max: MAX(TCC_IB_STALL_sum / $denom)
+                unit: (Cycles + $normUnit)
+            - Stalled on Latency FIFO:
+                avg: AVG(TCC_LATENCY_FIFO_FULL_sum / $denom)
+                min: MIN(TCC_LATENCY_FIFO_FULL_sum / $denom)
+                max: MAX(TCC_LATENCY_FIFO_FULL_sum / $denom)
+                unit: (Cycles + $normUnit)
+            - Stalled on Write Data FIFO:
+                avg: AVG(TCC_SRC_FIFO_FULL_sum / $denom)
+                min: MIN(TCC_SRC_FIFO_FULL_sum / $denom)
+                max: MAX(TCC_SRC_FIFO_FULL_sum / $denom)
+                unit: (Cycles + $normUnit)
+      - metric_table:
+          id: 1705
+          title: L2 - Fabric Interface stalls
+          metrics:
+            - Write - HBM Stall:
+                type: HBM Stall
+                transaction: Write
+                avg: |
+                  AVG(((100 * (TCC_EA0_WRREQ_DRAM_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                min: |
+                  MIN(((100 * (TCC_EA0_WRREQ_DRAM_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                max: |
+                  MAX(((100 * (TCC_EA0_WRREQ_DRAM_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                unit: pct
+            - Read - HBM Stall:
+                type: HBM Stall
+                transaction: Read
+                avg: |
+                  AVG(((100 * (TCC_EA0_RDREQ_DRAM_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                min: |
+                  MIN(((100 * (TCC_EA0_RDREQ_DRAM_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                max: |
+                  MAX(((100 * (TCC_EA0_RDREQ_DRAM_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                unit: pct
+            - Write - PCIe Stall:
+                type: PCIe Stall
+                transaction: Write
+                avg: |
+                  AVG(((100 * (TCC_EA0_WRREQ_IO_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                min: |
+                  MIN(((100 * (TCC_EA0_WRREQ_IO_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                max: |
+                  MAX(((100 * (TCC_EA0_WRREQ_IO_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                unit: pct
+            - Write - Infinity Fabric Stall:
+                type: Infinity Fabric™ Stall
+                transaction: Write
+                avg: |
+                  AVG(((100 * (TCC_EA0_WRREQ_GMI_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                min: |
+                  MIN(((100 * (TCC_EA0_WRREQ_GMI_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                max: |
+                  MAX(((100 * (TCC_EA0_WRREQ_GMI_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                unit: pct
+            - Read - Infinity Fabric Stall:
+                type: Infinity Fabric™ Stall
+                transaction: Read
+                avg: |
+                  AVG(((100 * (TCC_EA0_RDREQ_GMI_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                min: |
+                  MIN(((100 * (TCC_EA0_RDREQ_GMI_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                max: |
+                  MAX(((100 * (TCC_EA0_RDREQ_GMI_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                unit: pct
+            - Read - PCIe Stall:
+                type: PCIe Stall
+                transaction: Read
+                avg: |
+                  AVG(((100 * (TCC_EA0_RDREQ_IO_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                min: |
+                  MIN(((100 * (TCC_EA0_RDREQ_IO_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                max: |
+                  MAX(((100 * (TCC_EA0_RDREQ_IO_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                unit: pct
+      - metric_table:
+          id: 1706
+          title: L2 - Fabric interface detailed metrics
+          metrics:
+            - Read Bandwidth - PCIe:
+                avg: AVG(TCC_EA0_RDREQ_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                min: MIN(TCC_EA0_RDREQ_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                max: MAX(TCC_EA0_RDREQ_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                unit: Gbps
+            - Write Bandwidth - Infinity Fabric™:
+                avg: AVG(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                min: MIN(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                max: MAX(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                unit: Gbps
+            - Atomic Bandwidth - HBM:
+                avg: AVG(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                min: MIN(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                max: MAX(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                unit: Gbps
+            - Atomic - HBM:
+                avg: AVG((TCC_EA0_WRREQ_ATOMIC_DRAM_sum / $denom))
+                min: MIN((TCC_EA0_WRREQ_ATOMIC_DRAM_sum / $denom))
+                max: MAX((TCC_EA0_WRREQ_ATOMIC_DRAM_sum / $denom))
+                unit: (Req + $normUnit)
+            - Read Bandwidth - HBM:
+                avg: AVG(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                min: MIN(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                max: MAX(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                unit: Gbps
+            - Atomic Bandwidth - Infinity Fabric™:
+                avg: AVG(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                min: MIN(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                max: MAX(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                unit: Gbps
+            - Write Bandwidth - HBM:
+                avg: AVG(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                min: MIN(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                max: MAX(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                unit: Gbps
+            - Atomic Bandwidth - PCIe:
+                avg: AVG(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                min: MIN(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                max: MAX(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                unit: Gbps
+            - Read (128B):
+                avg: AVG((TCC_EA0_RDREQ_128B_sum / $denom))
+                min: MIN((TCC_EA0_RDREQ_128B_sum / $denom))
+                max: MAX((TCC_EA0_RDREQ_128B_sum / $denom))
+                unit: (Req + $normUnit)
+            - Read Bandwidth - Infinity Fabric™:
+                avg: AVG(TCC_EA0_RDREQ_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                min: MIN(TCC_EA0_RDREQ_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                max: MAX(TCC_EA0_RDREQ_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                unit: Gbps
+            - Write Bandwidth - PCIe:
+                avg: AVG(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                min: MIN(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                max: MAX(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                unit: Gbps
+
+Deletion:
+  []
+
+Modification:
+  - Panel Config:
+      id: 200
+      title: System Speed-of-Light
+    metric_tables:
+      - metric_table:
+          id: 201
+          title: System Speed-of-Light
+          metrics:
+            - MFMA FLOPs (F8):
+                pop: |
+                  ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 8192) / 1000))
+                peak: ((($max_sclk * $cu_per_gpu) * 8192) / 1000)
+            - MFMA FLOPs (F64):
+                pop: |
+                  ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 128) / 1000))
+                peak: ((($max_sclk * $cu_per_gpu) * 128) / 1000)
+            - MFMA IOPs (Int8):
+                pop: |
+                  ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 8192) / 1000))
+                peak: ((($max_sclk * $cu_per_gpu) * 8192) / 1000)
+            - MFMA FLOPs (F16):
+                pop: |
+                  ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 4096) / 1000))
+                peak: ((($max_sclk * $cu_per_gpu) * 4096) / 1000)
+            - MFMA FLOPs (BF16):
+                pop: |
+                  ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 4096) / 1000))
+                peak: ((($max_sclk * $cu_per_gpu) * 4096) / 1000)
+  - Panel Config:
+      id: 300
+      title: Memory Chart
+    metric_tables:
+      - metric_table:
+          id: 301
+          title: Memory Chart
+          metrics:
+            - Wavefronts:
+                value: ROUND(AVG(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE), 0)
+            - Workgroups:
+                value: |
+                  ROUND(AVG(SPI_CS0_NUM_THREADGROUPS + SPI_CS1_NUM_THREADGROUPS + SPI_CS2_NUM_THREADGROUPS + SPI_CS3_NUM_THREADGROUPS), 0)
+  - Panel Config:
+      id: 400
+      title: Roofline
+    metric_tables:
+      - metric_table:
+          id: 402
+          title: Roofline Plot Points
+          metrics:
+            - Performance (GFLOPs):
+                value: |
+                  ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9
+            - AI L2:
+                value: |
+                  ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / SUM( (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64 ) )
+            - AI L1:
+                value: |
+                  ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / SUM(TCP_TOTAL_CACHE_ACCESSES_sum * 64) )
+            - AI HBM:
+                value: |
+                  ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / SUM( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum * 64) ) )
+  - Panel Config:
+      id: 600
+      title: Workgroup Manager (SPI)
+    metric_tables:
+      - metric_table:
+          id: 601
+          title: Workgroup manager utilizations
+          metrics:
+            - SGPR Writes:
+                max: |
+                  MAX((((1 * SPI_SWC_CSC_WR) / (SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)) if ((SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE) != 0) else None))
+                min: |
+                  MIN((((1 * SPI_SWC_CSC_WR) / (SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)) if ((SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE) != 0) else None))
+                avg: |
+                  AVG((((1 * SPI_SWC_CSC_WR) / (SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)) if ((SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE) != 0) else None))
+            - Dispatched Wavefronts:
+                max: MAX(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)
+                min: MIN(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)
+                avg: AVG(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)
+            - Dispatched Workgroups:
+                max: |
+                  MAX(SPI_CS0_NUM_THREADGROUPS + SPI_CS1_NUM_THREADGROUPS + SPI_CS2_NUM_THREADGROUPS + SPI_CS3_NUM_THREADGROUPS)
+                min: |
+                  MIN(SPI_CS0_NUM_THREADGROUPS + SPI_CS1_NUM_THREADGROUPS + SPI_CS2_NUM_THREADGROUPS + SPI_CS3_NUM_THREADGROUPS)
+                avg: |
+                  AVG(SPI_CS0_NUM_THREADGROUPS + SPI_CS1_NUM_THREADGROUPS + SPI_CS2_NUM_THREADGROUPS + SPI_CS3_NUM_THREADGROUPS)
+            - Scheduler-Pipe Utilization:
+                max: |
+                  MAX(100 * (SPI_CS0_BUSY + SPI_CS1_BUSY + SPI_CS2_BUSY + SPI_CS3_BUSY) / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
+                min: |
+                  MIN(100 * (SPI_CS0_BUSY + SPI_CS1_BUSY + SPI_CS2_BUSY + SPI_CS3_BUSY) / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
+                avg: |
+                  AVG(100 * (SPI_CS0_BUSY + SPI_CS1_BUSY + SPI_CS2_BUSY + SPI_CS3_BUSY) / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
+            - VGPR Writes:
+                max: |
+                  MAX((((SPI_VWC0_VDATA_VALID_WR + SPI_VWC1_VDATA_VALID_WR) / (SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)) if ((SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE) != 0) else None))
+                min: |
+                  MIN((((SPI_VWC0_VDATA_VALID_WR + SPI_VWC1_VDATA_VALID_WR) / (SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)) if ((SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE) != 0) else None))
+                avg: |
+                  AVG((((SPI_VWC0_VDATA_VALID_WR + SPI_VWC1_VDATA_VALID_WR) / (SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)) if ((SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE) != 0) else None))
+  - Panel Config:
+      id: 700
+      title: Wavefront
+    metric_tables:
+      - metric_table:
+          id: 701
+          title: Wavefront Launch Stats
+          metrics:
+            - Total Wavefronts:
+                max: MAX(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)
+                min: MIN(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)
+                avg: AVG(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)
+  - Panel Config:
+      id: 1100
+      title: Compute Units - Compute Pipeline
+    metric_tables:
+      - metric_table:
+          id: 1101
+          title: Compute Speed-of-Light
+          metrics:
+            - MFMA FLOPs (F16):
+                pop: |
+                  ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 4096) / 1000))
+                peak: ((($max_sclk * $cu_per_gpu) * 4096) / 1000)
+            - MFMA FLOPs (F64):
+                pop: |
+                  ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 128) / 1000))
+                peak: ((($max_sclk * $cu_per_gpu) * 128) / 1000)
+            - MFMA IOPs (INT8):
+                pop: |
+                  ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 8192) / 1000))
+                peak: ((($max_sclk * $cu_per_gpu) * 8192) / 1000)
+            - MFMA FLOPs (BF16):
+                pop: |
+                  ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 4096) / 1000))
+                peak: ((($max_sclk * $cu_per_gpu) * 4096) / 1000)
+            - MFMA FLOPs (F8):
+                pop: |
+                  ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 8192) / 1000))
+                peak: ((($max_sclk * $cu_per_gpu) * 8192) / 1000)
+      - metric_table:
+          id: 1103
+          title: Arithmetic Operations
+          metrics:
+            - FLOPs (Total):
+                max: |
+                  MAX((((((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_FMA_F16 * 2))) + ((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F16) + (512 * SQ_INSTS_VALU_MFMA_MOPS_BF16))) + (64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_FMA_F32 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32)) + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64) + (SQ_INSTS_VALU_FMA_F64 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F6F4)) / $denom))
+                min: |
+                  MIN((((((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_FMA_F16 * 2))) + ((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F16) + (512 * SQ_INSTS_VALU_MFMA_MOPS_BF16))) + (64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_FMA_F32 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32)) + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64) + (SQ_INSTS_VALU_FMA_F64 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F6F4)) / $denom))
+                avg: |
+                  AVG((((((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_FMA_F16 * 2))) + ((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F16) + (512 * SQ_INSTS_VALU_MFMA_MOPS_BF16))) + (64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_FMA_F32 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32)) + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64) + (SQ_INSTS_VALU_FMA_F64 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F6F4)) / $denom))
+  - Panel Config:
+      id: 1700
+      title: L2 Cache
+    metric_tables:
+      - metric_table:
+          id: 1701
+          title: L2 Speed-of-Light
+          metrics:
+            - L2-Fabric Read BW:
+                value: |
+                  AVG((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) + (TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp)))
+      - metric_table:
+          id: 1702
+          title: L2-Fabric interface metrics
+          metrics:
+            - Read BW:
+                max: |
+                  MAX((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) + (TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp)))
+                min: |
+                  MIN((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) + (TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp)))
+                avg: |
+                  AVG((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) + (TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp)))
+      - metric_table:
+          id: 1706
+          title: L2 - Fabric interface detailed metrics
+          metrics:
+            - Read (64B):
+                max: MAX((TCC_EA0_RDREQ_64B_sum / $denom))
+                min: MIN((TCC_EA0_RDREQ_64B_sum / $denom))
+                avg: AVG((TCC_EA0_RDREQ_64B_sum / $denom))
+            - HBM Write and Atomic:
+                max: MAX((TCC_EA0_WRREQ_WRITE_DRAM_sum / $denom))
+                min: MIN((TCC_EA0_WRREQ_WRITE_DRAM_sum / $denom))
+                avg: AVG((TCC_EA0_WRREQ_WRITE_DRAM_sum / $denom))
+  - Panel Config:
+      id: 1800
+      title: L2 Cache (per Channel)
+    metric_tables:
+      - metric_table:
+          id: 1809
+          title: L2-Fabric Read Stall (Cycles per normUnit)
+          metrics:
+            - ::_1:
+                ea read stall - pcie: AVG((TO_INT(TCC_EA0_RDREQ_IO_CREDIT_STALL[::_1]) / $denom))
+                ea read stall - hbm: AVG((TO_INT(TCC_EA0_RDREQ_DRAM_CREDIT_STALL[::_1]) / $denom))
+                ea read stall - if: AVG((TO_INT(TCC_EA0_RDREQ_GMI_CREDIT_STALL[::_1]) / $denom))
+      - metric_table:
+          id: 1810
+          title: L2-Fabric Write and Atomic Stall (Cycles per normUnit)
+          metrics:
+            - ::_1:
+                ea write stall - hbm: AVG((TO_INT(TCC_EA0_WRREQ_DRAM_CREDIT_STALL[::_1]) / $denom))
+                ea write stall - pcie: AVG((TO_INT(TCC_EA0_WRREQ_IO_CREDIT_STALL[::_1]) / $denom))
+                ea write stall - if: AVG((TO_INT(TCC_EA0_WRREQ_GMI_CREDIT_STALL[::_1]) / $denom))
@@ -2,7 +2,6 @@
 Panel Config:
  id: 0
  title: Top Stats
-  metrics_description: {}
  data source:
  - raw_csv_table:
      id: 1
@@ -12,3 +11,4 @@ Panel Config:
      id: 2
      title: Dispatch List
      source: pmc_dispatch_info.csv
+  metrics_description: {}
@@ -2,10 +2,10 @@
 Panel Config:
  id: 100
  title: System Info
-  metrics_description: {}
  data source:
  - raw_csv_table:
      id: 101
      title: System Info
      source: sysinfo.csv
      columnwise: true
+  metrics_description: {}
@@ -2,124 +2,6 @@
 Panel Config:
  id: 200
  title: System Speed-of-Light
-  metrics_description:
-    VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
-      This is also presented as a percent of the peak theoretical FLOPs achievable
-      on the specific accelerator. Note: this does not include any floating-point
-      operations from MFMA instructions.'
-    VALU IOPs: 'The total integer operations executed per second on the VALU. This
-      is also presented as a percent of the peak theoretical IOPs achievable on the
-      specific accelerator. Note: this does not include any integer operations from
-      MFMA instructions.'
-    MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
-      executed per second. This does not include any 16-bit brain floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F8 MFMA operations achievable on the specific accelerator. It is supported on
-      AMD Instinct MI300 series and later only.
-    MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
-      executed per second. Note: this does not include any 16-bit brain floating point
-      operations from VALU instructions. This is also presented as a percent of the
-      peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
-      per second. Note: this does not include any 16-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F16 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
-      per second. Note: this does not include any 32-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F32 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
-      per second. Note: this does not include any 64-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F64 MFMA operations achievable on the specific accelerator.'
-    MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
-      per second. Note: this does not include any 8-bit integer operations from VALU
-      instructions. This is also presented as a percent of the peak theoretical INT8
-      MFMA operations achievable on the specific accelerator.'
-    Active CUs: Total number of active compute units (CUs) on the accelerator during
-      the kernel execution.
-    SALU Utilization: Indicates what percent of the kernel's duration the SALU was
-      busy executing instructions. Computed as the ratio of the total number of cycles
-      spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.
-    VALU Utilization: Indicates what percent of the kernel's duration the VALU was
-      busy executing instructions. Does not include VMEM operations. Computed as the
-      ratio of the total number of cycles spent by the scheduler issuing VALU instructions
-      over the total CU cycles.
-    MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
-      was busy executing instructions. Computed as the ratio of the total number of
-      cycles the MFMA was busy over the total CU cycles.
-    VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
-      was busy executing instructions, including both global/generic and spill/scratch
-      operations (see the VMEM instruction count metrics) for more detail). Does not
-      include VALU operations. Computed as the ratio of the total number of cycles
-      spent by the scheduler issuing VMEM instructions over the total CU cycles.
-    Branch Utilization: Indicates what percent of the kernel's duration the branch
-      unit was busy executing instructions. Computed as the ratio of the total number
-      of cycles spent by the scheduler issuing branch instructions over the total
-      CU cycles
-    VALU Active Threads: Indicates the average level of divergence within a wavefront
-      over the lifetime of the kernel. The number of work-items that were active in
-      a wavefront during execution of each VALU instruction, time-averaged over all
-      VALU instructions run on all wavefronts in the kernel.
-    IPC: The ratio of the total number of instructions executed on the CU over the
-      total active CU cycles. This is also presented as a percent of the peak theoretical
-      bandwidth achievable on the specific accelerator.
-    Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
-      over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
-      kernels (less than 1ms). This is also presented as a percent of the peak theoretical
-      occupancy achievable on the specific accelerator.'
-    Theoretical LDS Bandwidth: Indicates the maximum amount of bytes that could have
-      been loaded from, stored to, or atomically updated in the LDS per unit time
-      (see LDS Bandwidth example for more detail). This is also presented as a percent
-      of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
-    LDS Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS
-      scheduler due to bank conflicts (as determined by the conflict resolution hardware)
-      to the base number of cycles that would be spent in the LDS scheduler in a completely
-      uncontended case. This is also presented in normalized form (i.e., the Bank
-      Conflict Rate).
-    vL1D Cache Hit Rate: The ratio of the number of vL1D cache line requests that
-      hit in vL1D cache over the total number of cache line requests to the vL1D cache
-      RAM.
-    vL1D Cache BW: The number of bytes looked up in the vL1D cache as a result of
-      VMEM instructions per unit time. The number of bytes is calculated as the number
-      of cache lines requested multiplied by the cache line size. This value does
-      not consider partial requests, so e.g., if only a single value is requested
-      in a cache line, the data movement will still be counted as a full cache line.
-      This is also presented as a percent of the peak theoretical bandwidth achievable
-      on the specific accelerator.
-    L2 Cache Hit Rate: The ratio of the number of L2 cache line requests that hit
-      in the L2 cache over the total number of incoming cache line requests to the
-      L2 cache.
-    L2 Cache BW: The number of bytes looked up in the L2 cache per unit time. The
-      number of bytes is calculated as the number of cache lines requested multiplied
-      by the cache line size. This value does not consider partial requests, so e.g.,
-      if only a single value is requested in a cache line, the data movement will
-      still be counted as a full cache line. This is also presented as a percent of
-      the peak theoretical bandwidth achievable on the specific accelerator.
-    L2-Fabric Read BW: "The number of bytes read by the L2 over the Infinity Fabric\u2122\
-      \ interface per unit time. This is also presented as a percent of the peak theoretical\
-      \ bandwidth achievable on the specific accelerator."
-    L2-Fabric Write BW: The number of bytes sent by the L2 over the Infinity Fabric
-      interface by write and atomic operations per unit time. This is also presented
-      as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
-    L2-Fabric Read Latency: The time-averaged number of cycles read requests spent
-      in Infinity Fabric before data was returned to the L2.
-    L2-Fabric Write Latency: The time-averaged number of cycles write requests spent
-      in Infinity Fabric before a completion acknowledgement was returned to the L2.
-    sL1D Cache Hit Rate: The percent of sL1D requests that hit on a previously loaded
-      line the cache. Calculated as the ratio of the number of sL1D requests that
-      hit over the number of all sL1D requests.
-    sL1D Cache BW: The number of bytes looked up in the sL1D cache per unit time.
-      This is also presented as a percent of the peak theoretical bandwidth achievable
-      on the specific accelerator.
-    L1I Hit Rate: The number of bytes looked up in the L1I cache per unit time. This
-      is also presented as a percent of the peak theoretical bandwidth achievable
-      on the specific accelerator.
-    L1I BW: The percent of L1I requests that hit on a previously loaded line the cache.
-      Calculated as the ratio of the number of L1I requests that hit over the number
-      of all L1I requests.
-    L1I Fetch Latency: The average number of cycles spent to fetch instructions to
-      a CU.
  data source:
  - metric_table:
      id: 201
@@ -344,3 +226,130 @@ Panel Config:
          peak: None
          pop: None
          coll_level: SQ_IFETCH_LEVEL
+  metrics_description:
+    VALU FLOPs: |-
+      The total floating-point operations executed per second on the VALU.
+      This is also presented as a percent of the peak theoretical FLOPs achievable
+      on the specific accelerator. Note: this does not include any floating-point
+      operations from MFMA instructions.
+    VALU IOPs: |-
+      The total integer operations executed per second on the VALU. This is
+      also presented as a percent of the peak theoretical IOPs achievable on the
+      specific accelerator. Note: this does not include any integer operations from
+      MFMA instructions.
+    MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
+      executed per second. This does not include any 16-bit brain floating point operations
+      from VALU instructions. This is also presented as a percent of the peak theoretical
+      F8 MFMA operations achievable on the specific accelerator. It is supported on
+      AMD Instinct MI300 series and later only.
+    MFMA FLOPs (BF16): |-
+      The total number of 16-bit brain floating point MFMA operations executed
+      per second. Note: this does not include any 16-bit brain floating point operations
+      from VALU instructions. This is also presented as a percent of the peak theoretical
+      BF16 MFMA operations achievable on the specific accelerator.
+    MFMA FLOPs (F16): |-
+      The total number of 16-bit floating point MFMA operations executed per
+      second. Note: this does not include any 16-bit floating point operations from
+      VALU instructions. This is also presented as a percent of the peak theoretical
+      F16 MFMA operations achievable on the specific accelerator.
+    MFMA FLOPs (F32): |-
+      The total number of 32-bit floating point MFMA operations executed per
+      second. Note: this does not include any 32-bit floating point operations from
+      VALU instructions. This is also presented as a percent of the peak theoretical
+      F32 MFMA operations achievable on the specific accelerator.
+    MFMA FLOPs (F64): |-
+      The total number of 64-bit floating point MFMA operations executed per
+      second. Note: this does not include any 64-bit floating point operations from
+      VALU instructions. This is also presented as a percent of the peak theoretical
+      F64 MFMA operations achievable on the specific accelerator.
+    MFMA IOPs (Int8): |-
+      The total number of 8-bit integer MFMA operations executed per second.
+      Note: this does not include any 8-bit integer operations from VALU instructions.
+      This is also presented as a percent of the peak theoretical INT8 MFMA operations
+      achievable on the specific accelerator.
+    Active CUs: Total number of active compute units (CUs) on the accelerator during
+      the kernel execution.
+    SALU Utilization: Indicates what percent of the kernel's duration the SALU was
+      busy executing instructions. Computed as the ratio of the total number of cycles
+      spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.
+    VALU Utilization: Indicates what percent of the kernel's duration the VALU was
+      busy executing instructions. Does not include VMEM operations. Computed as the
+      ratio of the total number of cycles spent by the scheduler issuing VALU instructions
+      over the total CU cycles.
+    MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
+      was busy executing instructions. Computed as the ratio of the total number of
+      cycles the MFMA was busy over the total CU cycles.
+    VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
+      was busy executing instructions, including both global/generic and spill/scratch
+      operations (see the VMEM instruction count metrics) for more detail). Does not
+      include VALU operations. Computed as the ratio of the total number of cycles
+      spent by the scheduler issuing VMEM instructions over the total CU cycles.
+    Branch Utilization: Indicates what percent of the kernel's duration the branch
+      unit was busy executing instructions. Computed as the ratio of the total number
+      of cycles spent by the scheduler issuing branch instructions over the total
+      CU cycles
+    VALU Active Threads: Indicates the average level of divergence within a wavefront
+      over the lifetime of the kernel. The number of work-items that were active in
+      a wavefront during execution of each VALU instruction, time-averaged over all
+      VALU instructions run on all wavefronts in the kernel.
+    IPC: The ratio of the total number of instructions executed on the CU over the
+      total active CU cycles. This is also presented as a percent of the peak theoretical
+      bandwidth achievable on the specific accelerator.
+    Wavefront Occupancy: |-
+      The time-averaged number of wavefronts resident on the accelerator over
+      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
+      kernels (less than 1ms). This is also presented as a percent of the peak theoretical
+      occupancy achievable on the specific accelerator.
+    Theoretical LDS Bandwidth: Indicates the maximum amount of bytes that could have
+      been loaded from, stored to, or atomically updated in the LDS per unit time
+      (see LDS Bandwidth example for more detail). This is also presented as a percent
+      of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
+    LDS Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS
+      scheduler due to bank conflicts (as determined by the conflict resolution hardware)
+      to the base number of cycles that would be spent in the LDS scheduler in a completely
+      uncontended case. This is also presented in normalized form (i.e., the Bank
+      Conflict Rate).
+    vL1D Cache Hit Rate: The ratio of the number of vL1D cache line requests that
+      hit in vL1D cache over the total number of cache line requests to the vL1D cache
+      RAM.
+    vL1D Cache BW: The number of bytes looked up in the vL1D cache as a result of
+      VMEM instructions per unit time. The number of bytes is calculated as the number
+      of cache lines requested multiplied by the cache line size. This value does
+      not consider partial requests, so e.g., if only a single value is requested
+      in a cache line, the data movement will still be counted as a full cache line.
+      This is also presented as a percent of the peak theoretical bandwidth achievable
+      on the specific accelerator.
+    L2 Cache Hit Rate: The ratio of the number of L2 cache line requests that hit
+      in the L2 cache over the total number of incoming cache line requests to the
+      L2 cache.
+    L2 Cache BW: The number of bytes looked up in the L2 cache per unit time. The
+      number of bytes is calculated as the number of cache lines requested multiplied
+      by the cache line size. This value does not consider partial requests, so e.g.,
+      if only a single value is requested in a cache line, the data movement will
+      still be counted as a full cache line. This is also presented as a percent of
+      the peak theoretical bandwidth achievable on the specific accelerator.
+    L2-Fabric Read BW: |-
+      The number of bytes read by the L2 over the Infinity Fabric\u2122 interface
+      per unit time. This is also presented as a percent of the peak theoretical
+      bandwidth achievable on the specific accelerator.
+    L2-Fabric Write BW: The number of bytes sent by the L2 over the Infinity Fabric
+      interface by write and atomic operations per unit time. This is also presented
+      as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
+    L2-Fabric Read Latency: The time-averaged number of cycles read requests spent
+      in Infinity Fabric before data was returned to the L2.
+    L2-Fabric Write Latency: The time-averaged number of cycles write requests spent
+      in Infinity Fabric before a completion acknowledgement was returned to the L2.
+    sL1D Cache Hit Rate: The percent of sL1D requests that hit on a previously loaded
+      line the cache. Calculated as the ratio of the number of sL1D requests that
+      hit over the number of all sL1D requests.
+    sL1D Cache BW: The number of bytes looked up in the sL1D cache per unit time.
+      This is also presented as a percent of the peak theoretical bandwidth achievable
+      on the specific accelerator.
+    L1I Hit Rate: The number of bytes looked up in the L1I cache per unit time. This
+      is also presented as a percent of the peak theoretical bandwidth achievable
+      on the specific accelerator.
+    L1I BW: The percent of L1I requests that hit on a previously loaded line the cache.
+      Calculated as the ratio of the number of L1I requests that hit over the number
+      of all L1I requests.
+    L1I Fetch Latency: The average number of cycles spent to fetch instructions to
+      a CU.
@@ -2,122 +2,6 @@
 Panel Config:
  id: 300
  title: Memory Chart
-  metrics_description:
-    Wavefront Occupancy: Wavefronts per active CU.
-    Wave Life: Average number of cycles executing a wave.
-    SALU: Total Number of SALU (Scalar ALU) instructions issued per normalization
-      unit.
-    SMEM: Total number of SMEM (Scalar Memory Read) instructions issued normalization
-      unit.
-    VALU: The number of VALU (Vector ALU) instructions issued per normalization unit.
-    MFMA: Total number of MFMA (Matrix-Fused-Multiply-Add) instructions issued per
-      normalization unit.
-    VMEM: The number of VMEM (GPU Memory) read instructions issued (including FLAT/scratch
-      memory) per normalization unit.
-    LDS: The total number of LDS instructions (including, but not limited to, read/write/atomics
-      and HIP's __shfl instructions) executed per normalization unit.
-    GWS: Total number of GDS (global data sync) instructions issued per normalization
-      unit.
-    BR: Total number of BRANCH instructions issued per normalization unit.
-    Active CUs: Total number of active compute units (CUs) on the accelerator during
-      the kernel execution.
-    Num CUs: Total number of compute units (CUs) on the accelerator.
-    VGPR: 'The number of architected vector general-purpose registers allocated for
-      the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
-      by the compiler due to allocation granularity.'
-    SGPR: 'The number of scalar general-purpose registers allocated for the kernel,
-      see SALU. Note: this may not exactly match the number of SGPRs requested by
-      the compiler due to allocation granularity.'
-    LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
-      for this kernel. Note: This may also be larger than what was requested at compile
-      time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
-    Scratch Allocation: The number of bytes of scratch memory requested per work-item
-      for this kernel. Scratch memory is used for stack memory on the accelerator,
-      as well as for register spills and restores.
-    Wavefronts: The total number of wavefronts, summed over all workgroups, forming
-      this kernel launch.
-    Workgroups: The total number of workgroups forming this kernel launch.
-    LDS Req: The total number of LDS instructions (including, but not limited to,
-      read/write/atomics and HIP's __shfl instructions) executed per normalization
-      unit.
-    LDS Util: Indicates what percent of the kernel's duration the LDS was actively
-      executing instructions (including, but not limited to, load, store, atomic and
-      HIP's __shfl operations). Calculated as the ratio of the total number of cycles
-      LDS was active over the total CU cycles.
-    LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
-      / acknowledgment) required for an LDS instruction to complete.
-    VL1 Rd: The total number of incoming read requests from the address processing
-      unit after coalescing per normalization unit
-    VL1 Wr: The total number of incoming write requests from the address processing
-      unit after coalescing per normalization unit
-    VL1 Atomic: The total number of incoming atomic requests from the address processing
-      unit after coalescing per normalization unit
-    VL1 Hit: The ratio of the number of vL1D cache line requests that hit in vL1D
-      cache over the total number of cache line requests to the vL1D Cache RAM.
-    VL1 Lat: Calculated as the average number of cycles that a vL1D cache line request
-      spent in the vL1D cache pipeline.
-    VL1 Coalesce: Indicates how well memory instructions were coalesced by the address
-      processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
-      as the average number of thread-requests generated per instruction divided by
-      the ideal number of thread-requests per instruction.
-    VL1 Stall: The ratio of the number of cycles where the vL1D is stalled waiting
-      to issue a request for data to the L2 cache divided by the number of cycles
-      where the vL1D is active.
-    VL1_L2 Rd: The number of read requests for a vL1D cache line that were not satisfied
-      by the vL1D and must be retrieved from the to the L2 Cache per normalization
-      unit.
-    VL1_L2 Wr: The number of write requests to a vL1D cache line that were sent through
-      the vL1D to the L2 cache, per normalization unit.
-    VL1_L2 Atomic: The number of atomic requests that are sent through the vL1D to
-      the L2 cache, per normalization unit. This includes requests for atomics with,
-      and without return.
-    sL1D Rd: The total number of requests, of any size or type, made to the sL1D per
-      normalization unit.
-    sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
-      line, per normalization unit.
-    sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
-      unit.
-    sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
-      unit. Typically unused on current CDNA accelerators.
-    sL1D_L2 Atomic: The total number of atomic requests from sL1D to the L2, per normalization
-      unit. Typically unused on current CDNA accelerators.
-    IL1 Fetch: The total number of requests made to the L1I per normalization-unit.
-    IL1 Hit: The percent of L1I requests that hit on a previously loaded line the
-      cache. Calculated as the ratio of the number of L1I requests that hit over the
-      number of all L1I requests.
-    IL1 Lat: The average number of cycles spent to fetch instructions to a CU.
-    IL1_L2 Rd: The total number of requests across the L1I - L2 interface per normalization-unit.
-    L2 Rd: The total number of read requests to the L2 from all clients.
-    L2 Wr: The total number of write requests to the L2 from all clients.
-    L2 Atomic: The total number of atomic requests (with and without return) to the
-      L2 from all clients.
-    L2 Hit: The ratio of the number of L2 cache line requests that hit in the L2 cache
-      over the total number of incoming cache line requests to the L2 cache.
-    L2 Rd Lat: Calculated as the average number of cycles that the vL1D cache took
-      to issue and receive read requests from the L2 Cache. This number also includes
-      requests for atomics with return values.
-    L2 Wr Lat: Calculated as the average number of cycles that the vL1D cache took
-      to issue and receive acknowledgement of a write request to the L2 Cache. This
-      number also includes requests for atomics without return values.
-    Fabric_L2 Rd: Number of L2 cache - Infinity Fabric read requests (either 32-byte
-      or 64-byte) summed over TCC instances per normalization unit.
-    Fabric_L2 Wr: Number of L2 cache - Infinity Fabric write requests (either 32-byte
-      or 64-byte) summed over TCC instances per normalization unit.
-    Fabric_L2 Atomic: Number of L2 cache - Infinity Fabric write requests (either
-      32-byte or 64-byte) that are actually atomic requests summed over TCC instances
-      per normalization unit.
-    Fabric Rd Lat: The time-averaged number of cycles read requests spent in Infinity
-      Fabric before data was returned to the L2.
-    Fabric Wr Lat: The time-averaged number of cycles write requests spent in Infinity
-      Fabric before a completion acknowledgement was returned to the L2.
-    Fabric Atomic Lat: The time-averaged number of cycles atomic requests spent in
-      Infinity Fabric before a completion acknowledgement (atomic without return value)
-      or data (atomic with return value) was returned to the L2.
-    HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
-      of data from the accelerator's local HBM, per normalization unit.
-    HBM Wr: 'The total number of L2 requests to Infinity Fabric to write or atomically
-      update 32B or 64B of data in the accelerator''s local HBM, per normalization
-      unit. '
  data source:
  - metric_table:
      id: 301
@@ -244,13 +128,13 @@ Panel Config:
          value: ROUND(AVG((TCC_EA0_ATOMIC_sum / $denom)), 0)
        Fabric Rd Lat:
          value: ROUND(AVG(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
-            != 0) else  0)), 0)
+            != 0) else 0)), 0)
        Fabric Wr Lat:
          value: ROUND(AVG(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
-            != 0) else  0)), 0)
+            != 0) else 0)), 0)
        Fabric Atomic Lat:
          value: ROUND(AVG(((TCC_EA0_ATOMIC_LEVEL_sum / TCC_EA0_ATOMIC_sum) if (TCC_EA0_ATOMIC_sum
-            != 0) else  0)), 0)
+            != 0) else 0)), 0)
        HBM Rd:
          value: ROUND(AVG((TCC_EA0_RDREQ_DRAM_sum / $denom)), 0)
        HBM Wr:
@@ -258,3 +142,117 @@ Panel Config:
      comparable: false
      cli_style: mem_chart
      tui_style: mem_chart
+  metrics_description:
+    Wavefront Occupancy: Wavefronts per active CU.
+    Wave Life: Average number of cycles executing a wave.
+    SALU: Total Number of SALU (Scalar ALU) instructions issued per normalization
+      unit.
+    SMEM: Total number of SMEM (Scalar Memory Read) instructions issued normalization
+      unit.
+    VALU: The number of VALU (Vector ALU) instructions issued per normalization unit.
+    MFMA: Total number of MFMA (Matrix-Fused-Multiply-Add) instructions issued per
+      normalization unit.
+    VMEM: The number of VMEM (GPU Memory) read instructions issued (including FLAT/scratch
+      memory) per normalization unit.
+    LDS: The total number of LDS instructions (including, but not limited to, read/write/atomics
+      and HIP's __shfl instructions) executed per normalization unit.
+    GWS: Total number of GDS (global data sync) instructions issued per normalization
+      unit.
+    BR: Total number of BRANCH instructions issued per normalization unit.
+    Active CUs: Total number of active compute units (CUs) on the accelerator during
+      the kernel execution.
+    Num CUs: Total number of compute units (CUs) on the accelerator.
+    VGPR: |-
+      The number of architected vector general-purpose registers allocated
+      for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
+      requested by the compiler due to allocation granularity.
+    SGPR: |-
+      The number of scalar general-purpose registers allocated for the kernel,
+      see SALU. Note: this may not exactly match the number of SGPRs requested by
+      the compiler due to allocation granularity.
+    LDS Allocation: |-
+      The number of bytes of LDS memory (or, shared memory) allocated for
+      this kernel. Note: This may also be larger than what was requested at compile
+      time due to both allocation granularity and dynamic per-dispatch LDS allocations.
+    Scratch Allocation: The number of bytes of scratch memory requested per work-item
+      for this kernel. Scratch memory is used for stack memory on the accelerator,
+      as well as for register spills and restores.
+    Wavefronts: The total number of wavefronts, summed over all workgroups, forming
+      this kernel launch.
+    Workgroups: The total number of workgroups forming this kernel launch.
+    LDS Req: The total number of LDS instructions (including, but not limited to,
+      read/write/atomics and HIP's __shfl instructions) executed per normalization
+      unit.
+    LDS Util: Indicates what percent of the kernel's duration the LDS was actively
+      executing instructions (including, but not limited to, load, store, atomic and
+      HIP's __shfl operations). Calculated as the ratio of the total number of cycles
+      LDS was active over the total CU cycles.
+    LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
+      / acknowledgment) required for an LDS instruction to complete.
+    VL1 Rd: The total number of incoming read requests from the address processing
+      unit after coalescing per normalization unit
+    VL1 Wr: The total number of incoming write requests from the address processing
+      unit after coalescing per normalization unit
+    VL1 Atomic: The total number of incoming atomic requests from the address processing
+      unit after coalescing per normalization unit
+    VL1 Hit: The ratio of the number of vL1D cache line requests that hit in vL1D
+      cache over the total number of cache line requests to the vL1D Cache RAM.
+    VL1 Lat: Calculated as the average number of cycles that a vL1D cache line request
+      spent in the vL1D cache pipeline.
+    VL1 Coalesce: Indicates how well memory instructions were coalesced by the address
+      processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
+      as the average number of thread-requests generated per instruction divided by
+      the ideal number of thread-requests per instruction.
+    VL1 Stall: The ratio of the number of cycles where the vL1D is stalled waiting
+      to issue a request for data to the L2 cache divided by the number of cycles
+      where the vL1D is active.
+    VL1_L2 Rd: The number of read requests for a vL1D cache line that were not satisfied
+      by the vL1D and must be retrieved from the to the L2 Cache per normalization
+      unit.
+    VL1_L2 Wr: The number of write requests to a vL1D cache line that were sent through
+      the vL1D to the L2 cache, per normalization unit.
+    VL1_L2 Atomic: The number of atomic requests that are sent through the vL1D to
+      the L2 cache, per normalization unit. This includes requests for atomics with,
+      and without return.
+    sL1D Rd: The total number of requests, of any size or type, made to the sL1D per
+      normalization unit.
+    sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
+      line, per normalization unit.
+    sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
+      unit.
+    sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
+      unit. Typically unused on current CDNA accelerators.
+    sL1D_L2 Atomic: The total number of atomic requests from sL1D to the L2, per normalization
+      unit. Typically unused on current CDNA accelerators.
+    IL1 Fetch: The total number of requests made to the L1I per normalization-unit.
+    IL1 Hit: The percent of L1I requests that hit on a previously loaded line the
+      cache. Calculated as the ratio of the number of L1I requests that hit over the
+      number of all L1I requests.
+    IL1 Lat: The average number of cycles spent to fetch instructions to a CU.
+    IL1_L2 Rd: The total number of requests across the L1I - L2 interface per normalization-unit.
+    L2 Rd: The total number of read requests to the L2 from all clients.
+    L2 Wr: The total number of write requests to the L2 from all clients.
+    L2 Atomic: The total number of atomic requests (with and without return) to the
+      L2 from all clients.
+    L2 Hit: The ratio of the number of L2 cache line requests that hit in the L2 cache
+      over the total number of incoming cache line requests to the L2 cache.
+    Fabric_L2 Rd: Number of L2 cache - Infinity Fabric read requests (either 32-byte
+      or 64-byte) summed over TCC instances per normalization unit.
+    Fabric_L2 Wr: Number of L2 cache - Infinity Fabric write requests (either 32-byte
+      or 64-byte) summed over TCC instances per normalization unit.
+    Fabric_L2 Atomic: Number of L2 cache - Infinity Fabric write requests (either
+      32-byte or 64-byte) that are actually atomic requests summed over TCC instances
+      per normalization unit.
+    Fabric Rd Lat: The time-averaged number of cycles read requests spent in Infinity
+      Fabric before data was returned to the L2.
+    Fabric Wr Lat: The time-averaged number of cycles write requests spent in Infinity
+      Fabric before a completion acknowledgement was returned to the L2.
+    Fabric Atomic Lat: The time-averaged number of cycles atomic requests spent in
+      Infinity Fabric before a completion acknowledgement (atomic without return value)
+      or data (atomic with return value) was returned to the L2.
+    HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
+      of data from the accelerator's local HBM, per normalization unit.
+    HBM Wr: |-
+      The total number of L2 requests to Infinity Fabric to write or atomically
+      update 32B or 64B of data in the accelerator's local HBM, per normalization
+      unit.
@@ -2,85 +2,6 @@
 Panel Config:
  id: 400
  title: Roofline
-  metrics_description:
-    VALU FLOPs (F16): 'The total 16-bit floating-point operations executed per second
-      on the VALU. This is presented with the value of the peak empirical F16 FLOPs
-      achievable on the specific accelerator. Note: this does not include any F16
-      operations from MFMA instructions.'
-    VALU FLOPs (F32): 'The total 32-bit floating-point operations executed per second
-      on the VALU. This is presented with the value of the peak empirical F32 FLOPs
-      achievable on the specific accelerator. Note: this does not include any F32
-      operations from MFMA instructions.'
-    VALU FLOPs (F64): 'The total 64-bit floating-point operations executed per second
-      on the VALU. This is presented with the value of the peak empirical F64 FLOPs
-      achievable on the specific accelerator. Note: this does not include any F64
-      operations from MFMA instructions.'
-    MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
-      executed per second. This does not include any 16-bit brain floating point operations
-      from VALU instructions. The peak empirically measured F8 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison. It is supported
-      on AMD Instinct MI300 series and later only.
-    MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
-      executed per second. Note: this does not include any 16-bit brain floating point
-      operations from VALU instructions. The peak empirically measured BF16 MFMA operations
-      achievable on the specific accelerator is displayed alongside for comparison.'
-    MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
-      per second. Note: this does not include any 16-bit floating point operations
-      from VALU instructions. The peak empirically measured F16 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison.'
-    MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
-      per second. Note: this does not include any 32-bit floating point operations
-      from VALU instructions. The peak empirically measured F32 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison.'
-    MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
-      per second. Note: this does not include any 64-bit floating point operations
-      from VALU instructions. The peak empirically measured F64 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison.'
-    MFMA FLOPs (F6F4): 'The total number of 4-bit and 6-bit floating point MFMA operations
-      executed per second. Note: this does not include any floating point operations
-      from VALU instructions. The peak empirically measured F6F4 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison. It is supported
-      on AMD Instinct MI350 series (gfx950) and later only.'
-    MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
-      per second. Note: this does not include any 8-bit integer operations from VALU
-      instructions. The peak empirically measured INT8 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison.'
-    HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth
-      Memory (HBM) per second. The peak empirically measured bandwidth achievable
-      on the specific accelerator is displayed alongside for comparison.
-    L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
-      The number of bytes is calculated as the number of cache lines requested multiplied
-      by the cache line size. This value does not consider partial requests, so e.g.,
-      if only a single value is requested in a cache line, the data movement will
-      still be counted as a full cache line. The peak empirically measured bandwidth
-      achievable on the specific accelerator is displayed alongside for comparison.
-    L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
-      of VMEM instructions per unit time. The number of bytes is calculated as the
-      number of cache lines requested multiplied by the cache line size. This value
-      does not consider partial requests, so e.g., if only a single value is requested
-      in a cache line, the data movement will still be counted as a full cache line.
-      The peak empirically measured bandwidth achievable on the specific accelerator
-      is displayed alongside for comparison.
-    LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
-      from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
-      example for more detail). The peak empirically measured LDS bandwidth achievable
-      on the specific accelerator is displayed alongside for comparison.
-    AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
-      of total floating-point operations (FLOPs) to total bytes transferred between
-      the L1 cache and the processing units. This value is used as the x-coordinate
-      for the L1 roofline.
-    AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
-      of total floating-point operations (FLOPs) to total bytes transferred between
-      the L2 cache and the L1 cache. This value is used as the x-coordinate for the
-      L2 roofline.
-    AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
-      It is the ratio of total floating-point operations (FLOPs) to total bytes transferred
-      between HBM and the L2 cache. This value is used as the x-coordinate for the
-      HBM roofline.
-    Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs
-      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
-      operations divided by the total execution time. This value is used as the y-coordinate
-      for the kernel's point on the Roofline plot.
  data source:
  - metric_table:
      id: 401
@@ -218,3 +139,91 @@ Panel Config:
            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
            * 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9
          unit: GFLOP/s
+  metrics_description:
+    VALU FLOPs (F16): |-
+      The total 16-bit floating-point operations executed per second on the VALU.
+      This is presented with the value of the peak empirical F16 FLOPs achievable
+      on the specific accelerator. Note: this does not include any F16 operations
+      from MFMA instructions.
+    VALU FLOPs (F32): |-
+      The total 32-bit floating-point operations executed per second on the VALU.
+      This is presented with the value of the peak empirical F32 FLOPs achievable
+      on the specific accelerator. Note: this does not include any F32 operations
+      from MFMA instructions.
+    VALU FLOPs (F64): |-
+      The total 64-bit floating-point operations executed per second on the VALU.
+      This is presented with the value of the peak empirical F64 FLOPs achievable
+      on the specific accelerator. Note: this does not include any F64 operations
+      from MFMA instructions.
+    MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
+      executed per second. This does not include any 16-bit brain floating point operations
+      from VALU instructions. The peak empirically measured F8 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison. It is supported
+      on AMD Instinct MI300 series and later only.
+    MFMA FLOPs (BF16): |-
+      The total number of 16-bit brain floating point MFMA operations executed
+      per second. Note: this does not include any 16-bit brain floating point
+      operations from VALU instructions. The peak empirically measured BF16 MFMA
+      operations achievable on the specific accelerator is displayed alongside
+      for comparison.
+    MFMA FLOPs (F16): |-
+      The total number of 16-bit floating point MFMA operations executed per
+      second. Note: this does not include any 16-bit floating point operations from
+      VALU instructions. The peak empirically measured F16 MFMA operations
+      achievable on the specific accelerator is displayed alongside for comparison.
+    MFMA FLOPs (F32): |-
+      The total number of 32-bit floating point MFMA operations executed per
+      second. Note: this does not include any 32-bit floating point operations from
+      VALU instructions. The peak empirically measured F32 MFMA operations
+      achievable on the specific accelerator is displayed alongside for comparison.
+    MFMA FLOPs (F64): |-
+      The total number of 64-bit floating point MFMA operations executed per
+      second. Note: this does not include any 64-bit floating point operations from
+      VALU instructions. The peak empirically measured F64 MFMA operations
+      achievable on the specific accelerator is displayed alongside for comparison.
+    MFMA IOPs (Int8): |-
+      The total number of 8-bit integer MFMA operations executed per second.
+      Note: this does not include any 8-bit integer operations from VALU instructions.
+      The peak empirically measured INT8 MFMA operations achievable on the specific
+      accelerator is displayed alongside for comparison.
+    HBM Bandwidth: |-
+      The total number of bytes read from and written to High-Bandwidth
+      Memory (HBM) per second. The peak empirically measured bandwidth achievable
+      on the specific accelerator is displayed alongside for comparison.
+    L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
+      The number of bytes is calculated as the number of cache lines requested multiplied
+      by the cache line size. This value does not consider partial requests, so e.g.,
+      if only a single value is requested in a cache line, the data movement will
+      still be counted as a full cache line. The peak empirically measured bandwidth
+      achievable on the specific accelerator is displayed alongside for comparison.
+    L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
+      of VMEM instructions per unit time. The number of bytes is calculated as the
+      number of cache lines requested multiplied by the cache line size. This value
+      does not consider partial requests, so e.g., if only a single value is requested
+      in a cache line, the data movement will still be counted as a full cache line.
+      The peak empirically measured bandwidth achievable on the specific accelerator
+      is displayed alongside for comparison.
+    LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
+      from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
+      example for more detail). The peak empirically measured LDS bandwidth achievable
+      on the specific accelerator is displayed alongside for comparison.
+    AI L1: |-
+      The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
+      of total floating-point operations (FLOPs) to total bytes transferred between
+      the L1 cache and the processing units. This value is used as the x-coordinate
+      for the L1 roofline.
+    AI L2: |-
+      The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
+      of total floating-point operations (FLOPs) to total bytes transferred between
+      the L2 cache and the L1 cache. This value is used as the x-coordinate for
+      the L2 roofline.
+    AI HBM: |-
+      The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
+      It is the ratio of total floating-point operations (FLOPs) to total bytes
+      transferred between HBM and the L2 cache. This value is used as the x-coordinate
+      for the HBM roofline.
+    Performance (GFLOPs): |-
+      The overall achieved performance, measured in GigaFLOPs
+      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
+      operations divided by the total execution time. This value is used as the y-coordinate
+      for the kernel's point on the Roofline plot.
@@ -2,30 +2,6 @@
 Panel Config:
  id: 500
  title: Command Processor (CPC/CPF)
-  metrics_description:
-    CPF Utilization: Percent of total cycles where the CPF was busy actively doing
-      any work. The ratio of CPF busy cycles over total cycles counted by the CPF.
-    CPF Stall: Percent of CPF busy cycles where the CPF was stalled for any reason.
-    CPF-L2 Utilization: Percent of total cycles counted by the CPF-L2 interface where
-      the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
-      over total cycles counted by the CPF-L2.
-    CPF-L2 Stall: Percent of CPF-L2 L2 busy cycles where the CPF-L2 interface was
-      stalled for any reason.
-    CPF-UTCL1 Stall: Percent of CPF busy cycles where the CPF was stalled by address
-      translation.
-    CPC Utilization: Percent of total cycles where the CPC was busy actively doing
-      any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
-    CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
-    CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
-      for processing.
-    CPC-Workgroup Manager Utilization: Percent of CPC busy cycles spent dispatching
-      workgroups to the workgroup manager.
-    CPC-L2 Utilization: Percent of total cycles counted by the CPC-L2 interface where
-      the CPC-L2 interface was active doing any work.
-    CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
-      translation
-    CPC-UTCL2 Utilization: 'Percent of total cycles counted by the CPC''s L2 address
-      translation interface where the CPC was busy doing address translation work.  '
  data source:
  - metric_table:
      id: 501
@@ -143,3 +119,28 @@ Panel Config:
          max: MAX((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
            if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
          unit: pct
+  metrics_description:
+    CPF Utilization: Percent of total cycles where the CPF was busy actively doing
+      any work. The ratio of CPF busy cycles over total cycles counted by the CPF.
+    CPF Stall: Percent of CPF busy cycles where the CPF was stalled for any reason.
+    CPF-L2 Utilization: Percent of total cycles counted by the CPF-L2 interface where
+      the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
+      over total cycles counted by the CPF-L2.
+    CPF-L2 Stall: Percent of CPF-L2 L2 busy cycles where the CPF-L2 interface was
+      stalled for any reason.
+    CPF-UTCL1 Stall: Percent of CPF busy cycles where the CPF was stalled by address
+      translation.
+    CPC Utilization: Percent of total cycles where the CPC was busy actively doing
+      any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
+    CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
+    CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
+      for processing.
+    CPC-Workgroup Manager Utilization: Percent of CPC busy cycles spent dispatching
+      workgroups to the workgroup manager.
+    CPC-L2 Utilization: Percent of total cycles counted by the CPC-L2 interface where
+      the CPC-L2 interface was active doing any work.
+    CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
+      translation
+    CPC-UTCL2 Utilization: |-
+      Percent of total cycles counted by the CPC's L2 address translation
+      interface where the CPC was busy doing address translation work.
@@ -2,61 +2,6 @@
 Panel Config:
  id: 600
  title: Workgroup Manager (SPI)
-  metrics_description:
-    Accelerator Utilization: The percent of cycles in the kernel where the accelerator
-      was actively doing any work.
-    Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
-      kernel where the scheduler-pipes were actively doing any work.
-    Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
-      manager was actively doing any work.
-    Shader Engine Utilization: The percent of total shader engine cycles in the kernel
-      where any CU in a shader-engine was actively doing any work, normalized over
-      all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
-      was not fully saturated by the kernel, or a potential load-imbalance issue.
-    SIMD Utilization: The percent of total SIMD cycles in the kernel where any SIMD
-      on a CU was actively doing any work, summed over all CUs. Low values (less than
-      100%) indicate that the accelerator was not fully saturated by the kernel, or
-      a potential load-imbalance issue.
-    Dispatched Workgroups: The total number of workgroups forming this kernel launch.
-    Dispatched Wavefronts: The total number of wavefronts, summed over all workgroups,
-      forming this kernel launch.
-    VGPR Writes: The average number of cycles spent initializing VGPRs at wave creation.
-    SGPR Writes: The average number of cycles spent initializing SGPRs at wave creation.
-    Not-scheduled Rate (Workgroup Manager): The percent of total scheduler-pipe cycles
-      in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
-      within the workgroup manager rather than a lack of a CU or SIMD with sufficient
-      resources.
-    Not-scheduled Rate (Scheduler-Pipe): 'The percent of total scheduler-pipe cycles
-      in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
-      within the scheduler-pipes rather than a lack of a CU or SIMD with sufficient
-      resources. '
-    Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
-      where a workgroup could not be scheduled to a CU due to occupancy limitations
-      (like a lack of a CU or SIMD with sufficient resources).
-    Scratch Stall Rate: The percent of total shader-engine cycles in the kernel where
-      a workgroup could not be scheduled to a CU due to lack of private (a.k.a., scratch)
-      memory slots. While this can reach up to 100%, note that the actual occupancy
-      limitations on a kernel using private memory are typically quite small (for
-      example, less than 1% of the total number of waves that can be scheduled to
-      an accelerator).
-    Insufficient SIMD Waveslots: The percent of total SIMD cycles in the kernel where
-      a workgroup could not be scheduled to a SIMD due to lack of available waveslots.
-    Insufficient SIMD VGPRs: The percent of total SIMD cycles in the kernel where
-      a workgroup could not be scheduled to a SIMD due to lack of available VGPRs.
-    Insufficient SIMD SGPRs: The percent of total SIMD cycles in the kernel where
-      a workgroup could not be scheduled to a SIMD due to lack of available SGPRs.
-    Insufficient CU LDS: The percent of total CU cycles in the kernel where a workgroup
-      could not be scheduled to a CU due to lack of available LDS.
-    Insufficient CU Barriers: The percent of total CU cycles in the kernel where a
-      workgroup could not be scheduled to a CU due to lack of available barriers.
-    Reached CU Workgroup Limit: The percent of total CU cycles in the kernel where
-      a workgroup could not be scheduled to a CU due to limits within the workgroup
-      manager. This is expected to be always be zero on CDNA2 or newer accelerators
-      (and small for previous accelerators).
-    Reached CU Wavefront Limit: The percent of total CU cycles in the kernel where
-      a wavefront could not be scheduled to a CU due to limits within the workgroup
-      manager. This is expected to be always be zero on CDNA2 or newer accelerators
-      (and small for previous accelerators).
  data source:
  - metric_table:
      id: 601
@@ -199,3 +144,58 @@ Panel Config:
          min: MIN(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
+  metrics_description:
+    Accelerator Utilization: The percent of cycles in the kernel where the accelerator
+      was actively doing any work.
+    Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
+      kernel where the scheduler-pipes were actively doing any work.
+    Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
+      manager was actively doing any work.
+    Shader Engine Utilization: The percent of total shader engine cycles in the kernel
+      where any CU in a shader-engine was actively doing any work, normalized over
+      all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
+      was not fully saturated by the kernel, or a potential load-imbalance issue.
+    SIMD Utilization: The percent of total SIMD cycles in the kernel where any SIMD
+      on a CU was actively doing any work, summed over all CUs. Low values (less than
+      100%) indicate that the accelerator was not fully saturated by the kernel, or
+      a potential load-imbalance issue.
+    Dispatched Workgroups: The total number of workgroups forming this kernel launch.
+    Dispatched Wavefronts: The total number of wavefronts, summed over all workgroups,
+      forming this kernel launch.
+    VGPR Writes: The average number of cycles spent initializing VGPRs at wave creation.
+    SGPR Writes: The average number of cycles spent initializing SGPRs at wave creation.
+    Not-scheduled Rate (Workgroup Manager): The percent of total scheduler-pipe cycles
+      in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
+      within the workgroup manager rather than a lack of a CU or SIMD with sufficient
+      resources.
+    Not-scheduled Rate (Scheduler-Pipe): |-
+      The percent of total scheduler-pipe cycles in the kernel where a workgroup
+      could not be scheduled to a CU due to a bottleneck within the scheduler-pipes
+      rather than a lack of a CU or SIMD with sufficient resources.
+    Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
+      where a workgroup could not be scheduled to a CU due to occupancy limitations
+      (like a lack of a CU or SIMD with sufficient resources).
+    Scratch Stall Rate: The percent of total shader-engine cycles in the kernel where
+      a workgroup could not be scheduled to a CU due to lack of private (a.k.a., scratch)
+      memory slots. While this can reach up to 100%, note that the actual occupancy
+      limitations on a kernel using private memory are typically quite small (for
+      example, less than 1% of the total number of waves that can be scheduled to
+      an accelerator).
+    Insufficient SIMD Waveslots: The percent of total SIMD cycles in the kernel where
+      a workgroup could not be scheduled to a SIMD due to lack of available waveslots.
+    Insufficient SIMD VGPRs: The percent of total SIMD cycles in the kernel where
+      a workgroup could not be scheduled to a SIMD due to lack of available VGPRs.
+    Insufficient SIMD SGPRs: The percent of total SIMD cycles in the kernel where
+      a workgroup could not be scheduled to a SIMD due to lack of available SGPRs.
+    Insufficient CU LDS: The percent of total CU cycles in the kernel where a workgroup
+      could not be scheduled to a CU due to lack of available LDS.
+    Insufficient CU Barriers: The percent of total CU cycles in the kernel where a
+      workgroup could not be scheduled to a CU due to lack of available barriers.
+    Reached CU Workgroup Limit: The percent of total CU cycles in the kernel where
+      a workgroup could not be scheduled to a CU due to limits within the workgroup
+      manager. This is expected to be always be zero on CDNA2 or newer accelerators
+      (and small for previous accelerators).
+    Reached CU Wavefront Limit: The percent of total CU cycles in the kernel where
+      a wavefront could not be scheduled to a CU due to limits within the workgroup
+      manager. This is expected to be always be zero on CDNA2 or newer accelerators
+      (and small for previous accelerators).
@@ -2,63 +2,6 @@
 Panel Config:
  id: 700
  title: Wavefront
-  metrics_description:
-    Grid Size: The total number of work-items (or, threads) launched as a part of
-      the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied
-      by the total workgroup (or, block) size.
-    Workgroup Size: The total number of work-items (or, threads) in each workgroup
-      (or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
-      to the total block size.
-    Total Wavefronts: "The total number of wavefronts launched as part of the kernel\
-      \ dispatch. On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs,\
-      \ the wavefront size is always 64 work-items. Thus, the total number of wavefronts\
-      \ should be equivalent to the ceiling of grid size divided by 64."
-    Saved Wavefronts: The total number of wavefronts saved at a context-save.
-    Restored Wavefronts: The total number of wavefronts restored from a context-save.
-    VGPRs: 'The number of architected vector general-purpose registers allocated for
-      the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
-      by the compiler due to allocation granularity.'
-    AGPRs: 'The number of accumulation vector general-purpose registers allocated
-      for the kernel, see AGPRs. Note: this may not exactly match the number of AGPRs
-      requested by the compiler due to allocation granularity.'
-    SGPRs: 'The number of scalar general-purpose registers allocated for the kernel,
-      see SALU. Note: this may not exactly match the number of SGPRs requested by
-      the compiler due to allocation granularity.'
-    LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
-      for this kernel. Note: This may also be larger than what was requested at compile
-      time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
-    Scratch Allocation: The number of bytes of scratch memory requested per work-item
-      for this kernel. Scratch memory is used for stack memory on the accelerator,
-      as well as for register spills and restores.
-    Kernel Time: The total duration of the executed kernel.
-    Kernel Time (Cycles): The total duration of the executed kernel in cycles.
-    Instructions per wavefront: The average number of instructions (of all types)
-      executed per wavefront. This is averaged over all wavefronts in a kernel dispatch.
-    Wave Cycles: The number of cycles a wavefront in the kernel dispatch spent resident
-      on a compute unit per normalization unit. This is averaged over all wavefronts
-      in a kernel dispatch.
-    Dependency Wait Cycles: The number of cycles a wavefront in the kernel dispatch
-      spent resident on a compute unit per normalization unit. This is averaged over
-      all wavefronts in a kernel dispatch.
-    Issue Wait Cycles: The number of cycles a wavefront in the kernel dispatch was
-      unable to issue an instruction for any reason (e.g., execution pipe back-pressure,
-      arbitration loss, etc.) per normalization unit. This counter is incremented
-      at every cycle by all wavefronts on a CU unable to issue an instruction. As
-      such, it is most useful to get a sense of how waves were spending their time,
-      rather than identification of a precise limiter because another wave could be
-      actively executing while a wave is issue stalled. The sum of this metric, Dependency
-      Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
-    Active Cycles: The average number of cycles a wavefront in the kernel dispatch
-      was actively executing instructions per normalization unit. This measurement
-      is made on a per-wavefront basis, and may include cycles that another wavefront
-      spent actively executing (on another execution unit, for example) or was stalled.
-      As such, it is most useful to get a sense of how waves were spending their time,
-      rather than identification of a precise limiter. The sum of this metric, Issue
-      Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
-      metric.
-    Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
-      over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
-      kernels (less than 1ms).'
  data source:
  - metric_table:
      id: 701
@@ -171,3 +114,66 @@ Panel Config:
          max: MAX((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
          unit: Wavefronts
          coll_level: SQ_LEVEL_WAVES
+  metrics_description:
+    Grid Size: The total number of work-items (or, threads) launched as a part of
+      the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied
+      by the total workgroup (or, block) size.
+    Workgroup Size: The total number of work-items (or, threads) in each workgroup
+      (or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
+      to the total block size.
+    Total Wavefronts: |-
+      The total number of wavefronts launched as part of the kernel dispatch.
+      On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
+      size is always 64 work-items. Thus, the total number of wavefronts should
+      be equivalent to the ceiling of grid size divided by 64.
+    Saved Wavefronts: The total number of wavefronts saved at a context-save.
+    Restored Wavefronts: The total number of wavefronts restored from a context-save.
+    VGPRs: |-
+      The number of architected vector general-purpose registers allocated
+      for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
+      requested by the compiler due to allocation granularity.
+    AGPRs: |-
+      The number of accumulation vector general-purpose registers allocated
+      for the kernel, see AGPRs. Note: this may not exactly match the number of
+      AGPRs requested by the compiler due to allocation granularity.
+    SGPRs: |-
+      The number of scalar general-purpose registers allocated for the kernel,
+      see SALU. Note: this may not exactly match the number of SGPRs requested by
+      the compiler due to allocation granularity.
+    LDS Allocation: |-
+      The number of bytes of LDS memory (or, shared memory) allocated for
+      this kernel. Note: This may also be larger than what was requested at compile
+      time due to both allocation granularity and dynamic per-dispatch LDS allocations.
+    Scratch Allocation: The number of bytes of scratch memory requested per work-item
+      for this kernel. Scratch memory is used for stack memory on the accelerator,
+      as well as for register spills and restores.
+    Kernel Time: The total duration of the executed kernel.
+    Kernel Time (Cycles): The total duration of the executed kernel in cycles.
+    Instructions per wavefront: The average number of instructions (of all types)
+      executed per wavefront. This is averaged over all wavefronts in a kernel dispatch.
+    Wave Cycles: The number of cycles a wavefront in the kernel dispatch spent resident
+      on a compute unit per normalization unit. This is averaged over all wavefronts
+      in a kernel dispatch.
+    Dependency Wait Cycles: The number of cycles a wavefront in the kernel dispatch
+      spent resident on a compute unit per normalization unit. This is averaged over
+      all wavefronts in a kernel dispatch.
+    Issue Wait Cycles: The number of cycles a wavefront in the kernel dispatch was
+      unable to issue an instruction for any reason (e.g., execution pipe back-pressure,
+      arbitration loss, etc.) per normalization unit. This counter is incremented
+      at every cycle by all wavefronts on a CU unable to issue an instruction. As
+      such, it is most useful to get a sense of how waves were spending their time,
+      rather than identification of a precise limiter because another wave could be
+      actively executing while a wave is issue stalled. The sum of this metric, Dependency
+      Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
+    Active Cycles: The average number of cycles a wavefront in the kernel dispatch
+      was actively executing instructions per normalization unit. This measurement
+      is made on a per-wavefront basis, and may include cycles that another wavefront
+      spent actively executing (on another execution unit, for example) or was stalled.
+      As such, it is most useful to get a sense of how waves were spending their time,
+      rather than identification of a precise limiter. The sum of this metric, Issue
+      Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
+      metric.
+    Wavefront Occupancy: |-
+      The time-averaged number of wavefronts resident on the accelerator over
+      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
+      kernels (less than 1ms).
@@ -2,90 +2,6 @@
 Panel Config:
  id: 1000
  title: Compute Units - Instruction Mix
-  metrics_description:
-    VALU: The total number of vector arithmetic logic unit (VALU) operations issued.
-      These are the workhorses of the compute unit, and are used to execute a wide
-      range of instruction types including floating point operations, non-uniform
-      address calculations, transcendental operations, integer operations, shifts,
-      conditional evaluation, etc.
-    VMEM: The total number of vector memory operations issued. These include most
-      loads, stores and atomic operations and all accesses to generic, global, private
-      and texture memory.
-    LDS: The total number of LDS (also known as shared memory) operations issued.
-      These include loads, stores, atomics, and HIP's __shfl operations.
-    MFMA: The total number of matrix fused multiply-add instructions issued.
-    SALU: The total number of scalar arithmetic logic unit (SALU) operations issued.
-      Typically these are used for address calculations, literal constants, and other
-      operations that are provably uniform across a wavefront. Although scalar memory
-      (SMEM) operations are issued by the SALU, they are counted separately in this
-      section.
-    SMEM: The total number of scalar memory (SMEM) operations issued. These are typically
-      used for loading kernel arguments, base-pointers and loads from HIP's __constant__
-      memory.
-    Branch: The total number of branch operations issued. These typically consist
-      of jump or branch operations and are used to implement control flow.
-    INT32: The total number of instructions operating on 32-bit integer operands issued
-      to the VALU per normalization unit.
-    INT64: The total number of instructions operating on 64-bit integer operands issued
-      to the VALU per normalization unit.
-    F16-ADD: The total number of addition instructions operating on 16-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F16-MUL: The total number of multiplication instructions operating on 16-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F16-FMA: The total number of fused multiply-add instructions operating on 16-bit
-      floating-point operands issued to the VALU per normalization unit.
-    F16-Trans: The total number of transcendental instructions (e.g., sqrt) operating
-      on 16-bit floating-point operands issued to the VALU per normalization unit.
-    F32-ADD: The total number of addition instructions operating on 32-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F32-MUL: The total number of multiplication instructions operating on 32-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F32-FMA: The total number of fused multiply-add instructions operating on 32-bit
-      floating-point operands issued to the VALU per normalization unit.
-    F32-Trans: The total number of transcendental instructions (such as sqrt) operating
-      on 32-bit floating-point operands issued to the VALU per normalization unit.
-    F64-ADD: The total number of addition instructions operating on 64-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F64-MUL: The total number of multiplication instructions operating on 64-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F64-FMA: The total number of fused multiply-add instructions operating on 64-bit
-      floating-point operands issued to the VALU per normalization unit.
-    F64-Trans: The total number of transcendental instructions (such as sqrt) operating
-      on 64-bit floating-point operands issued to the VALU per normalization unit.
-    Conversion: "The total number of type conversion instructions (such as converting\
-      \ data to or from F32\u2194F64) issued to the VALU per normalization unit."
-    Global/Generic Instr: The total number of global & generic memory instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Global/Generic Read: The total number of global & generic memory read instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Global/Generic Write: The total number of global & generic memory write instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Global/Generic Atomic: The total number of global & generic memory atomic (with
-      and without return) instructions executed on all compute units on the accelerator,
-      per normalization unit.
-    Spill/Stack Instr: The total number of spill/stack memory instructions executed
-      on all compute units on the accelerator, per normalization unit.
-    Spill/Stack Read: The total number of spill/stack memory read instructions executed
-      on all compute units on the accelerator, per normalization unit.
-    Spill/Stack Write: The total number of spill/stack memory write instructions executed
-      on all compute units on the accelerator, per normalization unit.
-    Spill/Stack Atomic: The total number of spill/stack memory atomic (with and without
-      return) instructions executed on all compute units on the accelerator, per normalization
-      unit. Typically unused as these memory operations are typically used to implement
-      thread-local storage.
-    MFMA-I8: The total number of 8-bit integer MFMA instructions issued per normalization
-      unit.
-    MFMA-F8: The total number of 8-bit floating point MFMA instructions issued per
-      normalization unit. This is supported in AMD Instinct MI300 series and later
-      only.
-    MFMA-F16: The total number of 16-bit floating point MFMA instructions issued per
-      normalization unit.
-    MFMA-BF16: The total number of 16-bit brain floating point MFMA instructions issued
-      per normalization unit.
-    MFMA-F32: The total number of 32-bit floating-point MFMA instructions issued per
-      normalization unit.
-    MFMA-F64: The total number of 64-bit floating-point MFMA instructions issued per
-      normalization unit.
  data source:
  - metric_table:
      id: 1001
@@ -307,3 +223,88 @@ Panel Config:
          min: MIN((SQ_INSTS_VALU_MFMA_F64 / $denom))
          max: MAX((SQ_INSTS_VALU_MFMA_F64 / $denom))
          unit: (instr + $normUnit)
+  metrics_description:
+    VALU: The total number of vector arithmetic logic unit (VALU) operations issued.
+      These are the workhorses of the compute unit, and are used to execute a wide
+      range of instruction types including floating point operations, non-uniform
+      address calculations, transcendental operations, integer operations, shifts,
+      conditional evaluation, etc.
+    VMEM: The total number of vector memory operations issued. These include most
+      loads, stores and atomic operations and all accesses to generic, global, private
+      and texture memory.
+    LDS: The total number of LDS (also known as shared memory) operations issued.
+      These include loads, stores, atomics, and HIP's __shfl operations.
+    MFMA: The total number of matrix fused multiply-add instructions issued.
+    SALU: The total number of scalar arithmetic logic unit (SALU) operations issued.
+      Typically these are used for address calculations, literal constants, and other
+      operations that are provably uniform across a wavefront. Although scalar memory
+      (SMEM) operations are issued by the SALU, they are counted separately in this
+      section.
+    SMEM: The total number of scalar memory (SMEM) operations issued. These are typically
+      used for loading kernel arguments, base-pointers and loads from HIP's __constant__
+      memory.
+    Branch: The total number of branch operations issued. These typically consist
+      of jump or branch operations and are used to implement control flow.
+    INT32: The total number of instructions operating on 32-bit integer operands issued
+      to the VALU per normalization unit.
+    INT64: The total number of instructions operating on 64-bit integer operands issued
+      to the VALU per normalization unit.
+    F16-ADD: The total number of addition instructions operating on 16-bit floating-point
+      operands issued to the VALU per normalization unit.
+    F16-MUL: The total number of multiplication instructions operating on 16-bit floating-point
+      operands issued to the VALU per normalization unit.
+    F16-FMA: The total number of fused multiply-add instructions operating on 16-bit
+      floating-point operands issued to the VALU per normalization unit.
+    F16-Trans: The total number of transcendental instructions (e.g., sqrt) operating
+      on 16-bit floating-point operands issued to the VALU per normalization unit.
+    F32-ADD: The total number of addition instructions operating on 32-bit floating-point
+      operands issued to the VALU per normalization unit.
+    F32-MUL: The total number of multiplication instructions operating on 32-bit floating-point
+      operands issued to the VALU per normalization unit.
+    F32-FMA: The total number of fused multiply-add instructions operating on 32-bit
+      floating-point operands issued to the VALU per normalization unit.
+    F32-Trans: The total number of transcendental instructions (such as sqrt) operating
+      on 32-bit floating-point operands issued to the VALU per normalization unit.
+    F64-ADD: The total number of addition instructions operating on 64-bit floating-point
+      operands issued to the VALU per normalization unit.
+    F64-MUL: The total number of multiplication instructions operating on 64-bit floating-point
+      operands issued to the VALU per normalization unit.
+    F64-FMA: The total number of fused multiply-add instructions operating on 64-bit
+      floating-point operands issued to the VALU per normalization unit.
+    F64-Trans: The total number of transcendental instructions (such as sqrt) operating
+      on 64-bit floating-point operands issued to the VALU per normalization unit.
+    Conversion: |-
+      The total number of type conversion instructions (such as converting
+      data to or from F32\u2194F64) issued to the VALU per normalization unit.
+    Global/Generic Instr: The total number of global & generic memory instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Global/Generic Read: The total number of global & generic memory read instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Global/Generic Write: The total number of global & generic memory write instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Global/Generic Atomic: The total number of global & generic memory atomic (with
+      and without return) instructions executed on all compute units on the accelerator,
+      per normalization unit.
+    Spill/Stack Instr: The total number of spill/stack memory instructions executed
+      on all compute units on the accelerator, per normalization unit.
+    Spill/Stack Read: The total number of spill/stack memory read instructions executed
+      on all compute units on the accelerator, per normalization unit.
+    Spill/Stack Write: The total number of spill/stack memory write instructions executed
+      on all compute units on the accelerator, per normalization unit.
+    Spill/Stack Atomic: The total number of spill/stack memory atomic (with and without
+      return) instructions executed on all compute units on the accelerator, per normalization
+      unit. Typically unused as these memory operations are typically used to implement
+      thread-local storage.
+    MFMA-I8: The total number of 8-bit integer MFMA instructions issued per normalization
+      unit.
+    MFMA-F8: The total number of 8-bit floating point MFMA instructions issued per
+      normalization unit. This is supported in AMD Instinct MI300 series and later
+      only.
+    MFMA-F16: The total number of 16-bit floating point MFMA instructions issued per
+      normalization unit.
+    MFMA-BF16: The total number of 16-bit brain floating point MFMA instructions issued
+      per normalization unit.
+    MFMA-F32: The total number of 32-bit floating-point MFMA instructions issued per
+      normalization unit.
+    MFMA-F64: The total number of 64-bit floating-point MFMA instructions issued per
+      normalization unit.
@@ -2,84 +2,6 @@
 Panel Config:
  id: 1100
  title: Compute Units - Compute Pipeline
-  metrics_description:
-    VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
-      This is also presented as a percent of the peak theoretical FLOPs achievable
-      on the specific accelerator. Note: this does not include any floating-point
-      operations from MFMA instructions.'
-    VALU IOPs: 'The total integer operations executed per second on the VALU. This
-      is also presented as a percent of the peak theoretical IOPs achievable on the
-      specific accelerator. Note: this does not include any integer operations from
-      MFMA instructions.'
-    MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
-      executed per second. Note: this does not include any 16-bit brain floating point
-      operations from VALU instructions. This is also presented as a percent of the
-      peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
-      per second. Note: this does not include any 16-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F16 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
-      per second. Note: this does not include any 32-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F32 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
-      per second. Note: this does not include any 64-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F64 MFMA operations achievable on the specific accelerator.'
-    MFMA IOPs (INT8): 'The total number of 8-bit integer MFMA operations executed
-      per second. Note: this does not include any 8-bit integer operations from VALU
-      instructions. This is also presented as a percent of the peak theoretical INT8
-      MFMA operations achievable on the specific accelerator.'
-    IPC: The ratio of the total number of instructions executed on the CU over the
-      total active CU cycles.
-    IPC (Issued): The ratio of the total number of (non-internal) instructions issued
-      over the number of cycles where the scheduler was actively working on issuing
-      instructions.
-    SALU Utilization: Indicates what percent of the kernel's duration the SALU was
-      busy executing instructions. Computed as the ratio of the total number of cycles
-      spent by the scheduler issuing SALU / SMEM instructions over the total CU cycles.
-    VALU Utilization: Indicates what percent of the kernel's duration the VALU was
-      busy executing instructions. Does not include VMEM operations. Computed as the
-      ratio of the total number of cycles spent by the scheduler issuing VALU instructions
-      over the total CU cycles.
-    VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
-      was busy executing instructions, including both global/generic and spill/scratch
-      operations (see the VMEM instruction count metrics for more detail). Does not
-      include VALU operations. Computed as the ratio of the total number of cycles
-      spent by the scheduler issuing VMEM instructions over the total CU cycles.
-    Branch Utilization: Indicates what percent of the kernel's duration the branch
-      unit was busy executing instructions. Computed as the ratio of the total number
-      of cycles spent by the scheduler issuing branch instructions over the total
-      CU cycles.
-    VALU Active Threads: Indicates the average level of divergence within a wavefront
-      over the lifetime of the kernel. The number of work-items that were active in
-      a wavefront during execution of each VALU instruction, time-averaged over all
-      VALU instructions run on all wavefronts in the kernel
-    MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
-      was busy executing instructions. Computed as the ratio of the total number of
-      cycles spent by the MFMA was busy over the total CU cycles.
-    MFMA Instruction Cycles: The average duration of MFMA instructions in this kernel
-      in cycles. Computed as the ratio of the total number of cycles the MFMA unit
-      was busy over the total number of MFMA instructions.
-    VMEM Latency: The average number of round-trip cycles (that is, from issue to
-      data return / acknowledgment) required for a VMEM instruction to complete.
-    SMEM Latency: The average number of round-trip cycles (that is, from issue to
-      data return / acknowledgment) required for a SMEM instruction to complete.
-    FLOPs (Total): The total number of floating-point operations executed on either
-      the VALU or MFMA units, per normalization unit.
-    IOPs (Total): The total number of integer operations executed on either the VALU
-      or MFMA units, per normalization unit.
-    F16 OPs: The total number of 16-bit floating-point operations executed on either
-      the VALU or MFMA units, per normalization unit.
-    BF16 OPs: The total number of 16-bit brain floating-point operations executed
-      on either the VALU or MFMA units, per normalization unit.
-    F32 OPs: The total number of 32-bit floating-point operations executed on either
-      the VALU or MFMA units, per normalization unit.
-    F64 OPs: The total number of 64-bit floating-point operations executed on either
-      the VALU or MFMA units, per normalization unit.
-    INT8 OPs: The total number of 8-bit integer operations executed on either the
-      VALU or MFMA units, per normalization unit.
  data source:
  - metric_table:
      id: 1101
@@ -165,13 +87,13 @@ Panel Config:
          unit: Instr/cycle
        IPC (Issued):
          avg: AVG(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
-            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED  + SQ_INSTS_LDS)
+            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
            / SQ_ACTIVE_INST_ANY))
          min: MIN(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
            / SQ_ACTIVE_INST_ANY))
          max: MAX(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
-            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED  + SQ_INSTS_LDS)
+            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
            / SQ_ACTIVE_INST_ANY))
          unit: Instr/cycle
        SALU Utilization:
@@ -271,7 +193,7 @@ Panel Config:
            + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
            + (SQ_INSTS_VALU_FMA_F64 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
            / $denom))
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
        IOPs (Total):
          avg: AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8
            * 512)) / $denom)
@@ -279,12 +201,12 @@ Panel Config:
            * 512)) / $denom)
          max: MAX(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8
            * 512)) / $denom)
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
        F8 OPs:
          avg: AVG(((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) / $denom))
          min: MIN(((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) / $denom))
          max: MAX(((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) / $denom))
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
        F16 OPs:
          avg: AVG(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16))
            + (64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512
@@ -295,12 +217,12 @@ Panel Config:
          max: MAX(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16))
            + (64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512
            * SQ_INSTS_VALU_MFMA_MOPS_F16)) / $denom))
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
        BF16 OPs:
          avg: AVG(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
          min: MIN(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
          max: MAX(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
        F32 OPs:
          avg: AVG((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
            + (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
@@ -311,7 +233,7 @@ Panel Config:
          max: MAX((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
            + (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
            / $denom))
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
        F64 OPs:
          avg: AVG((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
            + (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
@@ -322,9 +244,94 @@ Panel Config:
          max: MAX((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
            + (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
            / $denom))
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
        INT8 OPs:
          avg: AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
          min: MIN(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
          max: MAX(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
+  metrics_description:
+    VALU FLOPs: |-
+      The total floating-point operations executed per second on the VALU.
+      This is also presented as a percent of the peak theoretical FLOPs achievable
+      on the specific accelerator. Note: this does not include any floating-point
+      operations from MFMA instructions.
+    VALU IOPs: |-
+      The total integer operations executed per second on the VALU. This is
+      also presented as a percent of the peak theoretical IOPs achievable on the
+      specific accelerator. Note: this does not include any integer operations from
+      MFMA instructions.
+    MFMA FLOPs (BF16): |-
+      The total number of 16-bit brain floating point MFMA operations executed
+      per second. Note: this does not include any 16-bit brain floating point operations
+      from VALU instructions. This is also presented as a percent of the peak theoretical
+      BF16 MFMA operations achievable on the specific accelerator.
+    MFMA FLOPs (F16): |-
+      The total number of 16-bit floating point MFMA operations executed per
+      second. Note: this does not include any 16-bit floating point operations from
+      VALU instructions. This is also presented as a percent of the peak theoretical
+      F16 MFMA operations achievable on the specific accelerator.
+    MFMA FLOPs (F32): |-
+      The total number of 32-bit floating point MFMA operations executed per
+      second. Note: this does not include any 32-bit floating point operations from
+      VALU instructions. This is also presented as a percent of the peak theoretical
+      F32 MFMA operations achievable on the specific accelerator.
+    MFMA FLOPs (F64): |-
+      The total number of 64-bit floating point MFMA operations executed per
+      second. Note: this does not include any 64-bit floating point operations from
+      VALU instructions. This is also presented as a percent of the peak theoretical
+      F64 MFMA operations achievable on the specific accelerator.
+    MFMA IOPs (INT8): |-
+      The total number of 8-bit integer MFMA operations executed per second.
+      Note: this does not include any 8-bit integer operations from VALU instructions.
+      This is also presented as a percent of the peak theoretical INT8 MFMA operations
+      achievable on the specific accelerator.
+    IPC: The ratio of the total number of instructions executed on the CU over the
+      total active CU cycles.
+    IPC (Issued): The ratio of the total number of (non-internal) instructions issued
+      over the number of cycles where the scheduler was actively working on issuing
+      instructions.
+    SALU Utilization: Indicates what percent of the kernel's duration the SALU was
+      busy executing instructions. Computed as the ratio of the total number of cycles
+      spent by the scheduler issuing SALU / SMEM instructions over the total CU cycles.
+    VALU Utilization: Indicates what percent of the kernel's duration the VALU was
+      busy executing instructions. Does not include VMEM operations. Computed as the
+      ratio of the total number of cycles spent by the scheduler issuing VALU instructions
+      over the total CU cycles.
+    VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
+      was busy executing instructions, including both global/generic and spill/scratch
+      operations (see the VMEM instruction count metrics for more detail). Does not
+      include VALU operations. Computed as the ratio of the total number of cycles
+      spent by the scheduler issuing VMEM instructions over the total CU cycles.
+    Branch Utilization: Indicates what percent of the kernel's duration the branch
+      unit was busy executing instructions. Computed as the ratio of the total number
+      of cycles spent by the scheduler issuing branch instructions over the total
+      CU cycles.
+    VALU Active Threads: Indicates the average level of divergence within a wavefront
+      over the lifetime of the kernel. The number of work-items that were active in
+      a wavefront during execution of each VALU instruction, time-averaged over all
+      VALU instructions run on all wavefronts in the kernel
+    MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
+      was busy executing instructions. Computed as the ratio of the total number of
+      cycles spent by the MFMA was busy over the total CU cycles.
+    MFMA Instruction Cycles: The average duration of MFMA instructions in this kernel
+      in cycles. Computed as the ratio of the total number of cycles the MFMA unit
+      was busy over the total number of MFMA instructions.
+    VMEM Latency: The average number of round-trip cycles (that is, from issue to
+      data return / acknowledgment) required for a VMEM instruction to complete.
+    SMEM Latency: The average number of round-trip cycles (that is, from issue to
+      data return / acknowledgment) required for a SMEM instruction to complete.
+    FLOPs (Total): The total number of floating-point operations executed on either
+      the VALU or MFMA units, per normalization unit.
+    IOPs (Total): The total number of integer operations executed on either the VALU
+      or MFMA units, per normalization unit.
+    F16 OPs: The total number of 16-bit floating-point operations executed on either
+      the VALU or MFMA units, per normalization unit.
+    BF16 OPs: The total number of 16-bit brain floating-point operations executed
+      on either the VALU or MFMA units, per normalization unit.
+    F32 OPs: The total number of 32-bit floating-point operations executed on either
+      the VALU or MFMA units, per normalization unit.
+    F64 OPs: The total number of 64-bit floating-point operations executed on either
+      the VALU or MFMA units, per normalization unit.
+    INT8 OPs: The total number of 8-bit integer operations executed on either the
+      VALU or MFMA units, per normalization unit.
@@ -2,51 +2,6 @@
 Panel Config:
  id: 1200
  title: Local Data Share (LDS)
-  metrics_description:
-    Utilization: Indicates what percent of the kernel's duration the LDS was actively
-      executing instructions (including, but not limited to, load, store, atomic and
-      HIP's __shfl operations). Calculated as the ratio of the total number of cycles
-      LDS was active over the total CU cycles.
-    Access Rate: Indicates the percentage of SIMDs in the VALU actively issuing LDS
-      instructions, averaged over the lifetime of the kernel. Calculated as the ratio
-      of the total number of cycles spent by the scheduler issuing LDS instructions
-      over the total CU cycles.
-    Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
-      could have been loaded from, stored to, or atomically updated in the LDS divided
-      as percentage of theoretical peak. Does not take into account the execution
-      mask of the wavefront when the instruction was executed.
-    Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
-      loaded from, stored to, or atomically updated in the LDS divided by total duration.
-      Does not take into account the execution mask of the wavefront when the instruction
-      was executed.
-    Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
-      servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing
-      bank conflicts over the number of LDS cycles that would have been required to
-      move the same amount of data in an uncontended access.
-    LDS Instructions: The total number of LDS instructions (including, but not limited
-      to, read/write/atomics and HIP's __shfl instructions) executed per normalization
-      unit.
-    LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
-      / acknowledgment) required for an LDS instruction to complete.
-    Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS scheduler
-      due to bank conflicts (as determined by the conflict resolution hardware) to
-      the base number of cycles that would be spent in the LDS scheduler in a completely
-      uncontended case. This is the unnormalized form of the Bank Conflict Rate.
-    Index Accesses: The total number of cycles spent in the LDS scheduler over all
-      operations per normalization unit.
-    Atomic Return Cycles: The total number of cycles spent on LDS atomics with return
-      per normalization unit.
-    Bank Conflict: The total number of cycles spent in the LDS scheduler due to bank
-      conflicts (as determined by the conflict resolution hardware) per normalization
-      unit.
-    Addr Conflict: The total number of cycles spent in the LDS scheduler due to address
-      conflicts (as determined by the conflict resolution hardware) per normalization
-      unit.
-    Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
-      stalls from non-dword aligned addresses per normalization unit.
-    Mem Violations: "The total number of out-of-bounds accesses made to the LDS, per\
-      \ normalization unit. This is unused and expected to be zero in most configurations\
-      \ for modern CDNA\u2122 accelerators."
  data source:
  - metric_table:
      id: 1201
@@ -87,7 +42,7 @@ Panel Config:
          avg: AVG((SQ_INSTS_LDS / $denom))
          min: MIN((SQ_INSTS_LDS / $denom))
          max: MAX((SQ_INSTS_LDS / $denom))
-          unit: (Instr  + $normUnit)
+          unit: (Instr + $normUnit)
        Theoretical Bandwidth:
          avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
            / (End_Timestamp - Start_Timestamp)))
@@ -117,29 +72,75 @@ Panel Config:
          avg: AVG((SQ_LDS_IDX_ACTIVE / $denom))
          min: MIN((SQ_LDS_IDX_ACTIVE / $denom))
          max: MAX((SQ_LDS_IDX_ACTIVE / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Atomic Return Cycles:
          avg: AVG((SQ_LDS_ATOMIC_RETURN / $denom))
          min: MIN((SQ_LDS_ATOMIC_RETURN / $denom))
          max: MAX((SQ_LDS_ATOMIC_RETURN / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Bank Conflict:
          avg: AVG((SQ_LDS_BANK_CONFLICT / $denom))
          min: MIN((SQ_LDS_BANK_CONFLICT / $denom))
          max: MAX((SQ_LDS_BANK_CONFLICT / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Addr Conflict:
          avg: AVG((SQ_LDS_ADDR_CONFLICT / $denom))
          min: MIN((SQ_LDS_ADDR_CONFLICT / $denom))
          max: MAX((SQ_LDS_ADDR_CONFLICT / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Unaligned Stall:
          avg: AVG((SQ_LDS_UNALIGNED_STALL / $denom))
          min: MIN((SQ_LDS_UNALIGNED_STALL / $denom))
          max: MAX((SQ_LDS_UNALIGNED_STALL / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Mem Violations:
          avg: AVG((SQ_LDS_MEM_VIOLATIONS / $denom))
          min: MIN((SQ_LDS_MEM_VIOLATIONS / $denom))
          max: MAX((SQ_LDS_MEM_VIOLATIONS / $denom))
          unit: (Accesses + $normUnit)
+  metrics_description:
+    Utilization: Indicates what percent of the kernel's duration the LDS was actively
+      executing instructions (including, but not limited to, load, store, atomic and
+      HIP's __shfl operations). Calculated as the ratio of the total number of cycles
+      LDS was active over the total CU cycles.
+    Access Rate: Indicates the percentage of SIMDs in the VALU actively issuing LDS
+      instructions, averaged over the lifetime of the kernel. Calculated as the ratio
+      of the total number of cycles spent by the scheduler issuing LDS instructions
+      over the total CU cycles.
+    Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
+      could have been loaded from, stored to, or atomically updated in the LDS divided
+      as percentage of theoretical peak. Does not take into account the execution
+      mask of the wavefront when the instruction was executed.
+    Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
+      loaded from, stored to, or atomically updated in the LDS divided by total duration.
+      Does not take into account the execution mask of the wavefront when the instruction
+      was executed.
+    Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
+      servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing
+      bank conflicts over the number of LDS cycles that would have been required to
+      move the same amount of data in an uncontended access.
+    LDS Instructions: The total number of LDS instructions (including, but not limited
+      to, read/write/atomics and HIP's __shfl instructions) executed per normalization
+      unit.
+    LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
+      acknowledgment) required for an LDS instruction to complete.
+    Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS scheduler
+      due to bank conflicts (as determined by the conflict resolution hardware) to
+      the base number of cycles that would be spent in the LDS scheduler in a completely
+      uncontended case. This is the unnormalized form of the Bank Conflict Rate.
+    Index Accesses: The total number of cycles spent in the LDS scheduler over all
+      operations per normalization unit.
+    Atomic Return Cycles: The total number of cycles spent on LDS atomics with return
+      per normalization unit.
+    Bank Conflict: The total number of cycles spent in the LDS scheduler due to bank
+      conflicts (as determined by the conflict resolution hardware) per normalization
+      unit.
+    Addr Conflict: The total number of cycles spent in the LDS scheduler due to address
+      conflicts (as determined by the conflict resolution hardware) per normalization
+      unit.
+    Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
+      stalls from non-dword aligned addresses per normalization unit.
+    Mem Violations: |-
+      The total number of out-of-bounds accesses made to the LDS, per normalization
+      unit. This is unused and expected to be zero in most configurations for
+      modern CDNA\u2122 accelerators.
@@ -2,28 +2,6 @@
 Panel Config:
  id: 1300
  title: Instruction Cache
-  metrics_description:
-    Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
-      of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
-      the total L1I cycles.
-    Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
-      loaded line the cache. Calculated as the ratio of the number of L1I requests
-      that hit over the number of all L1I requests.
-    L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\
-      \ L2 cache request bandwidth achieved. Calculated as the ratio of the total\
-      \ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\
-      \ cycles."
-    L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
-      divided by total duration.
-    Req: The total number of requests made to the L1I per normalization-unit
-    Hits: The total number of L1I requests that hit on a previously loaded cache line,
-      per normalization-unit.
-    Misses - Non Duplicated: The total number of L1I requests that missed on a cache
-      line that were not already pending due to another request, per normalization-unit.
-    Misses - Duplicated: The total number of L1I requests that missed on a cache line
-      that were already pending due to another request, per normalization-unit.
-    Instruction Fetch Latency: The average number of cycles spent to fetch instructions
-      to a CU.
  data source:
  - metric_table:
      id: 1301
@@ -62,22 +40,22 @@ Panel Config:
          avg: AVG((SQC_ICACHE_REQ / $denom))
          min: MIN((SQC_ICACHE_REQ / $denom))
          max: MAX((SQC_ICACHE_REQ / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Hits:
          avg: AVG((SQC_ICACHE_HITS / $denom))
          min: MIN((SQC_ICACHE_HITS / $denom))
          max: MAX((SQC_ICACHE_HITS / $denom))
-          unit: (Hits  + $normUnit)
+          unit: (Hits + $normUnit)
        Misses - Non Duplicated:
          avg: AVG((SQC_ICACHE_MISSES / $denom))
          min: MIN((SQC_ICACHE_MISSES / $denom))
          max: MAX((SQC_ICACHE_MISSES / $denom))
-          unit: (Misses  + $normUnit)
+          unit: (Misses + $normUnit)
        Misses - Duplicated:
          avg: AVG((SQC_ICACHE_MISSES_DUPLICATE / $denom))
          min: MIN((SQC_ICACHE_MISSES_DUPLICATE / $denom))
          max: MAX((SQC_ICACHE_MISSES_DUPLICATE / $denom))
-          unit: (Misses  + $normUnit)
+          unit: (Misses + $normUnit)
        Cache Hit Rate:
          avg: AVG(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
            + SQC_ICACHE_MISSES_DUPLICATE)))
@@ -107,3 +85,25 @@ Panel Config:
          min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
          max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
          unit: Gbps
+  metrics_description:
+    Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
+      of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
+      the total L1I cycles.
+    Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
+      loaded line the cache. Calculated as the ratio of the number of L1I requests
+      that hit over the number of all L1I requests.
+    L1I-L2 Bandwidth Utilization: |-
+      The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
+      achieved. Calculated as the ratio of the total number of requests from the
+      L1I to the L2 cache over the total L1I-L2 interface cycles.
+    L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
+      divided by total duration.
+    Req: The total number of requests made to the L1I per normalization-unit
+    Hits: The total number of L1I requests that hit on a previously loaded cache line,
+      per normalization-unit.
+    Misses - Non Duplicated: The total number of L1I requests that missed on a cache
+      line that were not already pending due to another request, per normalization-unit.
+    Misses - Duplicated: The total number of L1I requests that missed on a cache line
+      that were already pending due to another request, per normalization-unit.
+    Instruction Fetch Latency: The average number of cycles spent to fetch instructions
+      to a CU.
@@ -2,49 +2,6 @@
 Panel Config:
  id: 1400
  title: Scalar L1 Data Cache
-  metrics_description:
-    Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
-      of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
-      over the total sL1D cycles.
-    Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
-      loaded line the cache. The ratio of the number of sL1D requests that hit over
-      the number of all sL1D requests.
-    sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
-      bandwidth acheived.\ \ Caclulated as total number of bytes read from, written
-      to, or atomically updated\ \ across the sL1D - L2 interface.
-    sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
-      \ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\
-      \ writes and atomics are typically unused on current CDNA accelerators, so in\
-      \ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
-    Req: The total number of requests, of any size or type, made to the sL1D per normalization
-      unit.
-    Hits: The total number of sL1D requests that hit on a previously loaded cache
-      line, per normalization unit.
-    Misses - Non Duplicated: 'The total number of sL1D requests that missed on a cache
-      line that was not already pending due to another request, per normalization
-      unit. '
-    Misses- Duplicated: The total number of sL1D requests that missed on a cache line
-      that was already pending due to another request, per normalization unit.
-    Read Req (Total): The total number of sL1D read requests of any size, per normalization
-      unit.
-    Atomic Req: The total number of atomic requests from sL1D to the L2, per normalization
-      unit. Typically unused on current CDNA accelerators.
-    Read Req (1 DWord): The total number of sL1D read requests made for a single dword
-      of data (4B), per normalization unit.
-    Read Req (2 DWord): The total number of sL1D read requests made for a two dwords
-      of data (8B), per normalization unit.
-    Read Req (4 DWord): The total number of sL1D read requests made for a four dwords
-      of data (16B), per normalization unit.
-    Read Req (8 DWord): The total number of sL1D read requests made for a eight dwords
-      of data (32B), per normalization unit.
-    Read Req (16 DWord): The total number of sL1D read requests made for a sixteen
-      dwords of data (64B), per normalization unit.
-    Read Req: The total number of read requests from sL1D to the L2 per normalization
-      unit.
-    Write Req: The total number of write requests from sL1D to the L2, per normalization
-      unit. Typically unused on current CDNA accelerators.
-    Stall Cycles: "The total number of cycles the sL1D\u2194L2 interface was stalled,\
-      \ per normalization unit."
  data source:
  - metric_table:
      id: 1401
@@ -84,22 +41,22 @@ Panel Config:
          avg: AVG((SQC_DCACHE_REQ / $denom))
          min: MIN((SQC_DCACHE_REQ / $denom))
          max: MAX((SQC_DCACHE_REQ / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Hits:
          avg: AVG((SQC_DCACHE_HITS / $denom))
          min: MIN((SQC_DCACHE_HITS / $denom))
          max: MAX((SQC_DCACHE_HITS / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Misses - Non Duplicated:
          avg: AVG((SQC_DCACHE_MISSES / $denom))
          min: MIN((SQC_DCACHE_MISSES / $denom))
          max: MAX((SQC_DCACHE_MISSES / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Misses- Duplicated:
          avg: AVG((SQC_DCACHE_MISSES_DUPLICATE / $denom))
          min: MIN((SQC_DCACHE_MISSES_DUPLICATE / $denom))
          max: MAX((SQC_DCACHE_MISSES_DUPLICATE / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Cache Hit Rate:
          avg: AVG((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
            + SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
@@ -118,37 +75,37 @@ Panel Config:
            + SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
          max: MAX((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
            + SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Atomic Req:
          avg: AVG((SQC_DCACHE_ATOMIC / $denom))
          min: MIN((SQC_DCACHE_ATOMIC / $denom))
          max: MAX((SQC_DCACHE_ATOMIC / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read Req (1 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_1 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_1 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_1 / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read Req (2 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_2 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_2 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_2 / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read Req (4 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_4 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_4 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_4 / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read Req (8 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_8 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_8 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_8 / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read Req (16 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_16 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_16 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_16 / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
  - metric_table:
      id: 1403
      title: Scalar L1D Cache - L2 Interface
@@ -171,19 +128,65 @@ Panel Config:
          avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
          min: MIN((SQC_TC_DATA_READ_REQ / $denom))
          max: MAX((SQC_TC_DATA_READ_REQ / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Write Req:
          avg: AVG((SQC_TC_DATA_WRITE_REQ / $denom))
          min: MIN((SQC_TC_DATA_WRITE_REQ / $denom))
          max: MAX((SQC_TC_DATA_WRITE_REQ / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Atomic Req:
          avg: AVG((SQC_TC_DATA_ATOMIC_REQ / $denom))
          min: MIN((SQC_TC_DATA_ATOMIC_REQ / $denom))
          max: MAX((SQC_TC_DATA_ATOMIC_REQ / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Stall Cycles:
          avg: AVG((SQC_TC_STALL / $denom))
          min: MIN((SQC_TC_STALL / $denom))
          max: MAX((SQC_TC_STALL / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
+  metrics_description:
+    Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
+      of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
+      over the total sL1D cycles.
+    Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
+      loaded line the cache. The ratio of the number of sL1D requests that hit over
+      the number of all sL1D requests.
+    sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
+      bandwidth acheived. Calculated as total number of bytes read from, written to,
+      or atomically updated across the sL1D - L2 interface.
+    sL1D-L2 BW: |-
+      The total number of bytes read from, written to, or atomically updated
+      across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D
+      writes and atomics are typically unused on current CDNA accelerators, so
+      in the majority of cases this can be interpreted as an sL1D\u2192L2 read
+      bandwidth.
+    Req: The total number of requests, of any size or type, made to the sL1D per normalization
+      unit.
+    Hits: The total number of sL1D requests that hit on a previously loaded cache
+      line, per normalization unit.
+    Misses - Non Duplicated: |-
+      The total number of sL1D requests that missed on a cache line that was
+      not already pending due to another request, per normalization unit.
+    Misses- Duplicated: The total number of sL1D requests that missed on a cache line
+      that was already pending due to another request, per normalization unit.
+    Read Req (Total): The total number of sL1D read requests of any size, per normalization
+      unit.
+    Atomic Req: The total number of atomic requests from sL1D to the L2, per normalization
+      unit. Typically unused on current CDNA accelerators.
+    Read Req (1 DWord): The total number of sL1D read requests made for a single dword
+      of data (4B), per normalization unit.
+    Read Req (2 DWord): The total number of sL1D read requests made for a two dwords
+      of data (8B), per normalization unit.
+    Read Req (4 DWord): The total number of sL1D read requests made for a four dwords
+      of data (16B), per normalization unit.
+    Read Req (8 DWord): The total number of sL1D read requests made for a eight dwords
+      of data (32B), per normalization unit.
+    Read Req (16 DWord): The total number of sL1D read requests made for a sixteen
+      dwords of data (64B), per normalization unit.
+    Read Req: The total number of read requests from sL1D to the L2 per normalization
+      unit.
+    Write Req: The total number of write requests from sL1D to the L2, per normalization
+      unit. Typically unused on current CDNA accelerators.
+    Stall Cycles: |-
+      The total number of cycles the sL1D\u2194L2 interface was stalled, per
+      normalization unit.
@@ -2,70 +2,6 @@
 Panel Config:
  id: 1500
  title: Address Processing Unit and Data Return Path (TA/TD)
-  metrics_description:
-    Address Processing Unit Busy: Percent of the total CU cycles the address processor
-      was busy
-    Address Stall: Percent of the total CU cycles the address processor was stalled
-      from sending address requests further into the vL1D pipeline.
-    Data Stall: Percent of the total CU cycles the address processor was stalled from
-      sending write/atomic data further into the vL1D pipeline.
-    "Data-Processor \u2192 Address Stall": Percent of total CU cycles the address
-      processor was stalled waiting to send command data to the data processor.
-    Total Instructions: The total number of memory instructions executed by the address
-      processer over all compute units on the accelerator, per normalization unit.
-    Global/Generic Instructions: The total number of global & generic memory instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Global/Generic Read Instructions: The total number of global & generic memory
-      read instructions executed on all compute units on the accelerator, per normalization
-      unit.
-    Global/Generic Write Instructions: The total number of global & generic memory
-      write instructions executed on all compute units on the accelerator, per normalization
-      unit.
-    Global/Generic Atomic Instructions: The total number of global & generic memory
-      atomic (with and without return) instructions executed on all compute units
-      on the accelerator, per normalization unit.
-    Spill/Stack Instructions: The total number of spill/stack memory instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Spill/Stack Read Instructions: The total number of spill/stack memory read instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Spill/Stack Write Instructions: The total number of spill/stack memory write instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Spill/Stack Atomic Instructions: The total number of spill/stack memory atomic
-      (with and without return) instructions executed on all compute units on the
-      accelerator, per normalization unit. Typically unused as these memory operations
-      are typically used to implement thread-local storage.
-    Spill/Stack Total Cycles: The number of cycles the address processing unit spent
-      working on spill/stack instructions, per normalization unit.
-    Spill/Stack Coalesced Read: The number of cycles the address processing unit spent
-      working on coalesced spill/stack read instructions, per normalization unit.
-    Spill/Stack Coalesced Write: The number of cycles the address processing unit
-      spent working on coalesced spill/stack write instructions, per normalization
-      unit.
-    Data-Return Busy: Percent of the total CU cycles the data-return unit was busy
-      processing or waiting on data to return to the CU.
-    "Cache RAM \u2192 Data-Return Stall": Percent of the total CU cycles the data-return
-      unit was stalled on data to be returned from the vL1D Cache RAM.
-    "Workgroup manager \u2192 Data-Return Stall": Percent of the total CU cycles the
-      data-return unit was stalled by the workgroup manager due to initialization
-      of registers as a part of launching new workgroups.
-    Coalescable Instructions: The number of instructions submitted to the data-return
-      unit by the address processor that were found to be coalescable, per normalization
-      unit.
-    Read Instructions: The number of read instructions submitted to the data-return
-      unit by the address processor summed over all compute units on the accelerator,
-      per normalization unit. This is expected to be the sum of global/generic and
-      spill/stack reads in the address processor.
-    Write Instructions: The number of store instructions submitted to the data-return
-      unit by the address processor summed over all compute units on the accelerator,
-      per normalization unit. This is expected to be the sum of global/generic and
-      spill/stack stores in the address processor.
-    Atomic Instructions: The number of atomic instructions submitted to the data-return
-      unit by the address processor summed over all compute units on the accelerator,
-      per normalization unit. This is expected to be the sum of global/generic and
-      spill/stack atomics in the address processor.
-    Write Ack Instructions: The total number of write acknowledgements submitted by
-      data-return unit to SQ, summed over all compute units on the accelerator, per
-      normalization unit.
  data source:
  - metric_table:
      id: 1501
@@ -135,47 +71,47 @@ Panel Config:
          avg: AVG((TA_TOTAL_WAVEFRONTS_sum / $denom))
          min: MIN((TA_TOTAL_WAVEFRONTS_sum / $denom))
          max: MAX((TA_TOTAL_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Global/Generic Instructions:
          avg: AVG((TA_FLAT_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Global/Generic Read Instructions:
          avg: AVG((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Global/Generic Write Instructions:
          avg: AVG((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Global/Generic Atomic Instructions:
          avg: AVG((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Spill/Stack Instructions:
          avg: AVG((TA_BUFFER_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Spill/Stack Read Instructions:
          avg: AVG((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Spill/Stack Write Instructions:
          avg: AVG((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Spill/Stack Atomic Instructions:
          avg: AVG((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
  - metric_table:
      id: 1503
      title: Spill and stack metrics
@@ -190,17 +126,17 @@ Panel Config:
          avg: AVG((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
          min: MIN((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
          max: MAX((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Spill/Stack Coalesced Read:
          avg: AVG((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
          min: MIN((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
          max: MAX((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Spill/Stack Coalesced Write:
          avg: AVG((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
          min: MIN((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
          max: MAX((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
  - metric_table:
      id: 1504
      title: Vector L1 data-return path or Texture Data (TD)
@@ -230,7 +166,7 @@ Panel Config:
          avg: AVG((TD_COALESCABLE_WAVEFRONT_sum / $denom))
          min: MIN((TD_COALESCABLE_WAVEFRONT_sum / $denom))
          max: MAX((TD_COALESCABLE_WAVEFRONT_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Read Instructions:
          avg: AVG((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
            / $denom))
@@ -238,14 +174,75 @@ Panel Config:
            / $denom))
          max: MAX((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
            / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Write Instructions:
          avg: AVG((TD_STORE_WAVEFRONT_sum / $denom))
          min: MIN((TD_STORE_WAVEFRONT_sum / $denom))
          max: MAX((TD_STORE_WAVEFRONT_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
        Atomic Instructions:
          avg: AVG((TD_ATOMIC_WAVEFRONT_sum / $denom))
          min: MIN((TD_ATOMIC_WAVEFRONT_sum / $denom))
          max: MAX((TD_ATOMIC_WAVEFRONT_sum / $denom))
-          unit: (Instructions  + $normUnit)
+          unit: (Instructions + $normUnit)
+  metrics_description:
+    Address Processing Unit Busy: Percent of the total CU cycles the address processor
+      was busy
+    Address Stall: Percent of the total CU cycles the address processor was stalled
+      from sending address requests further into the vL1D pipeline.
+    Data Stall: Percent of the total CU cycles the address processor was stalled from
+      sending write/atomic data further into the vL1D pipeline.
+    "Data-Processor \u2192 Address Stall": Percent of total CU cycles the address
+      processor was stalled waiting to send command data to the data processor.
+    Total Instructions: The total number of memory instructions executed by the address
+      processer over all compute units on the accelerator, per normalization unit.
+    Global/Generic Instructions: The total number of global & generic memory instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Global/Generic Read Instructions: The total number of global & generic memory
+      read instructions executed on all compute units on the accelerator, per normalization
+      unit.
+    Global/Generic Write Instructions: The total number of global & generic memory
+      write instructions executed on all compute units on the accelerator, per normalization
+      unit.
+    Global/Generic Atomic Instructions: The total number of global & generic memory
+      atomic (with and without return) instructions executed on all compute units
+      on the accelerator, per normalization unit.
+    Spill/Stack Instructions: The total number of spill/stack memory instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Spill/Stack Read Instructions: The total number of spill/stack memory read instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Spill/Stack Write Instructions: The total number of spill/stack memory write instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Spill/Stack Atomic Instructions: The total number of spill/stack memory atomic
+      (with and without return) instructions executed on all compute units on the
+      accelerator, per normalization unit. Typically unused as these memory operations
+      are typically used to implement thread-local storage.
+    Spill/Stack Total Cycles: The number of cycles the address processing unit spent
+      working on spill/stack instructions, per normalization unit.
+    Spill/Stack Coalesced Read: The number of cycles the address processing unit spent
+      working on coalesced spill/stack read instructions, per normalization unit.
+    Spill/Stack Coalesced Write: The number of cycles the address processing unit
+      spent working on coalesced spill/stack write instructions, per normalization
+      unit.
+    Data-Return Busy: Percent of the total CU cycles the data-return unit was busy
+      processing or waiting on data to return to the CU.
+    "Cache RAM \u2192 Data-Return Stall": Percent of the total CU cycles the data-return
+      unit was stalled on data to be returned from the vL1D Cache RAM.
+    "Workgroup manager \u2192 Data-Return Stall": Percent of the total CU cycles the
+      data-return unit was stalled by the workgroup manager due to initialization
+      of registers as a part of launching new workgroups.
+    Coalescable Instructions: The number of instructions submitted to the data-return
+      unit by the address processor that were found to be coalescable, per normalization
+      unit.
+    Read Instructions: The number of read instructions submitted to the data-return
+      unit by the address processor summed over all compute units on the accelerator,
+      per normalization unit. This is expected to be the sum of global/generic and
+      spill/stack reads in the address processor.
+    Write Instructions: The number of store instructions submitted to the data-return
+      unit by the address processor summed over all compute units on the accelerator,
+      per normalization unit. This is expected to be the sum of global/generic and
+      spill/stack stores in the address processor.
+    Atomic Instructions: The number of atomic instructions submitted to the data-return
+      unit by the address processor summed over all compute units on the accelerator,
+      per normalization unit. This is expected to be the sum of global/generic and
+      spill/stack atomics in the address processor.
@@ -2,117 +2,6 @@
 Panel Config:
  id: 1600
  title: Vector L1 Data Cache
-  metrics_description:
-    Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
-      cache over the total number of cache line requests to the vL1D Cache RAM.
-    Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
-      of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
-      on the specific accelerator. The number of bytes is calculated as the number
-      of cache lines requested multiplied by the cache line size. This value does
-      not consider partial requests, so for instance, if only a single value is requested
-      in a cache line, the data movement will still be counted as a full cache line.
-    Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
-      The number of cycles where the vL1D Cache RAM is actively processing any request
-      divided by the number of cycles where the vL1D is active.
-    Coalescing: Indicates how well memory instructions were coalesced by the address
-      processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
-      as the average number of thread-requests generated per instruction divided by
-      the ideal number of thread-requests per instruction.
-    Stalled on L2 Data: The ratio of the number of cycles where the vL1D is stalled
-      waiting for requested data to return from the L2 cache divided by the number
-      of cycles where the vL1D is active.
-    Stalled on L2 Req: The ratio of the number of cycles where the vL1D is stalled
-      waiting to issue a request for data to the L2 cache divided by the number of
-      cycles where the vL1D is active.
-    Tag RAM Stall (Read): The ratio of the number of cycles where the vL1D is stalled
-      due to Read requests with conflicting tags being looked up concurrently, divided
-      by the number of cycles where the vL1D is active.
-    Tag RAM Stall (Write): The ratio of the number of cycles where the vL1D is stalled
-      due to Write requests with conflicting tags being looked up concurrently, divided
-      by the number of cycles where the vL1D is active.
-    Tag RAM Stall (Atomic): The ratio of the number of cycles where the vL1D is stalled
-      due to Atomic requests with conflicting tags being looked up concurrently, divided
-      by the number of cycles where the vL1D is active.
-    Total Req: The total number of incoming requests from the address processing unit
-      after coalescing.
-    Read Req: The total number of incoming read requests from the address processing
-      unit after coalescing per normalization unit.
-    Write Req: The total number of incoming write requests from the address processing
-      unit after coalescing per normalization unit.
-    Atomic Req: The total number of incoming atomic requests from the address processing
-      unit after coalescing per normalization unit.
-    Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
-      instructions divided by total duration. The number of bytes is calculated as
-      the number of cache lines requested multiplied by the cache line size.  This
-      value does not consider partial requests, so for instance, if only a single
-      value is requested in a cache line, the data movement will still be counted
-      as a full cache line.
-    Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
-      vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
-    Cache Accesses: The total number of cache line lookups in the vL1D.
-    Cache Hits: The number of cache accesses minus the number of outgoing requests
-      to the L2 cache, that is, the number of cache line requests serviced by the
-      vL1D Cache RAM per normalization unit.
-    Invalidations: The number of times the vL1D was issued a write-back invalidate
-      command during the kernel's execution per normalization unit. This may be triggered
-      by, for instance, the buffer_wbinvl1 instruction.
-    L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
-      of VMEM instructions, divided by total duration. The number of bytes is calculated
-      as the number of cache lines requested multiplied by the cache line size. This
-      value does not consider partial requests, so for instance, if only a single
-      value is requested in a cache line, the data movement will still be counted
-      as a full cache line.
-    L1-L2 Read: The number of read requests for a vL1D cache line that were not satisfied
-      by the vL1D and must be retrieved from the to the L2 Cache per normalization
-      unit.
-    L1-L2 Write: The number of write requests to a vL1D cache line that were sent
-      through the vL1D to the L2 cache, per normalization unit.
-    L1-L2 Atomic: The number of atomic requests that are sent through the vL1D to
-      the L2 cache, per normalization unit. This includes requests for atomics with,
-      and without return.
-    L1 Access Latency: Calculated as the average number of cycles that a vL1D cache
-      line request spent in the vL1D cache pipeline.
-    L1-L2 Read Latency: Calculated as the average number of cycles that the vL1D cache
-      took to issue and receive read requests from the L2 Cache. This number also
-      includes requests for atomics with return values.
-    L1-L2 Write Latency: Calculated as the average number of cycles that the vL1D
-      cache took to issue and receive acknowledgement of a write request to the L2
-      Cache. This number also includes requests for atomics without return values.
-    NC - Read: Total read requests with NC mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    UC - Read: Total read requests with UC mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    CC - Read: Total read requests with CC mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    RW - Read: Total read requests with RW mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    RW - Write: Total write requests with RW mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    NC - Write: Total write requests with NC mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    UC - Write: Total write requests with UC mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    CC - Write: Total write requests with CC mtype from this TCP to all TCCs Sum over
-      TCP instances per normalization unit.
-    NC - Atomic: Total atomic requests with NC mtype from this TCP to all TCCs Sum
-      over TCP instances per normalization unit.
-    UC - Atomic: Total atomic requests with UC mtype from this TCP to all TCCs Sum
-      over TCP instances per normalization unit.
-    CC - Atomic: Total atomic requests with CC mtype from this TCP to all TCCs Sum
-      over TCP instances per normalization unit.
-    RW - Atomic: Total atomic requests with RW mtype from this TCP to all TCCs Sum
-      over TCP instances per normalization unit.
-    Req: The number of translation requests made to the UTCL1 per normalization unit.
-    Hit Ratio: The ratio of the number of translation requests that hit in the UTCL1
-      divided by the total number of translation requests made to the UTCL1.
-    Hits: The number of translation requests that hit in the UTCL1, and could be reused,
-      per normalization unit.
-    Translation Misses: The total number of translation requests that missed in the
-      UTCL1 due to  translation not being present in the cache, per normalization
-      unit.
-    Permission Misses: "The total number of translation requests that missed in the\
-      \ UTCL1 due to a permission error, per normalization unit. This is unused and\
-      \ expected to be zero in most configurations for modern CDNA\u2122 accelerators."
  data source:
  - metric_table:
      id: 1601
@@ -181,17 +70,17 @@ Panel Config:
          avg: AVG((TCP_TOTAL_ACCESSES_sum / $denom))
          min: MIN((TCP_TOTAL_ACCESSES_sum / $denom))
          max: MAX((TCP_TOTAL_ACCESSES_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read Req:
          avg: AVG((TCP_TOTAL_READ_sum / $denom))
          min: MIN((TCP_TOTAL_READ_sum / $denom))
          max: MAX((TCP_TOTAL_READ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Write Req:
          avg: AVG((TCP_TOTAL_WRITE_sum / $denom))
          min: MIN((TCP_TOTAL_WRITE_sum / $denom))
          max: MAX((TCP_TOTAL_WRITE_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Atomic Req:
          avg: AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
            / $denom))
@@ -199,7 +88,7 @@ Panel Config:
            / $denom))
          max: MAX(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
            / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Cache BW:
          avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
          min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
@@ -223,7 +112,7 @@ Panel Config:
          avg: AVG((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
          min: MIN((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
          max: MAX((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Cache Hits:
          avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
@@ -234,7 +123,7 @@ Panel Config:
          max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Invalidations:
          avg: AVG((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
          min: MIN((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
@@ -252,12 +141,12 @@ Panel Config:
          avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_READ_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        L1-L2 Write:
          avg: AVG((TCP_TCC_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_WRITE_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        L1-L2 Atomic:
          avg: AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
            / $denom))
@@ -265,7 +154,7 @@ Panel Config:
            / $denom))
          max: MAX(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
            / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
  - metric_table:
      id: 1604
      title: L1D - L2 Transactions
@@ -284,84 +173,84 @@ Panel Config:
          avg: AVG((TCP_TCC_NC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_NC_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_NC_READ_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        UC - Read:
          xfer: Read
          coherency: UC
          avg: AVG((TCP_TCC_UC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_UC_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_UC_READ_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        CC - Read:
          xfer: Read
          coherency: CC
          avg: AVG((TCP_TCC_CC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_CC_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_CC_READ_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        RW - Read:
          xfer: Read
          coherency: RW
          avg: AVG((TCP_TCC_RW_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_RW_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_RW_READ_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        RW - Write:
          xfer: Write
          coherency: RW
          avg: AVG((TCP_TCC_RW_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_RW_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_RW_WRITE_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        NC - Write:
          xfer: Write
          coherency: NC
          avg: AVG((TCP_TCC_NC_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_NC_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_NC_WRITE_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        UC - Write:
          xfer: Write
          coherency: UC
          avg: AVG((TCP_TCC_UC_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_UC_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_UC_WRITE_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        CC - Write:
          xfer: Write
          coherency: CC
          avg: AVG((TCP_TCC_CC_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_CC_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_CC_WRITE_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        NC - Atomic:
          xfer: Atomic
          coherency: NC
          avg: AVG((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
          min: MIN((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
          max: MAX((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        UC - Atomic:
          xfer: Atomic
          coherency: UC
          avg: AVG((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
          min: MIN((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
          max: MAX((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        CC - Atomic:
          xfer: Atomic
          coherency: CC
          avg: AVG((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
          min: MIN((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
          max: MAX((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        RW - Atomic:
          xfer: Atomic
          coherency: RW
          avg: AVG((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
          min: MIN((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
          max: MAX((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
  - metric_table:
      id: 1605
      title: L1 Unified Translation Cache (UTCL1)
@@ -410,3 +299,106 @@ Panel Config:
        max: Max
        units: Unit
      metric: {}
+  metrics_description:
+    Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
+      cache over the total number of cache line requests to the vL1D Cache RAM.
+    Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
+      of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
+      on the specific accelerator. The number of bytes is calculated as the number
+      of cache lines requested multiplied by the cache line size. This value does
+      not consider partial requests, so for instance, if only a single value is requested
+      in a cache line, the data movement will still be counted as a full cache line.
+    Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
+      The number of cycles where the vL1D Cache RAM is actively processing any request
+      divided by the number of cycles where the vL1D is active.
+    Coalescing: Indicates how well memory instructions were coalesced by the address
+      processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
+      as the average number of thread-requests generated per instruction divided by
+      the ideal number of thread-requests per instruction.
+    Stalled on L2 Data: The ratio of the number of cycles where the vL1D is stalled
+      waiting for requested data to return from the L2 cache divided by the number
+      of cycles where the vL1D is active.
+    Stalled on L2 Req: The ratio of the number of cycles where the vL1D is stalled
+      waiting to issue a request for data to the L2 cache divided by the number of
+      cycles where the vL1D is active.
+    Tag RAM Stall (Read): The ratio of the number of cycles where the vL1D is stalled
+      due to Read requests with conflicting tags being looked up concurrently, divided
+      by the number of cycles where the vL1D is active.
+    Tag RAM Stall (Write): The ratio of the number of cycles where the vL1D is stalled
+      due to Write requests with conflicting tags being looked up concurrently, divided
+      by the number of cycles where the vL1D is active.
+    Tag RAM Stall (Atomic): The ratio of the number of cycles where the vL1D is stalled
+      due to Atomic requests with conflicting tags being looked up concurrently, divided
+      by the number of cycles where the vL1D is active.
+    Total Req: The total number of incoming requests from the address processing unit
+      after coalescing.
+    Read Req: The total number of incoming read requests from the address processing
+      unit after coalescing per normalization unit.
+    Write Req: The total number of incoming write requests from the address processing
+      unit after coalescing per normalization unit.
+    Atomic Req: The total number of incoming atomic requests from the address processing
+      unit after coalescing per normalization unit.
+    Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
+      instructions divided by total duration. The number of bytes is calculated as
+      the number of cache lines requested multiplied by the cache line size. This
+      value does not consider partial requests, so for instance, if only a single
+      value is requested in a cache line, the data movement will still be counted
+      as a full cache line.
+    Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
+      vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
+    Cache Accesses: The total number of cache line lookups in the vL1D.
+    Cache Hits: The number of cache accesses minus the number of outgoing requests
+      to the L2 cache, that is, the number of cache line requests serviced by the
+      vL1D Cache RAM per normalization unit.
+    Invalidations: The number of times the vL1D was issued a write-back invalidate
+      command during the kernel's execution per normalization unit. This may be triggered
+      by, for instance, the buffer_wbinvl1 instruction.
+    L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
+      of VMEM instructions, divided by total duration. The number of bytes is calculated
+      as the number of cache lines requested multiplied by the cache line size. This
+      value does not consider partial requests, so for instance, if only a single
+      value is requested in a cache line, the data movement will still be counted
+      as a full cache line.
+    L1-L2 Read: The number of read requests for a vL1D cache line that were not satisfied
+      by the vL1D and must be retrieved from the to the L2 Cache per normalization
+      unit.
+    L1-L2 Write: The number of write requests to a vL1D cache line that were sent
+      through the vL1D to the L2 cache, per normalization unit.
+    L1-L2 Atomic: The number of atomic requests that are sent through the vL1D to
+      the L2 cache, per normalization unit. This includes requests for atomics with,
+      and without return.
+    NC - Read: Total read requests with NC mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    UC - Read: Total read requests with UC mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    CC - Read: Total read requests with CC mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    RW - Read: Total read requests with RW mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    RW - Write: Total write requests with RW mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    NC - Write: Total write requests with NC mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    UC - Write: Total write requests with UC mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    CC - Write: Total write requests with CC mtype from this TCP to all TCCs Sum over
+      TCP instances per normalization unit.
+    NC - Atomic: Total atomic requests with NC mtype from this TCP to all TCCs Sum
+      over TCP instances per normalization unit.
+    UC - Atomic: Total atomic requests with UC mtype from this TCP to all TCCs Sum
+      over TCP instances per normalization unit.
+    CC - Atomic: Total atomic requests with CC mtype from this TCP to all TCCs Sum
+      over TCP instances per normalization unit.
+    RW - Atomic: Total atomic requests with RW mtype from this TCP to all TCCs Sum
+      over TCP instances per normalization unit.
+    Req: The number of translation requests made to the UTCL1 per normalization unit.
+    Hit Ratio: The ratio of the number of translation requests that hit in the UTCL1
+      divided by the total number of translation requests made to the UTCL1.
+    Hits: The number of translation requests that hit in the UTCL1, and could be reused,
+      per normalization unit.
+    Translation Misses: The total number of translation requests that missed in the
+      UTCL1 due to translation not being present in the cache, per normalization unit.
+    Permission Misses: |-
+      The total number of translation requests that missed in the UTCL1 due
+      to a permission error, per normalization unit. This is unused and expected
+      to be zero in most configurations for modern CDNA\u2122 accelerators.
@@ -2,218 +2,6 @@
 Panel Config:
  id: 1700
  title: L2 Cache
-  metrics_description:
-    Utilization: The ratio of the number of cycles an L2 channel was active, summed
-      over all L2 channels on the accelerator over the total L2 cycles.
-    Peak Bandwidth: The number of bytes looked up in the L2 cache, as a percent of
-      the peak theoretical bandwidth achievable on the specific accelerator. The number
-      of bytes is calculated as the number of cache lines requested multiplied by
-      the cache line size. This value does not consider partial requests, so e.g.,
-      if only a single value is requested in a cache line, the data movement will
-      still be counted as a full cache line.
-    Hit Rate: The ratio of the number of L2 cache line requests that hit in the L2
-      cache over the total number of incoming cache line requests to the L2 cache.
-    L2-Fabric Read BW: The number of bytes read by the L2 over the Infinity Fabric
-      interface per unit time.
-    L2-Fabric Write and Atomic BW: The number of bytes sent by the L2 over the Infinity
-      Fabric interface by write and atomic operations per unit time.
-    HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
-      memory (HBM) per unit time. This value is calculated as the number of HBM channels
-      multiplied by the HBM channel width multiplied by the HBM clock frequency.
-    Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided
-      by total duration.
-    HBM Read Traffic: The percent of read requests generated by the L2 cache that
-      are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
-      does not consider the size of the request (meaning that 32B and 64B requests
-      are both counted as a single request), so this metric only approximates the
-      percent of the L2-Fabric Read bandwidth directed to the local HBM.
-    Remote Read Traffic: The percent of read requests generated by the L2 cache that
-      are routed to any memory location other than the accelerator's local high-bandwidth
-      memory (HBM) - for example, the CPU's DRAM or a remote accelerator's HBM. This
-      breakdown does not consider the size of the request (meaning that 32B and 64B
-      requests are both counted as a single request), so this metric only approximates
-      the percent of the L2-Fabric Read bandwidth directed to a remote location.
-    Uncached Read Traffic: The percent of read requests generated by the L2 cache
-      that are reading from an uncached memory allocation. Note, as described in the
-      request flow section, a single 64B read request is typically counted as two
-      uncached read requests. So, it is possible for the Uncached Read Traffic to
-      reach up to 200% of the total number of read requests. This breakdown does not
-      consider the size of the request (i.e., 32B and 64B requests are both counted
-      as a single request), so this metric only approximates the percent of the L2-Fabric
-      read bandwidth directed to an uncached memory location.
-    Write and Atomic BW: The total number of bytes written by the L2 over Infinity
-      Fabric by write and atomic operations divided by total duration. Note that on
-      current CDNA accelerators, such as the MI2XX, requests are only considered atomic
-      by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
-      fine-grained memory allocations or uncached memory allocations on the MI2XX.
-    HBM Write and Atomic Traffic: The percent of write and atomic requests generated
-      by the L2 cache that are routed to the accelerator's local high-bandwidth memory
-      (HBM). This breakdown does not consider the size of the request (meaning that
-      32B and 64B requests are both counted as a single request), so this metric only
-      approximates the percent of the L2-Fabric Write and Atomic bandwidth directed
-      to the local HBM. Note that on current CDNA accelerators, such as the MI2XX,
-      requests are only considered atomic by Infinity Fabric if they are targeted
-      at fine-grained memory allocations or uncached memory allocations.
-    Remote Write and Atomic Traffic: The percent of read requests generated by the
-      L2 cache that are routed to any memory location other than the accelerator's
-      local high-bandwidth memory (HBM) - for example, the CPU's DRAM or a remote
-      accelerator's HBM. This breakdown does not consider the size of the request
-      (meaning that 32B and 64B requests are both counted as a single request), so
-      this metric only approximates the percent of the L2-Fabric Read bandwidth directed
-      to a remote location. Note that on current CDNA accelerators, such as the MI2XX,
-      requests are only considered atomic by Infinity Fabric if they are targeted
-      at fine-grained memory allocations or uncached memory allocations.
-    Atomic Traffic: The percent of write requests generated by the L2 cache that are
-      atomic requests to any memory location. This breakdown does not consider the
-      size of the request (meaning that 32B and 64B requests are both counted as a
-      single request), so this metric only approximates the percent of the L2-Fabric
-      Read bandwidth directed to a remote location. Note that on current CDNA accelerators,
-      such as the MI2XX, requests are only considered atomic by Infinity Fabric if
-      they are targeted at fine-grained memory allocations or uncached memory allocations.
-    Uncached Write and Atomic Traffic: The percent of write and atomic requests generated
-      by the L2 cache that are targeting uncached memory allocations. This breakdown
-      does not consider the size of the request (meaning that 32B and 64B requests
-      are both counted as a single request), so this metric only approximates the
-      percent of the L2-Fabric read bandwidth directed to uncached memory allocations.
-    Read Latency: The time-averaged number of cycles read requests spent in Infinity
-      Fabric before data was returned to the L2.
-    Write and Atomic Latency: The time-averaged number of cycles write requests spent
-      in Infinity Fabric before a completion acknowledgement was returned to the L2.
-    Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
-      Fabric before a completion acknowledgement (atomic without return value) or
-      data (atomic with return value) was returned to the L2.
-    Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration.
-      The number of bytes is calculated as the number of cache lines requested multiplied
-      by the cache line size. This value does not consider partial requests, so for
-      example, if only a single value is requested in a cache line, the data movement
-      will still be counted as a full cache line.
-    Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests,
-      divided by total duration.
-    Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests,
-      divided by total duration.
-    Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests,
-      divided by total duration.
-    Req: The total number of incoming requests to the L2 from all clients for all
-      request types, per normalization unit.
-    Read Req: The total number of read requests to the L2 from all clients.
-    Write Req: The total number of write requests to the L2 from all clients.
-    Atomic Req: The total number of atomic requests (with and without return) to the
-      L2 from all clients.
-    Streaming Req: The total number of incoming requests to the L2 that are marked
-      as streaming. The exact meaning of this may differ depending on the targeted
-      accelerator, however on an MI2XX this corresponds to non-temporal load or stores.
-      The L2 cache attempts to evict streaming requests before normal requests when
-      the L2 is at capacity.
-    Probe Req: The number of coherence probe requests made to the L2 cache from outside
-      the accelerator. On an MI2XX, probe requests may be generated by, for example,
-      writes to fine-grained device memory or by writes to coarse-grained device memory.
-    Cache Hit: The ratio of the number of L2 cache line requests that hit in the L2
-      cache over the total number of incoming cache line requests to the L2 cache.
-    Hits: The total number of requests to the L2 from all clients that hit in the
-      cache. As noted in the Speed-of-Light section, this includes hit-on-miss requests.
-    Misses: The total number of requests to the L2 from all clients that miss in the
-      cache. As noted in the Speed-of-Light section, these do not include hit-on-miss
-      requests.
-    Writeback: The total number of L2 cache lines written back to memory for any reason.
-      Write-backs may occur due to user code (such as HIP kernel calls to _threadfence_system
-      or atomic built-ins) by the command processor's memory acquire/release fences,
-      or for other internal hardware reasons.
-    Writeback (Internal): The total number of L2 cache lines written back to memory
-      for internal hardware reasons, per normalization unit.
-    Writeback (vL1D Req): The total number of L2 cache lines written back to memory
-      due to requests initiated by the vL1D cache, per normalization unit.
-    Evict (Internal): The total number of L2 cache lines evicted from the cache due
-      to capacity limits, per normalization unit.
-    Evict (vL1D Req): The total number of L2 cache lines evicted from the cache due
-      to invalidation requests initiated by the vL1D cache, per normalization unit.
-    NC Req: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory
-      allocations, per normalization unit.
-    UC Req: The total number of requests to the L2 that go to Uncached (UC) memory
-      allocations.
-    CC Req: The total number of requests to the L2 that go to Coherently Cacheable
-      (CC) memory allocations.
-    RW Req: The total number of requests to the L2 that go to Read-Write coherent
-      memory (RW) allocations.
-    Write - Credit Starvation: The number of cycles the L2-Fabric interface was stalled
-      on write or atomic requests to any memory location because too many write/atomic
-      requests were currently in flight, as a percent of the total active L2 cycles.
-    Read (32B): The total number of L2 requests to Infinity Fabric to read 32B of
-      data from any memory location, per normalization unit.
-    Read (64B): The total number of L2 requests to Infinity Fabric to read 64B of
-      data from any memory location, per normalization unit.
-    Read (Uncached): The total number of L2 requests to Infinity Fabric to read uncached
-      data from any memory location, per normalization unit. 64B requests for uncached
-      data are counted as two 32B uncached data requests.
-    HBM Read: The total number of L2 requests to Infinity Fabric to read 32B or 64B
-      of data from the accelerator's local HBM, per normalization unit.
-    Remote Read: The total number of L2 requests to Infinity Fabric to read 32B or
-      64B of data from any source other than the accelerator's local HBM, per normalization
-      unit.
-    Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe
-      traffic, divided by total duration.
-    "Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read
-      requests due to Infinity Fabric traffic, divided by total duration.
-    Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM
-      traffic, divided by total duration.
-    Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
-      write or atomically update 32B of data to any memory location, per normalization
-      unit.
-    Write and Atomic (Uncached): The total number of L2 requests to Infinity Fabric
-      to write or atomically update 32B or 64B of uncached data, per normalization
-      unit.
-    Write and Atomic (64B): The total number of L2 requests to Infinity Fabric to
-      write or atomically update 64B of data in any memory location, per normalization
-      unit.
-    HBM Write and Atomic: The total number of L2 requests to Infinity Fabric to write
-      or atomically update 32B or 64B of data in the accelerator's local HBM, per
-      normalization unit.
-    Remote Write and Atomic: The total number of L2 requests to Infinity Fabric to
-      write or atomically update 32B or 64B of data in any memory location other than
-      the accelerator's local HBM, per normalization unit.
-    Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to
-      PCIe traffic, divided by total duration.
-    "Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write
-      requests due to Infinity Fabric traffic, divided by total duration.
-    Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM
-      traffic, divided by total duration.
-    Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to
-      PCIe traffic, divided by total duration.
-    "Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic
-      requests due to Infinity Fabric traffic, divided by total duration.
-    Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
-      HBM traffic, divided by total duration.
-    Atomic: The total number of L2 requests to Infinity Fabric to atomically update
-      32B or 64B of data in any memory location, per normalization unit. See Request
-      flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
-      requests are only considered atomic by Infinity Fabric if they are targeted
-      at non-write-cacheable memory, such as fine-grained memory allocations or uncached
-      memory allocations on the MI2XX.
-    Read Stall: "The ratio of the total number of cycles the L2-Fabric interface was\
-      \ stalled on a read request to any destination (local HBM, remote PCIe\xAE connected\
-      \ accelerator or CPU, or remote Infinity Fabric connected accelerator or CPU)\
-      \ over the total active L2 cycles."
-    Write Stall: The ratio of the total number of cycles the L2-Fabric interface was
-      stalled on a write or atomic request to any destination (local HBM, remote accelerator
-      or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected
-      accelerator or CPU) over the total active L2 cycles.
-    Read - PCIe Stall: The number of cycles the L2-Fabric interface was stalled on
-      read requests to remote PCIe connected accelerators or CPUs as a percent of
-      the total active L2 cycles.
-    Read - Infinity Fabric Stall: The number of cycles the L2-Fabric interface was
-      stalled on read requests to remote Infinity Fabric connected accelerators or
-      CPUs as a percent of the total active L2 cycles.
-    Read - HBM Stall: The number of cycles the L2-Fabric interface was stalled on
-      read requests to the accelerator's local HBM as a percent of the total active
-      L2 cycles.
-    Write - PCIe Stall: The number of cycles the L2-Fabric interface was stalled on
-      write or atomic requests to remote PCIe connected accelerators or CPUs as a
-      percent of the total active L2 cycles.
-    Write - Infinity Fabric Stall: The number of cycles the L2-Fabric interface was
-      stalled on write or atomic requests to remote Infinity Fabric connected accelerators
-      or CPUs as a percent of the total active L2 cycles.
-    Write - HBM Stall: The number of cycles the L2-Fabric interface was stalled on
-      write or atomic requests to accelerator's local HBM as a percent of the total
-      active L2 cycles.
  data source:
  - metric_table:
      id: 1701
@@ -370,32 +158,32 @@ Panel Config:
          avg: AVG((TCC_REQ_sum / $denom))
          min: MIN((TCC_REQ_sum / $denom))
          max: MAX((TCC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read Req:
          avg: AVG((TCC_READ_sum / $denom))
          min: MIN((TCC_READ_sum / $denom))
          max: MAX((TCC_READ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Write Req:
          avg: AVG((TCC_WRITE_sum / $denom))
          min: MIN((TCC_WRITE_sum / $denom))
          max: MAX((TCC_WRITE_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Atomic Req:
          avg: AVG((TCC_ATOMIC_sum / $denom))
          min: MIN((TCC_ATOMIC_sum / $denom))
          max: MAX((TCC_ATOMIC_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Streaming Req:
          avg: AVG((TCC_STREAMING_REQ_sum / $denom))
          min: MIN((TCC_STREAMING_REQ_sum / $denom))
          max: MAX((TCC_STREAMING_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Probe Req:
          avg: AVG((TCC_PROBE_sum / $denom))
          min: MIN((TCC_PROBE_sum / $denom))
          max: MAX((TCC_PROBE_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Cache Hit:
          avg: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
            + TCC_MISS_sum) != 0) else None))
@@ -408,17 +196,17 @@ Panel Config:
          avg: AVG((TCC_HIT_sum / $denom))
          min: MIN((TCC_HIT_sum / $denom))
          max: MAX((TCC_HIT_sum / $denom))
-          unit: (Hits  + $normUnit)
+          unit: (Hits + $normUnit)
        Misses:
          avg: AVG((TCC_MISS_sum / $denom))
          min: MIN((TCC_MISS_sum / $denom))
          max: MAX((TCC_MISS_sum / $denom))
-          unit: (Misses  + $normUnit)
+          unit: (Misses + $normUnit)
        Writeback:
          avg: AVG((TCC_WRITEBACK_sum / $denom))
          min: MIN((TCC_WRITEBACK_sum / $denom))
          max: MAX((TCC_WRITEBACK_sum / $denom))
-          unit: (Cachelines  + $normUnit)
+          unit: (Cachelines + $normUnit)
        Writeback (Internal):
          avg: AVG((TCC_NORMAL_WRITEBACK_sum / $denom))
          min: MIN((TCC_NORMAL_WRITEBACK_sum / $denom))
@@ -443,22 +231,22 @@ Panel Config:
          avg: AVG((TCC_NC_REQ_sum / $denom))
          min: MIN((TCC_NC_REQ_sum / $denom))
          max: MAX((TCC_NC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        UC Req:
          avg: AVG((TCC_UC_REQ_sum / $denom))
          min: MIN((TCC_UC_REQ_sum / $denom))
          max: MAX((TCC_UC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        CC Req:
          avg: AVG((TCC_CC_REQ_sum / $denom))
          min: MIN((TCC_CC_REQ_sum / $denom))
          max: MAX((TCC_CC_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        RW Req:
          avg: AVG((TCC_RW_REQ_sum / $denom))
          min: MIN((TCC_RW_REQ_sum / $denom))
          max: MAX((TCC_RW_REQ_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
  - metric_table:
      id: 1704
      title: L2 Cache Stalls
@@ -507,54 +295,216 @@ Panel Config:
          avg: AVG((TCC_EA0_RDREQ_32B_sum / $denom))
          min: MIN((TCC_EA0_RDREQ_32B_sum / $denom))
          max: MAX((TCC_EA0_RDREQ_32B_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read (64B):
          avg: AVG(((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) / $denom))
          min: MIN(((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) / $denom))
          max: MAX(((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Read (Uncached):
          avg: AVG((TCC_EA0_RD_UNCACHED_32B_sum / $denom))
          min: MIN((TCC_EA0_RD_UNCACHED_32B_sum / $denom))
          max: MAX((TCC_EA0_RD_UNCACHED_32B_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        HBM Read:
          avg: AVG((TCC_EA0_RDREQ_DRAM_sum / $denom))
          min: MIN((TCC_EA0_RDREQ_DRAM_sum / $denom))
          max: MAX((TCC_EA0_RDREQ_DRAM_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Remote Read:
          avg: AVG((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
          min: MIN((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
          max: MAX((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Write and Atomic (32B):
          avg: AVG(MAX(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom), 0))
          min: MIN(MAX(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom), 0))
          max: MAX(MAX(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom), 0))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Write and Atomic (Uncached):
          avg: AVG((TCC_EA0_WR_UNCACHED_32B_sum / $denom))
          min: MIN((TCC_EA0_WR_UNCACHED_32B_sum / $denom))
          max: MAX((TCC_EA0_WR_UNCACHED_32B_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Write and Atomic (64B):
          avg: AVG((TCC_EA0_WRREQ_64B_sum / $denom))
          min: MIN((TCC_EA0_WRREQ_64B_sum / $denom))
          max: MAX((TCC_EA0_WRREQ_64B_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        HBM Write and Atomic:
          avg: AVG((TCC_EA0_WRREQ_DRAM_sum / $denom))
          min: MIN((TCC_EA0_WRREQ_DRAM_sum / $denom))
          max: MAX((TCC_EA0_WRREQ_DRAM_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Remote Write and Atomic:
          avg: AVG((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom))
          min: MIN((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom))
          max: MAX((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
        Atomic:
          avg: AVG((TCC_EA0_ATOMIC_sum / $denom))
          min: MIN((TCC_EA0_ATOMIC_sum / $denom))
          max: MAX((TCC_EA0_ATOMIC_sum / $denom))
-          unit: (Req  + $normUnit)
+          unit: (Req + $normUnit)
+  metrics_description:
+    Utilization: The ratio of the number of cycles an L2 channel was active, summed
+      over all L2 channels on the accelerator over the total L2 cycles.
+    Peak Bandwidth: The number of bytes looked up in the L2 cache, as a percent of
+      the peak theoretical bandwidth achievable on the specific accelerator. The number
+      of bytes is calculated as the number of cache lines requested multiplied by
+      the cache line size. This value does not consider partial requests, so e.g.,
+      if only a single value is requested in a cache line, the data movement will
+      still be counted as a full cache line.
+    Hit Rate: The ratio of the number of L2 cache line requests that hit in the L2
+      cache over the total number of incoming cache line requests to the L2 cache.
+    L2-Fabric Read BW: The number of bytes read by the L2 over the Infinity Fabric
+      interface per unit time.
+    L2-Fabric Write and Atomic BW: The number of bytes sent by the L2 over the Infinity
+      Fabric interface by write and atomic operations per unit time.
+    HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
+      memory (HBM) per unit time. This value is calculated as the number of HBM channels
+      multiplied by the HBM channel width multiplied by the HBM clock frequency.
+    Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided
+      by total duration.
+    HBM Read Traffic: The percent of read requests generated by the L2 cache that
+      are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
+      does not consider the size of the request (meaning that 32B and 64B requests
+      are both counted as a single request), so this metric only approximates the
+      percent of the L2-Fabric Read bandwidth directed to the local HBM.
+    Remote Read Traffic: The percent of read requests generated by the L2 cache that
+      are routed to any memory location other than the accelerator's local high-bandwidth
+      memory (HBM) - for example, the CPU's DRAM or a remote accelerator's HBM. This
+      breakdown does not consider the size of the request (meaning that 32B and 64B
+      requests are both counted as a single request), so this metric only approximates
+      the percent of the L2-Fabric Read bandwidth directed to a remote location.
+    Uncached Read Traffic: The percent of read requests generated by the L2 cache
+      that are reading from an uncached memory allocation. Note, as described in the
+      request flow section, a single 64B read request is typically counted as two
+      uncached read requests. So, it is possible for the Uncached Read Traffic to
+      reach up to 200% of the total number of read requests. This breakdown does not
+      consider the size of the request (i.e., 32B and 64B requests are both counted
+      as a single request), so this metric only approximates the percent of the L2-Fabric
+      read bandwidth directed to an uncached memory location.
+    Write and Atomic BW: The total number of bytes written by the L2 over Infinity
+      Fabric by write and atomic operations divided by total duration. Note that on
+      current CDNA accelerators, such as the MI2XX, requests are only considered atomic
+      by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
+      fine-grained memory allocations or uncached memory allocations on the MI2XX.
+    HBM Write and Atomic Traffic: The percent of write and atomic requests generated
+      by the L2 cache that are routed to the accelerator's local high-bandwidth memory
+      (HBM). This breakdown does not consider the size of the request (meaning that
+      32B and 64B requests are both counted as a single request), so this metric only
+      approximates the percent of the L2-Fabric Write and Atomic bandwidth directed
+      to the local HBM. Note that on current CDNA accelerators, such as the MI2XX,
+      requests are only considered atomic by Infinity Fabric if they are targeted
+      at fine-grained memory allocations or uncached memory allocations.
+    Remote Write and Atomic Traffic: The percent of read requests generated by the
+      L2 cache that are routed to any memory location other than the accelerator's
+      local high-bandwidth memory (HBM) - for example, the CPU's DRAM or a remote
+      accelerator's HBM. This breakdown does not consider the size of the request
+      (meaning that 32B and 64B requests are both counted as a single request), so
+      this metric only approximates the percent of the L2-Fabric Read bandwidth directed
+      to a remote location. Note that on current CDNA accelerators, such as the MI2XX,
+      requests are only considered atomic by Infinity Fabric if they are targeted
+      at fine-grained memory allocations or uncached memory allocations.
+    Atomic Traffic: The percent of write requests generated by the L2 cache that are
+      atomic requests to any memory location. This breakdown does not consider the
+      size of the request (meaning that 32B and 64B requests are both counted as a
+      single request), so this metric only approximates the percent of the L2-Fabric
+      Read bandwidth directed to a remote location. Note that on current CDNA accelerators,
+      such as the MI2XX, requests are only considered atomic by Infinity Fabric if
+      they are targeted at fine-grained memory allocations or uncached memory allocations.
+    Uncached Write and Atomic Traffic: The percent of write and atomic requests generated
+      by the L2 cache that are targeting uncached memory allocations. This breakdown
+      does not consider the size of the request (meaning that 32B and 64B requests
+      are both counted as a single request), so this metric only approximates the
+      percent of the L2-Fabric read bandwidth directed to uncached memory allocations.
+    Read Latency: The time-averaged number of cycles read requests spent in Infinity
+      Fabric before data was returned to the L2.
+    Write and Atomic Latency: The time-averaged number of cycles write requests spent
+      in Infinity Fabric before a completion acknowledgement was returned to the L2.
+    Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
+      Fabric before a completion acknowledgement (atomic without return value) or
+      data (atomic with return value) was returned to the L2.
+    Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration.
+      The number of bytes is calculated as the number of cache lines requested multiplied
+      by the cache line size. This value does not consider partial requests, so for
+      example, if only a single value is requested in a cache line, the data movement
+      will still be counted as a full cache line.
+    Req: The total number of incoming requests to the L2 from all clients for all
+      request types, per normalization unit.
+    Read Req: The total number of read requests to the L2 from all clients.
+    Write Req: The total number of write requests to the L2 from all clients.
+    Atomic Req: The total number of atomic requests (with and without return) to the
+      L2 from all clients.
+    Streaming Req: The total number of incoming requests to the L2 that are marked
+      as streaming. The exact meaning of this may differ depending on the targeted
+      accelerator, however on an MI2XX this corresponds to non-temporal load or stores.
+      The L2 cache attempts to evict streaming requests before normal requests when
+      the L2 is at capacity.
+    Probe Req: The number of coherence probe requests made to the L2 cache from outside
+      the accelerator. On an MI2XX, probe requests may be generated by, for example,
+      writes to fine-grained device memory or by writes to coarse-grained device memory.
+    Cache Hit: The ratio of the number of L2 cache line requests that hit in the L2
+      cache over the total number of incoming cache line requests to the L2 cache.
+    Hits: The total number of requests to the L2 from all clients that hit in the
+      cache. As noted in the Speed-of-Light section, this includes hit-on-miss requests.
+    Misses: The total number of requests to the L2 from all clients that miss in the
+      cache. As noted in the Speed-of-Light section, these do not include hit-on-miss
+      requests.
+    Writeback: The total number of L2 cache lines written back to memory for any reason.
+      Write-backs may occur due to user code (such as HIP kernel calls to _threadfence_system
+      or atomic built-ins) by the command processor's memory acquire/release fences,
+      or for other internal hardware reasons.
+    Writeback (Internal): The total number of L2 cache lines written back to memory
+      for internal hardware reasons, per normalization unit.
+    Writeback (vL1D Req): The total number of L2 cache lines written back to memory
+      due to requests initiated by the vL1D cache, per normalization unit.
+    Evict (Internal): The total number of L2 cache lines evicted from the cache due
+      to capacity limits, per normalization unit.
+    Evict (vL1D Req): The total number of L2 cache lines evicted from the cache due
+      to invalidation requests initiated by the vL1D cache, per normalization unit.
+    NC Req: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory
+      allocations, per normalization unit.
+    UC Req: The total number of requests to the L2 that go to Uncached (UC) memory
+      allocations.
+    CC Req: The total number of requests to the L2 that go to Coherently Cacheable
+      (CC) memory allocations.
+    RW Req: The total number of requests to the L2 that go to Read-Write coherent
+      memory (RW) allocations.
+    Write - Credit Starvation: The number of cycles the L2-Fabric interface was stalled
+      on write or atomic requests to any memory location because too many write/atomic
+      requests were currently in flight, as a percent of the total active L2 cycles.
+    Read (32B): The total number of L2 requests to Infinity Fabric to read 32B of
+      data from any memory location, per normalization unit.
+    Read (64B): The total number of L2 requests to Infinity Fabric to read 64B of
+      data from any memory location, per normalization unit.
+    Read (Uncached): The total number of L2 requests to Infinity Fabric to read uncached
+      data from any memory location, per normalization unit. 64B requests for uncached
+      data are counted as two 32B uncached data requests.
+    HBM Read: The total number of L2 requests to Infinity Fabric to read 32B or 64B
+      of data from the accelerator's local HBM, per normalization unit.
+    Remote Read: The total number of L2 requests to Infinity Fabric to read 32B or
+      64B of data from any source other than the accelerator's local HBM, per normalization
+      unit.
+    Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
+      write or atomically update 32B of data to any memory location, per normalization
+      unit.
+    Write and Atomic (Uncached): The total number of L2 requests to Infinity Fabric
+      to write or atomically update 32B or 64B of uncached data, per normalization
+      unit.
+    Write and Atomic (64B): The total number of L2 requests to Infinity Fabric to
+      write or atomically update 64B of data in any memory location, per normalization
+      unit.
+    HBM Write and Atomic: The total number of L2 requests to Infinity Fabric to write
+      or atomically update 32B or 64B of data in the accelerator's local HBM, per
+      normalization unit.
+    Remote Write and Atomic: The total number of L2 requests to Infinity Fabric to
+      write or atomically update 32B or 64B of data in any memory location other than
+      the accelerator's local HBM, per normalization unit.
+    Atomic: The total number of L2 requests to Infinity Fabric to atomically update
+      32B or 64B of data in any memory location, per normalization unit. See Request
+      flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
+      requests are only considered atomic by Infinity Fabric if they are targeted
+      at non-write-cacheable memory, such as fine-grained memory allocations or uncached
+      memory allocations on the MI2XX.
@@ -2,10 +2,6 @@
 Panel Config:
  id: 1800
  title: L2 Cache (per Channel)
-  metrics_description:
-    L2 Cache Hit Rate: The percent of total number of requests to the L2 from all
-      clients that hit in the cache. As noted in the Speed-of-Light section, this
-      includes hit-on-miss requests.
  data source:
  - metric_table:
      id: 1801
@@ -249,3 +245,7 @@ Panel Config:
          ::_1: $total_l2_chan
      cli_style: simple_box
      tui_style: simple_box
+  metrics_description:
+    L2 Cache Hit Rate: The percent of total number of requests to the L2 from all
+      clients that hit in the cache. As noted in the Speed-of-Light section, this
+      includes hit-on-miss requests.
@@ -2,10 +2,10 @@
 Panel Config:
  id: 2100
  title: PC Sampling
-  metrics_description: {}
  data source:
  - pc_sampling_table:
      id: 2101
      title: PC Sampling
      source: ps_file
      comparable: false
+  metrics_description: {}
@@ -0,0 +1,763 @@
+# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated by tools/config_management/generate_config_deltas.py
+Addition:
+  - Panel Config:
+      id: 200
+      title: System Speed-of-Light
+    metric_tables:
+      - metric_table:
+          id: 201
+          title: System Speed-of-Light
+          metrics:
+            - MFMA FLOPs (F6F4):
+                value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) / (End_Timestamp - Start_Timestamp)))
+                unit: GFLOP/s
+                peak: ((($max_sclk * $cu_per_gpu) * 16834) / 1000)
+                pop: |
+                  ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 16834) / 1000))
+  - Panel Config:
+      id: 300
+      title: Memory Chart
+    metric_tables:
+      - metric_table:
+          id: 301
+          title: Memory Chart
+          metrics:
+            - L2 Rd Lat:
+                value: |
+                  ROUND(AVG(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)) if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else None)), 0)
+            - L2 Wr Lat:
+                value: |
+                  ROUND(AVG(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0) else None)), 0)
+  - Panel Config:
+      id: 400
+      title: Roofline
+    metric_tables:
+      - metric_table:
+          id: 401
+          title: Roofline Performance Rates
+          metrics:
+            - MFMA FLOPs (F6F4):
+                value: |
+                  AVG((((SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+                unit: GFLOP/s
+                peak: $MFMA_FLOPs_F6F4_empirical_peak
+  - Panel Config:
+      id: 500
+      title: Command Processor (CPC/CPF)
+    metric_tables:
+      - metric_table:
+          id: 502
+          title: Command processor packet processor (CPC)
+          metrics:
+            - CPC SYNC FIFO Full Rate:
+                avg: |
+                  AVG((100 * CPC_SYNC_FIFO_FULL) / CPC_SYNC_WRREQ_FIFO_BUSY if (CPC_SYNC_WRREQ_FIFO_BUSY != 0) else None)
+                min: |
+                  MIN((100 * CPC_SYNC_FIFO_FULL) / CPC_SYNC_WRREQ_FIFO_BUSY if (CPC_SYNC_WRREQ_FIFO_BUSY != 0) else None)
+                max: |
+                  MAX((100 * CPC_SYNC_FIFO_FULL) / CPC_SYNC_WRREQ_FIFO_BUSY if (CPC_SYNC_WRREQ_FIFO_BUSY != 0) else None)
+                unit: pct
+            - CPC ADC Utilization:
+                avg: AVG((100 * CPC_TG_SEND) / CPC_GD_BUSY if (CPC_GD_BUSY != 0) else None)
+                min: MIN((100 * CPC_TG_SEND) / CPC_GD_BUSY if (CPC_GD_BUSY != 0) else None)
+                max: MAX((100 * CPC_TG_SEND) / CPC_GD_BUSY if (CPC_GD_BUSY != 0) else None)
+                unit: pct
+            - CPC CANE Stall Rate:
+                avg: AVG((100 * CPC_CANE_STALL) / CPC_CANE_BUSY if (CPC_CANE_BUSY != 0) else None)
+                min: MIN((100 * CPC_CANE_STALL) / CPC_CANE_BUSY if (CPC_CANE_BUSY != 0) else None)
+                max: MAX((100 * CPC_CANE_STALL) / CPC_CANE_BUSY if (CPC_CANE_BUSY != 0) else None)
+                unit: pct
+  - Panel Config:
+      id: 600
+      title: Workgroup Manager (SPI)
+    metric_tables:
+      - metric_table:
+          id: 601
+          title: Workgroup manager utilizations
+          metrics:
+            - Schedule-Pipe Wave Occupancy:
+                avg: |
+                  AVG(SPI_CSQ_P0_OCCUPANCY + SPI_CSQ_P1_OCCUPANCY + SPI_CSQ_P2_OCCUPANCY + SPI_CSQ_P3_OCCUPANCY)
+                min: |
+                  MIN(SPI_CSQ_P0_OCCUPANCY + SPI_CSQ_P1_OCCUPANCY + SPI_CSQ_P2_OCCUPANCY + SPI_CSQ_P3_OCCUPANCY)
+                max: |
+                  MAX(SPI_CSQ_P0_OCCUPANCY + SPI_CSQ_P1_OCCUPANCY + SPI_CSQ_P2_OCCUPANCY + SPI_CSQ_P3_OCCUPANCY)
+                unit: Wave
+            - Scheduler-Pipe Wave Utilization:
+                avg: |
+                  AVG(100 * (SPI_CSC_WAVE_CNT_BUSY) / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
+                min: |
+                  MIN(100 * (SPI_CSC_WAVE_CNT_BUSY) / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
+                max: |
+                  MAX(100 * (SPI_CSC_WAVE_CNT_BUSY) / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
+                unit: Pct
+      - metric_table:
+          id: 602
+          title: Workgroup Manager - Resource Allocation
+          metrics:
+            - Scheduler-Pipe FIFO Full Rate:
+                avg: |
+                  AVG((100 * (SPI_CS0_CRAWLER_STALL + SPI_CS1_CRAWLER_STALL + SPI_CS2_CRAWLER_STALL + SPI_CS3_CRAWLER_STALL) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
+                min: |
+                  MIN((100 * (SPI_CS0_CRAWLER_STALL + SPI_CS1_CRAWLER_STALL + SPI_CS2_CRAWLER_STALL + SPI_CS3_CRAWLER_STALL) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
+                max: |
+                  MAX((100 * (SPI_CS0_CRAWLER_STALL + SPI_CS1_CRAWLER_STALL + SPI_CS2_CRAWLER_STALL + SPI_CS3_CRAWLER_STALL) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
+                unit: Pct
+  - Panel Config:
+      id: 1000
+      title: Compute Units - Instruction Mix
+    metric_tables:
+      - metric_table:
+          id: 1003
+          title: VMEM Instruction Mix
+          metrics:
+            - Spill/Stack Coalesceable Instr:
+                avg: AVG((TA_BUFFER_COALESCEABLE_WAVEFRONTS_sum / $denom))
+                min: MIN((TA_BUFFER_COALESCEABLE_WAVEFRONTS_sum / $denom))
+                max: MAX((TA_BUFFER_COALESCEABLE_WAVEFRONTS_sum / $denom))
+                unit: (instr + $normUnit)
+      - metric_table:
+          id: 1004
+          title: MFMA Arithmetic Instruction Mix
+          metrics:
+            - MFMA-F6F4:
+                avg: AVG((SQ_INSTS_VALU_MFMA_F6F4 / $denom))
+                min: MIN((SQ_INSTS_VALU_MFMA_F6F4 / $denom))
+                max: MAX((SQ_INSTS_VALU_MFMA_F6F4 / $denom))
+                unit: (instr + $normUnit)
+  - Panel Config:
+      id: 1100
+      title: Compute Units - Compute Pipeline
+    metric_tables:
+      - metric_table:
+          id: 1101
+          title: Compute Speed-of-Light
+          metrics:
+            - MFMA FLOPs (F6F4):
+                value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) / (End_Timestamp - Start_Timestamp)))
+                unit: GFLOP
+                peak: ((($max_sclk * $cu_per_gpu) * 16834) / 1000)
+                pop: |
+                  ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 16834) / 1000))
+      - metric_table:
+          id: 1102
+          title: Pipeline Statistics
+          metrics:
+            - VALU Co-Issue Efficiency:
+                avg: AVG((100 * SQ_ACTIVE_INST_VALU2) / (SQ_ACTIVE_INST_VALU - SQ_ACTIVE_INST_VALU2))
+                min: MIN((100 * SQ_ACTIVE_INST_VALU2) / (SQ_ACTIVE_INST_VALU - SQ_ACTIVE_INST_VALU2))
+                max: MAX((100 * SQ_ACTIVE_INST_VALU2) / (SQ_ACTIVE_INST_VALU - SQ_ACTIVE_INST_VALU2))
+                unit: pct
+      - metric_table:
+          id: 1103
+          title: Arithmetic Operations
+          metrics:
+            - F6F4 OPs:
+                avg: AVG((512 * SQ_INSTS_VALU_MFMA_MOPS_F6F4) / $denom)
+                min: MIN((512 * SQ_INSTS_VALU_MFMA_MOPS_F6F4) / $denom)
+                max: MAX((512 * SQ_INSTS_VALU_MFMA_MOPS_F6F4) / $denom)
+                unit: (OPs + $normUnit)
+  - Panel Config:
+      id: 1200
+      title: Local Data Share (LDS)
+    metric_tables:
+      - metric_table:
+          id: 1202
+          title: LDS Statistics
+          metrics:
+            - LDS ATOMIC Bandwidth:
+                avg: AVG(64 * SQ_INSTS_LDS_ATOMIC_BANDWIDTH / (End_Timestamp - Start_Timestamp))
+                min: MIN(64 * SQ_INSTS_LDS_ATOMIC_BANDWIDTH / (End_Timestamp - Start_Timestamp))
+                max: MAX(64 * SQ_INSTS_LDS_ATOMIC_BANDWIDTH / (End_Timestamp - Start_Timestamp))
+                units: Gbps
+            - LDS LOAD:
+                avg: AVG((SQ_INSTS_LDS_LOAD / $denom))
+                min: MIN((SQ_INSTS_LDS_LOAD / $denom))
+                max: MAX((SQ_INSTS_LDS_LOAD / $denom))
+                unit: (instr + $normUnit)
+            - LDS STORE:
+                avg: AVG((SQ_INSTS_LDS_STORE / $denom))
+                min: MIN((SQ_INSTS_LDS_STORE / $denom))
+                max: MAX((SQ_INSTS_LDS_STORE / $denom))
+                unit: (instr + $normUnit)
+            - LDS STORE Bandwidth:
+                avg: AVG(64 * SQ_INSTS_LDS_STORE_BANDWIDTH / (End_Timestamp - Start_Timestamp))
+                min: MIN(64 * SQ_INSTS_LDS_STORE_BANDWIDTH / (End_Timestamp - Start_Timestamp))
+                max: MAX(64 * SQ_INSTS_LDS_STORE_BANDWIDTH / (End_Timestamp - Start_Timestamp))
+                units: Gbps
+            - LDS LOAD Bandwidth:
+                avg: AVG(64 * SQ_INSTS_LDS_LOAD_BANDWIDTH / (End_Timestamp - Start_Timestamp))
+                min: MIN(64 * SQ_INSTS_LDS_LOAD_BANDWIDTH / (End_Timestamp - Start_Timestamp))
+                max: MAX(64 * SQ_INSTS_LDS_LOAD_BANDWIDTH / (End_Timestamp - Start_Timestamp))
+                units: Gbps
+            - LDS Command FIFO Full Rate:
+                avg: AVG((SQ_LDS_CMD_FIFO_FULL / $denom))
+                min: MIN((SQ_LDS_CMD_FIFO_FULL / $denom))
+                max: MAX((SQ_LDS_CMD_FIFO_FULL / $denom))
+                unit: (Cycles + $normUnit)
+            - LDS ATOMIC:
+                avg: AVG((SQ_INSTS_LDS_ATOMIC / $denom))
+                min: MIN((SQ_INSTS_LDS_ATOMIC / $denom))
+                max: MAX((SQ_INSTS_LDS_ATOMIC / $denom))
+                unit: (instr + $normUnit)
+            - LDS Data FIFO Full Rate:
+                avg: AVG((SQ_LDS_DATA_FIFO_FULL / $denom))
+                min: MIN((SQ_LDS_DATA_FIFO_FULL / $denom))
+                max: MAX((SQ_LDS_DATA_FIFO_FULL / $denom))
+                unit: (Cycles + $normUnit)
+  - Panel Config:
+      id: 1500
+      title: Address Processing Unit and Data Return Path (TA/TD)
+    metric_tables:
+      - metric_table:
+          id: 1504
+          title: Vector L1 data-return path or Texture Data (TD)
+          metrics:
+            - Write Ack Instructions:
+                avg: AVG((TD_WRITE_ACKT_WAVEFRONT_sum / $denom))
+                min: MIN((TD_WRITE_ACKT_WAVEFRONT_sum / $denom))
+                max: MAX((TD_WRITE_ACKT_WAVEFRONT_sum / $denom))
+                unit: (Instructions + $normUnit)
+      - metric_table:
+          id: 1502
+          title: Instruction counts
+          metrics:
+            - Global/Generic Read Instructions for LDS:
+                avg: AVG((TA_FLAT_READ_LDS_WAVEFRONTS_sum / $denom))
+                min: MIN((TA_FLAT_READ_LDS_WAVEFRONTS_sum / $denom))
+                max: MAX((TA_FLAT_READ_LDS_WAVEFRONTS_sum / $denom))
+                unit: (Instructions + $normUnit)
+            - Spill/Stack Read Instructions for LDS:
+                avg: AVG((TA_BUFFER_READ_LDS_WAVEFRONTS_sum / $denom))
+                min: MIN((TA_BUFFER_READ_LDS_WAVEFRONTS_sum / $denom))
+                max: MAX((TA_BUFFER_READ_LDS_WAVEFRONTS_sum / $denom))
+                unit: (Instructions + $normUnit)
+  - Panel Config:
+      id: 1600
+      title: Vector L1 Data Cache
+    metric_tables:
+      - metric_table:
+          id: 1602
+          title: vL1D cache stall metrics
+          metrics:
+            - Stalled on Request FIFO:
+                expr: |
+                  (((100 * TCP_RFIFO_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum != 0) else None)
+            - Stalled on Latency FIFO:
+                expr: |
+                  (((100 * TCP_LFIFO_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum != 0) else None)
+            - Stalled on Address:
+                expr: |
+                  (((100 * TCP_TCP_TA_ADDR_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum != 0) else None)
+            - Stalled on Read Return:
+                expr: |
+                  (((100 * TCP_TCR_RDRET_STALL_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum != 0) else None)
+            - Stalled on Data:
+                expr: |
+                  (((100 * TCP_TCP_TA_DATA_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum != 0) else None)
+      - metric_table:
+          id: 1603
+          title: vL1D cache access metrics
+          metrics:
+            - Tag RAM 2 Req:
+                avg: AVG((TCP_TAGRAM2_REQ_sum / $denom))
+                min: MIN((TCP_TAGRAM2_REQ_sum / $denom))
+                max: MAX((TCP_TAGRAM2_REQ_sum / $denom))
+                unit: (Req + $normUnit)
+            - Tag RAM 0 Req:
+                avg: AVG((TCP_TAGRAM0_REQ_sum / $denom))
+                min: MIN((TCP_TAGRAM0_REQ_sum / $denom))
+                max: MAX((TCP_TAGRAM0_REQ_sum / $denom))
+                unit: (Req + $normUnit)
+            - Tag RAM 3 Req:
+                avg: AVG((TCP_TAGRAM3_REQ_sum / $denom))
+                min: MIN((TCP_TAGRAM3_REQ_sum / $denom))
+                max: MAX((TCP_TAGRAM3_REQ_sum / $denom))
+                unit: (Req + $normUnit)
+            - Tag RAM 1 Req:
+                avg: AVG((TCP_TAGRAM1_REQ_sum / $denom))
+                min: MIN((TCP_TAGRAM1_REQ_sum / $denom))
+                max: MAX((TCP_TAGRAM1_REQ_sum / $denom))
+                unit: (Req + $normUnit)
+            - L1 Access Latency:
+                avg: AVG((TCP_TCP_LATENCY_sum / $denom))
+                min: MIN((TCP_TCP_LATENCY_sum / $denom))
+                max: MAX((TCP_TCP_LATENCY_sum / $denom))
+                unit: (Cycles + $normUnit)
+            - L1-L2 Read Latency:
+                avg: AVG((TCP_TCC_READ_REQ_LATENCY_sum / $denom))
+                min: MIN((TCP_TCC_READ_REQ_LATENCY_sum / $denom))
+                max: MAX((TCP_TCC_READ_REQ_LATENCY_sum / $denom))
+                unit: (Cycles + $normUnit)
+            - L1-L2 Write Latency:
+                avg: AVG((TCP_TCC_WRITE_REQ_LATENCY_sum / $denom))
+                min: MIN((TCP_TCC_WRITE_REQ_LATENCY_sum / $denom))
+                max: MAX((TCP_TCC_WRITE_REQ_LATENCY_sum / $denom))
+                unit: (Cycles + $normUnit)
+      - metric_table:
+          id: 1605
+          title: L1 Unified Translation Cache (UTCL1)
+          metrics:
+            - Misses under Translation Miss:
+                avg: AVG((TCP_UTCL1_TRANSLATION_MISS_UNDER_MISS_sum / $denom))
+                min: MIN((TCP_UTCL1_TRANSLATION_MISS_UNDER_MISS_sum / $denom))
+                max: MAX((TCP_UTCL1_TRANSLATION_MISS_UNDER_MISS_sum / $denom))
+                units: (Req + $normUnit)
+            - Inflight Req:
+                avg: AVG((TCP_CLIENT_UTCL1_INFLIGHT_sum / $denom))
+                min: MIN((TCP_CLIENT_UTCL1_INFLIGHT_sum / $denom))
+                max: MAX((TCP_CLIENT_UTCL1_INFLIGHT_sum / $denom))
+                units: (Req + $normUnit)
+      - metric_table:
+          id: 1606
+          title: L1D Addr Translation Stalls
+          metrics:
+            - Serialization Stall:
+                avg: AVG((TCP_UTCL1_SERIALIZATION_STALL_sum / $denom))
+                min: MIN((TCP_UTCL1_SERIALIZATION_STALL_sum / $denom))
+                max: MAX((TCP_UTCL1_SERIALIZATION_STALL_sum / $denom))
+                units: (Cycles + $normUnit)
+            - Cache Full Stall:
+                avg: AVG((TCP_UTCL1_STALL_INFLIGHT_MAX_sum / $denom))
+                min: MIN((TCP_UTCL1_STALL_INFLIGHT_MAX_sum / $denom))
+                max: MAX((TCP_UTCL1_STALL_INFLIGHT_MAX_sum / $denom))
+                units: (Cycles + $normUnit)
+            - Resident Page Full Stall:
+                avg: AVG((TCP_UTCL1_STALL_LFIFO_NO_RES_sum / $denom))
+                min: MIN((TCP_UTCL1_STALL_LFIFO_NO_RES_sum / $denom))
+                max: MAX((TCP_UTCL1_STALL_LFIFO_NO_RES_sum / $denom))
+                units: (Cycles + $normUnit)
+            - UTCL2 Stall:
+                avg: AVG((TCP_UTCL1_STALL_UTCL2_REQ_OUT_OF_CREDITS_sum / $denom))
+                min: MIN((TCP_UTCL1_STALL_UTCL2_REQ_OUT_OF_CREDITS_sum / $denom))
+                max: MAX((TCP_UTCL1_STALL_UTCL2_REQ_OUT_OF_CREDITS_sum / $denom))
+                units: (Cycles + $normUnit)
+            - Latency FIFO Stall:
+                avg: AVG((TCP_UTCL1_LFIFO_FULL_sum / $denom))
+                min: MIN((TCP_UTCL1_LFIFO_FULL_sum / $denom))
+                max: MAX((TCP_UTCL1_LFIFO_FULL_sum / $denom))
+                units: (Cycles + $normUnit)
+            - Thrashing Stall:
+                avg: AVG((TCP_UTCL1_THRASHING_STALL_sum / $denom))
+                min: MIN((TCP_UTCL1_THRASHING_STALL_sum / $denom))
+                max: MAX((TCP_UTCL1_THRASHING_STALL_sum / $denom))
+                units: (Cycles + $normUnit)
+            - Cache Miss Stall:
+                avg: AVG((TCP_UTCL1_STALL_MULTI_MISS_sum / $denom))
+                min: MIN((TCP_UTCL1_STALL_MULTI_MISS_sum / $denom))
+                max: MAX((TCP_UTCL1_STALL_MULTI_MISS_sum / $denom))
+                units: (Cycles + $normUnit)
+  - Panel Config:
+      id: 1700
+      title: L2 Cache
+    metric_tables:
+      - metric_table:
+          id: 1702
+          title: L2-Fabric interface metrics
+          metrics:
+            - Read Stall:
+                avg: |
+                  AVG((((100 * ((TCC_EA0_RDREQ_IO_CREDIT_STALL_sum + TCC_EA0_RDREQ_GMI_CREDIT_STALL_sum) + TCC_EA0_RDREQ_DRAM_CREDIT_STALL_sum)) / TCC_BUSY_sum) if (TCC_BUSY_sum != 0) else None))
+                min: |
+                  MIN((((100 * ((TCC_EA0_RDREQ_IO_CREDIT_STALL_sum + TCC_EA0_RDREQ_GMI_CREDIT_STALL_sum) + TCC_EA0_RDREQ_DRAM_CREDIT_STALL_sum)) / TCC_BUSY_sum) if (TCC_BUSY_sum != 0) else None))
+                max: |
+                  MAX((((100 * ((TCC_EA0_RDREQ_IO_CREDIT_STALL_sum + TCC_EA0_RDREQ_GMI_CREDIT_STALL_sum) + TCC_EA0_RDREQ_DRAM_CREDIT_STALL_sum)) / TCC_BUSY_sum) if (TCC_BUSY_sum != 0) else None))
+                unit: pct
+            - Write Stall:
+                avg: |
+                  AVG(((100 * (TCC_EA0_WRREQ_STALL_sum) / TCC_BUSY_sum) if (TCC_BUSY_sum != 0) else None))
+                min: |
+                  MIN(((100 * (TCC_EA0_WRREQ_STALL_sum) / TCC_BUSY_sum) if (TCC_BUSY_sum != 0) else None))
+                max: |
+                  MAX(((100 * (TCC_EA0_WRREQ_STALL_sum) / TCC_BUSY_sum) if (TCC_BUSY_sum != 0) else None))
+                unit: pct
+      - metric_table:
+          id: 1703
+          title: L2 Cache Accesses
+          metrics:
+            - Atomic Bandwidth:
+                avg: AVG(TCC_ATOMIC_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
+                min: MIN(TCC_ATOMIC_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
+                max: MAX(TCC_ATOMIC_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
+                unit: Gbps
+            - Input Buffer Req:
+                avg: AVG((TCC_IB_REQ_sum / $denom))
+                min: MIN((TCC_IB_REQ_sum / $denom))
+                max: MAX((TCC_IB_REQ_sum / $denom))
+                unit: (Req + $normUnit)
+            - Write Bandwidth:
+                avg: AVG(TCC_WRITE_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
+                min: MIN(TCC_WRITE_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
+                max: MAX(TCC_WRITE_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
+                unit: Gbps
+            - Read Bandwidth:
+                avg: AVG(TCC_READ_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
+                min: MIN(TCC_READ_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
+                max: MAX(TCC_READ_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
+                unit: Gbps
+            - Bypasss Req:
+                avg: AVG((TCC_BYPASS_REQ_sum / $denom))
+                min: MIN((TCC_BYPASS_REQ_sum / $denom))
+                max: MAX((TCC_BYPASS_REQ_sum / $denom))
+                unit: (Req + $normUnit)
+      - metric_table:
+          id: 1704
+          title: L2 Cache Stalls
+          metrics:
+            - Input Buffer Stalled on L2:
+                avg: AVG(TCC_IB_STALL_sum / $denom)
+                min: MIN(TCC_IB_STALL_sum / $denom)
+                max: MAX(TCC_IB_STALL_sum / $denom)
+                unit: (Cycles + $normUnit)
+            - Stalled on Latency FIFO:
+                avg: AVG(TCC_LATENCY_FIFO_FULL_sum / $denom)
+                min: MIN(TCC_LATENCY_FIFO_FULL_sum / $denom)
+                max: MAX(TCC_LATENCY_FIFO_FULL_sum / $denom)
+                unit: (Cycles + $normUnit)
+            - Stalled on Write Data FIFO:
+                avg: AVG(TCC_SRC_FIFO_FULL_sum / $denom)
+                min: MIN(TCC_SRC_FIFO_FULL_sum / $denom)
+                max: MAX(TCC_SRC_FIFO_FULL_sum / $denom)
+                unit: (Cycles + $normUnit)
+      - metric_table:
+          id: 1705
+          title: L2 - Fabric Interface stalls
+          metrics:
+            - Read - HBM Stall:
+                type: HBM Stall
+                transaction: Read
+                avg: |
+                  AVG(((100 * (TCC_EA0_RDREQ_DRAM_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                min: |
+                  MIN(((100 * (TCC_EA0_RDREQ_DRAM_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                max: |
+                  MAX(((100 * (TCC_EA0_RDREQ_DRAM_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                unit: pct
+            - Read - Infinity Fabric Stall:
+                type: Infinity Fabric™ Stall
+                transaction: Read
+                avg: |
+                  AVG(((100 * (TCC_EA0_RDREQ_GMI_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                min: |
+                  MIN(((100 * (TCC_EA0_RDREQ_GMI_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                max: |
+                  MAX(((100 * (TCC_EA0_RDREQ_GMI_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                unit: pct
+            - Write - PCIe Stall:
+                type: PCIe Stall
+                transaction: Write
+                avg: |
+                  AVG(((100 * (TCC_EA0_WRREQ_IO_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                min: |
+                  MIN(((100 * (TCC_EA0_WRREQ_IO_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                max: |
+                  MAX(((100 * (TCC_EA0_WRREQ_IO_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                unit: pct
+            - Read - PCIe Stall:
+                type: PCIe Stall
+                transaction: Read
+                avg: |
+                  AVG(((100 * (TCC_EA0_RDREQ_IO_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                min: |
+                  MIN(((100 * (TCC_EA0_RDREQ_IO_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                max: |
+                  MAX(((100 * (TCC_EA0_RDREQ_IO_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                unit: pct
+            - Write - Infinity Fabric Stall:
+                type: Infinity Fabric™ Stall
+                transaction: Write
+                avg: |
+                  AVG(((100 * (TCC_EA0_WRREQ_GMI_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                min: |
+                  MIN(((100 * (TCC_EA0_WRREQ_GMI_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                max: |
+                  MAX(((100 * (TCC_EA0_WRREQ_GMI_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                unit: pct
+            - Write - HBM Stall:
+                type: HBM Stall
+                transaction: Write
+                avg: |
+                  AVG(((100 * (TCC_EA0_WRREQ_DRAM_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                min: |
+                  MIN(((100 * (TCC_EA0_WRREQ_DRAM_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                max: |
+                  MAX(((100 * (TCC_EA0_WRREQ_DRAM_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
+                unit: pct
+      - metric_table:
+          id: 1706
+          title: L2 - Fabric interface detailed metrics
+          metrics:
+            - Write Bandwidth - HBM:
+                avg: AVG(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                min: MIN(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                max: MAX(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                unit: Gbps
+            - Read (128B):
+                avg: AVG((TCC_EA0_RDREQ_128B_sum / $denom))
+                min: MIN((TCC_EA0_RDREQ_128B_sum / $denom))
+                max: MAX((TCC_EA0_RDREQ_128B_sum / $denom))
+                unit: (Req + $normUnit)
+            - Atomic - HBM:
+                avg: AVG((TCC_EA0_WRREQ_ATOMIC_DRAM_sum / $denom))
+                min: MIN((TCC_EA0_WRREQ_ATOMIC_DRAM_sum / $denom))
+                max: MAX((TCC_EA0_WRREQ_ATOMIC_DRAM_sum / $denom))
+                unit: (Req + $normUnit)
+            - Read Bandwidth - PCIe:
+                avg: AVG(TCC_EA0_RDREQ_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                min: MIN(TCC_EA0_RDREQ_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                max: MAX(TCC_EA0_RDREQ_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                unit: Gbps
+            - Atomic Bandwidth - HBM:
+                avg: AVG(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                min: MIN(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                max: MAX(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                unit: Gbps
+            - Read Bandwidth - Infinity Fabric™:
+                avg: AVG(TCC_EA0_RDREQ_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                min: MIN(TCC_EA0_RDREQ_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                max: MAX(TCC_EA0_RDREQ_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                unit: Gbps
+            - Write Bandwidth - PCIe:
+                avg: AVG(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                min: MIN(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                max: MAX(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                unit: Gbps
+            - Atomic Bandwidth - PCIe:
+                avg: AVG(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                min: MIN(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                max: MAX(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                unit: Gbps
+            - Write Bandwidth - Infinity Fabric™:
+                avg: AVG(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                min: MIN(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                max: MAX(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                unit: Gbps
+            - Atomic Bandwidth - Infinity Fabric™:
+                avg: AVG(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                min: MIN(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                max: MAX(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                unit: Gbps
+            - Read Bandwidth - HBM:
+                avg: AVG(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                min: MIN(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                max: MAX(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+                unit: Gbps
+
+Deletion:
+  []
+
+Modification:
+  - Panel Config:
+      id: 200
+      title: System Speed-of-Light
+    metric_tables:
+      - metric_table:
+          id: 201
+          title: System Speed-of-Light
+          metrics:
+            - MFMA IOPs (Int8):
+                peak: ((($max_sclk * $cu_per_gpu) * 8192) / 1000)
+                pop: |
+                  ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 8192) / 1000))
+            - MFMA FLOPs (F16):
+                peak: ((($max_sclk * $cu_per_gpu) * 4096) / 1000)
+                pop: |
+                  ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 4096) / 1000))
+            - MFMA FLOPs (F8):
+                peak: ((($max_sclk * $cu_per_gpu) * 8192) / 1000)
+                unit: GFLOP/s
+                pop: |
+                  ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 8192) / 1000))
+            - MFMA FLOPs (F64):
+                peak: ((($max_sclk * $cu_per_gpu) * 128) / 1000)
+                pop: |
+                  ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 128) / 1000))
+            - MFMA FLOPs (BF16):
+                peak: ((($max_sclk * $cu_per_gpu) * 4096) / 1000)
+                pop: |
+                  ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 4096) / 1000))
+  - Panel Config:
+      id: 300
+      title: Memory Chart
+    metric_tables:
+      - metric_table:
+          id: 301
+          title: Memory Chart
+          metrics:
+            - Workgroups:
+                value: |
+                  ROUND(AVG(SPI_CS0_NUM_THREADGROUPS + SPI_CS1_NUM_THREADGROUPS + SPI_CS2_NUM_THREADGROUPS + SPI_CS3_NUM_THREADGROUPS), 0)
+            - Wavefronts:
+                value: ROUND(AVG(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE), 0)
+  - Panel Config:
+      id: 400
+      title: Roofline
+    metric_tables:
+      - metric_table:
+          id: 402
+          title: Roofline Plot Points
+          metrics:
+            - AI L2:
+                value: |
+                  ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / SUM( (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64 ) )
+            - AI HBM:
+                value: |
+                  ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / SUM( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum * 64) ) )
+            - AI L1:
+                value: |
+                  ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / SUM(TCP_TOTAL_CACHE_ACCESSES_sum * 64) )
+            - Performance (GFLOPs):
+                value: |
+                  ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9
+  - Panel Config:
+      id: 600
+      title: Workgroup Manager (SPI)
+    metric_tables:
+      - metric_table:
+          id: 601
+          title: Workgroup manager utilizations
+          metrics:
+            - Dispatched Workgroups:
+                max: |
+                  MAX(SPI_CS0_NUM_THREADGROUPS + SPI_CS1_NUM_THREADGROUPS + SPI_CS2_NUM_THREADGROUPS + SPI_CS3_NUM_THREADGROUPS)
+                avg: |
+                  AVG(SPI_CS0_NUM_THREADGROUPS + SPI_CS1_NUM_THREADGROUPS + SPI_CS2_NUM_THREADGROUPS + SPI_CS3_NUM_THREADGROUPS)
+                min: |
+                  MIN(SPI_CS0_NUM_THREADGROUPS + SPI_CS1_NUM_THREADGROUPS + SPI_CS2_NUM_THREADGROUPS + SPI_CS3_NUM_THREADGROUPS)
+            - VGPR Writes:
+                max: |
+                  MAX((((SPI_VWC0_VDATA_VALID_WR + SPI_VWC1_VDATA_VALID_WR) / (SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)) if ((SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE) != 0) else None))
+                avg: |
+                  AVG((((SPI_VWC0_VDATA_VALID_WR + SPI_VWC1_VDATA_VALID_WR) / (SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)) if ((SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE) != 0) else None))
+                min: |
+                  MIN((((SPI_VWC0_VDATA_VALID_WR + SPI_VWC1_VDATA_VALID_WR) / (SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)) if ((SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE) != 0) else None))
+            - Scheduler-Pipe Utilization:
+                max: |
+                  MAX(100 * (SPI_CS0_BUSY + SPI_CS1_BUSY + SPI_CS2_BUSY + SPI_CS3_BUSY) / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
+                avg: |
+                  AVG(100 * (SPI_CS0_BUSY + SPI_CS1_BUSY + SPI_CS2_BUSY + SPI_CS3_BUSY) / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
+                min: |
+                  MIN(100 * (SPI_CS0_BUSY + SPI_CS1_BUSY + SPI_CS2_BUSY + SPI_CS3_BUSY) / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
+            - SGPR Writes:
+                max: |
+                  MAX((((1 * SPI_SWC_CSC_WR) / (SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)) if ((SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE) != 0) else None))
+                avg: |
+                  AVG((((1 * SPI_SWC_CSC_WR) / (SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)) if ((SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE) != 0) else None))
+                min: |
+                  MIN((((1 * SPI_SWC_CSC_WR) / (SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)) if ((SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE) != 0) else None))
+            - Dispatched Wavefronts:
+                max: MAX(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)
+                avg: AVG(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)
+                min: MIN(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)
+  - Panel Config:
+      id: 700
+      title: Wavefront
+    metric_tables:
+      - metric_table:
+          id: 701
+          title: Wavefront Launch Stats
+          metrics:
+            - Total Wavefronts:
+                max: MAX(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)
+                avg: AVG(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)
+                min: MIN(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)
+  - Panel Config:
+      id: 1100
+      title: Compute Units - Compute Pipeline
+    metric_tables:
+      - metric_table:
+          id: 1101
+          title: Compute Speed-of-Light
+          metrics:
+            - MFMA FLOPs (F8):
+                peak: ((($max_sclk * $cu_per_gpu) * 8192) / 1000)
+                pop: |
+                  ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 8192) / 1000))
+            - MFMA FLOPs (F64):
+                peak: ((($max_sclk * $cu_per_gpu) * 128) / 1000)
+                pop: |
+                  ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 128) / 1000))
+            - MFMA FLOPs (BF16):
+                peak: ((($max_sclk * $cu_per_gpu) * 4096) / 1000)
+                pop: |
+                  ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 4096) / 1000))
+            - MFMA IOPs (INT8):
+                peak: ((($max_sclk * $cu_per_gpu) * 8192) / 1000)
+                pop: |
+                  ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 8192) / 1000))
+            - MFMA FLOPs (F16):
+                peak: ((($max_sclk * $cu_per_gpu) * 4096) / 1000)
+                pop: |
+                  ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 4096) / 1000))
+      - metric_table:
+          id: 1103
+          title: Arithmetic Operations
+          metrics:
+            - FLOPs (Total):
+                max: |
+                  MAX((((((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_FMA_F16 * 2))) + ((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F16) + (512 * SQ_INSTS_VALU_MFMA_MOPS_BF16))) + (64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_FMA_F32 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32)) + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64) + (SQ_INSTS_VALU_FMA_F64 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F6F4)) / $denom))
+                avg: |
+                  AVG((((((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_FMA_F16 * 2))) + ((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F16) + (512 * SQ_INSTS_VALU_MFMA_MOPS_BF16))) + (64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_FMA_F32 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32)) + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64) + (SQ_INSTS_VALU_FMA_F64 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F6F4)) / $denom))
+                min: |
+                  MIN((((((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_FMA_F16 * 2))) + ((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F16) + (512 * SQ_INSTS_VALU_MFMA_MOPS_BF16))) + (64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_FMA_F32 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32)) + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64) + (SQ_INSTS_VALU_FMA_F64 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F6F4)) / $denom))
+  - Panel Config:
+      id: 1700
+      title: L2 Cache
+    metric_tables:
+      - metric_table:
+          id: 1701
+          title: L2 Speed-of-Light
+          metrics:
+            - L2-Fabric Read BW:
+                value: |
+                  AVG((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) + (TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp)))
+      - metric_table:
+          id: 1702
+          title: L2-Fabric interface metrics
+          metrics:
+            - Read BW:
+                max: |
+                  MAX((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) + (TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp)))
+                avg: |
+                  AVG((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) + (TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp)))
+                min: |
+                  MIN((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) + (TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp)))
+            - Remote Read Traffic:
+                max: |
+                  MAX((100 * (MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum != 0) else None))
+                avg: |
+                  AVG((100 * (MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum != 0) else None))
+                min: |
+                  MIN((100 * (MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum != 0) else None))
+      - metric_table:
+          id: 1706
+          title: L2 - Fabric interface detailed metrics
+          metrics:
+            - HBM Write and Atomic:
+                max: MAX((TCC_EA0_WRREQ_WRITE_DRAM_sum / $denom))
+                avg: AVG((TCC_EA0_WRREQ_WRITE_DRAM_sum / $denom))
+                min: MIN((TCC_EA0_WRREQ_WRITE_DRAM_sum / $denom))
+            - Read (64B):
+                max: MAX((TCC_EA0_RDREQ_64B_sum / $denom))
+                avg: AVG((TCC_EA0_RDREQ_64B_sum / $denom))
+                min: MIN((TCC_EA0_RDREQ_64B_sum / $denom))
+  - Panel Config:
+      id: 1800
+      title: L2 Cache (per Channel)
+    metric_tables:
+      - metric_table:
+          id: 1809
+          title: L2-Fabric Read Stall (Cycles per normUnit)
+          metrics:
+            - ::_1:
+                ea read stall - pcie: AVG((TO_INT(TCC_EA0_RDREQ_IO_CREDIT_STALL[::_1]) / $denom))
+                ea read stall - hbm: AVG((TO_INT(TCC_EA0_RDREQ_DRAM_CREDIT_STALL[::_1]) / $denom))
+                ea read stall - if: AVG((TO_INT(TCC_EA0_RDREQ_GMI_CREDIT_STALL[::_1]) / $denom))
+      - metric_table:
+          id: 1810
+          title: L2-Fabric Write and Atomic Stall (Cycles per normUnit)
+          metrics:
+            - ::_1:
+                ea write stall - pcie: AVG((TO_INT(TCC_EA0_WRREQ_IO_CREDIT_STALL[::_1]) / $denom))
+                ea write stall - if: AVG((TO_INT(TCC_EA0_WRREQ_GMI_CREDIT_STALL[::_1]) / $denom))
+                ea write stall - hbm: AVG((TO_INT(TCC_EA0_WRREQ_DRAM_CREDIT_STALL[::_1]) / $denom))
@@ -2,7 +2,6 @@
 Panel Config:
  id: 0
  title: Top Stats
-  metrics_description: {}
  data source:
  - raw_csv_table:
      id: 1
@@ -12,3 +11,4 @@ Panel Config:
      id: 2
      title: Dispatch List
      source: pmc_dispatch_info.csv
+  metrics_description: {}
@@ -2,10 +2,10 @@
 Panel Config:
  id: 100
  title: System Info
-  metrics_description: {}
  data source:
  - raw_csv_table:
      id: 101
      title: System Info
      source: sysinfo.csv
      columnwise: true
+  metrics_description: {}
@@ -2,124 +2,6 @@
 Panel Config:
  id: 200
  title: System Speed-of-Light
-  metrics_description:
-    VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
-      This is also presented as a percent of the peak theoretical FLOPs achievable
-      on the specific accelerator. Note: this does not include any floating-point
-      operations from MFMA instructions.'
-    VALU IOPs: 'The total integer operations executed per second on the VALU. This
-      is also presented as a percent of the peak theoretical IOPs achievable on the
-      specific accelerator. Note: this does not include any integer operations from
-      MFMA instructions.'
-    MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
-      executed per second. This does not include any 16-bit brain floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F8 MFMA operations achievable on the specific accelerator. It is supported on
-      AMD Instinct MI300 series and later only.
-    MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
-      executed per second. Note: this does not include any 16-bit brain floating point
-      operations from VALU instructions. This is also presented as a percent of the
-      peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
-      per second. Note: this does not include any 16-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F16 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
-      per second. Note: this does not include any 32-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F32 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
-      per second. Note: this does not include any 64-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F64 MFMA operations achievable on the specific accelerator.'
-    MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
-      per second. Note: this does not include any 8-bit integer operations from VALU
-      instructions. This is also presented as a percent of the peak theoretical INT8
-      MFMA operations achievable on the specific accelerator.'
-    Active CUs: Total number of active compute units (CUs) on the accelerator during
-      the kernel execution.
-    SALU Utilization: Indicates what percent of the kernel's duration the SALU was
-      busy executing instructions. Computed as the ratio of the total number of cycles
-      spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.
-    VALU Utilization: Indicates what percent of the kernel's duration the VALU was
-      busy executing instructions. Does not include VMEM operations. Computed as the
-      ratio of the total number of cycles spent by the scheduler issuing VALU instructions
-      over the total CU cycles.
-    MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
-      was busy executing instructions. Computed as the ratio of the total number of
-      cycles the MFMA was busy over the total CU cycles.
-    VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
-      was busy executing instructions, including both global/generic and spill/scratch
-      operations (see the VMEM instruction count metrics) for more detail). Does not
-      include VALU operations. Computed as the ratio of the total number of cycles
-      spent by the scheduler issuing VMEM instructions over the total CU cycles.
-    Branch Utilization: Indicates what percent of the kernel's duration the branch
-      unit was busy executing instructions. Computed as the ratio of the total number
-      of cycles spent by the scheduler issuing branch instructions over the total
-      CU cycles
-    VALU Active Threads: Indicates the average level of divergence within a wavefront
-      over the lifetime of the kernel. The number of work-items that were active in
-      a wavefront during execution of each VALU instruction, time-averaged over all
-      VALU instructions run on all wavefronts in the kernel.
-    IPC: The ratio of the total number of instructions executed on the CU over the
-      total active CU cycles. This is also presented as a percent of the peak theoretical
-      bandwidth achievable on the specific accelerator.
-    Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
-      over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
-      kernels (less than 1ms). This is also presented as a percent of the peak theoretical
-      occupancy achievable on the specific accelerator.'
-    Theoretical LDS Bandwidth: Indicates the maximum amount of bytes that could have
-      been loaded from, stored to, or atomically updated in the LDS per unit time
-      (see LDS Bandwidth example for more detail). This is also presented as a percent
-      of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
-    LDS Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS
-      scheduler due to bank conflicts (as determined by the conflict resolution hardware)
-      to the base number of cycles that would be spent in the LDS scheduler in a completely
-      uncontended case. This is also presented in normalized form (i.e., the Bank
-      Conflict Rate).
-    vL1D Cache Hit Rate: The ratio of the number of vL1D cache line requests that
-      hit in vL1D cache over the total number of cache line requests to the vL1D cache
-      RAM.
-    vL1D Cache BW: The number of bytes looked up in the vL1D cache as a result of
-      VMEM instructions per unit time. The number of bytes is calculated as the number
-      of cache lines requested multiplied by the cache line size. This value does
-      not consider partial requests, so e.g., if only a single value is requested
-      in a cache line, the data movement will still be counted as a full cache line.
-      This is also presented as a percent of the peak theoretical bandwidth achievable
-      on the specific accelerator.
-    L2 Cache Hit Rate: The ratio of the number of L2 cache line requests that hit
-      in the L2 cache over the total number of incoming cache line requests to the
-      L2 cache.
-    L2 Cache BW: The number of bytes looked up in the L2 cache per unit time. The
-      number of bytes is calculated as the number of cache lines requested multiplied
-      by the cache line size. This value does not consider partial requests, so e.g.,
-      if only a single value is requested in a cache line, the data movement will
-      still be counted as a full cache line. This is also presented as a percent of
-      the peak theoretical bandwidth achievable on the specific accelerator.
-    L2-Fabric Read BW: "The number of bytes read by the L2 over the Infinity Fabric\u2122\
-      \ interface per unit time. This is also presented as a percent of the peak theoretical\
-      \ bandwidth achievable on the specific accelerator."
-    L2-Fabric Write BW: The number of bytes sent by the L2 over the Infinity Fabric
-      interface by write and atomic operations per unit time. This is also presented
-      as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
-    L2-Fabric Read Latency: The time-averaged number of cycles read requests spent
-      in Infinity Fabric before data was returned to the L2.
-    L2-Fabric Write Latency: The time-averaged number of cycles write requests spent
-      in Infinity Fabric before a completion acknowledgement was returned to the L2.
-    sL1D Cache Hit Rate: The percent of sL1D requests that hit on a previously loaded
-      line the cache. Calculated as the ratio of the number of sL1D requests that
-      hit over the number of all sL1D requests.
-    sL1D Cache BW: The number of bytes looked up in the sL1D cache per unit time.
-      This is also presented as a percent of the peak theoretical bandwidth achievable
-      on the specific accelerator.
-    L1I Hit Rate: The number of bytes looked up in the L1I cache per unit time. This
-      is also presented as a percent of the peak theoretical bandwidth achievable
-      on the specific accelerator.
-    L1I BW: The percent of L1I requests that hit on a previously loaded line the cache.
-      Calculated as the ratio of the number of L1I requests that hit over the number
-      of all L1I requests.
-    L1I Fetch Latency: The average number of cycles spent to fetch instructions to
-      a CU.
  data source:
  - metric_table:
      id: 201
@@ -344,3 +226,130 @@ Panel Config:
          peak: None
          pop: None
          coll_level: SQ_IFETCH_LEVEL
+  metrics_description:
+    VALU FLOPs: |-
+      The total floating-point operations executed per second on the VALU.
+      This is also presented as a percent of the peak theoretical FLOPs achievable
+      on the specific accelerator. Note: this does not include any floating-point
+      operations from MFMA instructions.
+    VALU IOPs: |-
+      The total integer operations executed per second on the VALU. This is
+      also presented as a percent of the peak theoretical IOPs achievable on the
+      specific accelerator. Note: this does not include any integer operations from
+      MFMA instructions.
+    MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
+      executed per second. This does not include any 16-bit brain floating point operations
+      from VALU instructions. This is also presented as a percent of the peak theoretical
+      F8 MFMA operations achievable on the specific accelerator. It is supported on
+      AMD Instinct MI300 series and later only.
+    MFMA FLOPs (BF16): |-
+      The total number of 16-bit brain floating point MFMA operations executed
+      per second. Note: this does not include any 16-bit brain floating point operations
+      from VALU instructions. This is also presented as a percent of the peak theoretical
+      BF16 MFMA operations achievable on the specific accelerator.
+    MFMA FLOPs (F16): |-
+      The total number of 16-bit floating point MFMA operations executed per
+      second. Note: this does not include any 16-bit floating point operations from
+      VALU instructions. This is also presented as a percent of the peak theoretical
+      F16 MFMA operations achievable on the specific accelerator.
+    MFMA FLOPs (F32): |-
+      The total number of 32-bit floating point MFMA operations executed per
+      second. Note: this does not include any 32-bit floating point operations from
+      VALU instructions. This is also presented as a percent of the peak theoretical
+      F32 MFMA operations achievable on the specific accelerator.
+    MFMA FLOPs (F64): |-
+      The total number of 64-bit floating point MFMA operations executed per
+      second. Note: this does not include any 64-bit floating point operations from
+      VALU instructions. This is also presented as a percent of the peak theoretical
+      F64 MFMA operations achievable on the specific accelerator.
+    MFMA IOPs (Int8): |-
+      The total number of 8-bit integer MFMA operations executed per second.
+      Note: this does not include any 8-bit integer operations from VALU instructions.
+      This is also presented as a percent of the peak theoretical INT8 MFMA operations
+      achievable on the specific accelerator.
+    Active CUs: Total number of active compute units (CUs) on the accelerator during
+      the kernel execution.
+    SALU Utilization: Indicates what percent of the kernel's duration the SALU was
+      busy executing instructions. Computed as the ratio of the total number of cycles
+      spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.
+    VALU Utilization: Indicates what percent of the kernel's duration the VALU was
+      busy executing instructions. Does not include VMEM operations. Computed as the
+      ratio of the total number of cycles spent by the scheduler issuing VALU instructions
+      over the total CU cycles.
+    MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
+      was busy executing instructions. Computed as the ratio of the total number of
+      cycles the MFMA was busy over the total CU cycles.
+    VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
+      was busy executing instructions, including both global/generic and spill/scratch
+      operations (see the VMEM instruction count metrics) for more detail). Does not
+      include VALU operations. Computed as the ratio of the total number of cycles
+      spent by the scheduler issuing VMEM instructions over the total CU cycles.
+    Branch Utilization: Indicates what percent of the kernel's duration the branch
+      unit was busy executing instructions. Computed as the ratio of the total number
+      of cycles spent by the scheduler issuing branch instructions over the total
+      CU cycles
+    VALU Active Threads: Indicates the average level of divergence within a wavefront
+      over the lifetime of the kernel. The number of work-items that were active in
+      a wavefront during execution of each VALU instruction, time-averaged over all
+      VALU instructions run on all wavefronts in the kernel.
+    IPC: The ratio of the total number of instructions executed on the CU over the
+      total active CU cycles. This is also presented as a percent of the peak theoretical
+      bandwidth achievable on the specific accelerator.
+    Wavefront Occupancy: |-
+      The time-averaged number of wavefronts resident on the accelerator over
+      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
+      kernels (less than 1ms). This is also presented as a percent of the peak theoretical
+      occupancy achievable on the specific accelerator.
+    Theoretical LDS Bandwidth: Indicates the maximum amount of bytes that could have
+      been loaded from, stored to, or atomically updated in the LDS per unit time
+      (see LDS Bandwidth example for more detail). This is also presented as a percent
+      of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
+    LDS Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS
+      scheduler due to bank conflicts (as determined by the conflict resolution hardware)
+      to the base number of cycles that would be spent in the LDS scheduler in a completely
+      uncontended case. This is also presented in normalized form (i.e., the Bank
+      Conflict Rate).
+    vL1D Cache Hit Rate: The ratio of the number of vL1D cache line requests that
+      hit in vL1D cache over the total number of cache line requests to the vL1D cache
+      RAM.
+    vL1D Cache BW: The number of bytes looked up in the vL1D cache as a result of
+      VMEM instructions per unit time. The number of bytes is calculated as the number
+      of cache lines requested multiplied by the cache line size. This value does
+      not consider partial requests, so e.g., if only a single value is requested
+      in a cache line, the data movement will still be counted as a full cache line.
+      This is also presented as a percent of the peak theoretical bandwidth achievable
+      on the specific accelerator.
+    L2 Cache Hit Rate: The ratio of the number of L2 cache line requests that hit
+      in the L2 cache over the total number of incoming cache line requests to the
+      L2 cache.
+    L2 Cache BW: The number of bytes looked up in the L2 cache per unit time. The
+      number of bytes is calculated as the number of cache lines requested multiplied
+      by the cache line size. This value does not consider partial requests, so e.g.,
+      if only a single value is requested in a cache line, the data movement will
+      still be counted as a full cache line. This is also presented as a percent of
+      the peak theoretical bandwidth achievable on the specific accelerator.
+    L2-Fabric Read BW: |-
+      The number of bytes read by the L2 over the Infinity Fabric\u2122 interface
+      per unit time. This is also presented as a percent of the peak theoretical
+      bandwidth achievable on the specific accelerator.
+    L2-Fabric Write BW: The number of bytes sent by the L2 over the Infinity Fabric
+      interface by write and atomic operations per unit time. This is also presented
+      as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
+    L2-Fabric Read Latency: The time-averaged number of cycles read requests spent
+      in Infinity Fabric before data was returned to the L2.
+    L2-Fabric Write Latency: The time-averaged number of cycles write requests spent
+      in Infinity Fabric before a completion acknowledgement was returned to the L2.
+    sL1D Cache Hit Rate: The percent of sL1D requests that hit on a previously loaded
+      line the cache. Calculated as the ratio of the number of sL1D requests that
+      hit over the number of all sL1D requests.
+    sL1D Cache BW: The number of bytes looked up in the sL1D cache per unit time.
+      This is also presented as a percent of the peak theoretical bandwidth achievable
+      on the specific accelerator.
+    L1I Hit Rate: The number of bytes looked up in the L1I cache per unit time. This
+      is also presented as a percent of the peak theoretical bandwidth achievable
+      on the specific accelerator.
+    L1I BW: The percent of L1I requests that hit on a previously loaded line the cache.
+      Calculated as the ratio of the number of L1I requests that hit over the number
+      of all L1I requests.
+    L1I Fetch Latency: The average number of cycles spent to fetch instructions to
+      a CU.
@@ -2,122 +2,6 @@
 Panel Config:
  id: 300
  title: Memory Chart
-  metrics_description:
-    Wavefront Occupancy: Wavefronts per active CU.
-    Wave Life: Average number of cycles executing a wave.
-    SALU: Total Number of SALU (Scalar ALU) instructions issued per normalization
-      unit.
-    SMEM: Total number of SMEM (Scalar Memory Read) instructions issued normalization
-      unit.
-    VALU: The number of VALU (Vector ALU) instructions issued per normalization unit.
-    MFMA: Total number of MFMA (Matrix-Fused-Multiply-Add) instructions issued per
-      normalization unit.
-    VMEM: The number of VMEM (GPU Memory) read instructions issued (including FLAT/scratch
-      memory) per normalization unit.
-    LDS: The total number of LDS instructions (including, but not limited to, read/write/atomics
-      and HIP's __shfl instructions) executed per normalization unit.
-    GWS: Total number of GDS (global data sync) instructions issued per normalization
-      unit.
-    BR: Total number of BRANCH instructions issued per normalization unit.
-    Active CUs: Total number of active compute units (CUs) on the accelerator during
-      the kernel execution.
-    Num CUs: Total number of compute units (CUs) on the accelerator.
-    VGPR: 'The number of architected vector general-purpose registers allocated for
-      the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
-      by the compiler due to allocation granularity.'
-    SGPR: 'The number of scalar general-purpose registers allocated for the kernel,
-      see SALU. Note: this may not exactly match the number of SGPRs requested by
-      the compiler due to allocation granularity.'
-    LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
-      for this kernel. Note: This may also be larger than what was requested at compile
-      time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
-    Scratch Allocation: The number of bytes of scratch memory requested per work-item
-      for this kernel. Scratch memory is used for stack memory on the accelerator,
-      as well as for register spills and restores.
-    Wavefronts: The total number of wavefronts, summed over all workgroups, forming
-      this kernel launch.
-    Workgroups: The total number of workgroups forming this kernel launch.
-    LDS Req: The total number of LDS instructions (including, but not limited to,
-      read/write/atomics and HIP's __shfl instructions) executed per normalization
-      unit.
-    LDS Util: Indicates what percent of the kernel's duration the LDS was actively
-      executing instructions (including, but not limited to, load, store, atomic and
-      HIP's __shfl operations). Calculated as the ratio of the total number of cycles
-      LDS was active over the total CU cycles.
-    LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
-      / acknowledgment) required for an LDS instruction to complete.
-    VL1 Rd: The total number of incoming read requests from the address processing
-      unit after coalescing per normalization unit
-    VL1 Wr: The total number of incoming write requests from the address processing
-      unit after coalescing per normalization unit
-    VL1 Atomic: The total number of incoming atomic requests from the address processing
-      unit after coalescing per normalization unit
-    VL1 Hit: The ratio of the number of vL1D cache line requests that hit in vL1D
-      cache over the total number of cache line requests to the vL1D Cache RAM.
-    VL1 Lat: Calculated as the average number of cycles that a vL1D cache line request
-      spent in the vL1D cache pipeline.
-    VL1 Coalesce: Indicates how well memory instructions were coalesced by the address
-      processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
-      as the average number of thread-requests generated per instruction divided by
-      the ideal number of thread-requests per instruction.
-    VL1 Stall: The ratio of the number of cycles where the vL1D is stalled waiting
-      to issue a request for data to the L2 cache divided by the number of cycles
-      where the vL1D is active.
-    VL1_L2 Rd: The number of read requests for a vL1D cache line that were not satisfied
-      by the vL1D and must be retrieved from the to the L2 Cache per normalization
-      unit.
-    VL1_L2 Wr: The number of write requests to a vL1D cache line that were sent through
-      the vL1D to the L2 cache, per normalization unit.
-    VL1_L2 Atomic: The number of atomic requests that are sent through the vL1D to
-      the L2 cache, per normalization unit. This includes requests for atomics with,
-      and without return.
-    sL1D Rd: The total number of requests, of any size or type, made to the sL1D per
-      normalization unit.
-    sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
-      line, per normalization unit.
-    sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
-      unit.
-    sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
-      unit. Typically unused on current CDNA accelerators.
-    sL1D_L2 Atomic: The total number of atomic requests from sL1D to the L2, per normalization
-      unit. Typically unused on current CDNA accelerators.
-    IL1 Fetch: The total number of requests made to the L1I per normalization-unit.
-    IL1 Hit: The percent of L1I requests that hit on a previously loaded line the
-      cache. Calculated as the ratio of the number of L1I requests that hit over the
-      number of all L1I requests.
-    IL1 Lat: The average number of cycles spent to fetch instructions to a CU.
-    IL1_L2 Rd: The total number of requests across the L1I - L2 interface per normalization-unit.
-    L2 Rd: The total number of read requests to the L2 from all clients.
-    L2 Wr: The total number of write requests to the L2 from all clients.
-    L2 Atomic: The total number of atomic requests (with and without return) to the
-      L2 from all clients.
-    L2 Hit: The ratio of the number of L2 cache line requests that hit in the L2 cache
-      over the total number of incoming cache line requests to the L2 cache.
-    L2 Rd Lat: Calculated as the average number of cycles that the vL1D cache took
-      to issue and receive read requests from the L2 Cache. This number also includes
-      requests for atomics with return values.
-    L2 Wr Lat: Calculated as the average number of cycles that the vL1D cache took
-      to issue and receive acknowledgement of a write request to the L2 Cache. This
-      number also includes requests for atomics without return values.
-    Fabric_L2 Rd: Number of L2 cache - Infinity Fabric read requests (either 32-byte
-      or 64-byte) summed over TCC instances per normalization unit.
-    Fabric_L2 Wr: Number of L2 cache - Infinity Fabric write requests (either 32-byte
-      or 64-byte) summed over TCC instances per normalization unit.
-    Fabric_L2 Atomic: Number of L2 cache - Infinity Fabric write requests (either
-      32-byte or 64-byte) that are actually atomic requests summed over TCC instances
-      per normalization unit.
-    Fabric Rd Lat: The time-averaged number of cycles read requests spent in Infinity
-      Fabric before data was returned to the L2.
-    Fabric Wr Lat: The time-averaged number of cycles write requests spent in Infinity
-      Fabric before a completion acknowledgement was returned to the L2.
-    Fabric Atomic Lat: The time-averaged number of cycles atomic requests spent in
-      Infinity Fabric before a completion acknowledgement (atomic without return value)
-      or data (atomic with return value) was returned to the L2.
-    HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
-      of data from the accelerator's local HBM, per normalization unit.
-    HBM Wr: 'The total number of L2 requests to Infinity Fabric to write or atomically
-      update 32B or 64B of data in the accelerator''s local HBM, per normalization
-      unit. '
  data source:
  - metric_table:
      id: 301
@@ -244,13 +128,13 @@ Panel Config:
          value: ROUND(AVG((TCC_EA0_ATOMIC_sum / $denom)), 0)
        Fabric Rd Lat:
          value: ROUND(AVG(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
-            != 0) else  0)), 0)
+            != 0) else 0)), 0)
        Fabric Wr Lat:
          value: ROUND(AVG(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
-            != 0) else  0)), 0)
+            != 0) else 0)), 0)
        Fabric Atomic Lat:
          value: ROUND(AVG(((TCC_EA0_ATOMIC_LEVEL_sum / TCC_EA0_ATOMIC_sum) if (TCC_EA0_ATOMIC_sum
-            != 0) else  0)), 0)
+            != 0) else 0)), 0)
        HBM Rd:
          value: ROUND(AVG((TCC_EA0_RDREQ_DRAM_sum / $denom)), 0)
        HBM Wr:
@@ -258,3 +142,117 @@ Panel Config:
      comparable: false
      cli_style: mem_chart
      tui_style: mem_chart
+  metrics_description:
+    Wavefront Occupancy: Wavefronts per active CU.
+    Wave Life: Average number of cycles executing a wave.
+    SALU: Total Number of SALU (Scalar ALU) instructions issued per normalization
+      unit.
+    SMEM: Total number of SMEM (Scalar Memory Read) instructions issued normalization
+      unit.
+    VALU: The number of VALU (Vector ALU) instructions issued per normalization unit.
+    MFMA: Total number of MFMA (Matrix-Fused-Multiply-Add) instructions issued per
+      normalization unit.
+    VMEM: The number of VMEM (GPU Memory) read instructions issued (including FLAT/scratch
+      memory) per normalization unit.
+    LDS: The total number of LDS instructions (including, but not limited to, read/write/atomics
+      and HIP's __shfl instructions) executed per normalization unit.
+    GWS: Total number of GDS (global data sync) instructions issued per normalization
+      unit.
+    BR: Total number of BRANCH instructions issued per normalization unit.
+    Active CUs: Total number of active compute units (CUs) on the accelerator during
+      the kernel execution.
+    Num CUs: Total number of compute units (CUs) on the accelerator.
+    VGPR: |-
+      The number of architected vector general-purpose registers allocated
+      for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
+      requested by the compiler due to allocation granularity.
+    SGPR: |-
+      The number of scalar general-purpose registers allocated for the kernel,
+      see SALU. Note: this may not exactly match the number of SGPRs requested by
+      the compiler due to allocation granularity.
+    LDS Allocation: |-
+      The number of bytes of LDS memory (or, shared memory) allocated for
+      this kernel. Note: This may also be larger than what was requested at compile
+      time due to both allocation granularity and dynamic per-dispatch LDS allocations.
+    Scratch Allocation: The number of bytes of scratch memory requested per work-item
+      for this kernel. Scratch memory is used for stack memory on the accelerator,
+      as well as for register spills and restores.
+    Wavefronts: The total number of wavefronts, summed over all workgroups, forming
+      this kernel launch.
+    Workgroups: The total number of workgroups forming this kernel launch.
+    LDS Req: The total number of LDS instructions (including, but not limited to,
+      read/write/atomics and HIP's __shfl instructions) executed per normalization
+      unit.
+    LDS Util: Indicates what percent of the kernel's duration the LDS was actively
+      executing instructions (including, but not limited to, load, store, atomic and
+      HIP's __shfl operations). Calculated as the ratio of the total number of cycles
+      LDS was active over the total CU cycles.
+    LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
+      / acknowledgment) required for an LDS instruction to complete.
+    VL1 Rd: The total number of incoming read requests from the address processing
+      unit after coalescing per normalization unit
+    VL1 Wr: The total number of incoming write requests from the address processing
+      unit after coalescing per normalization unit
+    VL1 Atomic: The total number of incoming atomic requests from the address processing
+      unit after coalescing per normalization unit
+    VL1 Hit: The ratio of the number of vL1D cache line requests that hit in vL1D
+      cache over the total number of cache line requests to the vL1D Cache RAM.
+    VL1 Lat: Calculated as the average number of cycles that a vL1D cache line request
+      spent in the vL1D cache pipeline.
+    VL1 Coalesce: Indicates how well memory instructions were coalesced by the address
+      processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
+      as the average number of thread-requests generated per instruction divided by
+      the ideal number of thread-requests per instruction.
+    VL1 Stall: The ratio of the number of cycles where the vL1D is stalled waiting
+      to issue a request for data to the L2 cache divided by the number of cycles
+      where the vL1D is active.
+    VL1_L2 Rd: The number of read requests for a vL1D cache line that were not satisfied
+      by the vL1D and must be retrieved from the to the L2 Cache per normalization
+      unit.
+    VL1_L2 Wr: The number of write requests to a vL1D cache line that were sent through
+      the vL1D to the L2 cache, per normalization unit.
+    VL1_L2 Atomic: The number of atomic requests that are sent through the vL1D to
+      the L2 cache, per normalization unit. This includes requests for atomics with,
+      and without return.
+    sL1D Rd: The total number of requests, of any size or type, made to the sL1D per
+      normalization unit.
+    sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
+      line, per normalization unit.
+    sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
+      unit.
+    sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
+      unit. Typically unused on current CDNA accelerators.
+    sL1D_L2 Atomic: The total number of atomic requests from sL1D to the L2, per normalization
+      unit. Typically unused on current CDNA accelerators.
+    IL1 Fetch: The total number of requests made to the L1I per normalization-unit.
+    IL1 Hit: The percent of L1I requests that hit on a previously loaded line the
+      cache. Calculated as the ratio of the number of L1I requests that hit over the
+      number of all L1I requests.
+    IL1 Lat: The average number of cycles spent to fetch instructions to a CU.
+    IL1_L2 Rd: The total number of requests across the L1I - L2 interface per normalization-unit.
+    L2 Rd: The total number of read requests to the L2 from all clients.
+    L2 Wr: The total number of write requests to the L2 from all clients.
+    L2 Atomic: The total number of atomic requests (with and without return) to the
+      L2 from all clients.
+    L2 Hit: The ratio of the number of L2 cache line requests that hit in the L2 cache
+      over the total number of incoming cache line requests to the L2 cache.
+    Fabric_L2 Rd: Number of L2 cache - Infinity Fabric read requests (either 32-byte
+      or 64-byte) summed over TCC instances per normalization unit.
+    Fabric_L2 Wr: Number of L2 cache - Infinity Fabric write requests (either 32-byte
+      or 64-byte) summed over TCC instances per normalization unit.
+    Fabric_L2 Atomic: Number of L2 cache - Infinity Fabric write requests (either
+      32-byte or 64-byte) that are actually atomic requests summed over TCC instances
+      per normalization unit.
+    Fabric Rd Lat: The time-averaged number of cycles read requests spent in Infinity
+      Fabric before data was returned to the L2.
+    Fabric Wr Lat: The time-averaged number of cycles write requests spent in Infinity
+      Fabric before a completion acknowledgement was returned to the L2.
+    Fabric Atomic Lat: The time-averaged number of cycles atomic requests spent in
+      Infinity Fabric before a completion acknowledgement (atomic without return value)
+      or data (atomic with return value) was returned to the L2.
+    HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
+      of data from the accelerator's local HBM, per normalization unit.
+    HBM Wr: |-
+      The total number of L2 requests to Infinity Fabric to write or atomically
+      update 32B or 64B of data in the accelerator's local HBM, per normalization
+      unit.
@@ -2,85 +2,6 @@
 Panel Config:
  id: 400
  title: Roofline
-  metrics_description:
-    VALU FLOPs (F16): 'The total 16-bit floating-point operations executed per second
-      on the VALU. This is presented with the value of the peak empirical F16 FLOPs
-      achievable on the specific accelerator. Note: this does not include any F16
-      operations from MFMA instructions.'
-    VALU FLOPs (F32): 'The total 32-bit floating-point operations executed per second
-      on the VALU. This is presented with the value of the peak empirical F32 FLOPs
-      achievable on the specific accelerator. Note: this does not include any F32
-      operations from MFMA instructions.'
-    VALU FLOPs (F64): 'The total 64-bit floating-point operations executed per second
-      on the VALU. This is presented with the value of the peak empirical F64 FLOPs
-      achievable on the specific accelerator. Note: this does not include any F64
-      operations from MFMA instructions.'
-    MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
-      executed per second. This does not include any 16-bit brain floating point operations
-      from VALU instructions. The peak empirically measured F8 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison. It is supported
-      on AMD Instinct MI300 series and later only.
-    MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
-      executed per second. Note: this does not include any 16-bit brain floating point
-      operations from VALU instructions. The peak empirically measured BF16 MFMA operations
-      achievable on the specific accelerator is displayed alongside for comparison.'
-    MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
-      per second. Note: this does not include any 16-bit floating point operations
-      from VALU instructions. The peak empirically measured F16 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison.'
-    MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
-      per second. Note: this does not include any 32-bit floating point operations
-      from VALU instructions. The peak empirically measured F32 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison.'
-    MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
-      per second. Note: this does not include any 64-bit floating point operations
-      from VALU instructions. The peak empirically measured F64 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison.'
-    MFMA FLOPs (F6F4): 'The total number of 4-bit and 6-bit floating point MFMA operations
-      executed per second. Note: this does not include any floating point operations
-      from VALU instructions. The peak empirically measured F6F4 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison. It is supported
-      on AMD Instinct MI350 series (gfx950) and later only.'
-    MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
-      per second. Note: this does not include any 8-bit integer operations from VALU
-      instructions. The peak empirically measured INT8 MFMA operations achievable
-      on the specific accelerator is displayed alongside for comparison.'
-    HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth
-      Memory (HBM) per second. The peak empirically measured bandwidth achievable
-      on the specific accelerator is displayed alongside for comparison.
-    L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
-      The number of bytes is calculated as the number of cache lines requested multiplied
-      by the cache line size. This value does not consider partial requests, so e.g.,
-      if only a single value is requested in a cache line, the data movement will
-      still be counted as a full cache line. The peak empirically measured bandwidth
-      achievable on the specific accelerator is displayed alongside for comparison.
-    L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
-      of VMEM instructions per unit time. The number of bytes is calculated as the
-      number of cache lines requested multiplied by the cache line size. This value
-      does not consider partial requests, so e.g., if only a single value is requested
-      in a cache line, the data movement will still be counted as a full cache line.
-      The peak empirically measured bandwidth achievable on the specific accelerator
-      is displayed alongside for comparison.
-    LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
-      from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
-      example for more detail). The peak empirically measured LDS bandwidth achievable
-      on the specific accelerator is displayed alongside for comparison.
-    AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
-      of total floating-point operations (FLOPs) to total bytes transferred between
-      the L1 cache and the processing units. This value is used as the x-coordinate
-      for the L1 roofline.
-    AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
-      of total floating-point operations (FLOPs) to total bytes transferred between
-      the L2 cache and the L1 cache. This value is used as the x-coordinate for the
-      L2 roofline.
-    AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
-      It is the ratio of total floating-point operations (FLOPs) to total bytes transferred
-      between HBM and the L2 cache. This value is used as the x-coordinate for the
-      HBM roofline.
-    Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs
-      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
-      operations divided by the total execution time. This value is used as the y-coordinate
-      for the kernel's point on the Roofline plot.
  data source:
  - metric_table:
      id: 401
@@ -218,3 +139,91 @@ Panel Config:
            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
            * 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9
          unit: GFLOP/s
+  metrics_description:
+    VALU FLOPs (F16): |-
+      The total 16-bit floating-point operations executed per second on the VALU.
+      This is presented with the value of the peak empirical F16 FLOPs achievable
+      on the specific accelerator. Note: this does not include any F16 operations
+      from MFMA instructions.
+    VALU FLOPs (F32): |-
+      The total 32-bit floating-point operations executed per second on the VALU.
+      This is presented with the value of the peak empirical F32 FLOPs achievable
+      on the specific accelerator. Note: this does not include any F32 operations
+      from MFMA instructions.
+    VALU FLOPs (F64): |-
+      The total 64-bit floating-point operations executed per second on the VALU.
+      This is presented with the value of the peak empirical F64 FLOPs achievable
+      on the specific accelerator. Note: this does not include any F64 operations
+      from MFMA instructions.
+    MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
+      executed per second. This does not include any 16-bit brain floating point operations
+      from VALU instructions. The peak empirically measured F8 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison. It is supported
+      on AMD Instinct MI300 series and later only.
+    MFMA FLOPs (BF16): |-
+      The total number of 16-bit brain floating point MFMA operations executed
+      per second. Note: this does not include any 16-bit brain floating point
+      operations from VALU instructions. The peak empirically measured BF16 MFMA
+      operations achievable on the specific accelerator is displayed alongside
+      for comparison.
+    MFMA FLOPs (F16): |-
+      The total number of 16-bit floating point MFMA operations executed per
+      second. Note: this does not include any 16-bit floating point operations from
+      VALU instructions. The peak empirically measured F16 MFMA operations
+      achievable on the specific accelerator is displayed alongside for comparison.
+    MFMA FLOPs (F32): |-
+      The total number of 32-bit floating point MFMA operations executed per
+      second. Note: this does not include any 32-bit floating point operations from
+      VALU instructions. The peak empirically measured F32 MFMA operations
+      achievable on the specific accelerator is displayed alongside for comparison.
+    MFMA FLOPs (F64): |-
+      The total number of 64-bit floating point MFMA operations executed per
+      second. Note: this does not include any 64-bit floating point operations from
+      VALU instructions. The peak empirically measured F64 MFMA operations
+      achievable on the specific accelerator is displayed alongside for comparison.
+    MFMA IOPs (Int8): |-
+      The total number of 8-bit integer MFMA operations executed per second.
+      Note: this does not include any 8-bit integer operations from VALU instructions.
+      The peak empirically measured INT8 MFMA operations achievable on the specific
+      accelerator is displayed alongside for comparison.
+    HBM Bandwidth: |-
+      The total number of bytes read from and written to High-Bandwidth
+      Memory (HBM) per second. The peak empirically measured bandwidth achievable
+      on the specific accelerator is displayed alongside for comparison.
+    L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
+      The number of bytes is calculated as the number of cache lines requested multiplied
+      by the cache line size. This value does not consider partial requests, so e.g.,
+      if only a single value is requested in a cache line, the data movement will
+      still be counted as a full cache line. The peak empirically measured bandwidth
+      achievable on the specific accelerator is displayed alongside for comparison.
+    L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
+      of VMEM instructions per unit time. The number of bytes is calculated as the
+      number of cache lines requested multiplied by the cache line size. This value
+      does not consider partial requests, so e.g., if only a single value is requested
+      in a cache line, the data movement will still be counted as a full cache line.
+      The peak empirically measured bandwidth achievable on the specific accelerator
+      is displayed alongside for comparison.
+    LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
+      from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
+      example for more detail). The peak empirically measured LDS bandwidth achievable
+      on the specific accelerator is displayed alongside for comparison.
+    AI L1: |-
+      The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
+      of total floating-point operations (FLOPs) to total bytes transferred between
+      the L1 cache and the processing units. This value is used as the x-coordinate
+      for the L1 roofline.
+    AI L2: |-
+      The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
+      of total floating-point operations (FLOPs) to total bytes transferred between
+      the L2 cache and the L1 cache. This value is used as the x-coordinate for
+      the L2 roofline.
+    AI HBM: |-
+      The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
+      It is the ratio of total floating-point operations (FLOPs) to total bytes
+      transferred between HBM and the L2 cache. This value is used as the x-coordinate
+      for the HBM roofline.
+    Performance (GFLOPs): |-
+      The overall achieved performance, measured in GigaFLOPs
+      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
+      operations divided by the total execution time. This value is used as the y-coordinate
+      for the kernel's point on the Roofline plot.
@@ -2,30 +2,6 @@
 Panel Config:
  id: 500
  title: Command Processor (CPC/CPF)
-  metrics_description:
-    CPF Utilization: Percent of total cycles where the CPF was busy actively doing
-      any work. The ratio of CPF busy cycles over total cycles counted by the CPF.
-    CPF Stall: Percent of CPF busy cycles where the CPF was stalled for any reason.
-    CPF-L2 Utilization: Percent of total cycles counted by the CPF-L2 interface where
-      the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
-      over total cycles counted by the CPF-L2.
-    CPF-L2 Stall: Percent of CPF-L2 L2 busy cycles where the CPF-L2 interface was
-      stalled for any reason.
-    CPF-UTCL1 Stall: Percent of CPF busy cycles where the CPF was stalled by address
-      translation.
-    CPC Utilization: Percent of total cycles where the CPC was busy actively doing
-      any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
-    CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
-    CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
-      for processing.
-    CPC-Workgroup Manager Utilization: Percent of CPC busy cycles spent dispatching
-      workgroups to the workgroup manager.
-    CPC-L2 Utilization: Percent of total cycles counted by the CPC-L2 interface where
-      the CPC-L2 interface was active doing any work.
-    CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
-      translation
-    CPC-UTCL2 Utilization: 'Percent of total cycles counted by the CPC''s L2 address
-      translation interface where the CPC was busy doing address translation work.  '
  data source:
  - metric_table:
      id: 501
@@ -143,3 +119,28 @@ Panel Config:
          max: MAX((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
            if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
          unit: pct
+  metrics_description:
+    CPF Utilization: Percent of total cycles where the CPF was busy actively doing
+      any work. The ratio of CPF busy cycles over total cycles counted by the CPF.
+    CPF Stall: Percent of CPF busy cycles where the CPF was stalled for any reason.
+    CPF-L2 Utilization: Percent of total cycles counted by the CPF-L2 interface where
+      the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
+      over total cycles counted by the CPF-L2.
+    CPF-L2 Stall: Percent of CPF-L2 L2 busy cycles where the CPF-L2 interface was
+      stalled for any reason.
+    CPF-UTCL1 Stall: Percent of CPF busy cycles where the CPF was stalled by address
+      translation.
+    CPC Utilization: Percent of total cycles where the CPC was busy actively doing
+      any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
+    CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
+    CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
+      for processing.
+    CPC-Workgroup Manager Utilization: Percent of CPC busy cycles spent dispatching
+      workgroups to the workgroup manager.
+    CPC-L2 Utilization: Percent of total cycles counted by the CPC-L2 interface where
+      the CPC-L2 interface was active doing any work.
+    CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
+      translation
+    CPC-UTCL2 Utilization: |-
+      Percent of total cycles counted by the CPC's L2 address translation
+      interface where the CPC was busy doing address translation work.
@@ -2,61 +2,6 @@
 Panel Config:
  id: 600
  title: Workgroup Manager (SPI)
-  metrics_description:
-    Accelerator Utilization: The percent of cycles in the kernel where the accelerator
-      was actively doing any work.
-    Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
-      kernel where the scheduler-pipes were actively doing any work.
-    Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
-      manager was actively doing any work.
-    Shader Engine Utilization: The percent of total shader engine cycles in the kernel
-      where any CU in a shader-engine was actively doing any work, normalized over
-      all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
-      was not fully saturated by the kernel, or a potential load-imbalance issue.
-    SIMD Utilization: The percent of total SIMD cycles in the kernel where any SIMD
-      on a CU was actively doing any work, summed over all CUs. Low values (less than
-      100%) indicate that the accelerator was not fully saturated by the kernel, or
-      a potential load-imbalance issue.
-    Dispatched Workgroups: The total number of workgroups forming this kernel launch.
-    Dispatched Wavefronts: The total number of wavefronts, summed over all workgroups,
-      forming this kernel launch.
-    VGPR Writes: The average number of cycles spent initializing VGPRs at wave creation.
-    SGPR Writes: The average number of cycles spent initializing SGPRs at wave creation.
-    Not-scheduled Rate (Workgroup Manager): The percent of total scheduler-pipe cycles
-      in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
-      within the workgroup manager rather than a lack of a CU or SIMD with sufficient
-      resources.
-    Not-scheduled Rate (Scheduler-Pipe): 'The percent of total scheduler-pipe cycles
-      in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
-      within the scheduler-pipes rather than a lack of a CU or SIMD with sufficient
-      resources. '
-    Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
-      where a workgroup could not be scheduled to a CU due to occupancy limitations
-      (like a lack of a CU or SIMD with sufficient resources).
-    Scratch Stall Rate: The percent of total shader-engine cycles in the kernel where
-      a workgroup could not be scheduled to a CU due to lack of private (a.k.a., scratch)
-      memory slots. While this can reach up to 100%, note that the actual occupancy
-      limitations on a kernel using private memory are typically quite small (for
-      example, less than 1% of the total number of waves that can be scheduled to
-      an accelerator).
-    Insufficient SIMD Waveslots: The percent of total SIMD cycles in the kernel where
-      a workgroup could not be scheduled to a SIMD due to lack of available waveslots.
-    Insufficient SIMD VGPRs: The percent of total SIMD cycles in the kernel where
-      a workgroup could not be scheduled to a SIMD due to lack of available VGPRs.
-    Insufficient SIMD SGPRs: The percent of total SIMD cycles in the kernel where
-      a workgroup could not be scheduled to a SIMD due to lack of available SGPRs.
-    Insufficient CU LDS: The percent of total CU cycles in the kernel where a workgroup
-      could not be scheduled to a CU due to lack of available LDS.
-    Insufficient CU Barriers: The percent of total CU cycles in the kernel where a
-      workgroup could not be scheduled to a CU due to lack of available barriers.
-    Reached CU Workgroup Limit: The percent of total CU cycles in the kernel where
-      a workgroup could not be scheduled to a CU due to limits within the workgroup
-      manager. This is expected to be always be zero on CDNA2 or newer accelerators
-      (and small for previous accelerators).
-    Reached CU Wavefront Limit: The percent of total CU cycles in the kernel where
-      a wavefront could not be scheduled to a CU due to limits within the workgroup
-      manager. This is expected to be always be zero on CDNA2 or newer accelerators
-      (and small for previous accelerators).
  data source:
  - metric_table:
      id: 601
@@ -199,3 +144,58 @@ Panel Config:
          min: MIN(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
+  metrics_description:
+    Accelerator Utilization: The percent of cycles in the kernel where the accelerator
+      was actively doing any work.
+    Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
+      kernel where the scheduler-pipes were actively doing any work.
+    Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
+      manager was actively doing any work.
+    Shader Engine Utilization: The percent of total shader engine cycles in the kernel
+      where any CU in a shader-engine was actively doing any work, normalized over
+      all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
+      was not fully saturated by the kernel, or a potential load-imbalance issue.
+    SIMD Utilization: The percent of total SIMD cycles in the kernel where any SIMD
+      on a CU was actively doing any work, summed over all CUs. Low values (less than
+      100%) indicate that the accelerator was not fully saturated by the kernel, or
+      a potential load-imbalance issue.
+    Dispatched Workgroups: The total number of workgroups forming this kernel launch.
+    Dispatched Wavefronts: The total number of wavefronts, summed over all workgroups,
+      forming this kernel launch.
+    VGPR Writes: The average number of cycles spent initializing VGPRs at wave creation.
+    SGPR Writes: The average number of cycles spent initializing SGPRs at wave creation.
+    Not-scheduled Rate (Workgroup Manager): The percent of total scheduler-pipe cycles
+      in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
+      within the workgroup manager rather than a lack of a CU or SIMD with sufficient
+      resources.
+    Not-scheduled Rate (Scheduler-Pipe): |-
+      The percent of total scheduler-pipe cycles in the kernel where a workgroup
+      could not be scheduled to a CU due to a bottleneck within the scheduler-pipes
+      rather than a lack of a CU or SIMD with sufficient resources.
+    Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
+      where a workgroup could not be scheduled to a CU due to occupancy limitations
+      (like a lack of a CU or SIMD with sufficient resources).
+    Scratch Stall Rate: The percent of total shader-engine cycles in the kernel where
+      a workgroup could not be scheduled to a CU due to lack of private (a.k.a., scratch)
+      memory slots. While this can reach up to 100%, note that the actual occupancy
+      limitations on a kernel using private memory are typically quite small (for
+      example, less than 1% of the total number of waves that can be scheduled to
+      an accelerator).
+    Insufficient SIMD Waveslots: The percent of total SIMD cycles in the kernel where
+      a workgroup could not be scheduled to a SIMD due to lack of available waveslots.
+    Insufficient SIMD VGPRs: The percent of total SIMD cycles in the kernel where
+      a workgroup could not be scheduled to a SIMD due to lack of available VGPRs.
+    Insufficient SIMD SGPRs: The percent of total SIMD cycles in the kernel where
+      a workgroup could not be scheduled to a SIMD due to lack of available SGPRs.
+    Insufficient CU LDS: The percent of total CU cycles in the kernel where a workgroup
+      could not be scheduled to a CU due to lack of available LDS.
+    Insufficient CU Barriers: The percent of total CU cycles in the kernel where a
+      workgroup could not be scheduled to a CU due to lack of available barriers.
+    Reached CU Workgroup Limit: The percent of total CU cycles in the kernel where
+      a workgroup could not be scheduled to a CU due to limits within the workgroup
+      manager. This is expected to be always be zero on CDNA2 or newer accelerators
+      (and small for previous accelerators).
+    Reached CU Wavefront Limit: The percent of total CU cycles in the kernel where
+      a wavefront could not be scheduled to a CU due to limits within the workgroup
+      manager. This is expected to be always be zero on CDNA2 or newer accelerators
+      (and small for previous accelerators).
@@ -2,63 +2,6 @@
 Panel Config:
  id: 700
  title: Wavefront
-  metrics_description:
-    Grid Size: The total number of work-items (or, threads) launched as a part of
-      the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied
-      by the total workgroup (or, block) size.
-    Workgroup Size: The total number of work-items (or, threads) in each workgroup
-      (or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
-      to the total block size.
-    Total Wavefronts: "The total number of wavefronts launched as part of the kernel\
-      \ dispatch. On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs,\
-      \ the wavefront size is always 64 work-items. Thus, the total number of wavefronts\
-      \ should be equivalent to the ceiling of grid size divided by 64."
-    Saved Wavefronts: The total number of wavefronts saved at a context-save.
-    Restored Wavefronts: The total number of wavefronts restored from a context-save.
-    VGPRs: 'The number of architected vector general-purpose registers allocated for
-      the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
-      by the compiler due to allocation granularity.'
-    AGPRs: 'The number of accumulation vector general-purpose registers allocated
-      for the kernel, see AGPRs. Note: this may not exactly match the number of AGPRs
-      requested by the compiler due to allocation granularity.'
-    SGPRs: 'The number of scalar general-purpose registers allocated for the kernel,
-      see SALU. Note: this may not exactly match the number of SGPRs requested by
-      the compiler due to allocation granularity.'
-    LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
-      for this kernel. Note: This may also be larger than what was requested at compile
-      time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
-    Scratch Allocation: The number of bytes of scratch memory requested per work-item
-      for this kernel. Scratch memory is used for stack memory on the accelerator,
-      as well as for register spills and restores.
-    Kernel Time: The total duration of the executed kernel.
-    Kernel Time (Cycles): The total duration of the executed kernel in cycles.
-    Instructions per wavefront: The average number of instructions (of all types)
-      executed per wavefront. This is averaged over all wavefronts in a kernel dispatch.
-    Wave Cycles: The number of cycles a wavefront in the kernel dispatch spent resident
-      on a compute unit per normalization unit. This is averaged over all wavefronts
-      in a kernel dispatch.
-    Dependency Wait Cycles: The number of cycles a wavefront in the kernel dispatch
-      spent resident on a compute unit per normalization unit. This is averaged over
-      all wavefronts in a kernel dispatch.
-    Issue Wait Cycles: The number of cycles a wavefront in the kernel dispatch was
-      unable to issue an instruction for any reason (e.g., execution pipe back-pressure,
-      arbitration loss, etc.) per normalization unit. This counter is incremented
-      at every cycle by all wavefronts on a CU unable to issue an instruction. As
-      such, it is most useful to get a sense of how waves were spending their time,
-      rather than identification of a precise limiter because another wave could be
-      actively executing while a wave is issue stalled. The sum of this metric, Dependency
-      Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
-    Active Cycles: The average number of cycles a wavefront in the kernel dispatch
-      was actively executing instructions per normalization unit. This measurement
-      is made on a per-wavefront basis, and may include cycles that another wavefront
-      spent actively executing (on another execution unit, for example) or was stalled.
-      As such, it is most useful to get a sense of how waves were spending their time,
-      rather than identification of a precise limiter. The sum of this metric, Issue
-      Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
-      metric.
-    Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
-      over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
-      kernels (less than 1ms).'
  data source:
  - metric_table:
      id: 701
@@ -171,3 +114,66 @@ Panel Config:
          max: MAX((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
          unit: Wavefronts
          coll_level: SQ_LEVEL_WAVES
+  metrics_description:
+    Grid Size: The total number of work-items (or, threads) launched as a part of
+      the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied
+      by the total workgroup (or, block) size.
+    Workgroup Size: The total number of work-items (or, threads) in each workgroup
+      (or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
+      to the total block size.
+    Total Wavefronts: |-
+      The total number of wavefronts launched as part of the kernel dispatch.
+      On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
+      size is always 64 work-items. Thus, the total number of wavefronts should
+      be equivalent to the ceiling of grid size divided by 64.
+    Saved Wavefronts: The total number of wavefronts saved at a context-save.
+    Restored Wavefronts: The total number of wavefronts restored from a context-save.
+    VGPRs: |-
+      The number of architected vector general-purpose registers allocated
+      for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
+      requested by the compiler due to allocation granularity.
+    AGPRs: |-
+      The number of accumulation vector general-purpose registers allocated
+      for the kernel, see AGPRs. Note: this may not exactly match the number of
+      AGPRs requested by the compiler due to allocation granularity.
+    SGPRs: |-
+      The number of scalar general-purpose registers allocated for the kernel,
+      see SALU. Note: this may not exactly match the number of SGPRs requested by
+      the compiler due to allocation granularity.
+    LDS Allocation: |-
+      The number of bytes of LDS memory (or, shared memory) allocated for
+      this kernel. Note: This may also be larger than what was requested at compile
+      time due to both allocation granularity and dynamic per-dispatch LDS allocations.
+    Scratch Allocation: The number of bytes of scratch memory requested per work-item
+      for this kernel. Scratch memory is used for stack memory on the accelerator,
+      as well as for register spills and restores.
+    Kernel Time: The total duration of the executed kernel.
+    Kernel Time (Cycles): The total duration of the executed kernel in cycles.
+    Instructions per wavefront: The average number of instructions (of all types)
+      executed per wavefront. This is averaged over all wavefronts in a kernel dispatch.
+    Wave Cycles: The number of cycles a wavefront in the kernel dispatch spent resident
+      on a compute unit per normalization unit. This is averaged over all wavefronts
+      in a kernel dispatch.
+    Dependency Wait Cycles: The number of cycles a wavefront in the kernel dispatch
+      spent resident on a compute unit per normalization unit. This is averaged over
+      all wavefronts in a kernel dispatch.
+    Issue Wait Cycles: The number of cycles a wavefront in the kernel dispatch was
+      unable to issue an instruction for any reason (e.g., execution pipe back-pressure,
+      arbitration loss, etc.) per normalization unit. This counter is incremented
+      at every cycle by all wavefronts on a CU unable to issue an instruction. As
+      such, it is most useful to get a sense of how waves were spending their time,
+      rather than identification of a precise limiter because another wave could be
+      actively executing while a wave is issue stalled. The sum of this metric, Dependency
+      Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
+    Active Cycles: The average number of cycles a wavefront in the kernel dispatch
+      was actively executing instructions per normalization unit. This measurement
+      is made on a per-wavefront basis, and may include cycles that another wavefront
+      spent actively executing (on another execution unit, for example) or was stalled.
+      As such, it is most useful to get a sense of how waves were spending their time,
+      rather than identification of a precise limiter. The sum of this metric, Issue
+      Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
+      metric.
+    Wavefront Occupancy: |-
+      The time-averaged number of wavefronts resident on the accelerator over
+      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
+      kernels (less than 1ms).
@@ -2,90 +2,6 @@
 Panel Config:
  id: 1000
  title: Compute Units - Instruction Mix
-  metrics_description:
-    VALU: The total number of vector arithmetic logic unit (VALU) operations issued.
-      These are the workhorses of the compute unit, and are used to execute a wide
-      range of instruction types including floating point operations, non-uniform
-      address calculations, transcendental operations, integer operations, shifts,
-      conditional evaluation, etc.
-    VMEM: The total number of vector memory operations issued. These include most
-      loads, stores and atomic operations and all accesses to generic, global, private
-      and texture memory.
-    LDS: The total number of LDS (also known as shared memory) operations issued.
-      These include loads, stores, atomics, and HIP's __shfl operations.
-    MFMA: The total number of matrix fused multiply-add instructions issued.
-    SALU: The total number of scalar arithmetic logic unit (SALU) operations issued.
-      Typically these are used for address calculations, literal constants, and other
-      operations that are provably uniform across a wavefront. Although scalar memory
-      (SMEM) operations are issued by the SALU, they are counted separately in this
-      section.
-    SMEM: The total number of scalar memory (SMEM) operations issued. These are typically
-      used for loading kernel arguments, base-pointers and loads from HIP's __constant__
-      memory.
-    Branch: The total number of branch operations issued. These typically consist
-      of jump or branch operations and are used to implement control flow.
-    INT32: The total number of instructions operating on 32-bit integer operands issued
-      to the VALU per normalization unit.
-    INT64: The total number of instructions operating on 64-bit integer operands issued
-      to the VALU per normalization unit.
-    F16-ADD: The total number of addition instructions operating on 16-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F16-MUL: The total number of multiplication instructions operating on 16-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F16-FMA: The total number of fused multiply-add instructions operating on 16-bit
-      floating-point operands issued to the VALU per normalization unit.
-    F16-Trans: The total number of transcendental instructions (e.g., sqrt) operating
-      on 16-bit floating-point operands issued to the VALU per normalization unit.
-    F32-ADD: The total number of addition instructions operating on 32-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F32-MUL: The total number of multiplication instructions operating on 32-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F32-FMA: The total number of fused multiply-add instructions operating on 32-bit
-      floating-point operands issued to the VALU per normalization unit.
-    F32-Trans: The total number of transcendental instructions (such as sqrt) operating
-      on 32-bit floating-point operands issued to the VALU per normalization unit.
-    F64-ADD: The total number of addition instructions operating on 64-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F64-MUL: The total number of multiplication instructions operating on 64-bit floating-point
-      operands issued to the VALU per normalization unit.
-    F64-FMA: The total number of fused multiply-add instructions operating on 64-bit
-      floating-point operands issued to the VALU per normalization unit.
-    F64-Trans: The total number of transcendental instructions (such as sqrt) operating
-      on 64-bit floating-point operands issued to the VALU per normalization unit.
-    Conversion: "The total number of type conversion instructions (such as converting\
-      \ data to or from F32\u2194F64) issued to the VALU per normalization unit."
-    Global/Generic Instr: The total number of global & generic memory instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Global/Generic Read: The total number of global & generic memory read instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Global/Generic Write: The total number of global & generic memory write instructions
-      executed on all compute units on the accelerator, per normalization unit.
-    Global/Generic Atomic: The total number of global & generic memory atomic (with
-      and without return) instructions executed on all compute units on the accelerator,
-      per normalization unit.
-    Spill/Stack Instr: The total number of spill/stack memory instructions executed
-      on all compute units on the accelerator, per normalization unit.
-    Spill/Stack Read: The total number of spill/stack memory read instructions executed
-      on all compute units on the accelerator, per normalization unit.
-    Spill/Stack Write: The total number of spill/stack memory write instructions executed
-      on all compute units on the accelerator, per normalization unit.
-    Spill/Stack Atomic: The total number of spill/stack memory atomic (with and without
-      return) instructions executed on all compute units on the accelerator, per normalization
-      unit. Typically unused as these memory operations are typically used to implement
-      thread-local storage.
-    MFMA-I8: The total number of 8-bit integer MFMA instructions issued per normalization
-      unit.
-    MFMA-F8: The total number of 8-bit floating point MFMA instructions issued per
-      normalization unit. This is supported in AMD Instinct MI300 series and later
-      only.
-    MFMA-F16: The total number of 16-bit floating point MFMA instructions issued per
-      normalization unit.
-    MFMA-BF16: The total number of 16-bit brain floating point MFMA instructions issued
-      per normalization unit.
-    MFMA-F32: The total number of 32-bit floating-point MFMA instructions issued per
-      normalization unit.
-    MFMA-F64: The total number of 64-bit floating-point MFMA instructions issued per
-      normalization unit.
  data source:
  - metric_table:
      id: 1001
@@ -307,3 +223,88 @@ Panel Config:
          min: MIN((SQ_INSTS_VALU_MFMA_F64 / $denom))
          max: MAX((SQ_INSTS_VALU_MFMA_F64 / $denom))
          unit: (instr + $normUnit)
+  metrics_description:
+    VALU: The total number of vector arithmetic logic unit (VALU) operations issued.
+      These are the workhorses of the compute unit, and are used to execute a wide
+      range of instruction types including floating point operations, non-uniform
+      address calculations, transcendental operations, integer operations, shifts,
+      conditional evaluation, etc.
+    VMEM: The total number of vector memory operations issued. These include most
+      loads, stores and atomic operations and all accesses to generic, global, private
+      and texture memory.
+    LDS: The total number of LDS (also known as shared memory) operations issued.
+      These include loads, stores, atomics, and HIP's __shfl operations.
+    MFMA: The total number of matrix fused multiply-add instructions issued.
+    SALU: The total number of scalar arithmetic logic unit (SALU) operations issued.
+      Typically these are used for address calculations, literal constants, and other
+      operations that are provably uniform across a wavefront. Although scalar memory
+      (SMEM) operations are issued by the SALU, they are counted separately in this
+      section.
+    SMEM: The total number of scalar memory (SMEM) operations issued. These are typically
+      used for loading kernel arguments, base-pointers and loads from HIP's __constant__
+      memory.
+    Branch: The total number of branch operations issued. These typically consist
+      of jump or branch operations and are used to implement control flow.
+    INT32: The total number of instructions operating on 32-bit integer operands issued
+      to the VALU per normalization unit.
+    INT64: The total number of instructions operating on 64-bit integer operands issued
+      to the VALU per normalization unit.
+    F16-ADD: The total number of addition instructions operating on 16-bit floating-point
+      operands issued to the VALU per normalization unit.
+    F16-MUL: The total number of multiplication instructions operating on 16-bit floating-point
+      operands issued to the VALU per normalization unit.
+    F16-FMA: The total number of fused multiply-add instructions operating on 16-bit
+      floating-point operands issued to the VALU per normalization unit.
+    F16-Trans: The total number of transcendental instructions (e.g., sqrt) operating
+      on 16-bit floating-point operands issued to the VALU per normalization unit.
+    F32-ADD: The total number of addition instructions operating on 32-bit floating-point
+      operands issued to the VALU per normalization unit.
+    F32-MUL: The total number of multiplication instructions operating on 32-bit floating-point
+      operands issued to the VALU per normalization unit.
+    F32-FMA: The total number of fused multiply-add instructions operating on 32-bit
+      floating-point operands issued to the VALU per normalization unit.
+    F32-Trans: The total number of transcendental instructions (such as sqrt) operating
+      on 32-bit floating-point operands issued to the VALU per normalization unit.
+    F64-ADD: The total number of addition instructions operating on 64-bit floating-point
+      operands issued to the VALU per normalization unit.
+    F64-MUL: The total number of multiplication instructions operating on 64-bit floating-point
+      operands issued to the VALU per normalization unit.
+    F64-FMA: The total number of fused multiply-add instructions operating on 64-bit
+      floating-point operands issued to the VALU per normalization unit.
+    F64-Trans: The total number of transcendental instructions (such as sqrt) operating
+      on 64-bit floating-point operands issued to the VALU per normalization unit.
+    Conversion: |-
+      The total number of type conversion instructions (such as converting
+      data to or from F32\u2194F64) issued to the VALU per normalization unit.
+    Global/Generic Instr: The total number of global & generic memory instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Global/Generic Read: The total number of global & generic memory read instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Global/Generic Write: The total number of global & generic memory write instructions
+      executed on all compute units on the accelerator, per normalization unit.
+    Global/Generic Atomic: The total number of global & generic memory atomic (with
+      and without return) instructions executed on all compute units on the accelerator,
+      per normalization unit.
+    Spill/Stack Instr: The total number of spill/stack memory instructions executed
+      on all compute units on the accelerator, per normalization unit.
+    Spill/Stack Read: The total number of spill/stack memory read instructions executed
+      on all compute units on the accelerator, per normalization unit.
+    Spill/Stack Write: The total number of spill/stack memory write instructions executed
+      on all compute units on the accelerator, per normalization unit.
+    Spill/Stack Atomic: The total number of spill/stack memory atomic (with and without
+      return) instructions executed on all compute units on the accelerator, per normalization
+      unit. Typically unused as these memory operations are typically used to implement
+      thread-local storage.
+    MFMA-I8: The total number of 8-bit integer MFMA instructions issued per normalization
+      unit.
+    MFMA-F8: The total number of 8-bit floating point MFMA instructions issued per
+      normalization unit. This is supported in AMD Instinct MI300 series and later
+      only.
+    MFMA-F16: The total number of 16-bit floating point MFMA instructions issued per
+      normalization unit.
+    MFMA-BF16: The total number of 16-bit brain floating point MFMA instructions issued
+      per normalization unit.
+    MFMA-F32: The total number of 32-bit floating-point MFMA instructions issued per
+      normalization unit.
+    MFMA-F64: The total number of 64-bit floating-point MFMA instructions issued per
+      normalization unit.
@@ -2,84 +2,6 @@
 Panel Config:
  id: 1100
  title: Compute Units - Compute Pipeline
-  metrics_description:
-    VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
-      This is also presented as a percent of the peak theoretical FLOPs achievable
-      on the specific accelerator. Note: this does not include any floating-point
-      operations from MFMA instructions.'
-    VALU IOPs: 'The total integer operations executed per second on the VALU. This
-      is also presented as a percent of the peak theoretical IOPs achievable on the
-      specific accelerator. Note: this does not include any integer operations from
-      MFMA instructions.'
-    MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
-      executed per second. Note: this does not include any 16-bit brain floating point
-      operations from VALU instructions. This is also presented as a percent of the
-      peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
-      per second. Note: this does not include any 16-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F16 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
-      per second. Note: this does not include any 32-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F32 MFMA operations achievable on the specific accelerator.'
-    MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
-      per second. Note: this does not include any 64-bit floating point operations
-      from VALU instructions. This is also presented as a percent of the peak theoretical
-      F64 MFMA operations achievable on the specific accelerator.'
-    MFMA IOPs (INT8): 'The total number of 8-bit integer MFMA operations executed
-      per second. Note: this does not include any 8-bit integer operations from VALU
-      instructions. This is also presented as a percent of the peak theoretical INT8
-      MFMA operations achievable on the specific accelerator.'
-    IPC: The ratio of the total number of instructions executed on the CU over the
-      total active CU cycles.
-    IPC (Issued): The ratio of the total number of (non-internal) instructions issued
-      over the number of cycles where the scheduler was actively working on issuing
-      instructions.
-    SALU Utilization: Indicates what percent of the kernel's duration the SALU was
-      busy executing instructions. Computed as the ratio of the total number of cycles
-      spent by the scheduler issuing SALU / SMEM instructions over the total CU cycles.
-    VALU Utilization: Indicates what percent of the kernel's duration the VALU was
-      busy executing instructions. Does not include VMEM operations. Computed as the
-      ratio of the total number of cycles spent by the scheduler issuing VALU instructions
-      over the total CU cycles.
-    VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
-      was busy executing instructions, including both global/generic and spill/scratch
-      operations (see the VMEM instruction count metrics for more detail). Does not
-      include VALU operations. Computed as the ratio of the total number of cycles
-      spent by the scheduler issuing VMEM instructions over the total CU cycles.
-    Branch Utilization: Indicates what percent of the kernel's duration the branch
-      unit was busy executing instructions. Computed as the ratio of the total number
-      of cycles spent by the scheduler issuing branch instructions over the total
-      CU cycles.
-    VALU Active Threads: Indicates the average level of divergence within a wavefront
-      over the lifetime of the kernel. The number of work-items that were active in
-      a wavefront during execution of each VALU instruction, time-averaged over all
-      VALU instructions run on all wavefronts in the kernel
-    MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
-      was busy executing instructions. Computed as the ratio of the total number of
-      cycles spent by the MFMA was busy over the total CU cycles.
-    MFMA Instruction Cycles: The average duration of MFMA instructions in this kernel
-      in cycles. Computed as the ratio of the total number of cycles the MFMA unit
-      was busy over the total number of MFMA instructions.
-    VMEM Latency: The average number of round-trip cycles (that is, from issue to
-      data return / acknowledgment) required for a VMEM instruction to complete.
-    SMEM Latency: The average number of round-trip cycles (that is, from issue to
-      data return / acknowledgment) required for a SMEM instruction to complete.
-    FLOPs (Total): The total number of floating-point operations executed on either
-      the VALU or MFMA units, per normalization unit.
-    IOPs (Total): The total number of integer operations executed on either the VALU
-      or MFMA units, per normalization unit.
-    F16 OPs: The total number of 16-bit floating-point operations executed on either
-      the VALU or MFMA units, per normalization unit.
-    BF16 OPs: The total number of 16-bit brain floating-point operations executed
-      on either the VALU or MFMA units, per normalization unit.
-    F32 OPs: The total number of 32-bit floating-point operations executed on either
-      the VALU or MFMA units, per normalization unit.
-    F64 OPs: The total number of 64-bit floating-point operations executed on either
-      the VALU or MFMA units, per normalization unit.
-    INT8 OPs: The total number of 8-bit integer operations executed on either the
-      VALU or MFMA units, per normalization unit.
  data source:
  - metric_table:
      id: 1101
@@ -165,13 +87,13 @@ Panel Config:
          unit: Instr/cycle
        IPC (Issued):
          avg: AVG(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
-            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED  + SQ_INSTS_LDS)
+            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
            / SQ_ACTIVE_INST_ANY))
          min: MIN(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
            / SQ_ACTIVE_INST_ANY))
          max: MAX(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
-            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED  + SQ_INSTS_LDS)
+            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
            / SQ_ACTIVE_INST_ANY))
          unit: Instr/cycle
        SALU Utilization:
@@ -271,7 +193,7 @@ Panel Config:
            + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
            + (SQ_INSTS_VALU_FMA_F64 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
            / $denom))
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
        IOPs (Total):
          avg: AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8
            * 512)) / $denom)
@@ -279,12 +201,12 @@ Panel Config:
            * 512)) / $denom)
          max: MAX(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8
            * 512)) / $denom)
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
        F8 OPs:
          avg: AVG(((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) / $denom))
          min: MIN(((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) / $denom))
          max: MAX(((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) / $denom))
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
        F16 OPs:
          avg: AVG(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16))
            + (64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512
@@ -295,12 +217,12 @@ Panel Config:
          max: MAX(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16))
            + (64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512
            * SQ_INSTS_VALU_MFMA_MOPS_F16)) / $denom))
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
        BF16 OPs:
          avg: AVG(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
          min: MIN(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
          max: MAX(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
        F32 OPs:
          avg: AVG((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
            + (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
@@ -311,7 +233,7 @@ Panel Config:
          max: MAX((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
            + (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
            / $denom))
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
        F64 OPs:
          avg: AVG((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
            + (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
@@ -322,9 +244,94 @@ Panel Config:
          max: MAX((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
            + (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
            / $denom))
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
        INT8 OPs:
          avg: AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
          min: MIN(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
          max: MAX(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
-          unit: (OPs  + $normUnit)
+          unit: (OPs + $normUnit)
+  metrics_description:
+    VALU FLOPs: |-
+      The total floating-point operations executed per second on the VALU.
+      This is also presented as a percent of the peak theoretical FLOPs achievable
+      on the specific accelerator. Note: this does not include any floating-point
+      operations from MFMA instructions.
+    VALU IOPs: |-
+      The total integer operations executed per second on the VALU. This is
+      also presented as a percent of the peak theoretical IOPs achievable on the
+      specific accelerator. Note: this does not include any integer operations from
+      MFMA instructions.
+    MFMA FLOPs (BF16): |-
+      The total number of 16-bit brain floating point MFMA operations executed
+      per second. Note: this does not include any 16-bit brain floating point operations
+      from VALU instructions. This is also presented as a percent of the peak theoretical
+      BF16 MFMA operations achievable on the specific accelerator.
+    MFMA FLOPs (F16): |-
+      The total number of 16-bit floating point MFMA operations executed per
+      second. Note: this does not include any 16-bit floating point operations from
+      VALU instructions. This is also presented as a percent of the peak theoretical
+      F16 MFMA operations achievable on the specific accelerator.
+    MFMA FLOPs (F32): |-
+      The total number of 32-bit floating point MFMA operations executed per
+      second. Note: this does not include any 32-bit floating point operations from
+      VALU instructions. This is also presented as a percent of the peak theoretical
+      F32 MFMA operations achievable on the specific accelerator.
+    MFMA FLOPs (F64): |-
+      The total number of 64-bit floating point MFMA operations executed per
+      second. Note: this does not include any 64-bit floating point operations from
+      VALU instructions. This is also presented as a percent of the peak theoretical
+      F64 MFMA operations achievable on the specific accelerator.
+    MFMA IOPs (INT8): |-
+      The total number of 8-bit integer MFMA operations executed per second.
+      Note: this does not include any 8-bit integer operations from VALU instructions.
+      This is also presented as a percent of the peak theoretical INT8 MFMA operations
+      achievable on the specific accelerator.
+    IPC: The ratio of the total number of instructions executed on the CU over the
+      total active CU cycles.
+    IPC (Issued): The ratio of the total number of (non-internal) instructions issued
+      over the number of cycles where the scheduler was actively working on issuing
+      instructions.
+    SALU Utilization: Indicates what percent of the kernel's duration the SALU was
+      busy executing instructions. Computed as the ratio of the total number of cycles
+      spent by the scheduler issuing SALU / SMEM instructions over the total CU cycles.
+    VALU Utilization: Indicates what percent of the kernel's duration the VALU was
+      busy executing instructions. Does not include VMEM operations. Computed as the
+      ratio of the total number of cycles spent by the scheduler issuing VALU instructions
+      over the total CU cycles.
+    VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
+      was busy executing instructions, including both global/generic and spill/scratch
+      operations (see the VMEM instruction count metrics for more detail). Does not
+      include VALU operations. Computed as the ratio of the total number of cycles
+      spent by the scheduler issuing VMEM instructions over the total CU cycles.
+    Branch Utilization: Indicates what percent of the kernel's duration the branch
+      unit was busy executing instructions. Computed as the ratio of the total number
+      of cycles spent by the scheduler issuing branch instructions over the total
+      CU cycles.
+    VALU Active Threads: Indicates the average level of divergence within a wavefront
+      over the lifetime of the kernel. The number of work-items that were active in
+      a wavefront during execution of each VALU instruction, time-averaged over all
+      VALU instructions run on all wavefronts in the kernel
+    MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
+      was busy executing instructions. Computed as the ratio of the total number of
+      cycles spent by the MFMA was busy over the total CU cycles.
+    MFMA Instruction Cycles: The average duration of MFMA instructions in this kernel
+      in cycles. Computed as the ratio of the total number of cycles the MFMA unit
+      was busy over the total number of MFMA instructions.
+    VMEM Latency: The average number of round-trip cycles (that is, from issue to
+      data return / acknowledgment) required for a VMEM instruction to complete.
+    SMEM Latency: The average number of round-trip cycles (that is, from issue to
+      data return / acknowledgment) required for a SMEM instruction to complete.
+    FLOPs (Total): The total number of floating-point operations executed on either
+      the VALU or MFMA units, per normalization unit.
+    IOPs (Total): The total number of integer operations executed on either the VALU
+      or MFMA units, per normalization unit.
+    F16 OPs: The total number of 16-bit floating-point operations executed on either
+      the VALU or MFMA units, per normalization unit.
+    BF16 OPs: The total number of 16-bit brain floating-point operations executed
+      on either the VALU or MFMA units, per normalization unit.
+    F32 OPs: The total number of 32-bit floating-point operations executed on either
+      the VALU or MFMA units, per normalization unit.
+    F64 OPs: The total number of 64-bit floating-point operations executed on either
+      the VALU or MFMA units, per normalization unit.
+    INT8 OPs: The total number of 8-bit integer operations executed on either the
+      VALU or MFMA units, per normalization unit.
@@ -2,51 +2,6 @@
 Panel Config:
  id: 1200
  title: Local Data Share (LDS)
-  metrics_description:
-    Utilization: Indicates what percent of the kernel's duration the LDS was actively
-      executing instructions (including, but not limited to, load, store, atomic and
-      HIP's __shfl operations). Calculated as the ratio of the total number of cycles
-      LDS was active over the total CU cycles.
-    Access Rate: Indicates the percentage of SIMDs in the VALU actively issuing LDS
-      instructions, averaged over the lifetime of the kernel. Calculated as the ratio
-      of the total number of cycles spent by the scheduler issuing LDS instructions
-      over the total CU cycles.
-    Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
-      could have been loaded from, stored to, or atomically updated in the LDS divided
-      as percentage of theoretical peak. Does not take into account the execution
-      mask of the wavefront when the instruction was executed.
-    Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
-      loaded from, stored to, or atomically updated in the LDS divided by total duration.
-      Does not take into account the execution mask of the wavefront when the instruction
-      was executed.
-    Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
-      servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing
-      bank conflicts over the number of LDS cycles that would have been required to
-      move the same amount of data in an uncontended access.
-    LDS Instructions: The total number of LDS instructions (including, but not limited
-      to, read/write/atomics and HIP's __shfl instructions) executed per normalization
-      unit.
-    LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
-      / acknowledgment) required for an LDS instruction to complete.
-    Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS scheduler
-      due to bank conflicts (as determined by the conflict resolution hardware) to
-      the base number of cycles that would be spent in the LDS scheduler in a completely
-      uncontended case. This is the unnormalized form of the Bank Conflict Rate.
-    Index Accesses: The total number of cycles spent in the LDS scheduler over all
-      operations per normalization unit.
-    Atomic Return Cycles: The total number of cycles spent on LDS atomics with return
-      per normalization unit.
-    Bank Conflict: The total number of cycles spent in the LDS scheduler due to bank
-      conflicts (as determined by the conflict resolution hardware) per normalization
-      unit.
-    Addr Conflict: The total number of cycles spent in the LDS scheduler due to address
-      conflicts (as determined by the conflict resolution hardware) per normalization
-      unit.
-    Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
-      stalls from non-dword aligned addresses per normalization unit.
-    Mem Violations: "The total number of out-of-bounds accesses made to the LDS, per\
-      \ normalization unit. This is unused and expected to be zero in most configurations\
-      \ for modern CDNA\u2122 accelerators."
  data source:
  - metric_table:
      id: 1201
@@ -87,7 +42,7 @@ Panel Config:
          avg: AVG((SQ_INSTS_LDS / $denom))
          min: MIN((SQ_INSTS_LDS / $denom))
          max: MAX((SQ_INSTS_LDS / $denom))
-          unit: (Instr  + $normUnit)
+          unit: (Instr + $normUnit)
        Theoretical Bandwidth:
          avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
            / (End_Timestamp - Start_Timestamp)))
@@ -117,29 +72,75 @@ Panel Config:
          avg: AVG((SQ_LDS_IDX_ACTIVE / $denom))
          min: MIN((SQ_LDS_IDX_ACTIVE / $denom))
          max: MAX((SQ_LDS_IDX_ACTIVE / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Atomic Return Cycles:
          avg: AVG((SQ_LDS_ATOMIC_RETURN / $denom))
          min: MIN((SQ_LDS_ATOMIC_RETURN / $denom))
          max: MAX((SQ_LDS_ATOMIC_RETURN / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Bank Conflict:
          avg: AVG((SQ_LDS_BANK_CONFLICT / $denom))
          min: MIN((SQ_LDS_BANK_CONFLICT / $denom))
          max: MAX((SQ_LDS_BANK_CONFLICT / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Addr Conflict:
          avg: AVG((SQ_LDS_ADDR_CONFLICT / $denom))
          min: MIN((SQ_LDS_ADDR_CONFLICT / $denom))
          max: MAX((SQ_LDS_ADDR_CONFLICT / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Unaligned Stall:
          avg: AVG((SQ_LDS_UNALIGNED_STALL / $denom))
          min: MIN((SQ_LDS_UNALIGNED_STALL / $denom))
          max: MAX((SQ_LDS_UNALIGNED_STALL / $denom))
-          unit: (Cycles  + $normUnit)
+          unit: (Cycles + $normUnit)
        Mem Violations:
          avg: AVG((SQ_LDS_MEM_VIOLATIONS / $denom))
          min: MIN((SQ_LDS_MEM_VIOLATIONS / $denom))
          max: MAX((SQ_LDS_MEM_VIOLATIONS / $denom))
          unit: (Accesses + $normUnit)
+  metrics_description:
+    Utilization: Indicates what percent of the kernel's duration the LDS was actively
+      executing instructions (including, but not limited to, load, store, atomic and
+      HIP's __shfl operations). Calculated as the ratio of the total number of cycles
+      LDS was active over the total CU cycles.
+    Access Rate: Indicates the percentage of SIMDs in the VALU actively issuing LDS
+      instructions, averaged over the lifetime of the kernel. Calculated as the ratio
+      of the total number of cycles spent by the scheduler issuing LDS instructions
+      over the total CU cycles.
+    Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
+      could have been loaded from, stored to, or atomically updated in the LDS divided
+      as percentage of theoretical peak. Does not take into account the execution
+      mask of the wavefront when the instruction was executed.
+    Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
+      loaded from, stored to, or atomically updated in the LDS divided by total duration.
+      Does not take into account the execution mask of the wavefront when the instruction
+      was executed.
+    Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
+      servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing
+      bank conflicts over the number of LDS cycles that would have been required to
+      move the same amount of data in an uncontended access.
+    LDS Instructions: The total number of LDS instructions (including, but not limited
+      to, read/write/atomics and HIP's __shfl instructions) executed per normalization
+      unit.
+    LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
+      acknowledgment) required for an LDS instruction to complete.
+    Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS scheduler
+      due to bank conflicts (as determined by the conflict resolution hardware) to
+      the base number of cycles that would be spent in the LDS scheduler in a completely
+      uncontended case. This is the unnormalized form of the Bank Conflict Rate.
+    Index Accesses: The total number of cycles spent in the LDS scheduler over all
+      operations per normalization unit.
+    Atomic Return Cycles: The total number of cycles spent on LDS atomics with return
+      per normalization unit.
+    Bank Conflict: The total number of cycles spent in the LDS scheduler due to bank
+      conflicts (as determined by the conflict resolution hardware) per normalization
+      unit.
+    Addr Conflict: The total number of cycles spent in the LDS scheduler due to address
+      conflicts (as determined by the conflict resolution hardware) per normalization
+      unit.
+    Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
+      stalls from non-dword aligned addresses per normalization unit.
+    Mem Violations: |-
+      The total number of out-of-bounds accesses made to the LDS, per normalization
+      unit. This is unused and expected to be zero in most configurations for
+      modern CDNA\u2122 accelerators.
--- a/Показать больше
+++ b/Показать больше