[rocprof-compute] Generalize Roofline (#325)

* per kernel analysis Roofline * added per-kernel eval_metric calculation with display * fixed typo * updated tty.py show_all() * formatting * fixed ctest failures and updated equations * formatting * updated metric descriptoins * review tweaks * update docs * added roofline gui analysis * updated GUI docs * updated print statement * comment tweaks and ran ruff formatting
2025-08-20 09:58:08 -04:00
parent 71b725f307
commit 5840940caa
26 changed files with 2612 additions and 158 deletions
@@ -19,6 +19,9 @@ This section provides an overview of ROCm Compute Profiler's CLI analysis featur
 * :ref:`Filtering <cli-analysis-options>`: Hone in on a particular kernel,
  GPU ID, or dispatch ID via post-process filtering.

+* :ref:`Per-kernel roofline analysis <per-kernel-roofline>`: Detailed arithmetic 
+   intensity and performance analysis for individual kernels.
+
 Run ``rocprof-compute analyze -h`` for more details.

 .. _cli-walkthrough:
@@ -32,7 +35,7 @@ There are three high-level GPU analysis views:

 * System Speed-of-Light: Key GPU performance metrics to show overall GPU performance and utilization.
 * Memory chart: Shows memory transactions and throughput on each cache hierarchical level.
-* Empirical hierarchical roofline: Roofline model that compares achieved throughput with attainable peak hardware limits, more specifically peak compute throughput and memory bandwidth (on L1/LDS/L2/HBM).
+* Empirical hierarchical roofline: Roofline model that compares achieved throughput with attainable peak hardware limits, more specifically peak compute throughput and memory bandwidth (on L1/LDS/L2/HBM). When combined with kernel filtering, provides detailed per-kernel arithmetic intensity analysis and performance breakdowns.

 **System Speed-of-Light:**

@@ -67,7 +70,7 @@ There are three high-level GPU analysis views:
 .. note::
   * Visualized memory chart and Roofline chart are only supported in single run analysis. In multiple runs comparison mode, both are switched back to basic table view.
   * Visualized memory chart requires the width of the terminal output to be greater than or equal to 234 to display the whole chart properly.
-   * Visualized Roofline chart is adapted to the initial terminal size only. If it is not clear, you may need to adjust the terminal size and regenerate it to check the display effect.
+   * Visualized Roofline chart is adapted to the initial terminal size only. If it is not clear, you may need to adjust the terminal size and regenerate it to check the display effect. Roofline analysis provides detailed, structured table output with measured empirical peak values for comparison.

 .. _cli-list-metrics:

@@ -309,6 +312,67 @@ Filter kernels
  You should see your filtered kernels indicated by an asterisk in the **Top
  Stats** table.

+.. _per-kernel-roofline:
+
+Per-kernel roofline analysis
+  When analyzing specific kernels, the roofline analysis provides detailed metrics for each filtered kernel:
+
+  .. code-block:: shell-session
+     $ rocprof-compute analyze -p workloads/vcopy/MI200/ -k 0 -b 4
+  This generates enhanced roofline output showing per-kernel performance rates and arithmetic intensity calculations:
+
+  .. code-block:: text
+   ================================================================================
+   4. Roofline
+   ================================================================================
+   (4.1) Per-Kernel Roofline Metrics and (4.2) AI Plot Points
+   --------------------------------------------------------------------------------
+   Kernel 0: vecCopy(double*, double*, double*, int, int) (100.0%)
+      |
+      ├─ 4.1 Roofline Rate Metrics:
+      |   ╒═════════════╤════════════════════╤═══════════════════╤═════════╤════════════════════╕
+      |   │ Metric_ID   │ Metric             │ Value             │ Unit    │   Peak (Empirical) │
+      |   ╞═════════════╪════════════════════╪═══════════════════╪═════════╪════════════════════╡
+      |   │ 4.1.0       │ VALU FLOPs         │                   │ Gflop/s │           61286.40 │
+      |   ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
+      |   │ 4.1.1       │ MFMA FLOPs (F64)   │                   │ Gflop/s │          108544.33 │
+      |   ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
+      |   │ 4.1.2       │ MFMA FLOPs (F32)   │                   │ Gflop/s │          104531.42 │
+      |   ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
+      |   │ 4.1.3       │ MFMA FLOPs (F16)   │                   │ Gflop/s │          709169.38 │
+      |   ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
+      |   │ 4.1.4       │ MFMA FLOPs (BF16)  │ 0.0               │ Gflop/s │          388161.09 │
+      |   ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
+      |   │ 4.1.5       │ MFMA FLOPs (F8)    │ 0.0               │ Gflop/s │         1446089.60 │
+      |   ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
+      |   │ 4.1.6       │ MFMA IOPs (Int8)   │                   │ Giop/s  │          737317.94 │
+      |   ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
+      |   │ 4.1.7       │ HBM Bandwidth      │                   │ Gb/s    │            3231.95 │
+      |   ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
+      |   │ 4.1.8       │ L2 Cache Bandwidth │                   │ Gb/s    │           19096.81 │
+      |   ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
+      |   │ 4.1.9       │ L1 Cache Bandwidth │ 3880.358726762844 │ Gb/s    │           25006.24 │
+      |   ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
+      |   │ 4.1.10      │ LDS Bandwidth      │                   │ Gb/s    │           54920.88 │
+      |   ╘═════════════╧════════════════════╧═══════════════════╧═════════╧════════════════════╛
+      ├─ 4.2 Roofline AI Plot Points:
+      |   ╒═════════════╤══════════════════════╤═════════╤════════════╕
+      |   │ Metric_ID   │ Metric               │ Value   │ Unit       │
+      |   ╞═════════════╪══════════════════════╪═════════╪════════════╡
+      |   │ 4.2.0       │ AI HBM               │         │ Flops/byte │
+      |   ├─────────────┼──────────────────────┼─────────┼────────────┤
+      |   │ 4.2.1       │ AI L2                │         │ Flops/byte │
+      |   ├─────────────┼──────────────────────┼─────────┼────────────┤
+      |   │ 4.2.2       │ AI L1                │         │ Flops/byte │
+      |   ├─────────────┼──────────────────────┼─────────┼────────────┤
+      |   │ 4.2.3       │ Performance (GFLOPs) │         │ Gflop/s    │
+      |   ╘═════════════╧══════════════════════╧═════════╧════════════╛
+  The per-kernel analysis uses YAML-based metric evaluation for accurate calculations.
+
+  Analyze multiple kernels for comparison:
+
+  .. code-block:: shell-session
+     $ rocprof-compute analyze -p workloads/vcopy/MI200/ -k 0 1 2 -b 4

 Baseline comparison
  .. code-block:: shell
@@ -83,6 +83,7 @@ application's profiling data:
 #. Top Stats (Top Kernel Statistics)
 #. System Info
 #. System Speed-of-Light
+#. Roofline AI Data Metrics

 To dive deeper, use the dropdown menus at the top of the screen to isolate
 particular kernels or dispatches. You should see the web page update with
@@ -307,7 +307,7 @@ Examples:
            "\t\t\t  For stochastic sampling, the interval is in cycles.\n"
            "\t\t\t  For host_trap sampling, the interval is in microsecond "
            "(DEFAULT: 1048576)."
-        )
+        ),
    )

    profile_group.add_argument(
@@ -32,6 +32,6 @@ PROJECT_NAME = "rocprofiler-compute"
 HIDDEN_COLUMNS = ["coll_level"]
 HIDDEN_COLUMNS_CLI = ["Description", "coll_level"]
 HIDDEN_COLUMNS_TUI = ["Description", "coll_level"]
-HIDDEN_SECTIONS = [400, 1900, 2000]
+HIDDEN_SECTIONS = [1900, 2000]

 TIME_UNITS = {"s": 10**9, "ms": 10**6, "us": 10**3, "ns": 1}
@@ -30,8 +30,16 @@ from abc import abstractmethod
 from collections import OrderedDict
 from pathlib import Path

+import pandas as pd
+
 from utils import file_io, parser, schema
-from utils.logger import console_debug, console_error, console_log, demarcate
+from utils.logger import (
+    console_debug,
+    console_error,
+    console_log,
+    console_warning,
+    demarcate,
+)
 from utils.utils import is_workload_empty, merge_counters_spatial_multiplex


@@ -189,6 +197,21 @@ class OmniAnalyze_Base:
                else file_io.find_1st_sub_dir(d[0])
            )
            w.sys_info = file_io.load_sys_info(sysinfo_path.joinpath("sysinfo.csv"))
+
+            if not getattr(self.get_args(), "no_roof", False):
+                try:
+                    roofline_path = sysinfo_path.joinpath("roofline.csv")
+                    roofline_df = pd.read_csv(roofline_path)
+
+                    # use original column names from roofline.csv directly
+                    w.roofline_peaks = roofline_df
+
+                except FileNotFoundError:
+                    console_warning("roofline.csv not found.")
+                    w.roofline_peaks = pd.DataFrame()
+            else:
+                w.roofline_peaks = pd.DataFrame()
+
            arch = w.sys_info.iloc[0]["gpu_arch"]
            mspec = self.get_socs()[arch]._mspec
            if self.__args.specs_correction:
@@ -40,8 +40,9 @@ class cli_analysis(OmniAnalyze_Base):
        if self.get_args().random_port:
            console_error("--gui flag is required to enable --random-port")
        for d in self.get_args().path:
+            workload = self._runs[d[0]]
            # create 'mega dataframe'
-            self._runs[d[0]].raw_pmc = file_io.create_df_pmc(
+            workload.raw_pmc = file_io.create_df_pmc(
                d[0],
                self.get_args().nodes,
                self.get_args().spatial_multiplexing,
@@ -51,29 +52,27 @@ class cli_analysis(OmniAnalyze_Base):
            )

            if self.get_args().spatial_multiplexing:
-                self._runs[d[0]].raw_pmc = self.spatial_multiplex_merge_counters(
-                    self._runs[d[0]].raw_pmc
+                workload.raw_pmc = self.spatial_multiplex_merge_counters(
+                    workload.raw_pmc
                )

            file_io.create_df_kernel_top_stats(
-                df_in=self._runs[d[0]].raw_pmc,
+                df_in=workload.raw_pmc,
                raw_data_dir=d[0],
-                filter_gpu_ids=self._runs[d[0]].filter_gpu_ids,
-                filter_dispatch_ids=self._runs[d[0]].filter_dispatch_ids,
-                filter_nodes=self._runs[d[0]].filter_nodes,
+                filter_gpu_ids=workload.filter_gpu_ids,
+                filter_dispatch_ids=workload.filter_dispatch_ids,
+                filter_nodes=workload.filter_nodes,
                time_unit=self.get_args().time_unit,
                max_stat_num=self.get_args().max_stat_num,
                kernel_verbose=self.get_args().kernel_verbose,
            )

            # demangle and overwrite original 'Kernel_Name'
-            kernel_name_shortener(
-                self._runs[d[0]].raw_pmc, self.get_args().kernel_verbose
-            )
+            kernel_name_shortener(workload.raw_pmc, self.get_args().kernel_verbose)

            # create the loaded table
            parser.load_table_data(
-                workload=self._runs[d[0]],
+                workload=workload,
                dir=d[0],
                is_gui=False,
                args=self.get_args(),
@@ -85,42 +84,41 @@ class cli_analysis(OmniAnalyze_Base):
        """Run CLI analysis."""
        super().run_analysis()

+        workload_path = self.get_args().path[0][0]
+        workload = self._runs[workload_path]
+        gpu_arch = workload.sys_info.iloc[0]["gpu_arch"]
+        arch_config = self._arch_configs[gpu_arch]
+
        if self.get_args().list_stats:
            tty.show_kernel_stats(
                self.get_args(),
                self._runs,
-                self._arch_configs[
-                    self._runs[self.get_args().path[0][0]].sys_info.iloc[0]["gpu_arch"]
-                ],
+                arch_config,
                self._output,
            )
        else:
            roof_plot = None
            # 1. check if not baseline && compatible soc:
-            if (len(self.get_args().path)) == 1 and self._runs[
-                self.get_args().path[0][0]
-            ].sys_info.iloc[0]["gpu_arch"] in [
-                "gfx90a",
-                "gfx940",
-                "gfx941",
-                "gfx942",
-                "gfx950",
-            ]:
-                # add roofline plot to cli output
-                roof_obj = self.get_socs()[
-                    self._runs[self.get_args().path[0][0]].sys_info.iloc[0]["gpu_arch"]
-                ].roofline_obj
+            if (len(self.get_args().path)) == 1:
+                if gpu_arch in ["gfx90a", "gfx940", "gfx941", "gfx942", "gfx950"]:
+                    roof_obj = self.get_socs()[gpu_arch].roofline_obj

-                if roof_obj:
-                    # NOTE: using default data type
-                    roof_plot = roof_obj.cli_generate_plot(roof_obj.get_dtype()[0])
+                    if roof_obj:
+                        # store path in workload for calc_ai_analyze
+                        workload.path = workload_path
+
+                        # NOTE: using default data type
+                        roof_plot = roof_obj.cli_generate_plot(
+                            dtype=roof_obj.get_dtype()[0],
+                            workload=workload,
+                            config=self._profiling_config,
+                            arch_config=arch_config,
+                        )

            tty.show_all(
                self.get_args(),
                self._runs,
-                self._arch_configs[
-                    self._runs[self.get_args().path[0][0]].sys_info.iloc[0]["gpu_arch"]
-                ],
+                arch_config,
                self._output,
                self._profiling_config,
                roof_plot=roof_plot,
@@ -48,7 +48,7 @@ class webui_analysis(OmniAnalyze_Base):
        self.dest_dir = str(Path(args.path[0][0]).absolute().resolve())
        self.arch = None

-        self.__hidden_sections = ["Memory Chart", "Roofline"]
+        self.__hidden_sections = ["Memory Chart"]
        self.__hidden_columns = HIDDEN_COLUMNS
        # define different types of bar charts
        self.__barchart_elements = {
@@ -151,7 +151,7 @@ class webui_analysis(OmniAnalyze_Base):
            # Only display basic metrics if no filters are applied
            if not (disp_filt or kernel_filter or gcd_filter):
                temp = {}
-                keep = [1, 2, 101, 201, 301, 401]
+                keep = [1, 2, 101, 201, 301, 401, 402]
                for key in base_data[base_run].dfs:
                    if keep.count(key) != 0:
                        temp[key] = base_data[base_run].dfs[key]
@@ -219,7 +219,6 @@ class webui_analysis(OmniAnalyze_Base):
                    .lower()
                )
                html_section = []
-
                if panel["title"] not in self.__hidden_sections:
                    # Iterate over each table per section
                    for data_source in panel["data source"]:
@@ -289,7 +289,7 @@ class RocProfCompute:
        if sets_info:
            first_set = next(iter(sets_info.keys()))
            print(f"  rocprof-compute profile --set {first_set}  # Profile this set")
-        print(f"  rocprof-compute profile --list-sets        # Show this help")
+        print("  rocprof-compute profile --list-sets        # Show this help")
        print()

        sys.exit(0)
@@ -2,8 +2,191 @@
 Panel Config:
  id: 400
  title: Roofline
-  metrics_description: {}
+  metrics_description:
+    VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
+      This is also presented as a percent of the peak theoretical FLOPs achievable
+      on the specific accelerator. Note: this does not include any floating-point
+      operations from MFMA instructions.'
+    MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
+      executed per second. This does not include any 16-bit brain floating point operations
+      from VALU instructions. The peak empirically measured F8 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison. It is supported
+      on AMD Instinct MI300 series and later only.
+    MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
+      executed per second. Note: this does not include any 16-bit brain floating point
+      operations from VALU instructions. The peak empirically measured BF16 MFMA operations
+      achievable on the specific accelerator is displayed alongside for comparison.'
+    MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
+      per second. Note: this does not include any 16-bit floating point operations
+      from VALU instructions. The peak empirically measured F16 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison.'
+    MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
+      per second. Note: this does not include any 32-bit floating point operations
+      from VALU instructions. The peak empirically measured F32 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison.'
+    MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
+      per second. Note: this does not include any 64-bit floating point operations
+      from VALU instructions. The peak empirically measured F64 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison.'
+    MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
+      per second. Note: this does not include any 8-bit integer operations from VALU
+      instructions. The peak empirically measured INT8 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison.'
+    HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth
+      Memory (HBM) per second. The peak empirically measured bandwidth achievable
+      on the specific accelerator is displayed alongside for comparison.
+    L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
+      The number of bytes is calculated as the number of cache lines requested multiplied
+      by the cache line size. This value does not consider partial requests, so e.g.,
+      if only a single value is requested in a cache line, the data movement will
+      still be counted as a full cache line. The peak empirically measured bandwidth
+      achievable on the specific accelerator is displayed alongside for comparison.
+    L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
+      of VMEM instructions per unit time. The number of bytes is calculated as the
+      number of cache lines requested multiplied by the cache line size. This value
+      does not consider partial requests, so e.g., if only a single value is requested
+      in a cache line, the data movement will still be counted as a full cache line.
+      The peak empirically measured bandwidth achievable on the specific accelerator
+      is displayed alongside for comparison.
+    LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
+      from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
+      example for more detail). The peak empirically measured LDS bandwidth achievable
+      on the specific accelerator is displayed alongside for comparison.
+    AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
+      of total floating-point operations (FLOPs) to total bytes transferred between
+      the L1 cache and the processing units. This value is used as the x-coordinate
+      for the L1 roofline.
+    AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
+      of total floating-point operations (FLOPs) to total bytes transferred between
+      the L2 cache and the L1 cache. This value is used as the x-coordinate for the
+      L2 roofline.
+    AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
+      It is the ratio of total floating-point operations (FLOPs) to total bytes transferred
+      between HBM and the L2 cache. This value is used as the x-coordinate for the
+      HBM roofline.
+    Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs
+      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
+      operations divided by the total execution time. This value is used as the y-coordinate
+      for the kernel's point on the Roofline plot.
  data source:
-  - None:
+  - metric_table:
      id: 401
-      title: Roofline
+      title: Roofline Performance Rates
+      cli_style: Roofline
+      tui_style: Roofline
+      header:
+        metric: Metric
+        value: Value
+        unit: Unit
+        peak: Peak (Empirical)
+      metric:
+        VALU FLOPs:
+          value: AVG((($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) / ((End_Timestamp - Start_Timestamp) / 1e9))
+            / 1e9)
+          unit: GFLOP/s
+          peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
+        MFMA FLOPs (F64):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMAF64Flops_empirical_peak
+        MFMA FLOPs (F32):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMAF32Flops_empirical_peak
+        MFMA FLOPs (F16):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMAF16Flops_empirical_peak
+        MFMA FLOPs (BF16):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMABF16Flops_empirical_peak
+        MFMA IOPs (Int8):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GIOP/s
+          peak: $MFMAI8Ops_empirical_peak
+        HBM Bandwidth:
+          value: AVG((( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum
+            - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + ((TCC_EA0_WRREQ_sum
+            - TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp
+            - Start_Timestamp) / 1e9)) / 1e9)
+          unit: GB/s
+          peak: $HBMBw_empirical_peak
+        L2 Cache Bandwidth:
+          value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
+            TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64)) / ((End_Timestamp
+            - Start_Timestamp) / 1e9)) / 1e9)
+          unit: GB/s
+          peak: $L2Bw_empirical_peak
+        L1 Cache Bandwidth:
+          value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GB/s
+          peak: $L1Bw_empirical_peak
+        LDS Bandwidth:
+          value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4 * $lds_banks_per_cu))
+            / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+          unit: GB/s
+          peak: $LDSBw_empirical_peak
+  - metric_table:
+      id: 402
+      title: Roofline Plot Points
+      cli_style: Roofline
+      tui_style: Roofline
+      header:
+        metric: Metric
+        value: Value
+        unit: Unit
+      metric:
+        AI HBM:
+          value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+            (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / SUM( (TCC_BUBBLE_sum *
+            128) + (TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum
+            - TCC_EA0_RDREQ_32B_sum) * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
+            * 32) + (TCC_EA0_WRREQ_64B_sum * 64) ) )
+          unit: FLOPs/Byte
+        AI L2:
+          value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+            (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / SUM( (TCP_TCC_WRITE_REQ_sum
+            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum
+            + TCP_TCC_READ_REQ_sum) * 64 ) )
+          unit: FLOPs/Byte
+        AI L1:
+          value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+            (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / SUM(TCP_TOTAL_CACHE_ACCESSES_sum
+            * 64) )
+          unit: FLOPs/Byte
+        Performance GFLOPs:
+          value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+            (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / (SUM(End_Timestamp - Start_Timestamp)
+            / 1e9) ) / 1e9
+          unit: GFLOP/s
@@ -2,8 +2,189 @@
 Panel Config:
  id: 400
  title: Roofline
-  metrics_description: {}
+  metrics_description:
+    VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
+      This is also presented as a percent of the peak theoretical FLOPs achievable
+      on the specific accelerator. Note: this does not include any floating-point
+      operations from MFMA instructions.'
+    MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
+      executed per second. This does not include any 16-bit brain floating point operations
+      from VALU instructions. The peak empirically measured F8 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison. It is supported
+      on AMD Instinct MI300 series and later only.
+    MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
+      executed per second. Note: this does not include any 16-bit brain floating point
+      operations from VALU instructions. The peak empirically measured BF16 MFMA operations
+      achievable on the specific accelerator is displayed alongside for comparison.'
+    MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
+      per second. Note: this does not include any 16-bit floating point operations
+      from VALU instructions. The peak empirically measured F16 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison.'
+    MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
+      per second. Note: this does not include any 32-bit floating point operations
+      from VALU instructions. The peak empirically measured F32 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison.'
+    MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
+      per second. Note: this does not include any 64-bit floating point operations
+      from VALU instructions. The peak empirically measured F64 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison.'
+    MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
+      per second. Note: this does not include any 8-bit integer operations from VALU
+      instructions. The peak empirically measured INT8 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison.'
+    HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth
+      Memory (HBM) per second. The peak empirically measured bandwidth achievable
+      on the specific accelerator is displayed alongside for comparison.
+    L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
+      The number of bytes is calculated as the number of cache lines requested multiplied
+      by the cache line size. This value does not consider partial requests, so e.g.,
+      if only a single value is requested in a cache line, the data movement will
+      still be counted as a full cache line. The peak empirically measured bandwidth
+      achievable on the specific accelerator is displayed alongside for comparison.
+    L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
+      of VMEM instructions per unit time. The number of bytes is calculated as the
+      number of cache lines requested multiplied by the cache line size. This value
+      does not consider partial requests, so e.g., if only a single value is requested
+      in a cache line, the data movement will still be counted as a full cache line.
+      The peak empirically measured bandwidth achievable on the specific accelerator
+      is displayed alongside for comparison.
+    LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
+      from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
+      example for more detail). The peak empirically measured LDS bandwidth achievable
+      on the specific accelerator is displayed alongside for comparison.
+    AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
+      of total floating-point operations (FLOPs) to total bytes transferred between
+      the L1 cache and the processing units. This value is used as the x-coordinate
+      for the L1 roofline.
+    AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
+      of total floating-point operations (FLOPs) to total bytes transferred between
+      the L2 cache and the L1 cache. This value is used as the x-coordinate for the
+      L2 roofline.
+    AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
+      It is the ratio of total floating-point operations (FLOPs) to total bytes transferred
+      between HBM and the L2 cache. This value is used as the x-coordinate for the
+      HBM roofline.
+    Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs
+      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
+      operations divided by the total execution time. This value is used as the y-coordinate
+      for the kernel's point on the Roofline plot.
  data source:
-  - None:
+  - metric_table:
      id: 401
-      title: Roofline
+      title: Roofline Performance Rates
+      cli_style: Roofline
+      tui_style: Roofline
+      header:
+        metric: Metric
+        value: Value
+        unit: Unit
+        peak: Peak (Empirical)
+      metric:
+        VALU FLOPs:
+          value: AVG((($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) / ((End_Timestamp - Start_Timestamp) / 1e9))
+            / 1e9)
+          unit: GFLOP/s
+          peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
+        MFMA FLOPs (F64):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMAF64Flops_empirical_peak
+        MFMA FLOPs (F32):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMAF32Flops_empirical_peak
+        MFMA FLOPs (F16):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMAF16Flops_empirical_peak
+        MFMA FLOPs (BF16):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMABF16Flops_empirical_peak
+        MFMA IOPs (Int8):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GIOP/s
+          peak: $MFMAI8Ops_empirical_peak
+        HBM Bandwidth:
+          value: AVG((( (TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
+            * 64) + (TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
+            * 32) ) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+          unit: GB/s
+          peak: $HBMBw_empirical_peak
+        L2 Cache Bandwidth:
+          value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
+            TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64)) / ((End_Timestamp
+            - Start_Timestamp) / 1e9)) / 1e9)
+          unit: GB/s
+          peak: $L2Bw_empirical_peak
+        L1 Cache Bandwidth:
+          value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GB/s
+          peak: $L1Bw_empirical_peak
+        LDS Bandwidth:
+          value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4 * $lds_banks_per_cu))
+            / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+          unit: GB/s
+          peak: $LDSBw_empirical_peak
+  - metric_table:
+      id: 402
+      title: Roofline Plot Points
+      cli_style: Roofline
+      tui_style: Roofline
+      header:
+        metric: Metric
+        value: Value
+        unit: Unit
+      metric:
+        AI HBM:
+          value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+            (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / SUM( (TCC_EA_RDREQ_32B_sum
+            * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) * 64) + (TCC_EA_WRREQ_64B_sum
+            * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) * 32) ) )
+          unit: FLOPs/Byte
+        AI L2:
+          value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+            (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / SUM( (TCP_TCC_WRITE_REQ_sum
+            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum
+            + TCP_TCC_READ_REQ_sum) * 64 ) )
+          unit: FLOPs/Byte
+        AI L1:
+          value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+            (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / SUM(TCP_TOTAL_CACHE_ACCESSES_sum
+            * 64) )
+          unit: FLOPs/Byte
+        Performance GFLOPs:
+          value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+            (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / (SUM(End_Timestamp - Start_Timestamp)
+            / 1e9) ) / 1e9
+          unit: GFLOP/s
@@ -2,8 +2,197 @@
 Panel Config:
  id: 400
  title: Roofline
-  metrics_description: {}
+  metrics_description:
+    VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
+      This is also presented as a percent of the peak theoretical FLOPs achievable
+      on the specific accelerator. Note: this does not include any floating-point
+      operations from MFMA instructions.'
+    MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
+      executed per second. This does not include any 16-bit brain floating point operations
+      from VALU instructions. The peak empirically measured F8 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison. It is supported
+      on AMD Instinct MI300 series and later only.
+    MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
+      executed per second. Note: this does not include any 16-bit brain floating point
+      operations from VALU instructions. The peak empirically measured BF16 MFMA operations
+      achievable on the specific accelerator is displayed alongside for comparison.'
+    MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
+      per second. Note: this does not include any 16-bit floating point operations
+      from VALU instructions. The peak empirically measured F16 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison.'
+    MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
+      per second. Note: this does not include any 32-bit floating point operations
+      from VALU instructions. The peak empirically measured F32 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison.'
+    MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
+      per second. Note: this does not include any 64-bit floating point operations
+      from VALU instructions. The peak empirically measured F64 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison.'
+    MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
+      per second. Note: this does not include any 8-bit integer operations from VALU
+      instructions. The peak empirically measured INT8 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison.'
+    HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth
+      Memory (HBM) per second. The peak empirically measured bandwidth achievable
+      on the specific accelerator is displayed alongside for comparison.
+    L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
+      The number of bytes is calculated as the number of cache lines requested multiplied
+      by the cache line size. This value does not consider partial requests, so e.g.,
+      if only a single value is requested in a cache line, the data movement will
+      still be counted as a full cache line. The peak empirically measured bandwidth
+      achievable on the specific accelerator is displayed alongside for comparison.
+    L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
+      of VMEM instructions per unit time. The number of bytes is calculated as the
+      number of cache lines requested multiplied by the cache line size. This value
+      does not consider partial requests, so e.g., if only a single value is requested
+      in a cache line, the data movement will still be counted as a full cache line.
+      The peak empirically measured bandwidth achievable on the specific accelerator
+      is displayed alongside for comparison.
+    LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
+      from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
+      example for more detail). The peak empirically measured LDS bandwidth achievable
+      on the specific accelerator is displayed alongside for comparison.
+    AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
+      of total floating-point operations (FLOPs) to total bytes transferred between
+      the L1 cache and the processing units. This value is used as the x-coordinate
+      for the L1 roofline.
+    AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
+      of total floating-point operations (FLOPs) to total bytes transferred between
+      the L2 cache and the L1 cache. This value is used as the x-coordinate for the
+      L2 roofline.
+    AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
+      It is the ratio of total floating-point operations (FLOPs) to total bytes transferred
+      between HBM and the L2 cache. This value is used as the x-coordinate for the
+      HBM roofline.
+    Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs
+      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
+      operations divided by the total execution time. This value is used as the y-coordinate
+      for the kernel's point on the Roofline plot.
  data source:
-  - None:
+  - metric_table:
      id: 401
-      title: Roofline
+      title: Roofline Performance Rates
+      cli_style: Roofline
+      tui_style: Roofline
+      header:
+        metric: Metric
+        value: Value
+        unit: Unit
+        peak: Peak (Empirical)
+      metric:
+        VALU FLOPs:
+          value: AVG(($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) ) / ((End_Timestamp - Start_Timestamp) / 1e9))
+            / 1e9)
+          unit: GFLOP/s
+          peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
+        MFMA FLOPs (F64):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMAF64Flops_empirical_peak
+        MFMA FLOPs (F32):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMAF32Flops_empirical_peak
+        MFMA FLOPs (F16):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMAF16Flops_empirical_peak
+        MFMA FLOPs (BF16):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMABF16Flops_empirical_peak
+        MFMA FLOPs (F8):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMAF8Flops_empirical_peak
+        MFMA IOPs (Int8):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GIOP/s
+          peak: $MFMAI8Ops_empirical_peak
+        HBM Bandwidth:
+          value: AVG((( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum
+            - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + ((TCC_EA0_WRREQ_sum
+            - TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp
+            - Start_Timestamp) / 1e9)) / 1e9)
+          unit: GB/s
+          peak: $HBMBw_empirical_peak
+        L2 Cache Bandwidth:
+          value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
+            TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64)) / ((End_Timestamp
+            - Start_Timestamp) / 1e9)) / 1e9)
+          unit: GB/s
+          peak: $L2Bw_empirical_peak
+        L1 Cache Bandwidth:
+          value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GB/s
+          peak: $L1Bw_empirical_peak
+        LDS Bandwidth:
+          value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4 * $lds_banks_per_cu))
+            / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+          unit: GB/s
+          peak: $LDSBw_empirical_peak
+  - metric_table:
+      id: 402
+      title: Roofline Plot Points
+      cli_style: Roofline
+      tui_style: Roofline
+      header:
+        metric: Metric
+        value: Value
+        unit: Unit
+      metric:
+        AI HBM:
+          value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+            (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
+            * 512) ) / SUM( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32)
+            + ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64)
+            + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum
+            * 64) ) )
+          unit: FLOPs/Byte
+        AI L2:
+          value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+            (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + ) / SUM( (TCP_TCC_WRITE_REQ_sum
+            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum
+            + TCP_TCC_READ_REQ_sum) * 64 ) )
+          unit: FLOPs/Byte
+        AI L1:
+          value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+            (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + ) / SUM(TCP_TOTAL_CACHE_ACCESSES_sum
+            * 64) )
+          unit: FLOPs/Byte
+        Performance GFLOPs:
+          value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+            (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
+            * 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9
+          unit: GFLOP/s
@@ -2,8 +2,197 @@
 Panel Config:
  id: 400
  title: Roofline
-  metrics_description: {}
+  metrics_description:
+    VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
+      This is also presented as a percent of the peak theoretical FLOPs achievable
+      on the specific accelerator. Note: this does not include any floating-point
+      operations from MFMA instructions.'
+    MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
+      executed per second. This does not include any 16-bit brain floating point operations
+      from VALU instructions. The peak empirically measured F8 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison. It is supported
+      on AMD Instinct MI300 series and later only.
+    MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
+      executed per second. Note: this does not include any 16-bit brain floating point
+      operations from VALU instructions. The peak empirically measured BF16 MFMA operations
+      achievable on the specific accelerator is displayed alongside for comparison.'
+    MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
+      per second. Note: this does not include any 16-bit floating point operations
+      from VALU instructions. The peak empirically measured F16 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison.'
+    MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
+      per second. Note: this does not include any 32-bit floating point operations
+      from VALU instructions. The peak empirically measured F32 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison.'
+    MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
+      per second. Note: this does not include any 64-bit floating point operations
+      from VALU instructions. The peak empirically measured F64 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison.'
+    MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
+      per second. Note: this does not include any 8-bit integer operations from VALU
+      instructions. The peak empirically measured INT8 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison.'
+    HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth
+      Memory (HBM) per second. The peak empirically measured bandwidth achievable
+      on the specific accelerator is displayed alongside for comparison.
+    L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
+      The number of bytes is calculated as the number of cache lines requested multiplied
+      by the cache line size. This value does not consider partial requests, so e.g.,
+      if only a single value is requested in a cache line, the data movement will
+      still be counted as a full cache line. The peak empirically measured bandwidth
+      achievable on the specific accelerator is displayed alongside for comparison.
+    L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
+      of VMEM instructions per unit time. The number of bytes is calculated as the
+      number of cache lines requested multiplied by the cache line size. This value
+      does not consider partial requests, so e.g., if only a single value is requested
+      in a cache line, the data movement will still be counted as a full cache line.
+      The peak empirically measured bandwidth achievable on the specific accelerator
+      is displayed alongside for comparison.
+    LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
+      from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
+      example for more detail). The peak empirically measured LDS bandwidth achievable
+      on the specific accelerator is displayed alongside for comparison.
+    AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
+      of total floating-point operations (FLOPs) to total bytes transferred between
+      the L1 cache and the processing units. This value is used as the x-coordinate
+      for the L1 roofline.
+    AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
+      of total floating-point operations (FLOPs) to total bytes transferred between
+      the L2 cache and the L1 cache. This value is used as the x-coordinate for the
+      L2 roofline.
+    AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
+      It is the ratio of total floating-point operations (FLOPs) to total bytes transferred
+      between HBM and the L2 cache. This value is used as the x-coordinate for the
+      HBM roofline.
+    Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs
+      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
+      operations divided by the total execution time. This value is used as the y-coordinate
+      for the kernel's point on the Roofline plot.
  data source:
-  - None:
+  - metric_table:
      id: 401
-      title: Roofline
+      title: Roofline Performance Rates
+      cli_style: Roofline
+      tui_style: Roofline
+      header:
+        metric: Metric
+        value: Value
+        unit: Unit
+        peak: Peak (Empirical)
+      metric:
+        VALU FLOPs:
+          value: AVG(($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) ) / ((End_Timestamp - Start_Timestamp) / 1e9))
+            / 1e9)
+          unit: GFLOP/s
+          peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
+        MFMA FLOPs (F64):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMAF64Flops_empirical_peak
+        MFMA FLOPs (F32):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMAF32Flops_empirical_peak
+        MFMA FLOPs (F16):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMAF16Flops_empirical_peak
+        MFMA FLOPs (BF16):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMABF16Flops_empirical_peak
+        MFMA FLOPs (F8):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMAF8Flops_empirical_peak
+        MFMA IOPs (Int8):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GIOP/s
+          peak: $MFMAI8Ops_empirical_peak
+        HBM Bandwidth:
+          value: AVG((( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum
+            - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + ((TCC_EA0_WRREQ_sum
+            - TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp
+            - Start_Timestamp) / 1e9)) / 1e9)
+          unit: GB/s
+          peak: $HBMBw_empirical_peak
+        L2 Cache Bandwidth:
+          value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
+            TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64)) / ((End_Timestamp
+            - Start_Timestamp) / 1e9)) / 1e9)
+          unit: GB/s
+          peak: $L2Bw_empirical_peak
+        L1 Cache Bandwidth:
+          value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GB/s
+          peak: $L1Bw_empirical_peak
+        LDS Bandwidth:
+          value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4 * $lds_banks_per_cu))
+            / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+          unit: GB/s
+          peak: $LDSBw_empirical_peak
+  - metric_table:
+      id: 402
+      title: Roofline Plot Points
+      cli_style: Roofline
+      tui_style: Roofline
+      header:
+        metric: Metric
+        value: Value
+        unit: Unit
+      metric:
+        AI HBM:
+          value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+            (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
+            * 512) ) / SUM( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32)
+            + ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64)
+            + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum
+            * 64) ) )
+          unit: FLOPs/Byte
+        AI L2:
+          value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+            (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
+            * 512) ) / SUM( (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
+            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64 ) )
+          unit: FLOPs/Byte
+        AI L1:
+          value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+            (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
+            * 512) ) / SUM(TCP_TOTAL_CACHE_ACCESSES_sum * 64) )
+          unit: FLOPs/Byte
+        Performance GFLOPs:
+          value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+            (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
+            * 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9
+          unit: GFLOP/s
@@ -2,8 +2,197 @@
 Panel Config:
  id: 400
  title: Roofline
-  metrics_description: {}
+  metrics_description:
+    VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
+      This is also presented as a percent of the peak theoretical FLOPs achievable
+      on the specific accelerator. Note: this does not include any floating-point
+      operations from MFMA instructions.'
+    MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
+      executed per second. This does not include any 16-bit brain floating point operations
+      from VALU instructions. The peak empirically measured F8 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison. It is supported
+      on AMD Instinct MI300 series and later only.
+    MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
+      executed per second. Note: this does not include any 16-bit brain floating point
+      operations from VALU instructions. The peak empirically measured BF16 MFMA operations
+      achievable on the specific accelerator is displayed alongside for comparison.'
+    MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
+      per second. Note: this does not include any 16-bit floating point operations
+      from VALU instructions. The peak empirically measured F16 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison.'
+    MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
+      per second. Note: this does not include any 32-bit floating point operations
+      from VALU instructions. The peak empirically measured F32 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison.'
+    MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
+      per second. Note: this does not include any 64-bit floating point operations
+      from VALU instructions. The peak empirically measured F64 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison.'
+    MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
+      per second. Note: this does not include any 8-bit integer operations from VALU
+      instructions. The peak empirically measured INT8 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison.'
+    HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth
+      Memory (HBM) per second. The peak empirically measured bandwidth achievable
+      on the specific accelerator is displayed alongside for comparison.
+    L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
+      The number of bytes is calculated as the number of cache lines requested multiplied
+      by the cache line size. This value does not consider partial requests, so e.g.,
+      if only a single value is requested in a cache line, the data movement will
+      still be counted as a full cache line. The peak empirically measured bandwidth
+      achievable on the specific accelerator is displayed alongside for comparison.
+    L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
+      of VMEM instructions per unit time. The number of bytes is calculated as the
+      number of cache lines requested multiplied by the cache line size. This value
+      does not consider partial requests, so e.g., if only a single value is requested
+      in a cache line, the data movement will still be counted as a full cache line.
+      The peak empirically measured bandwidth achievable on the specific accelerator
+      is displayed alongside for comparison.
+    LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
+      from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
+      example for more detail). The peak empirically measured LDS bandwidth achievable
+      on the specific accelerator is displayed alongside for comparison.
+    AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
+      of total floating-point operations (FLOPs) to total bytes transferred between
+      the L1 cache and the processing units. This value is used as the x-coordinate
+      for the L1 roofline.
+    AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
+      of total floating-point operations (FLOPs) to total bytes transferred between
+      the L2 cache and the L1 cache. This value is used as the x-coordinate for the
+      L2 roofline.
+    AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
+      It is the ratio of total floating-point operations (FLOPs) to total bytes transferred
+      between HBM and the L2 cache. This value is used as the x-coordinate for the
+      HBM roofline.
+    Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs
+      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
+      operations divided by the total execution time. This value is used as the y-coordinate
+      for the kernel's point on the Roofline plot.
  data source:
-  - None:
+  - metric_table:
      id: 401
-      title: Roofline
+      title: Roofline Performance Rates
+      cli_style: Roofline
+      tui_style: Roofline
+      header:
+        metric: Metric
+        value: Value
+        unit: Unit
+        peak: Peak (Empirical)
+      metric:
+        VALU FLOPs:
+          value: AVG((($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) / ((End_Timestamp - Start_Timestamp) / 1e9))
+            / 1e9)
+          unit: GFLOP/s
+          peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
+        MFMA FLOPs (F64):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMAF64Flops_empirical_peak
+        MFMA FLOPs (F32):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMAF32Flops_empirical_peak
+        MFMA FLOPs (F16):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMAF16Flops_empirical_peak
+        MFMA FLOPs (BF16):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMABF16Flops_empirical_peak
+        MFMA FLOPs (F8):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMAF8Flops_empirical_peak
+        MFMA IOPs (Int8):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GIOP/s
+          peak: $MFMAI8Ops_empirical_peak
+        HBM Bandwidth:
+          value: AVG((( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum
+            - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + ((TCC_EA0_WRREQ_sum
+            - TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp
+            - Start_Timestamp) / 1e9)) / 1e9)
+          unit: GB/s
+          peak: $HBMBw_empirical_peak
+        L2 Cache Bandwidth:
+          value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
+            TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64)) / ((End_Timestamp
+            - Start_Timestamp) / 1e9)) / 1e9)
+          unit: GB/s
+          peak: $L2Bw_empirical_peak
+        L1 Cache Bandwidth:
+          value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GB/s
+          peak: $L1Bw_empirical_peak
+        LDS Bandwidth:
+          value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4 * $lds_banks_per_cu))
+            / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+          unit: GB/s
+          peak: $LDSBw_empirical_peak
+  - metric_table:
+      id: 402
+      title: Roofline Plot Points
+      cli_style: Roofline
+      tui_style: Roofline
+      header:
+        metric: Metric
+        value: Value
+        unit: Unit
+      metric:
+        AI HBM:
+          value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+            (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
+            * 512) ) / SUM( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32)
+            + ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64)
+            + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum
+            * 64) ) )
+          unit: FLOPs/Byte
+        AI L2:
+          value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+            (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
+            * 512) ) / SUM( (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
+            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64 ) )
+          unit: FLOPs/Byte
+        AI L1:
+          value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+            (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
+            * 512) ) / SUM( TCP_TOTAL_CACHE_ACCESSES_sum * 64 ) )
+          unit: FLOPs/Byte
+        Performance (GFLOPs):
+          value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+            (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
+            * 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9
+          unit: GFLOP/s
@@ -2,8 +2,205 @@
 Panel Config:
  id: 400
  title: Roofline
-  metrics_description: {}
+  metrics_description:
+    VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
+      This is also presented as a percent of the peak theoretical FLOPs achievable
+      on the specific accelerator. Note: this does not include any floating-point
+      operations from MFMA instructions.'
+    MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
+      executed per second. This does not include any 16-bit brain floating point operations
+      from VALU instructions. The peak empirically measured F8 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison. It is supported
+      on AMD Instinct MI300 series and later only.
+    MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
+      executed per second. Note: this does not include any 16-bit brain floating point
+      operations from VALU instructions. The peak empirically measured BF16 MFMA operations
+      achievable on the specific accelerator is displayed alongside for comparison.'
+    MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
+      per second. Note: this does not include any 16-bit floating point operations
+      from VALU instructions. The peak empirically measured F16 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison.'
+    MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
+      per second. Note: this does not include any 32-bit floating point operations
+      from VALU instructions. The peak empirically measured F32 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison.'
+    MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
+      per second. Note: this does not include any 64-bit floating point operations
+      from VALU instructions. The peak empirically measured F64 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison.'
+    MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
+      per second. Note: this does not include any 8-bit integer operations from VALU
+      instructions. The peak empirically measured INT8 MFMA operations achievable
+      on the specific accelerator is displayed alongside for comparison.'
+    HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth
+      Memory (HBM) per second. The peak empirically measured bandwidth achievable
+      on the specific accelerator is displayed alongside for comparison.
+    L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
+      The number of bytes is calculated as the number of cache lines requested multiplied
+      by the cache line size. This value does not consider partial requests, so e.g.,
+      if only a single value is requested in a cache line, the data movement will
+      still be counted as a full cache line. The peak empirically measured bandwidth
+      achievable on the specific accelerator is displayed alongside for comparison.
+    L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
+      of VMEM instructions per unit time. The number of bytes is calculated as the
+      number of cache lines requested multiplied by the cache line size. This value
+      does not consider partial requests, so e.g., if only a single value is requested
+      in a cache line, the data movement will still be counted as a full cache line.
+      The peak empirically measured bandwidth achievable on the specific accelerator
+      is displayed alongside for comparison.
+    LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
+      from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
+      example for more detail). The peak empirically measured LDS bandwidth achievable
+      on the specific accelerator is displayed alongside for comparison.
+    AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
+      of total floating-point operations (FLOPs) to total bytes transferred between
+      the L1 cache and the processing units. This value is used as the x-coordinate
+      for the L1 roofline.
+    AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
+      of total floating-point operations (FLOPs) to total bytes transferred between
+      the L2 cache and the L1 cache. This value is used as the x-coordinate for the
+      L2 roofline.
+    AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
+      It is the ratio of total floating-point operations (FLOPs) to total bytes transferred
+      between HBM and the L2 cache. This value is used as the x-coordinate for the
+      HBM roofline.
+    Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs
+      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
+      operations divided by the total execution time. This value is used as the y-coordinate
+      for the kernel's point on the Roofline plot.
  data source:
-  - None:
+  - metric_table:
      id: 401
-      title: Roofline
+      title: Roofline Performance Rates
+      cli_style: Roofline
+      tui_style: Roofline
+      header:
+        metric: Metric
+        value: Value
+        unit: Unit
+        peak: Peak (Empirical)
+      metric:
+        VALU FLOPs:
+          value: AVG((($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) / ((End_Timestamp - Start_Timestamp) / 1e9))
+            / 1e9)
+          unit: GFLOP/s
+          peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
+        MFMA FLOPs (F64):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMAF64Flops_empirical_peak
+        MFMA FLOPs (F32):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMAF32Flops_empirical_peak
+        MFMA FLOPs (F16):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMAF16Flops_empirical_peak
+        MFMA FLOPs (BF16):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMABF16Flops_empirical_peak
+        MFMA FLOPs (F8):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMAF8Flops_empirical_peak
+        MFMA FLOPs (F6F4):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GFLOP/s
+          peak: $MFMA_FLOPs_F6F4_empirical_peak
+        MFMA IOPs (Int8):
+          value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GIOP/s
+          peak: $MFMAI8Ops_empirical_peak
+        HBM Bandwidth:
+          value: AVG((( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum
+            - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + ((TCC_EA0_WRREQ_sum
+            - TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp
+            - Start_Timestamp) / 1e9)) / 1e9)
+          unit: GB/s
+          peak: $HBMBw_empirical_peak
+        L2 Cache Bandwidth:
+          value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
+            TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64)) / ((End_Timestamp
+            - Start_Timestamp) / 1e9)) / 1e9)
+          unit: GB/s
+          peak: $L2Bw_empirical_peak
+        L1 Cache Bandwidth:
+          value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp)
+            / 1e9)) / 1e9)
+          unit: GB/s
+          peak: $L1Bw_empirical_peak
+        LDS Bandwidth:
+          value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4 * $lds_banks_per_cu))
+            / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+          unit: GB/s
+          peak: $LDSBw_empirical_peak
+  - metric_table:
+      id: 402
+      title: Roofline Plot Points
+      cli_style: Roofline
+      tui_style: Roofline
+      header:
+        metric: Metric
+        value: Value
+        unit: Unit
+      metric:
+        AI HBM:
+          value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+            (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
+            * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / SUM( (TCC_BUBBLE_sum
+            * 128) + (TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum
+            - TCC_EA0_RDREQ_32B_sum) * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
+            * 32) + (TCC_EA0_WRREQ_64B_sum * 64) ) )
+          unit: FLOPs/Byte
+        AI L2:
+          value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+            (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
+            * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / SUM( (TCP_TCC_WRITE_REQ_sum
+            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum
+            + TCP_TCC_READ_REQ_sum) * 64 ) )
+          unit: FLOPs/Byte
+        AI L1:
+          value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+            (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
+            * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / SUM(TCP_TOTAL_CACHE_ACCESSES_sum
+            * 64) )
+          unit: FLOPs/Byte
+        Performance GFLOPs:
+          value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+            + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+            + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+            + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+            + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+            (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+            512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
+            * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / (SUM(End_Timestamp -
+            Start_Timestamp) / 1e9) ) / 1e9
+          unit: GFLOP/s
@@ -377,14 +377,10 @@ class OmniSoC_Base:
            if counter_name.startswith("TCC") and counter_name.endswith("["):
                counters.remove(counter_name)
                counter_name = counter_name.split("[")[0]
-                counters = counters.union(
-                    {
-                        f"{counter_name}[{i}]"
-                        for i in range(
-                            num_xcd_for_pmc_file * int(self._mspec._l2_banks)
-                        )
-                    }
-                )
+                counters = counters.union({
+                    f"{counter_name}[{i}]"
+                    for i in range(num_xcd_for_pmc_file * int(self._mspec._l2_banks))
+                })

        return counters

@@ -48,7 +48,8 @@ from utils.roofline_calc import (
    MFMA_DATATYPES,
    PEAK_OPS_DATATYPES,
    SUPPORTED_DATATYPES,
-    calc_ai,
+    calc_ai_analyze,
+    calc_ai_profile,
    constuct_roof,
 )
 from utils.utils import mibench
@@ -182,10 +183,9 @@ class Roofline:
        console_debug(
            "roofline", "Path: %s" % self.__run_parameters.get("workload_dir")
        )
-        self.__ai_data = calc_ai(
+        self.__ai_data = calc_ai_profile(
            self.__mspec, self.__run_parameters.get("sort_type"), ret_df
        )
-
        msg = "AI at each mem level:"
        for i in self.__ai_data:
            msg += "\n\t%s -> %s" % (i, self.__ai_data[i])
@@ -620,7 +620,7 @@ class Roofline:

        return fig

-    def cli_generate_plot(self, dtype):
+    def cli_generate_plot(self, dtype, workload=None, config=None, arch_config=None):
        """
        Plot CLI mode roofline analysis in terminal using plotext

@@ -668,11 +668,43 @@ class Roofline:
        else:
            # workload_dir is a string
            base_dir = workload_dir
-        self.roof_setup()
-
        # Convert to Path object for easier manipulation
        base_path = Path(base_dir)

+        roofline_csv = base_path / "roofline.csv"
+        if not roofline_csv.is_file():
+            console_log("roofline", "{} does not exist".format(roofline_csv))
+            return
+
+        # if workload is detected, utilize Roofline yamls. If not, fallback to legacy calc_ai
+        if workload is not None:
+            self.__ai_data = calc_ai_analyze(
+                workload=workload,
+                mspec=self.__mspec,
+                sort_type=self.__run_parameters.get("sort_type"),
+                config=config,
+                arch_config=arch_config,
+            )
+
+        else:
+            pmc_perf_csv = base_path / "pmc_perf.csv"
+            if not pmc_perf_csv.is_file():
+                console_error("roofline", "{} does not exist".format(pmc_perf_csv))
+            t_df = OrderedDict()
+            t_df["pmc_perf"] = pd.read_csv(pmc_perf_csv)
+
+            self.__ai_data = calc_ai_profile(
+                self.__mspec, self.__run_parameters["sort_type"], t_df
+            )
+
+        self.__ceiling_data = constuct_roof(
+            roofline_parameters=self.__run_parameters, dtype=dtype
+        )
+        console_debug(f"AI data: {self.__ai_data}")
+        console_debug(f"Kernel names: {self.__ai_data.get('kernelNames', [])}")
+
+        self.roof_setup()
+
        # Check proper datatype input - takes single str
        if not isinstance(dtype, str):
            console_error("Unsupported datatype input - must be str")
@@ -682,16 +714,6 @@ class Roofline:
            self.__run_parameters["mem_level"].remove("vL1D")
            self.__run_parameters["mem_level"].append("L1")

-        roofline_csv = base_path / "roofline.csv"
-        if not roofline_csv.is_file():
-            console_log("roofline", "{} does not exist".format(roofline_csv))
-            return
-
-        pmc_perf_csv = base_path / "pmc_perf.csv"
-        if not pmc_perf_csv.is_file():
-            console_error("roofline", "{} does not exist".format(pmc_perf_csv))
-        t_df = OrderedDict()
-        t_df["pmc_perf"] = pd.read_csv(pmc_perf_csv)
        profiling_config = file_io.load_profiling_config(self.__args.path[0][0])
        if profiling_config.get("format_rocprof_output") == "rocpd":
            t_df["pmc_perf"] = rocpd_data.process_rocpd_csv(t_df["pmc_perf"])
@@ -714,12 +736,6 @@ class Roofline:
            5: "atom",
        }

-        self.__ceiling_data = constuct_roof(
-            roofline_parameters=self.__run_parameters,
-            dtype=dtype,
-        )
-        self.__ai_data = calc_ai(self.__mspec, self.__run_parameters["sort_type"], t_df)
-
        plt.clf()
        plt.plotsize(plt.tw(), plt.th())

@@ -103,6 +103,7 @@ supported_call = {
    "STD": "to_std",
    # functions apply to whole column of df or a single value
    "TO_INT": "to_int",
+    "SUM": "to_sum",
    # Support the below with 2 inputs
    "ROUND": "to_round",
    "QUANTILE": "to_quantile",
@@ -196,6 +197,19 @@ def to_int(a):
        raise Exception("to_int: unsupported type.")


+def to_sum(a):
+    if str(type(a)) == "<class 'NoneType'>":
+        return np.nan
+    elif np.isnan(a).all():
+        return np.nan
+    elif a.empty:
+        return np.nan
+    elif isinstance(a, pd.core.series.Series):
+        return a.sum()
+    else:
+        raise Exception("to_sum: unsupported type.")
+
+
 def to_round(a, b):
    if isinstance(a, pd.core.series.Series):
        return a.round(b)
@@ -755,7 +769,7 @@ def build_metric_value_string(dfs, dfs_type, normal_unit, profiling_config):


@demarcate
-def eval_metric(dfs, dfs_type, sys_info, raw_pmc_df, debug, config):
+def eval_metric(dfs, dfs_type, sys_info, empirical_peaks_df, raw_pmc_df, debug, config):
    """
    Execute the expr string for each metric in the df.
    """
@@ -860,6 +874,30 @@ def eval_metric(dfs, dfs_type, sys_info, raw_pmc_df, debug, config):
            "wave_size is not available in sysinfo.csv, please provide the correct "
            "value using --specs-correction"
        )
+    if not empirical_peaks_df.empty:
+        peak_data_row = empirical_peaks_df.iloc[0]
+        for metric_name in empirical_peaks_df.columns:
+            var_name = f"ammolite__{metric_name}_empirical_peak"
+            locals()[var_name] = peak_data_row[metric_name]
+    else:
+        default_peaks = [
+            "MFMAF64Flops",
+            "MFMAF32Flops",
+            "MFMAF16Flops",
+            "MFMABF16Flops",
+            "MFMAF8Flops",
+            "MFMAI8Ops",
+            "HBMBw",
+            "L2Bw",
+            "L1Bw",
+            "LDSBw",
+            "MFMA_FLOPs_F6F4",
+        ]
+        # set values to 0 if no no empirical peaks from roofline.csv are provided
+        for peak_name in default_peaks:
+            var_name = f"ammolite__{peak_name}_empirical_peak"
+            exec(f"{var_name} = 0", globals(), locals())
+
    # TODO: fix all $normUnit in Unit column or title

    # build and eval all derived build-in global variables
@@ -958,8 +996,7 @@ def eval_metric(dfs, dfs_type, sys_info, raw_pmc_df, debug, config):
                                    except TypeError:
                                        console_warning(
                                            "Skipping entry. Encountered a missing "
-                                            "counter\n{} has been assigned to None\n{}"
-                                            .format(
+                                            "counter\n{} has been assigned to None\n{}".format(
                                                expr,
                                                np.nan,
                                            )
@@ -984,8 +1021,14 @@ def eval_metric(dfs, dfs_type, sys_info, raw_pmc_df, debug, config):
                                        row[expr] = ""
                                    else:
                                        row[expr] = out
-                                except TypeError:
-                                    row[expr] = ""
+                                except (TypeError, NameError) as e:
+                                    if "empirical_peak" in str(e):
+                                        console_warning(
+                                            f"Missing empirical peak data: {e}. Using empty value."
+                                        )
+                                        row[expr] = ""
+                                    else:
+                                        row[expr] = ""
                                except AttributeError as ae:
                                    if (
                                        str(ae)
@@ -1043,8 +1086,7 @@ def apply_filters(workload, dir, is_gui, debug):
            for kernel_id in workload.filter_kernel_ids:
                if kernel_id >= len(kernels_df["Kernel_Name"]):
                    console_error(
-                        "{} is an invalid kernel id. Please enter an id between 0-{}"
-                        .format(
+                        "{} is an invalid kernel id. Please enter an id between 0-{}".format(
                            kernel_id,
                            len(kernels_df["Kernel_Name"]) - 1,
                        )
@@ -1579,6 +1621,7 @@ def load_table_data(workload, dir, is_gui, args, config, skipKernelTop=False):
        workload.dfs,
        workload.dfs_type,
        workload.sys_info.iloc[0],
+        workload.roofline_peaks,
        apply_filters(workload, dir, is_gui, args.debug),
        args.debug,
        config,
@@ -23,11 +23,15 @@

 ##############################################################################

+
 import csv
 from dataclasses import dataclass
 from pathlib import Path

+import pandas as pd
+
 from utils.logger import console_debug
+from utils.parser import apply_filters, eval_metric

 ################################################
 # Global vars
@@ -154,8 +158,7 @@ def get_color(catagory):
 #                           Plot BW at each cache level
 # -------------------------------------------------------------------------------------
 def calc_ceilings(roofline_parameters, dtype, benchmark_data):
-    """Given benchmarking data, calculate ceilings
-    (or peak performance) for empirical roofline"""
+    """Given benchmarking data, calculate ceilings (or peak performance) for empirical roofline"""
    # TODO: This is where filtering by memory level will need to occur for standalone
    graphPoints = {"hbm": [], "l2": [], "l1": [], "lds": [], "valu": [], "mfma": []}

@@ -186,7 +189,7 @@ def calc_ceilings(roofline_parameters, dtype, benchmark_data):

        if dtype in PEAK_OPS_DATATYPES:
            x2 = peakOps / peakBw
-            y2 = peakOps  # noqa: F841
+            y2 = peakOps

            # Plot MFMA lines (NOTE: Assuming MI200 soc)
            x1_mfma = peakOps / peakBw
@@ -220,9 +223,9 @@ def calc_ceilings(roofline_parameters, dtype, benchmark_data):
        graphPoints[cacheHierarchy[i].lower()].append([y1, peakY])
        graphPoints[cacheHierarchy[i].lower()].append(peakBw)

-    # ---------------------------------------------------------------------------------
+    # -------------------------------------------------------------------------------------
    #                                     Plot computing roof
-    # ---------------------------------------------------------------------------------
+    # -------------------------------------------------------------------------------------
    if dtype in PEAK_OPS_DATATYPES:
        # Plot FMA roof
        x0 = XMAX
@@ -254,9 +257,151 @@ def calc_ceilings(roofline_parameters, dtype, benchmark_data):
 #                              Overlay application performance
 # -------------------------------------------------------------------------------------
 # Calculate relevant metrics for ai calculation
-def calc_ai(mspec, sort_type, ret_df):
-    """Given counter data, calculate arithmetic intensity
-    for each kernel in the application."""
+def calc_ai_analyze(workload, mspec, sort_type, config, arch_config):
+    """
+    Calculate per-kernel metrics and AI points with Roofline yamls using eval_metric.
+    """
+    console_debug("calc_ai_analyze: Starting calc_ai analysis using Roofline yamls")
+    plot_points = {
+        "ai_l1": [[], []],
+        "ai_l2": [[], []],
+        "ai_hbm": [[], []],
+        "kernelNames": [],
+    }
+
+    workload.roofline_metrics = {}
+    filtered_pmc = apply_filters(workload, workload.path, is_gui=False, debug=False)
+
+    kernel_ids_to_process = []
+    kernel_top_table_id = 1
+
+    if workload.filter_kernel_ids:
+        kernel_ids_to_process = workload.filter_kernel_ids
+    else:
+        if kernel_top_table_id in workload.dfs:
+            kernel_top_df = workload.dfs[kernel_top_table_id]
+            kernel_ids_to_process = kernel_top_df.index.tolist()
+            console_debug(
+                "roofline", f"Found {len(kernel_ids_to_process)} kernels to process"
+            )
+
+    if not kernel_ids_to_process:
+        console_warning("No kernels found to process for roofline")
+        return plot_points
+
+    for kernel_id in kernel_ids_to_process:
+        if kernel_top_table_id in workload.dfs:
+            kernel_top_df = workload.dfs[kernel_top_table_id]
+            if kernel_id in kernel_top_df.index:
+                kernel_name = kernel_top_df.loc[kernel_id, "Kernel_Name"]
+            else:
+                continue
+        else:
+            continue
+
+        console_debug("roofline", f"Processing kernel {kernel_id}: {kernel_name[:50]}")
+
+        # filter PMC data for specific kernel
+        kernel_pmc_df = filtered_pmc[
+            filtered_pmc["pmc_perf"]["Kernel_Name"] == kernel_name
+        ]
+
+        if kernel_pmc_df.empty:
+            console_debug("roofline", f"No PMC data for kernel {kernel_id}")
+            continue
+
+        kernel_only_data = {"pmc_perf": kernel_pmc_df["pmc_perf"]}
+
+        kernel_dfs = {}
+        kernel_dfs_type = {}
+
+        for table_id in [401, 402]:
+            if table_id in arch_config.dfs:
+                kernel_dfs[table_id] = arch_config.dfs[table_id].copy()
+                kernel_dfs_type[table_id] = arch_config.dfs_type[table_id]
+
+        # eval metrics for single kernel only
+        eval_metric(
+            kernel_dfs,
+            kernel_dfs_type,
+            workload.sys_info.iloc[0],
+            workload.roofline_peaks,
+            kernel_only_data,
+            debug=False,
+            config=config,
+        )
+
+        # DEBUG
+        if 402 in kernel_dfs:
+            console_debug("roofline", f"Table 402 for kernel {kernel_id}:")
+            for idx, row in kernel_dfs[402].iterrows():
+                console_debug(
+                    "roofline", f"  {row.get('Metric', '')}: {row.get('Value', '')}"
+                )
+
+        ai_hbm = ai_l2 = ai_l1 = performance = 0
+
+        if 402 in kernel_dfs:
+            for idx, row in kernel_dfs[402].iterrows():
+                metric = row.get("Metric", "")
+                value = row.get("Value", 0)
+                if metric == "AI HBM":
+                    ai_hbm = value if value and value != "" else 0
+                elif metric == "AI L2":
+                    ai_l2 = value if value and value != "" else 0
+                elif metric == "AI L1":
+                    ai_l1 = value if value and value != "" else 0
+                elif metric == "Performance (GFLOPs)":
+                    performance = value if value and value != "" else 0
+
+        console_debug(
+            "roofline",
+            f"Kernel {kernel_id}: AI_HBM={ai_hbm:.2f}, AI_L2={ai_l2:.2f}, AI_L1={ai_l1:.2f}, Performance={performance:.2e} GFLOP/s",
+        )
+
+        # add to plot points if we have valid data
+        if performance > 0:
+            if ai_hbm > 0:
+                plot_points["ai_hbm"][0].append(ai_hbm)
+                plot_points["ai_hbm"][1].append(performance)
+            if ai_l2 > 0:
+                plot_points["ai_l2"][0].append(ai_l2)
+                plot_points["ai_l2"][1].append(performance)
+            if ai_l1 > 0:
+                plot_points["ai_l1"][0].append(ai_l1)
+                plot_points["ai_l1"][1].append(performance)
+
+            plot_points["kernelNames"].append(f"K{kernel_id}")
+            console_debug("roofline", f"Added kernel {kernel_id} to plot points")
+        else:
+            console_debug(
+                "roofline", f"Skipping kernel {kernel_id} - no performance data"
+            )
+
+        # store metrics for display
+        workload.roofline_metrics[kernel_id] = {
+            "name": kernel_name,
+            "ai_table": kernel_dfs.get(401, pd.DataFrame()),
+            "calc_table": kernel_dfs.get(402, pd.DataFrame()),
+        }
+
+    console_debug(
+        "roofline", f"Generated {len(plot_points['kernelNames'])} plot points"
+    )
+    console_debug("roofline", f"Plot points: {plot_points}")
+    return plot_points
+
+
+def calc_ai_profile(mspec, sort_type, ret_df):
+    """Given counter data, calculate arithmetic intensity for each kernel in the application.
+    Leverage hard-coded equations to calculate AI values.
+
+    Used during profiling stage to generate roofline PDF, since Roofline yamls are not available
+    in the profiling stage."""
+
+    console_debug(
+        "calc_ai_profile: Starting legacy roofline calculation (from roofline_calc)"
+    )
    df = ret_df["pmc_perf"]
    # Sort by top kernels or top dispatches?
    df = df.sort_values(by=["Kernel_Name"])
@@ -463,7 +608,9 @@ def calc_ai(mspec, sort_type, ret_df):

        calls += 1

-        if sort_type == "kernels" and (at_end or (kernelName != next_kernelName)):
+        if sort_type == "kernels" and (
+            at_end == True or (kernelName != next_kernelName)
+        ):
            myList.append(
                AI_Data(
                    kernelName,
@@ -538,8 +685,9 @@ def calc_ai(mspec, sort_type, ret_df):
    while i < TOP_N and i != len(myList):
        if myList[i].total_flops == 0:
            console_debug(
-                "No flops counted for {}, arithmetic intensities will not "
-                "display on plots.".format(myList[i].KernelName)
+                "No flops counted for {}, arithmetic intensities will not display on plots.".format(
+                    myList[i].KernelName
+                )
            )

        kernelNames.append(myList[i].KernelName)
@@ -548,40 +696,28 @@ def calc_ai(mspec, sort_type, ret_df):
            if myList[i].L1cache_data
            else intensities["ai_l1"].append(0)
        )
-        # print(
-        #     "cur_ai_L1",
-        #     myList[i].total_flops / myList[i].L1cache_data
-        # ) if myList[i].L1cache_data else print("null")
+        # print("cur_ai_L1", myList[i].total_flops/myList[i].L1cache_data) if myList[i].L1cache_data else print("null")
        # print()
        (
            intensities["ai_l2"].append(myList[i].total_flops / myList[i].L2cache_data)
            if myList[i].L2cache_data
            else intensities["ai_l2"].append(0)
        )
-        # print(
-        #     "cur_ai_L2",
-        #     myList[i].total_flops / myList[i].L2cache_data
-        # ) if myList[i].L2cache_data else print("null")
+        # print("cur_ai_L2", myList[i].total_flops/myList[i].L2cache_data) if myList[i].L2cache_data else print("null")
        # print()
        (
            intensities["ai_hbm"].append(myList[i].total_flops / myList[i].hbm_data)
            if myList[i].hbm_data
            else intensities["ai_hbm"].append(0)
        )
-        # print(
-        #     "cur_ai_hbm",
-        #     myList[i].total_flops / myList[i].hbm_data
-        # ) if myList[i].hbm_data else print("null")
+        # print("cur_ai_hbm", myList[i].total_flops/myList[i].hbm_data) if myList[i].hbm_data else print("null")
        # print()
        (
            curr_perf.append(myList[i].total_flops / myList[i].avgDuration)
            if myList[i].avgDuration
            else curr_perf.append(0)
        )
-        # print(
-        #     "cur_perf",
-        #     myList[i].total_flops / myList[i].avgDuration
-        # ) if myList[i].avgDuration else print("null")
+        # print("cur_perf", myList[i].total_flops/myList[i].avgDuration) if myList[i].avgDuration else print("null")

        i += 1

@@ -590,7 +726,7 @@ def calc_ai(mspec, sort_type, ret_df):
    for i in intensities:
        values = intensities[i]

-        color = get_color(i)  # noqa: F841
+        color = get_color(i)
        x = []
        y = []
        for entryIndx in range(0, len(values)):
@@ -622,8 +758,7 @@ def constuct_roof(roofline_parameters, dtype):
    # -----------------------------------------------------
    # Initialize roofline data dictionary from roofline.csv
    # -----------------------------------------------------
-    # TODO: consider changing this to an ordered dict for consistency over py versions
-    benchmark_data = {}
+    benchmark_data = {}  # TODO: consider changing this to an ordered dict for consistency over py versions
    headers = []
    try:
        with open(benchmark_results, "r") as csvfile:
@@ -641,7 +776,7 @@ def constuct_roof(roofline_parameters, dtype):

                rowCount += 1
        csvfile.close()
-    except Exception:
+    except:
        graphPoints = {
            "hbm": [None, None, None],
            "l2": [None, None, None],
@@ -83,6 +83,7 @@ supported_field = [
    "Avg",
    "Pct of Peak",
    "Peak",
+    "Peak (Empirical)",
    "Count",
    "Mean",
    "Pct",
@@ -32,6 +32,7 @@ from tabulate import tabulate

 import config
 from utils import mem_chart, parser
+from utils.kernel_name_shortener import kernel_name_shortener
 from utils.logger import console_error, console_log, console_warning
 from utils.utils import convert_metric_id_to_panel_info

@@ -146,6 +147,108 @@ def show_all(args, runs, archConfigs, output, profiling_config, roof_plot=None):
            continue
        ss = ""  # store content of all data_source from one panel

+        if panel_id == 400:
+            has_roofline_style = any(
+                data_source.get(type, {}).get("cli_style") == "Roofline"
+                for data_source in panel["data source"]
+                for type in data_source
+            )
+
+            if has_roofline_style and (
+                not args.filter_metrics or "4" in args.filter_metrics
+            ):
+                print("\n" + "=" * 80, file=output)
+                print("4. Roofline", file=output)
+                print("=" * 80, file=output)
+
+                for run_path, workload in runs.items():
+                    if (
+                        hasattr(workload, "roofline_metrics")
+                        and workload.roofline_metrics
+                    ):
+                        print(
+                            "\n(4.1) Per-Kernel Roofline Metrics and (4.2) AI Plot Points",
+                            file=output,
+                        )
+                        print("-" * 80, file=output)
+
+                        kernel_top_df = workload.dfs.get(1, pd.DataFrame())
+                        if not kernel_top_df.empty:
+                            kernel_name_shortener(kernel_top_df, args.kernel_verbose)
+
+                        for i, (kernel_id, metrics) in enumerate(
+                            workload.roofline_metrics.items()
+                        ):
+                            if (
+                                not kernel_top_df.empty
+                                and kernel_id in kernel_top_df.index
+                            ):
+                                kernel_name = kernel_top_df.loc[
+                                    kernel_id, "Kernel_Name"
+                                ]
+                                kernel_pct = (
+                                    kernel_top_df.loc[kernel_id, "Pct"]
+                                    if "Pct" in kernel_top_df.columns
+                                    else 0
+                                )
+                            else:
+                                kernel_name = metrics.get("name", f"Kernel {kernel_id}")
+                                kernel_pct = 0
+
+                            display_name = (
+                                kernel_name[:80] + "..."
+                                if len(kernel_name) > 80
+                                else kernel_name
+                            )
+                            print(
+                                f"\nKernel {kernel_id}: {display_name} ({kernel_pct:.1f}%)",
+                                file=output,
+                            )
+
+                            base_indent = "  "
+                            table_indent_prefix = f"{base_indent}|   "
+
+                            tables = {
+                                401: (
+                                    "4.1 Roofline Rate Metrics:",
+                                    metrics.get("ai_table", pd.DataFrame()),
+                                ),
+                                402: (
+                                    "4.2 Roofline AI Plot Points:",
+                                    metrics.get("calc_table", pd.DataFrame()),
+                                ),
+                            }
+
+                            print(f"{base_indent}|")
+
+                            for table_id, (table_name, df) in tables.items():
+                                if df.empty:
+                                    continue
+
+                                print(f"{base_indent}├─ {table_name}", file=output)
+
+                                display_df = df.copy()
+
+                                for col in hidden_cols:
+                                    if col in display_df.columns:
+                                        display_df = display_df.drop(columns=[col])
+
+                                table_string = get_table_string(
+                                    display_df, transpose=False, decimal=args.decimal
+                                )
+                                indented_table_string = textwrap.indent(
+                                    table_string, table_indent_prefix
+                                )
+                                print(indented_table_string, file=output)
+
+                    else:
+                        print("\nNo per-kernel metrics available", file=output)
+
+                # Show the roofline plot
+                if roof_plot:
+                    show_roof_plot(roof_plot)
+                continue
+
        for data_source in panel["data source"]:
            for type, table_config in data_source.items():
                # If block filtering was used during analysis, then don't use profiling
@@ -172,16 +275,6 @@ def show_all(args, runs, archConfigs, output, profiling_config, roof_plot=None):
                    )
                    continue

-                # Show roofline
-                # Check if we have filter_metrics for analyze stage:
-                # no filter_metrics = show all,
-                # filter_metrics containing "4" = user requesting roofline chart
-                if panel_id == 400 and (
-                    not args.filter_metrics or "4" in args.filter_metrics
-                ):
-                    show_roof_plot(roof_plot)
-                    continue
-
                # Metrics baseline comparison mode
                # We cannot guarantee that all runs have the same metrics.
                # Only show common metrics.
@@ -454,7 +547,7 @@ def show_roof_plot(roof_plot):
    # TODO: short term solution to display roofline plot
    print("\n" + "-" * 80)
    print("4. Roofline")
-    print("4.1 Roofline")
+    print("4.3 Roofline Plot")
    if roof_plot:
        print(roof_plot)
    else:
@@ -745,7 +745,7 @@ def run_prof(
            config.rocprof_compute_home
            / "rocprof_compute_soc"
            / "profile_configs"
-            / f"counter_defs.yaml",
+            / "counter_defs.yaml",
            "r",
        ) as file:
            counter_defs = yaml.safe_load(file)
@@ -1676,9 +1676,9 @@ class TestSetsIntegration:

        memory_metrics = ["16.1.2", "17.1.0"]
        for metric_id in memory_metrics:
-            assert (
-                metric_id in open(Path(workload_dir) / "log.txt", "r").read()
-            ), f"Expected memory metric {metric_id} not found"
+            assert metric_id in open(Path(workload_dir) / "log.txt", "r").read(), (
+                f"Expected memory metric {metric_id} not found"
+            )

        test_utils.clean_output_dir(config["cleanup"], workload_dir)

@@ -1745,7 +1745,9 @@ class TestSetsIntegration:
        assert returncode == 1
        test_utils.clean_output_dir(config["cleanup"], workload_dir)

-    def test_set_and_block_mutual_exclusion(self, binary_handler_profile_rocprof_compute):
+    def test_set_and_block_mutual_exclusion(
+        self, binary_handler_profile_rocprof_compute
+    ):
        options = ["--set", "compute_thruput_util", "--block", "12"]
        workload_dir = test_utils.get_output_dir()

@@ -30,18 +30,17 @@ import json
 import locale
 import logging
 import os
-import tempfile
 import pathlib
 import re
 import shutil
 import subprocess
+import tempfile
 from pathlib import Path
 from types import SimpleNamespace
 from unittest import mock

 import pandas as pd
 import pytest
-import yaml

 import utils.utils as utils

@@ -23,12 +23,12 @@ src/rocprof_compute_soc/analysis_configs/gfx940/0300_memory_chart.yaml: cff5509a
 src/rocprof_compute_soc/analysis_configs/gfx941/0300_memory_chart.yaml: cff5509ac8502bad6dbd75e3058159fe429aece5d93279c66b2a6a8c887b43b6
 src/rocprof_compute_soc/analysis_configs/gfx942/0300_memory_chart.yaml: cff5509ac8502bad6dbd75e3058159fe429aece5d93279c66b2a6a8c887b43b6
 src/rocprof_compute_soc/analysis_configs/gfx950/0300_memory_chart.yaml: 643b31ffa43bc3613d6f90b0c23d95093d0d0aa5bc8e72d9a0fbc1b739a08b67
-src/rocprof_compute_soc/analysis_configs/gfx908/0400_roofline.yaml: 57f95dcd487dfcdf24e1c2d8eb16d14dc3462df83d11a08e7de2b06343b48c3e
-src/rocprof_compute_soc/analysis_configs/gfx90a/0400_roofline.yaml: 57f95dcd487dfcdf24e1c2d8eb16d14dc3462df83d11a08e7de2b06343b48c3e
-src/rocprof_compute_soc/analysis_configs/gfx940/0400_roofline.yaml: 57f95dcd487dfcdf24e1c2d8eb16d14dc3462df83d11a08e7de2b06343b48c3e
-src/rocprof_compute_soc/analysis_configs/gfx941/0400_roofline.yaml: 57f95dcd487dfcdf24e1c2d8eb16d14dc3462df83d11a08e7de2b06343b48c3e
-src/rocprof_compute_soc/analysis_configs/gfx942/0400_roofline.yaml: 57f95dcd487dfcdf24e1c2d8eb16d14dc3462df83d11a08e7de2b06343b48c3e
-src/rocprof_compute_soc/analysis_configs/gfx950/0400_roofline.yaml: 57f95dcd487dfcdf24e1c2d8eb16d14dc3462df83d11a08e7de2b06343b48c3e
+src/rocprof_compute_soc/analysis_configs/gfx908/0400_roofline.yaml: 6406ce67cd55064f0d2db2a3511c6536cc1625314ddb31366900fbf3c60ed523
+src/rocprof_compute_soc/analysis_configs/gfx90a/0400_roofline.yaml: 100d555cf9e70b892e22f92ddd9c0a5d1f914d07077c4a8d35941e8ad62b5b30
+src/rocprof_compute_soc/analysis_configs/gfx940/0400_roofline.yaml: f8bf66f43c9afede4fd1f17c279050cc27cc6fbc1cdb53a71ae8ceb0eb84dc37
+src/rocprof_compute_soc/analysis_configs/gfx941/0400_roofline.yaml: 6fae04dcf4bcabe4a71f5d9eefc379a38d30cdf05fbb14e2c276e1c272fdb3f6
+src/rocprof_compute_soc/analysis_configs/gfx942/0400_roofline.yaml: c8dfe7df24f94dfa229ffa2035b802c6833ce98f7710e0889bc5710f2167d4c0
+src/rocprof_compute_soc/analysis_configs/gfx950/0400_roofline.yaml: 734fdfa818bfd8a87e01a0dd795c502a567c72158ca9b7bfe01e99451e8aa537
 src/rocprof_compute_soc/analysis_configs/gfx908/0500_command_processor_cpc_cpf.yaml: da1c2997d42d66da2aa069caa741cf9eade124357c56e4290333de2f3e0412bb
 src/rocprof_compute_soc/analysis_configs/gfx90a/0500_command_processor_cpc_cpf.yaml: da1c2997d42d66da2aa069caa741cf9eade124357c56e4290333de2f3e0412bb
 src/rocprof_compute_soc/analysis_configs/gfx940/0500_command_processor_cpc_cpf.yaml: da1c2997d42d66da2aa069caa741cf9eade124357c56e4290333de2f3e0412bb
@@ -87,7 +87,9 @@ def update_analysis_config():
                        data_source_config["metric_table"]["metric"],
                        gfx_version,
                    )
-                new_panel_config["Panel Config"]["data source"].append(data_source_config)
+                new_panel_config["Panel Config"]["data source"].append(
+                    data_source_config
+                )
            # Write panel config to file
            filename = Path(
                TARGET_DIR.joinpath(gfx_version, f"{panel_id}_{panel_title}.yaml")
@@ -134,9 +136,9 @@ def update_sets_config():
            }

            for metric_id in sets["metric"][gfx_version]:
-                current_set["metric"].append(
-                    {metric_id: METRIC_ID_TO_NAME_MAP[gfx_version][str(metric_id)]}
-                )
+                current_set["metric"].append({
+                    metric_id: METRIC_ID_TO_NAME_MAP[gfx_version][str(metric_id)]
+                })

            new_sets["sets"].append(current_set)

@@ -2801,9 +2801,963 @@ panels:
 - id: 400
  title: Roofline
  data source:
-  - None:
+  - metric_table:
      id: 401
-      title: Roofline
+      title: Roofline Performance Rates
+      cli_style: Roofline
+      tui_style: Roofline
+      header:
+        metric: Metric
+        value: Value
+        unit: Unit
+        peak: Peak (Empirical)
+      metric:
+        gfx90a:
+          VALU FLOPs:
+            value: AVG((($wave_size * (
+              (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
+              (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
+              (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
+              )) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
+          MFMA FLOPs (F64):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMAF64Flops_empirical_peak
+          MFMA FLOPs (F32):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMAF32Flops_empirical_peak
+          MFMA FLOPs (F16):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMAF16Flops_empirical_peak
+          MFMA FLOPs (BF16):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMABF16Flops_empirical_peak
+          MFMA IOPs (Int8):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GIOP/s
+            peak: $MFMAI8Ops_empirical_peak
+          HBM Bandwidth:
+            value: AVG(((
+              (TCC_EA_RDREQ_32B_sum * 32) +
+              ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) * 64) +
+              (TCC_EA_WRREQ_64B_sum * 64) +
+              ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) * 32)
+              ) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GB/s
+            peak: $HBMBw_empirical_peak
+          L2 Cache Bandwidth:
+            value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
+              TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) *
+              64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GB/s
+            peak: $L2Bw_empirical_peak
+          L1 Cache Bandwidth:
+            value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GB/s
+            peak: $L1Bw_empirical_peak
+          LDS Bandwidth:
+            value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) *
+              4 * $lds_banks_per_cu)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GB/s
+            peak: $LDSBw_empirical_peak
+        gfx908:
+          VALU FLOPs:
+            value: AVG((($wave_size * (
+              (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
+              (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
+              (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
+              )) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
+          MFMA FLOPs (F64):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMAF64Flops_empirical_peak
+          MFMA FLOPs (F32):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMAF32Flops_empirical_peak
+          MFMA FLOPs (F16):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMAF16Flops_empirical_peak
+          MFMA FLOPs (BF16):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMABF16Flops_empirical_peak
+          MFMA IOPs (Int8):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GIOP/s
+            peak: $MFMAI8Ops_empirical_peak
+          HBM Bandwidth:
+            value: AVG(((
+              (TCC_BUBBLE_sum * 128) +
+              (TCC_EA0_RDREQ_32B_sum * 32) +
+              ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) +
+              ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) +
+              (TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GB/s
+            peak: $HBMBw_empirical_peak
+          L2 Cache Bandwidth:
+            value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
+              TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) *
+              64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GB/s
+            peak: $L2Bw_empirical_peak
+          L1 Cache Bandwidth:
+            value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GB/s
+            peak: $L1Bw_empirical_peak
+          LDS Bandwidth:
+            value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) *
+              4 * $lds_banks_per_cu)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GB/s
+            peak: $LDSBw_empirical_peak
+        gfx940:
+          VALU FLOPs:
+            value: AVG(($wave_size * (
+              (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
+              (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
+              (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
+              ) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
+          MFMA FLOPs (F64):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMAF64Flops_empirical_peak
+          MFMA FLOPs (F32):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMAF32Flops_empirical_peak
+          MFMA FLOPs (F16):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMAF16Flops_empirical_peak
+          MFMA FLOPs (BF16):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMABF16Flops_empirical_peak
+          MFMA FLOPs (F8):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMAF8Flops_empirical_peak
+          MFMA IOPs (Int8):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GIOP/s
+            peak: $MFMAI8Ops_empirical_peak
+          HBM Bandwidth:
+            value: AVG(((
+              (TCC_BUBBLE_sum * 128) +
+              (TCC_EA0_RDREQ_32B_sum * 32) +
+              ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) +
+              ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) +
+              (TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GB/s
+            peak: $HBMBw_empirical_peak
+          L2 Cache Bandwidth:
+            value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
+              TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) *
+              64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GB/s
+            peak: $L2Bw_empirical_peak
+          L1 Cache Bandwidth:
+            value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GB/s
+            peak: $L1Bw_empirical_peak
+          LDS Bandwidth:
+            value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) *
+              4 * $lds_banks_per_cu)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GB/s
+            peak: $LDSBw_empirical_peak
+        gfx941:
+          VALU FLOPs:
+            value: AVG(($wave_size * (
+              (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
+              (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
+              (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
+              ) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
+          MFMA FLOPs (F64):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMAF64Flops_empirical_peak
+          MFMA FLOPs (F32):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMAF32Flops_empirical_peak
+          MFMA FLOPs (F16):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMAF16Flops_empirical_peak
+          MFMA FLOPs (BF16):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMABF16Flops_empirical_peak
+          MFMA FLOPs (F8):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMAF8Flops_empirical_peak
+          MFMA IOPs (Int8):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GIOP/s
+            peak: $MFMAI8Ops_empirical_peak
+          HBM Bandwidth:
+            value: AVG(((
+              (TCC_BUBBLE_sum * 128) +
+              (TCC_EA0_RDREQ_32B_sum * 32) +
+              ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) +
+              ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) +
+              (TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GB/s
+            peak: $HBMBw_empirical_peak
+          L2 Cache Bandwidth:
+            value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
+              TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) *
+              64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GB/s
+            peak: $L2Bw_empirical_peak
+          L1 Cache Bandwidth:
+            value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GB/s
+            peak: $L1Bw_empirical_peak
+          LDS Bandwidth:
+            value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) *
+              4 * $lds_banks_per_cu)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GB/s
+            peak: $LDSBw_empirical_peak
+        gfx942:
+          VALU FLOPs:
+            value: AVG((($wave_size * (
+              (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
+              (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
+              (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
+              )) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
+          MFMA FLOPs (F64):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMAF64Flops_empirical_peak
+          MFMA FLOPs (F32):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMAF32Flops_empirical_peak
+          MFMA FLOPs (F16):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMAF16Flops_empirical_peak
+          MFMA FLOPs (BF16):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMABF16Flops_empirical_peak
+          MFMA FLOPs (F8):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMAF8Flops_empirical_peak
+          MFMA IOPs (Int8):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GIOP/s
+            peak: $MFMAI8Ops_empirical_peak
+          HBM Bandwidth:
+            value: AVG(((
+              (TCC_BUBBLE_sum * 128) +
+              (TCC_EA0_RDREQ_32B_sum * 32) +
+              ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) +
+              ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) +
+              (TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GB/s
+            peak: $HBMBw_empirical_peak
+          L2 Cache Bandwidth:
+            value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
+              TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) *
+              64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GB/s
+            peak: $L2Bw_empirical_peak
+          L1 Cache Bandwidth:
+            value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GB/s
+            peak: $L1Bw_empirical_peak
+          LDS Bandwidth:
+            value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) *
+              4 * $lds_banks_per_cu)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GB/s
+            peak: $LDSBw_empirical_peak
+        gfx950:
+          VALU FLOPs:
+            value: AVG((($wave_size * (
+              (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
+              (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
+              (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
+              )) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
+          MFMA FLOPs (F64):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMAF64Flops_empirical_peak
+          MFMA FLOPs (F32):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMAF32Flops_empirical_peak
+          MFMA FLOPs (F16):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMAF16Flops_empirical_peak
+          MFMA FLOPs (BF16):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMABF16Flops_empirical_peak
+          MFMA FLOPs (F8):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMAF8Flops_empirical_peak
+          MFMA FLOPs (F6F4):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GFLOP/s
+            peak: $MFMA_FLOPs_F6F4_empirical_peak
+          MFMA IOPs (Int8):
+            value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GIOP/s
+            peak: $MFMAI8Ops_empirical_peak
+          HBM Bandwidth:
+            value: AVG(((
+              (TCC_BUBBLE_sum * 128) +
+              (TCC_EA0_RDREQ_32B_sum * 32) +
+              ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) +
+              ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) +
+              (TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GB/s
+            peak: $HBMBw_empirical_peak
+          L2 Cache Bandwidth:
+            value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
+              TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) *
+              64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GB/s
+            peak: $L2Bw_empirical_peak
+          L1 Cache Bandwidth:
+            value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GB/s
+            peak: $L1Bw_empirical_peak
+          LDS Bandwidth:
+            value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) *
+              4 * $lds_banks_per_cu)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
+            unit: GB/s
+            peak: $LDSBw_empirical_peak
+  - metric_table:
+      id: 402
+      title: Roofline Plot Points
+      cli_style: Roofline
+      tui_style: Roofline
+      header:
+        metric: Metric
+        value: Value
+        unit: Unit
+      metric:
+        gfx90a:
+          AI HBM:
+            value: (
+              SUM(
+                ($wave_size * (
+                  (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
+                  (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
+                  (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
+                )) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)
+              ) /
+              SUM(
+                (TCC_EA_RDREQ_32B_sum * 32) +
+                ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) * 64) +
+                (TCC_EA_WRREQ_64B_sum * 64) +
+                ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) * 32)
+              )
+              )
+            unit: FLOPs/Byte
+          AI L2:
+            value: (
+              SUM(
+                ($wave_size * (
+                  (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
+                  (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
+                  (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
+                )) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)
+              ) /
+              SUM(
+                (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + 
+                TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64
+              )
+              )
+            unit: FLOPs/Byte
+          AI L1:
+            value: (
+              SUM(
+                ($wave_size * (
+                  (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
+                  (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
+                  (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
+                )) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)
+              ) /
+              SUM(TCP_TOTAL_CACHE_ACCESSES_sum * 64)
+              )
+            unit: FLOPs/Byte
+          Performance GFLOPs:
+            value: (
+              SUM(
+                ($wave_size * (
+                  (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
+                  (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
+                  (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
+                )) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)
+              ) /
+              (SUM(End_Timestamp - Start_Timestamp) / 1e9)
+              ) / 1e9
+            unit: GFLOP/s
+        gfx908:
+          AI HBM:
+            value: (
+              SUM(
+                ($wave_size * (
+                  (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
+                  (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
+                  (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
+                )) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)
+              ) /
+              SUM(
+                (TCC_BUBBLE_sum * 128) +
+                (TCC_EA0_RDREQ_32B_sum * 32) +
+                ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) +
+                ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) +
+                (TCC_EA0_WRREQ_64B_sum * 64)
+              )
+              )
+            unit: FLOPs/Byte
+          AI L2:
+            value: (
+              SUM(
+                ($wave_size * (
+                  (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
+                  (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
+                  (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
+                )) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)
+              ) /
+              SUM(
+                (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + 
+                TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64
+              )
+              )
+            unit: FLOPs/Byte
+          AI L1:
+            value: (
+              SUM(
+                ($wave_size * (
+                  (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
+                  (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
+                  (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
+                )) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)
+              ) /
+              SUM(TCP_TOTAL_CACHE_ACCESSES_sum * 64)
+              )
+            unit: FLOPs/Byte
+          Performance GFLOPs:
+            value: (
+              SUM(
+                ($wave_size * (
+                  (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
+                  (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
+                  (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
+                )) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)
+              ) /
+              (SUM(End_Timestamp - Start_Timestamp) / 1e9)
+              ) / 1e9
+            unit: GFLOP/s
+        gfx940:
+          AI HBM:
+            value: (
+              SUM(
+                ($wave_size * (
+                  (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
+                  (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
+                  (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
+                )) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)
+              ) /
+              SUM(
+                (TCC_BUBBLE_sum * 128) +
+                (TCC_EA0_RDREQ_32B_sum * 32) +
+                ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) +
+                ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) +
+                (TCC_EA0_WRREQ_64B_sum * 64)
+              )
+              )
+            unit: FLOPs/Byte
+          AI L2:
+            value: (
+              SUM(
+                ($wave_size * (
+                  (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
+                  (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
+                  (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
+                )) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) +
+              ) /
+              SUM(
+                (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + 
+                TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64
+              )
+              )
+            unit: FLOPs/Byte
+          AI L1:
+            value: (
+              SUM(
+                ($wave_size * (
+                  (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
+                  (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
+                  (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
+                )) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) +
+              ) /
+              SUM(TCP_TOTAL_CACHE_ACCESSES_sum * 64)
+              )
+            unit: FLOPs/Byte
+          Performance GFLOPs:
+            value: (
+              SUM(
+                ($wave_size * (
+                  (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
+                  (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
+                  (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
+                )) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)
+              ) /
+              (SUM(End_Timestamp - Start_Timestamp) / 1e9)
+              ) / 1e9
+            unit: GFLOP/s
+        gfx941:
+          AI HBM:
+            value: (
+              SUM(
+                ($wave_size * (
+                  (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
+                  (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
+                  (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
+                )) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)
+              ) /
+              SUM(
+                (TCC_BUBBLE_sum * 128) +
+                (TCC_EA0_RDREQ_32B_sum * 32) +
+                ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) +
+                ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) +
+                (TCC_EA0_WRREQ_64B_sum * 64)
+              )
+              )
+            unit: FLOPs/Byte
+          AI L2:
+            value: (
+              SUM(
+                ($wave_size * (
+                  (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
+                  (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
+                  (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
+                )) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)
+              ) /
+              SUM(
+                (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + 
+                TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64
+              )
+              )
+            unit: FLOPs/Byte
+          AI L1:
+            value: (
+              SUM(
+                ($wave_size * (
+                  (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
+                  (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
+                  (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
+                )) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)
+              ) /
+              SUM(TCP_TOTAL_CACHE_ACCESSES_sum * 64)
+              )
+            unit: FLOPs/Byte
+          Performance GFLOPs:
+            value: (
+              SUM(
+                ($wave_size * (
+                  (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
+                  (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
+                  (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
+                )) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)
+              ) /
+              (SUM(End_Timestamp - Start_Timestamp) / 1e9)
+              ) / 1e9
+            unit: GFLOP/s        
+        gfx942:
+          AI HBM:
+            value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+              + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+              + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+              + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+              + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+              (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+              512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
+              * 512) ) / SUM( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32)
+              + ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64)
+              + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum
+              * 64) ) )
+            unit: FLOPs/Byte
+          AI L2:
+            value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+              + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+              + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+              + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+              + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+              (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+              512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
+              * 512) ) / SUM( (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + 
+              TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64 ) )
+            unit: FLOPs/Byte
+          AI L1:
+            value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+              + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+              + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+              + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+              + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+              (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+              512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
+              * 512) ) / SUM( TCP_TOTAL_CACHE_ACCESSES_sum * 64 ) )
+            unit: FLOPs/Byte
+          Performance (GFLOPs):
+            value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+              + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+              + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+              + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+              + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+              (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
+              512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
+              * 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9
+            unit: GFLOP/s
+        gfx950:
+          AI HBM:
+            value: (
+              SUM(
+                ($wave_size * (
+                  (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
+                  (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
+                  (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
+                )) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512)
+              ) /
+              SUM(
+                (TCC_BUBBLE_sum * 128) +
+                (TCC_EA0_RDREQ_32B_sum * 32) +
+                ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) +
+                ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) +
+                (TCC_EA0_WRREQ_64B_sum * 64)
+              )
+              )
+            unit: FLOPs/Byte
+          AI L2:
+            value: (
+              SUM(
+                ($wave_size * (
+                  (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
+                  (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
+                  (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
+                )) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512)
+              ) /
+              SUM(
+                (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + 
+                TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64
+              )
+              )
+            unit: FLOPs/Byte
+          AI L1:
+            value: (
+              SUM(
+                ($wave_size * (
+                  (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
+                  (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
+                  (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
+                )) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512)
+              ) /
+              SUM(TCP_TOTAL_CACHE_ACCESSES_sum * 64)
+              )
+            unit: FLOPs/Byte
+          Performance GFLOPs:
+            value: (
+              SUM(
+                ($wave_size * (
+                  (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
+                  (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
+                  (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
+                )) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) +
+                (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512)
+              ) /
+              (SUM(End_Timestamp - Start_Timestamp) / 1e9)
+              ) / 1e9
+            unit: GFLOP/s
+  metrics_description:
+      VALU FLOPs:
+        plain: 'The total floating-point operations executed per second on the VALU.
+          This is also presented as a percent of the peak theoretical FLOPs achievable
+          on the specific accelerator. Note: this does not include any floating-point
+          operations from MFMA instructions.'
+        rst: 'The total floating-point operations executed per second on the :ref:`VALU
+          <desc-valu>`. This is also presented as a percent of the peak theoretical
+          FLOPs achievable on the specific accelerator. Note: this does not include
+          any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.'
+        unit: GFLOPs
+      MFMA FLOPs (F8):
+        plain: The total number of 8-bit brain floating point MFMA operations executed
+          per second. This does not include any 16-bit brain floating point operations
+          from VALU instructions. The peak empirically measured F8 MFMA operations
+          achievable on the specific accelerator is displayed alongside for comparison.
+          It is supported on AMD Instinct MI300 series and later only.
+        rst: 'The total number of 8-bit brain floating point :ref:`MFMA <desc-mfma>`
+          operations executed per second. Note: this does not include any 16-bit brain
+          floating point operations from :ref:`VALU <desc-valu>` instructions. The
+          peak empirically measured F8 MFMA operations achievable on the specific
+          accelerator is displayed alongside for comparison. It is supported on AMD
+          Instinct MI300 series and later only.'
+        unit: GFLOPs
+      MFMA FLOPs (BF16):
+        plain: 'The total number of 16-bit brain floating point MFMA operations executed
+          per second. Note: this does not include any 16-bit brain floating point
+          operations from VALU instructions. The peak empirically measured BF16 MFMA
+          operations achievable on the specific accelerator is displayed alongside
+          for comparison.'
+        rst: 'The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
+          operations executed per second. Note: this does not include any 16-bit brain
+          floating point operations from :ref:`VALU <desc-valu>` instructions. The
+          peak empirically measured BF16 MFMA operations achievable on the specific
+          accelerator is displayed alongside for comparison.'
+        unit: GFLOPs
+      MFMA FLOPs (F16):
+        plain: 'The total number of 16-bit floating point MFMA operations executed per
+          second. Note: this does not include any 16-bit floating point operations from
+          VALU instructions. The peak empirically measured F16 MFMA operations
+          achievable on the specific accelerator is displayed alongside for comparison.'
+        rst: 'The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
+          executed per second. Note: this does not include any 16-bit floating point
+          operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
+          measured F16 MFMA operations achievable on the specific accelerator is
+          displayed alongside for comparison.'
+        unit: GFLOPs
+      MFMA FLOPs (F32):
+        plain: 'The total number of 32-bit floating point MFMA operations executed per
+          second. Note: this does not include any 32-bit floating point operations from
+          VALU instructions. The peak empirically measured F32 MFMA operations
+          achievable on the specific accelerator is displayed alongside for comparison.'
+        rst: 'The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
+          executed per second. Note: this does not include any 32-bit floating point
+          operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
+          measured F32 MFMA operations achievable on the specific accelerator is
+          displayed alongside for comparison.'
+        unit: GFLOPs
+      MFMA FLOPs (F64):
+        plain: 'The total number of 64-bit floating point MFMA operations executed per
+          second. Note: this does not include any 64-bit floating point operations from
+          VALU instructions. The peak empirically measured F64 MFMA operations
+          achievable on the specific accelerator is displayed alongside for comparison.'
+        rst: 'The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
+          executed per second. Note: this does not include any 64-bit floating point
+          operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
+          measured F64 MFMA operations achievable on the specific accelerator is
+          displayed alongside for comparison.'
+        unit: GFLOPs
+      MFMA IOPs (Int8):
+        plain: 'The total number of 8-bit integer MFMA operations executed per second.
+          Note: this does not include any 8-bit integer operations from VALU instructions.
+          The peak empirically measured INT8 MFMA operations achievable on the specific
+          accelerator is displayed alongside for comparison.'
+        rst: 'The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
+          per second. Note: this does not include any 8-bit integer operations from
+          :ref:`VALU <desc-valu>` instructions. The peak empirically measured INT8 MFMA
+          operations achievable on the specific accelerator is displayed alongside
+          for comparison.'
+        unit: GIOPs
+      HBM Bandwidth:
+        plain: 'The total number of bytes read from and written to High-Bandwidth
+            Memory (HBM) per second. The peak empirically measured bandwidth achievable
+            on the specific accelerator is displayed alongside for comparison.'
+        rst: 'The total number of bytes read from and written to High-Bandwidth
+            Memory (HBM) per second. The peak empirically measured bandwidth achievable
+            on the specific accelerator is displayed alongside for comparison.'
+        unit: GB/s
+      L2 Cache Bandwidth:
+        plain: The number of bytes looked up in the L2 cache per unit time. The number
+          of bytes is calculated as the number of cache lines requested multiplied by
+          the cache line size. This value does not consider partial requests, so e.g.,
+          if only a single value is requested in a cache line, the data movement will
+          still be counted as a full cache line. The peak empirically measured bandwidth
+          achievable on the specific accelerator is displayed alongside for comparison.
+        rst: The number of bytes looked up in the L2 cache per unit time. The number of
+          bytes is calculated as the number of cache lines requested multiplied by
+          the cache line size. This value does not consider partial requests, so e.g.,
+          if only a single value is requested in a cache line, the data movement will
+          still be counted as a full cache line. The peak empirically measured
+          bandwidth achievable on the specific accelerator is displayed alongside
+          for comparison.
+        unit: GB/s
+      L1 Cache Bandwidth:
+        plain: The number of bytes looked up in the vL1D cache as a result of VMEM
+          instructions per unit time. The number of bytes is calculated as the number
+          of cache lines requested multiplied by the cache line size. This value does
+          not consider partial requests, so e.g., if only a single value is requested
+          in a cache line, the data movement will still be counted as a full cache line.
+          The peak empirically measured bandwidth achievable on the specific accelerator
+          is displayed alongside for comparison.
+        rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM
+          <desc-vmem>` instructions per unit time. The number of bytes is calculated
+          as the number of cache lines requested multiplied by the cache line size.
+          This value does not consider partial requests, so e.g., if only a single
+          value is requested in a cache line, the data movement will still be counted
+          as a full cache line. The peak empirically measured bandwidth achievable on
+          the specific accelerator is displayed alongside for comparison.
+        unit: GB/s
+      LDS Bandwidth:
+        plain: Indicates the maximum amount of bytes that could have been loaded from,
+          stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
+          example for more detail). The peak empirically measured LDS bandwidth
+          achievable on the specific accelerator is displayed alongside for comparison.
+        rst: Indicates the maximum amount of bytes that could have been loaded from,
+          stored to, or atomically updated in the LDS per unit time (see :ref:`LDS
+          Bandwidth <lds-bandwidth>` example for more detail). The peak empirically
+          measured LDS bandwidth achievable on the specific accelerator is displayed
+          alongside for comparison.
+        unit: GB/s
+      AI L1:
+        plain: 'The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
+          of total floating-point operations (FLOPs) to total bytes transferred between
+          the L1 cache and the processing units. This value is used as the x-coordinate
+          for the L1 roofline.'
+        rst: 'The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
+          of total floating-point operations (FLOPs) to total bytes transferred between
+          the L1 cache and the processing units. This value is used as the x-coordinate
+          for the L1 roofline.'
+        unit: FLOPs/Byte
+      AI L2:
+        plain: 'The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
+          of total floating-point operations (FLOPs) to total bytes transferred between
+          the L2 cache and the L1 cache. This value is used as the x-coordinate for
+          the L2 roofline.'
+        rst: 'The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
+          of total floating-point operations (FLOPs) to total bytes transferred between
+          the L2 cache and the L1 cache. This value is used as the x-coordinate for
+          the L2 roofline.'
+        unit: FLOPs/Byte
+      AI HBM:
+        plain: 'The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
+          It is the ratio of total floating-point operations (FLOPs) to total bytes
+          transferred between HBM and the L2 cache. This value is used as the x-coordinate
+          for the HBM roofline.'
+        rst: 'The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
+          It is the ratio of total floating-point operations (FLOPs) to total bytes
+          transferred between HBM and the L2 cache. This value is used as the x-coordinate
+          for the HBM roofline.'
+        unit: FLOPs/Byte
+      Performance (GFLOPs):
+        plain: 'The overall achieved performance, measured in GigaFLOPs
+          per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
+          operations divided by the total execution time. This value is used as the y-coordinate
+          for the kernel''s point on the Roofline plot.'
+        rst: 'The overall achieved performance, measured in GigaFLOPs
+          per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
+          operations divided by the total execution time. This value is used as the y-coordinate
+          for the kernel''s point on the Roofline plot.'
+        unit: GFLOP/s
 - id: 500
  title: Command Processor (CPC/CPF)
  data source: