Update Unit of Bandwidth metrics to Gbps (#96)

* Add Utilization to metric name for Bandwidth related metrics whose Unit is Percent * Update Unit of Bandwidth metrics to Gbps * Update metric Formula to use total duration as denominator instead of normalization unit. * Update metric Description * Update metric Unit * Update CHANGELOG
2025-08-06 18:39:50 -04:00
parent a10d897a69
commit 89c74ac3d3
34 changed files with 1088 additions and 988 deletions
@@ -27,6 +27,26 @@ Full documentation for ROCm Compute Profiler is available at [https://rocm.docs.

 * Change the basic view of TUI from aggregated analysis data to individual kernel analysis data

+* Update `Unit` of the following `Bandwidth` related metrics to `Gbps` instead of `Bytes per Normalization Unit`
+  * Theoretical Bandwidth (section 1202)
+  * L1I-L2 Bandwidth (section 1303)
+  * sL1D-L2 BW (section 1403)
+  * Cache BW (section 1603)
+  * L1-L2 BW (section 1603)
+  * Read BW (section 1702)
+  * Write and Atomic BW (section 1702)
+  * Bandwidth (section 1703)
+  * Atomic/Read/Write Bandwidth (section 1703)
+  * Atomic/Read/Write Bandwidth - (HBM/PCIe/Infinity Fabric) (section 1706)
+
+* Add `Utilization` to metric name for the following `Bandwidth` related metrics whose `Unit` is `Percent`
+  * Theoretical Bandwidth Utilization (section 1201)
+  * L1I-L2 Bandwidth Utilization (section 1301)
+  * Bandwidth Utilization (section 1301)
+  * Bandwidth Utilization (section 1401)
+  * sL1D-L2 BW Utilization (section 1401)
+  * Bandwidth Utilization (section 1601)
+
 ### Resolved issues

 * Fixed not detecting memory clock issue when using amd-smi
@@ -397,13 +397,13 @@ LDS Speed-of-Light:
      over the number of LDS cycles that would have been  required to move the same
      amount of data in an uncontended access. [#lds-bank-conflict]_
    unit: Percent
-  Theoretical Bandwidth:
+  Theoretical Bandwidth Utilization:
    rst: Indicates the maximum amount of bytes that could have been loaded from,  stored
-      to, or atomically updated in the LDS per  :ref:`normalization unit <normalization-units>`.
+      to, or atomically updated in the LDS divided as percentage of theoretical peak.
      Does *not* take into  account the execution mask of the wavefront when the instruction
      was  executed. See the  :ref:`LDS bandwidth example <lds-bandwidth>` for more
      detail.
-    unit: Bytes per normalization unit
+    unit: Percent
  Utilization:
    rst: Indicates what percent of the kernel's duration the :ref:`LDS <desc-lds>`  was
      actively executing instructions (including, but not limited to, load,  store,
@@ -450,17 +450,16 @@ LDS Statistics:
    unit: Accesses per normalization unit
  Theoretical Bandwidth:
    rst: Indicates the maximum amount of bytes that could have been loaded from,  stored
-      to, or atomically updated in the LDS per  :ref:`normalization unit <normalization-units>`.
-      Does *not* take into  account the execution mask of the wavefront when the instruction
-      was  executed. See the  :ref:`LDS bandwidth example <lds-bandwidth>` for more
-      detail.
-    unit: Bytes per normalization unit
+      to, or atomically updated in the LDS divided by total duration. Does *not* take
+      into  account the execution mask of the wavefront when the instruction was  executed.
+      See the  :ref:`LDS bandwidth example <lds-bandwidth>` for more detail.
+    unit: Gbps
  Unaligned Stall:
    rst: The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>`  due
      to stalls from non-dword aligned addresses per  :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
 vL1D Speed-of-Light:
-  Bandwidth:
+  Bandwidth Utilization:
    rst: The number of bytes looked up in the vL1D cache as a result of  :ref:`VMEM
      <desc-vmem>` instructions, as a percent of the peak  theoretical bandwidth achievable
      on the specific accelerator. The number  of bytes is calculated as the number
@@ -614,13 +613,13 @@ vL1D cache access metrics:
    rst: The total number of cache line lookups in the vL1D.
    unit: Cache lines
  Cache BW:
-    rst: The number of bytes looked up in the vL1D cache as a result of  :ref:`VMEM
-      <desc-vmem>` instructions per  :ref:`normalization unit <normalization-units>`.  The
-      number of bytes is  calculated as the number of cache lines requested multiplied
-      by the cache  line size.  This value does not consider partial requests, so
-      for  instance, if only a single value is requested in a cache line, the data  movement
-      will still be counted as a full cache line.
-    unit: Bytes per normalization unit
+    rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM
+      <desc-vmem>` instructions divided by total duration. The number of bytes is
+      calculated as the number of cache lines requested multiplied by the cache line
+      size. This value does not consider partial requests, so for  instance, if only
+      a single value is requested in a cache line, the data movement will still be
+      counted as a full cache line.
+    unit: Gbps
  Cache Hit Rate:
    rst: The ratio of the number of vL1D cache line requests that hit in vL1D  cache
      over the total number of cache line requests to the  :ref:`vL1D Cache RAM <desc-tc>`.
@@ -646,12 +645,12 @@ vL1D cache access metrics:
    unit: Requests per normalization unit
  L1-L2 BW:
    rst: The number of bytes transferred across the vL1D-L2 interface as a result  of
-      :ref:`VMEM <desc-vmem>` instructions, per  :ref:`normalization unit <normalization-units>`.
-      The number of bytes is  calculated as the number of cache lines requested multiplied
-      by the cache  line size. This value does not consider partial requests, so for  instance,
+      :ref:`VMEM <desc-vmem>` instructions, divided by total duration. The number
+      of bytes is  calculated as the number of cache lines requested multiplied by
+      the cache  line size. This value does not consider partial requests, so for  instance,
      if only a single value is requested in a cache line, the data  movement will
      still be counted as a full cache line.
-    unit: Bytes per normalization unit
+    unit: Gbps
  L1-L2 Read:
    rst: The number of read requests for a vL1D cache line that were not satisfied  by
      the vL1D and must be retrieved from the to the  :doc:`L2 Cache <l2-cache>` per  :ref:`normalization
@@ -761,20 +760,20 @@ L2 Speed-of-Light:
    unit: Percent
 L2 cache accesses:
  Atomic Bandwidth:
-    rst: Total number of bytes looked up in the L2 cache for atomic requests, per
-      :ref:`normalization unit <normalization-units>`.
-    unit: Bytes per normalization unit
+    rst: Total number of bytes looked up in the L2 cache for atomic requests, divided
+      by total duration.
+    unit: Gbps
  Atomic Req:
    rst: The total number of atomic requests (with and without return) to the L2 from
      all clients.
    unit: Requests per normalization unit
  Bandwidth:
-    rst: The number of bytes looked up in the L2 cache, per  :ref:`normalization unit
-      <normalization-units>`.  The number of bytes is  calculated as the number of
-      cache lines requested multiplied by the cache  line size. This value does not
-      consider partial requests, so for example,  if only a single value is requested
-      in a cache line, the data movement  will still be counted as a full cache line.
-    unit: Bytes per normalization unit
+    rst: The number of bytes looked up in the L2 cache, divided by total duration.
+      The number of bytes is  calculated as the number of cache lines requested multiplied
+      by the cache line size. This value does not consider partial requests, so for
+      example, if only a single value is requested in a cache line, the data movement  will
+      still be counted as a full cache line.
+    unit: Gbps
  CC Req:
    rst: The total number of requests to the L2 that go to Coherently Cacheable (CC)  memory
      allocations. See the :ref:`memory-type` for more information.
@@ -818,9 +817,9 @@ L2 cache accesses:
      allocations. See the :ref:`memory-type` for more information.
    unit: Requests per normalization unit
  Read Bandwidth:
-    rst: Total number of bytes looked up in the L2 cache for read requests, per :ref:`normalization
-      unit <normalization-units>`.
-    unit: Bytes per normalization unit
+    rst: Total number of bytes looked up in the L2 cache for read requests, divided
+      by total duration.
+    unit: Gbps
  Read Req:
    rst: 'The total number of read requests to the L2 from all clients.  '
    unit: Requests per normalization unit
@@ -841,9 +840,9 @@ L2 cache accesses:
      See the :ref:`memory-type` for more information.
    unit: Requests per normalization unit
  Write Bandwidth:
-    rst: Total number of bytes looked up in the L2 cache for write requests, per :ref:`normalization
-      unit <normalization-units>`.
-    unit: Bytes per normalization unit
+    rst: Total number of bytes looked up in the L2 cache for write requests, divided
+      by total duration.
+    unit: Gbps
  Write Req:
    rst: The total number of write requests to the L2 from all clients.
    unit: Requests per normalization unit
@@ -896,9 +895,9 @@ L2-Fabric interface metrics:
      memory <memory-type>` allocations.
    unit: Percent
  Read BW:
-    rst: The total number of bytes read by the L2 cache from Infinity Fabric per  :ref:`normalization
-      unit <normalization-units>`.
-    unit: Bytes per normalization unit
+    rst: The total number of bytes read by the L2 cache from Infinity Fabric divided
+      by total duration.
+    unit: Gbps
  Read Latency:
    rst: The time-averaged number of cycles read requests spent in Infinity Fabric  before
      data was returned to the L2.
@@ -954,12 +953,12 @@ L2-Fabric interface metrics:
    unit: Percent
  Write and Atomic BW:
    rst: The total number of bytes written by the L2 over Infinity Fabric by write  and
-      atomic operations per  :ref:`normalization unit <normalization-units>`. Note
-      that on current  CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests
-      are  only considered *atomic* by Infinity Fabric if they are targeted at  non-write-cacheable
-      memory, for example,  :ref:`fine-grained memory <memory-type>` allocations or  :ref:`uncached
+      atomic operations divided by total duration. Note that on current  CDNA accelerators,
+      such as the :ref:`MI2XX <mixxx-note>`, requests are  only considered *atomic*
+      by Infinity Fabric if they are targeted at  non-write-cacheable memory, for
+      example,  :ref:`fine-grained memory <memory-type>` allocations or  :ref:`uncached
      memory <memory-type>` allocations on the  MI2XX.
-    unit: Bytes per normalization unit
+    unit: Gbps
  Write and Atomic Latency:
    rst: The time-averaged number of cycles write requests spent in Infinity Fabric
      before a completion acknowledgement was returned to the L2.
@@ -975,17 +974,17 @@ L2 - Fabric interface detailed metrics:
      memory <memory-type>` allocations on the MI2XX.
    unit: Requests per normalization unit
  Atomic Bandwidth - HBM:
-    rst: Total number of bytes due to L2 atomic requests due to HBM traffic, per normalization
-      unit.
-    unit: Bytes per normalization unit
+    rst: Total number of bytes due to L2 atomic requests due to HBM traffic, divided
+      by total duration.
+    unit: Gbps
  "Atomic Bandwidth - Infinity Fabric\u2122":
    rst: Total number of bytes due to L2 atomic requests due to Infinity Fabric traffic,
-      per normalization unit.
-    unit: Bytes per normalization unit
+      divided by total duration.
+    unit: Gbps
  Atomic Bandwidth - PCIe:
-    rst: Total number of bytes due to L2 atomic requests due to PCIe traffic, per
-      normalization unit.
-    unit: Bytes per normalization unit
+    rst: Total number of bytes due to L2 atomic requests due to PCIe traffic, divided
+      by total duration.
+    unit: Gbps
  HBM Read:
    rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of  data
      from the accelerator's local HBM, per  :ref:`normalization unit <normalization-units>`.
@@ -1013,17 +1012,17 @@ L2 - Fabric interface detailed metrics:
      uncached data requests. See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  Read Bandwidth - HBM:
-    rst: Total number of bytes due to L2 read requests due to HBM traffic, per normalization
-      unit.
-    unit: Bytes per normalization unit
+    rst: Total number of bytes due to L2 read requests due to HBM traffic, divided
+      by total duration.
+    unit: Gbps
  "Read Bandwidth - Infinity Fabric\u2122":
    rst: Total number of bytes due to L2 read requests due to Infinity Fabric traffic,
-      per normalization unit.
-    unit: Bytes per normalization unit
+      divided by total duration.
+    unit: Gbps
  Read Bandwidth - PCIe:
-    rst: Total number of bytes due to L2 read requests due to PCIe traffic, per normalization
-      unit.
-    unit: Bytes per normalization unit
+    rst: Total number of bytes due to L2 read requests due to PCIe traffic, divided
+      by total duration.
+    unit: Gbps
  Remote Read:
    rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of  data
      from any source other than the accelerator's local HBM, per  :ref:`normalization
@@ -1036,17 +1035,17 @@ L2 - Fabric interface detailed metrics:
      for more detail.
    unit: Requests per normalization unit
  Write Bandwidth - HBM:
-    rst: Total number of bytes due to L2 write requests due to HBM traffic, per normalization
-      unit.
-    unit: Bytes per normalization unit
+    rst: Total number of bytes due to L2 write requests due to HBM traffic, divided
+      by total duration.
+    unit: Gbps
  "Write Bandwidth - Infinity Fabric\u2122":
    rst: Total number of bytes due to L2 write requests due to Infinity Fabric traffic,
-      per normalization unit.
-    unit: Bytes per normalization unit
+      divided by total duration.
+    unit: Gbps
  Write Bandwidth - PCIe:
-    rst: Total number of bytes due to L2 write requests due to PCIe traffic, per normalization
-      unit.
-    unit: Bytes per normalization unit
+    rst: Total number of bytes due to L2 write requests due to PCIe traffic, divided
+      by total duration.
+    unit: Gbps
  Write and Atomic (32B):
    rst: The total number of L2 requests to Infinity Fabric to write or atomically  update
      32B of data to any memory location, per  :ref:`normalization unit <normalization-units>`.
@@ -1098,7 +1097,7 @@ L2 - Fabric Interface stalls:
      of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
 Scalar L1D Speed-of-Light:
-  Bandwidth:
+  Bandwidth Utilization:
    rst: The number of bytes looked up in the sL1D cache, as a percent of the peak  theoretical
      bandwidth. Calculated as the ratio of sL1D requests over the  :ref:`total sL1D
      cycles <total-sl1d-cycles>`.
@@ -1108,13 +1107,11 @@ Scalar L1D Speed-of-Light:
      the cache. The ratio of the number of sL1D requests that hit  [#sl1d-cache]_
      over the number of all sL1D requests.
    unit: Percent
-  sL1D-L2 BW:
-    rst: "The total number of bytes read from, written to, or atomically updated \
-      \ across the sL1D\u2194:doc:`L2 <l2-cache>` interface, per  :ref:`normalization\
-      \ unit <normalization-units>`. Note that sL1D writes  and atomics are typically\
-      \ unused on current CDNA accelerators, so in the  majority of cases this can\
-      \ be interpreted as an sL1D\u2192L2 read bandwidth."
-    unit: Bytes per normalization unit
+  sL1D-L2 BW Utilization:
+    rst: The percentage of the peak theoretical sL1D - L2 interface bandwidth acheived.\
+      \ Caclulated as total number of bytes read from, written to, or atomically updated\
+      \ across the sL1D - L2 interface.
+    unit: Percent
 Scalar L1D cache accesses:
  Atomic Req:
    rst: The total number of atomic requests from sL1D to the  :doc:`L2 <l2-cache>`,
@@ -1189,13 +1186,13 @@ Scalar L1D Cache - L2 Interface:
    unit: Requests per normalization unit
  sL1D-L2 BW:
    rst: "The total number of bytes read from, written to, or atomically updated \
-      \ across the sL1D\u2194:doc:`L2 <l2-cache>` interface, per  :ref:`normalization\
-      \ unit <normalization-units>`. Note that sL1D writes  and atomics are typically\
-      \ unused on current CDNA accelerators, so in the  majority of cases this can\
-      \ be interpreted as an sL1D\u2192L2 read bandwidth."
-    unit: Bytes per normalization unit
+      \ across the sL1D\u2194:doc:`L2 <l2-cache>` interface, divided by total duration.\
+      \ Note that sL1D writes and atomics are typically unused on current CDNA accelerators,\
+      \ so in the  majority of cases this can be interpreted as an sL1D\u2192L2 read\
+      \ bandwidth."
+    unit: Gbps
 L1I Speed-of-Light:
-  Bandwidth:
+  Bandwidth Utilization:
    rst: The number of bytes looked up in the L1I cache, as a percent of the peak  theoretical
      bandwidth. Calculated as the ratio of L1I requests over the  :ref:`total L1I
      cycles <total-l1i-cycles>`.
@@ -1205,7 +1202,7 @@ L1I Speed-of-Light:
      the cache. Calculated as the ratio of the number of L1I requests  that hit over
      the number of all L1I requests.
    unit: Percent
-  L1I-L2 Bandwidth:
+  L1I-L2 Bandwidth Utilization:
    rst: "The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth\
      \  achieved. Calculated as the ratio of the total number of requests from  the\
      \ L1I to the L2 cache over the  :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`."
@@ -1238,10 +1235,9 @@ L1I cache accesses:
    unit: Requests per normalization unit
 L1I <-> L2 interface:
  L1I-L2 Bandwidth:
-    rst: "The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth\
-      \  achieved. Calculated as the ratio of the total number of requests from  the\
-      \ L1I to the L2 cache over the  :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`."
-    unit: Percent
+    rst: Total number of bytes transferred across L1I - L2 interface divided by total
+      duration.
+    unit: Gbps
 Workgroup manager utilizations:
  Accelerator Utilization:
    rst: The percent of cycles in the kernel where the accelerator was actively doing
@@ -11,8 +11,12 @@ Panel Config:
      instructions, averaged over the lifetime of the kernel. Calculated as the ratio
      of the total number of cycles spent by the scheduler issuing LDS instructions
      over the total CU cycles.
+    Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
+      could have been loaded from, stored to, or atomically updated in the LDS divided
+      as percentage of theoretical peak. Does not take into account the execution
+      mask of the wavefront when the instruction was executed.
    Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
-      loaded from, stored to, or atomically updated in the LDS per normalization unit.
+      loaded from, stored to, or atomically updated in the LDS divided by total duration.
      Does not take into account the execution mask of the wavefront when the instruction
      was executed.
    Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
@@ -58,7 +62,7 @@ Panel Config:
        Access Rate:
          value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          unit: Pct of Peak
-        Theoretical Bandwidth:
+        Theoretical Bandwidth Utilization:
          value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
            / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
          unit: Pct of Peak
@@ -86,12 +90,12 @@ Panel Config:
          unit: (Instr  + $normUnit)
        Theoretical Bandwidth:
          avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
-            / $denom))
+            / (End_Timestamp - Start_Timestamp)))
          min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
-            / $denom))
+            / (End_Timestamp - Start_Timestamp)))
          max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
-            / $denom))
-          unit: (Bytes  + $normUnit)
+            / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        LDS Latency:
          avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
            None))
@@ -3,15 +3,18 @@ Panel Config:
  id: 1300
  title: Instruction Cache
  metrics_description:
-    Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the
-      peak theoretical bandwidth. Calculated as the ratio of L1I requests over the
-      total L1I cycles.
+    Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
+      of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
+      the total L1I cycles.
    Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
      loaded line the cache. Calculated as the ratio of the number of L1I requests
      that hit over the number of all L1I requests.
-    L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\
-      \ bandwidth achieved. Calculated as the ratio of the total number of requests\
-      \ from the L1I to the L2 cache over the total L1I-L2 interface cycles."
+    L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\
+      \ L2 cache request bandwidth achieved. Calculated as the ratio of the total\
+      \ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\
+      \ cycles."
+    L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
+      divided by total duration.
    Req: The total number of requests made to the L1I per normalization-unit
    Hits: The total number of L1I requests that hit on a previously loaded cache line,
      per normalization-unit.
@@ -30,7 +33,7 @@ Panel Config:
        value: Avg
        unit: Unit
      metric:
-        Bandwidth:
+        Bandwidth Utilization:
          value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
            - Start_Timestamp))))
          unit: Pct of Peak
@@ -38,7 +41,7 @@ Panel Config:
          value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
            + SQC_ICACHE_MISSES_DUPLICATE)))
          unit: Pct of Peak
-        L1I-L2 Bandwidth:
+        L1I-L2 Bandwidth Utilization:
          value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu)
            * (End_Timestamp - Start_Timestamp))))
          unit: Pct of Peak
@@ -100,7 +103,7 @@ Panel Config:
        unit: Unit
      metric:
        L1I-L2 Bandwidth:
-          avg: AVG(((SQC_TC_INST_REQ * 64) / $denom))
-          min: MIN(((SQC_TC_INST_REQ * 64) / $denom))
-          max: MAX(((SQC_TC_INST_REQ * 64) / $denom))
-          unit: (Bytes + $normUnit)
+          avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
+          min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
+          max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
@@ -3,14 +3,17 @@ Panel Config:
  id: 1400
  title: Scalar L1 Data Cache
  metrics_description:
-    Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the
-      peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the
-      total sL1D cycles.
+    Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
+      of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
+      over the total sL1D cycles.
    Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
      loaded line the cache. The ratio of the number of sL1D requests that hit over
      the number of all sL1D requests.
+    sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
+      bandwidth acheived.\ \ Caclulated as total number of bytes read from, written
+      to, or atomically updated\ \ across the sL1D - L2 interface.
    sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
-      \ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\
+      \ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\
      \ writes and atomics are typically unused on current CDNA accelerators, so in\
      \ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
    Req: The total number of requests, of any size or type, made to the sL1D per normalization
@@ -51,7 +54,7 @@ Panel Config:
        value: Avg
        unit: Unit
      metric:
-        Bandwidth:
+        Bandwidth Utilization:
          value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
            - Start_Timestamp))))
          unit: Pct of Peak
@@ -60,7 +63,7 @@ Panel Config:
            + SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES
            + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
          unit: Pct of Peak
-        sL1D-L2 BW:
+        sL1D-L2 BW Utilization:
          value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
            * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))
          unit: Pct of Peak
@@ -158,12 +161,12 @@ Panel Config:
      metric:
        sL1D-L2 BW:
          avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
-            * 64)) / $denom))
+            * 64)) / (End_Timestamp - Start_Timestamp)))
          min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
-            * 64)) / $denom))
+            * 64)) / (End_Timestamp - Start_Timestamp)))
          max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
-            * 64)) / $denom))
-          unit: (Bytes + $normUnit)
+            * 64)) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        Read Req:
          avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
          min: MIN((SQC_TC_DATA_READ_REQ / $denom))
@@ -5,12 +5,12 @@ Panel Config:
  metrics_description:
    Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
      cache over the total number of cache line requests to the vL1D Cache RAM.
-    Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM
-      instructions, as a percent of the peak theoretical bandwidth achievable on the
-      specific accelerator. The number of bytes is calculated as the number of cache
-      lines requested multiplied by the cache line size. This value does not consider
-      partial requests, so for instance, if only a single value is requested in a
-      cache line, the data movement will still be counted as a full cache line.
+    Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
+      of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
+      on the specific accelerator. The number of bytes is calculated as the number
+      of cache lines requested multiplied by the cache line size. This value does
+      not consider partial requests, so for instance, if only a single value is requested
+      in a cache line, the data movement will still be counted as a full cache line.
    Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
      The number of cycles where the vL1D Cache RAM is actively processing any request
      divided by the number of cycles where the vL1D is active.
@@ -42,11 +42,11 @@ Panel Config:
    Atomic Req: The total number of incoming atomic requests from the address processing
      unit after coalescing per normalization unit.
    Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
-      instructions per normalization unit. The number of bytes is calculated as the
-      number of cache lines requested multiplied by the cache line size.  This value
-      does not consider partial requests, so for instance, if only a single value
-      is requested in a cache line, the data movement will still be counted as a full
-      cache line.
+      instructions divided by total duration. The number of bytes is calculated as
+      the number of cache lines requested multiplied by the cache line size.  This
+      value does not consider partial requests, so for instance, if only a single
+      value is requested in a cache line, the data movement will still be counted
+      as a full cache line.
    Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
      vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
    Cache Accesses: The total number of cache line lookups in the vL1D.
@@ -57,7 +57,7 @@ Panel Config:
      command during the kernel's execution per normalization unit. This may be triggered
      by, for instance, the buffer_wbinvl1 instruction.
    L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
-      of VMEM instructions, per normalization unit. The number of bytes is calculated
+      of VMEM instructions, divided by total duration. The number of bytes is calculated
      as the number of cache lines requested multiplied by the cache line size. This
      value does not consider partial requests, so for instance, if only a single
      value is requested in a cache line, the data movement will still be counted
@@ -128,7 +128,7 @@ Panel Config:
            / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
            else None))
          unit: Pct of Peak
-        Bandwidth:
+        Bandwidth Utilization:
          value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp
            - Start_Timestamp)))) / ((($max_sclk / 1000) * 64) * $cu_per_gpu))
          unit: Pct of Peak
@@ -201,10 +201,10 @@ Panel Config:
            / $denom))
          unit: (Req  + $normUnit)
        Cache BW:
-          avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
-          min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
-          max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
-          unit: (Bytes + $normUnit)
+          avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
+          min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
+          max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        Cache Hit Rate:
          avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
@@ -242,12 +242,12 @@ Panel Config:
          unit: (Req + $normUnit)
        L1-L2 BW:
          avg: AVG(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
-            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
+            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
          min: MIN(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
-            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
+            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
          max: MAX(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
-            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
-          unit: (Bytes + $normUnit)
+            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        L1-L2 Read:
          avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_READ_REQ_sum / $denom))
@@ -20,8 +20,8 @@ Panel Config:
    HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
      memory (HBM) per unit time. This value is calculated as the number of HBM channels
      multiplied by the HBM channel width multiplied by the HBM clock frequency.
-    Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per
-      normalization unit.
+    Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided
+      by total duration.
    HBM Read Traffic: The percent of read requests generated by the L2 cache that
      are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
      does not consider the size of the request (meaning that 32B and 64B requests
@@ -42,9 +42,9 @@ Panel Config:
      as a single request), so this metric only approximates the percent of the L2-Fabric
      read bandwidth directed to an uncached memory location.
    Write and Atomic BW: The total number of bytes written by the L2 over Infinity
-      Fabric by write and atomic operations per normalization unit. Note that on current
-      CDNA accelerators, such as the MI2XX, requests are only considered atomic by
-      Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
+      Fabric by write and atomic operations divided by total duration. Note that on
+      current CDNA accelerators, such as the MI2XX, requests are only considered atomic
+      by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
      fine-grained memory allocations or uncached memory allocations on the MI2XX.
    HBM Write and Atomic Traffic: The percent of write and atomic requests generated
      by the L2 cache that are routed to the accelerator's local high-bandwidth memory
@@ -82,17 +82,17 @@ Panel Config:
    Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
      Fabric before a completion acknowledgement (atomic without return value) or
      data (atomic with return value) was returned to the L2.
-    Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit.
+    Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration.
      The number of bytes is calculated as the number of cache lines requested multiplied
      by the cache line size. This value does not consider partial requests, so for
      example, if only a single value is requested in a cache line, the data movement
      will still be counted as a full cache line.
    Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests,
-      per normalization unit.
+      divided by total duration.
    Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests,
-      per normalization unit.
+      divided by total duration.
    Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests,
-      per normalization unit.
+      divided by total duration.
    Req: The total number of incoming requests to the L2 from all clients for all
      request types, per normalization unit.
    Read Req: The total number of read requests to the L2 from all clients.
@@ -150,11 +150,11 @@ Panel Config:
      64B of data from any source other than the accelerator's local HBM, per normalization
      unit.
    Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe
-      traffic, per normalization unit.
+      traffic, divided by total duration.
    "Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read
-      requests due to Infinity Fabric traffic, per normalization unit.
+      requests due to Infinity Fabric traffic, divided by total duration.
    Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM
-      traffic, per normalization unit.
+      traffic, divided by total duration.
    Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
      write or atomically update 32B of data to any memory location, per normalization
      unit.
@@ -171,17 +171,17 @@ Panel Config:
      write or atomically update 32B or 64B of data in any memory location other than
      the accelerator's local HBM, per normalization unit.
    Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to
-      PCIe traffic, per normalization unit.
+      PCIe traffic, divided by total duration.
    "Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write
-      requests due to Infinity Fabric traffic, per normalization unit.
+      requests due to Infinity Fabric traffic, divided by total duration.
    Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM
-      traffic, per normalization unit.
+      traffic, divided by total duration.
    Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to
-      PCIe traffic, per normalization unit.
+      PCIe traffic, divided by total duration.
    "Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic
-      requests due to Infinity Fabric traffic, per normalization unit.
+      requests due to Infinity Fabric traffic, divided by total duration.
    Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
-      HBM traffic, per normalization unit.
+      HBM traffic, divided by total duration.
    Atomic: The total number of L2 requests to Infinity Fabric to atomically update
      32B or 64B of data in any memory location, per normalization unit. See Request
      flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
@@ -257,12 +257,12 @@ Panel Config:
      metric:
        Read BW:
          avg: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
-            * 64)) / $denom))
+            * 64)) / (End_Timestamp - Start_Timestamp)))
          min: MIN((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
-            * 64)) / $denom))
+            * 64)) / (End_Timestamp - Start_Timestamp)))
          max: MAX((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
-            * 64)) / $denom))
-          unit: (Bytes  + $normUnit)
+            * 64)) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        HBM Read Traffic:
          avg: AVG((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
            != 0) else None))
@@ -289,12 +289,12 @@ Panel Config:
          unit: pct
        Write and Atomic BW:
          avg: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
-            * 32)) / $denom))
+            * 32)) / (End_Timestamp - Start_Timestamp)))
          min: MIN((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
-            * 32)) / $denom))
+            * 32)) / (End_Timestamp - Start_Timestamp)))
          max: MAX((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
-            * 32)) / $denom))
-          unit: (Bytes  + $normUnit)
+            * 32)) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        HBM Write and Atomic Traffic:
          avg: AVG((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
@@ -362,10 +362,10 @@ Panel Config:
        unit: Unit
      metric:
        Bandwidth:
-          avg: AVG((TCC_REQ_sum * 64) / $denom)
-          min: MIN((TCC_REQ_sum * 64) / $denom)
-          max: MAX((TCC_REQ_sum * 64) / $denom)
-          unit: (Bytes + $normUnit)
+          avg: AVG((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))
+          min: MIN((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))
+          max: MAX((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))
+          unit: Gbps
        Req:
          avg: AVG((TCC_REQ_sum / $denom))
          min: MIN((TCC_REQ_sum / $denom))
@@ -11,8 +11,12 @@ Panel Config:
      instructions, averaged over the lifetime of the kernel. Calculated as the ratio
      of the total number of cycles spent by the scheduler issuing LDS instructions
      over the total CU cycles.
+    Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
+      could have been loaded from, stored to, or atomically updated in the LDS divided
+      as percentage of theoretical peak. Does not take into account the execution
+      mask of the wavefront when the instruction was executed.
    Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
-      loaded from, stored to, or atomically updated in the LDS per normalization unit.
+      loaded from, stored to, or atomically updated in the LDS divided by total duration.
      Does not take into account the execution mask of the wavefront when the instruction
      was executed.
    Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
@@ -58,7 +62,7 @@ Panel Config:
        Access Rate:
          value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          unit: Pct of Peak
-        Theoretical Bandwidth:
+        Theoretical Bandwidth Utilization:
          value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
            / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
          unit: Pct of Peak
@@ -86,12 +90,12 @@ Panel Config:
          unit: (Instr  + $normUnit)
        Theoretical Bandwidth:
          avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
-            / $denom))
+            / (End_Timestamp - Start_Timestamp)))
          min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
-            / $denom))
+            / (End_Timestamp - Start_Timestamp)))
          max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
-            / $denom))
-          unit: (Bytes  + $normUnit)
+            / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        LDS Latency:
          avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
            None))
@@ -3,15 +3,18 @@ Panel Config:
  id: 1300
  title: Instruction Cache
  metrics_description:
-    Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the
-      peak theoretical bandwidth. Calculated as the ratio of L1I requests over the
-      total L1I cycles.
+    Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
+      of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
+      the total L1I cycles.
    Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
      loaded line the cache. Calculated as the ratio of the number of L1I requests
      that hit over the number of all L1I requests.
-    L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\
-      \ bandwidth achieved. Calculated as the ratio of the total number of requests\
-      \ from the L1I to the L2 cache over the total L1I-L2 interface cycles."
+    L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\
+      \ L2 cache request bandwidth achieved. Calculated as the ratio of the total\
+      \ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\
+      \ cycles."
+    L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
+      divided by total duration.
    Req: The total number of requests made to the L1I per normalization-unit
    Hits: The total number of L1I requests that hit on a previously loaded cache line,
      per normalization-unit.
@@ -30,7 +33,7 @@ Panel Config:
        value: Avg
        unit: Unit
      metric:
-        Bandwidth:
+        Bandwidth Utilization:
          value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
            - Start_Timestamp))))
          unit: Pct of Peak
@@ -38,7 +41,7 @@ Panel Config:
          value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
            + SQC_ICACHE_MISSES_DUPLICATE)))
          unit: Pct of Peak
-        L1I-L2 Bandwidth:
+        L1I-L2 Bandwidth Utilization:
          value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu)
            * (End_Timestamp - Start_Timestamp))))
          unit: Pct of Peak
@@ -100,7 +103,7 @@ Panel Config:
        unit: Unit
      metric:
        L1I-L2 Bandwidth:
-          avg: AVG(((SQC_TC_INST_REQ * 64) / $denom))
-          min: MIN(((SQC_TC_INST_REQ * 64) / $denom))
-          max: MAX(((SQC_TC_INST_REQ * 64) / $denom))
-          unit: (Bytes + $normUnit)
+          avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
+          min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
+          max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
@@ -3,14 +3,17 @@ Panel Config:
  id: 1400
  title: Scalar L1 Data Cache
  metrics_description:
-    Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the
-      peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the
-      total sL1D cycles.
+    Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
+      of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
+      over the total sL1D cycles.
    Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
      loaded line the cache. The ratio of the number of sL1D requests that hit over
      the number of all sL1D requests.
+    sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
+      bandwidth acheived.\ \ Caclulated as total number of bytes read from, written
+      to, or atomically updated\ \ across the sL1D - L2 interface.
    sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
-      \ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\
+      \ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\
      \ writes and atomics are typically unused on current CDNA accelerators, so in\
      \ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
    Req: The total number of requests, of any size or type, made to the sL1D per normalization
@@ -51,7 +54,7 @@ Panel Config:
        value: Avg
        unit: Unit
      metric:
-        Bandwidth:
+        Bandwidth Utilization:
          value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
            - Start_Timestamp))))
          unit: Pct of Peak
@@ -60,7 +63,7 @@ Panel Config:
            + SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES
            + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
          unit: Pct of Peak
-        sL1D-L2 BW:
+        sL1D-L2 BW Utilization:
          value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
            * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))
          unit: Pct of Peak
@@ -158,12 +161,12 @@ Panel Config:
      metric:
        sL1D-L2 BW:
          avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
-            * 64)) / $denom))
+            * 64)) / (End_Timestamp - Start_Timestamp)))
          min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
-            * 64)) / $denom))
+            * 64)) / (End_Timestamp - Start_Timestamp)))
          max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
-            * 64)) / $denom))
-          unit: (Bytes + $normUnit)
+            * 64)) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        Read Req:
          avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
          min: MIN((SQC_TC_DATA_READ_REQ / $denom))
@@ -5,12 +5,12 @@ Panel Config:
  metrics_description:
    Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
      cache over the total number of cache line requests to the vL1D Cache RAM.
-    Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM
-      instructions, as a percent of the peak theoretical bandwidth achievable on the
-      specific accelerator. The number of bytes is calculated as the number of cache
-      lines requested multiplied by the cache line size. This value does not consider
-      partial requests, so for instance, if only a single value is requested in a
-      cache line, the data movement will still be counted as a full cache line.
+    Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
+      of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
+      on the specific accelerator. The number of bytes is calculated as the number
+      of cache lines requested multiplied by the cache line size. This value does
+      not consider partial requests, so for instance, if only a single value is requested
+      in a cache line, the data movement will still be counted as a full cache line.
    Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
      The number of cycles where the vL1D Cache RAM is actively processing any request
      divided by the number of cycles where the vL1D is active.
@@ -42,11 +42,11 @@ Panel Config:
    Atomic Req: The total number of incoming atomic requests from the address processing
      unit after coalescing per normalization unit.
    Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
-      instructions per normalization unit. The number of bytes is calculated as the
-      number of cache lines requested multiplied by the cache line size.  This value
-      does not consider partial requests, so for instance, if only a single value
-      is requested in a cache line, the data movement will still be counted as a full
-      cache line.
+      instructions divided by total duration. The number of bytes is calculated as
+      the number of cache lines requested multiplied by the cache line size.  This
+      value does not consider partial requests, so for instance, if only a single
+      value is requested in a cache line, the data movement will still be counted
+      as a full cache line.
    Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
      vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
    Cache Accesses: The total number of cache line lookups in the vL1D.
@@ -57,7 +57,7 @@ Panel Config:
      command during the kernel's execution per normalization unit. This may be triggered
      by, for instance, the buffer_wbinvl1 instruction.
    L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
-      of VMEM instructions, per normalization unit. The number of bytes is calculated
+      of VMEM instructions, divided by total duration. The number of bytes is calculated
      as the number of cache lines requested multiplied by the cache line size. This
      value does not consider partial requests, so for instance, if only a single
      value is requested in a cache line, the data movement will still be counted
@@ -128,7 +128,7 @@ Panel Config:
            / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
            else None))
          unit: Pct of Peak
-        Bandwidth:
+        Bandwidth Utilization:
          value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp
            - Start_Timestamp)))) / ((($max_sclk / 1000) * 64) * $cu_per_gpu))
          unit: Pct of Peak
@@ -201,10 +201,10 @@ Panel Config:
            / $denom))
          unit: (Req  + $normUnit)
        Cache BW:
-          avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
-          min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
-          max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
-          unit: (Bytes + $normUnit)
+          avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
+          min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
+          max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        Cache Hit Rate:
          avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
@@ -242,12 +242,12 @@ Panel Config:
          unit: (Req + $normUnit)
        L1-L2 BW:
          avg: AVG(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
-            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
+            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
          min: MIN(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
-            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
+            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
          max: MAX(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
-            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
-          unit: (Bytes + $normUnit)
+            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        L1-L2 Read:
          avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_READ_REQ_sum / $denom))
@@ -20,8 +20,8 @@ Panel Config:
    HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
      memory (HBM) per unit time. This value is calculated as the number of HBM channels
      multiplied by the HBM channel width multiplied by the HBM clock frequency.
-    Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per
-      normalization unit.
+    Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided
+      by total duration.
    HBM Read Traffic: The percent of read requests generated by the L2 cache that
      are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
      does not consider the size of the request (meaning that 32B and 64B requests
@@ -42,9 +42,9 @@ Panel Config:
      as a single request), so this metric only approximates the percent of the L2-Fabric
      read bandwidth directed to an uncached memory location.
    Write and Atomic BW: The total number of bytes written by the L2 over Infinity
-      Fabric by write and atomic operations per normalization unit. Note that on current
-      CDNA accelerators, such as the MI2XX, requests are only considered atomic by
-      Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
+      Fabric by write and atomic operations divided by total duration. Note that on
+      current CDNA accelerators, such as the MI2XX, requests are only considered atomic
+      by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
      fine-grained memory allocations or uncached memory allocations on the MI2XX.
    HBM Write and Atomic Traffic: The percent of write and atomic requests generated
      by the L2 cache that are routed to the accelerator's local high-bandwidth memory
@@ -82,17 +82,17 @@ Panel Config:
    Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
      Fabric before a completion acknowledgement (atomic without return value) or
      data (atomic with return value) was returned to the L2.
-    Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit.
+    Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration.
      The number of bytes is calculated as the number of cache lines requested multiplied
      by the cache line size. This value does not consider partial requests, so for
      example, if only a single value is requested in a cache line, the data movement
      will still be counted as a full cache line.
    Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests,
-      per normalization unit.
+      divided by total duration.
    Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests,
-      per normalization unit.
+      divided by total duration.
    Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests,
-      per normalization unit.
+      divided by total duration.
    Req: The total number of incoming requests to the L2 from all clients for all
      request types, per normalization unit.
    Read Req: The total number of read requests to the L2 from all clients.
@@ -150,11 +150,11 @@ Panel Config:
      64B of data from any source other than the accelerator's local HBM, per normalization
      unit.
    Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe
-      traffic, per normalization unit.
+      traffic, divided by total duration.
    "Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read
-      requests due to Infinity Fabric traffic, per normalization unit.
+      requests due to Infinity Fabric traffic, divided by total duration.
    Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM
-      traffic, per normalization unit.
+      traffic, divided by total duration.
    Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
      write or atomically update 32B of data to any memory location, per normalization
      unit.
@@ -171,17 +171,17 @@ Panel Config:
      write or atomically update 32B or 64B of data in any memory location other than
      the accelerator's local HBM, per normalization unit.
    Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to
-      PCIe traffic, per normalization unit.
+      PCIe traffic, divided by total duration.
    "Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write
-      requests due to Infinity Fabric traffic, per normalization unit.
+      requests due to Infinity Fabric traffic, divided by total duration.
    Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM
-      traffic, per normalization unit.
+      traffic, divided by total duration.
    Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to
-      PCIe traffic, per normalization unit.
+      PCIe traffic, divided by total duration.
    "Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic
-      requests due to Infinity Fabric traffic, per normalization unit.
+      requests due to Infinity Fabric traffic, divided by total duration.
    Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
-      HBM traffic, per normalization unit.
+      HBM traffic, divided by total duration.
    Atomic: The total number of L2 requests to Infinity Fabric to atomically update
      32B or 64B of data in any memory location, per normalization unit. See Request
      flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
@@ -257,12 +257,12 @@ Panel Config:
      metric:
        Read BW:
          avg: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
-            * 64)) / $denom))
+            * 64)) / (End_Timestamp - Start_Timestamp)))
          min: MIN((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
-            * 64)) / $denom))
+            * 64)) / (End_Timestamp - Start_Timestamp)))
          max: MAX((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
-            * 64)) / $denom))
-          unit: (Bytes  + $normUnit)
+            * 64)) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        HBM Read Traffic:
          avg: AVG((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
            != 0) else None))
@@ -289,12 +289,12 @@ Panel Config:
          unit: pct
        Write and Atomic BW:
          avg: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
-            * 32)) / $denom))
+            * 32)) / (End_Timestamp - Start_Timestamp)))
          min: MIN((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
-            * 32)) / $denom))
+            * 32)) / (End_Timestamp - Start_Timestamp)))
          max: MAX((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
-            * 32)) / $denom))
-          unit: (Bytes  + $normUnit)
+            * 32)) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        HBM Write and Atomic Traffic:
          avg: AVG((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
@@ -362,10 +362,10 @@ Panel Config:
        unit: Unit
      metric:
        Bandwidth:
-          avg: AVG((TCC_REQ_sum * 128) / $denom)
-          min: MIN((TCC_REQ_sum * 128) / $denom)
-          max: MAX((TCC_REQ_sum * 128) / $denom)
-          unit: (Bytes + $normUnit)
+          avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
+          min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
+          max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
+          unit: Gbps
        Req:
          avg: AVG((TCC_REQ_sum / $denom))
          min: MIN((TCC_REQ_sum / $denom))
@@ -11,8 +11,12 @@ Panel Config:
      instructions, averaged over the lifetime of the kernel. Calculated as the ratio
      of the total number of cycles spent by the scheduler issuing LDS instructions
      over the total CU cycles.
+    Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
+      could have been loaded from, stored to, or atomically updated in the LDS divided
+      as percentage of theoretical peak. Does not take into account the execution
+      mask of the wavefront when the instruction was executed.
    Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
-      loaded from, stored to, or atomically updated in the LDS per normalization unit.
+      loaded from, stored to, or atomically updated in the LDS divided by total duration.
      Does not take into account the execution mask of the wavefront when the instruction
      was executed.
    Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
@@ -58,7 +62,7 @@ Panel Config:
        Access Rate:
          value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          unit: Pct of Peak
-        Theoretical Bandwidth (% of Peak):
+        Theoretical Bandwidth Utilization:
          value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
            / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
          unit: Pct of Peak
@@ -86,12 +90,12 @@ Panel Config:
          unit: (Instr  + $normUnit)
        Theoretical Bandwidth:
          avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
-            / $denom))
+            / (End_Timestamp - Start_Timestamp)))
          min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
-            / $denom))
+            / (End_Timestamp - Start_Timestamp)))
          max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
-            / $denom))
-          unit: (Bytes  + $normUnit)
+            / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        LDS Latency:
          avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
            None))
@@ -3,15 +3,18 @@ Panel Config:
  id: 1300
  title: Instruction Cache
  metrics_description:
-    Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the
-      peak theoretical bandwidth. Calculated as the ratio of L1I requests over the
-      total L1I cycles.
+    Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
+      of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
+      the total L1I cycles.
    Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
      loaded line the cache. Calculated as the ratio of the number of L1I requests
      that hit over the number of all L1I requests.
-    L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\
-      \ bandwidth achieved. Calculated as the ratio of the total number of requests\
-      \ from the L1I to the L2 cache over the total L1I-L2 interface cycles."
+    L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\
+      \ L2 cache request bandwidth achieved. Calculated as the ratio of the total\
+      \ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\
+      \ cycles."
+    L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
+      divided by total duration.
    Req: The total number of requests made to the L1I per normalization-unit
    Hits: The total number of L1I requests that hit on a previously loaded cache line,
      per normalization-unit.
@@ -30,7 +33,7 @@ Panel Config:
        value: Avg
        unit: Unit
      metric:
-        Bandwidth:
+        Bandwidth Utilization:
          value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
            - Start_Timestamp))))
          unit: Pct of Peak
@@ -38,7 +41,7 @@ Panel Config:
          value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
            + SQC_ICACHE_MISSES_DUPLICATE)))
          unit: Pct of Peak
-        L1I-L2 Bandwidth:
+        L1I-L2 Bandwidth Utilization:
          value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu)
            * (End_Timestamp - Start_Timestamp))))
          unit: Pct of Peak
@@ -100,7 +103,7 @@ Panel Config:
        unit: Unit
      metric:
        L1I-L2 Bandwidth:
-          avg: AVG(((SQC_TC_INST_REQ * 64) / $denom))
-          min: MIN(((SQC_TC_INST_REQ * 64) / $denom))
-          max: MAX(((SQC_TC_INST_REQ * 64) / $denom))
-          unit: (Bytes + $normUnit)
+          avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
+          min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
+          max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
@@ -3,14 +3,17 @@ Panel Config:
  id: 1400
  title: Scalar L1 Data Cache
  metrics_description:
-    Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the
-      peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the
-      total sL1D cycles.
+    Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
+      of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
+      over the total sL1D cycles.
    Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
      loaded line the cache. The ratio of the number of sL1D requests that hit over
      the number of all sL1D requests.
+    sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
+      bandwidth acheived.\ \ Caclulated as total number of bytes read from, written
+      to, or atomically updated\ \ across the sL1D - L2 interface.
    sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
-      \ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\
+      \ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\
      \ writes and atomics are typically unused on current CDNA accelerators, so in\
      \ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
    Req: The total number of requests, of any size or type, made to the sL1D per normalization
@@ -51,7 +54,7 @@ Panel Config:
        value: Avg
        unit: Unit
      metric:
-        Bandwidth:
+        Bandwidth Utilization:
          value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
            - Start_Timestamp))))
          unit: Pct of Peak
@@ -60,7 +63,7 @@ Panel Config:
            + SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES
            + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
          unit: Pct of Peak
-        sL1D-L2 BW:
+        sL1D-L2 BW Utilization:
          value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
            * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))
          unit: Pct of Peak
@@ -158,12 +161,12 @@ Panel Config:
      metric:
        sL1D-L2 BW:
          avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
-            * 64)) / $denom))
+            * 64)) / (End_Timestamp - Start_Timestamp)))
          min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
-            * 64)) / $denom))
+            * 64)) / (End_Timestamp - Start_Timestamp)))
          max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
-            * 64)) / $denom))
-          unit: (Bytes + $normUnit)
+            * 64)) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        Read Req:
          avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
          min: MIN((SQC_TC_DATA_READ_REQ / $denom))
@@ -5,12 +5,12 @@ Panel Config:
  metrics_description:
    Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
      cache over the total number of cache line requests to the vL1D Cache RAM.
-    Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM
-      instructions, as a percent of the peak theoretical bandwidth achievable on the
-      specific accelerator. The number of bytes is calculated as the number of cache
-      lines requested multiplied by the cache line size. This value does not consider
-      partial requests, so for instance, if only a single value is requested in a
-      cache line, the data movement will still be counted as a full cache line.
+    Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
+      of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
+      on the specific accelerator. The number of bytes is calculated as the number
+      of cache lines requested multiplied by the cache line size. This value does
+      not consider partial requests, so for instance, if only a single value is requested
+      in a cache line, the data movement will still be counted as a full cache line.
    Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
      The number of cycles where the vL1D Cache RAM is actively processing any request
      divided by the number of cycles where the vL1D is active.
@@ -42,11 +42,11 @@ Panel Config:
    Atomic Req: The total number of incoming atomic requests from the address processing
      unit after coalescing per normalization unit.
    Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
-      instructions per normalization unit. The number of bytes is calculated as the
-      number of cache lines requested multiplied by the cache line size.  This value
-      does not consider partial requests, so for instance, if only a single value
-      is requested in a cache line, the data movement will still be counted as a full
-      cache line.
+      instructions divided by total duration. The number of bytes is calculated as
+      the number of cache lines requested multiplied by the cache line size.  This
+      value does not consider partial requests, so for instance, if only a single
+      value is requested in a cache line, the data movement will still be counted
+      as a full cache line.
    Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
      vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
    Cache Accesses: The total number of cache line lookups in the vL1D.
@@ -57,7 +57,7 @@ Panel Config:
      command during the kernel's execution per normalization unit. This may be triggered
      by, for instance, the buffer_wbinvl1 instruction.
    L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
-      of VMEM instructions, per normalization unit. The number of bytes is calculated
+      of VMEM instructions, divided by total duration. The number of bytes is calculated
      as the number of cache lines requested multiplied by the cache line size. This
      value does not consider partial requests, so for instance, if only a single
      value is requested in a cache line, the data movement will still be counted
@@ -128,7 +128,7 @@ Panel Config:
            / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
            else None))
          unit: Pct of Peak
-        Bandwidth:
+        Bandwidth Utilization:
          value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp
            - Start_Timestamp)))) / ((($max_sclk / 1000) * 128) * $cu_per_gpu))
          unit: Pct of Peak
@@ -201,10 +201,10 @@ Panel Config:
            / $denom))
          unit: (Req  + $normUnit)
        Cache BW:
-          avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
-          min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
-          max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
-          unit: (Bytes + $normUnit)
+          avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
+          min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
+          max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        Cache Hit Rate:
          avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
@@ -242,12 +242,12 @@ Panel Config:
          unit: (Req + $normUnit)
        L1-L2 BW:
          avg: AVG(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
-            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
+            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
          min: MIN(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
-            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
+            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
          max: MAX(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
-            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
-          unit: (Bytes + $normUnit)
+            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        L1-L2 Read:
          avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_READ_REQ_sum / $denom))
@@ -20,8 +20,8 @@ Panel Config:
    HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
      memory (HBM) per unit time. This value is calculated as the number of HBM channels
      multiplied by the HBM channel width multiplied by the HBM clock frequency.
-    Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per
-      normalization unit.
+    Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided
+      by total duration.
    HBM Read Traffic: The percent of read requests generated by the L2 cache that
      are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
      does not consider the size of the request (meaning that 32B and 64B requests
@@ -42,9 +42,9 @@ Panel Config:
      as a single request), so this metric only approximates the percent of the L2-Fabric
      read bandwidth directed to an uncached memory location.
    Write and Atomic BW: The total number of bytes written by the L2 over Infinity
-      Fabric by write and atomic operations per normalization unit. Note that on current
-      CDNA accelerators, such as the MI2XX, requests are only considered atomic by
-      Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
+      Fabric by write and atomic operations divided by total duration. Note that on
+      current CDNA accelerators, such as the MI2XX, requests are only considered atomic
+      by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
      fine-grained memory allocations or uncached memory allocations on the MI2XX.
    HBM Write and Atomic Traffic: The percent of write and atomic requests generated
      by the L2 cache that are routed to the accelerator's local high-bandwidth memory
@@ -82,17 +82,17 @@ Panel Config:
    Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
      Fabric before a completion acknowledgement (atomic without return value) or
      data (atomic with return value) was returned to the L2.
-    Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit.
+    Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration.
      The number of bytes is calculated as the number of cache lines requested multiplied
      by the cache line size. This value does not consider partial requests, so for
      example, if only a single value is requested in a cache line, the data movement
      will still be counted as a full cache line.
    Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests,
-      per normalization unit.
+      divided by total duration.
    Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests,
-      per normalization unit.
+      divided by total duration.
    Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests,
-      per normalization unit.
+      divided by total duration.
    Req: The total number of incoming requests to the L2 from all clients for all
      request types, per normalization unit.
    Read Req: The total number of read requests to the L2 from all clients.
@@ -150,11 +150,11 @@ Panel Config:
      64B of data from any source other than the accelerator's local HBM, per normalization
      unit.
    Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe
-      traffic, per normalization unit.
+      traffic, divided by total duration.
    "Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read
-      requests due to Infinity Fabric traffic, per normalization unit.
+      requests due to Infinity Fabric traffic, divided by total duration.
    Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM
-      traffic, per normalization unit.
+      traffic, divided by total duration.
    Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
      write or atomically update 32B of data to any memory location, per normalization
      unit.
@@ -171,17 +171,17 @@ Panel Config:
      write or atomically update 32B or 64B of data in any memory location other than
      the accelerator's local HBM, per normalization unit.
    Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to
-      PCIe traffic, per normalization unit.
+      PCIe traffic, divided by total duration.
    "Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write
-      requests due to Infinity Fabric traffic, per normalization unit.
+      requests due to Infinity Fabric traffic, divided by total duration.
    Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM
-      traffic, per normalization unit.
+      traffic, divided by total duration.
    Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to
-      PCIe traffic, per normalization unit.
+      PCIe traffic, divided by total duration.
    "Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic
-      requests due to Infinity Fabric traffic, per normalization unit.
+      requests due to Infinity Fabric traffic, divided by total duration.
    Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
-      HBM traffic, per normalization unit.
+      HBM traffic, divided by total duration.
    Atomic: The total number of L2 requests to Infinity Fabric to atomically update
      32B or 64B of data in any memory location, per normalization unit. See Request
      flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
@@ -257,12 +257,12 @@ Panel Config:
      metric:
        Read BW:
          avg: AVG((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
-            * 64)) / $denom))
+            * 64)) / (End_Timestamp - Start_Timestamp)))
          min: MIN((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
-            * 64)) / $denom))
+            * 64)) / (End_Timestamp - Start_Timestamp)))
          max: MAX((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
-            * 64)) / $denom))
-          unit: (Bytes  + $normUnit)
+            * 64)) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        HBM Read Traffic:
          avg: AVG((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
            != 0) else None))
@@ -289,12 +289,12 @@ Panel Config:
          unit: pct
        Write and Atomic BW:
          avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
-            * 32)) / $denom))
+            * 32)) / (End_Timestamp - Start_Timestamp)))
          min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
-            * 32)) / $denom))
+            * 32)) / (End_Timestamp - Start_Timestamp)))
          max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
-            * 32)) / $denom))
-          unit: (Bytes  + $normUnit)
+            * 32)) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        HBM Write and Atomic Traffic:
          avg: AVG((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
            != 0) else None))
@@ -362,10 +362,10 @@ Panel Config:
        unit: Unit
      metric:
        Bandwidth:
-          avg: AVG((TCC_REQ_sum * 128) / $denom)
-          min: MIN((TCC_REQ_sum * 128) / $denom)
-          max: MAX((TCC_REQ_sum * 128) / $denom)
-          unit: (Bytes + $normUnit)
+          avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
+          min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
+          max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
+          unit: Gbps
        Req:
          avg: AVG((TCC_REQ_sum / $denom))
          min: MIN((TCC_REQ_sum / $denom))
@@ -11,8 +11,12 @@ Panel Config:
      instructions, averaged over the lifetime of the kernel. Calculated as the ratio
      of the total number of cycles spent by the scheduler issuing LDS instructions
      over the total CU cycles.
+    Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
+      could have been loaded from, stored to, or atomically updated in the LDS divided
+      as percentage of theoretical peak. Does not take into account the execution
+      mask of the wavefront when the instruction was executed.
    Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
-      loaded from, stored to, or atomically updated in the LDS per normalization unit.
+      loaded from, stored to, or atomically updated in the LDS divided by total duration.
      Does not take into account the execution mask of the wavefront when the instruction
      was executed.
    Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
@@ -58,7 +62,7 @@ Panel Config:
        Access Rate:
          value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          unit: Pct of Peak
-        Theoretical Bandwidth (% of Peak):
+        Theoretical Bandwidth Utilization:
          value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
            / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
          unit: Pct of Peak
@@ -86,12 +90,12 @@ Panel Config:
          unit: (Instr  + $normUnit)
        Theoretical Bandwidth:
          avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
-            / $denom))
+            / (End_Timestamp - Start_Timestamp)))
          min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
-            / $denom))
+            / (End_Timestamp - Start_Timestamp)))
          max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
-            / $denom))
-          unit: (Bytes  + $normUnit)
+            / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        LDS Latency:
          avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
            None))
@@ -3,15 +3,18 @@ Panel Config:
  id: 1300
  title: Instruction Cache
  metrics_description:
-    Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the
-      peak theoretical bandwidth. Calculated as the ratio of L1I requests over the
-      total L1I cycles.
+    Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
+      of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
+      the total L1I cycles.
    Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
      loaded line the cache. Calculated as the ratio of the number of L1I requests
      that hit over the number of all L1I requests.
-    L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\
-      \ bandwidth achieved. Calculated as the ratio of the total number of requests\
-      \ from the L1I to the L2 cache over the total L1I-L2 interface cycles."
+    L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\
+      \ L2 cache request bandwidth achieved. Calculated as the ratio of the total\
+      \ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\
+      \ cycles."
+    L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
+      divided by total duration.
    Req: The total number of requests made to the L1I per normalization-unit
    Hits: The total number of L1I requests that hit on a previously loaded cache line,
      per normalization-unit.
@@ -30,7 +33,7 @@ Panel Config:
        value: Avg
        unit: Unit
      metric:
-        Bandwidth:
+        Bandwidth Utilization:
          value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
            - Start_Timestamp))))
          unit: Pct of Peak
@@ -38,7 +41,7 @@ Panel Config:
          value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
            + SQC_ICACHE_MISSES_DUPLICATE)))
          unit: Pct of Peak
-        L1I-L2 Bandwidth:
+        L1I-L2 Bandwidth Utilization:
          value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu)
            * (End_Timestamp - Start_Timestamp))))
          unit: Pct of Peak
@@ -100,7 +103,7 @@ Panel Config:
        unit: Unit
      metric:
        L1I-L2 Bandwidth:
-          avg: AVG(((SQC_TC_INST_REQ * 64) / $denom))
-          min: MIN(((SQC_TC_INST_REQ * 64) / $denom))
-          max: MAX(((SQC_TC_INST_REQ * 64) / $denom))
-          unit: (Bytes + $normUnit)
+          avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
+          min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
+          max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
@@ -3,14 +3,17 @@ Panel Config:
  id: 1400
  title: Scalar L1 Data Cache
  metrics_description:
-    Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the
-      peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the
-      total sL1D cycles.
+    Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
+      of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
+      over the total sL1D cycles.
    Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
      loaded line the cache. The ratio of the number of sL1D requests that hit over
      the number of all sL1D requests.
+    sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
+      bandwidth acheived.\ \ Caclulated as total number of bytes read from, written
+      to, or atomically updated\ \ across the sL1D - L2 interface.
    sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
-      \ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\
+      \ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\
      \ writes and atomics are typically unused on current CDNA accelerators, so in\
      \ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
    Req: The total number of requests, of any size or type, made to the sL1D per normalization
@@ -51,7 +54,7 @@ Panel Config:
        value: Avg
        unit: Unit
      metric:
-        Bandwidth:
+        Bandwidth Utilization:
          value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
            - Start_Timestamp))))
          unit: Pct of Peak
@@ -60,7 +63,7 @@ Panel Config:
            + SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES
            + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
          unit: Pct of Peak
-        sL1D-L2 BW:
+        sL1D-L2 BW Utilization:
          value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
            * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))
          unit: Pct of Peak
@@ -158,12 +161,12 @@ Panel Config:
      metric:
        sL1D-L2 BW:
          avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
-            * 64)) / $denom))
+            * 64)) / (End_Timestamp - Start_Timestamp)))
          min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
-            * 64)) / $denom))
+            * 64)) / (End_Timestamp - Start_Timestamp)))
          max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
-            * 64)) / $denom))
-          unit: (Bytes + $normUnit)
+            * 64)) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        Read Req:
          avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
          min: MIN((SQC_TC_DATA_READ_REQ / $denom))
@@ -5,12 +5,12 @@ Panel Config:
  metrics_description:
    Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
      cache over the total number of cache line requests to the vL1D Cache RAM.
-    Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM
-      instructions, as a percent of the peak theoretical bandwidth achievable on the
-      specific accelerator. The number of bytes is calculated as the number of cache
-      lines requested multiplied by the cache line size. This value does not consider
-      partial requests, so for instance, if only a single value is requested in a
-      cache line, the data movement will still be counted as a full cache line.
+    Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
+      of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
+      on the specific accelerator. The number of bytes is calculated as the number
+      of cache lines requested multiplied by the cache line size. This value does
+      not consider partial requests, so for instance, if only a single value is requested
+      in a cache line, the data movement will still be counted as a full cache line.
    Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
      The number of cycles where the vL1D Cache RAM is actively processing any request
      divided by the number of cycles where the vL1D is active.
@@ -42,11 +42,11 @@ Panel Config:
    Atomic Req: The total number of incoming atomic requests from the address processing
      unit after coalescing per normalization unit.
    Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
-      instructions per normalization unit. The number of bytes is calculated as the
-      number of cache lines requested multiplied by the cache line size.  This value
-      does not consider partial requests, so for instance, if only a single value
-      is requested in a cache line, the data movement will still be counted as a full
-      cache line.
+      instructions divided by total duration. The number of bytes is calculated as
+      the number of cache lines requested multiplied by the cache line size.  This
+      value does not consider partial requests, so for instance, if only a single
+      value is requested in a cache line, the data movement will still be counted
+      as a full cache line.
    Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
      vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
    Cache Accesses: The total number of cache line lookups in the vL1D.
@@ -57,7 +57,7 @@ Panel Config:
      command during the kernel's execution per normalization unit. This may be triggered
      by, for instance, the buffer_wbinvl1 instruction.
    L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
-      of VMEM instructions, per normalization unit. The number of bytes is calculated
+      of VMEM instructions, divided by total duration. The number of bytes is calculated
      as the number of cache lines requested multiplied by the cache line size. This
      value does not consider partial requests, so for instance, if only a single
      value is requested in a cache line, the data movement will still be counted
@@ -128,7 +128,7 @@ Panel Config:
            / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
            else None))
          unit: Pct of Peak
-        Bandwidth:
+        Bandwidth Utilization:
          value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp
            - Start_Timestamp)))) / ((($max_sclk / 1000) * 128) * $cu_per_gpu))
          unit: Pct of Peak
@@ -201,10 +201,10 @@ Panel Config:
            / $denom))
          unit: (Req  + $normUnit)
        Cache BW:
-          avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
-          min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
-          max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
-          unit: (Bytes + $normUnit)
+          avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
+          min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
+          max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        Cache Hit Rate:
          avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
@@ -242,12 +242,12 @@ Panel Config:
          unit: (Req + $normUnit)
        L1-L2 BW:
          avg: AVG(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
-            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
+            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
          min: MIN(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
-            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
+            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
          max: MAX(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
-            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
-          unit: (Bytes + $normUnit)
+            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        L1-L2 Read:
          avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_READ_REQ_sum / $denom))
@@ -20,8 +20,8 @@ Panel Config:
    HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
      memory (HBM) per unit time. This value is calculated as the number of HBM channels
      multiplied by the HBM channel width multiplied by the HBM clock frequency.
-    Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per
-      normalization unit.
+    Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided
+      by total duration.
    HBM Read Traffic: The percent of read requests generated by the L2 cache that
      are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
      does not consider the size of the request (meaning that 32B and 64B requests
@@ -42,9 +42,9 @@ Panel Config:
      as a single request), so this metric only approximates the percent of the L2-Fabric
      read bandwidth directed to an uncached memory location.
    Write and Atomic BW: The total number of bytes written by the L2 over Infinity
-      Fabric by write and atomic operations per normalization unit. Note that on current
-      CDNA accelerators, such as the MI2XX, requests are only considered atomic by
-      Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
+      Fabric by write and atomic operations divided by total duration. Note that on
+      current CDNA accelerators, such as the MI2XX, requests are only considered atomic
+      by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
      fine-grained memory allocations or uncached memory allocations on the MI2XX.
    HBM Write and Atomic Traffic: The percent of write and atomic requests generated
      by the L2 cache that are routed to the accelerator's local high-bandwidth memory
@@ -82,17 +82,17 @@ Panel Config:
    Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
      Fabric before a completion acknowledgement (atomic without return value) or
      data (atomic with return value) was returned to the L2.
-    Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit.
+    Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration.
      The number of bytes is calculated as the number of cache lines requested multiplied
      by the cache line size. This value does not consider partial requests, so for
      example, if only a single value is requested in a cache line, the data movement
      will still be counted as a full cache line.
    Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests,
-      per normalization unit.
+      divided by total duration.
    Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests,
-      per normalization unit.
+      divided by total duration.
    Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests,
-      per normalization unit.
+      divided by total duration.
    Req: The total number of incoming requests to the L2 from all clients for all
      request types, per normalization unit.
    Read Req: The total number of read requests to the L2 from all clients.
@@ -150,11 +150,11 @@ Panel Config:
      64B of data from any source other than the accelerator's local HBM, per normalization
      unit.
    Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe
-      traffic, per normalization unit.
+      traffic, divided by total duration.
    "Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read
-      requests due to Infinity Fabric traffic, per normalization unit.
+      requests due to Infinity Fabric traffic, divided by total duration.
    Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM
-      traffic, per normalization unit.
+      traffic, divided by total duration.
    Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
      write or atomically update 32B of data to any memory location, per normalization
      unit.
@@ -171,17 +171,17 @@ Panel Config:
      write or atomically update 32B or 64B of data in any memory location other than
      the accelerator's local HBM, per normalization unit.
    Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to
-      PCIe traffic, per normalization unit.
+      PCIe traffic, divided by total duration.
    "Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write
-      requests due to Infinity Fabric traffic, per normalization unit.
+      requests due to Infinity Fabric traffic, divided by total duration.
    Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM
-      traffic, per normalization unit.
+      traffic, divided by total duration.
    Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to
-      PCIe traffic, per normalization unit.
+      PCIe traffic, divided by total duration.
    "Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic
-      requests due to Infinity Fabric traffic, per normalization unit.
+      requests due to Infinity Fabric traffic, divided by total duration.
    Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
-      HBM traffic, per normalization unit.
+      HBM traffic, divided by total duration.
    Atomic: The total number of L2 requests to Infinity Fabric to atomically update
      32B or 64B of data in any memory location, per normalization unit. See Request
      flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
@@ -257,12 +257,12 @@ Panel Config:
      metric:
        Read BW:
          avg: AVG((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
-            * 64)) / $denom))
+            * 64)) / (End_Timestamp - Start_Timestamp)))
          min: MIN((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
-            * 64)) / $denom))
+            * 64)) / (End_Timestamp - Start_Timestamp)))
          max: MAX((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
-            * 64)) / $denom))
-          unit: (Bytes  + $normUnit)
+            * 64)) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        HBM Read Traffic:
          avg: AVG((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
            != 0) else None))
@@ -289,12 +289,12 @@ Panel Config:
          unit: pct
        Write and Atomic BW:
          avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
-            * 32)) / $denom))
+            * 32)) / (End_Timestamp - Start_Timestamp)))
          min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
-            * 32)) / $denom))
+            * 32)) / (End_Timestamp - Start_Timestamp)))
          max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
-            * 32)) / $denom))
-          unit: (Bytes  + $normUnit)
+            * 32)) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        HBM Write and Atomic Traffic:
          avg: AVG((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
            != 0) else None))
@@ -362,10 +362,10 @@ Panel Config:
        unit: Unit
      metric:
        Bandwidth:
-          avg: AVG((TCC_REQ_sum * 128) / $denom)
-          min: MIN((TCC_REQ_sum * 128) / $denom)
-          max: MAX((TCC_REQ_sum * 128) / $denom)
-          unit: (Bytes + $normUnit)
+          avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
+          min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
+          max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
+          unit: Gbps
        Req:
          avg: AVG((TCC_REQ_sum / $denom))
          min: MIN((TCC_REQ_sum / $denom))
@@ -11,8 +11,12 @@ Panel Config:
      instructions, averaged over the lifetime of the kernel. Calculated as the ratio
      of the total number of cycles spent by the scheduler issuing LDS instructions
      over the total CU cycles.
+    Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
+      could have been loaded from, stored to, or atomically updated in the LDS divided
+      as percentage of theoretical peak. Does not take into account the execution
+      mask of the wavefront when the instruction was executed.
    Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
-      loaded from, stored to, or atomically updated in the LDS per normalization unit.
+      loaded from, stored to, or atomically updated in the LDS divided by total duration.
      Does not take into account the execution mask of the wavefront when the instruction
      was executed.
    Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
@@ -58,7 +62,7 @@ Panel Config:
        Access Rate:
          value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          unit: Pct of Peak
-        Theoretical Bandwidth (% of Peak):
+        Theoretical Bandwidth Utilization:
          value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
            / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
          unit: Pct of Peak
@@ -86,12 +90,12 @@ Panel Config:
          unit: (Instr  + $normUnit)
        Theoretical Bandwidth:
          avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
-            / $denom))
+            / (End_Timestamp - Start_Timestamp)))
          min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
-            / $denom))
+            / (End_Timestamp - Start_Timestamp)))
          max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
-            / $denom))
-          unit: (Bytes  + $normUnit)
+            / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        LDS Latency:
          avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
            None))
@@ -3,15 +3,18 @@ Panel Config:
  id: 1300
  title: Instruction Cache
  metrics_description:
-    Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the
-      peak theoretical bandwidth. Calculated as the ratio of L1I requests over the
-      total L1I cycles.
+    Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
+      of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
+      the total L1I cycles.
    Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
      loaded line the cache. Calculated as the ratio of the number of L1I requests
      that hit over the number of all L1I requests.
-    L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\
-      \ bandwidth achieved. Calculated as the ratio of the total number of requests\
-      \ from the L1I to the L2 cache over the total L1I-L2 interface cycles."
+    L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\
+      \ L2 cache request bandwidth achieved. Calculated as the ratio of the total\
+      \ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\
+      \ cycles."
+    L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
+      divided by total duration.
    Req: The total number of requests made to the L1I per normalization-unit
    Hits: The total number of L1I requests that hit on a previously loaded cache line,
      per normalization-unit.
@@ -30,7 +33,7 @@ Panel Config:
        value: Avg
        unit: Unit
      metric:
-        Bandwidth:
+        Bandwidth Utilization:
          value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
            - Start_Timestamp))))
          unit: Pct of Peak
@@ -38,7 +41,7 @@ Panel Config:
          value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
            + SQC_ICACHE_MISSES_DUPLICATE)))
          unit: Pct of Peak
-        L1I-L2 Bandwidth:
+        L1I-L2 Bandwidth Utilization:
          value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu)
            * (End_Timestamp - Start_Timestamp))))
          unit: Pct of Peak
@@ -100,7 +103,7 @@ Panel Config:
        unit: Unit
      metric:
        L1I-L2 Bandwidth:
-          avg: AVG(((SQC_TC_INST_REQ * 64) / $denom))
-          min: MIN(((SQC_TC_INST_REQ * 64) / $denom))
-          max: MAX(((SQC_TC_INST_REQ * 64) / $denom))
-          unit: (Bytes + $normUnit)
+          avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
+          min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
+          max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
@@ -3,14 +3,17 @@ Panel Config:
  id: 1400
  title: Scalar L1 Data Cache
  metrics_description:
-    Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the
-      peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the
-      total sL1D cycles.
+    Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
+      of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
+      over the total sL1D cycles.
    Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
      loaded line the cache. The ratio of the number of sL1D requests that hit over
      the number of all sL1D requests.
+    sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
+      bandwidth acheived.\ \ Caclulated as total number of bytes read from, written
+      to, or atomically updated\ \ across the sL1D - L2 interface.
    sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
-      \ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\
+      \ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\
      \ writes and atomics are typically unused on current CDNA accelerators, so in\
      \ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
    Req: The total number of requests, of any size or type, made to the sL1D per normalization
@@ -51,7 +54,7 @@ Panel Config:
        value: Avg
        unit: Unit
      metric:
-        Bandwidth:
+        Bandwidth Utilization:
          value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
            - Start_Timestamp))))
          unit: Pct of Peak
@@ -60,7 +63,7 @@ Panel Config:
            + SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES
            + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
          unit: Pct of Peak
-        sL1D-L2 BW:
+        sL1D-L2 BW Utilization:
          value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
            * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))
          unit: Pct of Peak
@@ -158,12 +161,12 @@ Panel Config:
      metric:
        sL1D-L2 BW:
          avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
-            * 64)) / $denom))
+            * 64)) / (End_Timestamp - Start_Timestamp)))
          min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
-            * 64)) / $denom))
+            * 64)) / (End_Timestamp - Start_Timestamp)))
          max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
-            * 64)) / $denom))
-          unit: (Bytes + $normUnit)
+            * 64)) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        Read Req:
          avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
          min: MIN((SQC_TC_DATA_READ_REQ / $denom))
@@ -5,12 +5,12 @@ Panel Config:
  metrics_description:
    Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
      cache over the total number of cache line requests to the vL1D Cache RAM.
-    Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM
-      instructions, as a percent of the peak theoretical bandwidth achievable on the
-      specific accelerator. The number of bytes is calculated as the number of cache
-      lines requested multiplied by the cache line size. This value does not consider
-      partial requests, so for instance, if only a single value is requested in a
-      cache line, the data movement will still be counted as a full cache line.
+    Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
+      of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
+      on the specific accelerator. The number of bytes is calculated as the number
+      of cache lines requested multiplied by the cache line size. This value does
+      not consider partial requests, so for instance, if only a single value is requested
+      in a cache line, the data movement will still be counted as a full cache line.
    Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
      The number of cycles where the vL1D Cache RAM is actively processing any request
      divided by the number of cycles where the vL1D is active.
@@ -42,11 +42,11 @@ Panel Config:
    Atomic Req: The total number of incoming atomic requests from the address processing
      unit after coalescing per normalization unit.
    Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
-      instructions per normalization unit. The number of bytes is calculated as the
-      number of cache lines requested multiplied by the cache line size.  This value
-      does not consider partial requests, so for instance, if only a single value
-      is requested in a cache line, the data movement will still be counted as a full
-      cache line.
+      instructions divided by total duration. The number of bytes is calculated as
+      the number of cache lines requested multiplied by the cache line size.  This
+      value does not consider partial requests, so for instance, if only a single
+      value is requested in a cache line, the data movement will still be counted
+      as a full cache line.
    Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
      vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
    Cache Accesses: The total number of cache line lookups in the vL1D.
@@ -57,7 +57,7 @@ Panel Config:
      command during the kernel's execution per normalization unit. This may be triggered
      by, for instance, the buffer_wbinvl1 instruction.
    L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
-      of VMEM instructions, per normalization unit. The number of bytes is calculated
+      of VMEM instructions, divided by total duration. The number of bytes is calculated
      as the number of cache lines requested multiplied by the cache line size. This
      value does not consider partial requests, so for instance, if only a single
      value is requested in a cache line, the data movement will still be counted
@@ -128,7 +128,7 @@ Panel Config:
            / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
            else None))
          unit: Pct of Peak
-        Bandwidth:
+        Bandwidth Utilization:
          value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp
            - Start_Timestamp)))) / ((($max_sclk / 1000) * 128) * $cu_per_gpu))
          unit: Pct of Peak
@@ -201,10 +201,10 @@ Panel Config:
            / $denom))
          unit: (Req  + $normUnit)
        Cache BW:
-          avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
-          min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
-          max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
-          unit: (Bytes + $normUnit)
+          avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
+          min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
+          max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        Cache Hit Rate:
          avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
@@ -242,12 +242,12 @@ Panel Config:
          unit: (Req + $normUnit)
        L1-L2 BW:
          avg: AVG(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
-            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
+            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
          min: MIN(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
-            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
+            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
          max: MAX(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
-            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
-          unit: (Bytes + $normUnit)
+            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        L1-L2 Read:
          avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_READ_REQ_sum / $denom))
@@ -20,8 +20,8 @@ Panel Config:
    HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
      memory (HBM) per unit time. This value is calculated as the number of HBM channels
      multiplied by the HBM channel width multiplied by the HBM clock frequency.
-    Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per
-      normalization unit.
+    Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided
+      by total duration.
    HBM Read Traffic: The percent of read requests generated by the L2 cache that
      are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
      does not consider the size of the request (meaning that 32B and 64B requests
@@ -42,9 +42,9 @@ Panel Config:
      as a single request), so this metric only approximates the percent of the L2-Fabric
      read bandwidth directed to an uncached memory location.
    Write and Atomic BW: The total number of bytes written by the L2 over Infinity
-      Fabric by write and atomic operations per normalization unit. Note that on current
-      CDNA accelerators, such as the MI2XX, requests are only considered atomic by
-      Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
+      Fabric by write and atomic operations divided by total duration. Note that on
+      current CDNA accelerators, such as the MI2XX, requests are only considered atomic
+      by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
      fine-grained memory allocations or uncached memory allocations on the MI2XX.
    HBM Write and Atomic Traffic: The percent of write and atomic requests generated
      by the L2 cache that are routed to the accelerator's local high-bandwidth memory
@@ -82,17 +82,17 @@ Panel Config:
    Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
      Fabric before a completion acknowledgement (atomic without return value) or
      data (atomic with return value) was returned to the L2.
-    Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit.
+    Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration.
      The number of bytes is calculated as the number of cache lines requested multiplied
      by the cache line size. This value does not consider partial requests, so for
      example, if only a single value is requested in a cache line, the data movement
      will still be counted as a full cache line.
    Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests,
-      per normalization unit.
+      divided by total duration.
    Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests,
-      per normalization unit.
+      divided by total duration.
    Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests,
-      per normalization unit.
+      divided by total duration.
    Req: The total number of incoming requests to the L2 from all clients for all
      request types, per normalization unit.
    Read Req: The total number of read requests to the L2 from all clients.
@@ -150,11 +150,11 @@ Panel Config:
      64B of data from any source other than the accelerator's local HBM, per normalization
      unit.
    Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe
-      traffic, per normalization unit.
+      traffic, divided by total duration.
    "Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read
-      requests due to Infinity Fabric traffic, per normalization unit.
+      requests due to Infinity Fabric traffic, divided by total duration.
    Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM
-      traffic, per normalization unit.
+      traffic, divided by total duration.
    Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
      write or atomically update 32B of data to any memory location, per normalization
      unit.
@@ -171,17 +171,17 @@ Panel Config:
      write or atomically update 32B or 64B of data in any memory location other than
      the accelerator's local HBM, per normalization unit.
    Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to
-      PCIe traffic, per normalization unit.
+      PCIe traffic, divided by total duration.
    "Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write
-      requests due to Infinity Fabric traffic, per normalization unit.
+      requests due to Infinity Fabric traffic, divided by total duration.
    Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM
-      traffic, per normalization unit.
+      traffic, divided by total duration.
    Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to
-      PCIe traffic, per normalization unit.
+      PCIe traffic, divided by total duration.
    "Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic
-      requests due to Infinity Fabric traffic, per normalization unit.
+      requests due to Infinity Fabric traffic, divided by total duration.
    Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
-      HBM traffic, per normalization unit.
+      HBM traffic, divided by total duration.
    Atomic: The total number of L2 requests to Infinity Fabric to atomically update
      32B or 64B of data in any memory location, per normalization unit. See Request
      flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
@@ -258,12 +258,15 @@ Panel Config:
      metric:
        Read BW:
          avg: AVG(((128 * TCC_BUBBLE_sum + 64 * (TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum
-            - TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / $denom))
+            - TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / (End_Timestamp
+            - Start_Timestamp)))
          min: MIN(((128 * TCC_BUBBLE_sum + 64 * (TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum
-            - TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / $denom))
+            - TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / (End_Timestamp
+            - Start_Timestamp)))
          max: MAX(((128 * TCC_BUBBLE_sum + 64 * (TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum
-            - TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / $denom))
-          unit: (Bytes  + $normUnit)
+            - TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / (End_Timestamp
+            - Start_Timestamp)))
+          unit: Gbps
        HBM Read Traffic:
          avg: AVG((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
            != 0) else None))
@@ -290,12 +293,12 @@ Panel Config:
          unit: pct
        Write and Atomic BW:
          avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
-            * 32)) / $denom))
+            * 32)) / (End_Timestamp - Start_Timestamp)))
          min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
-            * 32)) / $denom))
+            * 32)) / (End_Timestamp - Start_Timestamp)))
          max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
-            * 32)) / $denom))
-          unit: (Bytes  + $normUnit)
+            * 32)) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        HBM Write and Atomic Traffic:
          avg: AVG((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
            != 0) else None))
@@ -363,10 +366,10 @@ Panel Config:
        unit: Unit
      metric:
        Bandwidth:
-          avg: AVG((TCC_REQ_sum * 128) / $denom)
-          min: MIN((TCC_REQ_sum * 128) / $denom)
-          max: MAX((TCC_REQ_sum * 128) / $denom)
-          unit: (Bytes + $normUnit)
+          avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
+          min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
+          max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
+          unit: Gbps
        Req:
          avg: AVG((TCC_REQ_sum / $denom))
          min: MIN((TCC_REQ_sum / $denom))
@@ -11,8 +11,12 @@ Panel Config:
      instructions, averaged over the lifetime of the kernel. Calculated as the ratio
      of the total number of cycles spent by the scheduler issuing LDS instructions
      over the total CU cycles.
+    Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
+      could have been loaded from, stored to, or atomically updated in the LDS divided
+      as percentage of theoretical peak. Does not take into account the execution
+      mask of the wavefront when the instruction was executed.
    Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
-      loaded from, stored to, or atomically updated in the LDS per normalization unit.
+      loaded from, stored to, or atomically updated in the LDS divided by total duration.
      Does not take into account the execution mask of the wavefront when the instruction
      was executed.
    Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
@@ -58,7 +62,7 @@ Panel Config:
        Access Rate:
          value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          unit: Pct of Peak
-        Theoretical Bandwidth (% of Peak):
+        Theoretical Bandwidth Utilization:
          value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
            / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
          unit: Pct of Peak
@@ -116,12 +120,12 @@ Panel Config:
          units: Gbps
        Theoretical Bandwidth:
          avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
-            / $denom))
+            / (End_Timestamp - Start_Timestamp)))
          min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
-            / $denom))
+            / (End_Timestamp - Start_Timestamp)))
          max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
-            / $denom))
-          unit: (Bytes  + $normUnit)
+            / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        LDS Latency:
          avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
            None))
@@ -3,15 +3,18 @@ Panel Config:
  id: 1300
  title: Instruction Cache
  metrics_description:
-    Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the
-      peak theoretical bandwidth. Calculated as the ratio of L1I requests over the
-      total L1I cycles.
+    Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
+      of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
+      the total L1I cycles.
    Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
      loaded line the cache. Calculated as the ratio of the number of L1I requests
      that hit over the number of all L1I requests.
-    L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\
-      \ bandwidth achieved. Calculated as the ratio of the total number of requests\
-      \ from the L1I to the L2 cache over the total L1I-L2 interface cycles."
+    L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\
+      \ L2 cache request bandwidth achieved. Calculated as the ratio of the total\
+      \ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\
+      \ cycles."
+    L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
+      divided by total duration.
    Req: The total number of requests made to the L1I per normalization-unit
    Hits: The total number of L1I requests that hit on a previously loaded cache line,
      per normalization-unit.
@@ -30,7 +33,7 @@ Panel Config:
        value: Avg
        unit: Unit
      metric:
-        Bandwidth:
+        Bandwidth Utilization:
          value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
            - Start_Timestamp))))
          unit: Pct of Peak
@@ -38,7 +41,7 @@ Panel Config:
          value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
            + SQC_ICACHE_MISSES_DUPLICATE)))
          unit: Pct of Peak
-        L1I-L2 Bandwidth:
+        L1I-L2 Bandwidth Utilization:
          value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu)
            * (End_Timestamp - Start_Timestamp))))
          unit: Pct of Peak
@@ -100,7 +103,7 @@ Panel Config:
        unit: Unit
      metric:
        L1I-L2 Bandwidth:
-          avg: AVG(((SQC_TC_INST_REQ * 64) / $denom))
-          min: MIN(((SQC_TC_INST_REQ * 64) / $denom))
-          max: MAX(((SQC_TC_INST_REQ * 64) / $denom))
-          unit: (Bytes + $normUnit)
+          avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
+          min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
+          max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
@@ -3,14 +3,17 @@ Panel Config:
  id: 1400
  title: Scalar L1 Data Cache
  metrics_description:
-    Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the
-      peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the
-      total sL1D cycles.
+    Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
+      of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
+      over the total sL1D cycles.
    Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
      loaded line the cache. The ratio of the number of sL1D requests that hit over
      the number of all sL1D requests.
+    sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
+      bandwidth acheived.\ \ Caclulated as total number of bytes read from, written
+      to, or atomically updated\ \ across the sL1D - L2 interface.
    sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
-      \ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\
+      \ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\
      \ writes and atomics are typically unused on current CDNA accelerators, so in\
      \ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
    Req: The total number of requests, of any size or type, made to the sL1D per normalization
@@ -51,7 +54,7 @@ Panel Config:
        value: Avg
        unit: Unit
      metric:
-        Bandwidth:
+        Bandwidth Utilization:
          value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
            - Start_Timestamp))))
          unit: Pct of Peak
@@ -60,7 +63,7 @@ Panel Config:
            + SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES
            + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
          unit: Pct of Peak
-        sL1D-L2 BW:
+        sL1D-L2 BW Utilization:
          value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
            * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))
          unit: Pct of Peak
@@ -158,12 +161,12 @@ Panel Config:
      metric:
        sL1D-L2 BW:
          avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
-            * 64)) / $denom))
+            * 64)) / (End_Timestamp - Start_Timestamp)))
          min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
-            * 64)) / $denom))
+            * 64)) / (End_Timestamp - Start_Timestamp)))
          max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
-            * 64)) / $denom))
-          unit: (Bytes + $normUnit)
+            * 64)) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        Read Req:
          avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
          min: MIN((SQC_TC_DATA_READ_REQ / $denom))
@@ -5,12 +5,12 @@ Panel Config:
  metrics_description:
    Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
      cache over the total number of cache line requests to the vL1D Cache RAM.
-    Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM
-      instructions, as a percent of the peak theoretical bandwidth achievable on the
-      specific accelerator. The number of bytes is calculated as the number of cache
-      lines requested multiplied by the cache line size. This value does not consider
-      partial requests, so for instance, if only a single value is requested in a
-      cache line, the data movement will still be counted as a full cache line.
+    Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
+      of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
+      on the specific accelerator. The number of bytes is calculated as the number
+      of cache lines requested multiplied by the cache line size. This value does
+      not consider partial requests, so for instance, if only a single value is requested
+      in a cache line, the data movement will still be counted as a full cache line.
    Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
      The number of cycles where the vL1D Cache RAM is actively processing any request
      divided by the number of cycles where the vL1D is active.
@@ -42,11 +42,11 @@ Panel Config:
    Atomic Req: The total number of incoming atomic requests from the address processing
      unit after coalescing per normalization unit.
    Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
-      instructions per normalization unit. The number of bytes is calculated as the
-      number of cache lines requested multiplied by the cache line size.  This value
-      does not consider partial requests, so for instance, if only a single value
-      is requested in a cache line, the data movement will still be counted as a full
-      cache line.
+      instructions divided by total duration. The number of bytes is calculated as
+      the number of cache lines requested multiplied by the cache line size.  This
+      value does not consider partial requests, so for instance, if only a single
+      value is requested in a cache line, the data movement will still be counted
+      as a full cache line.
    Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
      vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
    Cache Accesses: The total number of cache line lookups in the vL1D.
@@ -57,7 +57,7 @@ Panel Config:
      command during the kernel's execution per normalization unit. This may be triggered
      by, for instance, the buffer_wbinvl1 instruction.
    L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
-      of VMEM instructions, per normalization unit. The number of bytes is calculated
+      of VMEM instructions, divided by total duration. The number of bytes is calculated
      as the number of cache lines requested multiplied by the cache line size. This
      value does not consider partial requests, so for instance, if only a single
      value is requested in a cache line, the data movement will still be counted
@@ -128,7 +128,7 @@ Panel Config:
            / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
            else None))
          unit: Pct of Peak
-        Bandwidth:
+        Bandwidth Utilization:
          value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp
            - Start_Timestamp)))) / ((($max_sclk / 1000) * 128) * $cu_per_gpu))
          unit: Pct of Peak
@@ -216,10 +216,10 @@ Panel Config:
            / $denom))
          unit: (Req  + $normUnit)
        Cache BW:
-          avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
-          min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
-          max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
-          unit: (Bytes + $normUnit)
+          avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
+          min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
+          max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        Cache Hit Rate:
          avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
@@ -257,12 +257,12 @@ Panel Config:
          unit: (Req + $normUnit)
        L1-L2 BW:
          avg: AVG(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
-            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
+            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
          min: MIN(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
-            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
+            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
          max: MAX(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
-            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
-          unit: (Bytes + $normUnit)
+            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        Tag RAM 0 Req:
          avg: AVG((TCP_TAGRAM0_REQ_sum / $denom))
          min: MIN((TCP_TAGRAM0_REQ_sum / $denom))
@@ -20,8 +20,8 @@ Panel Config:
    HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
      memory (HBM) per unit time. This value is calculated as the number of HBM channels
      multiplied by the HBM channel width multiplied by the HBM clock frequency.
-    Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per
-      normalization unit.
+    Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided
+      by total duration.
    HBM Read Traffic: The percent of read requests generated by the L2 cache that
      are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
      does not consider the size of the request (meaning that 32B and 64B requests
@@ -42,9 +42,9 @@ Panel Config:
      as a single request), so this metric only approximates the percent of the L2-Fabric
      read bandwidth directed to an uncached memory location.
    Write and Atomic BW: The total number of bytes written by the L2 over Infinity
-      Fabric by write and atomic operations per normalization unit. Note that on current
-      CDNA accelerators, such as the MI2XX, requests are only considered atomic by
-      Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
+      Fabric by write and atomic operations divided by total duration. Note that on
+      current CDNA accelerators, such as the MI2XX, requests are only considered atomic
+      by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
      fine-grained memory allocations or uncached memory allocations on the MI2XX.
    HBM Write and Atomic Traffic: The percent of write and atomic requests generated
      by the L2 cache that are routed to the accelerator's local high-bandwidth memory
@@ -82,17 +82,17 @@ Panel Config:
    Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
      Fabric before a completion acknowledgement (atomic without return value) or
      data (atomic with return value) was returned to the L2.
-    Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit.
+    Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration.
      The number of bytes is calculated as the number of cache lines requested multiplied
      by the cache line size. This value does not consider partial requests, so for
      example, if only a single value is requested in a cache line, the data movement
      will still be counted as a full cache line.
    Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests,
-      per normalization unit.
+      divided by total duration.
    Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests,
-      per normalization unit.
+      divided by total duration.
    Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests,
-      per normalization unit.
+      divided by total duration.
    Req: The total number of incoming requests to the L2 from all clients for all
      request types, per normalization unit.
    Read Req: The total number of read requests to the L2 from all clients.
@@ -150,11 +150,11 @@ Panel Config:
      64B of data from any source other than the accelerator's local HBM, per normalization
      unit.
    Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe
-      traffic, per normalization unit.
+      traffic, divided by total duration.
    "Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read
-      requests due to Infinity Fabric traffic, per normalization unit.
+      requests due to Infinity Fabric traffic, divided by total duration.
    Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM
-      traffic, per normalization unit.
+      traffic, divided by total duration.
    Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
      write or atomically update 32B of data to any memory location, per normalization
      unit.
@@ -171,17 +171,17 @@ Panel Config:
      write or atomically update 32B or 64B of data in any memory location other than
      the accelerator's local HBM, per normalization unit.
    Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to
-      PCIe traffic, per normalization unit.
+      PCIe traffic, divided by total duration.
    "Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write
-      requests due to Infinity Fabric traffic, per normalization unit.
+      requests due to Infinity Fabric traffic, divided by total duration.
    Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM
-      traffic, per normalization unit.
+      traffic, divided by total duration.
    Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to
-      PCIe traffic, per normalization unit.
+      PCIe traffic, divided by total duration.
    "Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic
-      requests due to Infinity Fabric traffic, per normalization unit.
+      requests due to Infinity Fabric traffic, divided by total duration.
    Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
-      HBM traffic, per normalization unit.
+      HBM traffic, divided by total duration.
    Atomic: The total number of L2 requests to Infinity Fabric to atomically update
      32B or 64B of data in any memory location, per normalization unit. See Request
      flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
@@ -257,12 +257,12 @@ Panel Config:
      metric:
        Read BW:
          avg: AVG((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) +
-            (TCC_EA0_RDREQ_128B_sum * 128)) / $denom))
+            (TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp)))
          min: MIN((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) +
-            (TCC_EA0_RDREQ_128B_sum * 128)) / $denom))
+            (TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp)))
          max: MAX((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) +
-            (TCC_EA0_RDREQ_128B_sum * 128)) / $denom))
-          unit: (Bytes  + $normUnit)
+            (TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        HBM Read Traffic:
          avg: AVG((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
            != 0) else None))
@@ -289,12 +289,12 @@ Panel Config:
          unit: pct
        Write and Atomic BW:
          avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
-            * 32)) / $denom))
+            * 32)) / (End_Timestamp - Start_Timestamp)))
          min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
-            * 32)) / $denom))
+            * 32)) / (End_Timestamp - Start_Timestamp)))
          max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
-            * 32)) / $denom))
-          unit: (Bytes  + $normUnit)
+            * 32)) / (End_Timestamp - Start_Timestamp)))
+          unit: Gbps
        HBM Write and Atomic Traffic:
          avg: AVG((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
            != 0) else None))
@@ -381,25 +381,25 @@ Panel Config:
        unit: Unit
      metric:
        Bandwidth:
-          avg: AVG((TCC_REQ_sum * 128) / $denom)
-          min: MIN((TCC_REQ_sum * 128) / $denom)
-          max: MAX((TCC_REQ_sum * 128) / $denom)
-          unit: (Bytes + $normUnit)
+          avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
+          min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
+          max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
+          unit: Gbps
        Read Bandwidth:
-          avg: AVG(TCC_READ_SECTORS_sum * 32/ $denom)
-          min: MIN(TCC_READ_SECTORS_sum * 32/ $denom)
-          max: MAX(TCC_READ_SECTORS_sum * 32/ $denom)
-          unit: (Bytes + $normUnit)
+          avg: AVG(TCC_READ_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
+          min: MIN(TCC_READ_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
+          max: MAX(TCC_READ_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
+          unit: Gbps
        Write Bandwidth:
-          avg: AVG(TCC_WRITE_SECTORS_sum * 32/ $denom)
-          min: MIN(TCC_WRITE_SECTORS_sum * 32/ $denom)
-          max: MAX(TCC_WRITE_SECTORS_sum * 32/ $denom)
-          unit: (Bytes + $normUnit)
+          avg: AVG(TCC_WRITE_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
+          min: MIN(TCC_WRITE_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
+          max: MAX(TCC_WRITE_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
+          unit: Gbps
        Atomic Bandwidth:
-          avg: AVG(TCC_ATOMIC_SECTORS_sum * 32/ $denom)
-          min: MIN(TCC_ATOMIC_SECTORS_sum * 32/ $denom)
-          max: MAX(TCC_ATOMIC_SECTORS_sum * 32/ $denom)
-          unit: (Bytes + $normUnit)
+          avg: AVG(TCC_ATOMIC_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
+          min: MIN(TCC_ATOMIC_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
+          max: MAX(TCC_ATOMIC_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
+          unit: Gbps
        Req:
          avg: AVG((TCC_REQ_sum / $denom))
          min: MIN((TCC_REQ_sum / $denom))
@@ -653,20 +653,20 @@ Panel Config:
          max: MAX((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
          unit: (Req  + $normUnit)
        Read Bandwidth - PCIe:
-          avg: AVG(TCC_EA0_RDREQ_IO_32B_sum * 32/ $denom)
-          min: MIN(TCC_EA0_RDREQ_IO_32B_sum * 32/ $denom)
-          max: MAX(TCC_EA0_RDREQ_IO_32B_sum * 32/ $denom)
-          unit: (Bytes + $normUnit)
+          avg: AVG(TCC_EA0_RDREQ_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+          min: MIN(TCC_EA0_RDREQ_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+          max: MAX(TCC_EA0_RDREQ_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+          unit: Gbps
        "Read Bandwidth - Infinity Fabric\u2122":
-          avg: AVG(TCC_EA0_RDREQ_GMI_32B_sum * 32/ $denom)
-          min: MIN(TCC_EA0_RDREQ_GMI_32B_sum  * 32/ $denom)
-          max: MAX(TCC_EA0_RDREQ_GMI_32B_sum  * 32/ $denom)
-          unit: (Bytes + $normUnit)
+          avg: AVG(TCC_EA0_RDREQ_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+          min: MIN(TCC_EA0_RDREQ_GMI_32B_sum  * 32/ (End_Timestamp - Start_Timestamp))
+          max: MAX(TCC_EA0_RDREQ_GMI_32B_sum  * 32/ (End_Timestamp - Start_Timestamp))
+          unit: Gbps
        Read Bandwidth - HBM:
-          avg: AVG(TCC_EA0_RDREQ_DRAM_32B_sum  * 32/ $denom)
-          min: MIN(TCC_EA0_RDREQ_DRAM_32B_sum  * 32/ $denom)
-          max: MAX(TCC_EA0_RDREQ_DRAM_32B_sum  * 32/ $denom)
-          unit: (Bytes + $normUnit)
+          avg: AVG(TCC_EA0_RDREQ_DRAM_32B_sum  * 32/ (End_Timestamp - Start_Timestamp))
+          min: MIN(TCC_EA0_RDREQ_DRAM_32B_sum  * 32/ (End_Timestamp - Start_Timestamp))
+          max: MAX(TCC_EA0_RDREQ_DRAM_32B_sum  * 32/ (End_Timestamp - Start_Timestamp))
+          unit: Gbps
        Write and Atomic (32B):
          avg: AVG(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom))
          min: MIN(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom))
@@ -693,20 +693,20 @@ Panel Config:
          max: MAX((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom))
          unit: (Req  + $normUnit)
        Write Bandwidth - PCIe:
-          avg: AVG(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ $denom)
-          min: MIN(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ $denom)
-          max: MAX(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ $denom)
-          unit: (Bytes + $normUnit)
+          avg: AVG(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+          min: MIN(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+          max: MAX(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+          unit: Gbps
        "Write Bandwidth - Infinity Fabric\u2122":
-          avg: AVG(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ $denom)
-          min: MIN(TCC_EA0_WRREQ_WRITE_GMI_32B_sum  * 32/ $denom)
-          max: MAX(TCC_EA0_WRREQ_WRITE_GMI_32B_sum  * 32/ $denom)
-          unit: (Bytes + $normUnit)
+          avg: AVG(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
+          min: MIN(TCC_EA0_WRREQ_WRITE_GMI_32B_sum  * 32/ (End_Timestamp - Start_Timestamp))
+          max: MAX(TCC_EA0_WRREQ_WRITE_GMI_32B_sum  * 32/ (End_Timestamp - Start_Timestamp))
+          unit: Gbps
        Write Bandwidth - HBM:
-          avg: AVG(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum  * 32/ $denom)
-          min: MIN(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum  * 32/ $denom)
-          max: MAX(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum  * 32/ $denom)
-          unit: (Bytes + $normUnit)
+          avg: AVG(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum  * 32/ (End_Timestamp - Start_Timestamp))
+          min: MIN(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum  * 32/ (End_Timestamp - Start_Timestamp))
+          max: MAX(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum  * 32/ (End_Timestamp - Start_Timestamp))
+          unit: Gbps
        Atomic:
          avg: AVG((TCC_EA0_ATOMIC_sum / $denom))
          min: MIN((TCC_EA0_ATOMIC_sum / $denom))
@@ -718,17 +718,17 @@ Panel Config:
          max: MAX((TCC_EA0_WRREQ_ATOMIC_DRAM_sum / $denom))
          unit: (Req  + $normUnit)
        Atomic Bandwidth - PCIe:
-          avg: AVG(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum  * 32/ $denom)
-          min: MIN(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum  * 32/ $denom)
-          max: MAX(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum  * 32/ $denom)
-          unit: (Bytes + $normUnit)
+          avg: AVG(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum  * 32/ (End_Timestamp - Start_Timestamp))
+          min: MIN(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum  * 32/ (End_Timestamp - Start_Timestamp))
+          max: MAX(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum  * 32/ (End_Timestamp - Start_Timestamp))
+          unit: Gbps
        "Atomic Bandwidth - Infinity Fabric\u2122":
-          avg: AVG(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum  * 32/ $denom)
-          min: MIN(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum  * 32/ $denom)
-          max: MAX(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum  * 32/ $denom)
-          unit: (Bytes + $normUnit)
+          avg: AVG(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum  * 32/ (End_Timestamp - Start_Timestamp))
+          min: MIN(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum  * 32/ (End_Timestamp - Start_Timestamp))
+          max: MAX(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum  * 32/ (End_Timestamp - Start_Timestamp))
+          unit: Gbps
        Atomic Bandwidth - HBM:
-          avg: AVG(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum  * 32/ $denom)
-          min: MIN(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum  * 32/ $denom)
-          max: MAX(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum  * 32/ $denom)
-          unit: (Bytes + $normUnit)
+          avg: AVG(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum  * 32/ (End_Timestamp - Start_Timestamp))
+          min: MIN(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum  * 32/ (End_Timestamp - Start_Timestamp))
+          max: MAX(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum  * 32/ (End_Timestamp - Start_Timestamp))
+          unit: Gbps
@@ -59,42 +59,42 @@ src/rocprof_compute_soc/analysis_configs/gfx940/1100_compute_units_compute_pipel
 src/rocprof_compute_soc/analysis_configs/gfx941/1100_compute_units_compute_pipeline.yaml: 4a25b6abf24f4a622fde1a3cfe65fe7236cf1e626fc2444667883997564cea1e
 src/rocprof_compute_soc/analysis_configs/gfx942/1100_compute_units_compute_pipeline.yaml: 4a25b6abf24f4a622fde1a3cfe65fe7236cf1e626fc2444667883997564cea1e
 src/rocprof_compute_soc/analysis_configs/gfx950/1100_compute_units_compute_pipeline.yaml: 4ef656938f8a9667ae872db522855856469accff9cb42bc0444b469346760dfd
-src/rocprof_compute_soc/analysis_configs/gfx908/1200_local_data_share_lds.yaml: 80f3ca3ea15de009c5278ea20566d8c08d62e0087971e5f9aeae1c89df1dd898
-src/rocprof_compute_soc/analysis_configs/gfx90a/1200_local_data_share_lds.yaml: 80f3ca3ea15de009c5278ea20566d8c08d62e0087971e5f9aeae1c89df1dd898
-src/rocprof_compute_soc/analysis_configs/gfx940/1200_local_data_share_lds.yaml: 3bbf3928288990863cfe72fd00a28785fde0a36f103f5381df578aae2eb28be0
-src/rocprof_compute_soc/analysis_configs/gfx941/1200_local_data_share_lds.yaml: 3bbf3928288990863cfe72fd00a28785fde0a36f103f5381df578aae2eb28be0
-src/rocprof_compute_soc/analysis_configs/gfx942/1200_local_data_share_lds.yaml: 3bbf3928288990863cfe72fd00a28785fde0a36f103f5381df578aae2eb28be0
-src/rocprof_compute_soc/analysis_configs/gfx950/1200_local_data_share_lds.yaml: 505163510a3b0132ee487f9e024188de2deb97d0f72e3d729b95f86e7c3434b3
-src/rocprof_compute_soc/analysis_configs/gfx908/1300_instruction_cache.yaml: 2437e2f8191675c4116d0da1db291f3ad2715281ea812e9fdd6506cf213e5d1b
-src/rocprof_compute_soc/analysis_configs/gfx90a/1300_instruction_cache.yaml: 2437e2f8191675c4116d0da1db291f3ad2715281ea812e9fdd6506cf213e5d1b
-src/rocprof_compute_soc/analysis_configs/gfx940/1300_instruction_cache.yaml: 2437e2f8191675c4116d0da1db291f3ad2715281ea812e9fdd6506cf213e5d1b
-src/rocprof_compute_soc/analysis_configs/gfx941/1300_instruction_cache.yaml: 2437e2f8191675c4116d0da1db291f3ad2715281ea812e9fdd6506cf213e5d1b
-src/rocprof_compute_soc/analysis_configs/gfx942/1300_instruction_cache.yaml: 2437e2f8191675c4116d0da1db291f3ad2715281ea812e9fdd6506cf213e5d1b
-src/rocprof_compute_soc/analysis_configs/gfx950/1300_instruction_cache.yaml: 2437e2f8191675c4116d0da1db291f3ad2715281ea812e9fdd6506cf213e5d1b
-src/rocprof_compute_soc/analysis_configs/gfx908/1400_scalar_l1_data_cache.yaml: 8871e3b65132321cb3880a48f894d8c3b2c56a3936d382c3c2b02723ed5c8ec5
-src/rocprof_compute_soc/analysis_configs/gfx90a/1400_scalar_l1_data_cache.yaml: 8871e3b65132321cb3880a48f894d8c3b2c56a3936d382c3c2b02723ed5c8ec5
-src/rocprof_compute_soc/analysis_configs/gfx940/1400_scalar_l1_data_cache.yaml: 8871e3b65132321cb3880a48f894d8c3b2c56a3936d382c3c2b02723ed5c8ec5
-src/rocprof_compute_soc/analysis_configs/gfx941/1400_scalar_l1_data_cache.yaml: 8871e3b65132321cb3880a48f894d8c3b2c56a3936d382c3c2b02723ed5c8ec5
-src/rocprof_compute_soc/analysis_configs/gfx942/1400_scalar_l1_data_cache.yaml: 8871e3b65132321cb3880a48f894d8c3b2c56a3936d382c3c2b02723ed5c8ec5
-src/rocprof_compute_soc/analysis_configs/gfx950/1400_scalar_l1_data_cache.yaml: 8871e3b65132321cb3880a48f894d8c3b2c56a3936d382c3c2b02723ed5c8ec5
+src/rocprof_compute_soc/analysis_configs/gfx908/1200_local_data_share_lds.yaml: f3f7a74e8b2915fe27eec7948f006f218a6b0a96c91b95cdff9e624b2c484bb2
+src/rocprof_compute_soc/analysis_configs/gfx90a/1200_local_data_share_lds.yaml: f3f7a74e8b2915fe27eec7948f006f218a6b0a96c91b95cdff9e624b2c484bb2
+src/rocprof_compute_soc/analysis_configs/gfx940/1200_local_data_share_lds.yaml: f3f7a74e8b2915fe27eec7948f006f218a6b0a96c91b95cdff9e624b2c484bb2
+src/rocprof_compute_soc/analysis_configs/gfx941/1200_local_data_share_lds.yaml: f3f7a74e8b2915fe27eec7948f006f218a6b0a96c91b95cdff9e624b2c484bb2
+src/rocprof_compute_soc/analysis_configs/gfx942/1200_local_data_share_lds.yaml: f3f7a74e8b2915fe27eec7948f006f218a6b0a96c91b95cdff9e624b2c484bb2
+src/rocprof_compute_soc/analysis_configs/gfx950/1200_local_data_share_lds.yaml: 6333e18126bde83da4c66fd967531d394bd22e69c08358096b27168a9dc11a30
+src/rocprof_compute_soc/analysis_configs/gfx908/1300_instruction_cache.yaml: f60b9c657bece161e34219f3ada4041107dc5ca3d248590ee3b67e7bd400ff54
+src/rocprof_compute_soc/analysis_configs/gfx90a/1300_instruction_cache.yaml: f60b9c657bece161e34219f3ada4041107dc5ca3d248590ee3b67e7bd400ff54
+src/rocprof_compute_soc/analysis_configs/gfx940/1300_instruction_cache.yaml: f60b9c657bece161e34219f3ada4041107dc5ca3d248590ee3b67e7bd400ff54
+src/rocprof_compute_soc/analysis_configs/gfx941/1300_instruction_cache.yaml: f60b9c657bece161e34219f3ada4041107dc5ca3d248590ee3b67e7bd400ff54
+src/rocprof_compute_soc/analysis_configs/gfx942/1300_instruction_cache.yaml: f60b9c657bece161e34219f3ada4041107dc5ca3d248590ee3b67e7bd400ff54
+src/rocprof_compute_soc/analysis_configs/gfx950/1300_instruction_cache.yaml: f60b9c657bece161e34219f3ada4041107dc5ca3d248590ee3b67e7bd400ff54
+src/rocprof_compute_soc/analysis_configs/gfx908/1400_scalar_l1_data_cache.yaml: 29fac4ea38e4a018baffc4a27a720b47078fd890c10da307655d40f693e6f0e7
+src/rocprof_compute_soc/analysis_configs/gfx90a/1400_scalar_l1_data_cache.yaml: 29fac4ea38e4a018baffc4a27a720b47078fd890c10da307655d40f693e6f0e7
+src/rocprof_compute_soc/analysis_configs/gfx940/1400_scalar_l1_data_cache.yaml: 29fac4ea38e4a018baffc4a27a720b47078fd890c10da307655d40f693e6f0e7
+src/rocprof_compute_soc/analysis_configs/gfx941/1400_scalar_l1_data_cache.yaml: 29fac4ea38e4a018baffc4a27a720b47078fd890c10da307655d40f693e6f0e7
+src/rocprof_compute_soc/analysis_configs/gfx942/1400_scalar_l1_data_cache.yaml: 29fac4ea38e4a018baffc4a27a720b47078fd890c10da307655d40f693e6f0e7
+src/rocprof_compute_soc/analysis_configs/gfx950/1400_scalar_l1_data_cache.yaml: 29fac4ea38e4a018baffc4a27a720b47078fd890c10da307655d40f693e6f0e7
 src/rocprof_compute_soc/analysis_configs/gfx908/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 633d59aba82b3a495b7ba33fa4b2ae4da638b58632bcc37ff18be87af68ce4d4
 src/rocprof_compute_soc/analysis_configs/gfx90a/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 2bdb9d7b3bea1057b3baee29ba3b428b211808261063a97bc4b6b319f4a19fb3
 src/rocprof_compute_soc/analysis_configs/gfx940/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 3180c2f3266be0ff44e01d73d247ca43ae2ee18ecaf61765f58849e36c701b19
 src/rocprof_compute_soc/analysis_configs/gfx941/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 3180c2f3266be0ff44e01d73d247ca43ae2ee18ecaf61765f58849e36c701b19
 src/rocprof_compute_soc/analysis_configs/gfx942/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 3180c2f3266be0ff44e01d73d247ca43ae2ee18ecaf61765f58849e36c701b19
 src/rocprof_compute_soc/analysis_configs/gfx950/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 9e56cef5b066fb575a5c530bcf9400f1291dd8636b12c8a2244cdba1defafc9f
-src/rocprof_compute_soc/analysis_configs/gfx908/1600_vector_l1_data_cache.yaml: e6ec43014ce7b7cc072385d4eba072dd187b5de14979c169a3c1e9b8fc4c2762
-src/rocprof_compute_soc/analysis_configs/gfx90a/1600_vector_l1_data_cache.yaml: e6ec43014ce7b7cc072385d4eba072dd187b5de14979c169a3c1e9b8fc4c2762
-src/rocprof_compute_soc/analysis_configs/gfx940/1600_vector_l1_data_cache.yaml: 0e53921cc8d87a9adade250b9632fa42d33c825565152e37d6e56f45f83a3a28
-src/rocprof_compute_soc/analysis_configs/gfx941/1600_vector_l1_data_cache.yaml: 0e53921cc8d87a9adade250b9632fa42d33c825565152e37d6e56f45f83a3a28
-src/rocprof_compute_soc/analysis_configs/gfx942/1600_vector_l1_data_cache.yaml: 0e53921cc8d87a9adade250b9632fa42d33c825565152e37d6e56f45f83a3a28
-src/rocprof_compute_soc/analysis_configs/gfx950/1600_vector_l1_data_cache.yaml: cd21327c193d2af8c18066b9c13f67e3d5dfb44731777bc5a1b6a7738c902dd1
-src/rocprof_compute_soc/analysis_configs/gfx908/1700_l2_cache.yaml: 5b48c690b6069a5610d07cc0c2a5e1da65a52296205dcf48a3b6fa5e3df36e9b
-src/rocprof_compute_soc/analysis_configs/gfx90a/1700_l2_cache.yaml: a9b128267a069060e891533334c52586c706f145b1e813a4081cb21d425516ad
-src/rocprof_compute_soc/analysis_configs/gfx940/1700_l2_cache.yaml: b4eea39f0e23e501ad503cdd96db377109c7f0e212949828fe06102de7355349
-src/rocprof_compute_soc/analysis_configs/gfx941/1700_l2_cache.yaml: da0189cd7f6e1ab4b79d0c054c2cdc1f7a9c81972dae9e5285f2f3d9c30ca644
-src/rocprof_compute_soc/analysis_configs/gfx942/1700_l2_cache.yaml: b0802f923052eb584ce138210ebf2db70fb7883926896da1861a9e857d4abe81
-src/rocprof_compute_soc/analysis_configs/gfx950/1700_l2_cache.yaml: 58bdd965421d610567e461becd7094fa41d668b119eddab99054d2bd6dc12acf
+src/rocprof_compute_soc/analysis_configs/gfx908/1600_vector_l1_data_cache.yaml: 438d0f4a972dd341eb2485f51a47d6860fbb30a6169054cd8550b4b7226e199f
+src/rocprof_compute_soc/analysis_configs/gfx90a/1600_vector_l1_data_cache.yaml: 438d0f4a972dd341eb2485f51a47d6860fbb30a6169054cd8550b4b7226e199f
+src/rocprof_compute_soc/analysis_configs/gfx940/1600_vector_l1_data_cache.yaml: 6100b218f24de9f1433b39a093ed04b9bb9dfe656c5df77583c9db332c447230
+src/rocprof_compute_soc/analysis_configs/gfx941/1600_vector_l1_data_cache.yaml: 6100b218f24de9f1433b39a093ed04b9bb9dfe656c5df77583c9db332c447230
+src/rocprof_compute_soc/analysis_configs/gfx942/1600_vector_l1_data_cache.yaml: 6100b218f24de9f1433b39a093ed04b9bb9dfe656c5df77583c9db332c447230
+src/rocprof_compute_soc/analysis_configs/gfx950/1600_vector_l1_data_cache.yaml: 67054ec0a4c6ca147a5dd40cc91f0e8e81378e1affe7d479274747579ecc524a
+src/rocprof_compute_soc/analysis_configs/gfx908/1700_l2_cache.yaml: b1baa76f9dbfcc52d5e12cc1834102a0011ddf8bdece5be5fabc2945ab8971f4
+src/rocprof_compute_soc/analysis_configs/gfx90a/1700_l2_cache.yaml: 4d834a2066d7f2cb655a8e41fc17531282150b6fe64bbc9c5ff3a10acddee5af
+src/rocprof_compute_soc/analysis_configs/gfx940/1700_l2_cache.yaml: 78f9fee5dafc83d311da1c801200c1820e16a0678dd0548fafa8a966ec6a94d5
+src/rocprof_compute_soc/analysis_configs/gfx941/1700_l2_cache.yaml: 51fe6e3888975b805594c2ab2b3147e717ae5e015468ee592cbcddc389c689bc
+src/rocprof_compute_soc/analysis_configs/gfx942/1700_l2_cache.yaml: dc2dc9ff61b1747e492c28ef5ac76764fd75c18fd0827834130bc583f2afc619
+src/rocprof_compute_soc/analysis_configs/gfx950/1700_l2_cache.yaml: d181f753c3fff608c72b8015d1af30bfd8cf8cdfbc0a17c505f717ddaa3b1efc
 src/rocprof_compute_soc/analysis_configs/gfx908/1800_l2_cache_per_channel.yaml: a0c53202fe9f68d5e1fa689ce0643c471ced7d47e007d8ccc68fba294f7f6a05
 src/rocprof_compute_soc/analysis_configs/gfx90a/1800_l2_cache_per_channel.yaml: a0c53202fe9f68d5e1fa689ce0643c471ced7d47e007d8ccc68fba294f7f6a05
 src/rocprof_compute_soc/analysis_configs/gfx940/1800_l2_cache_per_channel.yaml: e184e3692eb0d641fb2e37fada0e58a6c4958553931d7c038b884e1e6986093f
@@ -113,4 +113,4 @@ src/rocprof_compute_soc/profile_configs/sets/gfx940_sets.yaml: 44cd2b32b050cafa7
 src/rocprof_compute_soc/profile_configs/sets/gfx941_sets.yaml: 44cd2b32b050cafa73d0ead5703b82836edf25a057c21699046b6b8b8918b242
 src/rocprof_compute_soc/profile_configs/sets/gfx942_sets.yaml: 44cd2b32b050cafa73d0ead5703b82836edf25a057c21699046b6b8b8918b242
 src/rocprof_compute_soc/profile_configs/sets/gfx950_sets.yaml: 238d9dc8a98cfead3fc904885bfe413e5bcb4f1af31e9820cd640388bcd1e1c2
-docs/data/metrics_description.yaml: 819c08a584ae8b418e6983aa51108b95e43eda4f3b7892eab336c61d844b20bf
+docs/data/metrics_description.yaml: c2ddad7ef7973b128c1612e56cc6286e49c2f59af829b1795dc64b38c0ecfd61