[rocprof-compute] update yamls for docs (#1887)

2025-11-19 10:46:02 -05:00
@@ -1,7 +1,6 @@
-# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Wavefront launch stats:
  AGPRs:
-    rst: |-
+    rst: >-
      The number of accumulation vector general-purpose registers allocated
      for the kernel, see :ref:`AGPRs <desc-agprs>`. Note: this may not exactly match
      the number of AGPRs requested by the compiler due to allocation granularity.
@@ -12,7 +11,7 @@ Wavefront launch stats:
      total workgroup (or, block) size.
    unit: Work-Items
  LDS Allocation:
-    rst: |-
+    rst: >-
      The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
      allocated for this kernel. Note: This may also be larger than what was requested
      at compile time due to both allocation granularity and dynamic per-dispatch
@@ -23,7 +22,7 @@ Wavefront launch stats:
      <https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
    unit: Wavefronts
  SGPRs:
-    rst: |-
+    rst: >-
      The number of scalar general-purpose registers allocated for the kernel, see
      :ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
      SGPRs requested by the compiler due to allocation granularity.
@@ -38,14 +37,14 @@ Wavefront launch stats:
      as well as for register spills and restores.
    unit: Bytes per work-item
  Total Wavefronts:
-    rst: |-
+    rst: >-
      The total number of wavefronts launched as part of the kernel dispatch.
      On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
      size is always 64 work-items. Thus, the total number of wavefronts should
      be equivalent to the ceiling of grid size divided by 64.
    unit: Wavefronts
  VGPRs:
-    rst: |-
+    rst: >-
      The number of architected vector general-purpose registers allocated for the
      kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
      number of VGPRs requested by the compiler due to allocation granularity.
@@ -98,14 +97,14 @@ Wavefront runtime stats:
    rst: The total duration of the executed kernel in cycles.
    unit: Cycles
  Wave Cycles:
-    rst: |-
+    rst: >-
      The number of cycles a wavefront in the kernel dispatch spent resident
      on a compute unit per :ref:`normalization unit <normalization-units>`. This is
      averaged over all wavefronts in a kernel dispatch. Note: this should not
      be directly compared to the kernel cycles above.
    unit: Cycles per normalization unit
  Wavefront Occupancy:
-    rst: |-
+    rst: >-
      The time-averaged number of wavefronts resident on the accelerator over the
      lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms).
@@ -148,7 +147,7 @@ Overall instruction mix:
    unit: Instructions
 VALU arithmetic instruction mix:
  Conversion:
-    rst: |-
+    rst: >-
      The total number of type conversion instructions (such as converting data
      to or from F32\u2194F64) issued to the VALU per :ref:`normalization unit
      <normalization-units>`.
@@ -240,7 +239,7 @@ MFMA instruction mix:
    unit: Instructions per normalization unit
 Compute Speed-of-Light:
  MFMA FLOPs (BF16):
-    rst: |-
+    rst: >-
      The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 16-bit brain floating
      point operations from :ref:`VALU <desc-valu>` instructions. This is also
@@ -248,7 +247,7 @@ Compute Speed-of-Light:
      on the specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F16):
-    rst: |-
+    rst: >-
      The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 16-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -256,7 +255,7 @@ Compute Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F32):
-    rst: |-
+    rst: >-
      The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 32-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -264,7 +263,7 @@ Compute Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F64):
-    rst: |-
+    rst: >-
      The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 64-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -276,21 +275,21 @@ Compute Speed-of-Light:
      achievable on the specific accelerator.
    unit: GFLOPs
  MFMA IOPs (INT8):
-    rst: |-
+    rst: >-
      The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
      per second. Note: this does not include any 8-bit integer operations from
      :ref:`VALU <desc-valu>` instructions. This is also presented as a percent
      of the peak theoretical INT8 MFMA operations achievable on the specific accelerator.
    unit: GFLOPs
  VALU FLOPs:
-    rst: |-
+    rst: >-
      The total floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is also presented as a percent of the peak theoretical
      FLOPs achievable on the specific accelerator. Note: this does not include
      any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  VALU IOPs:
-    rst: |-
+    rst: >-
      The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
      This is also presented as a percent of the peak theoretical IOPs achievable
      on the specific accelerator. Note: this does not include any integer operations
@@ -361,7 +360,7 @@ Pipeline statistics:
    unit: Percent
 Arithmetic operations:
  BF16 OPs:
-    rst: |-
+    rst: >-
      The total number of 16-bit brain floating-point operations executed on
      either the :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
      unit <normalization-units>`. Note: on current CDNA accelerators, the VALU
@@ -388,7 +387,7 @@ Arithmetic operations:
      <normalization-units>`.
    unit: FLOP per normalization unit
  INT8 OPs:
-    rst: |-
+    rst: >-
      The total number of 8-bit integer operations executed on either the :ref:`VALU
      <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
      <normalization-units>`. Note: on current CDNA accelerators, the VALU has
@@ -460,7 +459,7 @@ LDS Statistics:
      acknowledgment) required for an LDS instruction to complete.
    unit: Cycles
  Mem Violations:
-    rst: |-
+    rst: >-
      The total number of out-of-bounds accesses made to the LDS, per :ref:`normalization
      unit <normalization-units>`. This is unused and expected to be zero in
      most configurations for modern CDNA\u2122 accelerators.
@@ -586,7 +585,7 @@ L1 Unified Translation Cache (UTCL1):
      per normalization unit.
    unit: Requests per normalization unit
  Permission Misses:
-    rst: |-
+    rst: >-
      The total number of translation requests that missed in the UTCL1 due
      to a permission error, per :ref:`normalization unit <normalization-units>`.
      This is unused and expected to be zero in most configurations for modern
@@ -925,7 +924,7 @@ L2-Fabric interface metrics:
      before data was returned to the L2.
    unit: Cycles
  Read Stall:
-    rst: |-
+    rst: >-
      The ratio of the total number of cycles the L2-Fabric interface was stalled
      on a read request to any destination (local HBM, remote PCIe\xAE connected
      accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_
@@ -1198,7 +1197,7 @@ Scalar L1D Cache - L2 Interface:
      per :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  Stall Cycles:
-    rst: |-
+    rst: >-
      The total number of cycles the sL1D\u2194 :doc:`L2 <l2-cache>` interface
      was stalled, per :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
@@ -1208,7 +1207,7 @@ Scalar L1D Cache - L2 Interface:
      CDNA accelerators.
    unit: Requests per normalization unit
  sL1D-L2 BW:
-    rst: |-
+    rst: >-
      The total number of bytes read from, written to, or atomically updated
      across the sL1D\u2194:doc:`L2 <l2-cache>` interface, divided by total duration.
      Note that sL1D writes and atomics are typically
@@ -1227,7 +1226,7 @@ L1I Speed-of-Light:
      over the number of all L1I requests.
    unit: Percent
  L1I-L2 Bandwidth Utilization:
-    rst: |-
+    rst: >-
      The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
      achieved. Calculated as the ratio of the total number of requests from
      the L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`.
@@ -1286,7 +1285,7 @@ Workgroup manager utilizations:
      not fully saturated by the kernel, or a potential load-imbalance issue.
    unit: Percent
  Scheduler-Pipe Utilization:
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where the scheduler-pipes were actively doing any work. Note: this
      value is expected to range between 0% and 25%. See :ref:`desc-spi`.
@@ -1332,7 +1331,7 @@ Workgroup Manager - Resource Allocation:
      lack of available :ref:`waveslots <desc-valu>`.
    unit: Percent
  Not-scheduled Rate (Scheduler-Pipe):
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
      due to a bottleneck within the scheduler-pipes rather than a lack of a CU
@@ -1341,7 +1340,7 @@ Workgroup Manager - Resource Allocation:
      description.
    unit: Percent
  Not-scheduled Rate (Workgroup Manager):
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
      due to a bottleneck within the workgroup manager rather than a lack of a
@@ -1362,7 +1361,7 @@ Workgroup Manager - Resource Allocation:
      or newer accelerators (and small for previous accelerators).
    unit: Percent
  Scheduler-Pipe Stall Rate:
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
      due to occupancy limitations (like a lack of a CU or :ref:`SIMD <desc-valu>`
@@ -1464,7 +1463,7 @@ System Speed-of-Light:
      over the total number of incoming cache line requests to the L2 cache.
    unit: Percent
  L2-Fabric Read BW:
-    rst: |-
+    rst: >-
      The number of bytes read by the L2 over the :ref:`Infinity Fabric\u2122
      interface <l2-fabric>` per unit time. This is also presented as a percent
      of the peak theoretical bandwidth achievable on the specific accelerator.
@@ -1490,7 +1489,7 @@ System Speed-of-Light:
      Conflict Rate).
    unit: Conflicts/Access
  MFMA FLOPs (BF16):
-    rst: |-
+    rst: >-
      The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
      operations executed per second. Note: this does not include any 16-bit brain
      floating point operations from :ref:`VALU <desc-valu>` instructions. This
@@ -1498,7 +1497,7 @@ System Speed-of-Light:
      achievable on the specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F16):
-    rst: |-
+    rst: >-
      The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 16-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1506,7 +1505,7 @@ System Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F32):
-    rst: |-
+    rst: >-
      The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 32-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1514,7 +1513,7 @@ System Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F64):
-    rst: |-
+    rst: >-
      The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 64-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1522,7 +1521,7 @@ System Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F8):
-    rst: |-
+    rst: >-
      The total number of 8-bit brain floating point :ref:`MFMA <desc-mfma>`
      operations executed per second. Note: this does not include any 16-bit brain
      floating point operations from :ref:`VALU <desc-valu>` instructions. This
@@ -1531,7 +1530,7 @@ System Speed-of-Light:
      series and later only.
    unit: GFLOPs
  MFMA IOPs (Int8):
-    rst: |-
+    rst: >-
      The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
      per second. Note: this does not include any 8-bit integer operations from
      :ref:`VALU <desc-valu>` instructions. This is also presented as a percent
@@ -1562,14 +1561,14 @@ System Speed-of-Light:
      time-averaged over all VALU instructions run on all wavefronts in the kernel.
    unit: Work-items
  VALU FLOPs:
-    rst: |-
+    rst: >-
      The total floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is also presented as a percent of the peak theoretical
      FLOPs achievable on the specific accelerator. Note: this does not include
      any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  VALU IOPs:
-    rst: |-
+    rst: >-
      The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
      This is also presented as a percent of the peak theoretical IOPs achievable
      on the specific accelerator. Note: this does not include any integer operations
@@ -1590,7 +1589,7 @@ System Speed-of-Light:
      issuing VMEM instructions over the :ref:`total CU cycles <total-cu-cycles>`.
    unit: Percent
  Wavefront Occupancy:
-    rst: |-
+    rst: >-
      The time-averaged number of wavefronts resident on the accelerator over
      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms). This is also presented as a percent of the peak theoretical
@@ -200,37 +200,37 @@ Panel Config:
          pop: None
          coll_level: SQ_IFETCH_LEVEL
  metrics_description:
-    VALU FLOPs: |-
+    VALU FLOPs: >-
      The total floating-point operations executed per second on the VALU.
      This is also presented as a percent of the peak theoretical FLOPs achievable
      on the specific accelerator. Note: this does not include any floating-point
      operations from MFMA instructions.
-    VALU IOPs: |-
+    VALU IOPs: >-
      The total integer operations executed per second on the VALU. This is
      also presented as a percent of the peak theoretical IOPs achievable on the
      specific accelerator. Note: this does not include any integer operations from
      MFMA instructions.
-    MFMA FLOPs (BF16): |-
+    MFMA FLOPs (BF16): >-
      The total number of 16-bit brain floating point MFMA operations executed
      per second. Note: this does not include any 16-bit brain floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      BF16 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F16): |-
+    MFMA FLOPs (F16): >-
      The total number of 16-bit floating point MFMA operations executed per
      second. Note: this does not include any 16-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F16 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F32): |-
+    MFMA FLOPs (F32): >-
      The total number of 32-bit floating point MFMA operations executed per
      second. Note: this does not include any 32-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F32 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F64): |-
+    MFMA FLOPs (F64): >-
      The total number of 64-bit floating point MFMA operations executed per
      second. Note: this does not include any 64-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F64 MFMA operations achievable on the specific accelerator.
-    MFMA IOPs (Int8): |-
+    MFMA IOPs (Int8): >-
      The total number of 8-bit integer MFMA operations executed per second.
      Note: this does not include any 8-bit integer operations from VALU instructions.
      This is also presented as a percent of the peak theoretical INT8 MFMA operations
@@ -263,7 +263,7 @@ Panel Config:
    IPC: The ratio of the total number of instructions executed on the CU over the
      total active CU cycles. This is also presented as a percent of the peak theoretical
      bandwidth achievable on the specific accelerator.
-    Wavefront Occupancy: |-
+    Wavefront Occupancy: >-
      The time-averaged number of wavefronts resident on the accelerator over
      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms). This is also presented as a percent of the peak theoretical
@@ -296,7 +296,7 @@ Panel Config:
      if only a single value is requested in a cache line, the data movement will
      still be counted as a full cache line. This is also presented as a percent of
      the peak theoretical bandwidth achievable on the specific accelerator.
-    L2-Fabric Read BW: |-
+    L2-Fabric Read BW: >-
      The number of bytes read by the L2 over the Infinity Fabric\u2122 interface
      per unit time. This is also presented as a percent of the peak theoretical
      bandwidth achievable on the specific accelerator.
@@ -170,15 +170,15 @@ Panel Config:
    Active CUs: Total number of active compute units (CUs) on the accelerator during
      the kernel execution.
    Num CUs: Total number of compute units (CUs) on the accelerator.
-    VGPR: |-
+    VGPR: >-
      The number of architected vector general-purpose registers allocated
      for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
      requested by the compiler due to allocation granularity.
-    SGPR: |-
+    SGPR: >-
      The number of scalar general-purpose registers allocated for the kernel,
      see SALU. Note: this may not exactly match the number of SGPRs requested by
      the compiler due to allocation granularity.
-    LDS Allocation: |-
+    LDS Allocation: >-
      The number of bytes of LDS memory (or, shared memory) allocated for
      this kernel. Note: This may also be larger than what was requested at compile
      time due to both allocation granularity and dynamic per-dispatch LDS allocations.
@@ -266,7 +266,7 @@ Panel Config:
      or data (atomic with return value) was returned to the L2.
    HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
      of data from the accelerator's local HBM, per normalization unit.
-    HBM Wr: |-
+    HBM Wr: >-
      The total number of L2 requests to Infinity Fabric to write or atomically
      update 32B or 64B of data in the accelerator's local HBM, per normalization
      unit.
@@ -134,48 +134,48 @@ Panel Config:
            / 1e9) ) / 1e9
          unit: GFLOP/s
  metrics_description:
-    VALU FLOPs (F16): |-
+    VALU FLOPs (F16): >-
      The total 16-bit floating-point operations executed per second on the VALU.
      This is presented with the value of the peak empirical F16 FLOPs achievable
      on the specific accelerator. Note: this does not include any F16 operations
      from MFMA instructions.
-    VALU FLOPs (F32): |-
+    VALU FLOPs (F32): >-
      The total 32-bit floating-point operations executed per second on the VALU.
      This is presented with the value of the peak empirical F32 FLOPs achievable
      on the specific accelerator. Note: this does not include any F32 operations
      from MFMA instructions.
-    VALU FLOPs (F64): |-
+    VALU FLOPs (F64): >-
      The total 64-bit floating-point operations executed per second on the VALU.
      This is presented with the value of the peak empirical F64 FLOPs achievable
      on the specific accelerator. Note: this does not include any F64 operations
      from MFMA instructions.
-    MFMA FLOPs (BF16): |-
+    MFMA FLOPs (BF16): >-
      The total number of 16-bit brain floating point MFMA operations executed
      per second. Note: this does not include any 16-bit brain floating point
      operations from VALU instructions. The peak empirically measured BF16 MFMA
      operations achievable on the specific accelerator is displayed alongside
      for comparison.
-    MFMA FLOPs (F16): |-
+    MFMA FLOPs (F16): >-
      The total number of 16-bit floating point MFMA operations executed per
      second. Note: this does not include any 16-bit floating point operations from
      VALU instructions. The peak empirically measured F16 MFMA operations
      achievable on the specific accelerator is displayed alongside for comparison.
-    MFMA FLOPs (F32): |-
+    MFMA FLOPs (F32): >-
      The total number of 32-bit floating point MFMA operations executed per
      second. Note: this does not include any 32-bit floating point operations from
      VALU instructions. The peak empirically measured F32 MFMA operations
      achievable on the specific accelerator is displayed alongside for comparison.
-    MFMA FLOPs (F64): |-
+    MFMA FLOPs (F64): >-
      The total number of 64-bit floating point MFMA operations executed per
      second. Note: this does not include any 64-bit floating point operations from
      VALU instructions. The peak empirically measured F64 MFMA operations
      achievable on the specific accelerator is displayed alongside for comparison.
-    MFMA IOPs (Int8): |-
+    MFMA IOPs (Int8): >-
      The total number of 8-bit integer MFMA operations executed per second.
      Note: this does not include any 8-bit integer operations from VALU instructions.
      The peak empirically measured INT8 MFMA operations achievable on the specific
      accelerator is displayed alongside for comparison.
-    HBM Bandwidth: |-
+    HBM Bandwidth: >-
      The total number of bytes read from and written to High-Bandwidth
      Memory (HBM) per second. The peak empirically measured bandwidth achievable
      on the specific accelerator is displayed alongside for comparison.
@@ -196,22 +196,22 @@ Panel Config:
      from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
      example for more detail). The peak empirically measured LDS bandwidth achievable
      on the specific accelerator is displayed alongside for comparison.
-    AI L1: |-
+    AI L1: >-
      The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
      of total floating-point operations (FLOPs) to total bytes transferred between
      the L1 cache and the processing units. This value is used as the x-coordinate
      for the L1 roofline.
-    AI L2: |-
+    AI L2: >-
      The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
      of total floating-point operations (FLOPs) to total bytes transferred between
      the L2 cache and the L1 cache. This value is used as the x-coordinate for
      the L2 roofline.
-    AI HBM: |-
+    AI HBM: >-
      The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
      It is the ratio of total floating-point operations (FLOPs) to total bytes
      transferred between HBM and the L2 cache. This value is used as the x-coordinate
      for the HBM roofline.
-    Performance (GFLOPs): |-
+    Performance (GFLOPs): >-
      The overall achieved performance, measured in GigaFLOPs
      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
      operations divided by the total execution time. This value is used as the y-coordinate
@@ -141,6 +141,6 @@ Panel Config:
      the CPC-L2 interface was active doing any work.
    CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
      translation
-    CPC-UTCL2 Utilization: |-
+    CPC-UTCL2 Utilization: >-
      Percent of total cycles counted by the CPC's L2 address translation
      interface where the CPC was busy doing address translation work.
@@ -168,7 +168,7 @@ Panel Config:
      in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
      within the workgroup manager rather than a lack of a CU or SIMD with sufficient
      resources.
-    Not-scheduled Rate (Scheduler-Pipe): |-
+    Not-scheduled Rate (Scheduler-Pipe): >-
      The percent of total scheduler-pipe cycles in the kernel where a workgroup
      could not be scheduled to a CU due to a bottleneck within the scheduler-pipes
      rather than a lack of a CU or SIMD with sufficient resources.
@@ -121,26 +121,26 @@ Panel Config:
    Workgroup Size: The total number of work-items (or, threads) in each workgroup
      (or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
      to the total block size.
-    Total Wavefronts: |-
+    Total Wavefronts: >-
      The total number of wavefronts launched as part of the kernel dispatch.
      On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
      size is always 64 work-items. Thus, the total number of wavefronts should
      be equivalent to the ceiling of grid size divided by 64.
    Saved Wavefronts: The total number of wavefronts saved at a context-save.
    Restored Wavefronts: The total number of wavefronts restored from a context-save.
-    VGPRs: |-
+    VGPRs: >-
      The number of architected vector general-purpose registers allocated
      for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
      requested by the compiler due to allocation granularity.
-    AGPRs: |-
+    AGPRs: >-
      The number of accumulation vector general-purpose registers allocated
      for the kernel, see AGPRs. Note: this may not exactly match the number of
      AGPRs requested by the compiler due to allocation granularity.
-    SGPRs: |-
+    SGPRs: >-
      The number of scalar general-purpose registers allocated for the kernel,
      see SALU. Note: this may not exactly match the number of SGPRs requested by
      the compiler due to allocation granularity.
-    LDS Allocation: |-
+    LDS Allocation: >-
      The number of bytes of LDS memory (or, shared memory) allocated for
      this kernel. Note: This may also be larger than what was requested at compile
      time due to both allocation granularity and dynamic per-dispatch LDS allocations.
@@ -173,7 +173,7 @@ Panel Config:
      rather than identification of a precise limiter. The sum of this metric, Issue
      Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
      metric.
-    Wavefront Occupancy: |-
+    Wavefront Occupancy: >-
      The time-averaged number of wavefronts resident on the accelerator over
      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms).
@@ -140,7 +140,7 @@ Panel Config:
      unit.
    Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
      stalls from non-dword aligned addresses per normalization unit.
-    Mem Violations: |-
+    Mem Violations: >-
      The total number of out-of-bounds accesses made to the LDS, per normalization
      unit. This is unused and expected to be zero in most configurations for
      modern CDNA\u2122 accelerators.
@@ -92,7 +92,7 @@ Panel Config:
    Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
      loaded line the cache. Calculated as the ratio of the number of L1I requests
      that hit over the number of all L1I requests.
-    L1I-L2 Bandwidth Utilization: |-
+    L1I-L2 Bandwidth Utilization: >-
      The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
      achieved. Calculated as the ratio of the total number of requests from the
      L1I to the L2 cache over the total L1I-L2 interface cycles.
@@ -154,7 +154,7 @@ Panel Config:
    sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
      bandwidth acheived. Calculated as total number of bytes read from, written to,
      or atomically updated across the sL1D - L2 interface.
-    sL1D-L2 BW: |-
+    sL1D-L2 BW: >-
      The total number of bytes read from, written to, or atomically updated
      across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D
      writes and atomics are typically unused on current CDNA accelerators, so
@@ -164,7 +164,7 @@ Panel Config:
      unit.
    Hits: The total number of sL1D requests that hit on a previously loaded cache
      line, per normalization unit.
-    Misses - Non Duplicated: |-
+    Misses - Non Duplicated: >-
      The total number of sL1D requests that missed on a cache line that was
      not already pending due to another request, per normalization unit.
    Misses- Duplicated: The total number of sL1D requests that missed on a cache line
@@ -187,6 +187,6 @@ Panel Config:
      unit.
    Write Req: The total number of write requests from sL1D to the L2, per normalization
      unit. Typically unused on current CDNA accelerators.
-    Stall Cycles: |-
+    Stall Cycles: >-
      The total number of cycles the sL1D\u2194L2 interface was stalled, per
      normalization unit.
@@ -436,7 +436,7 @@ Panel Config:
      per normalization unit.
    Translation Misses: The total number of translation requests that missed in the
      UTCL1 due to translation not being present in the cache, per normalization unit.
-    Permission Misses: |-
+    Permission Misses: >-
      The total number of translation requests that missed in the UTCL1 due
      to a permission error, per normalization unit. This is unused and expected
      to be zero in most configurations for modern CDNA\u2122 accelerators.
@@ -218,37 +218,37 @@ Panel Config:
          pop: None
          coll_level: SQ_IFETCH_LEVEL
  metrics_description:
-    VALU FLOPs: |-
+    VALU FLOPs: >-
      The total floating-point operations executed per second on the VALU.
      This is also presented as a percent of the peak theoretical FLOPs achievable
      on the specific accelerator. Note: this does not include any floating-point
      operations from MFMA instructions.
-    VALU IOPs: |-
+    VALU IOPs: >-
      The total integer operations executed per second on the VALU. This is
      also presented as a percent of the peak theoretical IOPs achievable on the
      specific accelerator. Note: this does not include any integer operations from
      MFMA instructions.
-    MFMA FLOPs (BF16): |-
+    MFMA FLOPs (BF16): >-
      The total number of 16-bit brain floating point MFMA operations executed
      per second. Note: this does not include any 16-bit brain floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      BF16 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F16): |-
+    MFMA FLOPs (F16): >-
      The total number of 16-bit floating point MFMA operations executed per
      second. Note: this does not include any 16-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F16 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F32): |-
+    MFMA FLOPs (F32): >-
      The total number of 32-bit floating point MFMA operations executed per
      second. Note: this does not include any 32-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F32 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F64): |-
+    MFMA FLOPs (F64): >-
      The total number of 64-bit floating point MFMA operations executed per
      second. Note: this does not include any 64-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F64 MFMA operations achievable on the specific accelerator.
-    MFMA IOPs (Int8): |-
+    MFMA IOPs (Int8): >-
      The total number of 8-bit integer MFMA operations executed per second.
      Note: this does not include any 8-bit integer operations from VALU instructions.
      This is also presented as a percent of the peak theoretical INT8 MFMA operations
@@ -281,7 +281,7 @@ Panel Config:
    IPC: The ratio of the total number of instructions executed on the CU over the
      total active CU cycles. This is also presented as a percent of the peak theoretical
      bandwidth achievable on the specific accelerator.
-    Wavefront Occupancy: |-
+    Wavefront Occupancy: >-
      The time-averaged number of wavefronts resident on the accelerator over
      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms). This is also presented as a percent of the peak theoretical
@@ -314,7 +314,7 @@ Panel Config:
      if only a single value is requested in a cache line, the data movement will
      still be counted as a full cache line. This is also presented as a percent of
      the peak theoretical bandwidth achievable on the specific accelerator.
-    L2-Fabric Read BW: |-
+    L2-Fabric Read BW: >-
      The number of bytes read by the L2 over the Infinity Fabric\u2122 interface
      per unit time. This is also presented as a percent of the peak theoretical
      bandwidth achievable on the specific accelerator.
@@ -170,15 +170,15 @@ Panel Config:
    Active CUs: Total number of active compute units (CUs) on the accelerator during
      the kernel execution.
    Num CUs: Total number of compute units (CUs) on the accelerator.
-    VGPR: |-
+    VGPR: >-
      The number of architected vector general-purpose registers allocated
      for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
      requested by the compiler due to allocation granularity.
-    SGPR: |-
+    SGPR: >-
      The number of scalar general-purpose registers allocated for the kernel,
      see SALU. Note: this may not exactly match the number of SGPRs requested by
      the compiler due to allocation granularity.
-    LDS Allocation: |-
+    LDS Allocation: >-
      The number of bytes of LDS memory (or, shared memory) allocated for
      this kernel. Note: This may also be larger than what was requested at compile
      time due to both allocation granularity and dynamic per-dispatch LDS allocations.
@@ -266,7 +266,7 @@ Panel Config:
      or data (atomic with return value) was returned to the L2.
    HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
      of data from the accelerator's local HBM, per normalization unit.
-    HBM Wr: |-
+    HBM Wr: >-
      The total number of L2 requests to Infinity Fabric to write or atomically
      update 32B or 64B of data in the accelerator's local HBM, per normalization
      unit.
@@ -132,48 +132,48 @@ Panel Config:
            / 1e9) ) / 1e9
          unit: GFLOP/s
  metrics_description:
-    VALU FLOPs (F16): |-
+    VALU FLOPs (F16): >-
      The total 16-bit floating-point operations executed per second on the VALU.
      This is presented with the value of the peak empirical F16 FLOPs achievable
      on the specific accelerator. Note: this does not include any F16 operations
      from MFMA instructions.
-    VALU FLOPs (F32): |-
+    VALU FLOPs (F32): >-
      The total 32-bit floating-point operations executed per second on the VALU.
      This is presented with the value of the peak empirical F32 FLOPs achievable
      on the specific accelerator. Note: this does not include any F32 operations
      from MFMA instructions.
-    VALU FLOPs (F64): |-
+    VALU FLOPs (F64): >-
      The total 64-bit floating-point operations executed per second on the VALU.
      This is presented with the value of the peak empirical F64 FLOPs achievable
      on the specific accelerator. Note: this does not include any F64 operations
      from MFMA instructions.
-    MFMA FLOPs (BF16): |-
+    MFMA FLOPs (BF16): >-
      The total number of 16-bit brain floating point MFMA operations executed
      per second. Note: this does not include any 16-bit brain floating point
      operations from VALU instructions. The peak empirically measured BF16 MFMA
      operations achievable on the specific accelerator is displayed alongside
      for comparison.
-    MFMA FLOPs (F16): |-
+    MFMA FLOPs (F16): >-
      The total number of 16-bit floating point MFMA operations executed per
      second. Note: this does not include any 16-bit floating point operations from
      VALU instructions. The peak empirically measured F16 MFMA operations
      achievable on the specific accelerator is displayed alongside for comparison.
-    MFMA FLOPs (F32): |-
+    MFMA FLOPs (F32): >-
      The total number of 32-bit floating point MFMA operations executed per
      second. Note: this does not include any 32-bit floating point operations from
      VALU instructions. The peak empirically measured F32 MFMA operations
      achievable on the specific accelerator is displayed alongside for comparison.
-    MFMA FLOPs (F64): |-
+    MFMA FLOPs (F64): >-
      The total number of 64-bit floating point MFMA operations executed per
      second. Note: this does not include any 64-bit floating point operations from
      VALU instructions. The peak empirically measured F64 MFMA operations
      achievable on the specific accelerator is displayed alongside for comparison.
-    MFMA IOPs (Int8): |-
+    MFMA IOPs (Int8): >-
      The total number of 8-bit integer MFMA operations executed per second.
      Note: this does not include any 8-bit integer operations from VALU instructions.
      The peak empirically measured INT8 MFMA operations achievable on the specific
      accelerator is displayed alongside for comparison.
-    HBM Bandwidth: |-
+    HBM Bandwidth: >-
      The total number of bytes read from and written to High-Bandwidth
      Memory (HBM) per second. The peak empirically measured bandwidth achievable
      on the specific accelerator is displayed alongside for comparison.
@@ -194,22 +194,22 @@ Panel Config:
      from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
      example for more detail). The peak empirically measured LDS bandwidth achievable
      on the specific accelerator is displayed alongside for comparison.
-    AI L1: |-
+    AI L1: >-
      The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
      of total floating-point operations (FLOPs) to total bytes transferred between
      the L1 cache and the processing units. This value is used as the x-coordinate
      for the L1 roofline.
-    AI L2: |-
+    AI L2: >-
      The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
      of total floating-point operations (FLOPs) to total bytes transferred between
      the L2 cache and the L1 cache. This value is used as the x-coordinate for
      the L2 roofline.
-    AI HBM: |-
+    AI HBM: >-
      The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
      It is the ratio of total floating-point operations (FLOPs) to total bytes
      transferred between HBM and the L2 cache. This value is used as the x-coordinate
      for the HBM roofline.
-    Performance (GFLOPs): |-
+    Performance (GFLOPs): >-
      The overall achieved performance, measured in GigaFLOPs
      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
      operations divided by the total execution time. This value is used as the y-coordinate
@@ -141,6 +141,6 @@ Panel Config:
      the CPC-L2 interface was active doing any work.
    CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
      translation
-    CPC-UTCL2 Utilization: |-
+    CPC-UTCL2 Utilization: >-
      Percent of total cycles counted by the CPC's L2 address translation
      interface where the CPC was busy doing address translation work.
@@ -168,7 +168,7 @@ Panel Config:
      in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
      within the workgroup manager rather than a lack of a CU or SIMD with sufficient
      resources.
-    Not-scheduled Rate (Scheduler-Pipe): |-
+    Not-scheduled Rate (Scheduler-Pipe): >-
      The percent of total scheduler-pipe cycles in the kernel where a workgroup
      could not be scheduled to a CU due to a bottleneck within the scheduler-pipes
      rather than a lack of a CU or SIMD with sufficient resources.
@@ -121,26 +121,26 @@ Panel Config:
    Workgroup Size: The total number of work-items (or, threads) in each workgroup
      (or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
      to the total block size.
-    Total Wavefronts: |-
+    Total Wavefronts: >-
      The total number of wavefronts launched as part of the kernel dispatch.
      On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
      size is always 64 work-items. Thus, the total number of wavefronts should
      be equivalent to the ceiling of grid size divided by 64.
    Saved Wavefronts: The total number of wavefronts saved at a context-save.
    Restored Wavefronts: The total number of wavefronts restored from a context-save.
-    VGPRs: |-
+    VGPRs: >-
      The number of architected vector general-purpose registers allocated
      for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
      requested by the compiler due to allocation granularity.
-    AGPRs: |-
+    AGPRs: >-
      The number of accumulation vector general-purpose registers allocated
      for the kernel, see AGPRs. Note: this may not exactly match the number of
      AGPRs requested by the compiler due to allocation granularity.
-    SGPRs: |-
+    SGPRs: >-
      The number of scalar general-purpose registers allocated for the kernel,
      see SALU. Note: this may not exactly match the number of SGPRs requested by
      the compiler due to allocation granularity.
-    LDS Allocation: |-
+    LDS Allocation: >-
      The number of bytes of LDS memory (or, shared memory) allocated for
      this kernel. Note: This may also be larger than what was requested at compile
      time due to both allocation granularity and dynamic per-dispatch LDS allocations.
@@ -173,7 +173,7 @@ Panel Config:
      rather than identification of a precise limiter. The sum of this metric, Issue
      Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
      metric.
-    Wavefront Occupancy: |-
+    Wavefront Occupancy: >-
      The time-averaged number of wavefronts resident on the accelerator over
      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms).
@@ -268,7 +268,7 @@ Panel Config:
      floating-point operands issued to the VALU per normalization unit.
    F64-Trans: The total number of transcendental instructions (such as sqrt) operating
      on 64-bit floating-point operands issued to the VALU per normalization unit.
-    Conversion: |-
+    Conversion: >-
      The total number of type conversion instructions (such as converting
      data to or from F32\u2194F64) issued to the VALU per normalization unit.
    Global/Generic Instr: The total number of global & generic memory instructions
@@ -237,37 +237,37 @@ Panel Config:
          max: MAX(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
          unit: (OPs + $normUnit)
  metrics_description:
-    VALU FLOPs: |-
+    VALU FLOPs: >-
      The total floating-point operations executed per second on the VALU.
      This is also presented as a percent of the peak theoretical FLOPs achievable
      on the specific accelerator. Note: this does not include any floating-point
      operations from MFMA instructions.
-    VALU IOPs: |-
+    VALU IOPs: >-
      The total integer operations executed per second on the VALU. This is
      also presented as a percent of the peak theoretical IOPs achievable on the
      specific accelerator. Note: this does not include any integer operations from
      MFMA instructions.
-    MFMA FLOPs (BF16): |-
+    MFMA FLOPs (BF16): >-
      The total number of 16-bit brain floating point MFMA operations executed
      per second. Note: this does not include any 16-bit brain floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      BF16 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F16): |-
+    MFMA FLOPs (F16): >-
      The total number of 16-bit floating point MFMA operations executed per
      second. Note: this does not include any 16-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F16 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F32): |-
+    MFMA FLOPs (F32): >-
      The total number of 32-bit floating point MFMA operations executed per
      second. Note: this does not include any 32-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F32 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F64): |-
+    MFMA FLOPs (F64): >-
      The total number of 64-bit floating point MFMA operations executed per
      second. Note: this does not include any 64-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F64 MFMA operations achievable on the specific accelerator.
-    MFMA IOPs (INT8): |-
+    MFMA IOPs (INT8): >-
      The total number of 8-bit integer MFMA operations executed per second.
      Note: this does not include any 8-bit integer operations from VALU instructions.
      This is also presented as a percent of the peak theoretical INT8 MFMA operations
@@ -140,7 +140,7 @@ Panel Config:
      unit.
    Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
      stalls from non-dword aligned addresses per normalization unit.
-    Mem Violations: |-
+    Mem Violations: >-
      The total number of out-of-bounds accesses made to the LDS, per normalization
      unit. This is unused and expected to be zero in most configurations for
      modern CDNA\u2122 accelerators.
@@ -92,7 +92,7 @@ Panel Config:
    Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
      loaded line the cache. Calculated as the ratio of the number of L1I requests
      that hit over the number of all L1I requests.
-    L1I-L2 Bandwidth Utilization: |-
+    L1I-L2 Bandwidth Utilization: >-
      The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
      achieved. Calculated as the ratio of the total number of requests from the
      L1I to the L2 cache over the total L1I-L2 interface cycles.
@@ -154,7 +154,7 @@ Panel Config:
    sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
      bandwidth acheived. Calculated as total number of bytes read from, written to,
      or atomically updated across the sL1D - L2 interface.
-    sL1D-L2 BW: |-
+    sL1D-L2 BW: >-
      The total number of bytes read from, written to, or atomically updated
      across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D
      writes and atomics are typically unused on current CDNA accelerators, so
@@ -164,7 +164,7 @@ Panel Config:
      unit.
    Hits: The total number of sL1D requests that hit on a previously loaded cache
      line, per normalization unit.
-    Misses - Non Duplicated: |-
+    Misses - Non Duplicated: >-
      The total number of sL1D requests that missed on a cache line that was
      not already pending due to another request, per normalization unit.
    Misses- Duplicated: The total number of sL1D requests that missed on a cache line
@@ -187,6 +187,6 @@ Panel Config:
      unit.
    Write Req: The total number of write requests from sL1D to the L2, per normalization
      unit. Typically unused on current CDNA accelerators.
-    Stall Cycles: |-
+    Stall Cycles: >-
      The total number of cycles the sL1D\u2194L2 interface was stalled, per
      normalization unit.
@@ -436,7 +436,7 @@ Panel Config:
      per normalization unit.
    Translation Misses: The total number of translation requests that missed in the
      UTCL1 due to translation not being present in the cache, per normalization unit.
-    Permission Misses: |-
+    Permission Misses: >-
      The total number of translation requests that missed in the UTCL1 due
      to a permission error, per normalization unit. This is unused and expected
      to be zero in most configurations for modern CDNA\u2122 accelerators.
@@ -227,12 +227,12 @@ Panel Config:
          pop: None
          coll_level: SQ_IFETCH_LEVEL
  metrics_description:
-    VALU FLOPs: |-
+    VALU FLOPs: >-
      The total floating-point operations executed per second on the VALU.
      This is also presented as a percent of the peak theoretical FLOPs achievable
      on the specific accelerator. Note: this does not include any floating-point
      operations from MFMA instructions.
-    VALU IOPs: |-
+    VALU IOPs: >-
      The total integer operations executed per second on the VALU. This is
      also presented as a percent of the peak theoretical IOPs achievable on the
      specific accelerator. Note: this does not include any integer operations from
@@ -242,27 +242,27 @@ Panel Config:
      from VALU instructions. This is also presented as a percent of the peak theoretical
      F8 MFMA operations achievable on the specific accelerator. It is supported on
      AMD Instinct MI300 series and later only.
-    MFMA FLOPs (BF16): |-
+    MFMA FLOPs (BF16): >-
      The total number of 16-bit brain floating point MFMA operations executed
      per second. Note: this does not include any 16-bit brain floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      BF16 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F16): |-
+    MFMA FLOPs (F16): >-
      The total number of 16-bit floating point MFMA operations executed per
      second. Note: this does not include any 16-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F16 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F32): |-
+    MFMA FLOPs (F32): >-
      The total number of 32-bit floating point MFMA operations executed per
      second. Note: this does not include any 32-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F32 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F64): |-
+    MFMA FLOPs (F64): >-
      The total number of 64-bit floating point MFMA operations executed per
      second. Note: this does not include any 64-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F64 MFMA operations achievable on the specific accelerator.
-    MFMA IOPs (Int8): |-
+    MFMA IOPs (Int8): >-
      The total number of 8-bit integer MFMA operations executed per second.
      Note: this does not include any 8-bit integer operations from VALU instructions.
      This is also presented as a percent of the peak theoretical INT8 MFMA operations
@@ -295,7 +295,7 @@ Panel Config:
    IPC: The ratio of the total number of instructions executed on the CU over the
      total active CU cycles. This is also presented as a percent of the peak theoretical
      bandwidth achievable on the specific accelerator.
-    Wavefront Occupancy: |-
+    Wavefront Occupancy: >-
      The time-averaged number of wavefronts resident on the accelerator over
      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms). This is also presented as a percent of the peak theoretical
@@ -328,7 +328,7 @@ Panel Config:
      if only a single value is requested in a cache line, the data movement will
      still be counted as a full cache line. This is also presented as a percent of
      the peak theoretical bandwidth achievable on the specific accelerator.
-    L2-Fabric Read BW: |-
+    L2-Fabric Read BW: >-
      The number of bytes read by the L2 over the Infinity Fabric\u2122 interface
      per unit time. This is also presented as a percent of the peak theoretical
      bandwidth achievable on the specific accelerator.
@@ -162,15 +162,15 @@ Panel Config:
    Active CUs: Total number of active compute units (CUs) on the accelerator during
      the kernel execution.
    Num CUs: Total number of compute units (CUs) on the accelerator.
-    VGPR: |-
+    VGPR: >-
      The number of architected vector general-purpose registers allocated
      for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
      requested by the compiler due to allocation granularity.
-    SGPR: |-
+    SGPR: >-
      The number of scalar general-purpose registers allocated for the kernel,
      see SALU. Note: this may not exactly match the number of SGPRs requested by
      the compiler due to allocation granularity.
-    LDS Allocation: |-
+    LDS Allocation: >-
      The number of bytes of LDS memory (or, shared memory) allocated for
      this kernel. Note: This may also be larger than what was requested at compile
      time due to both allocation granularity and dynamic per-dispatch LDS allocations.
@@ -252,7 +252,7 @@ Panel Config:
      or data (atomic with return value) was returned to the L2.
    HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
      of data from the accelerator's local HBM, per normalization unit.
-    HBM Wr: |-
+    HBM Wr: >-
      The total number of L2 requests to Infinity Fabric to write or atomically
      update 32B or 64B of data in the accelerator's local HBM, per normalization
      unit.
@@ -140,17 +140,17 @@ Panel Config:
            * 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9
          unit: GFLOP/s
  metrics_description:
-    VALU FLOPs (F16): |-
+    VALU FLOPs (F16): >-
      The total 16-bit floating-point operations executed per second on the VALU.
      This is presented with the value of the peak empirical F16 FLOPs achievable
      on the specific accelerator. Note: this does not include any F16 operations
      from MFMA instructions.
-    VALU FLOPs (F32): |-
+    VALU FLOPs (F32): >-
      The total 32-bit floating-point operations executed per second on the VALU.
      This is presented with the value of the peak empirical F32 FLOPs achievable
      on the specific accelerator. Note: this does not include any F32 operations
      from MFMA instructions.
-    VALU FLOPs (F64): |-
+    VALU FLOPs (F64): >-
      The total 64-bit floating-point operations executed per second on the VALU.
      This is presented with the value of the peak empirical F64 FLOPs achievable
      on the specific accelerator. Note: this does not include any F64 operations
@@ -160,33 +160,33 @@ Panel Config:
      from VALU instructions. The peak empirically measured F8 MFMA operations achievable
      on the specific accelerator is displayed alongside for comparison. It is supported
      on AMD Instinct MI300 series and later only.
-    MFMA FLOPs (BF16): |-
+    MFMA FLOPs (BF16): >-
      The total number of 16-bit brain floating point MFMA operations executed
      per second. Note: this does not include any 16-bit brain floating point
      operations from VALU instructions. The peak empirically measured BF16 MFMA
      operations achievable on the specific accelerator is displayed alongside
      for comparison.
-    MFMA FLOPs (F16): |-
+    MFMA FLOPs (F16): >-
      The total number of 16-bit floating point MFMA operations executed per
      second. Note: this does not include any 16-bit floating point operations from
      VALU instructions. The peak empirically measured F16 MFMA operations
      achievable on the specific accelerator is displayed alongside for comparison.
-    MFMA FLOPs (F32): |-
+    MFMA FLOPs (F32): >-
      The total number of 32-bit floating point MFMA operations executed per
      second. Note: this does not include any 32-bit floating point operations from
      VALU instructions. The peak empirically measured F32 MFMA operations
      achievable on the specific accelerator is displayed alongside for comparison.
-    MFMA FLOPs (F64): |-
+    MFMA FLOPs (F64): >-
      The total number of 64-bit floating point MFMA operations executed per
      second. Note: this does not include any 64-bit floating point operations from
      VALU instructions. The peak empirically measured F64 MFMA operations
      achievable on the specific accelerator is displayed alongside for comparison.
-    MFMA IOPs (Int8): |-
+    MFMA IOPs (Int8): >-
      The total number of 8-bit integer MFMA operations executed per second.
      Note: this does not include any 8-bit integer operations from VALU instructions.
      The peak empirically measured INT8 MFMA operations achievable on the specific
      accelerator is displayed alongside for comparison.
-    HBM Bandwidth: |-
+    HBM Bandwidth: >-
      The total number of bytes read from and written to High-Bandwidth
      Memory (HBM) per second. The peak empirically measured bandwidth achievable
      on the specific accelerator is displayed alongside for comparison.
@@ -207,22 +207,22 @@ Panel Config:
      from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
      example for more detail). The peak empirically measured LDS bandwidth achievable
      on the specific accelerator is displayed alongside for comparison.
-    AI L1: |-
+    AI L1: >-
      The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
      of total floating-point operations (FLOPs) to total bytes transferred between
      the L1 cache and the processing units. This value is used as the x-coordinate
      for the L1 roofline.
-    AI L2: |-
+    AI L2: >-
      The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
      of total floating-point operations (FLOPs) to total bytes transferred between
      the L2 cache and the L1 cache. This value is used as the x-coordinate for
      the L2 roofline.
-    AI HBM: |-
+    AI HBM: >-
      The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
      It is the ratio of total floating-point operations (FLOPs) to total bytes
      transferred between HBM and the L2 cache. This value is used as the x-coordinate
      for the HBM roofline.
-    Performance (GFLOPs): |-
+    Performance (GFLOPs): >-
      The overall achieved performance, measured in GigaFLOPs
      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
      operations divided by the total execution time. This value is used as the y-coordinate
@@ -141,6 +141,6 @@ Panel Config:
      the CPC-L2 interface was active doing any work.
    CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
      translation
-    CPC-UTCL2 Utilization: |-
+    CPC-UTCL2 Utilization: >-
      Percent of total cycles counted by the CPC's L2 address translation
      interface where the CPC was busy doing address translation work.
@@ -168,7 +168,7 @@ Panel Config:
      in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
      within the workgroup manager rather than a lack of a CU or SIMD with sufficient
      resources.
-    Not-scheduled Rate (Scheduler-Pipe): |-
+    Not-scheduled Rate (Scheduler-Pipe): >-
      The percent of total scheduler-pipe cycles in the kernel where a workgroup
      could not be scheduled to a CU due to a bottleneck within the scheduler-pipes
      rather than a lack of a CU or SIMD with sufficient resources.
@@ -121,26 +121,26 @@ Panel Config:
    Workgroup Size: The total number of work-items (or, threads) in each workgroup
      (or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
      to the total block size.
-    Total Wavefronts: |-
+    Total Wavefronts: >-
      The total number of wavefronts launched as part of the kernel dispatch.
      On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
      size is always 64 work-items. Thus, the total number of wavefronts should
      be equivalent to the ceiling of grid size divided by 64.
    Saved Wavefronts: The total number of wavefronts saved at a context-save.
    Restored Wavefronts: The total number of wavefronts restored from a context-save.
-    VGPRs: |-
+    VGPRs: >-
      The number of architected vector general-purpose registers allocated
      for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
      requested by the compiler due to allocation granularity.
-    AGPRs: |-
+    AGPRs: >-
      The number of accumulation vector general-purpose registers allocated
      for the kernel, see AGPRs. Note: this may not exactly match the number of
      AGPRs requested by the compiler due to allocation granularity.
-    SGPRs: |-
+    SGPRs: >-
      The number of scalar general-purpose registers allocated for the kernel,
      see SALU. Note: this may not exactly match the number of SGPRs requested by
      the compiler due to allocation granularity.
-    LDS Allocation: |-
+    LDS Allocation: >-
      The number of bytes of LDS memory (or, shared memory) allocated for
      this kernel. Note: This may also be larger than what was requested at compile
      time due to both allocation granularity and dynamic per-dispatch LDS allocations.
@@ -173,7 +173,7 @@ Panel Config:
      rather than identification of a precise limiter. The sum of this metric, Issue
      Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
      metric.
-    Wavefront Occupancy: |-
+    Wavefront Occupancy: >-
      The time-averaged number of wavefronts resident on the accelerator over
      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms).
@@ -273,7 +273,7 @@ Panel Config:
      floating-point operands issued to the VALU per normalization unit.
    F64-Trans: The total number of transcendental instructions (such as sqrt) operating
      on 64-bit floating-point operands issued to the VALU per normalization unit.
-    Conversion: |-
+    Conversion: >-
      The total number of type conversion instructions (such as converting
      data to or from F32\u2194F64) issued to the VALU per normalization unit.
    Global/Generic Instr: The total number of global & generic memory instructions
@@ -251,37 +251,37 @@ Panel Config:
          max: MAX(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
          unit: (OPs + $normUnit)
  metrics_description:
-    VALU FLOPs: |-
+    VALU FLOPs: >-
      The total floating-point operations executed per second on the VALU.
      This is also presented as a percent of the peak theoretical FLOPs achievable
      on the specific accelerator. Note: this does not include any floating-point
      operations from MFMA instructions.
-    VALU IOPs: |-
+    VALU IOPs: >-
      The total integer operations executed per second on the VALU. This is
      also presented as a percent of the peak theoretical IOPs achievable on the
      specific accelerator. Note: this does not include any integer operations from
      MFMA instructions.
-    MFMA FLOPs (BF16): |-
+    MFMA FLOPs (BF16): >-
      The total number of 16-bit brain floating point MFMA operations executed
      per second. Note: this does not include any 16-bit brain floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      BF16 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F16): |-
+    MFMA FLOPs (F16): >-
      The total number of 16-bit floating point MFMA operations executed per
      second. Note: this does not include any 16-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F16 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F32): |-
+    MFMA FLOPs (F32): >-
      The total number of 32-bit floating point MFMA operations executed per
      second. Note: this does not include any 32-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F32 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F64): |-
+    MFMA FLOPs (F64): >-
      The total number of 64-bit floating point MFMA operations executed per
      second. Note: this does not include any 64-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F64 MFMA operations achievable on the specific accelerator.
-    MFMA IOPs (INT8): |-
+    MFMA IOPs (INT8): >-
      The total number of 8-bit integer MFMA operations executed per second.
      Note: this does not include any 8-bit integer operations from VALU instructions.
      This is also presented as a percent of the peak theoretical INT8 MFMA operations
@@ -140,7 +140,7 @@ Panel Config:
      unit.
    Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
      stalls from non-dword aligned addresses per normalization unit.
-    Mem Violations: |-
+    Mem Violations: >-
      The total number of out-of-bounds accesses made to the LDS, per normalization
      unit. This is unused and expected to be zero in most configurations for
      modern CDNA\u2122 accelerators.
@@ -92,7 +92,7 @@ Panel Config:
    Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
      loaded line the cache. Calculated as the ratio of the number of L1I requests
      that hit over the number of all L1I requests.
-    L1I-L2 Bandwidth Utilization: |-
+    L1I-L2 Bandwidth Utilization: >-
      The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
      achieved. Calculated as the ratio of the total number of requests from the
      L1I to the L2 cache over the total L1I-L2 interface cycles.
@@ -154,7 +154,7 @@ Panel Config:
    sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
      bandwidth acheived. Calculated as total number of bytes read from, written to,
      or atomically updated across the sL1D - L2 interface.
-    sL1D-L2 BW: |-
+    sL1D-L2 BW: >-
      The total number of bytes read from, written to, or atomically updated
      across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D
      writes and atomics are typically unused on current CDNA accelerators, so
@@ -164,7 +164,7 @@ Panel Config:
      unit.
    Hits: The total number of sL1D requests that hit on a previously loaded cache
      line, per normalization unit.
-    Misses - Non Duplicated: |-
+    Misses - Non Duplicated: >-
      The total number of sL1D requests that missed on a cache line that was
      not already pending due to another request, per normalization unit.
    Misses- Duplicated: The total number of sL1D requests that missed on a cache line
@@ -187,6 +187,6 @@ Panel Config:
      unit.
    Write Req: The total number of write requests from sL1D to the L2, per normalization
      unit. Typically unused on current CDNA accelerators.
-    Stall Cycles: |-
+    Stall Cycles: >-
      The total number of cycles the sL1D\u2194L2 interface was stalled, per
      normalization unit.
@@ -398,7 +398,7 @@ Panel Config:
      per normalization unit.
    Translation Misses: The total number of translation requests that missed in the
      UTCL1 due to translation not being present in the cache, per normalization unit.
-    Permission Misses: |-
+    Permission Misses: >-
      The total number of translation requests that missed in the UTCL1 due
      to a permission error, per normalization unit. This is unused and expected
      to be zero in most configurations for modern CDNA\u2122 accelerators.
@@ -227,12 +227,12 @@ Panel Config:
          pop: None
          coll_level: SQ_IFETCH_LEVEL
  metrics_description:
-    VALU FLOPs: |-
+    VALU FLOPs: >-
      The total floating-point operations executed per second on the VALU.
      This is also presented as a percent of the peak theoretical FLOPs achievable
      on the specific accelerator. Note: this does not include any floating-point
      operations from MFMA instructions.
-    VALU IOPs: |-
+    VALU IOPs: >-
      The total integer operations executed per second on the VALU. This is
      also presented as a percent of the peak theoretical IOPs achievable on the
      specific accelerator. Note: this does not include any integer operations from
@@ -242,27 +242,27 @@ Panel Config:
      from VALU instructions. This is also presented as a percent of the peak theoretical
      F8 MFMA operations achievable on the specific accelerator. It is supported on
      AMD Instinct MI300 series and later only.
-    MFMA FLOPs (BF16): |-
+    MFMA FLOPs (BF16): >-
      The total number of 16-bit brain floating point MFMA operations executed
      per second. Note: this does not include any 16-bit brain floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      BF16 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F16): |-
+    MFMA FLOPs (F16): >-
      The total number of 16-bit floating point MFMA operations executed per
      second. Note: this does not include any 16-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F16 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F32): |-
+    MFMA FLOPs (F32): >-
      The total number of 32-bit floating point MFMA operations executed per
      second. Note: this does not include any 32-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F32 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F64): |-
+    MFMA FLOPs (F64): >-
      The total number of 64-bit floating point MFMA operations executed per
      second. Note: this does not include any 64-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F64 MFMA operations achievable on the specific accelerator.
-    MFMA IOPs (Int8): |-
+    MFMA IOPs (Int8): >-
      The total number of 8-bit integer MFMA operations executed per second.
      Note: this does not include any 8-bit integer operations from VALU instructions.
      This is also presented as a percent of the peak theoretical INT8 MFMA operations
@@ -295,7 +295,7 @@ Panel Config:
    IPC: The ratio of the total number of instructions executed on the CU over the
      total active CU cycles. This is also presented as a percent of the peak theoretical
      bandwidth achievable on the specific accelerator.
-    Wavefront Occupancy: |-
+    Wavefront Occupancy: >-
      The time-averaged number of wavefronts resident on the accelerator over
      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms). This is also presented as a percent of the peak theoretical
@@ -328,7 +328,7 @@ Panel Config:
      if only a single value is requested in a cache line, the data movement will
      still be counted as a full cache line. This is also presented as a percent of
      the peak theoretical bandwidth achievable on the specific accelerator.
-    L2-Fabric Read BW: |-
+    L2-Fabric Read BW: >-
      The number of bytes read by the L2 over the Infinity Fabric\u2122 interface
      per unit time. This is also presented as a percent of the peak theoretical
      bandwidth achievable on the specific accelerator.
@@ -162,15 +162,15 @@ Panel Config:
    Active CUs: Total number of active compute units (CUs) on the accelerator during
      the kernel execution.
    Num CUs: Total number of compute units (CUs) on the accelerator.
-    VGPR: |-
+    VGPR: >-
      The number of architected vector general-purpose registers allocated
      for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
      requested by the compiler due to allocation granularity.
-    SGPR: |-
+    SGPR: >-
      The number of scalar general-purpose registers allocated for the kernel,
      see SALU. Note: this may not exactly match the number of SGPRs requested by
      the compiler due to allocation granularity.
-    LDS Allocation: |-
+    LDS Allocation: >-
      The number of bytes of LDS memory (or, shared memory) allocated for
      this kernel. Note: This may also be larger than what was requested at compile
      time due to both allocation granularity and dynamic per-dispatch LDS allocations.
@@ -252,7 +252,7 @@ Panel Config:
      or data (atomic with return value) was returned to the L2.
    HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
      of data from the accelerator's local HBM, per normalization unit.
-    HBM Wr: |-
+    HBM Wr: >-
      The total number of L2 requests to Infinity Fabric to write or atomically
      update 32B or 64B of data in the accelerator's local HBM, per normalization
      unit.
@@ -140,17 +140,17 @@ Panel Config:
            * 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9
          unit: GFLOP/s
  metrics_description:
-    VALU FLOPs (F16): |-
+    VALU FLOPs (F16): >-
      The total 16-bit floating-point operations executed per second on the VALU.
      This is presented with the value of the peak empirical F16 FLOPs achievable
      on the specific accelerator. Note: this does not include any F16 operations
      from MFMA instructions.
-    VALU FLOPs (F32): |-
+    VALU FLOPs (F32): >-
      The total 32-bit floating-point operations executed per second on the VALU.
      This is presented with the value of the peak empirical F32 FLOPs achievable
      on the specific accelerator. Note: this does not include any F32 operations
      from MFMA instructions.
-    VALU FLOPs (F64): |-
+    VALU FLOPs (F64): >-
      The total 64-bit floating-point operations executed per second on the VALU.
      This is presented with the value of the peak empirical F64 FLOPs achievable
      on the specific accelerator. Note: this does not include any F64 operations
@@ -160,33 +160,33 @@ Panel Config:
      from VALU instructions. The peak empirically measured F8 MFMA operations achievable
      on the specific accelerator is displayed alongside for comparison. It is supported
      on AMD Instinct MI300 series and later only.
-    MFMA FLOPs (BF16): |-
+    MFMA FLOPs (BF16): >-
      The total number of 16-bit brain floating point MFMA operations executed
      per second. Note: this does not include any 16-bit brain floating point
      operations from VALU instructions. The peak empirically measured BF16 MFMA
      operations achievable on the specific accelerator is displayed alongside
      for comparison.
-    MFMA FLOPs (F16): |-
+    MFMA FLOPs (F16): >-
      The total number of 16-bit floating point MFMA operations executed per
      second. Note: this does not include any 16-bit floating point operations from
      VALU instructions. The peak empirically measured F16 MFMA operations
      achievable on the specific accelerator is displayed alongside for comparison.
-    MFMA FLOPs (F32): |-
+    MFMA FLOPs (F32): >-
      The total number of 32-bit floating point MFMA operations executed per
      second. Note: this does not include any 32-bit floating point operations from
      VALU instructions. The peak empirically measured F32 MFMA operations
      achievable on the specific accelerator is displayed alongside for comparison.
-    MFMA FLOPs (F64): |-
+    MFMA FLOPs (F64): >-
      The total number of 64-bit floating point MFMA operations executed per
      second. Note: this does not include any 64-bit floating point operations from
      VALU instructions. The peak empirically measured F64 MFMA operations
      achievable on the specific accelerator is displayed alongside for comparison.
-    MFMA IOPs (Int8): |-
+    MFMA IOPs (Int8): >-
      The total number of 8-bit integer MFMA operations executed per second.
      Note: this does not include any 8-bit integer operations from VALU instructions.
      The peak empirically measured INT8 MFMA operations achievable on the specific
      accelerator is displayed alongside for comparison.
-    HBM Bandwidth: |-
+    HBM Bandwidth: >-
      The total number of bytes read from and written to High-Bandwidth
      Memory (HBM) per second. The peak empirically measured bandwidth achievable
      on the specific accelerator is displayed alongside for comparison.
@@ -207,22 +207,22 @@ Panel Config:
      from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
      example for more detail). The peak empirically measured LDS bandwidth achievable
      on the specific accelerator is displayed alongside for comparison.
-    AI L1: |-
+    AI L1: >-
      The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
      of total floating-point operations (FLOPs) to total bytes transferred between
      the L1 cache and the processing units. This value is used as the x-coordinate
      for the L1 roofline.
-    AI L2: |-
+    AI L2: >-
      The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
      of total floating-point operations (FLOPs) to total bytes transferred between
      the L2 cache and the L1 cache. This value is used as the x-coordinate for
      the L2 roofline.
-    AI HBM: |-
+    AI HBM: >-
      The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
      It is the ratio of total floating-point operations (FLOPs) to total bytes
      transferred between HBM and the L2 cache. This value is used as the x-coordinate
      for the HBM roofline.
-    Performance (GFLOPs): |-
+    Performance (GFLOPs): >-
      The overall achieved performance, measured in GigaFLOPs
      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
      operations divided by the total execution time. This value is used as the y-coordinate
@@ -141,6 +141,6 @@ Panel Config:
      the CPC-L2 interface was active doing any work.
    CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
      translation
-    CPC-UTCL2 Utilization: |-
+    CPC-UTCL2 Utilization: >-
      Percent of total cycles counted by the CPC's L2 address translation
      interface where the CPC was busy doing address translation work.
@@ -168,7 +168,7 @@ Panel Config:
      in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
      within the workgroup manager rather than a lack of a CU or SIMD with sufficient
      resources.
-    Not-scheduled Rate (Scheduler-Pipe): |-
+    Not-scheduled Rate (Scheduler-Pipe): >-
      The percent of total scheduler-pipe cycles in the kernel where a workgroup
      could not be scheduled to a CU due to a bottleneck within the scheduler-pipes
      rather than a lack of a CU or SIMD with sufficient resources.
@@ -121,26 +121,26 @@ Panel Config:
    Workgroup Size: The total number of work-items (or, threads) in each workgroup
      (or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
      to the total block size.
-    Total Wavefronts: |-
+    Total Wavefronts: >-
      The total number of wavefronts launched as part of the kernel dispatch.
      On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
      size is always 64 work-items. Thus, the total number of wavefronts should
      be equivalent to the ceiling of grid size divided by 64.
    Saved Wavefronts: The total number of wavefronts saved at a context-save.
    Restored Wavefronts: The total number of wavefronts restored from a context-save.
-    VGPRs: |-
+    VGPRs: >-
      The number of architected vector general-purpose registers allocated
      for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
      requested by the compiler due to allocation granularity.
-    AGPRs: |-
+    AGPRs: >-
      The number of accumulation vector general-purpose registers allocated
      for the kernel, see AGPRs. Note: this may not exactly match the number of
      AGPRs requested by the compiler due to allocation granularity.
-    SGPRs: |-
+    SGPRs: >-
      The number of scalar general-purpose registers allocated for the kernel,
      see SALU. Note: this may not exactly match the number of SGPRs requested by
      the compiler due to allocation granularity.
-    LDS Allocation: |-
+    LDS Allocation: >-
      The number of bytes of LDS memory (or, shared memory) allocated for
      this kernel. Note: This may also be larger than what was requested at compile
      time due to both allocation granularity and dynamic per-dispatch LDS allocations.
@@ -173,7 +173,7 @@ Panel Config:
      rather than identification of a precise limiter. The sum of this metric, Issue
      Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
      metric.
-    Wavefront Occupancy: |-
+    Wavefront Occupancy: >-
      The time-averaged number of wavefronts resident on the accelerator over
      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms).
@@ -273,7 +273,7 @@ Panel Config:
      floating-point operands issued to the VALU per normalization unit.
    F64-Trans: The total number of transcendental instructions (such as sqrt) operating
      on 64-bit floating-point operands issued to the VALU per normalization unit.
-    Conversion: |-
+    Conversion: >-
      The total number of type conversion instructions (such as converting
      data to or from F32\u2194F64) issued to the VALU per normalization unit.
    Global/Generic Instr: The total number of global & generic memory instructions
@@ -251,37 +251,37 @@ Panel Config:
          max: MAX(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
          unit: (OPs + $normUnit)
  metrics_description:
-    VALU FLOPs: |-
+    VALU FLOPs: >-
      The total floating-point operations executed per second on the VALU.
      This is also presented as a percent of the peak theoretical FLOPs achievable
      on the specific accelerator. Note: this does not include any floating-point
      operations from MFMA instructions.
-    VALU IOPs: |-
+    VALU IOPs: >-
      The total integer operations executed per second on the VALU. This is
      also presented as a percent of the peak theoretical IOPs achievable on the
      specific accelerator. Note: this does not include any integer operations from
      MFMA instructions.
-    MFMA FLOPs (BF16): |-
+    MFMA FLOPs (BF16): >-
      The total number of 16-bit brain floating point MFMA operations executed
      per second. Note: this does not include any 16-bit brain floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      BF16 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F16): |-
+    MFMA FLOPs (F16): >-
      The total number of 16-bit floating point MFMA operations executed per
      second. Note: this does not include any 16-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F16 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F32): |-
+    MFMA FLOPs (F32): >-
      The total number of 32-bit floating point MFMA operations executed per
      second. Note: this does not include any 32-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F32 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F64): |-
+    MFMA FLOPs (F64): >-
      The total number of 64-bit floating point MFMA operations executed per
      second. Note: this does not include any 64-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F64 MFMA operations achievable on the specific accelerator.
-    MFMA IOPs (INT8): |-
+    MFMA IOPs (INT8): >-
      The total number of 8-bit integer MFMA operations executed per second.
      Note: this does not include any 8-bit integer operations from VALU instructions.
      This is also presented as a percent of the peak theoretical INT8 MFMA operations
@@ -140,7 +140,7 @@ Panel Config:
      unit.
    Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
      stalls from non-dword aligned addresses per normalization unit.
-    Mem Violations: |-
+    Mem Violations: >-
      The total number of out-of-bounds accesses made to the LDS, per normalization
      unit. This is unused and expected to be zero in most configurations for
      modern CDNA\u2122 accelerators.
@@ -92,7 +92,7 @@ Panel Config:
    Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
      loaded line the cache. Calculated as the ratio of the number of L1I requests
      that hit over the number of all L1I requests.
-    L1I-L2 Bandwidth Utilization: |-
+    L1I-L2 Bandwidth Utilization: >-
      The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
      achieved. Calculated as the ratio of the total number of requests from the
      L1I to the L2 cache over the total L1I-L2 interface cycles.
@@ -154,7 +154,7 @@ Panel Config:
    sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
      bandwidth acheived. Calculated as total number of bytes read from, written to,
      or atomically updated across the sL1D - L2 interface.
-    sL1D-L2 BW: |-
+    sL1D-L2 BW: >-
      The total number of bytes read from, written to, or atomically updated
      across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D
      writes and atomics are typically unused on current CDNA accelerators, so
@@ -164,7 +164,7 @@ Panel Config:
      unit.
    Hits: The total number of sL1D requests that hit on a previously loaded cache
      line, per normalization unit.
-    Misses - Non Duplicated: |-
+    Misses - Non Duplicated: >-
      The total number of sL1D requests that missed on a cache line that was
      not already pending due to another request, per normalization unit.
    Misses- Duplicated: The total number of sL1D requests that missed on a cache line
@@ -187,6 +187,6 @@ Panel Config:
      unit.
    Write Req: The total number of write requests from sL1D to the L2, per normalization
      unit. Typically unused on current CDNA accelerators.
-    Stall Cycles: |-
+    Stall Cycles: >-
      The total number of cycles the sL1D\u2194L2 interface was stalled, per
      normalization unit.
@@ -398,7 +398,7 @@ Panel Config:
      per normalization unit.
    Translation Misses: The total number of translation requests that missed in the
      UTCL1 due to translation not being present in the cache, per normalization unit.
-    Permission Misses: |-
+    Permission Misses: >-
      The total number of translation requests that missed in the UTCL1 due
      to a permission error, per normalization unit. This is unused and expected
      to be zero in most configurations for modern CDNA\u2122 accelerators.
@@ -227,12 +227,12 @@ Panel Config:
          pop: None
          coll_level: SQ_IFETCH_LEVEL
  metrics_description:
-    VALU FLOPs: |-
+    VALU FLOPs: >-
      The total floating-point operations executed per second on the VALU.
      This is also presented as a percent of the peak theoretical FLOPs achievable
      on the specific accelerator. Note: this does not include any floating-point
      operations from MFMA instructions.
-    VALU IOPs: |-
+    VALU IOPs: >-
      The total integer operations executed per second on the VALU. This is
      also presented as a percent of the peak theoretical IOPs achievable on the
      specific accelerator. Note: this does not include any integer operations from
@@ -242,27 +242,27 @@ Panel Config:
      from VALU instructions. This is also presented as a percent of the peak theoretical
      F8 MFMA operations achievable on the specific accelerator. It is supported on
      AMD Instinct MI300 series and later only.
-    MFMA FLOPs (BF16): |-
+    MFMA FLOPs (BF16): >-
      The total number of 16-bit brain floating point MFMA operations executed
      per second. Note: this does not include any 16-bit brain floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      BF16 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F16): |-
+    MFMA FLOPs (F16): >-
      The total number of 16-bit floating point MFMA operations executed per
      second. Note: this does not include any 16-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F16 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F32): |-
+    MFMA FLOPs (F32): >-
      The total number of 32-bit floating point MFMA operations executed per
      second. Note: this does not include any 32-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F32 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F64): |-
+    MFMA FLOPs (F64): >-
      The total number of 64-bit floating point MFMA operations executed per
      second. Note: this does not include any 64-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F64 MFMA operations achievable on the specific accelerator.
-    MFMA IOPs (Int8): |-
+    MFMA IOPs (Int8): >-
      The total number of 8-bit integer MFMA operations executed per second.
      Note: this does not include any 8-bit integer operations from VALU instructions.
      This is also presented as a percent of the peak theoretical INT8 MFMA operations
@@ -295,7 +295,7 @@ Panel Config:
    IPC: The ratio of the total number of instructions executed on the CU over the
      total active CU cycles. This is also presented as a percent of the peak theoretical
      bandwidth achievable on the specific accelerator.
-    Wavefront Occupancy: |-
+    Wavefront Occupancy: >-
      The time-averaged number of wavefronts resident on the accelerator over
      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms). This is also presented as a percent of the peak theoretical
@@ -328,7 +328,7 @@ Panel Config:
      if only a single value is requested in a cache line, the data movement will
      still be counted as a full cache line. This is also presented as a percent of
      the peak theoretical bandwidth achievable on the specific accelerator.
-    L2-Fabric Read BW: |-
+    L2-Fabric Read BW: >-
      The number of bytes read by the L2 over the Infinity Fabric\u2122 interface
      per unit time. This is also presented as a percent of the peak theoretical
      bandwidth achievable on the specific accelerator.
@@ -162,15 +162,15 @@ Panel Config:
    Active CUs: Total number of active compute units (CUs) on the accelerator during
      the kernel execution.
    Num CUs: Total number of compute units (CUs) on the accelerator.
-    VGPR: |-
+    VGPR: >-
      The number of architected vector general-purpose registers allocated
      for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
      requested by the compiler due to allocation granularity.
-    SGPR: |-
+    SGPR: >-
      The number of scalar general-purpose registers allocated for the kernel,
      see SALU. Note: this may not exactly match the number of SGPRs requested by
      the compiler due to allocation granularity.
-    LDS Allocation: |-
+    LDS Allocation: >-
      The number of bytes of LDS memory (or, shared memory) allocated for
      this kernel. Note: This may also be larger than what was requested at compile
      time due to both allocation granularity and dynamic per-dispatch LDS allocations.
@@ -252,7 +252,7 @@ Panel Config:
      or data (atomic with return value) was returned to the L2.
    HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
      of data from the accelerator's local HBM, per normalization unit.
-    HBM Wr: |-
+    HBM Wr: >-
      The total number of L2 requests to Infinity Fabric to write or atomically
      update 32B or 64B of data in the accelerator's local HBM, per normalization
      unit.
@@ -140,17 +140,17 @@ Panel Config:
            * 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9
          unit: GFLOP/s
  metrics_description:
-    VALU FLOPs (F16): |-
+    VALU FLOPs (F16): >-
      The total 16-bit floating-point operations executed per second on the VALU.
      This is presented with the value of the peak empirical F16 FLOPs achievable
      on the specific accelerator. Note: this does not include any F16 operations
      from MFMA instructions.
-    VALU FLOPs (F32): |-
+    VALU FLOPs (F32): >-
      The total 32-bit floating-point operations executed per second on the VALU.
      This is presented with the value of the peak empirical F32 FLOPs achievable
      on the specific accelerator. Note: this does not include any F32 operations
      from MFMA instructions.
-    VALU FLOPs (F64): |-
+    VALU FLOPs (F64): >-
      The total 64-bit floating-point operations executed per second on the VALU.
      This is presented with the value of the peak empirical F64 FLOPs achievable
      on the specific accelerator. Note: this does not include any F64 operations
@@ -160,33 +160,33 @@ Panel Config:
      from VALU instructions. The peak empirically measured F8 MFMA operations achievable
      on the specific accelerator is displayed alongside for comparison. It is supported
      on AMD Instinct MI300 series and later only.
-    MFMA FLOPs (BF16): |-
+    MFMA FLOPs (BF16): >-
      The total number of 16-bit brain floating point MFMA operations executed
      per second. Note: this does not include any 16-bit brain floating point
      operations from VALU instructions. The peak empirically measured BF16 MFMA
      operations achievable on the specific accelerator is displayed alongside
      for comparison.
-    MFMA FLOPs (F16): |-
+    MFMA FLOPs (F16): >-
      The total number of 16-bit floating point MFMA operations executed per
      second. Note: this does not include any 16-bit floating point operations from
      VALU instructions. The peak empirically measured F16 MFMA operations
      achievable on the specific accelerator is displayed alongside for comparison.
-    MFMA FLOPs (F32): |-
+    MFMA FLOPs (F32): >-
      The total number of 32-bit floating point MFMA operations executed per
      second. Note: this does not include any 32-bit floating point operations from
      VALU instructions. The peak empirically measured F32 MFMA operations
      achievable on the specific accelerator is displayed alongside for comparison.
-    MFMA FLOPs (F64): |-
+    MFMA FLOPs (F64): >-
      The total number of 64-bit floating point MFMA operations executed per
      second. Note: this does not include any 64-bit floating point operations from
      VALU instructions. The peak empirically measured F64 MFMA operations
      achievable on the specific accelerator is displayed alongside for comparison.
-    MFMA IOPs (Int8): |-
+    MFMA IOPs (Int8): >-
      The total number of 8-bit integer MFMA operations executed per second.
      Note: this does not include any 8-bit integer operations from VALU instructions.
      The peak empirically measured INT8 MFMA operations achievable on the specific
      accelerator is displayed alongside for comparison.
-    HBM Bandwidth: |-
+    HBM Bandwidth: >-
      The total number of bytes read from and written to High-Bandwidth
      Memory (HBM) per second. The peak empirically measured bandwidth achievable
      on the specific accelerator is displayed alongside for comparison.
@@ -207,22 +207,22 @@ Panel Config:
      from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
      example for more detail). The peak empirically measured LDS bandwidth achievable
      on the specific accelerator is displayed alongside for comparison.
-    AI L1: |-
+    AI L1: >-
      The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
      of total floating-point operations (FLOPs) to total bytes transferred between
      the L1 cache and the processing units. This value is used as the x-coordinate
      for the L1 roofline.
-    AI L2: |-
+    AI L2: >-
      The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
      of total floating-point operations (FLOPs) to total bytes transferred between
      the L2 cache and the L1 cache. This value is used as the x-coordinate for
      the L2 roofline.
-    AI HBM: |-
+    AI HBM: >-
      The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
      It is the ratio of total floating-point operations (FLOPs) to total bytes
      transferred between HBM and the L2 cache. This value is used as the x-coordinate
      for the HBM roofline.
-    Performance (GFLOPs): |-
+    Performance (GFLOPs): >-
      The overall achieved performance, measured in GigaFLOPs
      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
      operations divided by the total execution time. This value is used as the y-coordinate
@@ -141,6 +141,6 @@ Panel Config:
      the CPC-L2 interface was active doing any work.
    CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
      translation
-    CPC-UTCL2 Utilization: |-
+    CPC-UTCL2 Utilization: >-
      Percent of total cycles counted by the CPC's L2 address translation
      interface where the CPC was busy doing address translation work.
@@ -168,7 +168,7 @@ Panel Config:
      in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
      within the workgroup manager rather than a lack of a CU or SIMD with sufficient
      resources.
-    Not-scheduled Rate (Scheduler-Pipe): |-
+    Not-scheduled Rate (Scheduler-Pipe): >-
      The percent of total scheduler-pipe cycles in the kernel where a workgroup
      could not be scheduled to a CU due to a bottleneck within the scheduler-pipes
      rather than a lack of a CU or SIMD with sufficient resources.
@@ -121,26 +121,26 @@ Panel Config:
    Workgroup Size: The total number of work-items (or, threads) in each workgroup
      (or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
      to the total block size.
-    Total Wavefronts: |-
+    Total Wavefronts: >-
      The total number of wavefronts launched as part of the kernel dispatch.
      On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
      size is always 64 work-items. Thus, the total number of wavefronts should
      be equivalent to the ceiling of grid size divided by 64.
    Saved Wavefronts: The total number of wavefronts saved at a context-save.
    Restored Wavefronts: The total number of wavefronts restored from a context-save.
-    VGPRs: |-
+    VGPRs: >-
      The number of architected vector general-purpose registers allocated
      for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
      requested by the compiler due to allocation granularity.
-    AGPRs: |-
+    AGPRs: >-
      The number of accumulation vector general-purpose registers allocated
      for the kernel, see AGPRs. Note: this may not exactly match the number of
      AGPRs requested by the compiler due to allocation granularity.
-    SGPRs: |-
+    SGPRs: >-
      The number of scalar general-purpose registers allocated for the kernel,
      see SALU. Note: this may not exactly match the number of SGPRs requested by
      the compiler due to allocation granularity.
-    LDS Allocation: |-
+    LDS Allocation: >-
      The number of bytes of LDS memory (or, shared memory) allocated for
      this kernel. Note: This may also be larger than what was requested at compile
      time due to both allocation granularity and dynamic per-dispatch LDS allocations.
@@ -173,7 +173,7 @@ Panel Config:
      rather than identification of a precise limiter. The sum of this metric, Issue
      Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
      metric.
-    Wavefront Occupancy: |-
+    Wavefront Occupancy: >-
      The time-averaged number of wavefronts resident on the accelerator over
      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms).
@@ -273,7 +273,7 @@ Panel Config:
      floating-point operands issued to the VALU per normalization unit.
    F64-Trans: The total number of transcendental instructions (such as sqrt) operating
      on 64-bit floating-point operands issued to the VALU per normalization unit.
-    Conversion: |-
+    Conversion: >-
      The total number of type conversion instructions (such as converting
      data to or from F32\u2194F64) issued to the VALU per normalization unit.
    Global/Generic Instr: The total number of global & generic memory instructions
@@ -251,37 +251,37 @@ Panel Config:
          max: MAX(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
          unit: (OPs + $normUnit)
  metrics_description:
-    VALU FLOPs: |-
+    VALU FLOPs: >-
      The total floating-point operations executed per second on the VALU.
      This is also presented as a percent of the peak theoretical FLOPs achievable
      on the specific accelerator. Note: this does not include any floating-point
      operations from MFMA instructions.
-    VALU IOPs: |-
+    VALU IOPs: >-
      The total integer operations executed per second on the VALU. This is
      also presented as a percent of the peak theoretical IOPs achievable on the
      specific accelerator. Note: this does not include any integer operations from
      MFMA instructions.
-    MFMA FLOPs (BF16): |-
+    MFMA FLOPs (BF16): >-
      The total number of 16-bit brain floating point MFMA operations executed
      per second. Note: this does not include any 16-bit brain floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      BF16 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F16): |-
+    MFMA FLOPs (F16): >-
      The total number of 16-bit floating point MFMA operations executed per
      second. Note: this does not include any 16-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F16 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F32): |-
+    MFMA FLOPs (F32): >-
      The total number of 32-bit floating point MFMA operations executed per
      second. Note: this does not include any 32-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F32 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F64): |-
+    MFMA FLOPs (F64): >-
      The total number of 64-bit floating point MFMA operations executed per
      second. Note: this does not include any 64-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F64 MFMA operations achievable on the specific accelerator.
-    MFMA IOPs (INT8): |-
+    MFMA IOPs (INT8): >-
      The total number of 8-bit integer MFMA operations executed per second.
      Note: this does not include any 8-bit integer operations from VALU instructions.
      This is also presented as a percent of the peak theoretical INT8 MFMA operations
@@ -140,7 +140,7 @@ Panel Config:
      unit.
    Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
      stalls from non-dword aligned addresses per normalization unit.
-    Mem Violations: |-
+    Mem Violations: >-
      The total number of out-of-bounds accesses made to the LDS, per normalization
      unit. This is unused and expected to be zero in most configurations for
      modern CDNA\u2122 accelerators.
@@ -92,7 +92,7 @@ Panel Config:
    Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
      loaded line the cache. Calculated as the ratio of the number of L1I requests
      that hit over the number of all L1I requests.
-    L1I-L2 Bandwidth Utilization: |-
+    L1I-L2 Bandwidth Utilization: >-
      The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
      achieved. Calculated as the ratio of the total number of requests from the
      L1I to the L2 cache over the total L1I-L2 interface cycles.
@@ -154,7 +154,7 @@ Panel Config:
    sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
      bandwidth acheived. Calculated as total number of bytes read from, written to,
      or atomically updated across the sL1D - L2 interface.
-    sL1D-L2 BW: |-
+    sL1D-L2 BW: >-
      The total number of bytes read from, written to, or atomically updated
      across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D
      writes and atomics are typically unused on current CDNA accelerators, so
@@ -164,7 +164,7 @@ Panel Config:
      unit.
    Hits: The total number of sL1D requests that hit on a previously loaded cache
      line, per normalization unit.
-    Misses - Non Duplicated: |-
+    Misses - Non Duplicated: >-
      The total number of sL1D requests that missed on a cache line that was
      not already pending due to another request, per normalization unit.
    Misses- Duplicated: The total number of sL1D requests that missed on a cache line
@@ -187,6 +187,6 @@ Panel Config:
      unit.
    Write Req: The total number of write requests from sL1D to the L2, per normalization
      unit. Typically unused on current CDNA accelerators.
-    Stall Cycles: |-
+    Stall Cycles: >-
      The total number of cycles the sL1D\u2194L2 interface was stalled, per
      normalization unit.
@@ -398,7 +398,7 @@ Panel Config:
      per normalization unit.
    Translation Misses: The total number of translation requests that missed in the
      UTCL1 due to translation not being present in the cache, per normalization unit.
-    Permission Misses: |-
+    Permission Misses: >-
      The total number of translation requests that missed in the UTCL1 due
      to a permission error, per normalization unit. This is unused and expected
      to be zero in most configurations for modern CDNA\u2122 accelerators.
@@ -233,12 +233,12 @@ Panel Config:
          pop: None
          coll_level: SQ_IFETCH_LEVEL
  metrics_description:
-    VALU FLOPs: |-
+    VALU FLOPs: >-
      The total floating-point operations executed per second on the VALU.
      This is also presented as a percent of the peak theoretical FLOPs achievable
      on the specific accelerator. Note: this does not include any floating-point
      operations from MFMA instructions.
-    VALU IOPs: |-
+    VALU IOPs: >-
      The total integer operations executed per second on the VALU. This is
      also presented as a percent of the peak theoretical IOPs achievable on the
      specific accelerator. Note: this does not include any integer operations from
@@ -248,27 +248,27 @@ Panel Config:
      from VALU instructions. This is also presented as a percent of the peak theoretical
      F8 MFMA operations achievable on the specific accelerator. It is supported on
      AMD Instinct MI300 series and later only.
-    MFMA FLOPs (BF16): |-
+    MFMA FLOPs (BF16): >-
      The total number of 16-bit brain floating point MFMA operations executed
      per second. Note: this does not include any 16-bit brain floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      BF16 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F16): |-
+    MFMA FLOPs (F16): >-
      The total number of 16-bit floating point MFMA operations executed per
      second. Note: this does not include any 16-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F16 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F32): |-
+    MFMA FLOPs (F32): >-
      The total number of 32-bit floating point MFMA operations executed per
      second. Note: this does not include any 32-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F32 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F64): |-
+    MFMA FLOPs (F64): >-
      The total number of 64-bit floating point MFMA operations executed per
      second. Note: this does not include any 64-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F64 MFMA operations achievable on the specific accelerator.
-    MFMA IOPs (Int8): |-
+    MFMA IOPs (Int8): >-
      The total number of 8-bit integer MFMA operations executed per second.
      Note: this does not include any 8-bit integer operations from VALU instructions.
      This is also presented as a percent of the peak theoretical INT8 MFMA operations
@@ -301,7 +301,7 @@ Panel Config:
    IPC: The ratio of the total number of instructions executed on the CU over the
      total active CU cycles. This is also presented as a percent of the peak theoretical
      bandwidth achievable on the specific accelerator.
-    Wavefront Occupancy: |-
+    Wavefront Occupancy: >-
      The time-averaged number of wavefronts resident on the accelerator over
      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms). This is also presented as a percent of the peak theoretical
@@ -334,7 +334,7 @@ Panel Config:
      if only a single value is requested in a cache line, the data movement will
      still be counted as a full cache line. This is also presented as a percent of
      the peak theoretical bandwidth achievable on the specific accelerator.
-    L2-Fabric Read BW: |-
+    L2-Fabric Read BW: >-
      The number of bytes read by the L2 over the Infinity Fabric\u2122 interface
      per unit time. This is also presented as a percent of the peak theoretical
      bandwidth achievable on the specific accelerator.
@@ -172,15 +172,15 @@ Panel Config:
    Active CUs: Total number of active compute units (CUs) on the accelerator during
      the kernel execution.
    Num CUs: Total number of compute units (CUs) on the accelerator.
-    VGPR: |-
+    VGPR: >-
      The number of architected vector general-purpose registers allocated
      for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
      requested by the compiler due to allocation granularity.
-    SGPR: |-
+    SGPR: >-
      The number of scalar general-purpose registers allocated for the kernel,
      see SALU. Note: this may not exactly match the number of SGPRs requested by
      the compiler due to allocation granularity.
-    LDS Allocation: |-
+    LDS Allocation: >-
      The number of bytes of LDS memory (or, shared memory) allocated for
      this kernel. Note: This may also be larger than what was requested at compile
      time due to both allocation granularity and dynamic per-dispatch LDS allocations.
@@ -268,7 +268,7 @@ Panel Config:
      or data (atomic with return value) was returned to the L2.
    HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
      of data from the accelerator's local HBM, per normalization unit.
-    HBM Wr: |-
+    HBM Wr: >-
      The total number of L2 requests to Infinity Fabric to write or atomically
      update 32B or 64B of data in the accelerator's local HBM, per normalization
      unit.
@@ -148,17 +148,17 @@ Panel Config:
            Start_Timestamp) / 1e9) ) / 1e9
          unit: GFLOP/s
  metrics_description:
-    VALU FLOPs (F16): |-
+    VALU FLOPs (F16): >-
      The total 16-bit floating-point operations executed per second on the VALU.
      This is presented with the value of the peak empirical F16 FLOPs achievable
      on the specific accelerator. Note: this does not include any F16 operations
      from MFMA instructions.
-    VALU FLOPs (F32): |-
+    VALU FLOPs (F32): >-
      The total 32-bit floating-point operations executed per second on the VALU.
      This is presented with the value of the peak empirical F32 FLOPs achievable
      on the specific accelerator. Note: this does not include any F32 operations
      from MFMA instructions.
-    VALU FLOPs (F64): |-
+    VALU FLOPs (F64): >-
      The total 64-bit floating-point operations executed per second on the VALU.
      This is presented with the value of the peak empirical F64 FLOPs achievable
      on the specific accelerator. Note: this does not include any F64 operations
@@ -168,39 +168,39 @@ Panel Config:
      from VALU instructions. The peak empirically measured F8 MFMA operations achievable
      on the specific accelerator is displayed alongside for comparison. It is supported
      on AMD Instinct MI300 series and later only.
-    MFMA FLOPs (BF16): |-
+    MFMA FLOPs (BF16): >-
      The total number of 16-bit brain floating point MFMA operations executed
      per second. Note: this does not include any 16-bit brain floating point
      operations from VALU instructions. The peak empirically measured BF16 MFMA
      operations achievable on the specific accelerator is displayed alongside
      for comparison.
-    MFMA FLOPs (F16): |-
+    MFMA FLOPs (F16): >-
      The total number of 16-bit floating point MFMA operations executed per
      second. Note: this does not include any 16-bit floating point operations from
      VALU instructions. The peak empirically measured F16 MFMA operations
      achievable on the specific accelerator is displayed alongside for comparison.
-    MFMA FLOPs (F32): |-
+    MFMA FLOPs (F32): >-
      The total number of 32-bit floating point MFMA operations executed per
      second. Note: this does not include any 32-bit floating point operations from
      VALU instructions. The peak empirically measured F32 MFMA operations
      achievable on the specific accelerator is displayed alongside for comparison.
-    MFMA FLOPs (F64): |-
+    MFMA FLOPs (F64): >-
      The total number of 64-bit floating point MFMA operations executed per
      second. Note: this does not include any 64-bit floating point operations from
      VALU instructions. The peak empirically measured F64 MFMA operations
      achievable on the specific accelerator is displayed alongside for comparison.
-    MFMA FLOPs (F6F4): |-
+    MFMA FLOPs (F6F4): >-
      The total number of 4-bit and 6-bit floating point MFMA operations executed
      per second. Note: this does not include any floating point operations from
      VALU instructions. The peak empirically measured F6F4 MFMA operations
      achievable on the specific accelerator is displayed alongside for comparison.
      It is supported on AMD Instinct MI350 series (gfx950) and later only.
-    MFMA IOPs (Int8): |-
+    MFMA IOPs (Int8): >-
      The total number of 8-bit integer MFMA operations executed per second.
      Note: this does not include any 8-bit integer operations from VALU instructions.
      The peak empirically measured INT8 MFMA operations achievable on the specific
      accelerator is displayed alongside for comparison.
-    HBM Bandwidth: |-
+    HBM Bandwidth: >-
      The total number of bytes read from and written to High-Bandwidth
      Memory (HBM) per second. The peak empirically measured bandwidth achievable
      on the specific accelerator is displayed alongside for comparison.
@@ -221,22 +221,22 @@ Panel Config:
      from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
      example for more detail). The peak empirically measured LDS bandwidth achievable
      on the specific accelerator is displayed alongside for comparison.
-    AI L1: |-
+    AI L1: >-
      The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
      of total floating-point operations (FLOPs) to total bytes transferred between
      the L1 cache and the processing units. This value is used as the x-coordinate
      for the L1 roofline.
-    AI L2: |-
+    AI L2: >-
      The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
      of total floating-point operations (FLOPs) to total bytes transferred between
      the L2 cache and the L1 cache. This value is used as the x-coordinate for
      the L2 roofline.
-    AI HBM: |-
+    AI HBM: >-
      The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
      It is the ratio of total floating-point operations (FLOPs) to total bytes
      transferred between HBM and the L2 cache. This value is used as the x-coordinate
      for the HBM roofline.
-    Performance (GFLOPs): |-
+    Performance (GFLOPs): >-
      The overall achieved performance, measured in GigaFLOPs
      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
      operations divided by the total execution time. This value is used as the y-coordinate
@@ -162,6 +162,6 @@ Panel Config:
      the CPC-L2 interface was active doing any work.
    CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
      translation
-    CPC-UTCL2 Utilization: |-
+    CPC-UTCL2 Utilization: >-
      Percent of total cycles counted by the CPC's L2 address translation
      interface where the CPC was busy doing address translation work.
@@ -204,7 +204,7 @@ Panel Config:
      in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
      within the workgroup manager rather than a lack of a CU or SIMD with sufficient
      resources.
-    Not-scheduled Rate (Scheduler-Pipe): |-
+    Not-scheduled Rate (Scheduler-Pipe): >-
      The percent of total scheduler-pipe cycles in the kernel where a workgroup
      could not be scheduled to a CU due to a bottleneck within the scheduler-pipes
      rather than a lack of a CU or SIMD with sufficient resources.
@@ -121,26 +121,26 @@ Panel Config:
    Workgroup Size: The total number of work-items (or, threads) in each workgroup
      (or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
      to the total block size.
-    Total Wavefronts: |-
+    Total Wavefronts: >-
      The total number of wavefronts launched as part of the kernel dispatch.
      On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
      size is always 64 work-items. Thus, the total number of wavefronts should
      be equivalent to the ceiling of grid size divided by 64.
    Saved Wavefronts: The total number of wavefronts saved at a context-save.
    Restored Wavefronts: The total number of wavefronts restored from a context-save.
-    VGPRs: |-
+    VGPRs: >-
      The number of architected vector general-purpose registers allocated
      for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
      requested by the compiler due to allocation granularity.
-    AGPRs: |-
+    AGPRs: >-
      The number of accumulation vector general-purpose registers allocated
      for the kernel, see AGPRs. Note: this may not exactly match the number of
      AGPRs requested by the compiler due to allocation granularity.
-    SGPRs: |-
+    SGPRs: >-
      The number of scalar general-purpose registers allocated for the kernel,
      see SALU. Note: this may not exactly match the number of SGPRs requested by
      the compiler due to allocation granularity.
-    LDS Allocation: |-
+    LDS Allocation: >-
      The number of bytes of LDS memory (or, shared memory) allocated for
      this kernel. Note: This may also be larger than what was requested at compile
      time due to both allocation granularity and dynamic per-dispatch LDS allocations.
@@ -173,7 +173,7 @@ Panel Config:
      rather than identification of a precise limiter. The sum of this metric, Issue
      Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
      metric.
-    Wavefront Occupancy: |-
+    Wavefront Occupancy: >-
      The time-averaged number of wavefronts resident on the accelerator over
      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms).
@@ -283,7 +283,7 @@ Panel Config:
      floating-point operands issued to the VALU per normalization unit.
    F64-Trans: The total number of transcendental instructions (such as sqrt) operating
      on 64-bit floating-point operands issued to the VALU per normalization unit.
-    Conversion: |-
+    Conversion: >-
      The total number of type conversion instructions (such as converting
      data to or from F32\u2194F64) issued to the VALU per normalization unit.
    Global/Generic Instr: The total number of global & generic memory instructions
@@ -267,37 +267,37 @@ Panel Config:
          max: MAX(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
          unit: (OPs + $normUnit)
  metrics_description:
-    VALU FLOPs: |-
+    VALU FLOPs: >-
      The total floating-point operations executed per second on the VALU.
      This is also presented as a percent of the peak theoretical FLOPs achievable
      on the specific accelerator. Note: this does not include any floating-point
      operations from MFMA instructions.
-    VALU IOPs: |-
+    VALU IOPs: >-
      The total integer operations executed per second on the VALU. This is
      also presented as a percent of the peak theoretical IOPs achievable on the
      specific accelerator. Note: this does not include any integer operations from
      MFMA instructions.
-    MFMA FLOPs (BF16): |-
+    MFMA FLOPs (BF16): >-
      The total number of 16-bit brain floating point MFMA operations executed
      per second. Note: this does not include any 16-bit brain floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      BF16 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F16): |-
+    MFMA FLOPs (F16): >-
      The total number of 16-bit floating point MFMA operations executed per
      second. Note: this does not include any 16-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F16 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F32): |-
+    MFMA FLOPs (F32): >-
      The total number of 32-bit floating point MFMA operations executed per
      second. Note: this does not include any 32-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F32 MFMA operations achievable on the specific accelerator.
-    MFMA FLOPs (F64): |-
+    MFMA FLOPs (F64): >-
      The total number of 64-bit floating point MFMA operations executed per
      second. Note: this does not include any 64-bit floating point operations from
      VALU instructions. This is also presented as a percent of the peak theoretical
      F64 MFMA operations achievable on the specific accelerator.
-    MFMA IOPs (INT8): |-
+    MFMA IOPs (INT8): >-
      The total number of 8-bit integer MFMA operations executed per second.
      Note: this does not include any 8-bit integer operations from VALU instructions.
      This is also presented as a percent of the peak theoretical INT8 MFMA operations
@@ -180,7 +180,7 @@ Panel Config:
      unit.
    Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
      stalls from non-dword aligned addresses per normalization unit.
-    Mem Violations: |-
+    Mem Violations: >-
      The total number of out-of-bounds accesses made to the LDS, per normalization
      unit. This is unused and expected to be zero in most configurations for
      modern CDNA\u2122 accelerators.
@@ -92,7 +92,7 @@ Panel Config:
    Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
      loaded line the cache. Calculated as the ratio of the number of L1I requests
      that hit over the number of all L1I requests.
-    L1I-L2 Bandwidth Utilization: |-
+    L1I-L2 Bandwidth Utilization: >-
      The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
      achieved. Calculated as the ratio of the total number of requests from the
      L1I to the L2 cache over the total L1I-L2 interface cycles.
@@ -154,7 +154,7 @@ Panel Config:
    sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
      bandwidth acheived. Calculated as total number of bytes read from, written to,
      or atomically updated across the sL1D - L2 interface.
-    sL1D-L2 BW: |-
+    sL1D-L2 BW: >-
      The total number of bytes read from, written to, or atomically updated
      across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D
      writes and atomics are typically unused on current CDNA accelerators, so
@@ -164,7 +164,7 @@ Panel Config:
      unit.
    Hits: The total number of sL1D requests that hit on a previously loaded cache
      line, per normalization unit.
-    Misses - Non Duplicated: |-
+    Misses - Non Duplicated: >-
      The total number of sL1D requests that missed on a cache line that was
      not already pending due to another request, per normalization unit.
    Misses- Duplicated: The total number of sL1D requests that missed on a cache line
@@ -187,6 +187,6 @@ Panel Config:
      unit.
    Write Req: The total number of write requests from sL1D to the L2, per normalization
      unit. Typically unused on current CDNA accelerators.
-    Stall Cycles: |-
+    Stall Cycles: >-
      The total number of cycles the sL1D\u2194L2 interface was stalled, per
      normalization unit.
@@ -501,7 +501,7 @@ Panel Config:
      per normalization unit.
    Translation Misses: The total number of translation requests that missed in the
      UTCL1 due to translation not being present in the cache, per normalization unit.
-    Permission Misses: |-
+    Permission Misses: >-
      The total number of translation requests that missed in the UTCL1 due
      to a permission error, per normalization unit. This is unused and expected
      to be zero in most configurations for modern CDNA\u2122 accelerators.
@@ -706,7 +706,7 @@ Panel Config:
      requests are only considered atomic by Infinity Fabric if they are targeted
      at non-write-cacheable memory, such as fine-grained memory allocations or uncached
      memory allocations on the MI2XX.
-    Read Stall: |-
+    Read Stall: >-
      The ratio of the total number of cycles the L2-Fabric interface was
      stalled on a read request to any destination (local HBM, remote PCIe\xAE
      connected accelerator or CPU, or remote Infinity Fabric connected accelerator
@@ -1,116 +1,2 @@
-# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
-src/rocprof_compute_soc/analysis_configs/gfx908/0000_top_stats.yaml: ad7818c680acb0d4e3cb624e0f14f79d44fa7efe14531c5643f47ac96266c91d
-src/rocprof_compute_soc/analysis_configs/gfx90a/0000_top_stats.yaml: ad7818c680acb0d4e3cb624e0f14f79d44fa7efe14531c5643f47ac96266c91d
-src/rocprof_compute_soc/analysis_configs/gfx940/0000_top_stats.yaml: ad7818c680acb0d4e3cb624e0f14f79d44fa7efe14531c5643f47ac96266c91d
-src/rocprof_compute_soc/analysis_configs/gfx941/0000_top_stats.yaml: ad7818c680acb0d4e3cb624e0f14f79d44fa7efe14531c5643f47ac96266c91d
-src/rocprof_compute_soc/analysis_configs/gfx942/0000_top_stats.yaml: ad7818c680acb0d4e3cb624e0f14f79d44fa7efe14531c5643f47ac96266c91d
-src/rocprof_compute_soc/analysis_configs/gfx950/0000_top_stats.yaml: ad7818c680acb0d4e3cb624e0f14f79d44fa7efe14531c5643f47ac96266c91d
-src/rocprof_compute_soc/analysis_configs/gfx908/0100_system_info.yaml: d95eea137c439cc2aa4ac5273f06ac6a05037a74550bc23a095162ee366d39cb
-src/rocprof_compute_soc/analysis_configs/gfx90a/0100_system_info.yaml: d95eea137c439cc2aa4ac5273f06ac6a05037a74550bc23a095162ee366d39cb
-src/rocprof_compute_soc/analysis_configs/gfx940/0100_system_info.yaml: d95eea137c439cc2aa4ac5273f06ac6a05037a74550bc23a095162ee366d39cb
-src/rocprof_compute_soc/analysis_configs/gfx941/0100_system_info.yaml: d95eea137c439cc2aa4ac5273f06ac6a05037a74550bc23a095162ee366d39cb
-src/rocprof_compute_soc/analysis_configs/gfx942/0100_system_info.yaml: d95eea137c439cc2aa4ac5273f06ac6a05037a74550bc23a095162ee366d39cb
-src/rocprof_compute_soc/analysis_configs/gfx950/0100_system_info.yaml: d95eea137c439cc2aa4ac5273f06ac6a05037a74550bc23a095162ee366d39cb
-src/rocprof_compute_soc/analysis_configs/gfx908/0200_system_speed_of_light.yaml: aa60b7a75e46196195675a1c8d6aa65211483ace8dfe346ed0228056586bc8a5
-src/rocprof_compute_soc/analysis_configs/gfx90a/0200_system_speed_of_light.yaml: 54d0ef58f8222463516984d3b9153806f5185de9e719d1903537af4c8344a4f4
-src/rocprof_compute_soc/analysis_configs/gfx940/0200_system_speed_of_light.yaml: a6a5d78d76eb39471249c4c55ccea2e8084a5136c01d29aaeb87d308cce05d2e
-src/rocprof_compute_soc/analysis_configs/gfx941/0200_system_speed_of_light.yaml: 352d4702fbebd8550883b777b875893a8404a7909d83c74cdd50c1b713452c81
-src/rocprof_compute_soc/analysis_configs/gfx942/0200_system_speed_of_light.yaml: a6a5d78d76eb39471249c4c55ccea2e8084a5136c01d29aaeb87d308cce05d2e
-src/rocprof_compute_soc/analysis_configs/gfx950/0200_system_speed_of_light.yaml: 1a164dfbb551e4b0a8a55a843d776738d90406cdbe2930e0f474b77a075a7353
-src/rocprof_compute_soc/analysis_configs/gfx908/0300_memory_chart.yaml: ff5fd164694f454a95ccd52c8c0bfa20aebfa476908cab2ac03215fb33e48598
-src/rocprof_compute_soc/analysis_configs/gfx90a/0300_memory_chart.yaml: 332c1965f462e75a479ddf3270294e1cf723701eb08b60c6cea550eb3bc192e7
-src/rocprof_compute_soc/analysis_configs/gfx940/0300_memory_chart.yaml: 92dc15222a707fff79ce2084172ae2068465bfe064b89538ca7e83359422dfc8
-src/rocprof_compute_soc/analysis_configs/gfx941/0300_memory_chart.yaml: 92dc15222a707fff79ce2084172ae2068465bfe064b89538ca7e83359422dfc8
-src/rocprof_compute_soc/analysis_configs/gfx942/0300_memory_chart.yaml: 92dc15222a707fff79ce2084172ae2068465bfe064b89538ca7e83359422dfc8
-src/rocprof_compute_soc/analysis_configs/gfx950/0300_memory_chart.yaml: d3a2e085061068ff8cff0b80f6944dc866ec3e748cf1e4c0cfcd76e1e14d21f8
-src/rocprof_compute_soc/analysis_configs/gfx908/0400_roofline.yaml: e91988af6d99a03e2a19593155447f79abe64dc128a83a170a5037ab466b238c
-src/rocprof_compute_soc/analysis_configs/gfx90a/0400_roofline.yaml: 0807c87d20faed19f2ef9470e9277715f2287e687aa831a328dcab4915a38812
-src/rocprof_compute_soc/analysis_configs/gfx940/0400_roofline.yaml: f5f35d1ae9a35fe83bcdf572aa788401c14cc6718761c4cf8e4dddcf249c3548
-src/rocprof_compute_soc/analysis_configs/gfx941/0400_roofline.yaml: 760ecef9947fa31d3a0fb5c45d653060d06213d8d9f216c19cbb1b1ce29942b6
-src/rocprof_compute_soc/analysis_configs/gfx942/0400_roofline.yaml: e037ce1a2cf8ba08e2317e322b56954caace6ec2427a966acbabf2135cd89855
-src/rocprof_compute_soc/analysis_configs/gfx950/0400_roofline.yaml: f53b2a92b3ac051290eff9b1f63343c30e6cd223b9cbf9d30a93ef4a5ff158b3
-src/rocprof_compute_soc/analysis_configs/gfx908/0500_command_processor_cpc_cpf.yaml: 649bec27b9ccee34c96520c1f6bc0977779a8c4f8a58ee21ff59d61207962966
-src/rocprof_compute_soc/analysis_configs/gfx90a/0500_command_processor_cpc_cpf.yaml: 649bec27b9ccee34c96520c1f6bc0977779a8c4f8a58ee21ff59d61207962966
-src/rocprof_compute_soc/analysis_configs/gfx940/0500_command_processor_cpc_cpf.yaml: 649bec27b9ccee34c96520c1f6bc0977779a8c4f8a58ee21ff59d61207962966
-src/rocprof_compute_soc/analysis_configs/gfx941/0500_command_processor_cpc_cpf.yaml: 649bec27b9ccee34c96520c1f6bc0977779a8c4f8a58ee21ff59d61207962966
-src/rocprof_compute_soc/analysis_configs/gfx942/0500_command_processor_cpc_cpf.yaml: 649bec27b9ccee34c96520c1f6bc0977779a8c4f8a58ee21ff59d61207962966
-src/rocprof_compute_soc/analysis_configs/gfx950/0500_command_processor_cpc_cpf.yaml: 1e4c1bc1158398df8966d24e56b7d434458ce10ade9e13f168887d9a0d9abaef
-src/rocprof_compute_soc/analysis_configs/gfx908/0600_workgroup_manager_spi.yaml: 2bcb7045609e8ff023c9bfa384e63f6a2cc926ff3261f3eab6737f89a899809e
-src/rocprof_compute_soc/analysis_configs/gfx90a/0600_workgroup_manager_spi.yaml: 2bcb7045609e8ff023c9bfa384e63f6a2cc926ff3261f3eab6737f89a899809e
-src/rocprof_compute_soc/analysis_configs/gfx940/0600_workgroup_manager_spi.yaml: 2bcb7045609e8ff023c9bfa384e63f6a2cc926ff3261f3eab6737f89a899809e
-src/rocprof_compute_soc/analysis_configs/gfx941/0600_workgroup_manager_spi.yaml: 2bcb7045609e8ff023c9bfa384e63f6a2cc926ff3261f3eab6737f89a899809e
-src/rocprof_compute_soc/analysis_configs/gfx942/0600_workgroup_manager_spi.yaml: 2bcb7045609e8ff023c9bfa384e63f6a2cc926ff3261f3eab6737f89a899809e
-src/rocprof_compute_soc/analysis_configs/gfx950/0600_workgroup_manager_spi.yaml: 6d97f3ebf3bef1d164255d4c4979e43d7f313f1eda067324aad9be06be98f090
-src/rocprof_compute_soc/analysis_configs/gfx908/0700_wavefront.yaml: da9fb740f9dfafa43c8d0401d22082915d1c04021e07fb8003ac1f31005e282b
-src/rocprof_compute_soc/analysis_configs/gfx90a/0700_wavefront.yaml: da9fb740f9dfafa43c8d0401d22082915d1c04021e07fb8003ac1f31005e282b
-src/rocprof_compute_soc/analysis_configs/gfx940/0700_wavefront.yaml: da9fb740f9dfafa43c8d0401d22082915d1c04021e07fb8003ac1f31005e282b
-src/rocprof_compute_soc/analysis_configs/gfx941/0700_wavefront.yaml: da9fb740f9dfafa43c8d0401d22082915d1c04021e07fb8003ac1f31005e282b
-src/rocprof_compute_soc/analysis_configs/gfx942/0700_wavefront.yaml: da9fb740f9dfafa43c8d0401d22082915d1c04021e07fb8003ac1f31005e282b
-src/rocprof_compute_soc/analysis_configs/gfx950/0700_wavefront.yaml: a6012921ec2e5984861d34ebfca416703b00f3b2cd4cb07541378a285a58b778
-src/rocprof_compute_soc/analysis_configs/gfx908/1000_compute_units_instruction_mix.yaml: 82ef2f27395f2887d1385a33b1d4bcb7cb646ece11146fe1238af2a2fc49108f
-src/rocprof_compute_soc/analysis_configs/gfx90a/1000_compute_units_instruction_mix.yaml: e58c1dff540e06ec3021ae4e852cec5a116e978f00f3e0902b74b5d86f1b88ac
-src/rocprof_compute_soc/analysis_configs/gfx940/1000_compute_units_instruction_mix.yaml: c74ada0b2cd9eda1e1115679267343e7afad9c9638b3a54b3f98193ae9637e09
-src/rocprof_compute_soc/analysis_configs/gfx941/1000_compute_units_instruction_mix.yaml: c74ada0b2cd9eda1e1115679267343e7afad9c9638b3a54b3f98193ae9637e09
-src/rocprof_compute_soc/analysis_configs/gfx942/1000_compute_units_instruction_mix.yaml: c74ada0b2cd9eda1e1115679267343e7afad9c9638b3a54b3f98193ae9637e09
-src/rocprof_compute_soc/analysis_configs/gfx950/1000_compute_units_instruction_mix.yaml: a0fe88305b0972c0702e542558c0d491eac26438577660e58817e988b7b1f0d4
-src/rocprof_compute_soc/analysis_configs/gfx908/1100_compute_units_compute_pipeline.yaml: e815205890d9c815f7f53cdaa64eeef6219bce83054b92fa2be25e240093bdb0
-src/rocprof_compute_soc/analysis_configs/gfx90a/1100_compute_units_compute_pipeline.yaml: b44f500ee07856ec8c59afa1ebb0a204d8b5f3247a43725ba16782484fef6ad1
-src/rocprof_compute_soc/analysis_configs/gfx940/1100_compute_units_compute_pipeline.yaml: e493741974eae65d88afd4fa98b6b3089fb483900b17af2630be18160964d80c
-src/rocprof_compute_soc/analysis_configs/gfx941/1100_compute_units_compute_pipeline.yaml: e493741974eae65d88afd4fa98b6b3089fb483900b17af2630be18160964d80c
-src/rocprof_compute_soc/analysis_configs/gfx942/1100_compute_units_compute_pipeline.yaml: e493741974eae65d88afd4fa98b6b3089fb483900b17af2630be18160964d80c
-src/rocprof_compute_soc/analysis_configs/gfx950/1100_compute_units_compute_pipeline.yaml: 4797cd3052fdb37278aa9a28572287c1a9a7228f05a77ce22c0eb4786cbbd404
-src/rocprof_compute_soc/analysis_configs/gfx908/1200_local_data_share_lds.yaml: 307733f9fee02c620558e2ee4ca3978954f62c3fab26cc98766511b93e96d54c
-src/rocprof_compute_soc/analysis_configs/gfx90a/1200_local_data_share_lds.yaml: 307733f9fee02c620558e2ee4ca3978954f62c3fab26cc98766511b93e96d54c
-src/rocprof_compute_soc/analysis_configs/gfx940/1200_local_data_share_lds.yaml: 307733f9fee02c620558e2ee4ca3978954f62c3fab26cc98766511b93e96d54c
-src/rocprof_compute_soc/analysis_configs/gfx941/1200_local_data_share_lds.yaml: 307733f9fee02c620558e2ee4ca3978954f62c3fab26cc98766511b93e96d54c
-src/rocprof_compute_soc/analysis_configs/gfx942/1200_local_data_share_lds.yaml: 307733f9fee02c620558e2ee4ca3978954f62c3fab26cc98766511b93e96d54c
-src/rocprof_compute_soc/analysis_configs/gfx950/1200_local_data_share_lds.yaml: 35c98741e9b5afd2f7638d2675b22138f5854168e15bc4633112857ed94edbc1
-src/rocprof_compute_soc/analysis_configs/gfx908/1300_instruction_cache.yaml: d2b0a455e9f28d66e6cef701d598072285c58eeebea2d08e1864a8602cdd797a
-src/rocprof_compute_soc/analysis_configs/gfx90a/1300_instruction_cache.yaml: d2b0a455e9f28d66e6cef701d598072285c58eeebea2d08e1864a8602cdd797a
-src/rocprof_compute_soc/analysis_configs/gfx940/1300_instruction_cache.yaml: d2b0a455e9f28d66e6cef701d598072285c58eeebea2d08e1864a8602cdd797a
-src/rocprof_compute_soc/analysis_configs/gfx941/1300_instruction_cache.yaml: d2b0a455e9f28d66e6cef701d598072285c58eeebea2d08e1864a8602cdd797a
-src/rocprof_compute_soc/analysis_configs/gfx942/1300_instruction_cache.yaml: d2b0a455e9f28d66e6cef701d598072285c58eeebea2d08e1864a8602cdd797a
-src/rocprof_compute_soc/analysis_configs/gfx950/1300_instruction_cache.yaml: d2b0a455e9f28d66e6cef701d598072285c58eeebea2d08e1864a8602cdd797a
-src/rocprof_compute_soc/analysis_configs/gfx908/1400_scalar_l1_data_cache.yaml: ccbe9a1309177db8760727f256cd14a7612708833828068dd2bede73ad319d75
-src/rocprof_compute_soc/analysis_configs/gfx90a/1400_scalar_l1_data_cache.yaml: ccbe9a1309177db8760727f256cd14a7612708833828068dd2bede73ad319d75
-src/rocprof_compute_soc/analysis_configs/gfx940/1400_scalar_l1_data_cache.yaml: ccbe9a1309177db8760727f256cd14a7612708833828068dd2bede73ad319d75
-src/rocprof_compute_soc/analysis_configs/gfx941/1400_scalar_l1_data_cache.yaml: ccbe9a1309177db8760727f256cd14a7612708833828068dd2bede73ad319d75
-src/rocprof_compute_soc/analysis_configs/gfx942/1400_scalar_l1_data_cache.yaml: ccbe9a1309177db8760727f256cd14a7612708833828068dd2bede73ad319d75
-src/rocprof_compute_soc/analysis_configs/gfx950/1400_scalar_l1_data_cache.yaml: ccbe9a1309177db8760727f256cd14a7612708833828068dd2bede73ad319d75
-src/rocprof_compute_soc/analysis_configs/gfx908/1500_address_processing_unit_and_data_return_path_ta_td.yaml: b98a800c31da0275704e076e561468dccdaf0b8bff1cc8d74a4e6bf9c7be2973
-src/rocprof_compute_soc/analysis_configs/gfx90a/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 58834a04fc4fb6f9eb648a6b8944f737ce4a8c9d4a6c5f75104d9fd528f520a6
-src/rocprof_compute_soc/analysis_configs/gfx940/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 7f37bcd01557a45aa5ed9009962a9f2499ad924a6a07d7d25a3af97138f360f8
-src/rocprof_compute_soc/analysis_configs/gfx941/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 7f37bcd01557a45aa5ed9009962a9f2499ad924a6a07d7d25a3af97138f360f8
-src/rocprof_compute_soc/analysis_configs/gfx942/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 7f37bcd01557a45aa5ed9009962a9f2499ad924a6a07d7d25a3af97138f360f8
-src/rocprof_compute_soc/analysis_configs/gfx950/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 5c6555a93b01c057f01e0b0cef3169eeb324ca8c256c42f5f9fc0d1ea131486b
-src/rocprof_compute_soc/analysis_configs/gfx908/1600_vector_l1_data_cache.yaml: 4fcb618450366a29c09e428368e1a9afd29a0b80ec3f03a5b3d55a2111bd5704
-src/rocprof_compute_soc/analysis_configs/gfx90a/1600_vector_l1_data_cache.yaml: 4fcb618450366a29c09e428368e1a9afd29a0b80ec3f03a5b3d55a2111bd5704
-src/rocprof_compute_soc/analysis_configs/gfx940/1600_vector_l1_data_cache.yaml: 1c25e20d701aff1ab9276a29cfd5f219b24c621b534aa5b86d1b78d2ae2f300a
-src/rocprof_compute_soc/analysis_configs/gfx941/1600_vector_l1_data_cache.yaml: 1c25e20d701aff1ab9276a29cfd5f219b24c621b534aa5b86d1b78d2ae2f300a
-src/rocprof_compute_soc/analysis_configs/gfx942/1600_vector_l1_data_cache.yaml: 1c25e20d701aff1ab9276a29cfd5f219b24c621b534aa5b86d1b78d2ae2f300a
-src/rocprof_compute_soc/analysis_configs/gfx950/1600_vector_l1_data_cache.yaml: 3cec51c5a848c4f513c4c0a74aa35a5657289148a67179f8db4ea3e55bdb6ac3
-src/rocprof_compute_soc/analysis_configs/gfx908/1700_l2_cache.yaml: e37693ef03caf3d77ae7b91c3c166d033fa0732880cc50a21b8c06a4e79b1f38
-src/rocprof_compute_soc/analysis_configs/gfx90a/1700_l2_cache.yaml: 3314a1e473b1cfc95b742b1a8cfbc47d4602061ca89d7a4ac89ea7cc15908962
-src/rocprof_compute_soc/analysis_configs/gfx940/1700_l2_cache.yaml: cb8922a41dd2088e8e2b0c1e82c7b95fa55304cf90435b217da128234805d77a
-src/rocprof_compute_soc/analysis_configs/gfx941/1700_l2_cache.yaml: 2187f141480a2c57b271ded46255735510de5197441de830cf1efa9345e5566a
-src/rocprof_compute_soc/analysis_configs/gfx942/1700_l2_cache.yaml: 7ce34989a66b8f8750cf1bf76f5cdaf59bf662a7205355f6fe12cace796d4ceb
-src/rocprof_compute_soc/analysis_configs/gfx950/1700_l2_cache.yaml: a3a8db0f555cd1069a61dfc3b89df83e9423d4a0200f1401c7612942ff75152e
-src/rocprof_compute_soc/analysis_configs/gfx908/1800_l2_cache_per_channel.yaml: fd32454bf9f0d3027c77a85ea6be308e92f6815d0ea732c6bafacc8e0f32a25f
-src/rocprof_compute_soc/analysis_configs/gfx90a/1800_l2_cache_per_channel.yaml: 23e9a258ab541d24d29cde2237f9445db695e7a4d17d5974cb4fd5ff9a9869c0
-src/rocprof_compute_soc/analysis_configs/gfx940/1800_l2_cache_per_channel.yaml: eb0823823506bfe0d40931fd69c435baab4979d2dfee158dc33c3651721f9f33
-src/rocprof_compute_soc/analysis_configs/gfx941/1800_l2_cache_per_channel.yaml: eb0823823506bfe0d40931fd69c435baab4979d2dfee158dc33c3651721f9f33
-src/rocprof_compute_soc/analysis_configs/gfx942/1800_l2_cache_per_channel.yaml: eb0823823506bfe0d40931fd69c435baab4979d2dfee158dc33c3651721f9f33
-src/rocprof_compute_soc/analysis_configs/gfx950/1800_l2_cache_per_channel.yaml: b6336ab78a97fb9750e2f925893a5acc4e66e43ac60472c20225e56c440983d7
-src/rocprof_compute_soc/analysis_configs/gfx908/2100_pc_sampling.yaml: efa16d3aadc3363bcc067895a978b82c1e06fa90882dfe33e03315e2c0425d36
-src/rocprof_compute_soc/analysis_configs/gfx90a/2100_pc_sampling.yaml: efa16d3aadc3363bcc067895a978b82c1e06fa90882dfe33e03315e2c0425d36
-src/rocprof_compute_soc/analysis_configs/gfx940/2100_pc_sampling.yaml: efa16d3aadc3363bcc067895a978b82c1e06fa90882dfe33e03315e2c0425d36
-src/rocprof_compute_soc/analysis_configs/gfx941/2100_pc_sampling.yaml: efa16d3aadc3363bcc067895a978b82c1e06fa90882dfe33e03315e2c0425d36
-src/rocprof_compute_soc/analysis_configs/gfx942/2100_pc_sampling.yaml: efa16d3aadc3363bcc067895a978b82c1e06fa90882dfe33e03315e2c0425d36
-src/rocprof_compute_soc/analysis_configs/gfx950/2100_pc_sampling.yaml: efa16d3aadc3363bcc067895a978b82c1e06fa90882dfe33e03315e2c0425d36
-src/rocprof_compute_soc/profile_configs/sets/gfx908_sets.yaml: ee28989e70d0537db8b0f0a4bc5499444b44ff0e73d3e7f2926943be11d0aeda
-src/rocprof_compute_soc/profile_configs/sets/gfx90a_sets.yaml: 9c9533174a3f7bd5c8e09ec998743c7bb2642c4ce3f818b546673be9cafc40a8
-src/rocprof_compute_soc/profile_configs/sets/gfx940_sets.yaml: 44cd2b32b050cafa73d0ead5703b82836edf25a057c21699046b6b8b8918b242
-src/rocprof_compute_soc/profile_configs/sets/gfx941_sets.yaml: 44cd2b32b050cafa73d0ead5703b82836edf25a057c21699046b6b8b8918b242
-src/rocprof_compute_soc/profile_configs/sets/gfx942_sets.yaml: 44cd2b32b050cafa73d0ead5703b82836edf25a057c21699046b6b8b8918b242
-src/rocprof_compute_soc/profile_configs/sets/gfx950_sets.yaml: 238d9dc8a98cfead3fc904885bfe413e5bcb4f1af31e9820cd640388bcd1e1c2
-docs/data/metrics_description.yaml: 12164b43dab4a1088f90763a80ffc8feb38aa82fd7b767edf8f65bd304f22162
+# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from tools/unified_config.yaml. Generated by tools/split_config.py
+{}
@@ -5,19 +5,19 @@
      "files": {
        "0000_top_stats.yaml": "2819d96f5b1c3704f2ac50868a246a7f",
        "0100_system_info.yaml": "cefae2b10db8cf4b0d3a971cff5e82c8",
-        "0200_system_speed_of_light.yaml": "c54676a8a385c02be50fcf09a721bef6",
-        "0300_memory_chart.yaml": "f952fe7de6d86cb22f6f8ce34867905f",
-        "0400_roofline.yaml": "02ca6cf3583f2718ab371bbbfdd8cfef",
-        "0500_command_processor_cpc_cpf.yaml": "93174ba73bf014c143e179719c110db6",
-        "0600_workgroup_manager_spi.yaml": "7364c8431929891d587e6f9b96ddce10",
-        "0700_wavefront.yaml": "5cc88d7743cba8c638491d97725f6778",
+        "0200_system_speed_of_light.yaml": "c4878ac57b7b7b4b5711672cb2f6dffc",
+        "0300_memory_chart.yaml": "221c6d2bb50a4f4177585b9988f88c7b",
+        "0400_roofline.yaml": "bad8d851694ff9a140e29a148a35fa50",
+        "0500_command_processor_cpc_cpf.yaml": "d8f424ec3fcfa4b2fcee2ad5e6456531",
+        "0600_workgroup_manager_spi.yaml": "8b6a89de516bed5821a9849627ad634a",
+        "0700_wavefront.yaml": "84ecb62efa87d87288bb63f7e3871028",
        "1000_compute_units_instruction_mix.yaml": "e96eccdcb0e5d28b292107c0f68ec845",
        "1100_compute_units_compute_pipeline.yaml": "8f61973d0d08bf49895b5dfe32d05c09",
-        "1200_local_data_share_lds.yaml": "97be647681c51e762e774eb91e8283fc",
-        "1300_instruction_cache.yaml": "dcab979ce17e30f2d48fd2734bf08e74",
-        "1400_scalar_l1_data_cache.yaml": "e90a514e7bb597ec1d22e238650c81d9",
+        "1200_local_data_share_lds.yaml": "14da923095596b4f0eeb82fd24bfd67c",
+        "1300_instruction_cache.yaml": "4b7696d75c93e55f7877e07770beda2d",
+        "1400_scalar_l1_data_cache.yaml": "ea6d0cdb6c34f574248f09554e976599",
        "1500_address_processing_unit_and_data_return_path_ta_td.yaml": "645eb10a440eed62c6250a0f5a2407f3",
-        "1600_vector_l1_data_cache.yaml": "e3b8d1787003094ab7b8372da818ff1e",
+        "1600_vector_l1_data_cache.yaml": "1daa7d96605e8cdf4116bf3b10fb9969",
        "1700_l2_cache.yaml": "38e7db4c404007c471864251dff30570",
        "1800_l2_cache_per_channel.yaml": "7193043cd8eee47501cd8c0ae02b51e9",
        "2100_pc_sampling.yaml": "8049866f25214544f1e53a9e2f08399b"
@@ -28,19 +28,19 @@
      "files": {
        "0000_top_stats.yaml": "2819d96f5b1c3704f2ac50868a246a7f",
        "0100_system_info.yaml": "cefae2b10db8cf4b0d3a971cff5e82c8",
-        "0200_system_speed_of_light.yaml": "747b14ab50dd4d7689af7c268569b32a",
-        "0300_memory_chart.yaml": "0d6d094ad24cebf6e583e643beaae06e",
-        "0400_roofline.yaml": "632b16e1d251e57de0cf7237d3a89766",
-        "0500_command_processor_cpc_cpf.yaml": "93174ba73bf014c143e179719c110db6",
-        "0600_workgroup_manager_spi.yaml": "7364c8431929891d587e6f9b96ddce10",
-        "0700_wavefront.yaml": "5cc88d7743cba8c638491d97725f6778",
-        "1000_compute_units_instruction_mix.yaml": "af6304cce1fe38c119b1d17fa635265c",
-        "1100_compute_units_compute_pipeline.yaml": "c38ece6032d757f394c83ad9f93e0dce",
-        "1200_local_data_share_lds.yaml": "97be647681c51e762e774eb91e8283fc",
-        "1300_instruction_cache.yaml": "dcab979ce17e30f2d48fd2734bf08e74",
-        "1400_scalar_l1_data_cache.yaml": "e90a514e7bb597ec1d22e238650c81d9",
+        "0200_system_speed_of_light.yaml": "dc6a6e1a8513e2d32aecc055a958c639",
+        "0300_memory_chart.yaml": "a61f219fe063c4c4b0b9cbaf96389a8b",
+        "0400_roofline.yaml": "da1d514ed19ca2466c167e983bdb4f13",
+        "0500_command_processor_cpc_cpf.yaml": "d8f424ec3fcfa4b2fcee2ad5e6456531",
+        "0600_workgroup_manager_spi.yaml": "8b6a89de516bed5821a9849627ad634a",
+        "0700_wavefront.yaml": "84ecb62efa87d87288bb63f7e3871028",
+        "1000_compute_units_instruction_mix.yaml": "84bd6a22a29335a4851bba675614e103",
+        "1100_compute_units_compute_pipeline.yaml": "39429cd6af68f91f1b20630c1bab8cc7",
+        "1200_local_data_share_lds.yaml": "14da923095596b4f0eeb82fd24bfd67c",
+        "1300_instruction_cache.yaml": "4b7696d75c93e55f7877e07770beda2d",
+        "1400_scalar_l1_data_cache.yaml": "ea6d0cdb6c34f574248f09554e976599",
        "1500_address_processing_unit_and_data_return_path_ta_td.yaml": "8005b28532601a759ace2f653d10da56",
-        "1600_vector_l1_data_cache.yaml": "e3b8d1787003094ab7b8372da818ff1e",
+        "1600_vector_l1_data_cache.yaml": "1daa7d96605e8cdf4116bf3b10fb9969",
        "1700_l2_cache.yaml": "1630ae8fc504ea056e91bb19909d5629",
        "1800_l2_cache_per_channel.yaml": "5ee4fd9c849670c301c4afee257acddd",
        "2100_pc_sampling.yaml": "8049866f25214544f1e53a9e2f08399b"
@@ -51,19 +51,19 @@
      "files": {
        "0000_top_stats.yaml": "2819d96f5b1c3704f2ac50868a246a7f",
        "0100_system_info.yaml": "cefae2b10db8cf4b0d3a971cff5e82c8",
-        "0200_system_speed_of_light.yaml": "74482aebb54b6d7b429c9ca605cb9951",
-        "0300_memory_chart.yaml": "e28ebc1340d2db1948c68225a6e008ff",
-        "0400_roofline.yaml": "1f3888778245e7eb05e769bda605588a",
-        "0500_command_processor_cpc_cpf.yaml": "93174ba73bf014c143e179719c110db6",
-        "0600_workgroup_manager_spi.yaml": "7364c8431929891d587e6f9b96ddce10",
-        "0700_wavefront.yaml": "5cc88d7743cba8c638491d97725f6778",
-        "1000_compute_units_instruction_mix.yaml": "ac290954de96988004b2a4be345a3a25",
-        "1100_compute_units_compute_pipeline.yaml": "470e3093ce9d53211923d3400e7e7bd7",
-        "1200_local_data_share_lds.yaml": "97be647681c51e762e774eb91e8283fc",
-        "1300_instruction_cache.yaml": "dcab979ce17e30f2d48fd2734bf08e74",
-        "1400_scalar_l1_data_cache.yaml": "e90a514e7bb597ec1d22e238650c81d9",
+        "0200_system_speed_of_light.yaml": "8b413c47f06f2e94b3faa723daac8edd",
+        "0300_memory_chart.yaml": "3d6c88ab2704dc4bd72a63e99fd68cf7",
+        "0400_roofline.yaml": "d4650e008f2e3a7d28871e8518153575",
+        "0500_command_processor_cpc_cpf.yaml": "d8f424ec3fcfa4b2fcee2ad5e6456531",
+        "0600_workgroup_manager_spi.yaml": "8b6a89de516bed5821a9849627ad634a",
+        "0700_wavefront.yaml": "84ecb62efa87d87288bb63f7e3871028",
+        "1000_compute_units_instruction_mix.yaml": "89c03d53fd7563da965ff3e4a1698b02",
+        "1100_compute_units_compute_pipeline.yaml": "0c3c36f6c2fed1f14476966295338e74",
+        "1200_local_data_share_lds.yaml": "14da923095596b4f0eeb82fd24bfd67c",
+        "1300_instruction_cache.yaml": "4b7696d75c93e55f7877e07770beda2d",
+        "1400_scalar_l1_data_cache.yaml": "ea6d0cdb6c34f574248f09554e976599",
        "1500_address_processing_unit_and_data_return_path_ta_td.yaml": "12fe315acb3e06d4c16e4538f418f0ca",
-        "1600_vector_l1_data_cache.yaml": "006854a23925320b94727261f30680b7",
+        "1600_vector_l1_data_cache.yaml": "ebff7d80c601d03027476ae9fb16ecae",
        "1700_l2_cache.yaml": "0987e21ac2547134fea87499dee01847",
        "1800_l2_cache_per_channel.yaml": "ba5eeabcd749ecbb107c42de5ce69317",
        "2100_pc_sampling.yaml": "8049866f25214544f1e53a9e2f08399b"
@@ -74,19 +74,19 @@
      "files": {
        "0000_top_stats.yaml": "2819d96f5b1c3704f2ac50868a246a7f",
        "0100_system_info.yaml": "cefae2b10db8cf4b0d3a971cff5e82c8",
-        "0200_system_speed_of_light.yaml": "7ed2ceba47e232b4e39431228a254f7f",
-        "0300_memory_chart.yaml": "e28ebc1340d2db1948c68225a6e008ff",
-        "0400_roofline.yaml": "a80de496435c2c76eb4cfdc38d62155f",
-        "0500_command_processor_cpc_cpf.yaml": "93174ba73bf014c143e179719c110db6",
-        "0600_workgroup_manager_spi.yaml": "7364c8431929891d587e6f9b96ddce10",
-        "0700_wavefront.yaml": "5cc88d7743cba8c638491d97725f6778",
-        "1000_compute_units_instruction_mix.yaml": "ac290954de96988004b2a4be345a3a25",
-        "1100_compute_units_compute_pipeline.yaml": "470e3093ce9d53211923d3400e7e7bd7",
-        "1200_local_data_share_lds.yaml": "97be647681c51e762e774eb91e8283fc",
-        "1300_instruction_cache.yaml": "dcab979ce17e30f2d48fd2734bf08e74",
-        "1400_scalar_l1_data_cache.yaml": "e90a514e7bb597ec1d22e238650c81d9",
+        "0200_system_speed_of_light.yaml": "0ddeaefd245291c7f88674431efd74f6",
+        "0300_memory_chart.yaml": "3d6c88ab2704dc4bd72a63e99fd68cf7",
+        "0400_roofline.yaml": "c066a19bc0e00e692c34998e44c62387",
+        "0500_command_processor_cpc_cpf.yaml": "d8f424ec3fcfa4b2fcee2ad5e6456531",
+        "0600_workgroup_manager_spi.yaml": "8b6a89de516bed5821a9849627ad634a",
+        "0700_wavefront.yaml": "84ecb62efa87d87288bb63f7e3871028",
+        "1000_compute_units_instruction_mix.yaml": "89c03d53fd7563da965ff3e4a1698b02",
+        "1100_compute_units_compute_pipeline.yaml": "0c3c36f6c2fed1f14476966295338e74",
+        "1200_local_data_share_lds.yaml": "14da923095596b4f0eeb82fd24bfd67c",
+        "1300_instruction_cache.yaml": "4b7696d75c93e55f7877e07770beda2d",
+        "1400_scalar_l1_data_cache.yaml": "ea6d0cdb6c34f574248f09554e976599",
        "1500_address_processing_unit_and_data_return_path_ta_td.yaml": "12fe315acb3e06d4c16e4538f418f0ca",
-        "1600_vector_l1_data_cache.yaml": "006854a23925320b94727261f30680b7",
+        "1600_vector_l1_data_cache.yaml": "ebff7d80c601d03027476ae9fb16ecae",
        "1700_l2_cache.yaml": "05a86637744ad66f6491620c4ad659d2",
        "1800_l2_cache_per_channel.yaml": "ba5eeabcd749ecbb107c42de5ce69317",
        "2100_pc_sampling.yaml": "8049866f25214544f1e53a9e2f08399b"
@@ -97,19 +97,19 @@
      "files": {
        "0000_top_stats.yaml": "2819d96f5b1c3704f2ac50868a246a7f",
        "0100_system_info.yaml": "cefae2b10db8cf4b0d3a971cff5e82c8",
-        "0200_system_speed_of_light.yaml": "74482aebb54b6d7b429c9ca605cb9951",
-        "0300_memory_chart.yaml": "e28ebc1340d2db1948c68225a6e008ff",
-        "0400_roofline.yaml": "f94c87dad18f87e5582566276a5c0cfc",
-        "0500_command_processor_cpc_cpf.yaml": "93174ba73bf014c143e179719c110db6",
-        "0600_workgroup_manager_spi.yaml": "7364c8431929891d587e6f9b96ddce10",
-        "0700_wavefront.yaml": "5cc88d7743cba8c638491d97725f6778",
-        "1000_compute_units_instruction_mix.yaml": "ac290954de96988004b2a4be345a3a25",
-        "1100_compute_units_compute_pipeline.yaml": "470e3093ce9d53211923d3400e7e7bd7",
-        "1200_local_data_share_lds.yaml": "97be647681c51e762e774eb91e8283fc",
-        "1300_instruction_cache.yaml": "dcab979ce17e30f2d48fd2734bf08e74",
-        "1400_scalar_l1_data_cache.yaml": "e90a514e7bb597ec1d22e238650c81d9",
+        "0200_system_speed_of_light.yaml": "8b413c47f06f2e94b3faa723daac8edd",
+        "0300_memory_chart.yaml": "3d6c88ab2704dc4bd72a63e99fd68cf7",
+        "0400_roofline.yaml": "318c3e774d41a639628a7f72c2462375",
+        "0500_command_processor_cpc_cpf.yaml": "d8f424ec3fcfa4b2fcee2ad5e6456531",
+        "0600_workgroup_manager_spi.yaml": "8b6a89de516bed5821a9849627ad634a",
+        "0700_wavefront.yaml": "84ecb62efa87d87288bb63f7e3871028",
+        "1000_compute_units_instruction_mix.yaml": "89c03d53fd7563da965ff3e4a1698b02",
+        "1100_compute_units_compute_pipeline.yaml": "0c3c36f6c2fed1f14476966295338e74",
+        "1200_local_data_share_lds.yaml": "14da923095596b4f0eeb82fd24bfd67c",
+        "1300_instruction_cache.yaml": "4b7696d75c93e55f7877e07770beda2d",
+        "1400_scalar_l1_data_cache.yaml": "ea6d0cdb6c34f574248f09554e976599",
        "1500_address_processing_unit_and_data_return_path_ta_td.yaml": "12fe315acb3e06d4c16e4538f418f0ca",
-        "1600_vector_l1_data_cache.yaml": "006854a23925320b94727261f30680b7",
+        "1600_vector_l1_data_cache.yaml": "ebff7d80c601d03027476ae9fb16ecae",
        "1700_l2_cache.yaml": "96e49399b26d00d88ad534a35c95304b",
        "1800_l2_cache_per_channel.yaml": "ba5eeabcd749ecbb107c42de5ce69317",
        "2100_pc_sampling.yaml": "8049866f25214544f1e53a9e2f08399b"
@@ -120,20 +120,20 @@
      "files": {
        "0000_top_stats.yaml": "2819d96f5b1c3704f2ac50868a246a7f",
        "0100_system_info.yaml": "cefae2b10db8cf4b0d3a971cff5e82c8",
-        "0200_system_speed_of_light.yaml": "4a215bccc9378583a6e7e7733b601537",
-        "0300_memory_chart.yaml": "f19548711a687779df0c0b87a1df7a27",
-        "0400_roofline.yaml": "156c1a1d7a6c1e55aea25552334a84d5",
-        "0500_command_processor_cpc_cpf.yaml": "5b67ff80efbc2e1dffb7e3922499ca88",
-        "0600_workgroup_manager_spi.yaml": "63a7b6f7a4487fb87d67549214e08aac",
-        "0700_wavefront.yaml": "1ecfc3a91ec0cce6ed9eb94afae17aa9",
-        "1000_compute_units_instruction_mix.yaml": "7088fafcaa66a8ec48a9d3939cd7339a",
-        "1100_compute_units_compute_pipeline.yaml": "fce707e3f419ee2708676c8f7c325df5",
-        "1200_local_data_share_lds.yaml": "06bee89ddab210dbd122eaaedef0b29a",
-        "1300_instruction_cache.yaml": "dcab979ce17e30f2d48fd2734bf08e74",
-        "1400_scalar_l1_data_cache.yaml": "e90a514e7bb597ec1d22e238650c81d9",
+        "0200_system_speed_of_light.yaml": "a5ee49ce96bfab87128c856c827db870",
+        "0300_memory_chart.yaml": "e2401641a8f280fda308f87e5ad243df",
+        "0400_roofline.yaml": "2bd3b630b72d6d165c0d30cf481136a9",
+        "0500_command_processor_cpc_cpf.yaml": "3f7dab1663ad7a6fae3801aec2b1e8d0",
+        "0600_workgroup_manager_spi.yaml": "e6546a92d283fed5a5dc6df203efb670",
+        "0700_wavefront.yaml": "330468fd711057b422de9b952c5cfe69",
+        "1000_compute_units_instruction_mix.yaml": "c8bbdde1f29c9548a8e0ed7fcdd9ae04",
+        "1100_compute_units_compute_pipeline.yaml": "30e64960bbac4cc5626615a60240bd5f",
+        "1200_local_data_share_lds.yaml": "0e57c559dbcd5526e2e8006a47a69f4b",
+        "1300_instruction_cache.yaml": "4b7696d75c93e55f7877e07770beda2d",
+        "1400_scalar_l1_data_cache.yaml": "ea6d0cdb6c34f574248f09554e976599",
        "1500_address_processing_unit_and_data_return_path_ta_td.yaml": "355a0c6b9b113fcfb686a300b78be21a",
-        "1600_vector_l1_data_cache.yaml": "68382e45c7a3c578df861d6285024803",
-        "1700_l2_cache.yaml": "f70f23b93e97b99327b5db3907eb133e",
+        "1600_vector_l1_data_cache.yaml": "689aba850739a9cbd64ce1e816e95dff",
+        "1700_l2_cache.yaml": "067f8c8a7264762fdc58a41728b4382b",
        "1800_l2_cache_per_channel.yaml": "7e2a1809a9b7f70a088068d6689c8aa4",
        "2100_pc_sampling.yaml": "8049866f25214544f1e53a9e2f08399b"
      }
@@ -97,7 +97,7 @@ Addition:
    metric_descriptions:
      New Metric:
        plain: Description text
-        rst: |- # Optional
+        rst: >- # Optional
          Description with :ref:`RST markup <link>`

 Deletion:
@@ -231,7 +231,7 @@ Modification:
    metric_descriptions:
      Existing Metric:
        plain: Updated description
-        rst: |-
+        rst: >-
          Updated description with **RST**"
 ```

@@ -114,11 +114,9 @@ def merge_docs_rst_as_default(descs: dict, docs_file: Path) -> dict:
    for section, metrics in descs.items():
        docs_section = docs.get(section) or {}
        for metric_name, d in metrics.items():
-            # If panel didn't explicitly provide rst, inherit from docs
-            if not d.get("rst"):
-                doc_entry = docs_section.get(metric_name) or {}
-                if doc_entry.get("rst"):
-                    d["rst"] = doc_entry["rst"]
+            doc_entry = docs_section.get(metric_name) or {}
+            if doc_entry.get("rst"):
+                d["rst"] = doc_entry["rst"]
    return descs


@@ -129,10 +127,6 @@ def merge_units_as_default(descs: dict, docs_file: Path, per_arch_file: Path) ->
      2) else from docs file,
      3) else leave as-is (missing).
    """
-    per_arch: dict = {}
-    if per_arch_file.exists():
-        with open(per_arch_file, "r", encoding="utf-8") as f:
-            per_arch = yaml.safe_load(f) or {}

    docs: dict = {}
    if docs_file.exists():
@@ -140,18 +134,11 @@ def merge_units_as_default(descs: dict, docs_file: Path, per_arch_file: Path) ->
            docs = yaml.safe_load(f) or {}

    for section, metrics in descs.items():
-        psec = per_arch.get(section) or {}
        dsec = docs.get(section) or {}
        for metric, data in metrics.items():
-            # Only fill if panel did NOT explicitly set unit
-            if "unit" not in data or data["unit"] is None:
-                unit = None
-                if metric in psec and isinstance(psec[metric], dict):
-                    unit = psec[metric].get("unit")
-                if unit is None and metric in dsec and isinstance(dsec[metric], dict):
-                    unit = dsec[metric].get("unit")
-                if unit is not None:
-                    data["unit"] = unit
+            doc_entry = dsec.get(metric)
+            if doc_entry and "unit" in doc_entry:
+                data["unit"] = doc_entry["unit"]
    return descs


@@ -403,7 +390,7 @@ def sync_arch(
    update_per_arch_metrics_file(arch_name, descriptions, per_arch_metrics_dir)

    # 5) Only when latest: update docs, but overwrite 'rst' only for overrides
-    if is_latest:
+    if is_latest and (panel_rst_overrides or panel_unit_overrides):
        if not update_docs_metrics_file(
            descriptions,
            docs_metrics_file,
@@ -31,7 +31,7 @@ import yaml

 def str_representer(dumper, data):
    if "\n" in data:
-        return dumper.represent_scalar("tag:yaml.org,2002:str", data, style="|")
+        return dumper.represent_scalar("tag:yaml.org,2002:str", data, style=">")
    return dumper.represent_scalar("tag:yaml.org,2002:str", data)


@@ -1,20 +1,20 @@
 System Speed-of-Light:
  VALU FLOPs:
-    rst: |-
+    rst: >-
      The total floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is also presented as a percent of the peak theoretical
      FLOPs achievable on the specific accelerator. Note: this does not include
      any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  VALU IOPs:
-    rst: |-
+    rst: >-
      The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
      This is also presented as a percent of the peak theoretical IOPs achievable
      on the specific accelerator. Note: this does not include any integer operations
      from :ref:`MFMA <desc-mfma>` instructions.
    unit: GOIPs
  MFMA FLOPs (BF16):
-    rst: |-
+    rst: >-
      The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
      operations executed per second. Note: this does not include any 16-bit brain
      floating point operations from :ref:`VALU <desc-valu>` instructions. This
@@ -22,7 +22,7 @@ System Speed-of-Light:
      achievable on the specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F16):
-    rst: |-
+    rst: >-
      The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 16-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -30,7 +30,7 @@ System Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F32):
-    rst: |-
+    rst: >-
      The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 32-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -38,7 +38,7 @@ System Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F64):
-    rst: |-
+    rst: >-
      The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 64-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -46,7 +46,7 @@ System Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA IOPs (Int8):
-    rst: |-
+    rst: >-
      The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
      per second. Note: this does not include any 8-bit integer operations from
      :ref:`VALU <desc-valu>` instructions. This is also presented as a percent
@@ -99,7 +99,7 @@ System Speed-of-Light:
      over the :ref:`total active CU cycles <total-active-cu-cycles>`.
    unit: Instructions per-cycle
  Wavefront Occupancy:
-    rst: |-
+    rst: >-
      The time-averaged number of wavefronts resident on the accelerator over
      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms). This is also presented as a percent of the peak theoretical
@@ -144,7 +144,7 @@ System Speed-of-Light:
      peak theoretical bandwidth achievable on the specific accelerator.
    unit: GB/s
  L2-Fabric Read BW:
-    rst: |-
+    rst: >-
      The number of bytes read by the L2 over the :ref:`Infinity Fabric\u2122
      interface <l2-fabric>` per unit time. This is also presented as a percent
      of the peak theoretical bandwidth achievable on the specific accelerator.
@@ -229,19 +229,19 @@ Memory Chart:
    rst: Total number of compute units (CUs) on the accelerator.
    unit: CUs
  VGPR:
-    rst: |-
+    rst: >-
      The number of architected vector general-purpose registers allocated for the
      kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
      number of VGPRs requested by the compiler due to allocation granularity.
    unit: VGPRs
  SGPR:
-    rst: |-
+    rst: >-
      The number of scalar general-purpose registers allocated for the kernel, see
      :ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
      SGPRs requested by the compiler due to allocation granularity.
    unit: SGPRs
  LDS Allocation:
-    rst: |-
+    rst: >-
      The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
      allocated for this kernel. Note: This may also be larger than what was requested
      at compile time due to both allocation granularity and dynamic per-dispatch
@@ -420,28 +420,28 @@ Memory Chart:
    unit: Requests per normalization unit
 Roofline Performance Rates:
  VALU FLOPs (F16):
-    rst: |-
+    rst: >-
      The total 16-bit floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is presented with the value of the peak empirical F16 FLOPs achievable
      on the specific accelerator. Note: this does not include any F16 operations
      from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  VALU FLOPs (F32):
-    rst: |-
+    rst: >-
      The total 32-bit floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is presented with the value of the peak empirical F32 FLOPs achievable
      on the specific accelerator. Note: this does not include any F32 operations
      from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  VALU FLOPs (F64):
-    rst: |-
+    rst: >-
      The total 64-bit floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is presented with the value of the peak empirical F64 FLOPs achievable
      on the specific accelerator. Note: this does not include any F64 operations
      from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  MFMA FLOPs (F64):
-    rst: |-
+    rst: >-
      The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 64-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -449,7 +449,7 @@ Roofline Performance Rates:
      displayed alongside for comparison.
    unit: GFLOPs
  MFMA FLOPs (F32):
-    rst: |-
+    rst: >-
      The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 32-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -457,7 +457,7 @@ Roofline Performance Rates:
      displayed alongside for comparison.
    unit: GFLOPs
  MFMA FLOPs (F16):
-    rst: |-
+    rst: >-
      The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 16-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -465,7 +465,7 @@ Roofline Performance Rates:
      displayed alongside for comparison.
    unit: GFLOPs
  MFMA FLOPs (BF16):
-    rst: |-
+    rst: >-
      The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
      operations executed per second. Note: this does not include any 16-bit brain
      floating point operations from :ref:`VALU <desc-valu>` instructions. The
@@ -473,7 +473,7 @@ Roofline Performance Rates:
      accelerator is displayed alongside for comparison.
    unit: GFLOPs
  MFMA IOPs (Int8):
-    rst: |-
+    rst: >-
      The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
      per second. Note: this does not include any 8-bit integer operations from
      :ref:`VALU <desc-valu>` instructions. The peak empirically measured INT8 MFMA
@@ -481,7 +481,7 @@ Roofline Performance Rates:
      for comparison.
    unit: GIOPs
  HBM Bandwidth:
-    rst: |-
+    rst: >-
      The total number of bytes read from and written to High-Bandwidth
        Memory (HBM) per second. The peak empirically measured bandwidth achievable
        on the specific accelerator is displayed alongside for comparison.
@@ -512,28 +512,28 @@ Roofline Performance Rates:
    unit: GB/s
 Roofline Plot Points:
  AI HBM:
-    rst: |-
+    rst: >-
      The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
      It is the ratio of total floating-point operations (FLOPs) to total bytes
      transferred between HBM and the L2 cache. This value is used as the x-coordinate
      for the HBM roofline.
    unit: FLOPs/Byte
  AI L2:
-    rst: |-
+    rst: >-
      The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
      of total floating-point operations (FLOPs) to total bytes transferred between
      the L2 cache and the L1 cache. This value is used as the x-coordinate for
      the L2 roofline.
    unit: FLOPs/Byte
  AI L1:
-    rst: |-
+    rst: >-
      The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
      of total floating-point operations (FLOPs) to total bytes transferred between
      the L1 cache and the processing units. This value is used as the x-coordinate
      for the L1 roofline.
    unit: FLOPs/Byte
  Performance (GFLOPs):
-    rst: |-
+    rst: >-
      The overall achieved performance, measured in GigaFLOPs
      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
      operations divided by the total execution time. This value is used as the y-coordinate
@@ -591,7 +591,7 @@ Workgroup manager utilizations:
      any work.
    unit: Percent
  Scheduler-Pipe Utilization:
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where the scheduler-pipes were actively doing any work. Note: this
      value is expected to range between 0% and 25%. See :ref:`desc-spi`.
@@ -629,7 +629,7 @@ Workgroup manager utilizations:
    unit: Cycles/wave
 Workgroup Manager - Resource Allocation:
  Not-scheduled Rate (Workgroup Manager):
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
      due to a bottleneck within the workgroup manager rather than a lack of a
@@ -638,7 +638,7 @@ Workgroup Manager - Resource Allocation:
      description.
    unit: Percent
  Not-scheduled Rate (Scheduler-Pipe):
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
      due to a bottleneck within the scheduler-pipes rather than a lack of a CU
@@ -647,7 +647,7 @@ Workgroup Manager - Resource Allocation:
      description.
    unit: Percent
  Scheduler-Pipe Stall Rate:
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
      due to occupancy limitations (like a lack of a CU or :ref:`SIMD <desc-valu>`
@@ -711,7 +711,7 @@ Wavefront Launch Stats:
      block size.
    unit: Work-Items
  Total Wavefronts:
-    rst: |-
+    rst: >-
      The total number of wavefronts launched as part of the kernel dispatch.
      On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
      size is always 64 work-items. Thus, the total number of wavefronts should
@@ -726,25 +726,25 @@ Wavefront Launch Stats:
      <https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
    unit: Wavefronts
  VGPRs:
-    rst: |-
+    rst: >-
      The number of architected vector general-purpose registers allocated for the
      kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
      number of VGPRs requested by the compiler due to allocation granularity.
    unit: VGPRs
  AGPRs:
-    rst: |-
+    rst: >-
      The number of accumulation vector general-purpose registers allocated
      for the kernel, see :ref:`AGPRs <desc-agprs>`. Note: this may not exactly match
      the number of AGPRs requested by the compiler due to allocation granularity.
    unit: AGPRs
  SGPRs:
-    rst: |-
+    rst: >-
      The number of scalar general-purpose registers allocated for the kernel, see
      :ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
      SGPRs requested by the compiler due to allocation granularity.
    unit: SGPRs
  LDS Allocation:
-    rst: |-
+    rst: >-
      The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
      allocated for this kernel. Note: This may also be larger than what was requested
      at compile time due to both allocation granularity and dynamic per-dispatch
@@ -767,7 +767,7 @@ Wavefront Runtime Stats:
      This is averaged over all wavefronts in a kernel dispatch.
    unit: Instructions per wavefront
  Wave Cycles:
-    rst: |-
+    rst: >-
      The number of cycles a wavefront in the kernel dispatch spent resident
      on a compute unit per :ref:`normalization unit <normalization-units>`. This is
      averaged over all wavefronts in a kernel dispatch. Note: this should not
@@ -805,7 +805,7 @@ Wavefront Runtime Stats:
      the total Wave Cycles metric.
    unit: Cycles per normalization unit
  Wavefront Occupancy:
-    rst: |-
+    rst: >-
      The time-averaged number of wavefronts resident on the accelerator over the
      lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms).
@@ -976,7 +976,7 @@ LDS Statistics:
      to stalls from non-dword aligned addresses per :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
  Mem Violations:
-    rst: |-
+    rst: >-
      The total number of out-of-bounds accesses made to the LDS, per :ref:`normalization
      unit <normalization-units>`. This is unused and expected to be zero in
      most configurations for modern CDNA\u2122 accelerators.
@@ -993,7 +993,7 @@ L1I Speed-of-Light:
      over the number of all L1I requests.
    unit: Percent
  L1I-L2 Bandwidth Utilization:
-    rst: |-
+    rst: >-
      The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
      achieved. Calculated as the ratio of the total number of requests from
      the L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`.
@@ -1100,7 +1100,7 @@ Scalar L1D cache accesses:
    unit: Requests per normalization unit
 Scalar L1D Cache - L2 Interface:
  sL1D-L2 BW:
-    rst: |-
+    rst: >-
      The total number of bytes read from, written to, or atomically updated
      across the sL1D\u2194:doc:`L2 <l2-cache>` interface, divided by total duration.
      Note that sL1D writes and atomics are typically
@@ -1122,7 +1122,7 @@ Scalar L1D Cache - L2 Interface:
      CDNA accelerators.
    unit: Requests per normalization unit
  Stall Cycles:
-    rst: |-
+    rst: >-
      The total number of cycles the sL1D\u2194 :doc:`L2 <l2-cache>` interface
      was stalled, per :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
@@ -1432,7 +1432,7 @@ L1 Unified Translation Cache (UTCL1):
      translation not being present in the cache, per :ref:`normalization unit <normalization-units>`.
    unit: unit
  Permission Misses:
-    rst: |-
+    rst: >-
      The total number of translation requests that missed in the UTCL1 due
      to a permission error, per :ref:`normalization unit <normalization-units>`.
      This is unused and expected to be zero in most configurations for modern
@@ -1,20 +1,20 @@
 System Speed-of-Light:
  VALU FLOPs:
-    rst: |-
+    rst: >-
      The total floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is also presented as a percent of the peak theoretical
      FLOPs achievable on the specific accelerator. Note: this does not include
      any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  VALU IOPs:
-    rst: |-
+    rst: >-
      The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
      This is also presented as a percent of the peak theoretical IOPs achievable
      on the specific accelerator. Note: this does not include any integer operations
      from :ref:`MFMA <desc-mfma>` instructions.
    unit: GOIPs
  MFMA FLOPs (BF16):
-    rst: |-
+    rst: >-
      The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
      operations executed per second. Note: this does not include any 16-bit brain
      floating point operations from :ref:`VALU <desc-valu>` instructions. This
@@ -22,7 +22,7 @@ System Speed-of-Light:
      achievable on the specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F16):
-    rst: |-
+    rst: >-
      The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 16-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -30,7 +30,7 @@ System Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F32):
-    rst: |-
+    rst: >-
      The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 32-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -38,7 +38,7 @@ System Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F64):
-    rst: |-
+    rst: >-
      The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 64-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -46,7 +46,7 @@ System Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA IOPs (Int8):
-    rst: |-
+    rst: >-
      The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
      per second. Note: this does not include any 8-bit integer operations from
      :ref:`VALU <desc-valu>` instructions. This is also presented as a percent
@@ -99,7 +99,7 @@ System Speed-of-Light:
      over the :ref:`total active CU cycles <total-active-cu-cycles>`.
    unit: Instructions per-cycle
  Wavefront Occupancy:
-    rst: |-
+    rst: >-
      The time-averaged number of wavefronts resident on the accelerator over
      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms). This is also presented as a percent of the peak theoretical
@@ -144,7 +144,7 @@ System Speed-of-Light:
      peak theoretical bandwidth achievable on the specific accelerator.
    unit: GB/s
  L2-Fabric Read BW:
-    rst: |-
+    rst: >-
      The number of bytes read by the L2 over the :ref:`Infinity Fabric\u2122
      interface <l2-fabric>` per unit time. This is also presented as a percent
      of the peak theoretical bandwidth achievable on the specific accelerator.
@@ -229,19 +229,19 @@ Memory Chart:
    rst: Total number of compute units (CUs) on the accelerator.
    unit: CUs
  VGPR:
-    rst: |-
+    rst: >-
      The number of architected vector general-purpose registers allocated for the
      kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
      number of VGPRs requested by the compiler due to allocation granularity.
    unit: VGPRs
  SGPR:
-    rst: |-
+    rst: >-
      The number of scalar general-purpose registers allocated for the kernel, see
      :ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
      SGPRs requested by the compiler due to allocation granularity.
    unit: SGPRs
  LDS Allocation:
-    rst: |-
+    rst: >-
      The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
      allocated for this kernel. Note: This may also be larger than what was requested
      at compile time due to both allocation granularity and dynamic per-dispatch
@@ -420,28 +420,28 @@ Memory Chart:
    unit: Requests per normalization unit
 Roofline Performance Rates:
  VALU FLOPs (F16):
-    rst: |-
+    rst: >-
      The total 16-bit floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is presented with the value of the peak empirical F16 FLOPs achievable
      on the specific accelerator. Note: this does not include any F16 operations
      from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  VALU FLOPs (F32):
-    rst: |-
+    rst: >-
      The total 32-bit floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is presented with the value of the peak empirical F32 FLOPs achievable
      on the specific accelerator. Note: this does not include any F32 operations
      from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  VALU FLOPs (F64):
-    rst: |-
+    rst: >-
      The total 64-bit floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is presented with the value of the peak empirical F64 FLOPs achievable
      on the specific accelerator. Note: this does not include any F64 operations
      from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  MFMA FLOPs (F64):
-    rst: |-
+    rst: >-
      The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 64-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -449,7 +449,7 @@ Roofline Performance Rates:
      displayed alongside for comparison.
    unit: GFLOPs
  MFMA FLOPs (F32):
-    rst: |-
+    rst: >-
      The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 32-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -457,7 +457,7 @@ Roofline Performance Rates:
      displayed alongside for comparison.
    unit: GFLOPs
  MFMA FLOPs (F16):
-    rst: |-
+    rst: >-
      The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 16-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -465,7 +465,7 @@ Roofline Performance Rates:
      displayed alongside for comparison.
    unit: GFLOPs
  MFMA FLOPs (BF16):
-    rst: |-
+    rst: >-
      The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
      operations executed per second. Note: this does not include any 16-bit brain
      floating point operations from :ref:`VALU <desc-valu>` instructions. The
@@ -473,7 +473,7 @@ Roofline Performance Rates:
      accelerator is displayed alongside for comparison.
    unit: GFLOPs
  MFMA IOPs (Int8):
-    rst: |-
+    rst: >-
      The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
      per second. Note: this does not include any 8-bit integer operations from
      :ref:`VALU <desc-valu>` instructions. The peak empirically measured INT8 MFMA
@@ -481,7 +481,7 @@ Roofline Performance Rates:
      for comparison.
    unit: GIOPs
  HBM Bandwidth:
-    rst: |-
+    rst: >-
      The total number of bytes read from and written to High-Bandwidth
        Memory (HBM) per second. The peak empirically measured bandwidth achievable
        on the specific accelerator is displayed alongside for comparison.
@@ -512,28 +512,28 @@ Roofline Performance Rates:
    unit: GB/s
 Roofline Plot Points:
  AI HBM:
-    rst: |-
+    rst: >-
      The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
      It is the ratio of total floating-point operations (FLOPs) to total bytes
      transferred between HBM and the L2 cache. This value is used as the x-coordinate
      for the HBM roofline.
    unit: FLOPs/Byte
  AI L2:
-    rst: |-
+    rst: >-
      The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
      of total floating-point operations (FLOPs) to total bytes transferred between
      the L2 cache and the L1 cache. This value is used as the x-coordinate for
      the L2 roofline.
    unit: FLOPs/Byte
  AI L1:
-    rst: |-
+    rst: >-
      The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
      of total floating-point operations (FLOPs) to total bytes transferred between
      the L1 cache and the processing units. This value is used as the x-coordinate
      for the L1 roofline.
    unit: FLOPs/Byte
  Performance (GFLOPs):
-    rst: |-
+    rst: >-
      The overall achieved performance, measured in GigaFLOPs
      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
      operations divided by the total execution time. This value is used as the y-coordinate
@@ -591,7 +591,7 @@ Workgroup manager utilizations:
      any work.
    unit: Percent
  Scheduler-Pipe Utilization:
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where the scheduler-pipes were actively doing any work. Note: this
      value is expected to range between 0% and 25%. See :ref:`desc-spi`.
@@ -629,7 +629,7 @@ Workgroup manager utilizations:
    unit: Cycles/wave
 Workgroup Manager - Resource Allocation:
  Not-scheduled Rate (Workgroup Manager):
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
      due to a bottleneck within the workgroup manager rather than a lack of a
@@ -638,7 +638,7 @@ Workgroup Manager - Resource Allocation:
      description.
    unit: Percent
  Not-scheduled Rate (Scheduler-Pipe):
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
      due to a bottleneck within the scheduler-pipes rather than a lack of a CU
@@ -647,7 +647,7 @@ Workgroup Manager - Resource Allocation:
      description.
    unit: Percent
  Scheduler-Pipe Stall Rate:
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
      due to occupancy limitations (like a lack of a CU or :ref:`SIMD <desc-valu>`
@@ -711,7 +711,7 @@ Wavefront Launch Stats:
      block size.
    unit: Work-Items
  Total Wavefronts:
-    rst: |-
+    rst: >-
      The total number of wavefronts launched as part of the kernel dispatch.
      On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
      size is always 64 work-items. Thus, the total number of wavefronts should
@@ -726,25 +726,25 @@ Wavefront Launch Stats:
      <https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
    unit: Wavefronts
  VGPRs:
-    rst: |-
+    rst: >-
      The number of architected vector general-purpose registers allocated for the
      kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
      number of VGPRs requested by the compiler due to allocation granularity.
    unit: VGPRs
  AGPRs:
-    rst: |-
+    rst: >-
      The number of accumulation vector general-purpose registers allocated
      for the kernel, see :ref:`AGPRs <desc-agprs>`. Note: this may not exactly match
      the number of AGPRs requested by the compiler due to allocation granularity.
    unit: AGPRs
  SGPRs:
-    rst: |-
+    rst: >-
      The number of scalar general-purpose registers allocated for the kernel, see
      :ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
      SGPRs requested by the compiler due to allocation granularity.
    unit: SGPRs
  LDS Allocation:
-    rst: |-
+    rst: >-
      The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
      allocated for this kernel. Note: This may also be larger than what was requested
      at compile time due to both allocation granularity and dynamic per-dispatch
@@ -767,7 +767,7 @@ Wavefront Runtime Stats:
      This is averaged over all wavefronts in a kernel dispatch.
    unit: Instructions per wavefront
  Wave Cycles:
-    rst: |-
+    rst: >-
      The number of cycles a wavefront in the kernel dispatch spent resident
      on a compute unit per :ref:`normalization unit <normalization-units>`. This is
      averaged over all wavefronts in a kernel dispatch. Note: this should not
@@ -805,7 +805,7 @@ Wavefront Runtime Stats:
      the total Wave Cycles metric.
    unit: Cycles per normalization unit
  Wavefront Occupancy:
-    rst: |-
+    rst: >-
      The time-averaged number of wavefronts resident on the accelerator over the
      lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms).
@@ -907,7 +907,7 @@ VALU Arithmetic Instruction Mix:
      unit <normalization-units>`.
    unit: Instructions per normalization unit
  Conversion:
-    rst: |-
+    rst: >-
      The total number of type conversion instructions (such as converting data
      to or from F32\u2194F64) issued to the VALU per :ref:`normalization unit
      <normalization-units>`.
@@ -976,21 +976,21 @@ MFMA Arithmetic Instruction Mix:
    unit: Instructions per normalization unit
 Compute Speed-of-Light:
  VALU FLOPs:
-    rst: |-
+    rst: >-
      The total floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is also presented as a percent of the peak theoretical
      FLOPs achievable on the specific accelerator. Note: this does not include
      any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  VALU IOPs:
-    rst: |-
+    rst: >-
      The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
      This is also presented as a percent of the peak theoretical IOPs achievable
      on the specific accelerator. Note: this does not include any integer operations
      from :ref:`MFMA <desc-mfma>` instructions.
    unit: GIOPs
  MFMA FLOPs (BF16):
-    rst: |-
+    rst: >-
      The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 16-bit brain floating
      point operations from :ref:`VALU <desc-valu>` instructions. This is also
@@ -998,7 +998,7 @@ Compute Speed-of-Light:
      on the specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F16):
-    rst: |-
+    rst: >-
      The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 16-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1006,7 +1006,7 @@ Compute Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F32):
-    rst: |-
+    rst: >-
      The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 32-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1014,7 +1014,7 @@ Compute Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F64):
-    rst: |-
+    rst: >-
      The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 64-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1026,7 +1026,7 @@ Compute Speed-of-Light:
      achievable on the specific accelerator.
    unit: GFLOPs
  MFMA IOPs (INT8):
-    rst: |-
+    rst: >-
      The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
      per second. Note: this does not include any 8-bit integer operations from
      :ref:`VALU <desc-valu>` instructions. This is also presented as a percent
@@ -1112,7 +1112,7 @@ Arithmetic Operations:
      unit <normalization-units>`.
    unit: FLOP per normalization unit
  BF16 OPs:
-    rst: |-
+    rst: >-
      The total number of 16-bit brain floating-point operations executed on
      either the :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
      unit <normalization-units>`. Note: on current CDNA accelerators, the VALU
@@ -1129,7 +1129,7 @@ Arithmetic Operations:
      unit <normalization-units>`.
    unit: FLOP per normalization unit
  INT8 OPs:
-    rst: |-
+    rst: >-
      The total number of 8-bit integer operations executed on either the :ref:`VALU
      <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
      <normalization-units>`. Note: on current CDNA accelerators, the VALU has
@@ -1206,7 +1206,7 @@ LDS Statistics:
      to stalls from non-dword aligned addresses per :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
  Mem Violations:
-    rst: |-
+    rst: >-
      The total number of out-of-bounds accesses made to the LDS, per :ref:`normalization
      unit <normalization-units>`. This is unused and expected to be zero in
      most configurations for modern CDNA\u2122 accelerators.
@@ -1223,7 +1223,7 @@ L1I Speed-of-Light:
      over the number of all L1I requests.
    unit: Percent
  L1I-L2 Bandwidth Utilization:
-    rst: |-
+    rst: >-
      The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
      achieved. Calculated as the ratio of the total number of requests from
      the L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`.
@@ -1330,7 +1330,7 @@ Scalar L1D cache accesses:
    unit: Requests per normalization unit
 Scalar L1D Cache - L2 Interface:
  sL1D-L2 BW:
-    rst: |-
+    rst: >-
      The total number of bytes read from, written to, or atomically updated
      across the sL1D\u2194:doc:`L2 <l2-cache>` interface, divided by total duration.
      Note that sL1D writes and atomics are typically
@@ -1352,7 +1352,7 @@ Scalar L1D Cache - L2 Interface:
      CDNA accelerators.
    unit: Requests per normalization unit
  Stall Cycles:
-    rst: |-
+    rst: >-
      The total number of cycles the sL1D\u2194 :doc:`L2 <l2-cache>` interface
      was stalled, per :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
@@ -1676,7 +1676,7 @@ L1 Unified Translation Cache (UTCL1):
      translation not being present in the cache, per :ref:`normalization unit <normalization-units>`.
    unit: unit
  Permission Misses:
-    rst: |-
+    rst: >-
      The total number of translation requests that missed in the UTCL1 due
      to a permission error, per :ref:`normalization unit <normalization-units>`.
      This is unused and expected to be zero in most configurations for modern
@@ -1,20 +1,20 @@
 System Speed-of-Light:
  VALU FLOPs:
-    rst: |-
+    rst: >-
      The total floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is also presented as a percent of the peak theoretical
      FLOPs achievable on the specific accelerator. Note: this does not include
      any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  VALU IOPs:
-    rst: |-
+    rst: >-
      The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
      This is also presented as a percent of the peak theoretical IOPs achievable
      on the specific accelerator. Note: this does not include any integer operations
      from :ref:`MFMA <desc-mfma>` instructions.
    unit: GOIPs
  MFMA FLOPs (F8):
-    rst: |-
+    rst: >-
      The total number of 8-bit brain floating point :ref:`MFMA <desc-mfma>`
      operations executed per second. Note: this does not include any 16-bit brain
      floating point operations from :ref:`VALU <desc-valu>` instructions. This
@@ -23,7 +23,7 @@ System Speed-of-Light:
      series and later only.
    unit: GFLOPs
  MFMA FLOPs (BF16):
-    rst: |-
+    rst: >-
      The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
      operations executed per second. Note: this does not include any 16-bit brain
      floating point operations from :ref:`VALU <desc-valu>` instructions. This
@@ -31,7 +31,7 @@ System Speed-of-Light:
      achievable on the specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F16):
-    rst: |-
+    rst: >-
      The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 16-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -39,7 +39,7 @@ System Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F32):
-    rst: |-
+    rst: >-
      The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 32-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -47,7 +47,7 @@ System Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F64):
-    rst: |-
+    rst: >-
      The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 64-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -55,7 +55,7 @@ System Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA IOPs (Int8):
-    rst: |-
+    rst: >-
      The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
      per second. Note: this does not include any 8-bit integer operations from
      :ref:`VALU <desc-valu>` instructions. This is also presented as a percent
@@ -108,7 +108,7 @@ System Speed-of-Light:
      over the :ref:`total active CU cycles <total-active-cu-cycles>`.
    unit: Instructions per-cycle
  Wavefront Occupancy:
-    rst: |-
+    rst: >-
      The time-averaged number of wavefronts resident on the accelerator over
      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms). This is also presented as a percent of the peak theoretical
@@ -153,7 +153,7 @@ System Speed-of-Light:
      peak theoretical bandwidth achievable on the specific accelerator.
    unit: GB/s
  L2-Fabric Read BW:
-    rst: |-
+    rst: >-
      The number of bytes read by the L2 over the :ref:`Infinity Fabric\u2122
      interface <l2-fabric>` per unit time. This is also presented as a percent
      of the peak theoretical bandwidth achievable on the specific accelerator.
@@ -238,19 +238,19 @@ Memory Chart:
    rst: Total number of compute units (CUs) on the accelerator.
    unit: CUs
  VGPR:
-    rst: |-
+    rst: >-
      The number of architected vector general-purpose registers allocated for the
      kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
      number of VGPRs requested by the compiler due to allocation granularity.
    unit: VGPRs
  SGPR:
-    rst: |-
+    rst: >-
      The number of scalar general-purpose registers allocated for the kernel, see
      :ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
      SGPRs requested by the compiler due to allocation granularity.
    unit: SGPRs
  LDS Allocation:
-    rst: |-
+    rst: >-
      The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
      allocated for this kernel. Note: This may also be larger than what was requested
      at compile time due to both allocation granularity and dynamic per-dispatch
@@ -419,28 +419,28 @@ Memory Chart:
    unit: Requests per normalization unit
 Roofline Performance Rates:
  VALU FLOPs (F16):
-    rst: |-
+    rst: >-
      The total 16-bit floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is presented with the value of the peak empirical F16 FLOPs achievable
      on the specific accelerator. Note: this does not include any F16 operations
      from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  VALU FLOPs (F32):
-    rst: |-
+    rst: >-
      The total 32-bit floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is presented with the value of the peak empirical F32 FLOPs achievable
      on the specific accelerator. Note: this does not include any F32 operations
      from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  VALU FLOPs (F64):
-    rst: |-
+    rst: >-
      The total 64-bit floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is presented with the value of the peak empirical F64 FLOPs achievable
      on the specific accelerator. Note: this does not include any F64 operations
      from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  MFMA FLOPs (F64):
-    rst: |-
+    rst: >-
      The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 64-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -448,7 +448,7 @@ Roofline Performance Rates:
      displayed alongside for comparison.
    unit: GFLOPs
  MFMA FLOPs (F32):
-    rst: |-
+    rst: >-
      The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 32-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -456,7 +456,7 @@ Roofline Performance Rates:
      displayed alongside for comparison.
    unit: GFLOPs
  MFMA FLOPs (F16):
-    rst: |-
+    rst: >-
      The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 16-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -464,7 +464,7 @@ Roofline Performance Rates:
      displayed alongside for comparison.
    unit: GFLOPs
  MFMA FLOPs (BF16):
-    rst: |-
+    rst: >-
      The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
      operations executed per second. Note: this does not include any 16-bit brain
      floating point operations from :ref:`VALU <desc-valu>` instructions. The
@@ -472,7 +472,7 @@ Roofline Performance Rates:
      accelerator is displayed alongside for comparison.
    unit: GFLOPs
  MFMA FLOPs (F8):
-    rst: |-
+    rst: >-
      The total number of 8-bit brain floating point :ref:`MFMA <desc-mfma>`
      operations executed per second. Note: this does not include any 16-bit brain
      floating point operations from :ref:`VALU <desc-valu>` instructions. The
@@ -481,7 +481,7 @@ Roofline Performance Rates:
      Instinct MI300 series and later only.
    unit: GFLOPs
  MFMA IOPs (Int8):
-    rst: |-
+    rst: >-
      The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
      per second. Note: this does not include any 8-bit integer operations from
      :ref:`VALU <desc-valu>` instructions. The peak empirically measured INT8 MFMA
@@ -489,7 +489,7 @@ Roofline Performance Rates:
      for comparison.
    unit: GIOPs
  HBM Bandwidth:
-    rst: |-
+    rst: >-
      The total number of bytes read from and written to High-Bandwidth
        Memory (HBM) per second. The peak empirically measured bandwidth achievable
        on the specific accelerator is displayed alongside for comparison.
@@ -520,28 +520,28 @@ Roofline Performance Rates:
    unit: GB/s
 Roofline Plot Points:
  AI HBM:
-    rst: |-
+    rst: >-
      The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
      It is the ratio of total floating-point operations (FLOPs) to total bytes
      transferred between HBM and the L2 cache. This value is used as the x-coordinate
      for the HBM roofline.
    unit: FLOPs/Byte
  AI L2:
-    rst: |-
+    rst: >-
      The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
      of total floating-point operations (FLOPs) to total bytes transferred between
      the L2 cache and the L1 cache. This value is used as the x-coordinate for
      the L2 roofline.
    unit: FLOPs/Byte
  AI L1:
-    rst: |-
+    rst: >-
      The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
      of total floating-point operations (FLOPs) to total bytes transferred between
      the L1 cache and the processing units. This value is used as the x-coordinate
      for the L1 roofline.
    unit: FLOPs/Byte
  Performance (GFLOPs):
-    rst: |-
+    rst: >-
      The overall achieved performance, measured in GigaFLOPs
      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
      operations divided by the total execution time. This value is used as the y-coordinate
@@ -599,7 +599,7 @@ Workgroup manager utilizations:
      any work.
    unit: Percent
  Scheduler-Pipe Utilization:
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where the scheduler-pipes were actively doing any work. Note: this
      value is expected to range between 0% and 25%. See :ref:`desc-spi`.
@@ -637,7 +637,7 @@ Workgroup manager utilizations:
    unit: Cycles/wave
 Workgroup Manager - Resource Allocation:
  Not-scheduled Rate (Workgroup Manager):
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
      due to a bottleneck within the workgroup manager rather than a lack of a
@@ -646,7 +646,7 @@ Workgroup Manager - Resource Allocation:
      description.
    unit: Percent
  Not-scheduled Rate (Scheduler-Pipe):
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
      due to a bottleneck within the scheduler-pipes rather than a lack of a CU
@@ -655,7 +655,7 @@ Workgroup Manager - Resource Allocation:
      description.
    unit: Percent
  Scheduler-Pipe Stall Rate:
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
      due to occupancy limitations (like a lack of a CU or :ref:`SIMD <desc-valu>`
@@ -719,7 +719,7 @@ Wavefront Launch Stats:
      block size.
    unit: Work-Items
  Total Wavefronts:
-    rst: |-
+    rst: >-
      The total number of wavefronts launched as part of the kernel dispatch.
      On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
      size is always 64 work-items. Thus, the total number of wavefronts should
@@ -734,25 +734,25 @@ Wavefront Launch Stats:
      <https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
    unit: Wavefronts
  VGPRs:
-    rst: |-
+    rst: >-
      The number of architected vector general-purpose registers allocated for the
      kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
      number of VGPRs requested by the compiler due to allocation granularity.
    unit: VGPRs
  AGPRs:
-    rst: |-
+    rst: >-
      The number of accumulation vector general-purpose registers allocated
      for the kernel, see :ref:`AGPRs <desc-agprs>`. Note: this may not exactly match
      the number of AGPRs requested by the compiler due to allocation granularity.
    unit: AGPRs
  SGPRs:
-    rst: |-
+    rst: >-
      The number of scalar general-purpose registers allocated for the kernel, see
      :ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
      SGPRs requested by the compiler due to allocation granularity.
    unit: SGPRs
  LDS Allocation:
-    rst: |-
+    rst: >-
      The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
      allocated for this kernel. Note: This may also be larger than what was requested
      at compile time due to both allocation granularity and dynamic per-dispatch
@@ -775,7 +775,7 @@ Wavefront Runtime Stats:
      This is averaged over all wavefronts in a kernel dispatch.
    unit: Instructions per wavefront
  Wave Cycles:
-    rst: |-
+    rst: >-
      The number of cycles a wavefront in the kernel dispatch spent resident
      on a compute unit per :ref:`normalization unit <normalization-units>`. This is
      averaged over all wavefronts in a kernel dispatch. Note: this should not
@@ -813,7 +813,7 @@ Wavefront Runtime Stats:
      the total Wave Cycles metric.
    unit: Cycles per normalization unit
  Wavefront Occupancy:
-    rst: |-
+    rst: >-
      The time-averaged number of wavefronts resident on the accelerator over the
      lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms).
@@ -915,7 +915,7 @@ VALU Arithmetic Instruction Mix:
      unit <normalization-units>`.
    unit: Instructions per normalization unit
  Conversion:
-    rst: |-
+    rst: >-
      The total number of type conversion instructions (such as converting data
      to or from F32\u2194F64) issued to the VALU per :ref:`normalization unit
      <normalization-units>`.
@@ -989,14 +989,14 @@ MFMA Arithmetic Instruction Mix:
    unit: Instructions per normalization unit
 Compute Speed-of-Light:
  VALU FLOPs:
-    rst: |-
+    rst: >-
      The total floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is also presented as a percent of the peak theoretical
      FLOPs achievable on the specific accelerator. Note: this does not include
      any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  VALU IOPs:
-    rst: |-
+    rst: >-
      The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
      This is also presented as a percent of the peak theoretical IOPs achievable
      on the specific accelerator. Note: this does not include any integer operations
@@ -1006,7 +1006,7 @@ Compute Speed-of-Light:
    rst: ''
    unit: Unknown
  MFMA FLOPs (BF16):
-    rst: |-
+    rst: >-
      The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 16-bit brain floating
      point operations from :ref:`VALU <desc-valu>` instructions. This is also
@@ -1014,7 +1014,7 @@ Compute Speed-of-Light:
      on the specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F16):
-    rst: |-
+    rst: >-
      The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 16-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1022,7 +1022,7 @@ Compute Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F32):
-    rst: |-
+    rst: >-
      The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 32-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1030,7 +1030,7 @@ Compute Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F64):
-    rst: |-
+    rst: >-
      The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 64-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1042,7 +1042,7 @@ Compute Speed-of-Light:
      achievable on the specific accelerator.
    unit: GFLOPs
  MFMA IOPs (INT8):
-    rst: |-
+    rst: >-
      The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
      per second. Note: this does not include any 8-bit integer operations from
      :ref:`VALU <desc-valu>` instructions. This is also presented as a percent
@@ -1131,7 +1131,7 @@ Arithmetic Operations:
      unit <normalization-units>`.
    unit: FLOP per normalization unit
  BF16 OPs:
-    rst: |-
+    rst: >-
      The total number of 16-bit brain floating-point operations executed on
      either the :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
      unit <normalization-units>`. Note: on current CDNA accelerators, the VALU
@@ -1148,7 +1148,7 @@ Arithmetic Operations:
      unit <normalization-units>`.
    unit: FLOP per normalization unit
  INT8 OPs:
-    rst: |-
+    rst: >-
      The total number of 8-bit integer operations executed on either the :ref:`VALU
      <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
      <normalization-units>`. Note: on current CDNA accelerators, the VALU has
@@ -1225,7 +1225,7 @@ LDS Statistics:
      to stalls from non-dword aligned addresses per :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
  Mem Violations:
-    rst: |-
+    rst: >-
      The total number of out-of-bounds accesses made to the LDS, per :ref:`normalization
      unit <normalization-units>`. This is unused and expected to be zero in
      most configurations for modern CDNA\u2122 accelerators.
@@ -1242,7 +1242,7 @@ L1I Speed-of-Light:
      over the number of all L1I requests.
    unit: Percent
  L1I-L2 Bandwidth Utilization:
-    rst: |-
+    rst: >-
      The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
      achieved. Calculated as the ratio of the total number of requests from
      the L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`.
@@ -1349,7 +1349,7 @@ Scalar L1D cache accesses:
    unit: Requests per normalization unit
 Scalar L1D Cache - L2 Interface:
  sL1D-L2 BW:
-    rst: |-
+    rst: >-
      The total number of bytes read from, written to, or atomically updated
      across the sL1D\u2194:doc:`L2 <l2-cache>` interface, divided by total duration.
      Note that sL1D writes and atomics are typically
@@ -1371,7 +1371,7 @@ Scalar L1D Cache - L2 Interface:
      CDNA accelerators.
    unit: Requests per normalization unit
  Stall Cycles:
-    rst: |-
+    rst: >-
      The total number of cycles the sL1D\u2194 :doc:`L2 <l2-cache>` interface
      was stalled, per :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
@@ -1681,7 +1681,7 @@ L1 Unified Translation Cache (UTCL1):
      translation not being present in the cache, per :ref:`normalization unit <normalization-units>`.
    unit: unit
  Permission Misses:
-    rst: |-
+    rst: >-
      The total number of translation requests that missed in the UTCL1 due
      to a permission error, per :ref:`normalization unit <normalization-units>`.
      This is unused and expected to be zero in most configurations for modern
@@ -1,20 +1,20 @@
 System Speed-of-Light:
  VALU FLOPs:
-    rst: |-
+    rst: >-
      The total floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is also presented as a percent of the peak theoretical
      FLOPs achievable on the specific accelerator. Note: this does not include
      any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  VALU IOPs:
-    rst: |-
+    rst: >-
      The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
      This is also presented as a percent of the peak theoretical IOPs achievable
      on the specific accelerator. Note: this does not include any integer operations
      from :ref:`MFMA <desc-mfma>` instructions.
    unit: GOIPs
  MFMA FLOPs (F8):
-    rst: |-
+    rst: >-
      The total number of 8-bit brain floating point :ref:`MFMA <desc-mfma>`
      operations executed per second. Note: this does not include any 16-bit brain
      floating point operations from :ref:`VALU <desc-valu>` instructions. This
@@ -23,7 +23,7 @@ System Speed-of-Light:
      series and later only.
    unit: GFLOPs
  MFMA FLOPs (BF16):
-    rst: |-
+    rst: >-
      The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
      operations executed per second. Note: this does not include any 16-bit brain
      floating point operations from :ref:`VALU <desc-valu>` instructions. This
@@ -31,7 +31,7 @@ System Speed-of-Light:
      achievable on the specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F16):
-    rst: |-
+    rst: >-
      The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 16-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -39,7 +39,7 @@ System Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F32):
-    rst: |-
+    rst: >-
      The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 32-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -47,7 +47,7 @@ System Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F64):
-    rst: |-
+    rst: >-
      The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 64-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -55,7 +55,7 @@ System Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA IOPs (Int8):
-    rst: |-
+    rst: >-
      The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
      per second. Note: this does not include any 8-bit integer operations from
      :ref:`VALU <desc-valu>` instructions. This is also presented as a percent
@@ -108,7 +108,7 @@ System Speed-of-Light:
      over the :ref:`total active CU cycles <total-active-cu-cycles>`.
    unit: Instructions per-cycle
  Wavefront Occupancy:
-    rst: |-
+    rst: >-
      The time-averaged number of wavefronts resident on the accelerator over
      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms). This is also presented as a percent of the peak theoretical
@@ -153,7 +153,7 @@ System Speed-of-Light:
      peak theoretical bandwidth achievable on the specific accelerator.
    unit: GB/s
  L2-Fabric Read BW:
-    rst: |-
+    rst: >-
      The number of bytes read by the L2 over the :ref:`Infinity Fabric\u2122
      interface <l2-fabric>` per unit time. This is also presented as a percent
      of the peak theoretical bandwidth achievable on the specific accelerator.
@@ -238,19 +238,19 @@ Memory Chart:
    rst: Total number of compute units (CUs) on the accelerator.
    unit: CUs
  VGPR:
-    rst: |-
+    rst: >-
      The number of architected vector general-purpose registers allocated for the
      kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
      number of VGPRs requested by the compiler due to allocation granularity.
    unit: VGPRs
  SGPR:
-    rst: |-
+    rst: >-
      The number of scalar general-purpose registers allocated for the kernel, see
      :ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
      SGPRs requested by the compiler due to allocation granularity.
    unit: SGPRs
  LDS Allocation:
-    rst: |-
+    rst: >-
      The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
      allocated for this kernel. Note: This may also be larger than what was requested
      at compile time due to both allocation granularity and dynamic per-dispatch
@@ -419,28 +419,28 @@ Memory Chart:
    unit: Requests per normalization unit
 Roofline Performance Rates:
  VALU FLOPs (F16):
-    rst: |-
+    rst: >-
      The total 16-bit floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is presented with the value of the peak empirical F16 FLOPs achievable
      on the specific accelerator. Note: this does not include any F16 operations
      from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  VALU FLOPs (F32):
-    rst: |-
+    rst: >-
      The total 32-bit floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is presented with the value of the peak empirical F32 FLOPs achievable
      on the specific accelerator. Note: this does not include any F32 operations
      from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  VALU FLOPs (F64):
-    rst: |-
+    rst: >-
      The total 64-bit floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is presented with the value of the peak empirical F64 FLOPs achievable
      on the specific accelerator. Note: this does not include any F64 operations
      from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  MFMA FLOPs (F64):
-    rst: |-
+    rst: >-
      The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 64-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -448,7 +448,7 @@ Roofline Performance Rates:
      displayed alongside for comparison.
    unit: GFLOPs
  MFMA FLOPs (F32):
-    rst: |-
+    rst: >-
      The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 32-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -456,7 +456,7 @@ Roofline Performance Rates:
      displayed alongside for comparison.
    unit: GFLOPs
  MFMA FLOPs (F16):
-    rst: |-
+    rst: >-
      The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 16-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -464,7 +464,7 @@ Roofline Performance Rates:
      displayed alongside for comparison.
    unit: GFLOPs
  MFMA FLOPs (BF16):
-    rst: |-
+    rst: >-
      The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
      operations executed per second. Note: this does not include any 16-bit brain
      floating point operations from :ref:`VALU <desc-valu>` instructions. The
@@ -472,7 +472,7 @@ Roofline Performance Rates:
      accelerator is displayed alongside for comparison.
    unit: GFLOPs
  MFMA FLOPs (F8):
-    rst: |-
+    rst: >-
      The total number of 8-bit brain floating point :ref:`MFMA <desc-mfma>`
      operations executed per second. Note: this does not include any 16-bit brain
      floating point operations from :ref:`VALU <desc-valu>` instructions. The
@@ -481,7 +481,7 @@ Roofline Performance Rates:
      Instinct MI300 series and later only.
    unit: GFLOPs
  MFMA IOPs (Int8):
-    rst: |-
+    rst: >-
      The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
      per second. Note: this does not include any 8-bit integer operations from
      :ref:`VALU <desc-valu>` instructions. The peak empirically measured INT8 MFMA
@@ -489,7 +489,7 @@ Roofline Performance Rates:
      for comparison.
    unit: GIOPs
  HBM Bandwidth:
-    rst: |-
+    rst: >-
      The total number of bytes read from and written to High-Bandwidth
        Memory (HBM) per second. The peak empirically measured bandwidth achievable
        on the specific accelerator is displayed alongside for comparison.
@@ -520,28 +520,28 @@ Roofline Performance Rates:
    unit: GB/s
 Roofline Plot Points:
  AI HBM:
-    rst: |-
+    rst: >-
      The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
      It is the ratio of total floating-point operations (FLOPs) to total bytes
      transferred between HBM and the L2 cache. This value is used as the x-coordinate
      for the HBM roofline.
    unit: FLOPs/Byte
  AI L2:
-    rst: |-
+    rst: >-
      The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
      of total floating-point operations (FLOPs) to total bytes transferred between
      the L2 cache and the L1 cache. This value is used as the x-coordinate for
      the L2 roofline.
    unit: FLOPs/Byte
  AI L1:
-    rst: |-
+    rst: >-
      The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
      of total floating-point operations (FLOPs) to total bytes transferred between
      the L1 cache and the processing units. This value is used as the x-coordinate
      for the L1 roofline.
    unit: FLOPs/Byte
  Performance (GFLOPs):
-    rst: |-
+    rst: >-
      The overall achieved performance, measured in GigaFLOPs
      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
      operations divided by the total execution time. This value is used as the y-coordinate
@@ -599,7 +599,7 @@ Workgroup manager utilizations:
      any work.
    unit: Percent
  Scheduler-Pipe Utilization:
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where the scheduler-pipes were actively doing any work. Note: this
      value is expected to range between 0% and 25%. See :ref:`desc-spi`.
@@ -637,7 +637,7 @@ Workgroup manager utilizations:
    unit: Cycles/wave
 Workgroup Manager - Resource Allocation:
  Not-scheduled Rate (Workgroup Manager):
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
      due to a bottleneck within the workgroup manager rather than a lack of a
@@ -646,7 +646,7 @@ Workgroup Manager - Resource Allocation:
      description.
    unit: Percent
  Not-scheduled Rate (Scheduler-Pipe):
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
      due to a bottleneck within the scheduler-pipes rather than a lack of a CU
@@ -655,7 +655,7 @@ Workgroup Manager - Resource Allocation:
      description.
    unit: Percent
  Scheduler-Pipe Stall Rate:
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
      due to occupancy limitations (like a lack of a CU or :ref:`SIMD <desc-valu>`
@@ -719,7 +719,7 @@ Wavefront Launch Stats:
      block size.
    unit: Work-Items
  Total Wavefronts:
-    rst: |-
+    rst: >-
      The total number of wavefronts launched as part of the kernel dispatch.
      On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
      size is always 64 work-items. Thus, the total number of wavefronts should
@@ -734,25 +734,25 @@ Wavefront Launch Stats:
      <https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
    unit: Wavefronts
  VGPRs:
-    rst: |-
+    rst: >-
      The number of architected vector general-purpose registers allocated for the
      kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
      number of VGPRs requested by the compiler due to allocation granularity.
    unit: VGPRs
  AGPRs:
-    rst: |-
+    rst: >-
      The number of accumulation vector general-purpose registers allocated
      for the kernel, see :ref:`AGPRs <desc-agprs>`. Note: this may not exactly match
      the number of AGPRs requested by the compiler due to allocation granularity.
    unit: AGPRs
  SGPRs:
-    rst: |-
+    rst: >-
      The number of scalar general-purpose registers allocated for the kernel, see
      :ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
      SGPRs requested by the compiler due to allocation granularity.
    unit: SGPRs
  LDS Allocation:
-    rst: |-
+    rst: >-
      The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
      allocated for this kernel. Note: This may also be larger than what was requested
      at compile time due to both allocation granularity and dynamic per-dispatch
@@ -775,7 +775,7 @@ Wavefront Runtime Stats:
      This is averaged over all wavefronts in a kernel dispatch.
    unit: Instructions per wavefront
  Wave Cycles:
-    rst: |-
+    rst: >-
      The number of cycles a wavefront in the kernel dispatch spent resident
      on a compute unit per :ref:`normalization unit <normalization-units>`. This is
      averaged over all wavefronts in a kernel dispatch. Note: this should not
@@ -813,7 +813,7 @@ Wavefront Runtime Stats:
      the total Wave Cycles metric.
    unit: Cycles per normalization unit
  Wavefront Occupancy:
-    rst: |-
+    rst: >-
      The time-averaged number of wavefronts resident on the accelerator over the
      lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms).
@@ -915,7 +915,7 @@ VALU Arithmetic Instruction Mix:
      unit <normalization-units>`.
    unit: Instructions per normalization unit
  Conversion:
-    rst: |-
+    rst: >-
      The total number of type conversion instructions (such as converting data
      to or from F32\u2194F64) issued to the VALU per :ref:`normalization unit
      <normalization-units>`.
@@ -989,14 +989,14 @@ MFMA Arithmetic Instruction Mix:
    unit: Instructions per normalization unit
 Compute Speed-of-Light:
  VALU FLOPs:
-    rst: |-
+    rst: >-
      The total floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is also presented as a percent of the peak theoretical
      FLOPs achievable on the specific accelerator. Note: this does not include
      any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  VALU IOPs:
-    rst: |-
+    rst: >-
      The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
      This is also presented as a percent of the peak theoretical IOPs achievable
      on the specific accelerator. Note: this does not include any integer operations
@@ -1006,7 +1006,7 @@ Compute Speed-of-Light:
    rst: ''
    unit: Unknown
  MFMA FLOPs (BF16):
-    rst: |-
+    rst: >-
      The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 16-bit brain floating
      point operations from :ref:`VALU <desc-valu>` instructions. This is also
@@ -1014,7 +1014,7 @@ Compute Speed-of-Light:
      on the specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F16):
-    rst: |-
+    rst: >-
      The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 16-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1022,7 +1022,7 @@ Compute Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F32):
-    rst: |-
+    rst: >-
      The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 32-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1030,7 +1030,7 @@ Compute Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F64):
-    rst: |-
+    rst: >-
      The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 64-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1042,7 +1042,7 @@ Compute Speed-of-Light:
      achievable on the specific accelerator.
    unit: GFLOPs
  MFMA IOPs (INT8):
-    rst: |-
+    rst: >-
      The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
      per second. Note: this does not include any 8-bit integer operations from
      :ref:`VALU <desc-valu>` instructions. This is also presented as a percent
@@ -1131,7 +1131,7 @@ Arithmetic Operations:
      unit <normalization-units>`.
    unit: FLOP per normalization unit
  BF16 OPs:
-    rst: |-
+    rst: >-
      The total number of 16-bit brain floating-point operations executed on
      either the :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
      unit <normalization-units>`. Note: on current CDNA accelerators, the VALU
@@ -1148,7 +1148,7 @@ Arithmetic Operations:
      unit <normalization-units>`.
    unit: FLOP per normalization unit
  INT8 OPs:
-    rst: |-
+    rst: >-
      The total number of 8-bit integer operations executed on either the :ref:`VALU
      <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
      <normalization-units>`. Note: on current CDNA accelerators, the VALU has
@@ -1225,7 +1225,7 @@ LDS Statistics:
      to stalls from non-dword aligned addresses per :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
  Mem Violations:
-    rst: |-
+    rst: >-
      The total number of out-of-bounds accesses made to the LDS, per :ref:`normalization
      unit <normalization-units>`. This is unused and expected to be zero in
      most configurations for modern CDNA\u2122 accelerators.
@@ -1242,7 +1242,7 @@ L1I Speed-of-Light:
      over the number of all L1I requests.
    unit: Percent
  L1I-L2 Bandwidth Utilization:
-    rst: |-
+    rst: >-
      The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
      achieved. Calculated as the ratio of the total number of requests from
      the L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`.
@@ -1349,7 +1349,7 @@ Scalar L1D cache accesses:
    unit: Requests per normalization unit
 Scalar L1D Cache - L2 Interface:
  sL1D-L2 BW:
-    rst: |-
+    rst: >-
      The total number of bytes read from, written to, or atomically updated
      across the sL1D\u2194:doc:`L2 <l2-cache>` interface, divided by total duration.
      Note that sL1D writes and atomics are typically
@@ -1371,7 +1371,7 @@ Scalar L1D Cache - L2 Interface:
      CDNA accelerators.
    unit: Requests per normalization unit
  Stall Cycles:
-    rst: |-
+    rst: >-
      The total number of cycles the sL1D\u2194 :doc:`L2 <l2-cache>` interface
      was stalled, per :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
@@ -1681,7 +1681,7 @@ L1 Unified Translation Cache (UTCL1):
      translation not being present in the cache, per :ref:`normalization unit <normalization-units>`.
    unit: unit
  Permission Misses:
-    rst: |-
+    rst: >-
      The total number of translation requests that missed in the UTCL1 due
      to a permission error, per :ref:`normalization unit <normalization-units>`.
      This is unused and expected to be zero in most configurations for modern
@@ -1,20 +1,20 @@
 System Speed-of-Light:
  VALU FLOPs:
-    rst: |-
+    rst: >-
      The total floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is also presented as a percent of the peak theoretical
      FLOPs achievable on the specific accelerator. Note: this does not include
      any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  VALU IOPs:
-    rst: |-
+    rst: >-
      The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
      This is also presented as a percent of the peak theoretical IOPs achievable
      on the specific accelerator. Note: this does not include any integer operations
      from :ref:`MFMA <desc-mfma>` instructions.
    unit: GOIPs
  MFMA FLOPs (F8):
-    rst: |-
+    rst: >-
      The total number of 8-bit brain floating point :ref:`MFMA <desc-mfma>`
      operations executed per second. Note: this does not include any 16-bit brain
      floating point operations from :ref:`VALU <desc-valu>` instructions. This
@@ -23,7 +23,7 @@ System Speed-of-Light:
      series and later only.
    unit: GFLOPs
  MFMA FLOPs (BF16):
-    rst: |-
+    rst: >-
      The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
      operations executed per second. Note: this does not include any 16-bit brain
      floating point operations from :ref:`VALU <desc-valu>` instructions. This
@@ -31,7 +31,7 @@ System Speed-of-Light:
      achievable on the specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F16):
-    rst: |-
+    rst: >-
      The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 16-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -39,7 +39,7 @@ System Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F32):
-    rst: |-
+    rst: >-
      The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 32-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -47,7 +47,7 @@ System Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F64):
-    rst: |-
+    rst: >-
      The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 64-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -55,7 +55,7 @@ System Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA IOPs (Int8):
-    rst: |-
+    rst: >-
      The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
      per second. Note: this does not include any 8-bit integer operations from
      :ref:`VALU <desc-valu>` instructions. This is also presented as a percent
@@ -108,7 +108,7 @@ System Speed-of-Light:
      over the :ref:`total active CU cycles <total-active-cu-cycles>`.
    unit: Instructions per-cycle
  Wavefront Occupancy:
-    rst: |-
+    rst: >-
      The time-averaged number of wavefronts resident on the accelerator over
      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms). This is also presented as a percent of the peak theoretical
@@ -153,7 +153,7 @@ System Speed-of-Light:
      peak theoretical bandwidth achievable on the specific accelerator.
    unit: GB/s
  L2-Fabric Read BW:
-    rst: |-
+    rst: >-
      The number of bytes read by the L2 over the :ref:`Infinity Fabric\u2122
      interface <l2-fabric>` per unit time. This is also presented as a percent
      of the peak theoretical bandwidth achievable on the specific accelerator.
@@ -238,19 +238,19 @@ Memory Chart:
    rst: Total number of compute units (CUs) on the accelerator.
    unit: CUs
  VGPR:
-    rst: |-
+    rst: >-
      The number of architected vector general-purpose registers allocated for the
      kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
      number of VGPRs requested by the compiler due to allocation granularity.
    unit: VGPRs
  SGPR:
-    rst: |-
+    rst: >-
      The number of scalar general-purpose registers allocated for the kernel, see
      :ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
      SGPRs requested by the compiler due to allocation granularity.
    unit: SGPRs
  LDS Allocation:
-    rst: |-
+    rst: >-
      The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
      allocated for this kernel. Note: This may also be larger than what was requested
      at compile time due to both allocation granularity and dynamic per-dispatch
@@ -419,28 +419,28 @@ Memory Chart:
    unit: Requests per normalization unit
 Roofline Performance Rates:
  VALU FLOPs (F16):
-    rst: |-
+    rst: >-
      The total 16-bit floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is presented with the value of the peak empirical F16 FLOPs achievable
      on the specific accelerator. Note: this does not include any F16 operations
      from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  VALU FLOPs (F32):
-    rst: |-
+    rst: >-
      The total 32-bit floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is presented with the value of the peak empirical F32 FLOPs achievable
      on the specific accelerator. Note: this does not include any F32 operations
      from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  VALU FLOPs (F64):
-    rst: |-
+    rst: >-
      The total 64-bit floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is presented with the value of the peak empirical F64 FLOPs achievable
      on the specific accelerator. Note: this does not include any F64 operations
      from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  MFMA FLOPs (F64):
-    rst: |-
+    rst: >-
      The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 64-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -448,7 +448,7 @@ Roofline Performance Rates:
      displayed alongside for comparison.
    unit: GFLOPs
  MFMA FLOPs (F32):
-    rst: |-
+    rst: >-
      The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 32-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -456,7 +456,7 @@ Roofline Performance Rates:
      displayed alongside for comparison.
    unit: GFLOPs
  MFMA FLOPs (F16):
-    rst: |-
+    rst: >-
      The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 16-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -464,7 +464,7 @@ Roofline Performance Rates:
      displayed alongside for comparison.
    unit: GFLOPs
  MFMA FLOPs (BF16):
-    rst: |-
+    rst: >-
      The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
      operations executed per second. Note: this does not include any 16-bit brain
      floating point operations from :ref:`VALU <desc-valu>` instructions. The
@@ -472,7 +472,7 @@ Roofline Performance Rates:
      accelerator is displayed alongside for comparison.
    unit: GFLOPs
  MFMA FLOPs (F8):
-    rst: |-
+    rst: >-
      The total number of 8-bit brain floating point :ref:`MFMA <desc-mfma>`
      operations executed per second. Note: this does not include any 16-bit brain
      floating point operations from :ref:`VALU <desc-valu>` instructions. The
@@ -481,7 +481,7 @@ Roofline Performance Rates:
      Instinct MI300 series and later only.
    unit: GFLOPs
  MFMA IOPs (Int8):
-    rst: |-
+    rst: >-
      The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
      per second. Note: this does not include any 8-bit integer operations from
      :ref:`VALU <desc-valu>` instructions. The peak empirically measured INT8 MFMA
@@ -489,7 +489,7 @@ Roofline Performance Rates:
      for comparison.
    unit: GIOPs
  HBM Bandwidth:
-    rst: |-
+    rst: >-
      The total number of bytes read from and written to High-Bandwidth
        Memory (HBM) per second. The peak empirically measured bandwidth achievable
        on the specific accelerator is displayed alongside for comparison.
@@ -520,28 +520,28 @@ Roofline Performance Rates:
    unit: GB/s
 Roofline Plot Points:
  AI HBM:
-    rst: |-
+    rst: >-
      The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
      It is the ratio of total floating-point operations (FLOPs) to total bytes
      transferred between HBM and the L2 cache. This value is used as the x-coordinate
      for the HBM roofline.
    unit: FLOPs/Byte
  AI L2:
-    rst: |-
+    rst: >-
      The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
      of total floating-point operations (FLOPs) to total bytes transferred between
      the L2 cache and the L1 cache. This value is used as the x-coordinate for
      the L2 roofline.
    unit: FLOPs/Byte
  AI L1:
-    rst: |-
+    rst: >-
      The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
      of total floating-point operations (FLOPs) to total bytes transferred between
      the L1 cache and the processing units. This value is used as the x-coordinate
      for the L1 roofline.
    unit: FLOPs/Byte
  Performance (GFLOPs):
-    rst: |-
+    rst: >-
      The overall achieved performance, measured in GigaFLOPs
      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
      operations divided by the total execution time. This value is used as the y-coordinate
@@ -599,7 +599,7 @@ Workgroup manager utilizations:
      any work.
    unit: Percent
  Scheduler-Pipe Utilization:
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where the scheduler-pipes were actively doing any work. Note: this
      value is expected to range between 0% and 25%. See :ref:`desc-spi`.
@@ -637,7 +637,7 @@ Workgroup manager utilizations:
    unit: Cycles/wave
 Workgroup Manager - Resource Allocation:
  Not-scheduled Rate (Workgroup Manager):
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
      due to a bottleneck within the workgroup manager rather than a lack of a
@@ -646,7 +646,7 @@ Workgroup Manager - Resource Allocation:
      description.
    unit: Percent
  Not-scheduled Rate (Scheduler-Pipe):
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
      due to a bottleneck within the scheduler-pipes rather than a lack of a CU
@@ -655,7 +655,7 @@ Workgroup Manager - Resource Allocation:
      description.
    unit: Percent
  Scheduler-Pipe Stall Rate:
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
      due to occupancy limitations (like a lack of a CU or :ref:`SIMD <desc-valu>`
@@ -719,7 +719,7 @@ Wavefront Launch Stats:
      block size.
    unit: Work-Items
  Total Wavefronts:
-    rst: |-
+    rst: >-
      The total number of wavefronts launched as part of the kernel dispatch.
      On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
      size is always 64 work-items. Thus, the total number of wavefronts should
@@ -734,25 +734,25 @@ Wavefront Launch Stats:
      <https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
    unit: Wavefronts
  VGPRs:
-    rst: |-
+    rst: >-
      The number of architected vector general-purpose registers allocated for the
      kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
      number of VGPRs requested by the compiler due to allocation granularity.
    unit: VGPRs
  AGPRs:
-    rst: |-
+    rst: >-
      The number of accumulation vector general-purpose registers allocated
      for the kernel, see :ref:`AGPRs <desc-agprs>`. Note: this may not exactly match
      the number of AGPRs requested by the compiler due to allocation granularity.
    unit: AGPRs
  SGPRs:
-    rst: |-
+    rst: >-
      The number of scalar general-purpose registers allocated for the kernel, see
      :ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
      SGPRs requested by the compiler due to allocation granularity.
    unit: SGPRs
  LDS Allocation:
-    rst: |-
+    rst: >-
      The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
      allocated for this kernel. Note: This may also be larger than what was requested
      at compile time due to both allocation granularity and dynamic per-dispatch
@@ -775,7 +775,7 @@ Wavefront Runtime Stats:
      This is averaged over all wavefronts in a kernel dispatch.
    unit: Instructions per wavefront
  Wave Cycles:
-    rst: |-
+    rst: >-
      The number of cycles a wavefront in the kernel dispatch spent resident
      on a compute unit per :ref:`normalization unit <normalization-units>`. This is
      averaged over all wavefronts in a kernel dispatch. Note: this should not
@@ -813,7 +813,7 @@ Wavefront Runtime Stats:
      the total Wave Cycles metric.
    unit: Cycles per normalization unit
  Wavefront Occupancy:
-    rst: |-
+    rst: >-
      The time-averaged number of wavefronts resident on the accelerator over the
      lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms).
@@ -915,7 +915,7 @@ VALU Arithmetic Instruction Mix:
      unit <normalization-units>`.
    unit: Instructions per normalization unit
  Conversion:
-    rst: |-
+    rst: >-
      The total number of type conversion instructions (such as converting data
      to or from F32\u2194F64) issued to the VALU per :ref:`normalization unit
      <normalization-units>`.
@@ -989,14 +989,14 @@ MFMA Arithmetic Instruction Mix:
    unit: Instructions per normalization unit
 Compute Speed-of-Light:
  VALU FLOPs:
-    rst: |-
+    rst: >-
      The total floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is also presented as a percent of the peak theoretical
      FLOPs achievable on the specific accelerator. Note: this does not include
      any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  VALU IOPs:
-    rst: |-
+    rst: >-
      The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
      This is also presented as a percent of the peak theoretical IOPs achievable
      on the specific accelerator. Note: this does not include any integer operations
@@ -1006,7 +1006,7 @@ Compute Speed-of-Light:
    rst: ''
    unit: Unknown
  MFMA FLOPs (BF16):
-    rst: |-
+    rst: >-
      The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 16-bit brain floating
      point operations from :ref:`VALU <desc-valu>` instructions. This is also
@@ -1014,7 +1014,7 @@ Compute Speed-of-Light:
      on the specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F16):
-    rst: |-
+    rst: >-
      The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 16-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1022,7 +1022,7 @@ Compute Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F32):
-    rst: |-
+    rst: >-
      The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 32-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1030,7 +1030,7 @@ Compute Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F64):
-    rst: |-
+    rst: >-
      The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 64-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1042,7 +1042,7 @@ Compute Speed-of-Light:
      achievable on the specific accelerator.
    unit: GFLOPs
  MFMA IOPs (INT8):
-    rst: |-
+    rst: >-
      The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
      per second. Note: this does not include any 8-bit integer operations from
      :ref:`VALU <desc-valu>` instructions. This is also presented as a percent
@@ -1131,7 +1131,7 @@ Arithmetic Operations:
      unit <normalization-units>`.
    unit: FLOP per normalization unit
  BF16 OPs:
-    rst: |-
+    rst: >-
      The total number of 16-bit brain floating-point operations executed on
      either the :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
      unit <normalization-units>`. Note: on current CDNA accelerators, the VALU
@@ -1148,7 +1148,7 @@ Arithmetic Operations:
      unit <normalization-units>`.
    unit: FLOP per normalization unit
  INT8 OPs:
-    rst: |-
+    rst: >-
      The total number of 8-bit integer operations executed on either the :ref:`VALU
      <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
      <normalization-units>`. Note: on current CDNA accelerators, the VALU has
@@ -1225,7 +1225,7 @@ LDS Statistics:
      to stalls from non-dword aligned addresses per :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
  Mem Violations:
-    rst: |-
+    rst: >-
      The total number of out-of-bounds accesses made to the LDS, per :ref:`normalization
      unit <normalization-units>`. This is unused and expected to be zero in
      most configurations for modern CDNA\u2122 accelerators.
@@ -1242,7 +1242,7 @@ L1I Speed-of-Light:
      over the number of all L1I requests.
    unit: Percent
  L1I-L2 Bandwidth Utilization:
-    rst: |-
+    rst: >-
      The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
      achieved. Calculated as the ratio of the total number of requests from
      the L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`.
@@ -1349,7 +1349,7 @@ Scalar L1D cache accesses:
    unit: Requests per normalization unit
 Scalar L1D Cache - L2 Interface:
  sL1D-L2 BW:
-    rst: |-
+    rst: >-
      The total number of bytes read from, written to, or atomically updated
      across the sL1D\u2194:doc:`L2 <l2-cache>` interface, divided by total duration.
      Note that sL1D writes and atomics are typically
@@ -1371,7 +1371,7 @@ Scalar L1D Cache - L2 Interface:
      CDNA accelerators.
    unit: Requests per normalization unit
  Stall Cycles:
-    rst: |-
+    rst: >-
      The total number of cycles the sL1D\u2194 :doc:`L2 <l2-cache>` interface
      was stalled, per :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
@@ -1681,7 +1681,7 @@ L1 Unified Translation Cache (UTCL1):
      translation not being present in the cache, per :ref:`normalization unit <normalization-units>`.
    unit: unit
  Permission Misses:
-    rst: |-
+    rst: >-
      The total number of translation requests that missed in the UTCL1 due
      to a permission error, per :ref:`normalization unit <normalization-units>`.
      This is unused and expected to be zero in most configurations for modern
@@ -1,20 +1,20 @@
 System Speed-of-Light:
  VALU FLOPs:
-    rst: |-
+    rst: >-
      The total floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is also presented as a percent of the peak theoretical
      FLOPs achievable on the specific accelerator. Note: this does not include
      any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  VALU IOPs:
-    rst: |-
+    rst: >-
      The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
      This is also presented as a percent of the peak theoretical IOPs achievable
      on the specific accelerator. Note: this does not include any integer operations
      from :ref:`MFMA <desc-mfma>` instructions.
    unit: GOIPs
  MFMA FLOPs (F8):
-    rst: |-
+    rst: >-
      The total number of 8-bit brain floating point :ref:`MFMA <desc-mfma>`
      operations executed per second. Note: this does not include any 16-bit brain
      floating point operations from :ref:`VALU <desc-valu>` instructions. This
@@ -23,7 +23,7 @@ System Speed-of-Light:
      series and later only.
    unit: GFLOPs
  MFMA FLOPs (BF16):
-    rst: |-
+    rst: >-
      The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
      operations executed per second. Note: this does not include any 16-bit brain
      floating point operations from :ref:`VALU <desc-valu>` instructions. This
@@ -31,7 +31,7 @@ System Speed-of-Light:
      achievable on the specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F16):
-    rst: |-
+    rst: >-
      The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 16-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -39,7 +39,7 @@ System Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F32):
-    rst: |-
+    rst: >-
      The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 32-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -47,7 +47,7 @@ System Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F64):
-    rst: |-
+    rst: >-
      The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 64-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -58,7 +58,7 @@ System Speed-of-Light:
    rst: ''
    unit: Unknown
  MFMA IOPs (Int8):
-    rst: |-
+    rst: >-
      The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
      per second. Note: this does not include any 8-bit integer operations from
      :ref:`VALU <desc-valu>` instructions. This is also presented as a percent
@@ -111,7 +111,7 @@ System Speed-of-Light:
      over the :ref:`total active CU cycles <total-active-cu-cycles>`.
    unit: Instructions per-cycle
  Wavefront Occupancy:
-    rst: |-
+    rst: >-
      The time-averaged number of wavefronts resident on the accelerator over
      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms). This is also presented as a percent of the peak theoretical
@@ -156,7 +156,7 @@ System Speed-of-Light:
      peak theoretical bandwidth achievable on the specific accelerator.
    unit: GB/s
  L2-Fabric Read BW:
-    rst: |-
+    rst: >-
      The number of bytes read by the L2 over the :ref:`Infinity Fabric\u2122
      interface <l2-fabric>` per unit time. This is also presented as a percent
      of the peak theoretical bandwidth achievable on the specific accelerator.
@@ -241,19 +241,19 @@ Memory Chart:
    rst: Total number of compute units (CUs) on the accelerator.
    unit: CUs
  VGPR:
-    rst: |-
+    rst: >-
      The number of architected vector general-purpose registers allocated for the
      kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
      number of VGPRs requested by the compiler due to allocation granularity.
    unit: VGPRs
  SGPR:
-    rst: |-
+    rst: >-
      The number of scalar general-purpose registers allocated for the kernel, see
      :ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
      SGPRs requested by the compiler due to allocation granularity.
    unit: SGPRs
  LDS Allocation:
-    rst: |-
+    rst: >-
      The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
      allocated for this kernel. Note: This may also be larger than what was requested
      at compile time due to both allocation granularity and dynamic per-dispatch
@@ -432,28 +432,28 @@ Memory Chart:
    unit: Requests per normalization unit
 Roofline Performance Rates:
  VALU FLOPs (F16):
-    rst: |-
+    rst: >-
      The total 16-bit floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is presented with the value of the peak empirical F16 FLOPs achievable
      on the specific accelerator. Note: this does not include any F16 operations
      from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  VALU FLOPs (F32):
-    rst: |-
+    rst: >-
      The total 32-bit floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is presented with the value of the peak empirical F32 FLOPs achievable
      on the specific accelerator. Note: this does not include any F32 operations
      from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  VALU FLOPs (F64):
-    rst: |-
+    rst: >-
      The total 64-bit floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is presented with the value of the peak empirical F64 FLOPs achievable
      on the specific accelerator. Note: this does not include any F64 operations
      from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  MFMA FLOPs (F64):
-    rst: |-
+    rst: >-
      The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 64-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -461,7 +461,7 @@ Roofline Performance Rates:
      displayed alongside for comparison.
    unit: GFLOPs
  MFMA FLOPs (F32):
-    rst: |-
+    rst: >-
      The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 32-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -469,7 +469,7 @@ Roofline Performance Rates:
      displayed alongside for comparison.
    unit: GFLOPs
  MFMA FLOPs (F16):
-    rst: |-
+    rst: >-
      The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 16-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -477,7 +477,7 @@ Roofline Performance Rates:
      displayed alongside for comparison.
    unit: GFLOPs
  MFMA FLOPs (BF16):
-    rst: |-
+    rst: >-
      The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
      operations executed per second. Note: this does not include any 16-bit brain
      floating point operations from :ref:`VALU <desc-valu>` instructions. The
@@ -485,7 +485,7 @@ Roofline Performance Rates:
      accelerator is displayed alongside for comparison.
    unit: GFLOPs
  MFMA FLOPs (F8):
-    rst: |-
+    rst: >-
      The total number of 8-bit brain floating point :ref:`MFMA <desc-mfma>`
      operations executed per second. Note: this does not include any 16-bit brain
      floating point operations from :ref:`VALU <desc-valu>` instructions. The
@@ -494,7 +494,7 @@ Roofline Performance Rates:
      Instinct MI300 series and later only.
    unit: GFLOPs
  MFMA FLOPs (F6F4):
-    rst: |-
+    rst: >-
      The total number of 4-bit and 6-bit floating point :ref:`MFMA <desc-mfma>`
      operations executed per second. Note: this does not include any floating point
      operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -503,7 +503,7 @@ Roofline Performance Rates:
      series (gfx950) and later only.
    unit: GFLOPs
  MFMA IOPs (Int8):
-    rst: |-
+    rst: >-
      The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
      per second. Note: this does not include any 8-bit integer operations from
      :ref:`VALU <desc-valu>` instructions. The peak empirically measured INT8 MFMA
@@ -511,7 +511,7 @@ Roofline Performance Rates:
      for comparison.
    unit: GIOPs
  HBM Bandwidth:
-    rst: |-
+    rst: >-
      The total number of bytes read from and written to High-Bandwidth
        Memory (HBM) per second. The peak empirically measured bandwidth achievable
        on the specific accelerator is displayed alongside for comparison.
@@ -542,28 +542,28 @@ Roofline Performance Rates:
    unit: GB/s
 Roofline Plot Points:
  AI HBM:
-    rst: |-
+    rst: >-
      The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
      It is the ratio of total floating-point operations (FLOPs) to total bytes
      transferred between HBM and the L2 cache. This value is used as the x-coordinate
      for the HBM roofline.
    unit: FLOPs/Byte
  AI L2:
-    rst: |-
+    rst: >-
      The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
      of total floating-point operations (FLOPs) to total bytes transferred between
      the L2 cache and the L1 cache. This value is used as the x-coordinate for
      the L2 roofline.
    unit: FLOPs/Byte
  AI L1:
-    rst: |-
+    rst: >-
      The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
      of total floating-point operations (FLOPs) to total bytes transferred between
      the L1 cache and the processing units. This value is used as the x-coordinate
      for the L1 roofline.
    unit: FLOPs/Byte
  Performance (GFLOPs):
-    rst: |-
+    rst: >-
      The overall achieved performance, measured in GigaFLOPs
      per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
      operations divided by the total execution time. This value is used as the y-coordinate
@@ -633,7 +633,7 @@ Workgroup manager utilizations:
      any work.
    unit: Percent
  Scheduler-Pipe Utilization:
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where the scheduler-pipes were actively doing any work. Note: this
      value is expected to range between 0% and 25%. See :ref:`desc-spi`.
@@ -674,7 +674,7 @@ Workgroup manager utilizations:
    unit: Cycles/wave
 Workgroup Manager - Resource Allocation:
  Not-scheduled Rate (Workgroup Manager):
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
      due to a bottleneck within the workgroup manager rather than a lack of a
@@ -683,7 +683,7 @@ Workgroup Manager - Resource Allocation:
      description.
    unit: Percent
  Not-scheduled Rate (Scheduler-Pipe):
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
      due to a bottleneck within the scheduler-pipes rather than a lack of a CU
@@ -695,7 +695,7 @@ Workgroup Manager - Resource Allocation:
    rst: ''
    unit: Unknown
  Scheduler-Pipe Stall Rate:
-    rst: |-
+    rst: >-
      The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
      in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
      due to occupancy limitations (like a lack of a CU or :ref:`SIMD <desc-valu>`
@@ -759,7 +759,7 @@ Wavefront Launch Stats:
      block size.
    unit: Work-Items
  Total Wavefronts:
-    rst: |-
+    rst: >-
      The total number of wavefronts launched as part of the kernel dispatch.
      On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
      size is always 64 work-items. Thus, the total number of wavefronts should
@@ -774,25 +774,25 @@ Wavefront Launch Stats:
      <https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
    unit: Wavefronts
  VGPRs:
-    rst: |-
+    rst: >-
      The number of architected vector general-purpose registers allocated for the
      kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
      number of VGPRs requested by the compiler due to allocation granularity.
    unit: VGPRs
  AGPRs:
-    rst: |-
+    rst: >-
      The number of accumulation vector general-purpose registers allocated
      for the kernel, see :ref:`AGPRs <desc-agprs>`. Note: this may not exactly match
      the number of AGPRs requested by the compiler due to allocation granularity.
    unit: AGPRs
  SGPRs:
-    rst: |-
+    rst: >-
      The number of scalar general-purpose registers allocated for the kernel, see
      :ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
      SGPRs requested by the compiler due to allocation granularity.
    unit: SGPRs
  LDS Allocation:
-    rst: |-
+    rst: >-
      The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
      allocated for this kernel. Note: This may also be larger than what was requested
      at compile time due to both allocation granularity and dynamic per-dispatch
@@ -815,7 +815,7 @@ Wavefront Runtime Stats:
      This is averaged over all wavefronts in a kernel dispatch.
    unit: Instructions per wavefront
  Wave Cycles:
-    rst: |-
+    rst: >-
      The number of cycles a wavefront in the kernel dispatch spent resident
      on a compute unit per :ref:`normalization unit <normalization-units>`. This is
      averaged over all wavefronts in a kernel dispatch. Note: this should not
@@ -853,7 +853,7 @@ Wavefront Runtime Stats:
      the total Wave Cycles metric.
    unit: Cycles per normalization unit
  Wavefront Occupancy:
-    rst: |-
+    rst: >-
      The time-averaged number of wavefronts resident on the accelerator over the
      lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms).
@@ -955,7 +955,7 @@ VALU Arithmetic Instruction Mix:
      unit <normalization-units>`.
    unit: Instructions per normalization unit
  Conversion:
-    rst: |-
+    rst: >-
      The total number of type conversion instructions (such as converting data
      to or from F32\u2194F64) issued to the VALU per :ref:`normalization unit
      <normalization-units>`.
@@ -1035,14 +1035,14 @@ MFMA Arithmetic Instruction Mix:
    unit: Unknown
 Compute Speed-of-Light:
  VALU FLOPs:
-    rst: |-
+    rst: >-
      The total floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is also presented as a percent of the peak theoretical
      FLOPs achievable on the specific accelerator. Note: this does not include
      any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
    unit: GFLOPs
  VALU IOPs:
-    rst: |-
+    rst: >-
      The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
      This is also presented as a percent of the peak theoretical IOPs achievable
      on the specific accelerator. Note: this does not include any integer operations
@@ -1052,7 +1052,7 @@ Compute Speed-of-Light:
    rst: ''
    unit: Unknown
  MFMA FLOPs (BF16):
-    rst: |-
+    rst: >-
      The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 16-bit brain floating
      point operations from :ref:`VALU <desc-valu>` instructions. This is also
@@ -1060,7 +1060,7 @@ Compute Speed-of-Light:
      on the specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F16):
-    rst: |-
+    rst: >-
      The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 16-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1068,7 +1068,7 @@ Compute Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F32):
-    rst: |-
+    rst: >-
      The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 32-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1076,7 +1076,7 @@ Compute Speed-of-Light:
      specific accelerator.
    unit: GFLOPs
  MFMA FLOPs (F64):
-    rst: |-
+    rst: >-
      The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 64-bit floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1091,7 +1091,7 @@ Compute Speed-of-Light:
    rst: ''
    unit: Unknown
  MFMA IOPs (INT8):
-    rst: |-
+    rst: >-
      The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
      per second. Note: this does not include any 8-bit integer operations from
      :ref:`VALU <desc-valu>` instructions. This is also presented as a percent
@@ -1183,7 +1183,7 @@ Arithmetic Operations:
      unit <normalization-units>`.
    unit: FLOP per normalization unit
  BF16 OPs:
-    rst: |-
+    rst: >-
      The total number of 16-bit brain floating-point operations executed on
      either the :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
      unit <normalization-units>`. Note: on current CDNA accelerators, the VALU
@@ -1203,7 +1203,7 @@ Arithmetic Operations:
    rst: ''
    unit: Unknown
  INT8 OPs:
-    rst: |-
+    rst: >-
      The total number of 8-bit integer operations executed on either the :ref:`VALU
      <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
      <normalization-units>`. Note: on current CDNA accelerators, the VALU has
@@ -1298,7 +1298,7 @@ LDS Statistics:
      to stalls from non-dword aligned addresses per :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
  Mem Violations:
-    rst: |-
+    rst: >-
      The total number of out-of-bounds accesses made to the LDS, per :ref:`normalization
      unit <normalization-units>`. This is unused and expected to be zero in
      most configurations for modern CDNA\u2122 accelerators.
@@ -1321,7 +1321,7 @@ L1I Speed-of-Light:
      over the number of all L1I requests.
    unit: Percent
  L1I-L2 Bandwidth Utilization:
-    rst: |-
+    rst: >-
      The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
      achieved. Calculated as the ratio of the total number of requests from
      the L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`.
@@ -1428,7 +1428,7 @@ Scalar L1D cache accesses:
    unit: Requests per normalization unit
 Scalar L1D Cache - L2 Interface:
  sL1D-L2 BW:
-    rst: |-
+    rst: >-
      The total number of bytes read from, written to, or atomically updated
      across the sL1D\u2194:doc:`L2 <l2-cache>` interface, divided by total duration.
      Note that sL1D writes and atomics are typically
@@ -1450,7 +1450,7 @@ Scalar L1D Cache - L2 Interface:
      CDNA accelerators.
    unit: Requests per normalization unit
  Stall Cycles:
-    rst: |-
+    rst: >-
      The total number of cycles the sL1D\u2194 :doc:`L2 <l2-cache>` interface
      was stalled, per :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
@@ -1818,7 +1818,7 @@ L1 Unified Translation Cache (UTCL1):
    rst: ''
    unit: Unknown
  Permission Misses:
-    rst: |-
+    rst: >-
      The total number of translation requests that missed in the UTCL1 due
      to a permission error, per :ref:`normalization unit <normalization-units>`.
      This is unused and expected to be zero in most configurations for modern
@@ -1968,7 +1968,7 @@ L2-Fabric interface metrics:
      with return value) was returned to the L2.
    unit: Cycles
  Read Stall:
-    rst: |-
+    rst: >-
      The ratio of the total number of cycles the L2-Fabric interface was stalled
      on a read request to any destination (local HBM, remote PCIe\xAE connected
      accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_