rocm-systems/projects/rocprofiler-compute/docs/data/metrics_description.yaml

# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
Wavefront launch stats:
  Grid Size:
    rst: The total number of work-items (or, threads) launched as a part of the kernel
      dispatch. In HIP, this is equivalent to the total grid size multiplied by the
      total workgroup (or, block) size.
    unit: Work-Items
  Workgroup Size:
    rst: The total number of work-items (or, threads) in each workgroup (or, block)
      launched as part of the kernel dispatch. In HIP, this is equivalent to the total
      block size.
    unit: Work-Items
  Total Wavefronts:
    rst: "The total number of wavefronts launched as part of the kernel dispatch.\
      \ On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront\
      \ size is always 64 work-items. Thus, the total number of wavefronts should\
      \ be equivalent to the ceiling of grid size divided by 64."
    unit: Wavefronts
  Saved Wavefronts:
    rst: The total number of wavefronts saved at a context-save. See  `cwsr_enable
      <https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
    unit: Wavefronts
  Restored Wavefronts:
    rst: The total number of wavefronts restored from a context-save. See  `cwsr_enable
      <https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
    unit: Wavefronts
  VGPRs:
    rst: 'The number of architected vector general-purpose registers allocated for  the
      kernel, see :ref:`VALU <desc-valu>`.  Note: this may not exactly  match the
      number of VGPRs requested by the compiler due to allocation  granularity.'
    unit: VGPRs
  AGPRs:
    rst: 'The number of accumulation vector general-purpose registers allocated for  the
      kernel, see :ref:`AGPRs <desc-agprs>`.  Note: this may not exactly  match the
      number of AGPRs requested by the compiler due to allocation  granularity.'
    unit: AGPRs
  SGPRs:
    rst: 'The number of scalar general-purpose registers allocated for the kernel,  see
      :ref:`SALU <desc-salu>`.  Note: this may not exactly match the number  of SGPRs
      requested by the compiler due to allocation granularity. plain'
    unit: SGPRs
  LDS Allocation:
    rst: 'The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared  memory)
      allocated for this kernel.  Note: This may also be larger than  what was requested
      at compile time due to both allocation granularity and  dynamic per-dispatch
      LDS allocations.'
    unit: Bytes per workgroup
  Scratch Allocation:
    rst: The number of bytes of :ref:`scratch memory <memory-spaces>` requested  per
      work-item for this kernel. Scratch memory is used for stack memory  on the accelerator,
      as well as for register spills and restores.
    unit: Bytes per work-item
  Kernel Time:
    rst: The total duration of the executed kernel.
    unit: Nanoseconds
  Kernel Time (Cycles):
    rst: The total duration of the executed kernel in cycles.
    unit: Cycles
  Instructions per wavefront:
    rst: The average number of instructions (of all types) executed per wavefront.
      This is averaged over all wavefronts in a kernel dispatch.
    unit: Instructions per wavefront
  Wave Cycles:
    rst: 'The number of cycles a wavefront in the kernel dispatch spent resident on  a
      compute unit per :ref:`normalization unit <normalization-units>`. This  is averaged
      over all wavefronts in a kernel dispatch.  Note: this should  not be directly
      compared to the kernel cycles above.'
    unit: Cycles per normalization unit
  Dependency Wait Cycles:
    rst: The number of cycles a wavefront in the kernel dispatch stalled waiting  on
      memory of any kind (e.g., instruction fetch, vector or scalar memory,  etc.)
      per :ref:`normalization unit <normalization-units>`. This counter  is incremented
      at every cycle by *all* wavefronts on a CU stalled at a  memory operation.  As
      such, it is most useful to get a sense of how waves  were spending their time,
      rather than identification of a precise limiter  because another wave could
      be actively executing while a wave is stalled.  The sum of this metric, Issue
      Wait Cycles and Active Cycles should be  equal to the total Wave Cycles metric.
    unit: Cycles per normalization unit
  Issue Wait Cycles:
    rst: The number of cycles a wavefront in the kernel dispatch was unable to  issue
      an instruction for any reason (e.g., execution pipe back-pressure,  arbitration
      loss, etc.) per  :ref:`normalization unit <normalization-units>`.  This counter
      is  incremented at every cycle by *all* wavefronts on a CU unable to issue an  instruction.  As
      such, it is most useful to get a sense of how waves were  spending their time,
      rather than identification of a precise limiter  because another wave could
      be actively executing while a wave is issue  stalled.  The sum of this metric,
      Dependency Wait Cycles and Active  Cycles should be equal to the total Wave
      Cycles metric.
    unit: Cycles per normalization unit
  Active Cycles:
    rst: The average number of cycles a wavefront in the kernel dispatch was  actively
      executing instructions per  :ref:`normalization unit <normalization-units>`.
      This measurement is made  on a per-wavefront basis, and may include cycles that
      another wavefront  spent actively executing (on another execution unit, for
      example) or was  stalled.  As such, it is most useful to get a sense of how
      waves were  spending their time, rather than identification of a precise limiter.
      The  sum of this metric, Issue Wait Cycles and Active Wait Cycles should be  equal
      to the total Wave Cycles metric.
    unit: Cycles per normalization unit
  Wavefront Occupancy:
    rst: 'The time-averaged number of wavefronts resident on the accelerator over  the
      lifetime of the kernel. Note: this metric may be inaccurate for  short-running
      kernels (less than 1ms).'
    unit: Wavefronts
Wavefront runtime stats:
  Grid Size:
    rst: The total number of work-items (or, threads) launched as a part of the kernel
      dispatch. In HIP, this is equivalent to the total grid size multiplied by the
      total workgroup (or, block) size.
    unit: Work-Items
  Workgroup Size:
    rst: The total number of work-items (or, threads) in each workgroup (or, block)
      launched as part of the kernel dispatch. In HIP, this is equivalent to the total
      block size.
    unit: Work-Items
  Total Wavefronts:
    rst: "The total number of wavefronts launched as part of the kernel dispatch.\
      \ On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront\
      \ size is always 64 work-items. Thus, the total number of wavefronts should\
      \ be equivalent to the ceiling of grid size divided by 64."
    unit: Wavefronts
  Saved Wavefronts:
    rst: The total number of wavefronts saved at a context-save. See  `cwsr_enable
      <https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
    unit: Wavefronts
  Restored Wavefronts:
    rst: The total number of wavefronts restored from a context-save. See  `cwsr_enable
      <https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
    unit: Wavefronts
  VGPRs:
    rst: 'The number of architected vector general-purpose registers allocated for  the
      kernel, see :ref:`VALU <desc-valu>`.  Note: this may not exactly  match the
      number of VGPRs requested by the compiler due to allocation  granularity.'
    unit: VGPRs
  AGPRs:
    rst: 'The number of accumulation vector general-purpose registers allocated for  the
      kernel, see :ref:`AGPRs <desc-agprs>`.  Note: this may not exactly  match the
      number of AGPRs requested by the compiler due to allocation  granularity.'
    unit: AGPRs
  SGPRs:
    rst: 'The number of scalar general-purpose registers allocated for the kernel,  see
      :ref:`SALU <desc-salu>`.  Note: this may not exactly match the number  of SGPRs
      requested by the compiler due to allocation granularity. plain'
    unit: SGPRs
  LDS Allocation:
    rst: 'The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared  memory)
      allocated for this kernel.  Note: This may also be larger than  what was requested
      at compile time due to both allocation granularity and  dynamic per-dispatch
      LDS allocations.'
    unit: Bytes per workgroup
  Scratch Allocation:
    rst: The number of bytes of :ref:`scratch memory <memory-spaces>` requested  per
      work-item for this kernel. Scratch memory is used for stack memory  on the accelerator,
      as well as for register spills and restores.
    unit: Bytes per work-item
  Kernel Time:
    rst: The total duration of the executed kernel.
    unit: Nanoseconds
  Kernel Time (Cycles):
    rst: The total duration of the executed kernel in cycles.
    unit: Cycles
  Instructions per wavefront:
    rst: The average number of instructions (of all types) executed per wavefront.
      This is averaged over all wavefronts in a kernel dispatch.
    unit: Instructions per wavefront
  Wave Cycles:
    rst: 'The number of cycles a wavefront in the kernel dispatch spent resident on  a
      compute unit per :ref:`normalization unit <normalization-units>`. This  is averaged
      over all wavefronts in a kernel dispatch.  Note: this should  not be directly
      compared to the kernel cycles above.'
    unit: Cycles per normalization unit
  Dependency Wait Cycles:
    rst: The number of cycles a wavefront in the kernel dispatch stalled waiting  on
      memory of any kind (e.g., instruction fetch, vector or scalar memory,  etc.)
      per :ref:`normalization unit <normalization-units>`. This counter  is incremented
      at every cycle by *all* wavefronts on a CU stalled at a  memory operation.  As
      such, it is most useful to get a sense of how waves  were spending their time,
      rather than identification of a precise limiter  because another wave could
      be actively executing while a wave is stalled.  The sum of this metric, Issue
      Wait Cycles and Active Cycles should be  equal to the total Wave Cycles metric.
    unit: Cycles per normalization unit
  Issue Wait Cycles:
    rst: The number of cycles a wavefront in the kernel dispatch was unable to  issue
      an instruction for any reason (e.g., execution pipe back-pressure,  arbitration
      loss, etc.) per  :ref:`normalization unit <normalization-units>`.  This counter
      is  incremented at every cycle by *all* wavefronts on a CU unable to issue an  instruction.  As
      such, it is most useful to get a sense of how waves were  spending their time,
      rather than identification of a precise limiter  because another wave could
      be actively executing while a wave is issue  stalled.  The sum of this metric,
      Dependency Wait Cycles and Active  Cycles should be equal to the total Wave
      Cycles metric.
    unit: Cycles per normalization unit
  Active Cycles:
    rst: The average number of cycles a wavefront in the kernel dispatch was  actively
      executing instructions per  :ref:`normalization unit <normalization-units>`.
      This measurement is made  on a per-wavefront basis, and may include cycles that
      another wavefront  spent actively executing (on another execution unit, for
      example) or was  stalled.  As such, it is most useful to get a sense of how
      waves were  spending their time, rather than identification of a precise limiter.
      The  sum of this metric, Issue Wait Cycles and Active Wait Cycles should be  equal
      to the total Wave Cycles metric.
    unit: Cycles per normalization unit
  Wavefront Occupancy:
    rst: 'The time-averaged number of wavefronts resident on the accelerator over  the
      lifetime of the kernel. Note: this metric may be inaccurate for  short-running
      kernels (less than 1ms).'
    unit: Wavefronts
Overall instruction mix:
  VALU:
    rst: The total number of vector arithmetic logic unit (VALU) operations  issued.
      These are the workhorses of the  :doc:`compute unit <compute-unit>`, and are
      used to execute a wide range of  instruction types including floating point
      operations, non-uniform  address calculations, transcendental operations, integer
      operations,  shifts, conditional evaluation, etc.
    unit: Instructions
  VMEM:
    rst: The total number of vector memory operations issued. These include most  loads,
      stores and atomic operations and all accesses to  :ref:`generic, global, private
      and texture <memory-spaces>` memory.
    unit: Instructions
  LDS:
    rst: The total number of LDS (also known as shared memory) operations issued.  These
      include loads, stores, atomics, and HIP's ``__shfl`` operations.
    unit: Instructions
  MFMA:
    rst: The total number of matrix fused multiply-add instructions issued.
    unit: Instructions
  SALU:
    rst: The total number of scalar arithmetic logic unit (SALU) operations issued.
      Typically these are used for address calculations, literal constants, and other
      operations that are provably uniform across a wavefront. Although scalar memory
      (SMEM) operations are issued by the SALU, they are counted separately in this
      section.
    unit: Instructions
  SMEM:
    rst: The total number of scalar memory (SMEM) operations issued. These are  typically
      used for loading kernel arguments, base-pointers and loads  from HIP's ``__constant__``
      memory.
    unit: Instructions
  Branch:
    rst: The total number of branch operations issued. These typically consist of  jump
      or branch operations and are used to implement control flow.
    unit: Instructions
  INT32:
    rst: The total number of instructions operating on 32-bit integer operands  issued
      to the VALU per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  INT64:
    rst: The total number of instructions operating on 64-bit integer operands  issued
      to the VALU per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F16-ADD:
    rst: The total number of addition instructions operating on 16-bit  floating-point
      operands issued to the VALU per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F16-MUL:
    rst: The total number of multiplication instructions operating on 16-bit  floating-point
      operands issued to the VALU per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F16-FMA:
    rst: The total number of fused multiply-add instructions operating on 16-bit  floating-point
      operands issued to the VALU per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F16-Trans:
    rst: The total number of transcendental instructions (e.g., `sqrt`) operating  on
      16-bit floating-point operands issued to the VALU per  :ref:`normalization unit
      <normalization-units>`.
    unit: Instructions per normalization unit
  F32-ADD:
    rst: The total number of addition instructions operating on 32-bit  floating-point
      operands issued to the VALU per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F32-MUL:
    rst: The total number of multiplication instructions operating on 32-bit  floating-point
      operands issued to the VALU per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F32-FMA:
    rst: The total number of fused multiply-add instructions operating on 32-bit  floating-point
      operands issued to the VALU per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F32-Trans:
    rst: The total number of transcendental instructions (such as ``sqrt``)  operating
      on 32-bit floating-point operands issued to the VALU per  :ref:`normalization
      unit <normalization-units>`.
    unit: Instructions per normalization unit
  F64-ADD:
    rst: The total number of addition instructions operating on 64-bit  floating-point
      operands issued to the VALU per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F64-MUL:
    rst: The total number of multiplication instructions operating on 64-bit  floating-point
      operands issued to the VALU per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F64-FMA:
    rst: The total number of fused multiply-add instructions operating on 64-bit  floating-point
      operands issued to the VALU per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F64-Trans:
    rst: The total number of transcendental instructions (such as `sqrt`)  operating
      on 64-bit floating-point operands issued to the VALU per  :ref:`normalization
      unit <normalization-units>`.
    unit: Instructions per normalization unit
  Conversion:
    rst: "The total number of type conversion instructions (such as converting data\
      \  to or from F32\u2194F64) issued to the VALU per  :ref:`normalization unit\
      \ <normalization-units>`."
    unit: Instructions per normalization unit
  Global/Generic Instr:
    rst: The total number of global & generic memory instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Global/Generic Read:
    rst: The total number of global & generic memory read instructions executed on  all
      :doc:`compute units <compute-unit>` on the accelerator, per  :ref:`normalization
      unit <normalization-units>`.
    unit: Instructions per normalization unit
  Global/Generic Write:
    rst: The total number of global & generic memory write instructions executed  on
      all :doc:`compute units <compute-unit>` on the accelerator, per  :ref:`normalization
      unit <normalization-units>`.
    unit: Instructions per normalization unit
  Global/Generic Atomic:
    rst: The total number of global & generic memory atomic (with and without  return)
      instructions executed on all :doc:`compute units <compute-unit>`  on the accelerator,
      per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Instr:
    rst: The total number of spill/stack memory instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Read:
    rst: The total number of spill/stack memory read instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Write:
    rst: The total number of spill/stack memory write instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Atomic:
    rst: The total number of spill/stack memory atomic (with and without return)  instructions
      executed on all :doc:`compute units <compute-unit>` on the  accelerator, per
      :ref:`normalization unit <normalization-units>`.  Typically unused as these
      memory operations are typically used to  implement thread-local storage.
    unit: Instructions per normalization unit
  MFMA-I8:
    rst: The total number of 8-bit integer :ref:`MFMA <desc-mfma>` instructions  issued
      per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  MFMA-F8:
    rst: The total number of 8-bit floating point :ref:`MFMA <desc-mfma>` instructions  issued
      per :ref:`normalization unit <normalization-units>`. This is supported in AMD
      Instinct MI300 series and later only.
    unit: Instructions per normalization unit
  MFMA-F16:
    rst: The total number of 16-bit floating point :ref:`MFMA <desc-mfma>`  instructions
      issued per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  MFMA-BF16:
    rst: The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`  instructions
      issued per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  MFMA-F32:
    rst: The total number of 32-bit floating-point :ref:`MFMA <desc-mfma>`  instructions
      issued per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  MFMA-F64:
    rst: The total number of 64-bit floating-point :ref:`MFMA <desc-mfma>`  instructions
      issued per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
VALU arithmetic instruction mix:
  VALU:
    rst: The total number of vector arithmetic logic unit (VALU) operations  issued.
      These are the workhorses of the  :doc:`compute unit <compute-unit>`, and are
      used to execute a wide range of  instruction types including floating point
      operations, non-uniform  address calculations, transcendental operations, integer
      operations,  shifts, conditional evaluation, etc.
    unit: Instructions
  VMEM:
    rst: The total number of vector memory operations issued. These include most  loads,
      stores and atomic operations and all accesses to  :ref:`generic, global, private
      and texture <memory-spaces>` memory.
    unit: Instructions
  LDS:
    rst: The total number of LDS (also known as shared memory) operations issued.  These
      include loads, stores, atomics, and HIP's ``__shfl`` operations.
    unit: Instructions
  MFMA:
    rst: The total number of matrix fused multiply-add instructions issued.
    unit: Instructions
  SALU:
    rst: The total number of scalar arithmetic logic unit (SALU) operations issued.
      Typically these are used for address calculations, literal constants, and other
      operations that are provably uniform across a wavefront. Although scalar memory
      (SMEM) operations are issued by the SALU, they are counted separately in this
      section.
    unit: Instructions
  SMEM:
    rst: The total number of scalar memory (SMEM) operations issued. These are  typically
      used for loading kernel arguments, base-pointers and loads  from HIP's ``__constant__``
      memory.
    unit: Instructions
  Branch:
    rst: The total number of branch operations issued. These typically consist of  jump
      or branch operations and are used to implement control flow.
    unit: Instructions
  INT32:
    rst: The total number of instructions operating on 32-bit integer operands  issued
      to the VALU per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  INT64:
    rst: The total number of instructions operating on 64-bit integer operands  issued
      to the VALU per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F16-ADD:
    rst: The total number of addition instructions operating on 16-bit  floating-point
      operands issued to the VALU per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F16-MUL:
    rst: The total number of multiplication instructions operating on 16-bit  floating-point
      operands issued to the VALU per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F16-FMA:
    rst: The total number of fused multiply-add instructions operating on 16-bit  floating-point
      operands issued to the VALU per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F16-Trans:
    rst: The total number of transcendental instructions (e.g., `sqrt`) operating  on
      16-bit floating-point operands issued to the VALU per  :ref:`normalization unit
      <normalization-units>`.
    unit: Instructions per normalization unit
  F32-ADD:
    rst: The total number of addition instructions operating on 32-bit  floating-point
      operands issued to the VALU per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F32-MUL:
    rst: The total number of multiplication instructions operating on 32-bit  floating-point
      operands issued to the VALU per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F32-FMA:
    rst: The total number of fused multiply-add instructions operating on 32-bit  floating-point
      operands issued to the VALU per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F32-Trans:
    rst: The total number of transcendental instructions (such as ``sqrt``)  operating
      on 32-bit floating-point operands issued to the VALU per  :ref:`normalization
      unit <normalization-units>`.
    unit: Instructions per normalization unit
  F64-ADD:
    rst: The total number of addition instructions operating on 64-bit  floating-point
      operands issued to the VALU per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F64-MUL:
    rst: The total number of multiplication instructions operating on 64-bit  floating-point
      operands issued to the VALU per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F64-FMA:
    rst: The total number of fused multiply-add instructions operating on 64-bit  floating-point
      operands issued to the VALU per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F64-Trans:
    rst: The total number of transcendental instructions (such as `sqrt`)  operating
      on 64-bit floating-point operands issued to the VALU per  :ref:`normalization
      unit <normalization-units>`.
    unit: Instructions per normalization unit
  Conversion:
    rst: "The total number of type conversion instructions (such as converting data\
      \  to or from F32\u2194F64) issued to the VALU per  :ref:`normalization unit\
      \ <normalization-units>`."
    unit: Instructions per normalization unit
  Global/Generic Instr:
    rst: The total number of global & generic memory instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Global/Generic Read:
    rst: The total number of global & generic memory read instructions executed on  all
      :doc:`compute units <compute-unit>` on the accelerator, per  :ref:`normalization
      unit <normalization-units>`.
    unit: Instructions per normalization unit
  Global/Generic Write:
    rst: The total number of global & generic memory write instructions executed  on
      all :doc:`compute units <compute-unit>` on the accelerator, per  :ref:`normalization
      unit <normalization-units>`.
    unit: Instructions per normalization unit
  Global/Generic Atomic:
    rst: The total number of global & generic memory atomic (with and without  return)
      instructions executed on all :doc:`compute units <compute-unit>`  on the accelerator,
      per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Instr:
    rst: The total number of spill/stack memory instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Read:
    rst: The total number of spill/stack memory read instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Write:
    rst: The total number of spill/stack memory write instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Atomic:
    rst: The total number of spill/stack memory atomic (with and without return)  instructions
      executed on all :doc:`compute units <compute-unit>` on the  accelerator, per
      :ref:`normalization unit <normalization-units>`.  Typically unused as these
      memory operations are typically used to  implement thread-local storage.
    unit: Instructions per normalization unit
  MFMA-I8:
    rst: The total number of 8-bit integer :ref:`MFMA <desc-mfma>` instructions  issued
      per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  MFMA-F8:
    rst: The total number of 8-bit floating point :ref:`MFMA <desc-mfma>` instructions  issued
      per :ref:`normalization unit <normalization-units>`. This is supported in AMD
      Instinct MI300 series and later only.
    unit: Instructions per normalization unit
  MFMA-F16:
    rst: The total number of 16-bit floating point :ref:`MFMA <desc-mfma>`  instructions
      issued per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  MFMA-BF16:
    rst: The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`  instructions
      issued per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  MFMA-F32:
    rst: The total number of 32-bit floating-point :ref:`MFMA <desc-mfma>`  instructions
      issued per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  MFMA-F64:
    rst: The total number of 64-bit floating-point :ref:`MFMA <desc-mfma>`  instructions
      issued per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
MFMA instruction mix:
  VALU:
    rst: The total number of vector arithmetic logic unit (VALU) operations  issued.
      These are the workhorses of the  :doc:`compute unit <compute-unit>`, and are
      used to execute a wide range of  instruction types including floating point
      operations, non-uniform  address calculations, transcendental operations, integer
      operations,  shifts, conditional evaluation, etc.
    unit: Instructions
  VMEM:
    rst: The total number of vector memory operations issued. These include most  loads,
      stores and atomic operations and all accesses to  :ref:`generic, global, private
      and texture <memory-spaces>` memory.
    unit: Instructions
  LDS:
    rst: The total number of LDS (also known as shared memory) operations issued.  These
      include loads, stores, atomics, and HIP's ``__shfl`` operations.
    unit: Instructions
  MFMA:
    rst: The total number of matrix fused multiply-add instructions issued.
    unit: Instructions
  SALU:
    rst: The total number of scalar arithmetic logic unit (SALU) operations issued.
      Typically these are used for address calculations, literal constants, and other
      operations that are provably uniform across a wavefront. Although scalar memory
      (SMEM) operations are issued by the SALU, they are counted separately in this
      section.
    unit: Instructions
  SMEM:
    rst: The total number of scalar memory (SMEM) operations issued. These are  typically
      used for loading kernel arguments, base-pointers and loads  from HIP's ``__constant__``
      memory.
    unit: Instructions
  Branch:
    rst: The total number of branch operations issued. These typically consist of  jump
      or branch operations and are used to implement control flow.
    unit: Instructions
  INT32:
    rst: The total number of instructions operating on 32-bit integer operands  issued
      to the VALU per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  INT64:
    rst: The total number of instructions operating on 64-bit integer operands  issued
      to the VALU per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F16-ADD:
    rst: The total number of addition instructions operating on 16-bit  floating-point
      operands issued to the VALU per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F16-MUL:
    rst: The total number of multiplication instructions operating on 16-bit  floating-point
      operands issued to the VALU per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F16-FMA:
    rst: The total number of fused multiply-add instructions operating on 16-bit  floating-point
      operands issued to the VALU per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F16-Trans:
    rst: The total number of transcendental instructions (e.g., `sqrt`) operating  on
      16-bit floating-point operands issued to the VALU per  :ref:`normalization unit
      <normalization-units>`.
    unit: Instructions per normalization unit
  F32-ADD:
    rst: The total number of addition instructions operating on 32-bit  floating-point
      operands issued to the VALU per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F32-MUL:
    rst: The total number of multiplication instructions operating on 32-bit  floating-point
      operands issued to the VALU per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F32-FMA:
    rst: The total number of fused multiply-add instructions operating on 32-bit  floating-point
      operands issued to the VALU per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F32-Trans:
    rst: The total number of transcendental instructions (such as ``sqrt``)  operating
      on 32-bit floating-point operands issued to the VALU per  :ref:`normalization
      unit <normalization-units>`.
    unit: Instructions per normalization unit
  F64-ADD:
    rst: The total number of addition instructions operating on 64-bit  floating-point
      operands issued to the VALU per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F64-MUL:
    rst: The total number of multiplication instructions operating on 64-bit  floating-point
      operands issued to the VALU per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F64-FMA:
    rst: The total number of fused multiply-add instructions operating on 64-bit  floating-point
      operands issued to the VALU per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  F64-Trans:
    rst: The total number of transcendental instructions (such as `sqrt`)  operating
      on 64-bit floating-point operands issued to the VALU per  :ref:`normalization
      unit <normalization-units>`.
    unit: Instructions per normalization unit
  Conversion:
    rst: "The total number of type conversion instructions (such as converting data\
      \  to or from F32\u2194F64) issued to the VALU per  :ref:`normalization unit\
      \ <normalization-units>`."
    unit: Instructions per normalization unit
  Global/Generic Instr:
    rst: The total number of global & generic memory instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Global/Generic Read:
    rst: The total number of global & generic memory read instructions executed on  all
      :doc:`compute units <compute-unit>` on the accelerator, per  :ref:`normalization
      unit <normalization-units>`.
    unit: Instructions per normalization unit
  Global/Generic Write:
    rst: The total number of global & generic memory write instructions executed  on
      all :doc:`compute units <compute-unit>` on the accelerator, per  :ref:`normalization
      unit <normalization-units>`.
    unit: Instructions per normalization unit
  Global/Generic Atomic:
    rst: The total number of global & generic memory atomic (with and without  return)
      instructions executed on all :doc:`compute units <compute-unit>`  on the accelerator,
      per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Instr:
    rst: The total number of spill/stack memory instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Read:
    rst: The total number of spill/stack memory read instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Write:
    rst: The total number of spill/stack memory write instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Atomic:
    rst: The total number of spill/stack memory atomic (with and without return)  instructions
      executed on all :doc:`compute units <compute-unit>` on the  accelerator, per
      :ref:`normalization unit <normalization-units>`.  Typically unused as these
      memory operations are typically used to  implement thread-local storage.
    unit: Instructions per normalization unit
  MFMA-I8:
    rst: The total number of 8-bit integer :ref:`MFMA <desc-mfma>` instructions  issued
      per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  MFMA-F8:
    rst: The total number of 8-bit floating point :ref:`MFMA <desc-mfma>` instructions  issued
      per :ref:`normalization unit <normalization-units>`. This is supported in AMD
      Instinct MI300 series and later only.
    unit: Instructions per normalization unit
  MFMA-F16:
    rst: The total number of 16-bit floating point :ref:`MFMA <desc-mfma>`  instructions
      issued per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  MFMA-BF16:
    rst: The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`  instructions
      issued per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  MFMA-F32:
    rst: The total number of 32-bit floating-point :ref:`MFMA <desc-mfma>`  instructions
      issued per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  MFMA-F64:
    rst: The total number of 64-bit floating-point :ref:`MFMA <desc-mfma>`  instructions
      issued per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
Compute Speed-of-Light:
  VALU FLOPs:
    rst: 'The total floating-point operations executed per second on the  :ref:`VALU
      <desc-valu>`. This is also presented as a percent of the peak  theoretical FLOPs
      achievable on the specific accelerator. Note: this does  not include any floating-point
      operations from :ref:`MFMA <desc-mfma>`  instructions.'
    unit: GFLOPs
  VALU IOPs:
    rst: 'The total integer operations executed per second on the  :ref:`VALU <desc-valu>`.
      This is also presented as a percent of the peak  theoretical IOPs achievable
      on the specific accelerator. Note: this does  not include any integer operations
      from :ref:`MFMA <desc-mfma>`  instructions.'
    unit: GIOPs
  MFMA FLOPs (BF16):
    rst: 'The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`  operations
      executed per second. Note: this does not include any 16-bit  brain floating
      point operations from :ref:`VALU <desc-valu>`  instructions. This is also presented
      as a percent of the peak theoretical  BF16 MFMA operations achievable on the
      specific accelerator.'
    unit: GFLOPs
  MFMA FLOPs (F16):
    rst: 'The total number of 16-bit floating point :ref:`MFMA <desc-mfma>`  operations
      executed per second. Note: this does not include any 16-bit  floating point
      operations from :ref:`VALU <desc-valu>` instructions. This  is also presented
      as a percent of the peak theoretical F16 MFMA  operations achievable on the
      specific accelerator.'
    unit: GFLOPs
  MFMA FLOPs (F32):
    rst: 'The total number of 32-bit floating point :ref:`MFMA <desc-mfma>`  operations
      executed per second. Note: this does not include any 32-bit  floating point
      operations from :ref:`VALU <desc-valu>` instructions. This  is also presented
      as a percent of the peak theoretical F32 MFMA  operations achievable on the
      specific accelerator.'
    unit: GFLOPs
  MFMA FLOPs (F64):
    rst: 'The total number of 64-bit floating point :ref:`MFMA <desc-mfma>`  operations
      executed per second. Note: this does not include any 64-bit  floating point
      operations from :ref:`VALU <desc-valu>` instructions. This  is also presented
      as a percent of the peak theoretical F64 MFMA  operations achievable on the
      specific accelerator.  The total number of 64-bit floating point :ref:`MFMA
      <desc-mfma>`  operations executed per second. Note: this does not include any
      64-bit  floating point operations from :ref:`VALU <desc-valu>` instructions.
      This  is also presented as a percent of the peak theoretical F64 MFMA  operations
      achievable on the specific accelerator.'
    unit: GFLOPs
  MFMA IOPs (INT8):
    rst: 'The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations  executed
      per second. Note: this does not include any 8-bit integer  operations from :ref:`VALU
      <desc-valu>` instructions. This is also  presented as a percent of the peak
      theoretical INT8 MFMA operations  achievable on the specific accelerator.'
    unit: GFLOPs
  IPC:
    rst: The ratio of the total number of instructions executed on the  :doc:`CU <compute-unit>`
      over the  :ref:`total active CU cycles <total-active-cu-cycles>`.
    unit: Instructions per cycle
  IPC (Issued):
    rst: The ratio of the total number of  (non-:ref:`internal <ipc-internal-instructions>`)
      instructions issued over  the number of cycles where the :ref:`scheduler <desc-scheduler>`
      was  actively working on issuing instructions. Refer to the  :ref:`Issued IPC
      <issued-ipc>` example for further detail.
    unit: Instructions per cycle
  SALU Utilization:
    rst: Indicates what percent of the kernel's duration the  :ref:`SALU <desc-salu>`
      was busy executing instructions. Computed as the  ratio of the total number
      of cycles spent by the  :ref:`scheduler <desc-scheduler>` issuing SALU / :ref:`SMEM
      <desc-smem>`  instructions over the :ref:`total CU cycles <total-cu-cycles>`.
    unit: Percent
  VALU Utilization:
    rst: Indicates what percent of the kernel's duration the  :ref:`VALU <desc-valu>`
      was busy executing instructions. Does not include  :ref:`VMEM <desc-vmem>` operations.
      Computed as the ratio of the total  number of cycles spent by the :ref:`scheduler
      <desc-scheduler>` issuing  VALU instructions over the :ref:`total CU cycles
      <total-cu-cycles>`.
    unit: Percent
  VMEM Utilization:
    rst: Indicates what percent of the kernel's duration the  :ref:`VMEM <desc-vmem>`
      unit was busy executing instructions, including  both global/generic and spill/scratch
      operations (see the  :ref:`VMEM instruction count metrics <ta-instruction-counts>`
      for more  detail).  Does not include :ref:`VALU <desc-valu>` operations. Computed  as
      the ratio of the total number of cycles spent by the  :ref:`scheduler <desc-scheduler>`
      issuing VMEM instructions over the  :ref:`total CU cycles <total-cu-cycles>`.
    unit: Percent
  Branch Utilization:
    rst: Indicates what percent of the kernel's duration the  :ref:`branch <desc-branch>`
      unit was busy executing instructions.  Computed as the ratio of the total number
      of cycles spent by the  :ref:`scheduler <desc-scheduler>` issuing branch instructions
      over the  :ref:`total CU cycles <total-cu-cycles>`.
    unit: Percent
  VALU Active Threads:
    rst: Indicates the average level of :ref:`divergence <desc-divergence>` within  a
      wavefront over the lifetime of the kernel. The number of work-items  that were
      active in a wavefront during execution of each  :ref:`VALU <desc-valu>` instruction,
      time-averaged over all VALU  instructions run on all wavefronts in the kernel.
    unit: Work-items
  MFMA Utilization:
    rst: Indicates what percent of the kernel's duration the  :ref:`MFMA <desc-mfma>`
      unit was busy executing instructions. Computed as  the ratio of the total number
      of cycles spent by the  :ref:`MFMA <desc-salu>` was busy over the  :ref:`total
      CU cycles <total-cu-cycles>`.
    unit: Percent
  MFMA Instruction Cycles:
    rst: The average duration of :ref:`MFMA <desc-mfma>` instructions in this  kernel
      in cycles. Computed as the ratio of the total number of cycles the  MFMA unit
      was busy over the total number of MFMA instructions. Compare  to, for example,
      the  `AMD Matrix Instruction Calculator <https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator>`_.
    unit: Cycles per instruction
  VMEM Latency:
    rst: The average number of round-trip cycles (that is, from issue to data  return
      / acknowledgment) required for a VMEM instruction to complete.
    unit: Cycles
  SMEM Latency:
    rst: The average number of round-trip cycles (that is, from issue to data  return
      / acknowledgment) required for a SMEM instruction to complete.
    unit: Cycles
  FLOPs (Total):
    rst: The total number of floating-point operations executed on either the  :ref:`VALU
      <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per  :ref:`normalization unit
      <normalization-units>`.
    unit: FLOP per normalization unit
  IOPs (Total):
    rst: The total number of integer operations executed on either the  :ref:`VALU
      <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per  :ref:`normalization unit
      <normalization-units>`.
    unit: IOP per normalization unit
  F16 OPs:
    rst: The total number of 16-bit floating-point operations executed on either the  :ref:`VALU
      <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per  :ref:`normalization unit
      <normalization-units>`.
    unit: FLOP per normalization unit
  BF16 OPs:
    rst: 'The total number of 16-bit brain floating-point operations executed on either
      the  :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per  :ref:`normalization
      unit <normalization-units>`. Note: on current CDNA  accelerators, the VALU has
      no native BF16 instructions.'
    unit: FLOP per normalization unit
  F32 OPs:
    rst: The total number of 32-bit floating-point operations executed on either  the
      :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per  :ref:`normalization
      unit <normalization-units>`.
    unit: FLOP per normalization unit
  F64 OPs:
    rst: The total number of 64-bit floating-point operations executed on either  the
      :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per  :ref:`normalization
      unit <normalization-units>`.
    unit: FLOP per normalization unit
  INT8 OPs:
    rst: 'The total number of 8-bit integer operations executed on either the  :ref:`VALU
      <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per  :ref:`normalization unit
      <normalization-units>`. Note: on current CDNA  accelerators, the VALU has no
      native INT8 instructions.'
    unit: IOP per normalization unit
Pipeline statistics:
  VALU FLOPs:
    rst: 'The total floating-point operations executed per second on the  :ref:`VALU
      <desc-valu>`. This is also presented as a percent of the peak  theoretical FLOPs
      achievable on the specific accelerator. Note: this does  not include any floating-point
      operations from :ref:`MFMA <desc-mfma>`  instructions.'
    unit: GFLOPs
  VALU IOPs:
    rst: 'The total integer operations executed per second on the  :ref:`VALU <desc-valu>`.
      This is also presented as a percent of the peak  theoretical IOPs achievable
      on the specific accelerator. Note: this does  not include any integer operations
      from :ref:`MFMA <desc-mfma>`  instructions.'
    unit: GIOPs
  MFMA FLOPs (BF16):
    rst: 'The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`  operations
      executed per second. Note: this does not include any 16-bit  brain floating
      point operations from :ref:`VALU <desc-valu>`  instructions. This is also presented
      as a percent of the peak theoretical  BF16 MFMA operations achievable on the
      specific accelerator.'
    unit: GFLOPs
  MFMA FLOPs (F16):
    rst: 'The total number of 16-bit floating point :ref:`MFMA <desc-mfma>`  operations
      executed per second. Note: this does not include any 16-bit  floating point
      operations from :ref:`VALU <desc-valu>` instructions. This  is also presented
      as a percent of the peak theoretical F16 MFMA  operations achievable on the
      specific accelerator.'
    unit: GFLOPs
  MFMA FLOPs (F32):
    rst: 'The total number of 32-bit floating point :ref:`MFMA <desc-mfma>`  operations
      executed per second. Note: this does not include any 32-bit  floating point
      operations from :ref:`VALU <desc-valu>` instructions. This  is also presented
      as a percent of the peak theoretical F32 MFMA  operations achievable on the
      specific accelerator.'
    unit: GFLOPs
  MFMA FLOPs (F64):
    rst: 'The total number of 64-bit floating point :ref:`MFMA <desc-mfma>`  operations
      executed per second. Note: this does not include any 64-bit  floating point
      operations from :ref:`VALU <desc-valu>` instructions. This  is also presented
      as a percent of the peak theoretical F64 MFMA  operations achievable on the
      specific accelerator.  The total number of 64-bit floating point :ref:`MFMA
      <desc-mfma>`  operations executed per second. Note: this does not include any
      64-bit  floating point operations from :ref:`VALU <desc-valu>` instructions.
      This  is also presented as a percent of the peak theoretical F64 MFMA  operations
      achievable on the specific accelerator.'
    unit: GFLOPs
  MFMA IOPs (INT8):
    rst: 'The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations  executed
      per second. Note: this does not include any 8-bit integer  operations from :ref:`VALU
      <desc-valu>` instructions. This is also  presented as a percent of the peak
      theoretical INT8 MFMA operations  achievable on the specific accelerator.'
    unit: GFLOPs
  IPC:
    rst: The ratio of the total number of instructions executed on the  :doc:`CU <compute-unit>`
      over the  :ref:`total active CU cycles <total-active-cu-cycles>`.
    unit: Instructions per cycle
  IPC (Issued):
    rst: The ratio of the total number of  (non-:ref:`internal <ipc-internal-instructions>`)
      instructions issued over  the number of cycles where the :ref:`scheduler <desc-scheduler>`
      was  actively working on issuing instructions. Refer to the  :ref:`Issued IPC
      <issued-ipc>` example for further detail.
    unit: Instructions per cycle
  SALU Utilization:
    rst: Indicates what percent of the kernel's duration the  :ref:`SALU <desc-salu>`
      was busy executing instructions. Computed as the  ratio of the total number
      of cycles spent by the  :ref:`scheduler <desc-scheduler>` issuing SALU / :ref:`SMEM
      <desc-smem>`  instructions over the :ref:`total CU cycles <total-cu-cycles>`.
    unit: Percent
  VALU Utilization:
    rst: Indicates what percent of the kernel's duration the  :ref:`VALU <desc-valu>`
      was busy executing instructions. Does not include  :ref:`VMEM <desc-vmem>` operations.
      Computed as the ratio of the total  number of cycles spent by the :ref:`scheduler
      <desc-scheduler>` issuing  VALU instructions over the :ref:`total CU cycles
      <total-cu-cycles>`.
    unit: Percent
  VMEM Utilization:
    rst: Indicates what percent of the kernel's duration the  :ref:`VMEM <desc-vmem>`
      unit was busy executing instructions, including  both global/generic and spill/scratch
      operations (see the  :ref:`VMEM instruction count metrics <ta-instruction-counts>`
      for more  detail).  Does not include :ref:`VALU <desc-valu>` operations. Computed  as
      the ratio of the total number of cycles spent by the  :ref:`scheduler <desc-scheduler>`
      issuing VMEM instructions over the  :ref:`total CU cycles <total-cu-cycles>`.
    unit: Percent
  Branch Utilization:
    rst: Indicates what percent of the kernel's duration the  :ref:`branch <desc-branch>`
      unit was busy executing instructions.  Computed as the ratio of the total number
      of cycles spent by the  :ref:`scheduler <desc-scheduler>` issuing branch instructions
      over the  :ref:`total CU cycles <total-cu-cycles>`.
    unit: Percent
  VALU Active Threads:
    rst: Indicates the average level of :ref:`divergence <desc-divergence>` within  a
      wavefront over the lifetime of the kernel. The number of work-items  that were
      active in a wavefront during execution of each  :ref:`VALU <desc-valu>` instruction,
      time-averaged over all VALU  instructions run on all wavefronts in the kernel.
    unit: Work-items
  MFMA Utilization:
    rst: Indicates what percent of the kernel's duration the  :ref:`MFMA <desc-mfma>`
      unit was busy executing instructions. Computed as  the ratio of the total number
      of cycles spent by the  :ref:`MFMA <desc-salu>` was busy over the  :ref:`total
      CU cycles <total-cu-cycles>`.
    unit: Percent
  MFMA Instruction Cycles:
    rst: The average duration of :ref:`MFMA <desc-mfma>` instructions in this  kernel
      in cycles. Computed as the ratio of the total number of cycles the  MFMA unit
      was busy over the total number of MFMA instructions. Compare  to, for example,
      the  `AMD Matrix Instruction Calculator <https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator>`_.
    unit: Cycles per instruction
  VMEM Latency:
    rst: The average number of round-trip cycles (that is, from issue to data  return
      / acknowledgment) required for a VMEM instruction to complete.
    unit: Cycles
  SMEM Latency:
    rst: The average number of round-trip cycles (that is, from issue to data  return
      / acknowledgment) required for a SMEM instruction to complete.
    unit: Cycles
  FLOPs (Total):
    rst: The total number of floating-point operations executed on either the  :ref:`VALU
      <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per  :ref:`normalization unit
      <normalization-units>`.
    unit: FLOP per normalization unit
  IOPs (Total):
    rst: The total number of integer operations executed on either the  :ref:`VALU
      <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per  :ref:`normalization unit
      <normalization-units>`.
    unit: IOP per normalization unit
  F16 OPs:
    rst: The total number of 16-bit floating-point operations executed on either the  :ref:`VALU
      <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per  :ref:`normalization unit
      <normalization-units>`.
    unit: FLOP per normalization unit
  BF16 OPs:
    rst: 'The total number of 16-bit brain floating-point operations executed on either
      the  :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per  :ref:`normalization
      unit <normalization-units>`. Note: on current CDNA  accelerators, the VALU has
      no native BF16 instructions.'
    unit: FLOP per normalization unit
  F32 OPs:
    rst: The total number of 32-bit floating-point operations executed on either  the
      :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per  :ref:`normalization
      unit <normalization-units>`.
    unit: FLOP per normalization unit
  F64 OPs:
    rst: The total number of 64-bit floating-point operations executed on either  the
      :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per  :ref:`normalization
      unit <normalization-units>`.
    unit: FLOP per normalization unit
  INT8 OPs:
    rst: 'The total number of 8-bit integer operations executed on either the  :ref:`VALU
      <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per  :ref:`normalization unit
      <normalization-units>`. Note: on current CDNA  accelerators, the VALU has no
      native INT8 instructions.'
    unit: IOP per normalization unit
Arithmetic operations:
  VALU FLOPs:
    rst: 'The total floating-point operations executed per second on the  :ref:`VALU
      <desc-valu>`. This is also presented as a percent of the peak  theoretical FLOPs
      achievable on the specific accelerator. Note: this does  not include any floating-point
      operations from :ref:`MFMA <desc-mfma>`  instructions.'
    unit: GFLOPs
  VALU IOPs:
    rst: 'The total integer operations executed per second on the  :ref:`VALU <desc-valu>`.
      This is also presented as a percent of the peak  theoretical IOPs achievable
      on the specific accelerator. Note: this does  not include any integer operations
      from :ref:`MFMA <desc-mfma>`  instructions.'
    unit: GIOPs
  MFMA FLOPs (BF16):
    rst: 'The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`  operations
      executed per second. Note: this does not include any 16-bit  brain floating
      point operations from :ref:`VALU <desc-valu>`  instructions. This is also presented
      as a percent of the peak theoretical  BF16 MFMA operations achievable on the
      specific accelerator.'
    unit: GFLOPs
  MFMA FLOPs (F16):
    rst: 'The total number of 16-bit floating point :ref:`MFMA <desc-mfma>`  operations
      executed per second. Note: this does not include any 16-bit  floating point
      operations from :ref:`VALU <desc-valu>` instructions. This  is also presented
      as a percent of the peak theoretical F16 MFMA  operations achievable on the
      specific accelerator.'
    unit: GFLOPs
  MFMA FLOPs (F32):
    rst: 'The total number of 32-bit floating point :ref:`MFMA <desc-mfma>`  operations
      executed per second. Note: this does not include any 32-bit  floating point
      operations from :ref:`VALU <desc-valu>` instructions. This  is also presented
      as a percent of the peak theoretical F32 MFMA  operations achievable on the
      specific accelerator.'
    unit: GFLOPs
  MFMA FLOPs (F64):
    rst: 'The total number of 64-bit floating point :ref:`MFMA <desc-mfma>`  operations
      executed per second. Note: this does not include any 64-bit  floating point
      operations from :ref:`VALU <desc-valu>` instructions. This  is also presented
      as a percent of the peak theoretical F64 MFMA  operations achievable on the
      specific accelerator.  The total number of 64-bit floating point :ref:`MFMA
      <desc-mfma>`  operations executed per second. Note: this does not include any
      64-bit  floating point operations from :ref:`VALU <desc-valu>` instructions.
      This  is also presented as a percent of the peak theoretical F64 MFMA  operations
      achievable on the specific accelerator.'
    unit: GFLOPs
  MFMA IOPs (INT8):
    rst: 'The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations  executed
      per second. Note: this does not include any 8-bit integer  operations from :ref:`VALU
      <desc-valu>` instructions. This is also  presented as a percent of the peak
      theoretical INT8 MFMA operations  achievable on the specific accelerator.'
    unit: GFLOPs
  IPC:
    rst: The ratio of the total number of instructions executed on the  :doc:`CU <compute-unit>`
      over the  :ref:`total active CU cycles <total-active-cu-cycles>`.
    unit: Instructions per cycle
  IPC (Issued):
    rst: The ratio of the total number of  (non-:ref:`internal <ipc-internal-instructions>`)
      instructions issued over  the number of cycles where the :ref:`scheduler <desc-scheduler>`
      was  actively working on issuing instructions. Refer to the  :ref:`Issued IPC
      <issued-ipc>` example for further detail.
    unit: Instructions per cycle
  SALU Utilization:
    rst: Indicates what percent of the kernel's duration the  :ref:`SALU <desc-salu>`
      was busy executing instructions. Computed as the  ratio of the total number
      of cycles spent by the  :ref:`scheduler <desc-scheduler>` issuing SALU / :ref:`SMEM
      <desc-smem>`  instructions over the :ref:`total CU cycles <total-cu-cycles>`.
    unit: Percent
  VALU Utilization:
    rst: Indicates what percent of the kernel's duration the  :ref:`VALU <desc-valu>`
      was busy executing instructions. Does not include  :ref:`VMEM <desc-vmem>` operations.
      Computed as the ratio of the total  number of cycles spent by the :ref:`scheduler
      <desc-scheduler>` issuing  VALU instructions over the :ref:`total CU cycles
      <total-cu-cycles>`.
    unit: Percent
  VMEM Utilization:
    rst: Indicates what percent of the kernel's duration the  :ref:`VMEM <desc-vmem>`
      unit was busy executing instructions, including  both global/generic and spill/scratch
      operations (see the  :ref:`VMEM instruction count metrics <ta-instruction-counts>`
      for more  detail).  Does not include :ref:`VALU <desc-valu>` operations. Computed  as
      the ratio of the total number of cycles spent by the  :ref:`scheduler <desc-scheduler>`
      issuing VMEM instructions over the  :ref:`total CU cycles <total-cu-cycles>`.
    unit: Percent
  Branch Utilization:
    rst: Indicates what percent of the kernel's duration the  :ref:`branch <desc-branch>`
      unit was busy executing instructions.  Computed as the ratio of the total number
      of cycles spent by the  :ref:`scheduler <desc-scheduler>` issuing branch instructions
      over the  :ref:`total CU cycles <total-cu-cycles>`.
    unit: Percent
  VALU Active Threads:
    rst: Indicates the average level of :ref:`divergence <desc-divergence>` within  a
      wavefront over the lifetime of the kernel. The number of work-items  that were
      active in a wavefront during execution of each  :ref:`VALU <desc-valu>` instruction,
      time-averaged over all VALU  instructions run on all wavefronts in the kernel.
    unit: Work-items
  MFMA Utilization:
    rst: Indicates what percent of the kernel's duration the  :ref:`MFMA <desc-mfma>`
      unit was busy executing instructions. Computed as  the ratio of the total number
      of cycles spent by the  :ref:`MFMA <desc-salu>` was busy over the  :ref:`total
      CU cycles <total-cu-cycles>`.
    unit: Percent
  MFMA Instruction Cycles:
    rst: The average duration of :ref:`MFMA <desc-mfma>` instructions in this  kernel
      in cycles. Computed as the ratio of the total number of cycles the  MFMA unit
      was busy over the total number of MFMA instructions. Compare  to, for example,
      the  `AMD Matrix Instruction Calculator <https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator>`_.
    unit: Cycles per instruction
  VMEM Latency:
    rst: The average number of round-trip cycles (that is, from issue to data  return
      / acknowledgment) required for a VMEM instruction to complete.
    unit: Cycles
  SMEM Latency:
    rst: The average number of round-trip cycles (that is, from issue to data  return
      / acknowledgment) required for a SMEM instruction to complete.
    unit: Cycles
  FLOPs (Total):
    rst: The total number of floating-point operations executed on either the  :ref:`VALU
      <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per  :ref:`normalization unit
      <normalization-units>`.
    unit: FLOP per normalization unit
  IOPs (Total):
    rst: The total number of integer operations executed on either the  :ref:`VALU
      <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per  :ref:`normalization unit
      <normalization-units>`.
    unit: IOP per normalization unit
  F16 OPs:
    rst: The total number of 16-bit floating-point operations executed on either the  :ref:`VALU
      <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per  :ref:`normalization unit
      <normalization-units>`.
    unit: FLOP per normalization unit
  BF16 OPs:
    rst: 'The total number of 16-bit brain floating-point operations executed on either
      the  :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per  :ref:`normalization
      unit <normalization-units>`. Note: on current CDNA  accelerators, the VALU has
      no native BF16 instructions.'
    unit: FLOP per normalization unit
  F32 OPs:
    rst: The total number of 32-bit floating-point operations executed on either  the
      :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per  :ref:`normalization
      unit <normalization-units>`.
    unit: FLOP per normalization unit
  F64 OPs:
    rst: The total number of 64-bit floating-point operations executed on either  the
      :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per  :ref:`normalization
      unit <normalization-units>`.
    unit: FLOP per normalization unit
  INT8 OPs:
    rst: 'The total number of 8-bit integer operations executed on either the  :ref:`VALU
      <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per  :ref:`normalization unit
      <normalization-units>`. Note: on current CDNA  accelerators, the VALU has no
      native INT8 instructions.'
    unit: IOP per normalization unit
LDS Speed-of-Light:
  Utilization:
    rst: Indicates what percent of the kernel's duration the :ref:`LDS <desc-lds>`  was
      actively executing instructions (including, but not limited to, load,  store,
      atomic and HIP's ``__shfl`` operations).  Calculated as the ratio  of the total
      number of cycles LDS was active over the  :ref:`total CU cycles <total-cu-cycles>`.
    unit: Percent
  Access Rate:
    rst: Indicates the percentage of SIMDs in the :ref:`VALU <desc-valu>` [#lds-workload]_
      actively issuing LDS instructions, averaged over the lifetime of the kernel.
      Calculated as the ratio of the total number of cycles spent by the :ref:`scheduler
      <desc-scheduler>` issuing :ref:`LDS <desc-lds>` instructions over the :ref:`total
      CU cycles <total-cu-cycles>`.
    unit: Percent
  Theoretical Bandwidth:
    rst: Indicates the maximum amount of bytes that could have been loaded from,  stored
      to, or atomically updated in the LDS per  :ref:`normalization unit <normalization-units>`.
      Does *not* take into  account the execution mask of the wavefront when the instruction
      was  executed. See the  :ref:`LDS bandwidth example <lds-bandwidth>` for more
      detail.
    unit: Bytes per normalization unit
  Bank Conflict Rate:
    rst: Indicates the percentage of active LDS cycles that were spent servicing  bank
      conflicts. Calculated as the ratio of LDS cycles spent servicing  bank conflicts
      over the number of LDS cycles that would have been  required to move the same
      amount of data in an uncontended access. [#lds-bank-conflict]_
    unit: Percent
  LDS Instructions:
    rst: The total number of LDS instructions (including, but not limited to,  read/write/atomics
      and HIP's ``__shfl`` instructions) executed per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  LDS Latency:
    rst: The average number of round-trip cycles (i.e., from issue to data-return  /
      acknowledgment) required for an LDS instruction to complete.
    unit: Cycles
  Bank Conflicts/Access:
    rst: The ratio of the number of cycles spent in the  :ref:`LDS scheduler <desc-lds>`
      due to bank conflicts (as determined by  the conflict resolution hardware) to
      the base number of cycles that would  be spent in the LDS scheduler in a completely
      uncontended case. This is  the unnormalized form of the Bank Conflict Rate.
    unit: Conflicts per Access
  Index Accesses:
    rst: The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>`  over
      all operations per :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
  Atomic Return Cycles:
    rst: The total number of cycles spent on LDS atomics with return per  :ref:`normalization
      unit <normalization-units>`.
    unit: Cycles per normalization unit
  Bank Conflict:
    rst: The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>`  due
      to bank conflicts (as determined by the conflict resolution hardware)  per :ref:`normalization
      unit <normalization-units>`.
    unit: Cycles per normalization unit
  Addr Conflict:
    rst: The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>`  due
      to address conflicts (as determined by the conflict resolution  hardware) per
      :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
  Unaligned Stall:
    rst: The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>`  due
      to stalls from non-dword aligned addresses per  :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
  Mem Violations:
    rst: "The total number of out-of-bounds accesses made to the LDS, per  :ref:`normalization\
      \ unit <normalization-units>`. This is unused and  expected to be zero in most\
      \ configurations for modern CDNA\u2122 accelerators."
    unit: Accesses per normalization unit
LDS Statistics:
  Utilization:
    rst: Indicates what percent of the kernel's duration the :ref:`LDS <desc-lds>`  was
      actively executing instructions (including, but not limited to, load,  store,
      atomic and HIP's ``__shfl`` operations).  Calculated as the ratio  of the total
      number of cycles LDS was active over the  :ref:`total CU cycles <total-cu-cycles>`.
    unit: Percent
  Access Rate:
    rst: Indicates the percentage of SIMDs in the :ref:`VALU <desc-valu>` [#lds-workload]_
      actively issuing LDS instructions, averaged over the lifetime of the kernel.
      Calculated as the ratio of the total number of cycles spent by the :ref:`scheduler
      <desc-scheduler>` issuing :ref:`LDS <desc-lds>` instructions over the :ref:`total
      CU cycles <total-cu-cycles>`.
    unit: Percent
  Theoretical Bandwidth:
    rst: Indicates the maximum amount of bytes that could have been loaded from,  stored
      to, or atomically updated in the LDS per  :ref:`normalization unit <normalization-units>`.
      Does *not* take into  account the execution mask of the wavefront when the instruction
      was  executed. See the  :ref:`LDS bandwidth example <lds-bandwidth>` for more
      detail.
    unit: Bytes per normalization unit
  Bank Conflict Rate:
    rst: Indicates the percentage of active LDS cycles that were spent servicing  bank
      conflicts. Calculated as the ratio of LDS cycles spent servicing  bank conflicts
      over the number of LDS cycles that would have been  required to move the same
      amount of data in an uncontended access. [#lds-bank-conflict]_
    unit: Percent
  LDS Instructions:
    rst: The total number of LDS instructions (including, but not limited to,  read/write/atomics
      and HIP's ``__shfl`` instructions) executed per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  LDS Latency:
    rst: The average number of round-trip cycles (i.e., from issue to data-return  /
      acknowledgment) required for an LDS instruction to complete.
    unit: Cycles
  Bank Conflicts/Access:
    rst: The ratio of the number of cycles spent in the  :ref:`LDS scheduler <desc-lds>`
      due to bank conflicts (as determined by  the conflict resolution hardware) to
      the base number of cycles that would  be spent in the LDS scheduler in a completely
      uncontended case. This is  the unnormalized form of the Bank Conflict Rate.
    unit: Conflicts per Access
  Index Accesses:
    rst: The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>`  over
      all operations per :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
  Atomic Return Cycles:
    rst: The total number of cycles spent on LDS atomics with return per  :ref:`normalization
      unit <normalization-units>`.
    unit: Cycles per normalization unit
  Bank Conflict:
    rst: The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>`  due
      to bank conflicts (as determined by the conflict resolution hardware)  per :ref:`normalization
      unit <normalization-units>`.
    unit: Cycles per normalization unit
  Addr Conflict:
    rst: The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>`  due
      to address conflicts (as determined by the conflict resolution  hardware) per
      :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
  Unaligned Stall:
    rst: The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>`  due
      to stalls from non-dword aligned addresses per  :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
  Mem Violations:
    rst: "The total number of out-of-bounds accesses made to the LDS, per  :ref:`normalization\
      \ unit <normalization-units>`. This is unused and  expected to be zero in most\
      \ configurations for modern CDNA\u2122 accelerators."
    unit: Accesses per normalization unit
vL1D Speed-of-Light:
  Hit rate:
    rst: The ratio of the number of vL1D cache line requests that hit [#vl1d-hit]_  in
      vL1D cache over the total number of cache line requests to the  :ref:`vL1D Cache
      RAM <desc-tc>`.
    unit: Percent
  Bandwidth:
    rst: The number of bytes looked up in the vL1D cache as a result of  :ref:`VMEM
      <desc-vmem>` instructions, as a percent of the peak  theoretical bandwidth achievable
      on the specific accelerator. The number  of bytes is calculated as the number
      of cache lines requested multiplied  by the cache line size. This value does
      not consider partial requests, so  for instance, if only a single value is requested
      in a cache line, the  data movement will still be counted as a full cache line.
    unit: Percent
  Utilization:
    rst: Indicates how busy the :ref:`vL1D Cache RAM <desc-tc>` was during the  kernel
      execution. The number of cycles where the vL1D Cache RAM is  actively processing
      any request divided by the number of cycles where the  vL1D is active [#vl1d-activity]_.
    unit: Percent
  Coalescing:
    rst: Indicates how well memory instructions were coalesced by the  :ref:`address
      processing unit <desc-ta>`, ranging from uncoalesced (25%)  to fully coalesced
      (100%). Calculated as the average number of  :ref:`thread-requests <thread-requests>`
      generated per instruction  divided by the ideal number of thread-requests per
      instruction.
    unit: Percent
  Stalled on L2 Data:
    rst: The ratio of the number of cycles where the vL1D is stalled waiting for  requested
      data to return from the :doc:`L2 cache <l2-cache>` divided by  the number of
      cycles where the vL1D is active [#vl1d-activity]_.
    unit: Percent
  Stalled on L2 Req:
    rst: The ratio of the number of cycles where the vL1D is stalled waiting to  issue
      a request for data to the :doc:`L2 cache <l2-cache>` divided by the  number
      of cycles where the vL1D is active [#vl1d-activity]_.
    unit: Percent
  Tag RAM Stall (Read):
    rst: The ratio of the number of cycles where the vL1D is stalled due to Read requests
      with conflicting tags being looked up  concurrently, divided by the number of
      cycles where the  vL1D is active [#vl1d-activity]_.
    unit: Percent
  Tag RAM Stall (Write):
    rst: The ratio of the number of cycles where the vL1D is stalled due to Write
      requests with conflicting tags being looked up  concurrently, divided by the
      number of cycles where the  vL1D is active [#vl1d-activity]_.
    unit: Percent
  Tag RAM Stall (Atomic):
    rst: The ratio of the number of cycles where the vL1D is stalled due to Atomic
      requests with conflicting tags being looked up  concurrently, divided by the
      number of cycles where the  vL1D is active [#vl1d-activity]_.
    unit: Percent
  Total Req:
    rst: The total number of incoming requests from the  :ref:`address processing
      unit <desc-ta>` after coalescing.
    unit: Requests
  Read Req:
    rst: The total number of incoming read requests from the  :ref:`address processing
      unit <desc-ta>` after coalescing per  :ref:`normalization unit <normalization-units>`
    unit: Requests per normalization unit
  Write Req:
    rst: The total number of incoming write requests from the  :ref:`address processing
      unit <desc-ta>` after coalescing per  :ref:`normalization unit <normalization-units>`
    unit: Requests per normalization unit
  Atomic Req:
    rst: The total number of incoming atomic requests from the  :ref:`address processing
      unit <desc-ta>` after coalescing per  :ref:`normalization unit <normalization-units>`
    unit: Requests per normalization unit
  Cache BW:
    rst: The number of bytes looked up in the vL1D cache as a result of  :ref:`VMEM
      <desc-vmem>` instructions per  :ref:`normalization unit <normalization-units>`.  The
      number of bytes is  calculated as the number of cache lines requested multiplied
      by the cache  line size.  This value does not consider partial requests, so
      for  instance, if only a single value is requested in a cache line, the data  movement
      will still be counted as a full cache line.
    unit: Bytes per normalization unit
  Cache Hit Rate:
    rst: The ratio of the number of vL1D cache line requests that hit in vL1D  cache
      over the total number of cache line requests to the  :ref:`vL1D Cache RAM <desc-tc>`.
    unit: Percent
  Cache Accesses:
    rst: The total number of cache line lookups in the vL1D.
    unit: Cache lines
  Cache Hits:
    rst: The number of cache accesses minus the number of outgoing requests to the  :doc:`L2
      cache <l2-cache>`, that is, the number of cache line requests  serviced by the
      :ref:`vL1D Cache RAM <desc-tc>` per  :ref:`normalization unit <normalization-units>`.
    unit: Cache lines per normalization unit
  Invalidations:
    rst: The number of times the vL1D was issued a write-back invalidate command  during
      the kernel's execution per  :ref:`normalization unit <normalization-units>`.  This
      may be triggered  by, for instance, the ``buffer_wbinvl1`` instruction.
    unit: Invalidations per normalization unit
  L1-L2 BW:
    rst: The number of bytes transferred across the vL1D-L2 interface as a result  of
      :ref:`VMEM <desc-vmem>` instructions, per  :ref:`normalization unit <normalization-units>`.
      The number of bytes is  calculated as the number of cache lines requested multiplied
      by the cache  line size. This value does not consider partial requests, so for  instance,
      if only a single value is requested in a cache line, the data  movement will
      still be counted as a full cache line.
    unit: Bytes per normalization unit
  L1-L2 Read:
    rst: The number of read requests for a vL1D cache line that were not satisfied  by
      the vL1D and must be retrieved from the to the  :doc:`L2 Cache <l2-cache>` per  :ref:`normalization
      unit <normalization-units>`.
    unit: Requests per normalization unit
  L1-L2 Write:
    rst: The number of write requests to a vL1D cache line that were sent through  the
      vL1D to the :doc:`L2 cache <l2-cache>`, per  :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  L1-L2 Atomic:
    rst: The number of atomic requests that are sent through the vL1D to the  :doc:`L2
      cache <l2-cache>`, per  :ref:`normalization unit <normalization-units>`. This
      includes requests  for atomics with, and without return.
    unit: Requests per normalization unit
  L1 Access Latency:
    rst: Calculated as the average number of cycles that a vL1D cache line request
      spent in the vL1D cache pipeline.
    unit: Cycles
  L1-L2 Read Latency:
    rst: Calculated as the average number of cycles that the vL1D cache took to  issue
      and receive read requests from the :doc:`L2 Cache <l2-cache>`. This  number
      also includes requests for atomics with return values.
    unit: Cycles
  L1-L2 Write Latency:
    rst: Calculated as the average number of cycles that the vL1D cache took to  issue
      and receive acknowledgement of a write request to the  :doc:`L2 Cache <l2-cache>`.
      This number also includes requests for  atomics without return values.
    unit: Cycles
  NC - Read:
    rst: Total read requests with NC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  UC - Read:
    rst: Total read requests with UC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  CC - Read:
    rst: Total read requests with CC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  RW - Read:
    rst: ''
    unit: Requests per normalization unit
  RW - Write:
    rst: Total write requests with RW mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  NC - Write:
    rst: Total write requests with NC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  UC - Write:
    rst: Total write requests with UC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  CC - Write:
    rst: Total write requests with CC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  NC - Atomic:
    rst: Total atomic requests with NC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  UC - Atomic:
    rst: Total atomic requests with UC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  CC - Atomic:
    rst: Total atomic requests with CC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  RW - Atomic:
    rst: Total atomic requests with RW mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  Req:
    rst: The number of translation requests made to the UTCL1 per normalization unit.
    unit: Requests per normalization unit
  Hit Ratio:
    rst: The ratio of the number of translation requests that hit in the UTCL1 divided
      by the total number of translation requests made to the UTCL1.
    unit: Percent
  Hits:
    rst: The number of translation requests that hit in the UTCL1, and could be reused,
      per normalization unit.
    unit: Requests per normalization unit
  Translation Misses:
    rst: The total number of translation requests that missed in the UTCL1 due to  translation
      not being present in the cache, per  :ref:`normalization unit <normalization-units>`.
    unit: unit
  Permission Misses:
    rst: "The total number of translation requests that missed in the UTCL1 due to\
      \  a permission error, per :ref:`normalization unit <normalization-units>`.\
      \  This is unused and expected to be zero in most configurations for modern\
      \  CDNA\u2122 accelerators."
    unit: Requests per normalization unit
Busy / stall metrics:
  Address Processing Unit Busy:
    rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address  processor
      was busy
    unit: Percent
  Address Stall:
    rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address  processor
      was stalled from sending address requests further into the vL1D  pipeline
    unit: Percent
  Data Stall:
    rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address  processor
      was stalled from sending write/atomic data further into the  vL1D pipeline
    unit: Percent
  "Data-Processor \u2192 Address Stall":
    rst: Percent of :ref:`total CU cycles <total-cu-cycles>` the address processor  was
      stalled waiting to send command data to the  :ref:`data processor <desc-td>`
    unit: Percent
  Total Instructions:
    rst: The total number of memory instructions executed by the address processer
      over all compute units on the accelerator, per normalization unit.
    unit: Instructions per normalization unit
  Global/Generic Instructions:
    rst: The total number of global & generic memory instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Global/Generic Read Instructions:
    rst: The total number of global & generic memory read instructions executed on  all
      :doc:`compute units <compute-unit>` on the accelerator, per  :ref:`normalization
      unit <normalization-units>`.
    unit: Instructions per normalization unit
  Global/Generic Write Instructions:
    rst: The total number of global & generic memory write instructions executed  on
      all :doc:`compute units <compute-unit>` on the accelerator, per  :ref:`normalization
      unit <normalization-units>`.
    unit: Instructions per normalization unit
  Global/Generic Atomic Instructions:
    rst: The total number of global & generic memory atomic (with and without  return)
      instructions executed on all :doc:`compute units <compute-unit>`  on the accelerator,
      per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Instructions:
    rst: The total number of spill/stack memory instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Read Instructions:
    rst: The total number of spill/stack memory read instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Write Instructions:
    rst: The total number of spill/stack memory write instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Atomic Instructions:
    rst: The total number of spill/stack memory atomic (with and without return)  instructions
      executed on all :doc:`compute units <compute-unit>` on the  accelerator, per
      :ref:`normalization unit <normalization-units>`.  Typically unused as these
      memory operations are typically used to  implement thread-local storage.
    unit: Instructions per normalization unit
  Spill/Stack Total Cycles:
    rst: The number of cycles the address processing unit spent working on  spill/stack
      instructions, per  :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
  Spill/Stack Coalesced Read:
    rst: The number of cycles the address processing unit spent working on  coalesced
      spill/stack read instructions, per  :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
  Spill/Stack Coalesced Write:
    rst: The number of cycles the address processing unit spent working on  coalesced
      spill/stack write instructions, per  :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
  Data-Return Busy:
    rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return  unit
      was busy processing or waiting on data to return to the  :doc:`CU <compute-unit>`.
    unit: Percent
  "Cache RAM \u2192 Data-Return Stall":
    rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return  unit
      was stalled on data to be returned from the  :ref:`vL1D Cache RAM <desc-tc>`.
    unit: Percent
  "Workgroup manager \u2192 Data-Return Stall":
    rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return  unit
      was stalled by the :ref:`workgroup manager <desc-spi>` due to  initialization
      of registers as a part of launching new workgroups.
    unit: Percent
  Coalescable Instructions:
    rst: The number of instructions submitted to the  :ref:`data-return unit <desc-td>`
      by the  :ref:`address processor <desc-ta>` that were found to be coalescable,
      per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Read Instructions:
    rst: The number of read instructions submitted to the  :ref:`data-return unit
      <desc-td>` by the  :ref:`address processor <desc-ta>` summed over all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
      This is expected to be  the sum of global/generic and spill/stack reads in the  :ref:`address
      processor <desc-ta>`.
    unit: Instructions per normalization unit
  Write Instructions:
    rst: The number of store instructions submitted to the  :ref:`data-return unit
      <desc-td>` by the  :ref:`address processor <desc-ta>` summed over all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
      This is expected to be  the sum of global/generic and spill/stack stores counted
      by the  :ref:`vL1D cache-front-end <ta-instruction-counts>`.
    unit: Instructions per normalization unit
  Atomic Instructions:
    rst: The number of atomic instructions submitted to the  :ref:`data-return unit
      <desc-td>` by the  :ref:`address processor <desc-ta>` summed over all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
      This is expected to be  the sum of global/generic and spill/stack atomics in
      the  :ref:`address processor <desc-ta>`.
    unit: Instructions per normalization unit
Instruction counts:
  Address Processing Unit Busy:
    rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address  processor
      was busy
    unit: Percent
  Address Stall:
    rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address  processor
      was stalled from sending address requests further into the vL1D  pipeline
    unit: Percent
  Data Stall:
    rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address  processor
      was stalled from sending write/atomic data further into the  vL1D pipeline
    unit: Percent
  "Data-Processor \u2192 Address Stall":
    rst: Percent of :ref:`total CU cycles <total-cu-cycles>` the address processor  was
      stalled waiting to send command data to the  :ref:`data processor <desc-td>`
    unit: Percent
  Total Instructions:
    rst: The total number of memory instructions executed by the address processer
      over all compute units on the accelerator, per normalization unit.
    unit: Instructions per normalization unit
  Global/Generic Instructions:
    rst: The total number of global & generic memory instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Global/Generic Read Instructions:
    rst: The total number of global & generic memory read instructions executed on  all
      :doc:`compute units <compute-unit>` on the accelerator, per  :ref:`normalization
      unit <normalization-units>`.
    unit: Instructions per normalization unit
  Global/Generic Write Instructions:
    rst: The total number of global & generic memory write instructions executed  on
      all :doc:`compute units <compute-unit>` on the accelerator, per  :ref:`normalization
      unit <normalization-units>`.
    unit: Instructions per normalization unit
  Global/Generic Atomic Instructions:
    rst: The total number of global & generic memory atomic (with and without  return)
      instructions executed on all :doc:`compute units <compute-unit>`  on the accelerator,
      per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Instructions:
    rst: The total number of spill/stack memory instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Read Instructions:
    rst: The total number of spill/stack memory read instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Write Instructions:
    rst: The total number of spill/stack memory write instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Atomic Instructions:
    rst: The total number of spill/stack memory atomic (with and without return)  instructions
      executed on all :doc:`compute units <compute-unit>` on the  accelerator, per
      :ref:`normalization unit <normalization-units>`.  Typically unused as these
      memory operations are typically used to  implement thread-local storage.
    unit: Instructions per normalization unit
  Spill/Stack Total Cycles:
    rst: The number of cycles the address processing unit spent working on  spill/stack
      instructions, per  :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
  Spill/Stack Coalesced Read:
    rst: The number of cycles the address processing unit spent working on  coalesced
      spill/stack read instructions, per  :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
  Spill/Stack Coalesced Write:
    rst: The number of cycles the address processing unit spent working on  coalesced
      spill/stack write instructions, per  :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
  Data-Return Busy:
    rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return  unit
      was busy processing or waiting on data to return to the  :doc:`CU <compute-unit>`.
    unit: Percent
  "Cache RAM \u2192 Data-Return Stall":
    rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return  unit
      was stalled on data to be returned from the  :ref:`vL1D Cache RAM <desc-tc>`.
    unit: Percent
  "Workgroup manager \u2192 Data-Return Stall":
    rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return  unit
      was stalled by the :ref:`workgroup manager <desc-spi>` due to  initialization
      of registers as a part of launching new workgroups.
    unit: Percent
  Coalescable Instructions:
    rst: The number of instructions submitted to the  :ref:`data-return unit <desc-td>`
      by the  :ref:`address processor <desc-ta>` that were found to be coalescable,
      per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Read Instructions:
    rst: The number of read instructions submitted to the  :ref:`data-return unit
      <desc-td>` by the  :ref:`address processor <desc-ta>` summed over all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
      This is expected to be  the sum of global/generic and spill/stack reads in the  :ref:`address
      processor <desc-ta>`.
    unit: Instructions per normalization unit
  Write Instructions:
    rst: The number of store instructions submitted to the  :ref:`data-return unit
      <desc-td>` by the  :ref:`address processor <desc-ta>` summed over all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
      This is expected to be  the sum of global/generic and spill/stack stores counted
      by the  :ref:`vL1D cache-front-end <ta-instruction-counts>`.
    unit: Instructions per normalization unit
  Atomic Instructions:
    rst: The number of atomic instructions submitted to the  :ref:`data-return unit
      <desc-td>` by the  :ref:`address processor <desc-ta>` summed over all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
      This is expected to be  the sum of global/generic and spill/stack atomics in
      the  :ref:`address processor <desc-ta>`.
    unit: Instructions per normalization unit
Spill / stack metrics:
  Address Processing Unit Busy:
    rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address  processor
      was busy
    unit: Percent
  Address Stall:
    rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address  processor
      was stalled from sending address requests further into the vL1D  pipeline
    unit: Percent
  Data Stall:
    rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address  processor
      was stalled from sending write/atomic data further into the  vL1D pipeline
    unit: Percent
  "Data-Processor \u2192 Address Stall":
    rst: Percent of :ref:`total CU cycles <total-cu-cycles>` the address processor  was
      stalled waiting to send command data to the  :ref:`data processor <desc-td>`
    unit: Percent
  Total Instructions:
    rst: The total number of memory instructions executed by the address processer
      over all compute units on the accelerator, per normalization unit.
    unit: Instructions per normalization unit
  Global/Generic Instructions:
    rst: The total number of global & generic memory instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Global/Generic Read Instructions:
    rst: The total number of global & generic memory read instructions executed on  all
      :doc:`compute units <compute-unit>` on the accelerator, per  :ref:`normalization
      unit <normalization-units>`.
    unit: Instructions per normalization unit
  Global/Generic Write Instructions:
    rst: The total number of global & generic memory write instructions executed  on
      all :doc:`compute units <compute-unit>` on the accelerator, per  :ref:`normalization
      unit <normalization-units>`.
    unit: Instructions per normalization unit
  Global/Generic Atomic Instructions:
    rst: The total number of global & generic memory atomic (with and without  return)
      instructions executed on all :doc:`compute units <compute-unit>`  on the accelerator,
      per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Instructions:
    rst: The total number of spill/stack memory instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Read Instructions:
    rst: The total number of spill/stack memory read instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Write Instructions:
    rst: The total number of spill/stack memory write instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Atomic Instructions:
    rst: The total number of spill/stack memory atomic (with and without return)  instructions
      executed on all :doc:`compute units <compute-unit>` on the  accelerator, per
      :ref:`normalization unit <normalization-units>`.  Typically unused as these
      memory operations are typically used to  implement thread-local storage.
    unit: Instructions per normalization unit
  Spill/Stack Total Cycles:
    rst: The number of cycles the address processing unit spent working on  spill/stack
      instructions, per  :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
  Spill/Stack Coalesced Read:
    rst: The number of cycles the address processing unit spent working on  coalesced
      spill/stack read instructions, per  :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
  Spill/Stack Coalesced Write:
    rst: The number of cycles the address processing unit spent working on  coalesced
      spill/stack write instructions, per  :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
  Data-Return Busy:
    rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return  unit
      was busy processing or waiting on data to return to the  :doc:`CU <compute-unit>`.
    unit: Percent
  "Cache RAM \u2192 Data-Return Stall":
    rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return  unit
      was stalled on data to be returned from the  :ref:`vL1D Cache RAM <desc-tc>`.
    unit: Percent
  "Workgroup manager \u2192 Data-Return Stall":
    rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return  unit
      was stalled by the :ref:`workgroup manager <desc-spi>` due to  initialization
      of registers as a part of launching new workgroups.
    unit: Percent
  Coalescable Instructions:
    rst: The number of instructions submitted to the  :ref:`data-return unit <desc-td>`
      by the  :ref:`address processor <desc-ta>` that were found to be coalescable,
      per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Read Instructions:
    rst: The number of read instructions submitted to the  :ref:`data-return unit
      <desc-td>` by the  :ref:`address processor <desc-ta>` summed over all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
      This is expected to be  the sum of global/generic and spill/stack reads in the  :ref:`address
      processor <desc-ta>`.
    unit: Instructions per normalization unit
  Write Instructions:
    rst: The number of store instructions submitted to the  :ref:`data-return unit
      <desc-td>` by the  :ref:`address processor <desc-ta>` summed over all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
      This is expected to be  the sum of global/generic and spill/stack stores counted
      by the  :ref:`vL1D cache-front-end <ta-instruction-counts>`.
    unit: Instructions per normalization unit
  Atomic Instructions:
    rst: The number of atomic instructions submitted to the  :ref:`data-return unit
      <desc-td>` by the  :ref:`address processor <desc-ta>` summed over all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
      This is expected to be  the sum of global/generic and spill/stack atomics in
      the  :ref:`address processor <desc-ta>`.
    unit: Instructions per normalization unit
L1 Unified Translation Cache (UTCL1):
  Hit rate:
    rst: The ratio of the number of vL1D cache line requests that hit [#vl1d-hit]_  in
      vL1D cache over the total number of cache line requests to the  :ref:`vL1D Cache
      RAM <desc-tc>`.
    unit: Percent
  Bandwidth:
    rst: The number of bytes looked up in the vL1D cache as a result of  :ref:`VMEM
      <desc-vmem>` instructions, as a percent of the peak  theoretical bandwidth achievable
      on the specific accelerator. The number  of bytes is calculated as the number
      of cache lines requested multiplied  by the cache line size. This value does
      not consider partial requests, so  for instance, if only a single value is requested
      in a cache line, the  data movement will still be counted as a full cache line.
    unit: Percent
  Utilization:
    rst: Indicates how busy the :ref:`vL1D Cache RAM <desc-tc>` was during the  kernel
      execution. The number of cycles where the vL1D Cache RAM is  actively processing
      any request divided by the number of cycles where the  vL1D is active [#vl1d-activity]_.
    unit: Percent
  Coalescing:
    rst: Indicates how well memory instructions were coalesced by the  :ref:`address
      processing unit <desc-ta>`, ranging from uncoalesced (25%)  to fully coalesced
      (100%). Calculated as the average number of  :ref:`thread-requests <thread-requests>`
      generated per instruction  divided by the ideal number of thread-requests per
      instruction.
    unit: Percent
  Stalled on L2 Data:
    rst: The ratio of the number of cycles where the vL1D is stalled waiting for  requested
      data to return from the :doc:`L2 cache <l2-cache>` divided by  the number of
      cycles where the vL1D is active [#vl1d-activity]_.
    unit: Percent
  Stalled on L2 Req:
    rst: The ratio of the number of cycles where the vL1D is stalled waiting to  issue
      a request for data to the :doc:`L2 cache <l2-cache>` divided by the  number
      of cycles where the vL1D is active [#vl1d-activity]_.
    unit: Percent
  Tag RAM Stall (Read):
    rst: The ratio of the number of cycles where the vL1D is stalled due to Read requests
      with conflicting tags being looked up  concurrently, divided by the number of
      cycles where the  vL1D is active [#vl1d-activity]_.
    unit: Percent
  Tag RAM Stall (Write):
    rst: The ratio of the number of cycles where the vL1D is stalled due to Write
      requests with conflicting tags being looked up  concurrently, divided by the
      number of cycles where the  vL1D is active [#vl1d-activity]_.
    unit: Percent
  Tag RAM Stall (Atomic):
    rst: The ratio of the number of cycles where the vL1D is stalled due to Atomic
      requests with conflicting tags being looked up  concurrently, divided by the
      number of cycles where the  vL1D is active [#vl1d-activity]_.
    unit: Percent
  Total Req:
    rst: The total number of incoming requests from the  :ref:`address processing
      unit <desc-ta>` after coalescing.
    unit: Requests
  Read Req:
    rst: The total number of incoming read requests from the  :ref:`address processing
      unit <desc-ta>` after coalescing per  :ref:`normalization unit <normalization-units>`
    unit: Requests per normalization unit
  Write Req:
    rst: The total number of incoming write requests from the  :ref:`address processing
      unit <desc-ta>` after coalescing per  :ref:`normalization unit <normalization-units>`
    unit: Requests per normalization unit
  Atomic Req:
    rst: The total number of incoming atomic requests from the  :ref:`address processing
      unit <desc-ta>` after coalescing per  :ref:`normalization unit <normalization-units>`
    unit: Requests per normalization unit
  Cache BW:
    rst: The number of bytes looked up in the vL1D cache as a result of  :ref:`VMEM
      <desc-vmem>` instructions per  :ref:`normalization unit <normalization-units>`.  The
      number of bytes is  calculated as the number of cache lines requested multiplied
      by the cache  line size.  This value does not consider partial requests, so
      for  instance, if only a single value is requested in a cache line, the data  movement
      will still be counted as a full cache line.
    unit: Bytes per normalization unit
  Cache Hit Rate:
    rst: The ratio of the number of vL1D cache line requests that hit in vL1D  cache
      over the total number of cache line requests to the  :ref:`vL1D Cache RAM <desc-tc>`.
    unit: Percent
  Cache Accesses:
    rst: The total number of cache line lookups in the vL1D.
    unit: Cache lines
  Cache Hits:
    rst: The number of cache accesses minus the number of outgoing requests to the  :doc:`L2
      cache <l2-cache>`, that is, the number of cache line requests  serviced by the
      :ref:`vL1D Cache RAM <desc-tc>` per  :ref:`normalization unit <normalization-units>`.
    unit: Cache lines per normalization unit
  Invalidations:
    rst: The number of times the vL1D was issued a write-back invalidate command  during
      the kernel's execution per  :ref:`normalization unit <normalization-units>`.  This
      may be triggered  by, for instance, the ``buffer_wbinvl1`` instruction.
    unit: Invalidations per normalization unit
  L1-L2 BW:
    rst: The number of bytes transferred across the vL1D-L2 interface as a result  of
      :ref:`VMEM <desc-vmem>` instructions, per  :ref:`normalization unit <normalization-units>`.
      The number of bytes is  calculated as the number of cache lines requested multiplied
      by the cache  line size. This value does not consider partial requests, so for  instance,
      if only a single value is requested in a cache line, the data  movement will
      still be counted as a full cache line.
    unit: Bytes per normalization unit
  L1-L2 Read:
    rst: The number of read requests for a vL1D cache line that were not satisfied  by
      the vL1D and must be retrieved from the to the  :doc:`L2 Cache <l2-cache>` per  :ref:`normalization
      unit <normalization-units>`.
    unit: Requests per normalization unit
  L1-L2 Write:
    rst: The number of write requests to a vL1D cache line that were sent through  the
      vL1D to the :doc:`L2 cache <l2-cache>`, per  :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  L1-L2 Atomic:
    rst: The number of atomic requests that are sent through the vL1D to the  :doc:`L2
      cache <l2-cache>`, per  :ref:`normalization unit <normalization-units>`. This
      includes requests  for atomics with, and without return.
    unit: Requests per normalization unit
  L1 Access Latency:
    rst: Calculated as the average number of cycles that a vL1D cache line request
      spent in the vL1D cache pipeline.
    unit: Cycles
  L1-L2 Read Latency:
    rst: Calculated as the average number of cycles that the vL1D cache took to  issue
      and receive read requests from the :doc:`L2 Cache <l2-cache>`. This  number
      also includes requests for atomics with return values.
    unit: Cycles
  L1-L2 Write Latency:
    rst: Calculated as the average number of cycles that the vL1D cache took to  issue
      and receive acknowledgement of a write request to the  :doc:`L2 Cache <l2-cache>`.
      This number also includes requests for  atomics without return values.
    unit: Cycles
  NC - Read:
    rst: Total read requests with NC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  UC - Read:
    rst: Total read requests with UC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  CC - Read:
    rst: Total read requests with CC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  RW - Read:
    rst: ''
    unit: Requests per normalization unit
  RW - Write:
    rst: Total write requests with RW mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  NC - Write:
    rst: Total write requests with NC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  UC - Write:
    rst: Total write requests with UC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  CC - Write:
    rst: Total write requests with CC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  NC - Atomic:
    rst: Total atomic requests with NC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  UC - Atomic:
    rst: Total atomic requests with UC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  CC - Atomic:
    rst: Total atomic requests with CC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  RW - Atomic:
    rst: Total atomic requests with RW mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  Req:
    rst: The number of translation requests made to the UTCL1 per normalization unit.
    unit: Requests per normalization unit
  Hit Ratio:
    rst: The ratio of the number of translation requests that hit in the UTCL1 divided
      by the total number of translation requests made to the UTCL1.
    unit: Percent
  Hits:
    rst: The number of translation requests that hit in the UTCL1, and could be reused,
      per normalization unit.
    unit: Requests per normalization unit
  Translation Misses:
    rst: The total number of translation requests that missed in the UTCL1 due to  translation
      not being present in the cache, per  :ref:`normalization unit <normalization-units>`.
    unit: unit
  Permission Misses:
    rst: "The total number of translation requests that missed in the UTCL1 due to\
      \  a permission error, per :ref:`normalization unit <normalization-units>`.\
      \  This is unused and expected to be zero in most configurations for modern\
      \  CDNA\u2122 accelerators."
    unit: Requests per normalization unit
vL1D cache stall metrics:
  Hit rate:
    rst: The ratio of the number of vL1D cache line requests that hit [#vl1d-hit]_  in
      vL1D cache over the total number of cache line requests to the  :ref:`vL1D Cache
      RAM <desc-tc>`.
    unit: Percent
  Bandwidth:
    rst: The number of bytes looked up in the vL1D cache as a result of  :ref:`VMEM
      <desc-vmem>` instructions, as a percent of the peak  theoretical bandwidth achievable
      on the specific accelerator. The number  of bytes is calculated as the number
      of cache lines requested multiplied  by the cache line size. This value does
      not consider partial requests, so  for instance, if only a single value is requested
      in a cache line, the  data movement will still be counted as a full cache line.
    unit: Percent
  Utilization:
    rst: Indicates how busy the :ref:`vL1D Cache RAM <desc-tc>` was during the  kernel
      execution. The number of cycles where the vL1D Cache RAM is  actively processing
      any request divided by the number of cycles where the  vL1D is active [#vl1d-activity]_.
    unit: Percent
  Coalescing:
    rst: Indicates how well memory instructions were coalesced by the  :ref:`address
      processing unit <desc-ta>`, ranging from uncoalesced (25%)  to fully coalesced
      (100%). Calculated as the average number of  :ref:`thread-requests <thread-requests>`
      generated per instruction  divided by the ideal number of thread-requests per
      instruction.
    unit: Percent
  Stalled on L2 Data:
    rst: The ratio of the number of cycles where the vL1D is stalled waiting for  requested
      data to return from the :doc:`L2 cache <l2-cache>` divided by  the number of
      cycles where the vL1D is active [#vl1d-activity]_.
    unit: Percent
  Stalled on L2 Req:
    rst: The ratio of the number of cycles where the vL1D is stalled waiting to  issue
      a request for data to the :doc:`L2 cache <l2-cache>` divided by the  number
      of cycles where the vL1D is active [#vl1d-activity]_.
    unit: Percent
  Tag RAM Stall (Read):
    rst: The ratio of the number of cycles where the vL1D is stalled due to Read requests
      with conflicting tags being looked up  concurrently, divided by the number of
      cycles where the  vL1D is active [#vl1d-activity]_.
    unit: Percent
  Tag RAM Stall (Write):
    rst: The ratio of the number of cycles where the vL1D is stalled due to Write
      requests with conflicting tags being looked up  concurrently, divided by the
      number of cycles where the  vL1D is active [#vl1d-activity]_.
    unit: Percent
  Tag RAM Stall (Atomic):
    rst: The ratio of the number of cycles where the vL1D is stalled due to Atomic
      requests with conflicting tags being looked up  concurrently, divided by the
      number of cycles where the  vL1D is active [#vl1d-activity]_.
    unit: Percent
  Total Req:
    rst: The total number of incoming requests from the  :ref:`address processing
      unit <desc-ta>` after coalescing.
    unit: Requests
  Read Req:
    rst: The total number of incoming read requests from the  :ref:`address processing
      unit <desc-ta>` after coalescing per  :ref:`normalization unit <normalization-units>`
    unit: Requests per normalization unit
  Write Req:
    rst: The total number of incoming write requests from the  :ref:`address processing
      unit <desc-ta>` after coalescing per  :ref:`normalization unit <normalization-units>`
    unit: Requests per normalization unit
  Atomic Req:
    rst: The total number of incoming atomic requests from the  :ref:`address processing
      unit <desc-ta>` after coalescing per  :ref:`normalization unit <normalization-units>`
    unit: Requests per normalization unit
  Cache BW:
    rst: The number of bytes looked up in the vL1D cache as a result of  :ref:`VMEM
      <desc-vmem>` instructions per  :ref:`normalization unit <normalization-units>`.  The
      number of bytes is  calculated as the number of cache lines requested multiplied
      by the cache  line size.  This value does not consider partial requests, so
      for  instance, if only a single value is requested in a cache line, the data  movement
      will still be counted as a full cache line.
    unit: Bytes per normalization unit
  Cache Hit Rate:
    rst: The ratio of the number of vL1D cache line requests that hit in vL1D  cache
      over the total number of cache line requests to the  :ref:`vL1D Cache RAM <desc-tc>`.
    unit: Percent
  Cache Accesses:
    rst: The total number of cache line lookups in the vL1D.
    unit: Cache lines
  Cache Hits:
    rst: The number of cache accesses minus the number of outgoing requests to the  :doc:`L2
      cache <l2-cache>`, that is, the number of cache line requests  serviced by the
      :ref:`vL1D Cache RAM <desc-tc>` per  :ref:`normalization unit <normalization-units>`.
    unit: Cache lines per normalization unit
  Invalidations:
    rst: The number of times the vL1D was issued a write-back invalidate command  during
      the kernel's execution per  :ref:`normalization unit <normalization-units>`.  This
      may be triggered  by, for instance, the ``buffer_wbinvl1`` instruction.
    unit: Invalidations per normalization unit
  L1-L2 BW:
    rst: The number of bytes transferred across the vL1D-L2 interface as a result  of
      :ref:`VMEM <desc-vmem>` instructions, per  :ref:`normalization unit <normalization-units>`.
      The number of bytes is  calculated as the number of cache lines requested multiplied
      by the cache  line size. This value does not consider partial requests, so for  instance,
      if only a single value is requested in a cache line, the data  movement will
      still be counted as a full cache line.
    unit: Bytes per normalization unit
  L1-L2 Read:
    rst: The number of read requests for a vL1D cache line that were not satisfied  by
      the vL1D and must be retrieved from the to the  :doc:`L2 Cache <l2-cache>` per  :ref:`normalization
      unit <normalization-units>`.
    unit: Requests per normalization unit
  L1-L2 Write:
    rst: The number of write requests to a vL1D cache line that were sent through  the
      vL1D to the :doc:`L2 cache <l2-cache>`, per  :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  L1-L2 Atomic:
    rst: The number of atomic requests that are sent through the vL1D to the  :doc:`L2
      cache <l2-cache>`, per  :ref:`normalization unit <normalization-units>`. This
      includes requests  for atomics with, and without return.
    unit: Requests per normalization unit
  L1 Access Latency:
    rst: Calculated as the average number of cycles that a vL1D cache line request
      spent in the vL1D cache pipeline.
    unit: Cycles
  L1-L2 Read Latency:
    rst: Calculated as the average number of cycles that the vL1D cache took to  issue
      and receive read requests from the :doc:`L2 Cache <l2-cache>`. This  number
      also includes requests for atomics with return values.
    unit: Cycles
  L1-L2 Write Latency:
    rst: Calculated as the average number of cycles that the vL1D cache took to  issue
      and receive acknowledgement of a write request to the  :doc:`L2 Cache <l2-cache>`.
      This number also includes requests for  atomics without return values.
    unit: Cycles
  NC - Read:
    rst: Total read requests with NC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  UC - Read:
    rst: Total read requests with UC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  CC - Read:
    rst: Total read requests with CC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  RW - Read:
    rst: ''
    unit: Requests per normalization unit
  RW - Write:
    rst: Total write requests with RW mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  NC - Write:
    rst: Total write requests with NC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  UC - Write:
    rst: Total write requests with UC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  CC - Write:
    rst: Total write requests with CC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  NC - Atomic:
    rst: Total atomic requests with NC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  UC - Atomic:
    rst: Total atomic requests with UC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  CC - Atomic:
    rst: Total atomic requests with CC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  RW - Atomic:
    rst: Total atomic requests with RW mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  Req:
    rst: The number of translation requests made to the UTCL1 per normalization unit.
    unit: Requests per normalization unit
  Hit Ratio:
    rst: The ratio of the number of translation requests that hit in the UTCL1 divided
      by the total number of translation requests made to the UTCL1.
    unit: Percent
  Hits:
    rst: The number of translation requests that hit in the UTCL1, and could be reused,
      per normalization unit.
    unit: Requests per normalization unit
  Translation Misses:
    rst: The total number of translation requests that missed in the UTCL1 due to  translation
      not being present in the cache, per  :ref:`normalization unit <normalization-units>`.
    unit: unit
  Permission Misses:
    rst: "The total number of translation requests that missed in the UTCL1 due to\
      \  a permission error, per :ref:`normalization unit <normalization-units>`.\
      \  This is unused and expected to be zero in most configurations for modern\
      \  CDNA\u2122 accelerators."
    unit: Requests per normalization unit
vL1D cache access metrics:
  Hit rate:
    rst: The ratio of the number of vL1D cache line requests that hit [#vl1d-hit]_  in
      vL1D cache over the total number of cache line requests to the  :ref:`vL1D Cache
      RAM <desc-tc>`.
    unit: Percent
  Bandwidth:
    rst: The number of bytes looked up in the vL1D cache as a result of  :ref:`VMEM
      <desc-vmem>` instructions, as a percent of the peak  theoretical bandwidth achievable
      on the specific accelerator. The number  of bytes is calculated as the number
      of cache lines requested multiplied  by the cache line size. This value does
      not consider partial requests, so  for instance, if only a single value is requested
      in a cache line, the  data movement will still be counted as a full cache line.
    unit: Percent
  Utilization:
    rst: Indicates how busy the :ref:`vL1D Cache RAM <desc-tc>` was during the  kernel
      execution. The number of cycles where the vL1D Cache RAM is  actively processing
      any request divided by the number of cycles where the  vL1D is active [#vl1d-activity]_.
    unit: Percent
  Coalescing:
    rst: Indicates how well memory instructions were coalesced by the  :ref:`address
      processing unit <desc-ta>`, ranging from uncoalesced (25%)  to fully coalesced
      (100%). Calculated as the average number of  :ref:`thread-requests <thread-requests>`
      generated per instruction  divided by the ideal number of thread-requests per
      instruction.
    unit: Percent
  Stalled on L2 Data:
    rst: The ratio of the number of cycles where the vL1D is stalled waiting for  requested
      data to return from the :doc:`L2 cache <l2-cache>` divided by  the number of
      cycles where the vL1D is active [#vl1d-activity]_.
    unit: Percent
  Stalled on L2 Req:
    rst: The ratio of the number of cycles where the vL1D is stalled waiting to  issue
      a request for data to the :doc:`L2 cache <l2-cache>` divided by the  number
      of cycles where the vL1D is active [#vl1d-activity]_.
    unit: Percent
  Tag RAM Stall (Read):
    rst: The ratio of the number of cycles where the vL1D is stalled due to Read requests
      with conflicting tags being looked up  concurrently, divided by the number of
      cycles where the  vL1D is active [#vl1d-activity]_.
    unit: Percent
  Tag RAM Stall (Write):
    rst: The ratio of the number of cycles where the vL1D is stalled due to Write
      requests with conflicting tags being looked up  concurrently, divided by the
      number of cycles where the  vL1D is active [#vl1d-activity]_.
    unit: Percent
  Tag RAM Stall (Atomic):
    rst: The ratio of the number of cycles where the vL1D is stalled due to Atomic
      requests with conflicting tags being looked up  concurrently, divided by the
      number of cycles where the  vL1D is active [#vl1d-activity]_.
    unit: Percent
  Total Req:
    rst: The total number of incoming requests from the  :ref:`address processing
      unit <desc-ta>` after coalescing.
    unit: Requests
  Read Req:
    rst: The total number of incoming read requests from the  :ref:`address processing
      unit <desc-ta>` after coalescing per  :ref:`normalization unit <normalization-units>`
    unit: Requests per normalization unit
  Write Req:
    rst: The total number of incoming write requests from the  :ref:`address processing
      unit <desc-ta>` after coalescing per  :ref:`normalization unit <normalization-units>`
    unit: Requests per normalization unit
  Atomic Req:
    rst: The total number of incoming atomic requests from the  :ref:`address processing
      unit <desc-ta>` after coalescing per  :ref:`normalization unit <normalization-units>`
    unit: Requests per normalization unit
  Cache BW:
    rst: The number of bytes looked up in the vL1D cache as a result of  :ref:`VMEM
      <desc-vmem>` instructions per  :ref:`normalization unit <normalization-units>`.  The
      number of bytes is  calculated as the number of cache lines requested multiplied
      by the cache  line size.  This value does not consider partial requests, so
      for  instance, if only a single value is requested in a cache line, the data  movement
      will still be counted as a full cache line.
    unit: Bytes per normalization unit
  Cache Hit Rate:
    rst: The ratio of the number of vL1D cache line requests that hit in vL1D  cache
      over the total number of cache line requests to the  :ref:`vL1D Cache RAM <desc-tc>`.
    unit: Percent
  Cache Accesses:
    rst: The total number of cache line lookups in the vL1D.
    unit: Cache lines
  Cache Hits:
    rst: The number of cache accesses minus the number of outgoing requests to the  :doc:`L2
      cache <l2-cache>`, that is, the number of cache line requests  serviced by the
      :ref:`vL1D Cache RAM <desc-tc>` per  :ref:`normalization unit <normalization-units>`.
    unit: Cache lines per normalization unit
  Invalidations:
    rst: The number of times the vL1D was issued a write-back invalidate command  during
      the kernel's execution per  :ref:`normalization unit <normalization-units>`.  This
      may be triggered  by, for instance, the ``buffer_wbinvl1`` instruction.
    unit: Invalidations per normalization unit
  L1-L2 BW:
    rst: The number of bytes transferred across the vL1D-L2 interface as a result  of
      :ref:`VMEM <desc-vmem>` instructions, per  :ref:`normalization unit <normalization-units>`.
      The number of bytes is  calculated as the number of cache lines requested multiplied
      by the cache  line size. This value does not consider partial requests, so for  instance,
      if only a single value is requested in a cache line, the data  movement will
      still be counted as a full cache line.
    unit: Bytes per normalization unit
  L1-L2 Read:
    rst: The number of read requests for a vL1D cache line that were not satisfied  by
      the vL1D and must be retrieved from the to the  :doc:`L2 Cache <l2-cache>` per  :ref:`normalization
      unit <normalization-units>`.
    unit: Requests per normalization unit
  L1-L2 Write:
    rst: The number of write requests to a vL1D cache line that were sent through  the
      vL1D to the :doc:`L2 cache <l2-cache>`, per  :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  L1-L2 Atomic:
    rst: The number of atomic requests that are sent through the vL1D to the  :doc:`L2
      cache <l2-cache>`, per  :ref:`normalization unit <normalization-units>`. This
      includes requests  for atomics with, and without return.
    unit: Requests per normalization unit
  L1 Access Latency:
    rst: Calculated as the average number of cycles that a vL1D cache line request
      spent in the vL1D cache pipeline.
    unit: Cycles
  L1-L2 Read Latency:
    rst: Calculated as the average number of cycles that the vL1D cache took to  issue
      and receive read requests from the :doc:`L2 Cache <l2-cache>`. This  number
      also includes requests for atomics with return values.
    unit: Cycles
  L1-L2 Write Latency:
    rst: Calculated as the average number of cycles that the vL1D cache took to  issue
      and receive acknowledgement of a write request to the  :doc:`L2 Cache <l2-cache>`.
      This number also includes requests for  atomics without return values.
    unit: Cycles
  NC - Read:
    rst: Total read requests with NC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  UC - Read:
    rst: Total read requests with UC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  CC - Read:
    rst: Total read requests with CC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  RW - Read:
    rst: ''
    unit: Requests per normalization unit
  RW - Write:
    rst: Total write requests with RW mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  NC - Write:
    rst: Total write requests with NC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  UC - Write:
    rst: Total write requests with UC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  CC - Write:
    rst: Total write requests with CC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  NC - Atomic:
    rst: Total atomic requests with NC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  UC - Atomic:
    rst: Total atomic requests with UC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  CC - Atomic:
    rst: Total atomic requests with CC mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  RW - Atomic:
    rst: Total atomic requests with RW mtype from this TCP to all TCCs Sum over TCP
      instances per normalization unit.
    unit: Requests per normalization unit
  Req:
    rst: The number of translation requests made to the UTCL1 per normalization unit.
    unit: Requests per normalization unit
  Hit Ratio:
    rst: The ratio of the number of translation requests that hit in the UTCL1 divided
      by the total number of translation requests made to the UTCL1.
    unit: Percent
  Hits:
    rst: The number of translation requests that hit in the UTCL1, and could be reused,
      per normalization unit.
    unit: Requests per normalization unit
  Translation Misses:
    rst: The total number of translation requests that missed in the UTCL1 due to  translation
      not being present in the cache, per  :ref:`normalization unit <normalization-units>`.
    unit: unit
  Permission Misses:
    rst: "The total number of translation requests that missed in the UTCL1 due to\
      \  a permission error, per :ref:`normalization unit <normalization-units>`.\
      \  This is unused and expected to be zero in most configurations for modern\
      \  CDNA\u2122 accelerators."
    unit: Requests per normalization unit
Vector L1 data-return path or Texture Data (TD):
  Address Processing Unit Busy:
    rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address  processor
      was busy
    unit: Percent
  Address Stall:
    rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address  processor
      was stalled from sending address requests further into the vL1D  pipeline
    unit: Percent
  Data Stall:
    rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address  processor
      was stalled from sending write/atomic data further into the  vL1D pipeline
    unit: Percent
  "Data-Processor \u2192 Address Stall":
    rst: Percent of :ref:`total CU cycles <total-cu-cycles>` the address processor  was
      stalled waiting to send command data to the  :ref:`data processor <desc-td>`
    unit: Percent
  Total Instructions:
    rst: The total number of memory instructions executed by the address processer
      over all compute units on the accelerator, per normalization unit.
    unit: Instructions per normalization unit
  Global/Generic Instructions:
    rst: The total number of global & generic memory instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Global/Generic Read Instructions:
    rst: The total number of global & generic memory read instructions executed on  all
      :doc:`compute units <compute-unit>` on the accelerator, per  :ref:`normalization
      unit <normalization-units>`.
    unit: Instructions per normalization unit
  Global/Generic Write Instructions:
    rst: The total number of global & generic memory write instructions executed  on
      all :doc:`compute units <compute-unit>` on the accelerator, per  :ref:`normalization
      unit <normalization-units>`.
    unit: Instructions per normalization unit
  Global/Generic Atomic Instructions:
    rst: The total number of global & generic memory atomic (with and without  return)
      instructions executed on all :doc:`compute units <compute-unit>`  on the accelerator,
      per :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Instructions:
    rst: The total number of spill/stack memory instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Read Instructions:
    rst: The total number of spill/stack memory read instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Write Instructions:
    rst: The total number of spill/stack memory write instructions executed on all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Spill/Stack Atomic Instructions:
    rst: The total number of spill/stack memory atomic (with and without return)  instructions
      executed on all :doc:`compute units <compute-unit>` on the  accelerator, per
      :ref:`normalization unit <normalization-units>`.  Typically unused as these
      memory operations are typically used to  implement thread-local storage.
    unit: Instructions per normalization unit
  Spill/Stack Total Cycles:
    rst: The number of cycles the address processing unit spent working on  spill/stack
      instructions, per  :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
  Spill/Stack Coalesced Read:
    rst: The number of cycles the address processing unit spent working on  coalesced
      spill/stack read instructions, per  :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
  Spill/Stack Coalesced Write:
    rst: The number of cycles the address processing unit spent working on  coalesced
      spill/stack write instructions, per  :ref:`normalization unit <normalization-units>`.
    unit: Cycles per normalization unit
  Data-Return Busy:
    rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return  unit
      was busy processing or waiting on data to return to the  :doc:`CU <compute-unit>`.
    unit: Percent
  "Cache RAM \u2192 Data-Return Stall":
    rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return  unit
      was stalled on data to be returned from the  :ref:`vL1D Cache RAM <desc-tc>`.
    unit: Percent
  "Workgroup manager \u2192 Data-Return Stall":
    rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return  unit
      was stalled by the :ref:`workgroup manager <desc-spi>` due to  initialization
      of registers as a part of launching new workgroups.
    unit: Percent
  Coalescable Instructions:
    rst: The number of instructions submitted to the  :ref:`data-return unit <desc-td>`
      by the  :ref:`address processor <desc-ta>` that were found to be coalescable,
      per  :ref:`normalization unit <normalization-units>`.
    unit: Instructions per normalization unit
  Read Instructions:
    rst: The number of read instructions submitted to the  :ref:`data-return unit
      <desc-td>` by the  :ref:`address processor <desc-ta>` summed over all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
      This is expected to be  the sum of global/generic and spill/stack reads in the  :ref:`address
      processor <desc-ta>`.
    unit: Instructions per normalization unit
  Write Instructions:
    rst: The number of store instructions submitted to the  :ref:`data-return unit
      <desc-td>` by the  :ref:`address processor <desc-ta>` summed over all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
      This is expected to be  the sum of global/generic and spill/stack stores counted
      by the  :ref:`vL1D cache-front-end <ta-instruction-counts>`.
    unit: Instructions per normalization unit
  Atomic Instructions:
    rst: The number of atomic instructions submitted to the  :ref:`data-return unit
      <desc-td>` by the  :ref:`address processor <desc-ta>` summed over all  :doc:`compute
      units <compute-unit>` on the accelerator, per  :ref:`normalization unit <normalization-units>`.
      This is expected to be  the sum of global/generic and spill/stack atomics in
      the  :ref:`address processor <desc-ta>`.
    unit: Instructions per normalization unit
L2 Speed-of-Light:
  Utilization:
    rst: The ratio of the  :ref:`number of cycles an L2 channel was active, summed
      over all L2 channels on the accelerator <total-active-l2-cycles>`  over the
      :ref:`total L2 cycles <total-l2-cycles>`.
    unit: Percent
  Peak Bandwidth:
    rst: The number of bytes looked up in the L2 cache, as a percent of the peak  theoretical
      bandwidth achievable on the specific accelerator. The number  of bytes is calculated
      as the number of cache lines requested multiplied  by the cache line size. This
      value does not consider partial requests, so  e.g., if only a single value is
      requested in a cache line, the data  movement will still be counted as a full
      cache line.
    unit: Percent
  Hit Rate:
    rst: The ratio of the number of L2 cache line requests that hit in the L2  cache
      over the total number of incoming cache line requests to the L2 cache.
    unit: Percent
  L2-Fabric Read BW:
    rst: The number of bytes read by the L2 over the  :ref:`Infinity Fabric interface
      <l2-fabric>` per unit time.
    unit: GB/s
  L2-Fabric Write and Atomic BW:
    rst: The number of bytes sent by the L2 over the  :ref:`Infinity Fabric interface
      <l2-fabric>` by write and atomic  operations per unit time.
    unit: GB/s
  HBM Bandwidth:
    rst: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory
      (HBM) per unit time. This value is calculated as the number of HBM channels
      multiplied by the HBM channel width multiplied by the HBM clock frequency.
    unit: GB/s
  Read BW:
    rst: The total number of bytes read by the L2 cache from Infinity Fabric per  :ref:`normalization
      unit <normalization-units>`.
    unit: Bytes per normalization unit
  HBM Read Traffic:
    rst: The percent of read requests generated by the L2 cache that are routed to  the
      accelerator's local high-bandwidth memory (HBM). This breakdown does  not consider
      the *size* of the request (meaning that 32B and 64B requests  are both counted
      as a single request), so this metric only *approximates*  the percent of the
      L2-Fabric Read bandwidth directed to the local HBM.
    unit: Percent
  Remote Read Traffic:
    rst: The percent of read requests generated by the L2 cache that are routed to  any
      memory location other than the accelerator's local high-bandwidth  memory (HBM)
      -- for example, the CPU's DRAM or a remote accelerator's  HBM. This breakdown
      does not consider the *size* of the request (meaning  that 32B and 64B requests
      are both counted as a single request), so this  metric only *approximates* the
      percent of the L2-Fabric Read bandwidth  directed to a remote location.
    unit: Percent
  Uncached Read Traffic:
    rst: The percent of read requests generated by the L2 cache that are reading  from
      an :ref:`uncached memory allocation <memory-type>`. Note, as  described in the
      :ref:`request flow <l2-request-flow>` section, a single  64B read request is
      typically counted as two uncached read requests. So,  it is possible for the
      Uncached Read Traffic to reach up to 200% of the  total number of read requests.
      This breakdown does not consider the  *size* of the request (i.e., 32B and 64B
      requests are both counted as a  single request), so this metric only *approximates*
      the percent of the  L2-Fabric read bandwidth directed to an uncached memory
      location.
    unit: Percent
  Write and Atomic BW:
    rst: The total number of bytes written by the L2 over Infinity Fabric by write  and
      atomic operations per  :ref:`normalization unit <normalization-units>`. Note
      that on current  CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests
      are  only considered *atomic* by Infinity Fabric if they are targeted at  non-write-cacheable
      memory, for example,  :ref:`fine-grained memory <memory-type>` allocations or  :ref:`uncached
      memory <memory-type>` allocations on the  MI2XX.
    unit: Bytes per normalization unit
  HBM Write and Atomic Traffic:
    rst: The percent of write and atomic requests generated by the L2 cache that  are
      routed to the accelerator's local high-bandwidth memory (HBM). This  breakdown
      does not consider the *size* of the request (meaning that 32B  and 64B requests
      are both counted as a single request), so this metric  only *approximates* the
      percent of the L2-Fabric Write and Atomic  bandwidth directed to the local HBM.
      Note that on current CDNA  accelerators, such as the :ref:`MI2XX <mixxx-note>`,
      requests are only  considered *atomic* by Infinity Fabric if they are targeted
      at  :ref:`fine-grained memory <memory-type>` allocations or  :ref:`uncached
      memory <memory-type>` allocations.
    unit: Percent
  Remote Write and Atomic Traffic:
    rst: The percent of read requests generated by the L2 cache that are routed to  any
      memory location other than the accelerator's local high-bandwidth  memory (HBM)
      -- for example, the CPU's DRAM or a remote accelerator's  HBM. This breakdown
      does not consider the *size* of the request (meaning  that 32B and 64B requests
      are both counted as a single request), so this  metric only *approximates* the
      percent of the L2-Fabric Read bandwidth  directed to a remote location. Note
      that on current CDNA  accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests
      are only  considered *atomic* by Infinity Fabric if they are targeted at  :ref:`fine-grained
      memory <memory-type>` allocations or  :ref:`uncached memory <memory-type>` allocations.
    unit: Percent
  Atomic Traffic:
    rst: The percent of write requests generated by the L2 cache that are atomic  requests
      to *any* memory location. This breakdown does not consider the  *size* of the
      request (meaning that 32B and 64B requests are both counted  as a single request),
      so this metric only *approximates* the percent of  the L2-Fabric Read bandwidth
      directed to a remote location. Note that on  current CDNA accelerators, such
      as the :ref:`MI2XX <mixxx-note>`,  requests are only considered *atomic* by
      Infinity Fabric if they are  targeted at :ref:`fine-grained memory <memory-type>`
      allocations or  :ref:`uncached memory <memory-type>` allocations.
    unit: Percent
  Uncached Write and Atomic Traffic:
    rst: The percent of write and atomic requests generated by the L2 cache that  are
      targeting :ref:`uncached memory allocations <memory-type>`. This  breakdown
      does not consider the *size* of the request (meaning that 32B  and 64B requests
      are both counted as a single request), so this metric  only *approximates* the
      percent of the L2-Fabric read bandwidth directed  to uncached memory allocations.
    unit: Percent
  Read Latency:
    rst: The time-averaged number of cycles read requests spent in Infinity Fabric  before
      data was returned to the L2.
    unit: Cycles
  Write and Atomic Latency:
    rst: The time-averaged number of cycles write requests spent in Infinity Fabric
      before a completion acknowledgement was returned to the L2.
    unit: Cycles
  Atomic Latency:
    rst: The time-averaged number of cycles atomic requests spent in Infinity Fabric
      before a completion acknowledgement (atomic without return value) or data (atomic
      with return value) was returned to the L2.
    unit: Cycles
  Bandwidth:
    rst: The number of bytes looked up in the L2 cache, per  :ref:`normalization unit
      <normalization-units>`.  The number of bytes is  calculated as the number of
      cache lines requested multiplied by the cache  line size. This value does not
      consider partial requests, so for example,  if only a single value is requested
      in a cache line, the data movement  will still be counted as a full cache line.
    unit: Bytes per normalization unit
  Req:
    rst: The total number of incoming requests to the L2 from all clients for all  request
      types, per :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  Read Req:
    rst: 'The total number of read requests to the L2 from all clients.  '
    unit: Requests per normalization unit
  Write Req:
    rst: The total number of write requests to the L2 from all clients.
    unit: Requests per normalization unit
  Atomic Req:
    rst: The total number of atomic requests (with and without return) to the L2 from
      all clients.
    unit: Requests per normalization unit
  Streaming Req:
    rst: The total number of incoming requests to the L2 that are marked as  *streaming*.
      The exact meaning of this may differ depending on the  targeted accelerator,
      however on an :ref:`MI2XX <mixxx-note>` this  corresponds to  `non-temporal
      load or stores <https://clang.llvm.org/docs/LanguageExtensions.html#non-temporal-load-store-builtins>`_.  The
      L2 cache attempts to evict *streaming* requests before normal  requests when
      the L2 is at capacity.
    unit: Requests per normalization unit
  Probe Req:
    rst: The number of coherence probe requests made to the L2 cache from outside  the
      accelerator. On an :ref:`MI2XX <mixxx-note>`, probe requests may be  generated
      by, for example, writes to  :ref:`fine-grained device <memory-type>` memory
      or by writes to  :ref:`coarse-grained <memory-type>` device memory.
    unit: Requests per normalization unit
  Cache Hit:
    rst: The ratio of the number of L2 cache line requests that hit in the L2  cache
      over the total number of incoming cache line requests to the L2  cache.
    unit: Percent
  Hits:
    rst: The total number of requests to the L2 from all clients that hit in the  cache.
      As noted in the :ref:`Speed-of-Light <l2-sol>` section, this  includes hit-on-miss
      requests.
    unit: Requests per normalization unit
  Misses:
    rst: The total number of requests to the L2 from all clients that miss in the  cache.
      As noted in the :ref:`Speed-of-Light <l2-sol>` section, these do  not include
      hit-on-miss requests.
    unit: Requests per normalization unit
  Writeback:
    rst: The total number of L2 cache lines written back to memory for any reason.  Write-backs
      may occur due to user code (such as HIP kernel calls to  ``__threadfence_system``
      or atomic built-ins) by the  :doc:`command processor <command-processor>`'s
      memory acquire/release  fences, or for other internal hardware reasons.
    unit: Cache lines per normalization unit
  Writeback (Internal):
    rst: The total number of L2 cache lines written back to memory for internal  hardware
      reasons, per :ref:`normalization unit <normalization-units>`.
    unit: Cache lines per normalization unit
  Writeback (vL1D Req):
    rst: The total number of L2 cache lines written back to memory due to requests  initiated
      by the :doc:`vL1D cache <vector-l1-cache>`, per  :ref:`normalization unit <normalization-units>`.
    unit: Cache lines per normalization unit
  Evict (Internal):
    rst: The total number of L2 cache lines evicted from the cache due to capacity  limits,
      per :ref:`normalization unit <normalization-units>`.
    unit: Cache lines per normalization unit
  Evict (vL1D Req):
    rst: The total number of L2 cache lines evicted from the cache due to  invalidation
      requests initiated by the  :doc:`vL1D cache <vector-l1-cache>`, per  :ref:`normalization
      unit <normalization-units>`.
    unit: Cache lines per normalization unit
  NC Req:
    rst: The total number of requests to the L2 to Not-hardware-Coherent (NC)  memory
      allocations, per :ref:`normalization unit <normalization-units>`.  See the :ref:`memory-type`
      for more information.
    unit: Requests per normalization unit
  UC Req:
    rst: The total number of requests to the L2 that go to Uncached (UC) memory  allocations.
      See the :ref:`memory-type` for more information.
    unit: Requests per normalization unit
  CC Req:
    rst: The total number of requests to the L2 that go to Coherently Cacheable (CC)  memory
      allocations. See the :ref:`memory-type` for more information.
    unit: Requests per normalization unit
  RW Req:
    rst: The total number of requests to the L2 that go to Read-Write coherent memory  (RW)
      allocations. See the :ref:`memory-type` for more information.
    unit: Requests per normalization unit
  Write - Credit Starvation:
    rst: The number of cycles the L2-Fabric interface was stalled on write or  atomic
      requests to any memory location because too many write/atomic  requests were
      currently in flight, as a percent of the  :ref:`total active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Read (32B):
    rst: The total number of L2 requests to Infinity Fabric to read 32B of data  from
      any memory location, per  :ref:`normalization unit <normalization-units>`. See  :ref:`l2-request-flow`
      for more detail. Typically unused on CDNA  accelerators.
    unit: Requests per normalization unit
  Read (64B):
    rst: The total number of L2 requests to Infinity Fabric to read 64B of data  from
      any memory location, per  :ref:`normalization unit <normalization-units>`. See  :ref:`l2-request-flow`
      for more detail.
    unit: Requests per normalization unit
  Read (Uncached):
    rst: The total number of L2 requests to Infinity Fabric to read  :ref:`uncached
      data <memory-type>` from any memory location, per  :ref:`normalization unit
      <normalization-units>`. 64B requests for  uncached data are counted as two 32B
      uncached data requests. See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  HBM Read:
    rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of  data
      from the accelerator's local HBM, per  :ref:`normalization unit <normalization-units>`.
      See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  Remote Read:
    rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of  data
      from any source other than the accelerator's local HBM, per  :ref:`normalization
      unit <normalization-units>`. See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  Write and Atomic (32B):
    rst: The total number of L2 requests to Infinity Fabric to write or atomically  update
      32B of data to any memory location, per  :ref:`normalization unit <normalization-units>`.
      See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  Write and Atomic (Uncached):
    rst: The total number of L2 requests to Infinity Fabric to write or atomically  update
      32B or 64B of :ref:`uncached data <memory-type>`, per  :ref:`normalization unit
      <normalization-units>`. See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  Write and Atomic (64B):
    rst: The total number of L2 requests to Infinity Fabric to write or atomically  update
      64B of data in any memory location, per  :ref:`normalization unit <normalization-units>`.
      See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  HBM Write and Atomic:
    rst: The total number of L2 requests to Infinity Fabric to write or atomically  update
      32B or 64B of data in the accelerator's local HBM, per  :ref:`normalization
      unit <normalization-units>`. See  :ref:`l2-request-flow` for more detail.  plain
    unit: Requests per normalization unit
  Remote Write and Atomic:
    rst: The total number of L2 requests to Infinity Fabric to write or atomically  update
      32B or 64B of data in any memory location other than the  accelerator's local
      HBM, per  :ref:`normalization unit <normalization-units>`. See  :ref:`l2-request-flow`
      for more detail.
    unit: Requests per normalization unit
  Atomic:
    rst: The total number of L2 requests to Infinity Fabric to atomically update  32B
      or 64B of data in any memory location, per  :ref:`normalization unit <normalization-units>`.
      See  :ref:`l2-request-flow` for more detail. Note that on current CDNA  accelerators,
      such as the :ref:`MI2XX <mixxx-note>`, requests are only  considered *atomic*
      by Infinity Fabric if they are targeted at  non-write-cacheable memory, such
      as  :ref:`fine-grained memory <memory-type>` allocations or  :ref:`uncached
      memory <memory-type>` allocations on the MI2XX.
    unit: Requests per normalization unit
  Read Stall:
    rst: "The ratio of the total number of cycles the L2-Fabric interface was stalled\
      \ on a read request to any destination (local HBM, remote PCIe\xAE connected\
      \ accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_\
      \ or CPU) over the :ref:`total active L2 cycles <total-active-l2-cycles>`."
    unit: Percent
  Write Stall:
    rst: The ratio of the total number of cycles the L2-Fabric interface was stalled
      on a write or atomic request to any destination (local HBM, remote accelerator
      or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected
      accelerator [#inf]_ or CPU) over the :ref:`total active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Read - PCIe Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on read requests
      to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the :ref:`total
      active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Read - Infinity Fabric Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on read requests
      to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as a percent
      of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Read - HBM Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on read requests
      to the accelerator's local HBM as a percent of the :ref:`total active L2 cycles
      <total-active-l2-cycles>`.
    unit: Percent
  Write - PCIe Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
      requests to remote PCIe connected accelerators [#inf]_ or CPUs as a percent
      of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Write - Infinity Fabric Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
      requests to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as
      a percent of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Write - HBM Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
      requests to accelerator's local HBM as a percent of the total active L2 cycles.
    unit: Percent
L2 cache accesses:
  Utilization:
    rst: The ratio of the  :ref:`number of cycles an L2 channel was active, summed
      over all L2 channels on the accelerator <total-active-l2-cycles>`  over the
      :ref:`total L2 cycles <total-l2-cycles>`.
    unit: Percent
  Peak Bandwidth:
    rst: The number of bytes looked up in the L2 cache, as a percent of the peak  theoretical
      bandwidth achievable on the specific accelerator. The number  of bytes is calculated
      as the number of cache lines requested multiplied  by the cache line size. This
      value does not consider partial requests, so  e.g., if only a single value is
      requested in a cache line, the data  movement will still be counted as a full
      cache line.
    unit: Percent
  Hit Rate:
    rst: The ratio of the number of L2 cache line requests that hit in the L2  cache
      over the total number of incoming cache line requests to the L2 cache.
    unit: Percent
  L2-Fabric Read BW:
    rst: The number of bytes read by the L2 over the  :ref:`Infinity Fabric interface
      <l2-fabric>` per unit time.
    unit: GB/s
  L2-Fabric Write and Atomic BW:
    rst: The number of bytes sent by the L2 over the  :ref:`Infinity Fabric interface
      <l2-fabric>` by write and atomic  operations per unit time.
    unit: GB/s
  HBM Bandwidth:
    rst: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory
      (HBM) per unit time. This value is calculated as the number of HBM channels
      multiplied by the HBM channel width multiplied by the HBM clock frequency.
    unit: GB/s
  Read BW:
    rst: The total number of bytes read by the L2 cache from Infinity Fabric per  :ref:`normalization
      unit <normalization-units>`.
    unit: Bytes per normalization unit
  HBM Read Traffic:
    rst: The percent of read requests generated by the L2 cache that are routed to  the
      accelerator's local high-bandwidth memory (HBM). This breakdown does  not consider
      the *size* of the request (meaning that 32B and 64B requests  are both counted
      as a single request), so this metric only *approximates*  the percent of the
      L2-Fabric Read bandwidth directed to the local HBM.
    unit: Percent
  Remote Read Traffic:
    rst: The percent of read requests generated by the L2 cache that are routed to  any
      memory location other than the accelerator's local high-bandwidth  memory (HBM)
      -- for example, the CPU's DRAM or a remote accelerator's  HBM. This breakdown
      does not consider the *size* of the request (meaning  that 32B and 64B requests
      are both counted as a single request), so this  metric only *approximates* the
      percent of the L2-Fabric Read bandwidth  directed to a remote location.
    unit: Percent
  Uncached Read Traffic:
    rst: The percent of read requests generated by the L2 cache that are reading  from
      an :ref:`uncached memory allocation <memory-type>`. Note, as  described in the
      :ref:`request flow <l2-request-flow>` section, a single  64B read request is
      typically counted as two uncached read requests. So,  it is possible for the
      Uncached Read Traffic to reach up to 200% of the  total number of read requests.
      This breakdown does not consider the  *size* of the request (i.e., 32B and 64B
      requests are both counted as a  single request), so this metric only *approximates*
      the percent of the  L2-Fabric read bandwidth directed to an uncached memory
      location.
    unit: Percent
  Write and Atomic BW:
    rst: The total number of bytes written by the L2 over Infinity Fabric by write  and
      atomic operations per  :ref:`normalization unit <normalization-units>`. Note
      that on current  CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests
      are  only considered *atomic* by Infinity Fabric if they are targeted at  non-write-cacheable
      memory, for example,  :ref:`fine-grained memory <memory-type>` allocations or  :ref:`uncached
      memory <memory-type>` allocations on the  MI2XX.
    unit: Bytes per normalization unit
  HBM Write and Atomic Traffic:
    rst: The percent of write and atomic requests generated by the L2 cache that  are
      routed to the accelerator's local high-bandwidth memory (HBM). This  breakdown
      does not consider the *size* of the request (meaning that 32B  and 64B requests
      are both counted as a single request), so this metric  only *approximates* the
      percent of the L2-Fabric Write and Atomic  bandwidth directed to the local HBM.
      Note that on current CDNA  accelerators, such as the :ref:`MI2XX <mixxx-note>`,
      requests are only  considered *atomic* by Infinity Fabric if they are targeted
      at  :ref:`fine-grained memory <memory-type>` allocations or  :ref:`uncached
      memory <memory-type>` allocations.
    unit: Percent
  Remote Write and Atomic Traffic:
    rst: The percent of read requests generated by the L2 cache that are routed to  any
      memory location other than the accelerator's local high-bandwidth  memory (HBM)
      -- for example, the CPU's DRAM or a remote accelerator's  HBM. This breakdown
      does not consider the *size* of the request (meaning  that 32B and 64B requests
      are both counted as a single request), so this  metric only *approximates* the
      percent of the L2-Fabric Read bandwidth  directed to a remote location. Note
      that on current CDNA  accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests
      are only  considered *atomic* by Infinity Fabric if they are targeted at  :ref:`fine-grained
      memory <memory-type>` allocations or  :ref:`uncached memory <memory-type>` allocations.
    unit: Percent
  Atomic Traffic:
    rst: The percent of write requests generated by the L2 cache that are atomic  requests
      to *any* memory location. This breakdown does not consider the  *size* of the
      request (meaning that 32B and 64B requests are both counted  as a single request),
      so this metric only *approximates* the percent of  the L2-Fabric Read bandwidth
      directed to a remote location. Note that on  current CDNA accelerators, such
      as the :ref:`MI2XX <mixxx-note>`,  requests are only considered *atomic* by
      Infinity Fabric if they are  targeted at :ref:`fine-grained memory <memory-type>`
      allocations or  :ref:`uncached memory <memory-type>` allocations.
    unit: Percent
  Uncached Write and Atomic Traffic:
    rst: The percent of write and atomic requests generated by the L2 cache that  are
      targeting :ref:`uncached memory allocations <memory-type>`. This  breakdown
      does not consider the *size* of the request (meaning that 32B  and 64B requests
      are both counted as a single request), so this metric  only *approximates* the
      percent of the L2-Fabric read bandwidth directed  to uncached memory allocations.
    unit: Percent
  Read Latency:
    rst: The time-averaged number of cycles read requests spent in Infinity Fabric  before
      data was returned to the L2.
    unit: Cycles
  Write and Atomic Latency:
    rst: The time-averaged number of cycles write requests spent in Infinity Fabric
      before a completion acknowledgement was returned to the L2.
    unit: Cycles
  Atomic Latency:
    rst: The time-averaged number of cycles atomic requests spent in Infinity Fabric
      before a completion acknowledgement (atomic without return value) or data (atomic
      with return value) was returned to the L2.
    unit: Cycles
  Bandwidth:
    rst: The number of bytes looked up in the L2 cache, per  :ref:`normalization unit
      <normalization-units>`.  The number of bytes is  calculated as the number of
      cache lines requested multiplied by the cache  line size. This value does not
      consider partial requests, so for example,  if only a single value is requested
      in a cache line, the data movement  will still be counted as a full cache line.
    unit: Bytes per normalization unit
  Req:
    rst: The total number of incoming requests to the L2 from all clients for all  request
      types, per :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  Read Req:
    rst: 'The total number of read requests to the L2 from all clients.  '
    unit: Requests per normalization unit
  Write Req:
    rst: The total number of write requests to the L2 from all clients.
    unit: Requests per normalization unit
  Atomic Req:
    rst: The total number of atomic requests (with and without return) to the L2 from
      all clients.
    unit: Requests per normalization unit
  Streaming Req:
    rst: The total number of incoming requests to the L2 that are marked as  *streaming*.
      The exact meaning of this may differ depending on the  targeted accelerator,
      however on an :ref:`MI2XX <mixxx-note>` this  corresponds to  `non-temporal
      load or stores <https://clang.llvm.org/docs/LanguageExtensions.html#non-temporal-load-store-builtins>`_.  The
      L2 cache attempts to evict *streaming* requests before normal  requests when
      the L2 is at capacity.
    unit: Requests per normalization unit
  Probe Req:
    rst: The number of coherence probe requests made to the L2 cache from outside  the
      accelerator. On an :ref:`MI2XX <mixxx-note>`, probe requests may be  generated
      by, for example, writes to  :ref:`fine-grained device <memory-type>` memory
      or by writes to  :ref:`coarse-grained <memory-type>` device memory.
    unit: Requests per normalization unit
  Cache Hit:
    rst: The ratio of the number of L2 cache line requests that hit in the L2  cache
      over the total number of incoming cache line requests to the L2  cache.
    unit: Percent
  Hits:
    rst: The total number of requests to the L2 from all clients that hit in the  cache.
      As noted in the :ref:`Speed-of-Light <l2-sol>` section, this  includes hit-on-miss
      requests.
    unit: Requests per normalization unit
  Misses:
    rst: The total number of requests to the L2 from all clients that miss in the  cache.
      As noted in the :ref:`Speed-of-Light <l2-sol>` section, these do  not include
      hit-on-miss requests.
    unit: Requests per normalization unit
  Writeback:
    rst: The total number of L2 cache lines written back to memory for any reason.  Write-backs
      may occur due to user code (such as HIP kernel calls to  ``__threadfence_system``
      or atomic built-ins) by the  :doc:`command processor <command-processor>`'s
      memory acquire/release  fences, or for other internal hardware reasons.
    unit: Cache lines per normalization unit
  Writeback (Internal):
    rst: The total number of L2 cache lines written back to memory for internal  hardware
      reasons, per :ref:`normalization unit <normalization-units>`.
    unit: Cache lines per normalization unit
  Writeback (vL1D Req):
    rst: The total number of L2 cache lines written back to memory due to requests  initiated
      by the :doc:`vL1D cache <vector-l1-cache>`, per  :ref:`normalization unit <normalization-units>`.
    unit: Cache lines per normalization unit
  Evict (Internal):
    rst: The total number of L2 cache lines evicted from the cache due to capacity  limits,
      per :ref:`normalization unit <normalization-units>`.
    unit: Cache lines per normalization unit
  Evict (vL1D Req):
    rst: The total number of L2 cache lines evicted from the cache due to  invalidation
      requests initiated by the  :doc:`vL1D cache <vector-l1-cache>`, per  :ref:`normalization
      unit <normalization-units>`.
    unit: Cache lines per normalization unit
  NC Req:
    rst: The total number of requests to the L2 to Not-hardware-Coherent (NC)  memory
      allocations, per :ref:`normalization unit <normalization-units>`.  See the :ref:`memory-type`
      for more information.
    unit: Requests per normalization unit
  UC Req:
    rst: The total number of requests to the L2 that go to Uncached (UC) memory  allocations.
      See the :ref:`memory-type` for more information.
    unit: Requests per normalization unit
  CC Req:
    rst: The total number of requests to the L2 that go to Coherently Cacheable (CC)  memory
      allocations. See the :ref:`memory-type` for more information.
    unit: Requests per normalization unit
  RW Req:
    rst: The total number of requests to the L2 that go to Read-Write coherent memory  (RW)
      allocations. See the :ref:`memory-type` for more information.
    unit: Requests per normalization unit
  Write - Credit Starvation:
    rst: The number of cycles the L2-Fabric interface was stalled on write or  atomic
      requests to any memory location because too many write/atomic  requests were
      currently in flight, as a percent of the  :ref:`total active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Read (32B):
    rst: The total number of L2 requests to Infinity Fabric to read 32B of data  from
      any memory location, per  :ref:`normalization unit <normalization-units>`. See  :ref:`l2-request-flow`
      for more detail. Typically unused on CDNA  accelerators.
    unit: Requests per normalization unit
  Read (64B):
    rst: The total number of L2 requests to Infinity Fabric to read 64B of data  from
      any memory location, per  :ref:`normalization unit <normalization-units>`. See  :ref:`l2-request-flow`
      for more detail.
    unit: Requests per normalization unit
  Read (Uncached):
    rst: The total number of L2 requests to Infinity Fabric to read  :ref:`uncached
      data <memory-type>` from any memory location, per  :ref:`normalization unit
      <normalization-units>`. 64B requests for  uncached data are counted as two 32B
      uncached data requests. See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  HBM Read:
    rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of  data
      from the accelerator's local HBM, per  :ref:`normalization unit <normalization-units>`.
      See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  Remote Read:
    rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of  data
      from any source other than the accelerator's local HBM, per  :ref:`normalization
      unit <normalization-units>`. See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  Write and Atomic (32B):
    rst: The total number of L2 requests to Infinity Fabric to write or atomically  update
      32B of data to any memory location, per  :ref:`normalization unit <normalization-units>`.
      See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  Write and Atomic (Uncached):
    rst: The total number of L2 requests to Infinity Fabric to write or atomically  update
      32B or 64B of :ref:`uncached data <memory-type>`, per  :ref:`normalization unit
      <normalization-units>`. See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  Write and Atomic (64B):
    rst: The total number of L2 requests to Infinity Fabric to write or atomically  update
      64B of data in any memory location, per  :ref:`normalization unit <normalization-units>`.
      See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  HBM Write and Atomic:
    rst: The total number of L2 requests to Infinity Fabric to write or atomically  update
      32B or 64B of data in the accelerator's local HBM, per  :ref:`normalization
      unit <normalization-units>`. See  :ref:`l2-request-flow` for more detail.  plain
    unit: Requests per normalization unit
  Remote Write and Atomic:
    rst: The total number of L2 requests to Infinity Fabric to write or atomically  update
      32B or 64B of data in any memory location other than the  accelerator's local
      HBM, per  :ref:`normalization unit <normalization-units>`. See  :ref:`l2-request-flow`
      for more detail.
    unit: Requests per normalization unit
  Atomic:
    rst: The total number of L2 requests to Infinity Fabric to atomically update  32B
      or 64B of data in any memory location, per  :ref:`normalization unit <normalization-units>`.
      See  :ref:`l2-request-flow` for more detail. Note that on current CDNA  accelerators,
      such as the :ref:`MI2XX <mixxx-note>`, requests are only  considered *atomic*
      by Infinity Fabric if they are targeted at  non-write-cacheable memory, such
      as  :ref:`fine-grained memory <memory-type>` allocations or  :ref:`uncached
      memory <memory-type>` allocations on the MI2XX.
    unit: Requests per normalization unit
  Read Stall:
    rst: "The ratio of the total number of cycles the L2-Fabric interface was stalled\
      \ on a read request to any destination (local HBM, remote PCIe\xAE connected\
      \ accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_\
      \ or CPU) over the :ref:`total active L2 cycles <total-active-l2-cycles>`."
    unit: Percent
  Write Stall:
    rst: The ratio of the total number of cycles the L2-Fabric interface was stalled
      on a write or atomic request to any destination (local HBM, remote accelerator
      or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected
      accelerator [#inf]_ or CPU) over the :ref:`total active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Read - PCIe Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on read requests
      to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the :ref:`total
      active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Read - Infinity Fabric Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on read requests
      to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as a percent
      of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Read - HBM Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on read requests
      to the accelerator's local HBM as a percent of the :ref:`total active L2 cycles
      <total-active-l2-cycles>`.
    unit: Percent
  Write - PCIe Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
      requests to remote PCIe connected accelerators [#inf]_ or CPUs as a percent
      of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Write - Infinity Fabric Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
      requests to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as
      a percent of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Write - HBM Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
      requests to accelerator's local HBM as a percent of the total active L2 cycles.
    unit: Percent
L2-Fabric interface metrics:
  Utilization:
    rst: The ratio of the  :ref:`number of cycles an L2 channel was active, summed
      over all L2 channels on the accelerator <total-active-l2-cycles>`  over the
      :ref:`total L2 cycles <total-l2-cycles>`.
    unit: Percent
  Peak Bandwidth:
    rst: The number of bytes looked up in the L2 cache, as a percent of the peak  theoretical
      bandwidth achievable on the specific accelerator. The number  of bytes is calculated
      as the number of cache lines requested multiplied  by the cache line size. This
      value does not consider partial requests, so  e.g., if only a single value is
      requested in a cache line, the data  movement will still be counted as a full
      cache line.
    unit: Percent
  Hit Rate:
    rst: The ratio of the number of L2 cache line requests that hit in the L2  cache
      over the total number of incoming cache line requests to the L2 cache.
    unit: Percent
  L2-Fabric Read BW:
    rst: The number of bytes read by the L2 over the  :ref:`Infinity Fabric interface
      <l2-fabric>` per unit time.
    unit: GB/s
  L2-Fabric Write and Atomic BW:
    rst: The number of bytes sent by the L2 over the  :ref:`Infinity Fabric interface
      <l2-fabric>` by write and atomic  operations per unit time.
    unit: GB/s
  HBM Bandwidth:
    rst: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory
      (HBM) per unit time. This value is calculated as the number of HBM channels
      multiplied by the HBM channel width multiplied by the HBM clock frequency.
    unit: GB/s
  Read BW:
    rst: The total number of bytes read by the L2 cache from Infinity Fabric per  :ref:`normalization
      unit <normalization-units>`.
    unit: Bytes per normalization unit
  HBM Read Traffic:
    rst: The percent of read requests generated by the L2 cache that are routed to  the
      accelerator's local high-bandwidth memory (HBM). This breakdown does  not consider
      the *size* of the request (meaning that 32B and 64B requests  are both counted
      as a single request), so this metric only *approximates*  the percent of the
      L2-Fabric Read bandwidth directed to the local HBM.
    unit: Percent
  Remote Read Traffic:
    rst: The percent of read requests generated by the L2 cache that are routed to  any
      memory location other than the accelerator's local high-bandwidth  memory (HBM)
      -- for example, the CPU's DRAM or a remote accelerator's  HBM. This breakdown
      does not consider the *size* of the request (meaning  that 32B and 64B requests
      are both counted as a single request), so this  metric only *approximates* the
      percent of the L2-Fabric Read bandwidth  directed to a remote location.
    unit: Percent
  Uncached Read Traffic:
    rst: The percent of read requests generated by the L2 cache that are reading  from
      an :ref:`uncached memory allocation <memory-type>`. Note, as  described in the
      :ref:`request flow <l2-request-flow>` section, a single  64B read request is
      typically counted as two uncached read requests. So,  it is possible for the
      Uncached Read Traffic to reach up to 200% of the  total number of read requests.
      This breakdown does not consider the  *size* of the request (i.e., 32B and 64B
      requests are both counted as a  single request), so this metric only *approximates*
      the percent of the  L2-Fabric read bandwidth directed to an uncached memory
      location.
    unit: Percent
  Write and Atomic BW:
    rst: The total number of bytes written by the L2 over Infinity Fabric by write  and
      atomic operations per  :ref:`normalization unit <normalization-units>`. Note
      that on current  CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests
      are  only considered *atomic* by Infinity Fabric if they are targeted at  non-write-cacheable
      memory, for example,  :ref:`fine-grained memory <memory-type>` allocations or  :ref:`uncached
      memory <memory-type>` allocations on the  MI2XX.
    unit: Bytes per normalization unit
  HBM Write and Atomic Traffic:
    rst: The percent of write and atomic requests generated by the L2 cache that  are
      routed to the accelerator's local high-bandwidth memory (HBM). This  breakdown
      does not consider the *size* of the request (meaning that 32B  and 64B requests
      are both counted as a single request), so this metric  only *approximates* the
      percent of the L2-Fabric Write and Atomic  bandwidth directed to the local HBM.
      Note that on current CDNA  accelerators, such as the :ref:`MI2XX <mixxx-note>`,
      requests are only  considered *atomic* by Infinity Fabric if they are targeted
      at  :ref:`fine-grained memory <memory-type>` allocations or  :ref:`uncached
      memory <memory-type>` allocations.
    unit: Percent
  Remote Write and Atomic Traffic:
    rst: The percent of read requests generated by the L2 cache that are routed to  any
      memory location other than the accelerator's local high-bandwidth  memory (HBM)
      -- for example, the CPU's DRAM or a remote accelerator's  HBM. This breakdown
      does not consider the *size* of the request (meaning  that 32B and 64B requests
      are both counted as a single request), so this  metric only *approximates* the
      percent of the L2-Fabric Read bandwidth  directed to a remote location. Note
      that on current CDNA  accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests
      are only  considered *atomic* by Infinity Fabric if they are targeted at  :ref:`fine-grained
      memory <memory-type>` allocations or  :ref:`uncached memory <memory-type>` allocations.
    unit: Percent
  Atomic Traffic:
    rst: The percent of write requests generated by the L2 cache that are atomic  requests
      to *any* memory location. This breakdown does not consider the  *size* of the
      request (meaning that 32B and 64B requests are both counted  as a single request),
      so this metric only *approximates* the percent of  the L2-Fabric Read bandwidth
      directed to a remote location. Note that on  current CDNA accelerators, such
      as the :ref:`MI2XX <mixxx-note>`,  requests are only considered *atomic* by
      Infinity Fabric if they are  targeted at :ref:`fine-grained memory <memory-type>`
      allocations or  :ref:`uncached memory <memory-type>` allocations.
    unit: Percent
  Uncached Write and Atomic Traffic:
    rst: The percent of write and atomic requests generated by the L2 cache that  are
      targeting :ref:`uncached memory allocations <memory-type>`. This  breakdown
      does not consider the *size* of the request (meaning that 32B  and 64B requests
      are both counted as a single request), so this metric  only *approximates* the
      percent of the L2-Fabric read bandwidth directed  to uncached memory allocations.
    unit: Percent
  Read Latency:
    rst: The time-averaged number of cycles read requests spent in Infinity Fabric  before
      data was returned to the L2.
    unit: Cycles
  Write and Atomic Latency:
    rst: The time-averaged number of cycles write requests spent in Infinity Fabric
      before a completion acknowledgement was returned to the L2.
    unit: Cycles
  Atomic Latency:
    rst: The time-averaged number of cycles atomic requests spent in Infinity Fabric
      before a completion acknowledgement (atomic without return value) or data (atomic
      with return value) was returned to the L2.
    unit: Cycles
  Bandwidth:
    rst: The number of bytes looked up in the L2 cache, per  :ref:`normalization unit
      <normalization-units>`.  The number of bytes is  calculated as the number of
      cache lines requested multiplied by the cache  line size. This value does not
      consider partial requests, so for example,  if only a single value is requested
      in a cache line, the data movement  will still be counted as a full cache line.
    unit: Bytes per normalization unit
  Req:
    rst: The total number of incoming requests to the L2 from all clients for all  request
      types, per :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  Read Req:
    rst: 'The total number of read requests to the L2 from all clients.  '
    unit: Requests per normalization unit
  Write Req:
    rst: The total number of write requests to the L2 from all clients.
    unit: Requests per normalization unit
  Atomic Req:
    rst: The total number of atomic requests (with and without return) to the L2 from
      all clients.
    unit: Requests per normalization unit
  Streaming Req:
    rst: The total number of incoming requests to the L2 that are marked as  *streaming*.
      The exact meaning of this may differ depending on the  targeted accelerator,
      however on an :ref:`MI2XX <mixxx-note>` this  corresponds to  `non-temporal
      load or stores <https://clang.llvm.org/docs/LanguageExtensions.html#non-temporal-load-store-builtins>`_.  The
      L2 cache attempts to evict *streaming* requests before normal  requests when
      the L2 is at capacity.
    unit: Requests per normalization unit
  Probe Req:
    rst: The number of coherence probe requests made to the L2 cache from outside  the
      accelerator. On an :ref:`MI2XX <mixxx-note>`, probe requests may be  generated
      by, for example, writes to  :ref:`fine-grained device <memory-type>` memory
      or by writes to  :ref:`coarse-grained <memory-type>` device memory.
    unit: Requests per normalization unit
  Cache Hit:
    rst: The ratio of the number of L2 cache line requests that hit in the L2  cache
      over the total number of incoming cache line requests to the L2  cache.
    unit: Percent
  Hits:
    rst: The total number of requests to the L2 from all clients that hit in the  cache.
      As noted in the :ref:`Speed-of-Light <l2-sol>` section, this  includes hit-on-miss
      requests.
    unit: Requests per normalization unit
  Misses:
    rst: The total number of requests to the L2 from all clients that miss in the  cache.
      As noted in the :ref:`Speed-of-Light <l2-sol>` section, these do  not include
      hit-on-miss requests.
    unit: Requests per normalization unit
  Writeback:
    rst: The total number of L2 cache lines written back to memory for any reason.  Write-backs
      may occur due to user code (such as HIP kernel calls to  ``__threadfence_system``
      or atomic built-ins) by the  :doc:`command processor <command-processor>`'s
      memory acquire/release  fences, or for other internal hardware reasons.
    unit: Cache lines per normalization unit
  Writeback (Internal):
    rst: The total number of L2 cache lines written back to memory for internal  hardware
      reasons, per :ref:`normalization unit <normalization-units>`.
    unit: Cache lines per normalization unit
  Writeback (vL1D Req):
    rst: The total number of L2 cache lines written back to memory due to requests  initiated
      by the :doc:`vL1D cache <vector-l1-cache>`, per  :ref:`normalization unit <normalization-units>`.
    unit: Cache lines per normalization unit
  Evict (Internal):
    rst: The total number of L2 cache lines evicted from the cache due to capacity  limits,
      per :ref:`normalization unit <normalization-units>`.
    unit: Cache lines per normalization unit
  Evict (vL1D Req):
    rst: The total number of L2 cache lines evicted from the cache due to  invalidation
      requests initiated by the  :doc:`vL1D cache <vector-l1-cache>`, per  :ref:`normalization
      unit <normalization-units>`.
    unit: Cache lines per normalization unit
  NC Req:
    rst: The total number of requests to the L2 to Not-hardware-Coherent (NC)  memory
      allocations, per :ref:`normalization unit <normalization-units>`.  See the :ref:`memory-type`
      for more information.
    unit: Requests per normalization unit
  UC Req:
    rst: The total number of requests to the L2 that go to Uncached (UC) memory  allocations.
      See the :ref:`memory-type` for more information.
    unit: Requests per normalization unit
  CC Req:
    rst: The total number of requests to the L2 that go to Coherently Cacheable (CC)  memory
      allocations. See the :ref:`memory-type` for more information.
    unit: Requests per normalization unit
  RW Req:
    rst: The total number of requests to the L2 that go to Read-Write coherent memory  (RW)
      allocations. See the :ref:`memory-type` for more information.
    unit: Requests per normalization unit
  Write - Credit Starvation:
    rst: The number of cycles the L2-Fabric interface was stalled on write or  atomic
      requests to any memory location because too many write/atomic  requests were
      currently in flight, as a percent of the  :ref:`total active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Read (32B):
    rst: The total number of L2 requests to Infinity Fabric to read 32B of data  from
      any memory location, per  :ref:`normalization unit <normalization-units>`. See  :ref:`l2-request-flow`
      for more detail. Typically unused on CDNA  accelerators.
    unit: Requests per normalization unit
  Read (64B):
    rst: The total number of L2 requests to Infinity Fabric to read 64B of data  from
      any memory location, per  :ref:`normalization unit <normalization-units>`. See  :ref:`l2-request-flow`
      for more detail.
    unit: Requests per normalization unit
  Read (Uncached):
    rst: The total number of L2 requests to Infinity Fabric to read  :ref:`uncached
      data <memory-type>` from any memory location, per  :ref:`normalization unit
      <normalization-units>`. 64B requests for  uncached data are counted as two 32B
      uncached data requests. See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  HBM Read:
    rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of  data
      from the accelerator's local HBM, per  :ref:`normalization unit <normalization-units>`.
      See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  Remote Read:
    rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of  data
      from any source other than the accelerator's local HBM, per  :ref:`normalization
      unit <normalization-units>`. See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  Write and Atomic (32B):
    rst: The total number of L2 requests to Infinity Fabric to write or atomically  update
      32B of data to any memory location, per  :ref:`normalization unit <normalization-units>`.
      See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  Write and Atomic (Uncached):
    rst: The total number of L2 requests to Infinity Fabric to write or atomically  update
      32B or 64B of :ref:`uncached data <memory-type>`, per  :ref:`normalization unit
      <normalization-units>`. See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  Write and Atomic (64B):
    rst: The total number of L2 requests to Infinity Fabric to write or atomically  update
      64B of data in any memory location, per  :ref:`normalization unit <normalization-units>`.
      See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  HBM Write and Atomic:
    rst: The total number of L2 requests to Infinity Fabric to write or atomically  update
      32B or 64B of data in the accelerator's local HBM, per  :ref:`normalization
      unit <normalization-units>`. See  :ref:`l2-request-flow` for more detail.  plain
    unit: Requests per normalization unit
  Remote Write and Atomic:
    rst: The total number of L2 requests to Infinity Fabric to write or atomically  update
      32B or 64B of data in any memory location other than the  accelerator's local
      HBM, per  :ref:`normalization unit <normalization-units>`. See  :ref:`l2-request-flow`
      for more detail.
    unit: Requests per normalization unit
  Atomic:
    rst: The total number of L2 requests to Infinity Fabric to atomically update  32B
      or 64B of data in any memory location, per  :ref:`normalization unit <normalization-units>`.
      See  :ref:`l2-request-flow` for more detail. Note that on current CDNA  accelerators,
      such as the :ref:`MI2XX <mixxx-note>`, requests are only  considered *atomic*
      by Infinity Fabric if they are targeted at  non-write-cacheable memory, such
      as  :ref:`fine-grained memory <memory-type>` allocations or  :ref:`uncached
      memory <memory-type>` allocations on the MI2XX.
    unit: Requests per normalization unit
  Read Stall:
    rst: "The ratio of the total number of cycles the L2-Fabric interface was stalled\
      \ on a read request to any destination (local HBM, remote PCIe\xAE connected\
      \ accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_\
      \ or CPU) over the :ref:`total active L2 cycles <total-active-l2-cycles>`."
    unit: Percent
  Write Stall:
    rst: The ratio of the total number of cycles the L2-Fabric interface was stalled
      on a write or atomic request to any destination (local HBM, remote accelerator
      or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected
      accelerator [#inf]_ or CPU) over the :ref:`total active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Read - PCIe Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on read requests
      to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the :ref:`total
      active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Read - Infinity Fabric Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on read requests
      to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as a percent
      of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Read - HBM Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on read requests
      to the accelerator's local HBM as a percent of the :ref:`total active L2 cycles
      <total-active-l2-cycles>`.
    unit: Percent
  Write - PCIe Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
      requests to remote PCIe connected accelerators [#inf]_ or CPUs as a percent
      of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Write - Infinity Fabric Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
      requests to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as
      a percent of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Write - HBM Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
      requests to accelerator's local HBM as a percent of the total active L2 cycles.
    unit: Percent
L2 - Fabric interface detailed metrics:
  Utilization:
    rst: The ratio of the  :ref:`number of cycles an L2 channel was active, summed
      over all L2 channels on the accelerator <total-active-l2-cycles>`  over the
      :ref:`total L2 cycles <total-l2-cycles>`.
    unit: Percent
  Peak Bandwidth:
    rst: The number of bytes looked up in the L2 cache, as a percent of the peak  theoretical
      bandwidth achievable on the specific accelerator. The number  of bytes is calculated
      as the number of cache lines requested multiplied  by the cache line size. This
      value does not consider partial requests, so  e.g., if only a single value is
      requested in a cache line, the data  movement will still be counted as a full
      cache line.
    unit: Percent
  Hit Rate:
    rst: The ratio of the number of L2 cache line requests that hit in the L2  cache
      over the total number of incoming cache line requests to the L2 cache.
    unit: Percent
  L2-Fabric Read BW:
    rst: The number of bytes read by the L2 over the  :ref:`Infinity Fabric interface
      <l2-fabric>` per unit time.
    unit: GB/s
  L2-Fabric Write and Atomic BW:
    rst: The number of bytes sent by the L2 over the  :ref:`Infinity Fabric interface
      <l2-fabric>` by write and atomic  operations per unit time.
    unit: GB/s
  HBM Bandwidth:
    rst: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory
      (HBM) per unit time. This value is calculated as the number of HBM channels
      multiplied by the HBM channel width multiplied by the HBM clock frequency.
    unit: GB/s
  Read BW:
    rst: The total number of bytes read by the L2 cache from Infinity Fabric per  :ref:`normalization
      unit <normalization-units>`.
    unit: Bytes per normalization unit
  HBM Read Traffic:
    rst: The percent of read requests generated by the L2 cache that are routed to  the
      accelerator's local high-bandwidth memory (HBM). This breakdown does  not consider
      the *size* of the request (meaning that 32B and 64B requests  are both counted
      as a single request), so this metric only *approximates*  the percent of the
      L2-Fabric Read bandwidth directed to the local HBM.
    unit: Percent
  Remote Read Traffic:
    rst: The percent of read requests generated by the L2 cache that are routed to  any
      memory location other than the accelerator's local high-bandwidth  memory (HBM)
      -- for example, the CPU's DRAM or a remote accelerator's  HBM. This breakdown
      does not consider the *size* of the request (meaning  that 32B and 64B requests
      are both counted as a single request), so this  metric only *approximates* the
      percent of the L2-Fabric Read bandwidth  directed to a remote location.
    unit: Percent
  Uncached Read Traffic:
    rst: The percent of read requests generated by the L2 cache that are reading  from
      an :ref:`uncached memory allocation <memory-type>`. Note, as  described in the
      :ref:`request flow <l2-request-flow>` section, a single  64B read request is
      typically counted as two uncached read requests. So,  it is possible for the
      Uncached Read Traffic to reach up to 200% of the  total number of read requests.
      This breakdown does not consider the  *size* of the request (i.e., 32B and 64B
      requests are both counted as a  single request), so this metric only *approximates*
      the percent of the  L2-Fabric read bandwidth directed to an uncached memory
      location.
    unit: Percent
  Write and Atomic BW:
    rst: The total number of bytes written by the L2 over Infinity Fabric by write  and
      atomic operations per  :ref:`normalization unit <normalization-units>`. Note
      that on current  CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests
      are  only considered *atomic* by Infinity Fabric if they are targeted at  non-write-cacheable
      memory, for example,  :ref:`fine-grained memory <memory-type>` allocations or  :ref:`uncached
      memory <memory-type>` allocations on the  MI2XX.
    unit: Bytes per normalization unit
  HBM Write and Atomic Traffic:
    rst: The percent of write and atomic requests generated by the L2 cache that  are
      routed to the accelerator's local high-bandwidth memory (HBM). This  breakdown
      does not consider the *size* of the request (meaning that 32B  and 64B requests
      are both counted as a single request), so this metric  only *approximates* the
      percent of the L2-Fabric Write and Atomic  bandwidth directed to the local HBM.
      Note that on current CDNA  accelerators, such as the :ref:`MI2XX <mixxx-note>`,
      requests are only  considered *atomic* by Infinity Fabric if they are targeted
      at  :ref:`fine-grained memory <memory-type>` allocations or  :ref:`uncached
      memory <memory-type>` allocations.
    unit: Percent
  Remote Write and Atomic Traffic:
    rst: The percent of read requests generated by the L2 cache that are routed to  any
      memory location other than the accelerator's local high-bandwidth  memory (HBM)
      -- for example, the CPU's DRAM or a remote accelerator's  HBM. This breakdown
      does not consider the *size* of the request (meaning  that 32B and 64B requests
      are both counted as a single request), so this  metric only *approximates* the
      percent of the L2-Fabric Read bandwidth  directed to a remote location. Note
      that on current CDNA  accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests
      are only  considered *atomic* by Infinity Fabric if they are targeted at  :ref:`fine-grained
      memory <memory-type>` allocations or  :ref:`uncached memory <memory-type>` allocations.
    unit: Percent
  Atomic Traffic:
    rst: The percent of write requests generated by the L2 cache that are atomic  requests
      to *any* memory location. This breakdown does not consider the  *size* of the
      request (meaning that 32B and 64B requests are both counted  as a single request),
      so this metric only *approximates* the percent of  the L2-Fabric Read bandwidth
      directed to a remote location. Note that on  current CDNA accelerators, such
      as the :ref:`MI2XX <mixxx-note>`,  requests are only considered *atomic* by
      Infinity Fabric if they are  targeted at :ref:`fine-grained memory <memory-type>`
      allocations or  :ref:`uncached memory <memory-type>` allocations.
    unit: Percent
  Uncached Write and Atomic Traffic:
    rst: The percent of write and atomic requests generated by the L2 cache that  are
      targeting :ref:`uncached memory allocations <memory-type>`. This  breakdown
      does not consider the *size* of the request (meaning that 32B  and 64B requests
      are both counted as a single request), so this metric  only *approximates* the
      percent of the L2-Fabric read bandwidth directed  to uncached memory allocations.
    unit: Percent
  Read Latency:
    rst: The time-averaged number of cycles read requests spent in Infinity Fabric  before
      data was returned to the L2.
    unit: Cycles
  Write and Atomic Latency:
    rst: The time-averaged number of cycles write requests spent in Infinity Fabric
      before a completion acknowledgement was returned to the L2.
    unit: Cycles
  Atomic Latency:
    rst: The time-averaged number of cycles atomic requests spent in Infinity Fabric
      before a completion acknowledgement (atomic without return value) or data (atomic
      with return value) was returned to the L2.
    unit: Cycles
  Bandwidth:
    rst: The number of bytes looked up in the L2 cache, per  :ref:`normalization unit
      <normalization-units>`.  The number of bytes is  calculated as the number of
      cache lines requested multiplied by the cache  line size. This value does not
      consider partial requests, so for example,  if only a single value is requested
      in a cache line, the data movement  will still be counted as a full cache line.
    unit: Bytes per normalization unit
  Req:
    rst: The total number of incoming requests to the L2 from all clients for all  request
      types, per :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  Read Req:
    rst: 'The total number of read requests to the L2 from all clients.  '
    unit: Requests per normalization unit
  Write Req:
    rst: The total number of write requests to the L2 from all clients.
    unit: Requests per normalization unit
  Atomic Req:
    rst: The total number of atomic requests (with and without return) to the L2 from
      all clients.
    unit: Requests per normalization unit
  Streaming Req:
    rst: The total number of incoming requests to the L2 that are marked as  *streaming*.
      The exact meaning of this may differ depending on the  targeted accelerator,
      however on an :ref:`MI2XX <mixxx-note>` this  corresponds to  `non-temporal
      load or stores <https://clang.llvm.org/docs/LanguageExtensions.html#non-temporal-load-store-builtins>`_.  The
      L2 cache attempts to evict *streaming* requests before normal  requests when
      the L2 is at capacity.
    unit: Requests per normalization unit
  Probe Req:
    rst: The number of coherence probe requests made to the L2 cache from outside  the
      accelerator. On an :ref:`MI2XX <mixxx-note>`, probe requests may be  generated
      by, for example, writes to  :ref:`fine-grained device <memory-type>` memory
      or by writes to  :ref:`coarse-grained <memory-type>` device memory.
    unit: Requests per normalization unit
  Cache Hit:
    rst: The ratio of the number of L2 cache line requests that hit in the L2  cache
      over the total number of incoming cache line requests to the L2  cache.
    unit: Percent
  Hits:
    rst: The total number of requests to the L2 from all clients that hit in the  cache.
      As noted in the :ref:`Speed-of-Light <l2-sol>` section, this  includes hit-on-miss
      requests.
    unit: Requests per normalization unit
  Misses:
    rst: The total number of requests to the L2 from all clients that miss in the  cache.
      As noted in the :ref:`Speed-of-Light <l2-sol>` section, these do  not include
      hit-on-miss requests.
    unit: Requests per normalization unit
  Writeback:
    rst: The total number of L2 cache lines written back to memory for any reason.  Write-backs
      may occur due to user code (such as HIP kernel calls to  ``__threadfence_system``
      or atomic built-ins) by the  :doc:`command processor <command-processor>`'s
      memory acquire/release  fences, or for other internal hardware reasons.
    unit: Cache lines per normalization unit
  Writeback (Internal):
    rst: The total number of L2 cache lines written back to memory for internal  hardware
      reasons, per :ref:`normalization unit <normalization-units>`.
    unit: Cache lines per normalization unit
  Writeback (vL1D Req):
    rst: The total number of L2 cache lines written back to memory due to requests  initiated
      by the :doc:`vL1D cache <vector-l1-cache>`, per  :ref:`normalization unit <normalization-units>`.
    unit: Cache lines per normalization unit
  Evict (Internal):
    rst: The total number of L2 cache lines evicted from the cache due to capacity  limits,
      per :ref:`normalization unit <normalization-units>`.
    unit: Cache lines per normalization unit
  Evict (vL1D Req):
    rst: The total number of L2 cache lines evicted from the cache due to  invalidation
      requests initiated by the  :doc:`vL1D cache <vector-l1-cache>`, per  :ref:`normalization
      unit <normalization-units>`.
    unit: Cache lines per normalization unit
  NC Req:
    rst: The total number of requests to the L2 to Not-hardware-Coherent (NC)  memory
      allocations, per :ref:`normalization unit <normalization-units>`.  See the :ref:`memory-type`
      for more information.
    unit: Requests per normalization unit
  UC Req:
    rst: The total number of requests to the L2 that go to Uncached (UC) memory  allocations.
      See the :ref:`memory-type` for more information.
    unit: Requests per normalization unit
  CC Req:
    rst: The total number of requests to the L2 that go to Coherently Cacheable (CC)  memory
      allocations. See the :ref:`memory-type` for more information.
    unit: Requests per normalization unit
  RW Req:
    rst: The total number of requests to the L2 that go to Read-Write coherent memory  (RW)
      allocations. See the :ref:`memory-type` for more information.
    unit: Requests per normalization unit
  Write - Credit Starvation:
    rst: The number of cycles the L2-Fabric interface was stalled on write or  atomic
      requests to any memory location because too many write/atomic  requests were
      currently in flight, as a percent of the  :ref:`total active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Read (32B):
    rst: The total number of L2 requests to Infinity Fabric to read 32B of data  from
      any memory location, per  :ref:`normalization unit <normalization-units>`. See  :ref:`l2-request-flow`
      for more detail. Typically unused on CDNA  accelerators.
    unit: Requests per normalization unit
  Read (64B):
    rst: The total number of L2 requests to Infinity Fabric to read 64B of data  from
      any memory location, per  :ref:`normalization unit <normalization-units>`. See  :ref:`l2-request-flow`
      for more detail.
    unit: Requests per normalization unit
  Read (Uncached):
    rst: The total number of L2 requests to Infinity Fabric to read  :ref:`uncached
      data <memory-type>` from any memory location, per  :ref:`normalization unit
      <normalization-units>`. 64B requests for  uncached data are counted as two 32B
      uncached data requests. See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  HBM Read:
    rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of  data
      from the accelerator's local HBM, per  :ref:`normalization unit <normalization-units>`.
      See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  Remote Read:
    rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of  data
      from any source other than the accelerator's local HBM, per  :ref:`normalization
      unit <normalization-units>`. See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  Write and Atomic (32B):
    rst: The total number of L2 requests to Infinity Fabric to write or atomically  update
      32B of data to any memory location, per  :ref:`normalization unit <normalization-units>`.
      See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  Write and Atomic (Uncached):
    rst: The total number of L2 requests to Infinity Fabric to write or atomically  update
      32B or 64B of :ref:`uncached data <memory-type>`, per  :ref:`normalization unit
      <normalization-units>`. See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  Write and Atomic (64B):
    rst: The total number of L2 requests to Infinity Fabric to write or atomically  update
      64B of data in any memory location, per  :ref:`normalization unit <normalization-units>`.
      See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  HBM Write and Atomic:
    rst: The total number of L2 requests to Infinity Fabric to write or atomically  update
      32B or 64B of data in the accelerator's local HBM, per  :ref:`normalization
      unit <normalization-units>`. See  :ref:`l2-request-flow` for more detail.  plain
    unit: Requests per normalization unit
  Remote Write and Atomic:
    rst: The total number of L2 requests to Infinity Fabric to write or atomically  update
      32B or 64B of data in any memory location other than the  accelerator's local
      HBM, per  :ref:`normalization unit <normalization-units>`. See  :ref:`l2-request-flow`
      for more detail.
    unit: Requests per normalization unit
  Atomic:
    rst: The total number of L2 requests to Infinity Fabric to atomically update  32B
      or 64B of data in any memory location, per  :ref:`normalization unit <normalization-units>`.
      See  :ref:`l2-request-flow` for more detail. Note that on current CDNA  accelerators,
      such as the :ref:`MI2XX <mixxx-note>`, requests are only  considered *atomic*
      by Infinity Fabric if they are targeted at  non-write-cacheable memory, such
      as  :ref:`fine-grained memory <memory-type>` allocations or  :ref:`uncached
      memory <memory-type>` allocations on the MI2XX.
    unit: Requests per normalization unit
  Read Stall:
    rst: "The ratio of the total number of cycles the L2-Fabric interface was stalled\
      \ on a read request to any destination (local HBM, remote PCIe\xAE connected\
      \ accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_\
      \ or CPU) over the :ref:`total active L2 cycles <total-active-l2-cycles>`."
    unit: Percent
  Write Stall:
    rst: The ratio of the total number of cycles the L2-Fabric interface was stalled
      on a write or atomic request to any destination (local HBM, remote accelerator
      or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected
      accelerator [#inf]_ or CPU) over the :ref:`total active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Read - PCIe Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on read requests
      to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the :ref:`total
      active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Read - Infinity Fabric Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on read requests
      to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as a percent
      of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Read - HBM Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on read requests
      to the accelerator's local HBM as a percent of the :ref:`total active L2 cycles
      <total-active-l2-cycles>`.
    unit: Percent
  Write - PCIe Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
      requests to remote PCIe connected accelerators [#inf]_ or CPUs as a percent
      of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Write - Infinity Fabric Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
      requests to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as
      a percent of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Write - HBM Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
      requests to accelerator's local HBM as a percent of the total active L2 cycles.
    unit: Percent
L2 - Fabric Interface stalls:
  Utilization:
    rst: The ratio of the  :ref:`number of cycles an L2 channel was active, summed
      over all L2 channels on the accelerator <total-active-l2-cycles>`  over the
      :ref:`total L2 cycles <total-l2-cycles>`.
    unit: Percent
  Peak Bandwidth:
    rst: The number of bytes looked up in the L2 cache, as a percent of the peak  theoretical
      bandwidth achievable on the specific accelerator. The number  of bytes is calculated
      as the number of cache lines requested multiplied  by the cache line size. This
      value does not consider partial requests, so  e.g., if only a single value is
      requested in a cache line, the data  movement will still be counted as a full
      cache line.
    unit: Percent
  Hit Rate:
    rst: The ratio of the number of L2 cache line requests that hit in the L2  cache
      over the total number of incoming cache line requests to the L2 cache.
    unit: Percent
  L2-Fabric Read BW:
    rst: The number of bytes read by the L2 over the  :ref:`Infinity Fabric interface
      <l2-fabric>` per unit time.
    unit: GB/s
  L2-Fabric Write and Atomic BW:
    rst: The number of bytes sent by the L2 over the  :ref:`Infinity Fabric interface
      <l2-fabric>` by write and atomic  operations per unit time.
    unit: GB/s
  HBM Bandwidth:
    rst: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory
      (HBM) per unit time. This value is calculated as the number of HBM channels
      multiplied by the HBM channel width multiplied by the HBM clock frequency.
    unit: GB/s
  Read BW:
    rst: The total number of bytes read by the L2 cache from Infinity Fabric per  :ref:`normalization
      unit <normalization-units>`.
    unit: Bytes per normalization unit
  HBM Read Traffic:
    rst: The percent of read requests generated by the L2 cache that are routed to  the
      accelerator's local high-bandwidth memory (HBM). This breakdown does  not consider
      the *size* of the request (meaning that 32B and 64B requests  are both counted
      as a single request), so this metric only *approximates*  the percent of the
      L2-Fabric Read bandwidth directed to the local HBM.
    unit: Percent
  Remote Read Traffic:
    rst: The percent of read requests generated by the L2 cache that are routed to  any
      memory location other than the accelerator's local high-bandwidth  memory (HBM)
      -- for example, the CPU's DRAM or a remote accelerator's  HBM. This breakdown
      does not consider the *size* of the request (meaning  that 32B and 64B requests
      are both counted as a single request), so this  metric only *approximates* the
      percent of the L2-Fabric Read bandwidth  directed to a remote location.
    unit: Percent
  Uncached Read Traffic:
    rst: The percent of read requests generated by the L2 cache that are reading  from
      an :ref:`uncached memory allocation <memory-type>`. Note, as  described in the
      :ref:`request flow <l2-request-flow>` section, a single  64B read request is
      typically counted as two uncached read requests. So,  it is possible for the
      Uncached Read Traffic to reach up to 200% of the  total number of read requests.
      This breakdown does not consider the  *size* of the request (i.e., 32B and 64B
      requests are both counted as a  single request), so this metric only *approximates*
      the percent of the  L2-Fabric read bandwidth directed to an uncached memory
      location.
    unit: Percent
  Write and Atomic BW:
    rst: The total number of bytes written by the L2 over Infinity Fabric by write  and
      atomic operations per  :ref:`normalization unit <normalization-units>`. Note
      that on current  CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests
      are  only considered *atomic* by Infinity Fabric if they are targeted at  non-write-cacheable
      memory, for example,  :ref:`fine-grained memory <memory-type>` allocations or  :ref:`uncached
      memory <memory-type>` allocations on the  MI2XX.
    unit: Bytes per normalization unit
  HBM Write and Atomic Traffic:
    rst: The percent of write and atomic requests generated by the L2 cache that  are
      routed to the accelerator's local high-bandwidth memory (HBM). This  breakdown
      does not consider the *size* of the request (meaning that 32B  and 64B requests
      are both counted as a single request), so this metric  only *approximates* the
      percent of the L2-Fabric Write and Atomic  bandwidth directed to the local HBM.
      Note that on current CDNA  accelerators, such as the :ref:`MI2XX <mixxx-note>`,
      requests are only  considered *atomic* by Infinity Fabric if they are targeted
      at  :ref:`fine-grained memory <memory-type>` allocations or  :ref:`uncached
      memory <memory-type>` allocations.
    unit: Percent
  Remote Write and Atomic Traffic:
    rst: The percent of read requests generated by the L2 cache that are routed to  any
      memory location other than the accelerator's local high-bandwidth  memory (HBM)
      -- for example, the CPU's DRAM or a remote accelerator's  HBM. This breakdown
      does not consider the *size* of the request (meaning  that 32B and 64B requests
      are both counted as a single request), so this  metric only *approximates* the
      percent of the L2-Fabric Read bandwidth  directed to a remote location. Note
      that on current CDNA  accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests
      are only  considered *atomic* by Infinity Fabric if they are targeted at  :ref:`fine-grained
      memory <memory-type>` allocations or  :ref:`uncached memory <memory-type>` allocations.
    unit: Percent
  Atomic Traffic:
    rst: The percent of write requests generated by the L2 cache that are atomic  requests
      to *any* memory location. This breakdown does not consider the  *size* of the
      request (meaning that 32B and 64B requests are both counted  as a single request),
      so this metric only *approximates* the percent of  the L2-Fabric Read bandwidth
      directed to a remote location. Note that on  current CDNA accelerators, such
      as the :ref:`MI2XX <mixxx-note>`,  requests are only considered *atomic* by
      Infinity Fabric if they are  targeted at :ref:`fine-grained memory <memory-type>`
      allocations or  :ref:`uncached memory <memory-type>` allocations.
    unit: Percent
  Uncached Write and Atomic Traffic:
    rst: The percent of write and atomic requests generated by the L2 cache that  are
      targeting :ref:`uncached memory allocations <memory-type>`. This  breakdown
      does not consider the *size* of the request (meaning that 32B  and 64B requests
      are both counted as a single request), so this metric  only *approximates* the
      percent of the L2-Fabric read bandwidth directed  to uncached memory allocations.
    unit: Percent
  Read Latency:
    rst: The time-averaged number of cycles read requests spent in Infinity Fabric  before
      data was returned to the L2.
    unit: Cycles
  Write and Atomic Latency:
    rst: The time-averaged number of cycles write requests spent in Infinity Fabric
      before a completion acknowledgement was returned to the L2.
    unit: Cycles
  Atomic Latency:
    rst: The time-averaged number of cycles atomic requests spent in Infinity Fabric
      before a completion acknowledgement (atomic without return value) or data (atomic
      with return value) was returned to the L2.
    unit: Cycles
  Bandwidth:
    rst: The number of bytes looked up in the L2 cache, per  :ref:`normalization unit
      <normalization-units>`.  The number of bytes is  calculated as the number of
      cache lines requested multiplied by the cache  line size. This value does not
      consider partial requests, so for example,  if only a single value is requested
      in a cache line, the data movement  will still be counted as a full cache line.
    unit: Bytes per normalization unit
  Req:
    rst: The total number of incoming requests to the L2 from all clients for all  request
      types, per :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  Read Req:
    rst: 'The total number of read requests to the L2 from all clients.  '
    unit: Requests per normalization unit
  Write Req:
    rst: The total number of write requests to the L2 from all clients.
    unit: Requests per normalization unit
  Atomic Req:
    rst: The total number of atomic requests (with and without return) to the L2 from
      all clients.
    unit: Requests per normalization unit
  Streaming Req:
    rst: The total number of incoming requests to the L2 that are marked as  *streaming*.
      The exact meaning of this may differ depending on the  targeted accelerator,
      however on an :ref:`MI2XX <mixxx-note>` this  corresponds to  `non-temporal
      load or stores <https://clang.llvm.org/docs/LanguageExtensions.html#non-temporal-load-store-builtins>`_.  The
      L2 cache attempts to evict *streaming* requests before normal  requests when
      the L2 is at capacity.
    unit: Requests per normalization unit
  Probe Req:
    rst: The number of coherence probe requests made to the L2 cache from outside  the
      accelerator. On an :ref:`MI2XX <mixxx-note>`, probe requests may be  generated
      by, for example, writes to  :ref:`fine-grained device <memory-type>` memory
      or by writes to  :ref:`coarse-grained <memory-type>` device memory.
    unit: Requests per normalization unit
  Cache Hit:
    rst: The ratio of the number of L2 cache line requests that hit in the L2  cache
      over the total number of incoming cache line requests to the L2  cache.
    unit: Percent
  Hits:
    rst: The total number of requests to the L2 from all clients that hit in the  cache.
      As noted in the :ref:`Speed-of-Light <l2-sol>` section, this  includes hit-on-miss
      requests.
    unit: Requests per normalization unit
  Misses:
    rst: The total number of requests to the L2 from all clients that miss in the  cache.
      As noted in the :ref:`Speed-of-Light <l2-sol>` section, these do  not include
      hit-on-miss requests.
    unit: Requests per normalization unit
  Writeback:
    rst: The total number of L2 cache lines written back to memory for any reason.  Write-backs
      may occur due to user code (such as HIP kernel calls to  ``__threadfence_system``
      or atomic built-ins) by the  :doc:`command processor <command-processor>`'s
      memory acquire/release  fences, or for other internal hardware reasons.
    unit: Cache lines per normalization unit
  Writeback (Internal):
    rst: The total number of L2 cache lines written back to memory for internal  hardware
      reasons, per :ref:`normalization unit <normalization-units>`.
    unit: Cache lines per normalization unit
  Writeback (vL1D Req):
    rst: The total number of L2 cache lines written back to memory due to requests  initiated
      by the :doc:`vL1D cache <vector-l1-cache>`, per  :ref:`normalization unit <normalization-units>`.
    unit: Cache lines per normalization unit
  Evict (Internal):
    rst: The total number of L2 cache lines evicted from the cache due to capacity  limits,
      per :ref:`normalization unit <normalization-units>`.
    unit: Cache lines per normalization unit
  Evict (vL1D Req):
    rst: The total number of L2 cache lines evicted from the cache due to  invalidation
      requests initiated by the  :doc:`vL1D cache <vector-l1-cache>`, per  :ref:`normalization
      unit <normalization-units>`.
    unit: Cache lines per normalization unit
  NC Req:
    rst: The total number of requests to the L2 to Not-hardware-Coherent (NC)  memory
      allocations, per :ref:`normalization unit <normalization-units>`.  See the :ref:`memory-type`
      for more information.
    unit: Requests per normalization unit
  UC Req:
    rst: The total number of requests to the L2 that go to Uncached (UC) memory  allocations.
      See the :ref:`memory-type` for more information.
    unit: Requests per normalization unit
  CC Req:
    rst: The total number of requests to the L2 that go to Coherently Cacheable (CC)  memory
      allocations. See the :ref:`memory-type` for more information.
    unit: Requests per normalization unit
  RW Req:
    rst: The total number of requests to the L2 that go to Read-Write coherent memory  (RW)
      allocations. See the :ref:`memory-type` for more information.
    unit: Requests per normalization unit
  Write - Credit Starvation:
    rst: The number of cycles the L2-Fabric interface was stalled on write or  atomic
      requests to any memory location because too many write/atomic  requests were
      currently in flight, as a percent of the  :ref:`total active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Read (32B):
    rst: The total number of L2 requests to Infinity Fabric to read 32B of data  from
      any memory location, per  :ref:`normalization unit <normalization-units>`. See  :ref:`l2-request-flow`
      for more detail. Typically unused on CDNA  accelerators.
    unit: Requests per normalization unit
  Read (64B):
    rst: The total number of L2 requests to Infinity Fabric to read 64B of data  from
      any memory location, per  :ref:`normalization unit <normalization-units>`. See  :ref:`l2-request-flow`
      for more detail.
    unit: Requests per normalization unit
  Read (Uncached):
    rst: The total number of L2 requests to Infinity Fabric to read  :ref:`uncached
      data <memory-type>` from any memory location, per  :ref:`normalization unit
      <normalization-units>`. 64B requests for  uncached data are counted as two 32B
      uncached data requests. See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  HBM Read:
    rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of  data
      from the accelerator's local HBM, per  :ref:`normalization unit <normalization-units>`.
      See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  Remote Read:
    rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of  data
      from any source other than the accelerator's local HBM, per  :ref:`normalization
      unit <normalization-units>`. See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  Write and Atomic (32B):
    rst: The total number of L2 requests to Infinity Fabric to write or atomically  update
      32B of data to any memory location, per  :ref:`normalization unit <normalization-units>`.
      See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  Write and Atomic (Uncached):
    rst: The total number of L2 requests to Infinity Fabric to write or atomically  update
      32B or 64B of :ref:`uncached data <memory-type>`, per  :ref:`normalization unit
      <normalization-units>`. See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  Write and Atomic (64B):
    rst: The total number of L2 requests to Infinity Fabric to write or atomically  update
      64B of data in any memory location, per  :ref:`normalization unit <normalization-units>`.
      See  :ref:`l2-request-flow` for more detail.
    unit: Requests per normalization unit
  HBM Write and Atomic:
    rst: The total number of L2 requests to Infinity Fabric to write or atomically  update
      32B or 64B of data in the accelerator's local HBM, per  :ref:`normalization
      unit <normalization-units>`. See  :ref:`l2-request-flow` for more detail.  plain
    unit: Requests per normalization unit
  Remote Write and Atomic:
    rst: The total number of L2 requests to Infinity Fabric to write or atomically  update
      32B or 64B of data in any memory location other than the  accelerator's local
      HBM, per  :ref:`normalization unit <normalization-units>`. See  :ref:`l2-request-flow`
      for more detail.
    unit: Requests per normalization unit
  Atomic:
    rst: The total number of L2 requests to Infinity Fabric to atomically update  32B
      or 64B of data in any memory location, per  :ref:`normalization unit <normalization-units>`.
      See  :ref:`l2-request-flow` for more detail. Note that on current CDNA  accelerators,
      such as the :ref:`MI2XX <mixxx-note>`, requests are only  considered *atomic*
      by Infinity Fabric if they are targeted at  non-write-cacheable memory, such
      as  :ref:`fine-grained memory <memory-type>` allocations or  :ref:`uncached
      memory <memory-type>` allocations on the MI2XX.
    unit: Requests per normalization unit
  Read Stall:
    rst: "The ratio of the total number of cycles the L2-Fabric interface was stalled\
      \ on a read request to any destination (local HBM, remote PCIe\xAE connected\
      \ accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_\
      \ or CPU) over the :ref:`total active L2 cycles <total-active-l2-cycles>`."
    unit: Percent
  Write Stall:
    rst: The ratio of the total number of cycles the L2-Fabric interface was stalled
      on a write or atomic request to any destination (local HBM, remote accelerator
      or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected
      accelerator [#inf]_ or CPU) over the :ref:`total active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Read - PCIe Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on read requests
      to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the :ref:`total
      active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Read - Infinity Fabric Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on read requests
      to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as a percent
      of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Read - HBM Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on read requests
      to the accelerator's local HBM as a percent of the :ref:`total active L2 cycles
      <total-active-l2-cycles>`.
    unit: Percent
  Write - PCIe Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
      requests to remote PCIe connected accelerators [#inf]_ or CPUs as a percent
      of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Write - Infinity Fabric Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
      requests to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as
      a percent of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
    unit: Percent
  Write - HBM Stall:
    rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
      requests to accelerator's local HBM as a percent of the total active L2 cycles.
    unit: Percent
Scalar L1D Speed-of-Light:
  Bandwidth:
    rst: The number of bytes looked up in the sL1D cache, as a percent of the peak  theoretical
      bandwidth. Calculated as the ratio of sL1D requests over the  :ref:`total sL1D
      cycles <total-sl1d-cycles>`.
    unit: Percent
  Cache Hit Rate:
    rst: Indicates the percent of sL1D requests that hit on a previously loaded  line
      the cache. The ratio of the number of sL1D requests that hit  [#sl1d-cache]_
      over the number of all sL1D requests.
    unit: Percent
  sL1D-L2 BW:
    rst: "The total number of bytes read from, written to, or atomically updated \
      \ across the sL1D\u2194:doc:`L2 <l2-cache>` interface, per  :ref:`normalization\
      \ unit <normalization-units>`. Note that sL1D writes  and atomics are typically\
      \ unused on current CDNA accelerators, so in the  majority of cases this can\
      \ be interpreted as an sL1D\u2192L2 read bandwidth."
    unit: Bytes per normalization unit
  Req:
    rst: The total number of requests, of any size or type, made to the sL1D per  :ref:`normalization
      unit <normalization-units>`.
    unit: Requests per normalization unit
  Hits:
    rst: The total number of sL1D requests that hit on a previously loaded cache  line,
      per :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  Misses - Non Duplicated:
    rst: The total number of sL1D requests that missed on a cache line that *was  not*
      already pending due to another request, per  :ref:`normalization unit <normalization-units>`.
      See :ref:`desc-sl1d-sol`  for more detail.
    unit: Requests per normalization unit
  Misses- Duplicated:
    rst: The total number of sL1D requests that missed on a cache line that *was*  already
      pending due to another request, per  :ref:`normalization unit <normalization-units>`.
      See  :ref:`desc-sl1d-sol` for more detail.
    unit: Requests per normalization unit
  Read Req (Total):
    rst: The total number of sL1D read requests of any size, per  :ref:`normalization
      unit <normalization-units>`.
    unit: Requests per normalization unit
  Atomic Req:
    rst: The total number of atomic requests from sL1D to the  :doc:`L2 <l2-cache>`,
      per  :ref:`normalization unit <normalization-units>`. Typically unused on  current
      CDNA accelerators.
    unit: Requests per normalization unit
  Read Req (1 DWord):
    rst: The total number of sL1D read requests made for a single dword of data  (4B),
      per :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  Read Req (2 DWord):
    rst: The total number of sL1D read requests made for a two dwords of data  (8B),
      per :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  Read Req (4 DWord):
    rst: The total number of sL1D read requests made for a four dwords of data  (16B),
      per :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  Read Req (8 DWord):
    rst: The total number of sL1D read requests made for a eight dwords of data  (32B),
      per :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  Read Req (16 DWord):
    rst: The total number of sL1D read requests made for a sixteen dwords of data  (64B),
      per :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  Read Req:
    rst: The total number of read requests from sL1D to the :doc:`L2 <l2-cache>`,  per
      :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  Write Req:
    rst: The total number of write requests from sL1D to the :doc:`L2 <l2-cache>`,  per
      :ref:`normalization unit <normalization-units>`. Typically unused on  current
      CDNA accelerators.
    unit: Requests per normalization unit
  Stall Cycles:
    rst: "The total number of cycles the sL1D\u2194  :doc:`L2 <l2-cache>` interface\
      \ was stalled, per  :ref:`normalization unit <normalization-units>`."
    unit: Cycles per normalization unit
Scalar L1D cache accesses:
  Bandwidth:
    rst: The number of bytes looked up in the sL1D cache, as a percent of the peak  theoretical
      bandwidth. Calculated as the ratio of sL1D requests over the  :ref:`total sL1D
      cycles <total-sl1d-cycles>`.
    unit: Percent
  Cache Hit Rate:
    rst: Indicates the percent of sL1D requests that hit on a previously loaded  line
      the cache. The ratio of the number of sL1D requests that hit  [#sl1d-cache]_
      over the number of all sL1D requests.
    unit: Percent
  sL1D-L2 BW:
    rst: "The total number of bytes read from, written to, or atomically updated \
      \ across the sL1D\u2194:doc:`L2 <l2-cache>` interface, per  :ref:`normalization\
      \ unit <normalization-units>`. Note that sL1D writes  and atomics are typically\
      \ unused on current CDNA accelerators, so in the  majority of cases this can\
      \ be interpreted as an sL1D\u2192L2 read bandwidth."
    unit: Bytes per normalization unit
  Req:
    rst: The total number of requests, of any size or type, made to the sL1D per  :ref:`normalization
      unit <normalization-units>`.
    unit: Requests per normalization unit
  Hits:
    rst: The total number of sL1D requests that hit on a previously loaded cache  line,
      per :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  Misses - Non Duplicated:
    rst: The total number of sL1D requests that missed on a cache line that *was  not*
      already pending due to another request, per  :ref:`normalization unit <normalization-units>`.
      See :ref:`desc-sl1d-sol`  for more detail.
    unit: Requests per normalization unit
  Misses- Duplicated:
    rst: The total number of sL1D requests that missed on a cache line that *was*  already
      pending due to another request, per  :ref:`normalization unit <normalization-units>`.
      See  :ref:`desc-sl1d-sol` for more detail.
    unit: Requests per normalization unit
  Read Req (Total):
    rst: The total number of sL1D read requests of any size, per  :ref:`normalization
      unit <normalization-units>`.
    unit: Requests per normalization unit
  Atomic Req:
    rst: The total number of atomic requests from sL1D to the  :doc:`L2 <l2-cache>`,
      per  :ref:`normalization unit <normalization-units>`. Typically unused on  current
      CDNA accelerators.
    unit: Requests per normalization unit
  Read Req (1 DWord):
    rst: The total number of sL1D read requests made for a single dword of data  (4B),
      per :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  Read Req (2 DWord):
    rst: The total number of sL1D read requests made for a two dwords of data  (8B),
      per :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  Read Req (4 DWord):
    rst: The total number of sL1D read requests made for a four dwords of data  (16B),
      per :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  Read Req (8 DWord):
    rst: The total number of sL1D read requests made for a eight dwords of data  (32B),
      per :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  Read Req (16 DWord):
    rst: The total number of sL1D read requests made for a sixteen dwords of data  (64B),
      per :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  Read Req:
    rst: The total number of read requests from sL1D to the :doc:`L2 <l2-cache>`,  per
      :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  Write Req:
    rst: The total number of write requests from sL1D to the :doc:`L2 <l2-cache>`,  per
      :ref:`normalization unit <normalization-units>`. Typically unused on  current
      CDNA accelerators.
    unit: Requests per normalization unit
  Stall Cycles:
    rst: "The total number of cycles the sL1D\u2194  :doc:`L2 <l2-cache>` interface\
      \ was stalled, per  :ref:`normalization unit <normalization-units>`."
    unit: Cycles per normalization unit
Scalar L1D Cache - L2 Interface:
  Bandwidth:
    rst: The number of bytes looked up in the sL1D cache, as a percent of the peak  theoretical
      bandwidth. Calculated as the ratio of sL1D requests over the  :ref:`total sL1D
      cycles <total-sl1d-cycles>`.
    unit: Percent
  Cache Hit Rate:
    rst: Indicates the percent of sL1D requests that hit on a previously loaded  line
      the cache. The ratio of the number of sL1D requests that hit  [#sl1d-cache]_
      over the number of all sL1D requests.
    unit: Percent
  sL1D-L2 BW:
    rst: "The total number of bytes read from, written to, or atomically updated \
      \ across the sL1D\u2194:doc:`L2 <l2-cache>` interface, per  :ref:`normalization\
      \ unit <normalization-units>`. Note that sL1D writes  and atomics are typically\
      \ unused on current CDNA accelerators, so in the  majority of cases this can\
      \ be interpreted as an sL1D\u2192L2 read bandwidth."
    unit: Bytes per normalization unit
  Req:
    rst: The total number of requests, of any size or type, made to the sL1D per  :ref:`normalization
      unit <normalization-units>`.
    unit: Requests per normalization unit
  Hits:
    rst: The total number of sL1D requests that hit on a previously loaded cache  line,
      per :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  Misses - Non Duplicated:
    rst: The total number of sL1D requests that missed on a cache line that *was  not*
      already pending due to another request, per  :ref:`normalization unit <normalization-units>`.
      See :ref:`desc-sl1d-sol`  for more detail.
    unit: Requests per normalization unit
  Misses- Duplicated:
    rst: The total number of sL1D requests that missed on a cache line that *was*  already
      pending due to another request, per  :ref:`normalization unit <normalization-units>`.
      See  :ref:`desc-sl1d-sol` for more detail.
    unit: Requests per normalization unit
  Read Req (Total):
    rst: The total number of sL1D read requests of any size, per  :ref:`normalization
      unit <normalization-units>`.
    unit: Requests per normalization unit
  Atomic Req:
    rst: The total number of atomic requests from sL1D to the  :doc:`L2 <l2-cache>`,
      per  :ref:`normalization unit <normalization-units>`. Typically unused on  current
      CDNA accelerators.
    unit: Requests per normalization unit
  Read Req (1 DWord):
    rst: The total number of sL1D read requests made for a single dword of data  (4B),
      per :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  Read Req (2 DWord):
    rst: The total number of sL1D read requests made for a two dwords of data  (8B),
      per :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  Read Req (4 DWord):
    rst: The total number of sL1D read requests made for a four dwords of data  (16B),
      per :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  Read Req (8 DWord):
    rst: The total number of sL1D read requests made for a eight dwords of data  (32B),
      per :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  Read Req (16 DWord):
    rst: The total number of sL1D read requests made for a sixteen dwords of data  (64B),
      per :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  Read Req:
    rst: The total number of read requests from sL1D to the :doc:`L2 <l2-cache>`,  per
      :ref:`normalization unit <normalization-units>`.
    unit: Requests per normalization unit
  Write Req:
    rst: The total number of write requests from sL1D to the :doc:`L2 <l2-cache>`,  per
      :ref:`normalization unit <normalization-units>`. Typically unused on  current
      CDNA accelerators.
    unit: Requests per normalization unit
  Stall Cycles:
    rst: "The total number of cycles the sL1D\u2194  :doc:`L2 <l2-cache>` interface\
      \ was stalled, per  :ref:`normalization unit <normalization-units>`."
    unit: Cycles per normalization unit
L1I Speed-of-Light:
  Bandwidth:
    rst: The number of bytes looked up in the L1I cache, as a percent of the peak  theoretical
      bandwidth. Calculated as the ratio of L1I requests over the  :ref:`total L1I
      cycles <total-l1i-cycles>`.
    unit: Percent
  Cache Hit Rate:
    rst: The percent of L1I requests that hit [#l1i-cache]_ on a previously loaded  line
      the cache. Calculated as the ratio of the number of L1I requests  that hit over
      the number of all L1I requests.
    unit: Percent
  L1I-L2 Bandwidth:
    rst: "The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth\
      \  achieved. Calculated as the ratio of the total number of requests from  the\
      \ L1I to the L2 cache over the  :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`."
    unit: Percent
  Req:
    rst: The total number of requests made to the L1I per normalization-unit
    unit: Requests per normalization unit
  Hits:
    rst: The total number of L1I requests that hit on a previously loaded cache  line,
      per :ref:`normalization-unit <normalization-units>`.
    unit: Requests per normalization unit
  Misses - Non Duplicated:
    rst: The total number of L1I requests that missed on a cache line that  *were
      not* already pending due to another request, per  :ref:`normalization-unit <normalization-units>`.
      See note in  :ref:`desc-l1i-sol` for more detail.
    unit: Requests per normalization unit
  Misses - Duplicated:
    rst: The total number of L1I requests that missed on a cache line that *were*  already
      pending due to another request, per  :ref:`normalization-unit <normalization-units>`.
      See note in  :ref:`desc-l1i-sol` for more detail.
    unit: Requests per normalization unit
  Instruction Fetch Latency:
    rst: The average number of cycles spent to fetch instructions to a  :doc:`CU <compute-unit>`.
    unit: Cycles
L1I cache accesses:
  Bandwidth:
    rst: The number of bytes looked up in the L1I cache, as a percent of the peak  theoretical
      bandwidth. Calculated as the ratio of L1I requests over the  :ref:`total L1I
      cycles <total-l1i-cycles>`.
    unit: Percent
  Cache Hit Rate:
    rst: The percent of L1I requests that hit [#l1i-cache]_ on a previously loaded  line
      the cache. Calculated as the ratio of the number of L1I requests  that hit over
      the number of all L1I requests.
    unit: Percent
  L1I-L2 Bandwidth:
    rst: "The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth\
      \  achieved. Calculated as the ratio of the total number of requests from  the\
      \ L1I to the L2 cache over the  :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`."
    unit: Percent
  Req:
    rst: The total number of requests made to the L1I per normalization-unit
    unit: Requests per normalization unit
  Hits:
    rst: The total number of L1I requests that hit on a previously loaded cache  line,
      per :ref:`normalization-unit <normalization-units>`.
    unit: Requests per normalization unit
  Misses - Non Duplicated:
    rst: The total number of L1I requests that missed on a cache line that  *were
      not* already pending due to another request, per  :ref:`normalization-unit <normalization-units>`.
      See note in  :ref:`desc-l1i-sol` for more detail.
    unit: Requests per normalization unit
  Misses - Duplicated:
    rst: The total number of L1I requests that missed on a cache line that *were*  already
      pending due to another request, per  :ref:`normalization-unit <normalization-units>`.
      See note in  :ref:`desc-l1i-sol` for more detail.
    unit: Requests per normalization unit
  Instruction Fetch Latency:
    rst: The average number of cycles spent to fetch instructions to a  :doc:`CU <compute-unit>`.
    unit: Cycles
L1I <-> L2 interface:
  Bandwidth:
    rst: The number of bytes looked up in the L1I cache, as a percent of the peak  theoretical
      bandwidth. Calculated as the ratio of L1I requests over the  :ref:`total L1I
      cycles <total-l1i-cycles>`.
    unit: Percent
  Cache Hit Rate:
    rst: The percent of L1I requests that hit [#l1i-cache]_ on a previously loaded  line
      the cache. Calculated as the ratio of the number of L1I requests  that hit over
      the number of all L1I requests.
    unit: Percent
  L1I-L2 Bandwidth:
    rst: "The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth\
      \  achieved. Calculated as the ratio of the total number of requests from  the\
      \ L1I to the L2 cache over the  :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`."
    unit: Percent
  Req:
    rst: The total number of requests made to the L1I per normalization-unit
    unit: Requests per normalization unit
  Hits:
    rst: The total number of L1I requests that hit on a previously loaded cache  line,
      per :ref:`normalization-unit <normalization-units>`.
    unit: Requests per normalization unit
  Misses - Non Duplicated:
    rst: The total number of L1I requests that missed on a cache line that  *were
      not* already pending due to another request, per  :ref:`normalization-unit <normalization-units>`.
      See note in  :ref:`desc-l1i-sol` for more detail.
    unit: Requests per normalization unit
  Misses - Duplicated:
    rst: The total number of L1I requests that missed on a cache line that *were*  already
      pending due to another request, per  :ref:`normalization-unit <normalization-units>`.
      See note in  :ref:`desc-l1i-sol` for more detail.
    unit: Requests per normalization unit
  Instruction Fetch Latency:
    rst: The average number of cycles spent to fetch instructions to a  :doc:`CU <compute-unit>`.
    unit: Cycles
Workgroup manager utilizations:
  Accelerator Utilization:
    rst: The percent of cycles in the kernel where the accelerator was actively doing
      any work.
    unit: Percent
  Scheduler-Pipe Utilization:
    rst: 'The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in  the
      kernel where the scheduler-pipes were actively doing any work. Note:  this value
      is expected to range between 0% and 25%. See :ref:`desc-spi`.'
    unit: Percent
  Workgroup Manager Utilization:
    rst: The percent of cycles in the kernel where the workgroup manager was actively
      doing any work.
    unit: Percent
  Shader Engine Utilization:
    rst: The percent of :ref:`total shader engine cycles <total-se-cycles>` in the  kernel
      where any CU in a shader-engine was actively doing any work,  normalized over
      all shader-engines. Low values (e.g., << 100%) indicate  that the accelerator
      was not fully saturated by the kernel, or a  potential load-imbalance issue.
    unit: Percent
  SIMD Utilization:
    rst: The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel  where
      any :ref:`SIMD <desc-valu>` on a CU was actively doing any work,  summed over
      all CUs. Low values (less than 100%) indicate that the  accelerator was not
      fully saturated by the kernel, or a potential  load-imbalance issue.
    unit: Percent
  Dispatched Workgroups:
    rst: The total number of workgroups forming this kernel launch.
    unit: Workgroups
  Dispatched Wavefronts:
    rst: The total number of wavefronts, summed over all workgroups, forming this
      kernel launch.
    unit: Wavefronts
  VGPR Writes:
    rst: The average number of cycles spent initializing :ref:`VGPRs <desc-valu>`  at
      wave creation.
    unit: Cycles/wave
  SGPR Writes:
    rst: The average number of cycles spent initializing :ref:`SGPRs <desc-salu>`  at
      wave creation.
    unit: Cycles/wave
  Not-scheduled Rate (Workgroup Manager):
    rst: 'The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in  the
      kernel where a workgroup could not be scheduled to a  :doc:`CU <compute-unit>`
      due to a bottleneck within the workgroup manager  rather than a lack of a CU
      or :ref:`SIMD <desc-valu>` with sufficient  resources. Note: this value is expected
      to range between 0-25%. See note  in :ref:`workgroup manager <desc-spi>` description.'
    unit: Percent
  Not-scheduled Rate (Scheduler-Pipe):
    rst: 'The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in  the
      kernel where a workgroup could not be scheduled to a  :doc:`CU <compute-unit>`
      due to a bottleneck within the scheduler-pipes  rather than a lack of a CU or
      :ref:`SIMD <desc-valu>` with sufficient  resources. Note: this value is expected
      to range between 0-25%, see note  in :ref:`workgroup manager <desc-spi>` description.'
    unit: Percent
  Scheduler-Pipe Stall Rate:
    rst: 'The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in  the
      kernel where a workgroup could not be scheduled to a  :doc:`CU <compute-unit>`
      due to occupancy limitations (like a lack of a  CU or :ref:`SIMD <desc-valu>`
      with sufficient resources). Note: this  value is expected to range between 0-25%,
      see note in  :ref:`workgroup manager <desc-spi>` description.'
    unit: Percent
  Scratch Stall Rate:
    rst: The percent of :ref:`total shader-engine cycles <total-se-cycles>` in the  kernel
      where a workgroup could not be scheduled to a  :doc:`CU <compute-unit>` due
      to lack of  :ref:`private (a.k.a., scratch) memory <memory-type>` slots. While
      this  can reach up to 100%, note that the actual occupancy limitations on a  kernel
      using private memory are typically quite small (for example, less  than 1% of
      the total number of waves that can be scheduled to an  accelerator).
    unit: Percent
  Insufficient SIMD Waveslots:
    rst: The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel  where
      a workgroup could not be scheduled to a  :ref:`SIMD <desc-valu>`  due to lack
      of available :ref:`waveslots <desc-valu>`.
    unit: Percent
  Insufficient SIMD VGPRs:
    rst: The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel  where
      a workgroup could not be scheduled to a  :ref:`SIMD <desc-valu>`  due to lack
      of available :ref:`VGPRs <desc-valu>`.
    unit: Percent
  Insufficient SIMD SGPRs:
    rst: The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel  where
      a workgroup could not be scheduled to a :ref:`SIMD <desc-valu>`  due to lack
      of available :ref:`SGPRs <desc-salu>`.
    unit: Percent
  Insufficient CU LDS:
    rst: The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel  where
      a workgroup could not be scheduled to a :doc:`CU <compute-unit>`  due to lack
      of available :doc:`LDS <local-data-share>`.
    unit: Percent
  Insufficient CU Barriers:
    rst: The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel  where
      a workgroup could not be scheduled to a :doc:`CU <compute-unit>`  due to lack
      of available :ref:`barriers <desc-barrier>`.
    unit: Percent
  Reached CU Workgroup Limit:
    rst: The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel  where
      a workgroup could not be scheduled to a :doc:`CU <compute-unit>`  due to limits
      within the workgroup manager.  This is expected to be  always be zero on CDNA2
      or newer accelerators (and small for previous  accelerators).
    unit: Percent
  Reached CU Wavefront Limit:
    rst: The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel  where
      a wavefront could not be scheduled to a :doc:`CU <compute-unit>`  due to limits
      within the workgroup manager.  This is expected to be  always be zero on CDNA2
      or newer accelerators (and small for previous  accelerators).
    unit: Percent
Workgroup Manager - Resource Allocation:
  Accelerator Utilization:
    rst: The percent of cycles in the kernel where the accelerator was actively doing
      any work.
    unit: Percent
  Scheduler-Pipe Utilization:
    rst: 'The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in  the
      kernel where the scheduler-pipes were actively doing any work. Note:  this value
      is expected to range between 0% and 25%. See :ref:`desc-spi`.'
    unit: Percent
  Workgroup Manager Utilization:
    rst: The percent of cycles in the kernel where the workgroup manager was actively
      doing any work.
    unit: Percent
  Shader Engine Utilization:
    rst: The percent of :ref:`total shader engine cycles <total-se-cycles>` in the  kernel
      where any CU in a shader-engine was actively doing any work,  normalized over
      all shader-engines. Low values (e.g., << 100%) indicate  that the accelerator
      was not fully saturated by the kernel, or a  potential load-imbalance issue.
    unit: Percent
  SIMD Utilization:
    rst: The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel  where
      any :ref:`SIMD <desc-valu>` on a CU was actively doing any work,  summed over
      all CUs. Low values (less than 100%) indicate that the  accelerator was not
      fully saturated by the kernel, or a potential  load-imbalance issue.
    unit: Percent
  Dispatched Workgroups:
    rst: The total number of workgroups forming this kernel launch.
    unit: Workgroups
  Dispatched Wavefronts:
    rst: The total number of wavefronts, summed over all workgroups, forming this
      kernel launch.
    unit: Wavefronts
  VGPR Writes:
    rst: The average number of cycles spent initializing :ref:`VGPRs <desc-valu>`  at
      wave creation.
    unit: Cycles/wave
  SGPR Writes:
    rst: The average number of cycles spent initializing :ref:`SGPRs <desc-salu>`  at
      wave creation.
    unit: Cycles/wave
  Not-scheduled Rate (Workgroup Manager):
    rst: 'The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in  the
      kernel where a workgroup could not be scheduled to a  :doc:`CU <compute-unit>`
      due to a bottleneck within the workgroup manager  rather than a lack of a CU
      or :ref:`SIMD <desc-valu>` with sufficient  resources. Note: this value is expected
      to range between 0-25%. See note  in :ref:`workgroup manager <desc-spi>` description.'
    unit: Percent
  Not-scheduled Rate (Scheduler-Pipe):
    rst: 'The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in  the
      kernel where a workgroup could not be scheduled to a  :doc:`CU <compute-unit>`
      due to a bottleneck within the scheduler-pipes  rather than a lack of a CU or
      :ref:`SIMD <desc-valu>` with sufficient  resources. Note: this value is expected
      to range between 0-25%, see note  in :ref:`workgroup manager <desc-spi>` description.'
    unit: Percent
  Scheduler-Pipe Stall Rate:
    rst: 'The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in  the
      kernel where a workgroup could not be scheduled to a  :doc:`CU <compute-unit>`
      due to occupancy limitations (like a lack of a  CU or :ref:`SIMD <desc-valu>`
      with sufficient resources). Note: this  value is expected to range between 0-25%,
      see note in  :ref:`workgroup manager <desc-spi>` description.'
    unit: Percent
  Scratch Stall Rate:
    rst: The percent of :ref:`total shader-engine cycles <total-se-cycles>` in the  kernel
      where a workgroup could not be scheduled to a  :doc:`CU <compute-unit>` due
      to lack of  :ref:`private (a.k.a., scratch) memory <memory-type>` slots. While
      this  can reach up to 100%, note that the actual occupancy limitations on a  kernel
      using private memory are typically quite small (for example, less  than 1% of
      the total number of waves that can be scheduled to an  accelerator).
    unit: Percent
  Insufficient SIMD Waveslots:
    rst: The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel  where
      a workgroup could not be scheduled to a  :ref:`SIMD <desc-valu>`  due to lack
      of available :ref:`waveslots <desc-valu>`.
    unit: Percent
  Insufficient SIMD VGPRs:
    rst: The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel  where
      a workgroup could not be scheduled to a  :ref:`SIMD <desc-valu>`  due to lack
      of available :ref:`VGPRs <desc-valu>`.
    unit: Percent
  Insufficient SIMD SGPRs:
    rst: The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel  where
      a workgroup could not be scheduled to a :ref:`SIMD <desc-valu>`  due to lack
      of available :ref:`SGPRs <desc-salu>`.
    unit: Percent
  Insufficient CU LDS:
    rst: The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel  where
      a workgroup could not be scheduled to a :doc:`CU <compute-unit>`  due to lack
      of available :doc:`LDS <local-data-share>`.
    unit: Percent
  Insufficient CU Barriers:
    rst: The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel  where
      a workgroup could not be scheduled to a :doc:`CU <compute-unit>`  due to lack
      of available :ref:`barriers <desc-barrier>`.
    unit: Percent
  Reached CU Workgroup Limit:
    rst: The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel  where
      a workgroup could not be scheduled to a :doc:`CU <compute-unit>`  due to limits
      within the workgroup manager.  This is expected to be  always be zero on CDNA2
      or newer accelerators (and small for previous  accelerators).
    unit: Percent
  Reached CU Wavefront Limit:
    rst: The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel  where
      a wavefront could not be scheduled to a :doc:`CU <compute-unit>`  due to limits
      within the workgroup manager.  This is expected to be  always be zero on CDNA2
      or newer accelerators (and small for previous  accelerators).
    unit: Percent
Command processor fetcher (CPF):
  CPF Utilization:
    rst: Percent of total cycles where the CPF was busy actively doing any work. The
      ratio of CPF busy cycles over total cycles counted by the CPF.
    unit: Percent
  CPF Stall:
    rst: Percent of CPF busy cycles where the CPF was stalled for any reason.
    unit: Percent
  CPF-L2 Utilization:
    rst: Percent of total cycles counted by the CPF-:doc:`L2 <l2-cache>` interface  where
      the CPF-L2 interface was active doing any work. The ratio of CPF-L2  busy cycles
      over total cycles counted by the CPF-L2.
    unit: Percent
  CPF-L2 Stall:
    rst: Percent of CPF-:doc:`L2 <l2-cache>` L2 busy cycles where the CPF-L2 interface
      was stalled for any reason.
    unit: Percent
  CPF-UTCL1 Stall:
    rst: Percent of CPF busy cycles where the CPF was stalled by address translation.
    unit: Percent
  CPC Utilization:
    rst: Percent of total cycles where the CPC was busy actively doing any work. The
      ratio of CPC busy cycles over total cycles counted by the CPC.
    unit: Percent
  CPC Stall Rate:
    rst: Percent of CPC busy cycles where the CPC was stalled for any reason.
    unit: Percent
  CPC Packet Decoding Utilization:
    rst: Percent of CPC busy cycles spent decoding commands for processing.
    unit: Percent
  CPC-Workgroup Manager Utilization:
    rst: Percent of CPC busy cycles spent dispatching workgroups to the  :ref:`workgroup
      manager <desc-spi>`.
    unit: Percent
  CPC-L2 Utilization:
    rst: Percent of total cycles counted by the CPC-:doc:`L2 <l2-cache>` interface  where
      the CPC-L2 interface was active doing any work.
    unit: Percent
  CPC-UTCL1 Stall:
    rst: Percent of CPC busy cycles where the CPC was stalled by address translation
    unit: Percent
  CPC-UTCL2 Utilization:
    rst: Percent of total cycles counted by the CPC's :doc:`L2 <l2-cache>` address  translation
      interface where the CPC was busy doing address translation  work.
    unit: Percent
Command processor packet processor (CPC):
  CPF Utilization:
    rst: Percent of total cycles where the CPF was busy actively doing any work. The
      ratio of CPF busy cycles over total cycles counted by the CPF.
    unit: Percent
  CPF Stall:
    rst: Percent of CPF busy cycles where the CPF was stalled for any reason.
    unit: Percent
  CPF-L2 Utilization:
    rst: Percent of total cycles counted by the CPF-:doc:`L2 <l2-cache>` interface  where
      the CPF-L2 interface was active doing any work. The ratio of CPF-L2  busy cycles
      over total cycles counted by the CPF-L2.
    unit: Percent
  CPF-L2 Stall:
    rst: Percent of CPF-:doc:`L2 <l2-cache>` L2 busy cycles where the CPF-L2 interface
      was stalled for any reason.
    unit: Percent
  CPF-UTCL1 Stall:
    rst: Percent of CPF busy cycles where the CPF was stalled by address translation.
    unit: Percent
  CPC Utilization:
    rst: Percent of total cycles where the CPC was busy actively doing any work. The
      ratio of CPC busy cycles over total cycles counted by the CPC.
    unit: Percent
  CPC Stall Rate:
    rst: Percent of CPC busy cycles where the CPC was stalled for any reason.
    unit: Percent
  CPC Packet Decoding Utilization:
    rst: Percent of CPC busy cycles spent decoding commands for processing.
    unit: Percent
  CPC-Workgroup Manager Utilization:
    rst: Percent of CPC busy cycles spent dispatching workgroups to the  :ref:`workgroup
      manager <desc-spi>`.
    unit: Percent
  CPC-L2 Utilization:
    rst: Percent of total cycles counted by the CPC-:doc:`L2 <l2-cache>` interface  where
      the CPC-L2 interface was active doing any work.
    unit: Percent
  CPC-UTCL1 Stall:
    rst: Percent of CPC busy cycles where the CPC was stalled by address translation
    unit: Percent
  CPC-UTCL2 Utilization:
    rst: Percent of total cycles counted by the CPC's :doc:`L2 <l2-cache>` address  translation
      interface where the CPC was busy doing address translation  work.
    unit: Percent
System Speed-of-Light:
  VALU FLOPs:
    rst: 'The total floating-point operations executed per second on the :ref:`VALU
      <desc-valu>`. This is also presented as a percent of the peak theoretical FLOPs
      achievable on the specific accelerator. Note: this does not include any floating-point
      operations from :ref:`MFMA <desc-mfma>` instructions.'
    unit: GFLOPs
  VALU IOPs:
    rst: 'The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
      This is also presented as a percent of the peak theoretical IOPs achievable
      on the specific accelerator. Note: this does not include any integer operations
      from :ref:`MFMA <desc-mfma>` instructions.'
    unit: GOIPs
  MFMA FLOPs (F8):
    rst: 'The total number of 8-bit brain floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 16-bit brain floating point
      operations from :ref:`VALU <desc-valu>` instructions. This is also presented
      as a percent of the peak theoretical F8 MFMA operations achievable on the specific
      accelerator. It is supported on AMD Instinct MI300 series and later only.'
    unit: GFLOPs
  MFMA FLOPs (BF16):
    rst: 'The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
      operations executed per second. Note: this does not include any 16-bit brain
      floating point operations from :ref:`VALU <desc-valu>` instructions. This is
      also presented as a percent of the peak theoretical BF16 MFMA operations achievable
      on the specific accelerator.'
    unit: GFLOPs
  MFMA FLOPs (F16):
    rst: 'The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 16-bit floating point operations
      from :ref:`VALU <desc-valu>` instructions. This is also presented as a percent
      of the peak theoretical F16 MFMA operations achievable on the specific accelerator.'
    unit: GFLOPs
  MFMA FLOPs (F32):
    rst: 'The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 32-bit floating point operations
      from :ref:`VALU <desc-valu>` instructions. This is also presented as a percent
      of the peak theoretical F32 MFMA operations achievable on the specific accelerator.'
    unit: GFLOPs
  MFMA FLOPs (F64):
    rst: 'The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
      executed per second. Note: this does not include any 64-bit floating point operations
      from :ref:`VALU <desc-valu>` instructions. This is also presented as a percent
      of the peak theoretical F64 MFMA operations achievable on the specific accelerator.'
    unit: GFLOPs
  MFMA IOPs (Int8):
    rst: 'The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
      per second. Note: this does not include any 8-bit integer operations from :ref:`VALU
      <desc-valu>` instructions. This is also presented as a percent of the peak theoretical
      INT8 MFMA operations achievable on the specific accelerator.'
    unit: GIOPs
  Active CUs:
    rst: Total number of active compute units (CUs) on the accelerator during the
      kernel execution.
    unit: Number
  SALU Utilization:
    rst: Indicates what percent of the kernel's duration the :ref:`SALU <desc-salu>`
      was busy executing instructions. Computed as the ratio of the total number of
      cycles spent by the :ref:`scheduler <desc-scheduler>` issuing SALU / :ref:`SMEM
      <desc-smem>` instructions over the :ref:`total CU cycles <total-cu-cycles>`.
    unit: Percent
  VALU Utilization:
    rst: Indicates what percent of the kernel's duration the :ref:`VALU <desc-valu>`
      was busy executing instructions. Does not include :ref:`VMEM <desc-vmem>` operations.
      Computed as the ratio of the total number of cycles spent by the :ref:`scheduler
      <desc-scheduler>` issuing VALU instructions over the :ref:`total CU cycles <total-cu-cycles>`.
    unit: Percent
  MFMA Utilization:
    rst: Indicates what percent of the kernel's duration the :ref:`MFMA <desc-mfma>`
      unit was busy executing instructions. Computed as the ratio of the total number
      of cycles spent by the :ref:`MFMA <desc-salu>` was busy over the :ref:`total
      CU cycles <total-cu-cycles>`.
    unit: Percent
  VMEM Utilization:
    rst: Indicates what percent of the kernel's duration the :ref:`VMEM <desc-vmem>`
      unit was busy executing instructions, including both global/generic and spill/scratch
      operations (see the :ref:`VMEM instruction count metrics <ta-instruction-counts>`
      for more detail).  Does not include :ref:`VALU <desc-valu>` operations. Computed
      as the ratio of the total number of cycles spent by the :ref:`scheduler <desc-scheduler>`
      issuing VMEM instructions over the :ref:`total CU cycles <total-cu-cycles>`.
    unit: Percent
  Branch Utilization:
    rst: Indicates what percent of the kernel's duration the :ref:`branch <desc-branch>`
      unit was busy executing instructions. Computed as the ratio of the total number
      of cycles spent by the :ref:`scheduler <desc-scheduler>` issuing branch instructions
      over the :ref:`total CU cycles <total-cu-cycles>`.
    unit: Percent
  VALU Active Threads:
    rst: Indicates the average level of :ref:`divergence <desc-divergence>` within
      a wavefront over the lifetime of the kernel. The number of work-items that were
      active in a wavefront during execution of each :ref:`VALU <desc-valu>` instruction,
      time-averaged over all VALU instructions run on all wavefronts in the kernel.
    unit: Work-items
  IPC:
    rst: The ratio of the total number of instructions executed on the :doc:`CU <compute-unit>`
      over the :ref:`total active CU cycles <total-active-cu-cycles>`.
    unit: Instructions per-cycle
  Wavefront Occupancy:
    rst: 'The time-averaged number of wavefronts resident on the accelerator over
      the lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms). This is also presented as a percent of the peak theoretical
      occupancy achievable on the specific accelerator.'
    unit: Wavefronts
  Theoretical LDS Bandwidth:
    rst: Indicates the maximum amount of bytes that could have been loaded from, stored
      to, or atomically updated in the LDS per unit time (see :ref:`LDS Bandwidth
      <lds-bandwidth>` example for more detail). This is also presented as a percent
      of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
    unit: GB/s
  LDS Bank Conflicts/Access:
    rst: The ratio of the number of cycles spent in the  :doc:`LDS scheduler <local-data-share>`
      due to bank conflicts (as  determined by the conflict resolution hardware) to
      the base number of  cycles that would be spent in the LDS scheduler in a completely  uncontended
      case. This is also presented in normalized form (i.e., the  Bank Conflict Rate).
    unit: Conflicts/Access
  vL1D Cache Hit Rate:
    rst: The ratio of the number of vL1D cache line requests that hit in vL1D  cache
      over the total number of cache line requests to the  :ref:`vL1D cache RAM <desc-tc>`.
    unit: Percent
  vL1D Cache BW:
    rst: The number of bytes looked up in the vL1D cache as a result of  :ref:`VMEM
      <desc-vmem>` instructions per unit time. The number of bytes  is calculated
      as the number of cache lines requested multiplied by the  cache line size. This
      value does not consider partial requests, so e.g.,  if only a single value is
      requested in a cache line, the data movement  will still be counted as a full
      cache line. This is also presented as a  percent of the peak theoretical bandwidth
      achievable on the specific  accelerator.
    unit: GB/s
  L2 Cache Hit Rate:
    rst: The ratio of the number of L2 cache line requests that hit in the L2  cache
      over the total number of incoming cache line requests to the L2  cache.
    unit: Percent
  L2 Cache BW:
    rst: The number of bytes looked up in the L2 cache per unit time.  The number  of
      bytes is calculated as the number of cache lines requested multiplied  by the
      cache line size. This value does not consider partial requests, so  e.g., if
      only a single value is requested in a cache line, the data  movement will still
      be counted as a full cache line. This is also  presented as a percent of the
      peak theoretical bandwidth achievable on  the specific accelerator.
    unit: GB/s
  L2-Fabric Read BW:
    rst: "The number of bytes read by the L2 over the  :ref:`Infinity Fabric\u2122\
      \ interface <l2-fabric>` per unit time. This is also  presented as a percent\
      \ of the peak theoretical bandwidth achievable on  the specific accelerator."
    unit: GB/s
  L2-Fabric Write BW:
    rst: The number of bytes sent by the L2 over the  :ref:`Infinity Fabric interface
      <l2-fabric>` by write and atomic  operations per unit time. This is also presented
      as a percent of the peak  theoretical bandwidth achievable on the specific accelerator.
    unit: GB/s
  L2-Fabric Read Latency:
    rst: The time-averaged number of cycles read requests spent in Infinity Fabric  before
      data was returned to the L2.
    unit: Cycles
  L2-Fabric Write Latency:
    rst: The time-averaged number of cycles write requests spent in Infinity  Fabric
      before a completion acknowledgement was returned to the L2.
    unit: Cycles
  sL1D Cache Hit Rate:
    rst: The percent of sL1D requests that hit on a previously loaded line the  cache.
      Calculated as the ratio of the number of sL1D requests that hit  over the number
      of all sL1D requests.
    unit: Percent
  sL1D Cache BW:
    rst: The number of bytes looked up in the sL1D cache per unit time. This is  also
      presented as a percent of the peak theoretical bandwidth achievable  on the
      specific accelerator.
    unit: GB/s
  L1I Hit Rate:
    rst: The percent of L1I requests that hit on a previously loaded line the  cache.
      Calculated as the ratio of the number of L1I requests that hit  over the number
      of all L1I requests.
    unit: GB/s
  L1I BW:
    rst: The number of bytes looked up in the L1I cache per unit time. This is  also
      presented as a percent of the peak theoretical bandwidth achievable  on the
      specific accelerator.
    unit: Percent
  L1I Fetch Latency:
    rst: The average number of cycles spent to fetch instructions to a  :doc:`CU <compute-unit>`.
    unit: Cycles