Comhaid
rocm-systems/projects/rocprofiler-compute/docs/data/metrics_description.yaml
T

Ag déanamh neamhairde de leasuithe i .git-blame-ignore-revs. Cliceáil anseo chun seachaint agus an gnáth-amharc milleán a fheiceáil.

1633 línte
87 KiB
YAML
Amh Amharc Gnáth Stair

2025-07-25 14:01:34 -04:00
Wavefront launch stats:
2025-08-01 13:56:29 -04:00
AGPRs:
rst: >-
The number of accumulation vector general-purpose registers allocated
for the kernel, see :ref:`AGPRs <desc-agprs>`. Note: this may not exactly match
the number of AGPRs requested by the compiler due to allocation granularity.
2025-08-01 13:56:29 -04:00
unit: AGPRs
Grid Size:
rst: The total number of work-items (or, threads) launched as a part of the kernel
dispatch. In HIP, this is equivalent to the total grid size multiplied by the
total workgroup (or, block) size.
unit: Work-Items
2025-07-25 14:01:34 -04:00
LDS Allocation:
rst: >-
The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
allocated for this kernel. Note: This may also be larger than what was requested
at compile time due to both allocation granularity and dynamic per-dispatch
LDS allocations.
2025-07-25 14:01:34 -04:00
unit: Bytes per workgroup
Restored Wavefronts:
rst: The total number of wavefronts restored from a context-save. See `cwsr_enable
2025-07-25 14:01:34 -04:00
<https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
unit: Wavefronts
SGPRs:
rst: >-
The number of scalar general-purpose registers allocated for the kernel, see
:ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
SGPRs requested by the compiler due to allocation granularity.
2025-07-25 14:01:34 -04:00
unit: SGPRs
2025-08-01 13:56:29 -04:00
Saved Wavefronts:
rst: The total number of wavefronts saved at a context-save. See `cwsr_enable
2025-08-01 13:56:29 -04:00
<https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
unit: Wavefronts
Scratch Allocation:
rst: The number of bytes of :ref:`scratch memory <memory-spaces>` requested per
work-item for this kernel. Scratch memory is used for stack memory on the accelerator,
2025-08-01 13:56:29 -04:00
as well as for register spills and restores.
unit: Bytes per work-item
Total Wavefronts:
rst: >-
The total number of wavefronts launched as part of the kernel dispatch.
On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
size is always 64 work-items. Thus, the total number of wavefronts should
be equivalent to the ceiling of grid size divided by 64.
unit: Wavefronts
2025-08-01 13:56:29 -04:00
VGPRs:
rst: >-
The number of architected vector general-purpose registers allocated for the
kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
number of VGPRs requested by the compiler due to allocation granularity.
2025-08-01 13:56:29 -04:00
unit: VGPRs
Workgroup Size:
rst: The total number of work-items (or, threads) in each workgroup (or, block)
launched as part of the kernel dispatch. In HIP, this is equivalent to the total
block size.
unit: Work-Items
Wavefront runtime stats:
Active Cycles:
rst: The average number of cycles a wavefront in the kernel dispatch was actively
executing instructions per :ref:`normalization unit <normalization-units>`.
This measurement is made on a per-wavefront basis, and may include cycles that
another wavefront spent actively executing (on another execution unit, for example)
or was stalled. As such, it is most useful to get a sense of how waves were
spending their time, rather than identification of a precise limiter. The sum
of this metric, Issue Wait Cycles and Active Wait Cycles should be equal to
the total Wave Cycles metric.
unit: Cycles per normalization unit
2025-08-01 13:56:29 -04:00
Dependency Wait Cycles:
rst: The number of cycles a wavefront in the kernel dispatch stalled waiting on
memory of any kind (e.g., instruction fetch, vector or scalar memory, etc.)
per :ref:`normalization unit <normalization-units>`. This counter is incremented
at every cycle by *all* wavefronts on a CU stalled at a memory operation. As
such, it is most useful to get a sense of how waves were spending their time,
rather than identification of a precise limiter because another wave could be
actively executing while a wave is stalled. The sum of this metric, Issue Wait
Cycles and Active Cycles should be equal to the total Wave Cycles metric.
2025-07-25 14:01:34 -04:00
unit: Cycles per normalization unit
2025-08-01 13:56:29 -04:00
Instructions per wavefront:
rst: The average number of instructions (of all types) executed per wavefront.
This is averaged over all wavefronts in a kernel dispatch.
unit: Instructions per wavefront
2025-07-25 14:01:34 -04:00
Issue Wait Cycles:
rst: The number of cycles a wavefront in the kernel dispatch was unable to issue
an instruction for any reason (e.g., execution pipe back-pressure, arbitration
loss, etc.) per :ref:`normalization unit <normalization-units>`. This counter
is incremented at every cycle by *all* wavefronts on a CU unable to issue an
instruction. As such, it is most useful to get a sense of how waves were spending
their time, rather than identification of a precise limiter because another
wave could be actively executing while a wave is issue stalled. The sum of this
metric, Dependency Wait Cycles and Active Cycles should be equal to the total
Wave Cycles metric.
2025-07-25 14:01:34 -04:00
unit: Cycles per normalization unit
Kernel Time:
rst: The total duration of the executed kernel.
unit: Nanoseconds
Kernel Time (Cycles):
rst: The total duration of the executed kernel in cycles.
unit: Cycles
2025-08-01 13:56:29 -04:00
Wave Cycles:
rst: >-
The number of cycles a wavefront in the kernel dispatch spent resident
on a compute unit per :ref:`normalization unit <normalization-units>`. This is
averaged over all wavefronts in a kernel dispatch. Note: this should not
be directly compared to the kernel cycles above.
2025-08-01 13:56:29 -04:00
unit: Cycles per normalization unit
Wavefront Occupancy:
rst: >-
The time-averaged number of wavefronts resident on the accelerator over the
lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms).
2025-08-01 13:56:29 -04:00
unit: Wavefronts
2025-07-25 14:01:34 -04:00
Overall instruction mix:
2025-08-01 13:56:29 -04:00
Branch:
rst: The total number of branch operations issued. These typically consist of
jump or branch operations and are used to implement control flow.
2025-08-01 13:56:29 -04:00
unit: Instructions
2025-07-25 14:01:34 -04:00
LDS:
rst: The total number of LDS (also known as shared memory) operations issued.
These include loads, stores, atomics, and HIP's ``__shfl`` operations.
2025-07-25 14:01:34 -04:00
unit: Instructions
2025-08-01 13:56:29 -04:00
MFMA:
rst: The total number of matrix fused multiply-add instructions issued.
unit: Instructions
2025-07-25 14:01:34 -04:00
SALU:
rst: The total number of scalar arithmetic logic unit (SALU) operations issued.
Typically these are used for address calculations, literal constants, and other
operations that are provably uniform across a wavefront. Although scalar memory
(SMEM) operations are issued by the SALU, they are counted separately in this
section.
unit: Instructions
SMEM:
rst: The total number of scalar memory (SMEM) operations issued. These are typically
used for loading kernel arguments, base-pointers and loads from HIP's ``__constant__``
2025-07-25 14:01:34 -04:00
memory.
unit: Instructions
VALU:
rst: The total number of vector arithmetic logic unit (VALU) operations issued.
These are the workhorses of the :doc:`compute unit <compute-unit>`, and are
used to execute a wide range of instruction types including floating point operations,
non-uniform address calculations, transcendental operations, integer operations,
shifts, conditional evaluation, etc.
unit: Instructions
2025-08-01 13:56:29 -04:00
VMEM:
rst: The total number of vector memory operations issued. These include most loads,
stores and atomic operations and all accesses to :ref:`generic, global, private
2025-08-01 13:56:29 -04:00
and texture <memory-spaces>` memory.
2025-07-25 14:01:34 -04:00
unit: Instructions
VALU arithmetic instruction mix:
2025-08-01 13:56:29 -04:00
Conversion:
rst: >-
The total number of type conversion instructions (such as converting data
to or from F32\u2194F64) issued to the VALU per :ref:`normalization unit
<normalization-units>`.
2025-08-01 13:56:29 -04:00
unit: Instructions per normalization unit
2025-07-25 14:01:34 -04:00
F16-ADD:
rst: The total number of addition instructions operating on 16-bit floating-point
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
2025-07-25 14:01:34 -04:00
unit: Instructions per normalization unit
2025-08-01 13:56:29 -04:00
F16-FMA:
rst: The total number of fused multiply-add instructions operating on 16-bit floating-point
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
2025-08-01 13:56:29 -04:00
unit: Instructions per normalization unit
2025-07-25 14:01:34 -04:00
F16-MUL:
rst: The total number of multiplication instructions operating on 16-bit floating-point
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
2025-07-25 14:01:34 -04:00
unit: Instructions per normalization unit
2025-08-01 13:56:29 -04:00
F16-Trans:
rst: The total number of transcendental instructions (e.g., `sqrt`) operating
on 16-bit floating-point operands issued to the VALU per :ref:`normalization
unit <normalization-units>`.
2025-08-01 13:56:29 -04:00
unit: Instructions per normalization unit
F32-ADD:
rst: The total number of addition instructions operating on 32-bit floating-point
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
2025-08-01 13:56:29 -04:00
unit: Instructions per normalization unit
F32-FMA:
rst: The total number of fused multiply-add instructions operating on 32-bit floating-point
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
unit: Instructions per normalization unit
F32-MUL:
rst: The total number of multiplication instructions operating on 32-bit floating-point
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
2025-07-25 14:01:34 -04:00
unit: Instructions per normalization unit
2025-08-01 13:56:29 -04:00
F32-Trans:
rst: The total number of transcendental instructions (such as ``sqrt``) operating
on 32-bit floating-point operands issued to the VALU per :ref:`normalization
2025-08-01 13:56:29 -04:00
unit <normalization-units>`.
unit: Instructions per normalization unit
2025-07-25 14:01:34 -04:00
F64-ADD:
rst: The total number of addition instructions operating on 64-bit floating-point
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
2025-07-25 14:01:34 -04:00
unit: Instructions per normalization unit
F64-FMA:
rst: The total number of fused multiply-add instructions operating on 64-bit floating-point
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
2025-07-25 14:01:34 -04:00
unit: Instructions per normalization unit
2025-08-01 13:56:29 -04:00
F64-MUL:
rst: The total number of multiplication instructions operating on 64-bit floating-point
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
unit: Instructions per normalization unit
2025-07-25 14:01:34 -04:00
F64-Trans:
rst: The total number of transcendental instructions (such as `sqrt`) operating
on 64-bit floating-point operands issued to the VALU per :ref:`normalization
2025-07-25 14:01:34 -04:00
unit <normalization-units>`.
unit: Instructions per normalization unit
2025-08-01 13:56:29 -04:00
INT32:
rst: The total number of instructions operating on 32-bit integer operands issued
2025-08-01 13:56:29 -04:00
to the VALU per :ref:`normalization unit <normalization-units>`.
2025-07-25 14:01:34 -04:00
unit: Instructions per normalization unit
INT64:
rst: The total number of instructions operating on 64-bit integer operands issued
to the VALU per :ref:`normalization unit <normalization-units>`.
2025-07-25 14:01:34 -04:00
unit: Instructions per normalization unit
MFMA instruction mix:
2025-08-01 13:56:29 -04:00
MFMA-BF16:
rst: The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>` instructions
issued per :ref:`normalization unit <normalization-units>`.
2025-07-25 14:01:34 -04:00
unit: Instructions per normalization unit
2025-08-01 13:56:29 -04:00
MFMA-F16:
rst: The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` instructions
2025-08-01 13:56:29 -04:00
issued per :ref:`normalization unit <normalization-units>`.
2025-07-25 14:01:34 -04:00
unit: Instructions per normalization unit
MFMA-F32:
rst: The total number of 32-bit floating-point :ref:`MFMA <desc-mfma>` instructions
2025-07-25 14:01:34 -04:00
issued per :ref:`normalization unit <normalization-units>`.
unit: Instructions per normalization unit
2025-08-01 13:56:29 -04:00
MFMA-F64:
rst: The total number of 64-bit floating-point :ref:`MFMA <desc-mfma>` instructions
2025-07-25 14:01:34 -04:00
issued per :ref:`normalization unit <normalization-units>`.
unit: Instructions per normalization unit
MFMA-F8:
rst: The total number of 8-bit floating point :ref:`MFMA <desc-mfma>` instructions
issued per :ref:`normalization unit <normalization-units>`. This is supported
in AMD Instinct MI300 series and later only.
2025-07-25 14:01:34 -04:00
unit: Instructions per normalization unit
2025-08-01 13:56:29 -04:00
MFMA-I8:
rst: The total number of 8-bit integer :ref:`MFMA <desc-mfma>` instructions issued
2025-08-01 13:56:29 -04:00
per :ref:`normalization unit <normalization-units>`.
unit: Instructions per normalization unit
2025-07-25 14:01:34 -04:00
Compute Speed-of-Light:
MFMA FLOPs (BF16):
rst: >-
The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit brain floating
point operations from :ref:`VALU <desc-valu>` instructions. This is also
presented as a percent of the peak theoretical BF16 MFMA operations achievable
on the specific accelerator.
2025-07-25 14:01:34 -04:00
unit: GFLOPs
2025-08-01 13:56:29 -04:00
MFMA FLOPs (F16):
rst: >-
The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
as a percent of the peak theoretical F16 MFMA operations achievable on the
specific accelerator.
2025-08-01 13:56:29 -04:00
unit: GFLOPs
2025-07-25 14:01:34 -04:00
MFMA FLOPs (F32):
rst: >-
The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 32-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
as a percent of the peak theoretical F32 MFMA operations achievable on the
specific accelerator.
2025-07-25 14:01:34 -04:00
unit: GFLOPs
MFMA FLOPs (F64):
rst: >-
The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 64-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
as a percent of the peak theoretical F64 MFMA operations achievable on the
specific accelerator. The total number of 64-bit floating point :ref:`MFMA
<desc-mfma>` operations executed per second. Note: this does not include
any 64-bit floating point operations from :ref:`VALU <desc-valu>` instructions.
This is also presented as a percent of the peak theoretical F64 MFMA operations
achievable on the specific accelerator.
2025-07-25 14:01:34 -04:00
unit: GFLOPs
MFMA IOPs (INT8):
rst: >-
The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
per second. Note: this does not include any 8-bit integer operations from
:ref:`VALU <desc-valu>` instructions. This is also presented as a percent
of the peak theoretical INT8 MFMA operations achievable on the specific accelerator.
2025-07-25 14:01:34 -04:00
unit: GFLOPs
VALU FLOPs:
rst: >-
The total floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is also presented as a percent of the peak theoretical
FLOPs achievable on the specific accelerator. Note: this does not include
any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
VALU IOPs:
rst: >-
The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
This is also presented as a percent of the peak theoretical IOPs achievable
on the specific accelerator. Note: this does not include any integer operations
from :ref:`MFMA <desc-mfma>` instructions.
unit: GIOPs
Pipeline statistics:
Branch Utilization:
rst: Indicates what percent of the kernel's duration the :ref:`branch <desc-branch>`
unit was busy executing instructions. Computed as the ratio of the total number
of cycles spent by the :ref:`scheduler <desc-scheduler>` issuing branch instructions
over the :ref:`total CU cycles <total-cu-cycles>`.
unit: Percent
2025-08-01 13:56:29 -04:00
IPC:
rst: The ratio of the total number of instructions executed on the :doc:`CU <compute-unit>`
over the :ref:`total active CU cycles <total-active-cu-cycles>`.
2025-08-01 13:56:29 -04:00
unit: Instructions per cycle
IPC (Issued):
rst: The ratio of the total number of (non-:ref:`internal <ipc-internal-instructions>`)
instructions issued over the number of cycles where the :ref:`scheduler <desc-scheduler>`
was actively working on issuing instructions. Refer to the :ref:`Issued IPC
2025-08-01 13:56:29 -04:00
<issued-ipc>` example for further detail.
unit: Instructions per cycle
MFMA Instruction Cycles:
rst: The average duration of :ref:`MFMA <desc-mfma>` instructions in this kernel
in cycles. Computed as the ratio of the total number of cycles the MFMA unit
was busy over the total number of MFMA instructions. Compare to, for example,
the `AMD Matrix Instruction Calculator <https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator>`_.
unit: Cycles per instruction
2025-07-25 14:01:34 -04:00
MFMA Utilization:
rst: Indicates what percent of the kernel's duration the :ref:`MFMA <desc-mfma>`
unit was busy executing instructions. Computed as the ratio of the total number
of cycles spent by the :ref:`MFMA <desc-salu>` was busy over the :ref:`total
2025-07-25 14:01:34 -04:00
CU cycles <total-cu-cycles>`.
unit: Percent
2025-08-01 13:56:29 -04:00
SALU Utilization:
rst: Indicates what percent of the kernel's duration the :ref:`SALU <desc-salu>`
was busy executing instructions. Computed as the ratio of the total number of
cycles spent by the :ref:`scheduler <desc-scheduler>` issuing SALU / :ref:`SMEM
<desc-smem>` instructions over the :ref:`total CU cycles <total-cu-cycles>`.
2025-08-01 13:56:29 -04:00
unit: Percent
SMEM Latency:
rst: The average number of round-trip cycles (that is, from issue to data return
2025-08-01 13:56:29 -04:00
/ acknowledgment) required for a SMEM instruction to complete.
2025-07-25 14:01:34 -04:00
unit: Cycles
2025-08-01 13:56:29 -04:00
VALU Active Threads:
rst: Indicates the average level of :ref:`divergence <desc-divergence>` within
a wavefront over the lifetime of the kernel. The number of work-items that were
active in a wavefront during execution of each :ref:`VALU <desc-valu>` instruction,
time-averaged over all VALU instructions run on all wavefronts in the kernel.
2025-08-01 13:56:29 -04:00
unit: Work-items
VALU Utilization:
rst: Indicates what percent of the kernel's duration the :ref:`VALU <desc-valu>`
was busy executing instructions. Does not include :ref:`VMEM <desc-vmem>` operations.
Computed as the ratio of the total number of cycles spent by the :ref:`scheduler
<desc-scheduler>` issuing VALU instructions over the :ref:`total CU cycles <total-cu-cycles>`.
unit: Percent
2025-08-01 13:56:29 -04:00
VMEM Latency:
rst: The average number of round-trip cycles (that is, from issue to data return
2025-08-01 13:56:29 -04:00
/ acknowledgment) required for a VMEM instruction to complete.
unit: Cycles
VMEM Utilization:
rst: Indicates what percent of the kernel's duration the :ref:`VMEM <desc-vmem>`
unit was busy executing instructions, including both global/generic and spill/scratch
operations (see the :ref:`VMEM instruction count metrics <ta-instruction-counts>`
for more detail). Does not include :ref:`VALU <desc-valu>` operations. Computed
as the ratio of the total number of cycles spent by the :ref:`scheduler <desc-scheduler>`
issuing VMEM instructions over the :ref:`total CU cycles <total-cu-cycles>`.
2025-08-01 13:56:29 -04:00
unit: Percent
Arithmetic operations:
2025-07-25 14:01:34 -04:00
BF16 OPs:
rst: >-
The total number of 16-bit brain floating-point operations executed on
either the :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
unit <normalization-units>`. Note: on current CDNA accelerators, the VALU
has no native BF16 instructions.
2025-07-25 14:01:34 -04:00
unit: FLOP per normalization unit
F16 OPs:
rst: The total number of 16-bit floating-point operations executed on either the
:ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
unit <normalization-units>`.
unit: FLOP per normalization unit
2025-07-25 14:01:34 -04:00
F32 OPs:
rst: The total number of 32-bit floating-point operations executed on either the
:ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
2025-07-25 14:01:34 -04:00
unit <normalization-units>`.
unit: FLOP per normalization unit
2025-08-01 13:56:29 -04:00
F64 OPs:
rst: The total number of 64-bit floating-point operations executed on either the
:ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
2025-08-01 13:56:29 -04:00
unit <normalization-units>`.
unit: FLOP per normalization unit
FLOPs (Total):
rst: The total number of floating-point operations executed on either the :ref:`VALU
<desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
2025-08-01 13:56:29 -04:00
<normalization-units>`.
unit: FLOP per normalization unit
2025-07-25 14:01:34 -04:00
INT8 OPs:
rst: >-
The total number of 8-bit integer operations executed on either the :ref:`VALU
<desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
<normalization-units>`. Note: on current CDNA accelerators, the VALU has
no native INT8 instructions.
2025-07-25 14:01:34 -04:00
unit: IOP per normalization unit
2025-08-01 13:56:29 -04:00
IOPs (Total):
rst: The total number of integer operations executed on either the :ref:`VALU
<desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
2025-08-01 13:56:29 -04:00
<normalization-units>`.
unit: IOP per normalization unit
LDS Speed-of-Light:
Access Rate:
rst: Indicates the percentage of SIMDs in the :ref:`VALU <desc-valu>` [#lds-workload]_
actively issuing LDS instructions, averaged over the lifetime of the kernel.
Calculated as the ratio of the total number of cycles spent by the :ref:`scheduler
<desc-scheduler>` issuing :ref:`LDS <desc-lds>` instructions over the :ref:`total
CU cycles <total-cu-cycles>`.
2025-07-25 14:01:34 -04:00
unit: Percent
Bank Conflict Rate:
rst: Indicates the percentage of active LDS cycles that were spent servicing bank
conflicts. Calculated as the ratio of LDS cycles spent servicing bank conflicts
over the number of LDS cycles that would have been required to move the same
amount of data in an uncontended access. [#lds-bank-conflict]_
2025-07-25 14:01:34 -04:00
unit: Percent
Theoretical Bandwidth Utilization:
rst: Indicates the maximum amount of bytes that could have been loaded from, stored
to, or atomically updated in the LDS divided as percentage of theoretical peak.
Does *not* take into account the execution mask of the wavefront when the instruction
was executed. See the :ref:`LDS bandwidth example <lds-bandwidth>` for more
detail.
unit: Percent
Utilization:
rst: Indicates what percent of the kernel's duration the :ref:`LDS <desc-lds>`
was actively executing instructions (including, but not limited to, load, store,
atomic and HIP's ``__shfl`` operations). Calculated as the ratio of the total
number of cycles LDS was active over the :ref:`total CU cycles <total-cu-cycles>`.
2025-07-25 14:01:34 -04:00
unit: Percent
LDS Statistics:
2025-08-01 13:56:29 -04:00
Addr Conflict:
rst: The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>` due
to address conflicts (as determined by the conflict resolution hardware) per
2025-08-01 13:56:29 -04:00
:ref:`normalization unit <normalization-units>`.
unit: Cycles per normalization unit
Atomic Return Cycles:
rst: The total number of cycles spent on LDS atomics with return per :ref:`normalization
2025-08-01 13:56:29 -04:00
unit <normalization-units>`.
unit: Cycles per normalization unit
Bank Conflict:
rst: The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>` due
to bank conflicts (as determined by the conflict resolution hardware) per :ref:`normalization
2025-08-01 13:56:29 -04:00
unit <normalization-units>`.
unit: Cycles per normalization unit
2025-07-25 14:01:34 -04:00
Bank Conflicts/Access:
rst: The ratio of the number of cycles spent in the :ref:`LDS scheduler <desc-lds>`
due to bank conflicts (as determined by the conflict resolution hardware) to
the base number of cycles that would be spent in the LDS scheduler in a completely
uncontended case. This is the unnormalized form of the Bank Conflict Rate.
2025-07-25 14:01:34 -04:00
unit: Conflicts per Access
2025-08-01 13:56:29 -04:00
Index Accesses:
rst: The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>` over
2025-08-01 13:56:29 -04:00
all operations per :ref:`normalization unit <normalization-units>`.
unit: Cycles per normalization unit
LDS Instructions:
rst: The total number of LDS instructions (including, but not limited to, read/write/atomics
and HIP's ``__shfl`` instructions) executed per :ref:`normalization unit <normalization-units>`.
unit: Instructions per normalization unit
LDS Latency:
rst: The average number of round-trip cycles (i.e., from issue to data-return
acknowledgment) required for an LDS instruction to complete.
unit: Cycles
2025-07-25 14:01:34 -04:00
Mem Violations:
rst: >-
The total number of out-of-bounds accesses made to the LDS, per :ref:`normalization
unit <normalization-units>`. This is unused and expected to be zero in
most configurations for modern CDNA\u2122 accelerators.
2025-07-25 14:01:34 -04:00
unit: Accesses per normalization unit
2025-08-01 13:56:29 -04:00
Theoretical Bandwidth:
rst: Indicates the maximum amount of bytes that could have been loaded from, stored
to, or atomically updated in the LDS divided by total duration. Does *not* take
into account the execution mask of the wavefront when the instruction was executed.
See the :ref:`LDS bandwidth example <lds-bandwidth>` for more detail.
unit: Gbps
2025-08-01 13:56:29 -04:00
Unaligned Stall:
rst: The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>` due
to stalls from non-dword aligned addresses per :ref:`normalization unit <normalization-units>`.
2025-07-25 14:01:34 -04:00
unit: Cycles per normalization unit
vL1D Speed-of-Light:
Bandwidth Utilization:
rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM
<desc-vmem>` instructions, as a percent of the peak theoretical bandwidth achievable
on the specific accelerator. The number of bytes is calculated as the number
of cache lines requested multiplied by the cache line size. This value does
not consider partial requests, so for instance, if only a single value is requested
in a cache line, the data movement will still be counted as a full cache line.
2025-07-25 14:01:34 -04:00
unit: Percent
Coalescing:
rst: Indicates how well memory instructions were coalesced by the :ref:`address
processing unit <desc-ta>`, ranging from uncoalesced (25%) to fully coalesced
(100%). Calculated as the average number of :ref:`thread-requests <thread-requests>`
generated per instruction divided by the ideal number of thread-requests per
instruction.
2025-07-25 14:01:34 -04:00
unit: Percent
2025-08-01 13:56:29 -04:00
Hit rate:
rst: The ratio of the number of vL1D cache line requests that hit [#vl1d-hit]_
in vL1D cache over the total number of cache line requests to the :ref:`vL1D
Cache RAM <desc-tc>`.
2025-08-01 13:56:29 -04:00
unit: Percent
Utilization:
rst: Indicates how busy the :ref:`vL1D Cache RAM <desc-tc>` was during the kernel
execution. The number of cycles where the vL1D Cache RAM is actively processing
any request divided by the number of cycles where the vL1D is active [#vl1d-activity]_.
2025-07-25 14:01:34 -04:00
unit: Percent
Busy / stall metrics:
Address Processing Unit Busy:
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address processor
was busy
2025-07-25 14:01:34 -04:00
unit: Percent
Address Stall:
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address processor
was stalled from sending address requests further into the vL1D pipeline
2025-07-25 14:01:34 -04:00
unit: Percent
2025-08-01 13:56:29 -04:00
Data Stall:
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address processor
was stalled from sending write/atomic data further into the vL1D pipeline
2025-08-01 13:56:29 -04:00
unit: Percent
"Data-Processor \u2192 Address Stall":
rst: Percent of :ref:`total CU cycles <total-cu-cycles>` the address processor
was stalled waiting to send command data to the :ref:`data processor <desc-td>`
2025-08-01 13:56:29 -04:00
unit: Percent
Instruction counts:
Global/Generic Atomic Instructions:
rst: The total number of global & generic memory atomic (with and without return)
instructions executed on all :doc:`compute units <compute-unit>` on the accelerator,
per :ref:`normalization unit <normalization-units>`.
unit: Instructions per normalization unit
Global/Generic Instructions:
rst: The total number of global & generic memory instructions executed on all
:doc:`compute units <compute-unit>` on the accelerator, per :ref:`normalization
unit <normalization-units>`.
unit: Instructions per normalization unit
2025-08-01 13:56:29 -04:00
Global/Generic Read Instructions:
rst: The total number of global & generic memory read instructions executed on
all :doc:`compute units <compute-unit>` on the accelerator, per :ref:`normalization
2025-08-01 13:56:29 -04:00
unit <normalization-units>`.
unit: Instructions per normalization unit
Global/Generic Write Instructions:
rst: The total number of global & generic memory write instructions executed on
all :doc:`compute units <compute-unit>` on the accelerator, per :ref:`normalization
unit <normalization-units>`.
unit: Instructions per normalization unit
Spill/Stack Atomic Instructions:
rst: The total number of spill/stack memory atomic (with and without return) instructions
executed on all :doc:`compute units <compute-unit>` on the accelerator, per
:ref:`normalization unit <normalization-units>`. Typically unused as these memory
operations are typically used to implement thread-local storage.
unit: Instructions per normalization unit
Spill/Stack Instructions:
rst: The total number of spill/stack memory instructions executed on all :doc:`compute
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
unit: Instructions per normalization unit
2025-08-01 13:56:29 -04:00
Spill/Stack Read Instructions:
rst: The total number of spill/stack memory read instructions executed on all
:doc:`compute units <compute-unit>` on the accelerator, per :ref:`normalization
unit <normalization-units>`.
2025-08-01 13:56:29 -04:00
unit: Instructions per normalization unit
Spill/Stack Write Instructions:
rst: The total number of spill/stack memory write instructions executed on all
:doc:`compute units <compute-unit>` on the accelerator, per :ref:`normalization
unit <normalization-units>`.
2025-08-01 13:56:29 -04:00
unit: Instructions per normalization unit
Total Instructions:
rst: The total number of memory instructions executed by the address processer
over all compute units on the accelerator, per normalization unit.
unit: Instructions per normalization unit
Spill / stack metrics:
Spill/Stack Coalesced Read:
rst: The number of cycles the address processing unit spent working on coalesced
spill/stack read instructions, per :ref:`normalization unit <normalization-units>`.
unit: Cycles per normalization unit
Spill/Stack Coalesced Write:
rst: The number of cycles the address processing unit spent working on coalesced
spill/stack write instructions, per :ref:`normalization unit <normalization-units>`.
unit: Cycles per normalization unit
2025-08-01 13:56:29 -04:00
Spill/Stack Total Cycles:
rst: The number of cycles the address processing unit spent working on spill/stack
instructions, per :ref:`normalization unit <normalization-units>`.
2025-08-01 13:56:29 -04:00
unit: Cycles per normalization unit
L1 Unified Translation Cache (UTCL1):
Hit Ratio:
rst: The ratio of the number of translation requests that hit in the UTCL1 divided
by the total number of translation requests made to the UTCL1.
unit: Percent
Hits:
rst: The number of translation requests that hit in the UTCL1, and could be reused,
per normalization unit.
2025-07-25 14:01:34 -04:00
unit: Requests per normalization unit
2025-08-01 13:56:29 -04:00
Permission Misses:
rst: >-
The total number of translation requests that missed in the UTCL1 due
to a permission error, per :ref:`normalization unit <normalization-units>`.
This is unused and expected to be zero in most configurations for modern
CDNA\u2122 accelerators.
2025-08-01 13:56:29 -04:00
unit: Requests per normalization unit
Req:
rst: The number of translation requests made to the UTCL1 per normalization unit.
2025-07-25 14:01:34 -04:00
unit: Requests per normalization unit
Translation Misses:
rst: The total number of translation requests that missed in the UTCL1 due to
translation not being present in the cache, per :ref:`normalization unit <normalization-units>`.
unit: unit
vL1D cache stall metrics:
2025-08-01 13:56:29 -04:00
Stalled on L2 Data:
rst: The ratio of the number of cycles where the vL1D is stalled waiting for requested
data to return from the :doc:`L2 cache <l2-cache>` divided by the number of
2025-08-01 13:56:29 -04:00
cycles where the vL1D is active [#vl1d-activity]_.
unit: Percent
Stalled on L2 Req:
rst: The ratio of the number of cycles where the vL1D is stalled waiting to issue
a request for data to the :doc:`L2 cache <l2-cache>` divided by the number of
cycles where the vL1D is active [#vl1d-activity]_.
2025-07-25 14:01:34 -04:00
unit: Percent
Tag RAM Stall (Atomic):
rst: The ratio of the number of cycles where the vL1D is stalled due to Atomic
requests with conflicting tags being looked up concurrently, divided by the
number of cycles where the vL1D is active [#vl1d-activity]_.
2025-07-25 14:01:34 -04:00
unit: Percent
2025-08-01 13:56:29 -04:00
Tag RAM Stall (Read):
rst: The ratio of the number of cycles where the vL1D is stalled due to Read requests
with conflicting tags being looked up concurrently, divided by the number of
cycles where the vL1D is active [#vl1d-activity]_.
2025-08-01 13:56:29 -04:00
unit: Percent
Tag RAM Stall (Write):
rst: The ratio of the number of cycles where the vL1D is stalled due to Write
requests with conflicting tags being looked up concurrently, divided by the
number of cycles where the vL1D is active [#vl1d-activity]_.
2025-07-25 14:01:34 -04:00
unit: Percent
vL1D cache access metrics:
2025-08-01 13:56:29 -04:00
Atomic Req:
rst: The total number of incoming atomic requests from the :ref:`address processing
unit <desc-ta>` after coalescing per :ref:`normalization unit <normalization-units>`
2025-08-01 13:56:29 -04:00
unit: Requests per normalization unit
Cache Accesses:
rst: The total number of cache line lookups in the vL1D.
unit: Cache lines
Cache BW:
rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM
<desc-vmem>` instructions divided by total duration. The number of bytes is
calculated as the number of cache lines requested multiplied by the cache line
size. This value does not consider partial requests, so for instance, if only
a single value is requested in a cache line, the data movement will still be
counted as a full cache line.
unit: Gbps
Cache Hit Rate:
rst: The ratio of the number of vL1D cache line requests that hit in vL1D cache
over the total number of cache line requests to the :ref:`vL1D Cache RAM <desc-tc>`.
2025-07-25 14:01:34 -04:00
unit: Percent
2025-08-01 13:56:29 -04:00
Cache Hits:
rst: The number of cache accesses minus the number of outgoing requests to the
:doc:`L2 cache <l2-cache>`, that is, the number of cache line requests serviced
by the :ref:`vL1D Cache RAM <desc-tc>` per :ref:`normalization unit <normalization-units>`.
2025-08-01 13:56:29 -04:00
unit: Cache lines per normalization unit
Invalidations:
rst: The number of times the vL1D was issued a write-back invalidate command during
the kernel's execution per :ref:`normalization unit <normalization-units>`.
This may be triggered by, for instance, the ``buffer_wbinvl1`` instruction.
2025-08-01 13:56:29 -04:00
unit: Invalidations per normalization unit
L1 Access Latency:
rst: Calculated as the average number of cycles that a vL1D cache line request
spent in the vL1D cache pipeline.
unit: Cycles
L1-L2 Atomic:
rst: The number of atomic requests that are sent through the vL1D to the :doc:`L2
cache <l2-cache>`, per :ref:`normalization unit <normalization-units>`. This
includes requests for atomics with, and without return.
unit: Requests per normalization unit
L1-L2 BW:
rst: The number of bytes transferred across the vL1D-L2 interface as a result
of :ref:`VMEM <desc-vmem>` instructions, divided by total duration. The number
of bytes is calculated as the number of cache lines requested multiplied by
the cache line size. This value does not consider partial requests, so for instance,
if only a single value is requested in a cache line, the data movement will
still be counted as a full cache line.
unit: Gbps
2025-08-01 13:56:29 -04:00
L1-L2 Read:
rst: The number of read requests for a vL1D cache line that were not satisfied
by the vL1D and must be retrieved from the to the :doc:`L2 Cache <l2-cache>`
per :ref:`normalization unit <normalization-units>`.
2025-08-01 13:56:29 -04:00
unit: Requests per normalization unit
L1-L2 Read Latency:
rst: Calculated as the average number of cycles that the vL1D cache took to issue
and receive read requests from the :doc:`L2 Cache <l2-cache>`. This number also
includes requests for atomics with return values.
unit: Cycles
2025-08-01 13:56:29 -04:00
L1-L2 Write:
rst: The number of write requests to a vL1D cache line that were sent through
the vL1D to the :doc:`L2 cache <l2-cache>`, per :ref:`normalization unit <normalization-units>`.
2025-08-01 13:56:29 -04:00
unit: Requests per normalization unit
L1-L2 Write Latency:
rst: Calculated as the average number of cycles that the vL1D cache took to issue
and receive acknowledgement of a write request to the :doc:`L2 Cache <l2-cache>`.
This number also includes requests for atomics without return values.
unit: Cycles
2025-08-01 13:56:29 -04:00
Read Req:
rst: The total number of incoming read requests from the :ref:`address processing
unit <desc-ta>` after coalescing per :ref:`normalization unit <normalization-units>`
2025-08-01 13:56:29 -04:00
unit: Requests per normalization unit
Total Req:
rst: The total number of incoming requests from the :ref:`address processing unit
<desc-ta>` after coalescing.
2025-08-01 13:56:29 -04:00
unit: Requests
Write Req:
rst: The total number of incoming write requests from the :ref:`address processing
unit <desc-ta>` after coalescing per :ref:`normalization unit <normalization-units>`
unit: Requests per normalization unit
Vector L1 data-return path or Texture Data (TD):
2025-08-01 13:56:29 -04:00
Atomic Instructions:
rst: The number of atomic instructions submitted to the :ref:`data-return unit
<desc-td>` by the :ref:`address processor <desc-ta>` summed over all :doc:`compute
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
This is expected to be the sum of global/generic and spill/stack atomics in
the :ref:`address processor <desc-ta>`.
2025-08-01 13:56:29 -04:00
unit: Instructions per normalization unit
"Cache RAM \u2192 Data-Return Stall":
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return unit
was stalled on data to be returned from the :ref:`vL1D Cache RAM <desc-tc>`.
2025-08-01 13:56:29 -04:00
unit: Percent
Coalescable Instructions:
rst: The number of instructions submitted to the :ref:`data-return unit <desc-td>`
by the :ref:`address processor <desc-ta>` that were found to be coalescable,
per :ref:`normalization unit <normalization-units>`.
unit: Instructions per normalization unit
2025-08-01 13:56:29 -04:00
Data-Return Busy:
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return unit
was busy processing or waiting on data to return to the :doc:`CU <compute-unit>`.
2025-07-25 14:01:34 -04:00
unit: Percent
Read Instructions:
rst: The number of read instructions submitted to the :ref:`data-return unit <desc-td>`
by the :ref:`address processor <desc-ta>` summed over all :doc:`compute units
<compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
This is expected to be the sum of global/generic and spill/stack reads in the
:ref:`address processor <desc-ta>`.
unit: Instructions per normalization unit
"Workgroup manager \u2192 Data-Return Stall":
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return unit
was stalled by the :ref:`workgroup manager <desc-spi>` due to initialization
of registers as a part of launching new workgroups.
2025-07-25 14:01:34 -04:00
unit: Percent
Write Ack Instructions:
rst: The total number of write acknowledgements submitted by :ref:`data-return
unit <desc-td>` to SQ, summed over all compute units on the accelerator, per
normalization unit.
unit: Instructions per normalization unit
Write Instructions:
rst: The number of store instructions submitted to the :ref:`data-return unit
<desc-td>` by the :ref:`address processor <desc-ta>` summed over all :doc:`compute
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
This is expected to be the sum of global/generic and spill/stack stores counted
by the :ref:`vL1D cache-front-end <ta-instruction-counts>`.
unit: Instructions per normalization unit
L2 Speed-of-Light:
2025-08-01 13:56:29 -04:00
HBM Bandwidth:
rst: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory
(HBM) per unit time. This value is calculated as the number of HBM channels
multiplied by the HBM channel width multiplied by the HBM clock frequency.
unit: GB/s
Hit Rate:
rst: The ratio of the number of L2 cache line requests that hit in the L2 cache
2025-08-01 13:56:29 -04:00
over the total number of incoming cache line requests to the L2 cache.
unit: Percent
L2-Fabric Read BW:
rst: The number of bytes read by the L2 over the :ref:`Infinity Fabric interface
2025-08-01 13:56:29 -04:00
<l2-fabric>` per unit time.
unit: GB/s
L2-Fabric Write and Atomic BW:
rst: The number of bytes sent by the L2 over the :ref:`Infinity Fabric interface
<l2-fabric>` by write and atomic operations per unit time.
2025-08-01 13:56:29 -04:00
unit: GB/s
2025-07-25 14:01:34 -04:00
Peak Bandwidth:
rst: The number of bytes looked up in the L2 cache, as a percent of the peak theoretical
bandwidth achievable on the specific accelerator. The number of bytes is calculated
as the number of cache lines requested multiplied by the cache line size. This
value does not consider partial requests, so e.g., if only a single value is
requested in a cache line, the data movement will still be counted as a full
2025-07-25 14:01:34 -04:00
cache line.
unit: Percent
Utilization:
rst: The ratio of the :ref:`number of cycles an L2 channel was active, summed
over all L2 channels on the accelerator <total-active-l2-cycles>` over the :ref:`total
L2 cycles <total-l2-cycles>`.
2025-07-25 14:01:34 -04:00
unit: Percent
L2 cache accesses:
Atomic Bandwidth:
rst: Total number of bytes looked up in the L2 cache for atomic requests, divided
by total duration.
unit: Gbps
2025-08-01 13:56:29 -04:00
Atomic Req:
rst: The total number of atomic requests (with and without return) to the L2 from
all clients.
unit: Requests per normalization unit
Bandwidth:
rst: The number of bytes looked up in the L2 cache, divided by total duration.
The number of bytes is calculated as the number of cache lines requested multiplied
by the cache line size. This value does not consider partial requests, so for
example, if only a single value is requested in a cache line, the data movement
will still be counted as a full cache line.
unit: Gbps
2025-08-01 13:56:29 -04:00
CC Req:
rst: The total number of requests to the L2 that go to Coherently Cacheable (CC)
memory allocations. See the :ref:`memory-type` for more information.
2025-08-01 13:56:29 -04:00
unit: Requests per normalization unit
Cache Hit:
rst: The ratio of the number of L2 cache line requests that hit in the L2 cache
over the total number of incoming cache line requests to the L2 cache.
2025-08-01 13:56:29 -04:00
unit: Percent
Evict (Internal):
rst: The total number of L2 cache lines evicted from the cache due to capacity
limits, per :ref:`normalization unit <normalization-units>`.
2025-08-01 13:56:29 -04:00
unit: Cache lines per normalization unit
Evict (vL1D Req):
rst: The total number of L2 cache lines evicted from the cache due to invalidation
requests initiated by the :doc:`vL1D cache <vector-l1-cache>`, per :ref:`normalization
2025-08-01 13:56:29 -04:00
unit <normalization-units>`.
unit: Cache lines per normalization unit
2025-07-25 14:01:34 -04:00
Hits:
rst: The total number of requests to the L2 from all clients that hit in the cache.
As noted in the :ref:`Speed-of-Light <l2-sol>` section, this includes hit-on-miss
2025-07-25 14:01:34 -04:00
requests.
unit: Requests per normalization unit
2025-08-01 13:56:29 -04:00
Misses:
rst: The total number of requests to the L2 from all clients that miss in the
cache. As noted in the :ref:`Speed-of-Light <l2-sol>` section, these do not
include hit-on-miss requests.
2025-07-25 14:01:34 -04:00
unit: Requests per normalization unit
NC Req:
rst: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory
allocations, per :ref:`normalization unit <normalization-units>`. See the :ref:`memory-type`
for more information.
unit: Requests per normalization unit
Probe Req:
rst: The number of coherence probe requests made to the L2 cache from outside
the accelerator. On an :ref:`MI2XX <mixxx-note>`, probe requests may be generated
by, for example, writes to :ref:`fine-grained device <memory-type>` memory or
by writes to :ref:`coarse-grained <memory-type>` device memory.
unit: Requests per normalization unit
2025-08-01 13:56:29 -04:00
RW Req:
rst: The total number of requests to the L2 that go to Read-Write coherent memory
(RW) allocations. See the :ref:`memory-type` for more information.
unit: Requests per normalization unit
Read Bandwidth:
rst: Total number of bytes looked up in the L2 cache for read requests, divided
by total duration.
unit: Gbps
Read Req:
rst: The total number of read requests to the L2 from all clients.
unit: Requests per normalization unit
2025-08-01 13:56:29 -04:00
Req:
rst: The total number of incoming requests to the L2 from all clients for all
request types, per :ref:`normalization unit <normalization-units>`.
2025-08-01 13:56:29 -04:00
unit: Requests per normalization unit
Streaming Req:
rst: The total number of incoming requests to the L2 that are marked as *streaming*.
The exact meaning of this may differ depending on the targeted accelerator,
however on an :ref:`MI2XX <mixxx-note>` this corresponds to `non-temporal load
or stores <https://clang.llvm.org/docs/LanguageExtensions.html#non-temporal-load-store-builtins>`_.
The L2 cache attempts to evict *streaming* requests before normal requests when
2025-08-01 13:56:29 -04:00
the L2 is at capacity.
unit: Requests per normalization unit
UC Req:
rst: The total number of requests to the L2 that go to Uncached (UC) memory allocations.
2025-08-01 13:56:29 -04:00
See the :ref:`memory-type` for more information.
unit: Requests per normalization unit
Write Bandwidth:
rst: Total number of bytes looked up in the L2 cache for write requests, divided
by total duration.
unit: Gbps
2025-08-01 13:56:29 -04:00
Write Req:
rst: The total number of write requests to the L2 from all clients.
unit: Requests per normalization unit
Writeback:
rst: The total number of L2 cache lines written back to memory for any reason.
Write-backs may occur due to user code (such as HIP kernel calls to ``__threadfence_system``
or atomic built-ins) by the :doc:`command processor <command-processor>`'s memory
acquire/release fences, or for other internal hardware reasons.
2025-08-01 13:56:29 -04:00
unit: Cache lines per normalization unit
Writeback (Internal):
rst: The total number of L2 cache lines written back to memory for internal hardware
2025-08-01 13:56:29 -04:00
reasons, per :ref:`normalization unit <normalization-units>`.
unit: Cache lines per normalization unit
Writeback (vL1D Req):
rst: The total number of L2 cache lines written back to memory due to requests
initiated by the :doc:`vL1D cache <vector-l1-cache>`, per :ref:`normalization
unit <normalization-units>`.
2025-08-01 13:56:29 -04:00
unit: Cache lines per normalization unit
L2-Fabric interface metrics:
2025-08-01 13:56:29 -04:00
Atomic Latency:
rst: The time-averaged number of cycles atomic requests spent in Infinity Fabric
before a completion acknowledgement (atomic without return value) or data (atomic
with return value) was returned to the L2.
unit: Cycles
Atomic Traffic:
rst: The percent of write requests generated by the L2 cache that are atomic requests
to *any* memory location. This breakdown does not consider the *size* of the
request (meaning that 32B and 64B requests are both counted as a single request),
so this metric only *approximates* the percent of the L2-Fabric Read bandwidth
directed to a remote location. Note that on current CDNA accelerators, such
as the :ref:`MI2XX <mixxx-note>`, requests are only considered *atomic* by Infinity
Fabric if they are targeted at :ref:`fine-grained memory <memory-type>` allocations
or :ref:`uncached memory <memory-type>` allocations.
unit: Percent
HBM Read Traffic:
rst: The percent of read requests generated by the L2 cache that are routed to
the accelerator's local high-bandwidth memory (HBM). This breakdown does not
consider the *size* of the request (meaning that 32B and 64B requests are both
counted as a single request), so this metric only *approximates* the percent
of the L2-Fabric Read bandwidth directed to the local HBM.
unit: Percent
HBM Write and Atomic Traffic:
rst: The percent of write and atomic requests generated by the L2 cache that are
routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
does not consider the *size* of the request (meaning that 32B and 64B requests
are both counted as a single request), so this metric only *approximates* the
percent of the L2-Fabric Write and Atomic bandwidth directed to the local HBM.
Note that on current CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`,
requests are only considered *atomic* by Infinity Fabric if they are targeted
at :ref:`fine-grained memory <memory-type>` allocations or :ref:`uncached memory
<memory-type>` allocations.
unit: Percent
Read BW:
rst: The total number of bytes read by the L2 cache from Infinity Fabric divided
by total duration.
unit: Gbps
2025-08-01 13:56:29 -04:00
Read Latency:
rst: The time-averaged number of cycles read requests spent in Infinity Fabric
before data was returned to the L2.
unit: Cycles
2025-08-01 13:56:29 -04:00
Read Stall:
rst: >-
The ratio of the total number of cycles the L2-Fabric interface was stalled
on a read request to any destination (local HBM, remote PCIe\xAE connected
accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_
or CPU) over the :ref:`total active L2 cycles <total-active-l2-cycles>`.
2025-08-01 13:56:29 -04:00
unit: Percent
Remote Read Traffic:
rst: The percent of read requests generated by the L2 cache that are routed to
any memory location other than the accelerator's local high-bandwidth memory
(HBM) -- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown
does not consider the *size* of the request (meaning that 32B and 64B requests
are both counted as a single request), so this metric only *approximates* the
percent of the L2-Fabric Read bandwidth directed to a remote location.
2025-08-01 13:56:29 -04:00
unit: Percent
Remote Write and Atomic Traffic:
rst: The percent of read requests generated by the L2 cache that are routed to
any memory location other than the accelerator's local high-bandwidth memory
(HBM) -- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown
does not consider the *size* of the request (meaning that 32B and 64B requests
are both counted as a single request), so this metric only *approximates* the
percent of the L2-Fabric Read bandwidth directed to a remote location. Note
that on current CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests
are only considered *atomic* by Infinity Fabric if they are targeted at :ref:`fine-grained
memory <memory-type>` allocations or :ref:`uncached memory <memory-type>` allocations.
unit: Percent
Uncached Read Traffic:
rst: The percent of read requests generated by the L2 cache that are reading from
an :ref:`uncached memory allocation <memory-type>`. Note, as described in the
:ref:`request flow <l2-request-flow>` section, a single 64B read request is
typically counted as two uncached read requests. So, it is possible for the
Uncached Read Traffic to reach up to 200% of the total number of read requests.
This breakdown does not consider the *size* of the request (i.e., 32B and 64B
requests are both counted as a single request), so this metric only *approximates*
the percent of the L2-Fabric read bandwidth directed to an uncached memory location.
unit: Percent
2025-08-01 13:56:29 -04:00
Uncached Write and Atomic Traffic:
rst: The percent of write and atomic requests generated by the L2 cache that are
targeting :ref:`uncached memory allocations <memory-type>`. This breakdown does
not consider the *size* of the request (meaning that 32B and 64B requests are
both counted as a single request), so this metric only *approximates* the percent
of the L2-Fabric read bandwidth directed to uncached memory allocations.
2025-08-01 13:56:29 -04:00
unit: Percent
Write Stall:
rst: The ratio of the total number of cycles the L2-Fabric interface was stalled
on a write or atomic request to any destination (local HBM, remote accelerator
or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected
accelerator [#inf]_ or CPU) over the :ref:`total active L2 cycles <total-active-l2-cycles>`.
unit: Percent
Write and Atomic BW:
rst: The total number of bytes written by the L2 over Infinity Fabric by write
and atomic operations divided by total duration. Note that on current CDNA accelerators,
such as the :ref:`MI2XX <mixxx-note>`, requests are only considered *atomic*
by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
:ref:`fine-grained memory <memory-type>` allocations or :ref:`uncached memory
<memory-type>` allocations on the MI2XX.
unit: Gbps
2025-08-01 13:56:29 -04:00
Write and Atomic Latency:
rst: The time-averaged number of cycles write requests spent in Infinity Fabric
before a completion acknowledgement was returned to the L2.
unit: Cycles
L2 - Fabric interface detailed metrics:
2025-08-01 13:56:29 -04:00
Atomic:
rst: The total number of L2 requests to Infinity Fabric to atomically update 32B
or 64B of data in any memory location, per :ref:`normalization unit <normalization-units>`.
See :ref:`l2-request-flow` for more detail. Note that on current CDNA accelerators,
such as the :ref:`MI2XX <mixxx-note>`, requests are only considered *atomic*
by Infinity Fabric if they are targeted at non-write-cacheable memory, such
as :ref:`fine-grained memory <memory-type>` allocations or :ref:`uncached memory
<memory-type>` allocations on the MI2XX.
2025-07-25 14:01:34 -04:00
unit: Requests per normalization unit
Atomic Bandwidth - HBM:
rst: Total number of bytes due to L2 atomic requests due to HBM traffic, divided
by total duration.
unit: Gbps
"Atomic Bandwidth - Infinity Fabric\u2122":
rst: Total number of bytes due to L2 atomic requests due to Infinity Fabric traffic,
divided by total duration.
unit: Gbps
Atomic Bandwidth - PCIe:
rst: Total number of bytes due to L2 atomic requests due to PCIe traffic, divided
by total duration.
unit: Gbps
2025-07-25 14:01:34 -04:00
HBM Read:
rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of
data from the accelerator's local HBM, per :ref:`normalization unit <normalization-units>`.
See :ref:`l2-request-flow` for more detail.
2025-07-25 14:01:34 -04:00
unit: Requests per normalization unit
HBM Write and Atomic:
rst: The total number of L2 requests to Infinity Fabric to write or atomically
update 32B or 64B of data in the accelerator's local HBM, per :ref:`normalization
unit <normalization-units>`. See :ref:`l2-request-flow` for more detail. plain
2025-07-25 14:01:34 -04:00
unit: Requests per normalization unit
2025-08-01 13:56:29 -04:00
Read (32B):
rst: The total number of L2 requests to Infinity Fabric to read 32B of data from
any memory location, per :ref:`normalization unit <normalization-units>`. See
:ref:`l2-request-flow` for more detail. Typically unused on CDNA accelerators.
2025-08-01 13:56:29 -04:00
unit: Requests per normalization unit
Read (64B):
rst: The total number of L2 requests to Infinity Fabric to read 64B of data from
any memory location, per :ref:`normalization unit <normalization-units>`. See
:ref:`l2-request-flow` for more detail.
2025-07-25 14:01:34 -04:00
unit: Requests per normalization unit
Read (Uncached):
rst: The total number of L2 requests to Infinity Fabric to read :ref:`uncached
data <memory-type>` from any memory location, per :ref:`normalization unit <normalization-units>`.
64B requests for uncached data are counted as two 32B uncached data requests.
See :ref:`l2-request-flow` for more detail.
unit: Requests per normalization unit
Read Bandwidth - HBM:
rst: Total number of bytes due to L2 read requests due to HBM traffic, divided
by total duration.
unit: Gbps
"Read Bandwidth - Infinity Fabric\u2122":
rst: Total number of bytes due to L2 read requests due to Infinity Fabric traffic,
divided by total duration.
unit: Gbps
Read Bandwidth - PCIe:
rst: Total number of bytes due to L2 read requests due to PCIe traffic, divided
by total duration.
unit: Gbps
2025-08-01 13:56:29 -04:00
Remote Read:
rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of
data from any source other than the accelerator's local HBM, per :ref:`normalization
unit <normalization-units>`. See :ref:`l2-request-flow` for more detail.
2025-08-01 13:56:29 -04:00
unit: Requests per normalization unit
Remote Write and Atomic:
rst: The total number of L2 requests to Infinity Fabric to write or atomically
update 32B or 64B of data in any memory location other than the accelerator's
local HBM, per :ref:`normalization unit <normalization-units>`. See :ref:`l2-request-flow`
2025-08-01 13:56:29 -04:00
for more detail.
unit: Requests per normalization unit
Write Bandwidth - HBM:
rst: Total number of bytes due to L2 write requests due to HBM traffic, divided
by total duration.
unit: Gbps
"Write Bandwidth - Infinity Fabric\u2122":
rst: Total number of bytes due to L2 write requests due to Infinity Fabric traffic,
divided by total duration.
unit: Gbps
Write Bandwidth - PCIe:
rst: Total number of bytes due to L2 write requests due to PCIe traffic, divided
by total duration.
unit: Gbps
2025-08-01 13:56:29 -04:00
Write and Atomic (32B):
rst: The total number of L2 requests to Infinity Fabric to write or atomically
update 32B of data to any memory location, per :ref:`normalization unit <normalization-units>`.
See :ref:`l2-request-flow` for more detail.
2025-08-01 13:56:29 -04:00
unit: Requests per normalization unit
Write and Atomic (64B):
rst: The total number of L2 requests to Infinity Fabric to write or atomically
update 64B of data in any memory location, per :ref:`normalization unit <normalization-units>`.
See :ref:`l2-request-flow` for more detail.
2025-08-01 13:56:29 -04:00
unit: Requests per normalization unit
Write and Atomic (Uncached):
rst: The total number of L2 requests to Infinity Fabric to write or atomically
update 32B or 64B of :ref:`uncached data <memory-type>`, per :ref:`normalization
unit <normalization-units>`. See :ref:`l2-request-flow` for more detail.
unit: Requests per normalization unit
L2 - Fabric Interface stalls:
Read - HBM Stall:
2025-07-25 14:01:34 -04:00
rst: The number of cycles the L2-Fabric interface was stalled on read requests
to the accelerator's local HBM as a percent of the :ref:`total active L2 cycles
<total-active-l2-cycles>`.
2025-07-25 14:01:34 -04:00
unit: Percent
Read - Infinity Fabric Stall:
rst: The number of cycles the L2-Fabric interface was stalled on read requests
to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as a percent
of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
unit: Percent
2025-08-01 13:56:29 -04:00
Read - PCIe Stall:
rst: The number of cycles the L2-Fabric interface was stalled on read requests
to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the :ref:`total
active L2 cycles <total-active-l2-cycles>`.
unit: Percent
Write - Credit Starvation:
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
requests to any memory location because too many write/atomic requests were
currently in flight, as a percent of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
2025-08-01 13:56:29 -04:00
unit: Percent
Write - HBM Stall:
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
2025-08-01 13:56:29 -04:00
requests to accelerator's local HBM as a percent of the total active L2 cycles.
unit: Percent
Write - Infinity Fabric Stall:
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
requests to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as
a percent of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
unit: Percent
2025-08-01 13:56:29 -04:00
Write - PCIe Stall:
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
2025-08-01 13:56:29 -04:00
requests to remote PCIe connected accelerators [#inf]_ or CPUs as a percent
of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
2025-07-25 14:01:34 -04:00
unit: Percent
Scalar L1D Speed-of-Light:
Bandwidth Utilization:
rst: The number of bytes looked up in the sL1D cache, as a percent of the peak
theoretical bandwidth. Calculated as the ratio of sL1D requests over the :ref:`total
sL1D cycles <total-sl1d-cycles>`.
unit: Percent
2025-08-01 13:56:29 -04:00
Cache Hit Rate:
rst: Indicates the percent of sL1D requests that hit on a previously loaded line
the cache. The ratio of the number of sL1D requests that hit [#sl1d-cache]_
2025-08-01 13:56:29 -04:00
over the number of all sL1D requests.
unit: Percent
sL1D-L2 BW Utilization:
rst: The percentage of the peak theoretical sL1D - L2 interface bandwidth acheived.
Calculated as total number of bytes read from, written to, or atomically updated
across the sL1D - L2 interface.
unit: Percent
Scalar L1D cache accesses:
2025-07-25 14:01:34 -04:00
Atomic Req:
rst: The total number of atomic requests from sL1D to the :doc:`L2 <l2-cache>`,
per :ref:`normalization unit <normalization-units>`. Typically unused on current
2025-07-25 14:01:34 -04:00
CDNA accelerators.
unit: Requests per normalization unit
Cache Hit Rate:
rst: Indicates the percent of sL1D requests that hit on a previously loaded line
the cache. The ratio of the number of sL1D requests that hit [#sl1d-cache]_
2025-07-25 14:01:34 -04:00
over the number of all sL1D requests.
unit: Percent
2025-08-01 13:56:29 -04:00
Hits:
rst: The total number of sL1D requests that hit on a previously loaded cache line,
2025-08-01 13:56:29 -04:00
per :ref:`normalization unit <normalization-units>`.
unit: Requests per normalization unit
Misses - Non Duplicated:
rst: The total number of sL1D requests that missed on a cache line that *was not*
already pending due to another request, per :ref:`normalization unit <normalization-units>`.
See :ref:`desc-sl1d-sol` for more detail.
2025-08-01 13:56:29 -04:00
unit: Requests per normalization unit
2025-07-25 14:01:34 -04:00
Misses- Duplicated:
rst: The total number of sL1D requests that missed on a cache line that *was*
already pending due to another request, per :ref:`normalization unit <normalization-units>`.
See :ref:`desc-sl1d-sol` for more detail.
2025-07-25 14:01:34 -04:00
unit: Requests per normalization unit
2025-08-01 13:56:29 -04:00
Read Req (1 DWord):
rst: The total number of sL1D read requests made for a single dword of data (4B),
2025-08-01 13:56:29 -04:00
per :ref:`normalization unit <normalization-units>`.
unit: Requests per normalization unit
Read Req (16 DWord):
rst: The total number of sL1D read requests made for a sixteen dwords of data
(64B), per :ref:`normalization unit <normalization-units>`.
2025-08-01 13:56:29 -04:00
unit: Requests per normalization unit
Read Req (2 DWord):
rst: The total number of sL1D read requests made for a two dwords of data (8B),
2025-08-01 13:56:29 -04:00
per :ref:`normalization unit <normalization-units>`.
2025-07-25 14:01:34 -04:00
unit: Requests per normalization unit
Read Req (4 DWord):
rst: The total number of sL1D read requests made for a four dwords of data (16B),
2025-07-25 14:01:34 -04:00
per :ref:`normalization unit <normalization-units>`.
unit: Requests per normalization unit
Read Req (8 DWord):
rst: The total number of sL1D read requests made for a eight dwords of data (32B),
2025-07-25 14:01:34 -04:00
per :ref:`normalization unit <normalization-units>`.
unit: Requests per normalization unit
Read Req (Total):
rst: The total number of sL1D read requests of any size, per :ref:`normalization
unit <normalization-units>`.
2025-07-25 14:01:34 -04:00
unit: Requests per normalization unit
2025-08-01 13:56:29 -04:00
Req:
rst: The total number of requests, of any size or type, made to the sL1D per :ref:`normalization
2025-08-01 13:56:29 -04:00
unit <normalization-units>`.
unit: Requests per normalization unit
Scalar L1D Cache - L2 Interface:
Atomic Req:
rst: The total number of atomic requests from sL1D to the :doc:`L2 <l2-cache>`,
per :ref:`normalization unit <normalization-units>`. Typically unused on current
2025-07-25 14:01:34 -04:00
CDNA accelerators.
unit: Requests per normalization unit
2025-08-01 13:56:29 -04:00
Read Req:
rst: The total number of read requests from sL1D to the :doc:`L2 <l2-cache>`,
per :ref:`normalization unit <normalization-units>`.
2025-08-01 13:56:29 -04:00
unit: Requests per normalization unit
2025-07-25 14:01:34 -04:00
Stall Cycles:
rst: >-
The total number of cycles the sL1D\u2194 :doc:`L2 <l2-cache>` interface
was stalled, per :ref:`normalization unit <normalization-units>`.
2025-07-25 14:01:34 -04:00
unit: Cycles per normalization unit
Write Req:
rst: The total number of write requests from sL1D to the :doc:`L2 <l2-cache>`,
per :ref:`normalization unit <normalization-units>`. Typically unused on current
CDNA accelerators.
unit: Requests per normalization unit
2025-08-01 13:56:29 -04:00
sL1D-L2 BW:
rst: >-
The total number of bytes read from, written to, or atomically updated
across the sL1D\u2194:doc:`L2 <l2-cache>` interface, divided by total duration.
Note that sL1D writes and atomics are typically
unused on current CDNA accelerators, so in the majority of cases this can
be interpreted as an sL1D\u2192L2 read bandwidth.
unit: Gbps
2025-07-25 14:01:34 -04:00
L1I Speed-of-Light:
Bandwidth Utilization:
rst: The number of bytes looked up in the L1I cache, as a percent of the peak
theoretical bandwidth. Calculated as the ratio of L1I requests over the :ref:`total
L1I cycles <total-l1i-cycles>`.
2025-07-25 14:01:34 -04:00
unit: Percent
2025-08-01 13:56:29 -04:00
Cache Hit Rate:
rst: The percent of L1I requests that hit [#l1i-cache]_ on a previously loaded
line the cache. Calculated as the ratio of the number of L1I requests that hit
over the number of all L1I requests.
2025-08-01 13:56:29 -04:00
unit: Percent
L1I-L2 Bandwidth Utilization:
rst: >-
The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
achieved. Calculated as the ratio of the total number of requests from
the L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`.
2025-07-25 14:01:34 -04:00
unit: Percent
L1I cache accesses:
2025-08-01 13:56:29 -04:00
Cache Hit Rate:
rst: The percent of L1I requests that hit [#l1i-cache]_ on a previously loaded
line the cache. Calculated as the ratio of the number of L1I requests that hit
over the number of all L1I requests.
2025-08-01 13:56:29 -04:00
unit: Percent
Hits:
rst: The total number of L1I requests that hit on a previously loaded cache line,
2025-08-01 13:56:29 -04:00
per :ref:`normalization-unit <normalization-units>`.
unit: Requests per normalization unit
Instruction Fetch Latency:
rst: The average number of cycles spent to fetch instructions to a :doc:`CU <compute-unit>`.
unit: Cycles
2025-07-25 14:01:34 -04:00
Misses - Duplicated:
rst: The total number of L1I requests that missed on a cache line that *were*
already pending due to another request, per :ref:`normalization-unit <normalization-units>`.
See note in :ref:`desc-l1i-sol` for more detail.
2025-07-25 14:01:34 -04:00
unit: Requests per normalization unit
Misses - Non Duplicated:
rst: The total number of L1I requests that missed on a cache line that *were not*
already pending due to another request, per :ref:`normalization-unit <normalization-units>`.
See note in :ref:`desc-l1i-sol` for more detail.
unit: Requests per normalization unit
Req:
rst: The total number of requests made to the L1I per normalization-unit
unit: Requests per normalization unit
L1I <-> L2 interface:
L1I-L2 Bandwidth:
rst: Total number of bytes transferred across L1I - L2 interface divided by total
duration.
unit: Gbps
Workgroup manager utilizations:
Accelerator Utilization:
rst: The percent of cycles in the kernel where the accelerator was actively doing
any work.
unit: Percent
2025-07-25 14:01:34 -04:00
Dispatched Wavefronts:
rst: The total number of wavefronts, summed over all workgroups, forming this
kernel launch.
unit: Wavefronts
2025-08-01 13:56:29 -04:00
Dispatched Workgroups:
rst: The total number of workgroups forming this kernel launch.
unit: Workgroups
2025-07-25 14:01:34 -04:00
SGPR Writes:
rst: The average number of cycles spent initializing :ref:`SGPRs <desc-salu>`
at wave creation.
2025-07-25 14:01:34 -04:00
unit: Cycles/wave
2025-08-01 13:56:29 -04:00
SIMD Utilization:
rst: The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel
where any :ref:`SIMD <desc-valu>` on a CU was actively doing any work, summed
over all CUs. Low values (less than 100%) indicate that the accelerator was
not fully saturated by the kernel, or a potential load-imbalance issue.
2025-08-01 13:56:29 -04:00
unit: Percent
Scheduler-Pipe Utilization:
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where the scheduler-pipes were actively doing any work. Note: this
value is expected to range between 0% and 25%. See :ref:`desc-spi`.
2025-08-01 13:56:29 -04:00
unit: Percent
Shader Engine Utilization:
rst: The percent of :ref:`total shader engine cycles <total-se-cycles>` in the
kernel where any CU in a shader-engine was actively doing any work, normalized
over all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
was not fully saturated by the kernel, or a potential load-imbalance issue.
2025-07-25 14:01:34 -04:00
unit: Percent
2025-08-01 13:56:29 -04:00
VGPR Writes:
rst: The average number of cycles spent initializing :ref:`VGPRs <desc-valu>`
at wave creation.
2025-08-01 13:56:29 -04:00
unit: Cycles/wave
Workgroup Manager Utilization:
rst: The percent of cycles in the kernel where the workgroup manager was actively
doing any work.
unit: Percent
Workgroup Manager - Resource Allocation:
Insufficient CU Barriers:
rst: The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel where
a workgroup could not be scheduled to a :doc:`CU <compute-unit>` due to lack
of available :ref:`barriers <desc-barrier>`.
2025-07-25 14:01:34 -04:00
unit: Percent
2025-08-01 13:56:29 -04:00
Insufficient CU LDS:
rst: The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel where
a workgroup could not be scheduled to a :doc:`CU <compute-unit>` due to lack
2025-08-01 13:56:29 -04:00
of available :doc:`LDS <local-data-share>`.
unit: Percent
2025-07-25 14:01:34 -04:00
Insufficient SIMD SGPRs:
rst: The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel
where a workgroup could not be scheduled to a :ref:`SIMD <desc-valu>` due to
lack of available :ref:`SGPRs <desc-salu>`.
2025-07-25 14:01:34 -04:00
unit: Percent
2025-08-01 13:56:29 -04:00
Insufficient SIMD VGPRs:
rst: The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel
where a workgroup could not be scheduled to a :ref:`SIMD <desc-valu>` due to
lack of available :ref:`VGPRs <desc-valu>`.
2025-08-01 13:56:29 -04:00
unit: Percent
Insufficient SIMD Waveslots:
rst: The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel
where a workgroup could not be scheduled to a :ref:`SIMD <desc-valu>` due to
lack of available :ref:`waveslots <desc-valu>`.
2025-07-25 14:01:34 -04:00
unit: Percent
Not-scheduled Rate (Scheduler-Pipe):
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
due to a bottleneck within the scheduler-pipes rather than a lack of a CU
or :ref:`SIMD <desc-valu>` with sufficient resources. Note: this value is
expected to range between 0-25%, see note in :ref:`workgroup manager <desc-spi>`
description.
2025-07-25 14:01:34 -04:00
unit: Percent
Not-scheduled Rate (Workgroup Manager):
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
due to a bottleneck within the workgroup manager rather than a lack of a
CU or :ref:`SIMD <desc-valu>` with sufficient resources. Note: this value
is expected to range between 0-25%. See note in :ref:`workgroup manager <desc-spi>`
description.
2025-07-25 14:01:34 -04:00
unit: Percent
Reached CU Wavefront Limit:
rst: The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel where
a wavefront could not be scheduled to a :doc:`CU <compute-unit>` due to limits
within the workgroup manager. This is expected to be always be zero on CDNA2
or newer accelerators (and small for previous accelerators).
2025-07-25 14:01:34 -04:00
unit: Percent
2025-08-01 13:56:29 -04:00
Reached CU Workgroup Limit:
rst: The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel where
a workgroup could not be scheduled to a :doc:`CU <compute-unit>` due to limits
within the workgroup manager. This is expected to be always be zero on CDNA2
or newer accelerators (and small for previous accelerators).
2025-08-01 13:56:29 -04:00
unit: Percent
Scheduler-Pipe Stall Rate:
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
due to occupancy limitations (like a lack of a CU or :ref:`SIMD <desc-valu>`
with sufficient resources). Note: this value is expected to range between
0-25%, see note in :ref:`workgroup manager <desc-spi>` description.
2025-08-01 13:56:29 -04:00
unit: Percent
Scratch Stall Rate:
rst: The percent of :ref:`total shader-engine cycles <total-se-cycles>` in the
kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
due to lack of :ref:`private (a.k.a., scratch) memory <memory-type>` slots.
While this can reach up to 100%, note that the actual occupancy limitations
on a kernel using private memory are typically quite small (for example, less
than 1% of the total number of waves that can be scheduled to an accelerator).
2025-08-01 13:56:29 -04:00
unit: Percent
2025-07-25 14:01:34 -04:00
Command processor fetcher (CPF):
CPF Stall:
rst: Percent of CPF busy cycles where the CPF was stalled for any reason.
unit: Percent
2025-08-01 13:56:29 -04:00
CPF Utilization:
rst: Percent of total cycles where the CPF was busy actively doing any work. The
ratio of CPF busy cycles over total cycles counted by the CPF.
2025-07-25 14:01:34 -04:00
unit: Percent
CPF-L2 Stall:
rst: Percent of CPF-:doc:`L2 <l2-cache>` L2 busy cycles where the CPF-L2 interface
was stalled for any reason.
unit: Percent
2025-08-01 13:56:29 -04:00
CPF-L2 Utilization:
rst: Percent of total cycles counted by the CPF-:doc:`L2 <l2-cache>` interface
where the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy
cycles over total cycles counted by the CPF-L2.
2025-08-01 13:56:29 -04:00
unit: Percent
2025-07-25 14:01:34 -04:00
CPF-UTCL1 Stall:
rst: Percent of CPF busy cycles where the CPF was stalled by address translation.
unit: Percent
Command processor packet processor (CPC):
2025-07-25 14:01:34 -04:00
CPC Packet Decoding Utilization:
rst: Percent of CPC busy cycles spent decoding commands for processing.
unit: Percent
2025-08-01 13:56:29 -04:00
CPC Stall Rate:
rst: Percent of CPC busy cycles where the CPC was stalled for any reason.
unit: Percent
CPC Utilization:
rst: Percent of total cycles where the CPC was busy actively doing any work. The
ratio of CPC busy cycles over total cycles counted by the CPC.
2025-07-25 14:01:34 -04:00
unit: Percent
CPC-L2 Utilization:
rst: Percent of total cycles counted by the CPC-:doc:`L2 <l2-cache>` interface
where the CPC-L2 interface was active doing any work.
2025-07-25 14:01:34 -04:00
unit: Percent
2025-08-01 13:56:29 -04:00
CPC-UTCL1 Stall:
rst: Percent of CPC busy cycles where the CPC was stalled by address translation
2025-07-25 14:01:34 -04:00
unit: Percent
2025-08-01 13:56:29 -04:00
CPC-UTCL2 Utilization:
rst: Percent of total cycles counted by the CPC's :doc:`L2 <l2-cache>` address
translation interface where the CPC was busy doing address translation work.
2025-08-01 13:56:29 -04:00
unit: Percent
CPC-Workgroup Manager Utilization:
rst: Percent of CPC busy cycles spent dispatching workgroups to the :ref:`workgroup
2025-08-01 13:56:29 -04:00
manager <desc-spi>`.
2025-07-25 14:01:34 -04:00
unit: Percent
System Speed-of-Light:
Active CUs (deprecated):
2025-08-01 13:56:29 -04:00
rst: Total number of active compute units (CUs) on the accelerator during the
kernel execution. (Deprecated - See CU Utilization instead)
2025-08-01 13:56:29 -04:00
unit: Number
Branch Utilization:
rst: Indicates what percent of the kernel's duration the :ref:`branch <desc-branch>`
unit was busy executing instructions. Computed as the ratio of the total number
of cycles spent by the :ref:`scheduler <desc-scheduler>` issuing branch instructions
over the :ref:`total CU cycles <total-cu-cycles>`.
unit: Percent
IPC:
rst: The ratio of the total number of instructions executed on the :doc:`CU <compute-unit>`
over the :ref:`total active CU cycles <total-active-cu-cycles>`.
unit: Instructions per-cycle
2025-08-01 13:56:29 -04:00
L1I BW:
rst: The number of bytes looked up in the L1I cache per unit time. This is also
presented as a percent of the peak theoretical bandwidth achievable on the specific
accelerator.
2025-07-25 14:01:34 -04:00
unit: Percent
2025-08-01 13:56:29 -04:00
L1I Fetch Latency:
rst: The average number of cycles spent to fetch instructions to a :doc:`CU <compute-unit>`.
2025-08-01 13:56:29 -04:00
unit: Cycles
L1I Hit Rate:
rst: The percent of L1I requests that hit on a previously loaded line the cache.
Calculated as the ratio of the number of L1I requests that hit over the number
of all L1I requests.
unit: GB/s
2025-08-01 13:56:29 -04:00
L2 Cache BW:
rst: The number of bytes looked up in the L2 cache per unit time. The number of
bytes is calculated as the number of cache lines requested multiplied by the
cache line size. This value does not consider partial requests, so e.g., if
only a single value is requested in a cache line, the data movement will still
be counted as a full cache line. This is also presented as a percent of the
peak theoretical bandwidth achievable on the specific accelerator.
2025-08-01 13:56:29 -04:00
unit: GB/s
L2 Cache Hit Rate:
rst: The ratio of the number of L2 cache line requests that hit in the L2 cache
over the total number of incoming cache line requests to the L2 cache.
2025-07-25 14:01:34 -04:00
unit: Percent
2025-08-01 13:56:29 -04:00
L2-Fabric Read BW:
rst: >-
The number of bytes read by the L2 over the :ref:`Infinity Fabric\u2122
interface <l2-fabric>` per unit time. This is also presented as a percent
of the peak theoretical bandwidth achievable on the specific accelerator.
2025-08-01 13:56:29 -04:00
unit: GB/s
L2-Fabric Read Latency:
rst: The time-averaged number of cycles read requests spent in Infinity Fabric
before data was returned to the L2.
2025-08-01 13:56:29 -04:00
unit: Cycles
L2-Fabric Write BW:
rst: The number of bytes sent by the L2 over the :ref:`Infinity Fabric interface
<l2-fabric>` by write and atomic operations per unit time. This is also presented
as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
2025-08-01 13:56:29 -04:00
unit: GB/s
L2-Fabric Write Latency:
rst: The time-averaged number of cycles write requests spent in Infinity Fabric
2025-08-01 13:56:29 -04:00
before a completion acknowledgement was returned to the L2.
unit: Cycles
LDS Bank Conflicts/Access:
rst: The ratio of the number of cycles spent in the :doc:`LDS scheduler <local-data-share>`
due to bank conflicts (as determined by the conflict resolution hardware) to
the base number of cycles that would be spent in the LDS scheduler in a completely
uncontended case. This is also presented in normalized form (i.e., the Bank
Conflict Rate).
2025-08-01 13:56:29 -04:00
unit: Conflicts/Access
MFMA FLOPs (BF16):
rst: >-
The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any 16-bit brain
floating point operations from :ref:`VALU <desc-valu>` instructions. This
is also presented as a percent of the peak theoretical BF16 MFMA operations
achievable on the specific accelerator.
unit: GFLOPs
2025-08-01 13:56:29 -04:00
MFMA FLOPs (F16):
rst: >-
The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
as a percent of the peak theoretical F16 MFMA operations achievable on the
specific accelerator.
2025-08-01 13:56:29 -04:00
unit: GFLOPs
MFMA FLOPs (F32):
rst: >-
The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 32-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
as a percent of the peak theoretical F32 MFMA operations achievable on the
specific accelerator.
2025-08-01 13:56:29 -04:00
unit: GFLOPs
MFMA FLOPs (F64):
rst: >-
The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 64-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
as a percent of the peak theoretical F64 MFMA operations achievable on the
specific accelerator.
2025-08-01 13:56:29 -04:00
unit: GFLOPs
MFMA FLOPs (F8):
rst: >-
The total number of 8-bit brain floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any 16-bit brain
floating point operations from :ref:`VALU <desc-valu>` instructions. This
is also presented as a percent of the peak theoretical F8 MFMA operations
achievable on the specific accelerator. It is supported on AMD Instinct MI300
series and later only.
2025-08-01 13:56:29 -04:00
unit: GFLOPs
MFMA IOPs (Int8):
rst: >-
The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
per second. Note: this does not include any 8-bit integer operations from
:ref:`VALU <desc-valu>` instructions. This is also presented as a percent
of the peak theoretical INT8 MFMA operations achievable on the specific accelerator.
2025-08-01 13:56:29 -04:00
unit: GIOPs
MFMA Utilization:
rst: Indicates what percent of the kernel's duration the :ref:`MFMA <desc-mfma>`
unit was busy executing instructions. Computed as the ratio of the total number
2025-08-01 13:56:29 -04:00
of cycles spent by the :ref:`MFMA <desc-salu>` was busy over the :ref:`total
CU cycles <total-cu-cycles>`.
unit: Percent
SALU Utilization:
rst: Indicates what percent of the kernel's duration the :ref:`SALU <desc-salu>`
was busy executing instructions. Computed as the ratio of the total number of
cycles spent by the :ref:`scheduler <desc-scheduler>` issuing SALU / :ref:`SMEM
<desc-smem>` instructions over the :ref:`total CU cycles <total-cu-cycles>`.
unit: Percent
2025-07-25 14:01:34 -04:00
Theoretical LDS Bandwidth:
rst: Indicates the maximum amount of bytes that could have been loaded from, stored
to, or atomically updated in the LDS per unit time (see :ref:`LDS Bandwidth
<lds-bandwidth>` example for more detail). This is also presented as a percent
of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
unit: GB/s
VALU Active Threads:
rst: Indicates the average level of :ref:`divergence <desc-divergence>` within
a wavefront over the lifetime of the kernel. The number of work-items that were
active in a wavefront during execution of each :ref:`VALU <desc-valu>` instruction,
time-averaged over all VALU instructions run on all wavefronts in the kernel.
unit: Work-items
2025-08-01 13:56:29 -04:00
VALU FLOPs:
rst: >-
The total floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is also presented as a percent of the peak theoretical
FLOPs achievable on the specific accelerator. Note: this does not include
any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
2025-08-01 13:56:29 -04:00
VALU IOPs:
rst: >-
The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
2025-08-01 13:56:29 -04:00
This is also presented as a percent of the peak theoretical IOPs achievable
on the specific accelerator. Note: this does not include any integer operations
from :ref:`MFMA <desc-mfma>` instructions.
2025-08-01 13:56:29 -04:00
unit: GOIPs
VALU Utilization:
rst: Indicates what percent of the kernel's duration the :ref:`VALU <desc-valu>`
was busy executing instructions. Does not include :ref:`VMEM <desc-vmem>` operations.
Computed as the ratio of the total number of cycles spent by the :ref:`scheduler
<desc-scheduler>` issuing VALU instructions over the :ref:`total CU cycles <total-cu-cycles>`.
unit: Percent
Dual-issue VALU Utilization:
rst: Indicates what percent of the kernel's duration the :ref:`VALU <desc-valu>`
was busy executing dual-issued instructions. Computed as the ratio of the total number of
cycles spent by the scheduler co-issuing VALU instructions over the total
CU cycles.
unit: Percent
2025-08-01 13:56:29 -04:00
VMEM Utilization:
rst: Indicates what percent of the kernel's duration the :ref:`VMEM <desc-vmem>`
unit was busy executing instructions, including both global/generic and spill/scratch
operations (see the :ref:`VMEM instruction count metrics <ta-instruction-counts>`
for more detail). Does not include :ref:`VALU <desc-valu>` operations. Computed
2025-08-01 13:56:29 -04:00
as the ratio of the total number of cycles spent by the :ref:`scheduler <desc-scheduler>`
issuing VMEM instructions over the :ref:`total CU cycles <total-cu-cycles>`.
unit: Percent
Wavefront Occupancy:
rst: >-
The time-averaged number of wavefronts resident on the accelerator over
2025-08-01 13:56:29 -04:00
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
occupancy achievable on the specific accelerator.
2025-08-01 13:56:29 -04:00
unit: Wavefronts
sL1D Cache BW:
rst: The number of bytes looked up in the sL1D cache per unit time. This is also
presented as a percent of the peak theoretical bandwidth achievable on the specific
accelerator.
2025-07-25 14:01:34 -04:00
unit: GB/s
sL1D Cache Hit Rate:
rst: The percent of sL1D requests that hit on a previously loaded line the cache.
Calculated as the ratio of the number of sL1D requests that hit over the number
2025-07-25 14:01:34 -04:00
of all sL1D requests.
unit: Percent
vL1D Cache BW:
rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM
<desc-vmem>` instructions per unit time. The number of bytes is calculated as
the number of cache lines requested multiplied by the cache line size. This
value does not consider partial requests, so e.g., if only a single value is
requested in a cache line, the data movement will still be counted as a full
cache line. This is also presented as a percent of the peak theoretical bandwidth
achievable on the specific accelerator.
unit: GB/s
2025-08-01 13:56:29 -04:00
vL1D Cache Hit Rate:
rst: The ratio of the number of vL1D cache line requests that hit in vL1D cache
over the total number of cache line requests to the :ref:`vL1D cache RAM <desc-tc>`.
2025-08-01 13:56:29 -04:00
unit: Percent
CU Utilization:
rst: The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel
where any :ref:`SIMD <desc-valu>` on a CU was actively doing any work, summed
over all CUs. Low values (less than 100%) indicate that the accelerator was
not fully saturated by the kernel, or a potential load-imbalance issue.
unit: Percent