2025-07-25 14:01:34 -04:00
|
|
|
Wavefront launch stats:
|
2025-08-01 13:56:29 -04:00
|
|
|
AGPRs:
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The number of accumulation vector general-purpose registers allocated
|
|
|
|
|
for the kernel, see :ref:`AGPRs <desc-agprs>`. Note: this may not exactly match
|
|
|
|
|
the number of AGPRs requested by the compiler due to allocation granularity.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: AGPRs
|
2025-08-01 13:33:58 -04:00
|
|
|
Grid Size:
|
|
|
|
|
rst: The total number of work-items (or, threads) launched as a part of the kernel
|
|
|
|
|
dispatch. In HIP, this is equivalent to the total grid size multiplied by the
|
|
|
|
|
total workgroup (or, block) size.
|
|
|
|
|
unit: Work-Items
|
2025-07-25 14:01:34 -04:00
|
|
|
LDS Allocation:
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
|
|
|
|
|
allocated for this kernel. Note: This may also be larger than what was requested
|
|
|
|
|
at compile time due to both allocation granularity and dynamic per-dispatch
|
|
|
|
|
LDS allocations.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Bytes per workgroup
|
|
|
|
|
Restored Wavefronts:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of wavefronts restored from a context-save. See `cwsr_enable
|
2025-07-25 14:01:34 -04:00
|
|
|
<https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
|
|
|
|
|
unit: Wavefronts
|
|
|
|
|
SGPRs:
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The number of scalar general-purpose registers allocated for the kernel, see
|
|
|
|
|
:ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
|
|
|
|
|
SGPRs requested by the compiler due to allocation granularity.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: SGPRs
|
2025-08-01 13:56:29 -04:00
|
|
|
Saved Wavefronts:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of wavefronts saved at a context-save. See `cwsr_enable
|
2025-08-01 13:56:29 -04:00
|
|
|
<https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
|
|
|
|
|
unit: Wavefronts
|
|
|
|
|
Scratch Allocation:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of bytes of :ref:`scratch memory <memory-spaces>` requested per
|
|
|
|
|
work-item for this kernel. Scratch memory is used for stack memory on the accelerator,
|
2025-08-01 13:56:29 -04:00
|
|
|
as well as for register spills and restores.
|
|
|
|
|
unit: Bytes per work-item
|
2025-08-01 13:33:58 -04:00
|
|
|
Total Wavefronts:
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The total number of wavefronts launched as part of the kernel dispatch.
|
|
|
|
|
On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
|
|
|
|
|
size is always 64 work-items. Thus, the total number of wavefronts should
|
|
|
|
|
be equivalent to the ceiling of grid size divided by 64.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Wavefronts
|
2025-08-01 13:56:29 -04:00
|
|
|
VGPRs:
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The number of architected vector general-purpose registers allocated for the
|
|
|
|
|
kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
|
|
|
|
|
number of VGPRs requested by the compiler due to allocation granularity.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: VGPRs
|
2025-08-01 13:33:58 -04:00
|
|
|
Workgroup Size:
|
|
|
|
|
rst: The total number of work-items (or, threads) in each workgroup (or, block)
|
|
|
|
|
launched as part of the kernel dispatch. In HIP, this is equivalent to the total
|
|
|
|
|
block size.
|
|
|
|
|
unit: Work-Items
|
|
|
|
|
Wavefront runtime stats:
|
|
|
|
|
Active Cycles:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The average number of cycles a wavefront in the kernel dispatch was actively
|
|
|
|
|
executing instructions per :ref:`normalization unit <normalization-units>`.
|
|
|
|
|
This measurement is made on a per-wavefront basis, and may include cycles that
|
|
|
|
|
another wavefront spent actively executing (on another execution unit, for example)
|
|
|
|
|
or was stalled. As such, it is most useful to get a sense of how waves were
|
|
|
|
|
spending their time, rather than identification of a precise limiter. The sum
|
|
|
|
|
of this metric, Issue Wait Cycles and Active Wait Cycles should be equal to
|
|
|
|
|
the total Wave Cycles metric.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Cycles per normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
Dependency Wait Cycles:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of cycles a wavefront in the kernel dispatch stalled waiting on
|
|
|
|
|
memory of any kind (e.g., instruction fetch, vector or scalar memory, etc.)
|
|
|
|
|
per :ref:`normalization unit <normalization-units>`. This counter is incremented
|
|
|
|
|
at every cycle by *all* wavefronts on a CU stalled at a memory operation. As
|
|
|
|
|
such, it is most useful to get a sense of how waves were spending their time,
|
|
|
|
|
rather than identification of a precise limiter because another wave could be
|
|
|
|
|
actively executing while a wave is stalled. The sum of this metric, Issue Wait
|
|
|
|
|
Cycles and Active Cycles should be equal to the total Wave Cycles metric.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Cycles per normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
Instructions per wavefront:
|
|
|
|
|
rst: The average number of instructions (of all types) executed per wavefront.
|
|
|
|
|
This is averaged over all wavefronts in a kernel dispatch.
|
|
|
|
|
unit: Instructions per wavefront
|
2025-07-25 14:01:34 -04:00
|
|
|
Issue Wait Cycles:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of cycles a wavefront in the kernel dispatch was unable to issue
|
|
|
|
|
an instruction for any reason (e.g., execution pipe back-pressure, arbitration
|
|
|
|
|
loss, etc.) per :ref:`normalization unit <normalization-units>`. This counter
|
|
|
|
|
is incremented at every cycle by *all* wavefronts on a CU unable to issue an
|
|
|
|
|
instruction. As such, it is most useful to get a sense of how waves were spending
|
|
|
|
|
their time, rather than identification of a precise limiter because another
|
|
|
|
|
wave could be actively executing while a wave is issue stalled. The sum of this
|
|
|
|
|
metric, Dependency Wait Cycles and Active Cycles should be equal to the total
|
|
|
|
|
Wave Cycles metric.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Cycles per normalization unit
|
2025-08-01 13:33:58 -04:00
|
|
|
Kernel Time:
|
|
|
|
|
rst: The total duration of the executed kernel.
|
|
|
|
|
unit: Nanoseconds
|
|
|
|
|
Kernel Time (Cycles):
|
|
|
|
|
rst: The total duration of the executed kernel in cycles.
|
|
|
|
|
unit: Cycles
|
2025-08-01 13:56:29 -04:00
|
|
|
Wave Cycles:
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The number of cycles a wavefront in the kernel dispatch spent resident
|
|
|
|
|
on a compute unit per :ref:`normalization unit <normalization-units>`. This is
|
|
|
|
|
averaged over all wavefronts in a kernel dispatch. Note: this should not
|
|
|
|
|
be directly compared to the kernel cycles above.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Cycles per normalization unit
|
|
|
|
|
Wavefront Occupancy:
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The time-averaged number of wavefronts resident on the accelerator over the
|
|
|
|
|
lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
|
|
|
|
kernels (less than 1ms).
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Wavefronts
|
2025-07-25 14:01:34 -04:00
|
|
|
Overall instruction mix:
|
2025-08-01 13:56:29 -04:00
|
|
|
Branch:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of branch operations issued. These typically consist of
|
|
|
|
|
jump or branch operations and are used to implement control flow.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Instructions
|
2025-07-25 14:01:34 -04:00
|
|
|
LDS:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of LDS (also known as shared memory) operations issued.
|
|
|
|
|
These include loads, stores, atomics, and HIP's ``__shfl`` operations.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Instructions
|
2025-08-01 13:56:29 -04:00
|
|
|
MFMA:
|
|
|
|
|
rst: The total number of matrix fused multiply-add instructions issued.
|
|
|
|
|
unit: Instructions
|
2025-07-25 14:01:34 -04:00
|
|
|
SALU:
|
|
|
|
|
rst: The total number of scalar arithmetic logic unit (SALU) operations issued.
|
|
|
|
|
Typically these are used for address calculations, literal constants, and other
|
|
|
|
|
operations that are provably uniform across a wavefront. Although scalar memory
|
|
|
|
|
(SMEM) operations are issued by the SALU, they are counted separately in this
|
|
|
|
|
section.
|
|
|
|
|
unit: Instructions
|
|
|
|
|
SMEM:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of scalar memory (SMEM) operations issued. These are typically
|
|
|
|
|
used for loading kernel arguments, base-pointers and loads from HIP's ``__constant__``
|
2025-07-25 14:01:34 -04:00
|
|
|
memory.
|
|
|
|
|
unit: Instructions
|
2025-08-01 13:33:58 -04:00
|
|
|
VALU:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of vector arithmetic logic unit (VALU) operations issued.
|
|
|
|
|
These are the workhorses of the :doc:`compute unit <compute-unit>`, and are
|
|
|
|
|
used to execute a wide range of instruction types including floating point operations,
|
|
|
|
|
non-uniform address calculations, transcendental operations, integer operations,
|
|
|
|
|
shifts, conditional evaluation, etc.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Instructions
|
2025-08-01 13:56:29 -04:00
|
|
|
VMEM:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of vector memory operations issued. These include most loads,
|
|
|
|
|
stores and atomic operations and all accesses to :ref:`generic, global, private
|
2025-08-01 13:56:29 -04:00
|
|
|
and texture <memory-spaces>` memory.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Instructions
|
2025-08-01 13:33:58 -04:00
|
|
|
VALU arithmetic instruction mix:
|
2025-08-01 13:56:29 -04:00
|
|
|
Conversion:
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The total number of type conversion instructions (such as converting data
|
|
|
|
|
to or from F32\u2194F64) issued to the VALU per :ref:`normalization unit
|
|
|
|
|
<normalization-units>`.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Instructions per normalization unit
|
2025-07-25 14:01:34 -04:00
|
|
|
F16-ADD:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of addition instructions operating on 16-bit floating-point
|
|
|
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Instructions per normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
F16-FMA:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of fused multiply-add instructions operating on 16-bit floating-point
|
|
|
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Instructions per normalization unit
|
2025-07-25 14:01:34 -04:00
|
|
|
F16-MUL:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of multiplication instructions operating on 16-bit floating-point
|
|
|
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Instructions per normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
F16-Trans:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of transcendental instructions (e.g., `sqrt`) operating
|
|
|
|
|
on 16-bit floating-point operands issued to the VALU per :ref:`normalization
|
|
|
|
|
unit <normalization-units>`.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Instructions per normalization unit
|
|
|
|
|
F32-ADD:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of addition instructions operating on 32-bit floating-point
|
|
|
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Instructions per normalization unit
|
|
|
|
|
F32-FMA:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of fused multiply-add instructions operating on 32-bit floating-point
|
|
|
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Instructions per normalization unit
|
|
|
|
|
F32-MUL:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of multiplication instructions operating on 32-bit floating-point
|
|
|
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Instructions per normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
F32-Trans:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of transcendental instructions (such as ``sqrt``) operating
|
|
|
|
|
on 32-bit floating-point operands issued to the VALU per :ref:`normalization
|
2025-08-01 13:56:29 -04:00
|
|
|
unit <normalization-units>`.
|
|
|
|
|
unit: Instructions per normalization unit
|
2025-07-25 14:01:34 -04:00
|
|
|
F64-ADD:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of addition instructions operating on 64-bit floating-point
|
|
|
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Instructions per normalization unit
|
|
|
|
|
F64-FMA:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of fused multiply-add instructions operating on 64-bit floating-point
|
|
|
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Instructions per normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
F64-MUL:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of multiplication instructions operating on 64-bit floating-point
|
|
|
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Instructions per normalization unit
|
2025-07-25 14:01:34 -04:00
|
|
|
F64-Trans:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of transcendental instructions (such as `sqrt`) operating
|
|
|
|
|
on 64-bit floating-point operands issued to the VALU per :ref:`normalization
|
2025-07-25 14:01:34 -04:00
|
|
|
unit <normalization-units>`.
|
|
|
|
|
unit: Instructions per normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
INT32:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of instructions operating on 32-bit integer operands issued
|
2025-08-01 13:56:29 -04:00
|
|
|
to the VALU per :ref:`normalization unit <normalization-units>`.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Instructions per normalization unit
|
2025-08-01 13:33:58 -04:00
|
|
|
INT64:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of instructions operating on 64-bit integer operands issued
|
2025-08-01 13:33:58 -04:00
|
|
|
to the VALU per :ref:`normalization unit <normalization-units>`.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Instructions per normalization unit
|
2025-08-01 13:33:58 -04:00
|
|
|
MFMA instruction mix:
|
2025-08-01 13:56:29 -04:00
|
|
|
MFMA-BF16:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>` instructions
|
2025-08-01 13:33:58 -04:00
|
|
|
issued per :ref:`normalization unit <normalization-units>`.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Instructions per normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
MFMA-F16:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` instructions
|
2025-08-01 13:56:29 -04:00
|
|
|
issued per :ref:`normalization unit <normalization-units>`.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Instructions per normalization unit
|
2025-08-01 13:33:58 -04:00
|
|
|
MFMA-F32:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of 32-bit floating-point :ref:`MFMA <desc-mfma>` instructions
|
2025-07-25 14:01:34 -04:00
|
|
|
issued per :ref:`normalization unit <normalization-units>`.
|
|
|
|
|
unit: Instructions per normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
MFMA-F64:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of 64-bit floating-point :ref:`MFMA <desc-mfma>` instructions
|
2025-07-25 14:01:34 -04:00
|
|
|
issued per :ref:`normalization unit <normalization-units>`.
|
|
|
|
|
unit: Instructions per normalization unit
|
2025-08-01 13:33:58 -04:00
|
|
|
MFMA-F8:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of 8-bit floating point :ref:`MFMA <desc-mfma>` instructions
|
|
|
|
|
issued per :ref:`normalization unit <normalization-units>`. This is supported
|
|
|
|
|
in AMD Instinct MI300 series and later only.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Instructions per normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
MFMA-I8:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of 8-bit integer :ref:`MFMA <desc-mfma>` instructions issued
|
2025-08-01 13:56:29 -04:00
|
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
|
|
|
unit: Instructions per normalization unit
|
2025-07-25 14:01:34 -04:00
|
|
|
Compute Speed-of-Light:
|
|
|
|
|
MFMA FLOPs (BF16):
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>` operations
|
|
|
|
|
executed per second. Note: this does not include any 16-bit brain floating
|
|
|
|
|
point operations from :ref:`VALU <desc-valu>` instructions. This is also
|
|
|
|
|
presented as a percent of the peak theoretical BF16 MFMA operations achievable
|
|
|
|
|
on the specific accelerator.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: GFLOPs
|
2025-08-01 13:56:29 -04:00
|
|
|
MFMA FLOPs (F16):
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
|
|
|
|
|
executed per second. Note: this does not include any 16-bit floating point
|
|
|
|
|
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
|
|
|
|
|
as a percent of the peak theoretical F16 MFMA operations achievable on the
|
|
|
|
|
specific accelerator.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: GFLOPs
|
2025-07-25 14:01:34 -04:00
|
|
|
MFMA FLOPs (F32):
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
|
|
|
|
|
executed per second. Note: this does not include any 32-bit floating point
|
|
|
|
|
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
|
|
|
|
|
as a percent of the peak theoretical F32 MFMA operations achievable on the
|
|
|
|
|
specific accelerator.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: GFLOPs
|
|
|
|
|
MFMA FLOPs (F64):
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
|
|
|
|
|
executed per second. Note: this does not include any 64-bit floating point
|
|
|
|
|
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
|
|
|
|
|
as a percent of the peak theoretical F64 MFMA operations achievable on the
|
|
|
|
|
specific accelerator. The total number of 64-bit floating point :ref:`MFMA
|
|
|
|
|
<desc-mfma>` operations executed per second. Note: this does not include
|
|
|
|
|
any 64-bit floating point operations from :ref:`VALU <desc-valu>` instructions.
|
|
|
|
|
This is also presented as a percent of the peak theoretical F64 MFMA operations
|
|
|
|
|
achievable on the specific accelerator.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: GFLOPs
|
|
|
|
|
MFMA IOPs (INT8):
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
|
|
|
|
|
per second. Note: this does not include any 8-bit integer operations from
|
|
|
|
|
:ref:`VALU <desc-valu>` instructions. This is also presented as a percent
|
|
|
|
|
of the peak theoretical INT8 MFMA operations achievable on the specific accelerator.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: GFLOPs
|
2025-08-01 13:33:58 -04:00
|
|
|
VALU FLOPs:
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The total floating-point operations executed per second on the :ref:`VALU
|
|
|
|
|
<desc-valu>`. This is also presented as a percent of the peak theoretical
|
|
|
|
|
FLOPs achievable on the specific accelerator. Note: this does not include
|
|
|
|
|
any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: GFLOPs
|
|
|
|
|
VALU IOPs:
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
|
|
|
|
|
This is also presented as a percent of the peak theoretical IOPs achievable
|
|
|
|
|
on the specific accelerator. Note: this does not include any integer operations
|
|
|
|
|
from :ref:`MFMA <desc-mfma>` instructions.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: GIOPs
|
|
|
|
|
Pipeline statistics:
|
|
|
|
|
Branch Utilization:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: Indicates what percent of the kernel's duration the :ref:`branch <desc-branch>`
|
|
|
|
|
unit was busy executing instructions. Computed as the ratio of the total number
|
|
|
|
|
of cycles spent by the :ref:`scheduler <desc-scheduler>` issuing branch instructions
|
|
|
|
|
over the :ref:`total CU cycles <total-cu-cycles>`.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:56:29 -04:00
|
|
|
IPC:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The ratio of the total number of instructions executed on the :doc:`CU <compute-unit>`
|
|
|
|
|
over the :ref:`total active CU cycles <total-active-cu-cycles>`.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Instructions per cycle
|
|
|
|
|
IPC (Issued):
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The ratio of the total number of (non-:ref:`internal <ipc-internal-instructions>`)
|
|
|
|
|
instructions issued over the number of cycles where the :ref:`scheduler <desc-scheduler>`
|
|
|
|
|
was actively working on issuing instructions. Refer to the :ref:`Issued IPC
|
2025-08-01 13:56:29 -04:00
|
|
|
<issued-ipc>` example for further detail.
|
|
|
|
|
unit: Instructions per cycle
|
2025-08-01 13:33:58 -04:00
|
|
|
MFMA Instruction Cycles:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The average duration of :ref:`MFMA <desc-mfma>` instructions in this kernel
|
|
|
|
|
in cycles. Computed as the ratio of the total number of cycles the MFMA unit
|
|
|
|
|
was busy over the total number of MFMA instructions. Compare to, for example,
|
|
|
|
|
the `AMD Matrix Instruction Calculator <https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator>`_.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Cycles per instruction
|
2025-07-25 14:01:34 -04:00
|
|
|
MFMA Utilization:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: Indicates what percent of the kernel's duration the :ref:`MFMA <desc-mfma>`
|
|
|
|
|
unit was busy executing instructions. Computed as the ratio of the total number
|
|
|
|
|
of cycles spent by the :ref:`MFMA <desc-salu>` was busy over the :ref:`total
|
2025-07-25 14:01:34 -04:00
|
|
|
CU cycles <total-cu-cycles>`.
|
|
|
|
|
unit: Percent
|
2025-08-01 13:56:29 -04:00
|
|
|
SALU Utilization:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: Indicates what percent of the kernel's duration the :ref:`SALU <desc-salu>`
|
|
|
|
|
was busy executing instructions. Computed as the ratio of the total number of
|
|
|
|
|
cycles spent by the :ref:`scheduler <desc-scheduler>` issuing SALU / :ref:`SMEM
|
|
|
|
|
<desc-smem>` instructions over the :ref:`total CU cycles <total-cu-cycles>`.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Percent
|
|
|
|
|
SMEM Latency:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The average number of round-trip cycles (that is, from issue to data return
|
2025-08-01 13:56:29 -04:00
|
|
|
/ acknowledgment) required for a SMEM instruction to complete.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Cycles
|
2025-08-01 13:56:29 -04:00
|
|
|
VALU Active Threads:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: Indicates the average level of :ref:`divergence <desc-divergence>` within
|
|
|
|
|
a wavefront over the lifetime of the kernel. The number of work-items that were
|
|
|
|
|
active in a wavefront during execution of each :ref:`VALU <desc-valu>` instruction,
|
|
|
|
|
time-averaged over all VALU instructions run on all wavefronts in the kernel.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Work-items
|
2025-08-01 13:33:58 -04:00
|
|
|
VALU Utilization:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: Indicates what percent of the kernel's duration the :ref:`VALU <desc-valu>`
|
|
|
|
|
was busy executing instructions. Does not include :ref:`VMEM <desc-vmem>` operations.
|
|
|
|
|
Computed as the ratio of the total number of cycles spent by the :ref:`scheduler
|
|
|
|
|
<desc-scheduler>` issuing VALU instructions over the :ref:`total CU cycles <total-cu-cycles>`.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:56:29 -04:00
|
|
|
VMEM Latency:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The average number of round-trip cycles (that is, from issue to data return
|
2025-08-01 13:56:29 -04:00
|
|
|
/ acknowledgment) required for a VMEM instruction to complete.
|
|
|
|
|
unit: Cycles
|
|
|
|
|
VMEM Utilization:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: Indicates what percent of the kernel's duration the :ref:`VMEM <desc-vmem>`
|
|
|
|
|
unit was busy executing instructions, including both global/generic and spill/scratch
|
|
|
|
|
operations (see the :ref:`VMEM instruction count metrics <ta-instruction-counts>`
|
|
|
|
|
for more detail). Does not include :ref:`VALU <desc-valu>` operations. Computed
|
|
|
|
|
as the ratio of the total number of cycles spent by the :ref:`scheduler <desc-scheduler>`
|
|
|
|
|
issuing VMEM instructions over the :ref:`total CU cycles <total-cu-cycles>`.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:33:58 -04:00
|
|
|
Arithmetic operations:
|
2025-07-25 14:01:34 -04:00
|
|
|
BF16 OPs:
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The total number of 16-bit brain floating-point operations executed on
|
|
|
|
|
either the :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
|
|
|
|
|
unit <normalization-units>`. Note: on current CDNA accelerators, the VALU
|
|
|
|
|
has no native BF16 instructions.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: FLOP per normalization unit
|
2025-08-01 13:33:58 -04:00
|
|
|
F16 OPs:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of 16-bit floating-point operations executed on either the
|
|
|
|
|
:ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
|
|
|
|
|
unit <normalization-units>`.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: FLOP per normalization unit
|
2025-07-25 14:01:34 -04:00
|
|
|
F32 OPs:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of 32-bit floating-point operations executed on either the
|
|
|
|
|
:ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
|
2025-07-25 14:01:34 -04:00
|
|
|
unit <normalization-units>`.
|
|
|
|
|
unit: FLOP per normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
F64 OPs:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of 64-bit floating-point operations executed on either the
|
|
|
|
|
:ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
|
2025-08-01 13:56:29 -04:00
|
|
|
unit <normalization-units>`.
|
|
|
|
|
unit: FLOP per normalization unit
|
|
|
|
|
FLOPs (Total):
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of floating-point operations executed on either the :ref:`VALU
|
|
|
|
|
<desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
<normalization-units>`.
|
|
|
|
|
unit: FLOP per normalization unit
|
2025-07-25 14:01:34 -04:00
|
|
|
INT8 OPs:
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The total number of 8-bit integer operations executed on either the :ref:`VALU
|
|
|
|
|
<desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
|
|
|
|
|
<normalization-units>`. Note: on current CDNA accelerators, the VALU has
|
|
|
|
|
no native INT8 instructions.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: IOP per normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
IOPs (Total):
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of integer operations executed on either the :ref:`VALU
|
|
|
|
|
<desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
<normalization-units>`.
|
|
|
|
|
unit: IOP per normalization unit
|
2025-08-01 13:33:58 -04:00
|
|
|
LDS Speed-of-Light:
|
|
|
|
|
Access Rate:
|
|
|
|
|
rst: Indicates the percentage of SIMDs in the :ref:`VALU <desc-valu>` [#lds-workload]_
|
|
|
|
|
actively issuing LDS instructions, averaged over the lifetime of the kernel.
|
|
|
|
|
Calculated as the ratio of the total number of cycles spent by the :ref:`scheduler
|
|
|
|
|
<desc-scheduler>` issuing :ref:`LDS <desc-lds>` instructions over the :ref:`total
|
|
|
|
|
CU cycles <total-cu-cycles>`.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:33:58 -04:00
|
|
|
Bank Conflict Rate:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: Indicates the percentage of active LDS cycles that were spent servicing bank
|
|
|
|
|
conflicts. Calculated as the ratio of LDS cycles spent servicing bank conflicts
|
|
|
|
|
over the number of LDS cycles that would have been required to move the same
|
2025-08-01 13:33:58 -04:00
|
|
|
amount of data in an uncontended access. [#lds-bank-conflict]_
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
2025-08-06 18:39:50 -04:00
|
|
|
Theoretical Bandwidth Utilization:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: Indicates the maximum amount of bytes that could have been loaded from, stored
|
2025-08-06 18:39:50 -04:00
|
|
|
to, or atomically updated in the LDS divided as percentage of theoretical peak.
|
2025-10-22 15:17:43 -04:00
|
|
|
Does *not* take into account the execution mask of the wavefront when the instruction
|
|
|
|
|
was executed. See the :ref:`LDS bandwidth example <lds-bandwidth>` for more
|
2025-08-01 13:33:58 -04:00
|
|
|
detail.
|
2025-08-06 18:39:50 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:33:58 -04:00
|
|
|
Utilization:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: Indicates what percent of the kernel's duration the :ref:`LDS <desc-lds>`
|
|
|
|
|
was actively executing instructions (including, but not limited to, load, store,
|
|
|
|
|
atomic and HIP's ``__shfl`` operations). Calculated as the ratio of the total
|
|
|
|
|
number of cycles LDS was active over the :ref:`total CU cycles <total-cu-cycles>`.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:33:58 -04:00
|
|
|
LDS Statistics:
|
2025-08-01 13:56:29 -04:00
|
|
|
Addr Conflict:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>` due
|
|
|
|
|
to address conflicts (as determined by the conflict resolution hardware) per
|
2025-08-01 13:56:29 -04:00
|
|
|
:ref:`normalization unit <normalization-units>`.
|
|
|
|
|
unit: Cycles per normalization unit
|
|
|
|
|
Atomic Return Cycles:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of cycles spent on LDS atomics with return per :ref:`normalization
|
2025-08-01 13:56:29 -04:00
|
|
|
unit <normalization-units>`.
|
|
|
|
|
unit: Cycles per normalization unit
|
|
|
|
|
Bank Conflict:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>` due
|
|
|
|
|
to bank conflicts (as determined by the conflict resolution hardware) per :ref:`normalization
|
2025-08-01 13:56:29 -04:00
|
|
|
unit <normalization-units>`.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Cycles per normalization unit
|
2025-07-25 14:01:34 -04:00
|
|
|
Bank Conflicts/Access:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The ratio of the number of cycles spent in the :ref:`LDS scheduler <desc-lds>`
|
|
|
|
|
due to bank conflicts (as determined by the conflict resolution hardware) to
|
|
|
|
|
the base number of cycles that would be spent in the LDS scheduler in a completely
|
|
|
|
|
uncontended case. This is the unnormalized form of the Bank Conflict Rate.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Conflicts per Access
|
2025-08-01 13:56:29 -04:00
|
|
|
Index Accesses:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>` over
|
2025-08-01 13:56:29 -04:00
|
|
|
all operations per :ref:`normalization unit <normalization-units>`.
|
|
|
|
|
unit: Cycles per normalization unit
|
2025-08-01 13:33:58 -04:00
|
|
|
LDS Instructions:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of LDS instructions (including, but not limited to, read/write/atomics
|
|
|
|
|
and HIP's ``__shfl`` instructions) executed per :ref:`normalization unit <normalization-units>`.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Instructions per normalization unit
|
|
|
|
|
LDS Latency:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The average number of round-trip cycles (i.e., from issue to data-return
|
2025-08-01 13:33:58 -04:00
|
|
|
acknowledgment) required for an LDS instruction to complete.
|
|
|
|
|
unit: Cycles
|
2025-07-25 14:01:34 -04:00
|
|
|
Mem Violations:
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The total number of out-of-bounds accesses made to the LDS, per :ref:`normalization
|
|
|
|
|
unit <normalization-units>`. This is unused and expected to be zero in
|
|
|
|
|
most configurations for modern CDNA\u2122 accelerators.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Accesses per normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
Theoretical Bandwidth:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: Indicates the maximum amount of bytes that could have been loaded from, stored
|
2025-08-06 18:39:50 -04:00
|
|
|
to, or atomically updated in the LDS divided by total duration. Does *not* take
|
2025-10-22 15:17:43 -04:00
|
|
|
into account the execution mask of the wavefront when the instruction was executed.
|
|
|
|
|
See the :ref:`LDS bandwidth example <lds-bandwidth>` for more detail.
|
2025-08-06 18:39:50 -04:00
|
|
|
unit: Gbps
|
2025-08-01 13:56:29 -04:00
|
|
|
Unaligned Stall:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>` due
|
|
|
|
|
to stalls from non-dword aligned addresses per :ref:`normalization unit <normalization-units>`.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Cycles per normalization unit
|
|
|
|
|
vL1D Speed-of-Light:
|
2025-08-06 18:39:50 -04:00
|
|
|
Bandwidth Utilization:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM
|
|
|
|
|
<desc-vmem>` instructions, as a percent of the peak theoretical bandwidth achievable
|
|
|
|
|
on the specific accelerator. The number of bytes is calculated as the number
|
|
|
|
|
of cache lines requested multiplied by the cache line size. This value does
|
|
|
|
|
not consider partial requests, so for instance, if only a single value is requested
|
|
|
|
|
in a cache line, the data movement will still be counted as a full cache line.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:33:58 -04:00
|
|
|
Coalescing:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: Indicates how well memory instructions were coalesced by the :ref:`address
|
|
|
|
|
processing unit <desc-ta>`, ranging from uncoalesced (25%) to fully coalesced
|
|
|
|
|
(100%). Calculated as the average number of :ref:`thread-requests <thread-requests>`
|
|
|
|
|
generated per instruction divided by the ideal number of thread-requests per
|
2025-08-01 13:33:58 -04:00
|
|
|
instruction.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:56:29 -04:00
|
|
|
Hit rate:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The ratio of the number of vL1D cache line requests that hit [#vl1d-hit]_
|
|
|
|
|
in vL1D cache over the total number of cache line requests to the :ref:`vL1D
|
|
|
|
|
Cache RAM <desc-tc>`.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:33:58 -04:00
|
|
|
Utilization:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: Indicates how busy the :ref:`vL1D Cache RAM <desc-tc>` was during the kernel
|
|
|
|
|
execution. The number of cycles where the vL1D Cache RAM is actively processing
|
|
|
|
|
any request divided by the number of cycles where the vL1D is active [#vl1d-activity]_.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:33:58 -04:00
|
|
|
Busy / stall metrics:
|
|
|
|
|
Address Processing Unit Busy:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address processor
|
2025-08-01 13:33:58 -04:00
|
|
|
was busy
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:33:58 -04:00
|
|
|
Address Stall:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address processor
|
|
|
|
|
was stalled from sending address requests further into the vL1D pipeline
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:56:29 -04:00
|
|
|
Data Stall:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address processor
|
|
|
|
|
was stalled from sending write/atomic data further into the vL1D pipeline
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Percent
|
|
|
|
|
"Data-Processor \u2192 Address Stall":
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: Percent of :ref:`total CU cycles <total-cu-cycles>` the address processor
|
|
|
|
|
was stalled waiting to send command data to the :ref:`data processor <desc-td>`
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:33:58 -04:00
|
|
|
Instruction counts:
|
|
|
|
|
Global/Generic Atomic Instructions:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of global & generic memory atomic (with and without return)
|
|
|
|
|
instructions executed on all :doc:`compute units <compute-unit>` on the accelerator,
|
2025-08-01 13:33:58 -04:00
|
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
|
|
|
unit: Instructions per normalization unit
|
|
|
|
|
Global/Generic Instructions:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of global & generic memory instructions executed on all
|
|
|
|
|
:doc:`compute units <compute-unit>` on the accelerator, per :ref:`normalization
|
|
|
|
|
unit <normalization-units>`.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Instructions per normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
Global/Generic Read Instructions:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of global & generic memory read instructions executed on
|
|
|
|
|
all :doc:`compute units <compute-unit>` on the accelerator, per :ref:`normalization
|
2025-08-01 13:56:29 -04:00
|
|
|
unit <normalization-units>`.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Instructions per normalization unit
|
|
|
|
|
Global/Generic Write Instructions:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of global & generic memory write instructions executed on
|
|
|
|
|
all :doc:`compute units <compute-unit>` on the accelerator, per :ref:`normalization
|
2025-08-01 13:33:58 -04:00
|
|
|
unit <normalization-units>`.
|
|
|
|
|
unit: Instructions per normalization unit
|
|
|
|
|
Spill/Stack Atomic Instructions:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of spill/stack memory atomic (with and without return) instructions
|
|
|
|
|
executed on all :doc:`compute units <compute-unit>` on the accelerator, per
|
|
|
|
|
:ref:`normalization unit <normalization-units>`. Typically unused as these memory
|
|
|
|
|
operations are typically used to implement thread-local storage.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Instructions per normalization unit
|
|
|
|
|
Spill/Stack Instructions:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of spill/stack memory instructions executed on all :doc:`compute
|
|
|
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Instructions per normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
Spill/Stack Read Instructions:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of spill/stack memory read instructions executed on all
|
|
|
|
|
:doc:`compute units <compute-unit>` on the accelerator, per :ref:`normalization
|
|
|
|
|
unit <normalization-units>`.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Instructions per normalization unit
|
|
|
|
|
Spill/Stack Write Instructions:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of spill/stack memory write instructions executed on all
|
|
|
|
|
:doc:`compute units <compute-unit>` on the accelerator, per :ref:`normalization
|
|
|
|
|
unit <normalization-units>`.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Instructions per normalization unit
|
2025-08-01 13:33:58 -04:00
|
|
|
Total Instructions:
|
|
|
|
|
rst: The total number of memory instructions executed by the address processer
|
|
|
|
|
over all compute units on the accelerator, per normalization unit.
|
|
|
|
|
unit: Instructions per normalization unit
|
|
|
|
|
Spill / stack metrics:
|
|
|
|
|
Spill/Stack Coalesced Read:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of cycles the address processing unit spent working on coalesced
|
|
|
|
|
spill/stack read instructions, per :ref:`normalization unit <normalization-units>`.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Cycles per normalization unit
|
|
|
|
|
Spill/Stack Coalesced Write:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of cycles the address processing unit spent working on coalesced
|
|
|
|
|
spill/stack write instructions, per :ref:`normalization unit <normalization-units>`.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Cycles per normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
Spill/Stack Total Cycles:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of cycles the address processing unit spent working on spill/stack
|
|
|
|
|
instructions, per :ref:`normalization unit <normalization-units>`.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Cycles per normalization unit
|
2025-08-01 13:33:58 -04:00
|
|
|
L1 Unified Translation Cache (UTCL1):
|
|
|
|
|
Hit Ratio:
|
|
|
|
|
rst: The ratio of the number of translation requests that hit in the UTCL1 divided
|
|
|
|
|
by the total number of translation requests made to the UTCL1.
|
|
|
|
|
unit: Percent
|
|
|
|
|
Hits:
|
|
|
|
|
rst: The number of translation requests that hit in the UTCL1, and could be reused,
|
|
|
|
|
per normalization unit.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Requests per normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
Permission Misses:
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The total number of translation requests that missed in the UTCL1 due
|
|
|
|
|
to a permission error, per :ref:`normalization unit <normalization-units>`.
|
|
|
|
|
This is unused and expected to be zero in most configurations for modern
|
|
|
|
|
CDNA\u2122 accelerators.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Requests per normalization unit
|
2025-08-01 13:33:58 -04:00
|
|
|
Req:
|
|
|
|
|
rst: The number of translation requests made to the UTCL1 per normalization unit.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Requests per normalization unit
|
2025-08-01 13:33:58 -04:00
|
|
|
Translation Misses:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of translation requests that missed in the UTCL1 due to
|
|
|
|
|
translation not being present in the cache, per :ref:`normalization unit <normalization-units>`.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: unit
|
|
|
|
|
vL1D cache stall metrics:
|
2025-08-01 13:56:29 -04:00
|
|
|
Stalled on L2 Data:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The ratio of the number of cycles where the vL1D is stalled waiting for requested
|
|
|
|
|
data to return from the :doc:`L2 cache <l2-cache>` divided by the number of
|
2025-08-01 13:56:29 -04:00
|
|
|
cycles where the vL1D is active [#vl1d-activity]_.
|
|
|
|
|
unit: Percent
|
2025-08-01 13:33:58 -04:00
|
|
|
Stalled on L2 Req:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The ratio of the number of cycles where the vL1D is stalled waiting to issue
|
|
|
|
|
a request for data to the :doc:`L2 cache <l2-cache>` divided by the number of
|
|
|
|
|
cycles where the vL1D is active [#vl1d-activity]_.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:33:58 -04:00
|
|
|
Tag RAM Stall (Atomic):
|
|
|
|
|
rst: The ratio of the number of cycles where the vL1D is stalled due to Atomic
|
2025-10-22 15:17:43 -04:00
|
|
|
requests with conflicting tags being looked up concurrently, divided by the
|
|
|
|
|
number of cycles where the vL1D is active [#vl1d-activity]_.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:56:29 -04:00
|
|
|
Tag RAM Stall (Read):
|
|
|
|
|
rst: The ratio of the number of cycles where the vL1D is stalled due to Read requests
|
2025-10-22 15:17:43 -04:00
|
|
|
with conflicting tags being looked up concurrently, divided by the number of
|
|
|
|
|
cycles where the vL1D is active [#vl1d-activity]_.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:33:58 -04:00
|
|
|
Tag RAM Stall (Write):
|
|
|
|
|
rst: The ratio of the number of cycles where the vL1D is stalled due to Write
|
2025-10-22 15:17:43 -04:00
|
|
|
requests with conflicting tags being looked up concurrently, divided by the
|
|
|
|
|
number of cycles where the vL1D is active [#vl1d-activity]_.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:33:58 -04:00
|
|
|
vL1D cache access metrics:
|
2025-08-01 13:56:29 -04:00
|
|
|
Atomic Req:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of incoming atomic requests from the :ref:`address processing
|
|
|
|
|
unit <desc-ta>` after coalescing per :ref:`normalization unit <normalization-units>`
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Requests per normalization unit
|
|
|
|
|
Cache Accesses:
|
|
|
|
|
rst: The total number of cache line lookups in the vL1D.
|
|
|
|
|
unit: Cache lines
|
2025-08-01 13:33:58 -04:00
|
|
|
Cache BW:
|
2025-08-06 18:39:50 -04:00
|
|
|
rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM
|
|
|
|
|
<desc-vmem>` instructions divided by total duration. The number of bytes is
|
|
|
|
|
calculated as the number of cache lines requested multiplied by the cache line
|
2025-10-22 15:17:43 -04:00
|
|
|
size. This value does not consider partial requests, so for instance, if only
|
2025-08-06 18:39:50 -04:00
|
|
|
a single value is requested in a cache line, the data movement will still be
|
|
|
|
|
counted as a full cache line.
|
|
|
|
|
unit: Gbps
|
2025-08-01 13:33:58 -04:00
|
|
|
Cache Hit Rate:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The ratio of the number of vL1D cache line requests that hit in vL1D cache
|
|
|
|
|
over the total number of cache line requests to the :ref:`vL1D Cache RAM <desc-tc>`.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:56:29 -04:00
|
|
|
Cache Hits:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of cache accesses minus the number of outgoing requests to the
|
|
|
|
|
:doc:`L2 cache <l2-cache>`, that is, the number of cache line requests serviced
|
|
|
|
|
by the :ref:`vL1D Cache RAM <desc-tc>` per :ref:`normalization unit <normalization-units>`.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Cache lines per normalization unit
|
|
|
|
|
Invalidations:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of times the vL1D was issued a write-back invalidate command during
|
|
|
|
|
the kernel's execution per :ref:`normalization unit <normalization-units>`.
|
|
|
|
|
This may be triggered by, for instance, the ``buffer_wbinvl1`` instruction.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Invalidations per normalization unit
|
|
|
|
|
L1 Access Latency:
|
|
|
|
|
rst: Calculated as the average number of cycles that a vL1D cache line request
|
|
|
|
|
spent in the vL1D cache pipeline.
|
|
|
|
|
unit: Cycles
|
|
|
|
|
L1-L2 Atomic:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of atomic requests that are sent through the vL1D to the :doc:`L2
|
|
|
|
|
cache <l2-cache>`, per :ref:`normalization unit <normalization-units>`. This
|
|
|
|
|
includes requests for atomics with, and without return.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Requests per normalization unit
|
|
|
|
|
L1-L2 BW:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of bytes transferred across the vL1D-L2 interface as a result
|
|
|
|
|
of :ref:`VMEM <desc-vmem>` instructions, divided by total duration. The number
|
|
|
|
|
of bytes is calculated as the number of cache lines requested multiplied by
|
|
|
|
|
the cache line size. This value does not consider partial requests, so for instance,
|
|
|
|
|
if only a single value is requested in a cache line, the data movement will
|
2025-08-01 13:33:58 -04:00
|
|
|
still be counted as a full cache line.
|
2025-08-06 18:39:50 -04:00
|
|
|
unit: Gbps
|
2025-08-01 13:56:29 -04:00
|
|
|
L1-L2 Read:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of read requests for a vL1D cache line that were not satisfied
|
|
|
|
|
by the vL1D and must be retrieved from the to the :doc:`L2 Cache <l2-cache>`
|
|
|
|
|
per :ref:`normalization unit <normalization-units>`.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Requests per normalization unit
|
2025-08-01 13:33:58 -04:00
|
|
|
L1-L2 Read Latency:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: Calculated as the average number of cycles that the vL1D cache took to issue
|
|
|
|
|
and receive read requests from the :doc:`L2 Cache <l2-cache>`. This number also
|
|
|
|
|
includes requests for atomics with return values.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Cycles
|
2025-08-01 13:56:29 -04:00
|
|
|
L1-L2 Write:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of write requests to a vL1D cache line that were sent through
|
|
|
|
|
the vL1D to the :doc:`L2 cache <l2-cache>`, per :ref:`normalization unit <normalization-units>`.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Requests per normalization unit
|
2025-08-01 13:33:58 -04:00
|
|
|
L1-L2 Write Latency:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: Calculated as the average number of cycles that the vL1D cache took to issue
|
|
|
|
|
and receive acknowledgement of a write request to the :doc:`L2 Cache <l2-cache>`.
|
|
|
|
|
This number also includes requests for atomics without return values.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Cycles
|
2025-08-01 13:56:29 -04:00
|
|
|
Read Req:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of incoming read requests from the :ref:`address processing
|
|
|
|
|
unit <desc-ta>` after coalescing per :ref:`normalization unit <normalization-units>`
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Requests per normalization unit
|
|
|
|
|
Total Req:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of incoming requests from the :ref:`address processing unit
|
|
|
|
|
<desc-ta>` after coalescing.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Requests
|
2025-08-01 13:33:58 -04:00
|
|
|
Write Req:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of incoming write requests from the :ref:`address processing
|
|
|
|
|
unit <desc-ta>` after coalescing per :ref:`normalization unit <normalization-units>`
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Requests per normalization unit
|
|
|
|
|
Vector L1 data-return path or Texture Data (TD):
|
2025-08-01 13:56:29 -04:00
|
|
|
Atomic Instructions:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of atomic instructions submitted to the :ref:`data-return unit
|
|
|
|
|
<desc-td>` by the :ref:`address processor <desc-ta>` summed over all :doc:`compute
|
|
|
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
|
|
|
This is expected to be the sum of global/generic and spill/stack atomics in
|
|
|
|
|
the :ref:`address processor <desc-ta>`.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Instructions per normalization unit
|
|
|
|
|
"Cache RAM \u2192 Data-Return Stall":
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return unit
|
|
|
|
|
was stalled on data to be returned from the :ref:`vL1D Cache RAM <desc-tc>`.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:33:58 -04:00
|
|
|
Coalescable Instructions:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of instructions submitted to the :ref:`data-return unit <desc-td>`
|
|
|
|
|
by the :ref:`address processor <desc-ta>` that were found to be coalescable,
|
|
|
|
|
per :ref:`normalization unit <normalization-units>`.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Instructions per normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
Data-Return Busy:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return unit
|
|
|
|
|
was busy processing or waiting on data to return to the :doc:`CU <compute-unit>`.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:33:58 -04:00
|
|
|
Read Instructions:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of read instructions submitted to the :ref:`data-return unit <desc-td>`
|
|
|
|
|
by the :ref:`address processor <desc-ta>` summed over all :doc:`compute units
|
|
|
|
|
<compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
|
|
|
This is expected to be the sum of global/generic and spill/stack reads in the
|
|
|
|
|
:ref:`address processor <desc-ta>`.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Instructions per normalization unit
|
|
|
|
|
"Workgroup manager \u2192 Data-Return Stall":
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return unit
|
|
|
|
|
was stalled by the :ref:`workgroup manager <desc-spi>` due to initialization
|
2025-08-01 13:33:58 -04:00
|
|
|
of registers as a part of launching new workgroups.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
2025-08-01 14:00:46 -04:00
|
|
|
Write Ack Instructions:
|
|
|
|
|
rst: The total number of write acknowledgements submitted by :ref:`data-return
|
|
|
|
|
unit <desc-td>` to SQ, summed over all compute units on the accelerator, per
|
|
|
|
|
normalization unit.
|
|
|
|
|
unit: Instructions per normalization unit
|
2025-08-01 13:33:58 -04:00
|
|
|
Write Instructions:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of store instructions submitted to the :ref:`data-return unit
|
|
|
|
|
<desc-td>` by the :ref:`address processor <desc-ta>` summed over all :doc:`compute
|
|
|
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
|
|
|
This is expected to be the sum of global/generic and spill/stack stores counted
|
|
|
|
|
by the :ref:`vL1D cache-front-end <ta-instruction-counts>`.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Instructions per normalization unit
|
|
|
|
|
L2 Speed-of-Light:
|
2025-08-01 13:56:29 -04:00
|
|
|
HBM Bandwidth:
|
|
|
|
|
rst: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory
|
|
|
|
|
(HBM) per unit time. This value is calculated as the number of HBM channels
|
|
|
|
|
multiplied by the HBM channel width multiplied by the HBM clock frequency.
|
|
|
|
|
unit: GB/s
|
|
|
|
|
Hit Rate:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The ratio of the number of L2 cache line requests that hit in the L2 cache
|
2025-08-01 13:56:29 -04:00
|
|
|
over the total number of incoming cache line requests to the L2 cache.
|
|
|
|
|
unit: Percent
|
|
|
|
|
L2-Fabric Read BW:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of bytes read by the L2 over the :ref:`Infinity Fabric interface
|
2025-08-01 13:56:29 -04:00
|
|
|
<l2-fabric>` per unit time.
|
|
|
|
|
unit: GB/s
|
|
|
|
|
L2-Fabric Write and Atomic BW:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of bytes sent by the L2 over the :ref:`Infinity Fabric interface
|
|
|
|
|
<l2-fabric>` by write and atomic operations per unit time.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: GB/s
|
2025-07-25 14:01:34 -04:00
|
|
|
Peak Bandwidth:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of bytes looked up in the L2 cache, as a percent of the peak theoretical
|
|
|
|
|
bandwidth achievable on the specific accelerator. The number of bytes is calculated
|
|
|
|
|
as the number of cache lines requested multiplied by the cache line size. This
|
|
|
|
|
value does not consider partial requests, so e.g., if only a single value is
|
|
|
|
|
requested in a cache line, the data movement will still be counted as a full
|
2025-07-25 14:01:34 -04:00
|
|
|
cache line.
|
|
|
|
|
unit: Percent
|
2025-08-01 13:33:58 -04:00
|
|
|
Utilization:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The ratio of the :ref:`number of cycles an L2 channel was active, summed
|
|
|
|
|
over all L2 channels on the accelerator <total-active-l2-cycles>` over the :ref:`total
|
|
|
|
|
L2 cycles <total-l2-cycles>`.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:33:58 -04:00
|
|
|
L2 cache accesses:
|
2025-08-01 14:00:46 -04:00
|
|
|
Atomic Bandwidth:
|
2025-08-06 18:39:50 -04:00
|
|
|
rst: Total number of bytes looked up in the L2 cache for atomic requests, divided
|
|
|
|
|
by total duration.
|
|
|
|
|
unit: Gbps
|
2025-08-01 13:56:29 -04:00
|
|
|
Atomic Req:
|
|
|
|
|
rst: The total number of atomic requests (with and without return) to the L2 from
|
|
|
|
|
all clients.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Requests per normalization unit
|
|
|
|
|
Bandwidth:
|
2025-08-06 18:39:50 -04:00
|
|
|
rst: The number of bytes looked up in the L2 cache, divided by total duration.
|
2025-10-22 15:17:43 -04:00
|
|
|
The number of bytes is calculated as the number of cache lines requested multiplied
|
2025-08-06 18:39:50 -04:00
|
|
|
by the cache line size. This value does not consider partial requests, so for
|
2025-10-22 15:17:43 -04:00
|
|
|
example, if only a single value is requested in a cache line, the data movement
|
|
|
|
|
will still be counted as a full cache line.
|
2025-08-06 18:39:50 -04:00
|
|
|
unit: Gbps
|
2025-08-01 13:56:29 -04:00
|
|
|
CC Req:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of requests to the L2 that go to Coherently Cacheable (CC)
|
|
|
|
|
memory allocations. See the :ref:`memory-type` for more information.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Requests per normalization unit
|
|
|
|
|
Cache Hit:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The ratio of the number of L2 cache line requests that hit in the L2 cache
|
|
|
|
|
over the total number of incoming cache line requests to the L2 cache.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Percent
|
|
|
|
|
Evict (Internal):
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of L2 cache lines evicted from the cache due to capacity
|
|
|
|
|
limits, per :ref:`normalization unit <normalization-units>`.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Cache lines per normalization unit
|
|
|
|
|
Evict (vL1D Req):
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of L2 cache lines evicted from the cache due to invalidation
|
|
|
|
|
requests initiated by the :doc:`vL1D cache <vector-l1-cache>`, per :ref:`normalization
|
2025-08-01 13:56:29 -04:00
|
|
|
unit <normalization-units>`.
|
|
|
|
|
unit: Cache lines per normalization unit
|
2025-07-25 14:01:34 -04:00
|
|
|
Hits:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of requests to the L2 from all clients that hit in the cache.
|
|
|
|
|
As noted in the :ref:`Speed-of-Light <l2-sol>` section, this includes hit-on-miss
|
2025-07-25 14:01:34 -04:00
|
|
|
requests.
|
|
|
|
|
unit: Requests per normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
Misses:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of requests to the L2 from all clients that miss in the
|
|
|
|
|
cache. As noted in the :ref:`Speed-of-Light <l2-sol>` section, these do not
|
|
|
|
|
include hit-on-miss requests.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Requests per normalization unit
|
2025-08-01 13:33:58 -04:00
|
|
|
NC Req:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory
|
|
|
|
|
allocations, per :ref:`normalization unit <normalization-units>`. See the :ref:`memory-type`
|
2025-08-01 13:33:58 -04:00
|
|
|
for more information.
|
|
|
|
|
unit: Requests per normalization unit
|
|
|
|
|
Probe Req:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of coherence probe requests made to the L2 cache from outside
|
|
|
|
|
the accelerator. On an :ref:`MI2XX <mixxx-note>`, probe requests may be generated
|
|
|
|
|
by, for example, writes to :ref:`fine-grained device <memory-type>` memory or
|
|
|
|
|
by writes to :ref:`coarse-grained <memory-type>` device memory.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Requests per normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
RW Req:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of requests to the L2 that go to Read-Write coherent memory
|
|
|
|
|
(RW) allocations. See the :ref:`memory-type` for more information.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Requests per normalization unit
|
2025-08-01 14:00:46 -04:00
|
|
|
Read Bandwidth:
|
2025-08-06 18:39:50 -04:00
|
|
|
rst: Total number of bytes looked up in the L2 cache for read requests, divided
|
|
|
|
|
by total duration.
|
|
|
|
|
unit: Gbps
|
2025-08-01 13:33:58 -04:00
|
|
|
Read Req:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of read requests to the L2 from all clients.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Requests per normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
Req:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of incoming requests to the L2 from all clients for all
|
|
|
|
|
request types, per :ref:`normalization unit <normalization-units>`.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Requests per normalization unit
|
|
|
|
|
Streaming Req:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of incoming requests to the L2 that are marked as *streaming*.
|
|
|
|
|
The exact meaning of this may differ depending on the targeted accelerator,
|
|
|
|
|
however on an :ref:`MI2XX <mixxx-note>` this corresponds to `non-temporal load
|
|
|
|
|
or stores <https://clang.llvm.org/docs/LanguageExtensions.html#non-temporal-load-store-builtins>`_.
|
|
|
|
|
The L2 cache attempts to evict *streaming* requests before normal requests when
|
2025-08-01 13:56:29 -04:00
|
|
|
the L2 is at capacity.
|
|
|
|
|
unit: Requests per normalization unit
|
|
|
|
|
UC Req:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of requests to the L2 that go to Uncached (UC) memory allocations.
|
2025-08-01 13:56:29 -04:00
|
|
|
See the :ref:`memory-type` for more information.
|
|
|
|
|
unit: Requests per normalization unit
|
2025-08-01 14:00:46 -04:00
|
|
|
Write Bandwidth:
|
2025-08-06 18:39:50 -04:00
|
|
|
rst: Total number of bytes looked up in the L2 cache for write requests, divided
|
|
|
|
|
by total duration.
|
|
|
|
|
unit: Gbps
|
2025-08-01 13:56:29 -04:00
|
|
|
Write Req:
|
|
|
|
|
rst: The total number of write requests to the L2 from all clients.
|
|
|
|
|
unit: Requests per normalization unit
|
|
|
|
|
Writeback:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of L2 cache lines written back to memory for any reason.
|
|
|
|
|
Write-backs may occur due to user code (such as HIP kernel calls to ``__threadfence_system``
|
|
|
|
|
or atomic built-ins) by the :doc:`command processor <command-processor>`'s memory
|
|
|
|
|
acquire/release fences, or for other internal hardware reasons.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Cache lines per normalization unit
|
|
|
|
|
Writeback (Internal):
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of L2 cache lines written back to memory for internal hardware
|
2025-08-01 13:56:29 -04:00
|
|
|
reasons, per :ref:`normalization unit <normalization-units>`.
|
|
|
|
|
unit: Cache lines per normalization unit
|
|
|
|
|
Writeback (vL1D Req):
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of L2 cache lines written back to memory due to requests
|
|
|
|
|
initiated by the :doc:`vL1D cache <vector-l1-cache>`, per :ref:`normalization
|
|
|
|
|
unit <normalization-units>`.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Cache lines per normalization unit
|
2025-08-01 13:33:58 -04:00
|
|
|
L2-Fabric interface metrics:
|
2025-08-01 13:56:29 -04:00
|
|
|
Atomic Latency:
|
|
|
|
|
rst: The time-averaged number of cycles atomic requests spent in Infinity Fabric
|
|
|
|
|
before a completion acknowledgement (atomic without return value) or data (atomic
|
|
|
|
|
with return value) was returned to the L2.
|
|
|
|
|
unit: Cycles
|
2025-08-01 13:33:58 -04:00
|
|
|
Atomic Traffic:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The percent of write requests generated by the L2 cache that are atomic requests
|
|
|
|
|
to *any* memory location. This breakdown does not consider the *size* of the
|
|
|
|
|
request (meaning that 32B and 64B requests are both counted as a single request),
|
|
|
|
|
so this metric only *approximates* the percent of the L2-Fabric Read bandwidth
|
|
|
|
|
directed to a remote location. Note that on current CDNA accelerators, such
|
|
|
|
|
as the :ref:`MI2XX <mixxx-note>`, requests are only considered *atomic* by Infinity
|
|
|
|
|
Fabric if they are targeted at :ref:`fine-grained memory <memory-type>` allocations
|
|
|
|
|
or :ref:`uncached memory <memory-type>` allocations.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Percent
|
|
|
|
|
HBM Read Traffic:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The percent of read requests generated by the L2 cache that are routed to
|
|
|
|
|
the accelerator's local high-bandwidth memory (HBM). This breakdown does not
|
|
|
|
|
consider the *size* of the request (meaning that 32B and 64B requests are both
|
|
|
|
|
counted as a single request), so this metric only *approximates* the percent
|
|
|
|
|
of the L2-Fabric Read bandwidth directed to the local HBM.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Percent
|
|
|
|
|
HBM Write and Atomic Traffic:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The percent of write and atomic requests generated by the L2 cache that are
|
|
|
|
|
routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
|
|
|
|
|
does not consider the *size* of the request (meaning that 32B and 64B requests
|
|
|
|
|
are both counted as a single request), so this metric only *approximates* the
|
|
|
|
|
percent of the L2-Fabric Write and Atomic bandwidth directed to the local HBM.
|
|
|
|
|
Note that on current CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`,
|
|
|
|
|
requests are only considered *atomic* by Infinity Fabric if they are targeted
|
|
|
|
|
at :ref:`fine-grained memory <memory-type>` allocations or :ref:`uncached memory
|
|
|
|
|
<memory-type>` allocations.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Percent
|
|
|
|
|
Read BW:
|
2025-08-06 18:39:50 -04:00
|
|
|
rst: The total number of bytes read by the L2 cache from Infinity Fabric divided
|
|
|
|
|
by total duration.
|
|
|
|
|
unit: Gbps
|
2025-08-01 13:56:29 -04:00
|
|
|
Read Latency:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The time-averaged number of cycles read requests spent in Infinity Fabric
|
|
|
|
|
before data was returned to the L2.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Cycles
|
2025-08-01 13:56:29 -04:00
|
|
|
Read Stall:
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The ratio of the total number of cycles the L2-Fabric interface was stalled
|
|
|
|
|
on a read request to any destination (local HBM, remote PCIe\xAE connected
|
|
|
|
|
accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_
|
|
|
|
|
or CPU) over the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Percent
|
|
|
|
|
Remote Read Traffic:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The percent of read requests generated by the L2 cache that are routed to
|
|
|
|
|
any memory location other than the accelerator's local high-bandwidth memory
|
|
|
|
|
(HBM) -- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown
|
|
|
|
|
does not consider the *size* of the request (meaning that 32B and 64B requests
|
|
|
|
|
are both counted as a single request), so this metric only *approximates* the
|
|
|
|
|
percent of the L2-Fabric Read bandwidth directed to a remote location.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Percent
|
|
|
|
|
Remote Write and Atomic Traffic:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The percent of read requests generated by the L2 cache that are routed to
|
|
|
|
|
any memory location other than the accelerator's local high-bandwidth memory
|
|
|
|
|
(HBM) -- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown
|
|
|
|
|
does not consider the *size* of the request (meaning that 32B and 64B requests
|
|
|
|
|
are both counted as a single request), so this metric only *approximates* the
|
|
|
|
|
percent of the L2-Fabric Read bandwidth directed to a remote location. Note
|
|
|
|
|
that on current CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests
|
|
|
|
|
are only considered *atomic* by Infinity Fabric if they are targeted at :ref:`fine-grained
|
|
|
|
|
memory <memory-type>` allocations or :ref:`uncached memory <memory-type>` allocations.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Percent
|
|
|
|
|
Uncached Read Traffic:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The percent of read requests generated by the L2 cache that are reading from
|
|
|
|
|
an :ref:`uncached memory allocation <memory-type>`. Note, as described in the
|
|
|
|
|
:ref:`request flow <l2-request-flow>` section, a single 64B read request is
|
|
|
|
|
typically counted as two uncached read requests. So, it is possible for the
|
|
|
|
|
Uncached Read Traffic to reach up to 200% of the total number of read requests.
|
|
|
|
|
This breakdown does not consider the *size* of the request (i.e., 32B and 64B
|
|
|
|
|
requests are both counted as a single request), so this metric only *approximates*
|
|
|
|
|
the percent of the L2-Fabric read bandwidth directed to an uncached memory location.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:56:29 -04:00
|
|
|
Uncached Write and Atomic Traffic:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The percent of write and atomic requests generated by the L2 cache that are
|
|
|
|
|
targeting :ref:`uncached memory allocations <memory-type>`. This breakdown does
|
|
|
|
|
not consider the *size* of the request (meaning that 32B and 64B requests are
|
|
|
|
|
both counted as a single request), so this metric only *approximates* the percent
|
|
|
|
|
of the L2-Fabric read bandwidth directed to uncached memory allocations.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Percent
|
|
|
|
|
Write Stall:
|
|
|
|
|
rst: The ratio of the total number of cycles the L2-Fabric interface was stalled
|
|
|
|
|
on a write or atomic request to any destination (local HBM, remote accelerator
|
|
|
|
|
or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected
|
|
|
|
|
accelerator [#inf]_ or CPU) over the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
|
|
|
unit: Percent
|
|
|
|
|
Write and Atomic BW:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of bytes written by the L2 over Infinity Fabric by write
|
|
|
|
|
and atomic operations divided by total duration. Note that on current CDNA accelerators,
|
|
|
|
|
such as the :ref:`MI2XX <mixxx-note>`, requests are only considered *atomic*
|
|
|
|
|
by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
|
|
|
|
|
:ref:`fine-grained memory <memory-type>` allocations or :ref:`uncached memory
|
|
|
|
|
<memory-type>` allocations on the MI2XX.
|
2025-08-06 18:39:50 -04:00
|
|
|
unit: Gbps
|
2025-08-01 13:56:29 -04:00
|
|
|
Write and Atomic Latency:
|
|
|
|
|
rst: The time-averaged number of cycles write requests spent in Infinity Fabric
|
|
|
|
|
before a completion acknowledgement was returned to the L2.
|
|
|
|
|
unit: Cycles
|
2025-08-01 13:33:58 -04:00
|
|
|
L2 - Fabric interface detailed metrics:
|
2025-08-01 13:56:29 -04:00
|
|
|
Atomic:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of L2 requests to Infinity Fabric to atomically update 32B
|
|
|
|
|
or 64B of data in any memory location, per :ref:`normalization unit <normalization-units>`.
|
|
|
|
|
See :ref:`l2-request-flow` for more detail. Note that on current CDNA accelerators,
|
|
|
|
|
such as the :ref:`MI2XX <mixxx-note>`, requests are only considered *atomic*
|
|
|
|
|
by Infinity Fabric if they are targeted at non-write-cacheable memory, such
|
|
|
|
|
as :ref:`fine-grained memory <memory-type>` allocations or :ref:`uncached memory
|
|
|
|
|
<memory-type>` allocations on the MI2XX.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Requests per normalization unit
|
2025-08-01 14:00:46 -04:00
|
|
|
Atomic Bandwidth - HBM:
|
2025-08-06 18:39:50 -04:00
|
|
|
rst: Total number of bytes due to L2 atomic requests due to HBM traffic, divided
|
|
|
|
|
by total duration.
|
|
|
|
|
unit: Gbps
|
2025-08-01 14:00:46 -04:00
|
|
|
"Atomic Bandwidth - Infinity Fabric\u2122":
|
|
|
|
|
rst: Total number of bytes due to L2 atomic requests due to Infinity Fabric traffic,
|
2025-08-06 18:39:50 -04:00
|
|
|
divided by total duration.
|
|
|
|
|
unit: Gbps
|
2025-08-01 14:00:46 -04:00
|
|
|
Atomic Bandwidth - PCIe:
|
2025-08-06 18:39:50 -04:00
|
|
|
rst: Total number of bytes due to L2 atomic requests due to PCIe traffic, divided
|
|
|
|
|
by total duration.
|
|
|
|
|
unit: Gbps
|
2025-07-25 14:01:34 -04:00
|
|
|
HBM Read:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of
|
|
|
|
|
data from the accelerator's local HBM, per :ref:`normalization unit <normalization-units>`.
|
|
|
|
|
See :ref:`l2-request-flow` for more detail.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Requests per normalization unit
|
|
|
|
|
HBM Write and Atomic:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically
|
|
|
|
|
update 32B or 64B of data in the accelerator's local HBM, per :ref:`normalization
|
|
|
|
|
unit <normalization-units>`. See :ref:`l2-request-flow` for more detail. plain
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Requests per normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
Read (32B):
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of L2 requests to Infinity Fabric to read 32B of data from
|
|
|
|
|
any memory location, per :ref:`normalization unit <normalization-units>`. See
|
|
|
|
|
:ref:`l2-request-flow` for more detail. Typically unused on CDNA accelerators.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Requests per normalization unit
|
|
|
|
|
Read (64B):
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of L2 requests to Infinity Fabric to read 64B of data from
|
|
|
|
|
any memory location, per :ref:`normalization unit <normalization-units>`. See
|
|
|
|
|
:ref:`l2-request-flow` for more detail.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Requests per normalization unit
|
2025-08-01 13:33:58 -04:00
|
|
|
Read (Uncached):
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of L2 requests to Infinity Fabric to read :ref:`uncached
|
|
|
|
|
data <memory-type>` from any memory location, per :ref:`normalization unit <normalization-units>`.
|
|
|
|
|
64B requests for uncached data are counted as two 32B uncached data requests.
|
|
|
|
|
See :ref:`l2-request-flow` for more detail.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Requests per normalization unit
|
2025-08-01 14:00:46 -04:00
|
|
|
Read Bandwidth - HBM:
|
2025-08-06 18:39:50 -04:00
|
|
|
rst: Total number of bytes due to L2 read requests due to HBM traffic, divided
|
|
|
|
|
by total duration.
|
|
|
|
|
unit: Gbps
|
2025-08-01 14:00:46 -04:00
|
|
|
"Read Bandwidth - Infinity Fabric\u2122":
|
|
|
|
|
rst: Total number of bytes due to L2 read requests due to Infinity Fabric traffic,
|
2025-08-06 18:39:50 -04:00
|
|
|
divided by total duration.
|
|
|
|
|
unit: Gbps
|
2025-08-01 14:00:46 -04:00
|
|
|
Read Bandwidth - PCIe:
|
2025-08-06 18:39:50 -04:00
|
|
|
rst: Total number of bytes due to L2 read requests due to PCIe traffic, divided
|
|
|
|
|
by total duration.
|
|
|
|
|
unit: Gbps
|
2025-08-01 13:56:29 -04:00
|
|
|
Remote Read:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of
|
|
|
|
|
data from any source other than the accelerator's local HBM, per :ref:`normalization
|
|
|
|
|
unit <normalization-units>`. See :ref:`l2-request-flow` for more detail.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Requests per normalization unit
|
|
|
|
|
Remote Write and Atomic:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically
|
|
|
|
|
update 32B or 64B of data in any memory location other than the accelerator's
|
|
|
|
|
local HBM, per :ref:`normalization unit <normalization-units>`. See :ref:`l2-request-flow`
|
2025-08-01 13:56:29 -04:00
|
|
|
for more detail.
|
|
|
|
|
unit: Requests per normalization unit
|
2025-08-01 14:00:46 -04:00
|
|
|
Write Bandwidth - HBM:
|
2025-08-06 18:39:50 -04:00
|
|
|
rst: Total number of bytes due to L2 write requests due to HBM traffic, divided
|
|
|
|
|
by total duration.
|
|
|
|
|
unit: Gbps
|
2025-08-01 14:00:46 -04:00
|
|
|
"Write Bandwidth - Infinity Fabric\u2122":
|
|
|
|
|
rst: Total number of bytes due to L2 write requests due to Infinity Fabric traffic,
|
2025-08-06 18:39:50 -04:00
|
|
|
divided by total duration.
|
|
|
|
|
unit: Gbps
|
2025-08-01 14:00:46 -04:00
|
|
|
Write Bandwidth - PCIe:
|
2025-08-06 18:39:50 -04:00
|
|
|
rst: Total number of bytes due to L2 write requests due to PCIe traffic, divided
|
|
|
|
|
by total duration.
|
|
|
|
|
unit: Gbps
|
2025-08-01 13:56:29 -04:00
|
|
|
Write and Atomic (32B):
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically
|
|
|
|
|
update 32B of data to any memory location, per :ref:`normalization unit <normalization-units>`.
|
|
|
|
|
See :ref:`l2-request-flow` for more detail.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Requests per normalization unit
|
|
|
|
|
Write and Atomic (64B):
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically
|
|
|
|
|
update 64B of data in any memory location, per :ref:`normalization unit <normalization-units>`.
|
|
|
|
|
See :ref:`l2-request-flow` for more detail.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Requests per normalization unit
|
2025-08-01 13:33:58 -04:00
|
|
|
Write and Atomic (Uncached):
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically
|
|
|
|
|
update 32B or 64B of :ref:`uncached data <memory-type>`, per :ref:`normalization
|
|
|
|
|
unit <normalization-units>`. See :ref:`l2-request-flow` for more detail.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Requests per normalization unit
|
|
|
|
|
L2 - Fabric Interface stalls:
|
|
|
|
|
Read - HBM Stall:
|
2025-07-25 14:01:34 -04:00
|
|
|
rst: The number of cycles the L2-Fabric interface was stalled on read requests
|
2025-08-01 13:33:58 -04:00
|
|
|
to the accelerator's local HBM as a percent of the :ref:`total active L2 cycles
|
|
|
|
|
<total-active-l2-cycles>`.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
|
|
|
|
Read - Infinity Fabric Stall:
|
|
|
|
|
rst: The number of cycles the L2-Fabric interface was stalled on read requests
|
|
|
|
|
to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as a percent
|
|
|
|
|
of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
|
|
|
unit: Percent
|
2025-08-01 13:56:29 -04:00
|
|
|
Read - PCIe Stall:
|
|
|
|
|
rst: The number of cycles the L2-Fabric interface was stalled on read requests
|
|
|
|
|
to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the :ref:`total
|
|
|
|
|
active L2 cycles <total-active-l2-cycles>`.
|
|
|
|
|
unit: Percent
|
|
|
|
|
Write - Credit Starvation:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
|
|
|
|
|
requests to any memory location because too many write/atomic requests were
|
|
|
|
|
currently in flight, as a percent of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Percent
|
|
|
|
|
Write - HBM Stall:
|
2025-08-01 13:33:58 -04:00
|
|
|
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
|
2025-08-01 13:56:29 -04:00
|
|
|
requests to accelerator's local HBM as a percent of the total active L2 cycles.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Percent
|
|
|
|
|
Write - Infinity Fabric Stall:
|
|
|
|
|
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
|
|
|
|
|
requests to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as
|
|
|
|
|
a percent of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
|
|
|
unit: Percent
|
2025-08-01 13:56:29 -04:00
|
|
|
Write - PCIe Stall:
|
2025-08-01 13:33:58 -04:00
|
|
|
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
|
2025-08-01 13:56:29 -04:00
|
|
|
requests to remote PCIe connected accelerators [#inf]_ or CPUs as a percent
|
|
|
|
|
of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:33:58 -04:00
|
|
|
Scalar L1D Speed-of-Light:
|
2025-08-06 18:39:50 -04:00
|
|
|
Bandwidth Utilization:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of bytes looked up in the sL1D cache, as a percent of the peak
|
|
|
|
|
theoretical bandwidth. Calculated as the ratio of sL1D requests over the :ref:`total
|
|
|
|
|
sL1D cycles <total-sl1d-cycles>`.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:56:29 -04:00
|
|
|
Cache Hit Rate:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: Indicates the percent of sL1D requests that hit on a previously loaded line
|
|
|
|
|
the cache. The ratio of the number of sL1D requests that hit [#sl1d-cache]_
|
2025-08-01 13:56:29 -04:00
|
|
|
over the number of all sL1D requests.
|
|
|
|
|
unit: Percent
|
2025-08-06 18:39:50 -04:00
|
|
|
sL1D-L2 BW Utilization:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The percentage of the peak theoretical sL1D - L2 interface bandwidth acheived.
|
|
|
|
|
Calculated as total number of bytes read from, written to, or atomically updated
|
|
|
|
|
across the sL1D - L2 interface.
|
2025-08-06 18:39:50 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:33:58 -04:00
|
|
|
Scalar L1D cache accesses:
|
2025-07-25 14:01:34 -04:00
|
|
|
Atomic Req:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of atomic requests from sL1D to the :doc:`L2 <l2-cache>`,
|
|
|
|
|
per :ref:`normalization unit <normalization-units>`. Typically unused on current
|
2025-07-25 14:01:34 -04:00
|
|
|
CDNA accelerators.
|
|
|
|
|
unit: Requests per normalization unit
|
|
|
|
|
Cache Hit Rate:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: Indicates the percent of sL1D requests that hit on a previously loaded line
|
|
|
|
|
the cache. The ratio of the number of sL1D requests that hit [#sl1d-cache]_
|
2025-07-25 14:01:34 -04:00
|
|
|
over the number of all sL1D requests.
|
|
|
|
|
unit: Percent
|
2025-08-01 13:56:29 -04:00
|
|
|
Hits:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of sL1D requests that hit on a previously loaded cache line,
|
2025-08-01 13:56:29 -04:00
|
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
|
|
|
unit: Requests per normalization unit
|
|
|
|
|
Misses - Non Duplicated:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of sL1D requests that missed on a cache line that *was not*
|
|
|
|
|
already pending due to another request, per :ref:`normalization unit <normalization-units>`.
|
|
|
|
|
See :ref:`desc-sl1d-sol` for more detail.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Requests per normalization unit
|
2025-07-25 14:01:34 -04:00
|
|
|
Misses- Duplicated:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of sL1D requests that missed on a cache line that *was*
|
|
|
|
|
already pending due to another request, per :ref:`normalization unit <normalization-units>`.
|
|
|
|
|
See :ref:`desc-sl1d-sol` for more detail.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Requests per normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
Read Req (1 DWord):
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of sL1D read requests made for a single dword of data (4B),
|
2025-08-01 13:56:29 -04:00
|
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
|
|
|
unit: Requests per normalization unit
|
|
|
|
|
Read Req (16 DWord):
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of sL1D read requests made for a sixteen dwords of data
|
|
|
|
|
(64B), per :ref:`normalization unit <normalization-units>`.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Requests per normalization unit
|
|
|
|
|
Read Req (2 DWord):
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of sL1D read requests made for a two dwords of data (8B),
|
2025-08-01 13:56:29 -04:00
|
|
|
per :ref:`normalization unit <normalization-units>`.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Requests per normalization unit
|
|
|
|
|
Read Req (4 DWord):
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of sL1D read requests made for a four dwords of data (16B),
|
2025-07-25 14:01:34 -04:00
|
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
|
|
|
unit: Requests per normalization unit
|
|
|
|
|
Read Req (8 DWord):
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of sL1D read requests made for a eight dwords of data (32B),
|
2025-07-25 14:01:34 -04:00
|
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
|
|
|
unit: Requests per normalization unit
|
2025-08-01 13:33:58 -04:00
|
|
|
Read Req (Total):
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of sL1D read requests of any size, per :ref:`normalization
|
2025-08-01 13:33:58 -04:00
|
|
|
unit <normalization-units>`.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Requests per normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
Req:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of requests, of any size or type, made to the sL1D per :ref:`normalization
|
2025-08-01 13:56:29 -04:00
|
|
|
unit <normalization-units>`.
|
|
|
|
|
unit: Requests per normalization unit
|
2025-08-01 13:33:58 -04:00
|
|
|
Scalar L1D Cache - L2 Interface:
|
|
|
|
|
Atomic Req:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of atomic requests from sL1D to the :doc:`L2 <l2-cache>`,
|
|
|
|
|
per :ref:`normalization unit <normalization-units>`. Typically unused on current
|
2025-07-25 14:01:34 -04:00
|
|
|
CDNA accelerators.
|
|
|
|
|
unit: Requests per normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
Read Req:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of read requests from sL1D to the :doc:`L2 <l2-cache>`,
|
|
|
|
|
per :ref:`normalization unit <normalization-units>`.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Requests per normalization unit
|
2025-07-25 14:01:34 -04:00
|
|
|
Stall Cycles:
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The total number of cycles the sL1D\u2194 :doc:`L2 <l2-cache>` interface
|
|
|
|
|
was stalled, per :ref:`normalization unit <normalization-units>`.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Cycles per normalization unit
|
2025-08-01 13:33:58 -04:00
|
|
|
Write Req:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of write requests from sL1D to the :doc:`L2 <l2-cache>`,
|
|
|
|
|
per :ref:`normalization unit <normalization-units>`. Typically unused on current
|
2025-08-01 13:33:58 -04:00
|
|
|
CDNA accelerators.
|
|
|
|
|
unit: Requests per normalization unit
|
2025-08-01 13:56:29 -04:00
|
|
|
sL1D-L2 BW:
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The total number of bytes read from, written to, or atomically updated
|
|
|
|
|
across the sL1D\u2194:doc:`L2 <l2-cache>` interface, divided by total duration.
|
|
|
|
|
Note that sL1D writes and atomics are typically
|
|
|
|
|
unused on current CDNA accelerators, so in the majority of cases this can
|
|
|
|
|
be interpreted as an sL1D\u2192L2 read bandwidth.
|
2025-08-06 18:39:50 -04:00
|
|
|
unit: Gbps
|
2025-07-25 14:01:34 -04:00
|
|
|
L1I Speed-of-Light:
|
2025-08-06 18:39:50 -04:00
|
|
|
Bandwidth Utilization:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of bytes looked up in the L1I cache, as a percent of the peak
|
|
|
|
|
theoretical bandwidth. Calculated as the ratio of L1I requests over the :ref:`total
|
|
|
|
|
L1I cycles <total-l1i-cycles>`.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:56:29 -04:00
|
|
|
Cache Hit Rate:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The percent of L1I requests that hit [#l1i-cache]_ on a previously loaded
|
|
|
|
|
line the cache. Calculated as the ratio of the number of L1I requests that hit
|
|
|
|
|
over the number of all L1I requests.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Percent
|
2025-08-06 18:39:50 -04:00
|
|
|
L1I-L2 Bandwidth Utilization:
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
|
|
|
|
|
achieved. Calculated as the ratio of the total number of requests from
|
|
|
|
|
the L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:33:58 -04:00
|
|
|
L1I cache accesses:
|
2025-08-01 13:56:29 -04:00
|
|
|
Cache Hit Rate:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The percent of L1I requests that hit [#l1i-cache]_ on a previously loaded
|
|
|
|
|
line the cache. Calculated as the ratio of the number of L1I requests that hit
|
|
|
|
|
over the number of all L1I requests.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Percent
|
|
|
|
|
Hits:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of L1I requests that hit on a previously loaded cache line,
|
2025-08-01 13:56:29 -04:00
|
|
|
per :ref:`normalization-unit <normalization-units>`.
|
|
|
|
|
unit: Requests per normalization unit
|
2025-08-01 13:33:58 -04:00
|
|
|
Instruction Fetch Latency:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The average number of cycles spent to fetch instructions to a :doc:`CU <compute-unit>`.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Cycles
|
2025-07-25 14:01:34 -04:00
|
|
|
Misses - Duplicated:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of L1I requests that missed on a cache line that *were*
|
|
|
|
|
already pending due to another request, per :ref:`normalization-unit <normalization-units>`.
|
|
|
|
|
See note in :ref:`desc-l1i-sol` for more detail.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Requests per normalization unit
|
|
|
|
|
Misses - Non Duplicated:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The total number of L1I requests that missed on a cache line that *were not*
|
|
|
|
|
already pending due to another request, per :ref:`normalization-unit <normalization-units>`.
|
|
|
|
|
See note in :ref:`desc-l1i-sol` for more detail.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Requests per normalization unit
|
|
|
|
|
Req:
|
|
|
|
|
rst: The total number of requests made to the L1I per normalization-unit
|
|
|
|
|
unit: Requests per normalization unit
|
|
|
|
|
L1I <-> L2 interface:
|
|
|
|
|
L1I-L2 Bandwidth:
|
2025-08-06 18:39:50 -04:00
|
|
|
rst: Total number of bytes transferred across L1I - L2 interface divided by total
|
|
|
|
|
duration.
|
|
|
|
|
unit: Gbps
|
2025-08-01 13:33:58 -04:00
|
|
|
Workgroup manager utilizations:
|
|
|
|
|
Accelerator Utilization:
|
|
|
|
|
rst: The percent of cycles in the kernel where the accelerator was actively doing
|
|
|
|
|
any work.
|
|
|
|
|
unit: Percent
|
2025-07-25 14:01:34 -04:00
|
|
|
Dispatched Wavefronts:
|
|
|
|
|
rst: The total number of wavefronts, summed over all workgroups, forming this
|
|
|
|
|
kernel launch.
|
|
|
|
|
unit: Wavefronts
|
2025-08-01 13:56:29 -04:00
|
|
|
Dispatched Workgroups:
|
|
|
|
|
rst: The total number of workgroups forming this kernel launch.
|
|
|
|
|
unit: Workgroups
|
2025-07-25 14:01:34 -04:00
|
|
|
SGPR Writes:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The average number of cycles spent initializing :ref:`SGPRs <desc-salu>`
|
|
|
|
|
at wave creation.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Cycles/wave
|
2025-08-01 13:56:29 -04:00
|
|
|
SIMD Utilization:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel
|
|
|
|
|
where any :ref:`SIMD <desc-valu>` on a CU was actively doing any work, summed
|
|
|
|
|
over all CUs. Low values (less than 100%) indicate that the accelerator was
|
|
|
|
|
not fully saturated by the kernel, or a potential load-imbalance issue.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Percent
|
|
|
|
|
Scheduler-Pipe Utilization:
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
|
|
|
|
|
in the kernel where the scheduler-pipes were actively doing any work. Note: this
|
|
|
|
|
value is expected to range between 0% and 25%. See :ref:`desc-spi`.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:33:58 -04:00
|
|
|
Shader Engine Utilization:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The percent of :ref:`total shader engine cycles <total-se-cycles>` in the
|
|
|
|
|
kernel where any CU in a shader-engine was actively doing any work, normalized
|
|
|
|
|
over all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
|
|
|
|
|
was not fully saturated by the kernel, or a potential load-imbalance issue.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:56:29 -04:00
|
|
|
VGPR Writes:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The average number of cycles spent initializing :ref:`VGPRs <desc-valu>`
|
|
|
|
|
at wave creation.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Cycles/wave
|
|
|
|
|
Workgroup Manager Utilization:
|
|
|
|
|
rst: The percent of cycles in the kernel where the workgroup manager was actively
|
|
|
|
|
doing any work.
|
|
|
|
|
unit: Percent
|
2025-08-01 13:33:58 -04:00
|
|
|
Workgroup Manager - Resource Allocation:
|
|
|
|
|
Insufficient CU Barriers:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel where
|
|
|
|
|
a workgroup could not be scheduled to a :doc:`CU <compute-unit>` due to lack
|
2025-08-01 13:33:58 -04:00
|
|
|
of available :ref:`barriers <desc-barrier>`.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:56:29 -04:00
|
|
|
Insufficient CU LDS:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel where
|
|
|
|
|
a workgroup could not be scheduled to a :doc:`CU <compute-unit>` due to lack
|
2025-08-01 13:56:29 -04:00
|
|
|
of available :doc:`LDS <local-data-share>`.
|
|
|
|
|
unit: Percent
|
2025-07-25 14:01:34 -04:00
|
|
|
Insufficient SIMD SGPRs:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel
|
|
|
|
|
where a workgroup could not be scheduled to a :ref:`SIMD <desc-valu>` due to
|
|
|
|
|
lack of available :ref:`SGPRs <desc-salu>`.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:56:29 -04:00
|
|
|
Insufficient SIMD VGPRs:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel
|
|
|
|
|
where a workgroup could not be scheduled to a :ref:`SIMD <desc-valu>` due to
|
|
|
|
|
lack of available :ref:`VGPRs <desc-valu>`.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:33:58 -04:00
|
|
|
Insufficient SIMD Waveslots:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel
|
|
|
|
|
where a workgroup could not be scheduled to a :ref:`SIMD <desc-valu>` due to
|
|
|
|
|
lack of available :ref:`waveslots <desc-valu>`.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
|
|
|
|
Not-scheduled Rate (Scheduler-Pipe):
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
|
|
|
|
|
in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
|
|
|
|
|
due to a bottleneck within the scheduler-pipes rather than a lack of a CU
|
|
|
|
|
or :ref:`SIMD <desc-valu>` with sufficient resources. Note: this value is
|
|
|
|
|
expected to range between 0-25%, see note in :ref:`workgroup manager <desc-spi>`
|
|
|
|
|
description.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:33:58 -04:00
|
|
|
Not-scheduled Rate (Workgroup Manager):
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
|
|
|
|
|
in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
|
|
|
|
|
due to a bottleneck within the workgroup manager rather than a lack of a
|
|
|
|
|
CU or :ref:`SIMD <desc-valu>` with sufficient resources. Note: this value
|
|
|
|
|
is expected to range between 0-25%. See note in :ref:`workgroup manager <desc-spi>`
|
|
|
|
|
description.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
|
|
|
|
Reached CU Wavefront Limit:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel where
|
|
|
|
|
a wavefront could not be scheduled to a :doc:`CU <compute-unit>` due to limits
|
|
|
|
|
within the workgroup manager. This is expected to be always be zero on CDNA2
|
|
|
|
|
or newer accelerators (and small for previous accelerators).
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:56:29 -04:00
|
|
|
Reached CU Workgroup Limit:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel where
|
|
|
|
|
a workgroup could not be scheduled to a :doc:`CU <compute-unit>` due to limits
|
|
|
|
|
within the workgroup manager. This is expected to be always be zero on CDNA2
|
|
|
|
|
or newer accelerators (and small for previous accelerators).
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Percent
|
|
|
|
|
Scheduler-Pipe Stall Rate:
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
|
|
|
|
|
in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
|
|
|
|
|
due to occupancy limitations (like a lack of a CU or :ref:`SIMD <desc-valu>`
|
|
|
|
|
with sufficient resources). Note: this value is expected to range between
|
|
|
|
|
0-25%, see note in :ref:`workgroup manager <desc-spi>` description.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Percent
|
|
|
|
|
Scratch Stall Rate:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The percent of :ref:`total shader-engine cycles <total-se-cycles>` in the
|
|
|
|
|
kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
|
|
|
|
|
due to lack of :ref:`private (a.k.a., scratch) memory <memory-type>` slots.
|
|
|
|
|
While this can reach up to 100%, note that the actual occupancy limitations
|
|
|
|
|
on a kernel using private memory are typically quite small (for example, less
|
|
|
|
|
than 1% of the total number of waves that can be scheduled to an accelerator).
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Percent
|
2025-07-25 14:01:34 -04:00
|
|
|
Command processor fetcher (CPF):
|
|
|
|
|
CPF Stall:
|
|
|
|
|
rst: Percent of CPF busy cycles where the CPF was stalled for any reason.
|
|
|
|
|
unit: Percent
|
2025-08-01 13:56:29 -04:00
|
|
|
CPF Utilization:
|
|
|
|
|
rst: Percent of total cycles where the CPF was busy actively doing any work. The
|
|
|
|
|
ratio of CPF busy cycles over total cycles counted by the CPF.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
|
|
|
|
CPF-L2 Stall:
|
|
|
|
|
rst: Percent of CPF-:doc:`L2 <l2-cache>` L2 busy cycles where the CPF-L2 interface
|
|
|
|
|
was stalled for any reason.
|
|
|
|
|
unit: Percent
|
2025-08-01 13:56:29 -04:00
|
|
|
CPF-L2 Utilization:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: Percent of total cycles counted by the CPF-:doc:`L2 <l2-cache>` interface
|
|
|
|
|
where the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy
|
|
|
|
|
cycles over total cycles counted by the CPF-L2.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Percent
|
2025-07-25 14:01:34 -04:00
|
|
|
CPF-UTCL1 Stall:
|
|
|
|
|
rst: Percent of CPF busy cycles where the CPF was stalled by address translation.
|
|
|
|
|
unit: Percent
|
2025-08-01 13:33:58 -04:00
|
|
|
Command processor packet processor (CPC):
|
2025-07-25 14:01:34 -04:00
|
|
|
CPC Packet Decoding Utilization:
|
|
|
|
|
rst: Percent of CPC busy cycles spent decoding commands for processing.
|
|
|
|
|
unit: Percent
|
2025-08-01 13:56:29 -04:00
|
|
|
CPC Stall Rate:
|
|
|
|
|
rst: Percent of CPC busy cycles where the CPC was stalled for any reason.
|
|
|
|
|
unit: Percent
|
|
|
|
|
CPC Utilization:
|
|
|
|
|
rst: Percent of total cycles where the CPC was busy actively doing any work. The
|
|
|
|
|
ratio of CPC busy cycles over total cycles counted by the CPC.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
|
|
|
|
CPC-L2 Utilization:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: Percent of total cycles counted by the CPC-:doc:`L2 <l2-cache>` interface
|
|
|
|
|
where the CPC-L2 interface was active doing any work.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:56:29 -04:00
|
|
|
CPC-UTCL1 Stall:
|
|
|
|
|
rst: Percent of CPC busy cycles where the CPC was stalled by address translation
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:56:29 -04:00
|
|
|
CPC-UTCL2 Utilization:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: Percent of total cycles counted by the CPC's :doc:`L2 <l2-cache>` address
|
|
|
|
|
translation interface where the CPC was busy doing address translation work.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Percent
|
|
|
|
|
CPC-Workgroup Manager Utilization:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: Percent of CPC busy cycles spent dispatching workgroups to the :ref:`workgroup
|
2025-08-01 13:56:29 -04:00
|
|
|
manager <desc-spi>`.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
|
|
|
|
System Speed-of-Light:
|
2025-11-28 22:02:25 +05:30
|
|
|
Active CUs (deprecated):
|
2025-08-01 13:56:29 -04:00
|
|
|
rst: Total number of active compute units (CUs) on the accelerator during the
|
2025-11-28 22:02:25 +05:30
|
|
|
kernel execution. (Deprecated - See CU Utilization instead)
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Number
|
|
|
|
|
Branch Utilization:
|
|
|
|
|
rst: Indicates what percent of the kernel's duration the :ref:`branch <desc-branch>`
|
|
|
|
|
unit was busy executing instructions. Computed as the ratio of the total number
|
|
|
|
|
of cycles spent by the :ref:`scheduler <desc-scheduler>` issuing branch instructions
|
|
|
|
|
over the :ref:`total CU cycles <total-cu-cycles>`.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Percent
|
|
|
|
|
IPC:
|
|
|
|
|
rst: The ratio of the total number of instructions executed on the :doc:`CU <compute-unit>`
|
|
|
|
|
over the :ref:`total active CU cycles <total-active-cu-cycles>`.
|
|
|
|
|
unit: Instructions per-cycle
|
2025-08-01 13:56:29 -04:00
|
|
|
L1I BW:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of bytes looked up in the L1I cache per unit time. This is also
|
|
|
|
|
presented as a percent of the peak theoretical bandwidth achievable on the specific
|
|
|
|
|
accelerator.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:56:29 -04:00
|
|
|
L1I Fetch Latency:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The average number of cycles spent to fetch instructions to a :doc:`CU <compute-unit>`.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Cycles
|
2025-08-01 13:33:58 -04:00
|
|
|
L1I Hit Rate:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The percent of L1I requests that hit on a previously loaded line the cache.
|
|
|
|
|
Calculated as the ratio of the number of L1I requests that hit over the number
|
2025-08-01 13:33:58 -04:00
|
|
|
of all L1I requests.
|
|
|
|
|
unit: GB/s
|
2025-08-01 13:56:29 -04:00
|
|
|
L2 Cache BW:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of bytes looked up in the L2 cache per unit time. The number of
|
|
|
|
|
bytes is calculated as the number of cache lines requested multiplied by the
|
|
|
|
|
cache line size. This value does not consider partial requests, so e.g., if
|
|
|
|
|
only a single value is requested in a cache line, the data movement will still
|
|
|
|
|
be counted as a full cache line. This is also presented as a percent of the
|
|
|
|
|
peak theoretical bandwidth achievable on the specific accelerator.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: GB/s
|
|
|
|
|
L2 Cache Hit Rate:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The ratio of the number of L2 cache line requests that hit in the L2 cache
|
|
|
|
|
over the total number of incoming cache line requests to the L2 cache.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: Percent
|
2025-08-01 13:56:29 -04:00
|
|
|
L2-Fabric Read BW:
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The number of bytes read by the L2 over the :ref:`Infinity Fabric\u2122
|
|
|
|
|
interface <l2-fabric>` per unit time. This is also presented as a percent
|
|
|
|
|
of the peak theoretical bandwidth achievable on the specific accelerator.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: GB/s
|
|
|
|
|
L2-Fabric Read Latency:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The time-averaged number of cycles read requests spent in Infinity Fabric
|
|
|
|
|
before data was returned to the L2.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Cycles
|
|
|
|
|
L2-Fabric Write BW:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of bytes sent by the L2 over the :ref:`Infinity Fabric interface
|
|
|
|
|
<l2-fabric>` by write and atomic operations per unit time. This is also presented
|
|
|
|
|
as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: GB/s
|
|
|
|
|
L2-Fabric Write Latency:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The time-averaged number of cycles write requests spent in Infinity Fabric
|
2025-08-01 13:56:29 -04:00
|
|
|
before a completion acknowledgement was returned to the L2.
|
|
|
|
|
unit: Cycles
|
|
|
|
|
LDS Bank Conflicts/Access:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The ratio of the number of cycles spent in the :doc:`LDS scheduler <local-data-share>`
|
|
|
|
|
due to bank conflicts (as determined by the conflict resolution hardware) to
|
|
|
|
|
the base number of cycles that would be spent in the LDS scheduler in a completely
|
|
|
|
|
uncontended case. This is also presented in normalized form (i.e., the Bank
|
|
|
|
|
Conflict Rate).
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Conflicts/Access
|
2025-08-01 13:33:58 -04:00
|
|
|
MFMA FLOPs (BF16):
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
|
2025-08-01 13:33:58 -04:00
|
|
|
operations executed per second. Note: this does not include any 16-bit brain
|
2025-10-22 15:17:43 -04:00
|
|
|
floating point operations from :ref:`VALU <desc-valu>` instructions. This
|
|
|
|
|
is also presented as a percent of the peak theoretical BF16 MFMA operations
|
|
|
|
|
achievable on the specific accelerator.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: GFLOPs
|
2025-08-01 13:56:29 -04:00
|
|
|
MFMA FLOPs (F16):
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
|
|
|
|
|
executed per second. Note: this does not include any 16-bit floating point
|
|
|
|
|
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
|
|
|
|
|
as a percent of the peak theoretical F16 MFMA operations achievable on the
|
|
|
|
|
specific accelerator.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: GFLOPs
|
|
|
|
|
MFMA FLOPs (F32):
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
|
|
|
|
|
executed per second. Note: this does not include any 32-bit floating point
|
|
|
|
|
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
|
|
|
|
|
as a percent of the peak theoretical F32 MFMA operations achievable on the
|
|
|
|
|
specific accelerator.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: GFLOPs
|
|
|
|
|
MFMA FLOPs (F64):
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
|
|
|
|
|
executed per second. Note: this does not include any 64-bit floating point
|
|
|
|
|
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
|
|
|
|
|
as a percent of the peak theoretical F64 MFMA operations achievable on the
|
|
|
|
|
specific accelerator.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: GFLOPs
|
|
|
|
|
MFMA FLOPs (F8):
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The total number of 8-bit brain floating point :ref:`MFMA <desc-mfma>`
|
|
|
|
|
operations executed per second. Note: this does not include any 16-bit brain
|
|
|
|
|
floating point operations from :ref:`VALU <desc-valu>` instructions. This
|
|
|
|
|
is also presented as a percent of the peak theoretical F8 MFMA operations
|
|
|
|
|
achievable on the specific accelerator. It is supported on AMD Instinct MI300
|
|
|
|
|
series and later only.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: GFLOPs
|
|
|
|
|
MFMA IOPs (Int8):
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
|
|
|
|
|
per second. Note: this does not include any 8-bit integer operations from
|
|
|
|
|
:ref:`VALU <desc-valu>` instructions. This is also presented as a percent
|
|
|
|
|
of the peak theoretical INT8 MFMA operations achievable on the specific accelerator.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: GIOPs
|
|
|
|
|
MFMA Utilization:
|
|
|
|
|
rst: Indicates what percent of the kernel's duration the :ref:`MFMA <desc-mfma>`
|
2025-08-01 13:33:58 -04:00
|
|
|
unit was busy executing instructions. Computed as the ratio of the total number
|
2025-08-01 13:56:29 -04:00
|
|
|
of cycles spent by the :ref:`MFMA <desc-salu>` was busy over the :ref:`total
|
|
|
|
|
CU cycles <total-cu-cycles>`.
|
|
|
|
|
unit: Percent
|
|
|
|
|
SALU Utilization:
|
|
|
|
|
rst: Indicates what percent of the kernel's duration the :ref:`SALU <desc-salu>`
|
|
|
|
|
was busy executing instructions. Computed as the ratio of the total number of
|
|
|
|
|
cycles spent by the :ref:`scheduler <desc-scheduler>` issuing SALU / :ref:`SMEM
|
|
|
|
|
<desc-smem>` instructions over the :ref:`total CU cycles <total-cu-cycles>`.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: Percent
|
2025-07-25 14:01:34 -04:00
|
|
|
Theoretical LDS Bandwidth:
|
|
|
|
|
rst: Indicates the maximum amount of bytes that could have been loaded from, stored
|
|
|
|
|
to, or atomically updated in the LDS per unit time (see :ref:`LDS Bandwidth
|
|
|
|
|
<lds-bandwidth>` example for more detail). This is also presented as a percent
|
|
|
|
|
of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
|
|
|
|
|
unit: GB/s
|
2025-08-01 13:33:58 -04:00
|
|
|
VALU Active Threads:
|
|
|
|
|
rst: Indicates the average level of :ref:`divergence <desc-divergence>` within
|
|
|
|
|
a wavefront over the lifetime of the kernel. The number of work-items that were
|
|
|
|
|
active in a wavefront during execution of each :ref:`VALU <desc-valu>` instruction,
|
|
|
|
|
time-averaged over all VALU instructions run on all wavefronts in the kernel.
|
|
|
|
|
unit: Work-items
|
2025-08-01 13:56:29 -04:00
|
|
|
VALU FLOPs:
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The total floating-point operations executed per second on the :ref:`VALU
|
|
|
|
|
<desc-valu>`. This is also presented as a percent of the peak theoretical
|
|
|
|
|
FLOPs achievable on the specific accelerator. Note: this does not include
|
|
|
|
|
any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: GFLOPs
|
2025-08-01 13:56:29 -04:00
|
|
|
VALU IOPs:
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
|
2025-08-01 13:56:29 -04:00
|
|
|
This is also presented as a percent of the peak theoretical IOPs achievable
|
|
|
|
|
on the specific accelerator. Note: this does not include any integer operations
|
2025-10-22 15:17:43 -04:00
|
|
|
from :ref:`MFMA <desc-mfma>` instructions.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: GOIPs
|
|
|
|
|
VALU Utilization:
|
|
|
|
|
rst: Indicates what percent of the kernel's duration the :ref:`VALU <desc-valu>`
|
|
|
|
|
was busy executing instructions. Does not include :ref:`VMEM <desc-vmem>` operations.
|
|
|
|
|
Computed as the ratio of the total number of cycles spent by the :ref:`scheduler
|
|
|
|
|
<desc-scheduler>` issuing VALU instructions over the :ref:`total CU cycles <total-cu-cycles>`.
|
|
|
|
|
unit: Percent
|
2025-12-11 14:23:34 -05:00
|
|
|
Dual-issue VALU Utilization:
|
|
|
|
|
rst: Indicates what percent of the kernel's duration the :ref:`VALU <desc-valu>`
|
|
|
|
|
was busy executing dual-issued instructions. Computed as the ratio of the total number of
|
|
|
|
|
cycles spent by the scheduler co-issuing VALU instructions over the total
|
|
|
|
|
CU cycles.
|
|
|
|
|
unit: Percent
|
2025-08-01 13:56:29 -04:00
|
|
|
VMEM Utilization:
|
|
|
|
|
rst: Indicates what percent of the kernel's duration the :ref:`VMEM <desc-vmem>`
|
|
|
|
|
unit was busy executing instructions, including both global/generic and spill/scratch
|
|
|
|
|
operations (see the :ref:`VMEM instruction count metrics <ta-instruction-counts>`
|
2025-10-22 15:17:43 -04:00
|
|
|
for more detail). Does not include :ref:`VALU <desc-valu>` operations. Computed
|
2025-08-01 13:56:29 -04:00
|
|
|
as the ratio of the total number of cycles spent by the :ref:`scheduler <desc-scheduler>`
|
|
|
|
|
issuing VMEM instructions over the :ref:`total CU cycles <total-cu-cycles>`.
|
|
|
|
|
unit: Percent
|
|
|
|
|
Wavefront Occupancy:
|
2025-11-19 10:46:02 -05:00
|
|
|
rst: >-
|
2025-10-22 15:17:43 -04:00
|
|
|
The time-averaged number of wavefronts resident on the accelerator over
|
2025-08-01 13:56:29 -04:00
|
|
|
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
|
|
|
|
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
|
2025-10-22 15:17:43 -04:00
|
|
|
occupancy achievable on the specific accelerator.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Wavefronts
|
|
|
|
|
sL1D Cache BW:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of bytes looked up in the sL1D cache per unit time. This is also
|
|
|
|
|
presented as a percent of the peak theoretical bandwidth achievable on the specific
|
|
|
|
|
accelerator.
|
2025-07-25 14:01:34 -04:00
|
|
|
unit: GB/s
|
|
|
|
|
sL1D Cache Hit Rate:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The percent of sL1D requests that hit on a previously loaded line the cache.
|
|
|
|
|
Calculated as the ratio of the number of sL1D requests that hit over the number
|
2025-07-25 14:01:34 -04:00
|
|
|
of all sL1D requests.
|
|
|
|
|
unit: Percent
|
2025-08-01 13:33:58 -04:00
|
|
|
vL1D Cache BW:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM
|
|
|
|
|
<desc-vmem>` instructions per unit time. The number of bytes is calculated as
|
|
|
|
|
the number of cache lines requested multiplied by the cache line size. This
|
|
|
|
|
value does not consider partial requests, so e.g., if only a single value is
|
|
|
|
|
requested in a cache line, the data movement will still be counted as a full
|
|
|
|
|
cache line. This is also presented as a percent of the peak theoretical bandwidth
|
|
|
|
|
achievable on the specific accelerator.
|
2025-08-01 13:33:58 -04:00
|
|
|
unit: GB/s
|
2025-08-01 13:56:29 -04:00
|
|
|
vL1D Cache Hit Rate:
|
2025-10-22 15:17:43 -04:00
|
|
|
rst: The ratio of the number of vL1D cache line requests that hit in vL1D cache
|
|
|
|
|
over the total number of cache line requests to the :ref:`vL1D cache RAM <desc-tc>`.
|
2025-08-01 13:56:29 -04:00
|
|
|
unit: Percent
|
2025-11-28 22:02:25 +05:30
|
|
|
CU Utilization:
|
|
|
|
|
rst: The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel
|
|
|
|
|
where any :ref:`SIMD <desc-valu>` on a CU was actively doing any work, summed
|
|
|
|
|
over all CUs. Low values (less than 100%) indicate that the accelerator was
|
|
|
|
|
not fully saturated by the kernel, or a potential load-imbalance issue.
|
|
|
|
|
unit: Percent
|