354fe5f52c
* Show description of metrics during analysis
* Use --include-cols Description show the Description column in analyze mode (this is hidden by default)
* Remove tips field from analysis config
* Align metric names in analysis config and documentation
* Add unified config utils/unified_config.yaml
* Add python script utils/split_config.py to auto generate analysis configuration and documentation metrics description
* Add test case to ensure unified config is older than auto-generated config
* Auto generate analysis config and documentation metrics description
* Update CONTRIBUTING.md to add instructions to build documentation assets
* Add docker image and compose file to build documentation
* Update CHANGELOG and Documentation
* Use jinja template instead of hardcoding metric tables in documentation
[ROCm/rocprofiler-compute commit: bb44e90b2d]
4915 خطوط
276 KiB
YAML
4915 خطوط
276 KiB
YAML
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
|
Wavefront launch stats:
|
|
Grid Size:
|
|
rst: The total number of work-items (or, threads) launched as a part of the kernel
|
|
dispatch. In HIP, this is equivalent to the total grid size multiplied by the
|
|
total workgroup (or, block) size.
|
|
unit: Work-Items
|
|
Workgroup Size:
|
|
rst: The total number of work-items (or, threads) in each workgroup (or, block)
|
|
launched as part of the kernel dispatch. In HIP, this is equivalent to the total
|
|
block size.
|
|
unit: Work-Items
|
|
Total Wavefronts:
|
|
rst: "The total number of wavefronts launched as part of the kernel dispatch.\
|
|
\ On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront\
|
|
\ size is always 64 work-items. Thus, the total number of wavefronts should\
|
|
\ be equivalent to the ceiling of grid size divided by 64."
|
|
unit: Wavefronts
|
|
Saved Wavefronts:
|
|
rst: The total number of wavefronts saved at a context-save. See `cwsr_enable
|
|
<https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
|
|
unit: Wavefronts
|
|
Restored Wavefronts:
|
|
rst: The total number of wavefronts restored from a context-save. See `cwsr_enable
|
|
<https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
|
|
unit: Wavefronts
|
|
VGPRs:
|
|
rst: 'The number of architected vector general-purpose registers allocated for the
|
|
kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
|
|
number of VGPRs requested by the compiler due to allocation granularity.'
|
|
unit: VGPRs
|
|
AGPRs:
|
|
rst: 'The number of accumulation vector general-purpose registers allocated for the
|
|
kernel, see :ref:`AGPRs <desc-agprs>`. Note: this may not exactly match the
|
|
number of AGPRs requested by the compiler due to allocation granularity.'
|
|
unit: AGPRs
|
|
SGPRs:
|
|
rst: 'The number of scalar general-purpose registers allocated for the kernel, see
|
|
:ref:`SALU <desc-salu>`. Note: this may not exactly match the number of SGPRs
|
|
requested by the compiler due to allocation granularity. plain'
|
|
unit: SGPRs
|
|
LDS Allocation:
|
|
rst: 'The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
|
|
allocated for this kernel. Note: This may also be larger than what was requested
|
|
at compile time due to both allocation granularity and dynamic per-dispatch
|
|
LDS allocations.'
|
|
unit: Bytes per workgroup
|
|
Scratch Allocation:
|
|
rst: The number of bytes of :ref:`scratch memory <memory-spaces>` requested per
|
|
work-item for this kernel. Scratch memory is used for stack memory on the accelerator,
|
|
as well as for register spills and restores.
|
|
unit: Bytes per work-item
|
|
Kernel Time:
|
|
rst: The total duration of the executed kernel.
|
|
unit: Nanoseconds
|
|
Kernel Time (Cycles):
|
|
rst: The total duration of the executed kernel in cycles.
|
|
unit: Cycles
|
|
Instructions per wavefront:
|
|
rst: The average number of instructions (of all types) executed per wavefront.
|
|
This is averaged over all wavefronts in a kernel dispatch.
|
|
unit: Instructions per wavefront
|
|
Wave Cycles:
|
|
rst: 'The number of cycles a wavefront in the kernel dispatch spent resident on a
|
|
compute unit per :ref:`normalization unit <normalization-units>`. This is averaged
|
|
over all wavefronts in a kernel dispatch. Note: this should not be directly
|
|
compared to the kernel cycles above.'
|
|
unit: Cycles per normalization unit
|
|
Dependency Wait Cycles:
|
|
rst: The number of cycles a wavefront in the kernel dispatch stalled waiting on
|
|
memory of any kind (e.g., instruction fetch, vector or scalar memory, etc.)
|
|
per :ref:`normalization unit <normalization-units>`. This counter is incremented
|
|
at every cycle by *all* wavefronts on a CU stalled at a memory operation. As
|
|
such, it is most useful to get a sense of how waves were spending their time,
|
|
rather than identification of a precise limiter because another wave could
|
|
be actively executing while a wave is stalled. The sum of this metric, Issue
|
|
Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
|
|
unit: Cycles per normalization unit
|
|
Issue Wait Cycles:
|
|
rst: The number of cycles a wavefront in the kernel dispatch was unable to issue
|
|
an instruction for any reason (e.g., execution pipe back-pressure, arbitration
|
|
loss, etc.) per :ref:`normalization unit <normalization-units>`. This counter
|
|
is incremented at every cycle by *all* wavefronts on a CU unable to issue an instruction. As
|
|
such, it is most useful to get a sense of how waves were spending their time,
|
|
rather than identification of a precise limiter because another wave could
|
|
be actively executing while a wave is issue stalled. The sum of this metric,
|
|
Dependency Wait Cycles and Active Cycles should be equal to the total Wave
|
|
Cycles metric.
|
|
unit: Cycles per normalization unit
|
|
Active Cycles:
|
|
rst: The average number of cycles a wavefront in the kernel dispatch was actively
|
|
executing instructions per :ref:`normalization unit <normalization-units>`.
|
|
This measurement is made on a per-wavefront basis, and may include cycles that
|
|
another wavefront spent actively executing (on another execution unit, for
|
|
example) or was stalled. As such, it is most useful to get a sense of how
|
|
waves were spending their time, rather than identification of a precise limiter.
|
|
The sum of this metric, Issue Wait Cycles and Active Wait Cycles should be equal
|
|
to the total Wave Cycles metric.
|
|
unit: Cycles per normalization unit
|
|
Wavefront Occupancy:
|
|
rst: 'The time-averaged number of wavefronts resident on the accelerator over the
|
|
lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
|
kernels (less than 1ms).'
|
|
unit: Wavefronts
|
|
Wavefront runtime stats:
|
|
Grid Size:
|
|
rst: The total number of work-items (or, threads) launched as a part of the kernel
|
|
dispatch. In HIP, this is equivalent to the total grid size multiplied by the
|
|
total workgroup (or, block) size.
|
|
unit: Work-Items
|
|
Workgroup Size:
|
|
rst: The total number of work-items (or, threads) in each workgroup (or, block)
|
|
launched as part of the kernel dispatch. In HIP, this is equivalent to the total
|
|
block size.
|
|
unit: Work-Items
|
|
Total Wavefronts:
|
|
rst: "The total number of wavefronts launched as part of the kernel dispatch.\
|
|
\ On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront\
|
|
\ size is always 64 work-items. Thus, the total number of wavefronts should\
|
|
\ be equivalent to the ceiling of grid size divided by 64."
|
|
unit: Wavefronts
|
|
Saved Wavefronts:
|
|
rst: The total number of wavefronts saved at a context-save. See `cwsr_enable
|
|
<https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
|
|
unit: Wavefronts
|
|
Restored Wavefronts:
|
|
rst: The total number of wavefronts restored from a context-save. See `cwsr_enable
|
|
<https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
|
|
unit: Wavefronts
|
|
VGPRs:
|
|
rst: 'The number of architected vector general-purpose registers allocated for the
|
|
kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
|
|
number of VGPRs requested by the compiler due to allocation granularity.'
|
|
unit: VGPRs
|
|
AGPRs:
|
|
rst: 'The number of accumulation vector general-purpose registers allocated for the
|
|
kernel, see :ref:`AGPRs <desc-agprs>`. Note: this may not exactly match the
|
|
number of AGPRs requested by the compiler due to allocation granularity.'
|
|
unit: AGPRs
|
|
SGPRs:
|
|
rst: 'The number of scalar general-purpose registers allocated for the kernel, see
|
|
:ref:`SALU <desc-salu>`. Note: this may not exactly match the number of SGPRs
|
|
requested by the compiler due to allocation granularity. plain'
|
|
unit: SGPRs
|
|
LDS Allocation:
|
|
rst: 'The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
|
|
allocated for this kernel. Note: This may also be larger than what was requested
|
|
at compile time due to both allocation granularity and dynamic per-dispatch
|
|
LDS allocations.'
|
|
unit: Bytes per workgroup
|
|
Scratch Allocation:
|
|
rst: The number of bytes of :ref:`scratch memory <memory-spaces>` requested per
|
|
work-item for this kernel. Scratch memory is used for stack memory on the accelerator,
|
|
as well as for register spills and restores.
|
|
unit: Bytes per work-item
|
|
Kernel Time:
|
|
rst: The total duration of the executed kernel.
|
|
unit: Nanoseconds
|
|
Kernel Time (Cycles):
|
|
rst: The total duration of the executed kernel in cycles.
|
|
unit: Cycles
|
|
Instructions per wavefront:
|
|
rst: The average number of instructions (of all types) executed per wavefront.
|
|
This is averaged over all wavefronts in a kernel dispatch.
|
|
unit: Instructions per wavefront
|
|
Wave Cycles:
|
|
rst: 'The number of cycles a wavefront in the kernel dispatch spent resident on a
|
|
compute unit per :ref:`normalization unit <normalization-units>`. This is averaged
|
|
over all wavefronts in a kernel dispatch. Note: this should not be directly
|
|
compared to the kernel cycles above.'
|
|
unit: Cycles per normalization unit
|
|
Dependency Wait Cycles:
|
|
rst: The number of cycles a wavefront in the kernel dispatch stalled waiting on
|
|
memory of any kind (e.g., instruction fetch, vector or scalar memory, etc.)
|
|
per :ref:`normalization unit <normalization-units>`. This counter is incremented
|
|
at every cycle by *all* wavefronts on a CU stalled at a memory operation. As
|
|
such, it is most useful to get a sense of how waves were spending their time,
|
|
rather than identification of a precise limiter because another wave could
|
|
be actively executing while a wave is stalled. The sum of this metric, Issue
|
|
Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
|
|
unit: Cycles per normalization unit
|
|
Issue Wait Cycles:
|
|
rst: The number of cycles a wavefront in the kernel dispatch was unable to issue
|
|
an instruction for any reason (e.g., execution pipe back-pressure, arbitration
|
|
loss, etc.) per :ref:`normalization unit <normalization-units>`. This counter
|
|
is incremented at every cycle by *all* wavefronts on a CU unable to issue an instruction. As
|
|
such, it is most useful to get a sense of how waves were spending their time,
|
|
rather than identification of a precise limiter because another wave could
|
|
be actively executing while a wave is issue stalled. The sum of this metric,
|
|
Dependency Wait Cycles and Active Cycles should be equal to the total Wave
|
|
Cycles metric.
|
|
unit: Cycles per normalization unit
|
|
Active Cycles:
|
|
rst: The average number of cycles a wavefront in the kernel dispatch was actively
|
|
executing instructions per :ref:`normalization unit <normalization-units>`.
|
|
This measurement is made on a per-wavefront basis, and may include cycles that
|
|
another wavefront spent actively executing (on another execution unit, for
|
|
example) or was stalled. As such, it is most useful to get a sense of how
|
|
waves were spending their time, rather than identification of a precise limiter.
|
|
The sum of this metric, Issue Wait Cycles and Active Wait Cycles should be equal
|
|
to the total Wave Cycles metric.
|
|
unit: Cycles per normalization unit
|
|
Wavefront Occupancy:
|
|
rst: 'The time-averaged number of wavefronts resident on the accelerator over the
|
|
lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
|
kernels (less than 1ms).'
|
|
unit: Wavefronts
|
|
Overall instruction mix:
|
|
VALU:
|
|
rst: The total number of vector arithmetic logic unit (VALU) operations issued.
|
|
These are the workhorses of the :doc:`compute unit <compute-unit>`, and are
|
|
used to execute a wide range of instruction types including floating point
|
|
operations, non-uniform address calculations, transcendental operations, integer
|
|
operations, shifts, conditional evaluation, etc.
|
|
unit: Instructions
|
|
VMEM:
|
|
rst: The total number of vector memory operations issued. These include most loads,
|
|
stores and atomic operations and all accesses to :ref:`generic, global, private
|
|
and texture <memory-spaces>` memory.
|
|
unit: Instructions
|
|
LDS:
|
|
rst: The total number of LDS (also known as shared memory) operations issued. These
|
|
include loads, stores, atomics, and HIP's ``__shfl`` operations.
|
|
unit: Instructions
|
|
MFMA:
|
|
rst: The total number of matrix fused multiply-add instructions issued.
|
|
unit: Instructions
|
|
SALU:
|
|
rst: The total number of scalar arithmetic logic unit (SALU) operations issued.
|
|
Typically these are used for address calculations, literal constants, and other
|
|
operations that are provably uniform across a wavefront. Although scalar memory
|
|
(SMEM) operations are issued by the SALU, they are counted separately in this
|
|
section.
|
|
unit: Instructions
|
|
SMEM:
|
|
rst: The total number of scalar memory (SMEM) operations issued. These are typically
|
|
used for loading kernel arguments, base-pointers and loads from HIP's ``__constant__``
|
|
memory.
|
|
unit: Instructions
|
|
Branch:
|
|
rst: The total number of branch operations issued. These typically consist of jump
|
|
or branch operations and are used to implement control flow.
|
|
unit: Instructions
|
|
INT32:
|
|
rst: The total number of instructions operating on 32-bit integer operands issued
|
|
to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
INT64:
|
|
rst: The total number of instructions operating on 64-bit integer operands issued
|
|
to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F16-ADD:
|
|
rst: The total number of addition instructions operating on 16-bit floating-point
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F16-MUL:
|
|
rst: The total number of multiplication instructions operating on 16-bit floating-point
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F16-FMA:
|
|
rst: The total number of fused multiply-add instructions operating on 16-bit floating-point
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F16-Trans:
|
|
rst: The total number of transcendental instructions (e.g., `sqrt`) operating on
|
|
16-bit floating-point operands issued to the VALU per :ref:`normalization unit
|
|
<normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F32-ADD:
|
|
rst: The total number of addition instructions operating on 32-bit floating-point
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F32-MUL:
|
|
rst: The total number of multiplication instructions operating on 32-bit floating-point
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F32-FMA:
|
|
rst: The total number of fused multiply-add instructions operating on 32-bit floating-point
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F32-Trans:
|
|
rst: The total number of transcendental instructions (such as ``sqrt``) operating
|
|
on 32-bit floating-point operands issued to the VALU per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F64-ADD:
|
|
rst: The total number of addition instructions operating on 64-bit floating-point
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F64-MUL:
|
|
rst: The total number of multiplication instructions operating on 64-bit floating-point
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F64-FMA:
|
|
rst: The total number of fused multiply-add instructions operating on 64-bit floating-point
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F64-Trans:
|
|
rst: The total number of transcendental instructions (such as `sqrt`) operating
|
|
on 64-bit floating-point operands issued to the VALU per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Conversion:
|
|
rst: "The total number of type conversion instructions (such as converting data\
|
|
\ to or from F32\u2194F64) issued to the VALU per :ref:`normalization unit\
|
|
\ <normalization-units>`."
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Instr:
|
|
rst: The total number of global & generic memory instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Read:
|
|
rst: The total number of global & generic memory read instructions executed on all
|
|
:doc:`compute units <compute-unit>` on the accelerator, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Write:
|
|
rst: The total number of global & generic memory write instructions executed on
|
|
all :doc:`compute units <compute-unit>` on the accelerator, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Atomic:
|
|
rst: The total number of global & generic memory atomic (with and without return)
|
|
instructions executed on all :doc:`compute units <compute-unit>` on the accelerator,
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Instr:
|
|
rst: The total number of spill/stack memory instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Read:
|
|
rst: The total number of spill/stack memory read instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Write:
|
|
rst: The total number of spill/stack memory write instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Atomic:
|
|
rst: The total number of spill/stack memory atomic (with and without return) instructions
|
|
executed on all :doc:`compute units <compute-unit>` on the accelerator, per
|
|
:ref:`normalization unit <normalization-units>`. Typically unused as these
|
|
memory operations are typically used to implement thread-local storage.
|
|
unit: Instructions per normalization unit
|
|
MFMA-I8:
|
|
rst: The total number of 8-bit integer :ref:`MFMA <desc-mfma>` instructions issued
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
MFMA-F8:
|
|
rst: The total number of 8-bit floating point :ref:`MFMA <desc-mfma>` instructions issued
|
|
per :ref:`normalization unit <normalization-units>`. This is supported in AMD
|
|
Instinct MI300 series and later only.
|
|
unit: Instructions per normalization unit
|
|
MFMA-F16:
|
|
rst: The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` instructions
|
|
issued per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
MFMA-BF16:
|
|
rst: The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>` instructions
|
|
issued per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
MFMA-F32:
|
|
rst: The total number of 32-bit floating-point :ref:`MFMA <desc-mfma>` instructions
|
|
issued per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
MFMA-F64:
|
|
rst: The total number of 64-bit floating-point :ref:`MFMA <desc-mfma>` instructions
|
|
issued per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
VALU arithmetic instruction mix:
|
|
VALU:
|
|
rst: The total number of vector arithmetic logic unit (VALU) operations issued.
|
|
These are the workhorses of the :doc:`compute unit <compute-unit>`, and are
|
|
used to execute a wide range of instruction types including floating point
|
|
operations, non-uniform address calculations, transcendental operations, integer
|
|
operations, shifts, conditional evaluation, etc.
|
|
unit: Instructions
|
|
VMEM:
|
|
rst: The total number of vector memory operations issued. These include most loads,
|
|
stores and atomic operations and all accesses to :ref:`generic, global, private
|
|
and texture <memory-spaces>` memory.
|
|
unit: Instructions
|
|
LDS:
|
|
rst: The total number of LDS (also known as shared memory) operations issued. These
|
|
include loads, stores, atomics, and HIP's ``__shfl`` operations.
|
|
unit: Instructions
|
|
MFMA:
|
|
rst: The total number of matrix fused multiply-add instructions issued.
|
|
unit: Instructions
|
|
SALU:
|
|
rst: The total number of scalar arithmetic logic unit (SALU) operations issued.
|
|
Typically these are used for address calculations, literal constants, and other
|
|
operations that are provably uniform across a wavefront. Although scalar memory
|
|
(SMEM) operations are issued by the SALU, they are counted separately in this
|
|
section.
|
|
unit: Instructions
|
|
SMEM:
|
|
rst: The total number of scalar memory (SMEM) operations issued. These are typically
|
|
used for loading kernel arguments, base-pointers and loads from HIP's ``__constant__``
|
|
memory.
|
|
unit: Instructions
|
|
Branch:
|
|
rst: The total number of branch operations issued. These typically consist of jump
|
|
or branch operations and are used to implement control flow.
|
|
unit: Instructions
|
|
INT32:
|
|
rst: The total number of instructions operating on 32-bit integer operands issued
|
|
to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
INT64:
|
|
rst: The total number of instructions operating on 64-bit integer operands issued
|
|
to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F16-ADD:
|
|
rst: The total number of addition instructions operating on 16-bit floating-point
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F16-MUL:
|
|
rst: The total number of multiplication instructions operating on 16-bit floating-point
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F16-FMA:
|
|
rst: The total number of fused multiply-add instructions operating on 16-bit floating-point
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F16-Trans:
|
|
rst: The total number of transcendental instructions (e.g., `sqrt`) operating on
|
|
16-bit floating-point operands issued to the VALU per :ref:`normalization unit
|
|
<normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F32-ADD:
|
|
rst: The total number of addition instructions operating on 32-bit floating-point
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F32-MUL:
|
|
rst: The total number of multiplication instructions operating on 32-bit floating-point
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F32-FMA:
|
|
rst: The total number of fused multiply-add instructions operating on 32-bit floating-point
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F32-Trans:
|
|
rst: The total number of transcendental instructions (such as ``sqrt``) operating
|
|
on 32-bit floating-point operands issued to the VALU per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F64-ADD:
|
|
rst: The total number of addition instructions operating on 64-bit floating-point
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F64-MUL:
|
|
rst: The total number of multiplication instructions operating on 64-bit floating-point
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F64-FMA:
|
|
rst: The total number of fused multiply-add instructions operating on 64-bit floating-point
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F64-Trans:
|
|
rst: The total number of transcendental instructions (such as `sqrt`) operating
|
|
on 64-bit floating-point operands issued to the VALU per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Conversion:
|
|
rst: "The total number of type conversion instructions (such as converting data\
|
|
\ to or from F32\u2194F64) issued to the VALU per :ref:`normalization unit\
|
|
\ <normalization-units>`."
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Instr:
|
|
rst: The total number of global & generic memory instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Read:
|
|
rst: The total number of global & generic memory read instructions executed on all
|
|
:doc:`compute units <compute-unit>` on the accelerator, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Write:
|
|
rst: The total number of global & generic memory write instructions executed on
|
|
all :doc:`compute units <compute-unit>` on the accelerator, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Atomic:
|
|
rst: The total number of global & generic memory atomic (with and without return)
|
|
instructions executed on all :doc:`compute units <compute-unit>` on the accelerator,
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Instr:
|
|
rst: The total number of spill/stack memory instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Read:
|
|
rst: The total number of spill/stack memory read instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Write:
|
|
rst: The total number of spill/stack memory write instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Atomic:
|
|
rst: The total number of spill/stack memory atomic (with and without return) instructions
|
|
executed on all :doc:`compute units <compute-unit>` on the accelerator, per
|
|
:ref:`normalization unit <normalization-units>`. Typically unused as these
|
|
memory operations are typically used to implement thread-local storage.
|
|
unit: Instructions per normalization unit
|
|
MFMA-I8:
|
|
rst: The total number of 8-bit integer :ref:`MFMA <desc-mfma>` instructions issued
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
MFMA-F8:
|
|
rst: The total number of 8-bit floating point :ref:`MFMA <desc-mfma>` instructions issued
|
|
per :ref:`normalization unit <normalization-units>`. This is supported in AMD
|
|
Instinct MI300 series and later only.
|
|
unit: Instructions per normalization unit
|
|
MFMA-F16:
|
|
rst: The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` instructions
|
|
issued per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
MFMA-BF16:
|
|
rst: The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>` instructions
|
|
issued per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
MFMA-F32:
|
|
rst: The total number of 32-bit floating-point :ref:`MFMA <desc-mfma>` instructions
|
|
issued per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
MFMA-F64:
|
|
rst: The total number of 64-bit floating-point :ref:`MFMA <desc-mfma>` instructions
|
|
issued per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
MFMA instruction mix:
|
|
VALU:
|
|
rst: The total number of vector arithmetic logic unit (VALU) operations issued.
|
|
These are the workhorses of the :doc:`compute unit <compute-unit>`, and are
|
|
used to execute a wide range of instruction types including floating point
|
|
operations, non-uniform address calculations, transcendental operations, integer
|
|
operations, shifts, conditional evaluation, etc.
|
|
unit: Instructions
|
|
VMEM:
|
|
rst: The total number of vector memory operations issued. These include most loads,
|
|
stores and atomic operations and all accesses to :ref:`generic, global, private
|
|
and texture <memory-spaces>` memory.
|
|
unit: Instructions
|
|
LDS:
|
|
rst: The total number of LDS (also known as shared memory) operations issued. These
|
|
include loads, stores, atomics, and HIP's ``__shfl`` operations.
|
|
unit: Instructions
|
|
MFMA:
|
|
rst: The total number of matrix fused multiply-add instructions issued.
|
|
unit: Instructions
|
|
SALU:
|
|
rst: The total number of scalar arithmetic logic unit (SALU) operations issued.
|
|
Typically these are used for address calculations, literal constants, and other
|
|
operations that are provably uniform across a wavefront. Although scalar memory
|
|
(SMEM) operations are issued by the SALU, they are counted separately in this
|
|
section.
|
|
unit: Instructions
|
|
SMEM:
|
|
rst: The total number of scalar memory (SMEM) operations issued. These are typically
|
|
used for loading kernel arguments, base-pointers and loads from HIP's ``__constant__``
|
|
memory.
|
|
unit: Instructions
|
|
Branch:
|
|
rst: The total number of branch operations issued. These typically consist of jump
|
|
or branch operations and are used to implement control flow.
|
|
unit: Instructions
|
|
INT32:
|
|
rst: The total number of instructions operating on 32-bit integer operands issued
|
|
to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
INT64:
|
|
rst: The total number of instructions operating on 64-bit integer operands issued
|
|
to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F16-ADD:
|
|
rst: The total number of addition instructions operating on 16-bit floating-point
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F16-MUL:
|
|
rst: The total number of multiplication instructions operating on 16-bit floating-point
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F16-FMA:
|
|
rst: The total number of fused multiply-add instructions operating on 16-bit floating-point
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F16-Trans:
|
|
rst: The total number of transcendental instructions (e.g., `sqrt`) operating on
|
|
16-bit floating-point operands issued to the VALU per :ref:`normalization unit
|
|
<normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F32-ADD:
|
|
rst: The total number of addition instructions operating on 32-bit floating-point
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F32-MUL:
|
|
rst: The total number of multiplication instructions operating on 32-bit floating-point
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F32-FMA:
|
|
rst: The total number of fused multiply-add instructions operating on 32-bit floating-point
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F32-Trans:
|
|
rst: The total number of transcendental instructions (such as ``sqrt``) operating
|
|
on 32-bit floating-point operands issued to the VALU per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F64-ADD:
|
|
rst: The total number of addition instructions operating on 64-bit floating-point
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F64-MUL:
|
|
rst: The total number of multiplication instructions operating on 64-bit floating-point
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F64-FMA:
|
|
rst: The total number of fused multiply-add instructions operating on 64-bit floating-point
|
|
operands issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
F64-Trans:
|
|
rst: The total number of transcendental instructions (such as `sqrt`) operating
|
|
on 64-bit floating-point operands issued to the VALU per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Conversion:
|
|
rst: "The total number of type conversion instructions (such as converting data\
|
|
\ to or from F32\u2194F64) issued to the VALU per :ref:`normalization unit\
|
|
\ <normalization-units>`."
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Instr:
|
|
rst: The total number of global & generic memory instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Read:
|
|
rst: The total number of global & generic memory read instructions executed on all
|
|
:doc:`compute units <compute-unit>` on the accelerator, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Write:
|
|
rst: The total number of global & generic memory write instructions executed on
|
|
all :doc:`compute units <compute-unit>` on the accelerator, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Atomic:
|
|
rst: The total number of global & generic memory atomic (with and without return)
|
|
instructions executed on all :doc:`compute units <compute-unit>` on the accelerator,
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Instr:
|
|
rst: The total number of spill/stack memory instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Read:
|
|
rst: The total number of spill/stack memory read instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Write:
|
|
rst: The total number of spill/stack memory write instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Atomic:
|
|
rst: The total number of spill/stack memory atomic (with and without return) instructions
|
|
executed on all :doc:`compute units <compute-unit>` on the accelerator, per
|
|
:ref:`normalization unit <normalization-units>`. Typically unused as these
|
|
memory operations are typically used to implement thread-local storage.
|
|
unit: Instructions per normalization unit
|
|
MFMA-I8:
|
|
rst: The total number of 8-bit integer :ref:`MFMA <desc-mfma>` instructions issued
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
MFMA-F8:
|
|
rst: The total number of 8-bit floating point :ref:`MFMA <desc-mfma>` instructions issued
|
|
per :ref:`normalization unit <normalization-units>`. This is supported in AMD
|
|
Instinct MI300 series and later only.
|
|
unit: Instructions per normalization unit
|
|
MFMA-F16:
|
|
rst: The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` instructions
|
|
issued per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
MFMA-BF16:
|
|
rst: The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>` instructions
|
|
issued per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
MFMA-F32:
|
|
rst: The total number of 32-bit floating-point :ref:`MFMA <desc-mfma>` instructions
|
|
issued per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
MFMA-F64:
|
|
rst: The total number of 64-bit floating-point :ref:`MFMA <desc-mfma>` instructions
|
|
issued per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Compute Speed-of-Light:
|
|
VALU FLOPs:
|
|
rst: 'The total floating-point operations executed per second on the :ref:`VALU
|
|
<desc-valu>`. This is also presented as a percent of the peak theoretical FLOPs
|
|
achievable on the specific accelerator. Note: this does not include any floating-point
|
|
operations from :ref:`MFMA <desc-mfma>` instructions.'
|
|
unit: GFLOPs
|
|
VALU IOPs:
|
|
rst: 'The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
|
|
This is also presented as a percent of the peak theoretical IOPs achievable
|
|
on the specific accelerator. Note: this does not include any integer operations
|
|
from :ref:`MFMA <desc-mfma>` instructions.'
|
|
unit: GIOPs
|
|
MFMA FLOPs (BF16):
|
|
rst: 'The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>` operations
|
|
executed per second. Note: this does not include any 16-bit brain floating
|
|
point operations from :ref:`VALU <desc-valu>` instructions. This is also presented
|
|
as a percent of the peak theoretical BF16 MFMA operations achievable on the
|
|
specific accelerator.'
|
|
unit: GFLOPs
|
|
MFMA FLOPs (F16):
|
|
rst: 'The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
|
|
executed per second. Note: this does not include any 16-bit floating point
|
|
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
|
|
as a percent of the peak theoretical F16 MFMA operations achievable on the
|
|
specific accelerator.'
|
|
unit: GFLOPs
|
|
MFMA FLOPs (F32):
|
|
rst: 'The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
|
|
executed per second. Note: this does not include any 32-bit floating point
|
|
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
|
|
as a percent of the peak theoretical F32 MFMA operations achievable on the
|
|
specific accelerator.'
|
|
unit: GFLOPs
|
|
MFMA FLOPs (F64):
|
|
rst: 'The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
|
|
executed per second. Note: this does not include any 64-bit floating point
|
|
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
|
|
as a percent of the peak theoretical F64 MFMA operations achievable on the
|
|
specific accelerator. The total number of 64-bit floating point :ref:`MFMA
|
|
<desc-mfma>` operations executed per second. Note: this does not include any
|
|
64-bit floating point operations from :ref:`VALU <desc-valu>` instructions.
|
|
This is also presented as a percent of the peak theoretical F64 MFMA operations
|
|
achievable on the specific accelerator.'
|
|
unit: GFLOPs
|
|
MFMA IOPs (INT8):
|
|
rst: 'The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
|
|
per second. Note: this does not include any 8-bit integer operations from :ref:`VALU
|
|
<desc-valu>` instructions. This is also presented as a percent of the peak
|
|
theoretical INT8 MFMA operations achievable on the specific accelerator.'
|
|
unit: GFLOPs
|
|
IPC:
|
|
rst: The ratio of the total number of instructions executed on the :doc:`CU <compute-unit>`
|
|
over the :ref:`total active CU cycles <total-active-cu-cycles>`.
|
|
unit: Instructions per cycle
|
|
IPC (Issued):
|
|
rst: The ratio of the total number of (non-:ref:`internal <ipc-internal-instructions>`)
|
|
instructions issued over the number of cycles where the :ref:`scheduler <desc-scheduler>`
|
|
was actively working on issuing instructions. Refer to the :ref:`Issued IPC
|
|
<issued-ipc>` example for further detail.
|
|
unit: Instructions per cycle
|
|
SALU Utilization:
|
|
rst: Indicates what percent of the kernel's duration the :ref:`SALU <desc-salu>`
|
|
was busy executing instructions. Computed as the ratio of the total number
|
|
of cycles spent by the :ref:`scheduler <desc-scheduler>` issuing SALU / :ref:`SMEM
|
|
<desc-smem>` instructions over the :ref:`total CU cycles <total-cu-cycles>`.
|
|
unit: Percent
|
|
VALU Utilization:
|
|
rst: Indicates what percent of the kernel's duration the :ref:`VALU <desc-valu>`
|
|
was busy executing instructions. Does not include :ref:`VMEM <desc-vmem>` operations.
|
|
Computed as the ratio of the total number of cycles spent by the :ref:`scheduler
|
|
<desc-scheduler>` issuing VALU instructions over the :ref:`total CU cycles
|
|
<total-cu-cycles>`.
|
|
unit: Percent
|
|
VMEM Utilization:
|
|
rst: Indicates what percent of the kernel's duration the :ref:`VMEM <desc-vmem>`
|
|
unit was busy executing instructions, including both global/generic and spill/scratch
|
|
operations (see the :ref:`VMEM instruction count metrics <ta-instruction-counts>`
|
|
for more detail). Does not include :ref:`VALU <desc-valu>` operations. Computed as
|
|
the ratio of the total number of cycles spent by the :ref:`scheduler <desc-scheduler>`
|
|
issuing VMEM instructions over the :ref:`total CU cycles <total-cu-cycles>`.
|
|
unit: Percent
|
|
Branch Utilization:
|
|
rst: Indicates what percent of the kernel's duration the :ref:`branch <desc-branch>`
|
|
unit was busy executing instructions. Computed as the ratio of the total number
|
|
of cycles spent by the :ref:`scheduler <desc-scheduler>` issuing branch instructions
|
|
over the :ref:`total CU cycles <total-cu-cycles>`.
|
|
unit: Percent
|
|
VALU Active Threads:
|
|
rst: Indicates the average level of :ref:`divergence <desc-divergence>` within a
|
|
wavefront over the lifetime of the kernel. The number of work-items that were
|
|
active in a wavefront during execution of each :ref:`VALU <desc-valu>` instruction,
|
|
time-averaged over all VALU instructions run on all wavefronts in the kernel.
|
|
unit: Work-items
|
|
MFMA Utilization:
|
|
rst: Indicates what percent of the kernel's duration the :ref:`MFMA <desc-mfma>`
|
|
unit was busy executing instructions. Computed as the ratio of the total number
|
|
of cycles spent by the :ref:`MFMA <desc-salu>` was busy over the :ref:`total
|
|
CU cycles <total-cu-cycles>`.
|
|
unit: Percent
|
|
MFMA Instruction Cycles:
|
|
rst: The average duration of :ref:`MFMA <desc-mfma>` instructions in this kernel
|
|
in cycles. Computed as the ratio of the total number of cycles the MFMA unit
|
|
was busy over the total number of MFMA instructions. Compare to, for example,
|
|
the `AMD Matrix Instruction Calculator <https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator>`_.
|
|
unit: Cycles per instruction
|
|
VMEM Latency:
|
|
rst: The average number of round-trip cycles (that is, from issue to data return
|
|
/ acknowledgment) required for a VMEM instruction to complete.
|
|
unit: Cycles
|
|
SMEM Latency:
|
|
rst: The average number of round-trip cycles (that is, from issue to data return
|
|
/ acknowledgment) required for a SMEM instruction to complete.
|
|
unit: Cycles
|
|
FLOPs (Total):
|
|
rst: The total number of floating-point operations executed on either the :ref:`VALU
|
|
<desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
|
|
<normalization-units>`.
|
|
unit: FLOP per normalization unit
|
|
IOPs (Total):
|
|
rst: The total number of integer operations executed on either the :ref:`VALU
|
|
<desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
|
|
<normalization-units>`.
|
|
unit: IOP per normalization unit
|
|
F16 OPs:
|
|
rst: The total number of 16-bit floating-point operations executed on either the :ref:`VALU
|
|
<desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
|
|
<normalization-units>`.
|
|
unit: FLOP per normalization unit
|
|
BF16 OPs:
|
|
rst: 'The total number of 16-bit brain floating-point operations executed on either
|
|
the :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
|
|
unit <normalization-units>`. Note: on current CDNA accelerators, the VALU has
|
|
no native BF16 instructions.'
|
|
unit: FLOP per normalization unit
|
|
F32 OPs:
|
|
rst: The total number of 32-bit floating-point operations executed on either the
|
|
:ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: FLOP per normalization unit
|
|
F64 OPs:
|
|
rst: The total number of 64-bit floating-point operations executed on either the
|
|
:ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: FLOP per normalization unit
|
|
INT8 OPs:
|
|
rst: 'The total number of 8-bit integer operations executed on either the :ref:`VALU
|
|
<desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
|
|
<normalization-units>`. Note: on current CDNA accelerators, the VALU has no
|
|
native INT8 instructions.'
|
|
unit: IOP per normalization unit
|
|
Pipeline statistics:
|
|
VALU FLOPs:
|
|
rst: 'The total floating-point operations executed per second on the :ref:`VALU
|
|
<desc-valu>`. This is also presented as a percent of the peak theoretical FLOPs
|
|
achievable on the specific accelerator. Note: this does not include any floating-point
|
|
operations from :ref:`MFMA <desc-mfma>` instructions.'
|
|
unit: GFLOPs
|
|
VALU IOPs:
|
|
rst: 'The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
|
|
This is also presented as a percent of the peak theoretical IOPs achievable
|
|
on the specific accelerator. Note: this does not include any integer operations
|
|
from :ref:`MFMA <desc-mfma>` instructions.'
|
|
unit: GIOPs
|
|
MFMA FLOPs (BF16):
|
|
rst: 'The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>` operations
|
|
executed per second. Note: this does not include any 16-bit brain floating
|
|
point operations from :ref:`VALU <desc-valu>` instructions. This is also presented
|
|
as a percent of the peak theoretical BF16 MFMA operations achievable on the
|
|
specific accelerator.'
|
|
unit: GFLOPs
|
|
MFMA FLOPs (F16):
|
|
rst: 'The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
|
|
executed per second. Note: this does not include any 16-bit floating point
|
|
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
|
|
as a percent of the peak theoretical F16 MFMA operations achievable on the
|
|
specific accelerator.'
|
|
unit: GFLOPs
|
|
MFMA FLOPs (F32):
|
|
rst: 'The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
|
|
executed per second. Note: this does not include any 32-bit floating point
|
|
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
|
|
as a percent of the peak theoretical F32 MFMA operations achievable on the
|
|
specific accelerator.'
|
|
unit: GFLOPs
|
|
MFMA FLOPs (F64):
|
|
rst: 'The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
|
|
executed per second. Note: this does not include any 64-bit floating point
|
|
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
|
|
as a percent of the peak theoretical F64 MFMA operations achievable on the
|
|
specific accelerator. The total number of 64-bit floating point :ref:`MFMA
|
|
<desc-mfma>` operations executed per second. Note: this does not include any
|
|
64-bit floating point operations from :ref:`VALU <desc-valu>` instructions.
|
|
This is also presented as a percent of the peak theoretical F64 MFMA operations
|
|
achievable on the specific accelerator.'
|
|
unit: GFLOPs
|
|
MFMA IOPs (INT8):
|
|
rst: 'The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
|
|
per second. Note: this does not include any 8-bit integer operations from :ref:`VALU
|
|
<desc-valu>` instructions. This is also presented as a percent of the peak
|
|
theoretical INT8 MFMA operations achievable on the specific accelerator.'
|
|
unit: GFLOPs
|
|
IPC:
|
|
rst: The ratio of the total number of instructions executed on the :doc:`CU <compute-unit>`
|
|
over the :ref:`total active CU cycles <total-active-cu-cycles>`.
|
|
unit: Instructions per cycle
|
|
IPC (Issued):
|
|
rst: The ratio of the total number of (non-:ref:`internal <ipc-internal-instructions>`)
|
|
instructions issued over the number of cycles where the :ref:`scheduler <desc-scheduler>`
|
|
was actively working on issuing instructions. Refer to the :ref:`Issued IPC
|
|
<issued-ipc>` example for further detail.
|
|
unit: Instructions per cycle
|
|
SALU Utilization:
|
|
rst: Indicates what percent of the kernel's duration the :ref:`SALU <desc-salu>`
|
|
was busy executing instructions. Computed as the ratio of the total number
|
|
of cycles spent by the :ref:`scheduler <desc-scheduler>` issuing SALU / :ref:`SMEM
|
|
<desc-smem>` instructions over the :ref:`total CU cycles <total-cu-cycles>`.
|
|
unit: Percent
|
|
VALU Utilization:
|
|
rst: Indicates what percent of the kernel's duration the :ref:`VALU <desc-valu>`
|
|
was busy executing instructions. Does not include :ref:`VMEM <desc-vmem>` operations.
|
|
Computed as the ratio of the total number of cycles spent by the :ref:`scheduler
|
|
<desc-scheduler>` issuing VALU instructions over the :ref:`total CU cycles
|
|
<total-cu-cycles>`.
|
|
unit: Percent
|
|
VMEM Utilization:
|
|
rst: Indicates what percent of the kernel's duration the :ref:`VMEM <desc-vmem>`
|
|
unit was busy executing instructions, including both global/generic and spill/scratch
|
|
operations (see the :ref:`VMEM instruction count metrics <ta-instruction-counts>`
|
|
for more detail). Does not include :ref:`VALU <desc-valu>` operations. Computed as
|
|
the ratio of the total number of cycles spent by the :ref:`scheduler <desc-scheduler>`
|
|
issuing VMEM instructions over the :ref:`total CU cycles <total-cu-cycles>`.
|
|
unit: Percent
|
|
Branch Utilization:
|
|
rst: Indicates what percent of the kernel's duration the :ref:`branch <desc-branch>`
|
|
unit was busy executing instructions. Computed as the ratio of the total number
|
|
of cycles spent by the :ref:`scheduler <desc-scheduler>` issuing branch instructions
|
|
over the :ref:`total CU cycles <total-cu-cycles>`.
|
|
unit: Percent
|
|
VALU Active Threads:
|
|
rst: Indicates the average level of :ref:`divergence <desc-divergence>` within a
|
|
wavefront over the lifetime of the kernel. The number of work-items that were
|
|
active in a wavefront during execution of each :ref:`VALU <desc-valu>` instruction,
|
|
time-averaged over all VALU instructions run on all wavefronts in the kernel.
|
|
unit: Work-items
|
|
MFMA Utilization:
|
|
rst: Indicates what percent of the kernel's duration the :ref:`MFMA <desc-mfma>`
|
|
unit was busy executing instructions. Computed as the ratio of the total number
|
|
of cycles spent by the :ref:`MFMA <desc-salu>` was busy over the :ref:`total
|
|
CU cycles <total-cu-cycles>`.
|
|
unit: Percent
|
|
MFMA Instruction Cycles:
|
|
rst: The average duration of :ref:`MFMA <desc-mfma>` instructions in this kernel
|
|
in cycles. Computed as the ratio of the total number of cycles the MFMA unit
|
|
was busy over the total number of MFMA instructions. Compare to, for example,
|
|
the `AMD Matrix Instruction Calculator <https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator>`_.
|
|
unit: Cycles per instruction
|
|
VMEM Latency:
|
|
rst: The average number of round-trip cycles (that is, from issue to data return
|
|
/ acknowledgment) required for a VMEM instruction to complete.
|
|
unit: Cycles
|
|
SMEM Latency:
|
|
rst: The average number of round-trip cycles (that is, from issue to data return
|
|
/ acknowledgment) required for a SMEM instruction to complete.
|
|
unit: Cycles
|
|
FLOPs (Total):
|
|
rst: The total number of floating-point operations executed on either the :ref:`VALU
|
|
<desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
|
|
<normalization-units>`.
|
|
unit: FLOP per normalization unit
|
|
IOPs (Total):
|
|
rst: The total number of integer operations executed on either the :ref:`VALU
|
|
<desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
|
|
<normalization-units>`.
|
|
unit: IOP per normalization unit
|
|
F16 OPs:
|
|
rst: The total number of 16-bit floating-point operations executed on either the :ref:`VALU
|
|
<desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
|
|
<normalization-units>`.
|
|
unit: FLOP per normalization unit
|
|
BF16 OPs:
|
|
rst: 'The total number of 16-bit brain floating-point operations executed on either
|
|
the :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
|
|
unit <normalization-units>`. Note: on current CDNA accelerators, the VALU has
|
|
no native BF16 instructions.'
|
|
unit: FLOP per normalization unit
|
|
F32 OPs:
|
|
rst: The total number of 32-bit floating-point operations executed on either the
|
|
:ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: FLOP per normalization unit
|
|
F64 OPs:
|
|
rst: The total number of 64-bit floating-point operations executed on either the
|
|
:ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: FLOP per normalization unit
|
|
INT8 OPs:
|
|
rst: 'The total number of 8-bit integer operations executed on either the :ref:`VALU
|
|
<desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
|
|
<normalization-units>`. Note: on current CDNA accelerators, the VALU has no
|
|
native INT8 instructions.'
|
|
unit: IOP per normalization unit
|
|
Arithmetic operations:
|
|
VALU FLOPs:
|
|
rst: 'The total floating-point operations executed per second on the :ref:`VALU
|
|
<desc-valu>`. This is also presented as a percent of the peak theoretical FLOPs
|
|
achievable on the specific accelerator. Note: this does not include any floating-point
|
|
operations from :ref:`MFMA <desc-mfma>` instructions.'
|
|
unit: GFLOPs
|
|
VALU IOPs:
|
|
rst: 'The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
|
|
This is also presented as a percent of the peak theoretical IOPs achievable
|
|
on the specific accelerator. Note: this does not include any integer operations
|
|
from :ref:`MFMA <desc-mfma>` instructions.'
|
|
unit: GIOPs
|
|
MFMA FLOPs (BF16):
|
|
rst: 'The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>` operations
|
|
executed per second. Note: this does not include any 16-bit brain floating
|
|
point operations from :ref:`VALU <desc-valu>` instructions. This is also presented
|
|
as a percent of the peak theoretical BF16 MFMA operations achievable on the
|
|
specific accelerator.'
|
|
unit: GFLOPs
|
|
MFMA FLOPs (F16):
|
|
rst: 'The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
|
|
executed per second. Note: this does not include any 16-bit floating point
|
|
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
|
|
as a percent of the peak theoretical F16 MFMA operations achievable on the
|
|
specific accelerator.'
|
|
unit: GFLOPs
|
|
MFMA FLOPs (F32):
|
|
rst: 'The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
|
|
executed per second. Note: this does not include any 32-bit floating point
|
|
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
|
|
as a percent of the peak theoretical F32 MFMA operations achievable on the
|
|
specific accelerator.'
|
|
unit: GFLOPs
|
|
MFMA FLOPs (F64):
|
|
rst: 'The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
|
|
executed per second. Note: this does not include any 64-bit floating point
|
|
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
|
|
as a percent of the peak theoretical F64 MFMA operations achievable on the
|
|
specific accelerator. The total number of 64-bit floating point :ref:`MFMA
|
|
<desc-mfma>` operations executed per second. Note: this does not include any
|
|
64-bit floating point operations from :ref:`VALU <desc-valu>` instructions.
|
|
This is also presented as a percent of the peak theoretical F64 MFMA operations
|
|
achievable on the specific accelerator.'
|
|
unit: GFLOPs
|
|
MFMA IOPs (INT8):
|
|
rst: 'The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
|
|
per second. Note: this does not include any 8-bit integer operations from :ref:`VALU
|
|
<desc-valu>` instructions. This is also presented as a percent of the peak
|
|
theoretical INT8 MFMA operations achievable on the specific accelerator.'
|
|
unit: GFLOPs
|
|
IPC:
|
|
rst: The ratio of the total number of instructions executed on the :doc:`CU <compute-unit>`
|
|
over the :ref:`total active CU cycles <total-active-cu-cycles>`.
|
|
unit: Instructions per cycle
|
|
IPC (Issued):
|
|
rst: The ratio of the total number of (non-:ref:`internal <ipc-internal-instructions>`)
|
|
instructions issued over the number of cycles where the :ref:`scheduler <desc-scheduler>`
|
|
was actively working on issuing instructions. Refer to the :ref:`Issued IPC
|
|
<issued-ipc>` example for further detail.
|
|
unit: Instructions per cycle
|
|
SALU Utilization:
|
|
rst: Indicates what percent of the kernel's duration the :ref:`SALU <desc-salu>`
|
|
was busy executing instructions. Computed as the ratio of the total number
|
|
of cycles spent by the :ref:`scheduler <desc-scheduler>` issuing SALU / :ref:`SMEM
|
|
<desc-smem>` instructions over the :ref:`total CU cycles <total-cu-cycles>`.
|
|
unit: Percent
|
|
VALU Utilization:
|
|
rst: Indicates what percent of the kernel's duration the :ref:`VALU <desc-valu>`
|
|
was busy executing instructions. Does not include :ref:`VMEM <desc-vmem>` operations.
|
|
Computed as the ratio of the total number of cycles spent by the :ref:`scheduler
|
|
<desc-scheduler>` issuing VALU instructions over the :ref:`total CU cycles
|
|
<total-cu-cycles>`.
|
|
unit: Percent
|
|
VMEM Utilization:
|
|
rst: Indicates what percent of the kernel's duration the :ref:`VMEM <desc-vmem>`
|
|
unit was busy executing instructions, including both global/generic and spill/scratch
|
|
operations (see the :ref:`VMEM instruction count metrics <ta-instruction-counts>`
|
|
for more detail). Does not include :ref:`VALU <desc-valu>` operations. Computed as
|
|
the ratio of the total number of cycles spent by the :ref:`scheduler <desc-scheduler>`
|
|
issuing VMEM instructions over the :ref:`total CU cycles <total-cu-cycles>`.
|
|
unit: Percent
|
|
Branch Utilization:
|
|
rst: Indicates what percent of the kernel's duration the :ref:`branch <desc-branch>`
|
|
unit was busy executing instructions. Computed as the ratio of the total number
|
|
of cycles spent by the :ref:`scheduler <desc-scheduler>` issuing branch instructions
|
|
over the :ref:`total CU cycles <total-cu-cycles>`.
|
|
unit: Percent
|
|
VALU Active Threads:
|
|
rst: Indicates the average level of :ref:`divergence <desc-divergence>` within a
|
|
wavefront over the lifetime of the kernel. The number of work-items that were
|
|
active in a wavefront during execution of each :ref:`VALU <desc-valu>` instruction,
|
|
time-averaged over all VALU instructions run on all wavefronts in the kernel.
|
|
unit: Work-items
|
|
MFMA Utilization:
|
|
rst: Indicates what percent of the kernel's duration the :ref:`MFMA <desc-mfma>`
|
|
unit was busy executing instructions. Computed as the ratio of the total number
|
|
of cycles spent by the :ref:`MFMA <desc-salu>` was busy over the :ref:`total
|
|
CU cycles <total-cu-cycles>`.
|
|
unit: Percent
|
|
MFMA Instruction Cycles:
|
|
rst: The average duration of :ref:`MFMA <desc-mfma>` instructions in this kernel
|
|
in cycles. Computed as the ratio of the total number of cycles the MFMA unit
|
|
was busy over the total number of MFMA instructions. Compare to, for example,
|
|
the `AMD Matrix Instruction Calculator <https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator>`_.
|
|
unit: Cycles per instruction
|
|
VMEM Latency:
|
|
rst: The average number of round-trip cycles (that is, from issue to data return
|
|
/ acknowledgment) required for a VMEM instruction to complete.
|
|
unit: Cycles
|
|
SMEM Latency:
|
|
rst: The average number of round-trip cycles (that is, from issue to data return
|
|
/ acknowledgment) required for a SMEM instruction to complete.
|
|
unit: Cycles
|
|
FLOPs (Total):
|
|
rst: The total number of floating-point operations executed on either the :ref:`VALU
|
|
<desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
|
|
<normalization-units>`.
|
|
unit: FLOP per normalization unit
|
|
IOPs (Total):
|
|
rst: The total number of integer operations executed on either the :ref:`VALU
|
|
<desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
|
|
<normalization-units>`.
|
|
unit: IOP per normalization unit
|
|
F16 OPs:
|
|
rst: The total number of 16-bit floating-point operations executed on either the :ref:`VALU
|
|
<desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
|
|
<normalization-units>`.
|
|
unit: FLOP per normalization unit
|
|
BF16 OPs:
|
|
rst: 'The total number of 16-bit brain floating-point operations executed on either
|
|
the :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
|
|
unit <normalization-units>`. Note: on current CDNA accelerators, the VALU has
|
|
no native BF16 instructions.'
|
|
unit: FLOP per normalization unit
|
|
F32 OPs:
|
|
rst: The total number of 32-bit floating-point operations executed on either the
|
|
:ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: FLOP per normalization unit
|
|
F64 OPs:
|
|
rst: The total number of 64-bit floating-point operations executed on either the
|
|
:ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: FLOP per normalization unit
|
|
INT8 OPs:
|
|
rst: 'The total number of 8-bit integer operations executed on either the :ref:`VALU
|
|
<desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
|
|
<normalization-units>`. Note: on current CDNA accelerators, the VALU has no
|
|
native INT8 instructions.'
|
|
unit: IOP per normalization unit
|
|
LDS Speed-of-Light:
|
|
Utilization:
|
|
rst: Indicates what percent of the kernel's duration the :ref:`LDS <desc-lds>` was
|
|
actively executing instructions (including, but not limited to, load, store,
|
|
atomic and HIP's ``__shfl`` operations). Calculated as the ratio of the total
|
|
number of cycles LDS was active over the :ref:`total CU cycles <total-cu-cycles>`.
|
|
unit: Percent
|
|
Access Rate:
|
|
rst: Indicates the percentage of SIMDs in the :ref:`VALU <desc-valu>` [#lds-workload]_
|
|
actively issuing LDS instructions, averaged over the lifetime of the kernel.
|
|
Calculated as the ratio of the total number of cycles spent by the :ref:`scheduler
|
|
<desc-scheduler>` issuing :ref:`LDS <desc-lds>` instructions over the :ref:`total
|
|
CU cycles <total-cu-cycles>`.
|
|
unit: Percent
|
|
Theoretical Bandwidth:
|
|
rst: Indicates the maximum amount of bytes that could have been loaded from, stored
|
|
to, or atomically updated in the LDS per :ref:`normalization unit <normalization-units>`.
|
|
Does *not* take into account the execution mask of the wavefront when the instruction
|
|
was executed. See the :ref:`LDS bandwidth example <lds-bandwidth>` for more
|
|
detail.
|
|
unit: Bytes per normalization unit
|
|
Bank Conflict Rate:
|
|
rst: Indicates the percentage of active LDS cycles that were spent servicing bank
|
|
conflicts. Calculated as the ratio of LDS cycles spent servicing bank conflicts
|
|
over the number of LDS cycles that would have been required to move the same
|
|
amount of data in an uncontended access. [#lds-bank-conflict]_
|
|
unit: Percent
|
|
LDS Instructions:
|
|
rst: The total number of LDS instructions (including, but not limited to, read/write/atomics
|
|
and HIP's ``__shfl`` instructions) executed per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
LDS Latency:
|
|
rst: The average number of round-trip cycles (i.e., from issue to data-return /
|
|
acknowledgment) required for an LDS instruction to complete.
|
|
unit: Cycles
|
|
Bank Conflicts/Access:
|
|
rst: The ratio of the number of cycles spent in the :ref:`LDS scheduler <desc-lds>`
|
|
due to bank conflicts (as determined by the conflict resolution hardware) to
|
|
the base number of cycles that would be spent in the LDS scheduler in a completely
|
|
uncontended case. This is the unnormalized form of the Bank Conflict Rate.
|
|
unit: Conflicts per Access
|
|
Index Accesses:
|
|
rst: The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>` over
|
|
all operations per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cycles per normalization unit
|
|
Atomic Return Cycles:
|
|
rst: The total number of cycles spent on LDS atomics with return per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Cycles per normalization unit
|
|
Bank Conflict:
|
|
rst: The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>` due
|
|
to bank conflicts (as determined by the conflict resolution hardware) per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Cycles per normalization unit
|
|
Addr Conflict:
|
|
rst: The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>` due
|
|
to address conflicts (as determined by the conflict resolution hardware) per
|
|
:ref:`normalization unit <normalization-units>`.
|
|
unit: Cycles per normalization unit
|
|
Unaligned Stall:
|
|
rst: The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>` due
|
|
to stalls from non-dword aligned addresses per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cycles per normalization unit
|
|
Mem Violations:
|
|
rst: "The total number of out-of-bounds accesses made to the LDS, per :ref:`normalization\
|
|
\ unit <normalization-units>`. This is unused and expected to be zero in most\
|
|
\ configurations for modern CDNA\u2122 accelerators."
|
|
unit: Accesses per normalization unit
|
|
LDS Statistics:
|
|
Utilization:
|
|
rst: Indicates what percent of the kernel's duration the :ref:`LDS <desc-lds>` was
|
|
actively executing instructions (including, but not limited to, load, store,
|
|
atomic and HIP's ``__shfl`` operations). Calculated as the ratio of the total
|
|
number of cycles LDS was active over the :ref:`total CU cycles <total-cu-cycles>`.
|
|
unit: Percent
|
|
Access Rate:
|
|
rst: Indicates the percentage of SIMDs in the :ref:`VALU <desc-valu>` [#lds-workload]_
|
|
actively issuing LDS instructions, averaged over the lifetime of the kernel.
|
|
Calculated as the ratio of the total number of cycles spent by the :ref:`scheduler
|
|
<desc-scheduler>` issuing :ref:`LDS <desc-lds>` instructions over the :ref:`total
|
|
CU cycles <total-cu-cycles>`.
|
|
unit: Percent
|
|
Theoretical Bandwidth:
|
|
rst: Indicates the maximum amount of bytes that could have been loaded from, stored
|
|
to, or atomically updated in the LDS per :ref:`normalization unit <normalization-units>`.
|
|
Does *not* take into account the execution mask of the wavefront when the instruction
|
|
was executed. See the :ref:`LDS bandwidth example <lds-bandwidth>` for more
|
|
detail.
|
|
unit: Bytes per normalization unit
|
|
Bank Conflict Rate:
|
|
rst: Indicates the percentage of active LDS cycles that were spent servicing bank
|
|
conflicts. Calculated as the ratio of LDS cycles spent servicing bank conflicts
|
|
over the number of LDS cycles that would have been required to move the same
|
|
amount of data in an uncontended access. [#lds-bank-conflict]_
|
|
unit: Percent
|
|
LDS Instructions:
|
|
rst: The total number of LDS instructions (including, but not limited to, read/write/atomics
|
|
and HIP's ``__shfl`` instructions) executed per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
LDS Latency:
|
|
rst: The average number of round-trip cycles (i.e., from issue to data-return /
|
|
acknowledgment) required for an LDS instruction to complete.
|
|
unit: Cycles
|
|
Bank Conflicts/Access:
|
|
rst: The ratio of the number of cycles spent in the :ref:`LDS scheduler <desc-lds>`
|
|
due to bank conflicts (as determined by the conflict resolution hardware) to
|
|
the base number of cycles that would be spent in the LDS scheduler in a completely
|
|
uncontended case. This is the unnormalized form of the Bank Conflict Rate.
|
|
unit: Conflicts per Access
|
|
Index Accesses:
|
|
rst: The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>` over
|
|
all operations per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cycles per normalization unit
|
|
Atomic Return Cycles:
|
|
rst: The total number of cycles spent on LDS atomics with return per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Cycles per normalization unit
|
|
Bank Conflict:
|
|
rst: The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>` due
|
|
to bank conflicts (as determined by the conflict resolution hardware) per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Cycles per normalization unit
|
|
Addr Conflict:
|
|
rst: The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>` due
|
|
to address conflicts (as determined by the conflict resolution hardware) per
|
|
:ref:`normalization unit <normalization-units>`.
|
|
unit: Cycles per normalization unit
|
|
Unaligned Stall:
|
|
rst: The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>` due
|
|
to stalls from non-dword aligned addresses per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cycles per normalization unit
|
|
Mem Violations:
|
|
rst: "The total number of out-of-bounds accesses made to the LDS, per :ref:`normalization\
|
|
\ unit <normalization-units>`. This is unused and expected to be zero in most\
|
|
\ configurations for modern CDNA\u2122 accelerators."
|
|
unit: Accesses per normalization unit
|
|
vL1D Speed-of-Light:
|
|
Hit rate:
|
|
rst: The ratio of the number of vL1D cache line requests that hit [#vl1d-hit]_ in
|
|
vL1D cache over the total number of cache line requests to the :ref:`vL1D Cache
|
|
RAM <desc-tc>`.
|
|
unit: Percent
|
|
Bandwidth:
|
|
rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM
|
|
<desc-vmem>` instructions, as a percent of the peak theoretical bandwidth achievable
|
|
on the specific accelerator. The number of bytes is calculated as the number
|
|
of cache lines requested multiplied by the cache line size. This value does
|
|
not consider partial requests, so for instance, if only a single value is requested
|
|
in a cache line, the data movement will still be counted as a full cache line.
|
|
unit: Percent
|
|
Utilization:
|
|
rst: Indicates how busy the :ref:`vL1D Cache RAM <desc-tc>` was during the kernel
|
|
execution. The number of cycles where the vL1D Cache RAM is actively processing
|
|
any request divided by the number of cycles where the vL1D is active [#vl1d-activity]_.
|
|
unit: Percent
|
|
Coalescing:
|
|
rst: Indicates how well memory instructions were coalesced by the :ref:`address
|
|
processing unit <desc-ta>`, ranging from uncoalesced (25%) to fully coalesced
|
|
(100%). Calculated as the average number of :ref:`thread-requests <thread-requests>`
|
|
generated per instruction divided by the ideal number of thread-requests per
|
|
instruction.
|
|
unit: Percent
|
|
Stalled on L2 Data:
|
|
rst: The ratio of the number of cycles where the vL1D is stalled waiting for requested
|
|
data to return from the :doc:`L2 cache <l2-cache>` divided by the number of
|
|
cycles where the vL1D is active [#vl1d-activity]_.
|
|
unit: Percent
|
|
Stalled on L2 Req:
|
|
rst: The ratio of the number of cycles where the vL1D is stalled waiting to issue
|
|
a request for data to the :doc:`L2 cache <l2-cache>` divided by the number
|
|
of cycles where the vL1D is active [#vl1d-activity]_.
|
|
unit: Percent
|
|
Tag RAM Stall (Read):
|
|
rst: The ratio of the number of cycles where the vL1D is stalled due to Read requests
|
|
with conflicting tags being looked up concurrently, divided by the number of
|
|
cycles where the vL1D is active [#vl1d-activity]_.
|
|
unit: Percent
|
|
Tag RAM Stall (Write):
|
|
rst: The ratio of the number of cycles where the vL1D is stalled due to Write
|
|
requests with conflicting tags being looked up concurrently, divided by the
|
|
number of cycles where the vL1D is active [#vl1d-activity]_.
|
|
unit: Percent
|
|
Tag RAM Stall (Atomic):
|
|
rst: The ratio of the number of cycles where the vL1D is stalled due to Atomic
|
|
requests with conflicting tags being looked up concurrently, divided by the
|
|
number of cycles where the vL1D is active [#vl1d-activity]_.
|
|
unit: Percent
|
|
Total Req:
|
|
rst: The total number of incoming requests from the :ref:`address processing
|
|
unit <desc-ta>` after coalescing.
|
|
unit: Requests
|
|
Read Req:
|
|
rst: The total number of incoming read requests from the :ref:`address processing
|
|
unit <desc-ta>` after coalescing per :ref:`normalization unit <normalization-units>`
|
|
unit: Requests per normalization unit
|
|
Write Req:
|
|
rst: The total number of incoming write requests from the :ref:`address processing
|
|
unit <desc-ta>` after coalescing per :ref:`normalization unit <normalization-units>`
|
|
unit: Requests per normalization unit
|
|
Atomic Req:
|
|
rst: The total number of incoming atomic requests from the :ref:`address processing
|
|
unit <desc-ta>` after coalescing per :ref:`normalization unit <normalization-units>`
|
|
unit: Requests per normalization unit
|
|
Cache BW:
|
|
rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM
|
|
<desc-vmem>` instructions per :ref:`normalization unit <normalization-units>`. The
|
|
number of bytes is calculated as the number of cache lines requested multiplied
|
|
by the cache line size. This value does not consider partial requests, so
|
|
for instance, if only a single value is requested in a cache line, the data movement
|
|
will still be counted as a full cache line.
|
|
unit: Bytes per normalization unit
|
|
Cache Hit Rate:
|
|
rst: The ratio of the number of vL1D cache line requests that hit in vL1D cache
|
|
over the total number of cache line requests to the :ref:`vL1D Cache RAM <desc-tc>`.
|
|
unit: Percent
|
|
Cache Accesses:
|
|
rst: The total number of cache line lookups in the vL1D.
|
|
unit: Cache lines
|
|
Cache Hits:
|
|
rst: The number of cache accesses minus the number of outgoing requests to the :doc:`L2
|
|
cache <l2-cache>`, that is, the number of cache line requests serviced by the
|
|
:ref:`vL1D Cache RAM <desc-tc>` per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cache lines per normalization unit
|
|
Invalidations:
|
|
rst: The number of times the vL1D was issued a write-back invalidate command during
|
|
the kernel's execution per :ref:`normalization unit <normalization-units>`. This
|
|
may be triggered by, for instance, the ``buffer_wbinvl1`` instruction.
|
|
unit: Invalidations per normalization unit
|
|
L1-L2 BW:
|
|
rst: The number of bytes transferred across the vL1D-L2 interface as a result of
|
|
:ref:`VMEM <desc-vmem>` instructions, per :ref:`normalization unit <normalization-units>`.
|
|
The number of bytes is calculated as the number of cache lines requested multiplied
|
|
by the cache line size. This value does not consider partial requests, so for instance,
|
|
if only a single value is requested in a cache line, the data movement will
|
|
still be counted as a full cache line.
|
|
unit: Bytes per normalization unit
|
|
L1-L2 Read:
|
|
rst: The number of read requests for a vL1D cache line that were not satisfied by
|
|
the vL1D and must be retrieved from the to the :doc:`L2 Cache <l2-cache>` per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
L1-L2 Write:
|
|
rst: The number of write requests to a vL1D cache line that were sent through the
|
|
vL1D to the :doc:`L2 cache <l2-cache>`, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
L1-L2 Atomic:
|
|
rst: The number of atomic requests that are sent through the vL1D to the :doc:`L2
|
|
cache <l2-cache>`, per :ref:`normalization unit <normalization-units>`. This
|
|
includes requests for atomics with, and without return.
|
|
unit: Requests per normalization unit
|
|
L1 Access Latency:
|
|
rst: Calculated as the average number of cycles that a vL1D cache line request
|
|
spent in the vL1D cache pipeline.
|
|
unit: Cycles
|
|
L1-L2 Read Latency:
|
|
rst: Calculated as the average number of cycles that the vL1D cache took to issue
|
|
and receive read requests from the :doc:`L2 Cache <l2-cache>`. This number
|
|
also includes requests for atomics with return values.
|
|
unit: Cycles
|
|
L1-L2 Write Latency:
|
|
rst: Calculated as the average number of cycles that the vL1D cache took to issue
|
|
and receive acknowledgement of a write request to the :doc:`L2 Cache <l2-cache>`.
|
|
This number also includes requests for atomics without return values.
|
|
unit: Cycles
|
|
NC - Read:
|
|
rst: Total read requests with NC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
UC - Read:
|
|
rst: Total read requests with UC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
CC - Read:
|
|
rst: Total read requests with CC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
RW - Read:
|
|
rst: ''
|
|
unit: Requests per normalization unit
|
|
RW - Write:
|
|
rst: Total write requests with RW mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
NC - Write:
|
|
rst: Total write requests with NC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
UC - Write:
|
|
rst: Total write requests with UC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
CC - Write:
|
|
rst: Total write requests with CC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
NC - Atomic:
|
|
rst: Total atomic requests with NC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
UC - Atomic:
|
|
rst: Total atomic requests with UC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
CC - Atomic:
|
|
rst: Total atomic requests with CC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
RW - Atomic:
|
|
rst: Total atomic requests with RW mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
Req:
|
|
rst: The number of translation requests made to the UTCL1 per normalization unit.
|
|
unit: Requests per normalization unit
|
|
Hit Ratio:
|
|
rst: The ratio of the number of translation requests that hit in the UTCL1 divided
|
|
by the total number of translation requests made to the UTCL1.
|
|
unit: Percent
|
|
Hits:
|
|
rst: The number of translation requests that hit in the UTCL1, and could be reused,
|
|
per normalization unit.
|
|
unit: Requests per normalization unit
|
|
Translation Misses:
|
|
rst: The total number of translation requests that missed in the UTCL1 due to translation
|
|
not being present in the cache, per :ref:`normalization unit <normalization-units>`.
|
|
unit: unit
|
|
Permission Misses:
|
|
rst: "The total number of translation requests that missed in the UTCL1 due to\
|
|
\ a permission error, per :ref:`normalization unit <normalization-units>`.\
|
|
\ This is unused and expected to be zero in most configurations for modern\
|
|
\ CDNA\u2122 accelerators."
|
|
unit: Requests per normalization unit
|
|
Busy / stall metrics:
|
|
Address Processing Unit Busy:
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address processor
|
|
was busy
|
|
unit: Percent
|
|
Address Stall:
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address processor
|
|
was stalled from sending address requests further into the vL1D pipeline
|
|
unit: Percent
|
|
Data Stall:
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address processor
|
|
was stalled from sending write/atomic data further into the vL1D pipeline
|
|
unit: Percent
|
|
"Data-Processor \u2192 Address Stall":
|
|
rst: Percent of :ref:`total CU cycles <total-cu-cycles>` the address processor was
|
|
stalled waiting to send command data to the :ref:`data processor <desc-td>`
|
|
unit: Percent
|
|
Total Instructions:
|
|
rst: The total number of memory instructions executed by the address processer
|
|
over all compute units on the accelerator, per normalization unit.
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Instructions:
|
|
rst: The total number of global & generic memory instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Read Instructions:
|
|
rst: The total number of global & generic memory read instructions executed on all
|
|
:doc:`compute units <compute-unit>` on the accelerator, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Write Instructions:
|
|
rst: The total number of global & generic memory write instructions executed on
|
|
all :doc:`compute units <compute-unit>` on the accelerator, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Atomic Instructions:
|
|
rst: The total number of global & generic memory atomic (with and without return)
|
|
instructions executed on all :doc:`compute units <compute-unit>` on the accelerator,
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Instructions:
|
|
rst: The total number of spill/stack memory instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Read Instructions:
|
|
rst: The total number of spill/stack memory read instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Write Instructions:
|
|
rst: The total number of spill/stack memory write instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Atomic Instructions:
|
|
rst: The total number of spill/stack memory atomic (with and without return) instructions
|
|
executed on all :doc:`compute units <compute-unit>` on the accelerator, per
|
|
:ref:`normalization unit <normalization-units>`. Typically unused as these
|
|
memory operations are typically used to implement thread-local storage.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Total Cycles:
|
|
rst: The number of cycles the address processing unit spent working on spill/stack
|
|
instructions, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cycles per normalization unit
|
|
Spill/Stack Coalesced Read:
|
|
rst: The number of cycles the address processing unit spent working on coalesced
|
|
spill/stack read instructions, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cycles per normalization unit
|
|
Spill/Stack Coalesced Write:
|
|
rst: The number of cycles the address processing unit spent working on coalesced
|
|
spill/stack write instructions, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cycles per normalization unit
|
|
Data-Return Busy:
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return unit
|
|
was busy processing or waiting on data to return to the :doc:`CU <compute-unit>`.
|
|
unit: Percent
|
|
"Cache RAM \u2192 Data-Return Stall":
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return unit
|
|
was stalled on data to be returned from the :ref:`vL1D Cache RAM <desc-tc>`.
|
|
unit: Percent
|
|
"Workgroup manager \u2192 Data-Return Stall":
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return unit
|
|
was stalled by the :ref:`workgroup manager <desc-spi>` due to initialization
|
|
of registers as a part of launching new workgroups.
|
|
unit: Percent
|
|
Coalescable Instructions:
|
|
rst: The number of instructions submitted to the :ref:`data-return unit <desc-td>`
|
|
by the :ref:`address processor <desc-ta>` that were found to be coalescable,
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Read Instructions:
|
|
rst: The number of read instructions submitted to the :ref:`data-return unit
|
|
<desc-td>` by the :ref:`address processor <desc-ta>` summed over all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
This is expected to be the sum of global/generic and spill/stack reads in the :ref:`address
|
|
processor <desc-ta>`.
|
|
unit: Instructions per normalization unit
|
|
Write Instructions:
|
|
rst: The number of store instructions submitted to the :ref:`data-return unit
|
|
<desc-td>` by the :ref:`address processor <desc-ta>` summed over all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
This is expected to be the sum of global/generic and spill/stack stores counted
|
|
by the :ref:`vL1D cache-front-end <ta-instruction-counts>`.
|
|
unit: Instructions per normalization unit
|
|
Atomic Instructions:
|
|
rst: The number of atomic instructions submitted to the :ref:`data-return unit
|
|
<desc-td>` by the :ref:`address processor <desc-ta>` summed over all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
This is expected to be the sum of global/generic and spill/stack atomics in
|
|
the :ref:`address processor <desc-ta>`.
|
|
unit: Instructions per normalization unit
|
|
Instruction counts:
|
|
Address Processing Unit Busy:
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address processor
|
|
was busy
|
|
unit: Percent
|
|
Address Stall:
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address processor
|
|
was stalled from sending address requests further into the vL1D pipeline
|
|
unit: Percent
|
|
Data Stall:
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address processor
|
|
was stalled from sending write/atomic data further into the vL1D pipeline
|
|
unit: Percent
|
|
"Data-Processor \u2192 Address Stall":
|
|
rst: Percent of :ref:`total CU cycles <total-cu-cycles>` the address processor was
|
|
stalled waiting to send command data to the :ref:`data processor <desc-td>`
|
|
unit: Percent
|
|
Total Instructions:
|
|
rst: The total number of memory instructions executed by the address processer
|
|
over all compute units on the accelerator, per normalization unit.
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Instructions:
|
|
rst: The total number of global & generic memory instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Read Instructions:
|
|
rst: The total number of global & generic memory read instructions executed on all
|
|
:doc:`compute units <compute-unit>` on the accelerator, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Write Instructions:
|
|
rst: The total number of global & generic memory write instructions executed on
|
|
all :doc:`compute units <compute-unit>` on the accelerator, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Atomic Instructions:
|
|
rst: The total number of global & generic memory atomic (with and without return)
|
|
instructions executed on all :doc:`compute units <compute-unit>` on the accelerator,
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Instructions:
|
|
rst: The total number of spill/stack memory instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Read Instructions:
|
|
rst: The total number of spill/stack memory read instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Write Instructions:
|
|
rst: The total number of spill/stack memory write instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Atomic Instructions:
|
|
rst: The total number of spill/stack memory atomic (with and without return) instructions
|
|
executed on all :doc:`compute units <compute-unit>` on the accelerator, per
|
|
:ref:`normalization unit <normalization-units>`. Typically unused as these
|
|
memory operations are typically used to implement thread-local storage.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Total Cycles:
|
|
rst: The number of cycles the address processing unit spent working on spill/stack
|
|
instructions, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cycles per normalization unit
|
|
Spill/Stack Coalesced Read:
|
|
rst: The number of cycles the address processing unit spent working on coalesced
|
|
spill/stack read instructions, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cycles per normalization unit
|
|
Spill/Stack Coalesced Write:
|
|
rst: The number of cycles the address processing unit spent working on coalesced
|
|
spill/stack write instructions, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cycles per normalization unit
|
|
Data-Return Busy:
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return unit
|
|
was busy processing or waiting on data to return to the :doc:`CU <compute-unit>`.
|
|
unit: Percent
|
|
"Cache RAM \u2192 Data-Return Stall":
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return unit
|
|
was stalled on data to be returned from the :ref:`vL1D Cache RAM <desc-tc>`.
|
|
unit: Percent
|
|
"Workgroup manager \u2192 Data-Return Stall":
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return unit
|
|
was stalled by the :ref:`workgroup manager <desc-spi>` due to initialization
|
|
of registers as a part of launching new workgroups.
|
|
unit: Percent
|
|
Coalescable Instructions:
|
|
rst: The number of instructions submitted to the :ref:`data-return unit <desc-td>`
|
|
by the :ref:`address processor <desc-ta>` that were found to be coalescable,
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Read Instructions:
|
|
rst: The number of read instructions submitted to the :ref:`data-return unit
|
|
<desc-td>` by the :ref:`address processor <desc-ta>` summed over all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
This is expected to be the sum of global/generic and spill/stack reads in the :ref:`address
|
|
processor <desc-ta>`.
|
|
unit: Instructions per normalization unit
|
|
Write Instructions:
|
|
rst: The number of store instructions submitted to the :ref:`data-return unit
|
|
<desc-td>` by the :ref:`address processor <desc-ta>` summed over all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
This is expected to be the sum of global/generic and spill/stack stores counted
|
|
by the :ref:`vL1D cache-front-end <ta-instruction-counts>`.
|
|
unit: Instructions per normalization unit
|
|
Atomic Instructions:
|
|
rst: The number of atomic instructions submitted to the :ref:`data-return unit
|
|
<desc-td>` by the :ref:`address processor <desc-ta>` summed over all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
This is expected to be the sum of global/generic and spill/stack atomics in
|
|
the :ref:`address processor <desc-ta>`.
|
|
unit: Instructions per normalization unit
|
|
Spill / stack metrics:
|
|
Address Processing Unit Busy:
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address processor
|
|
was busy
|
|
unit: Percent
|
|
Address Stall:
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address processor
|
|
was stalled from sending address requests further into the vL1D pipeline
|
|
unit: Percent
|
|
Data Stall:
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address processor
|
|
was stalled from sending write/atomic data further into the vL1D pipeline
|
|
unit: Percent
|
|
"Data-Processor \u2192 Address Stall":
|
|
rst: Percent of :ref:`total CU cycles <total-cu-cycles>` the address processor was
|
|
stalled waiting to send command data to the :ref:`data processor <desc-td>`
|
|
unit: Percent
|
|
Total Instructions:
|
|
rst: The total number of memory instructions executed by the address processer
|
|
over all compute units on the accelerator, per normalization unit.
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Instructions:
|
|
rst: The total number of global & generic memory instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Read Instructions:
|
|
rst: The total number of global & generic memory read instructions executed on all
|
|
:doc:`compute units <compute-unit>` on the accelerator, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Write Instructions:
|
|
rst: The total number of global & generic memory write instructions executed on
|
|
all :doc:`compute units <compute-unit>` on the accelerator, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Atomic Instructions:
|
|
rst: The total number of global & generic memory atomic (with and without return)
|
|
instructions executed on all :doc:`compute units <compute-unit>` on the accelerator,
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Instructions:
|
|
rst: The total number of spill/stack memory instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Read Instructions:
|
|
rst: The total number of spill/stack memory read instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Write Instructions:
|
|
rst: The total number of spill/stack memory write instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Atomic Instructions:
|
|
rst: The total number of spill/stack memory atomic (with and without return) instructions
|
|
executed on all :doc:`compute units <compute-unit>` on the accelerator, per
|
|
:ref:`normalization unit <normalization-units>`. Typically unused as these
|
|
memory operations are typically used to implement thread-local storage.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Total Cycles:
|
|
rst: The number of cycles the address processing unit spent working on spill/stack
|
|
instructions, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cycles per normalization unit
|
|
Spill/Stack Coalesced Read:
|
|
rst: The number of cycles the address processing unit spent working on coalesced
|
|
spill/stack read instructions, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cycles per normalization unit
|
|
Spill/Stack Coalesced Write:
|
|
rst: The number of cycles the address processing unit spent working on coalesced
|
|
spill/stack write instructions, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cycles per normalization unit
|
|
Data-Return Busy:
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return unit
|
|
was busy processing or waiting on data to return to the :doc:`CU <compute-unit>`.
|
|
unit: Percent
|
|
"Cache RAM \u2192 Data-Return Stall":
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return unit
|
|
was stalled on data to be returned from the :ref:`vL1D Cache RAM <desc-tc>`.
|
|
unit: Percent
|
|
"Workgroup manager \u2192 Data-Return Stall":
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return unit
|
|
was stalled by the :ref:`workgroup manager <desc-spi>` due to initialization
|
|
of registers as a part of launching new workgroups.
|
|
unit: Percent
|
|
Coalescable Instructions:
|
|
rst: The number of instructions submitted to the :ref:`data-return unit <desc-td>`
|
|
by the :ref:`address processor <desc-ta>` that were found to be coalescable,
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Read Instructions:
|
|
rst: The number of read instructions submitted to the :ref:`data-return unit
|
|
<desc-td>` by the :ref:`address processor <desc-ta>` summed over all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
This is expected to be the sum of global/generic and spill/stack reads in the :ref:`address
|
|
processor <desc-ta>`.
|
|
unit: Instructions per normalization unit
|
|
Write Instructions:
|
|
rst: The number of store instructions submitted to the :ref:`data-return unit
|
|
<desc-td>` by the :ref:`address processor <desc-ta>` summed over all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
This is expected to be the sum of global/generic and spill/stack stores counted
|
|
by the :ref:`vL1D cache-front-end <ta-instruction-counts>`.
|
|
unit: Instructions per normalization unit
|
|
Atomic Instructions:
|
|
rst: The number of atomic instructions submitted to the :ref:`data-return unit
|
|
<desc-td>` by the :ref:`address processor <desc-ta>` summed over all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
This is expected to be the sum of global/generic and spill/stack atomics in
|
|
the :ref:`address processor <desc-ta>`.
|
|
unit: Instructions per normalization unit
|
|
L1 Unified Translation Cache (UTCL1):
|
|
Hit rate:
|
|
rst: The ratio of the number of vL1D cache line requests that hit [#vl1d-hit]_ in
|
|
vL1D cache over the total number of cache line requests to the :ref:`vL1D Cache
|
|
RAM <desc-tc>`.
|
|
unit: Percent
|
|
Bandwidth:
|
|
rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM
|
|
<desc-vmem>` instructions, as a percent of the peak theoretical bandwidth achievable
|
|
on the specific accelerator. The number of bytes is calculated as the number
|
|
of cache lines requested multiplied by the cache line size. This value does
|
|
not consider partial requests, so for instance, if only a single value is requested
|
|
in a cache line, the data movement will still be counted as a full cache line.
|
|
unit: Percent
|
|
Utilization:
|
|
rst: Indicates how busy the :ref:`vL1D Cache RAM <desc-tc>` was during the kernel
|
|
execution. The number of cycles where the vL1D Cache RAM is actively processing
|
|
any request divided by the number of cycles where the vL1D is active [#vl1d-activity]_.
|
|
unit: Percent
|
|
Coalescing:
|
|
rst: Indicates how well memory instructions were coalesced by the :ref:`address
|
|
processing unit <desc-ta>`, ranging from uncoalesced (25%) to fully coalesced
|
|
(100%). Calculated as the average number of :ref:`thread-requests <thread-requests>`
|
|
generated per instruction divided by the ideal number of thread-requests per
|
|
instruction.
|
|
unit: Percent
|
|
Stalled on L2 Data:
|
|
rst: The ratio of the number of cycles where the vL1D is stalled waiting for requested
|
|
data to return from the :doc:`L2 cache <l2-cache>` divided by the number of
|
|
cycles where the vL1D is active [#vl1d-activity]_.
|
|
unit: Percent
|
|
Stalled on L2 Req:
|
|
rst: The ratio of the number of cycles where the vL1D is stalled waiting to issue
|
|
a request for data to the :doc:`L2 cache <l2-cache>` divided by the number
|
|
of cycles where the vL1D is active [#vl1d-activity]_.
|
|
unit: Percent
|
|
Tag RAM Stall (Read):
|
|
rst: The ratio of the number of cycles where the vL1D is stalled due to Read requests
|
|
with conflicting tags being looked up concurrently, divided by the number of
|
|
cycles where the vL1D is active [#vl1d-activity]_.
|
|
unit: Percent
|
|
Tag RAM Stall (Write):
|
|
rst: The ratio of the number of cycles where the vL1D is stalled due to Write
|
|
requests with conflicting tags being looked up concurrently, divided by the
|
|
number of cycles where the vL1D is active [#vl1d-activity]_.
|
|
unit: Percent
|
|
Tag RAM Stall (Atomic):
|
|
rst: The ratio of the number of cycles where the vL1D is stalled due to Atomic
|
|
requests with conflicting tags being looked up concurrently, divided by the
|
|
number of cycles where the vL1D is active [#vl1d-activity]_.
|
|
unit: Percent
|
|
Total Req:
|
|
rst: The total number of incoming requests from the :ref:`address processing
|
|
unit <desc-ta>` after coalescing.
|
|
unit: Requests
|
|
Read Req:
|
|
rst: The total number of incoming read requests from the :ref:`address processing
|
|
unit <desc-ta>` after coalescing per :ref:`normalization unit <normalization-units>`
|
|
unit: Requests per normalization unit
|
|
Write Req:
|
|
rst: The total number of incoming write requests from the :ref:`address processing
|
|
unit <desc-ta>` after coalescing per :ref:`normalization unit <normalization-units>`
|
|
unit: Requests per normalization unit
|
|
Atomic Req:
|
|
rst: The total number of incoming atomic requests from the :ref:`address processing
|
|
unit <desc-ta>` after coalescing per :ref:`normalization unit <normalization-units>`
|
|
unit: Requests per normalization unit
|
|
Cache BW:
|
|
rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM
|
|
<desc-vmem>` instructions per :ref:`normalization unit <normalization-units>`. The
|
|
number of bytes is calculated as the number of cache lines requested multiplied
|
|
by the cache line size. This value does not consider partial requests, so
|
|
for instance, if only a single value is requested in a cache line, the data movement
|
|
will still be counted as a full cache line.
|
|
unit: Bytes per normalization unit
|
|
Cache Hit Rate:
|
|
rst: The ratio of the number of vL1D cache line requests that hit in vL1D cache
|
|
over the total number of cache line requests to the :ref:`vL1D Cache RAM <desc-tc>`.
|
|
unit: Percent
|
|
Cache Accesses:
|
|
rst: The total number of cache line lookups in the vL1D.
|
|
unit: Cache lines
|
|
Cache Hits:
|
|
rst: The number of cache accesses minus the number of outgoing requests to the :doc:`L2
|
|
cache <l2-cache>`, that is, the number of cache line requests serviced by the
|
|
:ref:`vL1D Cache RAM <desc-tc>` per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cache lines per normalization unit
|
|
Invalidations:
|
|
rst: The number of times the vL1D was issued a write-back invalidate command during
|
|
the kernel's execution per :ref:`normalization unit <normalization-units>`. This
|
|
may be triggered by, for instance, the ``buffer_wbinvl1`` instruction.
|
|
unit: Invalidations per normalization unit
|
|
L1-L2 BW:
|
|
rst: The number of bytes transferred across the vL1D-L2 interface as a result of
|
|
:ref:`VMEM <desc-vmem>` instructions, per :ref:`normalization unit <normalization-units>`.
|
|
The number of bytes is calculated as the number of cache lines requested multiplied
|
|
by the cache line size. This value does not consider partial requests, so for instance,
|
|
if only a single value is requested in a cache line, the data movement will
|
|
still be counted as a full cache line.
|
|
unit: Bytes per normalization unit
|
|
L1-L2 Read:
|
|
rst: The number of read requests for a vL1D cache line that were not satisfied by
|
|
the vL1D and must be retrieved from the to the :doc:`L2 Cache <l2-cache>` per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
L1-L2 Write:
|
|
rst: The number of write requests to a vL1D cache line that were sent through the
|
|
vL1D to the :doc:`L2 cache <l2-cache>`, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
L1-L2 Atomic:
|
|
rst: The number of atomic requests that are sent through the vL1D to the :doc:`L2
|
|
cache <l2-cache>`, per :ref:`normalization unit <normalization-units>`. This
|
|
includes requests for atomics with, and without return.
|
|
unit: Requests per normalization unit
|
|
L1 Access Latency:
|
|
rst: Calculated as the average number of cycles that a vL1D cache line request
|
|
spent in the vL1D cache pipeline.
|
|
unit: Cycles
|
|
L1-L2 Read Latency:
|
|
rst: Calculated as the average number of cycles that the vL1D cache took to issue
|
|
and receive read requests from the :doc:`L2 Cache <l2-cache>`. This number
|
|
also includes requests for atomics with return values.
|
|
unit: Cycles
|
|
L1-L2 Write Latency:
|
|
rst: Calculated as the average number of cycles that the vL1D cache took to issue
|
|
and receive acknowledgement of a write request to the :doc:`L2 Cache <l2-cache>`.
|
|
This number also includes requests for atomics without return values.
|
|
unit: Cycles
|
|
NC - Read:
|
|
rst: Total read requests with NC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
UC - Read:
|
|
rst: Total read requests with UC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
CC - Read:
|
|
rst: Total read requests with CC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
RW - Read:
|
|
rst: ''
|
|
unit: Requests per normalization unit
|
|
RW - Write:
|
|
rst: Total write requests with RW mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
NC - Write:
|
|
rst: Total write requests with NC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
UC - Write:
|
|
rst: Total write requests with UC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
CC - Write:
|
|
rst: Total write requests with CC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
NC - Atomic:
|
|
rst: Total atomic requests with NC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
UC - Atomic:
|
|
rst: Total atomic requests with UC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
CC - Atomic:
|
|
rst: Total atomic requests with CC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
RW - Atomic:
|
|
rst: Total atomic requests with RW mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
Req:
|
|
rst: The number of translation requests made to the UTCL1 per normalization unit.
|
|
unit: Requests per normalization unit
|
|
Hit Ratio:
|
|
rst: The ratio of the number of translation requests that hit in the UTCL1 divided
|
|
by the total number of translation requests made to the UTCL1.
|
|
unit: Percent
|
|
Hits:
|
|
rst: The number of translation requests that hit in the UTCL1, and could be reused,
|
|
per normalization unit.
|
|
unit: Requests per normalization unit
|
|
Translation Misses:
|
|
rst: The total number of translation requests that missed in the UTCL1 due to translation
|
|
not being present in the cache, per :ref:`normalization unit <normalization-units>`.
|
|
unit: unit
|
|
Permission Misses:
|
|
rst: "The total number of translation requests that missed in the UTCL1 due to\
|
|
\ a permission error, per :ref:`normalization unit <normalization-units>`.\
|
|
\ This is unused and expected to be zero in most configurations for modern\
|
|
\ CDNA\u2122 accelerators."
|
|
unit: Requests per normalization unit
|
|
vL1D cache stall metrics:
|
|
Hit rate:
|
|
rst: The ratio of the number of vL1D cache line requests that hit [#vl1d-hit]_ in
|
|
vL1D cache over the total number of cache line requests to the :ref:`vL1D Cache
|
|
RAM <desc-tc>`.
|
|
unit: Percent
|
|
Bandwidth:
|
|
rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM
|
|
<desc-vmem>` instructions, as a percent of the peak theoretical bandwidth achievable
|
|
on the specific accelerator. The number of bytes is calculated as the number
|
|
of cache lines requested multiplied by the cache line size. This value does
|
|
not consider partial requests, so for instance, if only a single value is requested
|
|
in a cache line, the data movement will still be counted as a full cache line.
|
|
unit: Percent
|
|
Utilization:
|
|
rst: Indicates how busy the :ref:`vL1D Cache RAM <desc-tc>` was during the kernel
|
|
execution. The number of cycles where the vL1D Cache RAM is actively processing
|
|
any request divided by the number of cycles where the vL1D is active [#vl1d-activity]_.
|
|
unit: Percent
|
|
Coalescing:
|
|
rst: Indicates how well memory instructions were coalesced by the :ref:`address
|
|
processing unit <desc-ta>`, ranging from uncoalesced (25%) to fully coalesced
|
|
(100%). Calculated as the average number of :ref:`thread-requests <thread-requests>`
|
|
generated per instruction divided by the ideal number of thread-requests per
|
|
instruction.
|
|
unit: Percent
|
|
Stalled on L2 Data:
|
|
rst: The ratio of the number of cycles where the vL1D is stalled waiting for requested
|
|
data to return from the :doc:`L2 cache <l2-cache>` divided by the number of
|
|
cycles where the vL1D is active [#vl1d-activity]_.
|
|
unit: Percent
|
|
Stalled on L2 Req:
|
|
rst: The ratio of the number of cycles where the vL1D is stalled waiting to issue
|
|
a request for data to the :doc:`L2 cache <l2-cache>` divided by the number
|
|
of cycles where the vL1D is active [#vl1d-activity]_.
|
|
unit: Percent
|
|
Tag RAM Stall (Read):
|
|
rst: The ratio of the number of cycles where the vL1D is stalled due to Read requests
|
|
with conflicting tags being looked up concurrently, divided by the number of
|
|
cycles where the vL1D is active [#vl1d-activity]_.
|
|
unit: Percent
|
|
Tag RAM Stall (Write):
|
|
rst: The ratio of the number of cycles where the vL1D is stalled due to Write
|
|
requests with conflicting tags being looked up concurrently, divided by the
|
|
number of cycles where the vL1D is active [#vl1d-activity]_.
|
|
unit: Percent
|
|
Tag RAM Stall (Atomic):
|
|
rst: The ratio of the number of cycles where the vL1D is stalled due to Atomic
|
|
requests with conflicting tags being looked up concurrently, divided by the
|
|
number of cycles where the vL1D is active [#vl1d-activity]_.
|
|
unit: Percent
|
|
Total Req:
|
|
rst: The total number of incoming requests from the :ref:`address processing
|
|
unit <desc-ta>` after coalescing.
|
|
unit: Requests
|
|
Read Req:
|
|
rst: The total number of incoming read requests from the :ref:`address processing
|
|
unit <desc-ta>` after coalescing per :ref:`normalization unit <normalization-units>`
|
|
unit: Requests per normalization unit
|
|
Write Req:
|
|
rst: The total number of incoming write requests from the :ref:`address processing
|
|
unit <desc-ta>` after coalescing per :ref:`normalization unit <normalization-units>`
|
|
unit: Requests per normalization unit
|
|
Atomic Req:
|
|
rst: The total number of incoming atomic requests from the :ref:`address processing
|
|
unit <desc-ta>` after coalescing per :ref:`normalization unit <normalization-units>`
|
|
unit: Requests per normalization unit
|
|
Cache BW:
|
|
rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM
|
|
<desc-vmem>` instructions per :ref:`normalization unit <normalization-units>`. The
|
|
number of bytes is calculated as the number of cache lines requested multiplied
|
|
by the cache line size. This value does not consider partial requests, so
|
|
for instance, if only a single value is requested in a cache line, the data movement
|
|
will still be counted as a full cache line.
|
|
unit: Bytes per normalization unit
|
|
Cache Hit Rate:
|
|
rst: The ratio of the number of vL1D cache line requests that hit in vL1D cache
|
|
over the total number of cache line requests to the :ref:`vL1D Cache RAM <desc-tc>`.
|
|
unit: Percent
|
|
Cache Accesses:
|
|
rst: The total number of cache line lookups in the vL1D.
|
|
unit: Cache lines
|
|
Cache Hits:
|
|
rst: The number of cache accesses minus the number of outgoing requests to the :doc:`L2
|
|
cache <l2-cache>`, that is, the number of cache line requests serviced by the
|
|
:ref:`vL1D Cache RAM <desc-tc>` per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cache lines per normalization unit
|
|
Invalidations:
|
|
rst: The number of times the vL1D was issued a write-back invalidate command during
|
|
the kernel's execution per :ref:`normalization unit <normalization-units>`. This
|
|
may be triggered by, for instance, the ``buffer_wbinvl1`` instruction.
|
|
unit: Invalidations per normalization unit
|
|
L1-L2 BW:
|
|
rst: The number of bytes transferred across the vL1D-L2 interface as a result of
|
|
:ref:`VMEM <desc-vmem>` instructions, per :ref:`normalization unit <normalization-units>`.
|
|
The number of bytes is calculated as the number of cache lines requested multiplied
|
|
by the cache line size. This value does not consider partial requests, so for instance,
|
|
if only a single value is requested in a cache line, the data movement will
|
|
still be counted as a full cache line.
|
|
unit: Bytes per normalization unit
|
|
L1-L2 Read:
|
|
rst: The number of read requests for a vL1D cache line that were not satisfied by
|
|
the vL1D and must be retrieved from the to the :doc:`L2 Cache <l2-cache>` per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
L1-L2 Write:
|
|
rst: The number of write requests to a vL1D cache line that were sent through the
|
|
vL1D to the :doc:`L2 cache <l2-cache>`, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
L1-L2 Atomic:
|
|
rst: The number of atomic requests that are sent through the vL1D to the :doc:`L2
|
|
cache <l2-cache>`, per :ref:`normalization unit <normalization-units>`. This
|
|
includes requests for atomics with, and without return.
|
|
unit: Requests per normalization unit
|
|
L1 Access Latency:
|
|
rst: Calculated as the average number of cycles that a vL1D cache line request
|
|
spent in the vL1D cache pipeline.
|
|
unit: Cycles
|
|
L1-L2 Read Latency:
|
|
rst: Calculated as the average number of cycles that the vL1D cache took to issue
|
|
and receive read requests from the :doc:`L2 Cache <l2-cache>`. This number
|
|
also includes requests for atomics with return values.
|
|
unit: Cycles
|
|
L1-L2 Write Latency:
|
|
rst: Calculated as the average number of cycles that the vL1D cache took to issue
|
|
and receive acknowledgement of a write request to the :doc:`L2 Cache <l2-cache>`.
|
|
This number also includes requests for atomics without return values.
|
|
unit: Cycles
|
|
NC - Read:
|
|
rst: Total read requests with NC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
UC - Read:
|
|
rst: Total read requests with UC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
CC - Read:
|
|
rst: Total read requests with CC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
RW - Read:
|
|
rst: ''
|
|
unit: Requests per normalization unit
|
|
RW - Write:
|
|
rst: Total write requests with RW mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
NC - Write:
|
|
rst: Total write requests with NC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
UC - Write:
|
|
rst: Total write requests with UC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
CC - Write:
|
|
rst: Total write requests with CC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
NC - Atomic:
|
|
rst: Total atomic requests with NC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
UC - Atomic:
|
|
rst: Total atomic requests with UC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
CC - Atomic:
|
|
rst: Total atomic requests with CC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
RW - Atomic:
|
|
rst: Total atomic requests with RW mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
Req:
|
|
rst: The number of translation requests made to the UTCL1 per normalization unit.
|
|
unit: Requests per normalization unit
|
|
Hit Ratio:
|
|
rst: The ratio of the number of translation requests that hit in the UTCL1 divided
|
|
by the total number of translation requests made to the UTCL1.
|
|
unit: Percent
|
|
Hits:
|
|
rst: The number of translation requests that hit in the UTCL1, and could be reused,
|
|
per normalization unit.
|
|
unit: Requests per normalization unit
|
|
Translation Misses:
|
|
rst: The total number of translation requests that missed in the UTCL1 due to translation
|
|
not being present in the cache, per :ref:`normalization unit <normalization-units>`.
|
|
unit: unit
|
|
Permission Misses:
|
|
rst: "The total number of translation requests that missed in the UTCL1 due to\
|
|
\ a permission error, per :ref:`normalization unit <normalization-units>`.\
|
|
\ This is unused and expected to be zero in most configurations for modern\
|
|
\ CDNA\u2122 accelerators."
|
|
unit: Requests per normalization unit
|
|
vL1D cache access metrics:
|
|
Hit rate:
|
|
rst: The ratio of the number of vL1D cache line requests that hit [#vl1d-hit]_ in
|
|
vL1D cache over the total number of cache line requests to the :ref:`vL1D Cache
|
|
RAM <desc-tc>`.
|
|
unit: Percent
|
|
Bandwidth:
|
|
rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM
|
|
<desc-vmem>` instructions, as a percent of the peak theoretical bandwidth achievable
|
|
on the specific accelerator. The number of bytes is calculated as the number
|
|
of cache lines requested multiplied by the cache line size. This value does
|
|
not consider partial requests, so for instance, if only a single value is requested
|
|
in a cache line, the data movement will still be counted as a full cache line.
|
|
unit: Percent
|
|
Utilization:
|
|
rst: Indicates how busy the :ref:`vL1D Cache RAM <desc-tc>` was during the kernel
|
|
execution. The number of cycles where the vL1D Cache RAM is actively processing
|
|
any request divided by the number of cycles where the vL1D is active [#vl1d-activity]_.
|
|
unit: Percent
|
|
Coalescing:
|
|
rst: Indicates how well memory instructions were coalesced by the :ref:`address
|
|
processing unit <desc-ta>`, ranging from uncoalesced (25%) to fully coalesced
|
|
(100%). Calculated as the average number of :ref:`thread-requests <thread-requests>`
|
|
generated per instruction divided by the ideal number of thread-requests per
|
|
instruction.
|
|
unit: Percent
|
|
Stalled on L2 Data:
|
|
rst: The ratio of the number of cycles where the vL1D is stalled waiting for requested
|
|
data to return from the :doc:`L2 cache <l2-cache>` divided by the number of
|
|
cycles where the vL1D is active [#vl1d-activity]_.
|
|
unit: Percent
|
|
Stalled on L2 Req:
|
|
rst: The ratio of the number of cycles where the vL1D is stalled waiting to issue
|
|
a request for data to the :doc:`L2 cache <l2-cache>` divided by the number
|
|
of cycles where the vL1D is active [#vl1d-activity]_.
|
|
unit: Percent
|
|
Tag RAM Stall (Read):
|
|
rst: The ratio of the number of cycles where the vL1D is stalled due to Read requests
|
|
with conflicting tags being looked up concurrently, divided by the number of
|
|
cycles where the vL1D is active [#vl1d-activity]_.
|
|
unit: Percent
|
|
Tag RAM Stall (Write):
|
|
rst: The ratio of the number of cycles where the vL1D is stalled due to Write
|
|
requests with conflicting tags being looked up concurrently, divided by the
|
|
number of cycles where the vL1D is active [#vl1d-activity]_.
|
|
unit: Percent
|
|
Tag RAM Stall (Atomic):
|
|
rst: The ratio of the number of cycles where the vL1D is stalled due to Atomic
|
|
requests with conflicting tags being looked up concurrently, divided by the
|
|
number of cycles where the vL1D is active [#vl1d-activity]_.
|
|
unit: Percent
|
|
Total Req:
|
|
rst: The total number of incoming requests from the :ref:`address processing
|
|
unit <desc-ta>` after coalescing.
|
|
unit: Requests
|
|
Read Req:
|
|
rst: The total number of incoming read requests from the :ref:`address processing
|
|
unit <desc-ta>` after coalescing per :ref:`normalization unit <normalization-units>`
|
|
unit: Requests per normalization unit
|
|
Write Req:
|
|
rst: The total number of incoming write requests from the :ref:`address processing
|
|
unit <desc-ta>` after coalescing per :ref:`normalization unit <normalization-units>`
|
|
unit: Requests per normalization unit
|
|
Atomic Req:
|
|
rst: The total number of incoming atomic requests from the :ref:`address processing
|
|
unit <desc-ta>` after coalescing per :ref:`normalization unit <normalization-units>`
|
|
unit: Requests per normalization unit
|
|
Cache BW:
|
|
rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM
|
|
<desc-vmem>` instructions per :ref:`normalization unit <normalization-units>`. The
|
|
number of bytes is calculated as the number of cache lines requested multiplied
|
|
by the cache line size. This value does not consider partial requests, so
|
|
for instance, if only a single value is requested in a cache line, the data movement
|
|
will still be counted as a full cache line.
|
|
unit: Bytes per normalization unit
|
|
Cache Hit Rate:
|
|
rst: The ratio of the number of vL1D cache line requests that hit in vL1D cache
|
|
over the total number of cache line requests to the :ref:`vL1D Cache RAM <desc-tc>`.
|
|
unit: Percent
|
|
Cache Accesses:
|
|
rst: The total number of cache line lookups in the vL1D.
|
|
unit: Cache lines
|
|
Cache Hits:
|
|
rst: The number of cache accesses minus the number of outgoing requests to the :doc:`L2
|
|
cache <l2-cache>`, that is, the number of cache line requests serviced by the
|
|
:ref:`vL1D Cache RAM <desc-tc>` per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cache lines per normalization unit
|
|
Invalidations:
|
|
rst: The number of times the vL1D was issued a write-back invalidate command during
|
|
the kernel's execution per :ref:`normalization unit <normalization-units>`. This
|
|
may be triggered by, for instance, the ``buffer_wbinvl1`` instruction.
|
|
unit: Invalidations per normalization unit
|
|
L1-L2 BW:
|
|
rst: The number of bytes transferred across the vL1D-L2 interface as a result of
|
|
:ref:`VMEM <desc-vmem>` instructions, per :ref:`normalization unit <normalization-units>`.
|
|
The number of bytes is calculated as the number of cache lines requested multiplied
|
|
by the cache line size. This value does not consider partial requests, so for instance,
|
|
if only a single value is requested in a cache line, the data movement will
|
|
still be counted as a full cache line.
|
|
unit: Bytes per normalization unit
|
|
L1-L2 Read:
|
|
rst: The number of read requests for a vL1D cache line that were not satisfied by
|
|
the vL1D and must be retrieved from the to the :doc:`L2 Cache <l2-cache>` per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
L1-L2 Write:
|
|
rst: The number of write requests to a vL1D cache line that were sent through the
|
|
vL1D to the :doc:`L2 cache <l2-cache>`, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
L1-L2 Atomic:
|
|
rst: The number of atomic requests that are sent through the vL1D to the :doc:`L2
|
|
cache <l2-cache>`, per :ref:`normalization unit <normalization-units>`. This
|
|
includes requests for atomics with, and without return.
|
|
unit: Requests per normalization unit
|
|
L1 Access Latency:
|
|
rst: Calculated as the average number of cycles that a vL1D cache line request
|
|
spent in the vL1D cache pipeline.
|
|
unit: Cycles
|
|
L1-L2 Read Latency:
|
|
rst: Calculated as the average number of cycles that the vL1D cache took to issue
|
|
and receive read requests from the :doc:`L2 Cache <l2-cache>`. This number
|
|
also includes requests for atomics with return values.
|
|
unit: Cycles
|
|
L1-L2 Write Latency:
|
|
rst: Calculated as the average number of cycles that the vL1D cache took to issue
|
|
and receive acknowledgement of a write request to the :doc:`L2 Cache <l2-cache>`.
|
|
This number also includes requests for atomics without return values.
|
|
unit: Cycles
|
|
NC - Read:
|
|
rst: Total read requests with NC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
UC - Read:
|
|
rst: Total read requests with UC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
CC - Read:
|
|
rst: Total read requests with CC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
RW - Read:
|
|
rst: ''
|
|
unit: Requests per normalization unit
|
|
RW - Write:
|
|
rst: Total write requests with RW mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
NC - Write:
|
|
rst: Total write requests with NC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
UC - Write:
|
|
rst: Total write requests with UC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
CC - Write:
|
|
rst: Total write requests with CC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
NC - Atomic:
|
|
rst: Total atomic requests with NC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
UC - Atomic:
|
|
rst: Total atomic requests with UC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
CC - Atomic:
|
|
rst: Total atomic requests with CC mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
RW - Atomic:
|
|
rst: Total atomic requests with RW mtype from this TCP to all TCCs Sum over TCP
|
|
instances per normalization unit.
|
|
unit: Requests per normalization unit
|
|
Req:
|
|
rst: The number of translation requests made to the UTCL1 per normalization unit.
|
|
unit: Requests per normalization unit
|
|
Hit Ratio:
|
|
rst: The ratio of the number of translation requests that hit in the UTCL1 divided
|
|
by the total number of translation requests made to the UTCL1.
|
|
unit: Percent
|
|
Hits:
|
|
rst: The number of translation requests that hit in the UTCL1, and could be reused,
|
|
per normalization unit.
|
|
unit: Requests per normalization unit
|
|
Translation Misses:
|
|
rst: The total number of translation requests that missed in the UTCL1 due to translation
|
|
not being present in the cache, per :ref:`normalization unit <normalization-units>`.
|
|
unit: unit
|
|
Permission Misses:
|
|
rst: "The total number of translation requests that missed in the UTCL1 due to\
|
|
\ a permission error, per :ref:`normalization unit <normalization-units>`.\
|
|
\ This is unused and expected to be zero in most configurations for modern\
|
|
\ CDNA\u2122 accelerators."
|
|
unit: Requests per normalization unit
|
|
Vector L1 data-return path or Texture Data (TD):
|
|
Address Processing Unit Busy:
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address processor
|
|
was busy
|
|
unit: Percent
|
|
Address Stall:
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address processor
|
|
was stalled from sending address requests further into the vL1D pipeline
|
|
unit: Percent
|
|
Data Stall:
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the address processor
|
|
was stalled from sending write/atomic data further into the vL1D pipeline
|
|
unit: Percent
|
|
"Data-Processor \u2192 Address Stall":
|
|
rst: Percent of :ref:`total CU cycles <total-cu-cycles>` the address processor was
|
|
stalled waiting to send command data to the :ref:`data processor <desc-td>`
|
|
unit: Percent
|
|
Total Instructions:
|
|
rst: The total number of memory instructions executed by the address processer
|
|
over all compute units on the accelerator, per normalization unit.
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Instructions:
|
|
rst: The total number of global & generic memory instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Read Instructions:
|
|
rst: The total number of global & generic memory read instructions executed on all
|
|
:doc:`compute units <compute-unit>` on the accelerator, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Write Instructions:
|
|
rst: The total number of global & generic memory write instructions executed on
|
|
all :doc:`compute units <compute-unit>` on the accelerator, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Global/Generic Atomic Instructions:
|
|
rst: The total number of global & generic memory atomic (with and without return)
|
|
instructions executed on all :doc:`compute units <compute-unit>` on the accelerator,
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Instructions:
|
|
rst: The total number of spill/stack memory instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Read Instructions:
|
|
rst: The total number of spill/stack memory read instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Write Instructions:
|
|
rst: The total number of spill/stack memory write instructions executed on all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Atomic Instructions:
|
|
rst: The total number of spill/stack memory atomic (with and without return) instructions
|
|
executed on all :doc:`compute units <compute-unit>` on the accelerator, per
|
|
:ref:`normalization unit <normalization-units>`. Typically unused as these
|
|
memory operations are typically used to implement thread-local storage.
|
|
unit: Instructions per normalization unit
|
|
Spill/Stack Total Cycles:
|
|
rst: The number of cycles the address processing unit spent working on spill/stack
|
|
instructions, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cycles per normalization unit
|
|
Spill/Stack Coalesced Read:
|
|
rst: The number of cycles the address processing unit spent working on coalesced
|
|
spill/stack read instructions, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cycles per normalization unit
|
|
Spill/Stack Coalesced Write:
|
|
rst: The number of cycles the address processing unit spent working on coalesced
|
|
spill/stack write instructions, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cycles per normalization unit
|
|
Data-Return Busy:
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return unit
|
|
was busy processing or waiting on data to return to the :doc:`CU <compute-unit>`.
|
|
unit: Percent
|
|
"Cache RAM \u2192 Data-Return Stall":
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return unit
|
|
was stalled on data to be returned from the :ref:`vL1D Cache RAM <desc-tc>`.
|
|
unit: Percent
|
|
"Workgroup manager \u2192 Data-Return Stall":
|
|
rst: Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return unit
|
|
was stalled by the :ref:`workgroup manager <desc-spi>` due to initialization
|
|
of registers as a part of launching new workgroups.
|
|
unit: Percent
|
|
Coalescable Instructions:
|
|
rst: The number of instructions submitted to the :ref:`data-return unit <desc-td>`
|
|
by the :ref:`address processor <desc-ta>` that were found to be coalescable,
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Instructions per normalization unit
|
|
Read Instructions:
|
|
rst: The number of read instructions submitted to the :ref:`data-return unit
|
|
<desc-td>` by the :ref:`address processor <desc-ta>` summed over all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
This is expected to be the sum of global/generic and spill/stack reads in the :ref:`address
|
|
processor <desc-ta>`.
|
|
unit: Instructions per normalization unit
|
|
Write Instructions:
|
|
rst: The number of store instructions submitted to the :ref:`data-return unit
|
|
<desc-td>` by the :ref:`address processor <desc-ta>` summed over all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
This is expected to be the sum of global/generic and spill/stack stores counted
|
|
by the :ref:`vL1D cache-front-end <ta-instruction-counts>`.
|
|
unit: Instructions per normalization unit
|
|
Atomic Instructions:
|
|
rst: The number of atomic instructions submitted to the :ref:`data-return unit
|
|
<desc-td>` by the :ref:`address processor <desc-ta>` summed over all :doc:`compute
|
|
units <compute-unit>` on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
|
This is expected to be the sum of global/generic and spill/stack atomics in
|
|
the :ref:`address processor <desc-ta>`.
|
|
unit: Instructions per normalization unit
|
|
L2 Speed-of-Light:
|
|
Utilization:
|
|
rst: The ratio of the :ref:`number of cycles an L2 channel was active, summed
|
|
over all L2 channels on the accelerator <total-active-l2-cycles>` over the
|
|
:ref:`total L2 cycles <total-l2-cycles>`.
|
|
unit: Percent
|
|
Peak Bandwidth:
|
|
rst: The number of bytes looked up in the L2 cache, as a percent of the peak theoretical
|
|
bandwidth achievable on the specific accelerator. The number of bytes is calculated
|
|
as the number of cache lines requested multiplied by the cache line size. This
|
|
value does not consider partial requests, so e.g., if only a single value is
|
|
requested in a cache line, the data movement will still be counted as a full
|
|
cache line.
|
|
unit: Percent
|
|
Hit Rate:
|
|
rst: The ratio of the number of L2 cache line requests that hit in the L2 cache
|
|
over the total number of incoming cache line requests to the L2 cache.
|
|
unit: Percent
|
|
L2-Fabric Read BW:
|
|
rst: The number of bytes read by the L2 over the :ref:`Infinity Fabric interface
|
|
<l2-fabric>` per unit time.
|
|
unit: GB/s
|
|
L2-Fabric Write and Atomic BW:
|
|
rst: The number of bytes sent by the L2 over the :ref:`Infinity Fabric interface
|
|
<l2-fabric>` by write and atomic operations per unit time.
|
|
unit: GB/s
|
|
HBM Bandwidth:
|
|
rst: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory
|
|
(HBM) per unit time. This value is calculated as the number of HBM channels
|
|
multiplied by the HBM channel width multiplied by the HBM clock frequency.
|
|
unit: GB/s
|
|
Read BW:
|
|
rst: The total number of bytes read by the L2 cache from Infinity Fabric per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Bytes per normalization unit
|
|
HBM Read Traffic:
|
|
rst: The percent of read requests generated by the L2 cache that are routed to the
|
|
accelerator's local high-bandwidth memory (HBM). This breakdown does not consider
|
|
the *size* of the request (meaning that 32B and 64B requests are both counted
|
|
as a single request), so this metric only *approximates* the percent of the
|
|
L2-Fabric Read bandwidth directed to the local HBM.
|
|
unit: Percent
|
|
Remote Read Traffic:
|
|
rst: The percent of read requests generated by the L2 cache that are routed to any
|
|
memory location other than the accelerator's local high-bandwidth memory (HBM)
|
|
-- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown
|
|
does not consider the *size* of the request (meaning that 32B and 64B requests
|
|
are both counted as a single request), so this metric only *approximates* the
|
|
percent of the L2-Fabric Read bandwidth directed to a remote location.
|
|
unit: Percent
|
|
Uncached Read Traffic:
|
|
rst: The percent of read requests generated by the L2 cache that are reading from
|
|
an :ref:`uncached memory allocation <memory-type>`. Note, as described in the
|
|
:ref:`request flow <l2-request-flow>` section, a single 64B read request is
|
|
typically counted as two uncached read requests. So, it is possible for the
|
|
Uncached Read Traffic to reach up to 200% of the total number of read requests.
|
|
This breakdown does not consider the *size* of the request (i.e., 32B and 64B
|
|
requests are both counted as a single request), so this metric only *approximates*
|
|
the percent of the L2-Fabric read bandwidth directed to an uncached memory
|
|
location.
|
|
unit: Percent
|
|
Write and Atomic BW:
|
|
rst: The total number of bytes written by the L2 over Infinity Fabric by write and
|
|
atomic operations per :ref:`normalization unit <normalization-units>`. Note
|
|
that on current CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests
|
|
are only considered *atomic* by Infinity Fabric if they are targeted at non-write-cacheable
|
|
memory, for example, :ref:`fine-grained memory <memory-type>` allocations or :ref:`uncached
|
|
memory <memory-type>` allocations on the MI2XX.
|
|
unit: Bytes per normalization unit
|
|
HBM Write and Atomic Traffic:
|
|
rst: The percent of write and atomic requests generated by the L2 cache that are
|
|
routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
|
|
does not consider the *size* of the request (meaning that 32B and 64B requests
|
|
are both counted as a single request), so this metric only *approximates* the
|
|
percent of the L2-Fabric Write and Atomic bandwidth directed to the local HBM.
|
|
Note that on current CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`,
|
|
requests are only considered *atomic* by Infinity Fabric if they are targeted
|
|
at :ref:`fine-grained memory <memory-type>` allocations or :ref:`uncached
|
|
memory <memory-type>` allocations.
|
|
unit: Percent
|
|
Remote Write and Atomic Traffic:
|
|
rst: The percent of read requests generated by the L2 cache that are routed to any
|
|
memory location other than the accelerator's local high-bandwidth memory (HBM)
|
|
-- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown
|
|
does not consider the *size* of the request (meaning that 32B and 64B requests
|
|
are both counted as a single request), so this metric only *approximates* the
|
|
percent of the L2-Fabric Read bandwidth directed to a remote location. Note
|
|
that on current CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests
|
|
are only considered *atomic* by Infinity Fabric if they are targeted at :ref:`fine-grained
|
|
memory <memory-type>` allocations or :ref:`uncached memory <memory-type>` allocations.
|
|
unit: Percent
|
|
Atomic Traffic:
|
|
rst: The percent of write requests generated by the L2 cache that are atomic requests
|
|
to *any* memory location. This breakdown does not consider the *size* of the
|
|
request (meaning that 32B and 64B requests are both counted as a single request),
|
|
so this metric only *approximates* the percent of the L2-Fabric Read bandwidth
|
|
directed to a remote location. Note that on current CDNA accelerators, such
|
|
as the :ref:`MI2XX <mixxx-note>`, requests are only considered *atomic* by
|
|
Infinity Fabric if they are targeted at :ref:`fine-grained memory <memory-type>`
|
|
allocations or :ref:`uncached memory <memory-type>` allocations.
|
|
unit: Percent
|
|
Uncached Write and Atomic Traffic:
|
|
rst: The percent of write and atomic requests generated by the L2 cache that are
|
|
targeting :ref:`uncached memory allocations <memory-type>`. This breakdown
|
|
does not consider the *size* of the request (meaning that 32B and 64B requests
|
|
are both counted as a single request), so this metric only *approximates* the
|
|
percent of the L2-Fabric read bandwidth directed to uncached memory allocations.
|
|
unit: Percent
|
|
Read Latency:
|
|
rst: The time-averaged number of cycles read requests spent in Infinity Fabric before
|
|
data was returned to the L2.
|
|
unit: Cycles
|
|
Write and Atomic Latency:
|
|
rst: The time-averaged number of cycles write requests spent in Infinity Fabric
|
|
before a completion acknowledgement was returned to the L2.
|
|
unit: Cycles
|
|
Atomic Latency:
|
|
rst: The time-averaged number of cycles atomic requests spent in Infinity Fabric
|
|
before a completion acknowledgement (atomic without return value) or data (atomic
|
|
with return value) was returned to the L2.
|
|
unit: Cycles
|
|
Bandwidth:
|
|
rst: The number of bytes looked up in the L2 cache, per :ref:`normalization unit
|
|
<normalization-units>`. The number of bytes is calculated as the number of
|
|
cache lines requested multiplied by the cache line size. This value does not
|
|
consider partial requests, so for example, if only a single value is requested
|
|
in a cache line, the data movement will still be counted as a full cache line.
|
|
unit: Bytes per normalization unit
|
|
Req:
|
|
rst: The total number of incoming requests to the L2 from all clients for all request
|
|
types, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Read Req:
|
|
rst: 'The total number of read requests to the L2 from all clients. '
|
|
unit: Requests per normalization unit
|
|
Write Req:
|
|
rst: The total number of write requests to the L2 from all clients.
|
|
unit: Requests per normalization unit
|
|
Atomic Req:
|
|
rst: The total number of atomic requests (with and without return) to the L2 from
|
|
all clients.
|
|
unit: Requests per normalization unit
|
|
Streaming Req:
|
|
rst: The total number of incoming requests to the L2 that are marked as *streaming*.
|
|
The exact meaning of this may differ depending on the targeted accelerator,
|
|
however on an :ref:`MI2XX <mixxx-note>` this corresponds to `non-temporal
|
|
load or stores <https://clang.llvm.org/docs/LanguageExtensions.html#non-temporal-load-store-builtins>`_. The
|
|
L2 cache attempts to evict *streaming* requests before normal requests when
|
|
the L2 is at capacity.
|
|
unit: Requests per normalization unit
|
|
Probe Req:
|
|
rst: The number of coherence probe requests made to the L2 cache from outside the
|
|
accelerator. On an :ref:`MI2XX <mixxx-note>`, probe requests may be generated
|
|
by, for example, writes to :ref:`fine-grained device <memory-type>` memory
|
|
or by writes to :ref:`coarse-grained <memory-type>` device memory.
|
|
unit: Requests per normalization unit
|
|
Cache Hit:
|
|
rst: The ratio of the number of L2 cache line requests that hit in the L2 cache
|
|
over the total number of incoming cache line requests to the L2 cache.
|
|
unit: Percent
|
|
Hits:
|
|
rst: The total number of requests to the L2 from all clients that hit in the cache.
|
|
As noted in the :ref:`Speed-of-Light <l2-sol>` section, this includes hit-on-miss
|
|
requests.
|
|
unit: Requests per normalization unit
|
|
Misses:
|
|
rst: The total number of requests to the L2 from all clients that miss in the cache.
|
|
As noted in the :ref:`Speed-of-Light <l2-sol>` section, these do not include
|
|
hit-on-miss requests.
|
|
unit: Requests per normalization unit
|
|
Writeback:
|
|
rst: The total number of L2 cache lines written back to memory for any reason. Write-backs
|
|
may occur due to user code (such as HIP kernel calls to ``__threadfence_system``
|
|
or atomic built-ins) by the :doc:`command processor <command-processor>`'s
|
|
memory acquire/release fences, or for other internal hardware reasons.
|
|
unit: Cache lines per normalization unit
|
|
Writeback (Internal):
|
|
rst: The total number of L2 cache lines written back to memory for internal hardware
|
|
reasons, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cache lines per normalization unit
|
|
Writeback (vL1D Req):
|
|
rst: The total number of L2 cache lines written back to memory due to requests initiated
|
|
by the :doc:`vL1D cache <vector-l1-cache>`, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cache lines per normalization unit
|
|
Evict (Internal):
|
|
rst: The total number of L2 cache lines evicted from the cache due to capacity limits,
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cache lines per normalization unit
|
|
Evict (vL1D Req):
|
|
rst: The total number of L2 cache lines evicted from the cache due to invalidation
|
|
requests initiated by the :doc:`vL1D cache <vector-l1-cache>`, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Cache lines per normalization unit
|
|
NC Req:
|
|
rst: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory
|
|
allocations, per :ref:`normalization unit <normalization-units>`. See the :ref:`memory-type`
|
|
for more information.
|
|
unit: Requests per normalization unit
|
|
UC Req:
|
|
rst: The total number of requests to the L2 that go to Uncached (UC) memory allocations.
|
|
See the :ref:`memory-type` for more information.
|
|
unit: Requests per normalization unit
|
|
CC Req:
|
|
rst: The total number of requests to the L2 that go to Coherently Cacheable (CC) memory
|
|
allocations. See the :ref:`memory-type` for more information.
|
|
unit: Requests per normalization unit
|
|
RW Req:
|
|
rst: The total number of requests to the L2 that go to Read-Write coherent memory (RW)
|
|
allocations. See the :ref:`memory-type` for more information.
|
|
unit: Requests per normalization unit
|
|
Write - Credit Starvation:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
|
|
requests to any memory location because too many write/atomic requests were
|
|
currently in flight, as a percent of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Read (32B):
|
|
rst: The total number of L2 requests to Infinity Fabric to read 32B of data from
|
|
any memory location, per :ref:`normalization unit <normalization-units>`. See :ref:`l2-request-flow`
|
|
for more detail. Typically unused on CDNA accelerators.
|
|
unit: Requests per normalization unit
|
|
Read (64B):
|
|
rst: The total number of L2 requests to Infinity Fabric to read 64B of data from
|
|
any memory location, per :ref:`normalization unit <normalization-units>`. See :ref:`l2-request-flow`
|
|
for more detail.
|
|
unit: Requests per normalization unit
|
|
Read (Uncached):
|
|
rst: The total number of L2 requests to Infinity Fabric to read :ref:`uncached
|
|
data <memory-type>` from any memory location, per :ref:`normalization unit
|
|
<normalization-units>`. 64B requests for uncached data are counted as two 32B
|
|
uncached data requests. See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
HBM Read:
|
|
rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data
|
|
from the accelerator's local HBM, per :ref:`normalization unit <normalization-units>`.
|
|
See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
Remote Read:
|
|
rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data
|
|
from any source other than the accelerator's local HBM, per :ref:`normalization
|
|
unit <normalization-units>`. See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
Write and Atomic (32B):
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically update
|
|
32B of data to any memory location, per :ref:`normalization unit <normalization-units>`.
|
|
See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
Write and Atomic (Uncached):
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically update
|
|
32B or 64B of :ref:`uncached data <memory-type>`, per :ref:`normalization unit
|
|
<normalization-units>`. See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
Write and Atomic (64B):
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically update
|
|
64B of data in any memory location, per :ref:`normalization unit <normalization-units>`.
|
|
See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
HBM Write and Atomic:
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically update
|
|
32B or 64B of data in the accelerator's local HBM, per :ref:`normalization
|
|
unit <normalization-units>`. See :ref:`l2-request-flow` for more detail. plain
|
|
unit: Requests per normalization unit
|
|
Remote Write and Atomic:
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically update
|
|
32B or 64B of data in any memory location other than the accelerator's local
|
|
HBM, per :ref:`normalization unit <normalization-units>`. See :ref:`l2-request-flow`
|
|
for more detail.
|
|
unit: Requests per normalization unit
|
|
Atomic:
|
|
rst: The total number of L2 requests to Infinity Fabric to atomically update 32B
|
|
or 64B of data in any memory location, per :ref:`normalization unit <normalization-units>`.
|
|
See :ref:`l2-request-flow` for more detail. Note that on current CDNA accelerators,
|
|
such as the :ref:`MI2XX <mixxx-note>`, requests are only considered *atomic*
|
|
by Infinity Fabric if they are targeted at non-write-cacheable memory, such
|
|
as :ref:`fine-grained memory <memory-type>` allocations or :ref:`uncached
|
|
memory <memory-type>` allocations on the MI2XX.
|
|
unit: Requests per normalization unit
|
|
Read Stall:
|
|
rst: "The ratio of the total number of cycles the L2-Fabric interface was stalled\
|
|
\ on a read request to any destination (local HBM, remote PCIe\xAE connected\
|
|
\ accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_\
|
|
\ or CPU) over the :ref:`total active L2 cycles <total-active-l2-cycles>`."
|
|
unit: Percent
|
|
Write Stall:
|
|
rst: The ratio of the total number of cycles the L2-Fabric interface was stalled
|
|
on a write or atomic request to any destination (local HBM, remote accelerator
|
|
or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected
|
|
accelerator [#inf]_ or CPU) over the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Read - PCIe Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on read requests
|
|
to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the :ref:`total
|
|
active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Read - Infinity Fabric Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on read requests
|
|
to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as a percent
|
|
of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Read - HBM Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on read requests
|
|
to the accelerator's local HBM as a percent of the :ref:`total active L2 cycles
|
|
<total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Write - PCIe Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
|
|
requests to remote PCIe connected accelerators [#inf]_ or CPUs as a percent
|
|
of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Write - Infinity Fabric Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
|
|
requests to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as
|
|
a percent of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Write - HBM Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
|
|
requests to accelerator's local HBM as a percent of the total active L2 cycles.
|
|
unit: Percent
|
|
L2 cache accesses:
|
|
Utilization:
|
|
rst: The ratio of the :ref:`number of cycles an L2 channel was active, summed
|
|
over all L2 channels on the accelerator <total-active-l2-cycles>` over the
|
|
:ref:`total L2 cycles <total-l2-cycles>`.
|
|
unit: Percent
|
|
Peak Bandwidth:
|
|
rst: The number of bytes looked up in the L2 cache, as a percent of the peak theoretical
|
|
bandwidth achievable on the specific accelerator. The number of bytes is calculated
|
|
as the number of cache lines requested multiplied by the cache line size. This
|
|
value does not consider partial requests, so e.g., if only a single value is
|
|
requested in a cache line, the data movement will still be counted as a full
|
|
cache line.
|
|
unit: Percent
|
|
Hit Rate:
|
|
rst: The ratio of the number of L2 cache line requests that hit in the L2 cache
|
|
over the total number of incoming cache line requests to the L2 cache.
|
|
unit: Percent
|
|
L2-Fabric Read BW:
|
|
rst: The number of bytes read by the L2 over the :ref:`Infinity Fabric interface
|
|
<l2-fabric>` per unit time.
|
|
unit: GB/s
|
|
L2-Fabric Write and Atomic BW:
|
|
rst: The number of bytes sent by the L2 over the :ref:`Infinity Fabric interface
|
|
<l2-fabric>` by write and atomic operations per unit time.
|
|
unit: GB/s
|
|
HBM Bandwidth:
|
|
rst: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory
|
|
(HBM) per unit time. This value is calculated as the number of HBM channels
|
|
multiplied by the HBM channel width multiplied by the HBM clock frequency.
|
|
unit: GB/s
|
|
Read BW:
|
|
rst: The total number of bytes read by the L2 cache from Infinity Fabric per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Bytes per normalization unit
|
|
HBM Read Traffic:
|
|
rst: The percent of read requests generated by the L2 cache that are routed to the
|
|
accelerator's local high-bandwidth memory (HBM). This breakdown does not consider
|
|
the *size* of the request (meaning that 32B and 64B requests are both counted
|
|
as a single request), so this metric only *approximates* the percent of the
|
|
L2-Fabric Read bandwidth directed to the local HBM.
|
|
unit: Percent
|
|
Remote Read Traffic:
|
|
rst: The percent of read requests generated by the L2 cache that are routed to any
|
|
memory location other than the accelerator's local high-bandwidth memory (HBM)
|
|
-- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown
|
|
does not consider the *size* of the request (meaning that 32B and 64B requests
|
|
are both counted as a single request), so this metric only *approximates* the
|
|
percent of the L2-Fabric Read bandwidth directed to a remote location.
|
|
unit: Percent
|
|
Uncached Read Traffic:
|
|
rst: The percent of read requests generated by the L2 cache that are reading from
|
|
an :ref:`uncached memory allocation <memory-type>`. Note, as described in the
|
|
:ref:`request flow <l2-request-flow>` section, a single 64B read request is
|
|
typically counted as two uncached read requests. So, it is possible for the
|
|
Uncached Read Traffic to reach up to 200% of the total number of read requests.
|
|
This breakdown does not consider the *size* of the request (i.e., 32B and 64B
|
|
requests are both counted as a single request), so this metric only *approximates*
|
|
the percent of the L2-Fabric read bandwidth directed to an uncached memory
|
|
location.
|
|
unit: Percent
|
|
Write and Atomic BW:
|
|
rst: The total number of bytes written by the L2 over Infinity Fabric by write and
|
|
atomic operations per :ref:`normalization unit <normalization-units>`. Note
|
|
that on current CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests
|
|
are only considered *atomic* by Infinity Fabric if they are targeted at non-write-cacheable
|
|
memory, for example, :ref:`fine-grained memory <memory-type>` allocations or :ref:`uncached
|
|
memory <memory-type>` allocations on the MI2XX.
|
|
unit: Bytes per normalization unit
|
|
HBM Write and Atomic Traffic:
|
|
rst: The percent of write and atomic requests generated by the L2 cache that are
|
|
routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
|
|
does not consider the *size* of the request (meaning that 32B and 64B requests
|
|
are both counted as a single request), so this metric only *approximates* the
|
|
percent of the L2-Fabric Write and Atomic bandwidth directed to the local HBM.
|
|
Note that on current CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`,
|
|
requests are only considered *atomic* by Infinity Fabric if they are targeted
|
|
at :ref:`fine-grained memory <memory-type>` allocations or :ref:`uncached
|
|
memory <memory-type>` allocations.
|
|
unit: Percent
|
|
Remote Write and Atomic Traffic:
|
|
rst: The percent of read requests generated by the L2 cache that are routed to any
|
|
memory location other than the accelerator's local high-bandwidth memory (HBM)
|
|
-- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown
|
|
does not consider the *size* of the request (meaning that 32B and 64B requests
|
|
are both counted as a single request), so this metric only *approximates* the
|
|
percent of the L2-Fabric Read bandwidth directed to a remote location. Note
|
|
that on current CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests
|
|
are only considered *atomic* by Infinity Fabric if they are targeted at :ref:`fine-grained
|
|
memory <memory-type>` allocations or :ref:`uncached memory <memory-type>` allocations.
|
|
unit: Percent
|
|
Atomic Traffic:
|
|
rst: The percent of write requests generated by the L2 cache that are atomic requests
|
|
to *any* memory location. This breakdown does not consider the *size* of the
|
|
request (meaning that 32B and 64B requests are both counted as a single request),
|
|
so this metric only *approximates* the percent of the L2-Fabric Read bandwidth
|
|
directed to a remote location. Note that on current CDNA accelerators, such
|
|
as the :ref:`MI2XX <mixxx-note>`, requests are only considered *atomic* by
|
|
Infinity Fabric if they are targeted at :ref:`fine-grained memory <memory-type>`
|
|
allocations or :ref:`uncached memory <memory-type>` allocations.
|
|
unit: Percent
|
|
Uncached Write and Atomic Traffic:
|
|
rst: The percent of write and atomic requests generated by the L2 cache that are
|
|
targeting :ref:`uncached memory allocations <memory-type>`. This breakdown
|
|
does not consider the *size* of the request (meaning that 32B and 64B requests
|
|
are both counted as a single request), so this metric only *approximates* the
|
|
percent of the L2-Fabric read bandwidth directed to uncached memory allocations.
|
|
unit: Percent
|
|
Read Latency:
|
|
rst: The time-averaged number of cycles read requests spent in Infinity Fabric before
|
|
data was returned to the L2.
|
|
unit: Cycles
|
|
Write and Atomic Latency:
|
|
rst: The time-averaged number of cycles write requests spent in Infinity Fabric
|
|
before a completion acknowledgement was returned to the L2.
|
|
unit: Cycles
|
|
Atomic Latency:
|
|
rst: The time-averaged number of cycles atomic requests spent in Infinity Fabric
|
|
before a completion acknowledgement (atomic without return value) or data (atomic
|
|
with return value) was returned to the L2.
|
|
unit: Cycles
|
|
Bandwidth:
|
|
rst: The number of bytes looked up in the L2 cache, per :ref:`normalization unit
|
|
<normalization-units>`. The number of bytes is calculated as the number of
|
|
cache lines requested multiplied by the cache line size. This value does not
|
|
consider partial requests, so for example, if only a single value is requested
|
|
in a cache line, the data movement will still be counted as a full cache line.
|
|
unit: Bytes per normalization unit
|
|
Req:
|
|
rst: The total number of incoming requests to the L2 from all clients for all request
|
|
types, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Read Req:
|
|
rst: 'The total number of read requests to the L2 from all clients. '
|
|
unit: Requests per normalization unit
|
|
Write Req:
|
|
rst: The total number of write requests to the L2 from all clients.
|
|
unit: Requests per normalization unit
|
|
Atomic Req:
|
|
rst: The total number of atomic requests (with and without return) to the L2 from
|
|
all clients.
|
|
unit: Requests per normalization unit
|
|
Streaming Req:
|
|
rst: The total number of incoming requests to the L2 that are marked as *streaming*.
|
|
The exact meaning of this may differ depending on the targeted accelerator,
|
|
however on an :ref:`MI2XX <mixxx-note>` this corresponds to `non-temporal
|
|
load or stores <https://clang.llvm.org/docs/LanguageExtensions.html#non-temporal-load-store-builtins>`_. The
|
|
L2 cache attempts to evict *streaming* requests before normal requests when
|
|
the L2 is at capacity.
|
|
unit: Requests per normalization unit
|
|
Probe Req:
|
|
rst: The number of coherence probe requests made to the L2 cache from outside the
|
|
accelerator. On an :ref:`MI2XX <mixxx-note>`, probe requests may be generated
|
|
by, for example, writes to :ref:`fine-grained device <memory-type>` memory
|
|
or by writes to :ref:`coarse-grained <memory-type>` device memory.
|
|
unit: Requests per normalization unit
|
|
Cache Hit:
|
|
rst: The ratio of the number of L2 cache line requests that hit in the L2 cache
|
|
over the total number of incoming cache line requests to the L2 cache.
|
|
unit: Percent
|
|
Hits:
|
|
rst: The total number of requests to the L2 from all clients that hit in the cache.
|
|
As noted in the :ref:`Speed-of-Light <l2-sol>` section, this includes hit-on-miss
|
|
requests.
|
|
unit: Requests per normalization unit
|
|
Misses:
|
|
rst: The total number of requests to the L2 from all clients that miss in the cache.
|
|
As noted in the :ref:`Speed-of-Light <l2-sol>` section, these do not include
|
|
hit-on-miss requests.
|
|
unit: Requests per normalization unit
|
|
Writeback:
|
|
rst: The total number of L2 cache lines written back to memory for any reason. Write-backs
|
|
may occur due to user code (such as HIP kernel calls to ``__threadfence_system``
|
|
or atomic built-ins) by the :doc:`command processor <command-processor>`'s
|
|
memory acquire/release fences, or for other internal hardware reasons.
|
|
unit: Cache lines per normalization unit
|
|
Writeback (Internal):
|
|
rst: The total number of L2 cache lines written back to memory for internal hardware
|
|
reasons, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cache lines per normalization unit
|
|
Writeback (vL1D Req):
|
|
rst: The total number of L2 cache lines written back to memory due to requests initiated
|
|
by the :doc:`vL1D cache <vector-l1-cache>`, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cache lines per normalization unit
|
|
Evict (Internal):
|
|
rst: The total number of L2 cache lines evicted from the cache due to capacity limits,
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cache lines per normalization unit
|
|
Evict (vL1D Req):
|
|
rst: The total number of L2 cache lines evicted from the cache due to invalidation
|
|
requests initiated by the :doc:`vL1D cache <vector-l1-cache>`, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Cache lines per normalization unit
|
|
NC Req:
|
|
rst: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory
|
|
allocations, per :ref:`normalization unit <normalization-units>`. See the :ref:`memory-type`
|
|
for more information.
|
|
unit: Requests per normalization unit
|
|
UC Req:
|
|
rst: The total number of requests to the L2 that go to Uncached (UC) memory allocations.
|
|
See the :ref:`memory-type` for more information.
|
|
unit: Requests per normalization unit
|
|
CC Req:
|
|
rst: The total number of requests to the L2 that go to Coherently Cacheable (CC) memory
|
|
allocations. See the :ref:`memory-type` for more information.
|
|
unit: Requests per normalization unit
|
|
RW Req:
|
|
rst: The total number of requests to the L2 that go to Read-Write coherent memory (RW)
|
|
allocations. See the :ref:`memory-type` for more information.
|
|
unit: Requests per normalization unit
|
|
Write - Credit Starvation:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
|
|
requests to any memory location because too many write/atomic requests were
|
|
currently in flight, as a percent of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Read (32B):
|
|
rst: The total number of L2 requests to Infinity Fabric to read 32B of data from
|
|
any memory location, per :ref:`normalization unit <normalization-units>`. See :ref:`l2-request-flow`
|
|
for more detail. Typically unused on CDNA accelerators.
|
|
unit: Requests per normalization unit
|
|
Read (64B):
|
|
rst: The total number of L2 requests to Infinity Fabric to read 64B of data from
|
|
any memory location, per :ref:`normalization unit <normalization-units>`. See :ref:`l2-request-flow`
|
|
for more detail.
|
|
unit: Requests per normalization unit
|
|
Read (Uncached):
|
|
rst: The total number of L2 requests to Infinity Fabric to read :ref:`uncached
|
|
data <memory-type>` from any memory location, per :ref:`normalization unit
|
|
<normalization-units>`. 64B requests for uncached data are counted as two 32B
|
|
uncached data requests. See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
HBM Read:
|
|
rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data
|
|
from the accelerator's local HBM, per :ref:`normalization unit <normalization-units>`.
|
|
See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
Remote Read:
|
|
rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data
|
|
from any source other than the accelerator's local HBM, per :ref:`normalization
|
|
unit <normalization-units>`. See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
Write and Atomic (32B):
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically update
|
|
32B of data to any memory location, per :ref:`normalization unit <normalization-units>`.
|
|
See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
Write and Atomic (Uncached):
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically update
|
|
32B or 64B of :ref:`uncached data <memory-type>`, per :ref:`normalization unit
|
|
<normalization-units>`. See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
Write and Atomic (64B):
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically update
|
|
64B of data in any memory location, per :ref:`normalization unit <normalization-units>`.
|
|
See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
HBM Write and Atomic:
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically update
|
|
32B or 64B of data in the accelerator's local HBM, per :ref:`normalization
|
|
unit <normalization-units>`. See :ref:`l2-request-flow` for more detail. plain
|
|
unit: Requests per normalization unit
|
|
Remote Write and Atomic:
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically update
|
|
32B or 64B of data in any memory location other than the accelerator's local
|
|
HBM, per :ref:`normalization unit <normalization-units>`. See :ref:`l2-request-flow`
|
|
for more detail.
|
|
unit: Requests per normalization unit
|
|
Atomic:
|
|
rst: The total number of L2 requests to Infinity Fabric to atomically update 32B
|
|
or 64B of data in any memory location, per :ref:`normalization unit <normalization-units>`.
|
|
See :ref:`l2-request-flow` for more detail. Note that on current CDNA accelerators,
|
|
such as the :ref:`MI2XX <mixxx-note>`, requests are only considered *atomic*
|
|
by Infinity Fabric if they are targeted at non-write-cacheable memory, such
|
|
as :ref:`fine-grained memory <memory-type>` allocations or :ref:`uncached
|
|
memory <memory-type>` allocations on the MI2XX.
|
|
unit: Requests per normalization unit
|
|
Read Stall:
|
|
rst: "The ratio of the total number of cycles the L2-Fabric interface was stalled\
|
|
\ on a read request to any destination (local HBM, remote PCIe\xAE connected\
|
|
\ accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_\
|
|
\ or CPU) over the :ref:`total active L2 cycles <total-active-l2-cycles>`."
|
|
unit: Percent
|
|
Write Stall:
|
|
rst: The ratio of the total number of cycles the L2-Fabric interface was stalled
|
|
on a write or atomic request to any destination (local HBM, remote accelerator
|
|
or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected
|
|
accelerator [#inf]_ or CPU) over the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Read - PCIe Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on read requests
|
|
to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the :ref:`total
|
|
active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Read - Infinity Fabric Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on read requests
|
|
to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as a percent
|
|
of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Read - HBM Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on read requests
|
|
to the accelerator's local HBM as a percent of the :ref:`total active L2 cycles
|
|
<total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Write - PCIe Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
|
|
requests to remote PCIe connected accelerators [#inf]_ or CPUs as a percent
|
|
of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Write - Infinity Fabric Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
|
|
requests to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as
|
|
a percent of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Write - HBM Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
|
|
requests to accelerator's local HBM as a percent of the total active L2 cycles.
|
|
unit: Percent
|
|
L2-Fabric interface metrics:
|
|
Utilization:
|
|
rst: The ratio of the :ref:`number of cycles an L2 channel was active, summed
|
|
over all L2 channels on the accelerator <total-active-l2-cycles>` over the
|
|
:ref:`total L2 cycles <total-l2-cycles>`.
|
|
unit: Percent
|
|
Peak Bandwidth:
|
|
rst: The number of bytes looked up in the L2 cache, as a percent of the peak theoretical
|
|
bandwidth achievable on the specific accelerator. The number of bytes is calculated
|
|
as the number of cache lines requested multiplied by the cache line size. This
|
|
value does not consider partial requests, so e.g., if only a single value is
|
|
requested in a cache line, the data movement will still be counted as a full
|
|
cache line.
|
|
unit: Percent
|
|
Hit Rate:
|
|
rst: The ratio of the number of L2 cache line requests that hit in the L2 cache
|
|
over the total number of incoming cache line requests to the L2 cache.
|
|
unit: Percent
|
|
L2-Fabric Read BW:
|
|
rst: The number of bytes read by the L2 over the :ref:`Infinity Fabric interface
|
|
<l2-fabric>` per unit time.
|
|
unit: GB/s
|
|
L2-Fabric Write and Atomic BW:
|
|
rst: The number of bytes sent by the L2 over the :ref:`Infinity Fabric interface
|
|
<l2-fabric>` by write and atomic operations per unit time.
|
|
unit: GB/s
|
|
HBM Bandwidth:
|
|
rst: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory
|
|
(HBM) per unit time. This value is calculated as the number of HBM channels
|
|
multiplied by the HBM channel width multiplied by the HBM clock frequency.
|
|
unit: GB/s
|
|
Read BW:
|
|
rst: The total number of bytes read by the L2 cache from Infinity Fabric per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Bytes per normalization unit
|
|
HBM Read Traffic:
|
|
rst: The percent of read requests generated by the L2 cache that are routed to the
|
|
accelerator's local high-bandwidth memory (HBM). This breakdown does not consider
|
|
the *size* of the request (meaning that 32B and 64B requests are both counted
|
|
as a single request), so this metric only *approximates* the percent of the
|
|
L2-Fabric Read bandwidth directed to the local HBM.
|
|
unit: Percent
|
|
Remote Read Traffic:
|
|
rst: The percent of read requests generated by the L2 cache that are routed to any
|
|
memory location other than the accelerator's local high-bandwidth memory (HBM)
|
|
-- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown
|
|
does not consider the *size* of the request (meaning that 32B and 64B requests
|
|
are both counted as a single request), so this metric only *approximates* the
|
|
percent of the L2-Fabric Read bandwidth directed to a remote location.
|
|
unit: Percent
|
|
Uncached Read Traffic:
|
|
rst: The percent of read requests generated by the L2 cache that are reading from
|
|
an :ref:`uncached memory allocation <memory-type>`. Note, as described in the
|
|
:ref:`request flow <l2-request-flow>` section, a single 64B read request is
|
|
typically counted as two uncached read requests. So, it is possible for the
|
|
Uncached Read Traffic to reach up to 200% of the total number of read requests.
|
|
This breakdown does not consider the *size* of the request (i.e., 32B and 64B
|
|
requests are both counted as a single request), so this metric only *approximates*
|
|
the percent of the L2-Fabric read bandwidth directed to an uncached memory
|
|
location.
|
|
unit: Percent
|
|
Write and Atomic BW:
|
|
rst: The total number of bytes written by the L2 over Infinity Fabric by write and
|
|
atomic operations per :ref:`normalization unit <normalization-units>`. Note
|
|
that on current CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests
|
|
are only considered *atomic* by Infinity Fabric if they are targeted at non-write-cacheable
|
|
memory, for example, :ref:`fine-grained memory <memory-type>` allocations or :ref:`uncached
|
|
memory <memory-type>` allocations on the MI2XX.
|
|
unit: Bytes per normalization unit
|
|
HBM Write and Atomic Traffic:
|
|
rst: The percent of write and atomic requests generated by the L2 cache that are
|
|
routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
|
|
does not consider the *size* of the request (meaning that 32B and 64B requests
|
|
are both counted as a single request), so this metric only *approximates* the
|
|
percent of the L2-Fabric Write and Atomic bandwidth directed to the local HBM.
|
|
Note that on current CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`,
|
|
requests are only considered *atomic* by Infinity Fabric if they are targeted
|
|
at :ref:`fine-grained memory <memory-type>` allocations or :ref:`uncached
|
|
memory <memory-type>` allocations.
|
|
unit: Percent
|
|
Remote Write and Atomic Traffic:
|
|
rst: The percent of read requests generated by the L2 cache that are routed to any
|
|
memory location other than the accelerator's local high-bandwidth memory (HBM)
|
|
-- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown
|
|
does not consider the *size* of the request (meaning that 32B and 64B requests
|
|
are both counted as a single request), so this metric only *approximates* the
|
|
percent of the L2-Fabric Read bandwidth directed to a remote location. Note
|
|
that on current CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests
|
|
are only considered *atomic* by Infinity Fabric if they are targeted at :ref:`fine-grained
|
|
memory <memory-type>` allocations or :ref:`uncached memory <memory-type>` allocations.
|
|
unit: Percent
|
|
Atomic Traffic:
|
|
rst: The percent of write requests generated by the L2 cache that are atomic requests
|
|
to *any* memory location. This breakdown does not consider the *size* of the
|
|
request (meaning that 32B and 64B requests are both counted as a single request),
|
|
so this metric only *approximates* the percent of the L2-Fabric Read bandwidth
|
|
directed to a remote location. Note that on current CDNA accelerators, such
|
|
as the :ref:`MI2XX <mixxx-note>`, requests are only considered *atomic* by
|
|
Infinity Fabric if they are targeted at :ref:`fine-grained memory <memory-type>`
|
|
allocations or :ref:`uncached memory <memory-type>` allocations.
|
|
unit: Percent
|
|
Uncached Write and Atomic Traffic:
|
|
rst: The percent of write and atomic requests generated by the L2 cache that are
|
|
targeting :ref:`uncached memory allocations <memory-type>`. This breakdown
|
|
does not consider the *size* of the request (meaning that 32B and 64B requests
|
|
are both counted as a single request), so this metric only *approximates* the
|
|
percent of the L2-Fabric read bandwidth directed to uncached memory allocations.
|
|
unit: Percent
|
|
Read Latency:
|
|
rst: The time-averaged number of cycles read requests spent in Infinity Fabric before
|
|
data was returned to the L2.
|
|
unit: Cycles
|
|
Write and Atomic Latency:
|
|
rst: The time-averaged number of cycles write requests spent in Infinity Fabric
|
|
before a completion acknowledgement was returned to the L2.
|
|
unit: Cycles
|
|
Atomic Latency:
|
|
rst: The time-averaged number of cycles atomic requests spent in Infinity Fabric
|
|
before a completion acknowledgement (atomic without return value) or data (atomic
|
|
with return value) was returned to the L2.
|
|
unit: Cycles
|
|
Bandwidth:
|
|
rst: The number of bytes looked up in the L2 cache, per :ref:`normalization unit
|
|
<normalization-units>`. The number of bytes is calculated as the number of
|
|
cache lines requested multiplied by the cache line size. This value does not
|
|
consider partial requests, so for example, if only a single value is requested
|
|
in a cache line, the data movement will still be counted as a full cache line.
|
|
unit: Bytes per normalization unit
|
|
Req:
|
|
rst: The total number of incoming requests to the L2 from all clients for all request
|
|
types, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Read Req:
|
|
rst: 'The total number of read requests to the L2 from all clients. '
|
|
unit: Requests per normalization unit
|
|
Write Req:
|
|
rst: The total number of write requests to the L2 from all clients.
|
|
unit: Requests per normalization unit
|
|
Atomic Req:
|
|
rst: The total number of atomic requests (with and without return) to the L2 from
|
|
all clients.
|
|
unit: Requests per normalization unit
|
|
Streaming Req:
|
|
rst: The total number of incoming requests to the L2 that are marked as *streaming*.
|
|
The exact meaning of this may differ depending on the targeted accelerator,
|
|
however on an :ref:`MI2XX <mixxx-note>` this corresponds to `non-temporal
|
|
load or stores <https://clang.llvm.org/docs/LanguageExtensions.html#non-temporal-load-store-builtins>`_. The
|
|
L2 cache attempts to evict *streaming* requests before normal requests when
|
|
the L2 is at capacity.
|
|
unit: Requests per normalization unit
|
|
Probe Req:
|
|
rst: The number of coherence probe requests made to the L2 cache from outside the
|
|
accelerator. On an :ref:`MI2XX <mixxx-note>`, probe requests may be generated
|
|
by, for example, writes to :ref:`fine-grained device <memory-type>` memory
|
|
or by writes to :ref:`coarse-grained <memory-type>` device memory.
|
|
unit: Requests per normalization unit
|
|
Cache Hit:
|
|
rst: The ratio of the number of L2 cache line requests that hit in the L2 cache
|
|
over the total number of incoming cache line requests to the L2 cache.
|
|
unit: Percent
|
|
Hits:
|
|
rst: The total number of requests to the L2 from all clients that hit in the cache.
|
|
As noted in the :ref:`Speed-of-Light <l2-sol>` section, this includes hit-on-miss
|
|
requests.
|
|
unit: Requests per normalization unit
|
|
Misses:
|
|
rst: The total number of requests to the L2 from all clients that miss in the cache.
|
|
As noted in the :ref:`Speed-of-Light <l2-sol>` section, these do not include
|
|
hit-on-miss requests.
|
|
unit: Requests per normalization unit
|
|
Writeback:
|
|
rst: The total number of L2 cache lines written back to memory for any reason. Write-backs
|
|
may occur due to user code (such as HIP kernel calls to ``__threadfence_system``
|
|
or atomic built-ins) by the :doc:`command processor <command-processor>`'s
|
|
memory acquire/release fences, or for other internal hardware reasons.
|
|
unit: Cache lines per normalization unit
|
|
Writeback (Internal):
|
|
rst: The total number of L2 cache lines written back to memory for internal hardware
|
|
reasons, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cache lines per normalization unit
|
|
Writeback (vL1D Req):
|
|
rst: The total number of L2 cache lines written back to memory due to requests initiated
|
|
by the :doc:`vL1D cache <vector-l1-cache>`, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cache lines per normalization unit
|
|
Evict (Internal):
|
|
rst: The total number of L2 cache lines evicted from the cache due to capacity limits,
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cache lines per normalization unit
|
|
Evict (vL1D Req):
|
|
rst: The total number of L2 cache lines evicted from the cache due to invalidation
|
|
requests initiated by the :doc:`vL1D cache <vector-l1-cache>`, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Cache lines per normalization unit
|
|
NC Req:
|
|
rst: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory
|
|
allocations, per :ref:`normalization unit <normalization-units>`. See the :ref:`memory-type`
|
|
for more information.
|
|
unit: Requests per normalization unit
|
|
UC Req:
|
|
rst: The total number of requests to the L2 that go to Uncached (UC) memory allocations.
|
|
See the :ref:`memory-type` for more information.
|
|
unit: Requests per normalization unit
|
|
CC Req:
|
|
rst: The total number of requests to the L2 that go to Coherently Cacheable (CC) memory
|
|
allocations. See the :ref:`memory-type` for more information.
|
|
unit: Requests per normalization unit
|
|
RW Req:
|
|
rst: The total number of requests to the L2 that go to Read-Write coherent memory (RW)
|
|
allocations. See the :ref:`memory-type` for more information.
|
|
unit: Requests per normalization unit
|
|
Write - Credit Starvation:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
|
|
requests to any memory location because too many write/atomic requests were
|
|
currently in flight, as a percent of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Read (32B):
|
|
rst: The total number of L2 requests to Infinity Fabric to read 32B of data from
|
|
any memory location, per :ref:`normalization unit <normalization-units>`. See :ref:`l2-request-flow`
|
|
for more detail. Typically unused on CDNA accelerators.
|
|
unit: Requests per normalization unit
|
|
Read (64B):
|
|
rst: The total number of L2 requests to Infinity Fabric to read 64B of data from
|
|
any memory location, per :ref:`normalization unit <normalization-units>`. See :ref:`l2-request-flow`
|
|
for more detail.
|
|
unit: Requests per normalization unit
|
|
Read (Uncached):
|
|
rst: The total number of L2 requests to Infinity Fabric to read :ref:`uncached
|
|
data <memory-type>` from any memory location, per :ref:`normalization unit
|
|
<normalization-units>`. 64B requests for uncached data are counted as two 32B
|
|
uncached data requests. See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
HBM Read:
|
|
rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data
|
|
from the accelerator's local HBM, per :ref:`normalization unit <normalization-units>`.
|
|
See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
Remote Read:
|
|
rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data
|
|
from any source other than the accelerator's local HBM, per :ref:`normalization
|
|
unit <normalization-units>`. See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
Write and Atomic (32B):
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically update
|
|
32B of data to any memory location, per :ref:`normalization unit <normalization-units>`.
|
|
See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
Write and Atomic (Uncached):
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically update
|
|
32B or 64B of :ref:`uncached data <memory-type>`, per :ref:`normalization unit
|
|
<normalization-units>`. See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
Write and Atomic (64B):
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically update
|
|
64B of data in any memory location, per :ref:`normalization unit <normalization-units>`.
|
|
See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
HBM Write and Atomic:
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically update
|
|
32B or 64B of data in the accelerator's local HBM, per :ref:`normalization
|
|
unit <normalization-units>`. See :ref:`l2-request-flow` for more detail. plain
|
|
unit: Requests per normalization unit
|
|
Remote Write and Atomic:
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically update
|
|
32B or 64B of data in any memory location other than the accelerator's local
|
|
HBM, per :ref:`normalization unit <normalization-units>`. See :ref:`l2-request-flow`
|
|
for more detail.
|
|
unit: Requests per normalization unit
|
|
Atomic:
|
|
rst: The total number of L2 requests to Infinity Fabric to atomically update 32B
|
|
or 64B of data in any memory location, per :ref:`normalization unit <normalization-units>`.
|
|
See :ref:`l2-request-flow` for more detail. Note that on current CDNA accelerators,
|
|
such as the :ref:`MI2XX <mixxx-note>`, requests are only considered *atomic*
|
|
by Infinity Fabric if they are targeted at non-write-cacheable memory, such
|
|
as :ref:`fine-grained memory <memory-type>` allocations or :ref:`uncached
|
|
memory <memory-type>` allocations on the MI2XX.
|
|
unit: Requests per normalization unit
|
|
Read Stall:
|
|
rst: "The ratio of the total number of cycles the L2-Fabric interface was stalled\
|
|
\ on a read request to any destination (local HBM, remote PCIe\xAE connected\
|
|
\ accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_\
|
|
\ or CPU) over the :ref:`total active L2 cycles <total-active-l2-cycles>`."
|
|
unit: Percent
|
|
Write Stall:
|
|
rst: The ratio of the total number of cycles the L2-Fabric interface was stalled
|
|
on a write or atomic request to any destination (local HBM, remote accelerator
|
|
or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected
|
|
accelerator [#inf]_ or CPU) over the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Read - PCIe Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on read requests
|
|
to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the :ref:`total
|
|
active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Read - Infinity Fabric Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on read requests
|
|
to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as a percent
|
|
of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Read - HBM Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on read requests
|
|
to the accelerator's local HBM as a percent of the :ref:`total active L2 cycles
|
|
<total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Write - PCIe Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
|
|
requests to remote PCIe connected accelerators [#inf]_ or CPUs as a percent
|
|
of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Write - Infinity Fabric Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
|
|
requests to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as
|
|
a percent of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Write - HBM Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
|
|
requests to accelerator's local HBM as a percent of the total active L2 cycles.
|
|
unit: Percent
|
|
L2 - Fabric interface detailed metrics:
|
|
Utilization:
|
|
rst: The ratio of the :ref:`number of cycles an L2 channel was active, summed
|
|
over all L2 channels on the accelerator <total-active-l2-cycles>` over the
|
|
:ref:`total L2 cycles <total-l2-cycles>`.
|
|
unit: Percent
|
|
Peak Bandwidth:
|
|
rst: The number of bytes looked up in the L2 cache, as a percent of the peak theoretical
|
|
bandwidth achievable on the specific accelerator. The number of bytes is calculated
|
|
as the number of cache lines requested multiplied by the cache line size. This
|
|
value does not consider partial requests, so e.g., if only a single value is
|
|
requested in a cache line, the data movement will still be counted as a full
|
|
cache line.
|
|
unit: Percent
|
|
Hit Rate:
|
|
rst: The ratio of the number of L2 cache line requests that hit in the L2 cache
|
|
over the total number of incoming cache line requests to the L2 cache.
|
|
unit: Percent
|
|
L2-Fabric Read BW:
|
|
rst: The number of bytes read by the L2 over the :ref:`Infinity Fabric interface
|
|
<l2-fabric>` per unit time.
|
|
unit: GB/s
|
|
L2-Fabric Write and Atomic BW:
|
|
rst: The number of bytes sent by the L2 over the :ref:`Infinity Fabric interface
|
|
<l2-fabric>` by write and atomic operations per unit time.
|
|
unit: GB/s
|
|
HBM Bandwidth:
|
|
rst: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory
|
|
(HBM) per unit time. This value is calculated as the number of HBM channels
|
|
multiplied by the HBM channel width multiplied by the HBM clock frequency.
|
|
unit: GB/s
|
|
Read BW:
|
|
rst: The total number of bytes read by the L2 cache from Infinity Fabric per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Bytes per normalization unit
|
|
HBM Read Traffic:
|
|
rst: The percent of read requests generated by the L2 cache that are routed to the
|
|
accelerator's local high-bandwidth memory (HBM). This breakdown does not consider
|
|
the *size* of the request (meaning that 32B and 64B requests are both counted
|
|
as a single request), so this metric only *approximates* the percent of the
|
|
L2-Fabric Read bandwidth directed to the local HBM.
|
|
unit: Percent
|
|
Remote Read Traffic:
|
|
rst: The percent of read requests generated by the L2 cache that are routed to any
|
|
memory location other than the accelerator's local high-bandwidth memory (HBM)
|
|
-- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown
|
|
does not consider the *size* of the request (meaning that 32B and 64B requests
|
|
are both counted as a single request), so this metric only *approximates* the
|
|
percent of the L2-Fabric Read bandwidth directed to a remote location.
|
|
unit: Percent
|
|
Uncached Read Traffic:
|
|
rst: The percent of read requests generated by the L2 cache that are reading from
|
|
an :ref:`uncached memory allocation <memory-type>`. Note, as described in the
|
|
:ref:`request flow <l2-request-flow>` section, a single 64B read request is
|
|
typically counted as two uncached read requests. So, it is possible for the
|
|
Uncached Read Traffic to reach up to 200% of the total number of read requests.
|
|
This breakdown does not consider the *size* of the request (i.e., 32B and 64B
|
|
requests are both counted as a single request), so this metric only *approximates*
|
|
the percent of the L2-Fabric read bandwidth directed to an uncached memory
|
|
location.
|
|
unit: Percent
|
|
Write and Atomic BW:
|
|
rst: The total number of bytes written by the L2 over Infinity Fabric by write and
|
|
atomic operations per :ref:`normalization unit <normalization-units>`. Note
|
|
that on current CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests
|
|
are only considered *atomic* by Infinity Fabric if they are targeted at non-write-cacheable
|
|
memory, for example, :ref:`fine-grained memory <memory-type>` allocations or :ref:`uncached
|
|
memory <memory-type>` allocations on the MI2XX.
|
|
unit: Bytes per normalization unit
|
|
HBM Write and Atomic Traffic:
|
|
rst: The percent of write and atomic requests generated by the L2 cache that are
|
|
routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
|
|
does not consider the *size* of the request (meaning that 32B and 64B requests
|
|
are both counted as a single request), so this metric only *approximates* the
|
|
percent of the L2-Fabric Write and Atomic bandwidth directed to the local HBM.
|
|
Note that on current CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`,
|
|
requests are only considered *atomic* by Infinity Fabric if they are targeted
|
|
at :ref:`fine-grained memory <memory-type>` allocations or :ref:`uncached
|
|
memory <memory-type>` allocations.
|
|
unit: Percent
|
|
Remote Write and Atomic Traffic:
|
|
rst: The percent of read requests generated by the L2 cache that are routed to any
|
|
memory location other than the accelerator's local high-bandwidth memory (HBM)
|
|
-- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown
|
|
does not consider the *size* of the request (meaning that 32B and 64B requests
|
|
are both counted as a single request), so this metric only *approximates* the
|
|
percent of the L2-Fabric Read bandwidth directed to a remote location. Note
|
|
that on current CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests
|
|
are only considered *atomic* by Infinity Fabric if they are targeted at :ref:`fine-grained
|
|
memory <memory-type>` allocations or :ref:`uncached memory <memory-type>` allocations.
|
|
unit: Percent
|
|
Atomic Traffic:
|
|
rst: The percent of write requests generated by the L2 cache that are atomic requests
|
|
to *any* memory location. This breakdown does not consider the *size* of the
|
|
request (meaning that 32B and 64B requests are both counted as a single request),
|
|
so this metric only *approximates* the percent of the L2-Fabric Read bandwidth
|
|
directed to a remote location. Note that on current CDNA accelerators, such
|
|
as the :ref:`MI2XX <mixxx-note>`, requests are only considered *atomic* by
|
|
Infinity Fabric if they are targeted at :ref:`fine-grained memory <memory-type>`
|
|
allocations or :ref:`uncached memory <memory-type>` allocations.
|
|
unit: Percent
|
|
Uncached Write and Atomic Traffic:
|
|
rst: The percent of write and atomic requests generated by the L2 cache that are
|
|
targeting :ref:`uncached memory allocations <memory-type>`. This breakdown
|
|
does not consider the *size* of the request (meaning that 32B and 64B requests
|
|
are both counted as a single request), so this metric only *approximates* the
|
|
percent of the L2-Fabric read bandwidth directed to uncached memory allocations.
|
|
unit: Percent
|
|
Read Latency:
|
|
rst: The time-averaged number of cycles read requests spent in Infinity Fabric before
|
|
data was returned to the L2.
|
|
unit: Cycles
|
|
Write and Atomic Latency:
|
|
rst: The time-averaged number of cycles write requests spent in Infinity Fabric
|
|
before a completion acknowledgement was returned to the L2.
|
|
unit: Cycles
|
|
Atomic Latency:
|
|
rst: The time-averaged number of cycles atomic requests spent in Infinity Fabric
|
|
before a completion acknowledgement (atomic without return value) or data (atomic
|
|
with return value) was returned to the L2.
|
|
unit: Cycles
|
|
Bandwidth:
|
|
rst: The number of bytes looked up in the L2 cache, per :ref:`normalization unit
|
|
<normalization-units>`. The number of bytes is calculated as the number of
|
|
cache lines requested multiplied by the cache line size. This value does not
|
|
consider partial requests, so for example, if only a single value is requested
|
|
in a cache line, the data movement will still be counted as a full cache line.
|
|
unit: Bytes per normalization unit
|
|
Req:
|
|
rst: The total number of incoming requests to the L2 from all clients for all request
|
|
types, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Read Req:
|
|
rst: 'The total number of read requests to the L2 from all clients. '
|
|
unit: Requests per normalization unit
|
|
Write Req:
|
|
rst: The total number of write requests to the L2 from all clients.
|
|
unit: Requests per normalization unit
|
|
Atomic Req:
|
|
rst: The total number of atomic requests (with and without return) to the L2 from
|
|
all clients.
|
|
unit: Requests per normalization unit
|
|
Streaming Req:
|
|
rst: The total number of incoming requests to the L2 that are marked as *streaming*.
|
|
The exact meaning of this may differ depending on the targeted accelerator,
|
|
however on an :ref:`MI2XX <mixxx-note>` this corresponds to `non-temporal
|
|
load or stores <https://clang.llvm.org/docs/LanguageExtensions.html#non-temporal-load-store-builtins>`_. The
|
|
L2 cache attempts to evict *streaming* requests before normal requests when
|
|
the L2 is at capacity.
|
|
unit: Requests per normalization unit
|
|
Probe Req:
|
|
rst: The number of coherence probe requests made to the L2 cache from outside the
|
|
accelerator. On an :ref:`MI2XX <mixxx-note>`, probe requests may be generated
|
|
by, for example, writes to :ref:`fine-grained device <memory-type>` memory
|
|
or by writes to :ref:`coarse-grained <memory-type>` device memory.
|
|
unit: Requests per normalization unit
|
|
Cache Hit:
|
|
rst: The ratio of the number of L2 cache line requests that hit in the L2 cache
|
|
over the total number of incoming cache line requests to the L2 cache.
|
|
unit: Percent
|
|
Hits:
|
|
rst: The total number of requests to the L2 from all clients that hit in the cache.
|
|
As noted in the :ref:`Speed-of-Light <l2-sol>` section, this includes hit-on-miss
|
|
requests.
|
|
unit: Requests per normalization unit
|
|
Misses:
|
|
rst: The total number of requests to the L2 from all clients that miss in the cache.
|
|
As noted in the :ref:`Speed-of-Light <l2-sol>` section, these do not include
|
|
hit-on-miss requests.
|
|
unit: Requests per normalization unit
|
|
Writeback:
|
|
rst: The total number of L2 cache lines written back to memory for any reason. Write-backs
|
|
may occur due to user code (such as HIP kernel calls to ``__threadfence_system``
|
|
or atomic built-ins) by the :doc:`command processor <command-processor>`'s
|
|
memory acquire/release fences, or for other internal hardware reasons.
|
|
unit: Cache lines per normalization unit
|
|
Writeback (Internal):
|
|
rst: The total number of L2 cache lines written back to memory for internal hardware
|
|
reasons, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cache lines per normalization unit
|
|
Writeback (vL1D Req):
|
|
rst: The total number of L2 cache lines written back to memory due to requests initiated
|
|
by the :doc:`vL1D cache <vector-l1-cache>`, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cache lines per normalization unit
|
|
Evict (Internal):
|
|
rst: The total number of L2 cache lines evicted from the cache due to capacity limits,
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cache lines per normalization unit
|
|
Evict (vL1D Req):
|
|
rst: The total number of L2 cache lines evicted from the cache due to invalidation
|
|
requests initiated by the :doc:`vL1D cache <vector-l1-cache>`, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Cache lines per normalization unit
|
|
NC Req:
|
|
rst: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory
|
|
allocations, per :ref:`normalization unit <normalization-units>`. See the :ref:`memory-type`
|
|
for more information.
|
|
unit: Requests per normalization unit
|
|
UC Req:
|
|
rst: The total number of requests to the L2 that go to Uncached (UC) memory allocations.
|
|
See the :ref:`memory-type` for more information.
|
|
unit: Requests per normalization unit
|
|
CC Req:
|
|
rst: The total number of requests to the L2 that go to Coherently Cacheable (CC) memory
|
|
allocations. See the :ref:`memory-type` for more information.
|
|
unit: Requests per normalization unit
|
|
RW Req:
|
|
rst: The total number of requests to the L2 that go to Read-Write coherent memory (RW)
|
|
allocations. See the :ref:`memory-type` for more information.
|
|
unit: Requests per normalization unit
|
|
Write - Credit Starvation:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
|
|
requests to any memory location because too many write/atomic requests were
|
|
currently in flight, as a percent of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Read (32B):
|
|
rst: The total number of L2 requests to Infinity Fabric to read 32B of data from
|
|
any memory location, per :ref:`normalization unit <normalization-units>`. See :ref:`l2-request-flow`
|
|
for more detail. Typically unused on CDNA accelerators.
|
|
unit: Requests per normalization unit
|
|
Read (64B):
|
|
rst: The total number of L2 requests to Infinity Fabric to read 64B of data from
|
|
any memory location, per :ref:`normalization unit <normalization-units>`. See :ref:`l2-request-flow`
|
|
for more detail.
|
|
unit: Requests per normalization unit
|
|
Read (Uncached):
|
|
rst: The total number of L2 requests to Infinity Fabric to read :ref:`uncached
|
|
data <memory-type>` from any memory location, per :ref:`normalization unit
|
|
<normalization-units>`. 64B requests for uncached data are counted as two 32B
|
|
uncached data requests. See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
HBM Read:
|
|
rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data
|
|
from the accelerator's local HBM, per :ref:`normalization unit <normalization-units>`.
|
|
See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
Remote Read:
|
|
rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data
|
|
from any source other than the accelerator's local HBM, per :ref:`normalization
|
|
unit <normalization-units>`. See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
Write and Atomic (32B):
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically update
|
|
32B of data to any memory location, per :ref:`normalization unit <normalization-units>`.
|
|
See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
Write and Atomic (Uncached):
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically update
|
|
32B or 64B of :ref:`uncached data <memory-type>`, per :ref:`normalization unit
|
|
<normalization-units>`. See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
Write and Atomic (64B):
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically update
|
|
64B of data in any memory location, per :ref:`normalization unit <normalization-units>`.
|
|
See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
HBM Write and Atomic:
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically update
|
|
32B or 64B of data in the accelerator's local HBM, per :ref:`normalization
|
|
unit <normalization-units>`. See :ref:`l2-request-flow` for more detail. plain
|
|
unit: Requests per normalization unit
|
|
Remote Write and Atomic:
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically update
|
|
32B or 64B of data in any memory location other than the accelerator's local
|
|
HBM, per :ref:`normalization unit <normalization-units>`. See :ref:`l2-request-flow`
|
|
for more detail.
|
|
unit: Requests per normalization unit
|
|
Atomic:
|
|
rst: The total number of L2 requests to Infinity Fabric to atomically update 32B
|
|
or 64B of data in any memory location, per :ref:`normalization unit <normalization-units>`.
|
|
See :ref:`l2-request-flow` for more detail. Note that on current CDNA accelerators,
|
|
such as the :ref:`MI2XX <mixxx-note>`, requests are only considered *atomic*
|
|
by Infinity Fabric if they are targeted at non-write-cacheable memory, such
|
|
as :ref:`fine-grained memory <memory-type>` allocations or :ref:`uncached
|
|
memory <memory-type>` allocations on the MI2XX.
|
|
unit: Requests per normalization unit
|
|
Read Stall:
|
|
rst: "The ratio of the total number of cycles the L2-Fabric interface was stalled\
|
|
\ on a read request to any destination (local HBM, remote PCIe\xAE connected\
|
|
\ accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_\
|
|
\ or CPU) over the :ref:`total active L2 cycles <total-active-l2-cycles>`."
|
|
unit: Percent
|
|
Write Stall:
|
|
rst: The ratio of the total number of cycles the L2-Fabric interface was stalled
|
|
on a write or atomic request to any destination (local HBM, remote accelerator
|
|
or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected
|
|
accelerator [#inf]_ or CPU) over the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Read - PCIe Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on read requests
|
|
to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the :ref:`total
|
|
active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Read - Infinity Fabric Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on read requests
|
|
to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as a percent
|
|
of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Read - HBM Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on read requests
|
|
to the accelerator's local HBM as a percent of the :ref:`total active L2 cycles
|
|
<total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Write - PCIe Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
|
|
requests to remote PCIe connected accelerators [#inf]_ or CPUs as a percent
|
|
of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Write - Infinity Fabric Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
|
|
requests to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as
|
|
a percent of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Write - HBM Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
|
|
requests to accelerator's local HBM as a percent of the total active L2 cycles.
|
|
unit: Percent
|
|
L2 - Fabric Interface stalls:
|
|
Utilization:
|
|
rst: The ratio of the :ref:`number of cycles an L2 channel was active, summed
|
|
over all L2 channels on the accelerator <total-active-l2-cycles>` over the
|
|
:ref:`total L2 cycles <total-l2-cycles>`.
|
|
unit: Percent
|
|
Peak Bandwidth:
|
|
rst: The number of bytes looked up in the L2 cache, as a percent of the peak theoretical
|
|
bandwidth achievable on the specific accelerator. The number of bytes is calculated
|
|
as the number of cache lines requested multiplied by the cache line size. This
|
|
value does not consider partial requests, so e.g., if only a single value is
|
|
requested in a cache line, the data movement will still be counted as a full
|
|
cache line.
|
|
unit: Percent
|
|
Hit Rate:
|
|
rst: The ratio of the number of L2 cache line requests that hit in the L2 cache
|
|
over the total number of incoming cache line requests to the L2 cache.
|
|
unit: Percent
|
|
L2-Fabric Read BW:
|
|
rst: The number of bytes read by the L2 over the :ref:`Infinity Fabric interface
|
|
<l2-fabric>` per unit time.
|
|
unit: GB/s
|
|
L2-Fabric Write and Atomic BW:
|
|
rst: The number of bytes sent by the L2 over the :ref:`Infinity Fabric interface
|
|
<l2-fabric>` by write and atomic operations per unit time.
|
|
unit: GB/s
|
|
HBM Bandwidth:
|
|
rst: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory
|
|
(HBM) per unit time. This value is calculated as the number of HBM channels
|
|
multiplied by the HBM channel width multiplied by the HBM clock frequency.
|
|
unit: GB/s
|
|
Read BW:
|
|
rst: The total number of bytes read by the L2 cache from Infinity Fabric per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Bytes per normalization unit
|
|
HBM Read Traffic:
|
|
rst: The percent of read requests generated by the L2 cache that are routed to the
|
|
accelerator's local high-bandwidth memory (HBM). This breakdown does not consider
|
|
the *size* of the request (meaning that 32B and 64B requests are both counted
|
|
as a single request), so this metric only *approximates* the percent of the
|
|
L2-Fabric Read bandwidth directed to the local HBM.
|
|
unit: Percent
|
|
Remote Read Traffic:
|
|
rst: The percent of read requests generated by the L2 cache that are routed to any
|
|
memory location other than the accelerator's local high-bandwidth memory (HBM)
|
|
-- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown
|
|
does not consider the *size* of the request (meaning that 32B and 64B requests
|
|
are both counted as a single request), so this metric only *approximates* the
|
|
percent of the L2-Fabric Read bandwidth directed to a remote location.
|
|
unit: Percent
|
|
Uncached Read Traffic:
|
|
rst: The percent of read requests generated by the L2 cache that are reading from
|
|
an :ref:`uncached memory allocation <memory-type>`. Note, as described in the
|
|
:ref:`request flow <l2-request-flow>` section, a single 64B read request is
|
|
typically counted as two uncached read requests. So, it is possible for the
|
|
Uncached Read Traffic to reach up to 200% of the total number of read requests.
|
|
This breakdown does not consider the *size* of the request (i.e., 32B and 64B
|
|
requests are both counted as a single request), so this metric only *approximates*
|
|
the percent of the L2-Fabric read bandwidth directed to an uncached memory
|
|
location.
|
|
unit: Percent
|
|
Write and Atomic BW:
|
|
rst: The total number of bytes written by the L2 over Infinity Fabric by write and
|
|
atomic operations per :ref:`normalization unit <normalization-units>`. Note
|
|
that on current CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests
|
|
are only considered *atomic* by Infinity Fabric if they are targeted at non-write-cacheable
|
|
memory, for example, :ref:`fine-grained memory <memory-type>` allocations or :ref:`uncached
|
|
memory <memory-type>` allocations on the MI2XX.
|
|
unit: Bytes per normalization unit
|
|
HBM Write and Atomic Traffic:
|
|
rst: The percent of write and atomic requests generated by the L2 cache that are
|
|
routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
|
|
does not consider the *size* of the request (meaning that 32B and 64B requests
|
|
are both counted as a single request), so this metric only *approximates* the
|
|
percent of the L2-Fabric Write and Atomic bandwidth directed to the local HBM.
|
|
Note that on current CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`,
|
|
requests are only considered *atomic* by Infinity Fabric if they are targeted
|
|
at :ref:`fine-grained memory <memory-type>` allocations or :ref:`uncached
|
|
memory <memory-type>` allocations.
|
|
unit: Percent
|
|
Remote Write and Atomic Traffic:
|
|
rst: The percent of read requests generated by the L2 cache that are routed to any
|
|
memory location other than the accelerator's local high-bandwidth memory (HBM)
|
|
-- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown
|
|
does not consider the *size* of the request (meaning that 32B and 64B requests
|
|
are both counted as a single request), so this metric only *approximates* the
|
|
percent of the L2-Fabric Read bandwidth directed to a remote location. Note
|
|
that on current CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests
|
|
are only considered *atomic* by Infinity Fabric if they are targeted at :ref:`fine-grained
|
|
memory <memory-type>` allocations or :ref:`uncached memory <memory-type>` allocations.
|
|
unit: Percent
|
|
Atomic Traffic:
|
|
rst: The percent of write requests generated by the L2 cache that are atomic requests
|
|
to *any* memory location. This breakdown does not consider the *size* of the
|
|
request (meaning that 32B and 64B requests are both counted as a single request),
|
|
so this metric only *approximates* the percent of the L2-Fabric Read bandwidth
|
|
directed to a remote location. Note that on current CDNA accelerators, such
|
|
as the :ref:`MI2XX <mixxx-note>`, requests are only considered *atomic* by
|
|
Infinity Fabric if they are targeted at :ref:`fine-grained memory <memory-type>`
|
|
allocations or :ref:`uncached memory <memory-type>` allocations.
|
|
unit: Percent
|
|
Uncached Write and Atomic Traffic:
|
|
rst: The percent of write and atomic requests generated by the L2 cache that are
|
|
targeting :ref:`uncached memory allocations <memory-type>`. This breakdown
|
|
does not consider the *size* of the request (meaning that 32B and 64B requests
|
|
are both counted as a single request), so this metric only *approximates* the
|
|
percent of the L2-Fabric read bandwidth directed to uncached memory allocations.
|
|
unit: Percent
|
|
Read Latency:
|
|
rst: The time-averaged number of cycles read requests spent in Infinity Fabric before
|
|
data was returned to the L2.
|
|
unit: Cycles
|
|
Write and Atomic Latency:
|
|
rst: The time-averaged number of cycles write requests spent in Infinity Fabric
|
|
before a completion acknowledgement was returned to the L2.
|
|
unit: Cycles
|
|
Atomic Latency:
|
|
rst: The time-averaged number of cycles atomic requests spent in Infinity Fabric
|
|
before a completion acknowledgement (atomic without return value) or data (atomic
|
|
with return value) was returned to the L2.
|
|
unit: Cycles
|
|
Bandwidth:
|
|
rst: The number of bytes looked up in the L2 cache, per :ref:`normalization unit
|
|
<normalization-units>`. The number of bytes is calculated as the number of
|
|
cache lines requested multiplied by the cache line size. This value does not
|
|
consider partial requests, so for example, if only a single value is requested
|
|
in a cache line, the data movement will still be counted as a full cache line.
|
|
unit: Bytes per normalization unit
|
|
Req:
|
|
rst: The total number of incoming requests to the L2 from all clients for all request
|
|
types, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Read Req:
|
|
rst: 'The total number of read requests to the L2 from all clients. '
|
|
unit: Requests per normalization unit
|
|
Write Req:
|
|
rst: The total number of write requests to the L2 from all clients.
|
|
unit: Requests per normalization unit
|
|
Atomic Req:
|
|
rst: The total number of atomic requests (with and without return) to the L2 from
|
|
all clients.
|
|
unit: Requests per normalization unit
|
|
Streaming Req:
|
|
rst: The total number of incoming requests to the L2 that are marked as *streaming*.
|
|
The exact meaning of this may differ depending on the targeted accelerator,
|
|
however on an :ref:`MI2XX <mixxx-note>` this corresponds to `non-temporal
|
|
load or stores <https://clang.llvm.org/docs/LanguageExtensions.html#non-temporal-load-store-builtins>`_. The
|
|
L2 cache attempts to evict *streaming* requests before normal requests when
|
|
the L2 is at capacity.
|
|
unit: Requests per normalization unit
|
|
Probe Req:
|
|
rst: The number of coherence probe requests made to the L2 cache from outside the
|
|
accelerator. On an :ref:`MI2XX <mixxx-note>`, probe requests may be generated
|
|
by, for example, writes to :ref:`fine-grained device <memory-type>` memory
|
|
or by writes to :ref:`coarse-grained <memory-type>` device memory.
|
|
unit: Requests per normalization unit
|
|
Cache Hit:
|
|
rst: The ratio of the number of L2 cache line requests that hit in the L2 cache
|
|
over the total number of incoming cache line requests to the L2 cache.
|
|
unit: Percent
|
|
Hits:
|
|
rst: The total number of requests to the L2 from all clients that hit in the cache.
|
|
As noted in the :ref:`Speed-of-Light <l2-sol>` section, this includes hit-on-miss
|
|
requests.
|
|
unit: Requests per normalization unit
|
|
Misses:
|
|
rst: The total number of requests to the L2 from all clients that miss in the cache.
|
|
As noted in the :ref:`Speed-of-Light <l2-sol>` section, these do not include
|
|
hit-on-miss requests.
|
|
unit: Requests per normalization unit
|
|
Writeback:
|
|
rst: The total number of L2 cache lines written back to memory for any reason. Write-backs
|
|
may occur due to user code (such as HIP kernel calls to ``__threadfence_system``
|
|
or atomic built-ins) by the :doc:`command processor <command-processor>`'s
|
|
memory acquire/release fences, or for other internal hardware reasons.
|
|
unit: Cache lines per normalization unit
|
|
Writeback (Internal):
|
|
rst: The total number of L2 cache lines written back to memory for internal hardware
|
|
reasons, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cache lines per normalization unit
|
|
Writeback (vL1D Req):
|
|
rst: The total number of L2 cache lines written back to memory due to requests initiated
|
|
by the :doc:`vL1D cache <vector-l1-cache>`, per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cache lines per normalization unit
|
|
Evict (Internal):
|
|
rst: The total number of L2 cache lines evicted from the cache due to capacity limits,
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Cache lines per normalization unit
|
|
Evict (vL1D Req):
|
|
rst: The total number of L2 cache lines evicted from the cache due to invalidation
|
|
requests initiated by the :doc:`vL1D cache <vector-l1-cache>`, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Cache lines per normalization unit
|
|
NC Req:
|
|
rst: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory
|
|
allocations, per :ref:`normalization unit <normalization-units>`. See the :ref:`memory-type`
|
|
for more information.
|
|
unit: Requests per normalization unit
|
|
UC Req:
|
|
rst: The total number of requests to the L2 that go to Uncached (UC) memory allocations.
|
|
See the :ref:`memory-type` for more information.
|
|
unit: Requests per normalization unit
|
|
CC Req:
|
|
rst: The total number of requests to the L2 that go to Coherently Cacheable (CC) memory
|
|
allocations. See the :ref:`memory-type` for more information.
|
|
unit: Requests per normalization unit
|
|
RW Req:
|
|
rst: The total number of requests to the L2 that go to Read-Write coherent memory (RW)
|
|
allocations. See the :ref:`memory-type` for more information.
|
|
unit: Requests per normalization unit
|
|
Write - Credit Starvation:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
|
|
requests to any memory location because too many write/atomic requests were
|
|
currently in flight, as a percent of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Read (32B):
|
|
rst: The total number of L2 requests to Infinity Fabric to read 32B of data from
|
|
any memory location, per :ref:`normalization unit <normalization-units>`. See :ref:`l2-request-flow`
|
|
for more detail. Typically unused on CDNA accelerators.
|
|
unit: Requests per normalization unit
|
|
Read (64B):
|
|
rst: The total number of L2 requests to Infinity Fabric to read 64B of data from
|
|
any memory location, per :ref:`normalization unit <normalization-units>`. See :ref:`l2-request-flow`
|
|
for more detail.
|
|
unit: Requests per normalization unit
|
|
Read (Uncached):
|
|
rst: The total number of L2 requests to Infinity Fabric to read :ref:`uncached
|
|
data <memory-type>` from any memory location, per :ref:`normalization unit
|
|
<normalization-units>`. 64B requests for uncached data are counted as two 32B
|
|
uncached data requests. See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
HBM Read:
|
|
rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data
|
|
from the accelerator's local HBM, per :ref:`normalization unit <normalization-units>`.
|
|
See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
Remote Read:
|
|
rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data
|
|
from any source other than the accelerator's local HBM, per :ref:`normalization
|
|
unit <normalization-units>`. See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
Write and Atomic (32B):
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically update
|
|
32B of data to any memory location, per :ref:`normalization unit <normalization-units>`.
|
|
See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
Write and Atomic (Uncached):
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically update
|
|
32B or 64B of :ref:`uncached data <memory-type>`, per :ref:`normalization unit
|
|
<normalization-units>`. See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
Write and Atomic (64B):
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically update
|
|
64B of data in any memory location, per :ref:`normalization unit <normalization-units>`.
|
|
See :ref:`l2-request-flow` for more detail.
|
|
unit: Requests per normalization unit
|
|
HBM Write and Atomic:
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically update
|
|
32B or 64B of data in the accelerator's local HBM, per :ref:`normalization
|
|
unit <normalization-units>`. See :ref:`l2-request-flow` for more detail. plain
|
|
unit: Requests per normalization unit
|
|
Remote Write and Atomic:
|
|
rst: The total number of L2 requests to Infinity Fabric to write or atomically update
|
|
32B or 64B of data in any memory location other than the accelerator's local
|
|
HBM, per :ref:`normalization unit <normalization-units>`. See :ref:`l2-request-flow`
|
|
for more detail.
|
|
unit: Requests per normalization unit
|
|
Atomic:
|
|
rst: The total number of L2 requests to Infinity Fabric to atomically update 32B
|
|
or 64B of data in any memory location, per :ref:`normalization unit <normalization-units>`.
|
|
See :ref:`l2-request-flow` for more detail. Note that on current CDNA accelerators,
|
|
such as the :ref:`MI2XX <mixxx-note>`, requests are only considered *atomic*
|
|
by Infinity Fabric if they are targeted at non-write-cacheable memory, such
|
|
as :ref:`fine-grained memory <memory-type>` allocations or :ref:`uncached
|
|
memory <memory-type>` allocations on the MI2XX.
|
|
unit: Requests per normalization unit
|
|
Read Stall:
|
|
rst: "The ratio of the total number of cycles the L2-Fabric interface was stalled\
|
|
\ on a read request to any destination (local HBM, remote PCIe\xAE connected\
|
|
\ accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_\
|
|
\ or CPU) over the :ref:`total active L2 cycles <total-active-l2-cycles>`."
|
|
unit: Percent
|
|
Write Stall:
|
|
rst: The ratio of the total number of cycles the L2-Fabric interface was stalled
|
|
on a write or atomic request to any destination (local HBM, remote accelerator
|
|
or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected
|
|
accelerator [#inf]_ or CPU) over the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Read - PCIe Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on read requests
|
|
to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the :ref:`total
|
|
active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Read - Infinity Fabric Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on read requests
|
|
to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as a percent
|
|
of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Read - HBM Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on read requests
|
|
to the accelerator's local HBM as a percent of the :ref:`total active L2 cycles
|
|
<total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Write - PCIe Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
|
|
requests to remote PCIe connected accelerators [#inf]_ or CPUs as a percent
|
|
of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Write - Infinity Fabric Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
|
|
requests to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as
|
|
a percent of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
|
unit: Percent
|
|
Write - HBM Stall:
|
|
rst: The number of cycles the L2-Fabric interface was stalled on write or atomic
|
|
requests to accelerator's local HBM as a percent of the total active L2 cycles.
|
|
unit: Percent
|
|
Scalar L1D Speed-of-Light:
|
|
Bandwidth:
|
|
rst: The number of bytes looked up in the sL1D cache, as a percent of the peak theoretical
|
|
bandwidth. Calculated as the ratio of sL1D requests over the :ref:`total sL1D
|
|
cycles <total-sl1d-cycles>`.
|
|
unit: Percent
|
|
Cache Hit Rate:
|
|
rst: Indicates the percent of sL1D requests that hit on a previously loaded line
|
|
the cache. The ratio of the number of sL1D requests that hit [#sl1d-cache]_
|
|
over the number of all sL1D requests.
|
|
unit: Percent
|
|
sL1D-L2 BW:
|
|
rst: "The total number of bytes read from, written to, or atomically updated \
|
|
\ across the sL1D\u2194:doc:`L2 <l2-cache>` interface, per :ref:`normalization\
|
|
\ unit <normalization-units>`. Note that sL1D writes and atomics are typically\
|
|
\ unused on current CDNA accelerators, so in the majority of cases this can\
|
|
\ be interpreted as an sL1D\u2192L2 read bandwidth."
|
|
unit: Bytes per normalization unit
|
|
Req:
|
|
rst: The total number of requests, of any size or type, made to the sL1D per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Hits:
|
|
rst: The total number of sL1D requests that hit on a previously loaded cache line,
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Misses - Non Duplicated:
|
|
rst: The total number of sL1D requests that missed on a cache line that *was not*
|
|
already pending due to another request, per :ref:`normalization unit <normalization-units>`.
|
|
See :ref:`desc-sl1d-sol` for more detail.
|
|
unit: Requests per normalization unit
|
|
Misses- Duplicated:
|
|
rst: The total number of sL1D requests that missed on a cache line that *was* already
|
|
pending due to another request, per :ref:`normalization unit <normalization-units>`.
|
|
See :ref:`desc-sl1d-sol` for more detail.
|
|
unit: Requests per normalization unit
|
|
Read Req (Total):
|
|
rst: The total number of sL1D read requests of any size, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Atomic Req:
|
|
rst: The total number of atomic requests from sL1D to the :doc:`L2 <l2-cache>`,
|
|
per :ref:`normalization unit <normalization-units>`. Typically unused on current
|
|
CDNA accelerators.
|
|
unit: Requests per normalization unit
|
|
Read Req (1 DWord):
|
|
rst: The total number of sL1D read requests made for a single dword of data (4B),
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Read Req (2 DWord):
|
|
rst: The total number of sL1D read requests made for a two dwords of data (8B),
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Read Req (4 DWord):
|
|
rst: The total number of sL1D read requests made for a four dwords of data (16B),
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Read Req (8 DWord):
|
|
rst: The total number of sL1D read requests made for a eight dwords of data (32B),
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Read Req (16 DWord):
|
|
rst: The total number of sL1D read requests made for a sixteen dwords of data (64B),
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Read Req:
|
|
rst: The total number of read requests from sL1D to the :doc:`L2 <l2-cache>`, per
|
|
:ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Write Req:
|
|
rst: The total number of write requests from sL1D to the :doc:`L2 <l2-cache>`, per
|
|
:ref:`normalization unit <normalization-units>`. Typically unused on current
|
|
CDNA accelerators.
|
|
unit: Requests per normalization unit
|
|
Stall Cycles:
|
|
rst: "The total number of cycles the sL1D\u2194 :doc:`L2 <l2-cache>` interface\
|
|
\ was stalled, per :ref:`normalization unit <normalization-units>`."
|
|
unit: Cycles per normalization unit
|
|
Scalar L1D cache accesses:
|
|
Bandwidth:
|
|
rst: The number of bytes looked up in the sL1D cache, as a percent of the peak theoretical
|
|
bandwidth. Calculated as the ratio of sL1D requests over the :ref:`total sL1D
|
|
cycles <total-sl1d-cycles>`.
|
|
unit: Percent
|
|
Cache Hit Rate:
|
|
rst: Indicates the percent of sL1D requests that hit on a previously loaded line
|
|
the cache. The ratio of the number of sL1D requests that hit [#sl1d-cache]_
|
|
over the number of all sL1D requests.
|
|
unit: Percent
|
|
sL1D-L2 BW:
|
|
rst: "The total number of bytes read from, written to, or atomically updated \
|
|
\ across the sL1D\u2194:doc:`L2 <l2-cache>` interface, per :ref:`normalization\
|
|
\ unit <normalization-units>`. Note that sL1D writes and atomics are typically\
|
|
\ unused on current CDNA accelerators, so in the majority of cases this can\
|
|
\ be interpreted as an sL1D\u2192L2 read bandwidth."
|
|
unit: Bytes per normalization unit
|
|
Req:
|
|
rst: The total number of requests, of any size or type, made to the sL1D per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Hits:
|
|
rst: The total number of sL1D requests that hit on a previously loaded cache line,
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Misses - Non Duplicated:
|
|
rst: The total number of sL1D requests that missed on a cache line that *was not*
|
|
already pending due to another request, per :ref:`normalization unit <normalization-units>`.
|
|
See :ref:`desc-sl1d-sol` for more detail.
|
|
unit: Requests per normalization unit
|
|
Misses- Duplicated:
|
|
rst: The total number of sL1D requests that missed on a cache line that *was* already
|
|
pending due to another request, per :ref:`normalization unit <normalization-units>`.
|
|
See :ref:`desc-sl1d-sol` for more detail.
|
|
unit: Requests per normalization unit
|
|
Read Req (Total):
|
|
rst: The total number of sL1D read requests of any size, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Atomic Req:
|
|
rst: The total number of atomic requests from sL1D to the :doc:`L2 <l2-cache>`,
|
|
per :ref:`normalization unit <normalization-units>`. Typically unused on current
|
|
CDNA accelerators.
|
|
unit: Requests per normalization unit
|
|
Read Req (1 DWord):
|
|
rst: The total number of sL1D read requests made for a single dword of data (4B),
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Read Req (2 DWord):
|
|
rst: The total number of sL1D read requests made for a two dwords of data (8B),
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Read Req (4 DWord):
|
|
rst: The total number of sL1D read requests made for a four dwords of data (16B),
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Read Req (8 DWord):
|
|
rst: The total number of sL1D read requests made for a eight dwords of data (32B),
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Read Req (16 DWord):
|
|
rst: The total number of sL1D read requests made for a sixteen dwords of data (64B),
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Read Req:
|
|
rst: The total number of read requests from sL1D to the :doc:`L2 <l2-cache>`, per
|
|
:ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Write Req:
|
|
rst: The total number of write requests from sL1D to the :doc:`L2 <l2-cache>`, per
|
|
:ref:`normalization unit <normalization-units>`. Typically unused on current
|
|
CDNA accelerators.
|
|
unit: Requests per normalization unit
|
|
Stall Cycles:
|
|
rst: "The total number of cycles the sL1D\u2194 :doc:`L2 <l2-cache>` interface\
|
|
\ was stalled, per :ref:`normalization unit <normalization-units>`."
|
|
unit: Cycles per normalization unit
|
|
Scalar L1D Cache - L2 Interface:
|
|
Bandwidth:
|
|
rst: The number of bytes looked up in the sL1D cache, as a percent of the peak theoretical
|
|
bandwidth. Calculated as the ratio of sL1D requests over the :ref:`total sL1D
|
|
cycles <total-sl1d-cycles>`.
|
|
unit: Percent
|
|
Cache Hit Rate:
|
|
rst: Indicates the percent of sL1D requests that hit on a previously loaded line
|
|
the cache. The ratio of the number of sL1D requests that hit [#sl1d-cache]_
|
|
over the number of all sL1D requests.
|
|
unit: Percent
|
|
sL1D-L2 BW:
|
|
rst: "The total number of bytes read from, written to, or atomically updated \
|
|
\ across the sL1D\u2194:doc:`L2 <l2-cache>` interface, per :ref:`normalization\
|
|
\ unit <normalization-units>`. Note that sL1D writes and atomics are typically\
|
|
\ unused on current CDNA accelerators, so in the majority of cases this can\
|
|
\ be interpreted as an sL1D\u2192L2 read bandwidth."
|
|
unit: Bytes per normalization unit
|
|
Req:
|
|
rst: The total number of requests, of any size or type, made to the sL1D per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Hits:
|
|
rst: The total number of sL1D requests that hit on a previously loaded cache line,
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Misses - Non Duplicated:
|
|
rst: The total number of sL1D requests that missed on a cache line that *was not*
|
|
already pending due to another request, per :ref:`normalization unit <normalization-units>`.
|
|
See :ref:`desc-sl1d-sol` for more detail.
|
|
unit: Requests per normalization unit
|
|
Misses- Duplicated:
|
|
rst: The total number of sL1D requests that missed on a cache line that *was* already
|
|
pending due to another request, per :ref:`normalization unit <normalization-units>`.
|
|
See :ref:`desc-sl1d-sol` for more detail.
|
|
unit: Requests per normalization unit
|
|
Read Req (Total):
|
|
rst: The total number of sL1D read requests of any size, per :ref:`normalization
|
|
unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Atomic Req:
|
|
rst: The total number of atomic requests from sL1D to the :doc:`L2 <l2-cache>`,
|
|
per :ref:`normalization unit <normalization-units>`. Typically unused on current
|
|
CDNA accelerators.
|
|
unit: Requests per normalization unit
|
|
Read Req (1 DWord):
|
|
rst: The total number of sL1D read requests made for a single dword of data (4B),
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Read Req (2 DWord):
|
|
rst: The total number of sL1D read requests made for a two dwords of data (8B),
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Read Req (4 DWord):
|
|
rst: The total number of sL1D read requests made for a four dwords of data (16B),
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Read Req (8 DWord):
|
|
rst: The total number of sL1D read requests made for a eight dwords of data (32B),
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Read Req (16 DWord):
|
|
rst: The total number of sL1D read requests made for a sixteen dwords of data (64B),
|
|
per :ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Read Req:
|
|
rst: The total number of read requests from sL1D to the :doc:`L2 <l2-cache>`, per
|
|
:ref:`normalization unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Write Req:
|
|
rst: The total number of write requests from sL1D to the :doc:`L2 <l2-cache>`, per
|
|
:ref:`normalization unit <normalization-units>`. Typically unused on current
|
|
CDNA accelerators.
|
|
unit: Requests per normalization unit
|
|
Stall Cycles:
|
|
rst: "The total number of cycles the sL1D\u2194 :doc:`L2 <l2-cache>` interface\
|
|
\ was stalled, per :ref:`normalization unit <normalization-units>`."
|
|
unit: Cycles per normalization unit
|
|
L1I Speed-of-Light:
|
|
Bandwidth:
|
|
rst: The number of bytes looked up in the L1I cache, as a percent of the peak theoretical
|
|
bandwidth. Calculated as the ratio of L1I requests over the :ref:`total L1I
|
|
cycles <total-l1i-cycles>`.
|
|
unit: Percent
|
|
Cache Hit Rate:
|
|
rst: The percent of L1I requests that hit [#l1i-cache]_ on a previously loaded line
|
|
the cache. Calculated as the ratio of the number of L1I requests that hit over
|
|
the number of all L1I requests.
|
|
unit: Percent
|
|
L1I-L2 Bandwidth:
|
|
rst: "The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth\
|
|
\ achieved. Calculated as the ratio of the total number of requests from the\
|
|
\ L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`."
|
|
unit: Percent
|
|
Req:
|
|
rst: The total number of requests made to the L1I per normalization-unit
|
|
unit: Requests per normalization unit
|
|
Hits:
|
|
rst: The total number of L1I requests that hit on a previously loaded cache line,
|
|
per :ref:`normalization-unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Misses - Non Duplicated:
|
|
rst: The total number of L1I requests that missed on a cache line that *were
|
|
not* already pending due to another request, per :ref:`normalization-unit <normalization-units>`.
|
|
See note in :ref:`desc-l1i-sol` for more detail.
|
|
unit: Requests per normalization unit
|
|
Misses - Duplicated:
|
|
rst: The total number of L1I requests that missed on a cache line that *were* already
|
|
pending due to another request, per :ref:`normalization-unit <normalization-units>`.
|
|
See note in :ref:`desc-l1i-sol` for more detail.
|
|
unit: Requests per normalization unit
|
|
Instruction Fetch Latency:
|
|
rst: The average number of cycles spent to fetch instructions to a :doc:`CU <compute-unit>`.
|
|
unit: Cycles
|
|
L1I cache accesses:
|
|
Bandwidth:
|
|
rst: The number of bytes looked up in the L1I cache, as a percent of the peak theoretical
|
|
bandwidth. Calculated as the ratio of L1I requests over the :ref:`total L1I
|
|
cycles <total-l1i-cycles>`.
|
|
unit: Percent
|
|
Cache Hit Rate:
|
|
rst: The percent of L1I requests that hit [#l1i-cache]_ on a previously loaded line
|
|
the cache. Calculated as the ratio of the number of L1I requests that hit over
|
|
the number of all L1I requests.
|
|
unit: Percent
|
|
L1I-L2 Bandwidth:
|
|
rst: "The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth\
|
|
\ achieved. Calculated as the ratio of the total number of requests from the\
|
|
\ L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`."
|
|
unit: Percent
|
|
Req:
|
|
rst: The total number of requests made to the L1I per normalization-unit
|
|
unit: Requests per normalization unit
|
|
Hits:
|
|
rst: The total number of L1I requests that hit on a previously loaded cache line,
|
|
per :ref:`normalization-unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Misses - Non Duplicated:
|
|
rst: The total number of L1I requests that missed on a cache line that *were
|
|
not* already pending due to another request, per :ref:`normalization-unit <normalization-units>`.
|
|
See note in :ref:`desc-l1i-sol` for more detail.
|
|
unit: Requests per normalization unit
|
|
Misses - Duplicated:
|
|
rst: The total number of L1I requests that missed on a cache line that *were* already
|
|
pending due to another request, per :ref:`normalization-unit <normalization-units>`.
|
|
See note in :ref:`desc-l1i-sol` for more detail.
|
|
unit: Requests per normalization unit
|
|
Instruction Fetch Latency:
|
|
rst: The average number of cycles spent to fetch instructions to a :doc:`CU <compute-unit>`.
|
|
unit: Cycles
|
|
L1I <-> L2 interface:
|
|
Bandwidth:
|
|
rst: The number of bytes looked up in the L1I cache, as a percent of the peak theoretical
|
|
bandwidth. Calculated as the ratio of L1I requests over the :ref:`total L1I
|
|
cycles <total-l1i-cycles>`.
|
|
unit: Percent
|
|
Cache Hit Rate:
|
|
rst: The percent of L1I requests that hit [#l1i-cache]_ on a previously loaded line
|
|
the cache. Calculated as the ratio of the number of L1I requests that hit over
|
|
the number of all L1I requests.
|
|
unit: Percent
|
|
L1I-L2 Bandwidth:
|
|
rst: "The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth\
|
|
\ achieved. Calculated as the ratio of the total number of requests from the\
|
|
\ L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`."
|
|
unit: Percent
|
|
Req:
|
|
rst: The total number of requests made to the L1I per normalization-unit
|
|
unit: Requests per normalization unit
|
|
Hits:
|
|
rst: The total number of L1I requests that hit on a previously loaded cache line,
|
|
per :ref:`normalization-unit <normalization-units>`.
|
|
unit: Requests per normalization unit
|
|
Misses - Non Duplicated:
|
|
rst: The total number of L1I requests that missed on a cache line that *were
|
|
not* already pending due to another request, per :ref:`normalization-unit <normalization-units>`.
|
|
See note in :ref:`desc-l1i-sol` for more detail.
|
|
unit: Requests per normalization unit
|
|
Misses - Duplicated:
|
|
rst: The total number of L1I requests that missed on a cache line that *were* already
|
|
pending due to another request, per :ref:`normalization-unit <normalization-units>`.
|
|
See note in :ref:`desc-l1i-sol` for more detail.
|
|
unit: Requests per normalization unit
|
|
Instruction Fetch Latency:
|
|
rst: The average number of cycles spent to fetch instructions to a :doc:`CU <compute-unit>`.
|
|
unit: Cycles
|
|
Workgroup manager utilizations:
|
|
Accelerator Utilization:
|
|
rst: The percent of cycles in the kernel where the accelerator was actively doing
|
|
any work.
|
|
unit: Percent
|
|
Scheduler-Pipe Utilization:
|
|
rst: 'The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in the
|
|
kernel where the scheduler-pipes were actively doing any work. Note: this value
|
|
is expected to range between 0% and 25%. See :ref:`desc-spi`.'
|
|
unit: Percent
|
|
Workgroup Manager Utilization:
|
|
rst: The percent of cycles in the kernel where the workgroup manager was actively
|
|
doing any work.
|
|
unit: Percent
|
|
Shader Engine Utilization:
|
|
rst: The percent of :ref:`total shader engine cycles <total-se-cycles>` in the kernel
|
|
where any CU in a shader-engine was actively doing any work, normalized over
|
|
all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
|
|
was not fully saturated by the kernel, or a potential load-imbalance issue.
|
|
unit: Percent
|
|
SIMD Utilization:
|
|
rst: The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel where
|
|
any :ref:`SIMD <desc-valu>` on a CU was actively doing any work, summed over
|
|
all CUs. Low values (less than 100%) indicate that the accelerator was not
|
|
fully saturated by the kernel, or a potential load-imbalance issue.
|
|
unit: Percent
|
|
Dispatched Workgroups:
|
|
rst: The total number of workgroups forming this kernel launch.
|
|
unit: Workgroups
|
|
Dispatched Wavefronts:
|
|
rst: The total number of wavefronts, summed over all workgroups, forming this
|
|
kernel launch.
|
|
unit: Wavefronts
|
|
VGPR Writes:
|
|
rst: The average number of cycles spent initializing :ref:`VGPRs <desc-valu>` at
|
|
wave creation.
|
|
unit: Cycles/wave
|
|
SGPR Writes:
|
|
rst: The average number of cycles spent initializing :ref:`SGPRs <desc-salu>` at
|
|
wave creation.
|
|
unit: Cycles/wave
|
|
Not-scheduled Rate (Workgroup Manager):
|
|
rst: 'The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in the
|
|
kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
|
|
due to a bottleneck within the workgroup manager rather than a lack of a CU
|
|
or :ref:`SIMD <desc-valu>` with sufficient resources. Note: this value is expected
|
|
to range between 0-25%. See note in :ref:`workgroup manager <desc-spi>` description.'
|
|
unit: Percent
|
|
Not-scheduled Rate (Scheduler-Pipe):
|
|
rst: 'The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in the
|
|
kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
|
|
due to a bottleneck within the scheduler-pipes rather than a lack of a CU or
|
|
:ref:`SIMD <desc-valu>` with sufficient resources. Note: this value is expected
|
|
to range between 0-25%, see note in :ref:`workgroup manager <desc-spi>` description.'
|
|
unit: Percent
|
|
Scheduler-Pipe Stall Rate:
|
|
rst: 'The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in the
|
|
kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
|
|
due to occupancy limitations (like a lack of a CU or :ref:`SIMD <desc-valu>`
|
|
with sufficient resources). Note: this value is expected to range between 0-25%,
|
|
see note in :ref:`workgroup manager <desc-spi>` description.'
|
|
unit: Percent
|
|
Scratch Stall Rate:
|
|
rst: The percent of :ref:`total shader-engine cycles <total-se-cycles>` in the kernel
|
|
where a workgroup could not be scheduled to a :doc:`CU <compute-unit>` due
|
|
to lack of :ref:`private (a.k.a., scratch) memory <memory-type>` slots. While
|
|
this can reach up to 100%, note that the actual occupancy limitations on a kernel
|
|
using private memory are typically quite small (for example, less than 1% of
|
|
the total number of waves that can be scheduled to an accelerator).
|
|
unit: Percent
|
|
Insufficient SIMD Waveslots:
|
|
rst: The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel where
|
|
a workgroup could not be scheduled to a :ref:`SIMD <desc-valu>` due to lack
|
|
of available :ref:`waveslots <desc-valu>`.
|
|
unit: Percent
|
|
Insufficient SIMD VGPRs:
|
|
rst: The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel where
|
|
a workgroup could not be scheduled to a :ref:`SIMD <desc-valu>` due to lack
|
|
of available :ref:`VGPRs <desc-valu>`.
|
|
unit: Percent
|
|
Insufficient SIMD SGPRs:
|
|
rst: The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel where
|
|
a workgroup could not be scheduled to a :ref:`SIMD <desc-valu>` due to lack
|
|
of available :ref:`SGPRs <desc-salu>`.
|
|
unit: Percent
|
|
Insufficient CU LDS:
|
|
rst: The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel where
|
|
a workgroup could not be scheduled to a :doc:`CU <compute-unit>` due to lack
|
|
of available :doc:`LDS <local-data-share>`.
|
|
unit: Percent
|
|
Insufficient CU Barriers:
|
|
rst: The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel where
|
|
a workgroup could not be scheduled to a :doc:`CU <compute-unit>` due to lack
|
|
of available :ref:`barriers <desc-barrier>`.
|
|
unit: Percent
|
|
Reached CU Workgroup Limit:
|
|
rst: The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel where
|
|
a workgroup could not be scheduled to a :doc:`CU <compute-unit>` due to limits
|
|
within the workgroup manager. This is expected to be always be zero on CDNA2
|
|
or newer accelerators (and small for previous accelerators).
|
|
unit: Percent
|
|
Reached CU Wavefront Limit:
|
|
rst: The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel where
|
|
a wavefront could not be scheduled to a :doc:`CU <compute-unit>` due to limits
|
|
within the workgroup manager. This is expected to be always be zero on CDNA2
|
|
or newer accelerators (and small for previous accelerators).
|
|
unit: Percent
|
|
Workgroup Manager - Resource Allocation:
|
|
Accelerator Utilization:
|
|
rst: The percent of cycles in the kernel where the accelerator was actively doing
|
|
any work.
|
|
unit: Percent
|
|
Scheduler-Pipe Utilization:
|
|
rst: 'The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in the
|
|
kernel where the scheduler-pipes were actively doing any work. Note: this value
|
|
is expected to range between 0% and 25%. See :ref:`desc-spi`.'
|
|
unit: Percent
|
|
Workgroup Manager Utilization:
|
|
rst: The percent of cycles in the kernel where the workgroup manager was actively
|
|
doing any work.
|
|
unit: Percent
|
|
Shader Engine Utilization:
|
|
rst: The percent of :ref:`total shader engine cycles <total-se-cycles>` in the kernel
|
|
where any CU in a shader-engine was actively doing any work, normalized over
|
|
all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
|
|
was not fully saturated by the kernel, or a potential load-imbalance issue.
|
|
unit: Percent
|
|
SIMD Utilization:
|
|
rst: The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel where
|
|
any :ref:`SIMD <desc-valu>` on a CU was actively doing any work, summed over
|
|
all CUs. Low values (less than 100%) indicate that the accelerator was not
|
|
fully saturated by the kernel, or a potential load-imbalance issue.
|
|
unit: Percent
|
|
Dispatched Workgroups:
|
|
rst: The total number of workgroups forming this kernel launch.
|
|
unit: Workgroups
|
|
Dispatched Wavefronts:
|
|
rst: The total number of wavefronts, summed over all workgroups, forming this
|
|
kernel launch.
|
|
unit: Wavefronts
|
|
VGPR Writes:
|
|
rst: The average number of cycles spent initializing :ref:`VGPRs <desc-valu>` at
|
|
wave creation.
|
|
unit: Cycles/wave
|
|
SGPR Writes:
|
|
rst: The average number of cycles spent initializing :ref:`SGPRs <desc-salu>` at
|
|
wave creation.
|
|
unit: Cycles/wave
|
|
Not-scheduled Rate (Workgroup Manager):
|
|
rst: 'The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in the
|
|
kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
|
|
due to a bottleneck within the workgroup manager rather than a lack of a CU
|
|
or :ref:`SIMD <desc-valu>` with sufficient resources. Note: this value is expected
|
|
to range between 0-25%. See note in :ref:`workgroup manager <desc-spi>` description.'
|
|
unit: Percent
|
|
Not-scheduled Rate (Scheduler-Pipe):
|
|
rst: 'The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in the
|
|
kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
|
|
due to a bottleneck within the scheduler-pipes rather than a lack of a CU or
|
|
:ref:`SIMD <desc-valu>` with sufficient resources. Note: this value is expected
|
|
to range between 0-25%, see note in :ref:`workgroup manager <desc-spi>` description.'
|
|
unit: Percent
|
|
Scheduler-Pipe Stall Rate:
|
|
rst: 'The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in the
|
|
kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
|
|
due to occupancy limitations (like a lack of a CU or :ref:`SIMD <desc-valu>`
|
|
with sufficient resources). Note: this value is expected to range between 0-25%,
|
|
see note in :ref:`workgroup manager <desc-spi>` description.'
|
|
unit: Percent
|
|
Scratch Stall Rate:
|
|
rst: The percent of :ref:`total shader-engine cycles <total-se-cycles>` in the kernel
|
|
where a workgroup could not be scheduled to a :doc:`CU <compute-unit>` due
|
|
to lack of :ref:`private (a.k.a., scratch) memory <memory-type>` slots. While
|
|
this can reach up to 100%, note that the actual occupancy limitations on a kernel
|
|
using private memory are typically quite small (for example, less than 1% of
|
|
the total number of waves that can be scheduled to an accelerator).
|
|
unit: Percent
|
|
Insufficient SIMD Waveslots:
|
|
rst: The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel where
|
|
a workgroup could not be scheduled to a :ref:`SIMD <desc-valu>` due to lack
|
|
of available :ref:`waveslots <desc-valu>`.
|
|
unit: Percent
|
|
Insufficient SIMD VGPRs:
|
|
rst: The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel where
|
|
a workgroup could not be scheduled to a :ref:`SIMD <desc-valu>` due to lack
|
|
of available :ref:`VGPRs <desc-valu>`.
|
|
unit: Percent
|
|
Insufficient SIMD SGPRs:
|
|
rst: The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel where
|
|
a workgroup could not be scheduled to a :ref:`SIMD <desc-valu>` due to lack
|
|
of available :ref:`SGPRs <desc-salu>`.
|
|
unit: Percent
|
|
Insufficient CU LDS:
|
|
rst: The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel where
|
|
a workgroup could not be scheduled to a :doc:`CU <compute-unit>` due to lack
|
|
of available :doc:`LDS <local-data-share>`.
|
|
unit: Percent
|
|
Insufficient CU Barriers:
|
|
rst: The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel where
|
|
a workgroup could not be scheduled to a :doc:`CU <compute-unit>` due to lack
|
|
of available :ref:`barriers <desc-barrier>`.
|
|
unit: Percent
|
|
Reached CU Workgroup Limit:
|
|
rst: The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel where
|
|
a workgroup could not be scheduled to a :doc:`CU <compute-unit>` due to limits
|
|
within the workgroup manager. This is expected to be always be zero on CDNA2
|
|
or newer accelerators (and small for previous accelerators).
|
|
unit: Percent
|
|
Reached CU Wavefront Limit:
|
|
rst: The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel where
|
|
a wavefront could not be scheduled to a :doc:`CU <compute-unit>` due to limits
|
|
within the workgroup manager. This is expected to be always be zero on CDNA2
|
|
or newer accelerators (and small for previous accelerators).
|
|
unit: Percent
|
|
Command processor fetcher (CPF):
|
|
CPF Utilization:
|
|
rst: Percent of total cycles where the CPF was busy actively doing any work. The
|
|
ratio of CPF busy cycles over total cycles counted by the CPF.
|
|
unit: Percent
|
|
CPF Stall:
|
|
rst: Percent of CPF busy cycles where the CPF was stalled for any reason.
|
|
unit: Percent
|
|
CPF-L2 Utilization:
|
|
rst: Percent of total cycles counted by the CPF-:doc:`L2 <l2-cache>` interface where
|
|
the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
|
|
over total cycles counted by the CPF-L2.
|
|
unit: Percent
|
|
CPF-L2 Stall:
|
|
rst: Percent of CPF-:doc:`L2 <l2-cache>` L2 busy cycles where the CPF-L2 interface
|
|
was stalled for any reason.
|
|
unit: Percent
|
|
CPF-UTCL1 Stall:
|
|
rst: Percent of CPF busy cycles where the CPF was stalled by address translation.
|
|
unit: Percent
|
|
CPC Utilization:
|
|
rst: Percent of total cycles where the CPC was busy actively doing any work. The
|
|
ratio of CPC busy cycles over total cycles counted by the CPC.
|
|
unit: Percent
|
|
CPC Stall Rate:
|
|
rst: Percent of CPC busy cycles where the CPC was stalled for any reason.
|
|
unit: Percent
|
|
CPC Packet Decoding Utilization:
|
|
rst: Percent of CPC busy cycles spent decoding commands for processing.
|
|
unit: Percent
|
|
CPC-Workgroup Manager Utilization:
|
|
rst: Percent of CPC busy cycles spent dispatching workgroups to the :ref:`workgroup
|
|
manager <desc-spi>`.
|
|
unit: Percent
|
|
CPC-L2 Utilization:
|
|
rst: Percent of total cycles counted by the CPC-:doc:`L2 <l2-cache>` interface where
|
|
the CPC-L2 interface was active doing any work.
|
|
unit: Percent
|
|
CPC-UTCL1 Stall:
|
|
rst: Percent of CPC busy cycles where the CPC was stalled by address translation
|
|
unit: Percent
|
|
CPC-UTCL2 Utilization:
|
|
rst: Percent of total cycles counted by the CPC's :doc:`L2 <l2-cache>` address translation
|
|
interface where the CPC was busy doing address translation work.
|
|
unit: Percent
|
|
Command processor packet processor (CPC):
|
|
CPF Utilization:
|
|
rst: Percent of total cycles where the CPF was busy actively doing any work. The
|
|
ratio of CPF busy cycles over total cycles counted by the CPF.
|
|
unit: Percent
|
|
CPF Stall:
|
|
rst: Percent of CPF busy cycles where the CPF was stalled for any reason.
|
|
unit: Percent
|
|
CPF-L2 Utilization:
|
|
rst: Percent of total cycles counted by the CPF-:doc:`L2 <l2-cache>` interface where
|
|
the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
|
|
over total cycles counted by the CPF-L2.
|
|
unit: Percent
|
|
CPF-L2 Stall:
|
|
rst: Percent of CPF-:doc:`L2 <l2-cache>` L2 busy cycles where the CPF-L2 interface
|
|
was stalled for any reason.
|
|
unit: Percent
|
|
CPF-UTCL1 Stall:
|
|
rst: Percent of CPF busy cycles where the CPF was stalled by address translation.
|
|
unit: Percent
|
|
CPC Utilization:
|
|
rst: Percent of total cycles where the CPC was busy actively doing any work. The
|
|
ratio of CPC busy cycles over total cycles counted by the CPC.
|
|
unit: Percent
|
|
CPC Stall Rate:
|
|
rst: Percent of CPC busy cycles where the CPC was stalled for any reason.
|
|
unit: Percent
|
|
CPC Packet Decoding Utilization:
|
|
rst: Percent of CPC busy cycles spent decoding commands for processing.
|
|
unit: Percent
|
|
CPC-Workgroup Manager Utilization:
|
|
rst: Percent of CPC busy cycles spent dispatching workgroups to the :ref:`workgroup
|
|
manager <desc-spi>`.
|
|
unit: Percent
|
|
CPC-L2 Utilization:
|
|
rst: Percent of total cycles counted by the CPC-:doc:`L2 <l2-cache>` interface where
|
|
the CPC-L2 interface was active doing any work.
|
|
unit: Percent
|
|
CPC-UTCL1 Stall:
|
|
rst: Percent of CPC busy cycles where the CPC was stalled by address translation
|
|
unit: Percent
|
|
CPC-UTCL2 Utilization:
|
|
rst: Percent of total cycles counted by the CPC's :doc:`L2 <l2-cache>` address translation
|
|
interface where the CPC was busy doing address translation work.
|
|
unit: Percent
|
|
System Speed-of-Light:
|
|
VALU FLOPs:
|
|
rst: 'The total floating-point operations executed per second on the :ref:`VALU
|
|
<desc-valu>`. This is also presented as a percent of the peak theoretical FLOPs
|
|
achievable on the specific accelerator. Note: this does not include any floating-point
|
|
operations from :ref:`MFMA <desc-mfma>` instructions.'
|
|
unit: GFLOPs
|
|
VALU IOPs:
|
|
rst: 'The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
|
|
This is also presented as a percent of the peak theoretical IOPs achievable
|
|
on the specific accelerator. Note: this does not include any integer operations
|
|
from :ref:`MFMA <desc-mfma>` instructions.'
|
|
unit: GOIPs
|
|
MFMA FLOPs (F8):
|
|
rst: 'The total number of 8-bit brain floating point :ref:`MFMA <desc-mfma>` operations
|
|
executed per second. Note: this does not include any 16-bit brain floating point
|
|
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
|
|
as a percent of the peak theoretical F8 MFMA operations achievable on the specific
|
|
accelerator. It is supported on AMD Instinct MI300 series and later only.'
|
|
unit: GFLOPs
|
|
MFMA FLOPs (BF16):
|
|
rst: 'The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
|
|
operations executed per second. Note: this does not include any 16-bit brain
|
|
floating point operations from :ref:`VALU <desc-valu>` instructions. This is
|
|
also presented as a percent of the peak theoretical BF16 MFMA operations achievable
|
|
on the specific accelerator.'
|
|
unit: GFLOPs
|
|
MFMA FLOPs (F16):
|
|
rst: 'The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
|
|
executed per second. Note: this does not include any 16-bit floating point operations
|
|
from :ref:`VALU <desc-valu>` instructions. This is also presented as a percent
|
|
of the peak theoretical F16 MFMA operations achievable on the specific accelerator.'
|
|
unit: GFLOPs
|
|
MFMA FLOPs (F32):
|
|
rst: 'The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
|
|
executed per second. Note: this does not include any 32-bit floating point operations
|
|
from :ref:`VALU <desc-valu>` instructions. This is also presented as a percent
|
|
of the peak theoretical F32 MFMA operations achievable on the specific accelerator.'
|
|
unit: GFLOPs
|
|
MFMA FLOPs (F64):
|
|
rst: 'The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
|
|
executed per second. Note: this does not include any 64-bit floating point operations
|
|
from :ref:`VALU <desc-valu>` instructions. This is also presented as a percent
|
|
of the peak theoretical F64 MFMA operations achievable on the specific accelerator.'
|
|
unit: GFLOPs
|
|
MFMA IOPs (Int8):
|
|
rst: 'The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
|
|
per second. Note: this does not include any 8-bit integer operations from :ref:`VALU
|
|
<desc-valu>` instructions. This is also presented as a percent of the peak theoretical
|
|
INT8 MFMA operations achievable on the specific accelerator.'
|
|
unit: GIOPs
|
|
Active CUs:
|
|
rst: Total number of active compute units (CUs) on the accelerator during the
|
|
kernel execution.
|
|
unit: Number
|
|
SALU Utilization:
|
|
rst: Indicates what percent of the kernel's duration the :ref:`SALU <desc-salu>`
|
|
was busy executing instructions. Computed as the ratio of the total number of
|
|
cycles spent by the :ref:`scheduler <desc-scheduler>` issuing SALU / :ref:`SMEM
|
|
<desc-smem>` instructions over the :ref:`total CU cycles <total-cu-cycles>`.
|
|
unit: Percent
|
|
VALU Utilization:
|
|
rst: Indicates what percent of the kernel's duration the :ref:`VALU <desc-valu>`
|
|
was busy executing instructions. Does not include :ref:`VMEM <desc-vmem>` operations.
|
|
Computed as the ratio of the total number of cycles spent by the :ref:`scheduler
|
|
<desc-scheduler>` issuing VALU instructions over the :ref:`total CU cycles <total-cu-cycles>`.
|
|
unit: Percent
|
|
MFMA Utilization:
|
|
rst: Indicates what percent of the kernel's duration the :ref:`MFMA <desc-mfma>`
|
|
unit was busy executing instructions. Computed as the ratio of the total number
|
|
of cycles spent by the :ref:`MFMA <desc-salu>` was busy over the :ref:`total
|
|
CU cycles <total-cu-cycles>`.
|
|
unit: Percent
|
|
VMEM Utilization:
|
|
rst: Indicates what percent of the kernel's duration the :ref:`VMEM <desc-vmem>`
|
|
unit was busy executing instructions, including both global/generic and spill/scratch
|
|
operations (see the :ref:`VMEM instruction count metrics <ta-instruction-counts>`
|
|
for more detail). Does not include :ref:`VALU <desc-valu>` operations. Computed
|
|
as the ratio of the total number of cycles spent by the :ref:`scheduler <desc-scheduler>`
|
|
issuing VMEM instructions over the :ref:`total CU cycles <total-cu-cycles>`.
|
|
unit: Percent
|
|
Branch Utilization:
|
|
rst: Indicates what percent of the kernel's duration the :ref:`branch <desc-branch>`
|
|
unit was busy executing instructions. Computed as the ratio of the total number
|
|
of cycles spent by the :ref:`scheduler <desc-scheduler>` issuing branch instructions
|
|
over the :ref:`total CU cycles <total-cu-cycles>`.
|
|
unit: Percent
|
|
VALU Active Threads:
|
|
rst: Indicates the average level of :ref:`divergence <desc-divergence>` within
|
|
a wavefront over the lifetime of the kernel. The number of work-items that were
|
|
active in a wavefront during execution of each :ref:`VALU <desc-valu>` instruction,
|
|
time-averaged over all VALU instructions run on all wavefronts in the kernel.
|
|
unit: Work-items
|
|
IPC:
|
|
rst: The ratio of the total number of instructions executed on the :doc:`CU <compute-unit>`
|
|
over the :ref:`total active CU cycles <total-active-cu-cycles>`.
|
|
unit: Instructions per-cycle
|
|
Wavefront Occupancy:
|
|
rst: 'The time-averaged number of wavefronts resident on the accelerator over
|
|
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
|
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
|
|
occupancy achievable on the specific accelerator.'
|
|
unit: Wavefronts
|
|
Theoretical LDS Bandwidth:
|
|
rst: Indicates the maximum amount of bytes that could have been loaded from, stored
|
|
to, or atomically updated in the LDS per unit time (see :ref:`LDS Bandwidth
|
|
<lds-bandwidth>` example for more detail). This is also presented as a percent
|
|
of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
|
|
unit: GB/s
|
|
LDS Bank Conflicts/Access:
|
|
rst: The ratio of the number of cycles spent in the :doc:`LDS scheduler <local-data-share>`
|
|
due to bank conflicts (as determined by the conflict resolution hardware) to
|
|
the base number of cycles that would be spent in the LDS scheduler in a completely uncontended
|
|
case. This is also presented in normalized form (i.e., the Bank Conflict Rate).
|
|
unit: Conflicts/Access
|
|
vL1D Cache Hit Rate:
|
|
rst: The ratio of the number of vL1D cache line requests that hit in vL1D cache
|
|
over the total number of cache line requests to the :ref:`vL1D cache RAM <desc-tc>`.
|
|
unit: Percent
|
|
vL1D Cache BW:
|
|
rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM
|
|
<desc-vmem>` instructions per unit time. The number of bytes is calculated
|
|
as the number of cache lines requested multiplied by the cache line size. This
|
|
value does not consider partial requests, so e.g., if only a single value is
|
|
requested in a cache line, the data movement will still be counted as a full
|
|
cache line. This is also presented as a percent of the peak theoretical bandwidth
|
|
achievable on the specific accelerator.
|
|
unit: GB/s
|
|
L2 Cache Hit Rate:
|
|
rst: The ratio of the number of L2 cache line requests that hit in the L2 cache
|
|
over the total number of incoming cache line requests to the L2 cache.
|
|
unit: Percent
|
|
L2 Cache BW:
|
|
rst: The number of bytes looked up in the L2 cache per unit time. The number of
|
|
bytes is calculated as the number of cache lines requested multiplied by the
|
|
cache line size. This value does not consider partial requests, so e.g., if
|
|
only a single value is requested in a cache line, the data movement will still
|
|
be counted as a full cache line. This is also presented as a percent of the
|
|
peak theoretical bandwidth achievable on the specific accelerator.
|
|
unit: GB/s
|
|
L2-Fabric Read BW:
|
|
rst: "The number of bytes read by the L2 over the :ref:`Infinity Fabric\u2122\
|
|
\ interface <l2-fabric>` per unit time. This is also presented as a percent\
|
|
\ of the peak theoretical bandwidth achievable on the specific accelerator."
|
|
unit: GB/s
|
|
L2-Fabric Write BW:
|
|
rst: The number of bytes sent by the L2 over the :ref:`Infinity Fabric interface
|
|
<l2-fabric>` by write and atomic operations per unit time. This is also presented
|
|
as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
|
|
unit: GB/s
|
|
L2-Fabric Read Latency:
|
|
rst: The time-averaged number of cycles read requests spent in Infinity Fabric before
|
|
data was returned to the L2.
|
|
unit: Cycles
|
|
L2-Fabric Write Latency:
|
|
rst: The time-averaged number of cycles write requests spent in Infinity Fabric
|
|
before a completion acknowledgement was returned to the L2.
|
|
unit: Cycles
|
|
sL1D Cache Hit Rate:
|
|
rst: The percent of sL1D requests that hit on a previously loaded line the cache.
|
|
Calculated as the ratio of the number of sL1D requests that hit over the number
|
|
of all sL1D requests.
|
|
unit: Percent
|
|
sL1D Cache BW:
|
|
rst: The number of bytes looked up in the sL1D cache per unit time. This is also
|
|
presented as a percent of the peak theoretical bandwidth achievable on the
|
|
specific accelerator.
|
|
unit: GB/s
|
|
L1I Hit Rate:
|
|
rst: The percent of L1I requests that hit on a previously loaded line the cache.
|
|
Calculated as the ratio of the number of L1I requests that hit over the number
|
|
of all L1I requests.
|
|
unit: GB/s
|
|
L1I BW:
|
|
rst: The number of bytes looked up in the L1I cache per unit time. This is also
|
|
presented as a percent of the peak theoretical bandwidth achievable on the
|
|
specific accelerator.
|
|
unit: Percent
|
|
L1I Fetch Latency:
|
|
rst: The average number of cycles spent to fetch instructions to a :doc:`CU <compute-unit>`.
|
|
unit: Cycles
|