[rocprof-compute] update yamls for docs (#1887)

Этот коммит содержится в:
xuchen-amd
2025-11-19 10:46:02 -05:00
коммит произвёл GitHub
родитель eddd4c3601
Коммит c778acdb70
83 изменённых файлов: 708 добавлений и 836 удалений
+37 -38
Просмотреть файл
@@ -1,7 +1,6 @@
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
Wavefront launch stats:
AGPRs:
rst: |-
rst: >-
The number of accumulation vector general-purpose registers allocated
for the kernel, see :ref:`AGPRs <desc-agprs>`. Note: this may not exactly match
the number of AGPRs requested by the compiler due to allocation granularity.
@@ -12,7 +11,7 @@ Wavefront launch stats:
total workgroup (or, block) size.
unit: Work-Items
LDS Allocation:
rst: |-
rst: >-
The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
allocated for this kernel. Note: This may also be larger than what was requested
at compile time due to both allocation granularity and dynamic per-dispatch
@@ -23,7 +22,7 @@ Wavefront launch stats:
<https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
unit: Wavefronts
SGPRs:
rst: |-
rst: >-
The number of scalar general-purpose registers allocated for the kernel, see
:ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
SGPRs requested by the compiler due to allocation granularity.
@@ -38,14 +37,14 @@ Wavefront launch stats:
as well as for register spills and restores.
unit: Bytes per work-item
Total Wavefronts:
rst: |-
rst: >-
The total number of wavefronts launched as part of the kernel dispatch.
On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
size is always 64 work-items. Thus, the total number of wavefronts should
be equivalent to the ceiling of grid size divided by 64.
unit: Wavefronts
VGPRs:
rst: |-
rst: >-
The number of architected vector general-purpose registers allocated for the
kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
number of VGPRs requested by the compiler due to allocation granularity.
@@ -98,14 +97,14 @@ Wavefront runtime stats:
rst: The total duration of the executed kernel in cycles.
unit: Cycles
Wave Cycles:
rst: |-
rst: >-
The number of cycles a wavefront in the kernel dispatch spent resident
on a compute unit per :ref:`normalization unit <normalization-units>`. This is
averaged over all wavefronts in a kernel dispatch. Note: this should not
be directly compared to the kernel cycles above.
unit: Cycles per normalization unit
Wavefront Occupancy:
rst: |-
rst: >-
The time-averaged number of wavefronts resident on the accelerator over the
lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms).
@@ -148,7 +147,7 @@ Overall instruction mix:
unit: Instructions
VALU arithmetic instruction mix:
Conversion:
rst: |-
rst: >-
The total number of type conversion instructions (such as converting data
to or from F32\u2194F64) issued to the VALU per :ref:`normalization unit
<normalization-units>`.
@@ -240,7 +239,7 @@ MFMA instruction mix:
unit: Instructions per normalization unit
Compute Speed-of-Light:
MFMA FLOPs (BF16):
rst: |-
rst: >-
The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit brain floating
point operations from :ref:`VALU <desc-valu>` instructions. This is also
@@ -248,7 +247,7 @@ Compute Speed-of-Light:
on the specific accelerator.
unit: GFLOPs
MFMA FLOPs (F16):
rst: |-
rst: >-
The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -256,7 +255,7 @@ Compute Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA FLOPs (F32):
rst: |-
rst: >-
The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 32-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -264,7 +263,7 @@ Compute Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA FLOPs (F64):
rst: |-
rst: >-
The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 64-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -276,21 +275,21 @@ Compute Speed-of-Light:
achievable on the specific accelerator.
unit: GFLOPs
MFMA IOPs (INT8):
rst: |-
rst: >-
The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
per second. Note: this does not include any 8-bit integer operations from
:ref:`VALU <desc-valu>` instructions. This is also presented as a percent
of the peak theoretical INT8 MFMA operations achievable on the specific accelerator.
unit: GFLOPs
VALU FLOPs:
rst: |-
rst: >-
The total floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is also presented as a percent of the peak theoretical
FLOPs achievable on the specific accelerator. Note: this does not include
any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
VALU IOPs:
rst: |-
rst: >-
The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
This is also presented as a percent of the peak theoretical IOPs achievable
on the specific accelerator. Note: this does not include any integer operations
@@ -361,7 +360,7 @@ Pipeline statistics:
unit: Percent
Arithmetic operations:
BF16 OPs:
rst: |-
rst: >-
The total number of 16-bit brain floating-point operations executed on
either the :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
unit <normalization-units>`. Note: on current CDNA accelerators, the VALU
@@ -388,7 +387,7 @@ Arithmetic operations:
<normalization-units>`.
unit: FLOP per normalization unit
INT8 OPs:
rst: |-
rst: >-
The total number of 8-bit integer operations executed on either the :ref:`VALU
<desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
<normalization-units>`. Note: on current CDNA accelerators, the VALU has
@@ -460,7 +459,7 @@ LDS Statistics:
acknowledgment) required for an LDS instruction to complete.
unit: Cycles
Mem Violations:
rst: |-
rst: >-
The total number of out-of-bounds accesses made to the LDS, per :ref:`normalization
unit <normalization-units>`. This is unused and expected to be zero in
most configurations for modern CDNA\u2122 accelerators.
@@ -586,7 +585,7 @@ L1 Unified Translation Cache (UTCL1):
per normalization unit.
unit: Requests per normalization unit
Permission Misses:
rst: |-
rst: >-
The total number of translation requests that missed in the UTCL1 due
to a permission error, per :ref:`normalization unit <normalization-units>`.
This is unused and expected to be zero in most configurations for modern
@@ -925,7 +924,7 @@ L2-Fabric interface metrics:
before data was returned to the L2.
unit: Cycles
Read Stall:
rst: |-
rst: >-
The ratio of the total number of cycles the L2-Fabric interface was stalled
on a read request to any destination (local HBM, remote PCIe\xAE connected
accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_
@@ -1198,7 +1197,7 @@ Scalar L1D Cache - L2 Interface:
per :ref:`normalization unit <normalization-units>`.
unit: Requests per normalization unit
Stall Cycles:
rst: |-
rst: >-
The total number of cycles the sL1D\u2194 :doc:`L2 <l2-cache>` interface
was stalled, per :ref:`normalization unit <normalization-units>`.
unit: Cycles per normalization unit
@@ -1208,7 +1207,7 @@ Scalar L1D Cache - L2 Interface:
CDNA accelerators.
unit: Requests per normalization unit
sL1D-L2 BW:
rst: |-
rst: >-
The total number of bytes read from, written to, or atomically updated
across the sL1D\u2194:doc:`L2 <l2-cache>` interface, divided by total duration.
Note that sL1D writes and atomics are typically
@@ -1227,7 +1226,7 @@ L1I Speed-of-Light:
over the number of all L1I requests.
unit: Percent
L1I-L2 Bandwidth Utilization:
rst: |-
rst: >-
The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
achieved. Calculated as the ratio of the total number of requests from
the L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`.
@@ -1286,7 +1285,7 @@ Workgroup manager utilizations:
not fully saturated by the kernel, or a potential load-imbalance issue.
unit: Percent
Scheduler-Pipe Utilization:
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where the scheduler-pipes were actively doing any work. Note: this
value is expected to range between 0% and 25%. See :ref:`desc-spi`.
@@ -1332,7 +1331,7 @@ Workgroup Manager - Resource Allocation:
lack of available :ref:`waveslots <desc-valu>`.
unit: Percent
Not-scheduled Rate (Scheduler-Pipe):
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
due to a bottleneck within the scheduler-pipes rather than a lack of a CU
@@ -1341,7 +1340,7 @@ Workgroup Manager - Resource Allocation:
description.
unit: Percent
Not-scheduled Rate (Workgroup Manager):
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
due to a bottleneck within the workgroup manager rather than a lack of a
@@ -1362,7 +1361,7 @@ Workgroup Manager - Resource Allocation:
or newer accelerators (and small for previous accelerators).
unit: Percent
Scheduler-Pipe Stall Rate:
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
due to occupancy limitations (like a lack of a CU or :ref:`SIMD <desc-valu>`
@@ -1464,7 +1463,7 @@ System Speed-of-Light:
over the total number of incoming cache line requests to the L2 cache.
unit: Percent
L2-Fabric Read BW:
rst: |-
rst: >-
The number of bytes read by the L2 over the :ref:`Infinity Fabric\u2122
interface <l2-fabric>` per unit time. This is also presented as a percent
of the peak theoretical bandwidth achievable on the specific accelerator.
@@ -1490,7 +1489,7 @@ System Speed-of-Light:
Conflict Rate).
unit: Conflicts/Access
MFMA FLOPs (BF16):
rst: |-
rst: >-
The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any 16-bit brain
floating point operations from :ref:`VALU <desc-valu>` instructions. This
@@ -1498,7 +1497,7 @@ System Speed-of-Light:
achievable on the specific accelerator.
unit: GFLOPs
MFMA FLOPs (F16):
rst: |-
rst: >-
The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1506,7 +1505,7 @@ System Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA FLOPs (F32):
rst: |-
rst: >-
The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 32-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1514,7 +1513,7 @@ System Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA FLOPs (F64):
rst: |-
rst: >-
The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 64-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1522,7 +1521,7 @@ System Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA FLOPs (F8):
rst: |-
rst: >-
The total number of 8-bit brain floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any 16-bit brain
floating point operations from :ref:`VALU <desc-valu>` instructions. This
@@ -1531,7 +1530,7 @@ System Speed-of-Light:
series and later only.
unit: GFLOPs
MFMA IOPs (Int8):
rst: |-
rst: >-
The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
per second. Note: this does not include any 8-bit integer operations from
:ref:`VALU <desc-valu>` instructions. This is also presented as a percent
@@ -1562,14 +1561,14 @@ System Speed-of-Light:
time-averaged over all VALU instructions run on all wavefronts in the kernel.
unit: Work-items
VALU FLOPs:
rst: |-
rst: >-
The total floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is also presented as a percent of the peak theoretical
FLOPs achievable on the specific accelerator. Note: this does not include
any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
VALU IOPs:
rst: |-
rst: >-
The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
This is also presented as a percent of the peak theoretical IOPs achievable
on the specific accelerator. Note: this does not include any integer operations
@@ -1590,7 +1589,7 @@ System Speed-of-Light:
issuing VMEM instructions over the :ref:`total CU cycles <total-cu-cycles>`.
unit: Percent
Wavefront Occupancy:
rst: |-
rst: >-
The time-averaged number of wavefronts resident on the accelerator over
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
@@ -200,37 +200,37 @@ Panel Config:
pop: None
coll_level: SQ_IFETCH_LEVEL
metrics_description:
VALU FLOPs: |-
VALU FLOPs: >-
The total floating-point operations executed per second on the VALU.
This is also presented as a percent of the peak theoretical FLOPs achievable
on the specific accelerator. Note: this does not include any floating-point
operations from MFMA instructions.
VALU IOPs: |-
VALU IOPs: >-
The total integer operations executed per second on the VALU. This is
also presented as a percent of the peak theoretical IOPs achievable on the
specific accelerator. Note: this does not include any integer operations from
MFMA instructions.
MFMA FLOPs (BF16): |-
MFMA FLOPs (BF16): >-
The total number of 16-bit brain floating point MFMA operations executed
per second. Note: this does not include any 16-bit brain floating point operations
from VALU instructions. This is also presented as a percent of the peak theoretical
BF16 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F16): |-
MFMA FLOPs (F16): >-
The total number of 16-bit floating point MFMA operations executed per
second. Note: this does not include any 16-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F16 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F32): |-
MFMA FLOPs (F32): >-
The total number of 32-bit floating point MFMA operations executed per
second. Note: this does not include any 32-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F32 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F64): |-
MFMA FLOPs (F64): >-
The total number of 64-bit floating point MFMA operations executed per
second. Note: this does not include any 64-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F64 MFMA operations achievable on the specific accelerator.
MFMA IOPs (Int8): |-
MFMA IOPs (Int8): >-
The total number of 8-bit integer MFMA operations executed per second.
Note: this does not include any 8-bit integer operations from VALU instructions.
This is also presented as a percent of the peak theoretical INT8 MFMA operations
@@ -263,7 +263,7 @@ Panel Config:
IPC: The ratio of the total number of instructions executed on the CU over the
total active CU cycles. This is also presented as a percent of the peak theoretical
bandwidth achievable on the specific accelerator.
Wavefront Occupancy: |-
Wavefront Occupancy: >-
The time-averaged number of wavefronts resident on the accelerator over
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
@@ -296,7 +296,7 @@ Panel Config:
if only a single value is requested in a cache line, the data movement will
still be counted as a full cache line. This is also presented as a percent of
the peak theoretical bandwidth achievable on the specific accelerator.
L2-Fabric Read BW: |-
L2-Fabric Read BW: >-
The number of bytes read by the L2 over the Infinity Fabric\u2122 interface
per unit time. This is also presented as a percent of the peak theoretical
bandwidth achievable on the specific accelerator.
@@ -170,15 +170,15 @@ Panel Config:
Active CUs: Total number of active compute units (CUs) on the accelerator during
the kernel execution.
Num CUs: Total number of compute units (CUs) on the accelerator.
VGPR: |-
VGPR: >-
The number of architected vector general-purpose registers allocated
for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
requested by the compiler due to allocation granularity.
SGPR: |-
SGPR: >-
The number of scalar general-purpose registers allocated for the kernel,
see SALU. Note: this may not exactly match the number of SGPRs requested by
the compiler due to allocation granularity.
LDS Allocation: |-
LDS Allocation: >-
The number of bytes of LDS memory (or, shared memory) allocated for
this kernel. Note: This may also be larger than what was requested at compile
time due to both allocation granularity and dynamic per-dispatch LDS allocations.
@@ -266,7 +266,7 @@ Panel Config:
or data (atomic with return value) was returned to the L2.
HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
of data from the accelerator's local HBM, per normalization unit.
HBM Wr: |-
HBM Wr: >-
The total number of L2 requests to Infinity Fabric to write or atomically
update 32B or 64B of data in the accelerator's local HBM, per normalization
unit.
+13 -13
Просмотреть файл
@@ -134,48 +134,48 @@ Panel Config:
/ 1e9) ) / 1e9
unit: GFLOP/s
metrics_description:
VALU FLOPs (F16): |-
VALU FLOPs (F16): >-
The total 16-bit floating-point operations executed per second on the VALU.
This is presented with the value of the peak empirical F16 FLOPs achievable
on the specific accelerator. Note: this does not include any F16 operations
from MFMA instructions.
VALU FLOPs (F32): |-
VALU FLOPs (F32): >-
The total 32-bit floating-point operations executed per second on the VALU.
This is presented with the value of the peak empirical F32 FLOPs achievable
on the specific accelerator. Note: this does not include any F32 operations
from MFMA instructions.
VALU FLOPs (F64): |-
VALU FLOPs (F64): >-
The total 64-bit floating-point operations executed per second on the VALU.
This is presented with the value of the peak empirical F64 FLOPs achievable
on the specific accelerator. Note: this does not include any F64 operations
from MFMA instructions.
MFMA FLOPs (BF16): |-
MFMA FLOPs (BF16): >-
The total number of 16-bit brain floating point MFMA operations executed
per second. Note: this does not include any 16-bit brain floating point
operations from VALU instructions. The peak empirically measured BF16 MFMA
operations achievable on the specific accelerator is displayed alongside
for comparison.
MFMA FLOPs (F16): |-
MFMA FLOPs (F16): >-
The total number of 16-bit floating point MFMA operations executed per
second. Note: this does not include any 16-bit floating point operations from
VALU instructions. The peak empirically measured F16 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.
MFMA FLOPs (F32): |-
MFMA FLOPs (F32): >-
The total number of 32-bit floating point MFMA operations executed per
second. Note: this does not include any 32-bit floating point operations from
VALU instructions. The peak empirically measured F32 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.
MFMA FLOPs (F64): |-
MFMA FLOPs (F64): >-
The total number of 64-bit floating point MFMA operations executed per
second. Note: this does not include any 64-bit floating point operations from
VALU instructions. The peak empirically measured F64 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.
MFMA IOPs (Int8): |-
MFMA IOPs (Int8): >-
The total number of 8-bit integer MFMA operations executed per second.
Note: this does not include any 8-bit integer operations from VALU instructions.
The peak empirically measured INT8 MFMA operations achievable on the specific
accelerator is displayed alongside for comparison.
HBM Bandwidth: |-
HBM Bandwidth: >-
The total number of bytes read from and written to High-Bandwidth
Memory (HBM) per second. The peak empirically measured bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
@@ -196,22 +196,22 @@ Panel Config:
from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
example for more detail). The peak empirically measured LDS bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
AI L1: |-
AI L1: >-
The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L1 cache and the processing units. This value is used as the x-coordinate
for the L1 roofline.
AI L2: |-
AI L2: >-
The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L2 cache and the L1 cache. This value is used as the x-coordinate for
the L2 roofline.
AI HBM: |-
AI HBM: >-
The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
It is the ratio of total floating-point operations (FLOPs) to total bytes
transferred between HBM and the L2 cache. This value is used as the x-coordinate
for the HBM roofline.
Performance (GFLOPs): |-
Performance (GFLOPs): >-
The overall achieved performance, measured in GigaFLOPs
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
operations divided by the total execution time. This value is used as the y-coordinate
@@ -141,6 +141,6 @@ Panel Config:
the CPC-L2 interface was active doing any work.
CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
translation
CPC-UTCL2 Utilization: |-
CPC-UTCL2 Utilization: >-
Percent of total cycles counted by the CPC's L2 address translation
interface where the CPC was busy doing address translation work.
@@ -168,7 +168,7 @@ Panel Config:
in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
within the workgroup manager rather than a lack of a CU or SIMD with sufficient
resources.
Not-scheduled Rate (Scheduler-Pipe): |-
Not-scheduled Rate (Scheduler-Pipe): >-
The percent of total scheduler-pipe cycles in the kernel where a workgroup
could not be scheduled to a CU due to a bottleneck within the scheduler-pipes
rather than a lack of a CU or SIMD with sufficient resources.
+6 -6
Просмотреть файл
@@ -121,26 +121,26 @@ Panel Config:
Workgroup Size: The total number of work-items (or, threads) in each workgroup
(or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
to the total block size.
Total Wavefronts: |-
Total Wavefronts: >-
The total number of wavefronts launched as part of the kernel dispatch.
On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
size is always 64 work-items. Thus, the total number of wavefronts should
be equivalent to the ceiling of grid size divided by 64.
Saved Wavefronts: The total number of wavefronts saved at a context-save.
Restored Wavefronts: The total number of wavefronts restored from a context-save.
VGPRs: |-
VGPRs: >-
The number of architected vector general-purpose registers allocated
for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
requested by the compiler due to allocation granularity.
AGPRs: |-
AGPRs: >-
The number of accumulation vector general-purpose registers allocated
for the kernel, see AGPRs. Note: this may not exactly match the number of
AGPRs requested by the compiler due to allocation granularity.
SGPRs: |-
SGPRs: >-
The number of scalar general-purpose registers allocated for the kernel,
see SALU. Note: this may not exactly match the number of SGPRs requested by
the compiler due to allocation granularity.
LDS Allocation: |-
LDS Allocation: >-
The number of bytes of LDS memory (or, shared memory) allocated for
this kernel. Note: This may also be larger than what was requested at compile
time due to both allocation granularity and dynamic per-dispatch LDS allocations.
@@ -173,7 +173,7 @@ Panel Config:
rather than identification of a precise limiter. The sum of this metric, Issue
Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
metric.
Wavefront Occupancy: |-
Wavefront Occupancy: >-
The time-averaged number of wavefronts resident on the accelerator over
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms).
@@ -140,7 +140,7 @@ Panel Config:
unit.
Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
stalls from non-dword aligned addresses per normalization unit.
Mem Violations: |-
Mem Violations: >-
The total number of out-of-bounds accesses made to the LDS, per normalization
unit. This is unused and expected to be zero in most configurations for
modern CDNA\u2122 accelerators.
@@ -92,7 +92,7 @@ Panel Config:
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
loaded line the cache. Calculated as the ratio of the number of L1I requests
that hit over the number of all L1I requests.
L1I-L2 Bandwidth Utilization: |-
L1I-L2 Bandwidth Utilization: >-
The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
achieved. Calculated as the ratio of the total number of requests from the
L1I to the L2 cache over the total L1I-L2 interface cycles.
@@ -154,7 +154,7 @@ Panel Config:
sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
bandwidth acheived. Calculated as total number of bytes read from, written to,
or atomically updated across the sL1D - L2 interface.
sL1D-L2 BW: |-
sL1D-L2 BW: >-
The total number of bytes read from, written to, or atomically updated
across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D
writes and atomics are typically unused on current CDNA accelerators, so
@@ -164,7 +164,7 @@ Panel Config:
unit.
Hits: The total number of sL1D requests that hit on a previously loaded cache
line, per normalization unit.
Misses - Non Duplicated: |-
Misses - Non Duplicated: >-
The total number of sL1D requests that missed on a cache line that was
not already pending due to another request, per normalization unit.
Misses- Duplicated: The total number of sL1D requests that missed on a cache line
@@ -187,6 +187,6 @@ Panel Config:
unit.
Write Req: The total number of write requests from sL1D to the L2, per normalization
unit. Typically unused on current CDNA accelerators.
Stall Cycles: |-
Stall Cycles: >-
The total number of cycles the sL1D\u2194L2 interface was stalled, per
normalization unit.
@@ -436,7 +436,7 @@ Panel Config:
per normalization unit.
Translation Misses: The total number of translation requests that missed in the
UTCL1 due to translation not being present in the cache, per normalization unit.
Permission Misses: |-
Permission Misses: >-
The total number of translation requests that missed in the UTCL1 due
to a permission error, per normalization unit. This is unused and expected
to be zero in most configurations for modern CDNA\u2122 accelerators.
@@ -218,37 +218,37 @@ Panel Config:
pop: None
coll_level: SQ_IFETCH_LEVEL
metrics_description:
VALU FLOPs: |-
VALU FLOPs: >-
The total floating-point operations executed per second on the VALU.
This is also presented as a percent of the peak theoretical FLOPs achievable
on the specific accelerator. Note: this does not include any floating-point
operations from MFMA instructions.
VALU IOPs: |-
VALU IOPs: >-
The total integer operations executed per second on the VALU. This is
also presented as a percent of the peak theoretical IOPs achievable on the
specific accelerator. Note: this does not include any integer operations from
MFMA instructions.
MFMA FLOPs (BF16): |-
MFMA FLOPs (BF16): >-
The total number of 16-bit brain floating point MFMA operations executed
per second. Note: this does not include any 16-bit brain floating point operations
from VALU instructions. This is also presented as a percent of the peak theoretical
BF16 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F16): |-
MFMA FLOPs (F16): >-
The total number of 16-bit floating point MFMA operations executed per
second. Note: this does not include any 16-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F16 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F32): |-
MFMA FLOPs (F32): >-
The total number of 32-bit floating point MFMA operations executed per
second. Note: this does not include any 32-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F32 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F64): |-
MFMA FLOPs (F64): >-
The total number of 64-bit floating point MFMA operations executed per
second. Note: this does not include any 64-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F64 MFMA operations achievable on the specific accelerator.
MFMA IOPs (Int8): |-
MFMA IOPs (Int8): >-
The total number of 8-bit integer MFMA operations executed per second.
Note: this does not include any 8-bit integer operations from VALU instructions.
This is also presented as a percent of the peak theoretical INT8 MFMA operations
@@ -281,7 +281,7 @@ Panel Config:
IPC: The ratio of the total number of instructions executed on the CU over the
total active CU cycles. This is also presented as a percent of the peak theoretical
bandwidth achievable on the specific accelerator.
Wavefront Occupancy: |-
Wavefront Occupancy: >-
The time-averaged number of wavefronts resident on the accelerator over
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
@@ -314,7 +314,7 @@ Panel Config:
if only a single value is requested in a cache line, the data movement will
still be counted as a full cache line. This is also presented as a percent of
the peak theoretical bandwidth achievable on the specific accelerator.
L2-Fabric Read BW: |-
L2-Fabric Read BW: >-
The number of bytes read by the L2 over the Infinity Fabric\u2122 interface
per unit time. This is also presented as a percent of the peak theoretical
bandwidth achievable on the specific accelerator.
@@ -170,15 +170,15 @@ Panel Config:
Active CUs: Total number of active compute units (CUs) on the accelerator during
the kernel execution.
Num CUs: Total number of compute units (CUs) on the accelerator.
VGPR: |-
VGPR: >-
The number of architected vector general-purpose registers allocated
for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
requested by the compiler due to allocation granularity.
SGPR: |-
SGPR: >-
The number of scalar general-purpose registers allocated for the kernel,
see SALU. Note: this may not exactly match the number of SGPRs requested by
the compiler due to allocation granularity.
LDS Allocation: |-
LDS Allocation: >-
The number of bytes of LDS memory (or, shared memory) allocated for
this kernel. Note: This may also be larger than what was requested at compile
time due to both allocation granularity and dynamic per-dispatch LDS allocations.
@@ -266,7 +266,7 @@ Panel Config:
or data (atomic with return value) was returned to the L2.
HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
of data from the accelerator's local HBM, per normalization unit.
HBM Wr: |-
HBM Wr: >-
The total number of L2 requests to Infinity Fabric to write or atomically
update 32B or 64B of data in the accelerator's local HBM, per normalization
unit.
+13 -13
Просмотреть файл
@@ -132,48 +132,48 @@ Panel Config:
/ 1e9) ) / 1e9
unit: GFLOP/s
metrics_description:
VALU FLOPs (F16): |-
VALU FLOPs (F16): >-
The total 16-bit floating-point operations executed per second on the VALU.
This is presented with the value of the peak empirical F16 FLOPs achievable
on the specific accelerator. Note: this does not include any F16 operations
from MFMA instructions.
VALU FLOPs (F32): |-
VALU FLOPs (F32): >-
The total 32-bit floating-point operations executed per second on the VALU.
This is presented with the value of the peak empirical F32 FLOPs achievable
on the specific accelerator. Note: this does not include any F32 operations
from MFMA instructions.
VALU FLOPs (F64): |-
VALU FLOPs (F64): >-
The total 64-bit floating-point operations executed per second on the VALU.
This is presented with the value of the peak empirical F64 FLOPs achievable
on the specific accelerator. Note: this does not include any F64 operations
from MFMA instructions.
MFMA FLOPs (BF16): |-
MFMA FLOPs (BF16): >-
The total number of 16-bit brain floating point MFMA operations executed
per second. Note: this does not include any 16-bit brain floating point
operations from VALU instructions. The peak empirically measured BF16 MFMA
operations achievable on the specific accelerator is displayed alongside
for comparison.
MFMA FLOPs (F16): |-
MFMA FLOPs (F16): >-
The total number of 16-bit floating point MFMA operations executed per
second. Note: this does not include any 16-bit floating point operations from
VALU instructions. The peak empirically measured F16 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.
MFMA FLOPs (F32): |-
MFMA FLOPs (F32): >-
The total number of 32-bit floating point MFMA operations executed per
second. Note: this does not include any 32-bit floating point operations from
VALU instructions. The peak empirically measured F32 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.
MFMA FLOPs (F64): |-
MFMA FLOPs (F64): >-
The total number of 64-bit floating point MFMA operations executed per
second. Note: this does not include any 64-bit floating point operations from
VALU instructions. The peak empirically measured F64 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.
MFMA IOPs (Int8): |-
MFMA IOPs (Int8): >-
The total number of 8-bit integer MFMA operations executed per second.
Note: this does not include any 8-bit integer operations from VALU instructions.
The peak empirically measured INT8 MFMA operations achievable on the specific
accelerator is displayed alongside for comparison.
HBM Bandwidth: |-
HBM Bandwidth: >-
The total number of bytes read from and written to High-Bandwidth
Memory (HBM) per second. The peak empirically measured bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
@@ -194,22 +194,22 @@ Panel Config:
from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
example for more detail). The peak empirically measured LDS bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
AI L1: |-
AI L1: >-
The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L1 cache and the processing units. This value is used as the x-coordinate
for the L1 roofline.
AI L2: |-
AI L2: >-
The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L2 cache and the L1 cache. This value is used as the x-coordinate for
the L2 roofline.
AI HBM: |-
AI HBM: >-
The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
It is the ratio of total floating-point operations (FLOPs) to total bytes
transferred between HBM and the L2 cache. This value is used as the x-coordinate
for the HBM roofline.
Performance (GFLOPs): |-
Performance (GFLOPs): >-
The overall achieved performance, measured in GigaFLOPs
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
operations divided by the total execution time. This value is used as the y-coordinate
@@ -141,6 +141,6 @@ Panel Config:
the CPC-L2 interface was active doing any work.
CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
translation
CPC-UTCL2 Utilization: |-
CPC-UTCL2 Utilization: >-
Percent of total cycles counted by the CPC's L2 address translation
interface where the CPC was busy doing address translation work.
@@ -168,7 +168,7 @@ Panel Config:
in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
within the workgroup manager rather than a lack of a CU or SIMD with sufficient
resources.
Not-scheduled Rate (Scheduler-Pipe): |-
Not-scheduled Rate (Scheduler-Pipe): >-
The percent of total scheduler-pipe cycles in the kernel where a workgroup
could not be scheduled to a CU due to a bottleneck within the scheduler-pipes
rather than a lack of a CU or SIMD with sufficient resources.
+6 -6
Просмотреть файл
@@ -121,26 +121,26 @@ Panel Config:
Workgroup Size: The total number of work-items (or, threads) in each workgroup
(or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
to the total block size.
Total Wavefronts: |-
Total Wavefronts: >-
The total number of wavefronts launched as part of the kernel dispatch.
On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
size is always 64 work-items. Thus, the total number of wavefronts should
be equivalent to the ceiling of grid size divided by 64.
Saved Wavefronts: The total number of wavefronts saved at a context-save.
Restored Wavefronts: The total number of wavefronts restored from a context-save.
VGPRs: |-
VGPRs: >-
The number of architected vector general-purpose registers allocated
for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
requested by the compiler due to allocation granularity.
AGPRs: |-
AGPRs: >-
The number of accumulation vector general-purpose registers allocated
for the kernel, see AGPRs. Note: this may not exactly match the number of
AGPRs requested by the compiler due to allocation granularity.
SGPRs: |-
SGPRs: >-
The number of scalar general-purpose registers allocated for the kernel,
see SALU. Note: this may not exactly match the number of SGPRs requested by
the compiler due to allocation granularity.
LDS Allocation: |-
LDS Allocation: >-
The number of bytes of LDS memory (or, shared memory) allocated for
this kernel. Note: This may also be larger than what was requested at compile
time due to both allocation granularity and dynamic per-dispatch LDS allocations.
@@ -173,7 +173,7 @@ Panel Config:
rather than identification of a precise limiter. The sum of this metric, Issue
Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
metric.
Wavefront Occupancy: |-
Wavefront Occupancy: >-
The time-averaged number of wavefronts resident on the accelerator over
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms).
@@ -268,7 +268,7 @@ Panel Config:
floating-point operands issued to the VALU per normalization unit.
F64-Trans: The total number of transcendental instructions (such as sqrt) operating
on 64-bit floating-point operands issued to the VALU per normalization unit.
Conversion: |-
Conversion: >-
The total number of type conversion instructions (such as converting
data to or from F32\u2194F64) issued to the VALU per normalization unit.
Global/Generic Instr: The total number of global & generic memory instructions
@@ -237,37 +237,37 @@ Panel Config:
max: MAX(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
unit: (OPs + $normUnit)
metrics_description:
VALU FLOPs: |-
VALU FLOPs: >-
The total floating-point operations executed per second on the VALU.
This is also presented as a percent of the peak theoretical FLOPs achievable
on the specific accelerator. Note: this does not include any floating-point
operations from MFMA instructions.
VALU IOPs: |-
VALU IOPs: >-
The total integer operations executed per second on the VALU. This is
also presented as a percent of the peak theoretical IOPs achievable on the
specific accelerator. Note: this does not include any integer operations from
MFMA instructions.
MFMA FLOPs (BF16): |-
MFMA FLOPs (BF16): >-
The total number of 16-bit brain floating point MFMA operations executed
per second. Note: this does not include any 16-bit brain floating point operations
from VALU instructions. This is also presented as a percent of the peak theoretical
BF16 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F16): |-
MFMA FLOPs (F16): >-
The total number of 16-bit floating point MFMA operations executed per
second. Note: this does not include any 16-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F16 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F32): |-
MFMA FLOPs (F32): >-
The total number of 32-bit floating point MFMA operations executed per
second. Note: this does not include any 32-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F32 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F64): |-
MFMA FLOPs (F64): >-
The total number of 64-bit floating point MFMA operations executed per
second. Note: this does not include any 64-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F64 MFMA operations achievable on the specific accelerator.
MFMA IOPs (INT8): |-
MFMA IOPs (INT8): >-
The total number of 8-bit integer MFMA operations executed per second.
Note: this does not include any 8-bit integer operations from VALU instructions.
This is also presented as a percent of the peak theoretical INT8 MFMA operations
@@ -140,7 +140,7 @@ Panel Config:
unit.
Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
stalls from non-dword aligned addresses per normalization unit.
Mem Violations: |-
Mem Violations: >-
The total number of out-of-bounds accesses made to the LDS, per normalization
unit. This is unused and expected to be zero in most configurations for
modern CDNA\u2122 accelerators.
@@ -92,7 +92,7 @@ Panel Config:
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
loaded line the cache. Calculated as the ratio of the number of L1I requests
that hit over the number of all L1I requests.
L1I-L2 Bandwidth Utilization: |-
L1I-L2 Bandwidth Utilization: >-
The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
achieved. Calculated as the ratio of the total number of requests from the
L1I to the L2 cache over the total L1I-L2 interface cycles.
@@ -154,7 +154,7 @@ Panel Config:
sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
bandwidth acheived. Calculated as total number of bytes read from, written to,
or atomically updated across the sL1D - L2 interface.
sL1D-L2 BW: |-
sL1D-L2 BW: >-
The total number of bytes read from, written to, or atomically updated
across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D
writes and atomics are typically unused on current CDNA accelerators, so
@@ -164,7 +164,7 @@ Panel Config:
unit.
Hits: The total number of sL1D requests that hit on a previously loaded cache
line, per normalization unit.
Misses - Non Duplicated: |-
Misses - Non Duplicated: >-
The total number of sL1D requests that missed on a cache line that was
not already pending due to another request, per normalization unit.
Misses- Duplicated: The total number of sL1D requests that missed on a cache line
@@ -187,6 +187,6 @@ Panel Config:
unit.
Write Req: The total number of write requests from sL1D to the L2, per normalization
unit. Typically unused on current CDNA accelerators.
Stall Cycles: |-
Stall Cycles: >-
The total number of cycles the sL1D\u2194L2 interface was stalled, per
normalization unit.
@@ -436,7 +436,7 @@ Panel Config:
per normalization unit.
Translation Misses: The total number of translation requests that missed in the
UTCL1 due to translation not being present in the cache, per normalization unit.
Permission Misses: |-
Permission Misses: >-
The total number of translation requests that missed in the UTCL1 due
to a permission error, per normalization unit. This is unused and expected
to be zero in most configurations for modern CDNA\u2122 accelerators.
@@ -227,12 +227,12 @@ Panel Config:
pop: None
coll_level: SQ_IFETCH_LEVEL
metrics_description:
VALU FLOPs: |-
VALU FLOPs: >-
The total floating-point operations executed per second on the VALU.
This is also presented as a percent of the peak theoretical FLOPs achievable
on the specific accelerator. Note: this does not include any floating-point
operations from MFMA instructions.
VALU IOPs: |-
VALU IOPs: >-
The total integer operations executed per second on the VALU. This is
also presented as a percent of the peak theoretical IOPs achievable on the
specific accelerator. Note: this does not include any integer operations from
@@ -242,27 +242,27 @@ Panel Config:
from VALU instructions. This is also presented as a percent of the peak theoretical
F8 MFMA operations achievable on the specific accelerator. It is supported on
AMD Instinct MI300 series and later only.
MFMA FLOPs (BF16): |-
MFMA FLOPs (BF16): >-
The total number of 16-bit brain floating point MFMA operations executed
per second. Note: this does not include any 16-bit brain floating point operations
from VALU instructions. This is also presented as a percent of the peak theoretical
BF16 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F16): |-
MFMA FLOPs (F16): >-
The total number of 16-bit floating point MFMA operations executed per
second. Note: this does not include any 16-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F16 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F32): |-
MFMA FLOPs (F32): >-
The total number of 32-bit floating point MFMA operations executed per
second. Note: this does not include any 32-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F32 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F64): |-
MFMA FLOPs (F64): >-
The total number of 64-bit floating point MFMA operations executed per
second. Note: this does not include any 64-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F64 MFMA operations achievable on the specific accelerator.
MFMA IOPs (Int8): |-
MFMA IOPs (Int8): >-
The total number of 8-bit integer MFMA operations executed per second.
Note: this does not include any 8-bit integer operations from VALU instructions.
This is also presented as a percent of the peak theoretical INT8 MFMA operations
@@ -295,7 +295,7 @@ Panel Config:
IPC: The ratio of the total number of instructions executed on the CU over the
total active CU cycles. This is also presented as a percent of the peak theoretical
bandwidth achievable on the specific accelerator.
Wavefront Occupancy: |-
Wavefront Occupancy: >-
The time-averaged number of wavefronts resident on the accelerator over
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
@@ -328,7 +328,7 @@ Panel Config:
if only a single value is requested in a cache line, the data movement will
still be counted as a full cache line. This is also presented as a percent of
the peak theoretical bandwidth achievable on the specific accelerator.
L2-Fabric Read BW: |-
L2-Fabric Read BW: >-
The number of bytes read by the L2 over the Infinity Fabric\u2122 interface
per unit time. This is also presented as a percent of the peak theoretical
bandwidth achievable on the specific accelerator.
@@ -162,15 +162,15 @@ Panel Config:
Active CUs: Total number of active compute units (CUs) on the accelerator during
the kernel execution.
Num CUs: Total number of compute units (CUs) on the accelerator.
VGPR: |-
VGPR: >-
The number of architected vector general-purpose registers allocated
for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
requested by the compiler due to allocation granularity.
SGPR: |-
SGPR: >-
The number of scalar general-purpose registers allocated for the kernel,
see SALU. Note: this may not exactly match the number of SGPRs requested by
the compiler due to allocation granularity.
LDS Allocation: |-
LDS Allocation: >-
The number of bytes of LDS memory (or, shared memory) allocated for
this kernel. Note: This may also be larger than what was requested at compile
time due to both allocation granularity and dynamic per-dispatch LDS allocations.
@@ -252,7 +252,7 @@ Panel Config:
or data (atomic with return value) was returned to the L2.
HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
of data from the accelerator's local HBM, per normalization unit.
HBM Wr: |-
HBM Wr: >-
The total number of L2 requests to Infinity Fabric to write or atomically
update 32B or 64B of data in the accelerator's local HBM, per normalization
unit.
+13 -13
Просмотреть файл
@@ -140,17 +140,17 @@ Panel Config:
* 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9
unit: GFLOP/s
metrics_description:
VALU FLOPs (F16): |-
VALU FLOPs (F16): >-
The total 16-bit floating-point operations executed per second on the VALU.
This is presented with the value of the peak empirical F16 FLOPs achievable
on the specific accelerator. Note: this does not include any F16 operations
from MFMA instructions.
VALU FLOPs (F32): |-
VALU FLOPs (F32): >-
The total 32-bit floating-point operations executed per second on the VALU.
This is presented with the value of the peak empirical F32 FLOPs achievable
on the specific accelerator. Note: this does not include any F32 operations
from MFMA instructions.
VALU FLOPs (F64): |-
VALU FLOPs (F64): >-
The total 64-bit floating-point operations executed per second on the VALU.
This is presented with the value of the peak empirical F64 FLOPs achievable
on the specific accelerator. Note: this does not include any F64 operations
@@ -160,33 +160,33 @@ Panel Config:
from VALU instructions. The peak empirically measured F8 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison. It is supported
on AMD Instinct MI300 series and later only.
MFMA FLOPs (BF16): |-
MFMA FLOPs (BF16): >-
The total number of 16-bit brain floating point MFMA operations executed
per second. Note: this does not include any 16-bit brain floating point
operations from VALU instructions. The peak empirically measured BF16 MFMA
operations achievable on the specific accelerator is displayed alongside
for comparison.
MFMA FLOPs (F16): |-
MFMA FLOPs (F16): >-
The total number of 16-bit floating point MFMA operations executed per
second. Note: this does not include any 16-bit floating point operations from
VALU instructions. The peak empirically measured F16 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.
MFMA FLOPs (F32): |-
MFMA FLOPs (F32): >-
The total number of 32-bit floating point MFMA operations executed per
second. Note: this does not include any 32-bit floating point operations from
VALU instructions. The peak empirically measured F32 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.
MFMA FLOPs (F64): |-
MFMA FLOPs (F64): >-
The total number of 64-bit floating point MFMA operations executed per
second. Note: this does not include any 64-bit floating point operations from
VALU instructions. The peak empirically measured F64 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.
MFMA IOPs (Int8): |-
MFMA IOPs (Int8): >-
The total number of 8-bit integer MFMA operations executed per second.
Note: this does not include any 8-bit integer operations from VALU instructions.
The peak empirically measured INT8 MFMA operations achievable on the specific
accelerator is displayed alongside for comparison.
HBM Bandwidth: |-
HBM Bandwidth: >-
The total number of bytes read from and written to High-Bandwidth
Memory (HBM) per second. The peak empirically measured bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
@@ -207,22 +207,22 @@ Panel Config:
from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
example for more detail). The peak empirically measured LDS bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
AI L1: |-
AI L1: >-
The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L1 cache and the processing units. This value is used as the x-coordinate
for the L1 roofline.
AI L2: |-
AI L2: >-
The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L2 cache and the L1 cache. This value is used as the x-coordinate for
the L2 roofline.
AI HBM: |-
AI HBM: >-
The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
It is the ratio of total floating-point operations (FLOPs) to total bytes
transferred between HBM and the L2 cache. This value is used as the x-coordinate
for the HBM roofline.
Performance (GFLOPs): |-
Performance (GFLOPs): >-
The overall achieved performance, measured in GigaFLOPs
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
operations divided by the total execution time. This value is used as the y-coordinate
@@ -141,6 +141,6 @@ Panel Config:
the CPC-L2 interface was active doing any work.
CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
translation
CPC-UTCL2 Utilization: |-
CPC-UTCL2 Utilization: >-
Percent of total cycles counted by the CPC's L2 address translation
interface where the CPC was busy doing address translation work.
@@ -168,7 +168,7 @@ Panel Config:
in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
within the workgroup manager rather than a lack of a CU or SIMD with sufficient
resources.
Not-scheduled Rate (Scheduler-Pipe): |-
Not-scheduled Rate (Scheduler-Pipe): >-
The percent of total scheduler-pipe cycles in the kernel where a workgroup
could not be scheduled to a CU due to a bottleneck within the scheduler-pipes
rather than a lack of a CU or SIMD with sufficient resources.
+6 -6
Просмотреть файл
@@ -121,26 +121,26 @@ Panel Config:
Workgroup Size: The total number of work-items (or, threads) in each workgroup
(or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
to the total block size.
Total Wavefronts: |-
Total Wavefronts: >-
The total number of wavefronts launched as part of the kernel dispatch.
On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
size is always 64 work-items. Thus, the total number of wavefronts should
be equivalent to the ceiling of grid size divided by 64.
Saved Wavefronts: The total number of wavefronts saved at a context-save.
Restored Wavefronts: The total number of wavefronts restored from a context-save.
VGPRs: |-
VGPRs: >-
The number of architected vector general-purpose registers allocated
for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
requested by the compiler due to allocation granularity.
AGPRs: |-
AGPRs: >-
The number of accumulation vector general-purpose registers allocated
for the kernel, see AGPRs. Note: this may not exactly match the number of
AGPRs requested by the compiler due to allocation granularity.
SGPRs: |-
SGPRs: >-
The number of scalar general-purpose registers allocated for the kernel,
see SALU. Note: this may not exactly match the number of SGPRs requested by
the compiler due to allocation granularity.
LDS Allocation: |-
LDS Allocation: >-
The number of bytes of LDS memory (or, shared memory) allocated for
this kernel. Note: This may also be larger than what was requested at compile
time due to both allocation granularity and dynamic per-dispatch LDS allocations.
@@ -173,7 +173,7 @@ Panel Config:
rather than identification of a precise limiter. The sum of this metric, Issue
Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
metric.
Wavefront Occupancy: |-
Wavefront Occupancy: >-
The time-averaged number of wavefronts resident on the accelerator over
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms).
@@ -273,7 +273,7 @@ Panel Config:
floating-point operands issued to the VALU per normalization unit.
F64-Trans: The total number of transcendental instructions (such as sqrt) operating
on 64-bit floating-point operands issued to the VALU per normalization unit.
Conversion: |-
Conversion: >-
The total number of type conversion instructions (such as converting
data to or from F32\u2194F64) issued to the VALU per normalization unit.
Global/Generic Instr: The total number of global & generic memory instructions
@@ -251,37 +251,37 @@ Panel Config:
max: MAX(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
unit: (OPs + $normUnit)
metrics_description:
VALU FLOPs: |-
VALU FLOPs: >-
The total floating-point operations executed per second on the VALU.
This is also presented as a percent of the peak theoretical FLOPs achievable
on the specific accelerator. Note: this does not include any floating-point
operations from MFMA instructions.
VALU IOPs: |-
VALU IOPs: >-
The total integer operations executed per second on the VALU. This is
also presented as a percent of the peak theoretical IOPs achievable on the
specific accelerator. Note: this does not include any integer operations from
MFMA instructions.
MFMA FLOPs (BF16): |-
MFMA FLOPs (BF16): >-
The total number of 16-bit brain floating point MFMA operations executed
per second. Note: this does not include any 16-bit brain floating point operations
from VALU instructions. This is also presented as a percent of the peak theoretical
BF16 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F16): |-
MFMA FLOPs (F16): >-
The total number of 16-bit floating point MFMA operations executed per
second. Note: this does not include any 16-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F16 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F32): |-
MFMA FLOPs (F32): >-
The total number of 32-bit floating point MFMA operations executed per
second. Note: this does not include any 32-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F32 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F64): |-
MFMA FLOPs (F64): >-
The total number of 64-bit floating point MFMA operations executed per
second. Note: this does not include any 64-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F64 MFMA operations achievable on the specific accelerator.
MFMA IOPs (INT8): |-
MFMA IOPs (INT8): >-
The total number of 8-bit integer MFMA operations executed per second.
Note: this does not include any 8-bit integer operations from VALU instructions.
This is also presented as a percent of the peak theoretical INT8 MFMA operations
@@ -140,7 +140,7 @@ Panel Config:
unit.
Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
stalls from non-dword aligned addresses per normalization unit.
Mem Violations: |-
Mem Violations: >-
The total number of out-of-bounds accesses made to the LDS, per normalization
unit. This is unused and expected to be zero in most configurations for
modern CDNA\u2122 accelerators.
@@ -92,7 +92,7 @@ Panel Config:
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
loaded line the cache. Calculated as the ratio of the number of L1I requests
that hit over the number of all L1I requests.
L1I-L2 Bandwidth Utilization: |-
L1I-L2 Bandwidth Utilization: >-
The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
achieved. Calculated as the ratio of the total number of requests from the
L1I to the L2 cache over the total L1I-L2 interface cycles.
@@ -154,7 +154,7 @@ Panel Config:
sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
bandwidth acheived. Calculated as total number of bytes read from, written to,
or atomically updated across the sL1D - L2 interface.
sL1D-L2 BW: |-
sL1D-L2 BW: >-
The total number of bytes read from, written to, or atomically updated
across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D
writes and atomics are typically unused on current CDNA accelerators, so
@@ -164,7 +164,7 @@ Panel Config:
unit.
Hits: The total number of sL1D requests that hit on a previously loaded cache
line, per normalization unit.
Misses - Non Duplicated: |-
Misses - Non Duplicated: >-
The total number of sL1D requests that missed on a cache line that was
not already pending due to another request, per normalization unit.
Misses- Duplicated: The total number of sL1D requests that missed on a cache line
@@ -187,6 +187,6 @@ Panel Config:
unit.
Write Req: The total number of write requests from sL1D to the L2, per normalization
unit. Typically unused on current CDNA accelerators.
Stall Cycles: |-
Stall Cycles: >-
The total number of cycles the sL1D\u2194L2 interface was stalled, per
normalization unit.
@@ -398,7 +398,7 @@ Panel Config:
per normalization unit.
Translation Misses: The total number of translation requests that missed in the
UTCL1 due to translation not being present in the cache, per normalization unit.
Permission Misses: |-
Permission Misses: >-
The total number of translation requests that missed in the UTCL1 due
to a permission error, per normalization unit. This is unused and expected
to be zero in most configurations for modern CDNA\u2122 accelerators.
@@ -227,12 +227,12 @@ Panel Config:
pop: None
coll_level: SQ_IFETCH_LEVEL
metrics_description:
VALU FLOPs: |-
VALU FLOPs: >-
The total floating-point operations executed per second on the VALU.
This is also presented as a percent of the peak theoretical FLOPs achievable
on the specific accelerator. Note: this does not include any floating-point
operations from MFMA instructions.
VALU IOPs: |-
VALU IOPs: >-
The total integer operations executed per second on the VALU. This is
also presented as a percent of the peak theoretical IOPs achievable on the
specific accelerator. Note: this does not include any integer operations from
@@ -242,27 +242,27 @@ Panel Config:
from VALU instructions. This is also presented as a percent of the peak theoretical
F8 MFMA operations achievable on the specific accelerator. It is supported on
AMD Instinct MI300 series and later only.
MFMA FLOPs (BF16): |-
MFMA FLOPs (BF16): >-
The total number of 16-bit brain floating point MFMA operations executed
per second. Note: this does not include any 16-bit brain floating point operations
from VALU instructions. This is also presented as a percent of the peak theoretical
BF16 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F16): |-
MFMA FLOPs (F16): >-
The total number of 16-bit floating point MFMA operations executed per
second. Note: this does not include any 16-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F16 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F32): |-
MFMA FLOPs (F32): >-
The total number of 32-bit floating point MFMA operations executed per
second. Note: this does not include any 32-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F32 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F64): |-
MFMA FLOPs (F64): >-
The total number of 64-bit floating point MFMA operations executed per
second. Note: this does not include any 64-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F64 MFMA operations achievable on the specific accelerator.
MFMA IOPs (Int8): |-
MFMA IOPs (Int8): >-
The total number of 8-bit integer MFMA operations executed per second.
Note: this does not include any 8-bit integer operations from VALU instructions.
This is also presented as a percent of the peak theoretical INT8 MFMA operations
@@ -295,7 +295,7 @@ Panel Config:
IPC: The ratio of the total number of instructions executed on the CU over the
total active CU cycles. This is also presented as a percent of the peak theoretical
bandwidth achievable on the specific accelerator.
Wavefront Occupancy: |-
Wavefront Occupancy: >-
The time-averaged number of wavefronts resident on the accelerator over
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
@@ -328,7 +328,7 @@ Panel Config:
if only a single value is requested in a cache line, the data movement will
still be counted as a full cache line. This is also presented as a percent of
the peak theoretical bandwidth achievable on the specific accelerator.
L2-Fabric Read BW: |-
L2-Fabric Read BW: >-
The number of bytes read by the L2 over the Infinity Fabric\u2122 interface
per unit time. This is also presented as a percent of the peak theoretical
bandwidth achievable on the specific accelerator.
@@ -162,15 +162,15 @@ Panel Config:
Active CUs: Total number of active compute units (CUs) on the accelerator during
the kernel execution.
Num CUs: Total number of compute units (CUs) on the accelerator.
VGPR: |-
VGPR: >-
The number of architected vector general-purpose registers allocated
for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
requested by the compiler due to allocation granularity.
SGPR: |-
SGPR: >-
The number of scalar general-purpose registers allocated for the kernel,
see SALU. Note: this may not exactly match the number of SGPRs requested by
the compiler due to allocation granularity.
LDS Allocation: |-
LDS Allocation: >-
The number of bytes of LDS memory (or, shared memory) allocated for
this kernel. Note: This may also be larger than what was requested at compile
time due to both allocation granularity and dynamic per-dispatch LDS allocations.
@@ -252,7 +252,7 @@ Panel Config:
or data (atomic with return value) was returned to the L2.
HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
of data from the accelerator's local HBM, per normalization unit.
HBM Wr: |-
HBM Wr: >-
The total number of L2 requests to Infinity Fabric to write or atomically
update 32B or 64B of data in the accelerator's local HBM, per normalization
unit.
+13 -13
Просмотреть файл
@@ -140,17 +140,17 @@ Panel Config:
* 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9
unit: GFLOP/s
metrics_description:
VALU FLOPs (F16): |-
VALU FLOPs (F16): >-
The total 16-bit floating-point operations executed per second on the VALU.
This is presented with the value of the peak empirical F16 FLOPs achievable
on the specific accelerator. Note: this does not include any F16 operations
from MFMA instructions.
VALU FLOPs (F32): |-
VALU FLOPs (F32): >-
The total 32-bit floating-point operations executed per second on the VALU.
This is presented with the value of the peak empirical F32 FLOPs achievable
on the specific accelerator. Note: this does not include any F32 operations
from MFMA instructions.
VALU FLOPs (F64): |-
VALU FLOPs (F64): >-
The total 64-bit floating-point operations executed per second on the VALU.
This is presented with the value of the peak empirical F64 FLOPs achievable
on the specific accelerator. Note: this does not include any F64 operations
@@ -160,33 +160,33 @@ Panel Config:
from VALU instructions. The peak empirically measured F8 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison. It is supported
on AMD Instinct MI300 series and later only.
MFMA FLOPs (BF16): |-
MFMA FLOPs (BF16): >-
The total number of 16-bit brain floating point MFMA operations executed
per second. Note: this does not include any 16-bit brain floating point
operations from VALU instructions. The peak empirically measured BF16 MFMA
operations achievable on the specific accelerator is displayed alongside
for comparison.
MFMA FLOPs (F16): |-
MFMA FLOPs (F16): >-
The total number of 16-bit floating point MFMA operations executed per
second. Note: this does not include any 16-bit floating point operations from
VALU instructions. The peak empirically measured F16 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.
MFMA FLOPs (F32): |-
MFMA FLOPs (F32): >-
The total number of 32-bit floating point MFMA operations executed per
second. Note: this does not include any 32-bit floating point operations from
VALU instructions. The peak empirically measured F32 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.
MFMA FLOPs (F64): |-
MFMA FLOPs (F64): >-
The total number of 64-bit floating point MFMA operations executed per
second. Note: this does not include any 64-bit floating point operations from
VALU instructions. The peak empirically measured F64 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.
MFMA IOPs (Int8): |-
MFMA IOPs (Int8): >-
The total number of 8-bit integer MFMA operations executed per second.
Note: this does not include any 8-bit integer operations from VALU instructions.
The peak empirically measured INT8 MFMA operations achievable on the specific
accelerator is displayed alongside for comparison.
HBM Bandwidth: |-
HBM Bandwidth: >-
The total number of bytes read from and written to High-Bandwidth
Memory (HBM) per second. The peak empirically measured bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
@@ -207,22 +207,22 @@ Panel Config:
from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
example for more detail). The peak empirically measured LDS bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
AI L1: |-
AI L1: >-
The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L1 cache and the processing units. This value is used as the x-coordinate
for the L1 roofline.
AI L2: |-
AI L2: >-
The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L2 cache and the L1 cache. This value is used as the x-coordinate for
the L2 roofline.
AI HBM: |-
AI HBM: >-
The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
It is the ratio of total floating-point operations (FLOPs) to total bytes
transferred between HBM and the L2 cache. This value is used as the x-coordinate
for the HBM roofline.
Performance (GFLOPs): |-
Performance (GFLOPs): >-
The overall achieved performance, measured in GigaFLOPs
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
operations divided by the total execution time. This value is used as the y-coordinate
@@ -141,6 +141,6 @@ Panel Config:
the CPC-L2 interface was active doing any work.
CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
translation
CPC-UTCL2 Utilization: |-
CPC-UTCL2 Utilization: >-
Percent of total cycles counted by the CPC's L2 address translation
interface where the CPC was busy doing address translation work.
@@ -168,7 +168,7 @@ Panel Config:
in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
within the workgroup manager rather than a lack of a CU or SIMD with sufficient
resources.
Not-scheduled Rate (Scheduler-Pipe): |-
Not-scheduled Rate (Scheduler-Pipe): >-
The percent of total scheduler-pipe cycles in the kernel where a workgroup
could not be scheduled to a CU due to a bottleneck within the scheduler-pipes
rather than a lack of a CU or SIMD with sufficient resources.
+6 -6
Просмотреть файл
@@ -121,26 +121,26 @@ Panel Config:
Workgroup Size: The total number of work-items (or, threads) in each workgroup
(or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
to the total block size.
Total Wavefronts: |-
Total Wavefronts: >-
The total number of wavefronts launched as part of the kernel dispatch.
On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
size is always 64 work-items. Thus, the total number of wavefronts should
be equivalent to the ceiling of grid size divided by 64.
Saved Wavefronts: The total number of wavefronts saved at a context-save.
Restored Wavefronts: The total number of wavefronts restored from a context-save.
VGPRs: |-
VGPRs: >-
The number of architected vector general-purpose registers allocated
for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
requested by the compiler due to allocation granularity.
AGPRs: |-
AGPRs: >-
The number of accumulation vector general-purpose registers allocated
for the kernel, see AGPRs. Note: this may not exactly match the number of
AGPRs requested by the compiler due to allocation granularity.
SGPRs: |-
SGPRs: >-
The number of scalar general-purpose registers allocated for the kernel,
see SALU. Note: this may not exactly match the number of SGPRs requested by
the compiler due to allocation granularity.
LDS Allocation: |-
LDS Allocation: >-
The number of bytes of LDS memory (or, shared memory) allocated for
this kernel. Note: This may also be larger than what was requested at compile
time due to both allocation granularity and dynamic per-dispatch LDS allocations.
@@ -173,7 +173,7 @@ Panel Config:
rather than identification of a precise limiter. The sum of this metric, Issue
Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
metric.
Wavefront Occupancy: |-
Wavefront Occupancy: >-
The time-averaged number of wavefronts resident on the accelerator over
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms).
@@ -273,7 +273,7 @@ Panel Config:
floating-point operands issued to the VALU per normalization unit.
F64-Trans: The total number of transcendental instructions (such as sqrt) operating
on 64-bit floating-point operands issued to the VALU per normalization unit.
Conversion: |-
Conversion: >-
The total number of type conversion instructions (such as converting
data to or from F32\u2194F64) issued to the VALU per normalization unit.
Global/Generic Instr: The total number of global & generic memory instructions
@@ -251,37 +251,37 @@ Panel Config:
max: MAX(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
unit: (OPs + $normUnit)
metrics_description:
VALU FLOPs: |-
VALU FLOPs: >-
The total floating-point operations executed per second on the VALU.
This is also presented as a percent of the peak theoretical FLOPs achievable
on the specific accelerator. Note: this does not include any floating-point
operations from MFMA instructions.
VALU IOPs: |-
VALU IOPs: >-
The total integer operations executed per second on the VALU. This is
also presented as a percent of the peak theoretical IOPs achievable on the
specific accelerator. Note: this does not include any integer operations from
MFMA instructions.
MFMA FLOPs (BF16): |-
MFMA FLOPs (BF16): >-
The total number of 16-bit brain floating point MFMA operations executed
per second. Note: this does not include any 16-bit brain floating point operations
from VALU instructions. This is also presented as a percent of the peak theoretical
BF16 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F16): |-
MFMA FLOPs (F16): >-
The total number of 16-bit floating point MFMA operations executed per
second. Note: this does not include any 16-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F16 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F32): |-
MFMA FLOPs (F32): >-
The total number of 32-bit floating point MFMA operations executed per
second. Note: this does not include any 32-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F32 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F64): |-
MFMA FLOPs (F64): >-
The total number of 64-bit floating point MFMA operations executed per
second. Note: this does not include any 64-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F64 MFMA operations achievable on the specific accelerator.
MFMA IOPs (INT8): |-
MFMA IOPs (INT8): >-
The total number of 8-bit integer MFMA operations executed per second.
Note: this does not include any 8-bit integer operations from VALU instructions.
This is also presented as a percent of the peak theoretical INT8 MFMA operations
@@ -140,7 +140,7 @@ Panel Config:
unit.
Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
stalls from non-dword aligned addresses per normalization unit.
Mem Violations: |-
Mem Violations: >-
The total number of out-of-bounds accesses made to the LDS, per normalization
unit. This is unused and expected to be zero in most configurations for
modern CDNA\u2122 accelerators.
@@ -92,7 +92,7 @@ Panel Config:
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
loaded line the cache. Calculated as the ratio of the number of L1I requests
that hit over the number of all L1I requests.
L1I-L2 Bandwidth Utilization: |-
L1I-L2 Bandwidth Utilization: >-
The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
achieved. Calculated as the ratio of the total number of requests from the
L1I to the L2 cache over the total L1I-L2 interface cycles.
@@ -154,7 +154,7 @@ Panel Config:
sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
bandwidth acheived. Calculated as total number of bytes read from, written to,
or atomically updated across the sL1D - L2 interface.
sL1D-L2 BW: |-
sL1D-L2 BW: >-
The total number of bytes read from, written to, or atomically updated
across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D
writes and atomics are typically unused on current CDNA accelerators, so
@@ -164,7 +164,7 @@ Panel Config:
unit.
Hits: The total number of sL1D requests that hit on a previously loaded cache
line, per normalization unit.
Misses - Non Duplicated: |-
Misses - Non Duplicated: >-
The total number of sL1D requests that missed on a cache line that was
not already pending due to another request, per normalization unit.
Misses- Duplicated: The total number of sL1D requests that missed on a cache line
@@ -187,6 +187,6 @@ Panel Config:
unit.
Write Req: The total number of write requests from sL1D to the L2, per normalization
unit. Typically unused on current CDNA accelerators.
Stall Cycles: |-
Stall Cycles: >-
The total number of cycles the sL1D\u2194L2 interface was stalled, per
normalization unit.
@@ -398,7 +398,7 @@ Panel Config:
per normalization unit.
Translation Misses: The total number of translation requests that missed in the
UTCL1 due to translation not being present in the cache, per normalization unit.
Permission Misses: |-
Permission Misses: >-
The total number of translation requests that missed in the UTCL1 due
to a permission error, per normalization unit. This is unused and expected
to be zero in most configurations for modern CDNA\u2122 accelerators.
@@ -227,12 +227,12 @@ Panel Config:
pop: None
coll_level: SQ_IFETCH_LEVEL
metrics_description:
VALU FLOPs: |-
VALU FLOPs: >-
The total floating-point operations executed per second on the VALU.
This is also presented as a percent of the peak theoretical FLOPs achievable
on the specific accelerator. Note: this does not include any floating-point
operations from MFMA instructions.
VALU IOPs: |-
VALU IOPs: >-
The total integer operations executed per second on the VALU. This is
also presented as a percent of the peak theoretical IOPs achievable on the
specific accelerator. Note: this does not include any integer operations from
@@ -242,27 +242,27 @@ Panel Config:
from VALU instructions. This is also presented as a percent of the peak theoretical
F8 MFMA operations achievable on the specific accelerator. It is supported on
AMD Instinct MI300 series and later only.
MFMA FLOPs (BF16): |-
MFMA FLOPs (BF16): >-
The total number of 16-bit brain floating point MFMA operations executed
per second. Note: this does not include any 16-bit brain floating point operations
from VALU instructions. This is also presented as a percent of the peak theoretical
BF16 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F16): |-
MFMA FLOPs (F16): >-
The total number of 16-bit floating point MFMA operations executed per
second. Note: this does not include any 16-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F16 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F32): |-
MFMA FLOPs (F32): >-
The total number of 32-bit floating point MFMA operations executed per
second. Note: this does not include any 32-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F32 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F64): |-
MFMA FLOPs (F64): >-
The total number of 64-bit floating point MFMA operations executed per
second. Note: this does not include any 64-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F64 MFMA operations achievable on the specific accelerator.
MFMA IOPs (Int8): |-
MFMA IOPs (Int8): >-
The total number of 8-bit integer MFMA operations executed per second.
Note: this does not include any 8-bit integer operations from VALU instructions.
This is also presented as a percent of the peak theoretical INT8 MFMA operations
@@ -295,7 +295,7 @@ Panel Config:
IPC: The ratio of the total number of instructions executed on the CU over the
total active CU cycles. This is also presented as a percent of the peak theoretical
bandwidth achievable on the specific accelerator.
Wavefront Occupancy: |-
Wavefront Occupancy: >-
The time-averaged number of wavefronts resident on the accelerator over
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
@@ -328,7 +328,7 @@ Panel Config:
if only a single value is requested in a cache line, the data movement will
still be counted as a full cache line. This is also presented as a percent of
the peak theoretical bandwidth achievable on the specific accelerator.
L2-Fabric Read BW: |-
L2-Fabric Read BW: >-
The number of bytes read by the L2 over the Infinity Fabric\u2122 interface
per unit time. This is also presented as a percent of the peak theoretical
bandwidth achievable on the specific accelerator.
@@ -162,15 +162,15 @@ Panel Config:
Active CUs: Total number of active compute units (CUs) on the accelerator during
the kernel execution.
Num CUs: Total number of compute units (CUs) on the accelerator.
VGPR: |-
VGPR: >-
The number of architected vector general-purpose registers allocated
for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
requested by the compiler due to allocation granularity.
SGPR: |-
SGPR: >-
The number of scalar general-purpose registers allocated for the kernel,
see SALU. Note: this may not exactly match the number of SGPRs requested by
the compiler due to allocation granularity.
LDS Allocation: |-
LDS Allocation: >-
The number of bytes of LDS memory (or, shared memory) allocated for
this kernel. Note: This may also be larger than what was requested at compile
time due to both allocation granularity and dynamic per-dispatch LDS allocations.
@@ -252,7 +252,7 @@ Panel Config:
or data (atomic with return value) was returned to the L2.
HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
of data from the accelerator's local HBM, per normalization unit.
HBM Wr: |-
HBM Wr: >-
The total number of L2 requests to Infinity Fabric to write or atomically
update 32B or 64B of data in the accelerator's local HBM, per normalization
unit.
+13 -13
Просмотреть файл
@@ -140,17 +140,17 @@ Panel Config:
* 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9
unit: GFLOP/s
metrics_description:
VALU FLOPs (F16): |-
VALU FLOPs (F16): >-
The total 16-bit floating-point operations executed per second on the VALU.
This is presented with the value of the peak empirical F16 FLOPs achievable
on the specific accelerator. Note: this does not include any F16 operations
from MFMA instructions.
VALU FLOPs (F32): |-
VALU FLOPs (F32): >-
The total 32-bit floating-point operations executed per second on the VALU.
This is presented with the value of the peak empirical F32 FLOPs achievable
on the specific accelerator. Note: this does not include any F32 operations
from MFMA instructions.
VALU FLOPs (F64): |-
VALU FLOPs (F64): >-
The total 64-bit floating-point operations executed per second on the VALU.
This is presented with the value of the peak empirical F64 FLOPs achievable
on the specific accelerator. Note: this does not include any F64 operations
@@ -160,33 +160,33 @@ Panel Config:
from VALU instructions. The peak empirically measured F8 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison. It is supported
on AMD Instinct MI300 series and later only.
MFMA FLOPs (BF16): |-
MFMA FLOPs (BF16): >-
The total number of 16-bit brain floating point MFMA operations executed
per second. Note: this does not include any 16-bit brain floating point
operations from VALU instructions. The peak empirically measured BF16 MFMA
operations achievable on the specific accelerator is displayed alongside
for comparison.
MFMA FLOPs (F16): |-
MFMA FLOPs (F16): >-
The total number of 16-bit floating point MFMA operations executed per
second. Note: this does not include any 16-bit floating point operations from
VALU instructions. The peak empirically measured F16 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.
MFMA FLOPs (F32): |-
MFMA FLOPs (F32): >-
The total number of 32-bit floating point MFMA operations executed per
second. Note: this does not include any 32-bit floating point operations from
VALU instructions. The peak empirically measured F32 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.
MFMA FLOPs (F64): |-
MFMA FLOPs (F64): >-
The total number of 64-bit floating point MFMA operations executed per
second. Note: this does not include any 64-bit floating point operations from
VALU instructions. The peak empirically measured F64 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.
MFMA IOPs (Int8): |-
MFMA IOPs (Int8): >-
The total number of 8-bit integer MFMA operations executed per second.
Note: this does not include any 8-bit integer operations from VALU instructions.
The peak empirically measured INT8 MFMA operations achievable on the specific
accelerator is displayed alongside for comparison.
HBM Bandwidth: |-
HBM Bandwidth: >-
The total number of bytes read from and written to High-Bandwidth
Memory (HBM) per second. The peak empirically measured bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
@@ -207,22 +207,22 @@ Panel Config:
from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
example for more detail). The peak empirically measured LDS bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
AI L1: |-
AI L1: >-
The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L1 cache and the processing units. This value is used as the x-coordinate
for the L1 roofline.
AI L2: |-
AI L2: >-
The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L2 cache and the L1 cache. This value is used as the x-coordinate for
the L2 roofline.
AI HBM: |-
AI HBM: >-
The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
It is the ratio of total floating-point operations (FLOPs) to total bytes
transferred between HBM and the L2 cache. This value is used as the x-coordinate
for the HBM roofline.
Performance (GFLOPs): |-
Performance (GFLOPs): >-
The overall achieved performance, measured in GigaFLOPs
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
operations divided by the total execution time. This value is used as the y-coordinate
@@ -141,6 +141,6 @@ Panel Config:
the CPC-L2 interface was active doing any work.
CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
translation
CPC-UTCL2 Utilization: |-
CPC-UTCL2 Utilization: >-
Percent of total cycles counted by the CPC's L2 address translation
interface where the CPC was busy doing address translation work.
@@ -168,7 +168,7 @@ Panel Config:
in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
within the workgroup manager rather than a lack of a CU or SIMD with sufficient
resources.
Not-scheduled Rate (Scheduler-Pipe): |-
Not-scheduled Rate (Scheduler-Pipe): >-
The percent of total scheduler-pipe cycles in the kernel where a workgroup
could not be scheduled to a CU due to a bottleneck within the scheduler-pipes
rather than a lack of a CU or SIMD with sufficient resources.
+6 -6
Просмотреть файл
@@ -121,26 +121,26 @@ Panel Config:
Workgroup Size: The total number of work-items (or, threads) in each workgroup
(or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
to the total block size.
Total Wavefronts: |-
Total Wavefronts: >-
The total number of wavefronts launched as part of the kernel dispatch.
On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
size is always 64 work-items. Thus, the total number of wavefronts should
be equivalent to the ceiling of grid size divided by 64.
Saved Wavefronts: The total number of wavefronts saved at a context-save.
Restored Wavefronts: The total number of wavefronts restored from a context-save.
VGPRs: |-
VGPRs: >-
The number of architected vector general-purpose registers allocated
for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
requested by the compiler due to allocation granularity.
AGPRs: |-
AGPRs: >-
The number of accumulation vector general-purpose registers allocated
for the kernel, see AGPRs. Note: this may not exactly match the number of
AGPRs requested by the compiler due to allocation granularity.
SGPRs: |-
SGPRs: >-
The number of scalar general-purpose registers allocated for the kernel,
see SALU. Note: this may not exactly match the number of SGPRs requested by
the compiler due to allocation granularity.
LDS Allocation: |-
LDS Allocation: >-
The number of bytes of LDS memory (or, shared memory) allocated for
this kernel. Note: This may also be larger than what was requested at compile
time due to both allocation granularity and dynamic per-dispatch LDS allocations.
@@ -173,7 +173,7 @@ Panel Config:
rather than identification of a precise limiter. The sum of this metric, Issue
Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
metric.
Wavefront Occupancy: |-
Wavefront Occupancy: >-
The time-averaged number of wavefronts resident on the accelerator over
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms).
@@ -273,7 +273,7 @@ Panel Config:
floating-point operands issued to the VALU per normalization unit.
F64-Trans: The total number of transcendental instructions (such as sqrt) operating
on 64-bit floating-point operands issued to the VALU per normalization unit.
Conversion: |-
Conversion: >-
The total number of type conversion instructions (such as converting
data to or from F32\u2194F64) issued to the VALU per normalization unit.
Global/Generic Instr: The total number of global & generic memory instructions
@@ -251,37 +251,37 @@ Panel Config:
max: MAX(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
unit: (OPs + $normUnit)
metrics_description:
VALU FLOPs: |-
VALU FLOPs: >-
The total floating-point operations executed per second on the VALU.
This is also presented as a percent of the peak theoretical FLOPs achievable
on the specific accelerator. Note: this does not include any floating-point
operations from MFMA instructions.
VALU IOPs: |-
VALU IOPs: >-
The total integer operations executed per second on the VALU. This is
also presented as a percent of the peak theoretical IOPs achievable on the
specific accelerator. Note: this does not include any integer operations from
MFMA instructions.
MFMA FLOPs (BF16): |-
MFMA FLOPs (BF16): >-
The total number of 16-bit brain floating point MFMA operations executed
per second. Note: this does not include any 16-bit brain floating point operations
from VALU instructions. This is also presented as a percent of the peak theoretical
BF16 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F16): |-
MFMA FLOPs (F16): >-
The total number of 16-bit floating point MFMA operations executed per
second. Note: this does not include any 16-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F16 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F32): |-
MFMA FLOPs (F32): >-
The total number of 32-bit floating point MFMA operations executed per
second. Note: this does not include any 32-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F32 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F64): |-
MFMA FLOPs (F64): >-
The total number of 64-bit floating point MFMA operations executed per
second. Note: this does not include any 64-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F64 MFMA operations achievable on the specific accelerator.
MFMA IOPs (INT8): |-
MFMA IOPs (INT8): >-
The total number of 8-bit integer MFMA operations executed per second.
Note: this does not include any 8-bit integer operations from VALU instructions.
This is also presented as a percent of the peak theoretical INT8 MFMA operations
@@ -140,7 +140,7 @@ Panel Config:
unit.
Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
stalls from non-dword aligned addresses per normalization unit.
Mem Violations: |-
Mem Violations: >-
The total number of out-of-bounds accesses made to the LDS, per normalization
unit. This is unused and expected to be zero in most configurations for
modern CDNA\u2122 accelerators.
@@ -92,7 +92,7 @@ Panel Config:
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
loaded line the cache. Calculated as the ratio of the number of L1I requests
that hit over the number of all L1I requests.
L1I-L2 Bandwidth Utilization: |-
L1I-L2 Bandwidth Utilization: >-
The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
achieved. Calculated as the ratio of the total number of requests from the
L1I to the L2 cache over the total L1I-L2 interface cycles.
@@ -154,7 +154,7 @@ Panel Config:
sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
bandwidth acheived. Calculated as total number of bytes read from, written to,
or atomically updated across the sL1D - L2 interface.
sL1D-L2 BW: |-
sL1D-L2 BW: >-
The total number of bytes read from, written to, or atomically updated
across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D
writes and atomics are typically unused on current CDNA accelerators, so
@@ -164,7 +164,7 @@ Panel Config:
unit.
Hits: The total number of sL1D requests that hit on a previously loaded cache
line, per normalization unit.
Misses - Non Duplicated: |-
Misses - Non Duplicated: >-
The total number of sL1D requests that missed on a cache line that was
not already pending due to another request, per normalization unit.
Misses- Duplicated: The total number of sL1D requests that missed on a cache line
@@ -187,6 +187,6 @@ Panel Config:
unit.
Write Req: The total number of write requests from sL1D to the L2, per normalization
unit. Typically unused on current CDNA accelerators.
Stall Cycles: |-
Stall Cycles: >-
The total number of cycles the sL1D\u2194L2 interface was stalled, per
normalization unit.
@@ -398,7 +398,7 @@ Panel Config:
per normalization unit.
Translation Misses: The total number of translation requests that missed in the
UTCL1 due to translation not being present in the cache, per normalization unit.
Permission Misses: |-
Permission Misses: >-
The total number of translation requests that missed in the UTCL1 due
to a permission error, per normalization unit. This is unused and expected
to be zero in most configurations for modern CDNA\u2122 accelerators.
@@ -233,12 +233,12 @@ Panel Config:
pop: None
coll_level: SQ_IFETCH_LEVEL
metrics_description:
VALU FLOPs: |-
VALU FLOPs: >-
The total floating-point operations executed per second on the VALU.
This is also presented as a percent of the peak theoretical FLOPs achievable
on the specific accelerator. Note: this does not include any floating-point
operations from MFMA instructions.
VALU IOPs: |-
VALU IOPs: >-
The total integer operations executed per second on the VALU. This is
also presented as a percent of the peak theoretical IOPs achievable on the
specific accelerator. Note: this does not include any integer operations from
@@ -248,27 +248,27 @@ Panel Config:
from VALU instructions. This is also presented as a percent of the peak theoretical
F8 MFMA operations achievable on the specific accelerator. It is supported on
AMD Instinct MI300 series and later only.
MFMA FLOPs (BF16): |-
MFMA FLOPs (BF16): >-
The total number of 16-bit brain floating point MFMA operations executed
per second. Note: this does not include any 16-bit brain floating point operations
from VALU instructions. This is also presented as a percent of the peak theoretical
BF16 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F16): |-
MFMA FLOPs (F16): >-
The total number of 16-bit floating point MFMA operations executed per
second. Note: this does not include any 16-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F16 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F32): |-
MFMA FLOPs (F32): >-
The total number of 32-bit floating point MFMA operations executed per
second. Note: this does not include any 32-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F32 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F64): |-
MFMA FLOPs (F64): >-
The total number of 64-bit floating point MFMA operations executed per
second. Note: this does not include any 64-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F64 MFMA operations achievable on the specific accelerator.
MFMA IOPs (Int8): |-
MFMA IOPs (Int8): >-
The total number of 8-bit integer MFMA operations executed per second.
Note: this does not include any 8-bit integer operations from VALU instructions.
This is also presented as a percent of the peak theoretical INT8 MFMA operations
@@ -301,7 +301,7 @@ Panel Config:
IPC: The ratio of the total number of instructions executed on the CU over the
total active CU cycles. This is also presented as a percent of the peak theoretical
bandwidth achievable on the specific accelerator.
Wavefront Occupancy: |-
Wavefront Occupancy: >-
The time-averaged number of wavefronts resident on the accelerator over
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
@@ -334,7 +334,7 @@ Panel Config:
if only a single value is requested in a cache line, the data movement will
still be counted as a full cache line. This is also presented as a percent of
the peak theoretical bandwidth achievable on the specific accelerator.
L2-Fabric Read BW: |-
L2-Fabric Read BW: >-
The number of bytes read by the L2 over the Infinity Fabric\u2122 interface
per unit time. This is also presented as a percent of the peak theoretical
bandwidth achievable on the specific accelerator.
@@ -172,15 +172,15 @@ Panel Config:
Active CUs: Total number of active compute units (CUs) on the accelerator during
the kernel execution.
Num CUs: Total number of compute units (CUs) on the accelerator.
VGPR: |-
VGPR: >-
The number of architected vector general-purpose registers allocated
for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
requested by the compiler due to allocation granularity.
SGPR: |-
SGPR: >-
The number of scalar general-purpose registers allocated for the kernel,
see SALU. Note: this may not exactly match the number of SGPRs requested by
the compiler due to allocation granularity.
LDS Allocation: |-
LDS Allocation: >-
The number of bytes of LDS memory (or, shared memory) allocated for
this kernel. Note: This may also be larger than what was requested at compile
time due to both allocation granularity and dynamic per-dispatch LDS allocations.
@@ -268,7 +268,7 @@ Panel Config:
or data (atomic with return value) was returned to the L2.
HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
of data from the accelerator's local HBM, per normalization unit.
HBM Wr: |-
HBM Wr: >-
The total number of L2 requests to Infinity Fabric to write or atomically
update 32B or 64B of data in the accelerator's local HBM, per normalization
unit.
+14 -14
Просмотреть файл
@@ -148,17 +148,17 @@ Panel Config:
Start_Timestamp) / 1e9) ) / 1e9
unit: GFLOP/s
metrics_description:
VALU FLOPs (F16): |-
VALU FLOPs (F16): >-
The total 16-bit floating-point operations executed per second on the VALU.
This is presented with the value of the peak empirical F16 FLOPs achievable
on the specific accelerator. Note: this does not include any F16 operations
from MFMA instructions.
VALU FLOPs (F32): |-
VALU FLOPs (F32): >-
The total 32-bit floating-point operations executed per second on the VALU.
This is presented with the value of the peak empirical F32 FLOPs achievable
on the specific accelerator. Note: this does not include any F32 operations
from MFMA instructions.
VALU FLOPs (F64): |-
VALU FLOPs (F64): >-
The total 64-bit floating-point operations executed per second on the VALU.
This is presented with the value of the peak empirical F64 FLOPs achievable
on the specific accelerator. Note: this does not include any F64 operations
@@ -168,39 +168,39 @@ Panel Config:
from VALU instructions. The peak empirically measured F8 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison. It is supported
on AMD Instinct MI300 series and later only.
MFMA FLOPs (BF16): |-
MFMA FLOPs (BF16): >-
The total number of 16-bit brain floating point MFMA operations executed
per second. Note: this does not include any 16-bit brain floating point
operations from VALU instructions. The peak empirically measured BF16 MFMA
operations achievable on the specific accelerator is displayed alongside
for comparison.
MFMA FLOPs (F16): |-
MFMA FLOPs (F16): >-
The total number of 16-bit floating point MFMA operations executed per
second. Note: this does not include any 16-bit floating point operations from
VALU instructions. The peak empirically measured F16 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.
MFMA FLOPs (F32): |-
MFMA FLOPs (F32): >-
The total number of 32-bit floating point MFMA operations executed per
second. Note: this does not include any 32-bit floating point operations from
VALU instructions. The peak empirically measured F32 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.
MFMA FLOPs (F64): |-
MFMA FLOPs (F64): >-
The total number of 64-bit floating point MFMA operations executed per
second. Note: this does not include any 64-bit floating point operations from
VALU instructions. The peak empirically measured F64 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.
MFMA FLOPs (F6F4): |-
MFMA FLOPs (F6F4): >-
The total number of 4-bit and 6-bit floating point MFMA operations executed
per second. Note: this does not include any floating point operations from
VALU instructions. The peak empirically measured F6F4 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.
It is supported on AMD Instinct MI350 series (gfx950) and later only.
MFMA IOPs (Int8): |-
MFMA IOPs (Int8): >-
The total number of 8-bit integer MFMA operations executed per second.
Note: this does not include any 8-bit integer operations from VALU instructions.
The peak empirically measured INT8 MFMA operations achievable on the specific
accelerator is displayed alongside for comparison.
HBM Bandwidth: |-
HBM Bandwidth: >-
The total number of bytes read from and written to High-Bandwidth
Memory (HBM) per second. The peak empirically measured bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
@@ -221,22 +221,22 @@ Panel Config:
from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
example for more detail). The peak empirically measured LDS bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
AI L1: |-
AI L1: >-
The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L1 cache and the processing units. This value is used as the x-coordinate
for the L1 roofline.
AI L2: |-
AI L2: >-
The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L2 cache and the L1 cache. This value is used as the x-coordinate for
the L2 roofline.
AI HBM: |-
AI HBM: >-
The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
It is the ratio of total floating-point operations (FLOPs) to total bytes
transferred between HBM and the L2 cache. This value is used as the x-coordinate
for the HBM roofline.
Performance (GFLOPs): |-
Performance (GFLOPs): >-
The overall achieved performance, measured in GigaFLOPs
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
operations divided by the total execution time. This value is used as the y-coordinate
@@ -162,6 +162,6 @@ Panel Config:
the CPC-L2 interface was active doing any work.
CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
translation
CPC-UTCL2 Utilization: |-
CPC-UTCL2 Utilization: >-
Percent of total cycles counted by the CPC's L2 address translation
interface where the CPC was busy doing address translation work.
@@ -204,7 +204,7 @@ Panel Config:
in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
within the workgroup manager rather than a lack of a CU or SIMD with sufficient
resources.
Not-scheduled Rate (Scheduler-Pipe): |-
Not-scheduled Rate (Scheduler-Pipe): >-
The percent of total scheduler-pipe cycles in the kernel where a workgroup
could not be scheduled to a CU due to a bottleneck within the scheduler-pipes
rather than a lack of a CU or SIMD with sufficient resources.
+6 -6
Просмотреть файл
@@ -121,26 +121,26 @@ Panel Config:
Workgroup Size: The total number of work-items (or, threads) in each workgroup
(or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
to the total block size.
Total Wavefronts: |-
Total Wavefronts: >-
The total number of wavefronts launched as part of the kernel dispatch.
On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
size is always 64 work-items. Thus, the total number of wavefronts should
be equivalent to the ceiling of grid size divided by 64.
Saved Wavefronts: The total number of wavefronts saved at a context-save.
Restored Wavefronts: The total number of wavefronts restored from a context-save.
VGPRs: |-
VGPRs: >-
The number of architected vector general-purpose registers allocated
for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
requested by the compiler due to allocation granularity.
AGPRs: |-
AGPRs: >-
The number of accumulation vector general-purpose registers allocated
for the kernel, see AGPRs. Note: this may not exactly match the number of
AGPRs requested by the compiler due to allocation granularity.
SGPRs: |-
SGPRs: >-
The number of scalar general-purpose registers allocated for the kernel,
see SALU. Note: this may not exactly match the number of SGPRs requested by
the compiler due to allocation granularity.
LDS Allocation: |-
LDS Allocation: >-
The number of bytes of LDS memory (or, shared memory) allocated for
this kernel. Note: This may also be larger than what was requested at compile
time due to both allocation granularity and dynamic per-dispatch LDS allocations.
@@ -173,7 +173,7 @@ Panel Config:
rather than identification of a precise limiter. The sum of this metric, Issue
Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
metric.
Wavefront Occupancy: |-
Wavefront Occupancy: >-
The time-averaged number of wavefronts resident on the accelerator over
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms).
@@ -283,7 +283,7 @@ Panel Config:
floating-point operands issued to the VALU per normalization unit.
F64-Trans: The total number of transcendental instructions (such as sqrt) operating
on 64-bit floating-point operands issued to the VALU per normalization unit.
Conversion: |-
Conversion: >-
The total number of type conversion instructions (such as converting
data to or from F32\u2194F64) issued to the VALU per normalization unit.
Global/Generic Instr: The total number of global & generic memory instructions
@@ -267,37 +267,37 @@ Panel Config:
max: MAX(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
unit: (OPs + $normUnit)
metrics_description:
VALU FLOPs: |-
VALU FLOPs: >-
The total floating-point operations executed per second on the VALU.
This is also presented as a percent of the peak theoretical FLOPs achievable
on the specific accelerator. Note: this does not include any floating-point
operations from MFMA instructions.
VALU IOPs: |-
VALU IOPs: >-
The total integer operations executed per second on the VALU. This is
also presented as a percent of the peak theoretical IOPs achievable on the
specific accelerator. Note: this does not include any integer operations from
MFMA instructions.
MFMA FLOPs (BF16): |-
MFMA FLOPs (BF16): >-
The total number of 16-bit brain floating point MFMA operations executed
per second. Note: this does not include any 16-bit brain floating point operations
from VALU instructions. This is also presented as a percent of the peak theoretical
BF16 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F16): |-
MFMA FLOPs (F16): >-
The total number of 16-bit floating point MFMA operations executed per
second. Note: this does not include any 16-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F16 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F32): |-
MFMA FLOPs (F32): >-
The total number of 32-bit floating point MFMA operations executed per
second. Note: this does not include any 32-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F32 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (F64): |-
MFMA FLOPs (F64): >-
The total number of 64-bit floating point MFMA operations executed per
second. Note: this does not include any 64-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F64 MFMA operations achievable on the specific accelerator.
MFMA IOPs (INT8): |-
MFMA IOPs (INT8): >-
The total number of 8-bit integer MFMA operations executed per second.
Note: this does not include any 8-bit integer operations from VALU instructions.
This is also presented as a percent of the peak theoretical INT8 MFMA operations
@@ -180,7 +180,7 @@ Panel Config:
unit.
Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
stalls from non-dword aligned addresses per normalization unit.
Mem Violations: |-
Mem Violations: >-
The total number of out-of-bounds accesses made to the LDS, per normalization
unit. This is unused and expected to be zero in most configurations for
modern CDNA\u2122 accelerators.
@@ -92,7 +92,7 @@ Panel Config:
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
loaded line the cache. Calculated as the ratio of the number of L1I requests
that hit over the number of all L1I requests.
L1I-L2 Bandwidth Utilization: |-
L1I-L2 Bandwidth Utilization: >-
The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
achieved. Calculated as the ratio of the total number of requests from the
L1I to the L2 cache over the total L1I-L2 interface cycles.
@@ -154,7 +154,7 @@ Panel Config:
sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
bandwidth acheived. Calculated as total number of bytes read from, written to,
or atomically updated across the sL1D - L2 interface.
sL1D-L2 BW: |-
sL1D-L2 BW: >-
The total number of bytes read from, written to, or atomically updated
across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D
writes and atomics are typically unused on current CDNA accelerators, so
@@ -164,7 +164,7 @@ Panel Config:
unit.
Hits: The total number of sL1D requests that hit on a previously loaded cache
line, per normalization unit.
Misses - Non Duplicated: |-
Misses - Non Duplicated: >-
The total number of sL1D requests that missed on a cache line that was
not already pending due to another request, per normalization unit.
Misses- Duplicated: The total number of sL1D requests that missed on a cache line
@@ -187,6 +187,6 @@ Panel Config:
unit.
Write Req: The total number of write requests from sL1D to the L2, per normalization
unit. Typically unused on current CDNA accelerators.
Stall Cycles: |-
Stall Cycles: >-
The total number of cycles the sL1D\u2194L2 interface was stalled, per
normalization unit.
@@ -501,7 +501,7 @@ Panel Config:
per normalization unit.
Translation Misses: The total number of translation requests that missed in the
UTCL1 due to translation not being present in the cache, per normalization unit.
Permission Misses: |-
Permission Misses: >-
The total number of translation requests that missed in the UTCL1 due
to a permission error, per normalization unit. This is unused and expected
to be zero in most configurations for modern CDNA\u2122 accelerators.
+1 -1
Просмотреть файл
@@ -706,7 +706,7 @@ Panel Config:
requests are only considered atomic by Infinity Fabric if they are targeted
at non-write-cacheable memory, such as fine-grained memory allocations or uncached
memory allocations on the MI2XX.
Read Stall: |-
Read Stall: >-
The ratio of the total number of cycles the L2-Fabric interface was
stalled on a read request to any destination (local HBM, remote PCIe\xAE
connected accelerator or CPU, or remote Infinity Fabric connected accelerator
+2 -116
Просмотреть файл
@@ -1,116 +1,2 @@
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
src/rocprof_compute_soc/analysis_configs/gfx908/0000_top_stats.yaml: ad7818c680acb0d4e3cb624e0f14f79d44fa7efe14531c5643f47ac96266c91d
src/rocprof_compute_soc/analysis_configs/gfx90a/0000_top_stats.yaml: ad7818c680acb0d4e3cb624e0f14f79d44fa7efe14531c5643f47ac96266c91d
src/rocprof_compute_soc/analysis_configs/gfx940/0000_top_stats.yaml: ad7818c680acb0d4e3cb624e0f14f79d44fa7efe14531c5643f47ac96266c91d
src/rocprof_compute_soc/analysis_configs/gfx941/0000_top_stats.yaml: ad7818c680acb0d4e3cb624e0f14f79d44fa7efe14531c5643f47ac96266c91d
src/rocprof_compute_soc/analysis_configs/gfx942/0000_top_stats.yaml: ad7818c680acb0d4e3cb624e0f14f79d44fa7efe14531c5643f47ac96266c91d
src/rocprof_compute_soc/analysis_configs/gfx950/0000_top_stats.yaml: ad7818c680acb0d4e3cb624e0f14f79d44fa7efe14531c5643f47ac96266c91d
src/rocprof_compute_soc/analysis_configs/gfx908/0100_system_info.yaml: d95eea137c439cc2aa4ac5273f06ac6a05037a74550bc23a095162ee366d39cb
src/rocprof_compute_soc/analysis_configs/gfx90a/0100_system_info.yaml: d95eea137c439cc2aa4ac5273f06ac6a05037a74550bc23a095162ee366d39cb
src/rocprof_compute_soc/analysis_configs/gfx940/0100_system_info.yaml: d95eea137c439cc2aa4ac5273f06ac6a05037a74550bc23a095162ee366d39cb
src/rocprof_compute_soc/analysis_configs/gfx941/0100_system_info.yaml: d95eea137c439cc2aa4ac5273f06ac6a05037a74550bc23a095162ee366d39cb
src/rocprof_compute_soc/analysis_configs/gfx942/0100_system_info.yaml: d95eea137c439cc2aa4ac5273f06ac6a05037a74550bc23a095162ee366d39cb
src/rocprof_compute_soc/analysis_configs/gfx950/0100_system_info.yaml: d95eea137c439cc2aa4ac5273f06ac6a05037a74550bc23a095162ee366d39cb
src/rocprof_compute_soc/analysis_configs/gfx908/0200_system_speed_of_light.yaml: aa60b7a75e46196195675a1c8d6aa65211483ace8dfe346ed0228056586bc8a5
src/rocprof_compute_soc/analysis_configs/gfx90a/0200_system_speed_of_light.yaml: 54d0ef58f8222463516984d3b9153806f5185de9e719d1903537af4c8344a4f4
src/rocprof_compute_soc/analysis_configs/gfx940/0200_system_speed_of_light.yaml: a6a5d78d76eb39471249c4c55ccea2e8084a5136c01d29aaeb87d308cce05d2e
src/rocprof_compute_soc/analysis_configs/gfx941/0200_system_speed_of_light.yaml: 352d4702fbebd8550883b777b875893a8404a7909d83c74cdd50c1b713452c81
src/rocprof_compute_soc/analysis_configs/gfx942/0200_system_speed_of_light.yaml: a6a5d78d76eb39471249c4c55ccea2e8084a5136c01d29aaeb87d308cce05d2e
src/rocprof_compute_soc/analysis_configs/gfx950/0200_system_speed_of_light.yaml: 1a164dfbb551e4b0a8a55a843d776738d90406cdbe2930e0f474b77a075a7353
src/rocprof_compute_soc/analysis_configs/gfx908/0300_memory_chart.yaml: ff5fd164694f454a95ccd52c8c0bfa20aebfa476908cab2ac03215fb33e48598
src/rocprof_compute_soc/analysis_configs/gfx90a/0300_memory_chart.yaml: 332c1965f462e75a479ddf3270294e1cf723701eb08b60c6cea550eb3bc192e7
src/rocprof_compute_soc/analysis_configs/gfx940/0300_memory_chart.yaml: 92dc15222a707fff79ce2084172ae2068465bfe064b89538ca7e83359422dfc8
src/rocprof_compute_soc/analysis_configs/gfx941/0300_memory_chart.yaml: 92dc15222a707fff79ce2084172ae2068465bfe064b89538ca7e83359422dfc8
src/rocprof_compute_soc/analysis_configs/gfx942/0300_memory_chart.yaml: 92dc15222a707fff79ce2084172ae2068465bfe064b89538ca7e83359422dfc8
src/rocprof_compute_soc/analysis_configs/gfx950/0300_memory_chart.yaml: d3a2e085061068ff8cff0b80f6944dc866ec3e748cf1e4c0cfcd76e1e14d21f8
src/rocprof_compute_soc/analysis_configs/gfx908/0400_roofline.yaml: e91988af6d99a03e2a19593155447f79abe64dc128a83a170a5037ab466b238c
src/rocprof_compute_soc/analysis_configs/gfx90a/0400_roofline.yaml: 0807c87d20faed19f2ef9470e9277715f2287e687aa831a328dcab4915a38812
src/rocprof_compute_soc/analysis_configs/gfx940/0400_roofline.yaml: f5f35d1ae9a35fe83bcdf572aa788401c14cc6718761c4cf8e4dddcf249c3548
src/rocprof_compute_soc/analysis_configs/gfx941/0400_roofline.yaml: 760ecef9947fa31d3a0fb5c45d653060d06213d8d9f216c19cbb1b1ce29942b6
src/rocprof_compute_soc/analysis_configs/gfx942/0400_roofline.yaml: e037ce1a2cf8ba08e2317e322b56954caace6ec2427a966acbabf2135cd89855
src/rocprof_compute_soc/analysis_configs/gfx950/0400_roofline.yaml: f53b2a92b3ac051290eff9b1f63343c30e6cd223b9cbf9d30a93ef4a5ff158b3
src/rocprof_compute_soc/analysis_configs/gfx908/0500_command_processor_cpc_cpf.yaml: 649bec27b9ccee34c96520c1f6bc0977779a8c4f8a58ee21ff59d61207962966
src/rocprof_compute_soc/analysis_configs/gfx90a/0500_command_processor_cpc_cpf.yaml: 649bec27b9ccee34c96520c1f6bc0977779a8c4f8a58ee21ff59d61207962966
src/rocprof_compute_soc/analysis_configs/gfx940/0500_command_processor_cpc_cpf.yaml: 649bec27b9ccee34c96520c1f6bc0977779a8c4f8a58ee21ff59d61207962966
src/rocprof_compute_soc/analysis_configs/gfx941/0500_command_processor_cpc_cpf.yaml: 649bec27b9ccee34c96520c1f6bc0977779a8c4f8a58ee21ff59d61207962966
src/rocprof_compute_soc/analysis_configs/gfx942/0500_command_processor_cpc_cpf.yaml: 649bec27b9ccee34c96520c1f6bc0977779a8c4f8a58ee21ff59d61207962966
src/rocprof_compute_soc/analysis_configs/gfx950/0500_command_processor_cpc_cpf.yaml: 1e4c1bc1158398df8966d24e56b7d434458ce10ade9e13f168887d9a0d9abaef
src/rocprof_compute_soc/analysis_configs/gfx908/0600_workgroup_manager_spi.yaml: 2bcb7045609e8ff023c9bfa384e63f6a2cc926ff3261f3eab6737f89a899809e
src/rocprof_compute_soc/analysis_configs/gfx90a/0600_workgroup_manager_spi.yaml: 2bcb7045609e8ff023c9bfa384e63f6a2cc926ff3261f3eab6737f89a899809e
src/rocprof_compute_soc/analysis_configs/gfx940/0600_workgroup_manager_spi.yaml: 2bcb7045609e8ff023c9bfa384e63f6a2cc926ff3261f3eab6737f89a899809e
src/rocprof_compute_soc/analysis_configs/gfx941/0600_workgroup_manager_spi.yaml: 2bcb7045609e8ff023c9bfa384e63f6a2cc926ff3261f3eab6737f89a899809e
src/rocprof_compute_soc/analysis_configs/gfx942/0600_workgroup_manager_spi.yaml: 2bcb7045609e8ff023c9bfa384e63f6a2cc926ff3261f3eab6737f89a899809e
src/rocprof_compute_soc/analysis_configs/gfx950/0600_workgroup_manager_spi.yaml: 6d97f3ebf3bef1d164255d4c4979e43d7f313f1eda067324aad9be06be98f090
src/rocprof_compute_soc/analysis_configs/gfx908/0700_wavefront.yaml: da9fb740f9dfafa43c8d0401d22082915d1c04021e07fb8003ac1f31005e282b
src/rocprof_compute_soc/analysis_configs/gfx90a/0700_wavefront.yaml: da9fb740f9dfafa43c8d0401d22082915d1c04021e07fb8003ac1f31005e282b
src/rocprof_compute_soc/analysis_configs/gfx940/0700_wavefront.yaml: da9fb740f9dfafa43c8d0401d22082915d1c04021e07fb8003ac1f31005e282b
src/rocprof_compute_soc/analysis_configs/gfx941/0700_wavefront.yaml: da9fb740f9dfafa43c8d0401d22082915d1c04021e07fb8003ac1f31005e282b
src/rocprof_compute_soc/analysis_configs/gfx942/0700_wavefront.yaml: da9fb740f9dfafa43c8d0401d22082915d1c04021e07fb8003ac1f31005e282b
src/rocprof_compute_soc/analysis_configs/gfx950/0700_wavefront.yaml: a6012921ec2e5984861d34ebfca416703b00f3b2cd4cb07541378a285a58b778
src/rocprof_compute_soc/analysis_configs/gfx908/1000_compute_units_instruction_mix.yaml: 82ef2f27395f2887d1385a33b1d4bcb7cb646ece11146fe1238af2a2fc49108f
src/rocprof_compute_soc/analysis_configs/gfx90a/1000_compute_units_instruction_mix.yaml: e58c1dff540e06ec3021ae4e852cec5a116e978f00f3e0902b74b5d86f1b88ac
src/rocprof_compute_soc/analysis_configs/gfx940/1000_compute_units_instruction_mix.yaml: c74ada0b2cd9eda1e1115679267343e7afad9c9638b3a54b3f98193ae9637e09
src/rocprof_compute_soc/analysis_configs/gfx941/1000_compute_units_instruction_mix.yaml: c74ada0b2cd9eda1e1115679267343e7afad9c9638b3a54b3f98193ae9637e09
src/rocprof_compute_soc/analysis_configs/gfx942/1000_compute_units_instruction_mix.yaml: c74ada0b2cd9eda1e1115679267343e7afad9c9638b3a54b3f98193ae9637e09
src/rocprof_compute_soc/analysis_configs/gfx950/1000_compute_units_instruction_mix.yaml: a0fe88305b0972c0702e542558c0d491eac26438577660e58817e988b7b1f0d4
src/rocprof_compute_soc/analysis_configs/gfx908/1100_compute_units_compute_pipeline.yaml: e815205890d9c815f7f53cdaa64eeef6219bce83054b92fa2be25e240093bdb0
src/rocprof_compute_soc/analysis_configs/gfx90a/1100_compute_units_compute_pipeline.yaml: b44f500ee07856ec8c59afa1ebb0a204d8b5f3247a43725ba16782484fef6ad1
src/rocprof_compute_soc/analysis_configs/gfx940/1100_compute_units_compute_pipeline.yaml: e493741974eae65d88afd4fa98b6b3089fb483900b17af2630be18160964d80c
src/rocprof_compute_soc/analysis_configs/gfx941/1100_compute_units_compute_pipeline.yaml: e493741974eae65d88afd4fa98b6b3089fb483900b17af2630be18160964d80c
src/rocprof_compute_soc/analysis_configs/gfx942/1100_compute_units_compute_pipeline.yaml: e493741974eae65d88afd4fa98b6b3089fb483900b17af2630be18160964d80c
src/rocprof_compute_soc/analysis_configs/gfx950/1100_compute_units_compute_pipeline.yaml: 4797cd3052fdb37278aa9a28572287c1a9a7228f05a77ce22c0eb4786cbbd404
src/rocprof_compute_soc/analysis_configs/gfx908/1200_local_data_share_lds.yaml: 307733f9fee02c620558e2ee4ca3978954f62c3fab26cc98766511b93e96d54c
src/rocprof_compute_soc/analysis_configs/gfx90a/1200_local_data_share_lds.yaml: 307733f9fee02c620558e2ee4ca3978954f62c3fab26cc98766511b93e96d54c
src/rocprof_compute_soc/analysis_configs/gfx940/1200_local_data_share_lds.yaml: 307733f9fee02c620558e2ee4ca3978954f62c3fab26cc98766511b93e96d54c
src/rocprof_compute_soc/analysis_configs/gfx941/1200_local_data_share_lds.yaml: 307733f9fee02c620558e2ee4ca3978954f62c3fab26cc98766511b93e96d54c
src/rocprof_compute_soc/analysis_configs/gfx942/1200_local_data_share_lds.yaml: 307733f9fee02c620558e2ee4ca3978954f62c3fab26cc98766511b93e96d54c
src/rocprof_compute_soc/analysis_configs/gfx950/1200_local_data_share_lds.yaml: 35c98741e9b5afd2f7638d2675b22138f5854168e15bc4633112857ed94edbc1
src/rocprof_compute_soc/analysis_configs/gfx908/1300_instruction_cache.yaml: d2b0a455e9f28d66e6cef701d598072285c58eeebea2d08e1864a8602cdd797a
src/rocprof_compute_soc/analysis_configs/gfx90a/1300_instruction_cache.yaml: d2b0a455e9f28d66e6cef701d598072285c58eeebea2d08e1864a8602cdd797a
src/rocprof_compute_soc/analysis_configs/gfx940/1300_instruction_cache.yaml: d2b0a455e9f28d66e6cef701d598072285c58eeebea2d08e1864a8602cdd797a
src/rocprof_compute_soc/analysis_configs/gfx941/1300_instruction_cache.yaml: d2b0a455e9f28d66e6cef701d598072285c58eeebea2d08e1864a8602cdd797a
src/rocprof_compute_soc/analysis_configs/gfx942/1300_instruction_cache.yaml: d2b0a455e9f28d66e6cef701d598072285c58eeebea2d08e1864a8602cdd797a
src/rocprof_compute_soc/analysis_configs/gfx950/1300_instruction_cache.yaml: d2b0a455e9f28d66e6cef701d598072285c58eeebea2d08e1864a8602cdd797a
src/rocprof_compute_soc/analysis_configs/gfx908/1400_scalar_l1_data_cache.yaml: ccbe9a1309177db8760727f256cd14a7612708833828068dd2bede73ad319d75
src/rocprof_compute_soc/analysis_configs/gfx90a/1400_scalar_l1_data_cache.yaml: ccbe9a1309177db8760727f256cd14a7612708833828068dd2bede73ad319d75
src/rocprof_compute_soc/analysis_configs/gfx940/1400_scalar_l1_data_cache.yaml: ccbe9a1309177db8760727f256cd14a7612708833828068dd2bede73ad319d75
src/rocprof_compute_soc/analysis_configs/gfx941/1400_scalar_l1_data_cache.yaml: ccbe9a1309177db8760727f256cd14a7612708833828068dd2bede73ad319d75
src/rocprof_compute_soc/analysis_configs/gfx942/1400_scalar_l1_data_cache.yaml: ccbe9a1309177db8760727f256cd14a7612708833828068dd2bede73ad319d75
src/rocprof_compute_soc/analysis_configs/gfx950/1400_scalar_l1_data_cache.yaml: ccbe9a1309177db8760727f256cd14a7612708833828068dd2bede73ad319d75
src/rocprof_compute_soc/analysis_configs/gfx908/1500_address_processing_unit_and_data_return_path_ta_td.yaml: b98a800c31da0275704e076e561468dccdaf0b8bff1cc8d74a4e6bf9c7be2973
src/rocprof_compute_soc/analysis_configs/gfx90a/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 58834a04fc4fb6f9eb648a6b8944f737ce4a8c9d4a6c5f75104d9fd528f520a6
src/rocprof_compute_soc/analysis_configs/gfx940/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 7f37bcd01557a45aa5ed9009962a9f2499ad924a6a07d7d25a3af97138f360f8
src/rocprof_compute_soc/analysis_configs/gfx941/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 7f37bcd01557a45aa5ed9009962a9f2499ad924a6a07d7d25a3af97138f360f8
src/rocprof_compute_soc/analysis_configs/gfx942/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 7f37bcd01557a45aa5ed9009962a9f2499ad924a6a07d7d25a3af97138f360f8
src/rocprof_compute_soc/analysis_configs/gfx950/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 5c6555a93b01c057f01e0b0cef3169eeb324ca8c256c42f5f9fc0d1ea131486b
src/rocprof_compute_soc/analysis_configs/gfx908/1600_vector_l1_data_cache.yaml: 4fcb618450366a29c09e428368e1a9afd29a0b80ec3f03a5b3d55a2111bd5704
src/rocprof_compute_soc/analysis_configs/gfx90a/1600_vector_l1_data_cache.yaml: 4fcb618450366a29c09e428368e1a9afd29a0b80ec3f03a5b3d55a2111bd5704
src/rocprof_compute_soc/analysis_configs/gfx940/1600_vector_l1_data_cache.yaml: 1c25e20d701aff1ab9276a29cfd5f219b24c621b534aa5b86d1b78d2ae2f300a
src/rocprof_compute_soc/analysis_configs/gfx941/1600_vector_l1_data_cache.yaml: 1c25e20d701aff1ab9276a29cfd5f219b24c621b534aa5b86d1b78d2ae2f300a
src/rocprof_compute_soc/analysis_configs/gfx942/1600_vector_l1_data_cache.yaml: 1c25e20d701aff1ab9276a29cfd5f219b24c621b534aa5b86d1b78d2ae2f300a
src/rocprof_compute_soc/analysis_configs/gfx950/1600_vector_l1_data_cache.yaml: 3cec51c5a848c4f513c4c0a74aa35a5657289148a67179f8db4ea3e55bdb6ac3
src/rocprof_compute_soc/analysis_configs/gfx908/1700_l2_cache.yaml: e37693ef03caf3d77ae7b91c3c166d033fa0732880cc50a21b8c06a4e79b1f38
src/rocprof_compute_soc/analysis_configs/gfx90a/1700_l2_cache.yaml: 3314a1e473b1cfc95b742b1a8cfbc47d4602061ca89d7a4ac89ea7cc15908962
src/rocprof_compute_soc/analysis_configs/gfx940/1700_l2_cache.yaml: cb8922a41dd2088e8e2b0c1e82c7b95fa55304cf90435b217da128234805d77a
src/rocprof_compute_soc/analysis_configs/gfx941/1700_l2_cache.yaml: 2187f141480a2c57b271ded46255735510de5197441de830cf1efa9345e5566a
src/rocprof_compute_soc/analysis_configs/gfx942/1700_l2_cache.yaml: 7ce34989a66b8f8750cf1bf76f5cdaf59bf662a7205355f6fe12cace796d4ceb
src/rocprof_compute_soc/analysis_configs/gfx950/1700_l2_cache.yaml: a3a8db0f555cd1069a61dfc3b89df83e9423d4a0200f1401c7612942ff75152e
src/rocprof_compute_soc/analysis_configs/gfx908/1800_l2_cache_per_channel.yaml: fd32454bf9f0d3027c77a85ea6be308e92f6815d0ea732c6bafacc8e0f32a25f
src/rocprof_compute_soc/analysis_configs/gfx90a/1800_l2_cache_per_channel.yaml: 23e9a258ab541d24d29cde2237f9445db695e7a4d17d5974cb4fd5ff9a9869c0
src/rocprof_compute_soc/analysis_configs/gfx940/1800_l2_cache_per_channel.yaml: eb0823823506bfe0d40931fd69c435baab4979d2dfee158dc33c3651721f9f33
src/rocprof_compute_soc/analysis_configs/gfx941/1800_l2_cache_per_channel.yaml: eb0823823506bfe0d40931fd69c435baab4979d2dfee158dc33c3651721f9f33
src/rocprof_compute_soc/analysis_configs/gfx942/1800_l2_cache_per_channel.yaml: eb0823823506bfe0d40931fd69c435baab4979d2dfee158dc33c3651721f9f33
src/rocprof_compute_soc/analysis_configs/gfx950/1800_l2_cache_per_channel.yaml: b6336ab78a97fb9750e2f925893a5acc4e66e43ac60472c20225e56c440983d7
src/rocprof_compute_soc/analysis_configs/gfx908/2100_pc_sampling.yaml: efa16d3aadc3363bcc067895a978b82c1e06fa90882dfe33e03315e2c0425d36
src/rocprof_compute_soc/analysis_configs/gfx90a/2100_pc_sampling.yaml: efa16d3aadc3363bcc067895a978b82c1e06fa90882dfe33e03315e2c0425d36
src/rocprof_compute_soc/analysis_configs/gfx940/2100_pc_sampling.yaml: efa16d3aadc3363bcc067895a978b82c1e06fa90882dfe33e03315e2c0425d36
src/rocprof_compute_soc/analysis_configs/gfx941/2100_pc_sampling.yaml: efa16d3aadc3363bcc067895a978b82c1e06fa90882dfe33e03315e2c0425d36
src/rocprof_compute_soc/analysis_configs/gfx942/2100_pc_sampling.yaml: efa16d3aadc3363bcc067895a978b82c1e06fa90882dfe33e03315e2c0425d36
src/rocprof_compute_soc/analysis_configs/gfx950/2100_pc_sampling.yaml: efa16d3aadc3363bcc067895a978b82c1e06fa90882dfe33e03315e2c0425d36
src/rocprof_compute_soc/profile_configs/sets/gfx908_sets.yaml: ee28989e70d0537db8b0f0a4bc5499444b44ff0e73d3e7f2926943be11d0aeda
src/rocprof_compute_soc/profile_configs/sets/gfx90a_sets.yaml: 9c9533174a3f7bd5c8e09ec998743c7bb2642c4ce3f818b546673be9cafc40a8
src/rocprof_compute_soc/profile_configs/sets/gfx940_sets.yaml: 44cd2b32b050cafa73d0ead5703b82836edf25a057c21699046b6b8b8918b242
src/rocprof_compute_soc/profile_configs/sets/gfx941_sets.yaml: 44cd2b32b050cafa73d0ead5703b82836edf25a057c21699046b6b8b8918b242
src/rocprof_compute_soc/profile_configs/sets/gfx942_sets.yaml: 44cd2b32b050cafa73d0ead5703b82836edf25a057c21699046b6b8b8918b242
src/rocprof_compute_soc/profile_configs/sets/gfx950_sets.yaml: 238d9dc8a98cfead3fc904885bfe413e5bcb4f1af31e9820cd640388bcd1e1c2
docs/data/metrics_description.yaml: 12164b43dab4a1088f90763a80ffc8feb38aa82fd7b767edf8f65bd304f22162
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from tools/unified_config.yaml. Generated by tools/split_config.py
{}
+71 -71
Просмотреть файл
@@ -5,19 +5,19 @@
"files": {
"0000_top_stats.yaml": "2819d96f5b1c3704f2ac50868a246a7f",
"0100_system_info.yaml": "cefae2b10db8cf4b0d3a971cff5e82c8",
"0200_system_speed_of_light.yaml": "c54676a8a385c02be50fcf09a721bef6",
"0300_memory_chart.yaml": "f952fe7de6d86cb22f6f8ce34867905f",
"0400_roofline.yaml": "02ca6cf3583f2718ab371bbbfdd8cfef",
"0500_command_processor_cpc_cpf.yaml": "93174ba73bf014c143e179719c110db6",
"0600_workgroup_manager_spi.yaml": "7364c8431929891d587e6f9b96ddce10",
"0700_wavefront.yaml": "5cc88d7743cba8c638491d97725f6778",
"0200_system_speed_of_light.yaml": "c4878ac57b7b7b4b5711672cb2f6dffc",
"0300_memory_chart.yaml": "221c6d2bb50a4f4177585b9988f88c7b",
"0400_roofline.yaml": "bad8d851694ff9a140e29a148a35fa50",
"0500_command_processor_cpc_cpf.yaml": "d8f424ec3fcfa4b2fcee2ad5e6456531",
"0600_workgroup_manager_spi.yaml": "8b6a89de516bed5821a9849627ad634a",
"0700_wavefront.yaml": "84ecb62efa87d87288bb63f7e3871028",
"1000_compute_units_instruction_mix.yaml": "e96eccdcb0e5d28b292107c0f68ec845",
"1100_compute_units_compute_pipeline.yaml": "8f61973d0d08bf49895b5dfe32d05c09",
"1200_local_data_share_lds.yaml": "97be647681c51e762e774eb91e8283fc",
"1300_instruction_cache.yaml": "dcab979ce17e30f2d48fd2734bf08e74",
"1400_scalar_l1_data_cache.yaml": "e90a514e7bb597ec1d22e238650c81d9",
"1200_local_data_share_lds.yaml": "14da923095596b4f0eeb82fd24bfd67c",
"1300_instruction_cache.yaml": "4b7696d75c93e55f7877e07770beda2d",
"1400_scalar_l1_data_cache.yaml": "ea6d0cdb6c34f574248f09554e976599",
"1500_address_processing_unit_and_data_return_path_ta_td.yaml": "645eb10a440eed62c6250a0f5a2407f3",
"1600_vector_l1_data_cache.yaml": "e3b8d1787003094ab7b8372da818ff1e",
"1600_vector_l1_data_cache.yaml": "1daa7d96605e8cdf4116bf3b10fb9969",
"1700_l2_cache.yaml": "38e7db4c404007c471864251dff30570",
"1800_l2_cache_per_channel.yaml": "7193043cd8eee47501cd8c0ae02b51e9",
"2100_pc_sampling.yaml": "8049866f25214544f1e53a9e2f08399b"
@@ -28,19 +28,19 @@
"files": {
"0000_top_stats.yaml": "2819d96f5b1c3704f2ac50868a246a7f",
"0100_system_info.yaml": "cefae2b10db8cf4b0d3a971cff5e82c8",
"0200_system_speed_of_light.yaml": "747b14ab50dd4d7689af7c268569b32a",
"0300_memory_chart.yaml": "0d6d094ad24cebf6e583e643beaae06e",
"0400_roofline.yaml": "632b16e1d251e57de0cf7237d3a89766",
"0500_command_processor_cpc_cpf.yaml": "93174ba73bf014c143e179719c110db6",
"0600_workgroup_manager_spi.yaml": "7364c8431929891d587e6f9b96ddce10",
"0700_wavefront.yaml": "5cc88d7743cba8c638491d97725f6778",
"1000_compute_units_instruction_mix.yaml": "af6304cce1fe38c119b1d17fa635265c",
"1100_compute_units_compute_pipeline.yaml": "c38ece6032d757f394c83ad9f93e0dce",
"1200_local_data_share_lds.yaml": "97be647681c51e762e774eb91e8283fc",
"1300_instruction_cache.yaml": "dcab979ce17e30f2d48fd2734bf08e74",
"1400_scalar_l1_data_cache.yaml": "e90a514e7bb597ec1d22e238650c81d9",
"0200_system_speed_of_light.yaml": "dc6a6e1a8513e2d32aecc055a958c639",
"0300_memory_chart.yaml": "a61f219fe063c4c4b0b9cbaf96389a8b",
"0400_roofline.yaml": "da1d514ed19ca2466c167e983bdb4f13",
"0500_command_processor_cpc_cpf.yaml": "d8f424ec3fcfa4b2fcee2ad5e6456531",
"0600_workgroup_manager_spi.yaml": "8b6a89de516bed5821a9849627ad634a",
"0700_wavefront.yaml": "84ecb62efa87d87288bb63f7e3871028",
"1000_compute_units_instruction_mix.yaml": "84bd6a22a29335a4851bba675614e103",
"1100_compute_units_compute_pipeline.yaml": "39429cd6af68f91f1b20630c1bab8cc7",
"1200_local_data_share_lds.yaml": "14da923095596b4f0eeb82fd24bfd67c",
"1300_instruction_cache.yaml": "4b7696d75c93e55f7877e07770beda2d",
"1400_scalar_l1_data_cache.yaml": "ea6d0cdb6c34f574248f09554e976599",
"1500_address_processing_unit_and_data_return_path_ta_td.yaml": "8005b28532601a759ace2f653d10da56",
"1600_vector_l1_data_cache.yaml": "e3b8d1787003094ab7b8372da818ff1e",
"1600_vector_l1_data_cache.yaml": "1daa7d96605e8cdf4116bf3b10fb9969",
"1700_l2_cache.yaml": "1630ae8fc504ea056e91bb19909d5629",
"1800_l2_cache_per_channel.yaml": "5ee4fd9c849670c301c4afee257acddd",
"2100_pc_sampling.yaml": "8049866f25214544f1e53a9e2f08399b"
@@ -51,19 +51,19 @@
"files": {
"0000_top_stats.yaml": "2819d96f5b1c3704f2ac50868a246a7f",
"0100_system_info.yaml": "cefae2b10db8cf4b0d3a971cff5e82c8",
"0200_system_speed_of_light.yaml": "74482aebb54b6d7b429c9ca605cb9951",
"0300_memory_chart.yaml": "e28ebc1340d2db1948c68225a6e008ff",
"0400_roofline.yaml": "1f3888778245e7eb05e769bda605588a",
"0500_command_processor_cpc_cpf.yaml": "93174ba73bf014c143e179719c110db6",
"0600_workgroup_manager_spi.yaml": "7364c8431929891d587e6f9b96ddce10",
"0700_wavefront.yaml": "5cc88d7743cba8c638491d97725f6778",
"1000_compute_units_instruction_mix.yaml": "ac290954de96988004b2a4be345a3a25",
"1100_compute_units_compute_pipeline.yaml": "470e3093ce9d53211923d3400e7e7bd7",
"1200_local_data_share_lds.yaml": "97be647681c51e762e774eb91e8283fc",
"1300_instruction_cache.yaml": "dcab979ce17e30f2d48fd2734bf08e74",
"1400_scalar_l1_data_cache.yaml": "e90a514e7bb597ec1d22e238650c81d9",
"0200_system_speed_of_light.yaml": "8b413c47f06f2e94b3faa723daac8edd",
"0300_memory_chart.yaml": "3d6c88ab2704dc4bd72a63e99fd68cf7",
"0400_roofline.yaml": "d4650e008f2e3a7d28871e8518153575",
"0500_command_processor_cpc_cpf.yaml": "d8f424ec3fcfa4b2fcee2ad5e6456531",
"0600_workgroup_manager_spi.yaml": "8b6a89de516bed5821a9849627ad634a",
"0700_wavefront.yaml": "84ecb62efa87d87288bb63f7e3871028",
"1000_compute_units_instruction_mix.yaml": "89c03d53fd7563da965ff3e4a1698b02",
"1100_compute_units_compute_pipeline.yaml": "0c3c36f6c2fed1f14476966295338e74",
"1200_local_data_share_lds.yaml": "14da923095596b4f0eeb82fd24bfd67c",
"1300_instruction_cache.yaml": "4b7696d75c93e55f7877e07770beda2d",
"1400_scalar_l1_data_cache.yaml": "ea6d0cdb6c34f574248f09554e976599",
"1500_address_processing_unit_and_data_return_path_ta_td.yaml": "12fe315acb3e06d4c16e4538f418f0ca",
"1600_vector_l1_data_cache.yaml": "006854a23925320b94727261f30680b7",
"1600_vector_l1_data_cache.yaml": "ebff7d80c601d03027476ae9fb16ecae",
"1700_l2_cache.yaml": "0987e21ac2547134fea87499dee01847",
"1800_l2_cache_per_channel.yaml": "ba5eeabcd749ecbb107c42de5ce69317",
"2100_pc_sampling.yaml": "8049866f25214544f1e53a9e2f08399b"
@@ -74,19 +74,19 @@
"files": {
"0000_top_stats.yaml": "2819d96f5b1c3704f2ac50868a246a7f",
"0100_system_info.yaml": "cefae2b10db8cf4b0d3a971cff5e82c8",
"0200_system_speed_of_light.yaml": "7ed2ceba47e232b4e39431228a254f7f",
"0300_memory_chart.yaml": "e28ebc1340d2db1948c68225a6e008ff",
"0400_roofline.yaml": "a80de496435c2c76eb4cfdc38d62155f",
"0500_command_processor_cpc_cpf.yaml": "93174ba73bf014c143e179719c110db6",
"0600_workgroup_manager_spi.yaml": "7364c8431929891d587e6f9b96ddce10",
"0700_wavefront.yaml": "5cc88d7743cba8c638491d97725f6778",
"1000_compute_units_instruction_mix.yaml": "ac290954de96988004b2a4be345a3a25",
"1100_compute_units_compute_pipeline.yaml": "470e3093ce9d53211923d3400e7e7bd7",
"1200_local_data_share_lds.yaml": "97be647681c51e762e774eb91e8283fc",
"1300_instruction_cache.yaml": "dcab979ce17e30f2d48fd2734bf08e74",
"1400_scalar_l1_data_cache.yaml": "e90a514e7bb597ec1d22e238650c81d9",
"0200_system_speed_of_light.yaml": "0ddeaefd245291c7f88674431efd74f6",
"0300_memory_chart.yaml": "3d6c88ab2704dc4bd72a63e99fd68cf7",
"0400_roofline.yaml": "c066a19bc0e00e692c34998e44c62387",
"0500_command_processor_cpc_cpf.yaml": "d8f424ec3fcfa4b2fcee2ad5e6456531",
"0600_workgroup_manager_spi.yaml": "8b6a89de516bed5821a9849627ad634a",
"0700_wavefront.yaml": "84ecb62efa87d87288bb63f7e3871028",
"1000_compute_units_instruction_mix.yaml": "89c03d53fd7563da965ff3e4a1698b02",
"1100_compute_units_compute_pipeline.yaml": "0c3c36f6c2fed1f14476966295338e74",
"1200_local_data_share_lds.yaml": "14da923095596b4f0eeb82fd24bfd67c",
"1300_instruction_cache.yaml": "4b7696d75c93e55f7877e07770beda2d",
"1400_scalar_l1_data_cache.yaml": "ea6d0cdb6c34f574248f09554e976599",
"1500_address_processing_unit_and_data_return_path_ta_td.yaml": "12fe315acb3e06d4c16e4538f418f0ca",
"1600_vector_l1_data_cache.yaml": "006854a23925320b94727261f30680b7",
"1600_vector_l1_data_cache.yaml": "ebff7d80c601d03027476ae9fb16ecae",
"1700_l2_cache.yaml": "05a86637744ad66f6491620c4ad659d2",
"1800_l2_cache_per_channel.yaml": "ba5eeabcd749ecbb107c42de5ce69317",
"2100_pc_sampling.yaml": "8049866f25214544f1e53a9e2f08399b"
@@ -97,19 +97,19 @@
"files": {
"0000_top_stats.yaml": "2819d96f5b1c3704f2ac50868a246a7f",
"0100_system_info.yaml": "cefae2b10db8cf4b0d3a971cff5e82c8",
"0200_system_speed_of_light.yaml": "74482aebb54b6d7b429c9ca605cb9951",
"0300_memory_chart.yaml": "e28ebc1340d2db1948c68225a6e008ff",
"0400_roofline.yaml": "f94c87dad18f87e5582566276a5c0cfc",
"0500_command_processor_cpc_cpf.yaml": "93174ba73bf014c143e179719c110db6",
"0600_workgroup_manager_spi.yaml": "7364c8431929891d587e6f9b96ddce10",
"0700_wavefront.yaml": "5cc88d7743cba8c638491d97725f6778",
"1000_compute_units_instruction_mix.yaml": "ac290954de96988004b2a4be345a3a25",
"1100_compute_units_compute_pipeline.yaml": "470e3093ce9d53211923d3400e7e7bd7",
"1200_local_data_share_lds.yaml": "97be647681c51e762e774eb91e8283fc",
"1300_instruction_cache.yaml": "dcab979ce17e30f2d48fd2734bf08e74",
"1400_scalar_l1_data_cache.yaml": "e90a514e7bb597ec1d22e238650c81d9",
"0200_system_speed_of_light.yaml": "8b413c47f06f2e94b3faa723daac8edd",
"0300_memory_chart.yaml": "3d6c88ab2704dc4bd72a63e99fd68cf7",
"0400_roofline.yaml": "318c3e774d41a639628a7f72c2462375",
"0500_command_processor_cpc_cpf.yaml": "d8f424ec3fcfa4b2fcee2ad5e6456531",
"0600_workgroup_manager_spi.yaml": "8b6a89de516bed5821a9849627ad634a",
"0700_wavefront.yaml": "84ecb62efa87d87288bb63f7e3871028",
"1000_compute_units_instruction_mix.yaml": "89c03d53fd7563da965ff3e4a1698b02",
"1100_compute_units_compute_pipeline.yaml": "0c3c36f6c2fed1f14476966295338e74",
"1200_local_data_share_lds.yaml": "14da923095596b4f0eeb82fd24bfd67c",
"1300_instruction_cache.yaml": "4b7696d75c93e55f7877e07770beda2d",
"1400_scalar_l1_data_cache.yaml": "ea6d0cdb6c34f574248f09554e976599",
"1500_address_processing_unit_and_data_return_path_ta_td.yaml": "12fe315acb3e06d4c16e4538f418f0ca",
"1600_vector_l1_data_cache.yaml": "006854a23925320b94727261f30680b7",
"1600_vector_l1_data_cache.yaml": "ebff7d80c601d03027476ae9fb16ecae",
"1700_l2_cache.yaml": "96e49399b26d00d88ad534a35c95304b",
"1800_l2_cache_per_channel.yaml": "ba5eeabcd749ecbb107c42de5ce69317",
"2100_pc_sampling.yaml": "8049866f25214544f1e53a9e2f08399b"
@@ -120,20 +120,20 @@
"files": {
"0000_top_stats.yaml": "2819d96f5b1c3704f2ac50868a246a7f",
"0100_system_info.yaml": "cefae2b10db8cf4b0d3a971cff5e82c8",
"0200_system_speed_of_light.yaml": "4a215bccc9378583a6e7e7733b601537",
"0300_memory_chart.yaml": "f19548711a687779df0c0b87a1df7a27",
"0400_roofline.yaml": "156c1a1d7a6c1e55aea25552334a84d5",
"0500_command_processor_cpc_cpf.yaml": "5b67ff80efbc2e1dffb7e3922499ca88",
"0600_workgroup_manager_spi.yaml": "63a7b6f7a4487fb87d67549214e08aac",
"0700_wavefront.yaml": "1ecfc3a91ec0cce6ed9eb94afae17aa9",
"1000_compute_units_instruction_mix.yaml": "7088fafcaa66a8ec48a9d3939cd7339a",
"1100_compute_units_compute_pipeline.yaml": "fce707e3f419ee2708676c8f7c325df5",
"1200_local_data_share_lds.yaml": "06bee89ddab210dbd122eaaedef0b29a",
"1300_instruction_cache.yaml": "dcab979ce17e30f2d48fd2734bf08e74",
"1400_scalar_l1_data_cache.yaml": "e90a514e7bb597ec1d22e238650c81d9",
"0200_system_speed_of_light.yaml": "a5ee49ce96bfab87128c856c827db870",
"0300_memory_chart.yaml": "e2401641a8f280fda308f87e5ad243df",
"0400_roofline.yaml": "2bd3b630b72d6d165c0d30cf481136a9",
"0500_command_processor_cpc_cpf.yaml": "3f7dab1663ad7a6fae3801aec2b1e8d0",
"0600_workgroup_manager_spi.yaml": "e6546a92d283fed5a5dc6df203efb670",
"0700_wavefront.yaml": "330468fd711057b422de9b952c5cfe69",
"1000_compute_units_instruction_mix.yaml": "c8bbdde1f29c9548a8e0ed7fcdd9ae04",
"1100_compute_units_compute_pipeline.yaml": "30e64960bbac4cc5626615a60240bd5f",
"1200_local_data_share_lds.yaml": "0e57c559dbcd5526e2e8006a47a69f4b",
"1300_instruction_cache.yaml": "4b7696d75c93e55f7877e07770beda2d",
"1400_scalar_l1_data_cache.yaml": "ea6d0cdb6c34f574248f09554e976599",
"1500_address_processing_unit_and_data_return_path_ta_td.yaml": "355a0c6b9b113fcfb686a300b78be21a",
"1600_vector_l1_data_cache.yaml": "68382e45c7a3c578df861d6285024803",
"1700_l2_cache.yaml": "f70f23b93e97b99327b5db3907eb133e",
"1600_vector_l1_data_cache.yaml": "689aba850739a9cbd64ce1e816e95dff",
"1700_l2_cache.yaml": "067f8c8a7264762fdc58a41728b4382b",
"1800_l2_cache_per_channel.yaml": "7e2a1809a9b7f70a088068d6689c8aa4",
"2100_pc_sampling.yaml": "8049866f25214544f1e53a9e2f08399b"
}
+2 -2
Просмотреть файл
@@ -97,7 +97,7 @@ Addition:
metric_descriptions:
New Metric:
plain: Description text
rst: |- # Optional
rst: >- # Optional
Description with :ref:`RST markup <link>`
Deletion:
@@ -231,7 +231,7 @@ Modification:
metric_descriptions:
Existing Metric:
plain: Updated description
rst: |-
rst: >-
Updated description with **RST**"
```
+7 -20
Просмотреть файл
@@ -114,11 +114,9 @@ def merge_docs_rst_as_default(descs: dict, docs_file: Path) -> dict:
for section, metrics in descs.items():
docs_section = docs.get(section) or {}
for metric_name, d in metrics.items():
# If panel didn't explicitly provide rst, inherit from docs
if not d.get("rst"):
doc_entry = docs_section.get(metric_name) or {}
if doc_entry.get("rst"):
d["rst"] = doc_entry["rst"]
doc_entry = docs_section.get(metric_name) or {}
if doc_entry.get("rst"):
d["rst"] = doc_entry["rst"]
return descs
@@ -129,10 +127,6 @@ def merge_units_as_default(descs: dict, docs_file: Path, per_arch_file: Path) ->
2) else from docs file,
3) else leave as-is (missing).
"""
per_arch: dict = {}
if per_arch_file.exists():
with open(per_arch_file, "r", encoding="utf-8") as f:
per_arch = yaml.safe_load(f) or {}
docs: dict = {}
if docs_file.exists():
@@ -140,18 +134,11 @@ def merge_units_as_default(descs: dict, docs_file: Path, per_arch_file: Path) ->
docs = yaml.safe_load(f) or {}
for section, metrics in descs.items():
psec = per_arch.get(section) or {}
dsec = docs.get(section) or {}
for metric, data in metrics.items():
# Only fill if panel did NOT explicitly set unit
if "unit" not in data or data["unit"] is None:
unit = None
if metric in psec and isinstance(psec[metric], dict):
unit = psec[metric].get("unit")
if unit is None and metric in dsec and isinstance(dsec[metric], dict):
unit = dsec[metric].get("unit")
if unit is not None:
data["unit"] = unit
doc_entry = dsec.get(metric)
if doc_entry and "unit" in doc_entry:
data["unit"] = doc_entry["unit"]
return descs
@@ -403,7 +390,7 @@ def sync_arch(
update_per_arch_metrics_file(arch_name, descriptions, per_arch_metrics_dir)
# 5) Only when latest: update docs, but overwrite 'rst' only for overrides
if is_latest:
if is_latest and (panel_rst_overrides or panel_unit_overrides):
if not update_docs_metrics_file(
descriptions,
docs_metrics_file,
+1 -1
Просмотреть файл
@@ -31,7 +31,7 @@ import yaml
def str_representer(dumper, data):
if "\n" in data:
return dumper.represent_scalar("tag:yaml.org,2002:str", data, style="|")
return dumper.represent_scalar("tag:yaml.org,2002:str", data, style=">")
return dumper.represent_scalar("tag:yaml.org,2002:str", data)
+41 -41
Просмотреть файл
@@ -1,20 +1,20 @@
System Speed-of-Light:
VALU FLOPs:
rst: |-
rst: >-
The total floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is also presented as a percent of the peak theoretical
FLOPs achievable on the specific accelerator. Note: this does not include
any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
VALU IOPs:
rst: |-
rst: >-
The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
This is also presented as a percent of the peak theoretical IOPs achievable
on the specific accelerator. Note: this does not include any integer operations
from :ref:`MFMA <desc-mfma>` instructions.
unit: GOIPs
MFMA FLOPs (BF16):
rst: |-
rst: >-
The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any 16-bit brain
floating point operations from :ref:`VALU <desc-valu>` instructions. This
@@ -22,7 +22,7 @@ System Speed-of-Light:
achievable on the specific accelerator.
unit: GFLOPs
MFMA FLOPs (F16):
rst: |-
rst: >-
The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -30,7 +30,7 @@ System Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA FLOPs (F32):
rst: |-
rst: >-
The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 32-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -38,7 +38,7 @@ System Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA FLOPs (F64):
rst: |-
rst: >-
The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 64-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -46,7 +46,7 @@ System Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA IOPs (Int8):
rst: |-
rst: >-
The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
per second. Note: this does not include any 8-bit integer operations from
:ref:`VALU <desc-valu>` instructions. This is also presented as a percent
@@ -99,7 +99,7 @@ System Speed-of-Light:
over the :ref:`total active CU cycles <total-active-cu-cycles>`.
unit: Instructions per-cycle
Wavefront Occupancy:
rst: |-
rst: >-
The time-averaged number of wavefronts resident on the accelerator over
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
@@ -144,7 +144,7 @@ System Speed-of-Light:
peak theoretical bandwidth achievable on the specific accelerator.
unit: GB/s
L2-Fabric Read BW:
rst: |-
rst: >-
The number of bytes read by the L2 over the :ref:`Infinity Fabric\u2122
interface <l2-fabric>` per unit time. This is also presented as a percent
of the peak theoretical bandwidth achievable on the specific accelerator.
@@ -229,19 +229,19 @@ Memory Chart:
rst: Total number of compute units (CUs) on the accelerator.
unit: CUs
VGPR:
rst: |-
rst: >-
The number of architected vector general-purpose registers allocated for the
kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
number of VGPRs requested by the compiler due to allocation granularity.
unit: VGPRs
SGPR:
rst: |-
rst: >-
The number of scalar general-purpose registers allocated for the kernel, see
:ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
SGPRs requested by the compiler due to allocation granularity.
unit: SGPRs
LDS Allocation:
rst: |-
rst: >-
The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
allocated for this kernel. Note: This may also be larger than what was requested
at compile time due to both allocation granularity and dynamic per-dispatch
@@ -420,28 +420,28 @@ Memory Chart:
unit: Requests per normalization unit
Roofline Performance Rates:
VALU FLOPs (F16):
rst: |-
rst: >-
The total 16-bit floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is presented with the value of the peak empirical F16 FLOPs achievable
on the specific accelerator. Note: this does not include any F16 operations
from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
VALU FLOPs (F32):
rst: |-
rst: >-
The total 32-bit floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is presented with the value of the peak empirical F32 FLOPs achievable
on the specific accelerator. Note: this does not include any F32 operations
from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
VALU FLOPs (F64):
rst: |-
rst: >-
The total 64-bit floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is presented with the value of the peak empirical F64 FLOPs achievable
on the specific accelerator. Note: this does not include any F64 operations
from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
MFMA FLOPs (F64):
rst: |-
rst: >-
The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 64-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -449,7 +449,7 @@ Roofline Performance Rates:
displayed alongside for comparison.
unit: GFLOPs
MFMA FLOPs (F32):
rst: |-
rst: >-
The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 32-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -457,7 +457,7 @@ Roofline Performance Rates:
displayed alongside for comparison.
unit: GFLOPs
MFMA FLOPs (F16):
rst: |-
rst: >-
The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -465,7 +465,7 @@ Roofline Performance Rates:
displayed alongside for comparison.
unit: GFLOPs
MFMA FLOPs (BF16):
rst: |-
rst: >-
The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any 16-bit brain
floating point operations from :ref:`VALU <desc-valu>` instructions. The
@@ -473,7 +473,7 @@ Roofline Performance Rates:
accelerator is displayed alongside for comparison.
unit: GFLOPs
MFMA IOPs (Int8):
rst: |-
rst: >-
The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
per second. Note: this does not include any 8-bit integer operations from
:ref:`VALU <desc-valu>` instructions. The peak empirically measured INT8 MFMA
@@ -481,7 +481,7 @@ Roofline Performance Rates:
for comparison.
unit: GIOPs
HBM Bandwidth:
rst: |-
rst: >-
The total number of bytes read from and written to High-Bandwidth
Memory (HBM) per second. The peak empirically measured bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
@@ -512,28 +512,28 @@ Roofline Performance Rates:
unit: GB/s
Roofline Plot Points:
AI HBM:
rst: |-
rst: >-
The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
It is the ratio of total floating-point operations (FLOPs) to total bytes
transferred between HBM and the L2 cache. This value is used as the x-coordinate
for the HBM roofline.
unit: FLOPs/Byte
AI L2:
rst: |-
rst: >-
The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L2 cache and the L1 cache. This value is used as the x-coordinate for
the L2 roofline.
unit: FLOPs/Byte
AI L1:
rst: |-
rst: >-
The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L1 cache and the processing units. This value is used as the x-coordinate
for the L1 roofline.
unit: FLOPs/Byte
Performance (GFLOPs):
rst: |-
rst: >-
The overall achieved performance, measured in GigaFLOPs
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
operations divided by the total execution time. This value is used as the y-coordinate
@@ -591,7 +591,7 @@ Workgroup manager utilizations:
any work.
unit: Percent
Scheduler-Pipe Utilization:
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where the scheduler-pipes were actively doing any work. Note: this
value is expected to range between 0% and 25%. See :ref:`desc-spi`.
@@ -629,7 +629,7 @@ Workgroup manager utilizations:
unit: Cycles/wave
Workgroup Manager - Resource Allocation:
Not-scheduled Rate (Workgroup Manager):
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
due to a bottleneck within the workgroup manager rather than a lack of a
@@ -638,7 +638,7 @@ Workgroup Manager - Resource Allocation:
description.
unit: Percent
Not-scheduled Rate (Scheduler-Pipe):
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
due to a bottleneck within the scheduler-pipes rather than a lack of a CU
@@ -647,7 +647,7 @@ Workgroup Manager - Resource Allocation:
description.
unit: Percent
Scheduler-Pipe Stall Rate:
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
due to occupancy limitations (like a lack of a CU or :ref:`SIMD <desc-valu>`
@@ -711,7 +711,7 @@ Wavefront Launch Stats:
block size.
unit: Work-Items
Total Wavefronts:
rst: |-
rst: >-
The total number of wavefronts launched as part of the kernel dispatch.
On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
size is always 64 work-items. Thus, the total number of wavefronts should
@@ -726,25 +726,25 @@ Wavefront Launch Stats:
<https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
unit: Wavefronts
VGPRs:
rst: |-
rst: >-
The number of architected vector general-purpose registers allocated for the
kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
number of VGPRs requested by the compiler due to allocation granularity.
unit: VGPRs
AGPRs:
rst: |-
rst: >-
The number of accumulation vector general-purpose registers allocated
for the kernel, see :ref:`AGPRs <desc-agprs>`. Note: this may not exactly match
the number of AGPRs requested by the compiler due to allocation granularity.
unit: AGPRs
SGPRs:
rst: |-
rst: >-
The number of scalar general-purpose registers allocated for the kernel, see
:ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
SGPRs requested by the compiler due to allocation granularity.
unit: SGPRs
LDS Allocation:
rst: |-
rst: >-
The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
allocated for this kernel. Note: This may also be larger than what was requested
at compile time due to both allocation granularity and dynamic per-dispatch
@@ -767,7 +767,7 @@ Wavefront Runtime Stats:
This is averaged over all wavefronts in a kernel dispatch.
unit: Instructions per wavefront
Wave Cycles:
rst: |-
rst: >-
The number of cycles a wavefront in the kernel dispatch spent resident
on a compute unit per :ref:`normalization unit <normalization-units>`. This is
averaged over all wavefronts in a kernel dispatch. Note: this should not
@@ -805,7 +805,7 @@ Wavefront Runtime Stats:
the total Wave Cycles metric.
unit: Cycles per normalization unit
Wavefront Occupancy:
rst: |-
rst: >-
The time-averaged number of wavefronts resident on the accelerator over the
lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms).
@@ -976,7 +976,7 @@ LDS Statistics:
to stalls from non-dword aligned addresses per :ref:`normalization unit <normalization-units>`.
unit: Cycles per normalization unit
Mem Violations:
rst: |-
rst: >-
The total number of out-of-bounds accesses made to the LDS, per :ref:`normalization
unit <normalization-units>`. This is unused and expected to be zero in
most configurations for modern CDNA\u2122 accelerators.
@@ -993,7 +993,7 @@ L1I Speed-of-Light:
over the number of all L1I requests.
unit: Percent
L1I-L2 Bandwidth Utilization:
rst: |-
rst: >-
The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
achieved. Calculated as the ratio of the total number of requests from
the L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`.
@@ -1100,7 +1100,7 @@ Scalar L1D cache accesses:
unit: Requests per normalization unit
Scalar L1D Cache - L2 Interface:
sL1D-L2 BW:
rst: |-
rst: >-
The total number of bytes read from, written to, or atomically updated
across the sL1D\u2194:doc:`L2 <l2-cache>` interface, divided by total duration.
Note that sL1D writes and atomics are typically
@@ -1122,7 +1122,7 @@ Scalar L1D Cache - L2 Interface:
CDNA accelerators.
unit: Requests per normalization unit
Stall Cycles:
rst: |-
rst: >-
The total number of cycles the sL1D\u2194 :doc:`L2 <l2-cache>` interface
was stalled, per :ref:`normalization unit <normalization-units>`.
unit: Cycles per normalization unit
@@ -1432,7 +1432,7 @@ L1 Unified Translation Cache (UTCL1):
translation not being present in the cache, per :ref:`normalization unit <normalization-units>`.
unit: unit
Permission Misses:
rst: |-
rst: >-
The total number of translation requests that missed in the UTCL1 due
to a permission error, per :ref:`normalization unit <normalization-units>`.
This is unused and expected to be zero in most configurations for modern
+51 -51
Просмотреть файл
@@ -1,20 +1,20 @@
System Speed-of-Light:
VALU FLOPs:
rst: |-
rst: >-
The total floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is also presented as a percent of the peak theoretical
FLOPs achievable on the specific accelerator. Note: this does not include
any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
VALU IOPs:
rst: |-
rst: >-
The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
This is also presented as a percent of the peak theoretical IOPs achievable
on the specific accelerator. Note: this does not include any integer operations
from :ref:`MFMA <desc-mfma>` instructions.
unit: GOIPs
MFMA FLOPs (BF16):
rst: |-
rst: >-
The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any 16-bit brain
floating point operations from :ref:`VALU <desc-valu>` instructions. This
@@ -22,7 +22,7 @@ System Speed-of-Light:
achievable on the specific accelerator.
unit: GFLOPs
MFMA FLOPs (F16):
rst: |-
rst: >-
The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -30,7 +30,7 @@ System Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA FLOPs (F32):
rst: |-
rst: >-
The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 32-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -38,7 +38,7 @@ System Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA FLOPs (F64):
rst: |-
rst: >-
The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 64-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -46,7 +46,7 @@ System Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA IOPs (Int8):
rst: |-
rst: >-
The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
per second. Note: this does not include any 8-bit integer operations from
:ref:`VALU <desc-valu>` instructions. This is also presented as a percent
@@ -99,7 +99,7 @@ System Speed-of-Light:
over the :ref:`total active CU cycles <total-active-cu-cycles>`.
unit: Instructions per-cycle
Wavefront Occupancy:
rst: |-
rst: >-
The time-averaged number of wavefronts resident on the accelerator over
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
@@ -144,7 +144,7 @@ System Speed-of-Light:
peak theoretical bandwidth achievable on the specific accelerator.
unit: GB/s
L2-Fabric Read BW:
rst: |-
rst: >-
The number of bytes read by the L2 over the :ref:`Infinity Fabric\u2122
interface <l2-fabric>` per unit time. This is also presented as a percent
of the peak theoretical bandwidth achievable on the specific accelerator.
@@ -229,19 +229,19 @@ Memory Chart:
rst: Total number of compute units (CUs) on the accelerator.
unit: CUs
VGPR:
rst: |-
rst: >-
The number of architected vector general-purpose registers allocated for the
kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
number of VGPRs requested by the compiler due to allocation granularity.
unit: VGPRs
SGPR:
rst: |-
rst: >-
The number of scalar general-purpose registers allocated for the kernel, see
:ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
SGPRs requested by the compiler due to allocation granularity.
unit: SGPRs
LDS Allocation:
rst: |-
rst: >-
The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
allocated for this kernel. Note: This may also be larger than what was requested
at compile time due to both allocation granularity and dynamic per-dispatch
@@ -420,28 +420,28 @@ Memory Chart:
unit: Requests per normalization unit
Roofline Performance Rates:
VALU FLOPs (F16):
rst: |-
rst: >-
The total 16-bit floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is presented with the value of the peak empirical F16 FLOPs achievable
on the specific accelerator. Note: this does not include any F16 operations
from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
VALU FLOPs (F32):
rst: |-
rst: >-
The total 32-bit floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is presented with the value of the peak empirical F32 FLOPs achievable
on the specific accelerator. Note: this does not include any F32 operations
from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
VALU FLOPs (F64):
rst: |-
rst: >-
The total 64-bit floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is presented with the value of the peak empirical F64 FLOPs achievable
on the specific accelerator. Note: this does not include any F64 operations
from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
MFMA FLOPs (F64):
rst: |-
rst: >-
The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 64-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -449,7 +449,7 @@ Roofline Performance Rates:
displayed alongside for comparison.
unit: GFLOPs
MFMA FLOPs (F32):
rst: |-
rst: >-
The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 32-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -457,7 +457,7 @@ Roofline Performance Rates:
displayed alongside for comparison.
unit: GFLOPs
MFMA FLOPs (F16):
rst: |-
rst: >-
The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -465,7 +465,7 @@ Roofline Performance Rates:
displayed alongside for comparison.
unit: GFLOPs
MFMA FLOPs (BF16):
rst: |-
rst: >-
The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any 16-bit brain
floating point operations from :ref:`VALU <desc-valu>` instructions. The
@@ -473,7 +473,7 @@ Roofline Performance Rates:
accelerator is displayed alongside for comparison.
unit: GFLOPs
MFMA IOPs (Int8):
rst: |-
rst: >-
The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
per second. Note: this does not include any 8-bit integer operations from
:ref:`VALU <desc-valu>` instructions. The peak empirically measured INT8 MFMA
@@ -481,7 +481,7 @@ Roofline Performance Rates:
for comparison.
unit: GIOPs
HBM Bandwidth:
rst: |-
rst: >-
The total number of bytes read from and written to High-Bandwidth
Memory (HBM) per second. The peak empirically measured bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
@@ -512,28 +512,28 @@ Roofline Performance Rates:
unit: GB/s
Roofline Plot Points:
AI HBM:
rst: |-
rst: >-
The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
It is the ratio of total floating-point operations (FLOPs) to total bytes
transferred between HBM and the L2 cache. This value is used as the x-coordinate
for the HBM roofline.
unit: FLOPs/Byte
AI L2:
rst: |-
rst: >-
The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L2 cache and the L1 cache. This value is used as the x-coordinate for
the L2 roofline.
unit: FLOPs/Byte
AI L1:
rst: |-
rst: >-
The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L1 cache and the processing units. This value is used as the x-coordinate
for the L1 roofline.
unit: FLOPs/Byte
Performance (GFLOPs):
rst: |-
rst: >-
The overall achieved performance, measured in GigaFLOPs
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
operations divided by the total execution time. This value is used as the y-coordinate
@@ -591,7 +591,7 @@ Workgroup manager utilizations:
any work.
unit: Percent
Scheduler-Pipe Utilization:
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where the scheduler-pipes were actively doing any work. Note: this
value is expected to range between 0% and 25%. See :ref:`desc-spi`.
@@ -629,7 +629,7 @@ Workgroup manager utilizations:
unit: Cycles/wave
Workgroup Manager - Resource Allocation:
Not-scheduled Rate (Workgroup Manager):
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
due to a bottleneck within the workgroup manager rather than a lack of a
@@ -638,7 +638,7 @@ Workgroup Manager - Resource Allocation:
description.
unit: Percent
Not-scheduled Rate (Scheduler-Pipe):
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
due to a bottleneck within the scheduler-pipes rather than a lack of a CU
@@ -647,7 +647,7 @@ Workgroup Manager - Resource Allocation:
description.
unit: Percent
Scheduler-Pipe Stall Rate:
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
due to occupancy limitations (like a lack of a CU or :ref:`SIMD <desc-valu>`
@@ -711,7 +711,7 @@ Wavefront Launch Stats:
block size.
unit: Work-Items
Total Wavefronts:
rst: |-
rst: >-
The total number of wavefronts launched as part of the kernel dispatch.
On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
size is always 64 work-items. Thus, the total number of wavefronts should
@@ -726,25 +726,25 @@ Wavefront Launch Stats:
<https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
unit: Wavefronts
VGPRs:
rst: |-
rst: >-
The number of architected vector general-purpose registers allocated for the
kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
number of VGPRs requested by the compiler due to allocation granularity.
unit: VGPRs
AGPRs:
rst: |-
rst: >-
The number of accumulation vector general-purpose registers allocated
for the kernel, see :ref:`AGPRs <desc-agprs>`. Note: this may not exactly match
the number of AGPRs requested by the compiler due to allocation granularity.
unit: AGPRs
SGPRs:
rst: |-
rst: >-
The number of scalar general-purpose registers allocated for the kernel, see
:ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
SGPRs requested by the compiler due to allocation granularity.
unit: SGPRs
LDS Allocation:
rst: |-
rst: >-
The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
allocated for this kernel. Note: This may also be larger than what was requested
at compile time due to both allocation granularity and dynamic per-dispatch
@@ -767,7 +767,7 @@ Wavefront Runtime Stats:
This is averaged over all wavefronts in a kernel dispatch.
unit: Instructions per wavefront
Wave Cycles:
rst: |-
rst: >-
The number of cycles a wavefront in the kernel dispatch spent resident
on a compute unit per :ref:`normalization unit <normalization-units>`. This is
averaged over all wavefronts in a kernel dispatch. Note: this should not
@@ -805,7 +805,7 @@ Wavefront Runtime Stats:
the total Wave Cycles metric.
unit: Cycles per normalization unit
Wavefront Occupancy:
rst: |-
rst: >-
The time-averaged number of wavefronts resident on the accelerator over the
lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms).
@@ -907,7 +907,7 @@ VALU Arithmetic Instruction Mix:
unit <normalization-units>`.
unit: Instructions per normalization unit
Conversion:
rst: |-
rst: >-
The total number of type conversion instructions (such as converting data
to or from F32\u2194F64) issued to the VALU per :ref:`normalization unit
<normalization-units>`.
@@ -976,21 +976,21 @@ MFMA Arithmetic Instruction Mix:
unit: Instructions per normalization unit
Compute Speed-of-Light:
VALU FLOPs:
rst: |-
rst: >-
The total floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is also presented as a percent of the peak theoretical
FLOPs achievable on the specific accelerator. Note: this does not include
any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
VALU IOPs:
rst: |-
rst: >-
The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
This is also presented as a percent of the peak theoretical IOPs achievable
on the specific accelerator. Note: this does not include any integer operations
from :ref:`MFMA <desc-mfma>` instructions.
unit: GIOPs
MFMA FLOPs (BF16):
rst: |-
rst: >-
The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit brain floating
point operations from :ref:`VALU <desc-valu>` instructions. This is also
@@ -998,7 +998,7 @@ Compute Speed-of-Light:
on the specific accelerator.
unit: GFLOPs
MFMA FLOPs (F16):
rst: |-
rst: >-
The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1006,7 +1006,7 @@ Compute Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA FLOPs (F32):
rst: |-
rst: >-
The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 32-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1014,7 +1014,7 @@ Compute Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA FLOPs (F64):
rst: |-
rst: >-
The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 64-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1026,7 +1026,7 @@ Compute Speed-of-Light:
achievable on the specific accelerator.
unit: GFLOPs
MFMA IOPs (INT8):
rst: |-
rst: >-
The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
per second. Note: this does not include any 8-bit integer operations from
:ref:`VALU <desc-valu>` instructions. This is also presented as a percent
@@ -1112,7 +1112,7 @@ Arithmetic Operations:
unit <normalization-units>`.
unit: FLOP per normalization unit
BF16 OPs:
rst: |-
rst: >-
The total number of 16-bit brain floating-point operations executed on
either the :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
unit <normalization-units>`. Note: on current CDNA accelerators, the VALU
@@ -1129,7 +1129,7 @@ Arithmetic Operations:
unit <normalization-units>`.
unit: FLOP per normalization unit
INT8 OPs:
rst: |-
rst: >-
The total number of 8-bit integer operations executed on either the :ref:`VALU
<desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
<normalization-units>`. Note: on current CDNA accelerators, the VALU has
@@ -1206,7 +1206,7 @@ LDS Statistics:
to stalls from non-dword aligned addresses per :ref:`normalization unit <normalization-units>`.
unit: Cycles per normalization unit
Mem Violations:
rst: |-
rst: >-
The total number of out-of-bounds accesses made to the LDS, per :ref:`normalization
unit <normalization-units>`. This is unused and expected to be zero in
most configurations for modern CDNA\u2122 accelerators.
@@ -1223,7 +1223,7 @@ L1I Speed-of-Light:
over the number of all L1I requests.
unit: Percent
L1I-L2 Bandwidth Utilization:
rst: |-
rst: >-
The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
achieved. Calculated as the ratio of the total number of requests from
the L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`.
@@ -1330,7 +1330,7 @@ Scalar L1D cache accesses:
unit: Requests per normalization unit
Scalar L1D Cache - L2 Interface:
sL1D-L2 BW:
rst: |-
rst: >-
The total number of bytes read from, written to, or atomically updated
across the sL1D\u2194:doc:`L2 <l2-cache>` interface, divided by total duration.
Note that sL1D writes and atomics are typically
@@ -1352,7 +1352,7 @@ Scalar L1D Cache - L2 Interface:
CDNA accelerators.
unit: Requests per normalization unit
Stall Cycles:
rst: |-
rst: >-
The total number of cycles the sL1D\u2194 :doc:`L2 <l2-cache>` interface
was stalled, per :ref:`normalization unit <normalization-units>`.
unit: Cycles per normalization unit
@@ -1676,7 +1676,7 @@ L1 Unified Translation Cache (UTCL1):
translation not being present in the cache, per :ref:`normalization unit <normalization-units>`.
unit: unit
Permission Misses:
rst: |-
rst: >-
The total number of translation requests that missed in the UTCL1 due
to a permission error, per :ref:`normalization unit <normalization-units>`.
This is unused and expected to be zero in most configurations for modern
+53 -53
Просмотреть файл
@@ -1,20 +1,20 @@
System Speed-of-Light:
VALU FLOPs:
rst: |-
rst: >-
The total floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is also presented as a percent of the peak theoretical
FLOPs achievable on the specific accelerator. Note: this does not include
any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
VALU IOPs:
rst: |-
rst: >-
The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
This is also presented as a percent of the peak theoretical IOPs achievable
on the specific accelerator. Note: this does not include any integer operations
from :ref:`MFMA <desc-mfma>` instructions.
unit: GOIPs
MFMA FLOPs (F8):
rst: |-
rst: >-
The total number of 8-bit brain floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any 16-bit brain
floating point operations from :ref:`VALU <desc-valu>` instructions. This
@@ -23,7 +23,7 @@ System Speed-of-Light:
series and later only.
unit: GFLOPs
MFMA FLOPs (BF16):
rst: |-
rst: >-
The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any 16-bit brain
floating point operations from :ref:`VALU <desc-valu>` instructions. This
@@ -31,7 +31,7 @@ System Speed-of-Light:
achievable on the specific accelerator.
unit: GFLOPs
MFMA FLOPs (F16):
rst: |-
rst: >-
The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -39,7 +39,7 @@ System Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA FLOPs (F32):
rst: |-
rst: >-
The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 32-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -47,7 +47,7 @@ System Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA FLOPs (F64):
rst: |-
rst: >-
The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 64-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -55,7 +55,7 @@ System Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA IOPs (Int8):
rst: |-
rst: >-
The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
per second. Note: this does not include any 8-bit integer operations from
:ref:`VALU <desc-valu>` instructions. This is also presented as a percent
@@ -108,7 +108,7 @@ System Speed-of-Light:
over the :ref:`total active CU cycles <total-active-cu-cycles>`.
unit: Instructions per-cycle
Wavefront Occupancy:
rst: |-
rst: >-
The time-averaged number of wavefronts resident on the accelerator over
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
@@ -153,7 +153,7 @@ System Speed-of-Light:
peak theoretical bandwidth achievable on the specific accelerator.
unit: GB/s
L2-Fabric Read BW:
rst: |-
rst: >-
The number of bytes read by the L2 over the :ref:`Infinity Fabric\u2122
interface <l2-fabric>` per unit time. This is also presented as a percent
of the peak theoretical bandwidth achievable on the specific accelerator.
@@ -238,19 +238,19 @@ Memory Chart:
rst: Total number of compute units (CUs) on the accelerator.
unit: CUs
VGPR:
rst: |-
rst: >-
The number of architected vector general-purpose registers allocated for the
kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
number of VGPRs requested by the compiler due to allocation granularity.
unit: VGPRs
SGPR:
rst: |-
rst: >-
The number of scalar general-purpose registers allocated for the kernel, see
:ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
SGPRs requested by the compiler due to allocation granularity.
unit: SGPRs
LDS Allocation:
rst: |-
rst: >-
The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
allocated for this kernel. Note: This may also be larger than what was requested
at compile time due to both allocation granularity and dynamic per-dispatch
@@ -419,28 +419,28 @@ Memory Chart:
unit: Requests per normalization unit
Roofline Performance Rates:
VALU FLOPs (F16):
rst: |-
rst: >-
The total 16-bit floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is presented with the value of the peak empirical F16 FLOPs achievable
on the specific accelerator. Note: this does not include any F16 operations
from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
VALU FLOPs (F32):
rst: |-
rst: >-
The total 32-bit floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is presented with the value of the peak empirical F32 FLOPs achievable
on the specific accelerator. Note: this does not include any F32 operations
from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
VALU FLOPs (F64):
rst: |-
rst: >-
The total 64-bit floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is presented with the value of the peak empirical F64 FLOPs achievable
on the specific accelerator. Note: this does not include any F64 operations
from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
MFMA FLOPs (F64):
rst: |-
rst: >-
The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 64-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -448,7 +448,7 @@ Roofline Performance Rates:
displayed alongside for comparison.
unit: GFLOPs
MFMA FLOPs (F32):
rst: |-
rst: >-
The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 32-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -456,7 +456,7 @@ Roofline Performance Rates:
displayed alongside for comparison.
unit: GFLOPs
MFMA FLOPs (F16):
rst: |-
rst: >-
The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -464,7 +464,7 @@ Roofline Performance Rates:
displayed alongside for comparison.
unit: GFLOPs
MFMA FLOPs (BF16):
rst: |-
rst: >-
The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any 16-bit brain
floating point operations from :ref:`VALU <desc-valu>` instructions. The
@@ -472,7 +472,7 @@ Roofline Performance Rates:
accelerator is displayed alongside for comparison.
unit: GFLOPs
MFMA FLOPs (F8):
rst: |-
rst: >-
The total number of 8-bit brain floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any 16-bit brain
floating point operations from :ref:`VALU <desc-valu>` instructions. The
@@ -481,7 +481,7 @@ Roofline Performance Rates:
Instinct MI300 series and later only.
unit: GFLOPs
MFMA IOPs (Int8):
rst: |-
rst: >-
The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
per second. Note: this does not include any 8-bit integer operations from
:ref:`VALU <desc-valu>` instructions. The peak empirically measured INT8 MFMA
@@ -489,7 +489,7 @@ Roofline Performance Rates:
for comparison.
unit: GIOPs
HBM Bandwidth:
rst: |-
rst: >-
The total number of bytes read from and written to High-Bandwidth
Memory (HBM) per second. The peak empirically measured bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
@@ -520,28 +520,28 @@ Roofline Performance Rates:
unit: GB/s
Roofline Plot Points:
AI HBM:
rst: |-
rst: >-
The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
It is the ratio of total floating-point operations (FLOPs) to total bytes
transferred between HBM and the L2 cache. This value is used as the x-coordinate
for the HBM roofline.
unit: FLOPs/Byte
AI L2:
rst: |-
rst: >-
The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L2 cache and the L1 cache. This value is used as the x-coordinate for
the L2 roofline.
unit: FLOPs/Byte
AI L1:
rst: |-
rst: >-
The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L1 cache and the processing units. This value is used as the x-coordinate
for the L1 roofline.
unit: FLOPs/Byte
Performance (GFLOPs):
rst: |-
rst: >-
The overall achieved performance, measured in GigaFLOPs
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
operations divided by the total execution time. This value is used as the y-coordinate
@@ -599,7 +599,7 @@ Workgroup manager utilizations:
any work.
unit: Percent
Scheduler-Pipe Utilization:
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where the scheduler-pipes were actively doing any work. Note: this
value is expected to range between 0% and 25%. See :ref:`desc-spi`.
@@ -637,7 +637,7 @@ Workgroup manager utilizations:
unit: Cycles/wave
Workgroup Manager - Resource Allocation:
Not-scheduled Rate (Workgroup Manager):
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
due to a bottleneck within the workgroup manager rather than a lack of a
@@ -646,7 +646,7 @@ Workgroup Manager - Resource Allocation:
description.
unit: Percent
Not-scheduled Rate (Scheduler-Pipe):
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
due to a bottleneck within the scheduler-pipes rather than a lack of a CU
@@ -655,7 +655,7 @@ Workgroup Manager - Resource Allocation:
description.
unit: Percent
Scheduler-Pipe Stall Rate:
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
due to occupancy limitations (like a lack of a CU or :ref:`SIMD <desc-valu>`
@@ -719,7 +719,7 @@ Wavefront Launch Stats:
block size.
unit: Work-Items
Total Wavefronts:
rst: |-
rst: >-
The total number of wavefronts launched as part of the kernel dispatch.
On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
size is always 64 work-items. Thus, the total number of wavefronts should
@@ -734,25 +734,25 @@ Wavefront Launch Stats:
<https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
unit: Wavefronts
VGPRs:
rst: |-
rst: >-
The number of architected vector general-purpose registers allocated for the
kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
number of VGPRs requested by the compiler due to allocation granularity.
unit: VGPRs
AGPRs:
rst: |-
rst: >-
The number of accumulation vector general-purpose registers allocated
for the kernel, see :ref:`AGPRs <desc-agprs>`. Note: this may not exactly match
the number of AGPRs requested by the compiler due to allocation granularity.
unit: AGPRs
SGPRs:
rst: |-
rst: >-
The number of scalar general-purpose registers allocated for the kernel, see
:ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
SGPRs requested by the compiler due to allocation granularity.
unit: SGPRs
LDS Allocation:
rst: |-
rst: >-
The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
allocated for this kernel. Note: This may also be larger than what was requested
at compile time due to both allocation granularity and dynamic per-dispatch
@@ -775,7 +775,7 @@ Wavefront Runtime Stats:
This is averaged over all wavefronts in a kernel dispatch.
unit: Instructions per wavefront
Wave Cycles:
rst: |-
rst: >-
The number of cycles a wavefront in the kernel dispatch spent resident
on a compute unit per :ref:`normalization unit <normalization-units>`. This is
averaged over all wavefronts in a kernel dispatch. Note: this should not
@@ -813,7 +813,7 @@ Wavefront Runtime Stats:
the total Wave Cycles metric.
unit: Cycles per normalization unit
Wavefront Occupancy:
rst: |-
rst: >-
The time-averaged number of wavefronts resident on the accelerator over the
lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms).
@@ -915,7 +915,7 @@ VALU Arithmetic Instruction Mix:
unit <normalization-units>`.
unit: Instructions per normalization unit
Conversion:
rst: |-
rst: >-
The total number of type conversion instructions (such as converting data
to or from F32\u2194F64) issued to the VALU per :ref:`normalization unit
<normalization-units>`.
@@ -989,14 +989,14 @@ MFMA Arithmetic Instruction Mix:
unit: Instructions per normalization unit
Compute Speed-of-Light:
VALU FLOPs:
rst: |-
rst: >-
The total floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is also presented as a percent of the peak theoretical
FLOPs achievable on the specific accelerator. Note: this does not include
any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
VALU IOPs:
rst: |-
rst: >-
The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
This is also presented as a percent of the peak theoretical IOPs achievable
on the specific accelerator. Note: this does not include any integer operations
@@ -1006,7 +1006,7 @@ Compute Speed-of-Light:
rst: ''
unit: Unknown
MFMA FLOPs (BF16):
rst: |-
rst: >-
The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit brain floating
point operations from :ref:`VALU <desc-valu>` instructions. This is also
@@ -1014,7 +1014,7 @@ Compute Speed-of-Light:
on the specific accelerator.
unit: GFLOPs
MFMA FLOPs (F16):
rst: |-
rst: >-
The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1022,7 +1022,7 @@ Compute Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA FLOPs (F32):
rst: |-
rst: >-
The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 32-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1030,7 +1030,7 @@ Compute Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA FLOPs (F64):
rst: |-
rst: >-
The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 64-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1042,7 +1042,7 @@ Compute Speed-of-Light:
achievable on the specific accelerator.
unit: GFLOPs
MFMA IOPs (INT8):
rst: |-
rst: >-
The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
per second. Note: this does not include any 8-bit integer operations from
:ref:`VALU <desc-valu>` instructions. This is also presented as a percent
@@ -1131,7 +1131,7 @@ Arithmetic Operations:
unit <normalization-units>`.
unit: FLOP per normalization unit
BF16 OPs:
rst: |-
rst: >-
The total number of 16-bit brain floating-point operations executed on
either the :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
unit <normalization-units>`. Note: on current CDNA accelerators, the VALU
@@ -1148,7 +1148,7 @@ Arithmetic Operations:
unit <normalization-units>`.
unit: FLOP per normalization unit
INT8 OPs:
rst: |-
rst: >-
The total number of 8-bit integer operations executed on either the :ref:`VALU
<desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
<normalization-units>`. Note: on current CDNA accelerators, the VALU has
@@ -1225,7 +1225,7 @@ LDS Statistics:
to stalls from non-dword aligned addresses per :ref:`normalization unit <normalization-units>`.
unit: Cycles per normalization unit
Mem Violations:
rst: |-
rst: >-
The total number of out-of-bounds accesses made to the LDS, per :ref:`normalization
unit <normalization-units>`. This is unused and expected to be zero in
most configurations for modern CDNA\u2122 accelerators.
@@ -1242,7 +1242,7 @@ L1I Speed-of-Light:
over the number of all L1I requests.
unit: Percent
L1I-L2 Bandwidth Utilization:
rst: |-
rst: >-
The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
achieved. Calculated as the ratio of the total number of requests from
the L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`.
@@ -1349,7 +1349,7 @@ Scalar L1D cache accesses:
unit: Requests per normalization unit
Scalar L1D Cache - L2 Interface:
sL1D-L2 BW:
rst: |-
rst: >-
The total number of bytes read from, written to, or atomically updated
across the sL1D\u2194:doc:`L2 <l2-cache>` interface, divided by total duration.
Note that sL1D writes and atomics are typically
@@ -1371,7 +1371,7 @@ Scalar L1D Cache - L2 Interface:
CDNA accelerators.
unit: Requests per normalization unit
Stall Cycles:
rst: |-
rst: >-
The total number of cycles the sL1D\u2194 :doc:`L2 <l2-cache>` interface
was stalled, per :ref:`normalization unit <normalization-units>`.
unit: Cycles per normalization unit
@@ -1681,7 +1681,7 @@ L1 Unified Translation Cache (UTCL1):
translation not being present in the cache, per :ref:`normalization unit <normalization-units>`.
unit: unit
Permission Misses:
rst: |-
rst: >-
The total number of translation requests that missed in the UTCL1 due
to a permission error, per :ref:`normalization unit <normalization-units>`.
This is unused and expected to be zero in most configurations for modern
+53 -53
Просмотреть файл
@@ -1,20 +1,20 @@
System Speed-of-Light:
VALU FLOPs:
rst: |-
rst: >-
The total floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is also presented as a percent of the peak theoretical
FLOPs achievable on the specific accelerator. Note: this does not include
any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
VALU IOPs:
rst: |-
rst: >-
The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
This is also presented as a percent of the peak theoretical IOPs achievable
on the specific accelerator. Note: this does not include any integer operations
from :ref:`MFMA <desc-mfma>` instructions.
unit: GOIPs
MFMA FLOPs (F8):
rst: |-
rst: >-
The total number of 8-bit brain floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any 16-bit brain
floating point operations from :ref:`VALU <desc-valu>` instructions. This
@@ -23,7 +23,7 @@ System Speed-of-Light:
series and later only.
unit: GFLOPs
MFMA FLOPs (BF16):
rst: |-
rst: >-
The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any 16-bit brain
floating point operations from :ref:`VALU <desc-valu>` instructions. This
@@ -31,7 +31,7 @@ System Speed-of-Light:
achievable on the specific accelerator.
unit: GFLOPs
MFMA FLOPs (F16):
rst: |-
rst: >-
The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -39,7 +39,7 @@ System Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA FLOPs (F32):
rst: |-
rst: >-
The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 32-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -47,7 +47,7 @@ System Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA FLOPs (F64):
rst: |-
rst: >-
The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 64-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -55,7 +55,7 @@ System Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA IOPs (Int8):
rst: |-
rst: >-
The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
per second. Note: this does not include any 8-bit integer operations from
:ref:`VALU <desc-valu>` instructions. This is also presented as a percent
@@ -108,7 +108,7 @@ System Speed-of-Light:
over the :ref:`total active CU cycles <total-active-cu-cycles>`.
unit: Instructions per-cycle
Wavefront Occupancy:
rst: |-
rst: >-
The time-averaged number of wavefronts resident on the accelerator over
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
@@ -153,7 +153,7 @@ System Speed-of-Light:
peak theoretical bandwidth achievable on the specific accelerator.
unit: GB/s
L2-Fabric Read BW:
rst: |-
rst: >-
The number of bytes read by the L2 over the :ref:`Infinity Fabric\u2122
interface <l2-fabric>` per unit time. This is also presented as a percent
of the peak theoretical bandwidth achievable on the specific accelerator.
@@ -238,19 +238,19 @@ Memory Chart:
rst: Total number of compute units (CUs) on the accelerator.
unit: CUs
VGPR:
rst: |-
rst: >-
The number of architected vector general-purpose registers allocated for the
kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
number of VGPRs requested by the compiler due to allocation granularity.
unit: VGPRs
SGPR:
rst: |-
rst: >-
The number of scalar general-purpose registers allocated for the kernel, see
:ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
SGPRs requested by the compiler due to allocation granularity.
unit: SGPRs
LDS Allocation:
rst: |-
rst: >-
The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
allocated for this kernel. Note: This may also be larger than what was requested
at compile time due to both allocation granularity and dynamic per-dispatch
@@ -419,28 +419,28 @@ Memory Chart:
unit: Requests per normalization unit
Roofline Performance Rates:
VALU FLOPs (F16):
rst: |-
rst: >-
The total 16-bit floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is presented with the value of the peak empirical F16 FLOPs achievable
on the specific accelerator. Note: this does not include any F16 operations
from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
VALU FLOPs (F32):
rst: |-
rst: >-
The total 32-bit floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is presented with the value of the peak empirical F32 FLOPs achievable
on the specific accelerator. Note: this does not include any F32 operations
from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
VALU FLOPs (F64):
rst: |-
rst: >-
The total 64-bit floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is presented with the value of the peak empirical F64 FLOPs achievable
on the specific accelerator. Note: this does not include any F64 operations
from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
MFMA FLOPs (F64):
rst: |-
rst: >-
The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 64-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -448,7 +448,7 @@ Roofline Performance Rates:
displayed alongside for comparison.
unit: GFLOPs
MFMA FLOPs (F32):
rst: |-
rst: >-
The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 32-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -456,7 +456,7 @@ Roofline Performance Rates:
displayed alongside for comparison.
unit: GFLOPs
MFMA FLOPs (F16):
rst: |-
rst: >-
The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -464,7 +464,7 @@ Roofline Performance Rates:
displayed alongside for comparison.
unit: GFLOPs
MFMA FLOPs (BF16):
rst: |-
rst: >-
The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any 16-bit brain
floating point operations from :ref:`VALU <desc-valu>` instructions. The
@@ -472,7 +472,7 @@ Roofline Performance Rates:
accelerator is displayed alongside for comparison.
unit: GFLOPs
MFMA FLOPs (F8):
rst: |-
rst: >-
The total number of 8-bit brain floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any 16-bit brain
floating point operations from :ref:`VALU <desc-valu>` instructions. The
@@ -481,7 +481,7 @@ Roofline Performance Rates:
Instinct MI300 series and later only.
unit: GFLOPs
MFMA IOPs (Int8):
rst: |-
rst: >-
The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
per second. Note: this does not include any 8-bit integer operations from
:ref:`VALU <desc-valu>` instructions. The peak empirically measured INT8 MFMA
@@ -489,7 +489,7 @@ Roofline Performance Rates:
for comparison.
unit: GIOPs
HBM Bandwidth:
rst: |-
rst: >-
The total number of bytes read from and written to High-Bandwidth
Memory (HBM) per second. The peak empirically measured bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
@@ -520,28 +520,28 @@ Roofline Performance Rates:
unit: GB/s
Roofline Plot Points:
AI HBM:
rst: |-
rst: >-
The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
It is the ratio of total floating-point operations (FLOPs) to total bytes
transferred between HBM and the L2 cache. This value is used as the x-coordinate
for the HBM roofline.
unit: FLOPs/Byte
AI L2:
rst: |-
rst: >-
The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L2 cache and the L1 cache. This value is used as the x-coordinate for
the L2 roofline.
unit: FLOPs/Byte
AI L1:
rst: |-
rst: >-
The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L1 cache and the processing units. This value is used as the x-coordinate
for the L1 roofline.
unit: FLOPs/Byte
Performance (GFLOPs):
rst: |-
rst: >-
The overall achieved performance, measured in GigaFLOPs
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
operations divided by the total execution time. This value is used as the y-coordinate
@@ -599,7 +599,7 @@ Workgroup manager utilizations:
any work.
unit: Percent
Scheduler-Pipe Utilization:
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where the scheduler-pipes were actively doing any work. Note: this
value is expected to range between 0% and 25%. See :ref:`desc-spi`.
@@ -637,7 +637,7 @@ Workgroup manager utilizations:
unit: Cycles/wave
Workgroup Manager - Resource Allocation:
Not-scheduled Rate (Workgroup Manager):
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
due to a bottleneck within the workgroup manager rather than a lack of a
@@ -646,7 +646,7 @@ Workgroup Manager - Resource Allocation:
description.
unit: Percent
Not-scheduled Rate (Scheduler-Pipe):
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
due to a bottleneck within the scheduler-pipes rather than a lack of a CU
@@ -655,7 +655,7 @@ Workgroup Manager - Resource Allocation:
description.
unit: Percent
Scheduler-Pipe Stall Rate:
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
due to occupancy limitations (like a lack of a CU or :ref:`SIMD <desc-valu>`
@@ -719,7 +719,7 @@ Wavefront Launch Stats:
block size.
unit: Work-Items
Total Wavefronts:
rst: |-
rst: >-
The total number of wavefronts launched as part of the kernel dispatch.
On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
size is always 64 work-items. Thus, the total number of wavefronts should
@@ -734,25 +734,25 @@ Wavefront Launch Stats:
<https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
unit: Wavefronts
VGPRs:
rst: |-
rst: >-
The number of architected vector general-purpose registers allocated for the
kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
number of VGPRs requested by the compiler due to allocation granularity.
unit: VGPRs
AGPRs:
rst: |-
rst: >-
The number of accumulation vector general-purpose registers allocated
for the kernel, see :ref:`AGPRs <desc-agprs>`. Note: this may not exactly match
the number of AGPRs requested by the compiler due to allocation granularity.
unit: AGPRs
SGPRs:
rst: |-
rst: >-
The number of scalar general-purpose registers allocated for the kernel, see
:ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
SGPRs requested by the compiler due to allocation granularity.
unit: SGPRs
LDS Allocation:
rst: |-
rst: >-
The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
allocated for this kernel. Note: This may also be larger than what was requested
at compile time due to both allocation granularity and dynamic per-dispatch
@@ -775,7 +775,7 @@ Wavefront Runtime Stats:
This is averaged over all wavefronts in a kernel dispatch.
unit: Instructions per wavefront
Wave Cycles:
rst: |-
rst: >-
The number of cycles a wavefront in the kernel dispatch spent resident
on a compute unit per :ref:`normalization unit <normalization-units>`. This is
averaged over all wavefronts in a kernel dispatch. Note: this should not
@@ -813,7 +813,7 @@ Wavefront Runtime Stats:
the total Wave Cycles metric.
unit: Cycles per normalization unit
Wavefront Occupancy:
rst: |-
rst: >-
The time-averaged number of wavefronts resident on the accelerator over the
lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms).
@@ -915,7 +915,7 @@ VALU Arithmetic Instruction Mix:
unit <normalization-units>`.
unit: Instructions per normalization unit
Conversion:
rst: |-
rst: >-
The total number of type conversion instructions (such as converting data
to or from F32\u2194F64) issued to the VALU per :ref:`normalization unit
<normalization-units>`.
@@ -989,14 +989,14 @@ MFMA Arithmetic Instruction Mix:
unit: Instructions per normalization unit
Compute Speed-of-Light:
VALU FLOPs:
rst: |-
rst: >-
The total floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is also presented as a percent of the peak theoretical
FLOPs achievable on the specific accelerator. Note: this does not include
any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
VALU IOPs:
rst: |-
rst: >-
The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
This is also presented as a percent of the peak theoretical IOPs achievable
on the specific accelerator. Note: this does not include any integer operations
@@ -1006,7 +1006,7 @@ Compute Speed-of-Light:
rst: ''
unit: Unknown
MFMA FLOPs (BF16):
rst: |-
rst: >-
The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit brain floating
point operations from :ref:`VALU <desc-valu>` instructions. This is also
@@ -1014,7 +1014,7 @@ Compute Speed-of-Light:
on the specific accelerator.
unit: GFLOPs
MFMA FLOPs (F16):
rst: |-
rst: >-
The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1022,7 +1022,7 @@ Compute Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA FLOPs (F32):
rst: |-
rst: >-
The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 32-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1030,7 +1030,7 @@ Compute Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA FLOPs (F64):
rst: |-
rst: >-
The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 64-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1042,7 +1042,7 @@ Compute Speed-of-Light:
achievable on the specific accelerator.
unit: GFLOPs
MFMA IOPs (INT8):
rst: |-
rst: >-
The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
per second. Note: this does not include any 8-bit integer operations from
:ref:`VALU <desc-valu>` instructions. This is also presented as a percent
@@ -1131,7 +1131,7 @@ Arithmetic Operations:
unit <normalization-units>`.
unit: FLOP per normalization unit
BF16 OPs:
rst: |-
rst: >-
The total number of 16-bit brain floating-point operations executed on
either the :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
unit <normalization-units>`. Note: on current CDNA accelerators, the VALU
@@ -1148,7 +1148,7 @@ Arithmetic Operations:
unit <normalization-units>`.
unit: FLOP per normalization unit
INT8 OPs:
rst: |-
rst: >-
The total number of 8-bit integer operations executed on either the :ref:`VALU
<desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
<normalization-units>`. Note: on current CDNA accelerators, the VALU has
@@ -1225,7 +1225,7 @@ LDS Statistics:
to stalls from non-dword aligned addresses per :ref:`normalization unit <normalization-units>`.
unit: Cycles per normalization unit
Mem Violations:
rst: |-
rst: >-
The total number of out-of-bounds accesses made to the LDS, per :ref:`normalization
unit <normalization-units>`. This is unused and expected to be zero in
most configurations for modern CDNA\u2122 accelerators.
@@ -1242,7 +1242,7 @@ L1I Speed-of-Light:
over the number of all L1I requests.
unit: Percent
L1I-L2 Bandwidth Utilization:
rst: |-
rst: >-
The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
achieved. Calculated as the ratio of the total number of requests from
the L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`.
@@ -1349,7 +1349,7 @@ Scalar L1D cache accesses:
unit: Requests per normalization unit
Scalar L1D Cache - L2 Interface:
sL1D-L2 BW:
rst: |-
rst: >-
The total number of bytes read from, written to, or atomically updated
across the sL1D\u2194:doc:`L2 <l2-cache>` interface, divided by total duration.
Note that sL1D writes and atomics are typically
@@ -1371,7 +1371,7 @@ Scalar L1D Cache - L2 Interface:
CDNA accelerators.
unit: Requests per normalization unit
Stall Cycles:
rst: |-
rst: >-
The total number of cycles the sL1D\u2194 :doc:`L2 <l2-cache>` interface
was stalled, per :ref:`normalization unit <normalization-units>`.
unit: Cycles per normalization unit
@@ -1681,7 +1681,7 @@ L1 Unified Translation Cache (UTCL1):
translation not being present in the cache, per :ref:`normalization unit <normalization-units>`.
unit: unit
Permission Misses:
rst: |-
rst: >-
The total number of translation requests that missed in the UTCL1 due
to a permission error, per :ref:`normalization unit <normalization-units>`.
This is unused and expected to be zero in most configurations for modern
+53 -53
Просмотреть файл
@@ -1,20 +1,20 @@
System Speed-of-Light:
VALU FLOPs:
rst: |-
rst: >-
The total floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is also presented as a percent of the peak theoretical
FLOPs achievable on the specific accelerator. Note: this does not include
any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
VALU IOPs:
rst: |-
rst: >-
The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
This is also presented as a percent of the peak theoretical IOPs achievable
on the specific accelerator. Note: this does not include any integer operations
from :ref:`MFMA <desc-mfma>` instructions.
unit: GOIPs
MFMA FLOPs (F8):
rst: |-
rst: >-
The total number of 8-bit brain floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any 16-bit brain
floating point operations from :ref:`VALU <desc-valu>` instructions. This
@@ -23,7 +23,7 @@ System Speed-of-Light:
series and later only.
unit: GFLOPs
MFMA FLOPs (BF16):
rst: |-
rst: >-
The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any 16-bit brain
floating point operations from :ref:`VALU <desc-valu>` instructions. This
@@ -31,7 +31,7 @@ System Speed-of-Light:
achievable on the specific accelerator.
unit: GFLOPs
MFMA FLOPs (F16):
rst: |-
rst: >-
The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -39,7 +39,7 @@ System Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA FLOPs (F32):
rst: |-
rst: >-
The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 32-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -47,7 +47,7 @@ System Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA FLOPs (F64):
rst: |-
rst: >-
The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 64-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -55,7 +55,7 @@ System Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA IOPs (Int8):
rst: |-
rst: >-
The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
per second. Note: this does not include any 8-bit integer operations from
:ref:`VALU <desc-valu>` instructions. This is also presented as a percent
@@ -108,7 +108,7 @@ System Speed-of-Light:
over the :ref:`total active CU cycles <total-active-cu-cycles>`.
unit: Instructions per-cycle
Wavefront Occupancy:
rst: |-
rst: >-
The time-averaged number of wavefronts resident on the accelerator over
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
@@ -153,7 +153,7 @@ System Speed-of-Light:
peak theoretical bandwidth achievable on the specific accelerator.
unit: GB/s
L2-Fabric Read BW:
rst: |-
rst: >-
The number of bytes read by the L2 over the :ref:`Infinity Fabric\u2122
interface <l2-fabric>` per unit time. This is also presented as a percent
of the peak theoretical bandwidth achievable on the specific accelerator.
@@ -238,19 +238,19 @@ Memory Chart:
rst: Total number of compute units (CUs) on the accelerator.
unit: CUs
VGPR:
rst: |-
rst: >-
The number of architected vector general-purpose registers allocated for the
kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
number of VGPRs requested by the compiler due to allocation granularity.
unit: VGPRs
SGPR:
rst: |-
rst: >-
The number of scalar general-purpose registers allocated for the kernel, see
:ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
SGPRs requested by the compiler due to allocation granularity.
unit: SGPRs
LDS Allocation:
rst: |-
rst: >-
The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
allocated for this kernel. Note: This may also be larger than what was requested
at compile time due to both allocation granularity and dynamic per-dispatch
@@ -419,28 +419,28 @@ Memory Chart:
unit: Requests per normalization unit
Roofline Performance Rates:
VALU FLOPs (F16):
rst: |-
rst: >-
The total 16-bit floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is presented with the value of the peak empirical F16 FLOPs achievable
on the specific accelerator. Note: this does not include any F16 operations
from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
VALU FLOPs (F32):
rst: |-
rst: >-
The total 32-bit floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is presented with the value of the peak empirical F32 FLOPs achievable
on the specific accelerator. Note: this does not include any F32 operations
from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
VALU FLOPs (F64):
rst: |-
rst: >-
The total 64-bit floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is presented with the value of the peak empirical F64 FLOPs achievable
on the specific accelerator. Note: this does not include any F64 operations
from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
MFMA FLOPs (F64):
rst: |-
rst: >-
The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 64-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -448,7 +448,7 @@ Roofline Performance Rates:
displayed alongside for comparison.
unit: GFLOPs
MFMA FLOPs (F32):
rst: |-
rst: >-
The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 32-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -456,7 +456,7 @@ Roofline Performance Rates:
displayed alongside for comparison.
unit: GFLOPs
MFMA FLOPs (F16):
rst: |-
rst: >-
The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -464,7 +464,7 @@ Roofline Performance Rates:
displayed alongside for comparison.
unit: GFLOPs
MFMA FLOPs (BF16):
rst: |-
rst: >-
The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any 16-bit brain
floating point operations from :ref:`VALU <desc-valu>` instructions. The
@@ -472,7 +472,7 @@ Roofline Performance Rates:
accelerator is displayed alongside for comparison.
unit: GFLOPs
MFMA FLOPs (F8):
rst: |-
rst: >-
The total number of 8-bit brain floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any 16-bit brain
floating point operations from :ref:`VALU <desc-valu>` instructions. The
@@ -481,7 +481,7 @@ Roofline Performance Rates:
Instinct MI300 series and later only.
unit: GFLOPs
MFMA IOPs (Int8):
rst: |-
rst: >-
The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
per second. Note: this does not include any 8-bit integer operations from
:ref:`VALU <desc-valu>` instructions. The peak empirically measured INT8 MFMA
@@ -489,7 +489,7 @@ Roofline Performance Rates:
for comparison.
unit: GIOPs
HBM Bandwidth:
rst: |-
rst: >-
The total number of bytes read from and written to High-Bandwidth
Memory (HBM) per second. The peak empirically measured bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
@@ -520,28 +520,28 @@ Roofline Performance Rates:
unit: GB/s
Roofline Plot Points:
AI HBM:
rst: |-
rst: >-
The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
It is the ratio of total floating-point operations (FLOPs) to total bytes
transferred between HBM and the L2 cache. This value is used as the x-coordinate
for the HBM roofline.
unit: FLOPs/Byte
AI L2:
rst: |-
rst: >-
The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L2 cache and the L1 cache. This value is used as the x-coordinate for
the L2 roofline.
unit: FLOPs/Byte
AI L1:
rst: |-
rst: >-
The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L1 cache and the processing units. This value is used as the x-coordinate
for the L1 roofline.
unit: FLOPs/Byte
Performance (GFLOPs):
rst: |-
rst: >-
The overall achieved performance, measured in GigaFLOPs
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
operations divided by the total execution time. This value is used as the y-coordinate
@@ -599,7 +599,7 @@ Workgroup manager utilizations:
any work.
unit: Percent
Scheduler-Pipe Utilization:
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where the scheduler-pipes were actively doing any work. Note: this
value is expected to range between 0% and 25%. See :ref:`desc-spi`.
@@ -637,7 +637,7 @@ Workgroup manager utilizations:
unit: Cycles/wave
Workgroup Manager - Resource Allocation:
Not-scheduled Rate (Workgroup Manager):
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
due to a bottleneck within the workgroup manager rather than a lack of a
@@ -646,7 +646,7 @@ Workgroup Manager - Resource Allocation:
description.
unit: Percent
Not-scheduled Rate (Scheduler-Pipe):
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
due to a bottleneck within the scheduler-pipes rather than a lack of a CU
@@ -655,7 +655,7 @@ Workgroup Manager - Resource Allocation:
description.
unit: Percent
Scheduler-Pipe Stall Rate:
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
due to occupancy limitations (like a lack of a CU or :ref:`SIMD <desc-valu>`
@@ -719,7 +719,7 @@ Wavefront Launch Stats:
block size.
unit: Work-Items
Total Wavefronts:
rst: |-
rst: >-
The total number of wavefronts launched as part of the kernel dispatch.
On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
size is always 64 work-items. Thus, the total number of wavefronts should
@@ -734,25 +734,25 @@ Wavefront Launch Stats:
<https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
unit: Wavefronts
VGPRs:
rst: |-
rst: >-
The number of architected vector general-purpose registers allocated for the
kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
number of VGPRs requested by the compiler due to allocation granularity.
unit: VGPRs
AGPRs:
rst: |-
rst: >-
The number of accumulation vector general-purpose registers allocated
for the kernel, see :ref:`AGPRs <desc-agprs>`. Note: this may not exactly match
the number of AGPRs requested by the compiler due to allocation granularity.
unit: AGPRs
SGPRs:
rst: |-
rst: >-
The number of scalar general-purpose registers allocated for the kernel, see
:ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
SGPRs requested by the compiler due to allocation granularity.
unit: SGPRs
LDS Allocation:
rst: |-
rst: >-
The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
allocated for this kernel. Note: This may also be larger than what was requested
at compile time due to both allocation granularity and dynamic per-dispatch
@@ -775,7 +775,7 @@ Wavefront Runtime Stats:
This is averaged over all wavefronts in a kernel dispatch.
unit: Instructions per wavefront
Wave Cycles:
rst: |-
rst: >-
The number of cycles a wavefront in the kernel dispatch spent resident
on a compute unit per :ref:`normalization unit <normalization-units>`. This is
averaged over all wavefronts in a kernel dispatch. Note: this should not
@@ -813,7 +813,7 @@ Wavefront Runtime Stats:
the total Wave Cycles metric.
unit: Cycles per normalization unit
Wavefront Occupancy:
rst: |-
rst: >-
The time-averaged number of wavefronts resident on the accelerator over the
lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms).
@@ -915,7 +915,7 @@ VALU Arithmetic Instruction Mix:
unit <normalization-units>`.
unit: Instructions per normalization unit
Conversion:
rst: |-
rst: >-
The total number of type conversion instructions (such as converting data
to or from F32\u2194F64) issued to the VALU per :ref:`normalization unit
<normalization-units>`.
@@ -989,14 +989,14 @@ MFMA Arithmetic Instruction Mix:
unit: Instructions per normalization unit
Compute Speed-of-Light:
VALU FLOPs:
rst: |-
rst: >-
The total floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is also presented as a percent of the peak theoretical
FLOPs achievable on the specific accelerator. Note: this does not include
any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
VALU IOPs:
rst: |-
rst: >-
The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
This is also presented as a percent of the peak theoretical IOPs achievable
on the specific accelerator. Note: this does not include any integer operations
@@ -1006,7 +1006,7 @@ Compute Speed-of-Light:
rst: ''
unit: Unknown
MFMA FLOPs (BF16):
rst: |-
rst: >-
The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit brain floating
point operations from :ref:`VALU <desc-valu>` instructions. This is also
@@ -1014,7 +1014,7 @@ Compute Speed-of-Light:
on the specific accelerator.
unit: GFLOPs
MFMA FLOPs (F16):
rst: |-
rst: >-
The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1022,7 +1022,7 @@ Compute Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA FLOPs (F32):
rst: |-
rst: >-
The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 32-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1030,7 +1030,7 @@ Compute Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA FLOPs (F64):
rst: |-
rst: >-
The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 64-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1042,7 +1042,7 @@ Compute Speed-of-Light:
achievable on the specific accelerator.
unit: GFLOPs
MFMA IOPs (INT8):
rst: |-
rst: >-
The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
per second. Note: this does not include any 8-bit integer operations from
:ref:`VALU <desc-valu>` instructions. This is also presented as a percent
@@ -1131,7 +1131,7 @@ Arithmetic Operations:
unit <normalization-units>`.
unit: FLOP per normalization unit
BF16 OPs:
rst: |-
rst: >-
The total number of 16-bit brain floating-point operations executed on
either the :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
unit <normalization-units>`. Note: on current CDNA accelerators, the VALU
@@ -1148,7 +1148,7 @@ Arithmetic Operations:
unit <normalization-units>`.
unit: FLOP per normalization unit
INT8 OPs:
rst: |-
rst: >-
The total number of 8-bit integer operations executed on either the :ref:`VALU
<desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
<normalization-units>`. Note: on current CDNA accelerators, the VALU has
@@ -1225,7 +1225,7 @@ LDS Statistics:
to stalls from non-dword aligned addresses per :ref:`normalization unit <normalization-units>`.
unit: Cycles per normalization unit
Mem Violations:
rst: |-
rst: >-
The total number of out-of-bounds accesses made to the LDS, per :ref:`normalization
unit <normalization-units>`. This is unused and expected to be zero in
most configurations for modern CDNA\u2122 accelerators.
@@ -1242,7 +1242,7 @@ L1I Speed-of-Light:
over the number of all L1I requests.
unit: Percent
L1I-L2 Bandwidth Utilization:
rst: |-
rst: >-
The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
achieved. Calculated as the ratio of the total number of requests from
the L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`.
@@ -1349,7 +1349,7 @@ Scalar L1D cache accesses:
unit: Requests per normalization unit
Scalar L1D Cache - L2 Interface:
sL1D-L2 BW:
rst: |-
rst: >-
The total number of bytes read from, written to, or atomically updated
across the sL1D\u2194:doc:`L2 <l2-cache>` interface, divided by total duration.
Note that sL1D writes and atomics are typically
@@ -1371,7 +1371,7 @@ Scalar L1D Cache - L2 Interface:
CDNA accelerators.
unit: Requests per normalization unit
Stall Cycles:
rst: |-
rst: >-
The total number of cycles the sL1D\u2194 :doc:`L2 <l2-cache>` interface
was stalled, per :ref:`normalization unit <normalization-units>`.
unit: Cycles per normalization unit
@@ -1681,7 +1681,7 @@ L1 Unified Translation Cache (UTCL1):
translation not being present in the cache, per :ref:`normalization unit <normalization-units>`.
unit: unit
Permission Misses:
rst: |-
rst: >-
The total number of translation requests that missed in the UTCL1 due
to a permission error, per :ref:`normalization unit <normalization-units>`.
This is unused and expected to be zero in most configurations for modern
+55 -55
Просмотреть файл
@@ -1,20 +1,20 @@
System Speed-of-Light:
VALU FLOPs:
rst: |-
rst: >-
The total floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is also presented as a percent of the peak theoretical
FLOPs achievable on the specific accelerator. Note: this does not include
any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
VALU IOPs:
rst: |-
rst: >-
The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
This is also presented as a percent of the peak theoretical IOPs achievable
on the specific accelerator. Note: this does not include any integer operations
from :ref:`MFMA <desc-mfma>` instructions.
unit: GOIPs
MFMA FLOPs (F8):
rst: |-
rst: >-
The total number of 8-bit brain floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any 16-bit brain
floating point operations from :ref:`VALU <desc-valu>` instructions. This
@@ -23,7 +23,7 @@ System Speed-of-Light:
series and later only.
unit: GFLOPs
MFMA FLOPs (BF16):
rst: |-
rst: >-
The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any 16-bit brain
floating point operations from :ref:`VALU <desc-valu>` instructions. This
@@ -31,7 +31,7 @@ System Speed-of-Light:
achievable on the specific accelerator.
unit: GFLOPs
MFMA FLOPs (F16):
rst: |-
rst: >-
The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -39,7 +39,7 @@ System Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA FLOPs (F32):
rst: |-
rst: >-
The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 32-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -47,7 +47,7 @@ System Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA FLOPs (F64):
rst: |-
rst: >-
The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 64-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -58,7 +58,7 @@ System Speed-of-Light:
rst: ''
unit: Unknown
MFMA IOPs (Int8):
rst: |-
rst: >-
The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
per second. Note: this does not include any 8-bit integer operations from
:ref:`VALU <desc-valu>` instructions. This is also presented as a percent
@@ -111,7 +111,7 @@ System Speed-of-Light:
over the :ref:`total active CU cycles <total-active-cu-cycles>`.
unit: Instructions per-cycle
Wavefront Occupancy:
rst: |-
rst: >-
The time-averaged number of wavefronts resident on the accelerator over
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
@@ -156,7 +156,7 @@ System Speed-of-Light:
peak theoretical bandwidth achievable on the specific accelerator.
unit: GB/s
L2-Fabric Read BW:
rst: |-
rst: >-
The number of bytes read by the L2 over the :ref:`Infinity Fabric\u2122
interface <l2-fabric>` per unit time. This is also presented as a percent
of the peak theoretical bandwidth achievable on the specific accelerator.
@@ -241,19 +241,19 @@ Memory Chart:
rst: Total number of compute units (CUs) on the accelerator.
unit: CUs
VGPR:
rst: |-
rst: >-
The number of architected vector general-purpose registers allocated for the
kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
number of VGPRs requested by the compiler due to allocation granularity.
unit: VGPRs
SGPR:
rst: |-
rst: >-
The number of scalar general-purpose registers allocated for the kernel, see
:ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
SGPRs requested by the compiler due to allocation granularity.
unit: SGPRs
LDS Allocation:
rst: |-
rst: >-
The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
allocated for this kernel. Note: This may also be larger than what was requested
at compile time due to both allocation granularity and dynamic per-dispatch
@@ -432,28 +432,28 @@ Memory Chart:
unit: Requests per normalization unit
Roofline Performance Rates:
VALU FLOPs (F16):
rst: |-
rst: >-
The total 16-bit floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is presented with the value of the peak empirical F16 FLOPs achievable
on the specific accelerator. Note: this does not include any F16 operations
from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
VALU FLOPs (F32):
rst: |-
rst: >-
The total 32-bit floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is presented with the value of the peak empirical F32 FLOPs achievable
on the specific accelerator. Note: this does not include any F32 operations
from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
VALU FLOPs (F64):
rst: |-
rst: >-
The total 64-bit floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is presented with the value of the peak empirical F64 FLOPs achievable
on the specific accelerator. Note: this does not include any F64 operations
from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
MFMA FLOPs (F64):
rst: |-
rst: >-
The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 64-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -461,7 +461,7 @@ Roofline Performance Rates:
displayed alongside for comparison.
unit: GFLOPs
MFMA FLOPs (F32):
rst: |-
rst: >-
The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 32-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -469,7 +469,7 @@ Roofline Performance Rates:
displayed alongside for comparison.
unit: GFLOPs
MFMA FLOPs (F16):
rst: |-
rst: >-
The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -477,7 +477,7 @@ Roofline Performance Rates:
displayed alongside for comparison.
unit: GFLOPs
MFMA FLOPs (BF16):
rst: |-
rst: >-
The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any 16-bit brain
floating point operations from :ref:`VALU <desc-valu>` instructions. The
@@ -485,7 +485,7 @@ Roofline Performance Rates:
accelerator is displayed alongside for comparison.
unit: GFLOPs
MFMA FLOPs (F8):
rst: |-
rst: >-
The total number of 8-bit brain floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any 16-bit brain
floating point operations from :ref:`VALU <desc-valu>` instructions. The
@@ -494,7 +494,7 @@ Roofline Performance Rates:
Instinct MI300 series and later only.
unit: GFLOPs
MFMA FLOPs (F6F4):
rst: |-
rst: >-
The total number of 4-bit and 6-bit floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any floating point
operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
@@ -503,7 +503,7 @@ Roofline Performance Rates:
series (gfx950) and later only.
unit: GFLOPs
MFMA IOPs (Int8):
rst: |-
rst: >-
The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
per second. Note: this does not include any 8-bit integer operations from
:ref:`VALU <desc-valu>` instructions. The peak empirically measured INT8 MFMA
@@ -511,7 +511,7 @@ Roofline Performance Rates:
for comparison.
unit: GIOPs
HBM Bandwidth:
rst: |-
rst: >-
The total number of bytes read from and written to High-Bandwidth
Memory (HBM) per second. The peak empirically measured bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
@@ -542,28 +542,28 @@ Roofline Performance Rates:
unit: GB/s
Roofline Plot Points:
AI HBM:
rst: |-
rst: >-
The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
It is the ratio of total floating-point operations (FLOPs) to total bytes
transferred between HBM and the L2 cache. This value is used as the x-coordinate
for the HBM roofline.
unit: FLOPs/Byte
AI L2:
rst: |-
rst: >-
The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L2 cache and the L1 cache. This value is used as the x-coordinate for
the L2 roofline.
unit: FLOPs/Byte
AI L1:
rst: |-
rst: >-
The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L1 cache and the processing units. This value is used as the x-coordinate
for the L1 roofline.
unit: FLOPs/Byte
Performance (GFLOPs):
rst: |-
rst: >-
The overall achieved performance, measured in GigaFLOPs
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
operations divided by the total execution time. This value is used as the y-coordinate
@@ -633,7 +633,7 @@ Workgroup manager utilizations:
any work.
unit: Percent
Scheduler-Pipe Utilization:
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where the scheduler-pipes were actively doing any work. Note: this
value is expected to range between 0% and 25%. See :ref:`desc-spi`.
@@ -674,7 +674,7 @@ Workgroup manager utilizations:
unit: Cycles/wave
Workgroup Manager - Resource Allocation:
Not-scheduled Rate (Workgroup Manager):
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
due to a bottleneck within the workgroup manager rather than a lack of a
@@ -683,7 +683,7 @@ Workgroup Manager - Resource Allocation:
description.
unit: Percent
Not-scheduled Rate (Scheduler-Pipe):
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
due to a bottleneck within the scheduler-pipes rather than a lack of a CU
@@ -695,7 +695,7 @@ Workgroup Manager - Resource Allocation:
rst: ''
unit: Unknown
Scheduler-Pipe Stall Rate:
rst: |-
rst: >-
The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>`
in the kernel where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
due to occupancy limitations (like a lack of a CU or :ref:`SIMD <desc-valu>`
@@ -759,7 +759,7 @@ Wavefront Launch Stats:
block size.
unit: Work-Items
Total Wavefronts:
rst: |-
rst: >-
The total number of wavefronts launched as part of the kernel dispatch.
On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
size is always 64 work-items. Thus, the total number of wavefronts should
@@ -774,25 +774,25 @@ Wavefront Launch Stats:
<https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
unit: Wavefronts
VGPRs:
rst: |-
rst: >-
The number of architected vector general-purpose registers allocated for the
kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly match the
number of VGPRs requested by the compiler due to allocation granularity.
unit: VGPRs
AGPRs:
rst: |-
rst: >-
The number of accumulation vector general-purpose registers allocated
for the kernel, see :ref:`AGPRs <desc-agprs>`. Note: this may not exactly match
the number of AGPRs requested by the compiler due to allocation granularity.
unit: AGPRs
SGPRs:
rst: |-
rst: >-
The number of scalar general-purpose registers allocated for the kernel, see
:ref:`SALU <desc-salu>`. Note: this may not exactly match the number of
SGPRs requested by the compiler due to allocation granularity.
unit: SGPRs
LDS Allocation:
rst: |-
rst: >-
The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared memory)
allocated for this kernel. Note: This may also be larger than what was requested
at compile time due to both allocation granularity and dynamic per-dispatch
@@ -815,7 +815,7 @@ Wavefront Runtime Stats:
This is averaged over all wavefronts in a kernel dispatch.
unit: Instructions per wavefront
Wave Cycles:
rst: |-
rst: >-
The number of cycles a wavefront in the kernel dispatch spent resident
on a compute unit per :ref:`normalization unit <normalization-units>`. This is
averaged over all wavefronts in a kernel dispatch. Note: this should not
@@ -853,7 +853,7 @@ Wavefront Runtime Stats:
the total Wave Cycles metric.
unit: Cycles per normalization unit
Wavefront Occupancy:
rst: |-
rst: >-
The time-averaged number of wavefronts resident on the accelerator over the
lifetime of the kernel. Note: this metric may be inaccurate for short-running
kernels (less than 1ms).
@@ -955,7 +955,7 @@ VALU Arithmetic Instruction Mix:
unit <normalization-units>`.
unit: Instructions per normalization unit
Conversion:
rst: |-
rst: >-
The total number of type conversion instructions (such as converting data
to or from F32\u2194F64) issued to the VALU per :ref:`normalization unit
<normalization-units>`.
@@ -1035,14 +1035,14 @@ MFMA Arithmetic Instruction Mix:
unit: Unknown
Compute Speed-of-Light:
VALU FLOPs:
rst: |-
rst: >-
The total floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is also presented as a percent of the peak theoretical
FLOPs achievable on the specific accelerator. Note: this does not include
any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.
unit: GFLOPs
VALU IOPs:
rst: |-
rst: >-
The total integer operations executed per second on the :ref:`VALU <desc-valu>`.
This is also presented as a percent of the peak theoretical IOPs achievable
on the specific accelerator. Note: this does not include any integer operations
@@ -1052,7 +1052,7 @@ Compute Speed-of-Light:
rst: ''
unit: Unknown
MFMA FLOPs (BF16):
rst: |-
rst: >-
The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit brain floating
point operations from :ref:`VALU <desc-valu>` instructions. This is also
@@ -1060,7 +1060,7 @@ Compute Speed-of-Light:
on the specific accelerator.
unit: GFLOPs
MFMA FLOPs (F16):
rst: |-
rst: >-
The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1068,7 +1068,7 @@ Compute Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA FLOPs (F32):
rst: |-
rst: >-
The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 32-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1076,7 +1076,7 @@ Compute Speed-of-Light:
specific accelerator.
unit: GFLOPs
MFMA FLOPs (F64):
rst: |-
rst: >-
The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 64-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. This is also presented
@@ -1091,7 +1091,7 @@ Compute Speed-of-Light:
rst: ''
unit: Unknown
MFMA IOPs (INT8):
rst: |-
rst: >-
The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
per second. Note: this does not include any 8-bit integer operations from
:ref:`VALU <desc-valu>` instructions. This is also presented as a percent
@@ -1183,7 +1183,7 @@ Arithmetic Operations:
unit <normalization-units>`.
unit: FLOP per normalization unit
BF16 OPs:
rst: |-
rst: >-
The total number of 16-bit brain floating-point operations executed on
either the :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization
unit <normalization-units>`. Note: on current CDNA accelerators, the VALU
@@ -1203,7 +1203,7 @@ Arithmetic Operations:
rst: ''
unit: Unknown
INT8 OPs:
rst: |-
rst: >-
The total number of 8-bit integer operations executed on either the :ref:`VALU
<desc-valu>` or :ref:`MFMA <desc-mfma>` units, per :ref:`normalization unit
<normalization-units>`. Note: on current CDNA accelerators, the VALU has
@@ -1298,7 +1298,7 @@ LDS Statistics:
to stalls from non-dword aligned addresses per :ref:`normalization unit <normalization-units>`.
unit: Cycles per normalization unit
Mem Violations:
rst: |-
rst: >-
The total number of out-of-bounds accesses made to the LDS, per :ref:`normalization
unit <normalization-units>`. This is unused and expected to be zero in
most configurations for modern CDNA\u2122 accelerators.
@@ -1321,7 +1321,7 @@ L1I Speed-of-Light:
over the number of all L1I requests.
unit: Percent
L1I-L2 Bandwidth Utilization:
rst: |-
rst: >-
The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
achieved. Calculated as the ratio of the total number of requests from
the L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`.
@@ -1428,7 +1428,7 @@ Scalar L1D cache accesses:
unit: Requests per normalization unit
Scalar L1D Cache - L2 Interface:
sL1D-L2 BW:
rst: |-
rst: >-
The total number of bytes read from, written to, or atomically updated
across the sL1D\u2194:doc:`L2 <l2-cache>` interface, divided by total duration.
Note that sL1D writes and atomics are typically
@@ -1450,7 +1450,7 @@ Scalar L1D Cache - L2 Interface:
CDNA accelerators.
unit: Requests per normalization unit
Stall Cycles:
rst: |-
rst: >-
The total number of cycles the sL1D\u2194 :doc:`L2 <l2-cache>` interface
was stalled, per :ref:`normalization unit <normalization-units>`.
unit: Cycles per normalization unit
@@ -1818,7 +1818,7 @@ L1 Unified Translation Cache (UTCL1):
rst: ''
unit: Unknown
Permission Misses:
rst: |-
rst: >-
The total number of translation requests that missed in the UTCL1 due
to a permission error, per :ref:`normalization unit <normalization-units>`.
This is unused and expected to be zero in most configurations for modern
@@ -1968,7 +1968,7 @@ L2-Fabric interface metrics:
with return value) was returned to the L2.
unit: Cycles
Read Stall:
rst: |-
rst: >-
The ratio of the total number of cycles the L2-Fabric interface was stalled
on a read request to any destination (local HBM, remote PCIe\xAE connected
accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_