diff --git a/projects/rocprofiler-compute/docs/data/metrics_description.yaml b/projects/rocprofiler-compute/docs/data/metrics_description.yaml index 7184df52e1..ae027f63cf 100644 --- a/projects/rocprofiler-compute/docs/data/metrics_description.yaml +++ b/projects/rocprofiler-compute/docs/data/metrics_description.yaml @@ -1,34 +1,34 @@ # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py Wavefront launch stats: - Grid Size: - rst: The total number of work-items (or, threads) launched as a part of the kernel - dispatch. In HIP, this is equivalent to the total grid size multiplied by the - total workgroup (or, block) size. - unit: Work-Items - Workgroup Size: - rst: The total number of work-items (or, threads) in each workgroup (or, block) - launched as part of the kernel dispatch. In HIP, this is equivalent to the total - block size. - unit: Work-Items - Total Wavefronts: - rst: "The total number of wavefronts launched as part of the kernel dispatch.\ - \ On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront\ - \ size is always 64 work-items. Thus, the total number of wavefronts should\ - \ be equivalent to the ceiling of grid size divided by 64." - unit: Wavefronts + Scratch Allocation: + rst: The number of bytes of :ref:`scratch memory ` requested per + work-item for this kernel. Scratch memory is used for stack memory on the accelerator, + as well as for register spills and restores. + unit: Bytes per work-item Saved Wavefronts: rst: The total number of wavefronts saved at a context-save. See `cwsr_enable `_. unit: Wavefronts - Restored Wavefronts: - rst: The total number of wavefronts restored from a context-save. See `cwsr_enable - `_. - unit: Wavefronts VGPRs: rst: 'The number of architected vector general-purpose registers allocated for the kernel, see :ref:`VALU `. Note: this may not exactly match the number of VGPRs requested by the compiler due to allocation granularity.' unit: VGPRs + Grid Size: + rst: The total number of work-items (or, threads) launched as a part of the kernel + dispatch. In HIP, this is equivalent to the total grid size multiplied by the + total workgroup (or, block) size. + unit: Work-Items + LDS Allocation: + rst: 'The number of bytes of :doc:`LDS ` memory (or, shared memory) + allocated for this kernel. Note: This may also be larger than what was requested + at compile time due to both allocation granularity and dynamic per-dispatch + LDS allocations.' + unit: Bytes per workgroup + Restored Wavefronts: + rst: The total number of wavefronts restored from a context-save. See `cwsr_enable + `_. + unit: Wavefronts AGPRs: rst: 'The number of accumulation vector general-purpose registers allocated for the kernel, see :ref:`AGPRs `. Note: this may not exactly match the @@ -39,146 +39,43 @@ Wavefront launch stats: :ref:`SALU `. Note: this may not exactly match the number of SGPRs requested by the compiler due to allocation granularity. plain' unit: SGPRs - LDS Allocation: - rst: 'The number of bytes of :doc:`LDS ` memory (or, shared memory) - allocated for this kernel. Note: This may also be larger than what was requested - at compile time due to both allocation granularity and dynamic per-dispatch - LDS allocations.' - unit: Bytes per workgroup - Scratch Allocation: - rst: The number of bytes of :ref:`scratch memory ` requested per - work-item for this kernel. Scratch memory is used for stack memory on the accelerator, - as well as for register spills and restores. - unit: Bytes per work-item - Kernel Time: - rst: The total duration of the executed kernel. - unit: Nanoseconds - Kernel Time (Cycles): - rst: The total duration of the executed kernel in cycles. - unit: Cycles - Instructions per wavefront: - rst: The average number of instructions (of all types) executed per wavefront. - This is averaged over all wavefronts in a kernel dispatch. - unit: Instructions per wavefront - Wave Cycles: - rst: 'The number of cycles a wavefront in the kernel dispatch spent resident on a - compute unit per :ref:`normalization unit `. This is averaged - over all wavefronts in a kernel dispatch. Note: this should not be directly - compared to the kernel cycles above.' - unit: Cycles per normalization unit - Dependency Wait Cycles: - rst: The number of cycles a wavefront in the kernel dispatch stalled waiting on - memory of any kind (e.g., instruction fetch, vector or scalar memory, etc.) - per :ref:`normalization unit `. This counter is incremented - at every cycle by *all* wavefronts on a CU stalled at a memory operation. As - such, it is most useful to get a sense of how waves were spending their time, - rather than identification of a precise limiter because another wave could - be actively executing while a wave is stalled. The sum of this metric, Issue - Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric. - unit: Cycles per normalization unit - Issue Wait Cycles: - rst: The number of cycles a wavefront in the kernel dispatch was unable to issue - an instruction for any reason (e.g., execution pipe back-pressure, arbitration - loss, etc.) per :ref:`normalization unit `. This counter - is incremented at every cycle by *all* wavefronts on a CU unable to issue an instruction. As - such, it is most useful to get a sense of how waves were spending their time, - rather than identification of a precise limiter because another wave could - be actively executing while a wave is issue stalled. The sum of this metric, - Dependency Wait Cycles and Active Cycles should be equal to the total Wave - Cycles metric. - unit: Cycles per normalization unit - Active Cycles: - rst: The average number of cycles a wavefront in the kernel dispatch was actively - executing instructions per :ref:`normalization unit `. - This measurement is made on a per-wavefront basis, and may include cycles that - another wavefront spent actively executing (on another execution unit, for - example) or was stalled. As such, it is most useful to get a sense of how - waves were spending their time, rather than identification of a precise limiter. - The sum of this metric, Issue Wait Cycles and Active Wait Cycles should be equal - to the total Wave Cycles metric. - unit: Cycles per normalization unit - Wavefront Occupancy: - rst: 'The time-averaged number of wavefronts resident on the accelerator over the - lifetime of the kernel. Note: this metric may be inaccurate for short-running - kernels (less than 1ms).' + Total Wavefronts: + rst: "The total number of wavefronts launched as part of the kernel dispatch.\ + \ On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront\ + \ size is always 64 work-items. Thus, the total number of wavefronts should\ + \ be equivalent to the ceiling of grid size divided by 64." unit: Wavefronts + Workgroup Size: + rst: The total number of work-items (or, threads) in each workgroup (or, block) + launched as part of the kernel dispatch. In HIP, this is equivalent to the total + block size. + unit: Work-Items Wavefront runtime stats: - Grid Size: - rst: The total number of work-items (or, threads) launched as a part of the kernel - dispatch. In HIP, this is equivalent to the total grid size multiplied by the - total workgroup (or, block) size. - unit: Work-Items - Workgroup Size: - rst: The total number of work-items (or, threads) in each workgroup (or, block) - launched as part of the kernel dispatch. In HIP, this is equivalent to the total - block size. - unit: Work-Items - Total Wavefronts: - rst: "The total number of wavefronts launched as part of the kernel dispatch.\ - \ On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront\ - \ size is always 64 work-items. Thus, the total number of wavefronts should\ - \ be equivalent to the ceiling of grid size divided by 64." - unit: Wavefronts - Saved Wavefronts: - rst: The total number of wavefronts saved at a context-save. See `cwsr_enable - `_. - unit: Wavefronts - Restored Wavefronts: - rst: The total number of wavefronts restored from a context-save. See `cwsr_enable - `_. - unit: Wavefronts - VGPRs: - rst: 'The number of architected vector general-purpose registers allocated for the - kernel, see :ref:`VALU `. Note: this may not exactly match the - number of VGPRs requested by the compiler due to allocation granularity.' - unit: VGPRs - AGPRs: - rst: 'The number of accumulation vector general-purpose registers allocated for the - kernel, see :ref:`AGPRs `. Note: this may not exactly match the - number of AGPRs requested by the compiler due to allocation granularity.' - unit: AGPRs - SGPRs: - rst: 'The number of scalar general-purpose registers allocated for the kernel, see - :ref:`SALU `. Note: this may not exactly match the number of SGPRs - requested by the compiler due to allocation granularity. plain' - unit: SGPRs - LDS Allocation: - rst: 'The number of bytes of :doc:`LDS ` memory (or, shared memory) - allocated for this kernel. Note: This may also be larger than what was requested - at compile time due to both allocation granularity and dynamic per-dispatch - LDS allocations.' - unit: Bytes per workgroup - Scratch Allocation: - rst: The number of bytes of :ref:`scratch memory ` requested per - work-item for this kernel. Scratch memory is used for stack memory on the accelerator, - as well as for register spills and restores. - unit: Bytes per work-item - Kernel Time: - rst: The total duration of the executed kernel. - unit: Nanoseconds - Kernel Time (Cycles): - rst: The total duration of the executed kernel in cycles. - unit: Cycles Instructions per wavefront: rst: The average number of instructions (of all types) executed per wavefront. This is averaged over all wavefronts in a kernel dispatch. unit: Instructions per wavefront + Active Cycles: + rst: The average number of cycles a wavefront in the kernel dispatch was actively + executing instructions per :ref:`normalization unit `. + This measurement is made on a per-wavefront basis, and may include cycles that + another wavefront spent actively executing (on another execution unit, for + example) or was stalled. As such, it is most useful to get a sense of how + waves were spending their time, rather than identification of a precise limiter. + The sum of this metric, Issue Wait Cycles and Active Wait Cycles should be equal + to the total Wave Cycles metric. + unit: Cycles per normalization unit Wave Cycles: rst: 'The number of cycles a wavefront in the kernel dispatch spent resident on a compute unit per :ref:`normalization unit `. This is averaged over all wavefronts in a kernel dispatch. Note: this should not be directly compared to the kernel cycles above.' unit: Cycles per normalization unit - Dependency Wait Cycles: - rst: The number of cycles a wavefront in the kernel dispatch stalled waiting on - memory of any kind (e.g., instruction fetch, vector or scalar memory, etc.) - per :ref:`normalization unit `. This counter is incremented - at every cycle by *all* wavefronts on a CU stalled at a memory operation. As - such, it is most useful to get a sense of how waves were spending their time, - rather than identification of a precise limiter because another wave could - be actively executing while a wave is stalled. The sum of this metric, Issue - Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric. - unit: Cycles per normalization unit + Wavefront Occupancy: + rst: 'The time-averaged number of wavefronts resident on the accelerator over the + lifetime of the kernel. Note: this metric may be inaccurate for short-running + kernels (less than 1ms).' + unit: Wavefronts Issue Wait Cycles: rst: The number of cycles a wavefront in the kernel dispatch was unable to issue an instruction for any reason (e.g., execution pipe back-pressure, arbitration @@ -190,41 +87,27 @@ Wavefront runtime stats: Dependency Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric. unit: Cycles per normalization unit - Active Cycles: - rst: The average number of cycles a wavefront in the kernel dispatch was actively - executing instructions per :ref:`normalization unit `. - This measurement is made on a per-wavefront basis, and may include cycles that - another wavefront spent actively executing (on another execution unit, for - example) or was stalled. As such, it is most useful to get a sense of how - waves were spending their time, rather than identification of a precise limiter. - The sum of this metric, Issue Wait Cycles and Active Wait Cycles should be equal - to the total Wave Cycles metric. + Dependency Wait Cycles: + rst: The number of cycles a wavefront in the kernel dispatch stalled waiting on + memory of any kind (e.g., instruction fetch, vector or scalar memory, etc.) + per :ref:`normalization unit `. This counter is incremented + at every cycle by *all* wavefronts on a CU stalled at a memory operation. As + such, it is most useful to get a sense of how waves were spending their time, + rather than identification of a precise limiter because another wave could + be actively executing while a wave is stalled. The sum of this metric, Issue + Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric. unit: Cycles per normalization unit - Wavefront Occupancy: - rst: 'The time-averaged number of wavefronts resident on the accelerator over the - lifetime of the kernel. Note: this metric may be inaccurate for short-running - kernels (less than 1ms).' - unit: Wavefronts + Kernel Time: + rst: The total duration of the executed kernel. + unit: Nanoseconds + Kernel Time (Cycles): + rst: The total duration of the executed kernel in cycles. + unit: Cycles Overall instruction mix: - VALU: - rst: The total number of vector arithmetic logic unit (VALU) operations issued. - These are the workhorses of the :doc:`compute unit `, and are - used to execute a wide range of instruction types including floating point - operations, non-uniform address calculations, transcendental operations, integer - operations, shifts, conditional evaluation, etc. - unit: Instructions - VMEM: - rst: The total number of vector memory operations issued. These include most loads, - stores and atomic operations and all accesses to :ref:`generic, global, private - and texture ` memory. - unit: Instructions LDS: rst: The total number of LDS (also known as shared memory) operations issued. These include loads, stores, atomics, and HIP's ``__shfl`` operations. unit: Instructions - MFMA: - rst: The total number of matrix fused multiply-add instructions issued. - unit: Instructions SALU: rst: The total number of scalar arithmetic logic unit (SALU) operations issued. Typically these are used for address calculations, literal constants, and other @@ -232,185 +115,31 @@ Overall instruction mix: (SMEM) operations are issued by the SALU, they are counted separately in this section. unit: Instructions + MFMA: + rst: The total number of matrix fused multiply-add instructions issued. + unit: Instructions + VMEM: + rst: The total number of vector memory operations issued. These include most loads, + stores and atomic operations and all accesses to :ref:`generic, global, private + and texture ` memory. + unit: Instructions SMEM: rst: The total number of scalar memory (SMEM) operations issued. These are typically used for loading kernel arguments, base-pointers and loads from HIP's ``__constant__`` memory. unit: Instructions + VALU: + rst: The total number of vector arithmetic logic unit (VALU) operations issued. + These are the workhorses of the :doc:`compute unit `, and are + used to execute a wide range of instruction types including floating point + operations, non-uniform address calculations, transcendental operations, integer + operations, shifts, conditional evaluation, etc. + unit: Instructions Branch: rst: The total number of branch operations issued. These typically consist of jump or branch operations and are used to implement control flow. unit: Instructions - INT32: - rst: The total number of instructions operating on 32-bit integer operands issued - to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - INT64: - rst: The total number of instructions operating on 64-bit integer operands issued - to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F16-ADD: - rst: The total number of addition instructions operating on 16-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F16-MUL: - rst: The total number of multiplication instructions operating on 16-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F16-FMA: - rst: The total number of fused multiply-add instructions operating on 16-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F16-Trans: - rst: The total number of transcendental instructions (e.g., `sqrt`) operating on - 16-bit floating-point operands issued to the VALU per :ref:`normalization unit - `. - unit: Instructions per normalization unit - F32-ADD: - rst: The total number of addition instructions operating on 32-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F32-MUL: - rst: The total number of multiplication instructions operating on 32-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F32-FMA: - rst: The total number of fused multiply-add instructions operating on 32-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F32-Trans: - rst: The total number of transcendental instructions (such as ``sqrt``) operating - on 32-bit floating-point operands issued to the VALU per :ref:`normalization - unit `. - unit: Instructions per normalization unit - F64-ADD: - rst: The total number of addition instructions operating on 64-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F64-MUL: - rst: The total number of multiplication instructions operating on 64-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F64-FMA: - rst: The total number of fused multiply-add instructions operating on 64-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F64-Trans: - rst: The total number of transcendental instructions (such as `sqrt`) operating - on 64-bit floating-point operands issued to the VALU per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Conversion: - rst: "The total number of type conversion instructions (such as converting data\ - \ to or from F32\u2194F64) issued to the VALU per :ref:`normalization unit\ - \ `." - unit: Instructions per normalization unit - Global/Generic Instr: - rst: The total number of global & generic memory instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Global/Generic Read: - rst: The total number of global & generic memory read instructions executed on all - :doc:`compute units ` on the accelerator, per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Global/Generic Write: - rst: The total number of global & generic memory write instructions executed on - all :doc:`compute units ` on the accelerator, per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Global/Generic Atomic: - rst: The total number of global & generic memory atomic (with and without return) - instructions executed on all :doc:`compute units ` on the accelerator, - per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Instr: - rst: The total number of spill/stack memory instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Read: - rst: The total number of spill/stack memory read instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Write: - rst: The total number of spill/stack memory write instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Atomic: - rst: The total number of spill/stack memory atomic (with and without return) instructions - executed on all :doc:`compute units ` on the accelerator, per - :ref:`normalization unit `. Typically unused as these - memory operations are typically used to implement thread-local storage. - unit: Instructions per normalization unit - MFMA-I8: - rst: The total number of 8-bit integer :ref:`MFMA ` instructions issued - per :ref:`normalization unit `. - unit: Instructions per normalization unit - MFMA-F8: - rst: The total number of 8-bit floating point :ref:`MFMA ` instructions issued - per :ref:`normalization unit `. This is supported in AMD - Instinct MI300 series and later only. - unit: Instructions per normalization unit - MFMA-F16: - rst: The total number of 16-bit floating point :ref:`MFMA ` instructions - issued per :ref:`normalization unit `. - unit: Instructions per normalization unit - MFMA-BF16: - rst: The total number of 16-bit brain floating point :ref:`MFMA ` instructions - issued per :ref:`normalization unit `. - unit: Instructions per normalization unit - MFMA-F32: - rst: The total number of 32-bit floating-point :ref:`MFMA ` instructions - issued per :ref:`normalization unit `. - unit: Instructions per normalization unit - MFMA-F64: - rst: The total number of 64-bit floating-point :ref:`MFMA ` instructions - issued per :ref:`normalization unit `. - unit: Instructions per normalization unit VALU arithmetic instruction mix: - VALU: - rst: The total number of vector arithmetic logic unit (VALU) operations issued. - These are the workhorses of the :doc:`compute unit `, and are - used to execute a wide range of instruction types including floating point - operations, non-uniform address calculations, transcendental operations, integer - operations, shifts, conditional evaluation, etc. - unit: Instructions - VMEM: - rst: The total number of vector memory operations issued. These include most loads, - stores and atomic operations and all accesses to :ref:`generic, global, private - and texture ` memory. - unit: Instructions - LDS: - rst: The total number of LDS (also known as shared memory) operations issued. These - include loads, stores, atomics, and HIP's ``__shfl`` operations. - unit: Instructions - MFMA: - rst: The total number of matrix fused multiply-add instructions issued. - unit: Instructions - SALU: - rst: The total number of scalar arithmetic logic unit (SALU) operations issued. - Typically these are used for address calculations, literal constants, and other - operations that are provably uniform across a wavefront. Although scalar memory - (SMEM) operations are issued by the SALU, they are counted separately in this - section. - unit: Instructions - SMEM: - rst: The total number of scalar memory (SMEM) operations issued. These are typically - used for loading kernel arguments, base-pointers and loads from HIP's ``__constant__`` - memory. - unit: Instructions - Branch: - rst: The total number of branch operations issued. These typically consist of jump - or branch operations and are used to implement control flow. - unit: Instructions - INT32: - rst: The total number of instructions operating on 32-bit integer operands issued - to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - INT64: - rst: The total number of instructions operating on 64-bit integer operands issued - to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit F16-ADD: rst: The total number of addition instructions operating on 16-bit floating-point operands issued to the VALU per :ref:`normalization unit `. @@ -419,291 +148,89 @@ VALU arithmetic instruction mix: rst: The total number of multiplication instructions operating on 16-bit floating-point operands issued to the VALU per :ref:`normalization unit `. unit: Instructions per normalization unit + INT32: + rst: The total number of instructions operating on 32-bit integer operands issued + to the VALU per :ref:`normalization unit `. + unit: Instructions per normalization unit + F32-MUL: + rst: The total number of multiplication instructions operating on 32-bit floating-point + operands issued to the VALU per :ref:`normalization unit `. + unit: Instructions per normalization unit + F64-ADD: + rst: The total number of addition instructions operating on 64-bit floating-point + operands issued to the VALU per :ref:`normalization unit `. + unit: Instructions per normalization unit + F64-FMA: + rst: The total number of fused multiply-add instructions operating on 64-bit floating-point + operands issued to the VALU per :ref:`normalization unit `. + unit: Instructions per normalization unit + Conversion: + rst: "The total number of type conversion instructions (such as converting data\ + \ to or from F32\u2194F64) issued to the VALU per :ref:`normalization unit\ + \ `." + unit: Instructions per normalization unit F16-FMA: rst: The total number of fused multiply-add instructions operating on 16-bit floating-point operands issued to the VALU per :ref:`normalization unit `. unit: Instructions per normalization unit + F32-FMA: + rst: The total number of fused multiply-add instructions operating on 32-bit floating-point + operands issued to the VALU per :ref:`normalization unit `. + unit: Instructions per normalization unit F16-Trans: rst: The total number of transcendental instructions (e.g., `sqrt`) operating on 16-bit floating-point operands issued to the VALU per :ref:`normalization unit `. unit: Instructions per normalization unit - F32-ADD: - rst: The total number of addition instructions operating on 32-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F32-MUL: - rst: The total number of multiplication instructions operating on 32-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F32-FMA: - rst: The total number of fused multiply-add instructions operating on 32-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit F32-Trans: rst: The total number of transcendental instructions (such as ``sqrt``) operating on 32-bit floating-point operands issued to the VALU per :ref:`normalization unit `. unit: Instructions per normalization unit - F64-ADD: - rst: The total number of addition instructions operating on 64-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F64-MUL: - rst: The total number of multiplication instructions operating on 64-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F64-FMA: - rst: The total number of fused multiply-add instructions operating on 64-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit F64-Trans: rst: The total number of transcendental instructions (such as `sqrt`) operating on 64-bit floating-point operands issued to the VALU per :ref:`normalization unit `. unit: Instructions per normalization unit - Conversion: - rst: "The total number of type conversion instructions (such as converting data\ - \ to or from F32\u2194F64) issued to the VALU per :ref:`normalization unit\ - \ `." + F32-ADD: + rst: The total number of addition instructions operating on 32-bit floating-point + operands issued to the VALU per :ref:`normalization unit `. unit: Instructions per normalization unit - Global/Generic Instr: - rst: The total number of global & generic memory instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. + F64-MUL: + rst: The total number of multiplication instructions operating on 64-bit floating-point + operands issued to the VALU per :ref:`normalization unit `. unit: Instructions per normalization unit - Global/Generic Read: - rst: The total number of global & generic memory read instructions executed on all - :doc:`compute units ` on the accelerator, per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Global/Generic Write: - rst: The total number of global & generic memory write instructions executed on - all :doc:`compute units ` on the accelerator, per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Global/Generic Atomic: - rst: The total number of global & generic memory atomic (with and without return) - instructions executed on all :doc:`compute units ` on the accelerator, - per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Instr: - rst: The total number of spill/stack memory instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Read: - rst: The total number of spill/stack memory read instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Write: - rst: The total number of spill/stack memory write instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Atomic: - rst: The total number of spill/stack memory atomic (with and without return) instructions - executed on all :doc:`compute units ` on the accelerator, per - :ref:`normalization unit `. Typically unused as these - memory operations are typically used to implement thread-local storage. - unit: Instructions per normalization unit - MFMA-I8: - rst: The total number of 8-bit integer :ref:`MFMA ` instructions issued - per :ref:`normalization unit `. - unit: Instructions per normalization unit - MFMA-F8: - rst: The total number of 8-bit floating point :ref:`MFMA ` instructions issued - per :ref:`normalization unit `. This is supported in AMD - Instinct MI300 series and later only. - unit: Instructions per normalization unit - MFMA-F16: - rst: The total number of 16-bit floating point :ref:`MFMA ` instructions - issued per :ref:`normalization unit `. - unit: Instructions per normalization unit - MFMA-BF16: - rst: The total number of 16-bit brain floating point :ref:`MFMA ` instructions - issued per :ref:`normalization unit `. - unit: Instructions per normalization unit - MFMA-F32: - rst: The total number of 32-bit floating-point :ref:`MFMA ` instructions - issued per :ref:`normalization unit `. - unit: Instructions per normalization unit - MFMA-F64: - rst: The total number of 64-bit floating-point :ref:`MFMA ` instructions - issued per :ref:`normalization unit `. + INT64: + rst: The total number of instructions operating on 64-bit integer operands issued + to the VALU per :ref:`normalization unit `. unit: Instructions per normalization unit MFMA instruction mix: - VALU: - rst: The total number of vector arithmetic logic unit (VALU) operations issued. - These are the workhorses of the :doc:`compute unit `, and are - used to execute a wide range of instruction types including floating point - operations, non-uniform address calculations, transcendental operations, integer - operations, shifts, conditional evaluation, etc. - unit: Instructions - VMEM: - rst: The total number of vector memory operations issued. These include most loads, - stores and atomic operations and all accesses to :ref:`generic, global, private - and texture ` memory. - unit: Instructions - LDS: - rst: The total number of LDS (also known as shared memory) operations issued. These - include loads, stores, atomics, and HIP's ``__shfl`` operations. - unit: Instructions - MFMA: - rst: The total number of matrix fused multiply-add instructions issued. - unit: Instructions - SALU: - rst: The total number of scalar arithmetic logic unit (SALU) operations issued. - Typically these are used for address calculations, literal constants, and other - operations that are provably uniform across a wavefront. Although scalar memory - (SMEM) operations are issued by the SALU, they are counted separately in this - section. - unit: Instructions - SMEM: - rst: The total number of scalar memory (SMEM) operations issued. These are typically - used for loading kernel arguments, base-pointers and loads from HIP's ``__constant__`` - memory. - unit: Instructions - Branch: - rst: The total number of branch operations issued. These typically consist of jump - or branch operations and are used to implement control flow. - unit: Instructions - INT32: - rst: The total number of instructions operating on 32-bit integer operands issued - to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - INT64: - rst: The total number of instructions operating on 64-bit integer operands issued - to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F16-ADD: - rst: The total number of addition instructions operating on 16-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F16-MUL: - rst: The total number of multiplication instructions operating on 16-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F16-FMA: - rst: The total number of fused multiply-add instructions operating on 16-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F16-Trans: - rst: The total number of transcendental instructions (e.g., `sqrt`) operating on - 16-bit floating-point operands issued to the VALU per :ref:`normalization unit - `. - unit: Instructions per normalization unit - F32-ADD: - rst: The total number of addition instructions operating on 32-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F32-MUL: - rst: The total number of multiplication instructions operating on 32-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F32-FMA: - rst: The total number of fused multiply-add instructions operating on 32-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F32-Trans: - rst: The total number of transcendental instructions (such as ``sqrt``) operating - on 32-bit floating-point operands issued to the VALU per :ref:`normalization - unit `. - unit: Instructions per normalization unit - F64-ADD: - rst: The total number of addition instructions operating on 64-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F64-MUL: - rst: The total number of multiplication instructions operating on 64-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F64-FMA: - rst: The total number of fused multiply-add instructions operating on 64-bit floating-point - operands issued to the VALU per :ref:`normalization unit `. - unit: Instructions per normalization unit - F64-Trans: - rst: The total number of transcendental instructions (such as `sqrt`) operating - on 64-bit floating-point operands issued to the VALU per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Conversion: - rst: "The total number of type conversion instructions (such as converting data\ - \ to or from F32\u2194F64) issued to the VALU per :ref:`normalization unit\ - \ `." - unit: Instructions per normalization unit - Global/Generic Instr: - rst: The total number of global & generic memory instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Global/Generic Read: - rst: The total number of global & generic memory read instructions executed on all - :doc:`compute units ` on the accelerator, per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Global/Generic Write: - rst: The total number of global & generic memory write instructions executed on - all :doc:`compute units ` on the accelerator, per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Global/Generic Atomic: - rst: The total number of global & generic memory atomic (with and without return) - instructions executed on all :doc:`compute units ` on the accelerator, - per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Instr: - rst: The total number of spill/stack memory instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Read: - rst: The total number of spill/stack memory read instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Write: - rst: The total number of spill/stack memory write instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Atomic: - rst: The total number of spill/stack memory atomic (with and without return) instructions - executed on all :doc:`compute units ` on the accelerator, per - :ref:`normalization unit `. Typically unused as these - memory operations are typically used to implement thread-local storage. + MFMA-F64: + rst: The total number of 64-bit floating-point :ref:`MFMA ` instructions + issued per :ref:`normalization unit `. unit: Instructions per normalization unit MFMA-I8: rst: The total number of 8-bit integer :ref:`MFMA ` instructions issued per :ref:`normalization unit `. unit: Instructions per normalization unit - MFMA-F8: - rst: The total number of 8-bit floating point :ref:`MFMA ` instructions issued - per :ref:`normalization unit `. This is supported in AMD - Instinct MI300 series and later only. - unit: Instructions per normalization unit - MFMA-F16: - rst: The total number of 16-bit floating point :ref:`MFMA ` instructions + MFMA-F32: + rst: The total number of 32-bit floating-point :ref:`MFMA ` instructions issued per :ref:`normalization unit `. unit: Instructions per normalization unit MFMA-BF16: rst: The total number of 16-bit brain floating point :ref:`MFMA ` instructions issued per :ref:`normalization unit `. unit: Instructions per normalization unit - MFMA-F32: - rst: The total number of 32-bit floating-point :ref:`MFMA ` instructions + MFMA-F16: + rst: The total number of 16-bit floating point :ref:`MFMA ` instructions issued per :ref:`normalization unit `. unit: Instructions per normalization unit - MFMA-F64: - rst: The total number of 64-bit floating-point :ref:`MFMA ` instructions - issued per :ref:`normalization unit `. + MFMA-F8: + rst: The total number of 8-bit floating point :ref:`MFMA ` instructions issued + per :ref:`normalization unit `. This is supported in AMD + Instinct MI300 series and later only. unit: Instructions per normalization unit Compute Speed-of-Light: - VALU FLOPs: - rst: 'The total floating-point operations executed per second on the :ref:`VALU - `. This is also presented as a percent of the peak theoretical FLOPs - achievable on the specific accelerator. Note: this does not include any floating-point - operations from :ref:`MFMA ` instructions.' - unit: GFLOPs - VALU IOPs: - rst: 'The total integer operations executed per second on the :ref:`VALU `. - This is also presented as a percent of the peak theoretical IOPs achievable - on the specific accelerator. Note: this does not include any integer operations - from :ref:`MFMA ` instructions.' - unit: GIOPs MFMA FLOPs (BF16): rst: 'The total number of 16-bit brain floating point :ref:`MFMA ` operations executed per second. Note: this does not include any 16-bit brain floating @@ -711,13 +238,6 @@ Compute Speed-of-Light: as a percent of the peak theoretical BF16 MFMA operations achievable on the specific accelerator.' unit: GFLOPs - MFMA FLOPs (F16): - rst: 'The total number of 16-bit floating point :ref:`MFMA ` operations - executed per second. Note: this does not include any 16-bit floating point - operations from :ref:`VALU ` instructions. This is also presented - as a percent of the peak theoretical F16 MFMA operations achievable on the - specific accelerator.' - unit: GFLOPs MFMA FLOPs (F32): rst: 'The total number of 32-bit floating point :ref:`MFMA ` operations executed per second. Note: this does not include any 32-bit floating point @@ -742,180 +262,48 @@ Compute Speed-of-Light: ` instructions. This is also presented as a percent of the peak theoretical INT8 MFMA operations achievable on the specific accelerator.' unit: GFLOPs - IPC: - rst: The ratio of the total number of instructions executed on the :doc:`CU ` - over the :ref:`total active CU cycles `. - unit: Instructions per cycle - IPC (Issued): - rst: The ratio of the total number of (non-:ref:`internal `) - instructions issued over the number of cycles where the :ref:`scheduler ` - was actively working on issuing instructions. Refer to the :ref:`Issued IPC - ` example for further detail. - unit: Instructions per cycle - SALU Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`SALU ` - was busy executing instructions. Computed as the ratio of the total number - of cycles spent by the :ref:`scheduler ` issuing SALU / :ref:`SMEM - ` instructions over the :ref:`total CU cycles `. - unit: Percent - VALU Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`VALU ` - was busy executing instructions. Does not include :ref:`VMEM ` operations. - Computed as the ratio of the total number of cycles spent by the :ref:`scheduler - ` issuing VALU instructions over the :ref:`total CU cycles - `. - unit: Percent - VMEM Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`VMEM ` - unit was busy executing instructions, including both global/generic and spill/scratch - operations (see the :ref:`VMEM instruction count metrics ` - for more detail). Does not include :ref:`VALU ` operations. Computed as - the ratio of the total number of cycles spent by the :ref:`scheduler ` - issuing VMEM instructions over the :ref:`total CU cycles `. - unit: Percent - Branch Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`branch ` - unit was busy executing instructions. Computed as the ratio of the total number - of cycles spent by the :ref:`scheduler ` issuing branch instructions - over the :ref:`total CU cycles `. - unit: Percent - VALU Active Threads: - rst: Indicates the average level of :ref:`divergence ` within a - wavefront over the lifetime of the kernel. The number of work-items that were - active in a wavefront during execution of each :ref:`VALU ` instruction, - time-averaged over all VALU instructions run on all wavefronts in the kernel. - unit: Work-items - MFMA Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`MFMA ` - unit was busy executing instructions. Computed as the ratio of the total number - of cycles spent by the :ref:`MFMA ` was busy over the :ref:`total - CU cycles `. - unit: Percent - MFMA Instruction Cycles: - rst: The average duration of :ref:`MFMA ` instructions in this kernel - in cycles. Computed as the ratio of the total number of cycles the MFMA unit - was busy over the total number of MFMA instructions. Compare to, for example, - the `AMD Matrix Instruction Calculator `_. - unit: Cycles per instruction - VMEM Latency: - rst: The average number of round-trip cycles (that is, from issue to data return - / acknowledgment) required for a VMEM instruction to complete. - unit: Cycles - SMEM Latency: - rst: The average number of round-trip cycles (that is, from issue to data return - / acknowledgment) required for a SMEM instruction to complete. - unit: Cycles - FLOPs (Total): - rst: The total number of floating-point operations executed on either the :ref:`VALU - ` or :ref:`MFMA ` units, per :ref:`normalization unit - `. - unit: FLOP per normalization unit - IOPs (Total): - rst: The total number of integer operations executed on either the :ref:`VALU - ` or :ref:`MFMA ` units, per :ref:`normalization unit - `. - unit: IOP per normalization unit - F16 OPs: - rst: The total number of 16-bit floating-point operations executed on either the :ref:`VALU - ` or :ref:`MFMA ` units, per :ref:`normalization unit - `. - unit: FLOP per normalization unit - BF16 OPs: - rst: 'The total number of 16-bit brain floating-point operations executed on either - the :ref:`VALU ` or :ref:`MFMA ` units, per :ref:`normalization - unit `. Note: on current CDNA accelerators, the VALU has - no native BF16 instructions.' - unit: FLOP per normalization unit - F32 OPs: - rst: The total number of 32-bit floating-point operations executed on either the - :ref:`VALU ` or :ref:`MFMA ` units, per :ref:`normalization - unit `. - unit: FLOP per normalization unit - F64 OPs: - rst: The total number of 64-bit floating-point operations executed on either the - :ref:`VALU ` or :ref:`MFMA ` units, per :ref:`normalization - unit `. - unit: FLOP per normalization unit - INT8 OPs: - rst: 'The total number of 8-bit integer operations executed on either the :ref:`VALU - ` or :ref:`MFMA ` units, per :ref:`normalization unit - `. Note: on current CDNA accelerators, the VALU has no - native INT8 instructions.' - unit: IOP per normalization unit + VALU FLOPs: + rst: 'The total floating-point operations executed per second on the :ref:`VALU + `. This is also presented as a percent of the peak theoretical FLOPs + achievable on the specific accelerator. Note: this does not include any floating-point + operations from :ref:`MFMA ` instructions.' + unit: GFLOPs + VALU IOPs: + rst: 'The total integer operations executed per second on the :ref:`VALU `. + This is also presented as a percent of the peak theoretical IOPs achievable + on the specific accelerator. Note: this does not include any integer operations + from :ref:`MFMA ` instructions.' + unit: GIOPs + MFMA FLOPs (F16): + rst: 'The total number of 16-bit floating point :ref:`MFMA ` operations + executed per second. Note: this does not include any 16-bit floating point + operations from :ref:`VALU ` instructions. This is also presented + as a percent of the peak theoretical F16 MFMA operations achievable on the + specific accelerator.' + unit: GFLOPs Pipeline statistics: - VALU FLOPs: - rst: 'The total floating-point operations executed per second on the :ref:`VALU - `. This is also presented as a percent of the peak theoretical FLOPs - achievable on the specific accelerator. Note: this does not include any floating-point - operations from :ref:`MFMA ` instructions.' - unit: GFLOPs - VALU IOPs: - rst: 'The total integer operations executed per second on the :ref:`VALU `. - This is also presented as a percent of the peak theoretical IOPs achievable - on the specific accelerator. Note: this does not include any integer operations - from :ref:`MFMA ` instructions.' - unit: GIOPs - MFMA FLOPs (BF16): - rst: 'The total number of 16-bit brain floating point :ref:`MFMA ` operations - executed per second. Note: this does not include any 16-bit brain floating - point operations from :ref:`VALU ` instructions. This is also presented - as a percent of the peak theoretical BF16 MFMA operations achievable on the - specific accelerator.' - unit: GFLOPs - MFMA FLOPs (F16): - rst: 'The total number of 16-bit floating point :ref:`MFMA ` operations - executed per second. Note: this does not include any 16-bit floating point - operations from :ref:`VALU ` instructions. This is also presented - as a percent of the peak theoretical F16 MFMA operations achievable on the - specific accelerator.' - unit: GFLOPs - MFMA FLOPs (F32): - rst: 'The total number of 32-bit floating point :ref:`MFMA ` operations - executed per second. Note: this does not include any 32-bit floating point - operations from :ref:`VALU ` instructions. This is also presented - as a percent of the peak theoretical F32 MFMA operations achievable on the - specific accelerator.' - unit: GFLOPs - MFMA FLOPs (F64): - rst: 'The total number of 64-bit floating point :ref:`MFMA ` operations - executed per second. Note: this does not include any 64-bit floating point - operations from :ref:`VALU ` instructions. This is also presented - as a percent of the peak theoretical F64 MFMA operations achievable on the - specific accelerator. The total number of 64-bit floating point :ref:`MFMA - ` operations executed per second. Note: this does not include any - 64-bit floating point operations from :ref:`VALU ` instructions. - This is also presented as a percent of the peak theoretical F64 MFMA operations - achievable on the specific accelerator.' - unit: GFLOPs - MFMA IOPs (INT8): - rst: 'The total number of 8-bit integer :ref:`MFMA ` operations executed - per second. Note: this does not include any 8-bit integer operations from :ref:`VALU - ` instructions. This is also presented as a percent of the peak - theoretical INT8 MFMA operations achievable on the specific accelerator.' - unit: GFLOPs IPC: rst: The ratio of the total number of instructions executed on the :doc:`CU ` over the :ref:`total active CU cycles `. unit: Instructions per cycle - IPC (Issued): - rst: The ratio of the total number of (non-:ref:`internal `) - instructions issued over the number of cycles where the :ref:`scheduler ` - was actively working on issuing instructions. Refer to the :ref:`Issued IPC - ` example for further detail. - unit: Instructions per cycle + Branch Utilization: + rst: Indicates what percent of the kernel's duration the :ref:`branch ` + unit was busy executing instructions. Computed as the ratio of the total number + of cycles spent by the :ref:`scheduler ` issuing branch instructions + over the :ref:`total CU cycles `. + unit: Percent SALU Utilization: rst: Indicates what percent of the kernel's duration the :ref:`SALU ` was busy executing instructions. Computed as the ratio of the total number of cycles spent by the :ref:`scheduler ` issuing SALU / :ref:`SMEM ` instructions over the :ref:`total CU cycles `. unit: Percent - VALU Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`VALU ` - was busy executing instructions. Does not include :ref:`VMEM ` operations. - Computed as the ratio of the total number of cycles spent by the :ref:`scheduler - ` issuing VALU instructions over the :ref:`total CU cycles - `. - unit: Percent + MFMA Instruction Cycles: + rst: The average duration of :ref:`MFMA ` instructions in this kernel + in cycles. Computed as the ratio of the total number of cycles the MFMA unit + was busy over the total number of MFMA instructions. Compare to, for example, + the `AMD Matrix Instruction Calculator `_. + unit: Cycles per instruction VMEM Utilization: rst: Indicates what percent of the kernel's duration the :ref:`VMEM ` unit was busy executing instructions, including both global/generic and spill/scratch @@ -924,201 +312,47 @@ Pipeline statistics: the ratio of the total number of cycles spent by the :ref:`scheduler ` issuing VMEM instructions over the :ref:`total CU cycles `. unit: Percent - Branch Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`branch ` - unit was busy executing instructions. Computed as the ratio of the total number - of cycles spent by the :ref:`scheduler ` issuing branch instructions - over the :ref:`total CU cycles `. - unit: Percent - VALU Active Threads: - rst: Indicates the average level of :ref:`divergence ` within a - wavefront over the lifetime of the kernel. The number of work-items that were - active in a wavefront during execution of each :ref:`VALU ` instruction, - time-averaged over all VALU instructions run on all wavefronts in the kernel. - unit: Work-items + SMEM Latency: + rst: The average number of round-trip cycles (that is, from issue to data return + / acknowledgment) required for a SMEM instruction to complete. + unit: Cycles MFMA Utilization: rst: Indicates what percent of the kernel's duration the :ref:`MFMA ` unit was busy executing instructions. Computed as the ratio of the total number of cycles spent by the :ref:`MFMA ` was busy over the :ref:`total CU cycles `. unit: Percent - MFMA Instruction Cycles: - rst: The average duration of :ref:`MFMA ` instructions in this kernel - in cycles. Computed as the ratio of the total number of cycles the MFMA unit - was busy over the total number of MFMA instructions. Compare to, for example, - the `AMD Matrix Instruction Calculator `_. - unit: Cycles per instruction VMEM Latency: rst: The average number of round-trip cycles (that is, from issue to data return / acknowledgment) required for a VMEM instruction to complete. unit: Cycles - SMEM Latency: - rst: The average number of round-trip cycles (that is, from issue to data return - / acknowledgment) required for a SMEM instruction to complete. - unit: Cycles - FLOPs (Total): - rst: The total number of floating-point operations executed on either the :ref:`VALU - ` or :ref:`MFMA ` units, per :ref:`normalization unit - `. - unit: FLOP per normalization unit - IOPs (Total): - rst: The total number of integer operations executed on either the :ref:`VALU - ` or :ref:`MFMA ` units, per :ref:`normalization unit - `. - unit: IOP per normalization unit - F16 OPs: - rst: The total number of 16-bit floating-point operations executed on either the :ref:`VALU - ` or :ref:`MFMA ` units, per :ref:`normalization unit - `. - unit: FLOP per normalization unit - BF16 OPs: - rst: 'The total number of 16-bit brain floating-point operations executed on either - the :ref:`VALU ` or :ref:`MFMA ` units, per :ref:`normalization - unit `. Note: on current CDNA accelerators, the VALU has - no native BF16 instructions.' - unit: FLOP per normalization unit - F32 OPs: - rst: The total number of 32-bit floating-point operations executed on either the - :ref:`VALU ` or :ref:`MFMA ` units, per :ref:`normalization - unit `. - unit: FLOP per normalization unit - F64 OPs: - rst: The total number of 64-bit floating-point operations executed on either the - :ref:`VALU ` or :ref:`MFMA ` units, per :ref:`normalization - unit `. - unit: FLOP per normalization unit - INT8 OPs: - rst: 'The total number of 8-bit integer operations executed on either the :ref:`VALU - ` or :ref:`MFMA ` units, per :ref:`normalization unit - `. Note: on current CDNA accelerators, the VALU has no - native INT8 instructions.' - unit: IOP per normalization unit + VALU Utilization: + rst: Indicates what percent of the kernel's duration the :ref:`VALU ` + was busy executing instructions. Does not include :ref:`VMEM ` operations. + Computed as the ratio of the total number of cycles spent by the :ref:`scheduler + ` issuing VALU instructions over the :ref:`total CU cycles + `. + unit: Percent + IPC (Issued): + rst: The ratio of the total number of (non-:ref:`internal `) + instructions issued over the number of cycles where the :ref:`scheduler ` + was actively working on issuing instructions. Refer to the :ref:`Issued IPC + ` example for further detail. + unit: Instructions per cycle + VALU Active Threads: + rst: Indicates the average level of :ref:`divergence ` within a + wavefront over the lifetime of the kernel. The number of work-items that were + active in a wavefront during execution of each :ref:`VALU ` instruction, + time-averaged over all VALU instructions run on all wavefronts in the kernel. + unit: Work-items Arithmetic operations: - VALU FLOPs: - rst: 'The total floating-point operations executed per second on the :ref:`VALU - `. This is also presented as a percent of the peak theoretical FLOPs - achievable on the specific accelerator. Note: this does not include any floating-point - operations from :ref:`MFMA ` instructions.' - unit: GFLOPs - VALU IOPs: - rst: 'The total integer operations executed per second on the :ref:`VALU `. - This is also presented as a percent of the peak theoretical IOPs achievable - on the specific accelerator. Note: this does not include any integer operations - from :ref:`MFMA ` instructions.' - unit: GIOPs - MFMA FLOPs (BF16): - rst: 'The total number of 16-bit brain floating point :ref:`MFMA ` operations - executed per second. Note: this does not include any 16-bit brain floating - point operations from :ref:`VALU ` instructions. This is also presented - as a percent of the peak theoretical BF16 MFMA operations achievable on the - specific accelerator.' - unit: GFLOPs - MFMA FLOPs (F16): - rst: 'The total number of 16-bit floating point :ref:`MFMA ` operations - executed per second. Note: this does not include any 16-bit floating point - operations from :ref:`VALU ` instructions. This is also presented - as a percent of the peak theoretical F16 MFMA operations achievable on the - specific accelerator.' - unit: GFLOPs - MFMA FLOPs (F32): - rst: 'The total number of 32-bit floating point :ref:`MFMA ` operations - executed per second. Note: this does not include any 32-bit floating point - operations from :ref:`VALU ` instructions. This is also presented - as a percent of the peak theoretical F32 MFMA operations achievable on the - specific accelerator.' - unit: GFLOPs - MFMA FLOPs (F64): - rst: 'The total number of 64-bit floating point :ref:`MFMA ` operations - executed per second. Note: this does not include any 64-bit floating point - operations from :ref:`VALU ` instructions. This is also presented - as a percent of the peak theoretical F64 MFMA operations achievable on the - specific accelerator. The total number of 64-bit floating point :ref:`MFMA - ` operations executed per second. Note: this does not include any - 64-bit floating point operations from :ref:`VALU ` instructions. - This is also presented as a percent of the peak theoretical F64 MFMA operations - achievable on the specific accelerator.' - unit: GFLOPs - MFMA IOPs (INT8): - rst: 'The total number of 8-bit integer :ref:`MFMA ` operations executed - per second. Note: this does not include any 8-bit integer operations from :ref:`VALU - ` instructions. This is also presented as a percent of the peak - theoretical INT8 MFMA operations achievable on the specific accelerator.' - unit: GFLOPs - IPC: - rst: The ratio of the total number of instructions executed on the :doc:`CU ` - over the :ref:`total active CU cycles `. - unit: Instructions per cycle - IPC (Issued): - rst: The ratio of the total number of (non-:ref:`internal `) - instructions issued over the number of cycles where the :ref:`scheduler ` - was actively working on issuing instructions. Refer to the :ref:`Issued IPC - ` example for further detail. - unit: Instructions per cycle - SALU Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`SALU ` - was busy executing instructions. Computed as the ratio of the total number - of cycles spent by the :ref:`scheduler ` issuing SALU / :ref:`SMEM - ` instructions over the :ref:`total CU cycles `. - unit: Percent - VALU Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`VALU ` - was busy executing instructions. Does not include :ref:`VMEM ` operations. - Computed as the ratio of the total number of cycles spent by the :ref:`scheduler - ` issuing VALU instructions over the :ref:`total CU cycles - `. - unit: Percent - VMEM Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`VMEM ` - unit was busy executing instructions, including both global/generic and spill/scratch - operations (see the :ref:`VMEM instruction count metrics ` - for more detail). Does not include :ref:`VALU ` operations. Computed as - the ratio of the total number of cycles spent by the :ref:`scheduler ` - issuing VMEM instructions over the :ref:`total CU cycles `. - unit: Percent - Branch Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`branch ` - unit was busy executing instructions. Computed as the ratio of the total number - of cycles spent by the :ref:`scheduler ` issuing branch instructions - over the :ref:`total CU cycles `. - unit: Percent - VALU Active Threads: - rst: Indicates the average level of :ref:`divergence ` within a - wavefront over the lifetime of the kernel. The number of work-items that were - active in a wavefront during execution of each :ref:`VALU ` instruction, - time-averaged over all VALU instructions run on all wavefronts in the kernel. - unit: Work-items - MFMA Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`MFMA ` - unit was busy executing instructions. Computed as the ratio of the total number - of cycles spent by the :ref:`MFMA ` was busy over the :ref:`total - CU cycles `. - unit: Percent - MFMA Instruction Cycles: - rst: The average duration of :ref:`MFMA ` instructions in this kernel - in cycles. Computed as the ratio of the total number of cycles the MFMA unit - was busy over the total number of MFMA instructions. Compare to, for example, - the `AMD Matrix Instruction Calculator `_. - unit: Cycles per instruction - VMEM Latency: - rst: The average number of round-trip cycles (that is, from issue to data return - / acknowledgment) required for a VMEM instruction to complete. - unit: Cycles - SMEM Latency: - rst: The average number of round-trip cycles (that is, from issue to data return - / acknowledgment) required for a SMEM instruction to complete. - unit: Cycles - FLOPs (Total): - rst: The total number of floating-point operations executed on either the :ref:`VALU - ` or :ref:`MFMA ` units, per :ref:`normalization unit - `. - unit: FLOP per normalization unit IOPs (Total): rst: The total number of integer operations executed on either the :ref:`VALU ` or :ref:`MFMA ` units, per :ref:`normalization unit `. unit: IOP per normalization unit - F16 OPs: - rst: The total number of 16-bit floating-point operations executed on either the :ref:`VALU + FLOPs (Total): + rst: The total number of floating-point operations executed on either the :ref:`VALU ` or :ref:`MFMA ` units, per :ref:`normalization unit `. unit: FLOP per normalization unit @@ -1128,29 +362,28 @@ Arithmetic operations: unit `. Note: on current CDNA accelerators, the VALU has no native BF16 instructions.' unit: FLOP per normalization unit + F16 OPs: + rst: The total number of 16-bit floating-point operations executed on either the :ref:`VALU + ` or :ref:`MFMA ` units, per :ref:`normalization unit + `. + unit: FLOP per normalization unit F32 OPs: rst: The total number of 32-bit floating-point operations executed on either the :ref:`VALU ` or :ref:`MFMA ` units, per :ref:`normalization unit `. unit: FLOP per normalization unit - F64 OPs: - rst: The total number of 64-bit floating-point operations executed on either the - :ref:`VALU ` or :ref:`MFMA ` units, per :ref:`normalization - unit `. - unit: FLOP per normalization unit INT8 OPs: rst: 'The total number of 8-bit integer operations executed on either the :ref:`VALU ` or :ref:`MFMA ` units, per :ref:`normalization unit `. Note: on current CDNA accelerators, the VALU has no native INT8 instructions.' unit: IOP per normalization unit + F64 OPs: + rst: The total number of 64-bit floating-point operations executed on either the + :ref:`VALU ` or :ref:`MFMA ` units, per :ref:`normalization + unit `. + unit: FLOP per normalization unit LDS Speed-of-Light: - Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`LDS ` was - actively executing instructions (including, but not limited to, load, store, - atomic and HIP's ``__shfl`` operations). Calculated as the ratio of the total - number of cycles LDS was active over the :ref:`total CU cycles `. - unit: Percent Access Rate: rst: Indicates the percentage of SIMDs in the :ref:`VALU ` [#lds-workload]_ actively issuing LDS instructions, averaged over the lifetime of the kernel. @@ -1158,6 +391,12 @@ LDS Speed-of-Light: ` issuing :ref:`LDS ` instructions over the :ref:`total CU cycles `. unit: Percent + Bank Conflict Rate: + rst: Indicates the percentage of active LDS cycles that were spent servicing bank + conflicts. Calculated as the ratio of LDS cycles spent servicing bank conflicts + over the number of LDS cycles that would have been required to move the same + amount of data in an uncontended access. [#lds-bank-conflict]_ + unit: Percent Theoretical Bandwidth: rst: Indicates the maximum amount of bytes that could have been loaded from, stored to, or atomically updated in the LDS per :ref:`normalization unit `. @@ -1165,67 +404,17 @@ LDS Speed-of-Light: was executed. See the :ref:`LDS bandwidth example ` for more detail. unit: Bytes per normalization unit - Bank Conflict Rate: - rst: Indicates the percentage of active LDS cycles that were spent servicing bank - conflicts. Calculated as the ratio of LDS cycles spent servicing bank conflicts - over the number of LDS cycles that would have been required to move the same - amount of data in an uncontended access. [#lds-bank-conflict]_ + Utilization: + rst: Indicates what percent of the kernel's duration the :ref:`LDS ` was + actively executing instructions (including, but not limited to, load, store, + atomic and HIP's ``__shfl`` operations). Calculated as the ratio of the total + number of cycles LDS was active over the :ref:`total CU cycles `. unit: Percent - LDS Instructions: - rst: The total number of LDS instructions (including, but not limited to, read/write/atomics - and HIP's ``__shfl`` instructions) executed per :ref:`normalization unit `. - unit: Instructions per normalization unit - LDS Latency: - rst: The average number of round-trip cycles (i.e., from issue to data-return / - acknowledgment) required for an LDS instruction to complete. - unit: Cycles - Bank Conflicts/Access: - rst: The ratio of the number of cycles spent in the :ref:`LDS scheduler ` - due to bank conflicts (as determined by the conflict resolution hardware) to - the base number of cycles that would be spent in the LDS scheduler in a completely - uncontended case. This is the unnormalized form of the Bank Conflict Rate. - unit: Conflicts per Access - Index Accesses: - rst: The total number of cycles spent in the :ref:`LDS scheduler ` over - all operations per :ref:`normalization unit `. - unit: Cycles per normalization unit - Atomic Return Cycles: - rst: The total number of cycles spent on LDS atomics with return per :ref:`normalization - unit `. - unit: Cycles per normalization unit - Bank Conflict: - rst: The total number of cycles spent in the :ref:`LDS scheduler ` due - to bank conflicts (as determined by the conflict resolution hardware) per :ref:`normalization - unit `. - unit: Cycles per normalization unit - Addr Conflict: - rst: The total number of cycles spent in the :ref:`LDS scheduler ` due - to address conflicts (as determined by the conflict resolution hardware) per - :ref:`normalization unit `. - unit: Cycles per normalization unit - Unaligned Stall: - rst: The total number of cycles spent in the :ref:`LDS scheduler ` due - to stalls from non-dword aligned addresses per :ref:`normalization unit `. - unit: Cycles per normalization unit - Mem Violations: - rst: "The total number of out-of-bounds accesses made to the LDS, per :ref:`normalization\ - \ unit `. This is unused and expected to be zero in most\ - \ configurations for modern CDNA\u2122 accelerators." - unit: Accesses per normalization unit LDS Statistics: - Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`LDS ` was - actively executing instructions (including, but not limited to, load, store, - atomic and HIP's ``__shfl`` operations). Calculated as the ratio of the total - number of cycles LDS was active over the :ref:`total CU cycles `. - unit: Percent - Access Rate: - rst: Indicates the percentage of SIMDs in the :ref:`VALU ` [#lds-workload]_ - actively issuing LDS instructions, averaged over the lifetime of the kernel. - Calculated as the ratio of the total number of cycles spent by the :ref:`scheduler - ` issuing :ref:`LDS ` instructions over the :ref:`total - CU cycles `. - unit: Percent + Index Accesses: + rst: The total number of cycles spent in the :ref:`LDS scheduler ` over + all operations per :ref:`normalization unit `. + unit: Cycles per normalization unit Theoretical Bandwidth: rst: Indicates the maximum amount of bytes that could have been loaded from, stored to, or atomically updated in the LDS per :ref:`normalization unit `. @@ -1233,53 +422,43 @@ LDS Statistics: was executed. See the :ref:`LDS bandwidth example ` for more detail. unit: Bytes per normalization unit - Bank Conflict Rate: - rst: Indicates the percentage of active LDS cycles that were spent servicing bank - conflicts. Calculated as the ratio of LDS cycles spent servicing bank conflicts - over the number of LDS cycles that would have been required to move the same - amount of data in an uncontended access. [#lds-bank-conflict]_ - unit: Percent - LDS Instructions: - rst: The total number of LDS instructions (including, but not limited to, read/write/atomics - and HIP's ``__shfl`` instructions) executed per :ref:`normalization unit `. - unit: Instructions per normalization unit - LDS Latency: - rst: The average number of round-trip cycles (i.e., from issue to data-return / - acknowledgment) required for an LDS instruction to complete. - unit: Cycles Bank Conflicts/Access: rst: The ratio of the number of cycles spent in the :ref:`LDS scheduler ` due to bank conflicts (as determined by the conflict resolution hardware) to the base number of cycles that would be spent in the LDS scheduler in a completely uncontended case. This is the unnormalized form of the Bank Conflict Rate. unit: Conflicts per Access - Index Accesses: - rst: The total number of cycles spent in the :ref:`LDS scheduler ` over - all operations per :ref:`normalization unit `. - unit: Cycles per normalization unit - Atomic Return Cycles: - rst: The total number of cycles spent on LDS atomics with return per :ref:`normalization - unit `. + LDS Instructions: + rst: The total number of LDS instructions (including, but not limited to, read/write/atomics + and HIP's ``__shfl`` instructions) executed per :ref:`normalization unit `. + unit: Instructions per normalization unit + Unaligned Stall: + rst: The total number of cycles spent in the :ref:`LDS scheduler ` due + to stalls from non-dword aligned addresses per :ref:`normalization unit `. unit: Cycles per normalization unit Bank Conflict: rst: The total number of cycles spent in the :ref:`LDS scheduler ` due to bank conflicts (as determined by the conflict resolution hardware) per :ref:`normalization unit `. unit: Cycles per normalization unit + LDS Latency: + rst: The average number of round-trip cycles (i.e., from issue to data-return / + acknowledgment) required for an LDS instruction to complete. + unit: Cycles + Mem Violations: + rst: "The total number of out-of-bounds accesses made to the LDS, per :ref:`normalization\ + \ unit `. This is unused and expected to be zero in most\ + \ configurations for modern CDNA\u2122 accelerators." + unit: Accesses per normalization unit + Atomic Return Cycles: + rst: The total number of cycles spent on LDS atomics with return per :ref:`normalization + unit `. + unit: Cycles per normalization unit Addr Conflict: rst: The total number of cycles spent in the :ref:`LDS scheduler ` due to address conflicts (as determined by the conflict resolution hardware) per :ref:`normalization unit `. unit: Cycles per normalization unit - Unaligned Stall: - rst: The total number of cycles spent in the :ref:`LDS scheduler ` due - to stalls from non-dword aligned addresses per :ref:`normalization unit `. - unit: Cycles per normalization unit - Mem Violations: - rst: "The total number of out-of-bounds accesses made to the LDS, per :ref:`normalization\ - \ unit `. This is unused and expected to be zero in most\ - \ configurations for modern CDNA\u2122 accelerators." - unit: Accesses per normalization unit vL1D Speed-of-Light: Hit rate: rst: The ratio of the number of vL1D cache line requests that hit [#vl1d-hit]_ in @@ -1294,11 +473,6 @@ vL1D Speed-of-Light: not consider partial requests, so for instance, if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. unit: Percent - Utilization: - rst: Indicates how busy the :ref:`vL1D Cache RAM ` was during the kernel - execution. The number of cycles where the vL1D Cache RAM is actively processing - any request divided by the number of cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent Coalescing: rst: Indicates how well memory instructions were coalesced by the :ref:`address processing unit `, ranging from uncoalesced (25%) to fully coalesced @@ -1306,185 +480,12 @@ vL1D Speed-of-Light: generated per instruction divided by the ideal number of thread-requests per instruction. unit: Percent - Stalled on L2 Data: - rst: The ratio of the number of cycles where the vL1D is stalled waiting for requested - data to return from the :doc:`L2 cache ` divided by the number of - cycles where the vL1D is active [#vl1d-activity]_. + Utilization: + rst: Indicates how busy the :ref:`vL1D Cache RAM ` was during the kernel + execution. The number of cycles where the vL1D Cache RAM is actively processing + any request divided by the number of cycles where the vL1D is active [#vl1d-activity]_. unit: Percent - Stalled on L2 Req: - rst: The ratio of the number of cycles where the vL1D is stalled waiting to issue - a request for data to the :doc:`L2 cache ` divided by the number - of cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Tag RAM Stall (Read): - rst: The ratio of the number of cycles where the vL1D is stalled due to Read requests - with conflicting tags being looked up concurrently, divided by the number of - cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Tag RAM Stall (Write): - rst: The ratio of the number of cycles where the vL1D is stalled due to Write - requests with conflicting tags being looked up concurrently, divided by the - number of cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Tag RAM Stall (Atomic): - rst: The ratio of the number of cycles where the vL1D is stalled due to Atomic - requests with conflicting tags being looked up concurrently, divided by the - number of cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Total Req: - rst: The total number of incoming requests from the :ref:`address processing - unit ` after coalescing. - unit: Requests - Read Req: - rst: The total number of incoming read requests from the :ref:`address processing - unit ` after coalescing per :ref:`normalization unit ` - unit: Requests per normalization unit - Write Req: - rst: The total number of incoming write requests from the :ref:`address processing - unit ` after coalescing per :ref:`normalization unit ` - unit: Requests per normalization unit - Atomic Req: - rst: The total number of incoming atomic requests from the :ref:`address processing - unit ` after coalescing per :ref:`normalization unit ` - unit: Requests per normalization unit - Cache BW: - rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM - ` instructions per :ref:`normalization unit `. The - number of bytes is calculated as the number of cache lines requested multiplied - by the cache line size. This value does not consider partial requests, so - for instance, if only a single value is requested in a cache line, the data movement - will still be counted as a full cache line. - unit: Bytes per normalization unit - Cache Hit Rate: - rst: The ratio of the number of vL1D cache line requests that hit in vL1D cache - over the total number of cache line requests to the :ref:`vL1D Cache RAM `. - unit: Percent - Cache Accesses: - rst: The total number of cache line lookups in the vL1D. - unit: Cache lines - Cache Hits: - rst: The number of cache accesses minus the number of outgoing requests to the :doc:`L2 - cache `, that is, the number of cache line requests serviced by the - :ref:`vL1D Cache RAM ` per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Invalidations: - rst: The number of times the vL1D was issued a write-back invalidate command during - the kernel's execution per :ref:`normalization unit `. This - may be triggered by, for instance, the ``buffer_wbinvl1`` instruction. - unit: Invalidations per normalization unit - L1-L2 BW: - rst: The number of bytes transferred across the vL1D-L2 interface as a result of - :ref:`VMEM ` instructions, per :ref:`normalization unit `. - The number of bytes is calculated as the number of cache lines requested multiplied - by the cache line size. This value does not consider partial requests, so for instance, - if only a single value is requested in a cache line, the data movement will - still be counted as a full cache line. - unit: Bytes per normalization unit - L1-L2 Read: - rst: The number of read requests for a vL1D cache line that were not satisfied by - the vL1D and must be retrieved from the to the :doc:`L2 Cache ` per :ref:`normalization - unit `. - unit: Requests per normalization unit - L1-L2 Write: - rst: The number of write requests to a vL1D cache line that were sent through the - vL1D to the :doc:`L2 cache `, per :ref:`normalization unit `. - unit: Requests per normalization unit - L1-L2 Atomic: - rst: The number of atomic requests that are sent through the vL1D to the :doc:`L2 - cache `, per :ref:`normalization unit `. This - includes requests for atomics with, and without return. - unit: Requests per normalization unit - L1 Access Latency: - rst: Calculated as the average number of cycles that a vL1D cache line request - spent in the vL1D cache pipeline. - unit: Cycles - L1-L2 Read Latency: - rst: Calculated as the average number of cycles that the vL1D cache took to issue - and receive read requests from the :doc:`L2 Cache `. This number - also includes requests for atomics with return values. - unit: Cycles - L1-L2 Write Latency: - rst: Calculated as the average number of cycles that the vL1D cache took to issue - and receive acknowledgement of a write request to the :doc:`L2 Cache `. - This number also includes requests for atomics without return values. - unit: Cycles - NC - Read: - rst: Total read requests with NC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - UC - Read: - rst: Total read requests with UC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - CC - Read: - rst: Total read requests with CC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - RW - Read: - rst: '' - unit: Requests per normalization unit - RW - Write: - rst: Total write requests with RW mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - NC - Write: - rst: Total write requests with NC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - UC - Write: - rst: Total write requests with UC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - CC - Write: - rst: Total write requests with CC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - NC - Atomic: - rst: Total atomic requests with NC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - UC - Atomic: - rst: Total atomic requests with UC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - CC - Atomic: - rst: Total atomic requests with CC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - RW - Atomic: - rst: Total atomic requests with RW mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - Req: - rst: The number of translation requests made to the UTCL1 per normalization unit. - unit: Requests per normalization unit - Hit Ratio: - rst: The ratio of the number of translation requests that hit in the UTCL1 divided - by the total number of translation requests made to the UTCL1. - unit: Percent - Hits: - rst: The number of translation requests that hit in the UTCL1, and could be reused, - per normalization unit. - unit: Requests per normalization unit - Translation Misses: - rst: The total number of translation requests that missed in the UTCL1 due to translation - not being present in the cache, per :ref:`normalization unit `. - unit: unit - Permission Misses: - rst: "The total number of translation requests that missed in the UTCL1 due to\ - \ a permission error, per :ref:`normalization unit `.\ - \ This is unused and expected to be zero in most configurations for modern\ - \ CDNA\u2122 accelerators." - unit: Requests per normalization unit Busy / stall metrics: - Address Processing Unit Busy: - rst: Percent of the :ref:`total CU cycles ` the address processor - was busy - unit: Percent - Address Stall: - rst: Percent of the :ref:`total CU cycles ` the address processor - was stalled from sending address requests further into the vL1D pipeline - unit: Percent Data Stall: rst: Percent of the :ref:`total CU cycles ` the address processor was stalled from sending write/atomic data further into the vL1D pipeline @@ -1493,121 +494,17 @@ Busy / stall metrics: rst: Percent of :ref:`total CU cycles ` the address processor was stalled waiting to send command data to the :ref:`data processor ` unit: Percent - Total Instructions: - rst: The total number of memory instructions executed by the address processer - over all compute units on the accelerator, per normalization unit. - unit: Instructions per normalization unit - Global/Generic Instructions: - rst: The total number of global & generic memory instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Global/Generic Read Instructions: - rst: The total number of global & generic memory read instructions executed on all - :doc:`compute units ` on the accelerator, per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Global/Generic Write Instructions: - rst: The total number of global & generic memory write instructions executed on - all :doc:`compute units ` on the accelerator, per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Global/Generic Atomic Instructions: - rst: The total number of global & generic memory atomic (with and without return) - instructions executed on all :doc:`compute units ` on the accelerator, - per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Instructions: - rst: The total number of spill/stack memory instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Read Instructions: - rst: The total number of spill/stack memory read instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Write Instructions: - rst: The total number of spill/stack memory write instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Atomic Instructions: - rst: The total number of spill/stack memory atomic (with and without return) instructions - executed on all :doc:`compute units ` on the accelerator, per - :ref:`normalization unit `. Typically unused as these - memory operations are typically used to implement thread-local storage. - unit: Instructions per normalization unit - Spill/Stack Total Cycles: - rst: The number of cycles the address processing unit spent working on spill/stack - instructions, per :ref:`normalization unit `. - unit: Cycles per normalization unit - Spill/Stack Coalesced Read: - rst: The number of cycles the address processing unit spent working on coalesced - spill/stack read instructions, per :ref:`normalization unit `. - unit: Cycles per normalization unit - Spill/Stack Coalesced Write: - rst: The number of cycles the address processing unit spent working on coalesced - spill/stack write instructions, per :ref:`normalization unit `. - unit: Cycles per normalization unit - Data-Return Busy: - rst: Percent of the :ref:`total CU cycles ` the data-return unit - was busy processing or waiting on data to return to the :doc:`CU `. + Address Processing Unit Busy: + rst: Percent of the :ref:`total CU cycles ` the address processor + was busy unit: Percent - "Cache RAM \u2192 Data-Return Stall": - rst: Percent of the :ref:`total CU cycles ` the data-return unit - was stalled on data to be returned from the :ref:`vL1D Cache RAM `. + Address Stall: + rst: Percent of the :ref:`total CU cycles ` the address processor + was stalled from sending address requests further into the vL1D pipeline unit: Percent - "Workgroup manager \u2192 Data-Return Stall": - rst: Percent of the :ref:`total CU cycles ` the data-return unit - was stalled by the :ref:`workgroup manager ` due to initialization - of registers as a part of launching new workgroups. - unit: Percent - Coalescable Instructions: - rst: The number of instructions submitted to the :ref:`data-return unit ` - by the :ref:`address processor ` that were found to be coalescable, - per :ref:`normalization unit `. - unit: Instructions per normalization unit - Read Instructions: - rst: The number of read instructions submitted to the :ref:`data-return unit - ` by the :ref:`address processor ` summed over all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - This is expected to be the sum of global/generic and spill/stack reads in the :ref:`address - processor `. - unit: Instructions per normalization unit - Write Instructions: - rst: The number of store instructions submitted to the :ref:`data-return unit - ` by the :ref:`address processor ` summed over all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - This is expected to be the sum of global/generic and spill/stack stores counted - by the :ref:`vL1D cache-front-end `. - unit: Instructions per normalization unit - Atomic Instructions: - rst: The number of atomic instructions submitted to the :ref:`data-return unit - ` by the :ref:`address processor ` summed over all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - This is expected to be the sum of global/generic and spill/stack atomics in - the :ref:`address processor `. - unit: Instructions per normalization unit Instruction counts: - Address Processing Unit Busy: - rst: Percent of the :ref:`total CU cycles ` the address processor - was busy - unit: Percent - Address Stall: - rst: Percent of the :ref:`total CU cycles ` the address processor - was stalled from sending address requests further into the vL1D pipeline - unit: Percent - Data Stall: - rst: Percent of the :ref:`total CU cycles ` the address processor - was stalled from sending write/atomic data further into the vL1D pipeline - unit: Percent - "Data-Processor \u2192 Address Stall": - rst: Percent of :ref:`total CU cycles ` the address processor was - stalled waiting to send command data to the :ref:`data processor ` - unit: Percent - Total Instructions: - rst: The total number of memory instructions executed by the address processer - over all compute units on the accelerator, per normalization unit. - unit: Instructions per normalization unit - Global/Generic Instructions: - rst: The total number of global & generic memory instructions executed on all :doc:`compute + Spill/Stack Write Instructions: + rst: The total number of spill/stack memory write instructions executed on all :doc:`compute units ` on the accelerator, per :ref:`normalization unit `. unit: Instructions per normalization unit Global/Generic Read Instructions: @@ -1615,27 +512,23 @@ Instruction counts: :doc:`compute units ` on the accelerator, per :ref:`normalization unit `. unit: Instructions per normalization unit - Global/Generic Write Instructions: - rst: The total number of global & generic memory write instructions executed on - all :doc:`compute units ` on the accelerator, per :ref:`normalization - unit `. - unit: Instructions per normalization unit Global/Generic Atomic Instructions: rst: The total number of global & generic memory atomic (with and without return) instructions executed on all :doc:`compute units ` on the accelerator, per :ref:`normalization unit `. unit: Instructions per normalization unit - Spill/Stack Instructions: - rst: The total number of spill/stack memory instructions executed on all :doc:`compute + Global/Generic Instructions: + rst: The total number of global & generic memory instructions executed on all :doc:`compute units ` on the accelerator, per :ref:`normalization unit `. unit: Instructions per normalization unit Spill/Stack Read Instructions: rst: The total number of spill/stack memory read instructions executed on all :doc:`compute units ` on the accelerator, per :ref:`normalization unit `. unit: Instructions per normalization unit - Spill/Stack Write Instructions: - rst: The total number of spill/stack memory write instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. + Global/Generic Write Instructions: + rst: The total number of global & generic memory write instructions executed on + all :doc:`compute units ` on the accelerator, per :ref:`normalization + unit `. unit: Instructions per normalization unit Spill/Stack Atomic Instructions: rst: The total number of spill/stack memory atomic (with and without return) instructions @@ -1643,343 +536,33 @@ Instruction counts: :ref:`normalization unit `. Typically unused as these memory operations are typically used to implement thread-local storage. unit: Instructions per normalization unit - Spill/Stack Total Cycles: - rst: The number of cycles the address processing unit spent working on spill/stack - instructions, per :ref:`normalization unit `. - unit: Cycles per normalization unit - Spill/Stack Coalesced Read: - rst: The number of cycles the address processing unit spent working on coalesced - spill/stack read instructions, per :ref:`normalization unit `. - unit: Cycles per normalization unit - Spill/Stack Coalesced Write: - rst: The number of cycles the address processing unit spent working on coalesced - spill/stack write instructions, per :ref:`normalization unit `. - unit: Cycles per normalization unit - Data-Return Busy: - rst: Percent of the :ref:`total CU cycles ` the data-return unit - was busy processing or waiting on data to return to the :doc:`CU `. - unit: Percent - "Cache RAM \u2192 Data-Return Stall": - rst: Percent of the :ref:`total CU cycles ` the data-return unit - was stalled on data to be returned from the :ref:`vL1D Cache RAM `. - unit: Percent - "Workgroup manager \u2192 Data-Return Stall": - rst: Percent of the :ref:`total CU cycles ` the data-return unit - was stalled by the :ref:`workgroup manager ` due to initialization - of registers as a part of launching new workgroups. - unit: Percent - Coalescable Instructions: - rst: The number of instructions submitted to the :ref:`data-return unit ` - by the :ref:`address processor ` that were found to be coalescable, - per :ref:`normalization unit `. - unit: Instructions per normalization unit - Read Instructions: - rst: The number of read instructions submitted to the :ref:`data-return unit - ` by the :ref:`address processor ` summed over all :doc:`compute + Spill/Stack Instructions: + rst: The total number of spill/stack memory instructions executed on all :doc:`compute units ` on the accelerator, per :ref:`normalization unit `. - This is expected to be the sum of global/generic and spill/stack reads in the :ref:`address - processor `. unit: Instructions per normalization unit - Write Instructions: - rst: The number of store instructions submitted to the :ref:`data-return unit - ` by the :ref:`address processor ` summed over all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - This is expected to be the sum of global/generic and spill/stack stores counted - by the :ref:`vL1D cache-front-end `. - unit: Instructions per normalization unit - Atomic Instructions: - rst: The number of atomic instructions submitted to the :ref:`data-return unit - ` by the :ref:`address processor ` summed over all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - This is expected to be the sum of global/generic and spill/stack atomics in - the :ref:`address processor `. + Total Instructions: + rst: The total number of memory instructions executed by the address processer + over all compute units on the accelerator, per normalization unit. unit: Instructions per normalization unit Spill / stack metrics: - Address Processing Unit Busy: - rst: Percent of the :ref:`total CU cycles ` the address processor - was busy - unit: Percent - Address Stall: - rst: Percent of the :ref:`total CU cycles ` the address processor - was stalled from sending address requests further into the vL1D pipeline - unit: Percent - Data Stall: - rst: Percent of the :ref:`total CU cycles ` the address processor - was stalled from sending write/atomic data further into the vL1D pipeline - unit: Percent - "Data-Processor \u2192 Address Stall": - rst: Percent of :ref:`total CU cycles ` the address processor was - stalled waiting to send command data to the :ref:`data processor ` - unit: Percent - Total Instructions: - rst: The total number of memory instructions executed by the address processer - over all compute units on the accelerator, per normalization unit. - unit: Instructions per normalization unit - Global/Generic Instructions: - rst: The total number of global & generic memory instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Global/Generic Read Instructions: - rst: The total number of global & generic memory read instructions executed on all - :doc:`compute units ` on the accelerator, per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Global/Generic Write Instructions: - rst: The total number of global & generic memory write instructions executed on - all :doc:`compute units ` on the accelerator, per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Global/Generic Atomic Instructions: - rst: The total number of global & generic memory atomic (with and without return) - instructions executed on all :doc:`compute units ` on the accelerator, - per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Instructions: - rst: The total number of spill/stack memory instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Read Instructions: - rst: The total number of spill/stack memory read instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Write Instructions: - rst: The total number of spill/stack memory write instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Atomic Instructions: - rst: The total number of spill/stack memory atomic (with and without return) instructions - executed on all :doc:`compute units ` on the accelerator, per - :ref:`normalization unit `. Typically unused as these - memory operations are typically used to implement thread-local storage. - unit: Instructions per normalization unit - Spill/Stack Total Cycles: - rst: The number of cycles the address processing unit spent working on spill/stack - instructions, per :ref:`normalization unit `. - unit: Cycles per normalization unit Spill/Stack Coalesced Read: rst: The number of cycles the address processing unit spent working on coalesced spill/stack read instructions, per :ref:`normalization unit `. unit: Cycles per normalization unit + Spill/Stack Total Cycles: + rst: The number of cycles the address processing unit spent working on spill/stack + instructions, per :ref:`normalization unit `. + unit: Cycles per normalization unit Spill/Stack Coalesced Write: rst: The number of cycles the address processing unit spent working on coalesced spill/stack write instructions, per :ref:`normalization unit `. unit: Cycles per normalization unit - Data-Return Busy: - rst: Percent of the :ref:`total CU cycles ` the data-return unit - was busy processing or waiting on data to return to the :doc:`CU `. - unit: Percent - "Cache RAM \u2192 Data-Return Stall": - rst: Percent of the :ref:`total CU cycles ` the data-return unit - was stalled on data to be returned from the :ref:`vL1D Cache RAM `. - unit: Percent - "Workgroup manager \u2192 Data-Return Stall": - rst: Percent of the :ref:`total CU cycles ` the data-return unit - was stalled by the :ref:`workgroup manager ` due to initialization - of registers as a part of launching new workgroups. - unit: Percent - Coalescable Instructions: - rst: The number of instructions submitted to the :ref:`data-return unit ` - by the :ref:`address processor ` that were found to be coalescable, - per :ref:`normalization unit `. - unit: Instructions per normalization unit - Read Instructions: - rst: The number of read instructions submitted to the :ref:`data-return unit - ` by the :ref:`address processor ` summed over all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - This is expected to be the sum of global/generic and spill/stack reads in the :ref:`address - processor `. - unit: Instructions per normalization unit - Write Instructions: - rst: The number of store instructions submitted to the :ref:`data-return unit - ` by the :ref:`address processor ` summed over all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - This is expected to be the sum of global/generic and spill/stack stores counted - by the :ref:`vL1D cache-front-end `. - unit: Instructions per normalization unit - Atomic Instructions: - rst: The number of atomic instructions submitted to the :ref:`data-return unit - ` by the :ref:`address processor ` summed over all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - This is expected to be the sum of global/generic and spill/stack atomics in - the :ref:`address processor `. - unit: Instructions per normalization unit L1 Unified Translation Cache (UTCL1): - Hit rate: - rst: The ratio of the number of vL1D cache line requests that hit [#vl1d-hit]_ in - vL1D cache over the total number of cache line requests to the :ref:`vL1D Cache - RAM `. - unit: Percent - Bandwidth: - rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM - ` instructions, as a percent of the peak theoretical bandwidth achievable - on the specific accelerator. The number of bytes is calculated as the number - of cache lines requested multiplied by the cache line size. This value does - not consider partial requests, so for instance, if only a single value is requested - in a cache line, the data movement will still be counted as a full cache line. - unit: Percent - Utilization: - rst: Indicates how busy the :ref:`vL1D Cache RAM ` was during the kernel - execution. The number of cycles where the vL1D Cache RAM is actively processing - any request divided by the number of cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Coalescing: - rst: Indicates how well memory instructions were coalesced by the :ref:`address - processing unit `, ranging from uncoalesced (25%) to fully coalesced - (100%). Calculated as the average number of :ref:`thread-requests ` - generated per instruction divided by the ideal number of thread-requests per - instruction. - unit: Percent - Stalled on L2 Data: - rst: The ratio of the number of cycles where the vL1D is stalled waiting for requested - data to return from the :doc:`L2 cache ` divided by the number of - cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Stalled on L2 Req: - rst: The ratio of the number of cycles where the vL1D is stalled waiting to issue - a request for data to the :doc:`L2 cache ` divided by the number - of cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Tag RAM Stall (Read): - rst: The ratio of the number of cycles where the vL1D is stalled due to Read requests - with conflicting tags being looked up concurrently, divided by the number of - cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Tag RAM Stall (Write): - rst: The ratio of the number of cycles where the vL1D is stalled due to Write - requests with conflicting tags being looked up concurrently, divided by the - number of cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Tag RAM Stall (Atomic): - rst: The ratio of the number of cycles where the vL1D is stalled due to Atomic - requests with conflicting tags being looked up concurrently, divided by the - number of cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Total Req: - rst: The total number of incoming requests from the :ref:`address processing - unit ` after coalescing. - unit: Requests - Read Req: - rst: The total number of incoming read requests from the :ref:`address processing - unit ` after coalescing per :ref:`normalization unit ` - unit: Requests per normalization unit - Write Req: - rst: The total number of incoming write requests from the :ref:`address processing - unit ` after coalescing per :ref:`normalization unit ` - unit: Requests per normalization unit - Atomic Req: - rst: The total number of incoming atomic requests from the :ref:`address processing - unit ` after coalescing per :ref:`normalization unit ` - unit: Requests per normalization unit - Cache BW: - rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM - ` instructions per :ref:`normalization unit `. The - number of bytes is calculated as the number of cache lines requested multiplied - by the cache line size. This value does not consider partial requests, so - for instance, if only a single value is requested in a cache line, the data movement - will still be counted as a full cache line. - unit: Bytes per normalization unit - Cache Hit Rate: - rst: The ratio of the number of vL1D cache line requests that hit in vL1D cache - over the total number of cache line requests to the :ref:`vL1D Cache RAM `. - unit: Percent - Cache Accesses: - rst: The total number of cache line lookups in the vL1D. - unit: Cache lines - Cache Hits: - rst: The number of cache accesses minus the number of outgoing requests to the :doc:`L2 - cache `, that is, the number of cache line requests serviced by the - :ref:`vL1D Cache RAM ` per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Invalidations: - rst: The number of times the vL1D was issued a write-back invalidate command during - the kernel's execution per :ref:`normalization unit `. This - may be triggered by, for instance, the ``buffer_wbinvl1`` instruction. - unit: Invalidations per normalization unit - L1-L2 BW: - rst: The number of bytes transferred across the vL1D-L2 interface as a result of - :ref:`VMEM ` instructions, per :ref:`normalization unit `. - The number of bytes is calculated as the number of cache lines requested multiplied - by the cache line size. This value does not consider partial requests, so for instance, - if only a single value is requested in a cache line, the data movement will - still be counted as a full cache line. - unit: Bytes per normalization unit - L1-L2 Read: - rst: The number of read requests for a vL1D cache line that were not satisfied by - the vL1D and must be retrieved from the to the :doc:`L2 Cache ` per :ref:`normalization - unit `. - unit: Requests per normalization unit - L1-L2 Write: - rst: The number of write requests to a vL1D cache line that were sent through the - vL1D to the :doc:`L2 cache `, per :ref:`normalization unit `. - unit: Requests per normalization unit - L1-L2 Atomic: - rst: The number of atomic requests that are sent through the vL1D to the :doc:`L2 - cache `, per :ref:`normalization unit `. This - includes requests for atomics with, and without return. - unit: Requests per normalization unit - L1 Access Latency: - rst: Calculated as the average number of cycles that a vL1D cache line request - spent in the vL1D cache pipeline. - unit: Cycles - L1-L2 Read Latency: - rst: Calculated as the average number of cycles that the vL1D cache took to issue - and receive read requests from the :doc:`L2 Cache `. This number - also includes requests for atomics with return values. - unit: Cycles - L1-L2 Write Latency: - rst: Calculated as the average number of cycles that the vL1D cache took to issue - and receive acknowledgement of a write request to the :doc:`L2 Cache `. - This number also includes requests for atomics without return values. - unit: Cycles - NC - Read: - rst: Total read requests with NC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - UC - Read: - rst: Total read requests with UC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - CC - Read: - rst: Total read requests with CC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - RW - Read: - rst: '' - unit: Requests per normalization unit - RW - Write: - rst: Total write requests with RW mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - NC - Write: - rst: Total write requests with NC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - UC - Write: - rst: Total write requests with UC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - CC - Write: - rst: Total write requests with CC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - NC - Atomic: - rst: Total atomic requests with NC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - UC - Atomic: - rst: Total atomic requests with UC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - CC - Atomic: - rst: Total atomic requests with CC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - RW - Atomic: - rst: Total atomic requests with RW mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - Req: - rst: The number of translation requests made to the UTCL1 per normalization unit. + Permission Misses: + rst: "The total number of translation requests that missed in the UTCL1 due to\ + \ a permission error, per :ref:`normalization unit `.\ + \ This is unused and expected to be zero in most configurations for modern\ + \ CDNA\u2122 accelerators." unit: Requests per normalization unit Hit Ratio: rst: The ratio of the number of translation requests that hit in the UTCL1 divided @@ -1989,47 +572,14 @@ L1 Unified Translation Cache (UTCL1): rst: The number of translation requests that hit in the UTCL1, and could be reused, per normalization unit. unit: Requests per normalization unit + Req: + rst: The number of translation requests made to the UTCL1 per normalization unit. + unit: Requests per normalization unit Translation Misses: rst: The total number of translation requests that missed in the UTCL1 due to translation not being present in the cache, per :ref:`normalization unit `. unit: unit - Permission Misses: - rst: "The total number of translation requests that missed in the UTCL1 due to\ - \ a permission error, per :ref:`normalization unit `.\ - \ This is unused and expected to be zero in most configurations for modern\ - \ CDNA\u2122 accelerators." - unit: Requests per normalization unit vL1D cache stall metrics: - Hit rate: - rst: The ratio of the number of vL1D cache line requests that hit [#vl1d-hit]_ in - vL1D cache over the total number of cache line requests to the :ref:`vL1D Cache - RAM `. - unit: Percent - Bandwidth: - rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM - ` instructions, as a percent of the peak theoretical bandwidth achievable - on the specific accelerator. The number of bytes is calculated as the number - of cache lines requested multiplied by the cache line size. This value does - not consider partial requests, so for instance, if only a single value is requested - in a cache line, the data movement will still be counted as a full cache line. - unit: Percent - Utilization: - rst: Indicates how busy the :ref:`vL1D Cache RAM ` was during the kernel - execution. The number of cycles where the vL1D Cache RAM is actively processing - any request divided by the number of cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Coalescing: - rst: Indicates how well memory instructions were coalesced by the :ref:`address - processing unit `, ranging from uncoalesced (25%) to fully coalesced - (100%). Calculated as the average number of :ref:`thread-requests ` - generated per instruction divided by the ideal number of thread-requests per - instruction. - unit: Percent - Stalled on L2 Data: - rst: The ratio of the number of cycles where the vL1D is stalled waiting for requested - data to return from the :doc:`L2 cache ` divided by the number of - cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent Stalled on L2 Req: rst: The ratio of the number of cycles where the vL1D is stalled waiting to issue a request for data to the :doc:`L2 cache ` divided by the number @@ -2040,228 +590,31 @@ vL1D cache stall metrics: with conflicting tags being looked up concurrently, divided by the number of cycles where the vL1D is active [#vl1d-activity]_. unit: Percent - Tag RAM Stall (Write): - rst: The ratio of the number of cycles where the vL1D is stalled due to Write - requests with conflicting tags being looked up concurrently, divided by the - number of cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent Tag RAM Stall (Atomic): rst: The ratio of the number of cycles where the vL1D is stalled due to Atomic requests with conflicting tags being looked up concurrently, divided by the number of cycles where the vL1D is active [#vl1d-activity]_. unit: Percent - Total Req: - rst: The total number of incoming requests from the :ref:`address processing - unit ` after coalescing. - unit: Requests - Read Req: - rst: The total number of incoming read requests from the :ref:`address processing - unit ` after coalescing per :ref:`normalization unit ` - unit: Requests per normalization unit - Write Req: - rst: The total number of incoming write requests from the :ref:`address processing - unit ` after coalescing per :ref:`normalization unit ` - unit: Requests per normalization unit - Atomic Req: - rst: The total number of incoming atomic requests from the :ref:`address processing - unit ` after coalescing per :ref:`normalization unit ` - unit: Requests per normalization unit - Cache BW: - rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM - ` instructions per :ref:`normalization unit `. The - number of bytes is calculated as the number of cache lines requested multiplied - by the cache line size. This value does not consider partial requests, so - for instance, if only a single value is requested in a cache line, the data movement - will still be counted as a full cache line. - unit: Bytes per normalization unit - Cache Hit Rate: - rst: The ratio of the number of vL1D cache line requests that hit in vL1D cache - over the total number of cache line requests to the :ref:`vL1D Cache RAM `. + Tag RAM Stall (Write): + rst: The ratio of the number of cycles where the vL1D is stalled due to Write + requests with conflicting tags being looked up concurrently, divided by the + number of cycles where the vL1D is active [#vl1d-activity]_. unit: Percent - Cache Accesses: - rst: The total number of cache line lookups in the vL1D. - unit: Cache lines - Cache Hits: - rst: The number of cache accesses minus the number of outgoing requests to the :doc:`L2 - cache `, that is, the number of cache line requests serviced by the - :ref:`vL1D Cache RAM ` per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Invalidations: - rst: The number of times the vL1D was issued a write-back invalidate command during - the kernel's execution per :ref:`normalization unit `. This - may be triggered by, for instance, the ``buffer_wbinvl1`` instruction. - unit: Invalidations per normalization unit - L1-L2 BW: - rst: The number of bytes transferred across the vL1D-L2 interface as a result of - :ref:`VMEM ` instructions, per :ref:`normalization unit `. - The number of bytes is calculated as the number of cache lines requested multiplied - by the cache line size. This value does not consider partial requests, so for instance, - if only a single value is requested in a cache line, the data movement will - still be counted as a full cache line. - unit: Bytes per normalization unit - L1-L2 Read: - rst: The number of read requests for a vL1D cache line that were not satisfied by - the vL1D and must be retrieved from the to the :doc:`L2 Cache ` per :ref:`normalization - unit `. - unit: Requests per normalization unit - L1-L2 Write: - rst: The number of write requests to a vL1D cache line that were sent through the - vL1D to the :doc:`L2 cache `, per :ref:`normalization unit `. - unit: Requests per normalization unit - L1-L2 Atomic: - rst: The number of atomic requests that are sent through the vL1D to the :doc:`L2 - cache `, per :ref:`normalization unit `. This - includes requests for atomics with, and without return. - unit: Requests per normalization unit - L1 Access Latency: - rst: Calculated as the average number of cycles that a vL1D cache line request - spent in the vL1D cache pipeline. - unit: Cycles - L1-L2 Read Latency: - rst: Calculated as the average number of cycles that the vL1D cache took to issue - and receive read requests from the :doc:`L2 Cache `. This number - also includes requests for atomics with return values. - unit: Cycles - L1-L2 Write Latency: - rst: Calculated as the average number of cycles that the vL1D cache took to issue - and receive acknowledgement of a write request to the :doc:`L2 Cache `. - This number also includes requests for atomics without return values. - unit: Cycles - NC - Read: - rst: Total read requests with NC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - UC - Read: - rst: Total read requests with UC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - CC - Read: - rst: Total read requests with CC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - RW - Read: - rst: '' - unit: Requests per normalization unit - RW - Write: - rst: Total write requests with RW mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - NC - Write: - rst: Total write requests with NC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - UC - Write: - rst: Total write requests with UC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - CC - Write: - rst: Total write requests with CC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - NC - Atomic: - rst: Total atomic requests with NC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - UC - Atomic: - rst: Total atomic requests with UC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - CC - Atomic: - rst: Total atomic requests with CC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - RW - Atomic: - rst: Total atomic requests with RW mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - Req: - rst: The number of translation requests made to the UTCL1 per normalization unit. - unit: Requests per normalization unit - Hit Ratio: - rst: The ratio of the number of translation requests that hit in the UTCL1 divided - by the total number of translation requests made to the UTCL1. + Stalled on L2 Data: + rst: The ratio of the number of cycles where the vL1D is stalled waiting for requested + data to return from the :doc:`L2 cache ` divided by the number of + cycles where the vL1D is active [#vl1d-activity]_. unit: Percent - Hits: - rst: The number of translation requests that hit in the UTCL1, and could be reused, - per normalization unit. - unit: Requests per normalization unit - Translation Misses: - rst: The total number of translation requests that missed in the UTCL1 due to translation - not being present in the cache, per :ref:`normalization unit `. - unit: unit - Permission Misses: - rst: "The total number of translation requests that missed in the UTCL1 due to\ - \ a permission error, per :ref:`normalization unit `.\ - \ This is unused and expected to be zero in most configurations for modern\ - \ CDNA\u2122 accelerators." - unit: Requests per normalization unit vL1D cache access metrics: - Hit rate: - rst: The ratio of the number of vL1D cache line requests that hit [#vl1d-hit]_ in - vL1D cache over the total number of cache line requests to the :ref:`vL1D Cache - RAM `. - unit: Percent - Bandwidth: - rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM - ` instructions, as a percent of the peak theoretical bandwidth achievable - on the specific accelerator. The number of bytes is calculated as the number - of cache lines requested multiplied by the cache line size. This value does - not consider partial requests, so for instance, if only a single value is requested - in a cache line, the data movement will still be counted as a full cache line. - unit: Percent - Utilization: - rst: Indicates how busy the :ref:`vL1D Cache RAM ` was during the kernel - execution. The number of cycles where the vL1D Cache RAM is actively processing - any request divided by the number of cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Coalescing: - rst: Indicates how well memory instructions were coalesced by the :ref:`address - processing unit `, ranging from uncoalesced (25%) to fully coalesced - (100%). Calculated as the average number of :ref:`thread-requests ` - generated per instruction divided by the ideal number of thread-requests per - instruction. - unit: Percent - Stalled on L2 Data: - rst: The ratio of the number of cycles where the vL1D is stalled waiting for requested - data to return from the :doc:`L2 cache ` divided by the number of - cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Stalled on L2 Req: - rst: The ratio of the number of cycles where the vL1D is stalled waiting to issue - a request for data to the :doc:`L2 cache ` divided by the number - of cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Tag RAM Stall (Read): - rst: The ratio of the number of cycles where the vL1D is stalled due to Read requests - with conflicting tags being looked up concurrently, divided by the number of - cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Tag RAM Stall (Write): - rst: The ratio of the number of cycles where the vL1D is stalled due to Write - requests with conflicting tags being looked up concurrently, divided by the - number of cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent - Tag RAM Stall (Atomic): - rst: The ratio of the number of cycles where the vL1D is stalled due to Atomic - requests with conflicting tags being looked up concurrently, divided by the - number of cycles where the vL1D is active [#vl1d-activity]_. - unit: Percent Total Req: rst: The total number of incoming requests from the :ref:`address processing unit ` after coalescing. unit: Requests - Read Req: - rst: The total number of incoming read requests from the :ref:`address processing - unit ` after coalescing per :ref:`normalization unit ` - unit: Requests per normalization unit - Write Req: - rst: The total number of incoming write requests from the :ref:`address processing - unit ` after coalescing per :ref:`normalization unit ` - unit: Requests per normalization unit - Atomic Req: - rst: The total number of incoming atomic requests from the :ref:`address processing - unit ` after coalescing per :ref:`normalization unit ` - unit: Requests per normalization unit + Cache Hits: + rst: The number of cache accesses minus the number of outgoing requests to the :doc:`L2 + cache `, that is, the number of cache line requests serviced by the + :ref:`vL1D Cache RAM ` per :ref:`normalization unit `. + unit: Cache lines per normalization unit Cache BW: rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM ` instructions per :ref:`normalization unit `. The @@ -2270,23 +623,17 @@ vL1D cache access metrics: for instance, if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. unit: Bytes per normalization unit + Cache Accesses: + rst: The total number of cache line lookups in the vL1D. + unit: Cache lines Cache Hit Rate: rst: The ratio of the number of vL1D cache line requests that hit in vL1D cache over the total number of cache line requests to the :ref:`vL1D Cache RAM `. unit: Percent - Cache Accesses: - rst: The total number of cache line lookups in the vL1D. - unit: Cache lines - Cache Hits: - rst: The number of cache accesses minus the number of outgoing requests to the :doc:`L2 - cache `, that is, the number of cache line requests serviced by the - :ref:`vL1D Cache RAM ` per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Invalidations: - rst: The number of times the vL1D was issued a write-back invalidate command during - the kernel's execution per :ref:`normalization unit `. This - may be triggered by, for instance, the ``buffer_wbinvl1`` instruction. - unit: Invalidations per normalization unit + Atomic Req: + rst: The total number of incoming atomic requests from the :ref:`address processing + unit ` after coalescing per :ref:`normalization unit ` + unit: Requests per normalization unit L1-L2 BW: rst: The number of bytes transferred across the vL1D-L2 interface as a result of :ref:`VMEM ` instructions, per :ref:`normalization unit `. @@ -2295,24 +642,6 @@ vL1D cache access metrics: if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. unit: Bytes per normalization unit - L1-L2 Read: - rst: The number of read requests for a vL1D cache line that were not satisfied by - the vL1D and must be retrieved from the to the :doc:`L2 Cache ` per :ref:`normalization - unit `. - unit: Requests per normalization unit - L1-L2 Write: - rst: The number of write requests to a vL1D cache line that were sent through the - vL1D to the :doc:`L2 cache `, per :ref:`normalization unit `. - unit: Requests per normalization unit - L1-L2 Atomic: - rst: The number of atomic requests that are sent through the vL1D to the :doc:`L2 - cache `, per :ref:`normalization unit `. This - includes requests for atomics with, and without return. - unit: Requests per normalization unit - L1 Access Latency: - rst: Calculated as the average number of cycles that a vL1D cache line request - spent in the vL1D cache pipeline. - unit: Cycles L1-L2 Read Latency: rst: Calculated as the average number of cycles that the vL1D cache took to issue and receive read requests from the :doc:`L2 Cache `. This number @@ -2323,162 +652,47 @@ vL1D cache access metrics: and receive acknowledgement of a write request to the :doc:`L2 Cache `. This number also includes requests for atomics without return values. unit: Cycles - NC - Read: - rst: Total read requests with NC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. + Write Req: + rst: The total number of incoming write requests from the :ref:`address processing + unit ` after coalescing per :ref:`normalization unit ` unit: Requests per normalization unit - UC - Read: - rst: Total read requests with UC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. + L1-L2 Read: + rst: The number of read requests for a vL1D cache line that were not satisfied by + the vL1D and must be retrieved from the to the :doc:`L2 Cache ` per :ref:`normalization + unit `. unit: Requests per normalization unit - CC - Read: - rst: Total read requests with CC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. + L1-L2 Atomic: + rst: The number of atomic requests that are sent through the vL1D to the :doc:`L2 + cache `, per :ref:`normalization unit `. This + includes requests for atomics with, and without return. unit: Requests per normalization unit - RW - Read: - rst: '' + Read Req: + rst: The total number of incoming read requests from the :ref:`address processing + unit ` after coalescing per :ref:`normalization unit ` unit: Requests per normalization unit - RW - Write: - rst: Total write requests with RW mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - NC - Write: - rst: Total write requests with NC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - UC - Write: - rst: Total write requests with UC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - CC - Write: - rst: Total write requests with CC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - NC - Atomic: - rst: Total atomic requests with NC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - UC - Atomic: - rst: Total atomic requests with UC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - CC - Atomic: - rst: Total atomic requests with CC mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - RW - Atomic: - rst: Total atomic requests with RW mtype from this TCP to all TCCs Sum over TCP - instances per normalization unit. - unit: Requests per normalization unit - Req: - rst: The number of translation requests made to the UTCL1 per normalization unit. - unit: Requests per normalization unit - Hit Ratio: - rst: The ratio of the number of translation requests that hit in the UTCL1 divided - by the total number of translation requests made to the UTCL1. - unit: Percent - Hits: - rst: The number of translation requests that hit in the UTCL1, and could be reused, - per normalization unit. - unit: Requests per normalization unit - Translation Misses: - rst: The total number of translation requests that missed in the UTCL1 due to translation - not being present in the cache, per :ref:`normalization unit `. - unit: unit - Permission Misses: - rst: "The total number of translation requests that missed in the UTCL1 due to\ - \ a permission error, per :ref:`normalization unit `.\ - \ This is unused and expected to be zero in most configurations for modern\ - \ CDNA\u2122 accelerators." + Invalidations: + rst: The number of times the vL1D was issued a write-back invalidate command during + the kernel's execution per :ref:`normalization unit `. This + may be triggered by, for instance, the ``buffer_wbinvl1`` instruction. + unit: Invalidations per normalization unit + L1 Access Latency: + rst: Calculated as the average number of cycles that a vL1D cache line request + spent in the vL1D cache pipeline. + unit: Cycles + L1-L2 Write: + rst: The number of write requests to a vL1D cache line that were sent through the + vL1D to the :doc:`L2 cache `, per :ref:`normalization unit `. unit: Requests per normalization unit Vector L1 data-return path or Texture Data (TD): - Address Processing Unit Busy: - rst: Percent of the :ref:`total CU cycles ` the address processor - was busy - unit: Percent - Address Stall: - rst: Percent of the :ref:`total CU cycles ` the address processor - was stalled from sending address requests further into the vL1D pipeline - unit: Percent - Data Stall: - rst: Percent of the :ref:`total CU cycles ` the address processor - was stalled from sending write/atomic data further into the vL1D pipeline - unit: Percent - "Data-Processor \u2192 Address Stall": - rst: Percent of :ref:`total CU cycles ` the address processor was - stalled waiting to send command data to the :ref:`data processor ` - unit: Percent - Total Instructions: - rst: The total number of memory instructions executed by the address processer - over all compute units on the accelerator, per normalization unit. - unit: Instructions per normalization unit - Global/Generic Instructions: - rst: The total number of global & generic memory instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Global/Generic Read Instructions: - rst: The total number of global & generic memory read instructions executed on all - :doc:`compute units ` on the accelerator, per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Global/Generic Write Instructions: - rst: The total number of global & generic memory write instructions executed on - all :doc:`compute units ` on the accelerator, per :ref:`normalization - unit `. - unit: Instructions per normalization unit - Global/Generic Atomic Instructions: - rst: The total number of global & generic memory atomic (with and without return) - instructions executed on all :doc:`compute units ` on the accelerator, - per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Instructions: - rst: The total number of spill/stack memory instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Read Instructions: - rst: The total number of spill/stack memory read instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Write Instructions: - rst: The total number of spill/stack memory write instructions executed on all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - unit: Instructions per normalization unit - Spill/Stack Atomic Instructions: - rst: The total number of spill/stack memory atomic (with and without return) instructions - executed on all :doc:`compute units ` on the accelerator, per - :ref:`normalization unit `. Typically unused as these - memory operations are typically used to implement thread-local storage. - unit: Instructions per normalization unit - Spill/Stack Total Cycles: - rst: The number of cycles the address processing unit spent working on spill/stack - instructions, per :ref:`normalization unit `. - unit: Cycles per normalization unit - Spill/Stack Coalesced Read: - rst: The number of cycles the address processing unit spent working on coalesced - spill/stack read instructions, per :ref:`normalization unit `. - unit: Cycles per normalization unit - Spill/Stack Coalesced Write: - rst: The number of cycles the address processing unit spent working on coalesced - spill/stack write instructions, per :ref:`normalization unit `. - unit: Cycles per normalization unit - Data-Return Busy: - rst: Percent of the :ref:`total CU cycles ` the data-return unit - was busy processing or waiting on data to return to the :doc:`CU `. - unit: Percent - "Cache RAM \u2192 Data-Return Stall": - rst: Percent of the :ref:`total CU cycles ` the data-return unit - was stalled on data to be returned from the :ref:`vL1D Cache RAM `. - unit: Percent - "Workgroup manager \u2192 Data-Return Stall": - rst: Percent of the :ref:`total CU cycles ` the data-return unit - was stalled by the :ref:`workgroup manager ` due to initialization - of registers as a part of launching new workgroups. - unit: Percent Coalescable Instructions: rst: The number of instructions submitted to the :ref:`data-return unit ` by the :ref:`address processor ` that were found to be coalescable, per :ref:`normalization unit `. unit: Instructions per normalization unit + "Cache RAM \u2192 Data-Return Stall": + rst: Percent of the :ref:`total CU cycles ` the data-return unit + was stalled on data to be returned from the :ref:`vL1D Cache RAM `. + unit: Percent Read Instructions: rst: The number of read instructions submitted to the :ref:`data-return unit ` by the :ref:`address processor ` summed over all :doc:`compute @@ -2486,6 +700,18 @@ Vector L1 data-return path or Texture Data (TD): This is expected to be the sum of global/generic and spill/stack reads in the :ref:`address processor `. unit: Instructions per normalization unit + "Workgroup manager \u2192 Data-Return Stall": + rst: Percent of the :ref:`total CU cycles ` the data-return unit + was stalled by the :ref:`workgroup manager ` due to initialization + of registers as a part of launching new workgroups. + unit: Percent + Atomic Instructions: + rst: The number of atomic instructions submitted to the :ref:`data-return unit + ` by the :ref:`address processor ` summed over all :doc:`compute + units ` on the accelerator, per :ref:`normalization unit `. + This is expected to be the sum of global/generic and spill/stack atomics in + the :ref:`address processor `. + unit: Instructions per normalization unit Write Instructions: rst: The number of store instructions submitted to the :ref:`data-return unit ` by the :ref:`address processor ` summed over all :doc:`compute @@ -2493,19 +719,11 @@ Vector L1 data-return path or Texture Data (TD): This is expected to be the sum of global/generic and spill/stack stores counted by the :ref:`vL1D cache-front-end `. unit: Instructions per normalization unit - Atomic Instructions: - rst: The number of atomic instructions submitted to the :ref:`data-return unit - ` by the :ref:`address processor ` summed over all :doc:`compute - units ` on the accelerator, per :ref:`normalization unit `. - This is expected to be the sum of global/generic and spill/stack atomics in - the :ref:`address processor `. - unit: Instructions per normalization unit -L2 Speed-of-Light: - Utilization: - rst: The ratio of the :ref:`number of cycles an L2 channel was active, summed - over all L2 channels on the accelerator ` over the - :ref:`total L2 cycles `. + Data-Return Busy: + rst: Percent of the :ref:`total CU cycles ` the data-return unit + was busy processing or waiting on data to return to the :doc:`CU `. unit: Percent +L2 Speed-of-Light: Peak Bandwidth: rst: The number of bytes looked up in the L2 cache, as a percent of the peak theoretical bandwidth achievable on the specific accelerator. The number of bytes is calculated @@ -2514,451 +732,42 @@ L2 Speed-of-Light: requested in a cache line, the data movement will still be counted as a full cache line. unit: Percent - Hit Rate: - rst: The ratio of the number of L2 cache line requests that hit in the L2 cache - over the total number of incoming cache line requests to the L2 cache. - unit: Percent - L2-Fabric Read BW: - rst: The number of bytes read by the L2 over the :ref:`Infinity Fabric interface - ` per unit time. - unit: GB/s - L2-Fabric Write and Atomic BW: - rst: The number of bytes sent by the L2 over the :ref:`Infinity Fabric interface - ` by write and atomic operations per unit time. - unit: GB/s HBM Bandwidth: rst: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory (HBM) per unit time. This value is calculated as the number of HBM channels multiplied by the HBM channel width multiplied by the HBM clock frequency. unit: GB/s - Read BW: - rst: The total number of bytes read by the L2 cache from Infinity Fabric per :ref:`normalization - unit `. - unit: Bytes per normalization unit - HBM Read Traffic: - rst: The percent of read requests generated by the L2 cache that are routed to the - accelerator's local high-bandwidth memory (HBM). This breakdown does not consider - the *size* of the request (meaning that 32B and 64B requests are both counted - as a single request), so this metric only *approximates* the percent of the - L2-Fabric Read bandwidth directed to the local HBM. + Utilization: + rst: The ratio of the :ref:`number of cycles an L2 channel was active, summed + over all L2 channels on the accelerator ` over the + :ref:`total L2 cycles `. unit: Percent - Remote Read Traffic: - rst: The percent of read requests generated by the L2 cache that are routed to any - memory location other than the accelerator's local high-bandwidth memory (HBM) - -- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric Read bandwidth directed to a remote location. - unit: Percent - Uncached Read Traffic: - rst: The percent of read requests generated by the L2 cache that are reading from - an :ref:`uncached memory allocation `. Note, as described in the - :ref:`request flow ` section, a single 64B read request is - typically counted as two uncached read requests. So, it is possible for the - Uncached Read Traffic to reach up to 200% of the total number of read requests. - This breakdown does not consider the *size* of the request (i.e., 32B and 64B - requests are both counted as a single request), so this metric only *approximates* - the percent of the L2-Fabric read bandwidth directed to an uncached memory - location. - unit: Percent - Write and Atomic BW: - rst: The total number of bytes written by the L2 over Infinity Fabric by write and - atomic operations per :ref:`normalization unit `. Note - that on current CDNA accelerators, such as the :ref:`MI2XX `, requests - are only considered *atomic* by Infinity Fabric if they are targeted at non-write-cacheable - memory, for example, :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations on the MI2XX. - unit: Bytes per normalization unit - HBM Write and Atomic Traffic: - rst: The percent of write and atomic requests generated by the L2 cache that are - routed to the accelerator's local high-bandwidth memory (HBM). This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric Write and Atomic bandwidth directed to the local HBM. - Note that on current CDNA accelerators, such as the :ref:`MI2XX `, - requests are only considered *atomic* by Infinity Fabric if they are targeted - at :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations. - unit: Percent - Remote Write and Atomic Traffic: - rst: The percent of read requests generated by the L2 cache that are routed to any - memory location other than the accelerator's local high-bandwidth memory (HBM) - -- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric Read bandwidth directed to a remote location. Note - that on current CDNA accelerators, such as the :ref:`MI2XX `, requests - are only considered *atomic* by Infinity Fabric if they are targeted at :ref:`fine-grained - memory ` allocations or :ref:`uncached memory ` allocations. - unit: Percent - Atomic Traffic: - rst: The percent of write requests generated by the L2 cache that are atomic requests - to *any* memory location. This breakdown does not consider the *size* of the - request (meaning that 32B and 64B requests are both counted as a single request), - so this metric only *approximates* the percent of the L2-Fabric Read bandwidth - directed to a remote location. Note that on current CDNA accelerators, such - as the :ref:`MI2XX `, requests are only considered *atomic* by - Infinity Fabric if they are targeted at :ref:`fine-grained memory ` - allocations or :ref:`uncached memory ` allocations. - unit: Percent - Uncached Write and Atomic Traffic: - rst: The percent of write and atomic requests generated by the L2 cache that are - targeting :ref:`uncached memory allocations `. This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric read bandwidth directed to uncached memory allocations. - unit: Percent - Read Latency: - rst: The time-averaged number of cycles read requests spent in Infinity Fabric before - data was returned to the L2. - unit: Cycles - Write and Atomic Latency: - rst: The time-averaged number of cycles write requests spent in Infinity Fabric - before a completion acknowledgement was returned to the L2. - unit: Cycles - Atomic Latency: - rst: The time-averaged number of cycles atomic requests spent in Infinity Fabric - before a completion acknowledgement (atomic without return value) or data (atomic - with return value) was returned to the L2. - unit: Cycles - Bandwidth: - rst: The number of bytes looked up in the L2 cache, per :ref:`normalization unit - `. The number of bytes is calculated as the number of - cache lines requested multiplied by the cache line size. This value does not - consider partial requests, so for example, if only a single value is requested - in a cache line, the data movement will still be counted as a full cache line. - unit: Bytes per normalization unit - Req: - rst: The total number of incoming requests to the L2 from all clients for all request - types, per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req: - rst: 'The total number of read requests to the L2 from all clients. ' - unit: Requests per normalization unit - Write Req: - rst: The total number of write requests to the L2 from all clients. - unit: Requests per normalization unit - Atomic Req: - rst: The total number of atomic requests (with and without return) to the L2 from - all clients. - unit: Requests per normalization unit - Streaming Req: - rst: The total number of incoming requests to the L2 that are marked as *streaming*. - The exact meaning of this may differ depending on the targeted accelerator, - however on an :ref:`MI2XX ` this corresponds to `non-temporal - load or stores `_. The - L2 cache attempts to evict *streaming* requests before normal requests when - the L2 is at capacity. - unit: Requests per normalization unit - Probe Req: - rst: The number of coherence probe requests made to the L2 cache from outside the - accelerator. On an :ref:`MI2XX `, probe requests may be generated - by, for example, writes to :ref:`fine-grained device ` memory - or by writes to :ref:`coarse-grained ` device memory. - unit: Requests per normalization unit - Cache Hit: + L2-Fabric Write and Atomic BW: + rst: The number of bytes sent by the L2 over the :ref:`Infinity Fabric interface + ` by write and atomic operations per unit time. + unit: GB/s + L2-Fabric Read BW: + rst: The number of bytes read by the L2 over the :ref:`Infinity Fabric interface + ` per unit time. + unit: GB/s + Hit Rate: rst: The ratio of the number of L2 cache line requests that hit in the L2 cache - over the total number of incoming cache line requests to the L2 cache. - unit: Percent - Hits: - rst: The total number of requests to the L2 from all clients that hit in the cache. - As noted in the :ref:`Speed-of-Light ` section, this includes hit-on-miss - requests. - unit: Requests per normalization unit - Misses: - rst: The total number of requests to the L2 from all clients that miss in the cache. - As noted in the :ref:`Speed-of-Light ` section, these do not include - hit-on-miss requests. - unit: Requests per normalization unit - Writeback: - rst: The total number of L2 cache lines written back to memory for any reason. Write-backs - may occur due to user code (such as HIP kernel calls to ``__threadfence_system`` - or atomic built-ins) by the :doc:`command processor `'s - memory acquire/release fences, or for other internal hardware reasons. - unit: Cache lines per normalization unit - Writeback (Internal): - rst: The total number of L2 cache lines written back to memory for internal hardware - reasons, per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Writeback (vL1D Req): - rst: The total number of L2 cache lines written back to memory due to requests initiated - by the :doc:`vL1D cache `, per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Evict (Internal): - rst: The total number of L2 cache lines evicted from the cache due to capacity limits, - per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Evict (vL1D Req): - rst: The total number of L2 cache lines evicted from the cache due to invalidation - requests initiated by the :doc:`vL1D cache `, per :ref:`normalization - unit `. - unit: Cache lines per normalization unit - NC Req: - rst: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory - allocations, per :ref:`normalization unit `. See the :ref:`memory-type` - for more information. - unit: Requests per normalization unit - UC Req: - rst: The total number of requests to the L2 that go to Uncached (UC) memory allocations. - See the :ref:`memory-type` for more information. - unit: Requests per normalization unit - CC Req: - rst: The total number of requests to the L2 that go to Coherently Cacheable (CC) memory - allocations. See the :ref:`memory-type` for more information. - unit: Requests per normalization unit - RW Req: - rst: The total number of requests to the L2 that go to Read-Write coherent memory (RW) - allocations. See the :ref:`memory-type` for more information. - unit: Requests per normalization unit - Write - Credit Starvation: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to any memory location because too many write/atomic requests were - currently in flight, as a percent of the :ref:`total active L2 cycles `. - unit: Percent - Read (32B): - rst: The total number of L2 requests to Infinity Fabric to read 32B of data from - any memory location, per :ref:`normalization unit `. See :ref:`l2-request-flow` - for more detail. Typically unused on CDNA accelerators. - unit: Requests per normalization unit - Read (64B): - rst: The total number of L2 requests to Infinity Fabric to read 64B of data from - any memory location, per :ref:`normalization unit `. See :ref:`l2-request-flow` - for more detail. - unit: Requests per normalization unit - Read (Uncached): - rst: The total number of L2 requests to Infinity Fabric to read :ref:`uncached - data ` from any memory location, per :ref:`normalization unit - `. 64B requests for uncached data are counted as two 32B - uncached data requests. See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - HBM Read: - rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data - from the accelerator's local HBM, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Remote Read: - rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data - from any source other than the accelerator's local HBM, per :ref:`normalization - unit `. See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (32B): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B of data to any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (Uncached): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of :ref:`uncached data `, per :ref:`normalization unit - `. See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (64B): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 64B of data in any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - HBM Write and Atomic: - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of data in the accelerator's local HBM, per :ref:`normalization - unit `. See :ref:`l2-request-flow` for more detail. plain - unit: Requests per normalization unit - Remote Write and Atomic: - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of data in any memory location other than the accelerator's local - HBM, per :ref:`normalization unit `. See :ref:`l2-request-flow` - for more detail. - unit: Requests per normalization unit - Atomic: - rst: The total number of L2 requests to Infinity Fabric to atomically update 32B - or 64B of data in any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. Note that on current CDNA accelerators, - such as the :ref:`MI2XX `, requests are only considered *atomic* - by Infinity Fabric if they are targeted at non-write-cacheable memory, such - as :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations on the MI2XX. - unit: Requests per normalization unit - Read Stall: - rst: "The ratio of the total number of cycles the L2-Fabric interface was stalled\ - \ on a read request to any destination (local HBM, remote PCIe\xAE connected\ - \ accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_\ - \ or CPU) over the :ref:`total active L2 cycles `." - unit: Percent - Write Stall: - rst: The ratio of the total number of cycles the L2-Fabric interface was stalled - on a write or atomic request to any destination (local HBM, remote accelerator - or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected - accelerator [#inf]_ or CPU) over the :ref:`total active L2 cycles `. - unit: Percent - Read - PCIe Stall: - rst: The number of cycles the L2-Fabric interface was stalled on read requests - to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the :ref:`total - active L2 cycles `. - unit: Percent - Read - Infinity Fabric Stall: - rst: The number of cycles the L2-Fabric interface was stalled on read requests - to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as a percent - of the :ref:`total active L2 cycles `. - unit: Percent - Read - HBM Stall: - rst: The number of cycles the L2-Fabric interface was stalled on read requests - to the accelerator's local HBM as a percent of the :ref:`total active L2 cycles - `. - unit: Percent - Write - PCIe Stall: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to remote PCIe connected accelerators [#inf]_ or CPUs as a percent - of the :ref:`total active L2 cycles `. - unit: Percent - Write - Infinity Fabric Stall: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as - a percent of the :ref:`total active L2 cycles `. - unit: Percent - Write - HBM Stall: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to accelerator's local HBM as a percent of the total active L2 cycles. + over the total number of incoming cache line requests to the L2 cache. unit: Percent L2 cache accesses: - Utilization: - rst: The ratio of the :ref:`number of cycles an L2 channel was active, summed - over all L2 channels on the accelerator ` over the - :ref:`total L2 cycles `. - unit: Percent - Peak Bandwidth: - rst: The number of bytes looked up in the L2 cache, as a percent of the peak theoretical - bandwidth achievable on the specific accelerator. The number of bytes is calculated - as the number of cache lines requested multiplied by the cache line size. This - value does not consider partial requests, so e.g., if only a single value is - requested in a cache line, the data movement will still be counted as a full - cache line. - unit: Percent - Hit Rate: + UC Req: + rst: The total number of requests to the L2 that go to Uncached (UC) memory allocations. + See the :ref:`memory-type` for more information. + unit: Requests per normalization unit + Misses: + rst: The total number of requests to the L2 from all clients that miss in the cache. + As noted in the :ref:`Speed-of-Light ` section, these do not include + hit-on-miss requests. + unit: Requests per normalization unit + Cache Hit: rst: The ratio of the number of L2 cache line requests that hit in the L2 cache - over the total number of incoming cache line requests to the L2 cache. + over the total number of incoming cache line requests to the L2 cache. unit: Percent - L2-Fabric Read BW: - rst: The number of bytes read by the L2 over the :ref:`Infinity Fabric interface - ` per unit time. - unit: GB/s - L2-Fabric Write and Atomic BW: - rst: The number of bytes sent by the L2 over the :ref:`Infinity Fabric interface - ` by write and atomic operations per unit time. - unit: GB/s - HBM Bandwidth: - rst: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory - (HBM) per unit time. This value is calculated as the number of HBM channels - multiplied by the HBM channel width multiplied by the HBM clock frequency. - unit: GB/s - Read BW: - rst: The total number of bytes read by the L2 cache from Infinity Fabric per :ref:`normalization - unit `. - unit: Bytes per normalization unit - HBM Read Traffic: - rst: The percent of read requests generated by the L2 cache that are routed to the - accelerator's local high-bandwidth memory (HBM). This breakdown does not consider - the *size* of the request (meaning that 32B and 64B requests are both counted - as a single request), so this metric only *approximates* the percent of the - L2-Fabric Read bandwidth directed to the local HBM. - unit: Percent - Remote Read Traffic: - rst: The percent of read requests generated by the L2 cache that are routed to any - memory location other than the accelerator's local high-bandwidth memory (HBM) - -- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric Read bandwidth directed to a remote location. - unit: Percent - Uncached Read Traffic: - rst: The percent of read requests generated by the L2 cache that are reading from - an :ref:`uncached memory allocation `. Note, as described in the - :ref:`request flow ` section, a single 64B read request is - typically counted as two uncached read requests. So, it is possible for the - Uncached Read Traffic to reach up to 200% of the total number of read requests. - This breakdown does not consider the *size* of the request (i.e., 32B and 64B - requests are both counted as a single request), so this metric only *approximates* - the percent of the L2-Fabric read bandwidth directed to an uncached memory - location. - unit: Percent - Write and Atomic BW: - rst: The total number of bytes written by the L2 over Infinity Fabric by write and - atomic operations per :ref:`normalization unit `. Note - that on current CDNA accelerators, such as the :ref:`MI2XX `, requests - are only considered *atomic* by Infinity Fabric if they are targeted at non-write-cacheable - memory, for example, :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations on the MI2XX. - unit: Bytes per normalization unit - HBM Write and Atomic Traffic: - rst: The percent of write and atomic requests generated by the L2 cache that are - routed to the accelerator's local high-bandwidth memory (HBM). This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric Write and Atomic bandwidth directed to the local HBM. - Note that on current CDNA accelerators, such as the :ref:`MI2XX `, - requests are only considered *atomic* by Infinity Fabric if they are targeted - at :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations. - unit: Percent - Remote Write and Atomic Traffic: - rst: The percent of read requests generated by the L2 cache that are routed to any - memory location other than the accelerator's local high-bandwidth memory (HBM) - -- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric Read bandwidth directed to a remote location. Note - that on current CDNA accelerators, such as the :ref:`MI2XX `, requests - are only considered *atomic* by Infinity Fabric if they are targeted at :ref:`fine-grained - memory ` allocations or :ref:`uncached memory ` allocations. - unit: Percent - Atomic Traffic: - rst: The percent of write requests generated by the L2 cache that are atomic requests - to *any* memory location. This breakdown does not consider the *size* of the - request (meaning that 32B and 64B requests are both counted as a single request), - so this metric only *approximates* the percent of the L2-Fabric Read bandwidth - directed to a remote location. Note that on current CDNA accelerators, such - as the :ref:`MI2XX `, requests are only considered *atomic* by - Infinity Fabric if they are targeted at :ref:`fine-grained memory ` - allocations or :ref:`uncached memory ` allocations. - unit: Percent - Uncached Write and Atomic Traffic: - rst: The percent of write and atomic requests generated by the L2 cache that are - targeting :ref:`uncached memory allocations `. This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric read bandwidth directed to uncached memory allocations. - unit: Percent - Read Latency: - rst: The time-averaged number of cycles read requests spent in Infinity Fabric before - data was returned to the L2. - unit: Cycles - Write and Atomic Latency: - rst: The time-averaged number of cycles write requests spent in Infinity Fabric - before a completion acknowledgement was returned to the L2. - unit: Cycles - Atomic Latency: - rst: The time-averaged number of cycles atomic requests spent in Infinity Fabric - before a completion acknowledgement (atomic without return value) or data (atomic - with return value) was returned to the L2. - unit: Cycles - Bandwidth: - rst: The number of bytes looked up in the L2 cache, per :ref:`normalization unit - `. The number of bytes is calculated as the number of - cache lines requested multiplied by the cache line size. This value does not - consider partial requests, so for example, if only a single value is requested - in a cache line, the data movement will still be counted as a full cache line. - unit: Bytes per normalization unit - Req: - rst: The total number of incoming requests to the L2 from all clients for all request - types, per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req: - rst: 'The total number of read requests to the L2 from all clients. ' - unit: Requests per normalization unit - Write Req: - rst: The total number of write requests to the L2 from all clients. - unit: Requests per normalization unit - Atomic Req: - rst: The total number of atomic requests (with and without return) to the L2 from - all clients. - unit: Requests per normalization unit Streaming Req: rst: The total number of incoming requests to the L2 that are marked as *streaming*. The exact meaning of this may differ depending on the targeted accelerator, @@ -2967,25 +776,40 @@ L2 cache accesses: L2 cache attempts to evict *streaming* requests before normal requests when the L2 is at capacity. unit: Requests per normalization unit - Probe Req: - rst: The number of coherence probe requests made to the L2 cache from outside the - accelerator. On an :ref:`MI2XX `, probe requests may be generated - by, for example, writes to :ref:`fine-grained device ` memory - or by writes to :ref:`coarse-grained ` device memory. + Writeback (Internal): + rst: The total number of L2 cache lines written back to memory for internal hardware + reasons, per :ref:`normalization unit `. + unit: Cache lines per normalization unit + Write Req: + rst: The total number of write requests to the L2 from all clients. unit: Requests per normalization unit - Cache Hit: - rst: The ratio of the number of L2 cache line requests that hit in the L2 cache - over the total number of incoming cache line requests to the L2 cache. - unit: Percent + CC Req: + rst: The total number of requests to the L2 that go to Coherently Cacheable (CC) memory + allocations. See the :ref:`memory-type` for more information. + unit: Requests per normalization unit + Writeback (vL1D Req): + rst: The total number of L2 cache lines written back to memory due to requests initiated + by the :doc:`vL1D cache `, per :ref:`normalization unit `. + unit: Cache lines per normalization unit + Req: + rst: The total number of incoming requests to the L2 from all clients for all request + types, per :ref:`normalization unit `. + unit: Requests per normalization unit + Bandwidth: + rst: The number of bytes looked up in the L2 cache, per :ref:`normalization unit + `. The number of bytes is calculated as the number of + cache lines requested multiplied by the cache line size. This value does not + consider partial requests, so for example, if only a single value is requested + in a cache line, the data movement will still be counted as a full cache line. + unit: Bytes per normalization unit Hits: rst: The total number of requests to the L2 from all clients that hit in the cache. As noted in the :ref:`Speed-of-Light ` section, this includes hit-on-miss requests. unit: Requests per normalization unit - Misses: - rst: The total number of requests to the L2 from all clients that miss in the cache. - As noted in the :ref:`Speed-of-Light ` section, these do not include - hit-on-miss requests. + RW Req: + rst: The total number of requests to the L2 that go to Read-Write coherent memory (RW) + allocations. See the :ref:`memory-type` for more information. unit: Requests per normalization unit Writeback: rst: The total number of L2 cache lines written back to memory for any reason. Write-backs @@ -2993,227 +817,34 @@ L2 cache accesses: or atomic built-ins) by the :doc:`command processor `'s memory acquire/release fences, or for other internal hardware reasons. unit: Cache lines per normalization unit - Writeback (Internal): - rst: The total number of L2 cache lines written back to memory for internal hardware - reasons, per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Writeback (vL1D Req): - rst: The total number of L2 cache lines written back to memory due to requests initiated - by the :doc:`vL1D cache `, per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Evict (Internal): - rst: The total number of L2 cache lines evicted from the cache due to capacity limits, - per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Evict (vL1D Req): - rst: The total number of L2 cache lines evicted from the cache due to invalidation - requests initiated by the :doc:`vL1D cache `, per :ref:`normalization - unit `. - unit: Cache lines per normalization unit NC Req: rst: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory allocations, per :ref:`normalization unit `. See the :ref:`memory-type` for more information. unit: Requests per normalization unit - UC Req: - rst: The total number of requests to the L2 that go to Uncached (UC) memory allocations. - See the :ref:`memory-type` for more information. + Evict (vL1D Req): + rst: The total number of L2 cache lines evicted from the cache due to invalidation + requests initiated by the :doc:`vL1D cache `, per :ref:`normalization + unit `. + unit: Cache lines per normalization unit + Probe Req: + rst: The number of coherence probe requests made to the L2 cache from outside the + accelerator. On an :ref:`MI2XX `, probe requests may be generated + by, for example, writes to :ref:`fine-grained device ` memory + or by writes to :ref:`coarse-grained ` device memory. unit: Requests per normalization unit - CC Req: - rst: The total number of requests to the L2 that go to Coherently Cacheable (CC) memory - allocations. See the :ref:`memory-type` for more information. + Atomic Req: + rst: The total number of atomic requests (with and without return) to the L2 from + all clients. unit: Requests per normalization unit - RW Req: - rst: The total number of requests to the L2 that go to Read-Write coherent memory (RW) - allocations. See the :ref:`memory-type` for more information. + Evict (Internal): + rst: The total number of L2 cache lines evicted from the cache due to capacity limits, + per :ref:`normalization unit `. + unit: Cache lines per normalization unit + Read Req: + rst: 'The total number of read requests to the L2 from all clients. ' unit: Requests per normalization unit - Write - Credit Starvation: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to any memory location because too many write/atomic requests were - currently in flight, as a percent of the :ref:`total active L2 cycles `. - unit: Percent - Read (32B): - rst: The total number of L2 requests to Infinity Fabric to read 32B of data from - any memory location, per :ref:`normalization unit `. See :ref:`l2-request-flow` - for more detail. Typically unused on CDNA accelerators. - unit: Requests per normalization unit - Read (64B): - rst: The total number of L2 requests to Infinity Fabric to read 64B of data from - any memory location, per :ref:`normalization unit `. See :ref:`l2-request-flow` - for more detail. - unit: Requests per normalization unit - Read (Uncached): - rst: The total number of L2 requests to Infinity Fabric to read :ref:`uncached - data ` from any memory location, per :ref:`normalization unit - `. 64B requests for uncached data are counted as two 32B - uncached data requests. See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - HBM Read: - rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data - from the accelerator's local HBM, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Remote Read: - rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data - from any source other than the accelerator's local HBM, per :ref:`normalization - unit `. See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (32B): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B of data to any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (Uncached): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of :ref:`uncached data `, per :ref:`normalization unit - `. See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (64B): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 64B of data in any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - HBM Write and Atomic: - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of data in the accelerator's local HBM, per :ref:`normalization - unit `. See :ref:`l2-request-flow` for more detail. plain - unit: Requests per normalization unit - Remote Write and Atomic: - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of data in any memory location other than the accelerator's local - HBM, per :ref:`normalization unit `. See :ref:`l2-request-flow` - for more detail. - unit: Requests per normalization unit - Atomic: - rst: The total number of L2 requests to Infinity Fabric to atomically update 32B - or 64B of data in any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. Note that on current CDNA accelerators, - such as the :ref:`MI2XX `, requests are only considered *atomic* - by Infinity Fabric if they are targeted at non-write-cacheable memory, such - as :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations on the MI2XX. - unit: Requests per normalization unit - Read Stall: - rst: "The ratio of the total number of cycles the L2-Fabric interface was stalled\ - \ on a read request to any destination (local HBM, remote PCIe\xAE connected\ - \ accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_\ - \ or CPU) over the :ref:`total active L2 cycles `." - unit: Percent - Write Stall: - rst: The ratio of the total number of cycles the L2-Fabric interface was stalled - on a write or atomic request to any destination (local HBM, remote accelerator - or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected - accelerator [#inf]_ or CPU) over the :ref:`total active L2 cycles `. - unit: Percent - Read - PCIe Stall: - rst: The number of cycles the L2-Fabric interface was stalled on read requests - to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the :ref:`total - active L2 cycles `. - unit: Percent - Read - Infinity Fabric Stall: - rst: The number of cycles the L2-Fabric interface was stalled on read requests - to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as a percent - of the :ref:`total active L2 cycles `. - unit: Percent - Read - HBM Stall: - rst: The number of cycles the L2-Fabric interface was stalled on read requests - to the accelerator's local HBM as a percent of the :ref:`total active L2 cycles - `. - unit: Percent - Write - PCIe Stall: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to remote PCIe connected accelerators [#inf]_ or CPUs as a percent - of the :ref:`total active L2 cycles `. - unit: Percent - Write - Infinity Fabric Stall: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as - a percent of the :ref:`total active L2 cycles `. - unit: Percent - Write - HBM Stall: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to accelerator's local HBM as a percent of the total active L2 cycles. - unit: Percent L2-Fabric interface metrics: - Utilization: - rst: The ratio of the :ref:`number of cycles an L2 channel was active, summed - over all L2 channels on the accelerator ` over the - :ref:`total L2 cycles `. - unit: Percent - Peak Bandwidth: - rst: The number of bytes looked up in the L2 cache, as a percent of the peak theoretical - bandwidth achievable on the specific accelerator. The number of bytes is calculated - as the number of cache lines requested multiplied by the cache line size. This - value does not consider partial requests, so e.g., if only a single value is - requested in a cache line, the data movement will still be counted as a full - cache line. - unit: Percent - Hit Rate: - rst: The ratio of the number of L2 cache line requests that hit in the L2 cache - over the total number of incoming cache line requests to the L2 cache. - unit: Percent - L2-Fabric Read BW: - rst: The number of bytes read by the L2 over the :ref:`Infinity Fabric interface - ` per unit time. - unit: GB/s - L2-Fabric Write and Atomic BW: - rst: The number of bytes sent by the L2 over the :ref:`Infinity Fabric interface - ` by write and atomic operations per unit time. - unit: GB/s - HBM Bandwidth: - rst: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory - (HBM) per unit time. This value is calculated as the number of HBM channels - multiplied by the HBM channel width multiplied by the HBM clock frequency. - unit: GB/s - Read BW: - rst: The total number of bytes read by the L2 cache from Infinity Fabric per :ref:`normalization - unit `. - unit: Bytes per normalization unit - HBM Read Traffic: - rst: The percent of read requests generated by the L2 cache that are routed to the - accelerator's local high-bandwidth memory (HBM). This breakdown does not consider - the *size* of the request (meaning that 32B and 64B requests are both counted - as a single request), so this metric only *approximates* the percent of the - L2-Fabric Read bandwidth directed to the local HBM. - unit: Percent - Remote Read Traffic: - rst: The percent of read requests generated by the L2 cache that are routed to any - memory location other than the accelerator's local high-bandwidth memory (HBM) - -- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric Read bandwidth directed to a remote location. - unit: Percent - Uncached Read Traffic: - rst: The percent of read requests generated by the L2 cache that are reading from - an :ref:`uncached memory allocation `. Note, as described in the - :ref:`request flow ` section, a single 64B read request is - typically counted as two uncached read requests. So, it is possible for the - Uncached Read Traffic to reach up to 200% of the total number of read requests. - This breakdown does not consider the *size* of the request (i.e., 32B and 64B - requests are both counted as a single request), so this metric only *approximates* - the percent of the L2-Fabric read bandwidth directed to an uncached memory - location. - unit: Percent - Write and Atomic BW: - rst: The total number of bytes written by the L2 over Infinity Fabric by write and - atomic operations per :ref:`normalization unit `. Note - that on current CDNA accelerators, such as the :ref:`MI2XX `, requests - are only considered *atomic* by Infinity Fabric if they are targeted at non-write-cacheable - memory, for example, :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations on the MI2XX. - unit: Bytes per normalization unit - HBM Write and Atomic Traffic: - rst: The percent of write and atomic requests generated by the L2 cache that are - routed to the accelerator's local high-bandwidth memory (HBM). This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric Write and Atomic bandwidth directed to the local HBM. - Note that on current CDNA accelerators, such as the :ref:`MI2XX `, - requests are only considered *atomic* by Infinity Fabric if they are targeted - at :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations. - unit: Percent Remote Write and Atomic Traffic: rst: The percent of read requests generated by the L2 cache that are routed to any memory location other than the accelerator's local high-bandwidth memory (HBM) @@ -3225,6 +856,21 @@ L2-Fabric interface metrics: are only considered *atomic* by Infinity Fabric if they are targeted at :ref:`fine-grained memory ` allocations or :ref:`uncached memory ` allocations. unit: Percent + Remote Read Traffic: + rst: The percent of read requests generated by the L2 cache that are routed to any + memory location other than the accelerator's local high-bandwidth memory (HBM) + -- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown + does not consider the *size* of the request (meaning that 32B and 64B requests + are both counted as a single request), so this metric only *approximates* the + percent of the L2-Fabric Read bandwidth directed to a remote location. + unit: Percent + Uncached Write and Atomic Traffic: + rst: The percent of write and atomic requests generated by the L2 cache that are + targeting :ref:`uncached memory allocations `. This breakdown + does not consider the *size* of the request (meaning that 32B and 64B requests + are both counted as a single request), so this metric only *approximates* the + percent of the L2-Fabric read bandwidth directed to uncached memory allocations. + unit: Percent Atomic Traffic: rst: The percent of write requests generated by the L2 cache that are atomic requests to *any* memory location. This breakdown does not consider the *size* of the @@ -3235,452 +881,98 @@ L2-Fabric interface metrics: Infinity Fabric if they are targeted at :ref:`fine-grained memory ` allocations or :ref:`uncached memory ` allocations. unit: Percent - Uncached Write and Atomic Traffic: - rst: The percent of write and atomic requests generated by the L2 cache that are - targeting :ref:`uncached memory allocations `. This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric read bandwidth directed to uncached memory allocations. + HBM Read Traffic: + rst: The percent of read requests generated by the L2 cache that are routed to the + accelerator's local high-bandwidth memory (HBM). This breakdown does not consider + the *size* of the request (meaning that 32B and 64B requests are both counted + as a single request), so this metric only *approximates* the percent of the + L2-Fabric Read bandwidth directed to the local HBM. unit: Percent - Read Latency: - rst: The time-averaged number of cycles read requests spent in Infinity Fabric before - data was returned to the L2. - unit: Cycles Write and Atomic Latency: rst: The time-averaged number of cycles write requests spent in Infinity Fabric before a completion acknowledgement was returned to the L2. unit: Cycles - Atomic Latency: - rst: The time-averaged number of cycles atomic requests spent in Infinity Fabric - before a completion acknowledgement (atomic without return value) or data (atomic - with return value) was returned to the L2. + Read Latency: + rst: The time-averaged number of cycles read requests spent in Infinity Fabric before + data was returned to the L2. unit: Cycles - Bandwidth: - rst: The number of bytes looked up in the L2 cache, per :ref:`normalization unit - `. The number of bytes is calculated as the number of - cache lines requested multiplied by the cache line size. This value does not - consider partial requests, so for example, if only a single value is requested - in a cache line, the data movement will still be counted as a full cache line. - unit: Bytes per normalization unit - Req: - rst: The total number of incoming requests to the L2 from all clients for all request - types, per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req: - rst: 'The total number of read requests to the L2 from all clients. ' - unit: Requests per normalization unit - Write Req: - rst: The total number of write requests to the L2 from all clients. - unit: Requests per normalization unit - Atomic Req: - rst: The total number of atomic requests (with and without return) to the L2 from - all clients. - unit: Requests per normalization unit - Streaming Req: - rst: The total number of incoming requests to the L2 that are marked as *streaming*. - The exact meaning of this may differ depending on the targeted accelerator, - however on an :ref:`MI2XX ` this corresponds to `non-temporal - load or stores `_. The - L2 cache attempts to evict *streaming* requests before normal requests when - the L2 is at capacity. - unit: Requests per normalization unit - Probe Req: - rst: The number of coherence probe requests made to the L2 cache from outside the - accelerator. On an :ref:`MI2XX `, probe requests may be generated - by, for example, writes to :ref:`fine-grained device ` memory - or by writes to :ref:`coarse-grained ` device memory. - unit: Requests per normalization unit - Cache Hit: - rst: The ratio of the number of L2 cache line requests that hit in the L2 cache - over the total number of incoming cache line requests to the L2 cache. - unit: Percent - Hits: - rst: The total number of requests to the L2 from all clients that hit in the cache. - As noted in the :ref:`Speed-of-Light ` section, this includes hit-on-miss - requests. - unit: Requests per normalization unit - Misses: - rst: The total number of requests to the L2 from all clients that miss in the cache. - As noted in the :ref:`Speed-of-Light ` section, these do not include - hit-on-miss requests. - unit: Requests per normalization unit - Writeback: - rst: The total number of L2 cache lines written back to memory for any reason. Write-backs - may occur due to user code (such as HIP kernel calls to ``__threadfence_system`` - or atomic built-ins) by the :doc:`command processor `'s - memory acquire/release fences, or for other internal hardware reasons. - unit: Cache lines per normalization unit - Writeback (Internal): - rst: The total number of L2 cache lines written back to memory for internal hardware - reasons, per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Writeback (vL1D Req): - rst: The total number of L2 cache lines written back to memory due to requests initiated - by the :doc:`vL1D cache `, per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Evict (Internal): - rst: The total number of L2 cache lines evicted from the cache due to capacity limits, - per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Evict (vL1D Req): - rst: The total number of L2 cache lines evicted from the cache due to invalidation - requests initiated by the :doc:`vL1D cache `, per :ref:`normalization - unit `. - unit: Cache lines per normalization unit - NC Req: - rst: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory - allocations, per :ref:`normalization unit `. See the :ref:`memory-type` - for more information. - unit: Requests per normalization unit - UC Req: - rst: The total number of requests to the L2 that go to Uncached (UC) memory allocations. - See the :ref:`memory-type` for more information. - unit: Requests per normalization unit - CC Req: - rst: The total number of requests to the L2 that go to Coherently Cacheable (CC) memory - allocations. See the :ref:`memory-type` for more information. - unit: Requests per normalization unit - RW Req: - rst: The total number of requests to the L2 that go to Read-Write coherent memory (RW) - allocations. See the :ref:`memory-type` for more information. - unit: Requests per normalization unit - Write - Credit Starvation: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to any memory location because too many write/atomic requests were - currently in flight, as a percent of the :ref:`total active L2 cycles `. - unit: Percent - Read (32B): - rst: The total number of L2 requests to Infinity Fabric to read 32B of data from - any memory location, per :ref:`normalization unit `. See :ref:`l2-request-flow` - for more detail. Typically unused on CDNA accelerators. - unit: Requests per normalization unit - Read (64B): - rst: The total number of L2 requests to Infinity Fabric to read 64B of data from - any memory location, per :ref:`normalization unit `. See :ref:`l2-request-flow` - for more detail. - unit: Requests per normalization unit - Read (Uncached): - rst: The total number of L2 requests to Infinity Fabric to read :ref:`uncached - data ` from any memory location, per :ref:`normalization unit - `. 64B requests for uncached data are counted as two 32B - uncached data requests. See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - HBM Read: - rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data - from the accelerator's local HBM, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Remote Read: - rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data - from any source other than the accelerator's local HBM, per :ref:`normalization - unit `. See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (32B): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B of data to any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (Uncached): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of :ref:`uncached data `, per :ref:`normalization unit - `. See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (64B): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 64B of data in any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - HBM Write and Atomic: - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of data in the accelerator's local HBM, per :ref:`normalization - unit `. See :ref:`l2-request-flow` for more detail. plain - unit: Requests per normalization unit - Remote Write and Atomic: - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of data in any memory location other than the accelerator's local - HBM, per :ref:`normalization unit `. See :ref:`l2-request-flow` - for more detail. - unit: Requests per normalization unit - Atomic: - rst: The total number of L2 requests to Infinity Fabric to atomically update 32B - or 64B of data in any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. Note that on current CDNA accelerators, - such as the :ref:`MI2XX `, requests are only considered *atomic* - by Infinity Fabric if they are targeted at non-write-cacheable memory, such - as :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations on the MI2XX. - unit: Requests per normalization unit Read Stall: rst: "The ratio of the total number of cycles the L2-Fabric interface was stalled\ \ on a read request to any destination (local HBM, remote PCIe\xAE connected\ \ accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_\ \ or CPU) over the :ref:`total active L2 cycles `." unit: Percent + HBM Write and Atomic Traffic: + rst: The percent of write and atomic requests generated by the L2 cache that are + routed to the accelerator's local high-bandwidth memory (HBM). This breakdown + does not consider the *size* of the request (meaning that 32B and 64B requests + are both counted as a single request), so this metric only *approximates* the + percent of the L2-Fabric Write and Atomic bandwidth directed to the local HBM. + Note that on current CDNA accelerators, such as the :ref:`MI2XX `, + requests are only considered *atomic* by Infinity Fabric if they are targeted + at :ref:`fine-grained memory ` allocations or :ref:`uncached + memory ` allocations. + unit: Percent + Write and Atomic BW: + rst: The total number of bytes written by the L2 over Infinity Fabric by write and + atomic operations per :ref:`normalization unit `. Note + that on current CDNA accelerators, such as the :ref:`MI2XX `, requests + are only considered *atomic* by Infinity Fabric if they are targeted at non-write-cacheable + memory, for example, :ref:`fine-grained memory ` allocations or :ref:`uncached + memory ` allocations on the MI2XX. + unit: Bytes per normalization unit + Read BW: + rst: The total number of bytes read by the L2 cache from Infinity Fabric per :ref:`normalization + unit `. + unit: Bytes per normalization unit + Atomic Latency: + rst: The time-averaged number of cycles atomic requests spent in Infinity Fabric + before a completion acknowledgement (atomic without return value) or data (atomic + with return value) was returned to the L2. + unit: Cycles Write Stall: rst: The ratio of the total number of cycles the L2-Fabric interface was stalled on a write or atomic request to any destination (local HBM, remote accelerator or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_ or CPU) over the :ref:`total active L2 cycles `. unit: Percent - Read - PCIe Stall: - rst: The number of cycles the L2-Fabric interface was stalled on read requests - to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the :ref:`total - active L2 cycles `. - unit: Percent - Read - Infinity Fabric Stall: - rst: The number of cycles the L2-Fabric interface was stalled on read requests - to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as a percent - of the :ref:`total active L2 cycles `. - unit: Percent - Read - HBM Stall: - rst: The number of cycles the L2-Fabric interface was stalled on read requests - to the accelerator's local HBM as a percent of the :ref:`total active L2 cycles - `. - unit: Percent - Write - PCIe Stall: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to remote PCIe connected accelerators [#inf]_ or CPUs as a percent - of the :ref:`total active L2 cycles `. - unit: Percent - Write - Infinity Fabric Stall: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as - a percent of the :ref:`total active L2 cycles `. - unit: Percent - Write - HBM Stall: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to accelerator's local HBM as a percent of the total active L2 cycles. + Uncached Read Traffic: + rst: The percent of read requests generated by the L2 cache that are reading from + an :ref:`uncached memory allocation `. Note, as described in the + :ref:`request flow ` section, a single 64B read request is + typically counted as two uncached read requests. So, it is possible for the + Uncached Read Traffic to reach up to 200% of the total number of read requests. + This breakdown does not consider the *size* of the request (i.e., 32B and 64B + requests are both counted as a single request), so this metric only *approximates* + the percent of the L2-Fabric read bandwidth directed to an uncached memory + location. unit: Percent L2 - Fabric interface detailed metrics: - Utilization: - rst: The ratio of the :ref:`number of cycles an L2 channel was active, summed - over all L2 channels on the accelerator ` over the - :ref:`total L2 cycles `. - unit: Percent - Peak Bandwidth: - rst: The number of bytes looked up in the L2 cache, as a percent of the peak theoretical - bandwidth achievable on the specific accelerator. The number of bytes is calculated - as the number of cache lines requested multiplied by the cache line size. This - value does not consider partial requests, so e.g., if only a single value is - requested in a cache line, the data movement will still be counted as a full - cache line. - unit: Percent - Hit Rate: - rst: The ratio of the number of L2 cache line requests that hit in the L2 cache - over the total number of incoming cache line requests to the L2 cache. - unit: Percent - L2-Fabric Read BW: - rst: The number of bytes read by the L2 over the :ref:`Infinity Fabric interface - ` per unit time. - unit: GB/s - L2-Fabric Write and Atomic BW: - rst: The number of bytes sent by the L2 over the :ref:`Infinity Fabric interface - ` by write and atomic operations per unit time. - unit: GB/s - HBM Bandwidth: - rst: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory - (HBM) per unit time. This value is calculated as the number of HBM channels - multiplied by the HBM channel width multiplied by the HBM clock frequency. - unit: GB/s - Read BW: - rst: The total number of bytes read by the L2 cache from Infinity Fabric per :ref:`normalization - unit `. - unit: Bytes per normalization unit - HBM Read Traffic: - rst: The percent of read requests generated by the L2 cache that are routed to the - accelerator's local high-bandwidth memory (HBM). This breakdown does not consider - the *size* of the request (meaning that 32B and 64B requests are both counted - as a single request), so this metric only *approximates* the percent of the - L2-Fabric Read bandwidth directed to the local HBM. - unit: Percent - Remote Read Traffic: - rst: The percent of read requests generated by the L2 cache that are routed to any - memory location other than the accelerator's local high-bandwidth memory (HBM) - -- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric Read bandwidth directed to a remote location. - unit: Percent - Uncached Read Traffic: - rst: The percent of read requests generated by the L2 cache that are reading from - an :ref:`uncached memory allocation `. Note, as described in the - :ref:`request flow ` section, a single 64B read request is - typically counted as two uncached read requests. So, it is possible for the - Uncached Read Traffic to reach up to 200% of the total number of read requests. - This breakdown does not consider the *size* of the request (i.e., 32B and 64B - requests are both counted as a single request), so this metric only *approximates* - the percent of the L2-Fabric read bandwidth directed to an uncached memory - location. - unit: Percent - Write and Atomic BW: - rst: The total number of bytes written by the L2 over Infinity Fabric by write and - atomic operations per :ref:`normalization unit `. Note - that on current CDNA accelerators, such as the :ref:`MI2XX `, requests - are only considered *atomic* by Infinity Fabric if they are targeted at non-write-cacheable - memory, for example, :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations on the MI2XX. - unit: Bytes per normalization unit - HBM Write and Atomic Traffic: - rst: The percent of write and atomic requests generated by the L2 cache that are - routed to the accelerator's local high-bandwidth memory (HBM). This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric Write and Atomic bandwidth directed to the local HBM. - Note that on current CDNA accelerators, such as the :ref:`MI2XX `, - requests are only considered *atomic* by Infinity Fabric if they are targeted - at :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations. - unit: Percent - Remote Write and Atomic Traffic: - rst: The percent of read requests generated by the L2 cache that are routed to any - memory location other than the accelerator's local high-bandwidth memory (HBM) - -- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric Read bandwidth directed to a remote location. Note - that on current CDNA accelerators, such as the :ref:`MI2XX `, requests - are only considered *atomic* by Infinity Fabric if they are targeted at :ref:`fine-grained - memory ` allocations or :ref:`uncached memory ` allocations. - unit: Percent - Atomic Traffic: - rst: The percent of write requests generated by the L2 cache that are atomic requests - to *any* memory location. This breakdown does not consider the *size* of the - request (meaning that 32B and 64B requests are both counted as a single request), - so this metric only *approximates* the percent of the L2-Fabric Read bandwidth - directed to a remote location. Note that on current CDNA accelerators, such - as the :ref:`MI2XX `, requests are only considered *atomic* by - Infinity Fabric if they are targeted at :ref:`fine-grained memory ` - allocations or :ref:`uncached memory ` allocations. - unit: Percent - Uncached Write and Atomic Traffic: - rst: The percent of write and atomic requests generated by the L2 cache that are - targeting :ref:`uncached memory allocations `. This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric read bandwidth directed to uncached memory allocations. - unit: Percent - Read Latency: - rst: The time-averaged number of cycles read requests spent in Infinity Fabric before - data was returned to the L2. - unit: Cycles - Write and Atomic Latency: - rst: The time-averaged number of cycles write requests spent in Infinity Fabric - before a completion acknowledgement was returned to the L2. - unit: Cycles - Atomic Latency: - rst: The time-averaged number of cycles atomic requests spent in Infinity Fabric - before a completion acknowledgement (atomic without return value) or data (atomic - with return value) was returned to the L2. - unit: Cycles - Bandwidth: - rst: The number of bytes looked up in the L2 cache, per :ref:`normalization unit - `. The number of bytes is calculated as the number of - cache lines requested multiplied by the cache line size. This value does not - consider partial requests, so for example, if only a single value is requested - in a cache line, the data movement will still be counted as a full cache line. - unit: Bytes per normalization unit - Req: - rst: The total number of incoming requests to the L2 from all clients for all request - types, per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req: - rst: 'The total number of read requests to the L2 from all clients. ' - unit: Requests per normalization unit - Write Req: - rst: The total number of write requests to the L2 from all clients. - unit: Requests per normalization unit - Atomic Req: - rst: The total number of atomic requests (with and without return) to the L2 from - all clients. - unit: Requests per normalization unit - Streaming Req: - rst: The total number of incoming requests to the L2 that are marked as *streaming*. - The exact meaning of this may differ depending on the targeted accelerator, - however on an :ref:`MI2XX ` this corresponds to `non-temporal - load or stores `_. The - L2 cache attempts to evict *streaming* requests before normal requests when - the L2 is at capacity. - unit: Requests per normalization unit - Probe Req: - rst: The number of coherence probe requests made to the L2 cache from outside the - accelerator. On an :ref:`MI2XX `, probe requests may be generated - by, for example, writes to :ref:`fine-grained device ` memory - or by writes to :ref:`coarse-grained ` device memory. - unit: Requests per normalization unit - Cache Hit: - rst: The ratio of the number of L2 cache line requests that hit in the L2 cache - over the total number of incoming cache line requests to the L2 cache. - unit: Percent - Hits: - rst: The total number of requests to the L2 from all clients that hit in the cache. - As noted in the :ref:`Speed-of-Light ` section, this includes hit-on-miss - requests. - unit: Requests per normalization unit - Misses: - rst: The total number of requests to the L2 from all clients that miss in the cache. - As noted in the :ref:`Speed-of-Light ` section, these do not include - hit-on-miss requests. - unit: Requests per normalization unit - Writeback: - rst: The total number of L2 cache lines written back to memory for any reason. Write-backs - may occur due to user code (such as HIP kernel calls to ``__threadfence_system`` - or atomic built-ins) by the :doc:`command processor `'s - memory acquire/release fences, or for other internal hardware reasons. - unit: Cache lines per normalization unit - Writeback (Internal): - rst: The total number of L2 cache lines written back to memory for internal hardware - reasons, per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Writeback (vL1D Req): - rst: The total number of L2 cache lines written back to memory due to requests initiated - by the :doc:`vL1D cache `, per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Evict (Internal): - rst: The total number of L2 cache lines evicted from the cache due to capacity limits, - per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Evict (vL1D Req): - rst: The total number of L2 cache lines evicted from the cache due to invalidation - requests initiated by the :doc:`vL1D cache `, per :ref:`normalization - unit `. - unit: Cache lines per normalization unit - NC Req: - rst: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory - allocations, per :ref:`normalization unit `. See the :ref:`memory-type` - for more information. - unit: Requests per normalization unit - UC Req: - rst: The total number of requests to the L2 that go to Uncached (UC) memory allocations. - See the :ref:`memory-type` for more information. - unit: Requests per normalization unit - CC Req: - rst: The total number of requests to the L2 that go to Coherently Cacheable (CC) memory - allocations. See the :ref:`memory-type` for more information. - unit: Requests per normalization unit - RW Req: - rst: The total number of requests to the L2 that go to Read-Write coherent memory (RW) - allocations. See the :ref:`memory-type` for more information. - unit: Requests per normalization unit - Write - Credit Starvation: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to any memory location because too many write/atomic requests were - currently in flight, as a percent of the :ref:`total active L2 cycles `. - unit: Percent - Read (32B): - rst: The total number of L2 requests to Infinity Fabric to read 32B of data from - any memory location, per :ref:`normalization unit `. See :ref:`l2-request-flow` - for more detail. Typically unused on CDNA accelerators. + Remote Write and Atomic: + rst: The total number of L2 requests to Infinity Fabric to write or atomically update + 32B or 64B of data in any memory location other than the accelerator's local + HBM, per :ref:`normalization unit `. See :ref:`l2-request-flow` + for more detail. unit: Requests per normalization unit Read (64B): rst: The total number of L2 requests to Infinity Fabric to read 64B of data from any memory location, per :ref:`normalization unit `. See :ref:`l2-request-flow` for more detail. unit: Requests per normalization unit - Read (Uncached): - rst: The total number of L2 requests to Infinity Fabric to read :ref:`uncached - data ` from any memory location, per :ref:`normalization unit - `. 64B requests for uncached data are counted as two 32B - uncached data requests. See :ref:`l2-request-flow` for more detail. + Read (32B): + rst: The total number of L2 requests to Infinity Fabric to read 32B of data from + any memory location, per :ref:`normalization unit `. See :ref:`l2-request-flow` + for more detail. Typically unused on CDNA accelerators. + unit: Requests per normalization unit + Write and Atomic (32B): + rst: The total number of L2 requests to Infinity Fabric to write or atomically update + 32B of data to any memory location, per :ref:`normalization unit `. + See :ref:`l2-request-flow` for more detail. + unit: Requests per normalization unit + Write and Atomic (64B): + rst: The total number of L2 requests to Infinity Fabric to write or atomically update + 64B of data in any memory location, per :ref:`normalization unit `. + See :ref:`l2-request-flow` for more detail. unit: Requests per normalization unit HBM Read: rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data @@ -3692,32 +984,11 @@ L2 - Fabric interface detailed metrics: from any source other than the accelerator's local HBM, per :ref:`normalization unit `. See :ref:`l2-request-flow` for more detail. unit: Requests per normalization unit - Write and Atomic (32B): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B of data to any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (Uncached): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of :ref:`uncached data `, per :ref:`normalization unit - `. See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (64B): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 64B of data in any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit HBM Write and Atomic: rst: The total number of L2 requests to Infinity Fabric to write or atomically update 32B or 64B of data in the accelerator's local HBM, per :ref:`normalization unit `. See :ref:`l2-request-flow` for more detail. plain unit: Requests per normalization unit - Remote Write and Atomic: - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of data in any memory location other than the accelerator's local - HBM, per :ref:`normalization unit `. See :ref:`l2-request-flow` - for more detail. - unit: Requests per normalization unit Atomic: rst: The total number of L2 requests to Infinity Fabric to atomically update 32B or 64B of data in any memory location, per :ref:`normalization unit `. @@ -3727,350 +998,33 @@ L2 - Fabric interface detailed metrics: as :ref:`fine-grained memory ` allocations or :ref:`uncached memory ` allocations on the MI2XX. unit: Requests per normalization unit - Read Stall: - rst: "The ratio of the total number of cycles the L2-Fabric interface was stalled\ - \ on a read request to any destination (local HBM, remote PCIe\xAE connected\ - \ accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_\ - \ or CPU) over the :ref:`total active L2 cycles `." - unit: Percent - Write Stall: - rst: The ratio of the total number of cycles the L2-Fabric interface was stalled - on a write or atomic request to any destination (local HBM, remote accelerator - or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected - accelerator [#inf]_ or CPU) over the :ref:`total active L2 cycles `. - unit: Percent - Read - PCIe Stall: - rst: The number of cycles the L2-Fabric interface was stalled on read requests - to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the :ref:`total - active L2 cycles `. - unit: Percent - Read - Infinity Fabric Stall: - rst: The number of cycles the L2-Fabric interface was stalled on read requests - to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as a percent - of the :ref:`total active L2 cycles `. - unit: Percent - Read - HBM Stall: - rst: The number of cycles the L2-Fabric interface was stalled on read requests - to the accelerator's local HBM as a percent of the :ref:`total active L2 cycles - `. - unit: Percent - Write - PCIe Stall: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to remote PCIe connected accelerators [#inf]_ or CPUs as a percent - of the :ref:`total active L2 cycles `. - unit: Percent - Write - Infinity Fabric Stall: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as - a percent of the :ref:`total active L2 cycles `. - unit: Percent - Write - HBM Stall: - rst: The number of cycles the L2-Fabric interface was stalled on write or atomic - requests to accelerator's local HBM as a percent of the total active L2 cycles. - unit: Percent + Read (Uncached): + rst: The total number of L2 requests to Infinity Fabric to read :ref:`uncached + data ` from any memory location, per :ref:`normalization unit + `. 64B requests for uncached data are counted as two 32B + uncached data requests. See :ref:`l2-request-flow` for more detail. + unit: Requests per normalization unit + Write and Atomic (Uncached): + rst: The total number of L2 requests to Infinity Fabric to write or atomically update + 32B or 64B of :ref:`uncached data `, per :ref:`normalization unit + `. See :ref:`l2-request-flow` for more detail. + unit: Requests per normalization unit L2 - Fabric Interface stalls: - Utilization: - rst: The ratio of the :ref:`number of cycles an L2 channel was active, summed - over all L2 channels on the accelerator ` over the - :ref:`total L2 cycles `. - unit: Percent - Peak Bandwidth: - rst: The number of bytes looked up in the L2 cache, as a percent of the peak theoretical - bandwidth achievable on the specific accelerator. The number of bytes is calculated - as the number of cache lines requested multiplied by the cache line size. This - value does not consider partial requests, so e.g., if only a single value is - requested in a cache line, the data movement will still be counted as a full - cache line. - unit: Percent - Hit Rate: - rst: The ratio of the number of L2 cache line requests that hit in the L2 cache - over the total number of incoming cache line requests to the L2 cache. - unit: Percent - L2-Fabric Read BW: - rst: The number of bytes read by the L2 over the :ref:`Infinity Fabric interface - ` per unit time. - unit: GB/s - L2-Fabric Write and Atomic BW: - rst: The number of bytes sent by the L2 over the :ref:`Infinity Fabric interface - ` by write and atomic operations per unit time. - unit: GB/s - HBM Bandwidth: - rst: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory - (HBM) per unit time. This value is calculated as the number of HBM channels - multiplied by the HBM channel width multiplied by the HBM clock frequency. - unit: GB/s - Read BW: - rst: The total number of bytes read by the L2 cache from Infinity Fabric per :ref:`normalization - unit `. - unit: Bytes per normalization unit - HBM Read Traffic: - rst: The percent of read requests generated by the L2 cache that are routed to the - accelerator's local high-bandwidth memory (HBM). This breakdown does not consider - the *size* of the request (meaning that 32B and 64B requests are both counted - as a single request), so this metric only *approximates* the percent of the - L2-Fabric Read bandwidth directed to the local HBM. - unit: Percent - Remote Read Traffic: - rst: The percent of read requests generated by the L2 cache that are routed to any - memory location other than the accelerator's local high-bandwidth memory (HBM) - -- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric Read bandwidth directed to a remote location. - unit: Percent - Uncached Read Traffic: - rst: The percent of read requests generated by the L2 cache that are reading from - an :ref:`uncached memory allocation `. Note, as described in the - :ref:`request flow ` section, a single 64B read request is - typically counted as two uncached read requests. So, it is possible for the - Uncached Read Traffic to reach up to 200% of the total number of read requests. - This breakdown does not consider the *size* of the request (i.e., 32B and 64B - requests are both counted as a single request), so this metric only *approximates* - the percent of the L2-Fabric read bandwidth directed to an uncached memory - location. - unit: Percent - Write and Atomic BW: - rst: The total number of bytes written by the L2 over Infinity Fabric by write and - atomic operations per :ref:`normalization unit `. Note - that on current CDNA accelerators, such as the :ref:`MI2XX `, requests - are only considered *atomic* by Infinity Fabric if they are targeted at non-write-cacheable - memory, for example, :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations on the MI2XX. - unit: Bytes per normalization unit - HBM Write and Atomic Traffic: - rst: The percent of write and atomic requests generated by the L2 cache that are - routed to the accelerator's local high-bandwidth memory (HBM). This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric Write and Atomic bandwidth directed to the local HBM. - Note that on current CDNA accelerators, such as the :ref:`MI2XX `, - requests are only considered *atomic* by Infinity Fabric if they are targeted - at :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations. - unit: Percent - Remote Write and Atomic Traffic: - rst: The percent of read requests generated by the L2 cache that are routed to any - memory location other than the accelerator's local high-bandwidth memory (HBM) - -- for example, the CPU's DRAM or a remote accelerator's HBM. This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric Read bandwidth directed to a remote location. Note - that on current CDNA accelerators, such as the :ref:`MI2XX `, requests - are only considered *atomic* by Infinity Fabric if they are targeted at :ref:`fine-grained - memory ` allocations or :ref:`uncached memory ` allocations. - unit: Percent - Atomic Traffic: - rst: The percent of write requests generated by the L2 cache that are atomic requests - to *any* memory location. This breakdown does not consider the *size* of the - request (meaning that 32B and 64B requests are both counted as a single request), - so this metric only *approximates* the percent of the L2-Fabric Read bandwidth - directed to a remote location. Note that on current CDNA accelerators, such - as the :ref:`MI2XX `, requests are only considered *atomic* by - Infinity Fabric if they are targeted at :ref:`fine-grained memory ` - allocations or :ref:`uncached memory ` allocations. - unit: Percent - Uncached Write and Atomic Traffic: - rst: The percent of write and atomic requests generated by the L2 cache that are - targeting :ref:`uncached memory allocations `. This breakdown - does not consider the *size* of the request (meaning that 32B and 64B requests - are both counted as a single request), so this metric only *approximates* the - percent of the L2-Fabric read bandwidth directed to uncached memory allocations. - unit: Percent - Read Latency: - rst: The time-averaged number of cycles read requests spent in Infinity Fabric before - data was returned to the L2. - unit: Cycles - Write and Atomic Latency: - rst: The time-averaged number of cycles write requests spent in Infinity Fabric - before a completion acknowledgement was returned to the L2. - unit: Cycles - Atomic Latency: - rst: The time-averaged number of cycles atomic requests spent in Infinity Fabric - before a completion acknowledgement (atomic without return value) or data (atomic - with return value) was returned to the L2. - unit: Cycles - Bandwidth: - rst: The number of bytes looked up in the L2 cache, per :ref:`normalization unit - `. The number of bytes is calculated as the number of - cache lines requested multiplied by the cache line size. This value does not - consider partial requests, so for example, if only a single value is requested - in a cache line, the data movement will still be counted as a full cache line. - unit: Bytes per normalization unit - Req: - rst: The total number of incoming requests to the L2 from all clients for all request - types, per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req: - rst: 'The total number of read requests to the L2 from all clients. ' - unit: Requests per normalization unit - Write Req: - rst: The total number of write requests to the L2 from all clients. - unit: Requests per normalization unit - Atomic Req: - rst: The total number of atomic requests (with and without return) to the L2 from - all clients. - unit: Requests per normalization unit - Streaming Req: - rst: The total number of incoming requests to the L2 that are marked as *streaming*. - The exact meaning of this may differ depending on the targeted accelerator, - however on an :ref:`MI2XX ` this corresponds to `non-temporal - load or stores `_. The - L2 cache attempts to evict *streaming* requests before normal requests when - the L2 is at capacity. - unit: Requests per normalization unit - Probe Req: - rst: The number of coherence probe requests made to the L2 cache from outside the - accelerator. On an :ref:`MI2XX `, probe requests may be generated - by, for example, writes to :ref:`fine-grained device ` memory - or by writes to :ref:`coarse-grained ` device memory. - unit: Requests per normalization unit - Cache Hit: - rst: The ratio of the number of L2 cache line requests that hit in the L2 cache - over the total number of incoming cache line requests to the L2 cache. - unit: Percent - Hits: - rst: The total number of requests to the L2 from all clients that hit in the cache. - As noted in the :ref:`Speed-of-Light ` section, this includes hit-on-miss - requests. - unit: Requests per normalization unit - Misses: - rst: The total number of requests to the L2 from all clients that miss in the cache. - As noted in the :ref:`Speed-of-Light ` section, these do not include - hit-on-miss requests. - unit: Requests per normalization unit - Writeback: - rst: The total number of L2 cache lines written back to memory for any reason. Write-backs - may occur due to user code (such as HIP kernel calls to ``__threadfence_system`` - or atomic built-ins) by the :doc:`command processor `'s - memory acquire/release fences, or for other internal hardware reasons. - unit: Cache lines per normalization unit - Writeback (Internal): - rst: The total number of L2 cache lines written back to memory for internal hardware - reasons, per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Writeback (vL1D Req): - rst: The total number of L2 cache lines written back to memory due to requests initiated - by the :doc:`vL1D cache `, per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Evict (Internal): - rst: The total number of L2 cache lines evicted from the cache due to capacity limits, - per :ref:`normalization unit `. - unit: Cache lines per normalization unit - Evict (vL1D Req): - rst: The total number of L2 cache lines evicted from the cache due to invalidation - requests initiated by the :doc:`vL1D cache `, per :ref:`normalization - unit `. - unit: Cache lines per normalization unit - NC Req: - rst: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory - allocations, per :ref:`normalization unit `. See the :ref:`memory-type` - for more information. - unit: Requests per normalization unit - UC Req: - rst: The total number of requests to the L2 that go to Uncached (UC) memory allocations. - See the :ref:`memory-type` for more information. - unit: Requests per normalization unit - CC Req: - rst: The total number of requests to the L2 that go to Coherently Cacheable (CC) memory - allocations. See the :ref:`memory-type` for more information. - unit: Requests per normalization unit - RW Req: - rst: The total number of requests to the L2 that go to Read-Write coherent memory (RW) - allocations. See the :ref:`memory-type` for more information. - unit: Requests per normalization unit Write - Credit Starvation: rst: The number of cycles the L2-Fabric interface was stalled on write or atomic requests to any memory location because too many write/atomic requests were currently in flight, as a percent of the :ref:`total active L2 cycles `. unit: Percent - Read (32B): - rst: The total number of L2 requests to Infinity Fabric to read 32B of data from - any memory location, per :ref:`normalization unit `. See :ref:`l2-request-flow` - for more detail. Typically unused on CDNA accelerators. - unit: Requests per normalization unit - Read (64B): - rst: The total number of L2 requests to Infinity Fabric to read 64B of data from - any memory location, per :ref:`normalization unit `. See :ref:`l2-request-flow` - for more detail. - unit: Requests per normalization unit - Read (Uncached): - rst: The total number of L2 requests to Infinity Fabric to read :ref:`uncached - data ` from any memory location, per :ref:`normalization unit - `. 64B requests for uncached data are counted as two 32B - uncached data requests. See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - HBM Read: - rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data - from the accelerator's local HBM, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Remote Read: - rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data - from any source other than the accelerator's local HBM, per :ref:`normalization - unit `. See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (32B): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B of data to any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (Uncached): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of :ref:`uncached data `, per :ref:`normalization unit - `. See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - Write and Atomic (64B): - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 64B of data in any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. - unit: Requests per normalization unit - HBM Write and Atomic: - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of data in the accelerator's local HBM, per :ref:`normalization - unit `. See :ref:`l2-request-flow` for more detail. plain - unit: Requests per normalization unit - Remote Write and Atomic: - rst: The total number of L2 requests to Infinity Fabric to write or atomically update - 32B or 64B of data in any memory location other than the accelerator's local - HBM, per :ref:`normalization unit `. See :ref:`l2-request-flow` - for more detail. - unit: Requests per normalization unit - Atomic: - rst: The total number of L2 requests to Infinity Fabric to atomically update 32B - or 64B of data in any memory location, per :ref:`normalization unit `. - See :ref:`l2-request-flow` for more detail. Note that on current CDNA accelerators, - such as the :ref:`MI2XX `, requests are only considered *atomic* - by Infinity Fabric if they are targeted at non-write-cacheable memory, such - as :ref:`fine-grained memory ` allocations or :ref:`uncached - memory ` allocations on the MI2XX. - unit: Requests per normalization unit - Read Stall: - rst: "The ratio of the total number of cycles the L2-Fabric interface was stalled\ - \ on a read request to any destination (local HBM, remote PCIe\xAE connected\ - \ accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_\ - \ or CPU) over the :ref:`total active L2 cycles `." - unit: Percent - Write Stall: - rst: The ratio of the total number of cycles the L2-Fabric interface was stalled - on a write or atomic request to any destination (local HBM, remote accelerator - or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected - accelerator [#inf]_ or CPU) over the :ref:`total active L2 cycles `. - unit: Percent - Read - PCIe Stall: + Read - HBM Stall: rst: The number of cycles the L2-Fabric interface was stalled on read requests - to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the :ref:`total - active L2 cycles `. + to the accelerator's local HBM as a percent of the :ref:`total active L2 cycles + `. unit: Percent Read - Infinity Fabric Stall: rst: The number of cycles the L2-Fabric interface was stalled on read requests to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as a percent of the :ref:`total active L2 cycles `. unit: Percent - Read - HBM Stall: - rst: The number of cycles the L2-Fabric interface was stalled on read requests - to the accelerator's local HBM as a percent of the :ref:`total active L2 cycles - `. - unit: Percent Write - PCIe Stall: rst: The number of cycles the L2-Fabric interface was stalled on write or atomic requests to remote PCIe connected accelerators [#inf]_ or CPUs as a percent @@ -4085,17 +1039,22 @@ L2 - Fabric Interface stalls: rst: The number of cycles the L2-Fabric interface was stalled on write or atomic requests to accelerator's local HBM as a percent of the total active L2 cycles. unit: Percent + Read - PCIe Stall: + rst: The number of cycles the L2-Fabric interface was stalled on read requests + to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the :ref:`total + active L2 cycles `. + unit: Percent Scalar L1D Speed-of-Light: - Bandwidth: - rst: The number of bytes looked up in the sL1D cache, as a percent of the peak theoretical - bandwidth. Calculated as the ratio of sL1D requests over the :ref:`total sL1D - cycles `. - unit: Percent Cache Hit Rate: rst: Indicates the percent of sL1D requests that hit on a previously loaded line the cache. The ratio of the number of sL1D requests that hit [#sl1d-cache]_ over the number of all sL1D requests. unit: Percent + Bandwidth: + rst: The number of bytes looked up in the sL1D cache, as a percent of the peak theoretical + bandwidth. Calculated as the ratio of sL1D requests over the :ref:`total sL1D + cycles `. + unit: Percent sL1D-L2 BW: rst: "The total number of bytes read from, written to, or atomically updated \ \ across the sL1D\u2194:doc:`L2 ` interface, per :ref:`normalization\ @@ -4103,119 +1062,47 @@ Scalar L1D Speed-of-Light: \ unused on current CDNA accelerators, so in the majority of cases this can\ \ be interpreted as an sL1D\u2192L2 read bandwidth." unit: Bytes per normalization unit - Req: - rst: The total number of requests, of any size or type, made to the sL1D per :ref:`normalization - unit `. - unit: Requests per normalization unit - Hits: - rst: The total number of sL1D requests that hit on a previously loaded cache line, - per :ref:`normalization unit `. - unit: Requests per normalization unit - Misses - Non Duplicated: - rst: The total number of sL1D requests that missed on a cache line that *was not* - already pending due to another request, per :ref:`normalization unit `. - See :ref:`desc-sl1d-sol` for more detail. - unit: Requests per normalization unit - Misses- Duplicated: - rst: The total number of sL1D requests that missed on a cache line that *was* already - pending due to another request, per :ref:`normalization unit `. - See :ref:`desc-sl1d-sol` for more detail. - unit: Requests per normalization unit - Read Req (Total): - rst: The total number of sL1D read requests of any size, per :ref:`normalization - unit `. - unit: Requests per normalization unit - Atomic Req: - rst: The total number of atomic requests from sL1D to the :doc:`L2 `, - per :ref:`normalization unit `. Typically unused on current - CDNA accelerators. - unit: Requests per normalization unit - Read Req (1 DWord): - rst: The total number of sL1D read requests made for a single dword of data (4B), - per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req (2 DWord): - rst: The total number of sL1D read requests made for a two dwords of data (8B), - per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req (4 DWord): - rst: The total number of sL1D read requests made for a four dwords of data (16B), - per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req (8 DWord): - rst: The total number of sL1D read requests made for a eight dwords of data (32B), - per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req (16 DWord): - rst: The total number of sL1D read requests made for a sixteen dwords of data (64B), - per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req: - rst: The total number of read requests from sL1D to the :doc:`L2 `, per - :ref:`normalization unit `. - unit: Requests per normalization unit - Write Req: - rst: The total number of write requests from sL1D to the :doc:`L2 `, per - :ref:`normalization unit `. Typically unused on current - CDNA accelerators. - unit: Requests per normalization unit - Stall Cycles: - rst: "The total number of cycles the sL1D\u2194 :doc:`L2 ` interface\ - \ was stalled, per :ref:`normalization unit `." - unit: Cycles per normalization unit Scalar L1D cache accesses: - Bandwidth: - rst: The number of bytes looked up in the sL1D cache, as a percent of the peak theoretical - bandwidth. Calculated as the ratio of sL1D requests over the :ref:`total sL1D - cycles `. - unit: Percent - Cache Hit Rate: - rst: Indicates the percent of sL1D requests that hit on a previously loaded line - the cache. The ratio of the number of sL1D requests that hit [#sl1d-cache]_ - over the number of all sL1D requests. - unit: Percent - sL1D-L2 BW: - rst: "The total number of bytes read from, written to, or atomically updated \ - \ across the sL1D\u2194:doc:`L2 ` interface, per :ref:`normalization\ - \ unit `. Note that sL1D writes and atomics are typically\ - \ unused on current CDNA accelerators, so in the majority of cases this can\ - \ be interpreted as an sL1D\u2192L2 read bandwidth." - unit: Bytes per normalization unit - Req: - rst: The total number of requests, of any size or type, made to the sL1D per :ref:`normalization - unit `. - unit: Requests per normalization unit - Hits: - rst: The total number of sL1D requests that hit on a previously loaded cache line, + Read Req (1 DWord): + rst: The total number of sL1D read requests made for a single dword of data (4B), per :ref:`normalization unit `. unit: Requests per normalization unit - Misses - Non Duplicated: - rst: The total number of sL1D requests that missed on a cache line that *was not* - already pending due to another request, per :ref:`normalization unit `. - See :ref:`desc-sl1d-sol` for more detail. - unit: Requests per normalization unit - Misses- Duplicated: - rst: The total number of sL1D requests that missed on a cache line that *was* already - pending due to another request, per :ref:`normalization unit `. - See :ref:`desc-sl1d-sol` for more detail. - unit: Requests per normalization unit - Read Req (Total): - rst: The total number of sL1D read requests of any size, per :ref:`normalization - unit `. + Read Req (16 DWord): + rst: The total number of sL1D read requests made for a sixteen dwords of data (64B), + per :ref:`normalization unit `. unit: Requests per normalization unit Atomic Req: rst: The total number of atomic requests from sL1D to the :doc:`L2 `, per :ref:`normalization unit `. Typically unused on current CDNA accelerators. unit: Requests per normalization unit - Read Req (1 DWord): - rst: The total number of sL1D read requests made for a single dword of data (4B), - per :ref:`normalization unit `. + Misses - Non Duplicated: + rst: The total number of sL1D requests that missed on a cache line that *was not* + already pending due to another request, per :ref:`normalization unit `. + See :ref:`desc-sl1d-sol` for more detail. unit: Requests per normalization unit Read Req (2 DWord): rst: The total number of sL1D read requests made for a two dwords of data (8B), per :ref:`normalization unit `. unit: Requests per normalization unit + Hits: + rst: The total number of sL1D requests that hit on a previously loaded cache line, + per :ref:`normalization unit `. + unit: Requests per normalization unit + Cache Hit Rate: + rst: Indicates the percent of sL1D requests that hit on a previously loaded line + the cache. The ratio of the number of sL1D requests that hit [#sl1d-cache]_ + over the number of all sL1D requests. + unit: Percent + Misses- Duplicated: + rst: The total number of sL1D requests that missed on a cache line that *was* already + pending due to another request, per :ref:`normalization unit `. + See :ref:`desc-sl1d-sol` for more detail. + unit: Requests per normalization unit + Req: + rst: The total number of requests, of any size or type, made to the sL1D per :ref:`normalization + unit `. + unit: Requests per normalization unit Read Req (4 DWord): rst: The total number of sL1D read requests made for a four dwords of data (16B), per :ref:`normalization unit `. @@ -4224,34 +1111,11 @@ Scalar L1D cache accesses: rst: The total number of sL1D read requests made for a eight dwords of data (32B), per :ref:`normalization unit `. unit: Requests per normalization unit - Read Req (16 DWord): - rst: The total number of sL1D read requests made for a sixteen dwords of data (64B), - per :ref:`normalization unit `. + Read Req (Total): + rst: The total number of sL1D read requests of any size, per :ref:`normalization + unit `. unit: Requests per normalization unit - Read Req: - rst: The total number of read requests from sL1D to the :doc:`L2 `, per - :ref:`normalization unit `. - unit: Requests per normalization unit - Write Req: - rst: The total number of write requests from sL1D to the :doc:`L2 `, per - :ref:`normalization unit `. Typically unused on current - CDNA accelerators. - unit: Requests per normalization unit - Stall Cycles: - rst: "The total number of cycles the sL1D\u2194 :doc:`L2 ` interface\ - \ was stalled, per :ref:`normalization unit `." - unit: Cycles per normalization unit Scalar L1D Cache - L2 Interface: - Bandwidth: - rst: The number of bytes looked up in the sL1D cache, as a percent of the peak theoretical - bandwidth. Calculated as the ratio of sL1D requests over the :ref:`total sL1D - cycles `. - unit: Percent - Cache Hit Rate: - rst: Indicates the percent of sL1D requests that hit on a previously loaded line - the cache. The ratio of the number of sL1D requests that hit [#sl1d-cache]_ - over the number of all sL1D requests. - unit: Percent sL1D-L2 BW: rst: "The total number of bytes read from, written to, or atomically updated \ \ across the sL1D\u2194:doc:`L2 ` interface, per :ref:`normalization\ @@ -4259,328 +1123,142 @@ Scalar L1D Cache - L2 Interface: \ unused on current CDNA accelerators, so in the majority of cases this can\ \ be interpreted as an sL1D\u2192L2 read bandwidth." unit: Bytes per normalization unit - Req: - rst: The total number of requests, of any size or type, made to the sL1D per :ref:`normalization - unit `. - unit: Requests per normalization unit - Hits: - rst: The total number of sL1D requests that hit on a previously loaded cache line, - per :ref:`normalization unit `. - unit: Requests per normalization unit - Misses - Non Duplicated: - rst: The total number of sL1D requests that missed on a cache line that *was not* - already pending due to another request, per :ref:`normalization unit `. - See :ref:`desc-sl1d-sol` for more detail. - unit: Requests per normalization unit - Misses- Duplicated: - rst: The total number of sL1D requests that missed on a cache line that *was* already - pending due to another request, per :ref:`normalization unit `. - See :ref:`desc-sl1d-sol` for more detail. - unit: Requests per normalization unit - Read Req (Total): - rst: The total number of sL1D read requests of any size, per :ref:`normalization - unit `. - unit: Requests per normalization unit Atomic Req: rst: The total number of atomic requests from sL1D to the :doc:`L2 `, per :ref:`normalization unit `. Typically unused on current CDNA accelerators. unit: Requests per normalization unit - Read Req (1 DWord): - rst: The total number of sL1D read requests made for a single dword of data (4B), - per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req (2 DWord): - rst: The total number of sL1D read requests made for a two dwords of data (8B), - per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req (4 DWord): - rst: The total number of sL1D read requests made for a four dwords of data (16B), - per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req (8 DWord): - rst: The total number of sL1D read requests made for a eight dwords of data (32B), - per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req (16 DWord): - rst: The total number of sL1D read requests made for a sixteen dwords of data (64B), - per :ref:`normalization unit `. - unit: Requests per normalization unit - Read Req: - rst: The total number of read requests from sL1D to the :doc:`L2 `, per - :ref:`normalization unit `. - unit: Requests per normalization unit - Write Req: - rst: The total number of write requests from sL1D to the :doc:`L2 `, per - :ref:`normalization unit `. Typically unused on current - CDNA accelerators. - unit: Requests per normalization unit Stall Cycles: rst: "The total number of cycles the sL1D\u2194 :doc:`L2 ` interface\ \ was stalled, per :ref:`normalization unit `." unit: Cycles per normalization unit + Write Req: + rst: The total number of write requests from sL1D to the :doc:`L2 `, per + :ref:`normalization unit `. Typically unused on current + CDNA accelerators. + unit: Requests per normalization unit + Read Req: + rst: The total number of read requests from sL1D to the :doc:`L2 `, per + :ref:`normalization unit `. + unit: Requests per normalization unit L1I Speed-of-Light: - Bandwidth: - rst: The number of bytes looked up in the L1I cache, as a percent of the peak theoretical - bandwidth. Calculated as the ratio of L1I requests over the :ref:`total L1I - cycles `. - unit: Percent Cache Hit Rate: rst: The percent of L1I requests that hit [#l1i-cache]_ on a previously loaded line the cache. Calculated as the ratio of the number of L1I requests that hit over the number of all L1I requests. unit: Percent + Bandwidth: + rst: The number of bytes looked up in the L1I cache, as a percent of the peak theoretical + bandwidth. Calculated as the ratio of L1I requests over the :ref:`total L1I + cycles `. + unit: Percent L1I-L2 Bandwidth: rst: "The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth\ \ achieved. Calculated as the ratio of the total number of requests from the\ \ L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles `." unit: Percent - Req: - rst: The total number of requests made to the L1I per normalization-unit - unit: Requests per normalization unit - Hits: - rst: The total number of L1I requests that hit on a previously loaded cache line, - per :ref:`normalization-unit `. - unit: Requests per normalization unit - Misses - Non Duplicated: - rst: The total number of L1I requests that missed on a cache line that *were - not* already pending due to another request, per :ref:`normalization-unit `. - See note in :ref:`desc-l1i-sol` for more detail. - unit: Requests per normalization unit - Misses - Duplicated: - rst: The total number of L1I requests that missed on a cache line that *were* already - pending due to another request, per :ref:`normalization-unit `. - See note in :ref:`desc-l1i-sol` for more detail. - unit: Requests per normalization unit - Instruction Fetch Latency: - rst: The average number of cycles spent to fetch instructions to a :doc:`CU `. - unit: Cycles L1I cache accesses: - Bandwidth: - rst: The number of bytes looked up in the L1I cache, as a percent of the peak theoretical - bandwidth. Calculated as the ratio of L1I requests over the :ref:`total L1I - cycles `. - unit: Percent - Cache Hit Rate: - rst: The percent of L1I requests that hit [#l1i-cache]_ on a previously loaded line - the cache. Calculated as the ratio of the number of L1I requests that hit over - the number of all L1I requests. - unit: Percent - L1I-L2 Bandwidth: - rst: "The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth\ - \ achieved. Calculated as the ratio of the total number of requests from the\ - \ L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles `." - unit: Percent - Req: - rst: The total number of requests made to the L1I per normalization-unit - unit: Requests per normalization unit - Hits: - rst: The total number of L1I requests that hit on a previously loaded cache line, - per :ref:`normalization-unit `. + Instruction Fetch Latency: + rst: The average number of cycles spent to fetch instructions to a :doc:`CU `. + unit: Cycles + Misses - Duplicated: + rst: The total number of L1I requests that missed on a cache line that *were* already + pending due to another request, per :ref:`normalization-unit `. + See note in :ref:`desc-l1i-sol` for more detail. unit: Requests per normalization unit Misses - Non Duplicated: rst: The total number of L1I requests that missed on a cache line that *were not* already pending due to another request, per :ref:`normalization-unit `. See note in :ref:`desc-l1i-sol` for more detail. unit: Requests per normalization unit - Misses - Duplicated: - rst: The total number of L1I requests that missed on a cache line that *were* already - pending due to another request, per :ref:`normalization-unit `. - See note in :ref:`desc-l1i-sol` for more detail. + Hits: + rst: The total number of L1I requests that hit on a previously loaded cache line, + per :ref:`normalization-unit `. + unit: Requests per normalization unit + Cache Hit Rate: + rst: The percent of L1I requests that hit [#l1i-cache]_ on a previously loaded line + the cache. Calculated as the ratio of the number of L1I requests that hit over + the number of all L1I requests. + unit: Percent + Req: + rst: The total number of requests made to the L1I per normalization-unit unit: Requests per normalization unit - Instruction Fetch Latency: - rst: The average number of cycles spent to fetch instructions to a :doc:`CU `. - unit: Cycles L1I <-> L2 interface: - Bandwidth: - rst: The number of bytes looked up in the L1I cache, as a percent of the peak theoretical - bandwidth. Calculated as the ratio of L1I requests over the :ref:`total L1I - cycles `. - unit: Percent - Cache Hit Rate: - rst: The percent of L1I requests that hit [#l1i-cache]_ on a previously loaded line - the cache. Calculated as the ratio of the number of L1I requests that hit over - the number of all L1I requests. - unit: Percent L1I-L2 Bandwidth: rst: "The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth\ \ achieved. Calculated as the ratio of the total number of requests from the\ \ L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles `." unit: Percent - Req: - rst: The total number of requests made to the L1I per normalization-unit - unit: Requests per normalization unit - Hits: - rst: The total number of L1I requests that hit on a previously loaded cache line, - per :ref:`normalization-unit `. - unit: Requests per normalization unit - Misses - Non Duplicated: - rst: The total number of L1I requests that missed on a cache line that *were - not* already pending due to another request, per :ref:`normalization-unit `. - See note in :ref:`desc-l1i-sol` for more detail. - unit: Requests per normalization unit - Misses - Duplicated: - rst: The total number of L1I requests that missed on a cache line that *were* already - pending due to another request, per :ref:`normalization-unit `. - See note in :ref:`desc-l1i-sol` for more detail. - unit: Requests per normalization unit - Instruction Fetch Latency: - rst: The average number of cycles spent to fetch instructions to a :doc:`CU `. - unit: Cycles Workgroup manager utilizations: - Accelerator Utilization: - rst: The percent of cycles in the kernel where the accelerator was actively doing - any work. - unit: Percent - Scheduler-Pipe Utilization: - rst: 'The percent of :ref:`total scheduler-pipe cycles ` in the - kernel where the scheduler-pipes were actively doing any work. Note: this value - is expected to range between 0% and 25%. See :ref:`desc-spi`.' - unit: Percent - Workgroup Manager Utilization: - rst: The percent of cycles in the kernel where the workgroup manager was actively - doing any work. - unit: Percent - Shader Engine Utilization: - rst: The percent of :ref:`total shader engine cycles ` in the kernel - where any CU in a shader-engine was actively doing any work, normalized over - all shader-engines. Low values (e.g., << 100%) indicate that the accelerator - was not fully saturated by the kernel, or a potential load-imbalance issue. - unit: Percent SIMD Utilization: rst: The percent of :ref:`total SIMD cycles ` in the kernel where any :ref:`SIMD ` on a CU was actively doing any work, summed over all CUs. Low values (less than 100%) indicate that the accelerator was not fully saturated by the kernel, or a potential load-imbalance issue. unit: Percent + Workgroup Manager Utilization: + rst: The percent of cycles in the kernel where the workgroup manager was actively + doing any work. + unit: Percent + Accelerator Utilization: + rst: The percent of cycles in the kernel where the accelerator was actively doing + any work. + unit: Percent Dispatched Workgroups: rst: The total number of workgroups forming this kernel launch. unit: Workgroups + Scheduler-Pipe Utilization: + rst: 'The percent of :ref:`total scheduler-pipe cycles ` in the + kernel where the scheduler-pipes were actively doing any work. Note: this value + is expected to range between 0% and 25%. See :ref:`desc-spi`.' + unit: Percent Dispatched Wavefronts: rst: The total number of wavefronts, summed over all workgroups, forming this kernel launch. unit: Wavefronts - VGPR Writes: - rst: The average number of cycles spent initializing :ref:`VGPRs ` at - wave creation. - unit: Cycles/wave SGPR Writes: rst: The average number of cycles spent initializing :ref:`SGPRs ` at wave creation. unit: Cycles/wave - Not-scheduled Rate (Workgroup Manager): - rst: 'The percent of :ref:`total scheduler-pipe cycles ` in the - kernel where a workgroup could not be scheduled to a :doc:`CU ` - due to a bottleneck within the workgroup manager rather than a lack of a CU - or :ref:`SIMD ` with sufficient resources. Note: this value is expected - to range between 0-25%. See note in :ref:`workgroup manager ` description.' - unit: Percent - Not-scheduled Rate (Scheduler-Pipe): - rst: 'The percent of :ref:`total scheduler-pipe cycles ` in the - kernel where a workgroup could not be scheduled to a :doc:`CU ` - due to a bottleneck within the scheduler-pipes rather than a lack of a CU or - :ref:`SIMD ` with sufficient resources. Note: this value is expected - to range between 0-25%, see note in :ref:`workgroup manager ` description.' - unit: Percent - Scheduler-Pipe Stall Rate: - rst: 'The percent of :ref:`total scheduler-pipe cycles ` in the - kernel where a workgroup could not be scheduled to a :doc:`CU ` - due to occupancy limitations (like a lack of a CU or :ref:`SIMD ` - with sufficient resources). Note: this value is expected to range between 0-25%, - see note in :ref:`workgroup manager ` description.' - unit: Percent - Scratch Stall Rate: - rst: The percent of :ref:`total shader-engine cycles ` in the kernel - where a workgroup could not be scheduled to a :doc:`CU ` due - to lack of :ref:`private (a.k.a., scratch) memory ` slots. While - this can reach up to 100%, note that the actual occupancy limitations on a kernel - using private memory are typically quite small (for example, less than 1% of - the total number of waves that can be scheduled to an accelerator). - unit: Percent - Insufficient SIMD Waveslots: - rst: The percent of :ref:`total SIMD cycles ` in the kernel where - a workgroup could not be scheduled to a :ref:`SIMD ` due to lack - of available :ref:`waveslots `. - unit: Percent - Insufficient SIMD VGPRs: - rst: The percent of :ref:`total SIMD cycles ` in the kernel where - a workgroup could not be scheduled to a :ref:`SIMD ` due to lack - of available :ref:`VGPRs `. - unit: Percent - Insufficient SIMD SGPRs: - rst: The percent of :ref:`total SIMD cycles ` in the kernel where - a workgroup could not be scheduled to a :ref:`SIMD ` due to lack - of available :ref:`SGPRs `. - unit: Percent - Insufficient CU LDS: - rst: The percent of :ref:`total CU cycles ` in the kernel where - a workgroup could not be scheduled to a :doc:`CU ` due to lack - of available :doc:`LDS `. - unit: Percent - Insufficient CU Barriers: - rst: The percent of :ref:`total CU cycles ` in the kernel where - a workgroup could not be scheduled to a :doc:`CU ` due to lack - of available :ref:`barriers `. - unit: Percent - Reached CU Workgroup Limit: - rst: The percent of :ref:`total CU cycles ` in the kernel where - a workgroup could not be scheduled to a :doc:`CU ` due to limits - within the workgroup manager. This is expected to be always be zero on CDNA2 - or newer accelerators (and small for previous accelerators). - unit: Percent - Reached CU Wavefront Limit: - rst: The percent of :ref:`total CU cycles ` in the kernel where - a wavefront could not be scheduled to a :doc:`CU ` due to limits - within the workgroup manager. This is expected to be always be zero on CDNA2 - or newer accelerators (and small for previous accelerators). + VGPR Writes: + rst: The average number of cycles spent initializing :ref:`VGPRs ` at + wave creation. + unit: Cycles/wave + Shader Engine Utilization: + rst: The percent of :ref:`total shader engine cycles ` in the kernel + where any CU in a shader-engine was actively doing any work, normalized over + all shader-engines. Low values (e.g., << 100%) indicate that the accelerator + was not fully saturated by the kernel, or a potential load-imbalance issue. unit: Percent Workgroup Manager - Resource Allocation: - Accelerator Utilization: - rst: The percent of cycles in the kernel where the accelerator was actively doing - any work. + Insufficient CU Barriers: + rst: The percent of :ref:`total CU cycles ` in the kernel where + a workgroup could not be scheduled to a :doc:`CU ` due to lack + of available :ref:`barriers `. unit: Percent - Scheduler-Pipe Utilization: - rst: 'The percent of :ref:`total scheduler-pipe cycles ` in the - kernel where the scheduler-pipes were actively doing any work. Note: this value - is expected to range between 0% and 25%. See :ref:`desc-spi`.' - unit: Percent - Workgroup Manager Utilization: - rst: The percent of cycles in the kernel where the workgroup manager was actively - doing any work. - unit: Percent - Shader Engine Utilization: - rst: The percent of :ref:`total shader engine cycles ` in the kernel - where any CU in a shader-engine was actively doing any work, normalized over - all shader-engines. Low values (e.g., << 100%) indicate that the accelerator - was not fully saturated by the kernel, or a potential load-imbalance issue. - unit: Percent - SIMD Utilization: + Insufficient SIMD SGPRs: rst: The percent of :ref:`total SIMD cycles ` in the kernel where - any :ref:`SIMD ` on a CU was actively doing any work, summed over - all CUs. Low values (less than 100%) indicate that the accelerator was not - fully saturated by the kernel, or a potential load-imbalance issue. + a workgroup could not be scheduled to a :ref:`SIMD ` due to lack + of available :ref:`SGPRs `. unit: Percent - Dispatched Workgroups: - rst: The total number of workgroups forming this kernel launch. - unit: Workgroups - Dispatched Wavefronts: - rst: The total number of wavefronts, summed over all workgroups, forming this - kernel launch. - unit: Wavefronts - VGPR Writes: - rst: The average number of cycles spent initializing :ref:`VGPRs ` at - wave creation. - unit: Cycles/wave - SGPR Writes: - rst: The average number of cycles spent initializing :ref:`SGPRs ` at - wave creation. - unit: Cycles/wave - Not-scheduled Rate (Workgroup Manager): - rst: 'The percent of :ref:`total scheduler-pipe cycles ` in the - kernel where a workgroup could not be scheduled to a :doc:`CU ` - due to a bottleneck within the workgroup manager rather than a lack of a CU - or :ref:`SIMD ` with sufficient resources. Note: this value is expected - to range between 0-25%. See note in :ref:`workgroup manager ` description.' + Insufficient SIMD Waveslots: + rst: The percent of :ref:`total SIMD cycles ` in the kernel where + a workgroup could not be scheduled to a :ref:`SIMD ` due to lack + of available :ref:`waveslots `. + unit: Percent + Reached CU Workgroup Limit: + rst: The percent of :ref:`total CU cycles ` in the kernel where + a workgroup could not be scheduled to a :doc:`CU ` due to limits + within the workgroup manager. This is expected to be always be zero on CDNA2 + or newer accelerators (and small for previous accelerators). + unit: Percent + Scratch Stall Rate: + rst: The percent of :ref:`total shader-engine cycles ` in the kernel + where a workgroup could not be scheduled to a :doc:`CU ` due + to lack of :ref:`private (a.k.a., scratch) memory ` slots. While + this can reach up to 100%, note that the actual occupancy limitations on a kernel + using private memory are typically quite small (for example, less than 1% of + the total number of waves that can be scheduled to an accelerator). unit: Percent Not-scheduled Rate (Scheduler-Pipe): rst: 'The percent of :ref:`total scheduler-pipe cycles ` in the @@ -4596,44 +1274,22 @@ Workgroup Manager - Resource Allocation: with sufficient resources). Note: this value is expected to range between 0-25%, see note in :ref:`workgroup manager ` description.' unit: Percent - Scratch Stall Rate: - rst: The percent of :ref:`total shader-engine cycles ` in the kernel - where a workgroup could not be scheduled to a :doc:`CU ` due - to lack of :ref:`private (a.k.a., scratch) memory ` slots. While - this can reach up to 100%, note that the actual occupancy limitations on a kernel - using private memory are typically quite small (for example, less than 1% of - the total number of waves that can be scheduled to an accelerator). - unit: Percent - Insufficient SIMD Waveslots: - rst: The percent of :ref:`total SIMD cycles ` in the kernel where - a workgroup could not be scheduled to a :ref:`SIMD ` due to lack - of available :ref:`waveslots `. - unit: Percent Insufficient SIMD VGPRs: rst: The percent of :ref:`total SIMD cycles ` in the kernel where a workgroup could not be scheduled to a :ref:`SIMD ` due to lack of available :ref:`VGPRs `. unit: Percent - Insufficient SIMD SGPRs: - rst: The percent of :ref:`total SIMD cycles ` in the kernel where - a workgroup could not be scheduled to a :ref:`SIMD ` due to lack - of available :ref:`SGPRs `. - unit: Percent Insufficient CU LDS: rst: The percent of :ref:`total CU cycles ` in the kernel where a workgroup could not be scheduled to a :doc:`CU ` due to lack of available :doc:`LDS `. unit: Percent - Insufficient CU Barriers: - rst: The percent of :ref:`total CU cycles ` in the kernel where - a workgroup could not be scheduled to a :doc:`CU ` due to lack - of available :ref:`barriers `. - unit: Percent - Reached CU Workgroup Limit: - rst: The percent of :ref:`total CU cycles ` in the kernel where - a workgroup could not be scheduled to a :doc:`CU ` due to limits - within the workgroup manager. This is expected to be always be zero on CDNA2 - or newer accelerators (and small for previous accelerators). + Not-scheduled Rate (Workgroup Manager): + rst: 'The percent of :ref:`total scheduler-pipe cycles ` in the + kernel where a workgroup could not be scheduled to a :doc:`CU ` + due to a bottleneck within the workgroup manager rather than a lack of a CU + or :ref:`SIMD ` with sufficient resources. Note: this value is expected + to range between 0-25%. See note in :ref:`workgroup manager ` description.' unit: Percent Reached CU Wavefront Limit: rst: The percent of :ref:`total CU cycles ` in the kernel where @@ -4642,10 +1298,6 @@ Workgroup Manager - Resource Allocation: or newer accelerators (and small for previous accelerators). unit: Percent Command processor fetcher (CPF): - CPF Utilization: - rst: Percent of total cycles where the CPF was busy actively doing any work. The - ratio of CPF busy cycles over total cycles counted by the CPF. - unit: Percent CPF Stall: rst: Percent of CPF busy cycles where the CPF was stalled for any reason. unit: Percent @@ -4661,50 +1313,28 @@ Command processor fetcher (CPF): CPF-UTCL1 Stall: rst: Percent of CPF busy cycles where the CPF was stalled by address translation. unit: Percent - CPC Utilization: - rst: Percent of total cycles where the CPC was busy actively doing any work. The - ratio of CPC busy cycles over total cycles counted by the CPC. - unit: Percent - CPC Stall Rate: - rst: Percent of CPC busy cycles where the CPC was stalled for any reason. - unit: Percent - CPC Packet Decoding Utilization: - rst: Percent of CPC busy cycles spent decoding commands for processing. - unit: Percent - CPC-Workgroup Manager Utilization: - rst: Percent of CPC busy cycles spent dispatching workgroups to the :ref:`workgroup - manager `. - unit: Percent - CPC-L2 Utilization: - rst: Percent of total cycles counted by the CPC-:doc:`L2 ` interface where - the CPC-L2 interface was active doing any work. - unit: Percent - CPC-UTCL1 Stall: - rst: Percent of CPC busy cycles where the CPC was stalled by address translation - unit: Percent - CPC-UTCL2 Utilization: - rst: Percent of total cycles counted by the CPC's :doc:`L2 ` address translation - interface where the CPC was busy doing address translation work. + CPF Utilization: + rst: Percent of total cycles where the CPF was busy actively doing any work. The + ratio of CPF busy cycles over total cycles counted by the CPF. unit: Percent Command processor packet processor (CPC): - CPF Utilization: - rst: Percent of total cycles where the CPF was busy actively doing any work. The - ratio of CPF busy cycles over total cycles counted by the CPF. + CPC-Workgroup Manager Utilization: + rst: Percent of CPC busy cycles spent dispatching workgroups to the :ref:`workgroup + manager `. unit: Percent - CPF Stall: - rst: Percent of CPF busy cycles where the CPF was stalled for any reason. + CPC-UTCL2 Utilization: + rst: Percent of total cycles counted by the CPC's :doc:`L2 ` address translation + interface where the CPC was busy doing address translation work. unit: Percent - CPF-L2 Utilization: - rst: Percent of total cycles counted by the CPF-:doc:`L2 ` interface where - the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles - over total cycles counted by the CPF-L2. + CPC Packet Decoding Utilization: + rst: Percent of CPC busy cycles spent decoding commands for processing. unit: Percent - CPF-L2 Stall: - rst: Percent of CPF-:doc:`L2 ` L2 busy cycles where the CPF-L2 interface - was stalled for any reason. + CPC-UTCL1 Stall: + rst: Percent of CPC busy cycles where the CPC was stalled by address translation unit: Percent - CPF-UTCL1 Stall: - rst: Percent of CPF busy cycles where the CPF was stalled by address translation. + CPC-L2 Utilization: + rst: Percent of total cycles counted by the CPC-:doc:`L2 ` interface where + the CPC-L2 interface was active doing any work. unit: Percent CPC Utilization: rst: Percent of total cycles where the CPC was busy actively doing any work. The @@ -4713,97 +1343,49 @@ Command processor packet processor (CPC): CPC Stall Rate: rst: Percent of CPC busy cycles where the CPC was stalled for any reason. unit: Percent - CPC Packet Decoding Utilization: - rst: Percent of CPC busy cycles spent decoding commands for processing. - unit: Percent - CPC-Workgroup Manager Utilization: - rst: Percent of CPC busy cycles spent dispatching workgroups to the :ref:`workgroup - manager `. - unit: Percent - CPC-L2 Utilization: - rst: Percent of total cycles counted by the CPC-:doc:`L2 ` interface where - the CPC-L2 interface was active doing any work. - unit: Percent - CPC-UTCL1 Stall: - rst: Percent of CPC busy cycles where the CPC was stalled by address translation - unit: Percent - CPC-UTCL2 Utilization: - rst: Percent of total cycles counted by the CPC's :doc:`L2 ` address translation - interface where the CPC was busy doing address translation work. - unit: Percent System Speed-of-Light: - VALU FLOPs: - rst: 'The total floating-point operations executed per second on the :ref:`VALU - `. This is also presented as a percent of the peak theoretical FLOPs - achievable on the specific accelerator. Note: this does not include any floating-point - operations from :ref:`MFMA ` instructions.' - unit: GFLOPs - VALU IOPs: - rst: 'The total integer operations executed per second on the :ref:`VALU `. - This is also presented as a percent of the peak theoretical IOPs achievable - on the specific accelerator. Note: this does not include any integer operations - from :ref:`MFMA ` instructions.' - unit: GOIPs - MFMA FLOPs (F8): - rst: 'The total number of 8-bit brain floating point :ref:`MFMA ` operations - executed per second. Note: this does not include any 16-bit brain floating point - operations from :ref:`VALU ` instructions. This is also presented - as a percent of the peak theoretical F8 MFMA operations achievable on the specific - accelerator. It is supported on AMD Instinct MI300 series and later only.' - unit: GFLOPs - MFMA FLOPs (BF16): - rst: 'The total number of 16-bit brain floating point :ref:`MFMA ` - operations executed per second. Note: this does not include any 16-bit brain - floating point operations from :ref:`VALU ` instructions. This is - also presented as a percent of the peak theoretical BF16 MFMA operations achievable - on the specific accelerator.' - unit: GFLOPs - MFMA FLOPs (F16): - rst: 'The total number of 16-bit floating point :ref:`MFMA ` operations - executed per second. Note: this does not include any 16-bit floating point operations - from :ref:`VALU ` instructions. This is also presented as a percent - of the peak theoretical F16 MFMA operations achievable on the specific accelerator.' - unit: GFLOPs - MFMA FLOPs (F32): - rst: 'The total number of 32-bit floating point :ref:`MFMA ` operations - executed per second. Note: this does not include any 32-bit floating point operations - from :ref:`VALU ` instructions. This is also presented as a percent - of the peak theoretical F32 MFMA operations achievable on the specific accelerator.' - unit: GFLOPs - MFMA FLOPs (F64): - rst: 'The total number of 64-bit floating point :ref:`MFMA ` operations - executed per second. Note: this does not include any 64-bit floating point operations - from :ref:`VALU ` instructions. This is also presented as a percent - of the peak theoretical F64 MFMA operations achievable on the specific accelerator.' - unit: GFLOPs - MFMA IOPs (Int8): - rst: 'The total number of 8-bit integer :ref:`MFMA ` operations executed - per second. Note: this does not include any 8-bit integer operations from :ref:`VALU - ` instructions. This is also presented as a percent of the peak theoretical - INT8 MFMA operations achievable on the specific accelerator.' - unit: GIOPs - Active CUs: - rst: Total number of active compute units (CUs) on the accelerator during the - kernel execution. - unit: Number - SALU Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`SALU ` - was busy executing instructions. Computed as the ratio of the total number of - cycles spent by the :ref:`scheduler ` issuing SALU / :ref:`SMEM - ` instructions over the :ref:`total CU cycles `. - unit: Percent + sL1D Cache BW: + rst: The number of bytes looked up in the sL1D cache per unit time. This is also + presented as a percent of the peak theoretical bandwidth achievable on the + specific accelerator. + unit: GB/s VALU Utilization: rst: Indicates what percent of the kernel's duration the :ref:`VALU ` was busy executing instructions. Does not include :ref:`VMEM ` operations. Computed as the ratio of the total number of cycles spent by the :ref:`scheduler ` issuing VALU instructions over the :ref:`total CU cycles `. unit: Percent - MFMA Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`MFMA ` - unit was busy executing instructions. Computed as the ratio of the total number - of cycles spent by the :ref:`MFMA ` was busy over the :ref:`total - CU cycles `. + LDS Bank Conflicts/Access: + rst: The ratio of the number of cycles spent in the :doc:`LDS scheduler ` + due to bank conflicts (as determined by the conflict resolution hardware) to + the base number of cycles that would be spent in the LDS scheduler in a completely uncontended + case. This is also presented in normalized form (i.e., the Bank Conflict Rate). + unit: Conflicts/Access + VALU IOPs: + rst: 'The total integer operations executed per second on the :ref:`VALU `. + This is also presented as a percent of the peak theoretical IOPs achievable + on the specific accelerator. Note: this does not include any integer operations + from :ref:`MFMA ` instructions.' + unit: GOIPs + L2-Fabric Write Latency: + rst: The time-averaged number of cycles write requests spent in Infinity Fabric + before a completion acknowledgement was returned to the L2. + unit: Cycles + IPC: + rst: The ratio of the total number of instructions executed on the :doc:`CU ` + over the :ref:`total active CU cycles `. + unit: Instructions per-cycle + SALU Utilization: + rst: Indicates what percent of the kernel's duration the :ref:`SALU ` + was busy executing instructions. Computed as the ratio of the total number of + cycles spent by the :ref:`scheduler ` issuing SALU / :ref:`SMEM + ` instructions over the :ref:`total CU cycles `. unit: Percent + L1I Hit Rate: + rst: The percent of L1I requests that hit on a previously loaded line the cache. + Calculated as the ratio of the number of L1I requests that hit over the number + of all L1I requests. + unit: GB/s VMEM Utilization: rst: Indicates what percent of the kernel's duration the :ref:`VMEM ` unit was busy executing instructions, including both global/generic and spill/scratch @@ -4812,44 +1394,127 @@ System Speed-of-Light: as the ratio of the total number of cycles spent by the :ref:`scheduler ` issuing VMEM instructions over the :ref:`total CU cycles `. unit: Percent - Branch Utilization: - rst: Indicates what percent of the kernel's duration the :ref:`branch ` - unit was busy executing instructions. Computed as the ratio of the total number - of cycles spent by the :ref:`scheduler ` issuing branch instructions - over the :ref:`total CU cycles `. - unit: Percent - VALU Active Threads: - rst: Indicates the average level of :ref:`divergence ` within - a wavefront over the lifetime of the kernel. The number of work-items that were - active in a wavefront during execution of each :ref:`VALU ` instruction, - time-averaged over all VALU instructions run on all wavefronts in the kernel. - unit: Work-items - IPC: - rst: The ratio of the total number of instructions executed on the :doc:`CU ` - over the :ref:`total active CU cycles `. - unit: Instructions per-cycle + MFMA FLOPs (F64): + rst: 'The total number of 64-bit floating point :ref:`MFMA ` operations + executed per second. Note: this does not include any 64-bit floating point operations + from :ref:`VALU ` instructions. This is also presented as a percent + of the peak theoretical F64 MFMA operations achievable on the specific accelerator.' + unit: GFLOPs Wavefront Occupancy: rst: 'The time-averaged number of wavefronts resident on the accelerator over the lifetime of the kernel. Note: this metric may be inaccurate for short-running kernels (less than 1ms). This is also presented as a percent of the peak theoretical occupancy achievable on the specific accelerator.' unit: Wavefronts + MFMA FLOPs (BF16): + rst: 'The total number of 16-bit brain floating point :ref:`MFMA ` + operations executed per second. Note: this does not include any 16-bit brain + floating point operations from :ref:`VALU ` instructions. This is + also presented as a percent of the peak theoretical BF16 MFMA operations achievable + on the specific accelerator.' + unit: GFLOPs + Branch Utilization: + rst: Indicates what percent of the kernel's duration the :ref:`branch ` + unit was busy executing instructions. Computed as the ratio of the total number + of cycles spent by the :ref:`scheduler ` issuing branch instructions + over the :ref:`total CU cycles `. + unit: Percent Theoretical LDS Bandwidth: rst: Indicates the maximum amount of bytes that could have been loaded from, stored to, or atomically updated in the LDS per unit time (see :ref:`LDS Bandwidth ` example for more detail). This is also presented as a percent of the peak theoretical F64 MFMA operations achievable on the specific accelerator. unit: GB/s - LDS Bank Conflicts/Access: - rst: The ratio of the number of cycles spent in the :doc:`LDS scheduler ` - due to bank conflicts (as determined by the conflict resolution hardware) to - the base number of cycles that would be spent in the LDS scheduler in a completely uncontended - case. This is also presented in normalized form (i.e., the Bank Conflict Rate). - unit: Conflicts/Access + L2-Fabric Read Latency: + rst: The time-averaged number of cycles read requests spent in Infinity Fabric before + data was returned to the L2. + unit: Cycles + MFMA Utilization: + rst: Indicates what percent of the kernel's duration the :ref:`MFMA ` + unit was busy executing instructions. Computed as the ratio of the total number + of cycles spent by the :ref:`MFMA ` was busy over the :ref:`total + CU cycles `. + unit: Percent + MFMA IOPs (Int8): + rst: 'The total number of 8-bit integer :ref:`MFMA ` operations executed + per second. Note: this does not include any 8-bit integer operations from :ref:`VALU + ` instructions. This is also presented as a percent of the peak theoretical + INT8 MFMA operations achievable on the specific accelerator.' + unit: GIOPs + VALU FLOPs: + rst: 'The total floating-point operations executed per second on the :ref:`VALU + `. This is also presented as a percent of the peak theoretical FLOPs + achievable on the specific accelerator. Note: this does not include any floating-point + operations from :ref:`MFMA ` instructions.' + unit: GFLOPs + L2 Cache BW: + rst: The number of bytes looked up in the L2 cache per unit time. The number of + bytes is calculated as the number of cache lines requested multiplied by the + cache line size. This value does not consider partial requests, so e.g., if + only a single value is requested in a cache line, the data movement will still + be counted as a full cache line. This is also presented as a percent of the + peak theoretical bandwidth achievable on the specific accelerator. + unit: GB/s + VALU Active Threads: + rst: Indicates the average level of :ref:`divergence ` within + a wavefront over the lifetime of the kernel. The number of work-items that were + active in a wavefront during execution of each :ref:`VALU ` instruction, + time-averaged over all VALU instructions run on all wavefronts in the kernel. + unit: Work-items + MFMA FLOPs (F16): + rst: 'The total number of 16-bit floating point :ref:`MFMA ` operations + executed per second. Note: this does not include any 16-bit floating point operations + from :ref:`VALU ` instructions. This is also presented as a percent + of the peak theoretical F16 MFMA operations achievable on the specific accelerator.' + unit: GFLOPs + L1I BW: + rst: The number of bytes looked up in the L1I cache per unit time. This is also + presented as a percent of the peak theoretical bandwidth achievable on the + specific accelerator. + unit: Percent + MFMA FLOPs (F8): + rst: 'The total number of 8-bit brain floating point :ref:`MFMA ` operations + executed per second. Note: this does not include any 16-bit brain floating point + operations from :ref:`VALU ` instructions. This is also presented + as a percent of the peak theoretical F8 MFMA operations achievable on the specific + accelerator. It is supported on AMD Instinct MI300 series and later only.' + unit: GFLOPs + L2-Fabric Write BW: + rst: The number of bytes sent by the L2 over the :ref:`Infinity Fabric interface + ` by write and atomic operations per unit time. This is also presented + as a percent of the peak theoretical bandwidth achievable on the specific accelerator. + unit: GB/s + MFMA FLOPs (F32): + rst: 'The total number of 32-bit floating point :ref:`MFMA ` operations + executed per second. Note: this does not include any 32-bit floating point operations + from :ref:`VALU ` instructions. This is also presented as a percent + of the peak theoretical F32 MFMA operations achievable on the specific accelerator.' + unit: GFLOPs + sL1D Cache Hit Rate: + rst: The percent of sL1D requests that hit on a previously loaded line the cache. + Calculated as the ratio of the number of sL1D requests that hit over the number + of all sL1D requests. + unit: Percent + Active CUs: + rst: Total number of active compute units (CUs) on the accelerator during the + kernel execution. + unit: Number + L2 Cache Hit Rate: + rst: The ratio of the number of L2 cache line requests that hit in the L2 cache + over the total number of incoming cache line requests to the L2 cache. + unit: Percent + L2-Fabric Read BW: + rst: "The number of bytes read by the L2 over the :ref:`Infinity Fabric\u2122\ + \ interface ` per unit time. This is also presented as a percent\ + \ of the peak theoretical bandwidth achievable on the specific accelerator." + unit: GB/s vL1D Cache Hit Rate: rst: The ratio of the number of vL1D cache line requests that hit in vL1D cache over the total number of cache line requests to the :ref:`vL1D cache RAM `. unit: Percent + L1I Fetch Latency: + rst: The average number of cycles spent to fetch instructions to a :doc:`CU `. + unit: Cycles vL1D Cache BW: rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM ` instructions per unit time. The number of bytes is calculated @@ -4859,56 +1524,3 @@ System Speed-of-Light: cache line. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator. unit: GB/s - L2 Cache Hit Rate: - rst: The ratio of the number of L2 cache line requests that hit in the L2 cache - over the total number of incoming cache line requests to the L2 cache. - unit: Percent - L2 Cache BW: - rst: The number of bytes looked up in the L2 cache per unit time. The number of - bytes is calculated as the number of cache lines requested multiplied by the - cache line size. This value does not consider partial requests, so e.g., if - only a single value is requested in a cache line, the data movement will still - be counted as a full cache line. This is also presented as a percent of the - peak theoretical bandwidth achievable on the specific accelerator. - unit: GB/s - L2-Fabric Read BW: - rst: "The number of bytes read by the L2 over the :ref:`Infinity Fabric\u2122\ - \ interface ` per unit time. This is also presented as a percent\ - \ of the peak theoretical bandwidth achievable on the specific accelerator." - unit: GB/s - L2-Fabric Write BW: - rst: The number of bytes sent by the L2 over the :ref:`Infinity Fabric interface - ` by write and atomic operations per unit time. This is also presented - as a percent of the peak theoretical bandwidth achievable on the specific accelerator. - unit: GB/s - L2-Fabric Read Latency: - rst: The time-averaged number of cycles read requests spent in Infinity Fabric before - data was returned to the L2. - unit: Cycles - L2-Fabric Write Latency: - rst: The time-averaged number of cycles write requests spent in Infinity Fabric - before a completion acknowledgement was returned to the L2. - unit: Cycles - sL1D Cache Hit Rate: - rst: The percent of sL1D requests that hit on a previously loaded line the cache. - Calculated as the ratio of the number of sL1D requests that hit over the number - of all sL1D requests. - unit: Percent - sL1D Cache BW: - rst: The number of bytes looked up in the sL1D cache per unit time. This is also - presented as a percent of the peak theoretical bandwidth achievable on the - specific accelerator. - unit: GB/s - L1I Hit Rate: - rst: The percent of L1I requests that hit on a previously loaded line the cache. - Calculated as the ratio of the number of L1I requests that hit over the number - of all L1I requests. - unit: GB/s - L1I BW: - rst: The number of bytes looked up in the L1I cache per unit time. This is also - presented as a percent of the peak theoretical bandwidth achievable on the - specific accelerator. - unit: Percent - L1I Fetch Latency: - rst: The average number of cycles spent to fetch instructions to a :doc:`CU `. - unit: Cycles diff --git a/projects/rocprofiler-compute/utils/autogen_hash.yaml b/projects/rocprofiler-compute/utils/autogen_hash.yaml index ec28448cca..756d690a24 100644 --- a/projects/rocprofiler-compute/utils/autogen_hash.yaml +++ b/projects/rocprofiler-compute/utils/autogen_hash.yaml @@ -107,4 +107,4 @@ src/rocprof_compute_soc/analysis_configs/gfx940/2100_pc_sampling.yaml: 4f3af5504 src/rocprof_compute_soc/analysis_configs/gfx941/2100_pc_sampling.yaml: 4f3af55040c40bee5f1fd88d83e2324d06e5dc462c0adc3e6d5b19b3f31af5e7 src/rocprof_compute_soc/analysis_configs/gfx942/2100_pc_sampling.yaml: 4f3af55040c40bee5f1fd88d83e2324d06e5dc462c0adc3e6d5b19b3f31af5e7 src/rocprof_compute_soc/analysis_configs/gfx950/2100_pc_sampling.yaml: 4f3af55040c40bee5f1fd88d83e2324d06e5dc462c0adc3e6d5b19b3f31af5e7 -docs/data/metrics_description.yaml: 69bd9c4121e13bdda6af2dead3129a46569f37fd1c59b20f45c85593824522d2 +docs/data/metrics_description.yaml: 7a79754edf27080a1701e959904c7db80c661dc552f3cdf94b0b2d332a2b2c45 diff --git a/projects/rocprofiler-compute/utils/split_config.py b/projects/rocprofiler-compute/utils/split_config.py index bc0e6db392..7e0c2b6f67 100644 --- a/projects/rocprofiler-compute/utils/split_config.py +++ b/projects/rocprofiler-compute/utils/split_config.py @@ -126,10 +126,21 @@ def update_documentation(): for data_source in panel_config["data source"]: if "metric_table" in data_source: metrics_info = {} - for key in panel_config["metrics_description"]: - metrics_info[key] = { - "rst": panel_config["metrics_description"][key]["rst"], - "unit": panel_config["metrics_description"][key]["unit"], + # Metric names from data source + metric_names = { + metric + for _, gfx_data in data_source["metric_table"]["metric"].items() + for metric in gfx_data + } + # Select metrics with descriptions available + metric_names = metric_names.intersection( + panel_config["metrics_description"].keys() + ) + # Add metrics info + for metric_name in metric_names: + metrics_info[metric_name] = { + "rst": panel_config["metrics_description"][metric_name]["rst"], + "unit": panel_config["metrics_description"][metric_name]["unit"], } panel_metric_map[data_source["metric_table"]["id"]] = metrics_info