From 89c74ac3d3aa6516aa8be2eadf4dc850b227472b Mon Sep 17 00:00:00 2001 From: "systems-assistant[bot]" <221163467+systems-assistant[bot]@users.noreply.github.com> Date: Wed, 6 Aug 2025 18:39:50 -0400 Subject: [PATCH] Update `Unit` of Bandwidth metrics to Gbps (#96) * Add Utilization to metric name for Bandwidth related metrics whose Unit is Percent * Update Unit of Bandwidth metrics to Gbps * Update metric Formula to use total duration as denominator instead of normalization unit. * Update metric Description * Update metric Unit * Update CHANGELOG --- projects/rocprofiler-compute/CHANGELOG.md | 20 + .../docs/data/metrics_description.yaml | 168 ++-- .../gfx908/1200_local_data_share_lds.yaml | 16 +- .../gfx908/1300_instruction_cache.yaml | 27 +- .../gfx908/1400_scalar_l1_data_cache.yaml | 23 +- .../gfx908/1600_vector_l1_data_cache.yaml | 42 +- .../gfx908/1700_l2_cache.yaml | 60 +- .../gfx90a/1200_local_data_share_lds.yaml | 16 +- .../gfx90a/1300_instruction_cache.yaml | 27 +- .../gfx90a/1400_scalar_l1_data_cache.yaml | 23 +- .../gfx90a/1600_vector_l1_data_cache.yaml | 42 +- .../gfx90a/1700_l2_cache.yaml | 60 +- .../gfx940/1200_local_data_share_lds.yaml | 16 +- .../gfx940/1300_instruction_cache.yaml | 27 +- .../gfx940/1400_scalar_l1_data_cache.yaml | 23 +- .../gfx940/1600_vector_l1_data_cache.yaml | 42 +- .../gfx940/1700_l2_cache.yaml | 60 +- .../gfx941/1200_local_data_share_lds.yaml | 16 +- .../gfx941/1300_instruction_cache.yaml | 27 +- .../gfx941/1400_scalar_l1_data_cache.yaml | 23 +- .../gfx941/1600_vector_l1_data_cache.yaml | 42 +- .../gfx941/1700_l2_cache.yaml | 60 +- .../gfx942/1200_local_data_share_lds.yaml | 16 +- .../gfx942/1300_instruction_cache.yaml | 27 +- .../gfx942/1400_scalar_l1_data_cache.yaml | 23 +- .../gfx942/1600_vector_l1_data_cache.yaml | 42 +- .../gfx942/1700_l2_cache.yaml | 63 +- .../gfx950/1200_local_data_share_lds.yaml | 16 +- .../gfx950/1300_instruction_cache.yaml | 27 +- .../gfx950/1400_scalar_l1_data_cache.yaml | 23 +- .../gfx950/1600_vector_l1_data_cache.yaml | 42 +- .../gfx950/1700_l2_cache.yaml | 156 ++-- .../utils/autogen_hash.yaml | 62 +- .../utils/unified_config.yaml | 719 +++++++++--------- 34 files changed, 1088 insertions(+), 988 deletions(-) diff --git a/projects/rocprofiler-compute/CHANGELOG.md b/projects/rocprofiler-compute/CHANGELOG.md index 9f33653aa6..9c2a3075fe 100644 --- a/projects/rocprofiler-compute/CHANGELOG.md +++ b/projects/rocprofiler-compute/CHANGELOG.md @@ -27,6 +27,26 @@ Full documentation for ROCm Compute Profiler is available at [https://rocm.docs. * Change the basic view of TUI from aggregated analysis data to individual kernel analysis data +* Update `Unit` of the following `Bandwidth` related metrics to `Gbps` instead of `Bytes per Normalization Unit` + * Theoretical Bandwidth (section 1202) + * L1I-L2 Bandwidth (section 1303) + * sL1D-L2 BW (section 1403) + * Cache BW (section 1603) + * L1-L2 BW (section 1603) + * Read BW (section 1702) + * Write and Atomic BW (section 1702) + * Bandwidth (section 1703) + * Atomic/Read/Write Bandwidth (section 1703) + * Atomic/Read/Write Bandwidth - (HBM/PCIe/Infinity Fabric) (section 1706) + +* Add `Utilization` to metric name for the following `Bandwidth` related metrics whose `Unit` is `Percent` + * Theoretical Bandwidth Utilization (section 1201) + * L1I-L2 Bandwidth Utilization (section 1301) + * Bandwidth Utilization (section 1301) + * Bandwidth Utilization (section 1401) + * sL1D-L2 BW Utilization (section 1401) + * Bandwidth Utilization (section 1601) + ### Resolved issues * Fixed not detecting memory clock issue when using amd-smi diff --git a/projects/rocprofiler-compute/docs/data/metrics_description.yaml b/projects/rocprofiler-compute/docs/data/metrics_description.yaml index 512518ab65..12eb28816a 100644 --- a/projects/rocprofiler-compute/docs/data/metrics_description.yaml +++ b/projects/rocprofiler-compute/docs/data/metrics_description.yaml @@ -397,13 +397,13 @@ LDS Speed-of-Light: over the number of LDS cycles that would have been required to move the same amount of data in an uncontended access. [#lds-bank-conflict]_ unit: Percent - Theoretical Bandwidth: + Theoretical Bandwidth Utilization: rst: Indicates the maximum amount of bytes that could have been loaded from, stored - to, or atomically updated in the LDS per :ref:`normalization unit `. + to, or atomically updated in the LDS divided as percentage of theoretical peak. Does *not* take into account the execution mask of the wavefront when the instruction was executed. See the :ref:`LDS bandwidth example ` for more detail. - unit: Bytes per normalization unit + unit: Percent Utilization: rst: Indicates what percent of the kernel's duration the :ref:`LDS ` was actively executing instructions (including, but not limited to, load, store, @@ -450,17 +450,16 @@ LDS Statistics: unit: Accesses per normalization unit Theoretical Bandwidth: rst: Indicates the maximum amount of bytes that could have been loaded from, stored - to, or atomically updated in the LDS per :ref:`normalization unit `. - Does *not* take into account the execution mask of the wavefront when the instruction - was executed. See the :ref:`LDS bandwidth example ` for more - detail. - unit: Bytes per normalization unit + to, or atomically updated in the LDS divided by total duration. Does *not* take + into account the execution mask of the wavefront when the instruction was executed. + See the :ref:`LDS bandwidth example ` for more detail. + unit: Gbps Unaligned Stall: rst: The total number of cycles spent in the :ref:`LDS scheduler ` due to stalls from non-dword aligned addresses per :ref:`normalization unit `. unit: Cycles per normalization unit vL1D Speed-of-Light: - Bandwidth: + Bandwidth Utilization: rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM ` instructions, as a percent of the peak theoretical bandwidth achievable on the specific accelerator. The number of bytes is calculated as the number @@ -614,13 +613,13 @@ vL1D cache access metrics: rst: The total number of cache line lookups in the vL1D. unit: Cache lines Cache BW: - rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM - ` instructions per :ref:`normalization unit `. The - number of bytes is calculated as the number of cache lines requested multiplied - by the cache line size. This value does not consider partial requests, so - for instance, if only a single value is requested in a cache line, the data movement - will still be counted as a full cache line. - unit: Bytes per normalization unit + rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM + ` instructions divided by total duration. The number of bytes is + calculated as the number of cache lines requested multiplied by the cache line + size. This value does not consider partial requests, so for instance, if only + a single value is requested in a cache line, the data movement will still be + counted as a full cache line. + unit: Gbps Cache Hit Rate: rst: The ratio of the number of vL1D cache line requests that hit in vL1D cache over the total number of cache line requests to the :ref:`vL1D Cache RAM `. @@ -646,12 +645,12 @@ vL1D cache access metrics: unit: Requests per normalization unit L1-L2 BW: rst: The number of bytes transferred across the vL1D-L2 interface as a result of - :ref:`VMEM ` instructions, per :ref:`normalization unit `. - The number of bytes is calculated as the number of cache lines requested multiplied - by the cache line size. This value does not consider partial requests, so for instance, + :ref:`VMEM ` instructions, divided by total duration. The number + of bytes is calculated as the number of cache lines requested multiplied by + the cache line size. This value does not consider partial requests, so for instance, if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. - unit: Bytes per normalization unit + unit: Gbps L1-L2 Read: rst: The number of read requests for a vL1D cache line that were not satisfied by the vL1D and must be retrieved from the to the :doc:`L2 Cache ` per :ref:`normalization @@ -761,20 +760,20 @@ L2 Speed-of-Light: unit: Percent L2 cache accesses: Atomic Bandwidth: - rst: Total number of bytes looked up in the L2 cache for atomic requests, per - :ref:`normalization unit `. - unit: Bytes per normalization unit + rst: Total number of bytes looked up in the L2 cache for atomic requests, divided + by total duration. + unit: Gbps Atomic Req: rst: The total number of atomic requests (with and without return) to the L2 from all clients. unit: Requests per normalization unit Bandwidth: - rst: The number of bytes looked up in the L2 cache, per :ref:`normalization unit - `. The number of bytes is calculated as the number of - cache lines requested multiplied by the cache line size. This value does not - consider partial requests, so for example, if only a single value is requested - in a cache line, the data movement will still be counted as a full cache line. - unit: Bytes per normalization unit + rst: The number of bytes looked up in the L2 cache, divided by total duration. + The number of bytes is calculated as the number of cache lines requested multiplied + by the cache line size. This value does not consider partial requests, so for + example, if only a single value is requested in a cache line, the data movement will + still be counted as a full cache line. + unit: Gbps CC Req: rst: The total number of requests to the L2 that go to Coherently Cacheable (CC) memory allocations. See the :ref:`memory-type` for more information. @@ -818,9 +817,9 @@ L2 cache accesses: allocations. See the :ref:`memory-type` for more information. unit: Requests per normalization unit Read Bandwidth: - rst: Total number of bytes looked up in the L2 cache for read requests, per :ref:`normalization - unit `. - unit: Bytes per normalization unit + rst: Total number of bytes looked up in the L2 cache for read requests, divided + by total duration. + unit: Gbps Read Req: rst: 'The total number of read requests to the L2 from all clients. ' unit: Requests per normalization unit @@ -841,9 +840,9 @@ L2 cache accesses: See the :ref:`memory-type` for more information. unit: Requests per normalization unit Write Bandwidth: - rst: Total number of bytes looked up in the L2 cache for write requests, per :ref:`normalization - unit `. - unit: Bytes per normalization unit + rst: Total number of bytes looked up in the L2 cache for write requests, divided + by total duration. + unit: Gbps Write Req: rst: The total number of write requests to the L2 from all clients. unit: Requests per normalization unit @@ -896,9 +895,9 @@ L2-Fabric interface metrics: memory ` allocations. unit: Percent Read BW: - rst: The total number of bytes read by the L2 cache from Infinity Fabric per :ref:`normalization - unit `. - unit: Bytes per normalization unit + rst: The total number of bytes read by the L2 cache from Infinity Fabric divided + by total duration. + unit: Gbps Read Latency: rst: The time-averaged number of cycles read requests spent in Infinity Fabric before data was returned to the L2. @@ -954,12 +953,12 @@ L2-Fabric interface metrics: unit: Percent Write and Atomic BW: rst: The total number of bytes written by the L2 over Infinity Fabric by write and - atomic operations per :ref:`normalization unit `. Note - that on current CDNA accelerators, such as the :ref:`MI2XX `, requests - are only considered *atomic* by Infinity Fabric if they are targeted at non-write-cacheable - memory, for example, :ref:`fine-grained memory ` allocations or :ref:`uncached + atomic operations divided by total duration. Note that on current CDNA accelerators, + such as the :ref:`MI2XX `, requests are only considered *atomic* + by Infinity Fabric if they are targeted at non-write-cacheable memory, for + example, :ref:`fine-grained memory ` allocations or :ref:`uncached memory ` allocations on the MI2XX. - unit: Bytes per normalization unit + unit: Gbps Write and Atomic Latency: rst: The time-averaged number of cycles write requests spent in Infinity Fabric before a completion acknowledgement was returned to the L2. @@ -975,17 +974,17 @@ L2 - Fabric interface detailed metrics: memory ` allocations on the MI2XX. unit: Requests per normalization unit Atomic Bandwidth - HBM: - rst: Total number of bytes due to L2 atomic requests due to HBM traffic, per normalization - unit. - unit: Bytes per normalization unit + rst: Total number of bytes due to L2 atomic requests due to HBM traffic, divided + by total duration. + unit: Gbps "Atomic Bandwidth - Infinity Fabric\u2122": rst: Total number of bytes due to L2 atomic requests due to Infinity Fabric traffic, - per normalization unit. - unit: Bytes per normalization unit + divided by total duration. + unit: Gbps Atomic Bandwidth - PCIe: - rst: Total number of bytes due to L2 atomic requests due to PCIe traffic, per - normalization unit. - unit: Bytes per normalization unit + rst: Total number of bytes due to L2 atomic requests due to PCIe traffic, divided + by total duration. + unit: Gbps HBM Read: rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data from the accelerator's local HBM, per :ref:`normalization unit `. @@ -1013,17 +1012,17 @@ L2 - Fabric interface detailed metrics: uncached data requests. See :ref:`l2-request-flow` for more detail. unit: Requests per normalization unit Read Bandwidth - HBM: - rst: Total number of bytes due to L2 read requests due to HBM traffic, per normalization - unit. - unit: Bytes per normalization unit + rst: Total number of bytes due to L2 read requests due to HBM traffic, divided + by total duration. + unit: Gbps "Read Bandwidth - Infinity Fabric\u2122": rst: Total number of bytes due to L2 read requests due to Infinity Fabric traffic, - per normalization unit. - unit: Bytes per normalization unit + divided by total duration. + unit: Gbps Read Bandwidth - PCIe: - rst: Total number of bytes due to L2 read requests due to PCIe traffic, per normalization - unit. - unit: Bytes per normalization unit + rst: Total number of bytes due to L2 read requests due to PCIe traffic, divided + by total duration. + unit: Gbps Remote Read: rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data from any source other than the accelerator's local HBM, per :ref:`normalization @@ -1036,17 +1035,17 @@ L2 - Fabric interface detailed metrics: for more detail. unit: Requests per normalization unit Write Bandwidth - HBM: - rst: Total number of bytes due to L2 write requests due to HBM traffic, per normalization - unit. - unit: Bytes per normalization unit + rst: Total number of bytes due to L2 write requests due to HBM traffic, divided + by total duration. + unit: Gbps "Write Bandwidth - Infinity Fabric\u2122": rst: Total number of bytes due to L2 write requests due to Infinity Fabric traffic, - per normalization unit. - unit: Bytes per normalization unit + divided by total duration. + unit: Gbps Write Bandwidth - PCIe: - rst: Total number of bytes due to L2 write requests due to PCIe traffic, per normalization - unit. - unit: Bytes per normalization unit + rst: Total number of bytes due to L2 write requests due to PCIe traffic, divided + by total duration. + unit: Gbps Write and Atomic (32B): rst: The total number of L2 requests to Infinity Fabric to write or atomically update 32B of data to any memory location, per :ref:`normalization unit `. @@ -1098,7 +1097,7 @@ L2 - Fabric Interface stalls: of the :ref:`total active L2 cycles `. unit: Percent Scalar L1D Speed-of-Light: - Bandwidth: + Bandwidth Utilization: rst: The number of bytes looked up in the sL1D cache, as a percent of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the :ref:`total sL1D cycles `. @@ -1108,13 +1107,11 @@ Scalar L1D Speed-of-Light: the cache. The ratio of the number of sL1D requests that hit [#sl1d-cache]_ over the number of all sL1D requests. unit: Percent - sL1D-L2 BW: - rst: "The total number of bytes read from, written to, or atomically updated \ - \ across the sL1D\u2194:doc:`L2 ` interface, per :ref:`normalization\ - \ unit `. Note that sL1D writes and atomics are typically\ - \ unused on current CDNA accelerators, so in the majority of cases this can\ - \ be interpreted as an sL1D\u2192L2 read bandwidth." - unit: Bytes per normalization unit + sL1D-L2 BW Utilization: + rst: The percentage of the peak theoretical sL1D - L2 interface bandwidth acheived.\ + \ Caclulated as total number of bytes read from, written to, or atomically updated\ + \ across the sL1D - L2 interface. + unit: Percent Scalar L1D cache accesses: Atomic Req: rst: The total number of atomic requests from sL1D to the :doc:`L2 `, @@ -1189,13 +1186,13 @@ Scalar L1D Cache - L2 Interface: unit: Requests per normalization unit sL1D-L2 BW: rst: "The total number of bytes read from, written to, or atomically updated \ - \ across the sL1D\u2194:doc:`L2 ` interface, per :ref:`normalization\ - \ unit `. Note that sL1D writes and atomics are typically\ - \ unused on current CDNA accelerators, so in the majority of cases this can\ - \ be interpreted as an sL1D\u2192L2 read bandwidth." - unit: Bytes per normalization unit + \ across the sL1D\u2194:doc:`L2 ` interface, divided by total duration.\ + \ Note that sL1D writes and atomics are typically unused on current CDNA accelerators,\ + \ so in the majority of cases this can be interpreted as an sL1D\u2192L2 read\ + \ bandwidth." + unit: Gbps L1I Speed-of-Light: - Bandwidth: + Bandwidth Utilization: rst: The number of bytes looked up in the L1I cache, as a percent of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over the :ref:`total L1I cycles `. @@ -1205,7 +1202,7 @@ L1I Speed-of-Light: the cache. Calculated as the ratio of the number of L1I requests that hit over the number of all L1I requests. unit: Percent - L1I-L2 Bandwidth: + L1I-L2 Bandwidth Utilization: rst: "The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth\ \ achieved. Calculated as the ratio of the total number of requests from the\ \ L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles `." @@ -1238,10 +1235,9 @@ L1I cache accesses: unit: Requests per normalization unit L1I <-> L2 interface: L1I-L2 Bandwidth: - rst: "The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth\ - \ achieved. Calculated as the ratio of the total number of requests from the\ - \ L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles `." - unit: Percent + rst: Total number of bytes transferred across L1I - L2 interface divided by total + duration. + unit: Gbps Workgroup manager utilizations: Accelerator Utilization: rst: The percent of cycles in the kernel where the accelerator was actively doing diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1200_local_data_share_lds.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1200_local_data_share_lds.yaml index 6cfe19d9de..2718654ad4 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1200_local_data_share_lds.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1200_local_data_share_lds.yaml @@ -11,8 +11,12 @@ Panel Config: instructions, averaged over the lifetime of the kernel. Calculated as the ratio of the total number of cycles spent by the scheduler issuing LDS instructions over the total CU cycles. + Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that + could have been loaded from, stored to, or atomically updated in the LDS divided + as percentage of theoretical peak. Does not take into account the execution + mask of the wavefront when the instruction was executed. Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been - loaded from, stored to, or atomically updated in the LDS per normalization unit. + loaded from, stored to, or atomically updated in the LDS divided by total duration. Does not take into account the execution mask of the wavefront when the instruction was executed. Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent @@ -58,7 +62,7 @@ Panel Config: Access Rate: value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))) unit: Pct of Peak - Theoretical Bandwidth: + Theoretical Bandwidth Utilization: value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128))) unit: Pct of Peak @@ -86,12 +90,12 @@ Panel Config: unit: (Instr + $normUnit) Theoretical Bandwidth: avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) - unit: (Bytes + $normUnit) + / (End_Timestamp - Start_Timestamp))) + unit: Gbps LDS Latency: avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None)) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1300_instruction_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1300_instruction_cache.yaml index a53c23691f..aeda9bc6c7 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1300_instruction_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1300_instruction_cache.yaml @@ -3,15 +3,18 @@ Panel Config: id: 1300 title: Instruction Cache metrics_description: - Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the - peak theoretical bandwidth. Calculated as the ratio of L1I requests over the - total L1I cycles. + Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent + of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over + the total L1I cycles. Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously loaded line the cache. Calculated as the ratio of the number of L1I requests that hit over the number of all L1I requests. - L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\ - \ bandwidth achieved. Calculated as the ratio of the total number of requests\ - \ from the L1I to the L2 cache over the total L1I-L2 interface cycles." + L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\ + \ L2 cache request bandwidth achieved. Calculated as the ratio of the total\ + \ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\ + \ cycles." + L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface + divided by total duration. Req: The total number of requests made to the L1I per normalization-unit Hits: The total number of L1I requests that hit on a previously loaded cache line, per normalization-unit. @@ -30,7 +33,7 @@ Panel Config: value: Avg unit: Unit metric: - Bandwidth: + Bandwidth Utilization: value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -38,7 +41,7 @@ Panel Config: value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES) + SQC_ICACHE_MISSES_DUPLICATE))) unit: Pct of Peak - L1I-L2 Bandwidth: + L1I-L2 Bandwidth Utilization: value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -100,7 +103,7 @@ Panel Config: unit: Unit metric: L1I-L2 Bandwidth: - avg: AVG(((SQC_TC_INST_REQ * 64) / $denom)) - min: MIN(((SQC_TC_INST_REQ * 64) / $denom)) - max: MAX(((SQC_TC_INST_REQ * 64) / $denom)) - unit: (Bytes + $normUnit) + avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + unit: Gbps diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1400_scalar_l1_data_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1400_scalar_l1_data_cache.yaml index d43157ce8e..282b97ad1f 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1400_scalar_l1_data_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1400_scalar_l1_data_cache.yaml @@ -3,14 +3,17 @@ Panel Config: id: 1400 title: Scalar L1 Data Cache metrics_description: - Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the - peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the - total sL1D cycles. + Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent + of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests + over the total sL1D cycles. Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously loaded line the cache. The ratio of the number of sL1D requests that hit over the number of all sL1D requests. + sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface + bandwidth acheived.\ \ Caclulated as total number of bytes read from, written + to, or atomically updated\ \ across the sL1D - L2 interface. sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\ - \ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\ + \ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\ \ writes and atomics are typically unused on current CDNA accelerators, so in\ \ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth." Req: The total number of requests, of any size or type, made to the sL1D per normalization @@ -51,7 +54,7 @@ Panel Config: value: Avg unit: Unit metric: - Bandwidth: + Bandwidth Utilization: value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -60,7 +63,7 @@ Panel Config: + SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None)) unit: Pct of Peak - sL1D-L2 BW: + sL1D-L2 BW Utilization: value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp))) unit: Pct of Peak @@ -158,12 +161,12 @@ Panel Config: metric: sL1D-L2 BW: avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) - unit: (Bytes + $normUnit) + * 64)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps Read Req: avg: AVG((SQC_TC_DATA_READ_REQ / $denom)) min: MIN((SQC_TC_DATA_READ_REQ / $denom)) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1600_vector_l1_data_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1600_vector_l1_data_cache.yaml index 96e021e378..50af33c21b 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1600_vector_l1_data_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1600_vector_l1_data_cache.yaml @@ -5,12 +5,12 @@ Panel Config: metrics_description: Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D cache over the total number of cache line requests to the vL1D Cache RAM. - Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM - instructions, as a percent of the peak theoretical bandwidth achievable on the - specific accelerator. The number of bytes is calculated as the number of cache - lines requested multiplied by the cache line size. This value does not consider - partial requests, so for instance, if only a single value is requested in a - cache line, the data movement will still be counted as a full cache line. + Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result + of VMEM instructions, as a percent of the peak theoretical bandwidth achievable + on the specific accelerator. The number of bytes is calculated as the number + of cache lines requested multiplied by the cache line size. This value does + not consider partial requests, so for instance, if only a single value is requested + in a cache line, the data movement will still be counted as a full cache line. Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution. The number of cycles where the vL1D Cache RAM is actively processing any request divided by the number of cycles where the vL1D is active. @@ -42,11 +42,11 @@ Panel Config: Atomic Req: The total number of incoming atomic requests from the address processing unit after coalescing per normalization unit. Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM - instructions per normalization unit. The number of bytes is calculated as the - number of cache lines requested multiplied by the cache line size. This value - does not consider partial requests, so for instance, if only a single value - is requested in a cache line, the data movement will still be counted as a full - cache line. + instructions divided by total duration. The number of bytes is calculated as + the number of cache lines requested multiplied by the cache line size. This + value does not consider partial requests, so for instance, if only a single + value is requested in a cache line, the data movement will still be counted + as a full cache line. Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in vL1D cache over the total number of cache line requests to the vL1D Cache RAM. Cache Accesses: The total number of cache line lookups in the vL1D. @@ -57,7 +57,7 @@ Panel Config: command during the kernel's execution per normalization unit. This may be triggered by, for instance, the buffer_wbinvl1 instruction. L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result - of VMEM instructions, per normalization unit. The number of bytes is calculated + of VMEM instructions, divided by total duration. The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so for instance, if only a single value is requested in a cache line, the data movement will still be counted @@ -128,7 +128,7 @@ Panel Config: / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else None)) unit: Pct of Peak - Bandwidth: + Bandwidth Utilization: value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk / 1000) * 64) * $cu_per_gpu)) unit: Pct of Peak @@ -201,10 +201,10 @@ Panel Config: / $denom)) unit: (Req + $normUnit) Cache BW: - avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom)) - min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom)) - max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom)) - unit: (Bytes + $normUnit) + avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp))) + min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp))) + max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp))) + unit: Gbps Cache Hit Rate: avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) @@ -242,12 +242,12 @@ Panel Config: unit: (Req + $normUnit) L1-L2 BW: avg: AVG(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) - + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom)) + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp))) min: MIN(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) - + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom)) + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp))) max: MAX(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) - + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom)) - unit: (Bytes + $normUnit) + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps L1-L2 Read: avg: AVG((TCP_TCC_READ_REQ_sum / $denom)) min: MIN((TCP_TCC_READ_REQ_sum / $denom)) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1700_l2_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1700_l2_cache.yaml index 6e77eb8f93..54046c8470 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1700_l2_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/1700_l2_cache.yaml @@ -20,8 +20,8 @@ Panel Config: HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory (HBM) per unit time. This value is calculated as the number of HBM channels multiplied by the HBM channel width multiplied by the HBM clock frequency. - Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per - normalization unit. + Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided + by total duration. HBM Read Traffic: The percent of read requests generated by the L2 cache that are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown does not consider the size of the request (meaning that 32B and 64B requests @@ -42,9 +42,9 @@ Panel Config: as a single request), so this metric only approximates the percent of the L2-Fabric read bandwidth directed to an uncached memory location. Write and Atomic BW: The total number of bytes written by the L2 over Infinity - Fabric by write and atomic operations per normalization unit. Note that on current - CDNA accelerators, such as the MI2XX, requests are only considered atomic by - Infinity Fabric if they are targeted at non-write-cacheable memory, for example, + Fabric by write and atomic operations divided by total duration. Note that on + current CDNA accelerators, such as the MI2XX, requests are only considered atomic + by Infinity Fabric if they are targeted at non-write-cacheable memory, for example, fine-grained memory allocations or uncached memory allocations on the MI2XX. HBM Write and Atomic Traffic: The percent of write and atomic requests generated by the L2 cache that are routed to the accelerator's local high-bandwidth memory @@ -82,17 +82,17 @@ Panel Config: Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity Fabric before a completion acknowledgement (atomic without return value) or data (atomic with return value) was returned to the L2. - Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit. + Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration. The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so for example, if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests, - per normalization unit. + divided by total duration. Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests, - per normalization unit. + divided by total duration. Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests, - per normalization unit. + divided by total duration. Req: The total number of incoming requests to the L2 from all clients for all request types, per normalization unit. Read Req: The total number of read requests to the L2 from all clients. @@ -150,11 +150,11 @@ Panel Config: 64B of data from any source other than the accelerator's local HBM, per normalization unit. Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe - traffic, per normalization unit. + traffic, divided by total duration. "Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read - requests due to Infinity Fabric traffic, per normalization unit. + requests due to Infinity Fabric traffic, divided by total duration. Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM - traffic, per normalization unit. + traffic, divided by total duration. Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to write or atomically update 32B of data to any memory location, per normalization unit. @@ -171,17 +171,17 @@ Panel Config: write or atomically update 32B or 64B of data in any memory location other than the accelerator's local HBM, per normalization unit. Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to - PCIe traffic, per normalization unit. + PCIe traffic, divided by total duration. "Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write - requests due to Infinity Fabric traffic, per normalization unit. + requests due to Infinity Fabric traffic, divided by total duration. Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM - traffic, per normalization unit. + traffic, divided by total duration. Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to - PCIe traffic, per normalization unit. + PCIe traffic, divided by total duration. "Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic - requests due to Infinity Fabric traffic, per normalization unit. + requests due to Infinity Fabric traffic, divided by total duration. Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to - HBM traffic, per normalization unit. + HBM traffic, divided by total duration. Atomic: The total number of L2 requests to Infinity Fabric to atomically update 32B or 64B of data in any memory location, per normalization unit. See Request flow for more detail. Note that on current CDNA accelerators, such as the MI2XX, @@ -257,12 +257,12 @@ Panel Config: metric: Read BW: avg: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) min: MIN((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) max: MAX((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) - * 64)) / $denom)) - unit: (Bytes + $normUnit) + * 64)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps HBM Read Traffic: avg: AVG((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None)) @@ -289,12 +289,12 @@ Panel Config: unit: pct Write and Atomic BW: avg: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) - * 32)) / $denom)) + * 32)) / (End_Timestamp - Start_Timestamp))) min: MIN((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) - * 32)) / $denom)) + * 32)) / (End_Timestamp - Start_Timestamp))) max: MAX((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) - * 32)) / $denom)) - unit: (Bytes + $normUnit) + * 32)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps HBM Write and Atomic Traffic: avg: AVG((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None)) @@ -362,10 +362,10 @@ Panel Config: unit: Unit metric: Bandwidth: - avg: AVG((TCC_REQ_sum * 64) / $denom) - min: MIN((TCC_REQ_sum * 64) / $denom) - max: MAX((TCC_REQ_sum * 64) / $denom) - unit: (Bytes + $normUnit) + avg: AVG((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp)) + min: MIN((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp)) + max: MAX((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp)) + unit: Gbps Req: avg: AVG((TCC_REQ_sum / $denom)) min: MIN((TCC_REQ_sum / $denom)) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1200_local_data_share_lds.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1200_local_data_share_lds.yaml index 6cfe19d9de..2718654ad4 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1200_local_data_share_lds.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1200_local_data_share_lds.yaml @@ -11,8 +11,12 @@ Panel Config: instructions, averaged over the lifetime of the kernel. Calculated as the ratio of the total number of cycles spent by the scheduler issuing LDS instructions over the total CU cycles. + Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that + could have been loaded from, stored to, or atomically updated in the LDS divided + as percentage of theoretical peak. Does not take into account the execution + mask of the wavefront when the instruction was executed. Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been - loaded from, stored to, or atomically updated in the LDS per normalization unit. + loaded from, stored to, or atomically updated in the LDS divided by total duration. Does not take into account the execution mask of the wavefront when the instruction was executed. Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent @@ -58,7 +62,7 @@ Panel Config: Access Rate: value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))) unit: Pct of Peak - Theoretical Bandwidth: + Theoretical Bandwidth Utilization: value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128))) unit: Pct of Peak @@ -86,12 +90,12 @@ Panel Config: unit: (Instr + $normUnit) Theoretical Bandwidth: avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) - unit: (Bytes + $normUnit) + / (End_Timestamp - Start_Timestamp))) + unit: Gbps LDS Latency: avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None)) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1300_instruction_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1300_instruction_cache.yaml index a53c23691f..aeda9bc6c7 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1300_instruction_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1300_instruction_cache.yaml @@ -3,15 +3,18 @@ Panel Config: id: 1300 title: Instruction Cache metrics_description: - Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the - peak theoretical bandwidth. Calculated as the ratio of L1I requests over the - total L1I cycles. + Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent + of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over + the total L1I cycles. Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously loaded line the cache. Calculated as the ratio of the number of L1I requests that hit over the number of all L1I requests. - L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\ - \ bandwidth achieved. Calculated as the ratio of the total number of requests\ - \ from the L1I to the L2 cache over the total L1I-L2 interface cycles." + L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\ + \ L2 cache request bandwidth achieved. Calculated as the ratio of the total\ + \ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\ + \ cycles." + L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface + divided by total duration. Req: The total number of requests made to the L1I per normalization-unit Hits: The total number of L1I requests that hit on a previously loaded cache line, per normalization-unit. @@ -30,7 +33,7 @@ Panel Config: value: Avg unit: Unit metric: - Bandwidth: + Bandwidth Utilization: value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -38,7 +41,7 @@ Panel Config: value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES) + SQC_ICACHE_MISSES_DUPLICATE))) unit: Pct of Peak - L1I-L2 Bandwidth: + L1I-L2 Bandwidth Utilization: value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -100,7 +103,7 @@ Panel Config: unit: Unit metric: L1I-L2 Bandwidth: - avg: AVG(((SQC_TC_INST_REQ * 64) / $denom)) - min: MIN(((SQC_TC_INST_REQ * 64) / $denom)) - max: MAX(((SQC_TC_INST_REQ * 64) / $denom)) - unit: (Bytes + $normUnit) + avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + unit: Gbps diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1400_scalar_l1_data_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1400_scalar_l1_data_cache.yaml index d43157ce8e..282b97ad1f 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1400_scalar_l1_data_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1400_scalar_l1_data_cache.yaml @@ -3,14 +3,17 @@ Panel Config: id: 1400 title: Scalar L1 Data Cache metrics_description: - Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the - peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the - total sL1D cycles. + Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent + of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests + over the total sL1D cycles. Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously loaded line the cache. The ratio of the number of sL1D requests that hit over the number of all sL1D requests. + sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface + bandwidth acheived.\ \ Caclulated as total number of bytes read from, written + to, or atomically updated\ \ across the sL1D - L2 interface. sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\ - \ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\ + \ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\ \ writes and atomics are typically unused on current CDNA accelerators, so in\ \ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth." Req: The total number of requests, of any size or type, made to the sL1D per normalization @@ -51,7 +54,7 @@ Panel Config: value: Avg unit: Unit metric: - Bandwidth: + Bandwidth Utilization: value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -60,7 +63,7 @@ Panel Config: + SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None)) unit: Pct of Peak - sL1D-L2 BW: + sL1D-L2 BW Utilization: value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp))) unit: Pct of Peak @@ -158,12 +161,12 @@ Panel Config: metric: sL1D-L2 BW: avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) - unit: (Bytes + $normUnit) + * 64)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps Read Req: avg: AVG((SQC_TC_DATA_READ_REQ / $denom)) min: MIN((SQC_TC_DATA_READ_REQ / $denom)) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1600_vector_l1_data_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1600_vector_l1_data_cache.yaml index 96e021e378..50af33c21b 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1600_vector_l1_data_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1600_vector_l1_data_cache.yaml @@ -5,12 +5,12 @@ Panel Config: metrics_description: Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D cache over the total number of cache line requests to the vL1D Cache RAM. - Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM - instructions, as a percent of the peak theoretical bandwidth achievable on the - specific accelerator. The number of bytes is calculated as the number of cache - lines requested multiplied by the cache line size. This value does not consider - partial requests, so for instance, if only a single value is requested in a - cache line, the data movement will still be counted as a full cache line. + Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result + of VMEM instructions, as a percent of the peak theoretical bandwidth achievable + on the specific accelerator. The number of bytes is calculated as the number + of cache lines requested multiplied by the cache line size. This value does + not consider partial requests, so for instance, if only a single value is requested + in a cache line, the data movement will still be counted as a full cache line. Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution. The number of cycles where the vL1D Cache RAM is actively processing any request divided by the number of cycles where the vL1D is active. @@ -42,11 +42,11 @@ Panel Config: Atomic Req: The total number of incoming atomic requests from the address processing unit after coalescing per normalization unit. Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM - instructions per normalization unit. The number of bytes is calculated as the - number of cache lines requested multiplied by the cache line size. This value - does not consider partial requests, so for instance, if only a single value - is requested in a cache line, the data movement will still be counted as a full - cache line. + instructions divided by total duration. The number of bytes is calculated as + the number of cache lines requested multiplied by the cache line size. This + value does not consider partial requests, so for instance, if only a single + value is requested in a cache line, the data movement will still be counted + as a full cache line. Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in vL1D cache over the total number of cache line requests to the vL1D Cache RAM. Cache Accesses: The total number of cache line lookups in the vL1D. @@ -57,7 +57,7 @@ Panel Config: command during the kernel's execution per normalization unit. This may be triggered by, for instance, the buffer_wbinvl1 instruction. L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result - of VMEM instructions, per normalization unit. The number of bytes is calculated + of VMEM instructions, divided by total duration. The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so for instance, if only a single value is requested in a cache line, the data movement will still be counted @@ -128,7 +128,7 @@ Panel Config: / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else None)) unit: Pct of Peak - Bandwidth: + Bandwidth Utilization: value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk / 1000) * 64) * $cu_per_gpu)) unit: Pct of Peak @@ -201,10 +201,10 @@ Panel Config: / $denom)) unit: (Req + $normUnit) Cache BW: - avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom)) - min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom)) - max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom)) - unit: (Bytes + $normUnit) + avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp))) + min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp))) + max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp))) + unit: Gbps Cache Hit Rate: avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) @@ -242,12 +242,12 @@ Panel Config: unit: (Req + $normUnit) L1-L2 BW: avg: AVG(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) - + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom)) + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp))) min: MIN(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) - + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom)) + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp))) max: MAX(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) - + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom)) - unit: (Bytes + $normUnit) + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps L1-L2 Read: avg: AVG((TCP_TCC_READ_REQ_sum / $denom)) min: MIN((TCP_TCC_READ_REQ_sum / $denom)) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1700_l2_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1700_l2_cache.yaml index 14398e1104..8153f7363c 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1700_l2_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/1700_l2_cache.yaml @@ -20,8 +20,8 @@ Panel Config: HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory (HBM) per unit time. This value is calculated as the number of HBM channels multiplied by the HBM channel width multiplied by the HBM clock frequency. - Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per - normalization unit. + Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided + by total duration. HBM Read Traffic: The percent of read requests generated by the L2 cache that are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown does not consider the size of the request (meaning that 32B and 64B requests @@ -42,9 +42,9 @@ Panel Config: as a single request), so this metric only approximates the percent of the L2-Fabric read bandwidth directed to an uncached memory location. Write and Atomic BW: The total number of bytes written by the L2 over Infinity - Fabric by write and atomic operations per normalization unit. Note that on current - CDNA accelerators, such as the MI2XX, requests are only considered atomic by - Infinity Fabric if they are targeted at non-write-cacheable memory, for example, + Fabric by write and atomic operations divided by total duration. Note that on + current CDNA accelerators, such as the MI2XX, requests are only considered atomic + by Infinity Fabric if they are targeted at non-write-cacheable memory, for example, fine-grained memory allocations or uncached memory allocations on the MI2XX. HBM Write and Atomic Traffic: The percent of write and atomic requests generated by the L2 cache that are routed to the accelerator's local high-bandwidth memory @@ -82,17 +82,17 @@ Panel Config: Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity Fabric before a completion acknowledgement (atomic without return value) or data (atomic with return value) was returned to the L2. - Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit. + Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration. The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so for example, if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests, - per normalization unit. + divided by total duration. Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests, - per normalization unit. + divided by total duration. Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests, - per normalization unit. + divided by total duration. Req: The total number of incoming requests to the L2 from all clients for all request types, per normalization unit. Read Req: The total number of read requests to the L2 from all clients. @@ -150,11 +150,11 @@ Panel Config: 64B of data from any source other than the accelerator's local HBM, per normalization unit. Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe - traffic, per normalization unit. + traffic, divided by total duration. "Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read - requests due to Infinity Fabric traffic, per normalization unit. + requests due to Infinity Fabric traffic, divided by total duration. Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM - traffic, per normalization unit. + traffic, divided by total duration. Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to write or atomically update 32B of data to any memory location, per normalization unit. @@ -171,17 +171,17 @@ Panel Config: write or atomically update 32B or 64B of data in any memory location other than the accelerator's local HBM, per normalization unit. Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to - PCIe traffic, per normalization unit. + PCIe traffic, divided by total duration. "Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write - requests due to Infinity Fabric traffic, per normalization unit. + requests due to Infinity Fabric traffic, divided by total duration. Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM - traffic, per normalization unit. + traffic, divided by total duration. Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to - PCIe traffic, per normalization unit. + PCIe traffic, divided by total duration. "Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic - requests due to Infinity Fabric traffic, per normalization unit. + requests due to Infinity Fabric traffic, divided by total duration. Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to - HBM traffic, per normalization unit. + HBM traffic, divided by total duration. Atomic: The total number of L2 requests to Infinity Fabric to atomically update 32B or 64B of data in any memory location, per normalization unit. See Request flow for more detail. Note that on current CDNA accelerators, such as the MI2XX, @@ -257,12 +257,12 @@ Panel Config: metric: Read BW: avg: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) min: MIN((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) max: MAX((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) - * 64)) / $denom)) - unit: (Bytes + $normUnit) + * 64)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps HBM Read Traffic: avg: AVG((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None)) @@ -289,12 +289,12 @@ Panel Config: unit: pct Write and Atomic BW: avg: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) - * 32)) / $denom)) + * 32)) / (End_Timestamp - Start_Timestamp))) min: MIN((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) - * 32)) / $denom)) + * 32)) / (End_Timestamp - Start_Timestamp))) max: MAX((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) - * 32)) / $denom)) - unit: (Bytes + $normUnit) + * 32)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps HBM Write and Atomic Traffic: avg: AVG((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None)) @@ -362,10 +362,10 @@ Panel Config: unit: Unit metric: Bandwidth: - avg: AVG((TCC_REQ_sum * 128) / $denom) - min: MIN((TCC_REQ_sum * 128) / $denom) - max: MAX((TCC_REQ_sum * 128) / $denom) - unit: (Bytes + $normUnit) + avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + unit: Gbps Req: avg: AVG((TCC_REQ_sum / $denom)) min: MIN((TCC_REQ_sum / $denom)) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1200_local_data_share_lds.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1200_local_data_share_lds.yaml index c1a8525348..2718654ad4 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1200_local_data_share_lds.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1200_local_data_share_lds.yaml @@ -11,8 +11,12 @@ Panel Config: instructions, averaged over the lifetime of the kernel. Calculated as the ratio of the total number of cycles spent by the scheduler issuing LDS instructions over the total CU cycles. + Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that + could have been loaded from, stored to, or atomically updated in the LDS divided + as percentage of theoretical peak. Does not take into account the execution + mask of the wavefront when the instruction was executed. Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been - loaded from, stored to, or atomically updated in the LDS per normalization unit. + loaded from, stored to, or atomically updated in the LDS divided by total duration. Does not take into account the execution mask of the wavefront when the instruction was executed. Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent @@ -58,7 +62,7 @@ Panel Config: Access Rate: value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))) unit: Pct of Peak - Theoretical Bandwidth (% of Peak): + Theoretical Bandwidth Utilization: value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128))) unit: Pct of Peak @@ -86,12 +90,12 @@ Panel Config: unit: (Instr + $normUnit) Theoretical Bandwidth: avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) - unit: (Bytes + $normUnit) + / (End_Timestamp - Start_Timestamp))) + unit: Gbps LDS Latency: avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None)) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1300_instruction_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1300_instruction_cache.yaml index a53c23691f..aeda9bc6c7 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1300_instruction_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1300_instruction_cache.yaml @@ -3,15 +3,18 @@ Panel Config: id: 1300 title: Instruction Cache metrics_description: - Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the - peak theoretical bandwidth. Calculated as the ratio of L1I requests over the - total L1I cycles. + Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent + of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over + the total L1I cycles. Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously loaded line the cache. Calculated as the ratio of the number of L1I requests that hit over the number of all L1I requests. - L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\ - \ bandwidth achieved. Calculated as the ratio of the total number of requests\ - \ from the L1I to the L2 cache over the total L1I-L2 interface cycles." + L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\ + \ L2 cache request bandwidth achieved. Calculated as the ratio of the total\ + \ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\ + \ cycles." + L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface + divided by total duration. Req: The total number of requests made to the L1I per normalization-unit Hits: The total number of L1I requests that hit on a previously loaded cache line, per normalization-unit. @@ -30,7 +33,7 @@ Panel Config: value: Avg unit: Unit metric: - Bandwidth: + Bandwidth Utilization: value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -38,7 +41,7 @@ Panel Config: value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES) + SQC_ICACHE_MISSES_DUPLICATE))) unit: Pct of Peak - L1I-L2 Bandwidth: + L1I-L2 Bandwidth Utilization: value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -100,7 +103,7 @@ Panel Config: unit: Unit metric: L1I-L2 Bandwidth: - avg: AVG(((SQC_TC_INST_REQ * 64) / $denom)) - min: MIN(((SQC_TC_INST_REQ * 64) / $denom)) - max: MAX(((SQC_TC_INST_REQ * 64) / $denom)) - unit: (Bytes + $normUnit) + avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + unit: Gbps diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1400_scalar_l1_data_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1400_scalar_l1_data_cache.yaml index d43157ce8e..282b97ad1f 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1400_scalar_l1_data_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1400_scalar_l1_data_cache.yaml @@ -3,14 +3,17 @@ Panel Config: id: 1400 title: Scalar L1 Data Cache metrics_description: - Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the - peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the - total sL1D cycles. + Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent + of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests + over the total sL1D cycles. Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously loaded line the cache. The ratio of the number of sL1D requests that hit over the number of all sL1D requests. + sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface + bandwidth acheived.\ \ Caclulated as total number of bytes read from, written + to, or atomically updated\ \ across the sL1D - L2 interface. sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\ - \ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\ + \ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\ \ writes and atomics are typically unused on current CDNA accelerators, so in\ \ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth." Req: The total number of requests, of any size or type, made to the sL1D per normalization @@ -51,7 +54,7 @@ Panel Config: value: Avg unit: Unit metric: - Bandwidth: + Bandwidth Utilization: value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -60,7 +63,7 @@ Panel Config: + SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None)) unit: Pct of Peak - sL1D-L2 BW: + sL1D-L2 BW Utilization: value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp))) unit: Pct of Peak @@ -158,12 +161,12 @@ Panel Config: metric: sL1D-L2 BW: avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) - unit: (Bytes + $normUnit) + * 64)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps Read Req: avg: AVG((SQC_TC_DATA_READ_REQ / $denom)) min: MIN((SQC_TC_DATA_READ_REQ / $denom)) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1600_vector_l1_data_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1600_vector_l1_data_cache.yaml index 708bbafe14..db745209b7 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1600_vector_l1_data_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1600_vector_l1_data_cache.yaml @@ -5,12 +5,12 @@ Panel Config: metrics_description: Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D cache over the total number of cache line requests to the vL1D Cache RAM. - Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM - instructions, as a percent of the peak theoretical bandwidth achievable on the - specific accelerator. The number of bytes is calculated as the number of cache - lines requested multiplied by the cache line size. This value does not consider - partial requests, so for instance, if only a single value is requested in a - cache line, the data movement will still be counted as a full cache line. + Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result + of VMEM instructions, as a percent of the peak theoretical bandwidth achievable + on the specific accelerator. The number of bytes is calculated as the number + of cache lines requested multiplied by the cache line size. This value does + not consider partial requests, so for instance, if only a single value is requested + in a cache line, the data movement will still be counted as a full cache line. Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution. The number of cycles where the vL1D Cache RAM is actively processing any request divided by the number of cycles where the vL1D is active. @@ -42,11 +42,11 @@ Panel Config: Atomic Req: The total number of incoming atomic requests from the address processing unit after coalescing per normalization unit. Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM - instructions per normalization unit. The number of bytes is calculated as the - number of cache lines requested multiplied by the cache line size. This value - does not consider partial requests, so for instance, if only a single value - is requested in a cache line, the data movement will still be counted as a full - cache line. + instructions divided by total duration. The number of bytes is calculated as + the number of cache lines requested multiplied by the cache line size. This + value does not consider partial requests, so for instance, if only a single + value is requested in a cache line, the data movement will still be counted + as a full cache line. Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in vL1D cache over the total number of cache line requests to the vL1D Cache RAM. Cache Accesses: The total number of cache line lookups in the vL1D. @@ -57,7 +57,7 @@ Panel Config: command during the kernel's execution per normalization unit. This may be triggered by, for instance, the buffer_wbinvl1 instruction. L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result - of VMEM instructions, per normalization unit. The number of bytes is calculated + of VMEM instructions, divided by total duration. The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so for instance, if only a single value is requested in a cache line, the data movement will still be counted @@ -128,7 +128,7 @@ Panel Config: / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else None)) unit: Pct of Peak - Bandwidth: + Bandwidth Utilization: value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk / 1000) * 128) * $cu_per_gpu)) unit: Pct of Peak @@ -201,10 +201,10 @@ Panel Config: / $denom)) unit: (Req + $normUnit) Cache BW: - avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom)) - min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom)) - max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom)) - unit: (Bytes + $normUnit) + avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp))) + min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp))) + max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp))) + unit: Gbps Cache Hit Rate: avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) @@ -242,12 +242,12 @@ Panel Config: unit: (Req + $normUnit) L1-L2 BW: avg: AVG(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum - + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom)) + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp))) min: MIN(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum - + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom)) + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp))) max: MAX(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum - + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom)) - unit: (Bytes + $normUnit) + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps L1-L2 Read: avg: AVG((TCP_TCC_READ_REQ_sum / $denom)) min: MIN((TCP_TCC_READ_REQ_sum / $denom)) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1700_l2_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1700_l2_cache.yaml index 36d5943858..74c12857e0 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1700_l2_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/1700_l2_cache.yaml @@ -20,8 +20,8 @@ Panel Config: HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory (HBM) per unit time. This value is calculated as the number of HBM channels multiplied by the HBM channel width multiplied by the HBM clock frequency. - Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per - normalization unit. + Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided + by total duration. HBM Read Traffic: The percent of read requests generated by the L2 cache that are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown does not consider the size of the request (meaning that 32B and 64B requests @@ -42,9 +42,9 @@ Panel Config: as a single request), so this metric only approximates the percent of the L2-Fabric read bandwidth directed to an uncached memory location. Write and Atomic BW: The total number of bytes written by the L2 over Infinity - Fabric by write and atomic operations per normalization unit. Note that on current - CDNA accelerators, such as the MI2XX, requests are only considered atomic by - Infinity Fabric if they are targeted at non-write-cacheable memory, for example, + Fabric by write and atomic operations divided by total duration. Note that on + current CDNA accelerators, such as the MI2XX, requests are only considered atomic + by Infinity Fabric if they are targeted at non-write-cacheable memory, for example, fine-grained memory allocations or uncached memory allocations on the MI2XX. HBM Write and Atomic Traffic: The percent of write and atomic requests generated by the L2 cache that are routed to the accelerator's local high-bandwidth memory @@ -82,17 +82,17 @@ Panel Config: Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity Fabric before a completion acknowledgement (atomic without return value) or data (atomic with return value) was returned to the L2. - Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit. + Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration. The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so for example, if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests, - per normalization unit. + divided by total duration. Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests, - per normalization unit. + divided by total duration. Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests, - per normalization unit. + divided by total duration. Req: The total number of incoming requests to the L2 from all clients for all request types, per normalization unit. Read Req: The total number of read requests to the L2 from all clients. @@ -150,11 +150,11 @@ Panel Config: 64B of data from any source other than the accelerator's local HBM, per normalization unit. Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe - traffic, per normalization unit. + traffic, divided by total duration. "Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read - requests due to Infinity Fabric traffic, per normalization unit. + requests due to Infinity Fabric traffic, divided by total duration. Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM - traffic, per normalization unit. + traffic, divided by total duration. Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to write or atomically update 32B of data to any memory location, per normalization unit. @@ -171,17 +171,17 @@ Panel Config: write or atomically update 32B or 64B of data in any memory location other than the accelerator's local HBM, per normalization unit. Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to - PCIe traffic, per normalization unit. + PCIe traffic, divided by total duration. "Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write - requests due to Infinity Fabric traffic, per normalization unit. + requests due to Infinity Fabric traffic, divided by total duration. Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM - traffic, per normalization unit. + traffic, divided by total duration. Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to - PCIe traffic, per normalization unit. + PCIe traffic, divided by total duration. "Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic - requests due to Infinity Fabric traffic, per normalization unit. + requests due to Infinity Fabric traffic, divided by total duration. Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to - HBM traffic, per normalization unit. + HBM traffic, divided by total duration. Atomic: The total number of L2 requests to Infinity Fabric to atomically update 32B or 64B of data in any memory location, per normalization unit. See Request flow for more detail. Note that on current CDNA accelerators, such as the MI2XX, @@ -257,12 +257,12 @@ Panel Config: metric: Read BW: avg: AVG((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) min: MIN((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) max: MAX((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) - * 64)) / $denom)) - unit: (Bytes + $normUnit) + * 64)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps HBM Read Traffic: avg: AVG((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum != 0) else None)) @@ -289,12 +289,12 @@ Panel Config: unit: pct Write and Atomic BW: avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) - * 32)) / $denom)) + * 32)) / (End_Timestamp - Start_Timestamp))) min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) - * 32)) / $denom)) + * 32)) / (End_Timestamp - Start_Timestamp))) max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) - * 32)) / $denom)) - unit: (Bytes + $normUnit) + * 32)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps HBM Write and Atomic Traffic: avg: AVG((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum != 0) else None)) @@ -362,10 +362,10 @@ Panel Config: unit: Unit metric: Bandwidth: - avg: AVG((TCC_REQ_sum * 128) / $denom) - min: MIN((TCC_REQ_sum * 128) / $denom) - max: MAX((TCC_REQ_sum * 128) / $denom) - unit: (Bytes + $normUnit) + avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + unit: Gbps Req: avg: AVG((TCC_REQ_sum / $denom)) min: MIN((TCC_REQ_sum / $denom)) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1200_local_data_share_lds.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1200_local_data_share_lds.yaml index c1a8525348..2718654ad4 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1200_local_data_share_lds.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1200_local_data_share_lds.yaml @@ -11,8 +11,12 @@ Panel Config: instructions, averaged over the lifetime of the kernel. Calculated as the ratio of the total number of cycles spent by the scheduler issuing LDS instructions over the total CU cycles. + Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that + could have been loaded from, stored to, or atomically updated in the LDS divided + as percentage of theoretical peak. Does not take into account the execution + mask of the wavefront when the instruction was executed. Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been - loaded from, stored to, or atomically updated in the LDS per normalization unit. + loaded from, stored to, or atomically updated in the LDS divided by total duration. Does not take into account the execution mask of the wavefront when the instruction was executed. Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent @@ -58,7 +62,7 @@ Panel Config: Access Rate: value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))) unit: Pct of Peak - Theoretical Bandwidth (% of Peak): + Theoretical Bandwidth Utilization: value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128))) unit: Pct of Peak @@ -86,12 +90,12 @@ Panel Config: unit: (Instr + $normUnit) Theoretical Bandwidth: avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) - unit: (Bytes + $normUnit) + / (End_Timestamp - Start_Timestamp))) + unit: Gbps LDS Latency: avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None)) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1300_instruction_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1300_instruction_cache.yaml index a53c23691f..aeda9bc6c7 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1300_instruction_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1300_instruction_cache.yaml @@ -3,15 +3,18 @@ Panel Config: id: 1300 title: Instruction Cache metrics_description: - Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the - peak theoretical bandwidth. Calculated as the ratio of L1I requests over the - total L1I cycles. + Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent + of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over + the total L1I cycles. Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously loaded line the cache. Calculated as the ratio of the number of L1I requests that hit over the number of all L1I requests. - L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\ - \ bandwidth achieved. Calculated as the ratio of the total number of requests\ - \ from the L1I to the L2 cache over the total L1I-L2 interface cycles." + L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\ + \ L2 cache request bandwidth achieved. Calculated as the ratio of the total\ + \ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\ + \ cycles." + L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface + divided by total duration. Req: The total number of requests made to the L1I per normalization-unit Hits: The total number of L1I requests that hit on a previously loaded cache line, per normalization-unit. @@ -30,7 +33,7 @@ Panel Config: value: Avg unit: Unit metric: - Bandwidth: + Bandwidth Utilization: value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -38,7 +41,7 @@ Panel Config: value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES) + SQC_ICACHE_MISSES_DUPLICATE))) unit: Pct of Peak - L1I-L2 Bandwidth: + L1I-L2 Bandwidth Utilization: value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -100,7 +103,7 @@ Panel Config: unit: Unit metric: L1I-L2 Bandwidth: - avg: AVG(((SQC_TC_INST_REQ * 64) / $denom)) - min: MIN(((SQC_TC_INST_REQ * 64) / $denom)) - max: MAX(((SQC_TC_INST_REQ * 64) / $denom)) - unit: (Bytes + $normUnit) + avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + unit: Gbps diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1400_scalar_l1_data_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1400_scalar_l1_data_cache.yaml index d43157ce8e..282b97ad1f 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1400_scalar_l1_data_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1400_scalar_l1_data_cache.yaml @@ -3,14 +3,17 @@ Panel Config: id: 1400 title: Scalar L1 Data Cache metrics_description: - Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the - peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the - total sL1D cycles. + Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent + of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests + over the total sL1D cycles. Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously loaded line the cache. The ratio of the number of sL1D requests that hit over the number of all sL1D requests. + sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface + bandwidth acheived.\ \ Caclulated as total number of bytes read from, written + to, or atomically updated\ \ across the sL1D - L2 interface. sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\ - \ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\ + \ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\ \ writes and atomics are typically unused on current CDNA accelerators, so in\ \ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth." Req: The total number of requests, of any size or type, made to the sL1D per normalization @@ -51,7 +54,7 @@ Panel Config: value: Avg unit: Unit metric: - Bandwidth: + Bandwidth Utilization: value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -60,7 +63,7 @@ Panel Config: + SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None)) unit: Pct of Peak - sL1D-L2 BW: + sL1D-L2 BW Utilization: value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp))) unit: Pct of Peak @@ -158,12 +161,12 @@ Panel Config: metric: sL1D-L2 BW: avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) - unit: (Bytes + $normUnit) + * 64)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps Read Req: avg: AVG((SQC_TC_DATA_READ_REQ / $denom)) min: MIN((SQC_TC_DATA_READ_REQ / $denom)) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1600_vector_l1_data_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1600_vector_l1_data_cache.yaml index 708bbafe14..db745209b7 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1600_vector_l1_data_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1600_vector_l1_data_cache.yaml @@ -5,12 +5,12 @@ Panel Config: metrics_description: Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D cache over the total number of cache line requests to the vL1D Cache RAM. - Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM - instructions, as a percent of the peak theoretical bandwidth achievable on the - specific accelerator. The number of bytes is calculated as the number of cache - lines requested multiplied by the cache line size. This value does not consider - partial requests, so for instance, if only a single value is requested in a - cache line, the data movement will still be counted as a full cache line. + Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result + of VMEM instructions, as a percent of the peak theoretical bandwidth achievable + on the specific accelerator. The number of bytes is calculated as the number + of cache lines requested multiplied by the cache line size. This value does + not consider partial requests, so for instance, if only a single value is requested + in a cache line, the data movement will still be counted as a full cache line. Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution. The number of cycles where the vL1D Cache RAM is actively processing any request divided by the number of cycles where the vL1D is active. @@ -42,11 +42,11 @@ Panel Config: Atomic Req: The total number of incoming atomic requests from the address processing unit after coalescing per normalization unit. Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM - instructions per normalization unit. The number of bytes is calculated as the - number of cache lines requested multiplied by the cache line size. This value - does not consider partial requests, so for instance, if only a single value - is requested in a cache line, the data movement will still be counted as a full - cache line. + instructions divided by total duration. The number of bytes is calculated as + the number of cache lines requested multiplied by the cache line size. This + value does not consider partial requests, so for instance, if only a single + value is requested in a cache line, the data movement will still be counted + as a full cache line. Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in vL1D cache over the total number of cache line requests to the vL1D Cache RAM. Cache Accesses: The total number of cache line lookups in the vL1D. @@ -57,7 +57,7 @@ Panel Config: command during the kernel's execution per normalization unit. This may be triggered by, for instance, the buffer_wbinvl1 instruction. L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result - of VMEM instructions, per normalization unit. The number of bytes is calculated + of VMEM instructions, divided by total duration. The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so for instance, if only a single value is requested in a cache line, the data movement will still be counted @@ -128,7 +128,7 @@ Panel Config: / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else None)) unit: Pct of Peak - Bandwidth: + Bandwidth Utilization: value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk / 1000) * 128) * $cu_per_gpu)) unit: Pct of Peak @@ -201,10 +201,10 @@ Panel Config: / $denom)) unit: (Req + $normUnit) Cache BW: - avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom)) - min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom)) - max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom)) - unit: (Bytes + $normUnit) + avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp))) + min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp))) + max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp))) + unit: Gbps Cache Hit Rate: avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) @@ -242,12 +242,12 @@ Panel Config: unit: (Req + $normUnit) L1-L2 BW: avg: AVG(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum - + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom)) + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp))) min: MIN(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum - + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom)) + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp))) max: MAX(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum - + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom)) - unit: (Bytes + $normUnit) + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps L1-L2 Read: avg: AVG((TCP_TCC_READ_REQ_sum / $denom)) min: MIN((TCP_TCC_READ_REQ_sum / $denom)) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1700_l2_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1700_l2_cache.yaml index e7acf40a5c..f0aefff869 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1700_l2_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/1700_l2_cache.yaml @@ -20,8 +20,8 @@ Panel Config: HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory (HBM) per unit time. This value is calculated as the number of HBM channels multiplied by the HBM channel width multiplied by the HBM clock frequency. - Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per - normalization unit. + Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided + by total duration. HBM Read Traffic: The percent of read requests generated by the L2 cache that are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown does not consider the size of the request (meaning that 32B and 64B requests @@ -42,9 +42,9 @@ Panel Config: as a single request), so this metric only approximates the percent of the L2-Fabric read bandwidth directed to an uncached memory location. Write and Atomic BW: The total number of bytes written by the L2 over Infinity - Fabric by write and atomic operations per normalization unit. Note that on current - CDNA accelerators, such as the MI2XX, requests are only considered atomic by - Infinity Fabric if they are targeted at non-write-cacheable memory, for example, + Fabric by write and atomic operations divided by total duration. Note that on + current CDNA accelerators, such as the MI2XX, requests are only considered atomic + by Infinity Fabric if they are targeted at non-write-cacheable memory, for example, fine-grained memory allocations or uncached memory allocations on the MI2XX. HBM Write and Atomic Traffic: The percent of write and atomic requests generated by the L2 cache that are routed to the accelerator's local high-bandwidth memory @@ -82,17 +82,17 @@ Panel Config: Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity Fabric before a completion acknowledgement (atomic without return value) or data (atomic with return value) was returned to the L2. - Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit. + Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration. The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so for example, if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests, - per normalization unit. + divided by total duration. Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests, - per normalization unit. + divided by total duration. Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests, - per normalization unit. + divided by total duration. Req: The total number of incoming requests to the L2 from all clients for all request types, per normalization unit. Read Req: The total number of read requests to the L2 from all clients. @@ -150,11 +150,11 @@ Panel Config: 64B of data from any source other than the accelerator's local HBM, per normalization unit. Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe - traffic, per normalization unit. + traffic, divided by total duration. "Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read - requests due to Infinity Fabric traffic, per normalization unit. + requests due to Infinity Fabric traffic, divided by total duration. Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM - traffic, per normalization unit. + traffic, divided by total duration. Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to write or atomically update 32B of data to any memory location, per normalization unit. @@ -171,17 +171,17 @@ Panel Config: write or atomically update 32B or 64B of data in any memory location other than the accelerator's local HBM, per normalization unit. Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to - PCIe traffic, per normalization unit. + PCIe traffic, divided by total duration. "Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write - requests due to Infinity Fabric traffic, per normalization unit. + requests due to Infinity Fabric traffic, divided by total duration. Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM - traffic, per normalization unit. + traffic, divided by total duration. Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to - PCIe traffic, per normalization unit. + PCIe traffic, divided by total duration. "Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic - requests due to Infinity Fabric traffic, per normalization unit. + requests due to Infinity Fabric traffic, divided by total duration. Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to - HBM traffic, per normalization unit. + HBM traffic, divided by total duration. Atomic: The total number of L2 requests to Infinity Fabric to atomically update 32B or 64B of data in any memory location, per normalization unit. See Request flow for more detail. Note that on current CDNA accelerators, such as the MI2XX, @@ -257,12 +257,12 @@ Panel Config: metric: Read BW: avg: AVG((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) min: MIN((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) max: MAX((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) - * 64)) / $denom)) - unit: (Bytes + $normUnit) + * 64)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps HBM Read Traffic: avg: AVG((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum != 0) else None)) @@ -289,12 +289,12 @@ Panel Config: unit: pct Write and Atomic BW: avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) - * 32)) / $denom)) + * 32)) / (End_Timestamp - Start_Timestamp))) min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) - * 32)) / $denom)) + * 32)) / (End_Timestamp - Start_Timestamp))) max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) - * 32)) / $denom)) - unit: (Bytes + $normUnit) + * 32)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps HBM Write and Atomic Traffic: avg: AVG((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum != 0) else None)) @@ -362,10 +362,10 @@ Panel Config: unit: Unit metric: Bandwidth: - avg: AVG((TCC_REQ_sum * 128) / $denom) - min: MIN((TCC_REQ_sum * 128) / $denom) - max: MAX((TCC_REQ_sum * 128) / $denom) - unit: (Bytes + $normUnit) + avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + unit: Gbps Req: avg: AVG((TCC_REQ_sum / $denom)) min: MIN((TCC_REQ_sum / $denom)) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1200_local_data_share_lds.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1200_local_data_share_lds.yaml index c1a8525348..2718654ad4 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1200_local_data_share_lds.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1200_local_data_share_lds.yaml @@ -11,8 +11,12 @@ Panel Config: instructions, averaged over the lifetime of the kernel. Calculated as the ratio of the total number of cycles spent by the scheduler issuing LDS instructions over the total CU cycles. + Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that + could have been loaded from, stored to, or atomically updated in the LDS divided + as percentage of theoretical peak. Does not take into account the execution + mask of the wavefront when the instruction was executed. Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been - loaded from, stored to, or atomically updated in the LDS per normalization unit. + loaded from, stored to, or atomically updated in the LDS divided by total duration. Does not take into account the execution mask of the wavefront when the instruction was executed. Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent @@ -58,7 +62,7 @@ Panel Config: Access Rate: value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))) unit: Pct of Peak - Theoretical Bandwidth (% of Peak): + Theoretical Bandwidth Utilization: value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128))) unit: Pct of Peak @@ -86,12 +90,12 @@ Panel Config: unit: (Instr + $normUnit) Theoretical Bandwidth: avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) - unit: (Bytes + $normUnit) + / (End_Timestamp - Start_Timestamp))) + unit: Gbps LDS Latency: avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None)) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1300_instruction_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1300_instruction_cache.yaml index a53c23691f..aeda9bc6c7 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1300_instruction_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1300_instruction_cache.yaml @@ -3,15 +3,18 @@ Panel Config: id: 1300 title: Instruction Cache metrics_description: - Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the - peak theoretical bandwidth. Calculated as the ratio of L1I requests over the - total L1I cycles. + Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent + of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over + the total L1I cycles. Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously loaded line the cache. Calculated as the ratio of the number of L1I requests that hit over the number of all L1I requests. - L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\ - \ bandwidth achieved. Calculated as the ratio of the total number of requests\ - \ from the L1I to the L2 cache over the total L1I-L2 interface cycles." + L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\ + \ L2 cache request bandwidth achieved. Calculated as the ratio of the total\ + \ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\ + \ cycles." + L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface + divided by total duration. Req: The total number of requests made to the L1I per normalization-unit Hits: The total number of L1I requests that hit on a previously loaded cache line, per normalization-unit. @@ -30,7 +33,7 @@ Panel Config: value: Avg unit: Unit metric: - Bandwidth: + Bandwidth Utilization: value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -38,7 +41,7 @@ Panel Config: value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES) + SQC_ICACHE_MISSES_DUPLICATE))) unit: Pct of Peak - L1I-L2 Bandwidth: + L1I-L2 Bandwidth Utilization: value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -100,7 +103,7 @@ Panel Config: unit: Unit metric: L1I-L2 Bandwidth: - avg: AVG(((SQC_TC_INST_REQ * 64) / $denom)) - min: MIN(((SQC_TC_INST_REQ * 64) / $denom)) - max: MAX(((SQC_TC_INST_REQ * 64) / $denom)) - unit: (Bytes + $normUnit) + avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + unit: Gbps diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1400_scalar_l1_data_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1400_scalar_l1_data_cache.yaml index d43157ce8e..282b97ad1f 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1400_scalar_l1_data_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1400_scalar_l1_data_cache.yaml @@ -3,14 +3,17 @@ Panel Config: id: 1400 title: Scalar L1 Data Cache metrics_description: - Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the - peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the - total sL1D cycles. + Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent + of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests + over the total sL1D cycles. Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously loaded line the cache. The ratio of the number of sL1D requests that hit over the number of all sL1D requests. + sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface + bandwidth acheived.\ \ Caclulated as total number of bytes read from, written + to, or atomically updated\ \ across the sL1D - L2 interface. sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\ - \ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\ + \ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\ \ writes and atomics are typically unused on current CDNA accelerators, so in\ \ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth." Req: The total number of requests, of any size or type, made to the sL1D per normalization @@ -51,7 +54,7 @@ Panel Config: value: Avg unit: Unit metric: - Bandwidth: + Bandwidth Utilization: value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -60,7 +63,7 @@ Panel Config: + SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None)) unit: Pct of Peak - sL1D-L2 BW: + sL1D-L2 BW Utilization: value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp))) unit: Pct of Peak @@ -158,12 +161,12 @@ Panel Config: metric: sL1D-L2 BW: avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) - unit: (Bytes + $normUnit) + * 64)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps Read Req: avg: AVG((SQC_TC_DATA_READ_REQ / $denom)) min: MIN((SQC_TC_DATA_READ_REQ / $denom)) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1600_vector_l1_data_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1600_vector_l1_data_cache.yaml index 708bbafe14..db745209b7 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1600_vector_l1_data_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1600_vector_l1_data_cache.yaml @@ -5,12 +5,12 @@ Panel Config: metrics_description: Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D cache over the total number of cache line requests to the vL1D Cache RAM. - Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM - instructions, as a percent of the peak theoretical bandwidth achievable on the - specific accelerator. The number of bytes is calculated as the number of cache - lines requested multiplied by the cache line size. This value does not consider - partial requests, so for instance, if only a single value is requested in a - cache line, the data movement will still be counted as a full cache line. + Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result + of VMEM instructions, as a percent of the peak theoretical bandwidth achievable + on the specific accelerator. The number of bytes is calculated as the number + of cache lines requested multiplied by the cache line size. This value does + not consider partial requests, so for instance, if only a single value is requested + in a cache line, the data movement will still be counted as a full cache line. Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution. The number of cycles where the vL1D Cache RAM is actively processing any request divided by the number of cycles where the vL1D is active. @@ -42,11 +42,11 @@ Panel Config: Atomic Req: The total number of incoming atomic requests from the address processing unit after coalescing per normalization unit. Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM - instructions per normalization unit. The number of bytes is calculated as the - number of cache lines requested multiplied by the cache line size. This value - does not consider partial requests, so for instance, if only a single value - is requested in a cache line, the data movement will still be counted as a full - cache line. + instructions divided by total duration. The number of bytes is calculated as + the number of cache lines requested multiplied by the cache line size. This + value does not consider partial requests, so for instance, if only a single + value is requested in a cache line, the data movement will still be counted + as a full cache line. Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in vL1D cache over the total number of cache line requests to the vL1D Cache RAM. Cache Accesses: The total number of cache line lookups in the vL1D. @@ -57,7 +57,7 @@ Panel Config: command during the kernel's execution per normalization unit. This may be triggered by, for instance, the buffer_wbinvl1 instruction. L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result - of VMEM instructions, per normalization unit. The number of bytes is calculated + of VMEM instructions, divided by total duration. The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so for instance, if only a single value is requested in a cache line, the data movement will still be counted @@ -128,7 +128,7 @@ Panel Config: / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else None)) unit: Pct of Peak - Bandwidth: + Bandwidth Utilization: value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk / 1000) * 128) * $cu_per_gpu)) unit: Pct of Peak @@ -201,10 +201,10 @@ Panel Config: / $denom)) unit: (Req + $normUnit) Cache BW: - avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom)) - min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom)) - max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom)) - unit: (Bytes + $normUnit) + avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp))) + min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp))) + max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp))) + unit: Gbps Cache Hit Rate: avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) @@ -242,12 +242,12 @@ Panel Config: unit: (Req + $normUnit) L1-L2 BW: avg: AVG(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum - + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom)) + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp))) min: MIN(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum - + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom)) + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp))) max: MAX(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum - + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom)) - unit: (Bytes + $normUnit) + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps L1-L2 Read: avg: AVG((TCP_TCC_READ_REQ_sum / $denom)) min: MIN((TCP_TCC_READ_REQ_sum / $denom)) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1700_l2_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1700_l2_cache.yaml index 0a72362ea7..efff4769b6 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1700_l2_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1700_l2_cache.yaml @@ -20,8 +20,8 @@ Panel Config: HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory (HBM) per unit time. This value is calculated as the number of HBM channels multiplied by the HBM channel width multiplied by the HBM clock frequency. - Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per - normalization unit. + Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided + by total duration. HBM Read Traffic: The percent of read requests generated by the L2 cache that are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown does not consider the size of the request (meaning that 32B and 64B requests @@ -42,9 +42,9 @@ Panel Config: as a single request), so this metric only approximates the percent of the L2-Fabric read bandwidth directed to an uncached memory location. Write and Atomic BW: The total number of bytes written by the L2 over Infinity - Fabric by write and atomic operations per normalization unit. Note that on current - CDNA accelerators, such as the MI2XX, requests are only considered atomic by - Infinity Fabric if they are targeted at non-write-cacheable memory, for example, + Fabric by write and atomic operations divided by total duration. Note that on + current CDNA accelerators, such as the MI2XX, requests are only considered atomic + by Infinity Fabric if they are targeted at non-write-cacheable memory, for example, fine-grained memory allocations or uncached memory allocations on the MI2XX. HBM Write and Atomic Traffic: The percent of write and atomic requests generated by the L2 cache that are routed to the accelerator's local high-bandwidth memory @@ -82,17 +82,17 @@ Panel Config: Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity Fabric before a completion acknowledgement (atomic without return value) or data (atomic with return value) was returned to the L2. - Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit. + Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration. The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so for example, if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests, - per normalization unit. + divided by total duration. Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests, - per normalization unit. + divided by total duration. Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests, - per normalization unit. + divided by total duration. Req: The total number of incoming requests to the L2 from all clients for all request types, per normalization unit. Read Req: The total number of read requests to the L2 from all clients. @@ -150,11 +150,11 @@ Panel Config: 64B of data from any source other than the accelerator's local HBM, per normalization unit. Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe - traffic, per normalization unit. + traffic, divided by total duration. "Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read - requests due to Infinity Fabric traffic, per normalization unit. + requests due to Infinity Fabric traffic, divided by total duration. Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM - traffic, per normalization unit. + traffic, divided by total duration. Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to write or atomically update 32B of data to any memory location, per normalization unit. @@ -171,17 +171,17 @@ Panel Config: write or atomically update 32B or 64B of data in any memory location other than the accelerator's local HBM, per normalization unit. Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to - PCIe traffic, per normalization unit. + PCIe traffic, divided by total duration. "Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write - requests due to Infinity Fabric traffic, per normalization unit. + requests due to Infinity Fabric traffic, divided by total duration. Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM - traffic, per normalization unit. + traffic, divided by total duration. Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to - PCIe traffic, per normalization unit. + PCIe traffic, divided by total duration. "Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic - requests due to Infinity Fabric traffic, per normalization unit. + requests due to Infinity Fabric traffic, divided by total duration. Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to - HBM traffic, per normalization unit. + HBM traffic, divided by total duration. Atomic: The total number of L2 requests to Infinity Fabric to atomically update 32B or 64B of data in any memory location, per normalization unit. See Request flow for more detail. Note that on current CDNA accelerators, such as the MI2XX, @@ -258,12 +258,15 @@ Panel Config: metric: Read BW: avg: AVG(((128 * TCC_BUBBLE_sum + 64 * (TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - - TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / $denom)) + - TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / (End_Timestamp + - Start_Timestamp))) min: MIN(((128 * TCC_BUBBLE_sum + 64 * (TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - - TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / $denom)) + - TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / (End_Timestamp + - Start_Timestamp))) max: MAX(((128 * TCC_BUBBLE_sum + 64 * (TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - - TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / $denom)) - unit: (Bytes + $normUnit) + - TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / (End_Timestamp + - Start_Timestamp))) + unit: Gbps HBM Read Traffic: avg: AVG((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum != 0) else None)) @@ -290,12 +293,12 @@ Panel Config: unit: pct Write and Atomic BW: avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) - * 32)) / $denom)) + * 32)) / (End_Timestamp - Start_Timestamp))) min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) - * 32)) / $denom)) + * 32)) / (End_Timestamp - Start_Timestamp))) max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) - * 32)) / $denom)) - unit: (Bytes + $normUnit) + * 32)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps HBM Write and Atomic Traffic: avg: AVG((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum != 0) else None)) @@ -363,10 +366,10 @@ Panel Config: unit: Unit metric: Bandwidth: - avg: AVG((TCC_REQ_sum * 128) / $denom) - min: MIN((TCC_REQ_sum * 128) / $denom) - max: MAX((TCC_REQ_sum * 128) / $denom) - unit: (Bytes + $normUnit) + avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + unit: Gbps Req: avg: AVG((TCC_REQ_sum / $denom)) min: MIN((TCC_REQ_sum / $denom)) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1200_local_data_share_lds.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1200_local_data_share_lds.yaml index 0609c0a203..c334698661 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1200_local_data_share_lds.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1200_local_data_share_lds.yaml @@ -11,8 +11,12 @@ Panel Config: instructions, averaged over the lifetime of the kernel. Calculated as the ratio of the total number of cycles spent by the scheduler issuing LDS instructions over the total CU cycles. + Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that + could have been loaded from, stored to, or atomically updated in the LDS divided + as percentage of theoretical peak. Does not take into account the execution + mask of the wavefront when the instruction was executed. Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been - loaded from, stored to, or atomically updated in the LDS per normalization unit. + loaded from, stored to, or atomically updated in the LDS divided by total duration. Does not take into account the execution mask of the wavefront when the instruction was executed. Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent @@ -58,7 +62,7 @@ Panel Config: Access Rate: value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))) unit: Pct of Peak - Theoretical Bandwidth (% of Peak): + Theoretical Bandwidth Utilization: value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128))) unit: Pct of Peak @@ -116,12 +120,12 @@ Panel Config: units: Gbps Theoretical Bandwidth: avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) - unit: (Bytes + $normUnit) + / (End_Timestamp - Start_Timestamp))) + unit: Gbps LDS Latency: avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None)) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1300_instruction_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1300_instruction_cache.yaml index a53c23691f..aeda9bc6c7 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1300_instruction_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1300_instruction_cache.yaml @@ -3,15 +3,18 @@ Panel Config: id: 1300 title: Instruction Cache metrics_description: - Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the - peak theoretical bandwidth. Calculated as the ratio of L1I requests over the - total L1I cycles. + Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent + of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over + the total L1I cycles. Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously loaded line the cache. Calculated as the ratio of the number of L1I requests that hit over the number of all L1I requests. - L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\ - \ bandwidth achieved. Calculated as the ratio of the total number of requests\ - \ from the L1I to the L2 cache over the total L1I-L2 interface cycles." + L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\ + \ L2 cache request bandwidth achieved. Calculated as the ratio of the total\ + \ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\ + \ cycles." + L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface + divided by total duration. Req: The total number of requests made to the L1I per normalization-unit Hits: The total number of L1I requests that hit on a previously loaded cache line, per normalization-unit. @@ -30,7 +33,7 @@ Panel Config: value: Avg unit: Unit metric: - Bandwidth: + Bandwidth Utilization: value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -38,7 +41,7 @@ Panel Config: value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES) + SQC_ICACHE_MISSES_DUPLICATE))) unit: Pct of Peak - L1I-L2 Bandwidth: + L1I-L2 Bandwidth Utilization: value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -100,7 +103,7 @@ Panel Config: unit: Unit metric: L1I-L2 Bandwidth: - avg: AVG(((SQC_TC_INST_REQ * 64) / $denom)) - min: MIN(((SQC_TC_INST_REQ * 64) / $denom)) - max: MAX(((SQC_TC_INST_REQ * 64) / $denom)) - unit: (Bytes + $normUnit) + avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + unit: Gbps diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1400_scalar_l1_data_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1400_scalar_l1_data_cache.yaml index d43157ce8e..282b97ad1f 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1400_scalar_l1_data_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1400_scalar_l1_data_cache.yaml @@ -3,14 +3,17 @@ Panel Config: id: 1400 title: Scalar L1 Data Cache metrics_description: - Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the - peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the - total sL1D cycles. + Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent + of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests + over the total sL1D cycles. Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously loaded line the cache. The ratio of the number of sL1D requests that hit over the number of all sL1D requests. + sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface + bandwidth acheived.\ \ Caclulated as total number of bytes read from, written + to, or atomically updated\ \ across the sL1D - L2 interface. sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\ - \ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\ + \ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\ \ writes and atomics are typically unused on current CDNA accelerators, so in\ \ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth." Req: The total number of requests, of any size or type, made to the sL1D per normalization @@ -51,7 +54,7 @@ Panel Config: value: Avg unit: Unit metric: - Bandwidth: + Bandwidth Utilization: value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -60,7 +63,7 @@ Panel Config: + SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None)) unit: Pct of Peak - sL1D-L2 BW: + sL1D-L2 BW Utilization: value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp))) unit: Pct of Peak @@ -158,12 +161,12 @@ Panel Config: metric: sL1D-L2 BW: avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) - unit: (Bytes + $normUnit) + * 64)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps Read Req: avg: AVG((SQC_TC_DATA_READ_REQ / $denom)) min: MIN((SQC_TC_DATA_READ_REQ / $denom)) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1600_vector_l1_data_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1600_vector_l1_data_cache.yaml index a196aa64f0..f95e3fcb1f 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1600_vector_l1_data_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1600_vector_l1_data_cache.yaml @@ -5,12 +5,12 @@ Panel Config: metrics_description: Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D cache over the total number of cache line requests to the vL1D Cache RAM. - Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM - instructions, as a percent of the peak theoretical bandwidth achievable on the - specific accelerator. The number of bytes is calculated as the number of cache - lines requested multiplied by the cache line size. This value does not consider - partial requests, so for instance, if only a single value is requested in a - cache line, the data movement will still be counted as a full cache line. + Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result + of VMEM instructions, as a percent of the peak theoretical bandwidth achievable + on the specific accelerator. The number of bytes is calculated as the number + of cache lines requested multiplied by the cache line size. This value does + not consider partial requests, so for instance, if only a single value is requested + in a cache line, the data movement will still be counted as a full cache line. Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution. The number of cycles where the vL1D Cache RAM is actively processing any request divided by the number of cycles where the vL1D is active. @@ -42,11 +42,11 @@ Panel Config: Atomic Req: The total number of incoming atomic requests from the address processing unit after coalescing per normalization unit. Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM - instructions per normalization unit. The number of bytes is calculated as the - number of cache lines requested multiplied by the cache line size. This value - does not consider partial requests, so for instance, if only a single value - is requested in a cache line, the data movement will still be counted as a full - cache line. + instructions divided by total duration. The number of bytes is calculated as + the number of cache lines requested multiplied by the cache line size. This + value does not consider partial requests, so for instance, if only a single + value is requested in a cache line, the data movement will still be counted + as a full cache line. Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in vL1D cache over the total number of cache line requests to the vL1D Cache RAM. Cache Accesses: The total number of cache line lookups in the vL1D. @@ -57,7 +57,7 @@ Panel Config: command during the kernel's execution per normalization unit. This may be triggered by, for instance, the buffer_wbinvl1 instruction. L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result - of VMEM instructions, per normalization unit. The number of bytes is calculated + of VMEM instructions, divided by total duration. The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so for instance, if only a single value is requested in a cache line, the data movement will still be counted @@ -128,7 +128,7 @@ Panel Config: / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else None)) unit: Pct of Peak - Bandwidth: + Bandwidth Utilization: value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk / 1000) * 128) * $cu_per_gpu)) unit: Pct of Peak @@ -216,10 +216,10 @@ Panel Config: / $denom)) unit: (Req + $normUnit) Cache BW: - avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom)) - min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom)) - max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom)) - unit: (Bytes + $normUnit) + avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp))) + min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp))) + max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp))) + unit: Gbps Cache Hit Rate: avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) @@ -257,12 +257,12 @@ Panel Config: unit: (Req + $normUnit) L1-L2 BW: avg: AVG(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum - + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom)) + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp))) min: MIN(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum - + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom)) + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp))) max: MAX(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum - + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom)) - unit: (Bytes + $normUnit) + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps Tag RAM 0 Req: avg: AVG((TCP_TAGRAM0_REQ_sum / $denom)) min: MIN((TCP_TAGRAM0_REQ_sum / $denom)) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1700_l2_cache.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1700_l2_cache.yaml index c354429c0e..15ba2f4745 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1700_l2_cache.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1700_l2_cache.yaml @@ -20,8 +20,8 @@ Panel Config: HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth memory (HBM) per unit time. This value is calculated as the number of HBM channels multiplied by the HBM channel width multiplied by the HBM clock frequency. - Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per - normalization unit. + Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided + by total duration. HBM Read Traffic: The percent of read requests generated by the L2 cache that are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown does not consider the size of the request (meaning that 32B and 64B requests @@ -42,9 +42,9 @@ Panel Config: as a single request), so this metric only approximates the percent of the L2-Fabric read bandwidth directed to an uncached memory location. Write and Atomic BW: The total number of bytes written by the L2 over Infinity - Fabric by write and atomic operations per normalization unit. Note that on current - CDNA accelerators, such as the MI2XX, requests are only considered atomic by - Infinity Fabric if they are targeted at non-write-cacheable memory, for example, + Fabric by write and atomic operations divided by total duration. Note that on + current CDNA accelerators, such as the MI2XX, requests are only considered atomic + by Infinity Fabric if they are targeted at non-write-cacheable memory, for example, fine-grained memory allocations or uncached memory allocations on the MI2XX. HBM Write and Atomic Traffic: The percent of write and atomic requests generated by the L2 cache that are routed to the accelerator's local high-bandwidth memory @@ -82,17 +82,17 @@ Panel Config: Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity Fabric before a completion acknowledgement (atomic without return value) or data (atomic with return value) was returned to the L2. - Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit. + Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration. The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so for example, if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests, - per normalization unit. + divided by total duration. Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests, - per normalization unit. + divided by total duration. Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests, - per normalization unit. + divided by total duration. Req: The total number of incoming requests to the L2 from all clients for all request types, per normalization unit. Read Req: The total number of read requests to the L2 from all clients. @@ -150,11 +150,11 @@ Panel Config: 64B of data from any source other than the accelerator's local HBM, per normalization unit. Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe - traffic, per normalization unit. + traffic, divided by total duration. "Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read - requests due to Infinity Fabric traffic, per normalization unit. + requests due to Infinity Fabric traffic, divided by total duration. Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM - traffic, per normalization unit. + traffic, divided by total duration. Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to write or atomically update 32B of data to any memory location, per normalization unit. @@ -171,17 +171,17 @@ Panel Config: write or atomically update 32B or 64B of data in any memory location other than the accelerator's local HBM, per normalization unit. Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to - PCIe traffic, per normalization unit. + PCIe traffic, divided by total duration. "Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write - requests due to Infinity Fabric traffic, per normalization unit. + requests due to Infinity Fabric traffic, divided by total duration. Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM - traffic, per normalization unit. + traffic, divided by total duration. Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to - PCIe traffic, per normalization unit. + PCIe traffic, divided by total duration. "Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic - requests due to Infinity Fabric traffic, per normalization unit. + requests due to Infinity Fabric traffic, divided by total duration. Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to - HBM traffic, per normalization unit. + HBM traffic, divided by total duration. Atomic: The total number of L2 requests to Infinity Fabric to atomically update 32B or 64B of data in any memory location, per normalization unit. See Request flow for more detail. Note that on current CDNA accelerators, such as the MI2XX, @@ -257,12 +257,12 @@ Panel Config: metric: Read BW: avg: AVG((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) + - (TCC_EA0_RDREQ_128B_sum * 128)) / $denom)) + (TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp))) min: MIN((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) + - (TCC_EA0_RDREQ_128B_sum * 128)) / $denom)) + (TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp))) max: MAX((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) + - (TCC_EA0_RDREQ_128B_sum * 128)) / $denom)) - unit: (Bytes + $normUnit) + (TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps HBM Read Traffic: avg: AVG((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum != 0) else None)) @@ -289,12 +289,12 @@ Panel Config: unit: pct Write and Atomic BW: avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) - * 32)) / $denom)) + * 32)) / (End_Timestamp - Start_Timestamp))) min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) - * 32)) / $denom)) + * 32)) / (End_Timestamp - Start_Timestamp))) max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) - * 32)) / $denom)) - unit: (Bytes + $normUnit) + * 32)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps HBM Write and Atomic Traffic: avg: AVG((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum != 0) else None)) @@ -381,25 +381,25 @@ Panel Config: unit: Unit metric: Bandwidth: - avg: AVG((TCC_REQ_sum * 128) / $denom) - min: MIN((TCC_REQ_sum * 128) / $denom) - max: MAX((TCC_REQ_sum * 128) / $denom) - unit: (Bytes + $normUnit) + avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + unit: Gbps Read Bandwidth: - avg: AVG(TCC_READ_SECTORS_sum * 32/ $denom) - min: MIN(TCC_READ_SECTORS_sum * 32/ $denom) - max: MAX(TCC_READ_SECTORS_sum * 32/ $denom) - unit: (Bytes + $normUnit) + avg: AVG(TCC_READ_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp)) + min: MIN(TCC_READ_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp)) + max: MAX(TCC_READ_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp)) + unit: Gbps Write Bandwidth: - avg: AVG(TCC_WRITE_SECTORS_sum * 32/ $denom) - min: MIN(TCC_WRITE_SECTORS_sum * 32/ $denom) - max: MAX(TCC_WRITE_SECTORS_sum * 32/ $denom) - unit: (Bytes + $normUnit) + avg: AVG(TCC_WRITE_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp)) + min: MIN(TCC_WRITE_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp)) + max: MAX(TCC_WRITE_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp)) + unit: Gbps Atomic Bandwidth: - avg: AVG(TCC_ATOMIC_SECTORS_sum * 32/ $denom) - min: MIN(TCC_ATOMIC_SECTORS_sum * 32/ $denom) - max: MAX(TCC_ATOMIC_SECTORS_sum * 32/ $denom) - unit: (Bytes + $normUnit) + avg: AVG(TCC_ATOMIC_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp)) + min: MIN(TCC_ATOMIC_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp)) + max: MAX(TCC_ATOMIC_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp)) + unit: Gbps Req: avg: AVG((TCC_REQ_sum / $denom)) min: MIN((TCC_REQ_sum / $denom)) @@ -653,20 +653,20 @@ Panel Config: max: MAX((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom)) unit: (Req + $normUnit) Read Bandwidth - PCIe: - avg: AVG(TCC_EA0_RDREQ_IO_32B_sum * 32/ $denom) - min: MIN(TCC_EA0_RDREQ_IO_32B_sum * 32/ $denom) - max: MAX(TCC_EA0_RDREQ_IO_32B_sum * 32/ $denom) - unit: (Bytes + $normUnit) + avg: AVG(TCC_EA0_RDREQ_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + min: MIN(TCC_EA0_RDREQ_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + max: MAX(TCC_EA0_RDREQ_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + unit: Gbps "Read Bandwidth - Infinity Fabric\u2122": - avg: AVG(TCC_EA0_RDREQ_GMI_32B_sum * 32/ $denom) - min: MIN(TCC_EA0_RDREQ_GMI_32B_sum * 32/ $denom) - max: MAX(TCC_EA0_RDREQ_GMI_32B_sum * 32/ $denom) - unit: (Bytes + $normUnit) + avg: AVG(TCC_EA0_RDREQ_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + min: MIN(TCC_EA0_RDREQ_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + max: MAX(TCC_EA0_RDREQ_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + unit: Gbps Read Bandwidth - HBM: - avg: AVG(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ $denom) - min: MIN(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ $denom) - max: MAX(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ $denom) - unit: (Bytes + $normUnit) + avg: AVG(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + min: MIN(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + max: MAX(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + unit: Gbps Write and Atomic (32B): avg: AVG(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom)) min: MIN(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom)) @@ -693,20 +693,20 @@ Panel Config: max: MAX((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom)) unit: (Req + $normUnit) Write Bandwidth - PCIe: - avg: AVG(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ $denom) - min: MIN(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ $denom) - max: MAX(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ $denom) - unit: (Bytes + $normUnit) + avg: AVG(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + min: MIN(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + max: MAX(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + unit: Gbps "Write Bandwidth - Infinity Fabric\u2122": - avg: AVG(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ $denom) - min: MIN(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ $denom) - max: MAX(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ $denom) - unit: (Bytes + $normUnit) + avg: AVG(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + min: MIN(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + max: MAX(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + unit: Gbps Write Bandwidth - HBM: - avg: AVG(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ $denom) - min: MIN(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ $denom) - max: MAX(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ $denom) - unit: (Bytes + $normUnit) + avg: AVG(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + min: MIN(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + max: MAX(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + unit: Gbps Atomic: avg: AVG((TCC_EA0_ATOMIC_sum / $denom)) min: MIN((TCC_EA0_ATOMIC_sum / $denom)) @@ -718,17 +718,17 @@ Panel Config: max: MAX((TCC_EA0_WRREQ_ATOMIC_DRAM_sum / $denom)) unit: (Req + $normUnit) Atomic Bandwidth - PCIe: - avg: AVG(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ $denom) - min: MIN(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ $denom) - max: MAX(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ $denom) - unit: (Bytes + $normUnit) + avg: AVG(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + min: MIN(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + max: MAX(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + unit: Gbps "Atomic Bandwidth - Infinity Fabric\u2122": - avg: AVG(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ $denom) - min: MIN(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ $denom) - max: MAX(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ $denom) - unit: (Bytes + $normUnit) + avg: AVG(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + min: MIN(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + max: MAX(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + unit: Gbps Atomic Bandwidth - HBM: - avg: AVG(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ $denom) - min: MIN(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ $denom) - max: MAX(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ $denom) - unit: (Bytes + $normUnit) + avg: AVG(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + min: MIN(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + max: MAX(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + unit: Gbps diff --git a/projects/rocprofiler-compute/utils/autogen_hash.yaml b/projects/rocprofiler-compute/utils/autogen_hash.yaml index 2c50e5470b..a078e5122d 100644 --- a/projects/rocprofiler-compute/utils/autogen_hash.yaml +++ b/projects/rocprofiler-compute/utils/autogen_hash.yaml @@ -59,42 +59,42 @@ src/rocprof_compute_soc/analysis_configs/gfx940/1100_compute_units_compute_pipel src/rocprof_compute_soc/analysis_configs/gfx941/1100_compute_units_compute_pipeline.yaml: 4a25b6abf24f4a622fde1a3cfe65fe7236cf1e626fc2444667883997564cea1e src/rocprof_compute_soc/analysis_configs/gfx942/1100_compute_units_compute_pipeline.yaml: 4a25b6abf24f4a622fde1a3cfe65fe7236cf1e626fc2444667883997564cea1e src/rocprof_compute_soc/analysis_configs/gfx950/1100_compute_units_compute_pipeline.yaml: 4ef656938f8a9667ae872db522855856469accff9cb42bc0444b469346760dfd -src/rocprof_compute_soc/analysis_configs/gfx908/1200_local_data_share_lds.yaml: 80f3ca3ea15de009c5278ea20566d8c08d62e0087971e5f9aeae1c89df1dd898 -src/rocprof_compute_soc/analysis_configs/gfx90a/1200_local_data_share_lds.yaml: 80f3ca3ea15de009c5278ea20566d8c08d62e0087971e5f9aeae1c89df1dd898 -src/rocprof_compute_soc/analysis_configs/gfx940/1200_local_data_share_lds.yaml: 3bbf3928288990863cfe72fd00a28785fde0a36f103f5381df578aae2eb28be0 -src/rocprof_compute_soc/analysis_configs/gfx941/1200_local_data_share_lds.yaml: 3bbf3928288990863cfe72fd00a28785fde0a36f103f5381df578aae2eb28be0 -src/rocprof_compute_soc/analysis_configs/gfx942/1200_local_data_share_lds.yaml: 3bbf3928288990863cfe72fd00a28785fde0a36f103f5381df578aae2eb28be0 -src/rocprof_compute_soc/analysis_configs/gfx950/1200_local_data_share_lds.yaml: 505163510a3b0132ee487f9e024188de2deb97d0f72e3d729b95f86e7c3434b3 -src/rocprof_compute_soc/analysis_configs/gfx908/1300_instruction_cache.yaml: 2437e2f8191675c4116d0da1db291f3ad2715281ea812e9fdd6506cf213e5d1b -src/rocprof_compute_soc/analysis_configs/gfx90a/1300_instruction_cache.yaml: 2437e2f8191675c4116d0da1db291f3ad2715281ea812e9fdd6506cf213e5d1b -src/rocprof_compute_soc/analysis_configs/gfx940/1300_instruction_cache.yaml: 2437e2f8191675c4116d0da1db291f3ad2715281ea812e9fdd6506cf213e5d1b -src/rocprof_compute_soc/analysis_configs/gfx941/1300_instruction_cache.yaml: 2437e2f8191675c4116d0da1db291f3ad2715281ea812e9fdd6506cf213e5d1b -src/rocprof_compute_soc/analysis_configs/gfx942/1300_instruction_cache.yaml: 2437e2f8191675c4116d0da1db291f3ad2715281ea812e9fdd6506cf213e5d1b -src/rocprof_compute_soc/analysis_configs/gfx950/1300_instruction_cache.yaml: 2437e2f8191675c4116d0da1db291f3ad2715281ea812e9fdd6506cf213e5d1b -src/rocprof_compute_soc/analysis_configs/gfx908/1400_scalar_l1_data_cache.yaml: 8871e3b65132321cb3880a48f894d8c3b2c56a3936d382c3c2b02723ed5c8ec5 -src/rocprof_compute_soc/analysis_configs/gfx90a/1400_scalar_l1_data_cache.yaml: 8871e3b65132321cb3880a48f894d8c3b2c56a3936d382c3c2b02723ed5c8ec5 -src/rocprof_compute_soc/analysis_configs/gfx940/1400_scalar_l1_data_cache.yaml: 8871e3b65132321cb3880a48f894d8c3b2c56a3936d382c3c2b02723ed5c8ec5 -src/rocprof_compute_soc/analysis_configs/gfx941/1400_scalar_l1_data_cache.yaml: 8871e3b65132321cb3880a48f894d8c3b2c56a3936d382c3c2b02723ed5c8ec5 -src/rocprof_compute_soc/analysis_configs/gfx942/1400_scalar_l1_data_cache.yaml: 8871e3b65132321cb3880a48f894d8c3b2c56a3936d382c3c2b02723ed5c8ec5 -src/rocprof_compute_soc/analysis_configs/gfx950/1400_scalar_l1_data_cache.yaml: 8871e3b65132321cb3880a48f894d8c3b2c56a3936d382c3c2b02723ed5c8ec5 +src/rocprof_compute_soc/analysis_configs/gfx908/1200_local_data_share_lds.yaml: f3f7a74e8b2915fe27eec7948f006f218a6b0a96c91b95cdff9e624b2c484bb2 +src/rocprof_compute_soc/analysis_configs/gfx90a/1200_local_data_share_lds.yaml: f3f7a74e8b2915fe27eec7948f006f218a6b0a96c91b95cdff9e624b2c484bb2 +src/rocprof_compute_soc/analysis_configs/gfx940/1200_local_data_share_lds.yaml: f3f7a74e8b2915fe27eec7948f006f218a6b0a96c91b95cdff9e624b2c484bb2 +src/rocprof_compute_soc/analysis_configs/gfx941/1200_local_data_share_lds.yaml: f3f7a74e8b2915fe27eec7948f006f218a6b0a96c91b95cdff9e624b2c484bb2 +src/rocprof_compute_soc/analysis_configs/gfx942/1200_local_data_share_lds.yaml: f3f7a74e8b2915fe27eec7948f006f218a6b0a96c91b95cdff9e624b2c484bb2 +src/rocprof_compute_soc/analysis_configs/gfx950/1200_local_data_share_lds.yaml: 6333e18126bde83da4c66fd967531d394bd22e69c08358096b27168a9dc11a30 +src/rocprof_compute_soc/analysis_configs/gfx908/1300_instruction_cache.yaml: f60b9c657bece161e34219f3ada4041107dc5ca3d248590ee3b67e7bd400ff54 +src/rocprof_compute_soc/analysis_configs/gfx90a/1300_instruction_cache.yaml: f60b9c657bece161e34219f3ada4041107dc5ca3d248590ee3b67e7bd400ff54 +src/rocprof_compute_soc/analysis_configs/gfx940/1300_instruction_cache.yaml: f60b9c657bece161e34219f3ada4041107dc5ca3d248590ee3b67e7bd400ff54 +src/rocprof_compute_soc/analysis_configs/gfx941/1300_instruction_cache.yaml: f60b9c657bece161e34219f3ada4041107dc5ca3d248590ee3b67e7bd400ff54 +src/rocprof_compute_soc/analysis_configs/gfx942/1300_instruction_cache.yaml: f60b9c657bece161e34219f3ada4041107dc5ca3d248590ee3b67e7bd400ff54 +src/rocprof_compute_soc/analysis_configs/gfx950/1300_instruction_cache.yaml: f60b9c657bece161e34219f3ada4041107dc5ca3d248590ee3b67e7bd400ff54 +src/rocprof_compute_soc/analysis_configs/gfx908/1400_scalar_l1_data_cache.yaml: 29fac4ea38e4a018baffc4a27a720b47078fd890c10da307655d40f693e6f0e7 +src/rocprof_compute_soc/analysis_configs/gfx90a/1400_scalar_l1_data_cache.yaml: 29fac4ea38e4a018baffc4a27a720b47078fd890c10da307655d40f693e6f0e7 +src/rocprof_compute_soc/analysis_configs/gfx940/1400_scalar_l1_data_cache.yaml: 29fac4ea38e4a018baffc4a27a720b47078fd890c10da307655d40f693e6f0e7 +src/rocprof_compute_soc/analysis_configs/gfx941/1400_scalar_l1_data_cache.yaml: 29fac4ea38e4a018baffc4a27a720b47078fd890c10da307655d40f693e6f0e7 +src/rocprof_compute_soc/analysis_configs/gfx942/1400_scalar_l1_data_cache.yaml: 29fac4ea38e4a018baffc4a27a720b47078fd890c10da307655d40f693e6f0e7 +src/rocprof_compute_soc/analysis_configs/gfx950/1400_scalar_l1_data_cache.yaml: 29fac4ea38e4a018baffc4a27a720b47078fd890c10da307655d40f693e6f0e7 src/rocprof_compute_soc/analysis_configs/gfx908/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 633d59aba82b3a495b7ba33fa4b2ae4da638b58632bcc37ff18be87af68ce4d4 src/rocprof_compute_soc/analysis_configs/gfx90a/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 2bdb9d7b3bea1057b3baee29ba3b428b211808261063a97bc4b6b319f4a19fb3 src/rocprof_compute_soc/analysis_configs/gfx940/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 3180c2f3266be0ff44e01d73d247ca43ae2ee18ecaf61765f58849e36c701b19 src/rocprof_compute_soc/analysis_configs/gfx941/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 3180c2f3266be0ff44e01d73d247ca43ae2ee18ecaf61765f58849e36c701b19 src/rocprof_compute_soc/analysis_configs/gfx942/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 3180c2f3266be0ff44e01d73d247ca43ae2ee18ecaf61765f58849e36c701b19 src/rocprof_compute_soc/analysis_configs/gfx950/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 9e56cef5b066fb575a5c530bcf9400f1291dd8636b12c8a2244cdba1defafc9f -src/rocprof_compute_soc/analysis_configs/gfx908/1600_vector_l1_data_cache.yaml: e6ec43014ce7b7cc072385d4eba072dd187b5de14979c169a3c1e9b8fc4c2762 -src/rocprof_compute_soc/analysis_configs/gfx90a/1600_vector_l1_data_cache.yaml: e6ec43014ce7b7cc072385d4eba072dd187b5de14979c169a3c1e9b8fc4c2762 -src/rocprof_compute_soc/analysis_configs/gfx940/1600_vector_l1_data_cache.yaml: 0e53921cc8d87a9adade250b9632fa42d33c825565152e37d6e56f45f83a3a28 -src/rocprof_compute_soc/analysis_configs/gfx941/1600_vector_l1_data_cache.yaml: 0e53921cc8d87a9adade250b9632fa42d33c825565152e37d6e56f45f83a3a28 -src/rocprof_compute_soc/analysis_configs/gfx942/1600_vector_l1_data_cache.yaml: 0e53921cc8d87a9adade250b9632fa42d33c825565152e37d6e56f45f83a3a28 -src/rocprof_compute_soc/analysis_configs/gfx950/1600_vector_l1_data_cache.yaml: cd21327c193d2af8c18066b9c13f67e3d5dfb44731777bc5a1b6a7738c902dd1 -src/rocprof_compute_soc/analysis_configs/gfx908/1700_l2_cache.yaml: 5b48c690b6069a5610d07cc0c2a5e1da65a52296205dcf48a3b6fa5e3df36e9b -src/rocprof_compute_soc/analysis_configs/gfx90a/1700_l2_cache.yaml: a9b128267a069060e891533334c52586c706f145b1e813a4081cb21d425516ad -src/rocprof_compute_soc/analysis_configs/gfx940/1700_l2_cache.yaml: b4eea39f0e23e501ad503cdd96db377109c7f0e212949828fe06102de7355349 -src/rocprof_compute_soc/analysis_configs/gfx941/1700_l2_cache.yaml: da0189cd7f6e1ab4b79d0c054c2cdc1f7a9c81972dae9e5285f2f3d9c30ca644 -src/rocprof_compute_soc/analysis_configs/gfx942/1700_l2_cache.yaml: b0802f923052eb584ce138210ebf2db70fb7883926896da1861a9e857d4abe81 -src/rocprof_compute_soc/analysis_configs/gfx950/1700_l2_cache.yaml: 58bdd965421d610567e461becd7094fa41d668b119eddab99054d2bd6dc12acf +src/rocprof_compute_soc/analysis_configs/gfx908/1600_vector_l1_data_cache.yaml: 438d0f4a972dd341eb2485f51a47d6860fbb30a6169054cd8550b4b7226e199f +src/rocprof_compute_soc/analysis_configs/gfx90a/1600_vector_l1_data_cache.yaml: 438d0f4a972dd341eb2485f51a47d6860fbb30a6169054cd8550b4b7226e199f +src/rocprof_compute_soc/analysis_configs/gfx940/1600_vector_l1_data_cache.yaml: 6100b218f24de9f1433b39a093ed04b9bb9dfe656c5df77583c9db332c447230 +src/rocprof_compute_soc/analysis_configs/gfx941/1600_vector_l1_data_cache.yaml: 6100b218f24de9f1433b39a093ed04b9bb9dfe656c5df77583c9db332c447230 +src/rocprof_compute_soc/analysis_configs/gfx942/1600_vector_l1_data_cache.yaml: 6100b218f24de9f1433b39a093ed04b9bb9dfe656c5df77583c9db332c447230 +src/rocprof_compute_soc/analysis_configs/gfx950/1600_vector_l1_data_cache.yaml: 67054ec0a4c6ca147a5dd40cc91f0e8e81378e1affe7d479274747579ecc524a +src/rocprof_compute_soc/analysis_configs/gfx908/1700_l2_cache.yaml: b1baa76f9dbfcc52d5e12cc1834102a0011ddf8bdece5be5fabc2945ab8971f4 +src/rocprof_compute_soc/analysis_configs/gfx90a/1700_l2_cache.yaml: 4d834a2066d7f2cb655a8e41fc17531282150b6fe64bbc9c5ff3a10acddee5af +src/rocprof_compute_soc/analysis_configs/gfx940/1700_l2_cache.yaml: 78f9fee5dafc83d311da1c801200c1820e16a0678dd0548fafa8a966ec6a94d5 +src/rocprof_compute_soc/analysis_configs/gfx941/1700_l2_cache.yaml: 51fe6e3888975b805594c2ab2b3147e717ae5e015468ee592cbcddc389c689bc +src/rocprof_compute_soc/analysis_configs/gfx942/1700_l2_cache.yaml: dc2dc9ff61b1747e492c28ef5ac76764fd75c18fd0827834130bc583f2afc619 +src/rocprof_compute_soc/analysis_configs/gfx950/1700_l2_cache.yaml: d181f753c3fff608c72b8015d1af30bfd8cf8cdfbc0a17c505f717ddaa3b1efc src/rocprof_compute_soc/analysis_configs/gfx908/1800_l2_cache_per_channel.yaml: a0c53202fe9f68d5e1fa689ce0643c471ced7d47e007d8ccc68fba294f7f6a05 src/rocprof_compute_soc/analysis_configs/gfx90a/1800_l2_cache_per_channel.yaml: a0c53202fe9f68d5e1fa689ce0643c471ced7d47e007d8ccc68fba294f7f6a05 src/rocprof_compute_soc/analysis_configs/gfx940/1800_l2_cache_per_channel.yaml: e184e3692eb0d641fb2e37fada0e58a6c4958553931d7c038b884e1e6986093f @@ -113,4 +113,4 @@ src/rocprof_compute_soc/profile_configs/sets/gfx940_sets.yaml: 44cd2b32b050cafa7 src/rocprof_compute_soc/profile_configs/sets/gfx941_sets.yaml: 44cd2b32b050cafa73d0ead5703b82836edf25a057c21699046b6b8b8918b242 src/rocprof_compute_soc/profile_configs/sets/gfx942_sets.yaml: 44cd2b32b050cafa73d0ead5703b82836edf25a057c21699046b6b8b8918b242 src/rocprof_compute_soc/profile_configs/sets/gfx950_sets.yaml: 238d9dc8a98cfead3fc904885bfe413e5bcb4f1af31e9820cd640388bcd1e1c2 -docs/data/metrics_description.yaml: 819c08a584ae8b418e6983aa51108b95e43eda4f3b7892eab336c61d844b20bf +docs/data/metrics_description.yaml: c2ddad7ef7973b128c1612e56cc6286e49c2f59af829b1795dc64b38c0ecfd61 diff --git a/projects/rocprofiler-compute/utils/unified_config.yaml b/projects/rocprofiler-compute/utils/unified_config.yaml index fb6286d7ab..0f3e89e781 100644 --- a/projects/rocprofiler-compute/utils/unified_config.yaml +++ b/projects/rocprofiler-compute/utils/unified_config.yaml @@ -7972,7 +7972,7 @@ panels: Access Rate: value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))) unit: Pct of Peak - Theoretical Bandwidth: + Theoretical Bandwidth Utilization: value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128))) @@ -7988,7 +7988,7 @@ panels: Access Rate: value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))) unit: Pct of Peak - Theoretical Bandwidth (% of Peak): + Theoretical Bandwidth Utilization: value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128))) @@ -8004,7 +8004,7 @@ panels: Access Rate: value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))) unit: Pct of Peak - Theoretical Bandwidth (% of Peak): + Theoretical Bandwidth Utilization: value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128))) @@ -8020,7 +8020,7 @@ panels: Access Rate: value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))) unit: Pct of Peak - Theoretical Bandwidth (% of Peak): + Theoretical Bandwidth Utilization: value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128))) @@ -8036,7 +8036,7 @@ panels: Access Rate: value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))) unit: Pct of Peak - Theoretical Bandwidth (% of Peak): + Theoretical Bandwidth Utilization: value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128))) @@ -8052,7 +8052,7 @@ panels: Access Rate: value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))) unit: Pct of Peak - Theoretical Bandwidth: + Theoretical Bandwidth Utilization: value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128))) @@ -8082,12 +8082,12 @@ panels: unit: (Instr + $normUnit) Theoretical Bandwidth: avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) - unit: (Bytes + $normUnit) + / (End_Timestamp - Start_Timestamp))) + unit: Gbps LDS Latency: avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None)) @@ -8143,12 +8143,12 @@ panels: unit: (Instr + $normUnit) Theoretical Bandwidth: avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) - unit: (Bytes + $normUnit) + / (End_Timestamp - Start_Timestamp))) + unit: Gbps LDS Latency: avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None)) @@ -8204,12 +8204,12 @@ panels: unit: (Instr + $normUnit) Theoretical Bandwidth: avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) - unit: (Bytes + $normUnit) + / (End_Timestamp - Start_Timestamp))) + unit: Gbps LDS Latency: avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None)) @@ -8265,12 +8265,12 @@ panels: unit: (Instr + $normUnit) Theoretical Bandwidth: avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) - unit: (Bytes + $normUnit) + / (End_Timestamp - Start_Timestamp))) + unit: Gbps LDS Latency: avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None)) @@ -8356,12 +8356,12 @@ panels: units: Gbps Theoretical Bandwidth: avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) - unit: (Bytes + $normUnit) + / (End_Timestamp - Start_Timestamp))) + unit: Gbps LDS Latency: avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None)) @@ -8427,12 +8427,12 @@ panels: unit: (Instr + $normUnit) Theoretical Bandwidth: avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu)) - / $denom)) - unit: (Bytes + $normUnit) + / (End_Timestamp - Start_Timestamp))) + unit: Gbps LDS Latency: avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None)) @@ -8502,17 +8502,28 @@ panels: ` issuing :ref:`LDS ` instructions over the :ref:`total CU cycles `. unit: Percent - Theoretical Bandwidth: + Theoretical Bandwidth Utilization: plain: Indicates the maximum amount of bytes that could have been loaded from, - stored to, or atomically updated in the LDS per normalization unit. Does not - take into account the execution mask of the wavefront when the instruction + stored to, or atomically updated in the LDS divided as percentage of theoretical peak. + Does not take into account the execution mask of the wavefront when the instruction was executed. rst: Indicates the maximum amount of bytes that could have been loaded from, stored - to, or atomically updated in the LDS per :ref:`normalization unit `. + to, or atomically updated in the LDS divided as percentage of theoretical peak. Does *not* take into account the execution mask of the wavefront when the instruction was executed. See the :ref:`LDS bandwidth example ` for more detail. - unit: Bytes per normalization unit + unit: Percent + Theoretical Bandwidth: + plain: Indicates the maximum amount of bytes that could have been loaded from, + stored to, or atomically updated in the LDS divided by total duration. Does not + take into account the execution mask of the wavefront when the instruction + was executed. + rst: Indicates the maximum amount of bytes that could have been loaded from, stored + to, or atomically updated in the LDS divided by total duration. + Does *not* take into account the execution mask of the wavefront when the + instruction was executed. See the :ref:`LDS bandwidth example ` + for more detail. + unit: Gbps Bank Conflict Rate: plain: Indicates the percentage of active LDS cycles that were spent servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing bank @@ -8601,7 +8612,7 @@ panels: unit: Unit metric: gfx90a: - Bandwidth: + Bandwidth Utilization: value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -8609,12 +8620,12 @@ panels: value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES) + SQC_ICACHE_MISSES_DUPLICATE))) unit: Pct of Peak - L1I-L2 Bandwidth: + L1I-L2 Bandwidth Utilization: value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak gfx941: - Bandwidth: + Bandwidth Utilization: value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -8622,12 +8633,12 @@ panels: value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES) + SQC_ICACHE_MISSES_DUPLICATE))) unit: Pct of Peak - L1I-L2 Bandwidth: + L1I-L2 Bandwidth Utilization: value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak gfx940: - Bandwidth: + Bandwidth Utilization: value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -8635,12 +8646,12 @@ panels: value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES) + SQC_ICACHE_MISSES_DUPLICATE))) unit: Pct of Peak - L1I-L2 Bandwidth: + L1I-L2 Bandwidth Utilization: value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak gfx942: - Bandwidth: + Bandwidth Utilization: value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -8648,12 +8659,12 @@ panels: value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES) + SQC_ICACHE_MISSES_DUPLICATE))) unit: Pct of Peak - L1I-L2 Bandwidth: + L1I-L2 Bandwidth Utilization: value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak gfx950: - Bandwidth: + Bandwidth Utilization: value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -8661,12 +8672,12 @@ panels: value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES) + SQC_ICACHE_MISSES_DUPLICATE))) unit: Pct of Peak - L1I-L2 Bandwidth: + L1I-L2 Bandwidth Utilization: value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak gfx908: - Bandwidth: + Bandwidth Utilization: value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -8674,7 +8685,7 @@ panels: value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES) + SQC_ICACHE_MISSES_DUPLICATE))) unit: Pct of Peak - L1I-L2 Bandwidth: + L1I-L2 Bandwidth Utilization: value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -8913,42 +8924,42 @@ panels: metric: gfx90a: L1I-L2 Bandwidth: - avg: AVG(((SQC_TC_INST_REQ * 64) / $denom)) - min: MIN(((SQC_TC_INST_REQ * 64) / $denom)) - max: MAX(((SQC_TC_INST_REQ * 64) / $denom)) - unit: (Bytes + $normUnit) + avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + unit: Gbps gfx941: L1I-L2 Bandwidth: - avg: AVG(((SQC_TC_INST_REQ * 64) / $denom)) - min: MIN(((SQC_TC_INST_REQ * 64) / $denom)) - max: MAX(((SQC_TC_INST_REQ * 64) / $denom)) - unit: (Bytes + $normUnit) + avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + unit: Gbps gfx940: L1I-L2 Bandwidth: - avg: AVG(((SQC_TC_INST_REQ * 64) / $denom)) - min: MIN(((SQC_TC_INST_REQ * 64) / $denom)) - max: MAX(((SQC_TC_INST_REQ * 64) / $denom)) - unit: (Bytes + $normUnit) + avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + unit: Gbps gfx942: L1I-L2 Bandwidth: - avg: AVG(((SQC_TC_INST_REQ * 64) / $denom)) - min: MIN(((SQC_TC_INST_REQ * 64) / $denom)) - max: MAX(((SQC_TC_INST_REQ * 64) / $denom)) - unit: (Bytes + $normUnit) + avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + unit: Gbps gfx950: L1I-L2 Bandwidth: - avg: AVG(((SQC_TC_INST_REQ * 64) / $denom)) - min: MIN(((SQC_TC_INST_REQ * 64) / $denom)) - max: MAX(((SQC_TC_INST_REQ * 64) / $denom)) - unit: (Bytes + $normUnit) + avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + unit: Gbps gfx908: L1I-L2 Bandwidth: - avg: AVG(((SQC_TC_INST_REQ * 64) / $denom)) - min: MIN(((SQC_TC_INST_REQ * 64) / $denom)) - max: MAX(((SQC_TC_INST_REQ * 64) / $denom)) - unit: (Bytes + $normUnit) + avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp))) + unit: Gbps metrics_description: - Bandwidth: + Bandwidth Utilization: plain: The number of bytes looked up in the L1I cache, as a percent of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over the total L1I cycles. @@ -8964,7 +8975,7 @@ panels: the cache. Calculated as the ratio of the number of L1I requests that hit over the number of all L1I requests. unit: Percent - L1I-L2 Bandwidth: + L1I-L2 Bandwidth Utilization: plain: "The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth\ \ achieved. Calculated as the ratio of the total number of requests from the\ \ L1I to the L2 cache over the total L1I-L2 interface cycles." @@ -8972,6 +8983,10 @@ panels: \ achieved. Calculated as the ratio of the total number of requests from\ \ the L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles `." unit: Percent + L1I-L2 Bandwidth: + plain: Total number of bytes transferred across L1I - L2 interface divided by total duration. + rst: Total number of bytes transferred across L1I - L2 interface divided by total duration. + unit: Gbps Req: plain: The total number of requests made to the L1I per normalization-unit rst: The total number of requests made to the L1I per normalization-unit @@ -9013,7 +9028,7 @@ panels: unit: Unit metric: gfx90a: - Bandwidth: + Bandwidth Utilization: value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -9022,12 +9037,12 @@ panels: + SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None)) unit: Pct of Peak - sL1D-L2 BW: + sL1D-L2 BW Utilization: value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp))) unit: Pct of Peak gfx941: - Bandwidth: + Bandwidth Utilization: value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -9036,12 +9051,12 @@ panels: + SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None)) unit: Pct of Peak - sL1D-L2 BW: + sL1D-L2 BW Utilization: value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp))) unit: Pct of Peak gfx940: - Bandwidth: + Bandwidth Utilization: value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -9050,12 +9065,12 @@ panels: + SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None)) unit: Pct of Peak - sL1D-L2 BW: + sL1D-L2 BW Utilization: value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp))) unit: Pct of Peak gfx942: - Bandwidth: + Bandwidth Utilization: value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -9064,12 +9079,12 @@ panels: + SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None)) unit: Pct of Peak - sL1D-L2 BW: + sL1D-L2 BW Utilization: value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp))) unit: Pct of Peak gfx950: - Bandwidth: + Bandwidth Utilization: value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -9078,12 +9093,12 @@ panels: + SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None)) unit: Pct of Peak - sL1D-L2 BW: + sL1D-L2 BW Utilization: value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp))) unit: Pct of Peak gfx908: - Bandwidth: + Bandwidth Utilization: value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))) unit: Pct of Peak @@ -9092,7 +9107,7 @@ panels: + SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None)) unit: Pct of Peak - sL1D-L2 BW: + sL1D-L2 BW Utilization: value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp))) unit: Pct of Peak @@ -9542,12 +9557,12 @@ panels: gfx90a: sL1D-L2 BW: avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) - unit: (Bytes + $normUnit) + * 64)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps Read Req: avg: AVG((SQC_TC_DATA_READ_REQ / $denom)) min: MIN((SQC_TC_DATA_READ_REQ / $denom)) @@ -9571,12 +9586,12 @@ panels: gfx941: sL1D-L2 BW: avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) - unit: (Bytes + $normUnit) + * 64)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps Read Req: avg: AVG((SQC_TC_DATA_READ_REQ / $denom)) min: MIN((SQC_TC_DATA_READ_REQ / $denom)) @@ -9600,12 +9615,12 @@ panels: gfx940: sL1D-L2 BW: avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) - unit: (Bytes + $normUnit) + * 64)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps Read Req: avg: AVG((SQC_TC_DATA_READ_REQ / $denom)) min: MIN((SQC_TC_DATA_READ_REQ / $denom)) @@ -9629,12 +9644,12 @@ panels: gfx942: sL1D-L2 BW: avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) - unit: (Bytes + $normUnit) + * 64)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps Read Req: avg: AVG((SQC_TC_DATA_READ_REQ / $denom)) min: MIN((SQC_TC_DATA_READ_REQ / $denom)) @@ -9658,12 +9673,12 @@ panels: gfx950: sL1D-L2 BW: avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) - unit: (Bytes + $normUnit) + * 64)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps Read Req: avg: AVG((SQC_TC_DATA_READ_REQ / $denom)) min: MIN((SQC_TC_DATA_READ_REQ / $denom)) @@ -9687,12 +9702,12 @@ panels: gfx908: sL1D-L2 BW: avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) - * 64)) / $denom)) - unit: (Bytes + $normUnit) + * 64)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps Read Req: avg: AVG((SQC_TC_DATA_READ_REQ / $denom)) min: MIN((SQC_TC_DATA_READ_REQ / $denom)) @@ -9714,7 +9729,7 @@ panels: max: MAX((SQC_TC_STALL / $denom)) unit: (Cycles + $normUnit) metrics_description: - Bandwidth: + Bandwidth Utilization: plain: The number of bytes looked up in the sL1D cache, as a percent of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the total sL1D cycles. @@ -9730,18 +9745,26 @@ panels: the cache. The ratio of the number of sL1D requests that hit [#sl1d-cache]_ over the number of all sL1D requests. unit: Percent + sL1D-L2 BW Utilization: + plain: The percentage of the peak theoretical sL1D - L2 interface bandwidth acheived.\ + \ Caclulated as total number of bytes read from, written to, or atomically updated\ + \ across the sL1D - L2 interface. + rst: The percentage of the peak theoretical sL1D - L2 interface bandwidth acheived.\ + \ Caclulated as total number of bytes read from, written to, or atomically updated\ + \ across the sL1D - L2 interface. + unit: Percent sL1D-L2 BW: plain: "The total number of bytes read from, written to, or atomically updated\ - \ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\ + \ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\ \ writes and atomics are typically unused on current CDNA accelerators, so\ \ in the majority of cases this can be interpreted as an sL1D\u2192L2 read\ \ bandwidth." rst: "The total number of bytes read from, written to, or atomically updated\ - \ across the sL1D\u2194:doc:`L2 ` interface, per :ref:`normalization\ - \ unit `. Note that sL1D writes and atomics are typically\ + \ across the sL1D\u2194:doc:`L2 ` interface, divided by total duration.\ + \ Note that sL1D writes and atomics are typically\ \ unused on current CDNA accelerators, so in the majority of cases this can\ \ be interpreted as an sL1D\u2192L2 read bandwidth." - unit: Bytes per normalization unit + unit: Gbps Req: plain: The total number of requests, of any size or type, made to the sL1D per normalization unit. @@ -10938,7 +10961,7 @@ panels: / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else None)) unit: Pct of Peak - Bandwidth: + Bandwidth Utilization: value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk / 1000) * 64) * $cu_per_gpu)) unit: Pct of Peak @@ -10957,7 +10980,7 @@ panels: / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else None)) unit: Pct of Peak - Bandwidth: + Bandwidth Utilization: value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk / 1000) * 128) * $cu_per_gpu)) unit: Pct of Peak @@ -10976,7 +10999,7 @@ panels: / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else None)) unit: Pct of Peak - Bandwidth: + Bandwidth Utilization: value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk / 1000) * 128) * $cu_per_gpu)) unit: Pct of Peak @@ -10995,7 +11018,7 @@ panels: / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else None)) unit: Pct of Peak - Bandwidth: + Bandwidth Utilization: value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk / 1000) * 128) * $cu_per_gpu)) unit: Pct of Peak @@ -11014,7 +11037,7 @@ panels: / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else None)) unit: Pct of Peak - Bandwidth: + Bandwidth Utilization: value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk / 1000) * 128) * $cu_per_gpu)) unit: Pct of Peak @@ -11033,7 +11056,7 @@ panels: / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else None)) unit: Pct of Peak - Bandwidth: + Bandwidth Utilization: value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk / 1000) * 64) * $cu_per_gpu)) unit: Pct of Peak @@ -11203,10 +11226,10 @@ panels: / $denom)) unit: (Req + $normUnit) Cache BW: - avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom)) - min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom)) - max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom)) - unit: (Bytes + $normUnit) + avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp))) + min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp))) + max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp))) + unit: Gbps Cache Hit Rate: avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) @@ -11244,12 +11267,12 @@ panels: unit: (Req + $normUnit) L1-L2 BW: avg: AVG(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) - + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom)) + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp))) min: MIN(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) - + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom)) + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp))) max: MAX(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) - + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom)) - unit: (Bytes + $normUnit) + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps L1-L2 Read: avg: AVG((TCP_TCC_READ_REQ_sum / $denom)) min: MIN((TCP_TCC_READ_REQ_sum / $denom)) @@ -11323,10 +11346,10 @@ panels: / $denom)) unit: (Req + $normUnit) Cache BW: - avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom)) - min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom)) - max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom)) - unit: (Bytes + $normUnit) + avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp))) + min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp))) + max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp))) + unit: Gbps Cache Hit Rate: avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) @@ -11365,14 +11388,14 @@ panels: L1-L2 BW: avg: AVG(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) min: MIN(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) max: MAX(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) - / $denom)) - unit: (Bytes + $normUnit) + / (End_Timestamp - Start_Timestamp))) + unit: Gbps L1-L2 Read: avg: AVG((TCP_TCC_READ_REQ_sum / $denom)) min: MIN((TCP_TCC_READ_REQ_sum / $denom)) @@ -11416,10 +11439,10 @@ panels: / $denom)) unit: (Req + $normUnit) Cache BW: - avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom)) - min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom)) - max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom)) - unit: (Bytes + $normUnit) + avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp))) + min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp))) + max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp))) + unit: Gbps Cache Hit Rate: avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) @@ -11458,14 +11481,14 @@ panels: L1-L2 BW: avg: AVG(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) min: MIN(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) max: MAX(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) - / $denom)) - unit: (Bytes + $normUnit) + / (End_Timestamp - Start_Timestamp))) + unit: Gbps L1-L2 Read: avg: AVG((TCP_TCC_READ_REQ_sum / $denom)) min: MIN((TCP_TCC_READ_REQ_sum / $denom)) @@ -11509,10 +11532,10 @@ panels: / $denom)) unit: (Req + $normUnit) Cache BW: - avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom)) - min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom)) - max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom)) - unit: (Bytes + $normUnit) + avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp))) + min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp))) + max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp))) + unit: Gbps Cache Hit Rate: avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) @@ -11551,14 +11574,14 @@ panels: L1-L2 BW: avg: AVG(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) min: MIN(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) max: MAX(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) - / $denom)) - unit: (Bytes + $normUnit) + / (End_Timestamp - Start_Timestamp))) + unit: Gbps L1-L2 Read: avg: AVG((TCP_TCC_READ_REQ_sum / $denom)) min: MIN((TCP_TCC_READ_REQ_sum / $denom)) @@ -11602,10 +11625,10 @@ panels: / $denom)) unit: (Req + $normUnit) Cache BW: - avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom)) - min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom)) - max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom)) - unit: (Bytes + $normUnit) + avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp))) + min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp))) + max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp))) + unit: Gbps Cache Hit Rate: avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) @@ -11644,14 +11667,14 @@ panels: L1-L2 BW: avg: AVG(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) min: MIN(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) - / $denom)) + / (End_Timestamp - Start_Timestamp))) max: MAX(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) - / $denom)) - unit: (Bytes + $normUnit) + / (End_Timestamp - Start_Timestamp))) + unit: Gbps Tag RAM 0 Req: avg: AVG((TCP_TAGRAM0_REQ_sum / $denom)) min: MIN((TCP_TAGRAM0_REQ_sum / $denom)) @@ -11730,10 +11753,10 @@ panels: / $denom)) unit: (Req + $normUnit) Cache BW: - avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom)) - min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom)) - max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom)) - unit: (Bytes + $normUnit) + avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp))) + min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp))) + max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp))) + unit: Gbps Cache Hit Rate: avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) @@ -11771,12 +11794,12 @@ panels: unit: (Req + $normUnit) L1-L2 BW: avg: AVG(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) - + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom)) + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp))) min: MIN(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) - + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom)) + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp))) max: MAX(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) - + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom)) - unit: (Bytes + $normUnit) + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps L1-L2 Read: avg: AVG((TCP_TCC_READ_REQ_sum / $denom)) min: MIN((TCP_TCC_READ_REQ_sum / $denom)) @@ -12600,7 +12623,7 @@ panels: vL1D cache over the total number of cache line requests to the :ref:`vL1D Cache RAM `. unit: Percent - Bandwidth: + Bandwidth Utilization: plain: The number of bytes looked up in the vL1D cache as a result of VMEM instructions, as a percent of the peak theoretical bandwidth achievable on the specific accelerator. The number of bytes is calculated as the number of cache lines @@ -12700,18 +12723,18 @@ panels: unit: Requests per normalization unit Cache BW: plain: The number of bytes looked up in the vL1D cache as a result of VMEM instructions - per normalization unit. The number of bytes is calculated as the number of + divided by total duration. The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so for instance, if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. - rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM - ` instructions per :ref:`normalization unit `. The - number of bytes is calculated as the number of cache lines requested multiplied - by the cache line size. This value does not consider partial requests, so - for instance, if only a single value is requested in a cache line, the data movement + rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM + ` instructions divided by total duration. The + number of bytes is calculated as the number of cache lines requested multiplied + by the cache line size. This value does not consider partial requests, so + for instance, if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. - unit: Bytes per normalization unit + unit: Gbps Cache Hit Rate: plain: The ratio of the number of vL1D cache line requests that hit in vL1D cache over the total number of cache line requests to the vL1D Cache RAM. @@ -12741,18 +12764,18 @@ panels: unit: Invalidations per normalization unit L1-L2 BW: plain: The number of bytes transferred across the vL1D-L2 interface as a result - of VMEM instructions, per normalization unit. The number of bytes is calculated + of VMEM instructions, divided by total duration. The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so for instance, if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. rst: The number of bytes transferred across the vL1D-L2 interface as a result of - :ref:`VMEM ` instructions, per :ref:`normalization unit `. + :ref:`VMEM ` instructions, divided by total duration. The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so for instance, if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. - unit: Bytes per normalization unit + unit: Gbps L1-L2 Read: plain: The number of read requests for a vL1D cache line that were not satisfied by the vL1D and must be retrieved from the to the L2 Cache per normalization @@ -13064,12 +13087,12 @@ panels: gfx90a: Read BW: avg: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) min: MIN((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) max: MAX((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) - * 64)) / $denom)) - unit: (Bytes + $normUnit) + * 64)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps HBM Read Traffic: avg: AVG((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None)) @@ -13096,12 +13119,12 @@ panels: unit: pct Write and Atomic BW: avg: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) - * 32)) / $denom)) + * 32)) / (End_Timestamp - Start_Timestamp))) min: MIN((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) - * 32)) / $denom)) + * 32)) / (End_Timestamp - Start_Timestamp))) max: MAX((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) - * 32)) / $denom)) - unit: (Bytes + $normUnit) + * 32)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps HBM Write and Atomic Traffic: avg: AVG((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None)) @@ -13161,12 +13184,12 @@ panels: gfx941: Read BW: avg: AVG((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) min: MIN((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) max: MAX((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) - * 64)) / $denom)) - unit: (Bytes + $normUnit) + * 64)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps HBM Read Traffic: avg: AVG((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum != 0) else None)) @@ -13193,12 +13216,12 @@ panels: unit: pct Write and Atomic BW: avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) - * 32)) / $denom)) + * 32)) / (End_Timestamp - Start_Timestamp))) min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) - * 32)) / $denom)) + * 32)) / (End_Timestamp - Start_Timestamp))) max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) - * 32)) / $denom)) - unit: (Bytes + $normUnit) + * 32)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps HBM Write and Atomic Traffic: avg: AVG((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum != 0) else None)) @@ -13258,12 +13281,12 @@ panels: gfx940: Read BW: avg: AVG((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) min: MIN((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) max: MAX((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) - * 64)) / $denom)) - unit: (Bytes + $normUnit) + * 64)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps HBM Read Traffic: avg: AVG((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum != 0) else None)) @@ -13290,12 +13313,12 @@ panels: unit: pct Write and Atomic BW: avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) - * 32)) / $denom)) + * 32)) / (End_Timestamp - Start_Timestamp))) min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) - * 32)) / $denom)) + * 32)) / (End_Timestamp - Start_Timestamp))) max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) - * 32)) / $denom)) - unit: (Bytes + $normUnit) + * 32)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps HBM Write and Atomic Traffic: avg: AVG((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum != 0) else None)) @@ -13355,12 +13378,12 @@ panels: gfx942: Read BW: avg: AVG(((128 * TCC_BUBBLE_sum + 64 * (TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - - TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / $denom)) + - TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / (End_Timestamp - Start_Timestamp))) min: MIN(((128 * TCC_BUBBLE_sum + 64 * (TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - - TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / $denom)) + - TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / (End_Timestamp - Start_Timestamp))) max: MAX(((128 * TCC_BUBBLE_sum + 64 * (TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - - TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / $denom)) - unit: (Bytes + $normUnit) + - TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / (End_Timestamp - Start_Timestamp))) + unit: Gbps HBM Read Traffic: avg: AVG((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum != 0) else None)) @@ -13387,12 +13410,12 @@ panels: unit: pct Write and Atomic BW: avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) - * 32)) / $denom)) + * 32)) / (End_Timestamp - Start_Timestamp))) min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) - * 32)) / $denom)) + * 32)) / (End_Timestamp - Start_Timestamp))) max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) - * 32)) / $denom)) - unit: (Bytes + $normUnit) + * 32)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps HBM Write and Atomic Traffic: avg: AVG((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum != 0) else None)) @@ -13452,12 +13475,12 @@ panels: gfx950: Read BW: avg: AVG((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) - + (TCC_EA0_RDREQ_128B_sum * 128)) / $denom)) + + (TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp))) min: MIN((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) - + (TCC_EA0_RDREQ_128B_sum * 128)) / $denom)) + + (TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp))) max: MAX((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) - + (TCC_EA0_RDREQ_128B_sum * 128)) / $denom)) - unit: (Bytes + $normUnit) + + (TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps HBM Read Traffic: avg: AVG((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum != 0) else None)) @@ -13484,12 +13507,12 @@ panels: unit: pct Write and Atomic BW: avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) - * 32)) / $denom)) + * 32)) / (End_Timestamp - Start_Timestamp))) min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) - * 32)) / $denom)) + * 32)) / (End_Timestamp - Start_Timestamp))) max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) - * 32)) / $denom)) - unit: (Bytes + $normUnit) + * 32)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps HBM Write and Atomic Traffic: avg: AVG((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum != 0) else None)) @@ -13568,12 +13591,12 @@ panels: gfx908: Read BW: avg: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) min: MIN((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) - * 64)) / $denom)) + * 64)) / (End_Timestamp - Start_Timestamp))) max: MAX((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) - * 64)) / $denom)) - unit: (Bytes + $normUnit) + * 64)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps HBM Read Traffic: avg: AVG((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None)) @@ -13600,12 +13623,12 @@ panels: unit: pct Write and Atomic BW: avg: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) - * 32)) / $denom)) + * 32)) / (End_Timestamp - Start_Timestamp))) min: MIN((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) - * 32)) / $denom)) + * 32)) / (End_Timestamp - Start_Timestamp))) max: MAX((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) - * 32)) / $denom)) - unit: (Bytes + $normUnit) + * 32)) / (End_Timestamp - Start_Timestamp))) + unit: Gbps HBM Write and Atomic Traffic: avg: AVG((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None)) @@ -13674,10 +13697,10 @@ panels: metric: gfx90a: Bandwidth: - avg: AVG((TCC_REQ_sum * 128) / $denom) - min: MIN((TCC_REQ_sum * 128) / $denom) - max: MAX((TCC_REQ_sum * 128) / $denom) - unit: (Bytes + $normUnit) + avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + unit: Gbps Req: avg: AVG((TCC_REQ_sum / $denom)) min: MIN((TCC_REQ_sum / $denom)) @@ -13773,10 +13796,10 @@ panels: unit: (Req + $normUnit) gfx941: Bandwidth: - avg: AVG((TCC_REQ_sum * 128) / $denom) - min: MIN((TCC_REQ_sum * 128) / $denom) - max: MAX((TCC_REQ_sum * 128) / $denom) - unit: (Bytes + $normUnit) + avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + unit: Gbps Req: avg: AVG((TCC_REQ_sum / $denom)) min: MIN((TCC_REQ_sum / $denom)) @@ -13872,10 +13895,10 @@ panels: unit: (Req + $normUnit) gfx940: Bandwidth: - avg: AVG((TCC_REQ_sum * 128) / $denom) - min: MIN((TCC_REQ_sum * 128) / $denom) - max: MAX((TCC_REQ_sum * 128) / $denom) - unit: (Bytes + $normUnit) + avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + unit: Gbps Req: avg: AVG((TCC_REQ_sum / $denom)) min: MIN((TCC_REQ_sum / $denom)) @@ -13971,10 +13994,10 @@ panels: unit: (Req + $normUnit) gfx942: Bandwidth: - avg: AVG((TCC_REQ_sum * 128) / $denom) - min: MIN((TCC_REQ_sum * 128) / $denom) - max: MAX((TCC_REQ_sum * 128) / $denom) - unit: (Bytes + $normUnit) + avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + unit: Gbps Req: avg: AVG((TCC_REQ_sum / $denom)) min: MIN((TCC_REQ_sum / $denom)) @@ -14070,25 +14093,25 @@ panels: unit: (Req + $normUnit) gfx950: Bandwidth: - avg: AVG((TCC_REQ_sum * 128) / $denom) - min: MIN((TCC_REQ_sum * 128) / $denom) - max: MAX((TCC_REQ_sum * 128) / $denom) - unit: (Bytes + $normUnit) + avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)) + unit: Gbps Read Bandwidth: - avg: AVG(TCC_READ_SECTORS_sum * 32/ $denom) - min: MIN(TCC_READ_SECTORS_sum * 32/ $denom) - max: MAX(TCC_READ_SECTORS_sum * 32/ $denom) - unit: (Bytes + $normUnit) + avg: AVG(TCC_READ_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp)) + min: MIN(TCC_READ_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp)) + max: MAX(TCC_READ_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp)) + unit: Gbps Write Bandwidth: - avg: AVG(TCC_WRITE_SECTORS_sum * 32/ $denom) - min: MIN(TCC_WRITE_SECTORS_sum * 32/ $denom) - max: MAX(TCC_WRITE_SECTORS_sum * 32/ $denom) - unit: (Bytes + $normUnit) + avg: AVG(TCC_WRITE_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp)) + min: MIN(TCC_WRITE_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp)) + max: MAX(TCC_WRITE_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp)) + unit: Gbps Atomic Bandwidth: - avg: AVG(TCC_ATOMIC_SECTORS_sum * 32/ $denom) - min: MIN(TCC_ATOMIC_SECTORS_sum * 32/ $denom) - max: MAX(TCC_ATOMIC_SECTORS_sum * 32/ $denom) - unit: (Bytes + $normUnit) + avg: AVG(TCC_ATOMIC_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp)) + min: MIN(TCC_ATOMIC_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp)) + max: MAX(TCC_ATOMIC_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp)) + unit: Gbps Req: avg: AVG((TCC_REQ_sum / $denom)) min: MIN((TCC_REQ_sum / $denom)) @@ -14194,10 +14217,10 @@ panels: unit: (Req + $normUnit) gfx908: Bandwidth: - avg: AVG((TCC_REQ_sum * 64) / $denom) - min: MIN((TCC_REQ_sum * 64) / $denom) - max: MAX((TCC_REQ_sum * 64) / $denom) - unit: (Bytes + $normUnit) + avg: AVG((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp)) + min: MIN((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp)) + max: MAX((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp)) + unit: Gbps Req: avg: AVG((TCC_REQ_sum / $denom)) min: MIN((TCC_REQ_sum / $denom)) @@ -14736,20 +14759,20 @@ panels: max: MAX((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom)) unit: (Req + $normUnit) Read Bandwidth - PCIe: - avg: AVG(TCC_EA0_RDREQ_IO_32B_sum * 32/ $denom) - min: MIN(TCC_EA0_RDREQ_IO_32B_sum * 32/ $denom) - max: MAX(TCC_EA0_RDREQ_IO_32B_sum * 32/ $denom) - unit: (Bytes + $normUnit) + avg: AVG(TCC_EA0_RDREQ_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + min: MIN(TCC_EA0_RDREQ_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + max: MAX(TCC_EA0_RDREQ_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + unit: Gbps "Read Bandwidth - Infinity Fabric\u2122": - avg: AVG(TCC_EA0_RDREQ_GMI_32B_sum * 32/ $denom) - min: MIN(TCC_EA0_RDREQ_GMI_32B_sum * 32/ $denom) - max: MAX(TCC_EA0_RDREQ_GMI_32B_sum * 32/ $denom) - unit: (Bytes + $normUnit) + avg: AVG(TCC_EA0_RDREQ_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + min: MIN(TCC_EA0_RDREQ_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + max: MAX(TCC_EA0_RDREQ_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + unit: Gbps Read Bandwidth - HBM: - avg: AVG(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ $denom) - min: MIN(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ $denom) - max: MAX(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ $denom) - unit: (Bytes + $normUnit) + avg: AVG(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + min: MIN(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + max: MAX(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + unit: Gbps Write and Atomic (32B): avg: AVG(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom)) min: MIN(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom)) @@ -14776,20 +14799,20 @@ panels: max: MAX((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom)) unit: (Req + $normUnit) Write Bandwidth - PCIe: - avg: AVG(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ $denom) - min: MIN(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ $denom) - max: MAX(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ $denom) - unit: (Bytes + $normUnit) + avg: AVG(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + min: MIN(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + max: MAX(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + unit: Gbps "Write Bandwidth - Infinity Fabric\u2122": - avg: AVG(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ $denom) - min: MIN(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ $denom) - max: MAX(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ $denom) - unit: (Bytes + $normUnit) + avg: AVG(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + min: MIN(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + max: MAX(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + unit: Gbps Write Bandwidth - HBM: - avg: AVG(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ $denom) - min: MIN(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ $denom) - max: MAX(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ $denom) - unit: (Bytes + $normUnit) + avg: AVG(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + min: MIN(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + max: MAX(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + unit: Gbps Atomic: avg: AVG((TCC_EA0_ATOMIC_sum / $denom)) min: MIN((TCC_EA0_ATOMIC_sum / $denom)) @@ -14801,20 +14824,20 @@ panels: max: MAX((TCC_EA0_WRREQ_ATOMIC_DRAM_sum / $denom)) unit: (Req + $normUnit) Atomic Bandwidth - PCIe: - avg: AVG(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ $denom) - min: MIN(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ $denom) - max: MAX(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ $denom) - unit: (Bytes + $normUnit) + avg: AVG(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + min: MIN(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + max: MAX(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + unit: Gbps "Atomic Bandwidth - Infinity Fabric\u2122": - avg: AVG(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ $denom) - min: MIN(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ $denom) - max: MAX(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ $denom) - unit: (Bytes + $normUnit) + avg: AVG(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + min: MIN(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + max: MAX(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + unit: Gbps Atomic Bandwidth - HBM: - avg: AVG(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ $denom) - min: MIN(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ $denom) - max: MAX(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ $denom) - unit: (Bytes + $normUnit) + avg: AVG(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + min: MIN(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + max: MAX(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp)) + unit: Gbps gfx908: Read (32B): avg: AVG((TCC_EA_RDREQ_32B_sum / $denom)) @@ -14920,11 +14943,9 @@ panels: channels multiplied by the HBM channel width multiplied by the HBM clock frequency. unit: GB/s Read BW: - plain: The total number of bytes read by the L2 cache from Infinity Fabric per - normalization unit. - rst: The total number of bytes read by the L2 cache from Infinity Fabric per :ref:`normalization - unit `. - unit: Bytes per normalization unit + plain: The total number of bytes read by the L2 cache from Infinity Fabric divided by total duration. + rst: The total number of bytes read by the L2 cache from Infinity Fabric divided by total duration. + unit: Gbps HBM Read Traffic: plain: The percent of read requests generated by the L2 cache that are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown does @@ -14972,17 +14993,17 @@ panels: unit: Percent Write and Atomic BW: plain: The total number of bytes written by the L2 over Infinity Fabric by write - and atomic operations per normalization unit. Note that on current CDNA accelerators, + and atomic operations divided by total duration. Note that on current CDNA accelerators, such as the MI2XX, requests are only considered atomic by Infinity Fabric if they are targeted at non-write-cacheable memory, for example, fine-grained memory allocations or uncached memory allocations on the MI2XX. rst: The total number of bytes written by the L2 over Infinity Fabric by write and - atomic operations per :ref:`normalization unit `. Note + atomic operations divided by total duration. Note that on current CDNA accelerators, such as the :ref:`MI2XX `, requests are only considered *atomic* by Infinity Fabric if they are targeted at non-write-cacheable memory, for example, :ref:`fine-grained memory ` allocations or :ref:`uncached memory ` allocations on the MI2XX. - unit: Bytes per normalization unit + unit: Gbps HBM Write and Atomic Traffic: plain: The percent of write and atomic requests generated by the L2 cache that are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown @@ -15074,36 +15095,36 @@ panels: (atomic with return value) was returned to the L2. unit: Cycles Bandwidth: - plain: The number of bytes looked up in the L2 cache, per normalization unit. + plain: The number of bytes looked up in the L2 cache, divided by total duration. The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so for example, if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. - rst: The number of bytes looked up in the L2 cache, per :ref:`normalization - unit `. The number of bytes is calculated as the number - of cache lines requested multiplied by the cache line size. This value does - not consider partial requests, so for example, if only a single value is + rst: The number of bytes looked up in the L2 cache, divided by total duration. + The number of bytes is calculated as the number of cache lines requested + multiplied by the cache line size. This value does + not consider partial requests, so for example, if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. - unit: Bytes per normalization unit + unit: Gbps Read Bandwidth: plain: Total number of bytes looked up in the L2 cache for read requests, - per normalization unit. + divided by total duration. rst: Total number of bytes looked up in the L2 cache for read requests, - per :ref:`normalization unit `. - unit: Bytes per normalization unit + divided by total duration. + unit: Gbps Write Bandwidth: plain: Total number of bytes looked up in the L2 cache for write requests, - per normalization unit. + divided by total duration. rst: Total number of bytes looked up in the L2 cache for write requests, - per :ref:`normalization unit `. - unit: Bytes per normalization unit + divided by total duration. + unit: Gbps Atomic Bandwidth: plain: Total number of bytes looked up in the L2 cache for atomic requests, - per normalization unit. + divided by total duration. rst: Total number of bytes looked up in the L2 cache for atomic requests, - per :ref:`normalization unit `. - unit: Bytes per normalization unit + divided by total duration. + unit: Gbps Req: plain: The total number of incoming requests to the L2 from all clients for all request types, per normalization unit. @@ -15276,17 +15297,17 @@ panels: unit `. See :ref:`l2-request-flow` for more detail. unit: Requests per normalization unit Read Bandwidth - PCIe: - plain: Total number of bytes due to L2 read requests due to PCIe traffic, per normalization unit. - rst: Total number of bytes due to L2 read requests due to PCIe traffic, per normalization unit. - unit: Bytes per normalization unit + plain: Total number of bytes due to L2 read requests due to PCIe traffic, divided by total duration. + rst: Total number of bytes due to L2 read requests due to PCIe traffic, divided by total duration. + unit: Gbps "Read Bandwidth - Infinity Fabric\u2122": - plain: Total number of bytes due to L2 read requests due to Infinity Fabric traffic, per normalization unit. - rst: Total number of bytes due to L2 read requests due to Infinity Fabric traffic, per normalization unit. - unit: Bytes per normalization unit + plain: Total number of bytes due to L2 read requests due to Infinity Fabric traffic, divided by total duration. + rst: Total number of bytes due to L2 read requests due to Infinity Fabric traffic, divided by total duration. + unit: Gbps Read Bandwidth - HBM: - plain: Total number of bytes due to L2 read requests due to HBM traffic, per normalization unit. - rst: Total number of bytes due to L2 read requests due to HBM traffic, per normalization unit. - unit: Bytes per normalization unit + plain: Total number of bytes due to L2 read requests due to HBM traffic, divided by total duration. + rst: Total number of bytes due to L2 read requests due to HBM traffic, divided by total duration. + unit: Gbps Write and Atomic (32B): plain: The total number of L2 requests to Infinity Fabric to write or atomically update 32B of data to any memory location, per normalization unit. @@ -15326,29 +15347,29 @@ panels: for more detail. unit: Requests per normalization unit Write Bandwidth - PCIe: - plain: Total number of bytes due to L2 write requests due to PCIe traffic, per normalization unit. - rst: Total number of bytes due to L2 write requests due to PCIe traffic, per normalization unit. - unit: Bytes per normalization unit + plain: Total number of bytes due to L2 write requests due to PCIe traffic, divided by total duration. + rst: Total number of bytes due to L2 write requests due to PCIe traffic, divided by total duration. + unit: Gbps "Write Bandwidth - Infinity Fabric\u2122": - plain: Total number of bytes due to L2 write requests due to Infinity Fabric traffic, per normalization unit. - rst: Total number of bytes due to L2 write requests due to Infinity Fabric traffic, per normalization unit. - unit: Bytes per normalization unit + plain: Total number of bytes due to L2 write requests due to Infinity Fabric traffic, divided by total duration. + rst: Total number of bytes due to L2 write requests due to Infinity Fabric traffic, divided by total duration. + unit: Gbps Write Bandwidth - HBM: - plain: Total number of bytes due to L2 write requests due to HBM traffic, per normalization unit. - rst: Total number of bytes due to L2 write requests due to HBM traffic, per normalization unit. - unit: Bytes per normalization unit + plain: Total number of bytes due to L2 write requests due to HBM traffic, divided by total duration. + rst: Total number of bytes due to L2 write requests due to HBM traffic, divided by total duration. + unit: Gbps Atomic Bandwidth - PCIe: - plain: Total number of bytes due to L2 atomic requests due to PCIe traffic, per normalization unit. - rst: Total number of bytes due to L2 atomic requests due to PCIe traffic, per normalization unit. - unit: Bytes per normalization unit + plain: Total number of bytes due to L2 atomic requests due to PCIe traffic, divided by total duration. + rst: Total number of bytes due to L2 atomic requests due to PCIe traffic, divided by total duration. + unit: Gbps "Atomic Bandwidth - Infinity Fabric\u2122": - plain: Total number of bytes due to L2 atomic requests due to Infinity Fabric traffic, per normalization unit. - rst: Total number of bytes due to L2 atomic requests due to Infinity Fabric traffic, per normalization unit. - unit: Bytes per normalization unit + plain: Total number of bytes due to L2 atomic requests due to Infinity Fabric traffic, divided by total duration. + rst: Total number of bytes due to L2 atomic requests due to Infinity Fabric traffic, divided by total duration. + unit: Gbps Atomic Bandwidth - HBM: - plain: Total number of bytes due to L2 atomic requests due to HBM traffic, per normalization unit. - rst: Total number of bytes due to L2 atomic requests due to HBM traffic, per normalization unit. - unit: Bytes per normalization unit + plain: Total number of bytes due to L2 atomic requests due to HBM traffic, divided by total duration. + rst: Total number of bytes due to L2 atomic requests due to HBM traffic, divided by total duration. + unit: Gbps Atomic: plain: The total number of L2 requests to Infinity Fabric to atomically update 32B or 64B of data in any memory location, per normalization unit. See Request