Update Unit of Bandwidth metrics to Gbps (#96)

* Add Utilization to metric name for Bandwidth related metrics whose Unit
  is Percent

* Update Unit of Bandwidth metrics to Gbps
    * Update metric Formula to use total duration as denominator instead of normalization unit.
    * Update metric Description
    * Update metric Unit

* Update CHANGELOG
This commit is contained in:
systems-assistant[bot]
2025-08-06 18:39:50 -04:00
committed by GitHub
parent a10d897a69
commit 89c74ac3d3
34 changed files with 1088 additions and 988 deletions
+20
View File
@@ -27,6 +27,26 @@ Full documentation for ROCm Compute Profiler is available at [https://rocm.docs.
* Change the basic view of TUI from aggregated analysis data to individual kernel analysis data
* Update `Unit` of the following `Bandwidth` related metrics to `Gbps` instead of `Bytes per Normalization Unit`
* Theoretical Bandwidth (section 1202)
* L1I-L2 Bandwidth (section 1303)
* sL1D-L2 BW (section 1403)
* Cache BW (section 1603)
* L1-L2 BW (section 1603)
* Read BW (section 1702)
* Write and Atomic BW (section 1702)
* Bandwidth (section 1703)
* Atomic/Read/Write Bandwidth (section 1703)
* Atomic/Read/Write Bandwidth - (HBM/PCIe/Infinity Fabric) (section 1706)
* Add `Utilization` to metric name for the following `Bandwidth` related metrics whose `Unit` is `Percent`
* Theoretical Bandwidth Utilization (section 1201)
* L1I-L2 Bandwidth Utilization (section 1301)
* Bandwidth Utilization (section 1301)
* Bandwidth Utilization (section 1401)
* sL1D-L2 BW Utilization (section 1401)
* Bandwidth Utilization (section 1601)
### Resolved issues
* Fixed not detecting memory clock issue when using amd-smi
@@ -397,13 +397,13 @@ LDS Speed-of-Light:
over the number of LDS cycles that would have been required to move the same
amount of data in an uncontended access. [#lds-bank-conflict]_
unit: Percent
Theoretical Bandwidth:
Theoretical Bandwidth Utilization:
rst: Indicates the maximum amount of bytes that could have been loaded from, stored
to, or atomically updated in the LDS per :ref:`normalization unit <normalization-units>`.
to, or atomically updated in the LDS divided as percentage of theoretical peak.
Does *not* take into account the execution mask of the wavefront when the instruction
was executed. See the :ref:`LDS bandwidth example <lds-bandwidth>` for more
detail.
unit: Bytes per normalization unit
unit: Percent
Utilization:
rst: Indicates what percent of the kernel's duration the :ref:`LDS <desc-lds>` was
actively executing instructions (including, but not limited to, load, store,
@@ -450,17 +450,16 @@ LDS Statistics:
unit: Accesses per normalization unit
Theoretical Bandwidth:
rst: Indicates the maximum amount of bytes that could have been loaded from, stored
to, or atomically updated in the LDS per :ref:`normalization unit <normalization-units>`.
Does *not* take into account the execution mask of the wavefront when the instruction
was executed. See the :ref:`LDS bandwidth example <lds-bandwidth>` for more
detail.
unit: Bytes per normalization unit
to, or atomically updated in the LDS divided by total duration. Does *not* take
into account the execution mask of the wavefront when the instruction was executed.
See the :ref:`LDS bandwidth example <lds-bandwidth>` for more detail.
unit: Gbps
Unaligned Stall:
rst: The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>` due
to stalls from non-dword aligned addresses per :ref:`normalization unit <normalization-units>`.
unit: Cycles per normalization unit
vL1D Speed-of-Light:
Bandwidth:
Bandwidth Utilization:
rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM
<desc-vmem>` instructions, as a percent of the peak theoretical bandwidth achievable
on the specific accelerator. The number of bytes is calculated as the number
@@ -614,13 +613,13 @@ vL1D cache access metrics:
rst: The total number of cache line lookups in the vL1D.
unit: Cache lines
Cache BW:
rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM
<desc-vmem>` instructions per :ref:`normalization unit <normalization-units>`. The
number of bytes is calculated as the number of cache lines requested multiplied
by the cache line size. This value does not consider partial requests, so
for instance, if only a single value is requested in a cache line, the data movement
will still be counted as a full cache line.
unit: Bytes per normalization unit
rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM
<desc-vmem>` instructions divided by total duration. The number of bytes is
calculated as the number of cache lines requested multiplied by the cache line
size. This value does not consider partial requests, so for instance, if only
a single value is requested in a cache line, the data movement will still be
counted as a full cache line.
unit: Gbps
Cache Hit Rate:
rst: The ratio of the number of vL1D cache line requests that hit in vL1D cache
over the total number of cache line requests to the :ref:`vL1D Cache RAM <desc-tc>`.
@@ -646,12 +645,12 @@ vL1D cache access metrics:
unit: Requests per normalization unit
L1-L2 BW:
rst: The number of bytes transferred across the vL1D-L2 interface as a result of
:ref:`VMEM <desc-vmem>` instructions, per :ref:`normalization unit <normalization-units>`.
The number of bytes is calculated as the number of cache lines requested multiplied
by the cache line size. This value does not consider partial requests, so for instance,
:ref:`VMEM <desc-vmem>` instructions, divided by total duration. The number
of bytes is calculated as the number of cache lines requested multiplied by
the cache line size. This value does not consider partial requests, so for instance,
if only a single value is requested in a cache line, the data movement will
still be counted as a full cache line.
unit: Bytes per normalization unit
unit: Gbps
L1-L2 Read:
rst: The number of read requests for a vL1D cache line that were not satisfied by
the vL1D and must be retrieved from the to the :doc:`L2 Cache <l2-cache>` per :ref:`normalization
@@ -761,20 +760,20 @@ L2 Speed-of-Light:
unit: Percent
L2 cache accesses:
Atomic Bandwidth:
rst: Total number of bytes looked up in the L2 cache for atomic requests, per
:ref:`normalization unit <normalization-units>`.
unit: Bytes per normalization unit
rst: Total number of bytes looked up in the L2 cache for atomic requests, divided
by total duration.
unit: Gbps
Atomic Req:
rst: The total number of atomic requests (with and without return) to the L2 from
all clients.
unit: Requests per normalization unit
Bandwidth:
rst: The number of bytes looked up in the L2 cache, per :ref:`normalization unit
<normalization-units>`. The number of bytes is calculated as the number of
cache lines requested multiplied by the cache line size. This value does not
consider partial requests, so for example, if only a single value is requested
in a cache line, the data movement will still be counted as a full cache line.
unit: Bytes per normalization unit
rst: The number of bytes looked up in the L2 cache, divided by total duration.
The number of bytes is calculated as the number of cache lines requested multiplied
by the cache line size. This value does not consider partial requests, so for
example, if only a single value is requested in a cache line, the data movement will
still be counted as a full cache line.
unit: Gbps
CC Req:
rst: The total number of requests to the L2 that go to Coherently Cacheable (CC) memory
allocations. See the :ref:`memory-type` for more information.
@@ -818,9 +817,9 @@ L2 cache accesses:
allocations. See the :ref:`memory-type` for more information.
unit: Requests per normalization unit
Read Bandwidth:
rst: Total number of bytes looked up in the L2 cache for read requests, per :ref:`normalization
unit <normalization-units>`.
unit: Bytes per normalization unit
rst: Total number of bytes looked up in the L2 cache for read requests, divided
by total duration.
unit: Gbps
Read Req:
rst: 'The total number of read requests to the L2 from all clients. '
unit: Requests per normalization unit
@@ -841,9 +840,9 @@ L2 cache accesses:
See the :ref:`memory-type` for more information.
unit: Requests per normalization unit
Write Bandwidth:
rst: Total number of bytes looked up in the L2 cache for write requests, per :ref:`normalization
unit <normalization-units>`.
unit: Bytes per normalization unit
rst: Total number of bytes looked up in the L2 cache for write requests, divided
by total duration.
unit: Gbps
Write Req:
rst: The total number of write requests to the L2 from all clients.
unit: Requests per normalization unit
@@ -896,9 +895,9 @@ L2-Fabric interface metrics:
memory <memory-type>` allocations.
unit: Percent
Read BW:
rst: The total number of bytes read by the L2 cache from Infinity Fabric per :ref:`normalization
unit <normalization-units>`.
unit: Bytes per normalization unit
rst: The total number of bytes read by the L2 cache from Infinity Fabric divided
by total duration.
unit: Gbps
Read Latency:
rst: The time-averaged number of cycles read requests spent in Infinity Fabric before
data was returned to the L2.
@@ -954,12 +953,12 @@ L2-Fabric interface metrics:
unit: Percent
Write and Atomic BW:
rst: The total number of bytes written by the L2 over Infinity Fabric by write and
atomic operations per :ref:`normalization unit <normalization-units>`. Note
that on current CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests
are only considered *atomic* by Infinity Fabric if they are targeted at non-write-cacheable
memory, for example, :ref:`fine-grained memory <memory-type>` allocations or :ref:`uncached
atomic operations divided by total duration. Note that on current CDNA accelerators,
such as the :ref:`MI2XX <mixxx-note>`, requests are only considered *atomic*
by Infinity Fabric if they are targeted at non-write-cacheable memory, for
example, :ref:`fine-grained memory <memory-type>` allocations or :ref:`uncached
memory <memory-type>` allocations on the MI2XX.
unit: Bytes per normalization unit
unit: Gbps
Write and Atomic Latency:
rst: The time-averaged number of cycles write requests spent in Infinity Fabric
before a completion acknowledgement was returned to the L2.
@@ -975,17 +974,17 @@ L2 - Fabric interface detailed metrics:
memory <memory-type>` allocations on the MI2XX.
unit: Requests per normalization unit
Atomic Bandwidth - HBM:
rst: Total number of bytes due to L2 atomic requests due to HBM traffic, per normalization
unit.
unit: Bytes per normalization unit
rst: Total number of bytes due to L2 atomic requests due to HBM traffic, divided
by total duration.
unit: Gbps
"Atomic Bandwidth - Infinity Fabric\u2122":
rst: Total number of bytes due to L2 atomic requests due to Infinity Fabric traffic,
per normalization unit.
unit: Bytes per normalization unit
divided by total duration.
unit: Gbps
Atomic Bandwidth - PCIe:
rst: Total number of bytes due to L2 atomic requests due to PCIe traffic, per
normalization unit.
unit: Bytes per normalization unit
rst: Total number of bytes due to L2 atomic requests due to PCIe traffic, divided
by total duration.
unit: Gbps
HBM Read:
rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data
from the accelerator's local HBM, per :ref:`normalization unit <normalization-units>`.
@@ -1013,17 +1012,17 @@ L2 - Fabric interface detailed metrics:
uncached data requests. See :ref:`l2-request-flow` for more detail.
unit: Requests per normalization unit
Read Bandwidth - HBM:
rst: Total number of bytes due to L2 read requests due to HBM traffic, per normalization
unit.
unit: Bytes per normalization unit
rst: Total number of bytes due to L2 read requests due to HBM traffic, divided
by total duration.
unit: Gbps
"Read Bandwidth - Infinity Fabric\u2122":
rst: Total number of bytes due to L2 read requests due to Infinity Fabric traffic,
per normalization unit.
unit: Bytes per normalization unit
divided by total duration.
unit: Gbps
Read Bandwidth - PCIe:
rst: Total number of bytes due to L2 read requests due to PCIe traffic, per normalization
unit.
unit: Bytes per normalization unit
rst: Total number of bytes due to L2 read requests due to PCIe traffic, divided
by total duration.
unit: Gbps
Remote Read:
rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data
from any source other than the accelerator's local HBM, per :ref:`normalization
@@ -1036,17 +1035,17 @@ L2 - Fabric interface detailed metrics:
for more detail.
unit: Requests per normalization unit
Write Bandwidth - HBM:
rst: Total number of bytes due to L2 write requests due to HBM traffic, per normalization
unit.
unit: Bytes per normalization unit
rst: Total number of bytes due to L2 write requests due to HBM traffic, divided
by total duration.
unit: Gbps
"Write Bandwidth - Infinity Fabric\u2122":
rst: Total number of bytes due to L2 write requests due to Infinity Fabric traffic,
per normalization unit.
unit: Bytes per normalization unit
divided by total duration.
unit: Gbps
Write Bandwidth - PCIe:
rst: Total number of bytes due to L2 write requests due to PCIe traffic, per normalization
unit.
unit: Bytes per normalization unit
rst: Total number of bytes due to L2 write requests due to PCIe traffic, divided
by total duration.
unit: Gbps
Write and Atomic (32B):
rst: The total number of L2 requests to Infinity Fabric to write or atomically update
32B of data to any memory location, per :ref:`normalization unit <normalization-units>`.
@@ -1098,7 +1097,7 @@ L2 - Fabric Interface stalls:
of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
unit: Percent
Scalar L1D Speed-of-Light:
Bandwidth:
Bandwidth Utilization:
rst: The number of bytes looked up in the sL1D cache, as a percent of the peak theoretical
bandwidth. Calculated as the ratio of sL1D requests over the :ref:`total sL1D
cycles <total-sl1d-cycles>`.
@@ -1108,13 +1107,11 @@ Scalar L1D Speed-of-Light:
the cache. The ratio of the number of sL1D requests that hit [#sl1d-cache]_
over the number of all sL1D requests.
unit: Percent
sL1D-L2 BW:
rst: "The total number of bytes read from, written to, or atomically updated \
\ across the sL1D\u2194:doc:`L2 <l2-cache>` interface, per :ref:`normalization\
\ unit <normalization-units>`. Note that sL1D writes and atomics are typically\
\ unused on current CDNA accelerators, so in the majority of cases this can\
\ be interpreted as an sL1D\u2192L2 read bandwidth."
unit: Bytes per normalization unit
sL1D-L2 BW Utilization:
rst: The percentage of the peak theoretical sL1D - L2 interface bandwidth acheived.\
\ Caclulated as total number of bytes read from, written to, or atomically updated\
\ across the sL1D - L2 interface.
unit: Percent
Scalar L1D cache accesses:
Atomic Req:
rst: The total number of atomic requests from sL1D to the :doc:`L2 <l2-cache>`,
@@ -1189,13 +1186,13 @@ Scalar L1D Cache - L2 Interface:
unit: Requests per normalization unit
sL1D-L2 BW:
rst: "The total number of bytes read from, written to, or atomically updated \
\ across the sL1D\u2194:doc:`L2 <l2-cache>` interface, per :ref:`normalization\
\ unit <normalization-units>`. Note that sL1D writes and atomics are typically\
\ unused on current CDNA accelerators, so in the majority of cases this can\
\ be interpreted as an sL1D\u2192L2 read bandwidth."
unit: Bytes per normalization unit
\ across the sL1D\u2194:doc:`L2 <l2-cache>` interface, divided by total duration.\
\ Note that sL1D writes and atomics are typically unused on current CDNA accelerators,\
\ so in the majority of cases this can be interpreted as an sL1D\u2192L2 read\
\ bandwidth."
unit: Gbps
L1I Speed-of-Light:
Bandwidth:
Bandwidth Utilization:
rst: The number of bytes looked up in the L1I cache, as a percent of the peak theoretical
bandwidth. Calculated as the ratio of L1I requests over the :ref:`total L1I
cycles <total-l1i-cycles>`.
@@ -1205,7 +1202,7 @@ L1I Speed-of-Light:
the cache. Calculated as the ratio of the number of L1I requests that hit over
the number of all L1I requests.
unit: Percent
L1I-L2 Bandwidth:
L1I-L2 Bandwidth Utilization:
rst: "The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth\
\ achieved. Calculated as the ratio of the total number of requests from the\
\ L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`."
@@ -1238,10 +1235,9 @@ L1I cache accesses:
unit: Requests per normalization unit
L1I <-> L2 interface:
L1I-L2 Bandwidth:
rst: "The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth\
\ achieved. Calculated as the ratio of the total number of requests from the\
\ L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`."
unit: Percent
rst: Total number of bytes transferred across L1I - L2 interface divided by total
duration.
unit: Gbps
Workgroup manager utilizations:
Accelerator Utilization:
rst: The percent of cycles in the kernel where the accelerator was actively doing
@@ -11,8 +11,12 @@ Panel Config:
instructions, averaged over the lifetime of the kernel. Calculated as the ratio
of the total number of cycles spent by the scheduler issuing LDS instructions
over the total CU cycles.
Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
could have been loaded from, stored to, or atomically updated in the LDS divided
as percentage of theoretical peak. Does not take into account the execution
mask of the wavefront when the instruction was executed.
Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
loaded from, stored to, or atomically updated in the LDS per normalization unit.
loaded from, stored to, or atomically updated in the LDS divided by total duration.
Does not take into account the execution mask of the wavefront when the instruction
was executed.
Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
@@ -58,7 +62,7 @@ Panel Config:
Access Rate:
value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
unit: Pct of Peak
Theoretical Bandwidth:
Theoretical Bandwidth Utilization:
value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
/ (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
unit: Pct of Peak
@@ -86,12 +90,12 @@ Panel Config:
unit: (Instr + $normUnit)
Theoretical Bandwidth:
avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
/ $denom))
/ (End_Timestamp - Start_Timestamp)))
min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
/ $denom))
/ (End_Timestamp - Start_Timestamp)))
max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
/ $denom))
unit: (Bytes + $normUnit)
/ (End_Timestamp - Start_Timestamp)))
unit: Gbps
LDS Latency:
avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
None))
@@ -3,15 +3,18 @@ Panel Config:
id: 1300
title: Instruction Cache
metrics_description:
Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the
peak theoretical bandwidth. Calculated as the ratio of L1I requests over the
total L1I cycles.
Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
the total L1I cycles.
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
loaded line the cache. Calculated as the ratio of the number of L1I requests
that hit over the number of all L1I requests.
L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\
\ bandwidth achieved. Calculated as the ratio of the total number of requests\
\ from the L1I to the L2 cache over the total L1I-L2 interface cycles."
L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\
\ L2 cache request bandwidth achieved. Calculated as the ratio of the total\
\ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\
\ cycles."
L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
divided by total duration.
Req: The total number of requests made to the L1I per normalization-unit
Hits: The total number of L1I requests that hit on a previously loaded cache line,
per normalization-unit.
@@ -30,7 +33,7 @@ Panel Config:
value: Avg
unit: Unit
metric:
Bandwidth:
Bandwidth Utilization:
value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
- Start_Timestamp))))
unit: Pct of Peak
@@ -38,7 +41,7 @@ Panel Config:
value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
+ SQC_ICACHE_MISSES_DUPLICATE)))
unit: Pct of Peak
L1I-L2 Bandwidth:
L1I-L2 Bandwidth Utilization:
value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu)
* (End_Timestamp - Start_Timestamp))))
unit: Pct of Peak
@@ -100,7 +103,7 @@ Panel Config:
unit: Unit
metric:
L1I-L2 Bandwidth:
avg: AVG(((SQC_TC_INST_REQ * 64) / $denom))
min: MIN(((SQC_TC_INST_REQ * 64) / $denom))
max: MAX(((SQC_TC_INST_REQ * 64) / $denom))
unit: (Bytes + $normUnit)
avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
@@ -3,14 +3,17 @@ Panel Config:
id: 1400
title: Scalar L1 Data Cache
metrics_description:
Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the
peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the
total sL1D cycles.
Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
over the total sL1D cycles.
Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
loaded line the cache. The ratio of the number of sL1D requests that hit over
the number of all sL1D requests.
sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
bandwidth acheived.\ \ Caclulated as total number of bytes read from, written
to, or atomically updated\ \ across the sL1D - L2 interface.
sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
\ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\
\ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\
\ writes and atomics are typically unused on current CDNA accelerators, so in\
\ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
Req: The total number of requests, of any size or type, made to the sL1D per normalization
@@ -51,7 +54,7 @@ Panel Config:
value: Avg
unit: Unit
metric:
Bandwidth:
Bandwidth Utilization:
value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
- Start_Timestamp))))
unit: Pct of Peak
@@ -60,7 +63,7 @@ Panel Config:
+ SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES
+ SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
unit: Pct of Peak
sL1D-L2 BW:
sL1D-L2 BW Utilization:
value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
* 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))
unit: Pct of Peak
@@ -158,12 +161,12 @@ Panel Config:
metric:
sL1D-L2 BW:
avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
* 64)) / $denom))
* 64)) / (End_Timestamp - Start_Timestamp)))
min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
* 64)) / $denom))
* 64)) / (End_Timestamp - Start_Timestamp)))
max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
* 64)) / $denom))
unit: (Bytes + $normUnit)
* 64)) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
Read Req:
avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
min: MIN((SQC_TC_DATA_READ_REQ / $denom))
@@ -5,12 +5,12 @@ Panel Config:
metrics_description:
Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
cache over the total number of cache line requests to the vL1D Cache RAM.
Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM
instructions, as a percent of the peak theoretical bandwidth achievable on the
specific accelerator. The number of bytes is calculated as the number of cache
lines requested multiplied by the cache line size. This value does not consider
partial requests, so for instance, if only a single value is requested in a
cache line, the data movement will still be counted as a full cache line.
Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
on the specific accelerator. The number of bytes is calculated as the number
of cache lines requested multiplied by the cache line size. This value does
not consider partial requests, so for instance, if only a single value is requested
in a cache line, the data movement will still be counted as a full cache line.
Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
The number of cycles where the vL1D Cache RAM is actively processing any request
divided by the number of cycles where the vL1D is active.
@@ -42,11 +42,11 @@ Panel Config:
Atomic Req: The total number of incoming atomic requests from the address processing
unit after coalescing per normalization unit.
Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
instructions per normalization unit. The number of bytes is calculated as the
number of cache lines requested multiplied by the cache line size. This value
does not consider partial requests, so for instance, if only a single value
is requested in a cache line, the data movement will still be counted as a full
cache line.
instructions divided by total duration. The number of bytes is calculated as
the number of cache lines requested multiplied by the cache line size. This
value does not consider partial requests, so for instance, if only a single
value is requested in a cache line, the data movement will still be counted
as a full cache line.
Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
Cache Accesses: The total number of cache line lookups in the vL1D.
@@ -57,7 +57,7 @@ Panel Config:
command during the kernel's execution per normalization unit. This may be triggered
by, for instance, the buffer_wbinvl1 instruction.
L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
of VMEM instructions, per normalization unit. The number of bytes is calculated
of VMEM instructions, divided by total duration. The number of bytes is calculated
as the number of cache lines requested multiplied by the cache line size. This
value does not consider partial requests, so for instance, if only a single
value is requested in a cache line, the data movement will still be counted
@@ -128,7 +128,7 @@ Panel Config:
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
else None))
unit: Pct of Peak
Bandwidth:
Bandwidth Utilization:
value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp
- Start_Timestamp)))) / ((($max_sclk / 1000) * 64) * $cu_per_gpu))
unit: Pct of Peak
@@ -201,10 +201,10 @@ Panel Config:
/ $denom))
unit: (Req + $normUnit)
Cache BW:
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
unit: (Bytes + $normUnit)
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
Cache Hit Rate:
avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
@@ -242,12 +242,12 @@ Panel Config:
unit: (Req + $normUnit)
L1-L2 BW:
avg: AVG(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
min: MIN(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
max: MAX(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
unit: (Bytes + $normUnit)
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
L1-L2 Read:
avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
min: MIN((TCP_TCC_READ_REQ_sum / $denom))
@@ -20,8 +20,8 @@ Panel Config:
HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
memory (HBM) per unit time. This value is calculated as the number of HBM channels
multiplied by the HBM channel width multiplied by the HBM clock frequency.
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per
normalization unit.
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided
by total duration.
HBM Read Traffic: The percent of read requests generated by the L2 cache that
are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
does not consider the size of the request (meaning that 32B and 64B requests
@@ -42,9 +42,9 @@ Panel Config:
as a single request), so this metric only approximates the percent of the L2-Fabric
read bandwidth directed to an uncached memory location.
Write and Atomic BW: The total number of bytes written by the L2 over Infinity
Fabric by write and atomic operations per normalization unit. Note that on current
CDNA accelerators, such as the MI2XX, requests are only considered atomic by
Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
Fabric by write and atomic operations divided by total duration. Note that on
current CDNA accelerators, such as the MI2XX, requests are only considered atomic
by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
fine-grained memory allocations or uncached memory allocations on the MI2XX.
HBM Write and Atomic Traffic: The percent of write and atomic requests generated
by the L2 cache that are routed to the accelerator's local high-bandwidth memory
@@ -82,17 +82,17 @@ Panel Config:
Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
Fabric before a completion acknowledgement (atomic without return value) or
data (atomic with return value) was returned to the L2.
Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit.
Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration.
The number of bytes is calculated as the number of cache lines requested multiplied
by the cache line size. This value does not consider partial requests, so for
example, if only a single value is requested in a cache line, the data movement
will still be counted as a full cache line.
Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests,
per normalization unit.
divided by total duration.
Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests,
per normalization unit.
divided by total duration.
Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests,
per normalization unit.
divided by total duration.
Req: The total number of incoming requests to the L2 from all clients for all
request types, per normalization unit.
Read Req: The total number of read requests to the L2 from all clients.
@@ -150,11 +150,11 @@ Panel Config:
64B of data from any source other than the accelerator's local HBM, per normalization
unit.
Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe
traffic, per normalization unit.
traffic, divided by total duration.
"Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read
requests due to Infinity Fabric traffic, per normalization unit.
requests due to Infinity Fabric traffic, divided by total duration.
Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM
traffic, per normalization unit.
traffic, divided by total duration.
Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
write or atomically update 32B of data to any memory location, per normalization
unit.
@@ -171,17 +171,17 @@ Panel Config:
write or atomically update 32B or 64B of data in any memory location other than
the accelerator's local HBM, per normalization unit.
Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to
PCIe traffic, per normalization unit.
PCIe traffic, divided by total duration.
"Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write
requests due to Infinity Fabric traffic, per normalization unit.
requests due to Infinity Fabric traffic, divided by total duration.
Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM
traffic, per normalization unit.
traffic, divided by total duration.
Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to
PCIe traffic, per normalization unit.
PCIe traffic, divided by total duration.
"Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic
requests due to Infinity Fabric traffic, per normalization unit.
requests due to Infinity Fabric traffic, divided by total duration.
Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
HBM traffic, per normalization unit.
HBM traffic, divided by total duration.
Atomic: The total number of L2 requests to Infinity Fabric to atomically update
32B or 64B of data in any memory location, per normalization unit. See Request
flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
@@ -257,12 +257,12 @@ Panel Config:
metric:
Read BW:
avg: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
* 64)) / $denom))
* 64)) / (End_Timestamp - Start_Timestamp)))
min: MIN((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
* 64)) / $denom))
* 64)) / (End_Timestamp - Start_Timestamp)))
max: MAX((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
* 64)) / $denom))
unit: (Bytes + $normUnit)
* 64)) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
HBM Read Traffic:
avg: AVG((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
!= 0) else None))
@@ -289,12 +289,12 @@ Panel Config:
unit: pct
Write and Atomic BW:
avg: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
* 32)) / $denom))
* 32)) / (End_Timestamp - Start_Timestamp)))
min: MIN((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
* 32)) / $denom))
* 32)) / (End_Timestamp - Start_Timestamp)))
max: MAX((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
* 32)) / $denom))
unit: (Bytes + $normUnit)
* 32)) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
HBM Write and Atomic Traffic:
avg: AVG((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
!= 0) else None))
@@ -362,10 +362,10 @@ Panel Config:
unit: Unit
metric:
Bandwidth:
avg: AVG((TCC_REQ_sum * 64) / $denom)
min: MIN((TCC_REQ_sum * 64) / $denom)
max: MAX((TCC_REQ_sum * 64) / $denom)
unit: (Bytes + $normUnit)
avg: AVG((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))
min: MIN((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))
max: MAX((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))
unit: Gbps
Req:
avg: AVG((TCC_REQ_sum / $denom))
min: MIN((TCC_REQ_sum / $denom))
@@ -11,8 +11,12 @@ Panel Config:
instructions, averaged over the lifetime of the kernel. Calculated as the ratio
of the total number of cycles spent by the scheduler issuing LDS instructions
over the total CU cycles.
Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
could have been loaded from, stored to, or atomically updated in the LDS divided
as percentage of theoretical peak. Does not take into account the execution
mask of the wavefront when the instruction was executed.
Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
loaded from, stored to, or atomically updated in the LDS per normalization unit.
loaded from, stored to, or atomically updated in the LDS divided by total duration.
Does not take into account the execution mask of the wavefront when the instruction
was executed.
Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
@@ -58,7 +62,7 @@ Panel Config:
Access Rate:
value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
unit: Pct of Peak
Theoretical Bandwidth:
Theoretical Bandwidth Utilization:
value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
/ (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
unit: Pct of Peak
@@ -86,12 +90,12 @@ Panel Config:
unit: (Instr + $normUnit)
Theoretical Bandwidth:
avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
/ $denom))
/ (End_Timestamp - Start_Timestamp)))
min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
/ $denom))
/ (End_Timestamp - Start_Timestamp)))
max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
/ $denom))
unit: (Bytes + $normUnit)
/ (End_Timestamp - Start_Timestamp)))
unit: Gbps
LDS Latency:
avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
None))
@@ -3,15 +3,18 @@ Panel Config:
id: 1300
title: Instruction Cache
metrics_description:
Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the
peak theoretical bandwidth. Calculated as the ratio of L1I requests over the
total L1I cycles.
Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
the total L1I cycles.
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
loaded line the cache. Calculated as the ratio of the number of L1I requests
that hit over the number of all L1I requests.
L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\
\ bandwidth achieved. Calculated as the ratio of the total number of requests\
\ from the L1I to the L2 cache over the total L1I-L2 interface cycles."
L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\
\ L2 cache request bandwidth achieved. Calculated as the ratio of the total\
\ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\
\ cycles."
L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
divided by total duration.
Req: The total number of requests made to the L1I per normalization-unit
Hits: The total number of L1I requests that hit on a previously loaded cache line,
per normalization-unit.
@@ -30,7 +33,7 @@ Panel Config:
value: Avg
unit: Unit
metric:
Bandwidth:
Bandwidth Utilization:
value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
- Start_Timestamp))))
unit: Pct of Peak
@@ -38,7 +41,7 @@ Panel Config:
value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
+ SQC_ICACHE_MISSES_DUPLICATE)))
unit: Pct of Peak
L1I-L2 Bandwidth:
L1I-L2 Bandwidth Utilization:
value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu)
* (End_Timestamp - Start_Timestamp))))
unit: Pct of Peak
@@ -100,7 +103,7 @@ Panel Config:
unit: Unit
metric:
L1I-L2 Bandwidth:
avg: AVG(((SQC_TC_INST_REQ * 64) / $denom))
min: MIN(((SQC_TC_INST_REQ * 64) / $denom))
max: MAX(((SQC_TC_INST_REQ * 64) / $denom))
unit: (Bytes + $normUnit)
avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
@@ -3,14 +3,17 @@ Panel Config:
id: 1400
title: Scalar L1 Data Cache
metrics_description:
Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the
peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the
total sL1D cycles.
Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
over the total sL1D cycles.
Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
loaded line the cache. The ratio of the number of sL1D requests that hit over
the number of all sL1D requests.
sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
bandwidth acheived.\ \ Caclulated as total number of bytes read from, written
to, or atomically updated\ \ across the sL1D - L2 interface.
sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
\ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\
\ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\
\ writes and atomics are typically unused on current CDNA accelerators, so in\
\ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
Req: The total number of requests, of any size or type, made to the sL1D per normalization
@@ -51,7 +54,7 @@ Panel Config:
value: Avg
unit: Unit
metric:
Bandwidth:
Bandwidth Utilization:
value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
- Start_Timestamp))))
unit: Pct of Peak
@@ -60,7 +63,7 @@ Panel Config:
+ SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES
+ SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
unit: Pct of Peak
sL1D-L2 BW:
sL1D-L2 BW Utilization:
value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
* 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))
unit: Pct of Peak
@@ -158,12 +161,12 @@ Panel Config:
metric:
sL1D-L2 BW:
avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
* 64)) / $denom))
* 64)) / (End_Timestamp - Start_Timestamp)))
min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
* 64)) / $denom))
* 64)) / (End_Timestamp - Start_Timestamp)))
max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
* 64)) / $denom))
unit: (Bytes + $normUnit)
* 64)) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
Read Req:
avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
min: MIN((SQC_TC_DATA_READ_REQ / $denom))
@@ -5,12 +5,12 @@ Panel Config:
metrics_description:
Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
cache over the total number of cache line requests to the vL1D Cache RAM.
Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM
instructions, as a percent of the peak theoretical bandwidth achievable on the
specific accelerator. The number of bytes is calculated as the number of cache
lines requested multiplied by the cache line size. This value does not consider
partial requests, so for instance, if only a single value is requested in a
cache line, the data movement will still be counted as a full cache line.
Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
on the specific accelerator. The number of bytes is calculated as the number
of cache lines requested multiplied by the cache line size. This value does
not consider partial requests, so for instance, if only a single value is requested
in a cache line, the data movement will still be counted as a full cache line.
Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
The number of cycles where the vL1D Cache RAM is actively processing any request
divided by the number of cycles where the vL1D is active.
@@ -42,11 +42,11 @@ Panel Config:
Atomic Req: The total number of incoming atomic requests from the address processing
unit after coalescing per normalization unit.
Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
instructions per normalization unit. The number of bytes is calculated as the
number of cache lines requested multiplied by the cache line size. This value
does not consider partial requests, so for instance, if only a single value
is requested in a cache line, the data movement will still be counted as a full
cache line.
instructions divided by total duration. The number of bytes is calculated as
the number of cache lines requested multiplied by the cache line size. This
value does not consider partial requests, so for instance, if only a single
value is requested in a cache line, the data movement will still be counted
as a full cache line.
Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
Cache Accesses: The total number of cache line lookups in the vL1D.
@@ -57,7 +57,7 @@ Panel Config:
command during the kernel's execution per normalization unit. This may be triggered
by, for instance, the buffer_wbinvl1 instruction.
L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
of VMEM instructions, per normalization unit. The number of bytes is calculated
of VMEM instructions, divided by total duration. The number of bytes is calculated
as the number of cache lines requested multiplied by the cache line size. This
value does not consider partial requests, so for instance, if only a single
value is requested in a cache line, the data movement will still be counted
@@ -128,7 +128,7 @@ Panel Config:
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
else None))
unit: Pct of Peak
Bandwidth:
Bandwidth Utilization:
value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp
- Start_Timestamp)))) / ((($max_sclk / 1000) * 64) * $cu_per_gpu))
unit: Pct of Peak
@@ -201,10 +201,10 @@ Panel Config:
/ $denom))
unit: (Req + $normUnit)
Cache BW:
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
unit: (Bytes + $normUnit)
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
Cache Hit Rate:
avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
@@ -242,12 +242,12 @@ Panel Config:
unit: (Req + $normUnit)
L1-L2 BW:
avg: AVG(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
min: MIN(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
max: MAX(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
unit: (Bytes + $normUnit)
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
L1-L2 Read:
avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
min: MIN((TCP_TCC_READ_REQ_sum / $denom))
@@ -20,8 +20,8 @@ Panel Config:
HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
memory (HBM) per unit time. This value is calculated as the number of HBM channels
multiplied by the HBM channel width multiplied by the HBM clock frequency.
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per
normalization unit.
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided
by total duration.
HBM Read Traffic: The percent of read requests generated by the L2 cache that
are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
does not consider the size of the request (meaning that 32B and 64B requests
@@ -42,9 +42,9 @@ Panel Config:
as a single request), so this metric only approximates the percent of the L2-Fabric
read bandwidth directed to an uncached memory location.
Write and Atomic BW: The total number of bytes written by the L2 over Infinity
Fabric by write and atomic operations per normalization unit. Note that on current
CDNA accelerators, such as the MI2XX, requests are only considered atomic by
Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
Fabric by write and atomic operations divided by total duration. Note that on
current CDNA accelerators, such as the MI2XX, requests are only considered atomic
by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
fine-grained memory allocations or uncached memory allocations on the MI2XX.
HBM Write and Atomic Traffic: The percent of write and atomic requests generated
by the L2 cache that are routed to the accelerator's local high-bandwidth memory
@@ -82,17 +82,17 @@ Panel Config:
Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
Fabric before a completion acknowledgement (atomic without return value) or
data (atomic with return value) was returned to the L2.
Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit.
Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration.
The number of bytes is calculated as the number of cache lines requested multiplied
by the cache line size. This value does not consider partial requests, so for
example, if only a single value is requested in a cache line, the data movement
will still be counted as a full cache line.
Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests,
per normalization unit.
divided by total duration.
Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests,
per normalization unit.
divided by total duration.
Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests,
per normalization unit.
divided by total duration.
Req: The total number of incoming requests to the L2 from all clients for all
request types, per normalization unit.
Read Req: The total number of read requests to the L2 from all clients.
@@ -150,11 +150,11 @@ Panel Config:
64B of data from any source other than the accelerator's local HBM, per normalization
unit.
Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe
traffic, per normalization unit.
traffic, divided by total duration.
"Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read
requests due to Infinity Fabric traffic, per normalization unit.
requests due to Infinity Fabric traffic, divided by total duration.
Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM
traffic, per normalization unit.
traffic, divided by total duration.
Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
write or atomically update 32B of data to any memory location, per normalization
unit.
@@ -171,17 +171,17 @@ Panel Config:
write or atomically update 32B or 64B of data in any memory location other than
the accelerator's local HBM, per normalization unit.
Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to
PCIe traffic, per normalization unit.
PCIe traffic, divided by total duration.
"Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write
requests due to Infinity Fabric traffic, per normalization unit.
requests due to Infinity Fabric traffic, divided by total duration.
Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM
traffic, per normalization unit.
traffic, divided by total duration.
Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to
PCIe traffic, per normalization unit.
PCIe traffic, divided by total duration.
"Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic
requests due to Infinity Fabric traffic, per normalization unit.
requests due to Infinity Fabric traffic, divided by total duration.
Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
HBM traffic, per normalization unit.
HBM traffic, divided by total duration.
Atomic: The total number of L2 requests to Infinity Fabric to atomically update
32B or 64B of data in any memory location, per normalization unit. See Request
flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
@@ -257,12 +257,12 @@ Panel Config:
metric:
Read BW:
avg: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
* 64)) / $denom))
* 64)) / (End_Timestamp - Start_Timestamp)))
min: MIN((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
* 64)) / $denom))
* 64)) / (End_Timestamp - Start_Timestamp)))
max: MAX((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
* 64)) / $denom))
unit: (Bytes + $normUnit)
* 64)) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
HBM Read Traffic:
avg: AVG((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
!= 0) else None))
@@ -289,12 +289,12 @@ Panel Config:
unit: pct
Write and Atomic BW:
avg: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
* 32)) / $denom))
* 32)) / (End_Timestamp - Start_Timestamp)))
min: MIN((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
* 32)) / $denom))
* 32)) / (End_Timestamp - Start_Timestamp)))
max: MAX((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
* 32)) / $denom))
unit: (Bytes + $normUnit)
* 32)) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
HBM Write and Atomic Traffic:
avg: AVG((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
!= 0) else None))
@@ -362,10 +362,10 @@ Panel Config:
unit: Unit
metric:
Bandwidth:
avg: AVG((TCC_REQ_sum * 128) / $denom)
min: MIN((TCC_REQ_sum * 128) / $denom)
max: MAX((TCC_REQ_sum * 128) / $denom)
unit: (Bytes + $normUnit)
avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
unit: Gbps
Req:
avg: AVG((TCC_REQ_sum / $denom))
min: MIN((TCC_REQ_sum / $denom))
@@ -11,8 +11,12 @@ Panel Config:
instructions, averaged over the lifetime of the kernel. Calculated as the ratio
of the total number of cycles spent by the scheduler issuing LDS instructions
over the total CU cycles.
Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
could have been loaded from, stored to, or atomically updated in the LDS divided
as percentage of theoretical peak. Does not take into account the execution
mask of the wavefront when the instruction was executed.
Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
loaded from, stored to, or atomically updated in the LDS per normalization unit.
loaded from, stored to, or atomically updated in the LDS divided by total duration.
Does not take into account the execution mask of the wavefront when the instruction
was executed.
Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
@@ -58,7 +62,7 @@ Panel Config:
Access Rate:
value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
unit: Pct of Peak
Theoretical Bandwidth (% of Peak):
Theoretical Bandwidth Utilization:
value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
/ (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
unit: Pct of Peak
@@ -86,12 +90,12 @@ Panel Config:
unit: (Instr + $normUnit)
Theoretical Bandwidth:
avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
/ $denom))
/ (End_Timestamp - Start_Timestamp)))
min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
/ $denom))
/ (End_Timestamp - Start_Timestamp)))
max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
/ $denom))
unit: (Bytes + $normUnit)
/ (End_Timestamp - Start_Timestamp)))
unit: Gbps
LDS Latency:
avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
None))
@@ -3,15 +3,18 @@ Panel Config:
id: 1300
title: Instruction Cache
metrics_description:
Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the
peak theoretical bandwidth. Calculated as the ratio of L1I requests over the
total L1I cycles.
Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
the total L1I cycles.
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
loaded line the cache. Calculated as the ratio of the number of L1I requests
that hit over the number of all L1I requests.
L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\
\ bandwidth achieved. Calculated as the ratio of the total number of requests\
\ from the L1I to the L2 cache over the total L1I-L2 interface cycles."
L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\
\ L2 cache request bandwidth achieved. Calculated as the ratio of the total\
\ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\
\ cycles."
L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
divided by total duration.
Req: The total number of requests made to the L1I per normalization-unit
Hits: The total number of L1I requests that hit on a previously loaded cache line,
per normalization-unit.
@@ -30,7 +33,7 @@ Panel Config:
value: Avg
unit: Unit
metric:
Bandwidth:
Bandwidth Utilization:
value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
- Start_Timestamp))))
unit: Pct of Peak
@@ -38,7 +41,7 @@ Panel Config:
value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
+ SQC_ICACHE_MISSES_DUPLICATE)))
unit: Pct of Peak
L1I-L2 Bandwidth:
L1I-L2 Bandwidth Utilization:
value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu)
* (End_Timestamp - Start_Timestamp))))
unit: Pct of Peak
@@ -100,7 +103,7 @@ Panel Config:
unit: Unit
metric:
L1I-L2 Bandwidth:
avg: AVG(((SQC_TC_INST_REQ * 64) / $denom))
min: MIN(((SQC_TC_INST_REQ * 64) / $denom))
max: MAX(((SQC_TC_INST_REQ * 64) / $denom))
unit: (Bytes + $normUnit)
avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
@@ -3,14 +3,17 @@ Panel Config:
id: 1400
title: Scalar L1 Data Cache
metrics_description:
Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the
peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the
total sL1D cycles.
Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
over the total sL1D cycles.
Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
loaded line the cache. The ratio of the number of sL1D requests that hit over
the number of all sL1D requests.
sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
bandwidth acheived.\ \ Caclulated as total number of bytes read from, written
to, or atomically updated\ \ across the sL1D - L2 interface.
sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
\ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\
\ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\
\ writes and atomics are typically unused on current CDNA accelerators, so in\
\ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
Req: The total number of requests, of any size or type, made to the sL1D per normalization
@@ -51,7 +54,7 @@ Panel Config:
value: Avg
unit: Unit
metric:
Bandwidth:
Bandwidth Utilization:
value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
- Start_Timestamp))))
unit: Pct of Peak
@@ -60,7 +63,7 @@ Panel Config:
+ SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES
+ SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
unit: Pct of Peak
sL1D-L2 BW:
sL1D-L2 BW Utilization:
value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
* 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))
unit: Pct of Peak
@@ -158,12 +161,12 @@ Panel Config:
metric:
sL1D-L2 BW:
avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
* 64)) / $denom))
* 64)) / (End_Timestamp - Start_Timestamp)))
min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
* 64)) / $denom))
* 64)) / (End_Timestamp - Start_Timestamp)))
max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
* 64)) / $denom))
unit: (Bytes + $normUnit)
* 64)) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
Read Req:
avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
min: MIN((SQC_TC_DATA_READ_REQ / $denom))
@@ -5,12 +5,12 @@ Panel Config:
metrics_description:
Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
cache over the total number of cache line requests to the vL1D Cache RAM.
Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM
instructions, as a percent of the peak theoretical bandwidth achievable on the
specific accelerator. The number of bytes is calculated as the number of cache
lines requested multiplied by the cache line size. This value does not consider
partial requests, so for instance, if only a single value is requested in a
cache line, the data movement will still be counted as a full cache line.
Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
on the specific accelerator. The number of bytes is calculated as the number
of cache lines requested multiplied by the cache line size. This value does
not consider partial requests, so for instance, if only a single value is requested
in a cache line, the data movement will still be counted as a full cache line.
Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
The number of cycles where the vL1D Cache RAM is actively processing any request
divided by the number of cycles where the vL1D is active.
@@ -42,11 +42,11 @@ Panel Config:
Atomic Req: The total number of incoming atomic requests from the address processing
unit after coalescing per normalization unit.
Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
instructions per normalization unit. The number of bytes is calculated as the
number of cache lines requested multiplied by the cache line size. This value
does not consider partial requests, so for instance, if only a single value
is requested in a cache line, the data movement will still be counted as a full
cache line.
instructions divided by total duration. The number of bytes is calculated as
the number of cache lines requested multiplied by the cache line size. This
value does not consider partial requests, so for instance, if only a single
value is requested in a cache line, the data movement will still be counted
as a full cache line.
Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
Cache Accesses: The total number of cache line lookups in the vL1D.
@@ -57,7 +57,7 @@ Panel Config:
command during the kernel's execution per normalization unit. This may be triggered
by, for instance, the buffer_wbinvl1 instruction.
L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
of VMEM instructions, per normalization unit. The number of bytes is calculated
of VMEM instructions, divided by total duration. The number of bytes is calculated
as the number of cache lines requested multiplied by the cache line size. This
value does not consider partial requests, so for instance, if only a single
value is requested in a cache line, the data movement will still be counted
@@ -128,7 +128,7 @@ Panel Config:
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
else None))
unit: Pct of Peak
Bandwidth:
Bandwidth Utilization:
value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp
- Start_Timestamp)))) / ((($max_sclk / 1000) * 128) * $cu_per_gpu))
unit: Pct of Peak
@@ -201,10 +201,10 @@ Panel Config:
/ $denom))
unit: (Req + $normUnit)
Cache BW:
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
unit: (Bytes + $normUnit)
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
Cache Hit Rate:
avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
@@ -242,12 +242,12 @@ Panel Config:
unit: (Req + $normUnit)
L1-L2 BW:
avg: AVG(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
min: MIN(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
max: MAX(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
unit: (Bytes + $normUnit)
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
L1-L2 Read:
avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
min: MIN((TCP_TCC_READ_REQ_sum / $denom))
@@ -20,8 +20,8 @@ Panel Config:
HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
memory (HBM) per unit time. This value is calculated as the number of HBM channels
multiplied by the HBM channel width multiplied by the HBM clock frequency.
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per
normalization unit.
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided
by total duration.
HBM Read Traffic: The percent of read requests generated by the L2 cache that
are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
does not consider the size of the request (meaning that 32B and 64B requests
@@ -42,9 +42,9 @@ Panel Config:
as a single request), so this metric only approximates the percent of the L2-Fabric
read bandwidth directed to an uncached memory location.
Write and Atomic BW: The total number of bytes written by the L2 over Infinity
Fabric by write and atomic operations per normalization unit. Note that on current
CDNA accelerators, such as the MI2XX, requests are only considered atomic by
Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
Fabric by write and atomic operations divided by total duration. Note that on
current CDNA accelerators, such as the MI2XX, requests are only considered atomic
by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
fine-grained memory allocations or uncached memory allocations on the MI2XX.
HBM Write and Atomic Traffic: The percent of write and atomic requests generated
by the L2 cache that are routed to the accelerator's local high-bandwidth memory
@@ -82,17 +82,17 @@ Panel Config:
Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
Fabric before a completion acknowledgement (atomic without return value) or
data (atomic with return value) was returned to the L2.
Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit.
Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration.
The number of bytes is calculated as the number of cache lines requested multiplied
by the cache line size. This value does not consider partial requests, so for
example, if only a single value is requested in a cache line, the data movement
will still be counted as a full cache line.
Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests,
per normalization unit.
divided by total duration.
Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests,
per normalization unit.
divided by total duration.
Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests,
per normalization unit.
divided by total duration.
Req: The total number of incoming requests to the L2 from all clients for all
request types, per normalization unit.
Read Req: The total number of read requests to the L2 from all clients.
@@ -150,11 +150,11 @@ Panel Config:
64B of data from any source other than the accelerator's local HBM, per normalization
unit.
Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe
traffic, per normalization unit.
traffic, divided by total duration.
"Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read
requests due to Infinity Fabric traffic, per normalization unit.
requests due to Infinity Fabric traffic, divided by total duration.
Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM
traffic, per normalization unit.
traffic, divided by total duration.
Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
write or atomically update 32B of data to any memory location, per normalization
unit.
@@ -171,17 +171,17 @@ Panel Config:
write or atomically update 32B or 64B of data in any memory location other than
the accelerator's local HBM, per normalization unit.
Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to
PCIe traffic, per normalization unit.
PCIe traffic, divided by total duration.
"Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write
requests due to Infinity Fabric traffic, per normalization unit.
requests due to Infinity Fabric traffic, divided by total duration.
Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM
traffic, per normalization unit.
traffic, divided by total duration.
Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to
PCIe traffic, per normalization unit.
PCIe traffic, divided by total duration.
"Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic
requests due to Infinity Fabric traffic, per normalization unit.
requests due to Infinity Fabric traffic, divided by total duration.
Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
HBM traffic, per normalization unit.
HBM traffic, divided by total duration.
Atomic: The total number of L2 requests to Infinity Fabric to atomically update
32B or 64B of data in any memory location, per normalization unit. See Request
flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
@@ -257,12 +257,12 @@ Panel Config:
metric:
Read BW:
avg: AVG((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
* 64)) / $denom))
* 64)) / (End_Timestamp - Start_Timestamp)))
min: MIN((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
* 64)) / $denom))
* 64)) / (End_Timestamp - Start_Timestamp)))
max: MAX((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
* 64)) / $denom))
unit: (Bytes + $normUnit)
* 64)) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
HBM Read Traffic:
avg: AVG((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
!= 0) else None))
@@ -289,12 +289,12 @@ Panel Config:
unit: pct
Write and Atomic BW:
avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
* 32)) / $denom))
* 32)) / (End_Timestamp - Start_Timestamp)))
min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
* 32)) / $denom))
* 32)) / (End_Timestamp - Start_Timestamp)))
max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
* 32)) / $denom))
unit: (Bytes + $normUnit)
* 32)) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
HBM Write and Atomic Traffic:
avg: AVG((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
!= 0) else None))
@@ -362,10 +362,10 @@ Panel Config:
unit: Unit
metric:
Bandwidth:
avg: AVG((TCC_REQ_sum * 128) / $denom)
min: MIN((TCC_REQ_sum * 128) / $denom)
max: MAX((TCC_REQ_sum * 128) / $denom)
unit: (Bytes + $normUnit)
avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
unit: Gbps
Req:
avg: AVG((TCC_REQ_sum / $denom))
min: MIN((TCC_REQ_sum / $denom))
@@ -11,8 +11,12 @@ Panel Config:
instructions, averaged over the lifetime of the kernel. Calculated as the ratio
of the total number of cycles spent by the scheduler issuing LDS instructions
over the total CU cycles.
Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
could have been loaded from, stored to, or atomically updated in the LDS divided
as percentage of theoretical peak. Does not take into account the execution
mask of the wavefront when the instruction was executed.
Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
loaded from, stored to, or atomically updated in the LDS per normalization unit.
loaded from, stored to, or atomically updated in the LDS divided by total duration.
Does not take into account the execution mask of the wavefront when the instruction
was executed.
Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
@@ -58,7 +62,7 @@ Panel Config:
Access Rate:
value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
unit: Pct of Peak
Theoretical Bandwidth (% of Peak):
Theoretical Bandwidth Utilization:
value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
/ (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
unit: Pct of Peak
@@ -86,12 +90,12 @@ Panel Config:
unit: (Instr + $normUnit)
Theoretical Bandwidth:
avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
/ $denom))
/ (End_Timestamp - Start_Timestamp)))
min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
/ $denom))
/ (End_Timestamp - Start_Timestamp)))
max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
/ $denom))
unit: (Bytes + $normUnit)
/ (End_Timestamp - Start_Timestamp)))
unit: Gbps
LDS Latency:
avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
None))
@@ -3,15 +3,18 @@ Panel Config:
id: 1300
title: Instruction Cache
metrics_description:
Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the
peak theoretical bandwidth. Calculated as the ratio of L1I requests over the
total L1I cycles.
Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
the total L1I cycles.
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
loaded line the cache. Calculated as the ratio of the number of L1I requests
that hit over the number of all L1I requests.
L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\
\ bandwidth achieved. Calculated as the ratio of the total number of requests\
\ from the L1I to the L2 cache over the total L1I-L2 interface cycles."
L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\
\ L2 cache request bandwidth achieved. Calculated as the ratio of the total\
\ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\
\ cycles."
L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
divided by total duration.
Req: The total number of requests made to the L1I per normalization-unit
Hits: The total number of L1I requests that hit on a previously loaded cache line,
per normalization-unit.
@@ -30,7 +33,7 @@ Panel Config:
value: Avg
unit: Unit
metric:
Bandwidth:
Bandwidth Utilization:
value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
- Start_Timestamp))))
unit: Pct of Peak
@@ -38,7 +41,7 @@ Panel Config:
value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
+ SQC_ICACHE_MISSES_DUPLICATE)))
unit: Pct of Peak
L1I-L2 Bandwidth:
L1I-L2 Bandwidth Utilization:
value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu)
* (End_Timestamp - Start_Timestamp))))
unit: Pct of Peak
@@ -100,7 +103,7 @@ Panel Config:
unit: Unit
metric:
L1I-L2 Bandwidth:
avg: AVG(((SQC_TC_INST_REQ * 64) / $denom))
min: MIN(((SQC_TC_INST_REQ * 64) / $denom))
max: MAX(((SQC_TC_INST_REQ * 64) / $denom))
unit: (Bytes + $normUnit)
avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
@@ -3,14 +3,17 @@ Panel Config:
id: 1400
title: Scalar L1 Data Cache
metrics_description:
Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the
peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the
total sL1D cycles.
Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
over the total sL1D cycles.
Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
loaded line the cache. The ratio of the number of sL1D requests that hit over
the number of all sL1D requests.
sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
bandwidth acheived.\ \ Caclulated as total number of bytes read from, written
to, or atomically updated\ \ across the sL1D - L2 interface.
sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
\ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\
\ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\
\ writes and atomics are typically unused on current CDNA accelerators, so in\
\ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
Req: The total number of requests, of any size or type, made to the sL1D per normalization
@@ -51,7 +54,7 @@ Panel Config:
value: Avg
unit: Unit
metric:
Bandwidth:
Bandwidth Utilization:
value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
- Start_Timestamp))))
unit: Pct of Peak
@@ -60,7 +63,7 @@ Panel Config:
+ SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES
+ SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
unit: Pct of Peak
sL1D-L2 BW:
sL1D-L2 BW Utilization:
value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
* 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))
unit: Pct of Peak
@@ -158,12 +161,12 @@ Panel Config:
metric:
sL1D-L2 BW:
avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
* 64)) / $denom))
* 64)) / (End_Timestamp - Start_Timestamp)))
min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
* 64)) / $denom))
* 64)) / (End_Timestamp - Start_Timestamp)))
max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
* 64)) / $denom))
unit: (Bytes + $normUnit)
* 64)) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
Read Req:
avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
min: MIN((SQC_TC_DATA_READ_REQ / $denom))
@@ -5,12 +5,12 @@ Panel Config:
metrics_description:
Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
cache over the total number of cache line requests to the vL1D Cache RAM.
Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM
instructions, as a percent of the peak theoretical bandwidth achievable on the
specific accelerator. The number of bytes is calculated as the number of cache
lines requested multiplied by the cache line size. This value does not consider
partial requests, so for instance, if only a single value is requested in a
cache line, the data movement will still be counted as a full cache line.
Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
on the specific accelerator. The number of bytes is calculated as the number
of cache lines requested multiplied by the cache line size. This value does
not consider partial requests, so for instance, if only a single value is requested
in a cache line, the data movement will still be counted as a full cache line.
Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
The number of cycles where the vL1D Cache RAM is actively processing any request
divided by the number of cycles where the vL1D is active.
@@ -42,11 +42,11 @@ Panel Config:
Atomic Req: The total number of incoming atomic requests from the address processing
unit after coalescing per normalization unit.
Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
instructions per normalization unit. The number of bytes is calculated as the
number of cache lines requested multiplied by the cache line size. This value
does not consider partial requests, so for instance, if only a single value
is requested in a cache line, the data movement will still be counted as a full
cache line.
instructions divided by total duration. The number of bytes is calculated as
the number of cache lines requested multiplied by the cache line size. This
value does not consider partial requests, so for instance, if only a single
value is requested in a cache line, the data movement will still be counted
as a full cache line.
Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
Cache Accesses: The total number of cache line lookups in the vL1D.
@@ -57,7 +57,7 @@ Panel Config:
command during the kernel's execution per normalization unit. This may be triggered
by, for instance, the buffer_wbinvl1 instruction.
L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
of VMEM instructions, per normalization unit. The number of bytes is calculated
of VMEM instructions, divided by total duration. The number of bytes is calculated
as the number of cache lines requested multiplied by the cache line size. This
value does not consider partial requests, so for instance, if only a single
value is requested in a cache line, the data movement will still be counted
@@ -128,7 +128,7 @@ Panel Config:
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
else None))
unit: Pct of Peak
Bandwidth:
Bandwidth Utilization:
value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp
- Start_Timestamp)))) / ((($max_sclk / 1000) * 128) * $cu_per_gpu))
unit: Pct of Peak
@@ -201,10 +201,10 @@ Panel Config:
/ $denom))
unit: (Req + $normUnit)
Cache BW:
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
unit: (Bytes + $normUnit)
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
Cache Hit Rate:
avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
@@ -242,12 +242,12 @@ Panel Config:
unit: (Req + $normUnit)
L1-L2 BW:
avg: AVG(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
min: MIN(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
max: MAX(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
unit: (Bytes + $normUnit)
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
L1-L2 Read:
avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
min: MIN((TCP_TCC_READ_REQ_sum / $denom))
@@ -20,8 +20,8 @@ Panel Config:
HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
memory (HBM) per unit time. This value is calculated as the number of HBM channels
multiplied by the HBM channel width multiplied by the HBM clock frequency.
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per
normalization unit.
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided
by total duration.
HBM Read Traffic: The percent of read requests generated by the L2 cache that
are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
does not consider the size of the request (meaning that 32B and 64B requests
@@ -42,9 +42,9 @@ Panel Config:
as a single request), so this metric only approximates the percent of the L2-Fabric
read bandwidth directed to an uncached memory location.
Write and Atomic BW: The total number of bytes written by the L2 over Infinity
Fabric by write and atomic operations per normalization unit. Note that on current
CDNA accelerators, such as the MI2XX, requests are only considered atomic by
Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
Fabric by write and atomic operations divided by total duration. Note that on
current CDNA accelerators, such as the MI2XX, requests are only considered atomic
by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
fine-grained memory allocations or uncached memory allocations on the MI2XX.
HBM Write and Atomic Traffic: The percent of write and atomic requests generated
by the L2 cache that are routed to the accelerator's local high-bandwidth memory
@@ -82,17 +82,17 @@ Panel Config:
Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
Fabric before a completion acknowledgement (atomic without return value) or
data (atomic with return value) was returned to the L2.
Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit.
Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration.
The number of bytes is calculated as the number of cache lines requested multiplied
by the cache line size. This value does not consider partial requests, so for
example, if only a single value is requested in a cache line, the data movement
will still be counted as a full cache line.
Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests,
per normalization unit.
divided by total duration.
Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests,
per normalization unit.
divided by total duration.
Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests,
per normalization unit.
divided by total duration.
Req: The total number of incoming requests to the L2 from all clients for all
request types, per normalization unit.
Read Req: The total number of read requests to the L2 from all clients.
@@ -150,11 +150,11 @@ Panel Config:
64B of data from any source other than the accelerator's local HBM, per normalization
unit.
Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe
traffic, per normalization unit.
traffic, divided by total duration.
"Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read
requests due to Infinity Fabric traffic, per normalization unit.
requests due to Infinity Fabric traffic, divided by total duration.
Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM
traffic, per normalization unit.
traffic, divided by total duration.
Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
write or atomically update 32B of data to any memory location, per normalization
unit.
@@ -171,17 +171,17 @@ Panel Config:
write or atomically update 32B or 64B of data in any memory location other than
the accelerator's local HBM, per normalization unit.
Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to
PCIe traffic, per normalization unit.
PCIe traffic, divided by total duration.
"Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write
requests due to Infinity Fabric traffic, per normalization unit.
requests due to Infinity Fabric traffic, divided by total duration.
Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM
traffic, per normalization unit.
traffic, divided by total duration.
Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to
PCIe traffic, per normalization unit.
PCIe traffic, divided by total duration.
"Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic
requests due to Infinity Fabric traffic, per normalization unit.
requests due to Infinity Fabric traffic, divided by total duration.
Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
HBM traffic, per normalization unit.
HBM traffic, divided by total duration.
Atomic: The total number of L2 requests to Infinity Fabric to atomically update
32B or 64B of data in any memory location, per normalization unit. See Request
flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
@@ -257,12 +257,12 @@ Panel Config:
metric:
Read BW:
avg: AVG((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
* 64)) / $denom))
* 64)) / (End_Timestamp - Start_Timestamp)))
min: MIN((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
* 64)) / $denom))
* 64)) / (End_Timestamp - Start_Timestamp)))
max: MAX((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
* 64)) / $denom))
unit: (Bytes + $normUnit)
* 64)) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
HBM Read Traffic:
avg: AVG((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
!= 0) else None))
@@ -289,12 +289,12 @@ Panel Config:
unit: pct
Write and Atomic BW:
avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
* 32)) / $denom))
* 32)) / (End_Timestamp - Start_Timestamp)))
min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
* 32)) / $denom))
* 32)) / (End_Timestamp - Start_Timestamp)))
max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
* 32)) / $denom))
unit: (Bytes + $normUnit)
* 32)) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
HBM Write and Atomic Traffic:
avg: AVG((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
!= 0) else None))
@@ -362,10 +362,10 @@ Panel Config:
unit: Unit
metric:
Bandwidth:
avg: AVG((TCC_REQ_sum * 128) / $denom)
min: MIN((TCC_REQ_sum * 128) / $denom)
max: MAX((TCC_REQ_sum * 128) / $denom)
unit: (Bytes + $normUnit)
avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
unit: Gbps
Req:
avg: AVG((TCC_REQ_sum / $denom))
min: MIN((TCC_REQ_sum / $denom))
@@ -11,8 +11,12 @@ Panel Config:
instructions, averaged over the lifetime of the kernel. Calculated as the ratio
of the total number of cycles spent by the scheduler issuing LDS instructions
over the total CU cycles.
Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
could have been loaded from, stored to, or atomically updated in the LDS divided
as percentage of theoretical peak. Does not take into account the execution
mask of the wavefront when the instruction was executed.
Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
loaded from, stored to, or atomically updated in the LDS per normalization unit.
loaded from, stored to, or atomically updated in the LDS divided by total duration.
Does not take into account the execution mask of the wavefront when the instruction
was executed.
Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
@@ -58,7 +62,7 @@ Panel Config:
Access Rate:
value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
unit: Pct of Peak
Theoretical Bandwidth (% of Peak):
Theoretical Bandwidth Utilization:
value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
/ (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
unit: Pct of Peak
@@ -86,12 +90,12 @@ Panel Config:
unit: (Instr + $normUnit)
Theoretical Bandwidth:
avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
/ $denom))
/ (End_Timestamp - Start_Timestamp)))
min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
/ $denom))
/ (End_Timestamp - Start_Timestamp)))
max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
/ $denom))
unit: (Bytes + $normUnit)
/ (End_Timestamp - Start_Timestamp)))
unit: Gbps
LDS Latency:
avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
None))
@@ -3,15 +3,18 @@ Panel Config:
id: 1300
title: Instruction Cache
metrics_description:
Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the
peak theoretical bandwidth. Calculated as the ratio of L1I requests over the
total L1I cycles.
Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
the total L1I cycles.
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
loaded line the cache. Calculated as the ratio of the number of L1I requests
that hit over the number of all L1I requests.
L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\
\ bandwidth achieved. Calculated as the ratio of the total number of requests\
\ from the L1I to the L2 cache over the total L1I-L2 interface cycles."
L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\
\ L2 cache request bandwidth achieved. Calculated as the ratio of the total\
\ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\
\ cycles."
L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
divided by total duration.
Req: The total number of requests made to the L1I per normalization-unit
Hits: The total number of L1I requests that hit on a previously loaded cache line,
per normalization-unit.
@@ -30,7 +33,7 @@ Panel Config:
value: Avg
unit: Unit
metric:
Bandwidth:
Bandwidth Utilization:
value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
- Start_Timestamp))))
unit: Pct of Peak
@@ -38,7 +41,7 @@ Panel Config:
value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
+ SQC_ICACHE_MISSES_DUPLICATE)))
unit: Pct of Peak
L1I-L2 Bandwidth:
L1I-L2 Bandwidth Utilization:
value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu)
* (End_Timestamp - Start_Timestamp))))
unit: Pct of Peak
@@ -100,7 +103,7 @@ Panel Config:
unit: Unit
metric:
L1I-L2 Bandwidth:
avg: AVG(((SQC_TC_INST_REQ * 64) / $denom))
min: MIN(((SQC_TC_INST_REQ * 64) / $denom))
max: MAX(((SQC_TC_INST_REQ * 64) / $denom))
unit: (Bytes + $normUnit)
avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
@@ -3,14 +3,17 @@ Panel Config:
id: 1400
title: Scalar L1 Data Cache
metrics_description:
Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the
peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the
total sL1D cycles.
Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
over the total sL1D cycles.
Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
loaded line the cache. The ratio of the number of sL1D requests that hit over
the number of all sL1D requests.
sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
bandwidth acheived.\ \ Caclulated as total number of bytes read from, written
to, or atomically updated\ \ across the sL1D - L2 interface.
sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
\ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\
\ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\
\ writes and atomics are typically unused on current CDNA accelerators, so in\
\ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
Req: The total number of requests, of any size or type, made to the sL1D per normalization
@@ -51,7 +54,7 @@ Panel Config:
value: Avg
unit: Unit
metric:
Bandwidth:
Bandwidth Utilization:
value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
- Start_Timestamp))))
unit: Pct of Peak
@@ -60,7 +63,7 @@ Panel Config:
+ SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES
+ SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
unit: Pct of Peak
sL1D-L2 BW:
sL1D-L2 BW Utilization:
value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
* 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))
unit: Pct of Peak
@@ -158,12 +161,12 @@ Panel Config:
metric:
sL1D-L2 BW:
avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
* 64)) / $denom))
* 64)) / (End_Timestamp - Start_Timestamp)))
min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
* 64)) / $denom))
* 64)) / (End_Timestamp - Start_Timestamp)))
max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
* 64)) / $denom))
unit: (Bytes + $normUnit)
* 64)) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
Read Req:
avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
min: MIN((SQC_TC_DATA_READ_REQ / $denom))
@@ -5,12 +5,12 @@ Panel Config:
metrics_description:
Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
cache over the total number of cache line requests to the vL1D Cache RAM.
Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM
instructions, as a percent of the peak theoretical bandwidth achievable on the
specific accelerator. The number of bytes is calculated as the number of cache
lines requested multiplied by the cache line size. This value does not consider
partial requests, so for instance, if only a single value is requested in a
cache line, the data movement will still be counted as a full cache line.
Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
on the specific accelerator. The number of bytes is calculated as the number
of cache lines requested multiplied by the cache line size. This value does
not consider partial requests, so for instance, if only a single value is requested
in a cache line, the data movement will still be counted as a full cache line.
Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
The number of cycles where the vL1D Cache RAM is actively processing any request
divided by the number of cycles where the vL1D is active.
@@ -42,11 +42,11 @@ Panel Config:
Atomic Req: The total number of incoming atomic requests from the address processing
unit after coalescing per normalization unit.
Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
instructions per normalization unit. The number of bytes is calculated as the
number of cache lines requested multiplied by the cache line size. This value
does not consider partial requests, so for instance, if only a single value
is requested in a cache line, the data movement will still be counted as a full
cache line.
instructions divided by total duration. The number of bytes is calculated as
the number of cache lines requested multiplied by the cache line size. This
value does not consider partial requests, so for instance, if only a single
value is requested in a cache line, the data movement will still be counted
as a full cache line.
Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
Cache Accesses: The total number of cache line lookups in the vL1D.
@@ -57,7 +57,7 @@ Panel Config:
command during the kernel's execution per normalization unit. This may be triggered
by, for instance, the buffer_wbinvl1 instruction.
L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
of VMEM instructions, per normalization unit. The number of bytes is calculated
of VMEM instructions, divided by total duration. The number of bytes is calculated
as the number of cache lines requested multiplied by the cache line size. This
value does not consider partial requests, so for instance, if only a single
value is requested in a cache line, the data movement will still be counted
@@ -128,7 +128,7 @@ Panel Config:
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
else None))
unit: Pct of Peak
Bandwidth:
Bandwidth Utilization:
value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp
- Start_Timestamp)))) / ((($max_sclk / 1000) * 128) * $cu_per_gpu))
unit: Pct of Peak
@@ -201,10 +201,10 @@ Panel Config:
/ $denom))
unit: (Req + $normUnit)
Cache BW:
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
unit: (Bytes + $normUnit)
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
Cache Hit Rate:
avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
@@ -242,12 +242,12 @@ Panel Config:
unit: (Req + $normUnit)
L1-L2 BW:
avg: AVG(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
min: MIN(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
max: MAX(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
unit: (Bytes + $normUnit)
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
L1-L2 Read:
avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
min: MIN((TCP_TCC_READ_REQ_sum / $denom))
@@ -20,8 +20,8 @@ Panel Config:
HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
memory (HBM) per unit time. This value is calculated as the number of HBM channels
multiplied by the HBM channel width multiplied by the HBM clock frequency.
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per
normalization unit.
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided
by total duration.
HBM Read Traffic: The percent of read requests generated by the L2 cache that
are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
does not consider the size of the request (meaning that 32B and 64B requests
@@ -42,9 +42,9 @@ Panel Config:
as a single request), so this metric only approximates the percent of the L2-Fabric
read bandwidth directed to an uncached memory location.
Write and Atomic BW: The total number of bytes written by the L2 over Infinity
Fabric by write and atomic operations per normalization unit. Note that on current
CDNA accelerators, such as the MI2XX, requests are only considered atomic by
Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
Fabric by write and atomic operations divided by total duration. Note that on
current CDNA accelerators, such as the MI2XX, requests are only considered atomic
by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
fine-grained memory allocations or uncached memory allocations on the MI2XX.
HBM Write and Atomic Traffic: The percent of write and atomic requests generated
by the L2 cache that are routed to the accelerator's local high-bandwidth memory
@@ -82,17 +82,17 @@ Panel Config:
Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
Fabric before a completion acknowledgement (atomic without return value) or
data (atomic with return value) was returned to the L2.
Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit.
Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration.
The number of bytes is calculated as the number of cache lines requested multiplied
by the cache line size. This value does not consider partial requests, so for
example, if only a single value is requested in a cache line, the data movement
will still be counted as a full cache line.
Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests,
per normalization unit.
divided by total duration.
Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests,
per normalization unit.
divided by total duration.
Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests,
per normalization unit.
divided by total duration.
Req: The total number of incoming requests to the L2 from all clients for all
request types, per normalization unit.
Read Req: The total number of read requests to the L2 from all clients.
@@ -150,11 +150,11 @@ Panel Config:
64B of data from any source other than the accelerator's local HBM, per normalization
unit.
Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe
traffic, per normalization unit.
traffic, divided by total duration.
"Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read
requests due to Infinity Fabric traffic, per normalization unit.
requests due to Infinity Fabric traffic, divided by total duration.
Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM
traffic, per normalization unit.
traffic, divided by total duration.
Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
write or atomically update 32B of data to any memory location, per normalization
unit.
@@ -171,17 +171,17 @@ Panel Config:
write or atomically update 32B or 64B of data in any memory location other than
the accelerator's local HBM, per normalization unit.
Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to
PCIe traffic, per normalization unit.
PCIe traffic, divided by total duration.
"Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write
requests due to Infinity Fabric traffic, per normalization unit.
requests due to Infinity Fabric traffic, divided by total duration.
Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM
traffic, per normalization unit.
traffic, divided by total duration.
Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to
PCIe traffic, per normalization unit.
PCIe traffic, divided by total duration.
"Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic
requests due to Infinity Fabric traffic, per normalization unit.
requests due to Infinity Fabric traffic, divided by total duration.
Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
HBM traffic, per normalization unit.
HBM traffic, divided by total duration.
Atomic: The total number of L2 requests to Infinity Fabric to atomically update
32B or 64B of data in any memory location, per normalization unit. See Request
flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
@@ -258,12 +258,15 @@ Panel Config:
metric:
Read BW:
avg: AVG(((128 * TCC_BUBBLE_sum + 64 * (TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum
- TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / $denom))
- TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / (End_Timestamp
- Start_Timestamp)))
min: MIN(((128 * TCC_BUBBLE_sum + 64 * (TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum
- TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / $denom))
- TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / (End_Timestamp
- Start_Timestamp)))
max: MAX(((128 * TCC_BUBBLE_sum + 64 * (TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum
- TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / $denom))
unit: (Bytes + $normUnit)
- TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / (End_Timestamp
- Start_Timestamp)))
unit: Gbps
HBM Read Traffic:
avg: AVG((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
!= 0) else None))
@@ -290,12 +293,12 @@ Panel Config:
unit: pct
Write and Atomic BW:
avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
* 32)) / $denom))
* 32)) / (End_Timestamp - Start_Timestamp)))
min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
* 32)) / $denom))
* 32)) / (End_Timestamp - Start_Timestamp)))
max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
* 32)) / $denom))
unit: (Bytes + $normUnit)
* 32)) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
HBM Write and Atomic Traffic:
avg: AVG((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
!= 0) else None))
@@ -363,10 +366,10 @@ Panel Config:
unit: Unit
metric:
Bandwidth:
avg: AVG((TCC_REQ_sum * 128) / $denom)
min: MIN((TCC_REQ_sum * 128) / $denom)
max: MAX((TCC_REQ_sum * 128) / $denom)
unit: (Bytes + $normUnit)
avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
unit: Gbps
Req:
avg: AVG((TCC_REQ_sum / $denom))
min: MIN((TCC_REQ_sum / $denom))
@@ -11,8 +11,12 @@ Panel Config:
instructions, averaged over the lifetime of the kernel. Calculated as the ratio
of the total number of cycles spent by the scheduler issuing LDS instructions
over the total CU cycles.
Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
could have been loaded from, stored to, or atomically updated in the LDS divided
as percentage of theoretical peak. Does not take into account the execution
mask of the wavefront when the instruction was executed.
Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
loaded from, stored to, or atomically updated in the LDS per normalization unit.
loaded from, stored to, or atomically updated in the LDS divided by total duration.
Does not take into account the execution mask of the wavefront when the instruction
was executed.
Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
@@ -58,7 +62,7 @@ Panel Config:
Access Rate:
value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
unit: Pct of Peak
Theoretical Bandwidth (% of Peak):
Theoretical Bandwidth Utilization:
value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
/ (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
unit: Pct of Peak
@@ -116,12 +120,12 @@ Panel Config:
units: Gbps
Theoretical Bandwidth:
avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
/ $denom))
/ (End_Timestamp - Start_Timestamp)))
min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
/ $denom))
/ (End_Timestamp - Start_Timestamp)))
max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
/ $denom))
unit: (Bytes + $normUnit)
/ (End_Timestamp - Start_Timestamp)))
unit: Gbps
LDS Latency:
avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
None))
@@ -3,15 +3,18 @@ Panel Config:
id: 1300
title: Instruction Cache
metrics_description:
Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the
peak theoretical bandwidth. Calculated as the ratio of L1I requests over the
total L1I cycles.
Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
the total L1I cycles.
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
loaded line the cache. Calculated as the ratio of the number of L1I requests
that hit over the number of all L1I requests.
L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\
\ bandwidth achieved. Calculated as the ratio of the total number of requests\
\ from the L1I to the L2 cache over the total L1I-L2 interface cycles."
L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\
\ L2 cache request bandwidth achieved. Calculated as the ratio of the total\
\ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\
\ cycles."
L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
divided by total duration.
Req: The total number of requests made to the L1I per normalization-unit
Hits: The total number of L1I requests that hit on a previously loaded cache line,
per normalization-unit.
@@ -30,7 +33,7 @@ Panel Config:
value: Avg
unit: Unit
metric:
Bandwidth:
Bandwidth Utilization:
value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
- Start_Timestamp))))
unit: Pct of Peak
@@ -38,7 +41,7 @@ Panel Config:
value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
+ SQC_ICACHE_MISSES_DUPLICATE)))
unit: Pct of Peak
L1I-L2 Bandwidth:
L1I-L2 Bandwidth Utilization:
value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu)
* (End_Timestamp - Start_Timestamp))))
unit: Pct of Peak
@@ -100,7 +103,7 @@ Panel Config:
unit: Unit
metric:
L1I-L2 Bandwidth:
avg: AVG(((SQC_TC_INST_REQ * 64) / $denom))
min: MIN(((SQC_TC_INST_REQ * 64) / $denom))
max: MAX(((SQC_TC_INST_REQ * 64) / $denom))
unit: (Bytes + $normUnit)
avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
@@ -3,14 +3,17 @@ Panel Config:
id: 1400
title: Scalar L1 Data Cache
metrics_description:
Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the
peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the
total sL1D cycles.
Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
over the total sL1D cycles.
Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
loaded line the cache. The ratio of the number of sL1D requests that hit over
the number of all sL1D requests.
sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
bandwidth acheived.\ \ Caclulated as total number of bytes read from, written
to, or atomically updated\ \ across the sL1D - L2 interface.
sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
\ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\
\ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\
\ writes and atomics are typically unused on current CDNA accelerators, so in\
\ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
Req: The total number of requests, of any size or type, made to the sL1D per normalization
@@ -51,7 +54,7 @@ Panel Config:
value: Avg
unit: Unit
metric:
Bandwidth:
Bandwidth Utilization:
value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
- Start_Timestamp))))
unit: Pct of Peak
@@ -60,7 +63,7 @@ Panel Config:
+ SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES
+ SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
unit: Pct of Peak
sL1D-L2 BW:
sL1D-L2 BW Utilization:
value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
* 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))
unit: Pct of Peak
@@ -158,12 +161,12 @@ Panel Config:
metric:
sL1D-L2 BW:
avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
* 64)) / $denom))
* 64)) / (End_Timestamp - Start_Timestamp)))
min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
* 64)) / $denom))
* 64)) / (End_Timestamp - Start_Timestamp)))
max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
* 64)) / $denom))
unit: (Bytes + $normUnit)
* 64)) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
Read Req:
avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
min: MIN((SQC_TC_DATA_READ_REQ / $denom))
@@ -5,12 +5,12 @@ Panel Config:
metrics_description:
Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
cache over the total number of cache line requests to the vL1D Cache RAM.
Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM
instructions, as a percent of the peak theoretical bandwidth achievable on the
specific accelerator. The number of bytes is calculated as the number of cache
lines requested multiplied by the cache line size. This value does not consider
partial requests, so for instance, if only a single value is requested in a
cache line, the data movement will still be counted as a full cache line.
Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
on the specific accelerator. The number of bytes is calculated as the number
of cache lines requested multiplied by the cache line size. This value does
not consider partial requests, so for instance, if only a single value is requested
in a cache line, the data movement will still be counted as a full cache line.
Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
The number of cycles where the vL1D Cache RAM is actively processing any request
divided by the number of cycles where the vL1D is active.
@@ -42,11 +42,11 @@ Panel Config:
Atomic Req: The total number of incoming atomic requests from the address processing
unit after coalescing per normalization unit.
Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
instructions per normalization unit. The number of bytes is calculated as the
number of cache lines requested multiplied by the cache line size. This value
does not consider partial requests, so for instance, if only a single value
is requested in a cache line, the data movement will still be counted as a full
cache line.
instructions divided by total duration. The number of bytes is calculated as
the number of cache lines requested multiplied by the cache line size. This
value does not consider partial requests, so for instance, if only a single
value is requested in a cache line, the data movement will still be counted
as a full cache line.
Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
Cache Accesses: The total number of cache line lookups in the vL1D.
@@ -57,7 +57,7 @@ Panel Config:
command during the kernel's execution per normalization unit. This may be triggered
by, for instance, the buffer_wbinvl1 instruction.
L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
of VMEM instructions, per normalization unit. The number of bytes is calculated
of VMEM instructions, divided by total duration. The number of bytes is calculated
as the number of cache lines requested multiplied by the cache line size. This
value does not consider partial requests, so for instance, if only a single
value is requested in a cache line, the data movement will still be counted
@@ -128,7 +128,7 @@ Panel Config:
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
else None))
unit: Pct of Peak
Bandwidth:
Bandwidth Utilization:
value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp
- Start_Timestamp)))) / ((($max_sclk / 1000) * 128) * $cu_per_gpu))
unit: Pct of Peak
@@ -216,10 +216,10 @@ Panel Config:
/ $denom))
unit: (Req + $normUnit)
Cache BW:
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
unit: (Bytes + $normUnit)
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
Cache Hit Rate:
avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
@@ -257,12 +257,12 @@ Panel Config:
unit: (Req + $normUnit)
L1-L2 BW:
avg: AVG(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
min: MIN(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
max: MAX(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
unit: (Bytes + $normUnit)
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
Tag RAM 0 Req:
avg: AVG((TCP_TAGRAM0_REQ_sum / $denom))
min: MIN((TCP_TAGRAM0_REQ_sum / $denom))
@@ -20,8 +20,8 @@ Panel Config:
HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
memory (HBM) per unit time. This value is calculated as the number of HBM channels
multiplied by the HBM channel width multiplied by the HBM clock frequency.
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per
normalization unit.
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided
by total duration.
HBM Read Traffic: The percent of read requests generated by the L2 cache that
are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
does not consider the size of the request (meaning that 32B and 64B requests
@@ -42,9 +42,9 @@ Panel Config:
as a single request), so this metric only approximates the percent of the L2-Fabric
read bandwidth directed to an uncached memory location.
Write and Atomic BW: The total number of bytes written by the L2 over Infinity
Fabric by write and atomic operations per normalization unit. Note that on current
CDNA accelerators, such as the MI2XX, requests are only considered atomic by
Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
Fabric by write and atomic operations divided by total duration. Note that on
current CDNA accelerators, such as the MI2XX, requests are only considered atomic
by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
fine-grained memory allocations or uncached memory allocations on the MI2XX.
HBM Write and Atomic Traffic: The percent of write and atomic requests generated
by the L2 cache that are routed to the accelerator's local high-bandwidth memory
@@ -82,17 +82,17 @@ Panel Config:
Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
Fabric before a completion acknowledgement (atomic without return value) or
data (atomic with return value) was returned to the L2.
Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit.
Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration.
The number of bytes is calculated as the number of cache lines requested multiplied
by the cache line size. This value does not consider partial requests, so for
example, if only a single value is requested in a cache line, the data movement
will still be counted as a full cache line.
Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests,
per normalization unit.
divided by total duration.
Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests,
per normalization unit.
divided by total duration.
Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests,
per normalization unit.
divided by total duration.
Req: The total number of incoming requests to the L2 from all clients for all
request types, per normalization unit.
Read Req: The total number of read requests to the L2 from all clients.
@@ -150,11 +150,11 @@ Panel Config:
64B of data from any source other than the accelerator's local HBM, per normalization
unit.
Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe
traffic, per normalization unit.
traffic, divided by total duration.
"Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read
requests due to Infinity Fabric traffic, per normalization unit.
requests due to Infinity Fabric traffic, divided by total duration.
Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM
traffic, per normalization unit.
traffic, divided by total duration.
Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
write or atomically update 32B of data to any memory location, per normalization
unit.
@@ -171,17 +171,17 @@ Panel Config:
write or atomically update 32B or 64B of data in any memory location other than
the accelerator's local HBM, per normalization unit.
Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to
PCIe traffic, per normalization unit.
PCIe traffic, divided by total duration.
"Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write
requests due to Infinity Fabric traffic, per normalization unit.
requests due to Infinity Fabric traffic, divided by total duration.
Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM
traffic, per normalization unit.
traffic, divided by total duration.
Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to
PCIe traffic, per normalization unit.
PCIe traffic, divided by total duration.
"Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic
requests due to Infinity Fabric traffic, per normalization unit.
requests due to Infinity Fabric traffic, divided by total duration.
Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
HBM traffic, per normalization unit.
HBM traffic, divided by total duration.
Atomic: The total number of L2 requests to Infinity Fabric to atomically update
32B or 64B of data in any memory location, per normalization unit. See Request
flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
@@ -257,12 +257,12 @@ Panel Config:
metric:
Read BW:
avg: AVG((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) +
(TCC_EA0_RDREQ_128B_sum * 128)) / $denom))
(TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp)))
min: MIN((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) +
(TCC_EA0_RDREQ_128B_sum * 128)) / $denom))
(TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp)))
max: MAX((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) +
(TCC_EA0_RDREQ_128B_sum * 128)) / $denom))
unit: (Bytes + $normUnit)
(TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
HBM Read Traffic:
avg: AVG((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
!= 0) else None))
@@ -289,12 +289,12 @@ Panel Config:
unit: pct
Write and Atomic BW:
avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
* 32)) / $denom))
* 32)) / (End_Timestamp - Start_Timestamp)))
min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
* 32)) / $denom))
* 32)) / (End_Timestamp - Start_Timestamp)))
max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
* 32)) / $denom))
unit: (Bytes + $normUnit)
* 32)) / (End_Timestamp - Start_Timestamp)))
unit: Gbps
HBM Write and Atomic Traffic:
avg: AVG((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
!= 0) else None))
@@ -381,25 +381,25 @@ Panel Config:
unit: Unit
metric:
Bandwidth:
avg: AVG((TCC_REQ_sum * 128) / $denom)
min: MIN((TCC_REQ_sum * 128) / $denom)
max: MAX((TCC_REQ_sum * 128) / $denom)
unit: (Bytes + $normUnit)
avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
unit: Gbps
Read Bandwidth:
avg: AVG(TCC_READ_SECTORS_sum * 32/ $denom)
min: MIN(TCC_READ_SECTORS_sum * 32/ $denom)
max: MAX(TCC_READ_SECTORS_sum * 32/ $denom)
unit: (Bytes + $normUnit)
avg: AVG(TCC_READ_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
min: MIN(TCC_READ_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
max: MAX(TCC_READ_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
unit: Gbps
Write Bandwidth:
avg: AVG(TCC_WRITE_SECTORS_sum * 32/ $denom)
min: MIN(TCC_WRITE_SECTORS_sum * 32/ $denom)
max: MAX(TCC_WRITE_SECTORS_sum * 32/ $denom)
unit: (Bytes + $normUnit)
avg: AVG(TCC_WRITE_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
min: MIN(TCC_WRITE_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
max: MAX(TCC_WRITE_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
unit: Gbps
Atomic Bandwidth:
avg: AVG(TCC_ATOMIC_SECTORS_sum * 32/ $denom)
min: MIN(TCC_ATOMIC_SECTORS_sum * 32/ $denom)
max: MAX(TCC_ATOMIC_SECTORS_sum * 32/ $denom)
unit: (Bytes + $normUnit)
avg: AVG(TCC_ATOMIC_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
min: MIN(TCC_ATOMIC_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
max: MAX(TCC_ATOMIC_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
unit: Gbps
Req:
avg: AVG((TCC_REQ_sum / $denom))
min: MIN((TCC_REQ_sum / $denom))
@@ -653,20 +653,20 @@ Panel Config:
max: MAX((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
unit: (Req + $normUnit)
Read Bandwidth - PCIe:
avg: AVG(TCC_EA0_RDREQ_IO_32B_sum * 32/ $denom)
min: MIN(TCC_EA0_RDREQ_IO_32B_sum * 32/ $denom)
max: MAX(TCC_EA0_RDREQ_IO_32B_sum * 32/ $denom)
unit: (Bytes + $normUnit)
avg: AVG(TCC_EA0_RDREQ_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
min: MIN(TCC_EA0_RDREQ_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
max: MAX(TCC_EA0_RDREQ_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
unit: Gbps
"Read Bandwidth - Infinity Fabric\u2122":
avg: AVG(TCC_EA0_RDREQ_GMI_32B_sum * 32/ $denom)
min: MIN(TCC_EA0_RDREQ_GMI_32B_sum * 32/ $denom)
max: MAX(TCC_EA0_RDREQ_GMI_32B_sum * 32/ $denom)
unit: (Bytes + $normUnit)
avg: AVG(TCC_EA0_RDREQ_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
min: MIN(TCC_EA0_RDREQ_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
max: MAX(TCC_EA0_RDREQ_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
unit: Gbps
Read Bandwidth - HBM:
avg: AVG(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ $denom)
min: MIN(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ $denom)
max: MAX(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ $denom)
unit: (Bytes + $normUnit)
avg: AVG(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
min: MIN(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
max: MAX(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
unit: Gbps
Write and Atomic (32B):
avg: AVG(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom))
min: MIN(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom))
@@ -693,20 +693,20 @@ Panel Config:
max: MAX((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom))
unit: (Req + $normUnit)
Write Bandwidth - PCIe:
avg: AVG(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ $denom)
min: MIN(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ $denom)
max: MAX(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ $denom)
unit: (Bytes + $normUnit)
avg: AVG(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
min: MIN(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
max: MAX(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
unit: Gbps
"Write Bandwidth - Infinity Fabric\u2122":
avg: AVG(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ $denom)
min: MIN(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ $denom)
max: MAX(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ $denom)
unit: (Bytes + $normUnit)
avg: AVG(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
min: MIN(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
max: MAX(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
unit: Gbps
Write Bandwidth - HBM:
avg: AVG(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ $denom)
min: MIN(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ $denom)
max: MAX(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ $denom)
unit: (Bytes + $normUnit)
avg: AVG(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
min: MIN(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
max: MAX(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
unit: Gbps
Atomic:
avg: AVG((TCC_EA0_ATOMIC_sum / $denom))
min: MIN((TCC_EA0_ATOMIC_sum / $denom))
@@ -718,17 +718,17 @@ Panel Config:
max: MAX((TCC_EA0_WRREQ_ATOMIC_DRAM_sum / $denom))
unit: (Req + $normUnit)
Atomic Bandwidth - PCIe:
avg: AVG(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ $denom)
min: MIN(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ $denom)
max: MAX(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ $denom)
unit: (Bytes + $normUnit)
avg: AVG(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
min: MIN(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
max: MAX(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
unit: Gbps
"Atomic Bandwidth - Infinity Fabric\u2122":
avg: AVG(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ $denom)
min: MIN(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ $denom)
max: MAX(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ $denom)
unit: (Bytes + $normUnit)
avg: AVG(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
min: MIN(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
max: MAX(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
unit: Gbps
Atomic Bandwidth - HBM:
avg: AVG(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ $denom)
min: MIN(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ $denom)
max: MAX(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ $denom)
unit: (Bytes + $normUnit)
avg: AVG(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
min: MIN(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
max: MAX(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
unit: Gbps
@@ -59,42 +59,42 @@ src/rocprof_compute_soc/analysis_configs/gfx940/1100_compute_units_compute_pipel
src/rocprof_compute_soc/analysis_configs/gfx941/1100_compute_units_compute_pipeline.yaml: 4a25b6abf24f4a622fde1a3cfe65fe7236cf1e626fc2444667883997564cea1e
src/rocprof_compute_soc/analysis_configs/gfx942/1100_compute_units_compute_pipeline.yaml: 4a25b6abf24f4a622fde1a3cfe65fe7236cf1e626fc2444667883997564cea1e
src/rocprof_compute_soc/analysis_configs/gfx950/1100_compute_units_compute_pipeline.yaml: 4ef656938f8a9667ae872db522855856469accff9cb42bc0444b469346760dfd
src/rocprof_compute_soc/analysis_configs/gfx908/1200_local_data_share_lds.yaml: 80f3ca3ea15de009c5278ea20566d8c08d62e0087971e5f9aeae1c89df1dd898
src/rocprof_compute_soc/analysis_configs/gfx90a/1200_local_data_share_lds.yaml: 80f3ca3ea15de009c5278ea20566d8c08d62e0087971e5f9aeae1c89df1dd898
src/rocprof_compute_soc/analysis_configs/gfx940/1200_local_data_share_lds.yaml: 3bbf3928288990863cfe72fd00a28785fde0a36f103f5381df578aae2eb28be0
src/rocprof_compute_soc/analysis_configs/gfx941/1200_local_data_share_lds.yaml: 3bbf3928288990863cfe72fd00a28785fde0a36f103f5381df578aae2eb28be0
src/rocprof_compute_soc/analysis_configs/gfx942/1200_local_data_share_lds.yaml: 3bbf3928288990863cfe72fd00a28785fde0a36f103f5381df578aae2eb28be0
src/rocprof_compute_soc/analysis_configs/gfx950/1200_local_data_share_lds.yaml: 505163510a3b0132ee487f9e024188de2deb97d0f72e3d729b95f86e7c3434b3
src/rocprof_compute_soc/analysis_configs/gfx908/1300_instruction_cache.yaml: 2437e2f8191675c4116d0da1db291f3ad2715281ea812e9fdd6506cf213e5d1b
src/rocprof_compute_soc/analysis_configs/gfx90a/1300_instruction_cache.yaml: 2437e2f8191675c4116d0da1db291f3ad2715281ea812e9fdd6506cf213e5d1b
src/rocprof_compute_soc/analysis_configs/gfx940/1300_instruction_cache.yaml: 2437e2f8191675c4116d0da1db291f3ad2715281ea812e9fdd6506cf213e5d1b
src/rocprof_compute_soc/analysis_configs/gfx941/1300_instruction_cache.yaml: 2437e2f8191675c4116d0da1db291f3ad2715281ea812e9fdd6506cf213e5d1b
src/rocprof_compute_soc/analysis_configs/gfx942/1300_instruction_cache.yaml: 2437e2f8191675c4116d0da1db291f3ad2715281ea812e9fdd6506cf213e5d1b
src/rocprof_compute_soc/analysis_configs/gfx950/1300_instruction_cache.yaml: 2437e2f8191675c4116d0da1db291f3ad2715281ea812e9fdd6506cf213e5d1b
src/rocprof_compute_soc/analysis_configs/gfx908/1400_scalar_l1_data_cache.yaml: 8871e3b65132321cb3880a48f894d8c3b2c56a3936d382c3c2b02723ed5c8ec5
src/rocprof_compute_soc/analysis_configs/gfx90a/1400_scalar_l1_data_cache.yaml: 8871e3b65132321cb3880a48f894d8c3b2c56a3936d382c3c2b02723ed5c8ec5
src/rocprof_compute_soc/analysis_configs/gfx940/1400_scalar_l1_data_cache.yaml: 8871e3b65132321cb3880a48f894d8c3b2c56a3936d382c3c2b02723ed5c8ec5
src/rocprof_compute_soc/analysis_configs/gfx941/1400_scalar_l1_data_cache.yaml: 8871e3b65132321cb3880a48f894d8c3b2c56a3936d382c3c2b02723ed5c8ec5
src/rocprof_compute_soc/analysis_configs/gfx942/1400_scalar_l1_data_cache.yaml: 8871e3b65132321cb3880a48f894d8c3b2c56a3936d382c3c2b02723ed5c8ec5
src/rocprof_compute_soc/analysis_configs/gfx950/1400_scalar_l1_data_cache.yaml: 8871e3b65132321cb3880a48f894d8c3b2c56a3936d382c3c2b02723ed5c8ec5
src/rocprof_compute_soc/analysis_configs/gfx908/1200_local_data_share_lds.yaml: f3f7a74e8b2915fe27eec7948f006f218a6b0a96c91b95cdff9e624b2c484bb2
src/rocprof_compute_soc/analysis_configs/gfx90a/1200_local_data_share_lds.yaml: f3f7a74e8b2915fe27eec7948f006f218a6b0a96c91b95cdff9e624b2c484bb2
src/rocprof_compute_soc/analysis_configs/gfx940/1200_local_data_share_lds.yaml: f3f7a74e8b2915fe27eec7948f006f218a6b0a96c91b95cdff9e624b2c484bb2
src/rocprof_compute_soc/analysis_configs/gfx941/1200_local_data_share_lds.yaml: f3f7a74e8b2915fe27eec7948f006f218a6b0a96c91b95cdff9e624b2c484bb2
src/rocprof_compute_soc/analysis_configs/gfx942/1200_local_data_share_lds.yaml: f3f7a74e8b2915fe27eec7948f006f218a6b0a96c91b95cdff9e624b2c484bb2
src/rocprof_compute_soc/analysis_configs/gfx950/1200_local_data_share_lds.yaml: 6333e18126bde83da4c66fd967531d394bd22e69c08358096b27168a9dc11a30
src/rocprof_compute_soc/analysis_configs/gfx908/1300_instruction_cache.yaml: f60b9c657bece161e34219f3ada4041107dc5ca3d248590ee3b67e7bd400ff54
src/rocprof_compute_soc/analysis_configs/gfx90a/1300_instruction_cache.yaml: f60b9c657bece161e34219f3ada4041107dc5ca3d248590ee3b67e7bd400ff54
src/rocprof_compute_soc/analysis_configs/gfx940/1300_instruction_cache.yaml: f60b9c657bece161e34219f3ada4041107dc5ca3d248590ee3b67e7bd400ff54
src/rocprof_compute_soc/analysis_configs/gfx941/1300_instruction_cache.yaml: f60b9c657bece161e34219f3ada4041107dc5ca3d248590ee3b67e7bd400ff54
src/rocprof_compute_soc/analysis_configs/gfx942/1300_instruction_cache.yaml: f60b9c657bece161e34219f3ada4041107dc5ca3d248590ee3b67e7bd400ff54
src/rocprof_compute_soc/analysis_configs/gfx950/1300_instruction_cache.yaml: f60b9c657bece161e34219f3ada4041107dc5ca3d248590ee3b67e7bd400ff54
src/rocprof_compute_soc/analysis_configs/gfx908/1400_scalar_l1_data_cache.yaml: 29fac4ea38e4a018baffc4a27a720b47078fd890c10da307655d40f693e6f0e7
src/rocprof_compute_soc/analysis_configs/gfx90a/1400_scalar_l1_data_cache.yaml: 29fac4ea38e4a018baffc4a27a720b47078fd890c10da307655d40f693e6f0e7
src/rocprof_compute_soc/analysis_configs/gfx940/1400_scalar_l1_data_cache.yaml: 29fac4ea38e4a018baffc4a27a720b47078fd890c10da307655d40f693e6f0e7
src/rocprof_compute_soc/analysis_configs/gfx941/1400_scalar_l1_data_cache.yaml: 29fac4ea38e4a018baffc4a27a720b47078fd890c10da307655d40f693e6f0e7
src/rocprof_compute_soc/analysis_configs/gfx942/1400_scalar_l1_data_cache.yaml: 29fac4ea38e4a018baffc4a27a720b47078fd890c10da307655d40f693e6f0e7
src/rocprof_compute_soc/analysis_configs/gfx950/1400_scalar_l1_data_cache.yaml: 29fac4ea38e4a018baffc4a27a720b47078fd890c10da307655d40f693e6f0e7
src/rocprof_compute_soc/analysis_configs/gfx908/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 633d59aba82b3a495b7ba33fa4b2ae4da638b58632bcc37ff18be87af68ce4d4
src/rocprof_compute_soc/analysis_configs/gfx90a/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 2bdb9d7b3bea1057b3baee29ba3b428b211808261063a97bc4b6b319f4a19fb3
src/rocprof_compute_soc/analysis_configs/gfx940/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 3180c2f3266be0ff44e01d73d247ca43ae2ee18ecaf61765f58849e36c701b19
src/rocprof_compute_soc/analysis_configs/gfx941/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 3180c2f3266be0ff44e01d73d247ca43ae2ee18ecaf61765f58849e36c701b19
src/rocprof_compute_soc/analysis_configs/gfx942/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 3180c2f3266be0ff44e01d73d247ca43ae2ee18ecaf61765f58849e36c701b19
src/rocprof_compute_soc/analysis_configs/gfx950/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 9e56cef5b066fb575a5c530bcf9400f1291dd8636b12c8a2244cdba1defafc9f
src/rocprof_compute_soc/analysis_configs/gfx908/1600_vector_l1_data_cache.yaml: e6ec43014ce7b7cc072385d4eba072dd187b5de14979c169a3c1e9b8fc4c2762
src/rocprof_compute_soc/analysis_configs/gfx90a/1600_vector_l1_data_cache.yaml: e6ec43014ce7b7cc072385d4eba072dd187b5de14979c169a3c1e9b8fc4c2762
src/rocprof_compute_soc/analysis_configs/gfx940/1600_vector_l1_data_cache.yaml: 0e53921cc8d87a9adade250b9632fa42d33c825565152e37d6e56f45f83a3a28
src/rocprof_compute_soc/analysis_configs/gfx941/1600_vector_l1_data_cache.yaml: 0e53921cc8d87a9adade250b9632fa42d33c825565152e37d6e56f45f83a3a28
src/rocprof_compute_soc/analysis_configs/gfx942/1600_vector_l1_data_cache.yaml: 0e53921cc8d87a9adade250b9632fa42d33c825565152e37d6e56f45f83a3a28
src/rocprof_compute_soc/analysis_configs/gfx950/1600_vector_l1_data_cache.yaml: cd21327c193d2af8c18066b9c13f67e3d5dfb44731777bc5a1b6a7738c902dd1
src/rocprof_compute_soc/analysis_configs/gfx908/1700_l2_cache.yaml: 5b48c690b6069a5610d07cc0c2a5e1da65a52296205dcf48a3b6fa5e3df36e9b
src/rocprof_compute_soc/analysis_configs/gfx90a/1700_l2_cache.yaml: a9b128267a069060e891533334c52586c706f145b1e813a4081cb21d425516ad
src/rocprof_compute_soc/analysis_configs/gfx940/1700_l2_cache.yaml: b4eea39f0e23e501ad503cdd96db377109c7f0e212949828fe06102de7355349
src/rocprof_compute_soc/analysis_configs/gfx941/1700_l2_cache.yaml: da0189cd7f6e1ab4b79d0c054c2cdc1f7a9c81972dae9e5285f2f3d9c30ca644
src/rocprof_compute_soc/analysis_configs/gfx942/1700_l2_cache.yaml: b0802f923052eb584ce138210ebf2db70fb7883926896da1861a9e857d4abe81
src/rocprof_compute_soc/analysis_configs/gfx950/1700_l2_cache.yaml: 58bdd965421d610567e461becd7094fa41d668b119eddab99054d2bd6dc12acf
src/rocprof_compute_soc/analysis_configs/gfx908/1600_vector_l1_data_cache.yaml: 438d0f4a972dd341eb2485f51a47d6860fbb30a6169054cd8550b4b7226e199f
src/rocprof_compute_soc/analysis_configs/gfx90a/1600_vector_l1_data_cache.yaml: 438d0f4a972dd341eb2485f51a47d6860fbb30a6169054cd8550b4b7226e199f
src/rocprof_compute_soc/analysis_configs/gfx940/1600_vector_l1_data_cache.yaml: 6100b218f24de9f1433b39a093ed04b9bb9dfe656c5df77583c9db332c447230
src/rocprof_compute_soc/analysis_configs/gfx941/1600_vector_l1_data_cache.yaml: 6100b218f24de9f1433b39a093ed04b9bb9dfe656c5df77583c9db332c447230
src/rocprof_compute_soc/analysis_configs/gfx942/1600_vector_l1_data_cache.yaml: 6100b218f24de9f1433b39a093ed04b9bb9dfe656c5df77583c9db332c447230
src/rocprof_compute_soc/analysis_configs/gfx950/1600_vector_l1_data_cache.yaml: 67054ec0a4c6ca147a5dd40cc91f0e8e81378e1affe7d479274747579ecc524a
src/rocprof_compute_soc/analysis_configs/gfx908/1700_l2_cache.yaml: b1baa76f9dbfcc52d5e12cc1834102a0011ddf8bdece5be5fabc2945ab8971f4
src/rocprof_compute_soc/analysis_configs/gfx90a/1700_l2_cache.yaml: 4d834a2066d7f2cb655a8e41fc17531282150b6fe64bbc9c5ff3a10acddee5af
src/rocprof_compute_soc/analysis_configs/gfx940/1700_l2_cache.yaml: 78f9fee5dafc83d311da1c801200c1820e16a0678dd0548fafa8a966ec6a94d5
src/rocprof_compute_soc/analysis_configs/gfx941/1700_l2_cache.yaml: 51fe6e3888975b805594c2ab2b3147e717ae5e015468ee592cbcddc389c689bc
src/rocprof_compute_soc/analysis_configs/gfx942/1700_l2_cache.yaml: dc2dc9ff61b1747e492c28ef5ac76764fd75c18fd0827834130bc583f2afc619
src/rocprof_compute_soc/analysis_configs/gfx950/1700_l2_cache.yaml: d181f753c3fff608c72b8015d1af30bfd8cf8cdfbc0a17c505f717ddaa3b1efc
src/rocprof_compute_soc/analysis_configs/gfx908/1800_l2_cache_per_channel.yaml: a0c53202fe9f68d5e1fa689ce0643c471ced7d47e007d8ccc68fba294f7f6a05
src/rocprof_compute_soc/analysis_configs/gfx90a/1800_l2_cache_per_channel.yaml: a0c53202fe9f68d5e1fa689ce0643c471ced7d47e007d8ccc68fba294f7f6a05
src/rocprof_compute_soc/analysis_configs/gfx940/1800_l2_cache_per_channel.yaml: e184e3692eb0d641fb2e37fada0e58a6c4958553931d7c038b884e1e6986093f
@@ -113,4 +113,4 @@ src/rocprof_compute_soc/profile_configs/sets/gfx940_sets.yaml: 44cd2b32b050cafa7
src/rocprof_compute_soc/profile_configs/sets/gfx941_sets.yaml: 44cd2b32b050cafa73d0ead5703b82836edf25a057c21699046b6b8b8918b242
src/rocprof_compute_soc/profile_configs/sets/gfx942_sets.yaml: 44cd2b32b050cafa73d0ead5703b82836edf25a057c21699046b6b8b8918b242
src/rocprof_compute_soc/profile_configs/sets/gfx950_sets.yaml: 238d9dc8a98cfead3fc904885bfe413e5bcb4f1af31e9820cd640388bcd1e1c2
docs/data/metrics_description.yaml: 819c08a584ae8b418e6983aa51108b95e43eda4f3b7892eab336c61d844b20bf
docs/data/metrics_description.yaml: c2ddad7ef7973b128c1612e56cc6286e49c2f59af829b1795dc64b38c0ecfd61
File diff suppressed because it is too large Load Diff