Update Unit of Bandwidth metrics to Gbps (#96)
* Add Utilization to metric name for Bandwidth related metrics whose Unit
is Percent
* Update Unit of Bandwidth metrics to Gbps
* Update metric Formula to use total duration as denominator instead of normalization unit.
* Update metric Description
* Update metric Unit
* Update CHANGELOG
This commit is contained in:
committed by
GitHub
parent
a10d897a69
commit
89c74ac3d3
@@ -27,6 +27,26 @@ Full documentation for ROCm Compute Profiler is available at [https://rocm.docs.
|
||||
|
||||
* Change the basic view of TUI from aggregated analysis data to individual kernel analysis data
|
||||
|
||||
* Update `Unit` of the following `Bandwidth` related metrics to `Gbps` instead of `Bytes per Normalization Unit`
|
||||
* Theoretical Bandwidth (section 1202)
|
||||
* L1I-L2 Bandwidth (section 1303)
|
||||
* sL1D-L2 BW (section 1403)
|
||||
* Cache BW (section 1603)
|
||||
* L1-L2 BW (section 1603)
|
||||
* Read BW (section 1702)
|
||||
* Write and Atomic BW (section 1702)
|
||||
* Bandwidth (section 1703)
|
||||
* Atomic/Read/Write Bandwidth (section 1703)
|
||||
* Atomic/Read/Write Bandwidth - (HBM/PCIe/Infinity Fabric) (section 1706)
|
||||
|
||||
* Add `Utilization` to metric name for the following `Bandwidth` related metrics whose `Unit` is `Percent`
|
||||
* Theoretical Bandwidth Utilization (section 1201)
|
||||
* L1I-L2 Bandwidth Utilization (section 1301)
|
||||
* Bandwidth Utilization (section 1301)
|
||||
* Bandwidth Utilization (section 1401)
|
||||
* sL1D-L2 BW Utilization (section 1401)
|
||||
* Bandwidth Utilization (section 1601)
|
||||
|
||||
### Resolved issues
|
||||
|
||||
* Fixed not detecting memory clock issue when using amd-smi
|
||||
|
||||
@@ -397,13 +397,13 @@ LDS Speed-of-Light:
|
||||
over the number of LDS cycles that would have been required to move the same
|
||||
amount of data in an uncontended access. [#lds-bank-conflict]_
|
||||
unit: Percent
|
||||
Theoretical Bandwidth:
|
||||
Theoretical Bandwidth Utilization:
|
||||
rst: Indicates the maximum amount of bytes that could have been loaded from, stored
|
||||
to, or atomically updated in the LDS per :ref:`normalization unit <normalization-units>`.
|
||||
to, or atomically updated in the LDS divided as percentage of theoretical peak.
|
||||
Does *not* take into account the execution mask of the wavefront when the instruction
|
||||
was executed. See the :ref:`LDS bandwidth example <lds-bandwidth>` for more
|
||||
detail.
|
||||
unit: Bytes per normalization unit
|
||||
unit: Percent
|
||||
Utilization:
|
||||
rst: Indicates what percent of the kernel's duration the :ref:`LDS <desc-lds>` was
|
||||
actively executing instructions (including, but not limited to, load, store,
|
||||
@@ -450,17 +450,16 @@ LDS Statistics:
|
||||
unit: Accesses per normalization unit
|
||||
Theoretical Bandwidth:
|
||||
rst: Indicates the maximum amount of bytes that could have been loaded from, stored
|
||||
to, or atomically updated in the LDS per :ref:`normalization unit <normalization-units>`.
|
||||
Does *not* take into account the execution mask of the wavefront when the instruction
|
||||
was executed. See the :ref:`LDS bandwidth example <lds-bandwidth>` for more
|
||||
detail.
|
||||
unit: Bytes per normalization unit
|
||||
to, or atomically updated in the LDS divided by total duration. Does *not* take
|
||||
into account the execution mask of the wavefront when the instruction was executed.
|
||||
See the :ref:`LDS bandwidth example <lds-bandwidth>` for more detail.
|
||||
unit: Gbps
|
||||
Unaligned Stall:
|
||||
rst: The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>` due
|
||||
to stalls from non-dword aligned addresses per :ref:`normalization unit <normalization-units>`.
|
||||
unit: Cycles per normalization unit
|
||||
vL1D Speed-of-Light:
|
||||
Bandwidth:
|
||||
Bandwidth Utilization:
|
||||
rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM
|
||||
<desc-vmem>` instructions, as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator. The number of bytes is calculated as the number
|
||||
@@ -614,13 +613,13 @@ vL1D cache access metrics:
|
||||
rst: The total number of cache line lookups in the vL1D.
|
||||
unit: Cache lines
|
||||
Cache BW:
|
||||
rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM
|
||||
<desc-vmem>` instructions per :ref:`normalization unit <normalization-units>`. The
|
||||
number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so
|
||||
for instance, if only a single value is requested in a cache line, the data movement
|
||||
will still be counted as a full cache line.
|
||||
unit: Bytes per normalization unit
|
||||
rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM
|
||||
<desc-vmem>` instructions divided by total duration. The number of bytes is
|
||||
calculated as the number of cache lines requested multiplied by the cache line
|
||||
size. This value does not consider partial requests, so for instance, if only
|
||||
a single value is requested in a cache line, the data movement will still be
|
||||
counted as a full cache line.
|
||||
unit: Gbps
|
||||
Cache Hit Rate:
|
||||
rst: The ratio of the number of vL1D cache line requests that hit in vL1D cache
|
||||
over the total number of cache line requests to the :ref:`vL1D Cache RAM <desc-tc>`.
|
||||
@@ -646,12 +645,12 @@ vL1D cache access metrics:
|
||||
unit: Requests per normalization unit
|
||||
L1-L2 BW:
|
||||
rst: The number of bytes transferred across the vL1D-L2 interface as a result of
|
||||
:ref:`VMEM <desc-vmem>` instructions, per :ref:`normalization unit <normalization-units>`.
|
||||
The number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so for instance,
|
||||
:ref:`VMEM <desc-vmem>` instructions, divided by total duration. The number
|
||||
of bytes is calculated as the number of cache lines requested multiplied by
|
||||
the cache line size. This value does not consider partial requests, so for instance,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line.
|
||||
unit: Bytes per normalization unit
|
||||
unit: Gbps
|
||||
L1-L2 Read:
|
||||
rst: The number of read requests for a vL1D cache line that were not satisfied by
|
||||
the vL1D and must be retrieved from the to the :doc:`L2 Cache <l2-cache>` per :ref:`normalization
|
||||
@@ -761,20 +760,20 @@ L2 Speed-of-Light:
|
||||
unit: Percent
|
||||
L2 cache accesses:
|
||||
Atomic Bandwidth:
|
||||
rst: Total number of bytes looked up in the L2 cache for atomic requests, per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
unit: Bytes per normalization unit
|
||||
rst: Total number of bytes looked up in the L2 cache for atomic requests, divided
|
||||
by total duration.
|
||||
unit: Gbps
|
||||
Atomic Req:
|
||||
rst: The total number of atomic requests (with and without return) to the L2 from
|
||||
all clients.
|
||||
unit: Requests per normalization unit
|
||||
Bandwidth:
|
||||
rst: The number of bytes looked up in the L2 cache, per :ref:`normalization unit
|
||||
<normalization-units>`. The number of bytes is calculated as the number of
|
||||
cache lines requested multiplied by the cache line size. This value does not
|
||||
consider partial requests, so for example, if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
unit: Bytes per normalization unit
|
||||
rst: The number of bytes looked up in the L2 cache, divided by total duration.
|
||||
The number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so for
|
||||
example, if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line.
|
||||
unit: Gbps
|
||||
CC Req:
|
||||
rst: The total number of requests to the L2 that go to Coherently Cacheable (CC) memory
|
||||
allocations. See the :ref:`memory-type` for more information.
|
||||
@@ -818,9 +817,9 @@ L2 cache accesses:
|
||||
allocations. See the :ref:`memory-type` for more information.
|
||||
unit: Requests per normalization unit
|
||||
Read Bandwidth:
|
||||
rst: Total number of bytes looked up in the L2 cache for read requests, per :ref:`normalization
|
||||
unit <normalization-units>`.
|
||||
unit: Bytes per normalization unit
|
||||
rst: Total number of bytes looked up in the L2 cache for read requests, divided
|
||||
by total duration.
|
||||
unit: Gbps
|
||||
Read Req:
|
||||
rst: 'The total number of read requests to the L2 from all clients. '
|
||||
unit: Requests per normalization unit
|
||||
@@ -841,9 +840,9 @@ L2 cache accesses:
|
||||
See the :ref:`memory-type` for more information.
|
||||
unit: Requests per normalization unit
|
||||
Write Bandwidth:
|
||||
rst: Total number of bytes looked up in the L2 cache for write requests, per :ref:`normalization
|
||||
unit <normalization-units>`.
|
||||
unit: Bytes per normalization unit
|
||||
rst: Total number of bytes looked up in the L2 cache for write requests, divided
|
||||
by total duration.
|
||||
unit: Gbps
|
||||
Write Req:
|
||||
rst: The total number of write requests to the L2 from all clients.
|
||||
unit: Requests per normalization unit
|
||||
@@ -896,9 +895,9 @@ L2-Fabric interface metrics:
|
||||
memory <memory-type>` allocations.
|
||||
unit: Percent
|
||||
Read BW:
|
||||
rst: The total number of bytes read by the L2 cache from Infinity Fabric per :ref:`normalization
|
||||
unit <normalization-units>`.
|
||||
unit: Bytes per normalization unit
|
||||
rst: The total number of bytes read by the L2 cache from Infinity Fabric divided
|
||||
by total duration.
|
||||
unit: Gbps
|
||||
Read Latency:
|
||||
rst: The time-averaged number of cycles read requests spent in Infinity Fabric before
|
||||
data was returned to the L2.
|
||||
@@ -954,12 +953,12 @@ L2-Fabric interface metrics:
|
||||
unit: Percent
|
||||
Write and Atomic BW:
|
||||
rst: The total number of bytes written by the L2 over Infinity Fabric by write and
|
||||
atomic operations per :ref:`normalization unit <normalization-units>`. Note
|
||||
that on current CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests
|
||||
are only considered *atomic* by Infinity Fabric if they are targeted at non-write-cacheable
|
||||
memory, for example, :ref:`fine-grained memory <memory-type>` allocations or :ref:`uncached
|
||||
atomic operations divided by total duration. Note that on current CDNA accelerators,
|
||||
such as the :ref:`MI2XX <mixxx-note>`, requests are only considered *atomic*
|
||||
by Infinity Fabric if they are targeted at non-write-cacheable memory, for
|
||||
example, :ref:`fine-grained memory <memory-type>` allocations or :ref:`uncached
|
||||
memory <memory-type>` allocations on the MI2XX.
|
||||
unit: Bytes per normalization unit
|
||||
unit: Gbps
|
||||
Write and Atomic Latency:
|
||||
rst: The time-averaged number of cycles write requests spent in Infinity Fabric
|
||||
before a completion acknowledgement was returned to the L2.
|
||||
@@ -975,17 +974,17 @@ L2 - Fabric interface detailed metrics:
|
||||
memory <memory-type>` allocations on the MI2XX.
|
||||
unit: Requests per normalization unit
|
||||
Atomic Bandwidth - HBM:
|
||||
rst: Total number of bytes due to L2 atomic requests due to HBM traffic, per normalization
|
||||
unit.
|
||||
unit: Bytes per normalization unit
|
||||
rst: Total number of bytes due to L2 atomic requests due to HBM traffic, divided
|
||||
by total duration.
|
||||
unit: Gbps
|
||||
"Atomic Bandwidth - Infinity Fabric\u2122":
|
||||
rst: Total number of bytes due to L2 atomic requests due to Infinity Fabric traffic,
|
||||
per normalization unit.
|
||||
unit: Bytes per normalization unit
|
||||
divided by total duration.
|
||||
unit: Gbps
|
||||
Atomic Bandwidth - PCIe:
|
||||
rst: Total number of bytes due to L2 atomic requests due to PCIe traffic, per
|
||||
normalization unit.
|
||||
unit: Bytes per normalization unit
|
||||
rst: Total number of bytes due to L2 atomic requests due to PCIe traffic, divided
|
||||
by total duration.
|
||||
unit: Gbps
|
||||
HBM Read:
|
||||
rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data
|
||||
from the accelerator's local HBM, per :ref:`normalization unit <normalization-units>`.
|
||||
@@ -1013,17 +1012,17 @@ L2 - Fabric interface detailed metrics:
|
||||
uncached data requests. See :ref:`l2-request-flow` for more detail.
|
||||
unit: Requests per normalization unit
|
||||
Read Bandwidth - HBM:
|
||||
rst: Total number of bytes due to L2 read requests due to HBM traffic, per normalization
|
||||
unit.
|
||||
unit: Bytes per normalization unit
|
||||
rst: Total number of bytes due to L2 read requests due to HBM traffic, divided
|
||||
by total duration.
|
||||
unit: Gbps
|
||||
"Read Bandwidth - Infinity Fabric\u2122":
|
||||
rst: Total number of bytes due to L2 read requests due to Infinity Fabric traffic,
|
||||
per normalization unit.
|
||||
unit: Bytes per normalization unit
|
||||
divided by total duration.
|
||||
unit: Gbps
|
||||
Read Bandwidth - PCIe:
|
||||
rst: Total number of bytes due to L2 read requests due to PCIe traffic, per normalization
|
||||
unit.
|
||||
unit: Bytes per normalization unit
|
||||
rst: Total number of bytes due to L2 read requests due to PCIe traffic, divided
|
||||
by total duration.
|
||||
unit: Gbps
|
||||
Remote Read:
|
||||
rst: The total number of L2 requests to Infinity Fabric to read 32B or 64B of data
|
||||
from any source other than the accelerator's local HBM, per :ref:`normalization
|
||||
@@ -1036,17 +1035,17 @@ L2 - Fabric interface detailed metrics:
|
||||
for more detail.
|
||||
unit: Requests per normalization unit
|
||||
Write Bandwidth - HBM:
|
||||
rst: Total number of bytes due to L2 write requests due to HBM traffic, per normalization
|
||||
unit.
|
||||
unit: Bytes per normalization unit
|
||||
rst: Total number of bytes due to L2 write requests due to HBM traffic, divided
|
||||
by total duration.
|
||||
unit: Gbps
|
||||
"Write Bandwidth - Infinity Fabric\u2122":
|
||||
rst: Total number of bytes due to L2 write requests due to Infinity Fabric traffic,
|
||||
per normalization unit.
|
||||
unit: Bytes per normalization unit
|
||||
divided by total duration.
|
||||
unit: Gbps
|
||||
Write Bandwidth - PCIe:
|
||||
rst: Total number of bytes due to L2 write requests due to PCIe traffic, per normalization
|
||||
unit.
|
||||
unit: Bytes per normalization unit
|
||||
rst: Total number of bytes due to L2 write requests due to PCIe traffic, divided
|
||||
by total duration.
|
||||
unit: Gbps
|
||||
Write and Atomic (32B):
|
||||
rst: The total number of L2 requests to Infinity Fabric to write or atomically update
|
||||
32B of data to any memory location, per :ref:`normalization unit <normalization-units>`.
|
||||
@@ -1098,7 +1097,7 @@ L2 - Fabric Interface stalls:
|
||||
of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
||||
unit: Percent
|
||||
Scalar L1D Speed-of-Light:
|
||||
Bandwidth:
|
||||
Bandwidth Utilization:
|
||||
rst: The number of bytes looked up in the sL1D cache, as a percent of the peak theoretical
|
||||
bandwidth. Calculated as the ratio of sL1D requests over the :ref:`total sL1D
|
||||
cycles <total-sl1d-cycles>`.
|
||||
@@ -1108,13 +1107,11 @@ Scalar L1D Speed-of-Light:
|
||||
the cache. The ratio of the number of sL1D requests that hit [#sl1d-cache]_
|
||||
over the number of all sL1D requests.
|
||||
unit: Percent
|
||||
sL1D-L2 BW:
|
||||
rst: "The total number of bytes read from, written to, or atomically updated \
|
||||
\ across the sL1D\u2194:doc:`L2 <l2-cache>` interface, per :ref:`normalization\
|
||||
\ unit <normalization-units>`. Note that sL1D writes and atomics are typically\
|
||||
\ unused on current CDNA accelerators, so in the majority of cases this can\
|
||||
\ be interpreted as an sL1D\u2192L2 read bandwidth."
|
||||
unit: Bytes per normalization unit
|
||||
sL1D-L2 BW Utilization:
|
||||
rst: The percentage of the peak theoretical sL1D - L2 interface bandwidth acheived.\
|
||||
\ Caclulated as total number of bytes read from, written to, or atomically updated\
|
||||
\ across the sL1D - L2 interface.
|
||||
unit: Percent
|
||||
Scalar L1D cache accesses:
|
||||
Atomic Req:
|
||||
rst: The total number of atomic requests from sL1D to the :doc:`L2 <l2-cache>`,
|
||||
@@ -1189,13 +1186,13 @@ Scalar L1D Cache - L2 Interface:
|
||||
unit: Requests per normalization unit
|
||||
sL1D-L2 BW:
|
||||
rst: "The total number of bytes read from, written to, or atomically updated \
|
||||
\ across the sL1D\u2194:doc:`L2 <l2-cache>` interface, per :ref:`normalization\
|
||||
\ unit <normalization-units>`. Note that sL1D writes and atomics are typically\
|
||||
\ unused on current CDNA accelerators, so in the majority of cases this can\
|
||||
\ be interpreted as an sL1D\u2192L2 read bandwidth."
|
||||
unit: Bytes per normalization unit
|
||||
\ across the sL1D\u2194:doc:`L2 <l2-cache>` interface, divided by total duration.\
|
||||
\ Note that sL1D writes and atomics are typically unused on current CDNA accelerators,\
|
||||
\ so in the majority of cases this can be interpreted as an sL1D\u2192L2 read\
|
||||
\ bandwidth."
|
||||
unit: Gbps
|
||||
L1I Speed-of-Light:
|
||||
Bandwidth:
|
||||
Bandwidth Utilization:
|
||||
rst: The number of bytes looked up in the L1I cache, as a percent of the peak theoretical
|
||||
bandwidth. Calculated as the ratio of L1I requests over the :ref:`total L1I
|
||||
cycles <total-l1i-cycles>`.
|
||||
@@ -1205,7 +1202,7 @@ L1I Speed-of-Light:
|
||||
the cache. Calculated as the ratio of the number of L1I requests that hit over
|
||||
the number of all L1I requests.
|
||||
unit: Percent
|
||||
L1I-L2 Bandwidth:
|
||||
L1I-L2 Bandwidth Utilization:
|
||||
rst: "The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth\
|
||||
\ achieved. Calculated as the ratio of the total number of requests from the\
|
||||
\ L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`."
|
||||
@@ -1238,10 +1235,9 @@ L1I cache accesses:
|
||||
unit: Requests per normalization unit
|
||||
L1I <-> L2 interface:
|
||||
L1I-L2 Bandwidth:
|
||||
rst: "The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth\
|
||||
\ achieved. Calculated as the ratio of the total number of requests from the\
|
||||
\ L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`."
|
||||
unit: Percent
|
||||
rst: Total number of bytes transferred across L1I - L2 interface divided by total
|
||||
duration.
|
||||
unit: Gbps
|
||||
Workgroup manager utilizations:
|
||||
Accelerator Utilization:
|
||||
rst: The percent of cycles in the kernel where the accelerator was actively doing
|
||||
|
||||
+10
-6
@@ -11,8 +11,12 @@ Panel Config:
|
||||
instructions, averaged over the lifetime of the kernel. Calculated as the ratio
|
||||
of the total number of cycles spent by the scheduler issuing LDS instructions
|
||||
over the total CU cycles.
|
||||
Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
|
||||
could have been loaded from, stored to, or atomically updated in the LDS divided
|
||||
as percentage of theoretical peak. Does not take into account the execution
|
||||
mask of the wavefront when the instruction was executed.
|
||||
Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
|
||||
loaded from, stored to, or atomically updated in the LDS per normalization unit.
|
||||
loaded from, stored to, or atomically updated in the LDS divided by total duration.
|
||||
Does not take into account the execution mask of the wavefront when the instruction
|
||||
was executed.
|
||||
Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
|
||||
@@ -58,7 +62,7 @@ Panel Config:
|
||||
Access Rate:
|
||||
value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: Pct of Peak
|
||||
Theoretical Bandwidth:
|
||||
Theoretical Bandwidth Utilization:
|
||||
value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
|
||||
unit: Pct of Peak
|
||||
@@ -86,12 +90,12 @@ Panel Config:
|
||||
unit: (Instr + $normUnit)
|
||||
Theoretical Bandwidth:
|
||||
avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
LDS Latency:
|
||||
avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
|
||||
None))
|
||||
|
||||
+15
-12
@@ -3,15 +3,18 @@ Panel Config:
|
||||
id: 1300
|
||||
title: Instruction Cache
|
||||
metrics_description:
|
||||
Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the
|
||||
peak theoretical bandwidth. Calculated as the ratio of L1I requests over the
|
||||
total L1I cycles.
|
||||
Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
|
||||
the total L1I cycles.
|
||||
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
|
||||
loaded line the cache. Calculated as the ratio of the number of L1I requests
|
||||
that hit over the number of all L1I requests.
|
||||
L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\
|
||||
\ bandwidth achieved. Calculated as the ratio of the total number of requests\
|
||||
\ from the L1I to the L2 cache over the total L1I-L2 interface cycles."
|
||||
L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\
|
||||
\ L2 cache request bandwidth achieved. Calculated as the ratio of the total\
|
||||
\ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\
|
||||
\ cycles."
|
||||
L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
|
||||
divided by total duration.
|
||||
Req: The total number of requests made to the L1I per normalization-unit
|
||||
Hits: The total number of L1I requests that hit on a previously loaded cache line,
|
||||
per normalization-unit.
|
||||
@@ -30,7 +33,7 @@ Panel Config:
|
||||
value: Avg
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
Bandwidth Utilization:
|
||||
value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
|
||||
- Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
@@ -38,7 +41,7 @@ Panel Config:
|
||||
value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
|
||||
+ SQC_ICACHE_MISSES_DUPLICATE)))
|
||||
unit: Pct of Peak
|
||||
L1I-L2 Bandwidth:
|
||||
L1I-L2 Bandwidth Utilization:
|
||||
value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu)
|
||||
* (End_Timestamp - Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
@@ -100,7 +103,7 @@ Panel Config:
|
||||
unit: Unit
|
||||
metric:
|
||||
L1I-L2 Bandwidth:
|
||||
avg: AVG(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
min: MIN(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
max: MAX(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
|
||||
+13
-10
@@ -3,14 +3,17 @@ Panel Config:
|
||||
id: 1400
|
||||
title: Scalar L1 Data Cache
|
||||
metrics_description:
|
||||
Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the
|
||||
peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the
|
||||
total sL1D cycles.
|
||||
Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
|
||||
over the total sL1D cycles.
|
||||
Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
|
||||
loaded line the cache. The ratio of the number of sL1D requests that hit over
|
||||
the number of all sL1D requests.
|
||||
sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
|
||||
bandwidth acheived.\ \ Caclulated as total number of bytes read from, written
|
||||
to, or atomically updated\ \ across the sL1D - L2 interface.
|
||||
sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
|
||||
\ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\
|
||||
\ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\
|
||||
\ writes and atomics are typically unused on current CDNA accelerators, so in\
|
||||
\ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
|
||||
Req: The total number of requests, of any size or type, made to the sL1D per normalization
|
||||
@@ -51,7 +54,7 @@ Panel Config:
|
||||
value: Avg
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
Bandwidth Utilization:
|
||||
value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
|
||||
- Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
@@ -60,7 +63,7 @@ Panel Config:
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
|
||||
unit: Pct of Peak
|
||||
sL1D-L2 BW:
|
||||
sL1D-L2 BW Utilization:
|
||||
value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))
|
||||
unit: Pct of Peak
|
||||
@@ -158,12 +161,12 @@ Panel Config:
|
||||
metric:
|
||||
sL1D-L2 BW:
|
||||
avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 64)) / $denom))
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 64)) / $denom))
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 64)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
Read Req:
|
||||
avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_READ_REQ / $denom))
|
||||
|
||||
+21
-21
@@ -5,12 +5,12 @@ Panel Config:
|
||||
metrics_description:
|
||||
Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM
|
||||
instructions, as a percent of the peak theoretical bandwidth achievable on the
|
||||
specific accelerator. The number of bytes is calculated as the number of cache
|
||||
lines requested multiplied by the cache line size. This value does not consider
|
||||
partial requests, so for instance, if only a single value is requested in a
|
||||
cache line, the data movement will still be counted as a full cache line.
|
||||
Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
|
||||
of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator. The number of bytes is calculated as the number
|
||||
of cache lines requested multiplied by the cache line size. This value does
|
||||
not consider partial requests, so for instance, if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
|
||||
The number of cycles where the vL1D Cache RAM is actively processing any request
|
||||
divided by the number of cycles where the vL1D is active.
|
||||
@@ -42,11 +42,11 @@ Panel Config:
|
||||
Atomic Req: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
|
||||
instructions per normalization unit. The number of bytes is calculated as the
|
||||
number of cache lines requested multiplied by the cache line size. This value
|
||||
does not consider partial requests, so for instance, if only a single value
|
||||
is requested in a cache line, the data movement will still be counted as a full
|
||||
cache line.
|
||||
instructions divided by total duration. The number of bytes is calculated as
|
||||
the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
as a full cache line.
|
||||
Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
|
||||
vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Cache Accesses: The total number of cache line lookups in the vL1D.
|
||||
@@ -57,7 +57,7 @@ Panel Config:
|
||||
command during the kernel's execution per normalization unit. This may be triggered
|
||||
by, for instance, the buffer_wbinvl1 instruction.
|
||||
L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
|
||||
of VMEM instructions, per normalization unit. The number of bytes is calculated
|
||||
of VMEM instructions, divided by total duration. The number of bytes is calculated
|
||||
as the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
@@ -128,7 +128,7 @@ Panel Config:
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
|
||||
else None))
|
||||
unit: Pct of Peak
|
||||
Bandwidth:
|
||||
Bandwidth Utilization:
|
||||
value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp
|
||||
- Start_Timestamp)))) / ((($max_sclk / 1000) * 64) * $cu_per_gpu))
|
||||
unit: Pct of Peak
|
||||
@@ -201,10 +201,10 @@ Panel Config:
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Cache BW:
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
|
||||
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
|
||||
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
Cache Hit Rate:
|
||||
avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
@@ -242,12 +242,12 @@ Panel Config:
|
||||
unit: (Req + $normUnit)
|
||||
L1-L2 BW:
|
||||
avg: AVG(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
L1-L2 Read:
|
||||
avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_READ_REQ_sum / $denom))
|
||||
|
||||
+30
-30
@@ -20,8 +20,8 @@ Panel Config:
|
||||
HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
|
||||
memory (HBM) per unit time. This value is calculated as the number of HBM channels
|
||||
multiplied by the HBM channel width multiplied by the HBM clock frequency.
|
||||
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per
|
||||
normalization unit.
|
||||
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided
|
||||
by total duration.
|
||||
HBM Read Traffic: The percent of read requests generated by the L2 cache that
|
||||
are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
|
||||
does not consider the size of the request (meaning that 32B and 64B requests
|
||||
@@ -42,9 +42,9 @@ Panel Config:
|
||||
as a single request), so this metric only approximates the percent of the L2-Fabric
|
||||
read bandwidth directed to an uncached memory location.
|
||||
Write and Atomic BW: The total number of bytes written by the L2 over Infinity
|
||||
Fabric by write and atomic operations per normalization unit. Note that on current
|
||||
CDNA accelerators, such as the MI2XX, requests are only considered atomic by
|
||||
Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
|
||||
Fabric by write and atomic operations divided by total duration. Note that on
|
||||
current CDNA accelerators, such as the MI2XX, requests are only considered atomic
|
||||
by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
|
||||
fine-grained memory allocations or uncached memory allocations on the MI2XX.
|
||||
HBM Write and Atomic Traffic: The percent of write and atomic requests generated
|
||||
by the L2 cache that are routed to the accelerator's local high-bandwidth memory
|
||||
@@ -82,17 +82,17 @@ Panel Config:
|
||||
Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
|
||||
Fabric before a completion acknowledgement (atomic without return value) or
|
||||
data (atomic with return value) was returned to the L2.
|
||||
Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit.
|
||||
Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration.
|
||||
The number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so for
|
||||
example, if only a single value is requested in a cache line, the data movement
|
||||
will still be counted as a full cache line.
|
||||
Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests,
|
||||
per normalization unit.
|
||||
divided by total duration.
|
||||
Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests,
|
||||
per normalization unit.
|
||||
divided by total duration.
|
||||
Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests,
|
||||
per normalization unit.
|
||||
divided by total duration.
|
||||
Req: The total number of incoming requests to the L2 from all clients for all
|
||||
request types, per normalization unit.
|
||||
Read Req: The total number of read requests to the L2 from all clients.
|
||||
@@ -150,11 +150,11 @@ Panel Config:
|
||||
64B of data from any source other than the accelerator's local HBM, per normalization
|
||||
unit.
|
||||
Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe
|
||||
traffic, per normalization unit.
|
||||
traffic, divided by total duration.
|
||||
"Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read
|
||||
requests due to Infinity Fabric traffic, per normalization unit.
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM
|
||||
traffic, per normalization unit.
|
||||
traffic, divided by total duration.
|
||||
Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 32B of data to any memory location, per normalization
|
||||
unit.
|
||||
@@ -171,17 +171,17 @@ Panel Config:
|
||||
write or atomically update 32B or 64B of data in any memory location other than
|
||||
the accelerator's local HBM, per normalization unit.
|
||||
Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to
|
||||
PCIe traffic, per normalization unit.
|
||||
PCIe traffic, divided by total duration.
|
||||
"Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write
|
||||
requests due to Infinity Fabric traffic, per normalization unit.
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM
|
||||
traffic, per normalization unit.
|
||||
traffic, divided by total duration.
|
||||
Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to
|
||||
PCIe traffic, per normalization unit.
|
||||
PCIe traffic, divided by total duration.
|
||||
"Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic
|
||||
requests due to Infinity Fabric traffic, per normalization unit.
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
|
||||
HBM traffic, per normalization unit.
|
||||
HBM traffic, divided by total duration.
|
||||
Atomic: The total number of L2 requests to Infinity Fabric to atomically update
|
||||
32B or 64B of data in any memory location, per normalization unit. See Request
|
||||
flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
|
||||
@@ -257,12 +257,12 @@ Panel Config:
|
||||
metric:
|
||||
Read BW:
|
||||
avg: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
HBM Read Traffic:
|
||||
avg: AVG((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
@@ -289,12 +289,12 @@ Panel Config:
|
||||
unit: pct
|
||||
Write and Atomic BW:
|
||||
avg: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
HBM Write and Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
@@ -362,10 +362,10 @@ Panel Config:
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
avg: AVG((TCC_REQ_sum * 64) / $denom)
|
||||
min: MIN((TCC_REQ_sum * 64) / $denom)
|
||||
max: MAX((TCC_REQ_sum * 64) / $denom)
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))
|
||||
min: MIN((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))
|
||||
max: MAX((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
Req:
|
||||
avg: AVG((TCC_REQ_sum / $denom))
|
||||
min: MIN((TCC_REQ_sum / $denom))
|
||||
|
||||
+10
-6
@@ -11,8 +11,12 @@ Panel Config:
|
||||
instructions, averaged over the lifetime of the kernel. Calculated as the ratio
|
||||
of the total number of cycles spent by the scheduler issuing LDS instructions
|
||||
over the total CU cycles.
|
||||
Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
|
||||
could have been loaded from, stored to, or atomically updated in the LDS divided
|
||||
as percentage of theoretical peak. Does not take into account the execution
|
||||
mask of the wavefront when the instruction was executed.
|
||||
Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
|
||||
loaded from, stored to, or atomically updated in the LDS per normalization unit.
|
||||
loaded from, stored to, or atomically updated in the LDS divided by total duration.
|
||||
Does not take into account the execution mask of the wavefront when the instruction
|
||||
was executed.
|
||||
Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
|
||||
@@ -58,7 +62,7 @@ Panel Config:
|
||||
Access Rate:
|
||||
value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: Pct of Peak
|
||||
Theoretical Bandwidth:
|
||||
Theoretical Bandwidth Utilization:
|
||||
value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
|
||||
unit: Pct of Peak
|
||||
@@ -86,12 +90,12 @@ Panel Config:
|
||||
unit: (Instr + $normUnit)
|
||||
Theoretical Bandwidth:
|
||||
avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
LDS Latency:
|
||||
avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
|
||||
None))
|
||||
|
||||
+15
-12
@@ -3,15 +3,18 @@ Panel Config:
|
||||
id: 1300
|
||||
title: Instruction Cache
|
||||
metrics_description:
|
||||
Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the
|
||||
peak theoretical bandwidth. Calculated as the ratio of L1I requests over the
|
||||
total L1I cycles.
|
||||
Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
|
||||
the total L1I cycles.
|
||||
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
|
||||
loaded line the cache. Calculated as the ratio of the number of L1I requests
|
||||
that hit over the number of all L1I requests.
|
||||
L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\
|
||||
\ bandwidth achieved. Calculated as the ratio of the total number of requests\
|
||||
\ from the L1I to the L2 cache over the total L1I-L2 interface cycles."
|
||||
L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\
|
||||
\ L2 cache request bandwidth achieved. Calculated as the ratio of the total\
|
||||
\ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\
|
||||
\ cycles."
|
||||
L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
|
||||
divided by total duration.
|
||||
Req: The total number of requests made to the L1I per normalization-unit
|
||||
Hits: The total number of L1I requests that hit on a previously loaded cache line,
|
||||
per normalization-unit.
|
||||
@@ -30,7 +33,7 @@ Panel Config:
|
||||
value: Avg
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
Bandwidth Utilization:
|
||||
value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
|
||||
- Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
@@ -38,7 +41,7 @@ Panel Config:
|
||||
value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
|
||||
+ SQC_ICACHE_MISSES_DUPLICATE)))
|
||||
unit: Pct of Peak
|
||||
L1I-L2 Bandwidth:
|
||||
L1I-L2 Bandwidth Utilization:
|
||||
value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu)
|
||||
* (End_Timestamp - Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
@@ -100,7 +103,7 @@ Panel Config:
|
||||
unit: Unit
|
||||
metric:
|
||||
L1I-L2 Bandwidth:
|
||||
avg: AVG(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
min: MIN(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
max: MAX(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
|
||||
+13
-10
@@ -3,14 +3,17 @@ Panel Config:
|
||||
id: 1400
|
||||
title: Scalar L1 Data Cache
|
||||
metrics_description:
|
||||
Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the
|
||||
peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the
|
||||
total sL1D cycles.
|
||||
Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
|
||||
over the total sL1D cycles.
|
||||
Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
|
||||
loaded line the cache. The ratio of the number of sL1D requests that hit over
|
||||
the number of all sL1D requests.
|
||||
sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
|
||||
bandwidth acheived.\ \ Caclulated as total number of bytes read from, written
|
||||
to, or atomically updated\ \ across the sL1D - L2 interface.
|
||||
sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
|
||||
\ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\
|
||||
\ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\
|
||||
\ writes and atomics are typically unused on current CDNA accelerators, so in\
|
||||
\ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
|
||||
Req: The total number of requests, of any size or type, made to the sL1D per normalization
|
||||
@@ -51,7 +54,7 @@ Panel Config:
|
||||
value: Avg
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
Bandwidth Utilization:
|
||||
value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
|
||||
- Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
@@ -60,7 +63,7 @@ Panel Config:
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
|
||||
unit: Pct of Peak
|
||||
sL1D-L2 BW:
|
||||
sL1D-L2 BW Utilization:
|
||||
value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))
|
||||
unit: Pct of Peak
|
||||
@@ -158,12 +161,12 @@ Panel Config:
|
||||
metric:
|
||||
sL1D-L2 BW:
|
||||
avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 64)) / $denom))
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 64)) / $denom))
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 64)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
Read Req:
|
||||
avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_READ_REQ / $denom))
|
||||
|
||||
+21
-21
@@ -5,12 +5,12 @@ Panel Config:
|
||||
metrics_description:
|
||||
Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM
|
||||
instructions, as a percent of the peak theoretical bandwidth achievable on the
|
||||
specific accelerator. The number of bytes is calculated as the number of cache
|
||||
lines requested multiplied by the cache line size. This value does not consider
|
||||
partial requests, so for instance, if only a single value is requested in a
|
||||
cache line, the data movement will still be counted as a full cache line.
|
||||
Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
|
||||
of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator. The number of bytes is calculated as the number
|
||||
of cache lines requested multiplied by the cache line size. This value does
|
||||
not consider partial requests, so for instance, if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
|
||||
The number of cycles where the vL1D Cache RAM is actively processing any request
|
||||
divided by the number of cycles where the vL1D is active.
|
||||
@@ -42,11 +42,11 @@ Panel Config:
|
||||
Atomic Req: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
|
||||
instructions per normalization unit. The number of bytes is calculated as the
|
||||
number of cache lines requested multiplied by the cache line size. This value
|
||||
does not consider partial requests, so for instance, if only a single value
|
||||
is requested in a cache line, the data movement will still be counted as a full
|
||||
cache line.
|
||||
instructions divided by total duration. The number of bytes is calculated as
|
||||
the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
as a full cache line.
|
||||
Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
|
||||
vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Cache Accesses: The total number of cache line lookups in the vL1D.
|
||||
@@ -57,7 +57,7 @@ Panel Config:
|
||||
command during the kernel's execution per normalization unit. This may be triggered
|
||||
by, for instance, the buffer_wbinvl1 instruction.
|
||||
L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
|
||||
of VMEM instructions, per normalization unit. The number of bytes is calculated
|
||||
of VMEM instructions, divided by total duration. The number of bytes is calculated
|
||||
as the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
@@ -128,7 +128,7 @@ Panel Config:
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
|
||||
else None))
|
||||
unit: Pct of Peak
|
||||
Bandwidth:
|
||||
Bandwidth Utilization:
|
||||
value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp
|
||||
- Start_Timestamp)))) / ((($max_sclk / 1000) * 64) * $cu_per_gpu))
|
||||
unit: Pct of Peak
|
||||
@@ -201,10 +201,10 @@ Panel Config:
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Cache BW:
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
|
||||
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
|
||||
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
Cache Hit Rate:
|
||||
avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
@@ -242,12 +242,12 @@ Panel Config:
|
||||
unit: (Req + $normUnit)
|
||||
L1-L2 BW:
|
||||
avg: AVG(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
L1-L2 Read:
|
||||
avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_READ_REQ_sum / $denom))
|
||||
|
||||
+30
-30
@@ -20,8 +20,8 @@ Panel Config:
|
||||
HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
|
||||
memory (HBM) per unit time. This value is calculated as the number of HBM channels
|
||||
multiplied by the HBM channel width multiplied by the HBM clock frequency.
|
||||
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per
|
||||
normalization unit.
|
||||
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided
|
||||
by total duration.
|
||||
HBM Read Traffic: The percent of read requests generated by the L2 cache that
|
||||
are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
|
||||
does not consider the size of the request (meaning that 32B and 64B requests
|
||||
@@ -42,9 +42,9 @@ Panel Config:
|
||||
as a single request), so this metric only approximates the percent of the L2-Fabric
|
||||
read bandwidth directed to an uncached memory location.
|
||||
Write and Atomic BW: The total number of bytes written by the L2 over Infinity
|
||||
Fabric by write and atomic operations per normalization unit. Note that on current
|
||||
CDNA accelerators, such as the MI2XX, requests are only considered atomic by
|
||||
Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
|
||||
Fabric by write and atomic operations divided by total duration. Note that on
|
||||
current CDNA accelerators, such as the MI2XX, requests are only considered atomic
|
||||
by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
|
||||
fine-grained memory allocations or uncached memory allocations on the MI2XX.
|
||||
HBM Write and Atomic Traffic: The percent of write and atomic requests generated
|
||||
by the L2 cache that are routed to the accelerator's local high-bandwidth memory
|
||||
@@ -82,17 +82,17 @@ Panel Config:
|
||||
Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
|
||||
Fabric before a completion acknowledgement (atomic without return value) or
|
||||
data (atomic with return value) was returned to the L2.
|
||||
Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit.
|
||||
Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration.
|
||||
The number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so for
|
||||
example, if only a single value is requested in a cache line, the data movement
|
||||
will still be counted as a full cache line.
|
||||
Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests,
|
||||
per normalization unit.
|
||||
divided by total duration.
|
||||
Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests,
|
||||
per normalization unit.
|
||||
divided by total duration.
|
||||
Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests,
|
||||
per normalization unit.
|
||||
divided by total duration.
|
||||
Req: The total number of incoming requests to the L2 from all clients for all
|
||||
request types, per normalization unit.
|
||||
Read Req: The total number of read requests to the L2 from all clients.
|
||||
@@ -150,11 +150,11 @@ Panel Config:
|
||||
64B of data from any source other than the accelerator's local HBM, per normalization
|
||||
unit.
|
||||
Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe
|
||||
traffic, per normalization unit.
|
||||
traffic, divided by total duration.
|
||||
"Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read
|
||||
requests due to Infinity Fabric traffic, per normalization unit.
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM
|
||||
traffic, per normalization unit.
|
||||
traffic, divided by total duration.
|
||||
Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 32B of data to any memory location, per normalization
|
||||
unit.
|
||||
@@ -171,17 +171,17 @@ Panel Config:
|
||||
write or atomically update 32B or 64B of data in any memory location other than
|
||||
the accelerator's local HBM, per normalization unit.
|
||||
Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to
|
||||
PCIe traffic, per normalization unit.
|
||||
PCIe traffic, divided by total duration.
|
||||
"Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write
|
||||
requests due to Infinity Fabric traffic, per normalization unit.
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM
|
||||
traffic, per normalization unit.
|
||||
traffic, divided by total duration.
|
||||
Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to
|
||||
PCIe traffic, per normalization unit.
|
||||
PCIe traffic, divided by total duration.
|
||||
"Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic
|
||||
requests due to Infinity Fabric traffic, per normalization unit.
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
|
||||
HBM traffic, per normalization unit.
|
||||
HBM traffic, divided by total duration.
|
||||
Atomic: The total number of L2 requests to Infinity Fabric to atomically update
|
||||
32B or 64B of data in any memory location, per normalization unit. See Request
|
||||
flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
|
||||
@@ -257,12 +257,12 @@ Panel Config:
|
||||
metric:
|
||||
Read BW:
|
||||
avg: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
HBM Read Traffic:
|
||||
avg: AVG((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
@@ -289,12 +289,12 @@ Panel Config:
|
||||
unit: pct
|
||||
Write and Atomic BW:
|
||||
avg: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
HBM Write and Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
@@ -362,10 +362,10 @@ Panel Config:
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
avg: AVG((TCC_REQ_sum * 128) / $denom)
|
||||
min: MIN((TCC_REQ_sum * 128) / $denom)
|
||||
max: MAX((TCC_REQ_sum * 128) / $denom)
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
|
||||
min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
|
||||
max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
Req:
|
||||
avg: AVG((TCC_REQ_sum / $denom))
|
||||
min: MIN((TCC_REQ_sum / $denom))
|
||||
|
||||
+10
-6
@@ -11,8 +11,12 @@ Panel Config:
|
||||
instructions, averaged over the lifetime of the kernel. Calculated as the ratio
|
||||
of the total number of cycles spent by the scheduler issuing LDS instructions
|
||||
over the total CU cycles.
|
||||
Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
|
||||
could have been loaded from, stored to, or atomically updated in the LDS divided
|
||||
as percentage of theoretical peak. Does not take into account the execution
|
||||
mask of the wavefront when the instruction was executed.
|
||||
Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
|
||||
loaded from, stored to, or atomically updated in the LDS per normalization unit.
|
||||
loaded from, stored to, or atomically updated in the LDS divided by total duration.
|
||||
Does not take into account the execution mask of the wavefront when the instruction
|
||||
was executed.
|
||||
Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
|
||||
@@ -58,7 +62,7 @@ Panel Config:
|
||||
Access Rate:
|
||||
value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: Pct of Peak
|
||||
Theoretical Bandwidth (% of Peak):
|
||||
Theoretical Bandwidth Utilization:
|
||||
value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
|
||||
unit: Pct of Peak
|
||||
@@ -86,12 +90,12 @@ Panel Config:
|
||||
unit: (Instr + $normUnit)
|
||||
Theoretical Bandwidth:
|
||||
avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
LDS Latency:
|
||||
avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
|
||||
None))
|
||||
|
||||
+15
-12
@@ -3,15 +3,18 @@ Panel Config:
|
||||
id: 1300
|
||||
title: Instruction Cache
|
||||
metrics_description:
|
||||
Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the
|
||||
peak theoretical bandwidth. Calculated as the ratio of L1I requests over the
|
||||
total L1I cycles.
|
||||
Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
|
||||
the total L1I cycles.
|
||||
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
|
||||
loaded line the cache. Calculated as the ratio of the number of L1I requests
|
||||
that hit over the number of all L1I requests.
|
||||
L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\
|
||||
\ bandwidth achieved. Calculated as the ratio of the total number of requests\
|
||||
\ from the L1I to the L2 cache over the total L1I-L2 interface cycles."
|
||||
L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\
|
||||
\ L2 cache request bandwidth achieved. Calculated as the ratio of the total\
|
||||
\ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\
|
||||
\ cycles."
|
||||
L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
|
||||
divided by total duration.
|
||||
Req: The total number of requests made to the L1I per normalization-unit
|
||||
Hits: The total number of L1I requests that hit on a previously loaded cache line,
|
||||
per normalization-unit.
|
||||
@@ -30,7 +33,7 @@ Panel Config:
|
||||
value: Avg
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
Bandwidth Utilization:
|
||||
value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
|
||||
- Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
@@ -38,7 +41,7 @@ Panel Config:
|
||||
value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
|
||||
+ SQC_ICACHE_MISSES_DUPLICATE)))
|
||||
unit: Pct of Peak
|
||||
L1I-L2 Bandwidth:
|
||||
L1I-L2 Bandwidth Utilization:
|
||||
value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu)
|
||||
* (End_Timestamp - Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
@@ -100,7 +103,7 @@ Panel Config:
|
||||
unit: Unit
|
||||
metric:
|
||||
L1I-L2 Bandwidth:
|
||||
avg: AVG(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
min: MIN(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
max: MAX(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
|
||||
+13
-10
@@ -3,14 +3,17 @@ Panel Config:
|
||||
id: 1400
|
||||
title: Scalar L1 Data Cache
|
||||
metrics_description:
|
||||
Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the
|
||||
peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the
|
||||
total sL1D cycles.
|
||||
Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
|
||||
over the total sL1D cycles.
|
||||
Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
|
||||
loaded line the cache. The ratio of the number of sL1D requests that hit over
|
||||
the number of all sL1D requests.
|
||||
sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
|
||||
bandwidth acheived.\ \ Caclulated as total number of bytes read from, written
|
||||
to, or atomically updated\ \ across the sL1D - L2 interface.
|
||||
sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
|
||||
\ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\
|
||||
\ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\
|
||||
\ writes and atomics are typically unused on current CDNA accelerators, so in\
|
||||
\ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
|
||||
Req: The total number of requests, of any size or type, made to the sL1D per normalization
|
||||
@@ -51,7 +54,7 @@ Panel Config:
|
||||
value: Avg
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
Bandwidth Utilization:
|
||||
value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
|
||||
- Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
@@ -60,7 +63,7 @@ Panel Config:
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
|
||||
unit: Pct of Peak
|
||||
sL1D-L2 BW:
|
||||
sL1D-L2 BW Utilization:
|
||||
value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))
|
||||
unit: Pct of Peak
|
||||
@@ -158,12 +161,12 @@ Panel Config:
|
||||
metric:
|
||||
sL1D-L2 BW:
|
||||
avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 64)) / $denom))
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 64)) / $denom))
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 64)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
Read Req:
|
||||
avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_READ_REQ / $denom))
|
||||
|
||||
+21
-21
@@ -5,12 +5,12 @@ Panel Config:
|
||||
metrics_description:
|
||||
Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM
|
||||
instructions, as a percent of the peak theoretical bandwidth achievable on the
|
||||
specific accelerator. The number of bytes is calculated as the number of cache
|
||||
lines requested multiplied by the cache line size. This value does not consider
|
||||
partial requests, so for instance, if only a single value is requested in a
|
||||
cache line, the data movement will still be counted as a full cache line.
|
||||
Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
|
||||
of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator. The number of bytes is calculated as the number
|
||||
of cache lines requested multiplied by the cache line size. This value does
|
||||
not consider partial requests, so for instance, if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
|
||||
The number of cycles where the vL1D Cache RAM is actively processing any request
|
||||
divided by the number of cycles where the vL1D is active.
|
||||
@@ -42,11 +42,11 @@ Panel Config:
|
||||
Atomic Req: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
|
||||
instructions per normalization unit. The number of bytes is calculated as the
|
||||
number of cache lines requested multiplied by the cache line size. This value
|
||||
does not consider partial requests, so for instance, if only a single value
|
||||
is requested in a cache line, the data movement will still be counted as a full
|
||||
cache line.
|
||||
instructions divided by total duration. The number of bytes is calculated as
|
||||
the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
as a full cache line.
|
||||
Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
|
||||
vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Cache Accesses: The total number of cache line lookups in the vL1D.
|
||||
@@ -57,7 +57,7 @@ Panel Config:
|
||||
command during the kernel's execution per normalization unit. This may be triggered
|
||||
by, for instance, the buffer_wbinvl1 instruction.
|
||||
L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
|
||||
of VMEM instructions, per normalization unit. The number of bytes is calculated
|
||||
of VMEM instructions, divided by total duration. The number of bytes is calculated
|
||||
as the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
@@ -128,7 +128,7 @@ Panel Config:
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
|
||||
else None))
|
||||
unit: Pct of Peak
|
||||
Bandwidth:
|
||||
Bandwidth Utilization:
|
||||
value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp
|
||||
- Start_Timestamp)))) / ((($max_sclk / 1000) * 128) * $cu_per_gpu))
|
||||
unit: Pct of Peak
|
||||
@@ -201,10 +201,10 @@ Panel Config:
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Cache BW:
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
|
||||
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
|
||||
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
Cache Hit Rate:
|
||||
avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
@@ -242,12 +242,12 @@ Panel Config:
|
||||
unit: (Req + $normUnit)
|
||||
L1-L2 BW:
|
||||
avg: AVG(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
L1-L2 Read:
|
||||
avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_READ_REQ_sum / $denom))
|
||||
|
||||
+30
-30
@@ -20,8 +20,8 @@ Panel Config:
|
||||
HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
|
||||
memory (HBM) per unit time. This value is calculated as the number of HBM channels
|
||||
multiplied by the HBM channel width multiplied by the HBM clock frequency.
|
||||
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per
|
||||
normalization unit.
|
||||
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided
|
||||
by total duration.
|
||||
HBM Read Traffic: The percent of read requests generated by the L2 cache that
|
||||
are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
|
||||
does not consider the size of the request (meaning that 32B and 64B requests
|
||||
@@ -42,9 +42,9 @@ Panel Config:
|
||||
as a single request), so this metric only approximates the percent of the L2-Fabric
|
||||
read bandwidth directed to an uncached memory location.
|
||||
Write and Atomic BW: The total number of bytes written by the L2 over Infinity
|
||||
Fabric by write and atomic operations per normalization unit. Note that on current
|
||||
CDNA accelerators, such as the MI2XX, requests are only considered atomic by
|
||||
Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
|
||||
Fabric by write and atomic operations divided by total duration. Note that on
|
||||
current CDNA accelerators, such as the MI2XX, requests are only considered atomic
|
||||
by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
|
||||
fine-grained memory allocations or uncached memory allocations on the MI2XX.
|
||||
HBM Write and Atomic Traffic: The percent of write and atomic requests generated
|
||||
by the L2 cache that are routed to the accelerator's local high-bandwidth memory
|
||||
@@ -82,17 +82,17 @@ Panel Config:
|
||||
Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
|
||||
Fabric before a completion acknowledgement (atomic without return value) or
|
||||
data (atomic with return value) was returned to the L2.
|
||||
Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit.
|
||||
Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration.
|
||||
The number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so for
|
||||
example, if only a single value is requested in a cache line, the data movement
|
||||
will still be counted as a full cache line.
|
||||
Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests,
|
||||
per normalization unit.
|
||||
divided by total duration.
|
||||
Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests,
|
||||
per normalization unit.
|
||||
divided by total duration.
|
||||
Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests,
|
||||
per normalization unit.
|
||||
divided by total duration.
|
||||
Req: The total number of incoming requests to the L2 from all clients for all
|
||||
request types, per normalization unit.
|
||||
Read Req: The total number of read requests to the L2 from all clients.
|
||||
@@ -150,11 +150,11 @@ Panel Config:
|
||||
64B of data from any source other than the accelerator's local HBM, per normalization
|
||||
unit.
|
||||
Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe
|
||||
traffic, per normalization unit.
|
||||
traffic, divided by total duration.
|
||||
"Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read
|
||||
requests due to Infinity Fabric traffic, per normalization unit.
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM
|
||||
traffic, per normalization unit.
|
||||
traffic, divided by total duration.
|
||||
Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 32B of data to any memory location, per normalization
|
||||
unit.
|
||||
@@ -171,17 +171,17 @@ Panel Config:
|
||||
write or atomically update 32B or 64B of data in any memory location other than
|
||||
the accelerator's local HBM, per normalization unit.
|
||||
Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to
|
||||
PCIe traffic, per normalization unit.
|
||||
PCIe traffic, divided by total duration.
|
||||
"Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write
|
||||
requests due to Infinity Fabric traffic, per normalization unit.
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM
|
||||
traffic, per normalization unit.
|
||||
traffic, divided by total duration.
|
||||
Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to
|
||||
PCIe traffic, per normalization unit.
|
||||
PCIe traffic, divided by total duration.
|
||||
"Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic
|
||||
requests due to Infinity Fabric traffic, per normalization unit.
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
|
||||
HBM traffic, per normalization unit.
|
||||
HBM traffic, divided by total duration.
|
||||
Atomic: The total number of L2 requests to Infinity Fabric to atomically update
|
||||
32B or 64B of data in any memory location, per normalization unit. See Request
|
||||
flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
|
||||
@@ -257,12 +257,12 @@ Panel Config:
|
||||
metric:
|
||||
Read BW:
|
||||
avg: AVG((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
HBM Read Traffic:
|
||||
avg: AVG((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else None))
|
||||
@@ -289,12 +289,12 @@ Panel Config:
|
||||
unit: pct
|
||||
Write and Atomic BW:
|
||||
avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
HBM Write and Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
@@ -362,10 +362,10 @@ Panel Config:
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
avg: AVG((TCC_REQ_sum * 128) / $denom)
|
||||
min: MIN((TCC_REQ_sum * 128) / $denom)
|
||||
max: MAX((TCC_REQ_sum * 128) / $denom)
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
|
||||
min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
|
||||
max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
Req:
|
||||
avg: AVG((TCC_REQ_sum / $denom))
|
||||
min: MIN((TCC_REQ_sum / $denom))
|
||||
|
||||
+10
-6
@@ -11,8 +11,12 @@ Panel Config:
|
||||
instructions, averaged over the lifetime of the kernel. Calculated as the ratio
|
||||
of the total number of cycles spent by the scheduler issuing LDS instructions
|
||||
over the total CU cycles.
|
||||
Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
|
||||
could have been loaded from, stored to, or atomically updated in the LDS divided
|
||||
as percentage of theoretical peak. Does not take into account the execution
|
||||
mask of the wavefront when the instruction was executed.
|
||||
Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
|
||||
loaded from, stored to, or atomically updated in the LDS per normalization unit.
|
||||
loaded from, stored to, or atomically updated in the LDS divided by total duration.
|
||||
Does not take into account the execution mask of the wavefront when the instruction
|
||||
was executed.
|
||||
Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
|
||||
@@ -58,7 +62,7 @@ Panel Config:
|
||||
Access Rate:
|
||||
value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: Pct of Peak
|
||||
Theoretical Bandwidth (% of Peak):
|
||||
Theoretical Bandwidth Utilization:
|
||||
value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
|
||||
unit: Pct of Peak
|
||||
@@ -86,12 +90,12 @@ Panel Config:
|
||||
unit: (Instr + $normUnit)
|
||||
Theoretical Bandwidth:
|
||||
avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
LDS Latency:
|
||||
avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
|
||||
None))
|
||||
|
||||
+15
-12
@@ -3,15 +3,18 @@ Panel Config:
|
||||
id: 1300
|
||||
title: Instruction Cache
|
||||
metrics_description:
|
||||
Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the
|
||||
peak theoretical bandwidth. Calculated as the ratio of L1I requests over the
|
||||
total L1I cycles.
|
||||
Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
|
||||
the total L1I cycles.
|
||||
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
|
||||
loaded line the cache. Calculated as the ratio of the number of L1I requests
|
||||
that hit over the number of all L1I requests.
|
||||
L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\
|
||||
\ bandwidth achieved. Calculated as the ratio of the total number of requests\
|
||||
\ from the L1I to the L2 cache over the total L1I-L2 interface cycles."
|
||||
L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\
|
||||
\ L2 cache request bandwidth achieved. Calculated as the ratio of the total\
|
||||
\ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\
|
||||
\ cycles."
|
||||
L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
|
||||
divided by total duration.
|
||||
Req: The total number of requests made to the L1I per normalization-unit
|
||||
Hits: The total number of L1I requests that hit on a previously loaded cache line,
|
||||
per normalization-unit.
|
||||
@@ -30,7 +33,7 @@ Panel Config:
|
||||
value: Avg
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
Bandwidth Utilization:
|
||||
value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
|
||||
- Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
@@ -38,7 +41,7 @@ Panel Config:
|
||||
value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
|
||||
+ SQC_ICACHE_MISSES_DUPLICATE)))
|
||||
unit: Pct of Peak
|
||||
L1I-L2 Bandwidth:
|
||||
L1I-L2 Bandwidth Utilization:
|
||||
value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu)
|
||||
* (End_Timestamp - Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
@@ -100,7 +103,7 @@ Panel Config:
|
||||
unit: Unit
|
||||
metric:
|
||||
L1I-L2 Bandwidth:
|
||||
avg: AVG(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
min: MIN(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
max: MAX(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
|
||||
+13
-10
@@ -3,14 +3,17 @@ Panel Config:
|
||||
id: 1400
|
||||
title: Scalar L1 Data Cache
|
||||
metrics_description:
|
||||
Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the
|
||||
peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the
|
||||
total sL1D cycles.
|
||||
Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
|
||||
over the total sL1D cycles.
|
||||
Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
|
||||
loaded line the cache. The ratio of the number of sL1D requests that hit over
|
||||
the number of all sL1D requests.
|
||||
sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
|
||||
bandwidth acheived.\ \ Caclulated as total number of bytes read from, written
|
||||
to, or atomically updated\ \ across the sL1D - L2 interface.
|
||||
sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
|
||||
\ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\
|
||||
\ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\
|
||||
\ writes and atomics are typically unused on current CDNA accelerators, so in\
|
||||
\ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
|
||||
Req: The total number of requests, of any size or type, made to the sL1D per normalization
|
||||
@@ -51,7 +54,7 @@ Panel Config:
|
||||
value: Avg
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
Bandwidth Utilization:
|
||||
value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
|
||||
- Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
@@ -60,7 +63,7 @@ Panel Config:
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
|
||||
unit: Pct of Peak
|
||||
sL1D-L2 BW:
|
||||
sL1D-L2 BW Utilization:
|
||||
value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))
|
||||
unit: Pct of Peak
|
||||
@@ -158,12 +161,12 @@ Panel Config:
|
||||
metric:
|
||||
sL1D-L2 BW:
|
||||
avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 64)) / $denom))
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 64)) / $denom))
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 64)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
Read Req:
|
||||
avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_READ_REQ / $denom))
|
||||
|
||||
+21
-21
@@ -5,12 +5,12 @@ Panel Config:
|
||||
metrics_description:
|
||||
Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM
|
||||
instructions, as a percent of the peak theoretical bandwidth achievable on the
|
||||
specific accelerator. The number of bytes is calculated as the number of cache
|
||||
lines requested multiplied by the cache line size. This value does not consider
|
||||
partial requests, so for instance, if only a single value is requested in a
|
||||
cache line, the data movement will still be counted as a full cache line.
|
||||
Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
|
||||
of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator. The number of bytes is calculated as the number
|
||||
of cache lines requested multiplied by the cache line size. This value does
|
||||
not consider partial requests, so for instance, if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
|
||||
The number of cycles where the vL1D Cache RAM is actively processing any request
|
||||
divided by the number of cycles where the vL1D is active.
|
||||
@@ -42,11 +42,11 @@ Panel Config:
|
||||
Atomic Req: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
|
||||
instructions per normalization unit. The number of bytes is calculated as the
|
||||
number of cache lines requested multiplied by the cache line size. This value
|
||||
does not consider partial requests, so for instance, if only a single value
|
||||
is requested in a cache line, the data movement will still be counted as a full
|
||||
cache line.
|
||||
instructions divided by total duration. The number of bytes is calculated as
|
||||
the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
as a full cache line.
|
||||
Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
|
||||
vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Cache Accesses: The total number of cache line lookups in the vL1D.
|
||||
@@ -57,7 +57,7 @@ Panel Config:
|
||||
command during the kernel's execution per normalization unit. This may be triggered
|
||||
by, for instance, the buffer_wbinvl1 instruction.
|
||||
L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
|
||||
of VMEM instructions, per normalization unit. The number of bytes is calculated
|
||||
of VMEM instructions, divided by total duration. The number of bytes is calculated
|
||||
as the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
@@ -128,7 +128,7 @@ Panel Config:
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
|
||||
else None))
|
||||
unit: Pct of Peak
|
||||
Bandwidth:
|
||||
Bandwidth Utilization:
|
||||
value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp
|
||||
- Start_Timestamp)))) / ((($max_sclk / 1000) * 128) * $cu_per_gpu))
|
||||
unit: Pct of Peak
|
||||
@@ -201,10 +201,10 @@ Panel Config:
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Cache BW:
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
|
||||
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
|
||||
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
Cache Hit Rate:
|
||||
avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
@@ -242,12 +242,12 @@ Panel Config:
|
||||
unit: (Req + $normUnit)
|
||||
L1-L2 BW:
|
||||
avg: AVG(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
L1-L2 Read:
|
||||
avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_READ_REQ_sum / $denom))
|
||||
|
||||
+30
-30
@@ -20,8 +20,8 @@ Panel Config:
|
||||
HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
|
||||
memory (HBM) per unit time. This value is calculated as the number of HBM channels
|
||||
multiplied by the HBM channel width multiplied by the HBM clock frequency.
|
||||
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per
|
||||
normalization unit.
|
||||
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided
|
||||
by total duration.
|
||||
HBM Read Traffic: The percent of read requests generated by the L2 cache that
|
||||
are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
|
||||
does not consider the size of the request (meaning that 32B and 64B requests
|
||||
@@ -42,9 +42,9 @@ Panel Config:
|
||||
as a single request), so this metric only approximates the percent of the L2-Fabric
|
||||
read bandwidth directed to an uncached memory location.
|
||||
Write and Atomic BW: The total number of bytes written by the L2 over Infinity
|
||||
Fabric by write and atomic operations per normalization unit. Note that on current
|
||||
CDNA accelerators, such as the MI2XX, requests are only considered atomic by
|
||||
Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
|
||||
Fabric by write and atomic operations divided by total duration. Note that on
|
||||
current CDNA accelerators, such as the MI2XX, requests are only considered atomic
|
||||
by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
|
||||
fine-grained memory allocations or uncached memory allocations on the MI2XX.
|
||||
HBM Write and Atomic Traffic: The percent of write and atomic requests generated
|
||||
by the L2 cache that are routed to the accelerator's local high-bandwidth memory
|
||||
@@ -82,17 +82,17 @@ Panel Config:
|
||||
Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
|
||||
Fabric before a completion acknowledgement (atomic without return value) or
|
||||
data (atomic with return value) was returned to the L2.
|
||||
Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit.
|
||||
Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration.
|
||||
The number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so for
|
||||
example, if only a single value is requested in a cache line, the data movement
|
||||
will still be counted as a full cache line.
|
||||
Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests,
|
||||
per normalization unit.
|
||||
divided by total duration.
|
||||
Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests,
|
||||
per normalization unit.
|
||||
divided by total duration.
|
||||
Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests,
|
||||
per normalization unit.
|
||||
divided by total duration.
|
||||
Req: The total number of incoming requests to the L2 from all clients for all
|
||||
request types, per normalization unit.
|
||||
Read Req: The total number of read requests to the L2 from all clients.
|
||||
@@ -150,11 +150,11 @@ Panel Config:
|
||||
64B of data from any source other than the accelerator's local HBM, per normalization
|
||||
unit.
|
||||
Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe
|
||||
traffic, per normalization unit.
|
||||
traffic, divided by total duration.
|
||||
"Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read
|
||||
requests due to Infinity Fabric traffic, per normalization unit.
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM
|
||||
traffic, per normalization unit.
|
||||
traffic, divided by total duration.
|
||||
Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 32B of data to any memory location, per normalization
|
||||
unit.
|
||||
@@ -171,17 +171,17 @@ Panel Config:
|
||||
write or atomically update 32B or 64B of data in any memory location other than
|
||||
the accelerator's local HBM, per normalization unit.
|
||||
Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to
|
||||
PCIe traffic, per normalization unit.
|
||||
PCIe traffic, divided by total duration.
|
||||
"Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write
|
||||
requests due to Infinity Fabric traffic, per normalization unit.
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM
|
||||
traffic, per normalization unit.
|
||||
traffic, divided by total duration.
|
||||
Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to
|
||||
PCIe traffic, per normalization unit.
|
||||
PCIe traffic, divided by total duration.
|
||||
"Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic
|
||||
requests due to Infinity Fabric traffic, per normalization unit.
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
|
||||
HBM traffic, per normalization unit.
|
||||
HBM traffic, divided by total duration.
|
||||
Atomic: The total number of L2 requests to Infinity Fabric to atomically update
|
||||
32B or 64B of data in any memory location, per normalization unit. See Request
|
||||
flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
|
||||
@@ -257,12 +257,12 @@ Panel Config:
|
||||
metric:
|
||||
Read BW:
|
||||
avg: AVG((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
HBM Read Traffic:
|
||||
avg: AVG((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else None))
|
||||
@@ -289,12 +289,12 @@ Panel Config:
|
||||
unit: pct
|
||||
Write and Atomic BW:
|
||||
avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
HBM Write and Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
@@ -362,10 +362,10 @@ Panel Config:
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
avg: AVG((TCC_REQ_sum * 128) / $denom)
|
||||
min: MIN((TCC_REQ_sum * 128) / $denom)
|
||||
max: MAX((TCC_REQ_sum * 128) / $denom)
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
|
||||
min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
|
||||
max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
Req:
|
||||
avg: AVG((TCC_REQ_sum / $denom))
|
||||
min: MIN((TCC_REQ_sum / $denom))
|
||||
|
||||
+10
-6
@@ -11,8 +11,12 @@ Panel Config:
|
||||
instructions, averaged over the lifetime of the kernel. Calculated as the ratio
|
||||
of the total number of cycles spent by the scheduler issuing LDS instructions
|
||||
over the total CU cycles.
|
||||
Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
|
||||
could have been loaded from, stored to, or atomically updated in the LDS divided
|
||||
as percentage of theoretical peak. Does not take into account the execution
|
||||
mask of the wavefront when the instruction was executed.
|
||||
Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
|
||||
loaded from, stored to, or atomically updated in the LDS per normalization unit.
|
||||
loaded from, stored to, or atomically updated in the LDS divided by total duration.
|
||||
Does not take into account the execution mask of the wavefront when the instruction
|
||||
was executed.
|
||||
Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
|
||||
@@ -58,7 +62,7 @@ Panel Config:
|
||||
Access Rate:
|
||||
value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: Pct of Peak
|
||||
Theoretical Bandwidth (% of Peak):
|
||||
Theoretical Bandwidth Utilization:
|
||||
value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
|
||||
unit: Pct of Peak
|
||||
@@ -86,12 +90,12 @@ Panel Config:
|
||||
unit: (Instr + $normUnit)
|
||||
Theoretical Bandwidth:
|
||||
avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
LDS Latency:
|
||||
avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
|
||||
None))
|
||||
|
||||
+15
-12
@@ -3,15 +3,18 @@ Panel Config:
|
||||
id: 1300
|
||||
title: Instruction Cache
|
||||
metrics_description:
|
||||
Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the
|
||||
peak theoretical bandwidth. Calculated as the ratio of L1I requests over the
|
||||
total L1I cycles.
|
||||
Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
|
||||
the total L1I cycles.
|
||||
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
|
||||
loaded line the cache. Calculated as the ratio of the number of L1I requests
|
||||
that hit over the number of all L1I requests.
|
||||
L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\
|
||||
\ bandwidth achieved. Calculated as the ratio of the total number of requests\
|
||||
\ from the L1I to the L2 cache over the total L1I-L2 interface cycles."
|
||||
L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\
|
||||
\ L2 cache request bandwidth achieved. Calculated as the ratio of the total\
|
||||
\ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\
|
||||
\ cycles."
|
||||
L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
|
||||
divided by total duration.
|
||||
Req: The total number of requests made to the L1I per normalization-unit
|
||||
Hits: The total number of L1I requests that hit on a previously loaded cache line,
|
||||
per normalization-unit.
|
||||
@@ -30,7 +33,7 @@ Panel Config:
|
||||
value: Avg
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
Bandwidth Utilization:
|
||||
value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
|
||||
- Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
@@ -38,7 +41,7 @@ Panel Config:
|
||||
value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
|
||||
+ SQC_ICACHE_MISSES_DUPLICATE)))
|
||||
unit: Pct of Peak
|
||||
L1I-L2 Bandwidth:
|
||||
L1I-L2 Bandwidth Utilization:
|
||||
value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu)
|
||||
* (End_Timestamp - Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
@@ -100,7 +103,7 @@ Panel Config:
|
||||
unit: Unit
|
||||
metric:
|
||||
L1I-L2 Bandwidth:
|
||||
avg: AVG(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
min: MIN(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
max: MAX(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
|
||||
+13
-10
@@ -3,14 +3,17 @@ Panel Config:
|
||||
id: 1400
|
||||
title: Scalar L1 Data Cache
|
||||
metrics_description:
|
||||
Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the
|
||||
peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the
|
||||
total sL1D cycles.
|
||||
Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
|
||||
over the total sL1D cycles.
|
||||
Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
|
||||
loaded line the cache. The ratio of the number of sL1D requests that hit over
|
||||
the number of all sL1D requests.
|
||||
sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
|
||||
bandwidth acheived.\ \ Caclulated as total number of bytes read from, written
|
||||
to, or atomically updated\ \ across the sL1D - L2 interface.
|
||||
sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
|
||||
\ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\
|
||||
\ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\
|
||||
\ writes and atomics are typically unused on current CDNA accelerators, so in\
|
||||
\ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
|
||||
Req: The total number of requests, of any size or type, made to the sL1D per normalization
|
||||
@@ -51,7 +54,7 @@ Panel Config:
|
||||
value: Avg
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
Bandwidth Utilization:
|
||||
value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
|
||||
- Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
@@ -60,7 +63,7 @@ Panel Config:
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
|
||||
unit: Pct of Peak
|
||||
sL1D-L2 BW:
|
||||
sL1D-L2 BW Utilization:
|
||||
value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))
|
||||
unit: Pct of Peak
|
||||
@@ -158,12 +161,12 @@ Panel Config:
|
||||
metric:
|
||||
sL1D-L2 BW:
|
||||
avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 64)) / $denom))
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 64)) / $denom))
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 64)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
Read Req:
|
||||
avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_READ_REQ / $denom))
|
||||
|
||||
+21
-21
@@ -5,12 +5,12 @@ Panel Config:
|
||||
metrics_description:
|
||||
Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM
|
||||
instructions, as a percent of the peak theoretical bandwidth achievable on the
|
||||
specific accelerator. The number of bytes is calculated as the number of cache
|
||||
lines requested multiplied by the cache line size. This value does not consider
|
||||
partial requests, so for instance, if only a single value is requested in a
|
||||
cache line, the data movement will still be counted as a full cache line.
|
||||
Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
|
||||
of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator. The number of bytes is calculated as the number
|
||||
of cache lines requested multiplied by the cache line size. This value does
|
||||
not consider partial requests, so for instance, if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
|
||||
The number of cycles where the vL1D Cache RAM is actively processing any request
|
||||
divided by the number of cycles where the vL1D is active.
|
||||
@@ -42,11 +42,11 @@ Panel Config:
|
||||
Atomic Req: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
|
||||
instructions per normalization unit. The number of bytes is calculated as the
|
||||
number of cache lines requested multiplied by the cache line size. This value
|
||||
does not consider partial requests, so for instance, if only a single value
|
||||
is requested in a cache line, the data movement will still be counted as a full
|
||||
cache line.
|
||||
instructions divided by total duration. The number of bytes is calculated as
|
||||
the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
as a full cache line.
|
||||
Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
|
||||
vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Cache Accesses: The total number of cache line lookups in the vL1D.
|
||||
@@ -57,7 +57,7 @@ Panel Config:
|
||||
command during the kernel's execution per normalization unit. This may be triggered
|
||||
by, for instance, the buffer_wbinvl1 instruction.
|
||||
L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
|
||||
of VMEM instructions, per normalization unit. The number of bytes is calculated
|
||||
of VMEM instructions, divided by total duration. The number of bytes is calculated
|
||||
as the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
@@ -128,7 +128,7 @@ Panel Config:
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
|
||||
else None))
|
||||
unit: Pct of Peak
|
||||
Bandwidth:
|
||||
Bandwidth Utilization:
|
||||
value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp
|
||||
- Start_Timestamp)))) / ((($max_sclk / 1000) * 128) * $cu_per_gpu))
|
||||
unit: Pct of Peak
|
||||
@@ -201,10 +201,10 @@ Panel Config:
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Cache BW:
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
|
||||
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
|
||||
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
Cache Hit Rate:
|
||||
avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
@@ -242,12 +242,12 @@ Panel Config:
|
||||
unit: (Req + $normUnit)
|
||||
L1-L2 BW:
|
||||
avg: AVG(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
L1-L2 Read:
|
||||
avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_READ_REQ_sum / $denom))
|
||||
|
||||
+33
-30
@@ -20,8 +20,8 @@ Panel Config:
|
||||
HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
|
||||
memory (HBM) per unit time. This value is calculated as the number of HBM channels
|
||||
multiplied by the HBM channel width multiplied by the HBM clock frequency.
|
||||
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per
|
||||
normalization unit.
|
||||
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided
|
||||
by total duration.
|
||||
HBM Read Traffic: The percent of read requests generated by the L2 cache that
|
||||
are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
|
||||
does not consider the size of the request (meaning that 32B and 64B requests
|
||||
@@ -42,9 +42,9 @@ Panel Config:
|
||||
as a single request), so this metric only approximates the percent of the L2-Fabric
|
||||
read bandwidth directed to an uncached memory location.
|
||||
Write and Atomic BW: The total number of bytes written by the L2 over Infinity
|
||||
Fabric by write and atomic operations per normalization unit. Note that on current
|
||||
CDNA accelerators, such as the MI2XX, requests are only considered atomic by
|
||||
Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
|
||||
Fabric by write and atomic operations divided by total duration. Note that on
|
||||
current CDNA accelerators, such as the MI2XX, requests are only considered atomic
|
||||
by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
|
||||
fine-grained memory allocations or uncached memory allocations on the MI2XX.
|
||||
HBM Write and Atomic Traffic: The percent of write and atomic requests generated
|
||||
by the L2 cache that are routed to the accelerator's local high-bandwidth memory
|
||||
@@ -82,17 +82,17 @@ Panel Config:
|
||||
Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
|
||||
Fabric before a completion acknowledgement (atomic without return value) or
|
||||
data (atomic with return value) was returned to the L2.
|
||||
Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit.
|
||||
Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration.
|
||||
The number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so for
|
||||
example, if only a single value is requested in a cache line, the data movement
|
||||
will still be counted as a full cache line.
|
||||
Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests,
|
||||
per normalization unit.
|
||||
divided by total duration.
|
||||
Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests,
|
||||
per normalization unit.
|
||||
divided by total duration.
|
||||
Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests,
|
||||
per normalization unit.
|
||||
divided by total duration.
|
||||
Req: The total number of incoming requests to the L2 from all clients for all
|
||||
request types, per normalization unit.
|
||||
Read Req: The total number of read requests to the L2 from all clients.
|
||||
@@ -150,11 +150,11 @@ Panel Config:
|
||||
64B of data from any source other than the accelerator's local HBM, per normalization
|
||||
unit.
|
||||
Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe
|
||||
traffic, per normalization unit.
|
||||
traffic, divided by total duration.
|
||||
"Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read
|
||||
requests due to Infinity Fabric traffic, per normalization unit.
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM
|
||||
traffic, per normalization unit.
|
||||
traffic, divided by total duration.
|
||||
Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 32B of data to any memory location, per normalization
|
||||
unit.
|
||||
@@ -171,17 +171,17 @@ Panel Config:
|
||||
write or atomically update 32B or 64B of data in any memory location other than
|
||||
the accelerator's local HBM, per normalization unit.
|
||||
Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to
|
||||
PCIe traffic, per normalization unit.
|
||||
PCIe traffic, divided by total duration.
|
||||
"Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write
|
||||
requests due to Infinity Fabric traffic, per normalization unit.
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM
|
||||
traffic, per normalization unit.
|
||||
traffic, divided by total duration.
|
||||
Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to
|
||||
PCIe traffic, per normalization unit.
|
||||
PCIe traffic, divided by total duration.
|
||||
"Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic
|
||||
requests due to Infinity Fabric traffic, per normalization unit.
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
|
||||
HBM traffic, per normalization unit.
|
||||
HBM traffic, divided by total duration.
|
||||
Atomic: The total number of L2 requests to Infinity Fabric to atomically update
|
||||
32B or 64B of data in any memory location, per normalization unit. See Request
|
||||
flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
|
||||
@@ -258,12 +258,15 @@ Panel Config:
|
||||
metric:
|
||||
Read BW:
|
||||
avg: AVG(((128 * TCC_BUBBLE_sum + 64 * (TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum
|
||||
- TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / $denom))
|
||||
- TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / (End_Timestamp
|
||||
- Start_Timestamp)))
|
||||
min: MIN(((128 * TCC_BUBBLE_sum + 64 * (TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum
|
||||
- TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / $denom))
|
||||
- TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / (End_Timestamp
|
||||
- Start_Timestamp)))
|
||||
max: MAX(((128 * TCC_BUBBLE_sum + 64 * (TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum
|
||||
- TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
- TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / (End_Timestamp
|
||||
- Start_Timestamp)))
|
||||
unit: Gbps
|
||||
HBM Read Traffic:
|
||||
avg: AVG((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else None))
|
||||
@@ -290,12 +293,12 @@ Panel Config:
|
||||
unit: pct
|
||||
Write and Atomic BW:
|
||||
avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
HBM Write and Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
@@ -363,10 +366,10 @@ Panel Config:
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
avg: AVG((TCC_REQ_sum * 128) / $denom)
|
||||
min: MIN((TCC_REQ_sum * 128) / $denom)
|
||||
max: MAX((TCC_REQ_sum * 128) / $denom)
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
|
||||
min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
|
||||
max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
Req:
|
||||
avg: AVG((TCC_REQ_sum / $denom))
|
||||
min: MIN((TCC_REQ_sum / $denom))
|
||||
|
||||
+10
-6
@@ -11,8 +11,12 @@ Panel Config:
|
||||
instructions, averaged over the lifetime of the kernel. Calculated as the ratio
|
||||
of the total number of cycles spent by the scheduler issuing LDS instructions
|
||||
over the total CU cycles.
|
||||
Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
|
||||
could have been loaded from, stored to, or atomically updated in the LDS divided
|
||||
as percentage of theoretical peak. Does not take into account the execution
|
||||
mask of the wavefront when the instruction was executed.
|
||||
Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
|
||||
loaded from, stored to, or atomically updated in the LDS per normalization unit.
|
||||
loaded from, stored to, or atomically updated in the LDS divided by total duration.
|
||||
Does not take into account the execution mask of the wavefront when the instruction
|
||||
was executed.
|
||||
Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
|
||||
@@ -58,7 +62,7 @@ Panel Config:
|
||||
Access Rate:
|
||||
value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: Pct of Peak
|
||||
Theoretical Bandwidth (% of Peak):
|
||||
Theoretical Bandwidth Utilization:
|
||||
value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
|
||||
unit: Pct of Peak
|
||||
@@ -116,12 +120,12 @@ Panel Config:
|
||||
units: Gbps
|
||||
Theoretical Bandwidth:
|
||||
avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
LDS Latency:
|
||||
avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
|
||||
None))
|
||||
|
||||
+15
-12
@@ -3,15 +3,18 @@ Panel Config:
|
||||
id: 1300
|
||||
title: Instruction Cache
|
||||
metrics_description:
|
||||
Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the
|
||||
peak theoretical bandwidth. Calculated as the ratio of L1I requests over the
|
||||
total L1I cycles.
|
||||
Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
|
||||
the total L1I cycles.
|
||||
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
|
||||
loaded line the cache. Calculated as the ratio of the number of L1I requests
|
||||
that hit over the number of all L1I requests.
|
||||
L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\
|
||||
\ bandwidth achieved. Calculated as the ratio of the total number of requests\
|
||||
\ from the L1I to the L2 cache over the total L1I-L2 interface cycles."
|
||||
L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\
|
||||
\ L2 cache request bandwidth achieved. Calculated as the ratio of the total\
|
||||
\ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\
|
||||
\ cycles."
|
||||
L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
|
||||
divided by total duration.
|
||||
Req: The total number of requests made to the L1I per normalization-unit
|
||||
Hits: The total number of L1I requests that hit on a previously loaded cache line,
|
||||
per normalization-unit.
|
||||
@@ -30,7 +33,7 @@ Panel Config:
|
||||
value: Avg
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
Bandwidth Utilization:
|
||||
value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
|
||||
- Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
@@ -38,7 +41,7 @@ Panel Config:
|
||||
value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
|
||||
+ SQC_ICACHE_MISSES_DUPLICATE)))
|
||||
unit: Pct of Peak
|
||||
L1I-L2 Bandwidth:
|
||||
L1I-L2 Bandwidth Utilization:
|
||||
value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu)
|
||||
* (End_Timestamp - Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
@@ -100,7 +103,7 @@ Panel Config:
|
||||
unit: Unit
|
||||
metric:
|
||||
L1I-L2 Bandwidth:
|
||||
avg: AVG(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
min: MIN(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
max: MAX(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
|
||||
+13
-10
@@ -3,14 +3,17 @@ Panel Config:
|
||||
id: 1400
|
||||
title: Scalar L1 Data Cache
|
||||
metrics_description:
|
||||
Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the
|
||||
peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the
|
||||
total sL1D cycles.
|
||||
Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
|
||||
over the total sL1D cycles.
|
||||
Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
|
||||
loaded line the cache. The ratio of the number of sL1D requests that hit over
|
||||
the number of all sL1D requests.
|
||||
sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
|
||||
bandwidth acheived.\ \ Caclulated as total number of bytes read from, written
|
||||
to, or atomically updated\ \ across the sL1D - L2 interface.
|
||||
sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
|
||||
\ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\
|
||||
\ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\
|
||||
\ writes and atomics are typically unused on current CDNA accelerators, so in\
|
||||
\ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
|
||||
Req: The total number of requests, of any size or type, made to the sL1D per normalization
|
||||
@@ -51,7 +54,7 @@ Panel Config:
|
||||
value: Avg
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
Bandwidth Utilization:
|
||||
value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
|
||||
- Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
@@ -60,7 +63,7 @@ Panel Config:
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
|
||||
unit: Pct of Peak
|
||||
sL1D-L2 BW:
|
||||
sL1D-L2 BW Utilization:
|
||||
value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))
|
||||
unit: Pct of Peak
|
||||
@@ -158,12 +161,12 @@ Panel Config:
|
||||
metric:
|
||||
sL1D-L2 BW:
|
||||
avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 64)) / $denom))
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 64)) / $denom))
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 64)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
Read Req:
|
||||
avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_READ_REQ / $denom))
|
||||
|
||||
+21
-21
@@ -5,12 +5,12 @@ Panel Config:
|
||||
metrics_description:
|
||||
Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM
|
||||
instructions, as a percent of the peak theoretical bandwidth achievable on the
|
||||
specific accelerator. The number of bytes is calculated as the number of cache
|
||||
lines requested multiplied by the cache line size. This value does not consider
|
||||
partial requests, so for instance, if only a single value is requested in a
|
||||
cache line, the data movement will still be counted as a full cache line.
|
||||
Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
|
||||
of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator. The number of bytes is calculated as the number
|
||||
of cache lines requested multiplied by the cache line size. This value does
|
||||
not consider partial requests, so for instance, if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
|
||||
The number of cycles where the vL1D Cache RAM is actively processing any request
|
||||
divided by the number of cycles where the vL1D is active.
|
||||
@@ -42,11 +42,11 @@ Panel Config:
|
||||
Atomic Req: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
|
||||
instructions per normalization unit. The number of bytes is calculated as the
|
||||
number of cache lines requested multiplied by the cache line size. This value
|
||||
does not consider partial requests, so for instance, if only a single value
|
||||
is requested in a cache line, the data movement will still be counted as a full
|
||||
cache line.
|
||||
instructions divided by total duration. The number of bytes is calculated as
|
||||
the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
as a full cache line.
|
||||
Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
|
||||
vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Cache Accesses: The total number of cache line lookups in the vL1D.
|
||||
@@ -57,7 +57,7 @@ Panel Config:
|
||||
command during the kernel's execution per normalization unit. This may be triggered
|
||||
by, for instance, the buffer_wbinvl1 instruction.
|
||||
L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
|
||||
of VMEM instructions, per normalization unit. The number of bytes is calculated
|
||||
of VMEM instructions, divided by total duration. The number of bytes is calculated
|
||||
as the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
@@ -128,7 +128,7 @@ Panel Config:
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
|
||||
else None))
|
||||
unit: Pct of Peak
|
||||
Bandwidth:
|
||||
Bandwidth Utilization:
|
||||
value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp
|
||||
- Start_Timestamp)))) / ((($max_sclk / 1000) * 128) * $cu_per_gpu))
|
||||
unit: Pct of Peak
|
||||
@@ -216,10 +216,10 @@ Panel Config:
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Cache BW:
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
|
||||
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
|
||||
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
Cache Hit Rate:
|
||||
avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
@@ -257,12 +257,12 @@ Panel Config:
|
||||
unit: (Req + $normUnit)
|
||||
L1-L2 BW:
|
||||
avg: AVG(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((128 * TCP_TCC_READ_REQ_sum + 64 * (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
Tag RAM 0 Req:
|
||||
avg: AVG((TCP_TAGRAM0_REQ_sum / $denom))
|
||||
min: MIN((TCP_TAGRAM0_REQ_sum / $denom))
|
||||
|
||||
+78
-78
@@ -20,8 +20,8 @@ Panel Config:
|
||||
HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
|
||||
memory (HBM) per unit time. This value is calculated as the number of HBM channels
|
||||
multiplied by the HBM channel width multiplied by the HBM clock frequency.
|
||||
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per
|
||||
normalization unit.
|
||||
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided
|
||||
by total duration.
|
||||
HBM Read Traffic: The percent of read requests generated by the L2 cache that
|
||||
are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
|
||||
does not consider the size of the request (meaning that 32B and 64B requests
|
||||
@@ -42,9 +42,9 @@ Panel Config:
|
||||
as a single request), so this metric only approximates the percent of the L2-Fabric
|
||||
read bandwidth directed to an uncached memory location.
|
||||
Write and Atomic BW: The total number of bytes written by the L2 over Infinity
|
||||
Fabric by write and atomic operations per normalization unit. Note that on current
|
||||
CDNA accelerators, such as the MI2XX, requests are only considered atomic by
|
||||
Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
|
||||
Fabric by write and atomic operations divided by total duration. Note that on
|
||||
current CDNA accelerators, such as the MI2XX, requests are only considered atomic
|
||||
by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
|
||||
fine-grained memory allocations or uncached memory allocations on the MI2XX.
|
||||
HBM Write and Atomic Traffic: The percent of write and atomic requests generated
|
||||
by the L2 cache that are routed to the accelerator's local high-bandwidth memory
|
||||
@@ -82,17 +82,17 @@ Panel Config:
|
||||
Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
|
||||
Fabric before a completion acknowledgement (atomic without return value) or
|
||||
data (atomic with return value) was returned to the L2.
|
||||
Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit.
|
||||
Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration.
|
||||
The number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so for
|
||||
example, if only a single value is requested in a cache line, the data movement
|
||||
will still be counted as a full cache line.
|
||||
Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests,
|
||||
per normalization unit.
|
||||
divided by total duration.
|
||||
Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests,
|
||||
per normalization unit.
|
||||
divided by total duration.
|
||||
Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests,
|
||||
per normalization unit.
|
||||
divided by total duration.
|
||||
Req: The total number of incoming requests to the L2 from all clients for all
|
||||
request types, per normalization unit.
|
||||
Read Req: The total number of read requests to the L2 from all clients.
|
||||
@@ -150,11 +150,11 @@ Panel Config:
|
||||
64B of data from any source other than the accelerator's local HBM, per normalization
|
||||
unit.
|
||||
Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe
|
||||
traffic, per normalization unit.
|
||||
traffic, divided by total duration.
|
||||
"Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read
|
||||
requests due to Infinity Fabric traffic, per normalization unit.
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM
|
||||
traffic, per normalization unit.
|
||||
traffic, divided by total duration.
|
||||
Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 32B of data to any memory location, per normalization
|
||||
unit.
|
||||
@@ -171,17 +171,17 @@ Panel Config:
|
||||
write or atomically update 32B or 64B of data in any memory location other than
|
||||
the accelerator's local HBM, per normalization unit.
|
||||
Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to
|
||||
PCIe traffic, per normalization unit.
|
||||
PCIe traffic, divided by total duration.
|
||||
"Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write
|
||||
requests due to Infinity Fabric traffic, per normalization unit.
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM
|
||||
traffic, per normalization unit.
|
||||
traffic, divided by total duration.
|
||||
Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to
|
||||
PCIe traffic, per normalization unit.
|
||||
PCIe traffic, divided by total duration.
|
||||
"Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic
|
||||
requests due to Infinity Fabric traffic, per normalization unit.
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
|
||||
HBM traffic, per normalization unit.
|
||||
HBM traffic, divided by total duration.
|
||||
Atomic: The total number of L2 requests to Infinity Fabric to atomically update
|
||||
32B or 64B of data in any memory location, per normalization unit. See Request
|
||||
flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
|
||||
@@ -257,12 +257,12 @@ Panel Config:
|
||||
metric:
|
||||
Read BW:
|
||||
avg: AVG((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) +
|
||||
(TCC_EA0_RDREQ_128B_sum * 128)) / $denom))
|
||||
(TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) +
|
||||
(TCC_EA0_RDREQ_128B_sum * 128)) / $denom))
|
||||
(TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) +
|
||||
(TCC_EA0_RDREQ_128B_sum * 128)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
(TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
HBM Read Traffic:
|
||||
avg: AVG((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else None))
|
||||
@@ -289,12 +289,12 @@ Panel Config:
|
||||
unit: pct
|
||||
Write and Atomic BW:
|
||||
avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
HBM Write and Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
@@ -381,25 +381,25 @@ Panel Config:
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
avg: AVG((TCC_REQ_sum * 128) / $denom)
|
||||
min: MIN((TCC_REQ_sum * 128) / $denom)
|
||||
max: MAX((TCC_REQ_sum * 128) / $denom)
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
|
||||
min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
|
||||
max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
Read Bandwidth:
|
||||
avg: AVG(TCC_READ_SECTORS_sum * 32/ $denom)
|
||||
min: MIN(TCC_READ_SECTORS_sum * 32/ $denom)
|
||||
max: MAX(TCC_READ_SECTORS_sum * 32/ $denom)
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG(TCC_READ_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_READ_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_READ_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
Write Bandwidth:
|
||||
avg: AVG(TCC_WRITE_SECTORS_sum * 32/ $denom)
|
||||
min: MIN(TCC_WRITE_SECTORS_sum * 32/ $denom)
|
||||
max: MAX(TCC_WRITE_SECTORS_sum * 32/ $denom)
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG(TCC_WRITE_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_WRITE_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_WRITE_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
Atomic Bandwidth:
|
||||
avg: AVG(TCC_ATOMIC_SECTORS_sum * 32/ $denom)
|
||||
min: MIN(TCC_ATOMIC_SECTORS_sum * 32/ $denom)
|
||||
max: MAX(TCC_ATOMIC_SECTORS_sum * 32/ $denom)
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG(TCC_ATOMIC_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_ATOMIC_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_ATOMIC_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
Req:
|
||||
avg: AVG((TCC_REQ_sum / $denom))
|
||||
min: MIN((TCC_REQ_sum / $denom))
|
||||
@@ -653,20 +653,20 @@ Panel Config:
|
||||
max: MAX((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read Bandwidth - PCIe:
|
||||
avg: AVG(TCC_EA0_RDREQ_IO_32B_sum * 32/ $denom)
|
||||
min: MIN(TCC_EA0_RDREQ_IO_32B_sum * 32/ $denom)
|
||||
max: MAX(TCC_EA0_RDREQ_IO_32B_sum * 32/ $denom)
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG(TCC_EA0_RDREQ_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_EA0_RDREQ_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_EA0_RDREQ_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
"Read Bandwidth - Infinity Fabric\u2122":
|
||||
avg: AVG(TCC_EA0_RDREQ_GMI_32B_sum * 32/ $denom)
|
||||
min: MIN(TCC_EA0_RDREQ_GMI_32B_sum * 32/ $denom)
|
||||
max: MAX(TCC_EA0_RDREQ_GMI_32B_sum * 32/ $denom)
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG(TCC_EA0_RDREQ_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_EA0_RDREQ_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_EA0_RDREQ_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
Read Bandwidth - HBM:
|
||||
avg: AVG(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ $denom)
|
||||
min: MIN(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ $denom)
|
||||
max: MAX(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ $denom)
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
Write and Atomic (32B):
|
||||
avg: AVG(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom))
|
||||
min: MIN(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom))
|
||||
@@ -693,20 +693,20 @@ Panel Config:
|
||||
max: MAX((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Write Bandwidth - PCIe:
|
||||
avg: AVG(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ $denom)
|
||||
min: MIN(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ $denom)
|
||||
max: MAX(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ $denom)
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
"Write Bandwidth - Infinity Fabric\u2122":
|
||||
avg: AVG(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ $denom)
|
||||
min: MIN(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ $denom)
|
||||
max: MAX(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ $denom)
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
Write Bandwidth - HBM:
|
||||
avg: AVG(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ $denom)
|
||||
min: MIN(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ $denom)
|
||||
max: MAX(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ $denom)
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
Atomic:
|
||||
avg: AVG((TCC_EA0_ATOMIC_sum / $denom))
|
||||
min: MIN((TCC_EA0_ATOMIC_sum / $denom))
|
||||
@@ -718,17 +718,17 @@ Panel Config:
|
||||
max: MAX((TCC_EA0_WRREQ_ATOMIC_DRAM_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Atomic Bandwidth - PCIe:
|
||||
avg: AVG(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ $denom)
|
||||
min: MIN(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ $denom)
|
||||
max: MAX(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ $denom)
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
"Atomic Bandwidth - Infinity Fabric\u2122":
|
||||
avg: AVG(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ $denom)
|
||||
min: MIN(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ $denom)
|
||||
max: MAX(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ $denom)
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
Atomic Bandwidth - HBM:
|
||||
avg: AVG(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ $denom)
|
||||
min: MIN(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ $denom)
|
||||
max: MAX(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ $denom)
|
||||
unit: (Bytes + $normUnit)
|
||||
avg: AVG(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
|
||||
@@ -59,42 +59,42 @@ src/rocprof_compute_soc/analysis_configs/gfx940/1100_compute_units_compute_pipel
|
||||
src/rocprof_compute_soc/analysis_configs/gfx941/1100_compute_units_compute_pipeline.yaml: 4a25b6abf24f4a622fde1a3cfe65fe7236cf1e626fc2444667883997564cea1e
|
||||
src/rocprof_compute_soc/analysis_configs/gfx942/1100_compute_units_compute_pipeline.yaml: 4a25b6abf24f4a622fde1a3cfe65fe7236cf1e626fc2444667883997564cea1e
|
||||
src/rocprof_compute_soc/analysis_configs/gfx950/1100_compute_units_compute_pipeline.yaml: 4ef656938f8a9667ae872db522855856469accff9cb42bc0444b469346760dfd
|
||||
src/rocprof_compute_soc/analysis_configs/gfx908/1200_local_data_share_lds.yaml: 80f3ca3ea15de009c5278ea20566d8c08d62e0087971e5f9aeae1c89df1dd898
|
||||
src/rocprof_compute_soc/analysis_configs/gfx90a/1200_local_data_share_lds.yaml: 80f3ca3ea15de009c5278ea20566d8c08d62e0087971e5f9aeae1c89df1dd898
|
||||
src/rocprof_compute_soc/analysis_configs/gfx940/1200_local_data_share_lds.yaml: 3bbf3928288990863cfe72fd00a28785fde0a36f103f5381df578aae2eb28be0
|
||||
src/rocprof_compute_soc/analysis_configs/gfx941/1200_local_data_share_lds.yaml: 3bbf3928288990863cfe72fd00a28785fde0a36f103f5381df578aae2eb28be0
|
||||
src/rocprof_compute_soc/analysis_configs/gfx942/1200_local_data_share_lds.yaml: 3bbf3928288990863cfe72fd00a28785fde0a36f103f5381df578aae2eb28be0
|
||||
src/rocprof_compute_soc/analysis_configs/gfx950/1200_local_data_share_lds.yaml: 505163510a3b0132ee487f9e024188de2deb97d0f72e3d729b95f86e7c3434b3
|
||||
src/rocprof_compute_soc/analysis_configs/gfx908/1300_instruction_cache.yaml: 2437e2f8191675c4116d0da1db291f3ad2715281ea812e9fdd6506cf213e5d1b
|
||||
src/rocprof_compute_soc/analysis_configs/gfx90a/1300_instruction_cache.yaml: 2437e2f8191675c4116d0da1db291f3ad2715281ea812e9fdd6506cf213e5d1b
|
||||
src/rocprof_compute_soc/analysis_configs/gfx940/1300_instruction_cache.yaml: 2437e2f8191675c4116d0da1db291f3ad2715281ea812e9fdd6506cf213e5d1b
|
||||
src/rocprof_compute_soc/analysis_configs/gfx941/1300_instruction_cache.yaml: 2437e2f8191675c4116d0da1db291f3ad2715281ea812e9fdd6506cf213e5d1b
|
||||
src/rocprof_compute_soc/analysis_configs/gfx942/1300_instruction_cache.yaml: 2437e2f8191675c4116d0da1db291f3ad2715281ea812e9fdd6506cf213e5d1b
|
||||
src/rocprof_compute_soc/analysis_configs/gfx950/1300_instruction_cache.yaml: 2437e2f8191675c4116d0da1db291f3ad2715281ea812e9fdd6506cf213e5d1b
|
||||
src/rocprof_compute_soc/analysis_configs/gfx908/1400_scalar_l1_data_cache.yaml: 8871e3b65132321cb3880a48f894d8c3b2c56a3936d382c3c2b02723ed5c8ec5
|
||||
src/rocprof_compute_soc/analysis_configs/gfx90a/1400_scalar_l1_data_cache.yaml: 8871e3b65132321cb3880a48f894d8c3b2c56a3936d382c3c2b02723ed5c8ec5
|
||||
src/rocprof_compute_soc/analysis_configs/gfx940/1400_scalar_l1_data_cache.yaml: 8871e3b65132321cb3880a48f894d8c3b2c56a3936d382c3c2b02723ed5c8ec5
|
||||
src/rocprof_compute_soc/analysis_configs/gfx941/1400_scalar_l1_data_cache.yaml: 8871e3b65132321cb3880a48f894d8c3b2c56a3936d382c3c2b02723ed5c8ec5
|
||||
src/rocprof_compute_soc/analysis_configs/gfx942/1400_scalar_l1_data_cache.yaml: 8871e3b65132321cb3880a48f894d8c3b2c56a3936d382c3c2b02723ed5c8ec5
|
||||
src/rocprof_compute_soc/analysis_configs/gfx950/1400_scalar_l1_data_cache.yaml: 8871e3b65132321cb3880a48f894d8c3b2c56a3936d382c3c2b02723ed5c8ec5
|
||||
src/rocprof_compute_soc/analysis_configs/gfx908/1200_local_data_share_lds.yaml: f3f7a74e8b2915fe27eec7948f006f218a6b0a96c91b95cdff9e624b2c484bb2
|
||||
src/rocprof_compute_soc/analysis_configs/gfx90a/1200_local_data_share_lds.yaml: f3f7a74e8b2915fe27eec7948f006f218a6b0a96c91b95cdff9e624b2c484bb2
|
||||
src/rocprof_compute_soc/analysis_configs/gfx940/1200_local_data_share_lds.yaml: f3f7a74e8b2915fe27eec7948f006f218a6b0a96c91b95cdff9e624b2c484bb2
|
||||
src/rocprof_compute_soc/analysis_configs/gfx941/1200_local_data_share_lds.yaml: f3f7a74e8b2915fe27eec7948f006f218a6b0a96c91b95cdff9e624b2c484bb2
|
||||
src/rocprof_compute_soc/analysis_configs/gfx942/1200_local_data_share_lds.yaml: f3f7a74e8b2915fe27eec7948f006f218a6b0a96c91b95cdff9e624b2c484bb2
|
||||
src/rocprof_compute_soc/analysis_configs/gfx950/1200_local_data_share_lds.yaml: 6333e18126bde83da4c66fd967531d394bd22e69c08358096b27168a9dc11a30
|
||||
src/rocprof_compute_soc/analysis_configs/gfx908/1300_instruction_cache.yaml: f60b9c657bece161e34219f3ada4041107dc5ca3d248590ee3b67e7bd400ff54
|
||||
src/rocprof_compute_soc/analysis_configs/gfx90a/1300_instruction_cache.yaml: f60b9c657bece161e34219f3ada4041107dc5ca3d248590ee3b67e7bd400ff54
|
||||
src/rocprof_compute_soc/analysis_configs/gfx940/1300_instruction_cache.yaml: f60b9c657bece161e34219f3ada4041107dc5ca3d248590ee3b67e7bd400ff54
|
||||
src/rocprof_compute_soc/analysis_configs/gfx941/1300_instruction_cache.yaml: f60b9c657bece161e34219f3ada4041107dc5ca3d248590ee3b67e7bd400ff54
|
||||
src/rocprof_compute_soc/analysis_configs/gfx942/1300_instruction_cache.yaml: f60b9c657bece161e34219f3ada4041107dc5ca3d248590ee3b67e7bd400ff54
|
||||
src/rocprof_compute_soc/analysis_configs/gfx950/1300_instruction_cache.yaml: f60b9c657bece161e34219f3ada4041107dc5ca3d248590ee3b67e7bd400ff54
|
||||
src/rocprof_compute_soc/analysis_configs/gfx908/1400_scalar_l1_data_cache.yaml: 29fac4ea38e4a018baffc4a27a720b47078fd890c10da307655d40f693e6f0e7
|
||||
src/rocprof_compute_soc/analysis_configs/gfx90a/1400_scalar_l1_data_cache.yaml: 29fac4ea38e4a018baffc4a27a720b47078fd890c10da307655d40f693e6f0e7
|
||||
src/rocprof_compute_soc/analysis_configs/gfx940/1400_scalar_l1_data_cache.yaml: 29fac4ea38e4a018baffc4a27a720b47078fd890c10da307655d40f693e6f0e7
|
||||
src/rocprof_compute_soc/analysis_configs/gfx941/1400_scalar_l1_data_cache.yaml: 29fac4ea38e4a018baffc4a27a720b47078fd890c10da307655d40f693e6f0e7
|
||||
src/rocprof_compute_soc/analysis_configs/gfx942/1400_scalar_l1_data_cache.yaml: 29fac4ea38e4a018baffc4a27a720b47078fd890c10da307655d40f693e6f0e7
|
||||
src/rocprof_compute_soc/analysis_configs/gfx950/1400_scalar_l1_data_cache.yaml: 29fac4ea38e4a018baffc4a27a720b47078fd890c10da307655d40f693e6f0e7
|
||||
src/rocprof_compute_soc/analysis_configs/gfx908/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 633d59aba82b3a495b7ba33fa4b2ae4da638b58632bcc37ff18be87af68ce4d4
|
||||
src/rocprof_compute_soc/analysis_configs/gfx90a/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 2bdb9d7b3bea1057b3baee29ba3b428b211808261063a97bc4b6b319f4a19fb3
|
||||
src/rocprof_compute_soc/analysis_configs/gfx940/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 3180c2f3266be0ff44e01d73d247ca43ae2ee18ecaf61765f58849e36c701b19
|
||||
src/rocprof_compute_soc/analysis_configs/gfx941/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 3180c2f3266be0ff44e01d73d247ca43ae2ee18ecaf61765f58849e36c701b19
|
||||
src/rocprof_compute_soc/analysis_configs/gfx942/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 3180c2f3266be0ff44e01d73d247ca43ae2ee18ecaf61765f58849e36c701b19
|
||||
src/rocprof_compute_soc/analysis_configs/gfx950/1500_address_processing_unit_and_data_return_path_ta_td.yaml: 9e56cef5b066fb575a5c530bcf9400f1291dd8636b12c8a2244cdba1defafc9f
|
||||
src/rocprof_compute_soc/analysis_configs/gfx908/1600_vector_l1_data_cache.yaml: e6ec43014ce7b7cc072385d4eba072dd187b5de14979c169a3c1e9b8fc4c2762
|
||||
src/rocprof_compute_soc/analysis_configs/gfx90a/1600_vector_l1_data_cache.yaml: e6ec43014ce7b7cc072385d4eba072dd187b5de14979c169a3c1e9b8fc4c2762
|
||||
src/rocprof_compute_soc/analysis_configs/gfx940/1600_vector_l1_data_cache.yaml: 0e53921cc8d87a9adade250b9632fa42d33c825565152e37d6e56f45f83a3a28
|
||||
src/rocprof_compute_soc/analysis_configs/gfx941/1600_vector_l1_data_cache.yaml: 0e53921cc8d87a9adade250b9632fa42d33c825565152e37d6e56f45f83a3a28
|
||||
src/rocprof_compute_soc/analysis_configs/gfx942/1600_vector_l1_data_cache.yaml: 0e53921cc8d87a9adade250b9632fa42d33c825565152e37d6e56f45f83a3a28
|
||||
src/rocprof_compute_soc/analysis_configs/gfx950/1600_vector_l1_data_cache.yaml: cd21327c193d2af8c18066b9c13f67e3d5dfb44731777bc5a1b6a7738c902dd1
|
||||
src/rocprof_compute_soc/analysis_configs/gfx908/1700_l2_cache.yaml: 5b48c690b6069a5610d07cc0c2a5e1da65a52296205dcf48a3b6fa5e3df36e9b
|
||||
src/rocprof_compute_soc/analysis_configs/gfx90a/1700_l2_cache.yaml: a9b128267a069060e891533334c52586c706f145b1e813a4081cb21d425516ad
|
||||
src/rocprof_compute_soc/analysis_configs/gfx940/1700_l2_cache.yaml: b4eea39f0e23e501ad503cdd96db377109c7f0e212949828fe06102de7355349
|
||||
src/rocprof_compute_soc/analysis_configs/gfx941/1700_l2_cache.yaml: da0189cd7f6e1ab4b79d0c054c2cdc1f7a9c81972dae9e5285f2f3d9c30ca644
|
||||
src/rocprof_compute_soc/analysis_configs/gfx942/1700_l2_cache.yaml: b0802f923052eb584ce138210ebf2db70fb7883926896da1861a9e857d4abe81
|
||||
src/rocprof_compute_soc/analysis_configs/gfx950/1700_l2_cache.yaml: 58bdd965421d610567e461becd7094fa41d668b119eddab99054d2bd6dc12acf
|
||||
src/rocprof_compute_soc/analysis_configs/gfx908/1600_vector_l1_data_cache.yaml: 438d0f4a972dd341eb2485f51a47d6860fbb30a6169054cd8550b4b7226e199f
|
||||
src/rocprof_compute_soc/analysis_configs/gfx90a/1600_vector_l1_data_cache.yaml: 438d0f4a972dd341eb2485f51a47d6860fbb30a6169054cd8550b4b7226e199f
|
||||
src/rocprof_compute_soc/analysis_configs/gfx940/1600_vector_l1_data_cache.yaml: 6100b218f24de9f1433b39a093ed04b9bb9dfe656c5df77583c9db332c447230
|
||||
src/rocprof_compute_soc/analysis_configs/gfx941/1600_vector_l1_data_cache.yaml: 6100b218f24de9f1433b39a093ed04b9bb9dfe656c5df77583c9db332c447230
|
||||
src/rocprof_compute_soc/analysis_configs/gfx942/1600_vector_l1_data_cache.yaml: 6100b218f24de9f1433b39a093ed04b9bb9dfe656c5df77583c9db332c447230
|
||||
src/rocprof_compute_soc/analysis_configs/gfx950/1600_vector_l1_data_cache.yaml: 67054ec0a4c6ca147a5dd40cc91f0e8e81378e1affe7d479274747579ecc524a
|
||||
src/rocprof_compute_soc/analysis_configs/gfx908/1700_l2_cache.yaml: b1baa76f9dbfcc52d5e12cc1834102a0011ddf8bdece5be5fabc2945ab8971f4
|
||||
src/rocprof_compute_soc/analysis_configs/gfx90a/1700_l2_cache.yaml: 4d834a2066d7f2cb655a8e41fc17531282150b6fe64bbc9c5ff3a10acddee5af
|
||||
src/rocprof_compute_soc/analysis_configs/gfx940/1700_l2_cache.yaml: 78f9fee5dafc83d311da1c801200c1820e16a0678dd0548fafa8a966ec6a94d5
|
||||
src/rocprof_compute_soc/analysis_configs/gfx941/1700_l2_cache.yaml: 51fe6e3888975b805594c2ab2b3147e717ae5e015468ee592cbcddc389c689bc
|
||||
src/rocprof_compute_soc/analysis_configs/gfx942/1700_l2_cache.yaml: dc2dc9ff61b1747e492c28ef5ac76764fd75c18fd0827834130bc583f2afc619
|
||||
src/rocprof_compute_soc/analysis_configs/gfx950/1700_l2_cache.yaml: d181f753c3fff608c72b8015d1af30bfd8cf8cdfbc0a17c505f717ddaa3b1efc
|
||||
src/rocprof_compute_soc/analysis_configs/gfx908/1800_l2_cache_per_channel.yaml: a0c53202fe9f68d5e1fa689ce0643c471ced7d47e007d8ccc68fba294f7f6a05
|
||||
src/rocprof_compute_soc/analysis_configs/gfx90a/1800_l2_cache_per_channel.yaml: a0c53202fe9f68d5e1fa689ce0643c471ced7d47e007d8ccc68fba294f7f6a05
|
||||
src/rocprof_compute_soc/analysis_configs/gfx940/1800_l2_cache_per_channel.yaml: e184e3692eb0d641fb2e37fada0e58a6c4958553931d7c038b884e1e6986093f
|
||||
@@ -113,4 +113,4 @@ src/rocprof_compute_soc/profile_configs/sets/gfx940_sets.yaml: 44cd2b32b050cafa7
|
||||
src/rocprof_compute_soc/profile_configs/sets/gfx941_sets.yaml: 44cd2b32b050cafa73d0ead5703b82836edf25a057c21699046b6b8b8918b242
|
||||
src/rocprof_compute_soc/profile_configs/sets/gfx942_sets.yaml: 44cd2b32b050cafa73d0ead5703b82836edf25a057c21699046b6b8b8918b242
|
||||
src/rocprof_compute_soc/profile_configs/sets/gfx950_sets.yaml: 238d9dc8a98cfead3fc904885bfe413e5bcb4f1af31e9820cd640388bcd1e1c2
|
||||
docs/data/metrics_description.yaml: 819c08a584ae8b418e6983aa51108b95e43eda4f3b7892eab336c61d844b20bf
|
||||
docs/data/metrics_description.yaml: c2ddad7ef7973b128c1612e56cc6286e49c2f59af829b1795dc64b38c0ecfd61
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user