Unified configuration for metrics (#726)
* Show description of metrics during analysis
* Use --include-cols Description show the Description column in analyze mode (this is hidden by default)
* Remove tips field from analysis config
* Align metric names in analysis config and documentation
* Add unified config utils/unified_config.yaml
* Add python script utils/split_config.py to auto generate analysis configuration and documentation metrics description
* Add test case to ensure unified config is older than auto-generated config
* Auto generate analysis config and documentation metrics description
* Update CONTRIBUTING.md to add instructions to build documentation assets
* Add docker image and compose file to build documentation
* Update CHANGELOG and Documentation
* Use jinja template instead of hardcoding metric tables in documentation
[ROCm/rocprofiler-compute commit: bb44e90b2d]
This commit is contained in:
zatwierdzone przez
GitHub
rodzic
dcdadfd37d
commit
354fe5f52c
@@ -66,6 +66,9 @@ Full documentation for ROCm Compute Profiler is available at [https://rocm.docs.
|
||||
|
||||
* Add deprecation warning for database update mode.
|
||||
|
||||
* Show description of metrics during analysis
|
||||
* Use `--include-cols Description` to show `Description` column which is excluded by default from cli output
|
||||
|
||||
### Changed
|
||||
|
||||
* Change the default rocprof version to rocprofv3, this is used when environment variable "ROCPROF" is not set
|
||||
@@ -101,6 +104,7 @@ Full documentation for ROCm Compute Profiler is available at [https://rocm.docs.
|
||||
* Fixed not detecting memory clock issue when using amd-smi
|
||||
* Fixed standalone GUI crashing
|
||||
* Fixed L2 read/write/atomic bandwidths on MI350
|
||||
* Update metric names for better alignment between analysis configuration and documentation
|
||||
|
||||
### Known issues
|
||||
|
||||
|
||||
@@ -335,6 +335,16 @@ add_test(
|
||||
${PROJECT_SOURCE_DIR}/tests/test_utils.py
|
||||
WORKING_DIRECTORY ${PROJECT_SOURCE_DIR})
|
||||
|
||||
# -----------------------------------
|
||||
# Autogenerated configuration tests
|
||||
# -----------------------------------
|
||||
|
||||
add_test(
|
||||
NAME test_autogen_config
|
||||
COMMAND ${Python3_EXECUTABLE} -m pytest --junitxml=tests/test_autogen_config.xml
|
||||
${COV_OPTION} ${PROJECT_SOURCE_DIR}/tests/test_autogen_config.py
|
||||
WORKING_DIRECTORY ${PROJECT_SOURCE_DIR})
|
||||
|
||||
# ---------
|
||||
# Install
|
||||
# ---------
|
||||
|
||||
@@ -57,3 +57,7 @@ Please see the [pre-commit documentation](https://pre-commit.com/#quick-start) f
|
||||
Below are some repository specific guidelines which are followed througout the repository.
|
||||
Any future contributions should adhere to these guidelines:
|
||||
* Use the `pathlib` library functions instead of `os.path` for manipulating the file paths.
|
||||
|
||||
## Build and test documentation changes
|
||||
|
||||
For instructions on how to build and test documentation changes (files under docs folder), please see https://rocm.docs.amd.com/en/latest/contribute/contributing.html
|
||||
|
||||
@@ -3,11 +3,6 @@ services:
|
||||
build:
|
||||
context: ../
|
||||
dockerfile: docker/Dockerfile.doctest
|
||||
devices:
|
||||
- /dev/kfd
|
||||
- /dev/dri
|
||||
security_opt:
|
||||
- seccomp:unconfined
|
||||
volumes:
|
||||
- ../:/app
|
||||
tty: true
|
||||
|
||||
@@ -0,0 +1,12 @@
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Metric
|
||||
- Description
|
||||
- Unit
|
||||
|
||||
{% for metric, metric_info in data.items() %}
|
||||
* - {{ metric }}
|
||||
- {{ metric_info.rst }}
|
||||
- {{ metric_info.unit }}
|
||||
{% endfor %}
|
||||
@@ -46,108 +46,13 @@ processor’s metrics therefore are focused on reporting, for example:
|
||||
Command processor fetcher (CPF)
|
||||
===============================
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - CPF Utilization
|
||||
|
||||
- Percent of total cycles where the CPF was busy actively doing any work.
|
||||
The ratio of CPF busy cycles over total cycles counted by the CPF.
|
||||
|
||||
- Percent
|
||||
|
||||
* - CPF Stall
|
||||
|
||||
- Percent of CPF busy cycles where the CPF was stalled for any reason.
|
||||
|
||||
- Percent
|
||||
|
||||
* - CPF-L2 Utilization
|
||||
|
||||
- Percent of total cycles counted by the CPF-:doc:`L2 <l2-cache>` interface
|
||||
where the CPF-L2 interface was active doing any work. The ratio of CPF-L2
|
||||
busy cycles over total cycles counted by the CPF-L2.
|
||||
|
||||
- Percent
|
||||
|
||||
* - CPF-L2 Stall
|
||||
|
||||
- Percent of CPF-:doc:`L2 <l2-cache>` L2 busy cycles where the CPF-L2
|
||||
interface was stalled for any reason.
|
||||
|
||||
- Percent
|
||||
|
||||
* - CPF-UTCL1 Stall
|
||||
|
||||
- Percent of CPF busy cycles where the CPF was stalled by address
|
||||
translation.
|
||||
|
||||
- Percent
|
||||
.. jinja:: cpf-metrics
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
.. _cpc-metrics:
|
||||
|
||||
Command processor packet processor (CPC)
|
||||
========================================
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - CPC Utilization
|
||||
|
||||
- Percent of total cycles where the CPC was busy actively doing any work.
|
||||
The ratio of CPC busy cycles over total cycles counted by the CPC.
|
||||
|
||||
- Percent
|
||||
|
||||
* - CPC Stall
|
||||
|
||||
- Percent of CPC busy cycles where the CPC was stalled for any reason.
|
||||
|
||||
- Percent
|
||||
|
||||
* - CPC Packet Decoding Utilization
|
||||
|
||||
- Percent of CPC busy cycles spent decoding commands for processing.
|
||||
|
||||
- Percent
|
||||
|
||||
* - CPC-Workgroup Manager Utilization
|
||||
|
||||
- Percent of CPC busy cycles spent dispatching workgroups to the
|
||||
:ref:`workgroup manager <desc-spi>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - CPC-L2 Utilization
|
||||
|
||||
- Percent of total cycles counted by the CPC-:doc:`L2 <l2-cache>` interface
|
||||
where the CPC-L2 interface was active doing any work.
|
||||
|
||||
- Percent
|
||||
|
||||
* - CPC-UTCL1 Stall
|
||||
|
||||
- Percent of CPC busy cycles where the CPC was stalled by address
|
||||
translation.
|
||||
|
||||
- Percent
|
||||
|
||||
* - CPC-UTCL2 Utilization
|
||||
|
||||
- Percent of total cycles counted by the CPC's :doc:`L2 <l2-cache>` address
|
||||
translation interface where the CPC was busy doing address translation
|
||||
work.
|
||||
|
||||
- Percent
|
||||
.. jinja:: cpc-metrics
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
@@ -48,56 +48,8 @@ The L2 cache’s speed-of-light table contains a few key metrics about the
|
||||
performance of the L2 cache, aggregated over all the L2 channels, as a
|
||||
comparison with the peak achievable values of those metrics:
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - Utilization
|
||||
|
||||
- The ratio of the
|
||||
:ref:`number of cycles an L2 channel was active, summed over all L2 channels on the accelerator <total-active-l2-cycles>`
|
||||
over the :ref:`total L2 cycles <total-l2-cycles>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Bandwidth
|
||||
|
||||
- The number of bytes looked up in the L2 cache, as a percent of the peak
|
||||
theoretical bandwidth achievable on the specific accelerator. The number
|
||||
of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so
|
||||
e.g., if only a single value is requested in a cache line, the data
|
||||
movement will still be counted as a full cache line.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Hit Rate
|
||||
|
||||
- The ratio of the number of L2 cache line requests that hit in the L2
|
||||
cache over the total number of incoming cache line requests to the L2
|
||||
cache.
|
||||
|
||||
- Percent
|
||||
|
||||
* - L2-Fabric Read BW
|
||||
|
||||
- The number of bytes read by the L2 over the
|
||||
:ref:`Infinity Fabric interface <l2-fabric>` per unit time.
|
||||
|
||||
- GB/s
|
||||
|
||||
* - L2-Fabric Write and Atomic BW
|
||||
|
||||
- The number of bytes sent by the L2 over the
|
||||
:ref:`Infinity Fabric interface <l2-fabric>` by write and atomic
|
||||
operations per unit time.
|
||||
|
||||
- GB/s
|
||||
.. jinja:: l2-sol
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
.. note::
|
||||
|
||||
@@ -117,168 +69,8 @@ This section details the incoming requests to the L2 cache from the
|
||||
:doc:`vL1D <vector-l1-cache>` and other clients -- for instance, the
|
||||
:ref:`sL1D <desc-sL1D>` and :ref:`L1I <desc-l1i>` caches.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:widths: 13 70 17
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - Bandwidth
|
||||
|
||||
- The number of bytes looked up in the L2 cache, per
|
||||
:ref:`normalization unit <normalization-units>`. The number of bytes is
|
||||
calculated as the number of cache lines requested multiplied by the cache
|
||||
line size. This value does not consider partial requests, so for example,
|
||||
if only a single value is requested in a cache line, the data movement
|
||||
will still be counted as a full cache line.
|
||||
|
||||
- Bytes per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
* - Requests
|
||||
|
||||
- The total number of incoming requests to the L2 from all clients for all
|
||||
request types, per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
* - Read Requests
|
||||
|
||||
- The total number of read requests to the L2 from all clients.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Write Requests
|
||||
|
||||
- The total number of write requests to the L2 from all clients.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Atomic Requests
|
||||
|
||||
- The total number of atomic requests (with and without return) to the L2
|
||||
from all clients.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Streaming Requests
|
||||
|
||||
- The total number of incoming requests to the L2 that are marked as
|
||||
*streaming*. The exact meaning of this may differ depending on the
|
||||
targeted accelerator, however on an :ref:`MI2XX <mixxx-note>` this
|
||||
corresponds to
|
||||
`non-temporal load or stores <https://clang.llvm.org/docs/LanguageExtensions.html#non-temporal-load-store-builtins>`_.
|
||||
The L2 cache attempts to evict *streaming* requests before normal
|
||||
requests when the L2 is at capacity.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Probe Requests
|
||||
|
||||
- The number of coherence probe requests made to the L2 cache from outside
|
||||
the accelerator. On an :ref:`MI2XX <mixxx-note>`, probe requests may be
|
||||
generated by, for example, writes to
|
||||
:ref:`fine-grained device <memory-type>` memory or by writes to
|
||||
:ref:`coarse-grained <memory-type>` device memory.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Hit Rate
|
||||
|
||||
- The ratio of the number of L2 cache line requests that hit in the L2
|
||||
cache over the total number of incoming cache line requests to the L2
|
||||
cache.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Hits
|
||||
|
||||
- The total number of requests to the L2 from all clients that hit in the
|
||||
cache. As noted in the :ref:`Speed-of-Light <l2-sol>` section, this
|
||||
includes hit-on-miss requests.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Misses
|
||||
|
||||
- The total number of requests to the L2 from all clients that miss in the
|
||||
cache. As noted in the :ref:`Speed-of-Light <l2-sol>` section, these do
|
||||
not include hit-on-miss requests.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Writebacks
|
||||
|
||||
- The total number of L2 cache lines written back to memory for any reason.
|
||||
Write-backs may occur due to user code (such as HIP kernel calls to
|
||||
``__threadfence_system`` or atomic built-ins) by the
|
||||
:doc:`command processor <command-processor>`'s memory acquire/release
|
||||
fences, or for other internal hardware reasons.
|
||||
|
||||
- Cache lines per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Writebacks (Internal)
|
||||
|
||||
- The total number of L2 cache lines written back to memory for internal
|
||||
hardware reasons, per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Cache lines per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
* - Writebacks (vL1D Req)
|
||||
|
||||
- The total number of L2 cache lines written back to memory due to requests
|
||||
initiated by the :doc:`vL1D cache <vector-l1-cache>`, per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Cache lines per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
* - Evictions (Normal)
|
||||
|
||||
- The total number of L2 cache lines evicted from the cache due to capacity
|
||||
limits, per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Cache lines per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
* - Evictions (vL1D Req)
|
||||
|
||||
- The total number of L2 cache lines evicted from the cache due to
|
||||
invalidation requests initiated by the
|
||||
:doc:`vL1D cache <vector-l1-cache>`, per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Cache lines per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
* - Non-hardware-Coherent Requests
|
||||
|
||||
- The total number of requests to the L2 to Not-hardware-Coherent (NC)
|
||||
memory allocations, per :ref:`normalization unit <normalization-units>`.
|
||||
See the :ref:`memory-type` for more information.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
* - Uncached Requests
|
||||
|
||||
- The total number of requests to the L2 that go to Uncached (UC) memory
|
||||
allocations. See the :ref:`memory-type` for more information.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
* - Coherently Cached Requests
|
||||
|
||||
- The total number of requests to the L2 that go to Coherently Cacheable (CC)
|
||||
memory allocations. See the :ref:`memory-type` for more information.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
* - Read/Write Coherent Requests
|
||||
|
||||
- The total number of requests to the L2 that go to Read-Write coherent memory
|
||||
(RW) allocations. See the :ref:`memory-type` for more information.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`.
|
||||
.. jinja:: l2-cache-accesses
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
.. note::
|
||||
|
||||
@@ -300,7 +92,7 @@ is responsible for routing these memory requests/data to the correct
|
||||
location and returning any fetched data to the L2 cache. The
|
||||
:ref:`l2-request-flow` describes the flow of these requests through
|
||||
Infinity Fabric in more detail, as described by ROCm Compute Profiler metrics,
|
||||
while :ref:`l2-request-metrics` give detailed definitions of
|
||||
while :ref:`l2-fabric` give detailed definitions of
|
||||
individual metrics.
|
||||
|
||||
.. _l2-request-flow:
|
||||
@@ -363,176 +155,15 @@ to uncached memory (denoted by the dashed line), they will also be
|
||||
counted as *two* uncached read requests (that is, the request is split).
|
||||
|
||||
|
||||
.. _l2-request-metrics:
|
||||
.. _l2-fabric-metrics:
|
||||
|
||||
Metrics
|
||||
-------
|
||||
|
||||
The following metrics are reported for the L2-Fabric interface:
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - L2-Fabric Read Bandwidth
|
||||
|
||||
- The total number of bytes read by the L2 cache from Infinity Fabric per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Bytes per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
* - HBM Read Traffic
|
||||
|
||||
- The percent of read requests generated by the L2 cache that are routed to
|
||||
the accelerator's local high-bandwidth memory (HBM). This breakdown does
|
||||
not consider the *size* of the request (meaning that 32B and 64B requests
|
||||
are both counted as a single request), so this metric only *approximates*
|
||||
the percent of the L2-Fabric Read bandwidth directed to the local HBM.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Remote Read Traffic
|
||||
|
||||
- The percent of read requests generated by the L2 cache that are routed to
|
||||
any memory location other than the accelerator's local high-bandwidth
|
||||
memory (HBM) -- for example, the CPU's DRAM or a remote accelerator's
|
||||
HBM. This breakdown does not consider the *size* of the request (meaning
|
||||
that 32B and 64B requests are both counted as a single request), so this
|
||||
metric only *approximates* the percent of the L2-Fabric Read bandwidth
|
||||
directed to a remote location.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Uncached Read Traffic
|
||||
|
||||
- The percent of read requests generated by the L2 cache that are reading
|
||||
from an :ref:`uncached memory allocation <memory-type>`. Note, as
|
||||
described in the :ref:`request flow <l2-request-flow>` section, a single
|
||||
64B read request is typically counted as two uncached read requests. So,
|
||||
it is possible for the Uncached Read Traffic to reach up to 200% of the
|
||||
total number of read requests. This breakdown does not consider the
|
||||
*size* of the request (i.e., 32B and 64B requests are both counted as a
|
||||
single request), so this metric only *approximates* the percent of the
|
||||
L2-Fabric read bandwidth directed to an uncached memory location.
|
||||
|
||||
- Percent
|
||||
|
||||
* - L2-Fabric Write and Atomic Bandwidth
|
||||
|
||||
- The total number of bytes written by the L2 over Infinity Fabric by write
|
||||
and atomic operations per
|
||||
:ref:`normalization unit <normalization-units>`. Note that on current
|
||||
CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests are
|
||||
only considered *atomic* by Infinity Fabric if they are targeted at
|
||||
non-write-cacheable memory, for example,
|
||||
:ref:`fine-grained memory <memory-type>` allocations or
|
||||
:ref:`uncached memory <memory-type>` allocations on the
|
||||
MI2XX.
|
||||
|
||||
- Bytes per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
* - HBM Write and Atomic Traffic
|
||||
|
||||
- The percent of write and atomic requests generated by the L2 cache that
|
||||
are routed to the accelerator's local high-bandwidth memory (HBM). This
|
||||
breakdown does not consider the *size* of the request (meaning that 32B
|
||||
and 64B requests are both counted as a single request), so this metric
|
||||
only *approximates* the percent of the L2-Fabric Write and Atomic
|
||||
bandwidth directed to the local HBM. Note that on current CDNA
|
||||
accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests are only
|
||||
considered *atomic* by Infinity Fabric if they are targeted at
|
||||
:ref:`fine-grained memory <memory-type>` allocations or
|
||||
:ref:`uncached memory <memory-type>` allocations.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Remote Write and Atomic Traffic
|
||||
|
||||
- The percent of read requests generated by the L2 cache that are routed to
|
||||
any memory location other than the accelerator's local high-bandwidth
|
||||
memory (HBM) -- for example, the CPU's DRAM or a remote accelerator's
|
||||
HBM. This breakdown does not consider the *size* of the request (meaning
|
||||
that 32B and 64B requests are both counted as a single request), so this
|
||||
metric only *approximates* the percent of the L2-Fabric Read bandwidth
|
||||
directed to a remote location. Note that on current CDNA
|
||||
accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests are only
|
||||
considered *atomic* by Infinity Fabric if they are targeted at
|
||||
:ref:`fine-grained memory <memory-type>` allocations or
|
||||
:ref:`uncached memory <memory-type>` allocations.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Atomic Traffic
|
||||
|
||||
- The percent of write requests generated by the L2 cache that are atomic
|
||||
requests to *any* memory location. This breakdown does not consider the
|
||||
*size* of the request (meaning that 32B and 64B requests are both counted
|
||||
as a single request), so this metric only *approximates* the percent of
|
||||
the L2-Fabric Read bandwidth directed to a remote location. Note that on
|
||||
current CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`,
|
||||
requests are only considered *atomic* by Infinity Fabric if they are
|
||||
targeted at :ref:`fine-grained memory <memory-type>` allocations or
|
||||
:ref:`uncached memory <memory-type>` allocations.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Uncached Write and Atomic Traffic
|
||||
|
||||
- The percent of write and atomic requests generated by the L2 cache that
|
||||
are targeting :ref:`uncached memory allocations <memory-type>`. This
|
||||
breakdown does not consider the *size* of the request (meaning that 32B
|
||||
and 64B requests are both counted as a single request), so this metric
|
||||
only *approximates* the percent of the L2-Fabric read bandwidth directed
|
||||
to uncached memory allocations.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Read Latency
|
||||
|
||||
- The time-averaged number of cycles read requests spent in Infinity Fabric
|
||||
before data was returned to the L2.
|
||||
|
||||
- Cycles
|
||||
|
||||
* - Write Latency
|
||||
|
||||
- The time-averaged number of cycles write requests spent in Infinity
|
||||
Fabric before a completion acknowledgement was returned to the L2.
|
||||
|
||||
- Cycles
|
||||
|
||||
* - Atomic Latency
|
||||
|
||||
- The time-averaged number of cycles atomic requests spent in Infinity
|
||||
Fabric before a completion acknowledgement (atomic without return value)
|
||||
or data (atomic with return value) was returned to the L2.
|
||||
|
||||
- Cycles
|
||||
|
||||
* - Read Stall
|
||||
|
||||
- The ratio of the total number of cycles the L2-Fabric interface was
|
||||
stalled on a read request to any destination (local HBM, remote PCIe®
|
||||
connected accelerator or CPU, or remote Infinity Fabric connected
|
||||
accelerator [#inf]_ or CPU) over the
|
||||
:ref:`total active L2 cycles <total-active-l2-cycles>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Write Stall
|
||||
|
||||
- The ratio of the total number of cycles the L2-Fabric interface was
|
||||
stalled on a write or atomic request to any destination (local HBM,
|
||||
remote accelerator or CPU, PCIe connected accelerator or CPU, or remote
|
||||
Infinity Fabric connected accelerator [#inf]_ or CPU) over the
|
||||
:ref:`total active L2 cycles <total-active-l2-cycles>`.
|
||||
|
||||
- Percent
|
||||
.. jinja:: l2-fabric-metrics
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
.. _l2-detailed-metrics:
|
||||
|
||||
@@ -542,121 +173,8 @@ Detailed transaction metrics
|
||||
The following metrics are available in the detailed L2-Fabric
|
||||
transaction breakdown table:
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - 32B Read Requests
|
||||
|
||||
- The total number of L2 requests to Infinity Fabric to read 32B of data
|
||||
from any memory location, per
|
||||
:ref:`normalization unit <normalization-units>`. See
|
||||
:ref:`l2-request-flow` for more detail. Typically unused on CDNA
|
||||
accelerators.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
* - Uncached Read Requests
|
||||
|
||||
- The total number of L2 requests to Infinity Fabric to read
|
||||
:ref:`uncached data <memory-type>` from any memory location, per
|
||||
:ref:`normalization unit <normalization-units>`. 64B requests for
|
||||
uncached data are counted as two 32B uncached data requests. See
|
||||
:ref:`l2-request-flow` for more detail.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
* - 64B Read Requests
|
||||
|
||||
- The total number of L2 requests to Infinity Fabric to read 64B of data
|
||||
from any memory location, per
|
||||
:ref:`normalization unit <normalization-units>`. See
|
||||
:ref:`l2-request-flow` for more detail.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
* - HBM Read Requests
|
||||
|
||||
- The total number of L2 requests to Infinity Fabric to read 32B or 64B of
|
||||
data from the accelerator's local HBM, per
|
||||
:ref:`normalization unit <normalization-units>`. See
|
||||
:ref:`l2-request-flow` for more detail.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
* - Remote Read Requests
|
||||
|
||||
- The total number of L2 requests to Infinity Fabric to read 32B or 64B of
|
||||
data from any source other than the accelerator's local HBM, per
|
||||
:ref:`normalization unit <normalization-units>`. See
|
||||
:ref:`l2-request-flow` for more detail.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
* - 32B Write and Atomic Requests
|
||||
|
||||
- The total number of L2 requests to Infinity Fabric to write or atomically
|
||||
update 32B of data to any memory location, per
|
||||
:ref:`normalization unit <normalization-units>`. See
|
||||
:ref:`l2-request-flow` for more detail.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
* - Uncached Write and Atomic Requests
|
||||
|
||||
- The total number of L2 requests to Infinity Fabric to write or atomically
|
||||
update 32B or 64B of :ref:`uncached data <memory-type>`, per
|
||||
:ref:`normalization unit <normalization-units>`. See
|
||||
:ref:`l2-request-flow` for more detail.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
* - 64B Write and Atomic Requests
|
||||
|
||||
- The total number of L2 requests to Infinity Fabric to write or atomically
|
||||
update 64B of data in any memory location, per
|
||||
:ref:`normalization unit <normalization-units>`. See
|
||||
:ref:`l2-request-flow` for more detail.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
* - HBM Write and Atomic Requests
|
||||
|
||||
- The total number of L2 requests to Infinity Fabric to write or atomically
|
||||
update 32B or 64B of data in the accelerator's local HBM, per
|
||||
:ref:`normalization unit <normalization-units>`. See
|
||||
:ref:`l2-request-flow` for more detail.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
* - Remote Write and Atomic Requests
|
||||
|
||||
- The total number of L2 requests to Infinity Fabric to write or atomically
|
||||
update 32B or 64B of data in any memory location other than the
|
||||
accelerator's local HBM, per
|
||||
:ref:`normalization unit <normalization-units>`. See
|
||||
:ref:`l2-request-flow` for more detail.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
* - Atomic Requests
|
||||
|
||||
- The total number of L2 requests to Infinity Fabric to atomically update
|
||||
32B or 64B of data in any memory location, per
|
||||
:ref:`normalization unit <normalization-units>`. See
|
||||
:ref:`l2-request-flow` for more detail. Note that on current CDNA
|
||||
accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests are only
|
||||
considered *atomic* by Infinity Fabric if they are targeted at
|
||||
non-write-cacheable memory, such as
|
||||
:ref:`fine-grained memory <memory-type>` allocations or
|
||||
:ref:`uncached memory <memory-type>` allocations on the MI2XX.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`.
|
||||
.. jinja:: l2-detailed-metrics
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
.. _l2-fabric-stalls:
|
||||
|
||||
@@ -670,72 +188,8 @@ what types of requests in a kernel caused a stall (like read versus write), and
|
||||
to which locations -- for instance, to the accelerator’s local memory, or to
|
||||
remote accelerators or CPUs.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - Read - PCIe Stall
|
||||
|
||||
- The number of cycles the L2-Fabric interface was stalled on read requests
|
||||
to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the
|
||||
:ref:`total active L2 cycles <total-active-l2-cycles>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Read - Infinity Fabric Stall
|
||||
|
||||
- The number of cycles the L2-Fabric interface was stalled on read requests
|
||||
to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as a
|
||||
percent of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Read - HBM Stall
|
||||
|
||||
- The number of cycles the L2-Fabric interface was stalled on read requests
|
||||
to the accelerator's local HBM as a percent of the
|
||||
:ref:`total active L2 cycles <total-active-l2-cycles>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Write - PCIe Stall
|
||||
|
||||
- The number of cycles the L2-Fabric interface was stalled on write or
|
||||
atomic requests to remote PCIe connected accelerators [#inf]_ or CPUs as
|
||||
a percent of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Write - Infinity Fabric Stall
|
||||
|
||||
- The number of cycles the L2-Fabric interface was stalled on write or
|
||||
atomic requests to remote Infinity Fabric connected accelerators [#inf]_
|
||||
or CPUs as a percent of the
|
||||
:ref:`total active L2 cycles <total-active-l2-cycles>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Write - HBM Stall
|
||||
|
||||
- The number of cycles the L2-Fabric interface was stalled on write or
|
||||
atomic requests to accelerator's local HBM as a percent of the
|
||||
:ref:`total active L2 cycles <total-active-l2-cycles>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Write - Credit Starvation
|
||||
|
||||
- The number of cycles the L2-Fabric interface was stalled on write or
|
||||
atomic requests to any memory location because too many write/atomic
|
||||
requests were currently in flight, as a percent of the
|
||||
:ref:`total active L2 cycles <total-active-l2-cycles>`.
|
||||
|
||||
- Percent
|
||||
.. jinja:: l2-fabric-stalls
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
.. warning::
|
||||
|
||||
|
||||
@@ -21,53 +21,8 @@ LDS Speed-of-Light
|
||||
The :ref:`LDS <desc-lds>` speed-of-light chart shows a number of key metrics for
|
||||
the LDS as a comparison with the peak achievable values of those metrics.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - Utilization
|
||||
|
||||
- Indicates what percent of the kernel's duration the :ref:`LDS <desc-lds>`
|
||||
was actively executing instructions (including, but not limited to, load,
|
||||
store, atomic and HIP's ``__shfl`` operations). Calculated as the ratio
|
||||
of the total number of cycles LDS was active over the
|
||||
:ref:`total CU cycles <total-cu-cycles>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Access Rate
|
||||
|
||||
- Indicates the percentage of SIMDs in the :ref:`VALU <desc-valu>` [#lds-workload]_
|
||||
actively issuing LDS instructions, averaged over the lifetime of the
|
||||
kernel. Calculated as the ratio of the total number of cycles spent by
|
||||
the :ref:`scheduler <desc-scheduler>` issuing :ref:`LDS <desc-lds>`
|
||||
instructions over the
|
||||
:ref:`total CU cycles <total-cu-cycles>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Theoretical Bandwidth (% of Peak)
|
||||
|
||||
- Indicates the maximum amount of bytes that *could* have been loaded from,
|
||||
stored to, or atomically updated in the LDS in this kernel, as a percent
|
||||
of the peak LDS bandwidth achievable. See the
|
||||
:ref:`LDS bandwidth example <lds-bandwidth>` for more detail.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Bank Conflict Rate
|
||||
|
||||
- Indicates the percentage of active LDS cycles that were spent servicing
|
||||
bank conflicts. Calculated as the ratio of LDS cycles spent servicing
|
||||
bank conflicts over the number of LDS cycles that would have been
|
||||
required to move the same amount of data in an uncontended access. [#lds-bank-conflict]_
|
||||
|
||||
- Percent
|
||||
.. jinja:: lds-sol
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
.. rubric:: Footnotes
|
||||
|
||||
@@ -90,93 +45,5 @@ Statistics
|
||||
|
||||
The LDS statistics panel gives a more detailed view of the hardware:
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - LDS Instructions
|
||||
|
||||
- The total number of LDS instructions (including, but not limited to,
|
||||
read/write/atomics and HIP's ``__shfl`` instructions) executed per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Theoretical Bandwidth
|
||||
|
||||
- Indicates the maximum amount of bytes that could have been loaded from,
|
||||
stored to, or atomically updated in the LDS per
|
||||
:ref:`normalization unit <normalization-units>`. Does *not* take into
|
||||
account the execution mask of the wavefront when the instruction was
|
||||
executed. See the
|
||||
:ref:`LDS bandwidth example <lds-bandwidth>` for more detail.
|
||||
|
||||
- Bytes per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - LDS Latency
|
||||
|
||||
- The average number of round-trip cycles (i.e., from issue to data-return
|
||||
/ acknowledgment) required for an LDS instruction to complete.
|
||||
|
||||
- Cycles
|
||||
|
||||
* - Bank Conflicts/Access
|
||||
|
||||
- The ratio of the number of cycles spent in the
|
||||
:ref:`LDS scheduler <desc-lds>` due to bank conflicts (as determined by
|
||||
the conflict resolution hardware) to the base number of cycles that would
|
||||
be spent in the LDS scheduler in a completely uncontended case. This is
|
||||
the unnormalized form of the Bank Conflict Rate.
|
||||
|
||||
- Conflicts/Access
|
||||
|
||||
* - Index Accesses
|
||||
|
||||
- The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>`
|
||||
over all operations per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Cycles per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Atomic Return Cycles
|
||||
|
||||
- The total number of cycles spent on LDS atomics with return per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Cycles per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Bank Conflicts
|
||||
|
||||
- The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>`
|
||||
due to bank conflicts (as determined by the conflict resolution hardware)
|
||||
per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Cycles per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Address Conflicts
|
||||
|
||||
- The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>`
|
||||
due to address conflicts (as determined by the conflict resolution
|
||||
hardware) per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Cycles per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Unaligned Stall
|
||||
|
||||
- The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>`
|
||||
due to stalls from non-dword aligned addresses per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Cycles per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Memory Violations
|
||||
|
||||
- The total number of out-of-bounds accesses made to the LDS, per
|
||||
:ref:`normalization unit <normalization-units>`. This is unused and
|
||||
expected to be zero in most configurations for modern CDNA™ accelerators.
|
||||
|
||||
- Accesses per :ref:`normalization unit <normalization-units>`
|
||||
.. jinja:: lds-stats
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
@@ -23,97 +23,8 @@ Wavefront launch stats
|
||||
The wavefront launch stats panel gives general information about the
|
||||
kernel launch:
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:widths: 20 65 15
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - Grid Size
|
||||
|
||||
- The total number of work-items (or, threads) launched as a part of
|
||||
the kernel dispatch. In HIP, this is equivalent to the total grid size
|
||||
multiplied by the total workgroup (or, block) size.
|
||||
|
||||
- :ref:`Work-items <desc-work-item>`
|
||||
|
||||
* - Workgroup Size
|
||||
|
||||
- The total number of work-items (or, threads) in each workgroup
|
||||
(or, block) launched as part of the kernel dispatch. In HIP, this is
|
||||
equivalent to the total block size.
|
||||
|
||||
- :ref:`Work-items <desc-work-item>`
|
||||
|
||||
* - Total Wavefronts
|
||||
|
||||
- The total number of wavefronts launched as part of the kernel dispatch.
|
||||
On AMD Instinct™ CDNA™ accelerators and GCN™ GPUs, the wavefront size is
|
||||
always 64 work-items. Thus, the total number of wavefronts should be
|
||||
equivalent to the ceiling of grid size divided by 64.
|
||||
|
||||
- :ref:`Wavefronts <desc-wavefront>`
|
||||
|
||||
* - Saved Wavefronts
|
||||
|
||||
- The total number of wavefronts saved at a context-save. See
|
||||
`cwsr_enable <https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
|
||||
|
||||
- :ref:`Wavefronts <desc-wavefront>`
|
||||
|
||||
* - Restored Wavefronts
|
||||
|
||||
- The total number of wavefronts restored from a context-save. See
|
||||
`cwsr_enable <https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
|
||||
|
||||
- :ref:`Wavefronts <desc-wavefront>`
|
||||
|
||||
* - VGPRs
|
||||
|
||||
- The number of architected vector general-purpose registers allocated for
|
||||
the kernel, see :ref:`VALU <desc-valu>`. Note: this may not exactly
|
||||
match the number of VGPRs requested by the compiler due to allocation
|
||||
granularity.
|
||||
|
||||
- :ref:`VGPRs <desc-valu>`
|
||||
|
||||
* - AGPRs
|
||||
|
||||
- The number of accumulation vector general-purpose registers allocated for
|
||||
the kernel, see :ref:`AGPRs <desc-agprs>`. Note: this may not exactly
|
||||
match the number of AGPRs requested by the compiler due to allocation
|
||||
granularity.
|
||||
|
||||
- :ref:`AGPRs <desc-agprs>`
|
||||
|
||||
* - SGPRs
|
||||
|
||||
- The number of scalar general-purpose registers allocated for the kernel,
|
||||
see :ref:`SALU <desc-salu>`. Note: this may not exactly match the number
|
||||
of SGPRs requested by the compiler due to allocation granularity.
|
||||
|
||||
- :ref:`SGPRs <desc-salu>`
|
||||
|
||||
* - LDS Allocation
|
||||
|
||||
- The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared
|
||||
memory) allocated for this kernel. Note: This may also be larger than
|
||||
what was requested at compile time due to both allocation granularity and
|
||||
dynamic per-dispatch LDS allocations.
|
||||
|
||||
- Bytes per :ref:`workgroup <desc-workgroup>`
|
||||
|
||||
* - Scratch Allocation
|
||||
|
||||
- The number of bytes of :ref:`scratch memory <memory-spaces>` requested
|
||||
per work-item for this kernel. Scratch memory is used for stack memory
|
||||
on the accelerator, as well as for register spills and restores.
|
||||
|
||||
- Bytes per :ref:`work-item <desc-work-item>`
|
||||
.. jinja:: wavefront-launch-stats
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
.. _wavefront-runtime-stats:
|
||||
|
||||
@@ -123,96 +34,8 @@ Wavefront runtime stats
|
||||
The wavefront runtime statistics gives a high-level overview of the
|
||||
execution of wavefronts in a kernel:
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:widths: 18 65 17
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - :ref:`Kernel time <kernel-time>`
|
||||
|
||||
- The total duration of the executed kernel. Note: this should not be
|
||||
directly compared to the wavefront cycles / timings below.
|
||||
|
||||
- Nanoseconds
|
||||
|
||||
* - :ref:`Kernel cycles <kernel-cycles>`
|
||||
|
||||
- The total duration of the executed kernel in cycles. Note: this should
|
||||
not be directly compared to the wavefront cycles / timings below.
|
||||
|
||||
- Cycles
|
||||
|
||||
* - Instructions per wavefront
|
||||
|
||||
- The average number of instructions (of all types) executed per wavefront.
|
||||
This is averaged over all wavefronts in a kernel dispatch.
|
||||
|
||||
- Instructions / wavefront
|
||||
|
||||
* - Wave cycles
|
||||
|
||||
- The number of cycles a wavefront in the kernel dispatch spent resident on
|
||||
a compute unit per :ref:`normalization unit <normalization-units>`. This
|
||||
is averaged over all wavefronts in a kernel dispatch. Note: this should
|
||||
not be directly compared to the kernel cycles above.
|
||||
|
||||
- Cycles per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Dependency wait cycles
|
||||
|
||||
- The number of cycles a wavefront in the kernel dispatch stalled waiting
|
||||
on memory of any kind (e.g., instruction fetch, vector or scalar memory,
|
||||
etc.) per :ref:`normalization unit <normalization-units>`. This counter
|
||||
is incremented at every cycle by *all* wavefronts on a CU stalled at a
|
||||
memory operation. As such, it is most useful to get a sense of how waves
|
||||
were spending their time, rather than identification of a precise limiter
|
||||
because another wave could be actively executing while a wave is stalled.
|
||||
The sum of this metric, Issue Wait Cycles and Active Cycles should be
|
||||
equal to the total Wave Cycles metric.
|
||||
|
||||
- Cycles per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Issue Wait Cycles
|
||||
|
||||
- The number of cycles a wavefront in the kernel dispatch was unable to
|
||||
issue an instruction for any reason (e.g., execution pipe back-pressure,
|
||||
arbitration loss, etc.) per
|
||||
:ref:`normalization unit <normalization-units>`. This counter is
|
||||
incremented at every cycle by *all* wavefronts on a CU unable to issue an
|
||||
instruction. As such, it is most useful to get a sense of how waves were
|
||||
spending their time, rather than identification of a precise limiter
|
||||
because another wave could be actively executing while a wave is issue
|
||||
stalled. The sum of this metric, Dependency Wait Cycles and Active
|
||||
Cycles should be equal to the total Wave Cycles metric.
|
||||
|
||||
- Cycles per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Active Cycles
|
||||
|
||||
- The average number of cycles a wavefront in the kernel dispatch was
|
||||
actively executing instructions per
|
||||
:ref:`normalization unit <normalization-units>`. This measurement is made
|
||||
on a per-wavefront basis, and may include cycles that another wavefront
|
||||
spent actively executing (on another execution unit, for example) or was
|
||||
stalled. As such, it is most useful to get a sense of how waves were
|
||||
spending their time, rather than identification of a precise limiter. The
|
||||
sum of this metric, Issue Wait Cycles and Active Wait Cycles should be
|
||||
equal to the total Wave Cycles metric.
|
||||
|
||||
- Cycles per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Wavefront Occupancy
|
||||
|
||||
- The time-averaged number of wavefronts resident on the accelerator over
|
||||
the lifetime of the kernel. Note: this metric may be inaccurate for
|
||||
short-running kernels (less than 1ms).
|
||||
|
||||
- :ref:`Wavefronts <desc-wavefront>`
|
||||
.. jinja:: wavefront-runtime-stats
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
.. note::
|
||||
|
||||
@@ -256,71 +79,8 @@ This panel shows the total number of each type of instruction issued to
|
||||
the :doc:`various compute pipelines </conceptual/pipeline-descriptions>` on the
|
||||
:doc:`CU </conceptual/compute-unit>`. These are:
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - :ref:`VALU <desc-valu>` instructions
|
||||
|
||||
- The total number of vector arithmetic logic unit (VALU) operations
|
||||
issued. These are the workhorses of the
|
||||
:doc:`compute unit <compute-unit>`, and are used to execute a wide range of
|
||||
instruction types including floating point operations, non-uniform
|
||||
address calculations, transcendental operations, integer operations,
|
||||
shifts, conditional evaluation, etc.
|
||||
|
||||
- Instructions
|
||||
|
||||
* - VMEM instructions
|
||||
|
||||
- The total number of vector memory operations issued. These include most
|
||||
loads, stores and atomic operations and all accesses to
|
||||
:ref:`generic, global, private and texture <memory-spaces>` memory.
|
||||
|
||||
- Instructions
|
||||
|
||||
* - :doc:`LDS <local-data-share>` instructions
|
||||
|
||||
- The total number of LDS (also known as shared memory) operations issued.
|
||||
These include loads, stores, atomics, and HIP's ``__shfl`` operations.
|
||||
|
||||
- Instructions
|
||||
|
||||
* - :ref:`MFMA <desc-mfma>` instructions
|
||||
|
||||
- The total number of matrix fused multiply-add instructions issued.
|
||||
|
||||
- Instructions
|
||||
|
||||
* - :ref:`SALU <desc-salu>` instructions
|
||||
|
||||
- The total number of scalar arithmetic logic unit (SALU) operations
|
||||
issued. Typically these are used for address calculations, literal
|
||||
constants, and other operations that are *provably* uniform across a
|
||||
wavefront. Although scalar memory (SMEM) operations are issued by the
|
||||
SALU, they are counted separately in this section.
|
||||
|
||||
- Instructions
|
||||
|
||||
* - SMEM instructions
|
||||
|
||||
- The total number of scalar memory (SMEM) operations issued. These are
|
||||
typically used for loading kernel arguments, base-pointers and loads
|
||||
from HIP's ``__constant__`` memory.
|
||||
|
||||
- Instructions
|
||||
|
||||
* - :ref:`Branch <desc-branch>` instructions
|
||||
|
||||
- The total number of branch operations issued. These typically consist of
|
||||
jump or branch operations and are used to implement control flow.
|
||||
|
||||
- Instructions
|
||||
.. jinja:: instruction-mix
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
.. note::
|
||||
|
||||
@@ -345,133 +105,8 @@ include :ref:`MFMA <desc-mfma>` instructions using the same precision; for
|
||||
instance, the “F16-ADD” metric does not include any 16-bit floating point
|
||||
additions executed as part of an MFMA instruction using the same precision.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:widths: 15 65 20
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - INT32
|
||||
|
||||
- The total number of instructions operating on 32-bit integer operands
|
||||
issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - INT64
|
||||
|
||||
- The total number of instructions operating on 64-bit integer operands
|
||||
issued to the VALU per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - F16-ADD
|
||||
|
||||
- The total number of addition instructions operating on 16-bit
|
||||
floating-point operands issued to the VALU per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - F16-MUL
|
||||
|
||||
- The total number of multiplication instructions operating on 16-bit
|
||||
floating-point operands issued to the VALU per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - F16-FMA
|
||||
|
||||
- The total number of fused multiply-add instructions operating on 16-bit
|
||||
floating-point operands issued to the VALU per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - F16-TRANS
|
||||
|
||||
- The total number of transcendental instructions (e.g., `sqrt`) operating
|
||||
on 16-bit floating-point operands issued to the VALU per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - F32-ADD
|
||||
|
||||
- The total number of addition instructions operating on 32-bit
|
||||
floating-point operands issued to the VALU per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - F32-MUL
|
||||
|
||||
- The total number of multiplication instructions operating on 32-bit
|
||||
floating-point operands issued to the VALU per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - F32-FMA
|
||||
|
||||
- The total number of fused multiply-add instructions operating on 32-bit
|
||||
floating-point operands issued to the VALU per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - F32-TRANS
|
||||
|
||||
- The total number of transcendental instructions (such as ``sqrt``)
|
||||
operating on 32-bit floating-point operands issued to the VALU per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - F64-ADD
|
||||
|
||||
- The total number of addition instructions operating on 64-bit
|
||||
floating-point operands issued to the VALU per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - F64-MUL
|
||||
|
||||
- The total number of multiplication instructions operating on 64-bit
|
||||
floating-point operands issued to the VALU per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - F64-FMA
|
||||
|
||||
- The total number of fused multiply-add instructions operating on 64-bit
|
||||
floating-point operands issued to the VALU per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - F64-TRANS
|
||||
|
||||
- The total number of transcendental instructions (such as `sqrt`)
|
||||
operating on 64-bit floating-point operands issued to the VALU per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Conversion
|
||||
|
||||
- The total number of type conversion instructions (such as converting data
|
||||
to or from F32↔F64) issued to the VALU per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
.. jinja:: valu-arith-instruction-mix
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
For an example of these counters in action, refer to
|
||||
:ref:`valu-arith-instruction-mix-ex`.
|
||||
@@ -502,57 +137,8 @@ This section details the types of Matrix Fused Multiply-Add
|
||||
MFMA instructions are classified by the type of input data they operate on, and
|
||||
*not* the data type the result is accumulated to.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:widths: 25 60 17
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - MFMA-I8 Instructions
|
||||
|
||||
- The total number of 8-bit integer :ref:`MFMA <desc-mfma>` instructions
|
||||
issued per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - MFMA-F8 Instructions
|
||||
|
||||
- The total number of 8-bit floating point :ref:`MFMA <desc-mfma>`
|
||||
instructions issued per :ref:`normalization unit <normalization-units>`. This is supported in AMD Instinct MI300 series and later only.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - MFMA-F16 Instructions
|
||||
|
||||
- The total number of 16-bit floating point :ref:`MFMA <desc-mfma>`
|
||||
instructions issued per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - MFMA-BF16 Instructions
|
||||
|
||||
- The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
|
||||
instructions issued per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - MFMA-F32 Instructions
|
||||
|
||||
- The total number of 32-bit floating-point :ref:`MFMA <desc-mfma>`
|
||||
instructions issued per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - MFMA-F64 Instructions
|
||||
|
||||
- The total number of 64-bit floating-point :ref:`MFMA <desc-mfma>`
|
||||
instructions issued per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
.. jinja:: mfma-instruction-mix
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
Compute pipeline
|
||||
================
|
||||
@@ -612,84 +198,8 @@ various precisions. We note that unlike the
|
||||
are reported as FLOPs and IOPs, that is, the total number of operations
|
||||
executed.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - VALU FLOPs
|
||||
|
||||
- The total floating-point operations executed per second on the
|
||||
:ref:`VALU <desc-valu>`. This is also presented as a percent of the peak
|
||||
theoretical FLOPs achievable on the specific accelerator. Note: this does
|
||||
not include any floating-point operations from :ref:`MFMA <desc-mfma>`
|
||||
instructions.
|
||||
|
||||
- GFLOPs
|
||||
|
||||
* - VALU IOPs
|
||||
|
||||
- The total integer operations executed per second on the
|
||||
:ref:`VALU <desc-valu>`. This is also presented as a percent of the peak
|
||||
theoretical IOPs achievable on the specific accelerator. Note: this does
|
||||
not include any integer operations from :ref:`MFMA <desc-mfma>`
|
||||
instructions.
|
||||
|
||||
- GIOPs
|
||||
|
||||
* - MFMA FLOPs (BF16)
|
||||
|
||||
- The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
|
||||
operations executed per second. Note: this does not include any 16-bit
|
||||
brain floating point operations from :ref:`VALU <desc-valu>`
|
||||
instructions. This is also presented as a percent of the peak theoretical
|
||||
BF16 MFMA operations achievable on the specific accelerator.
|
||||
|
||||
- GFLOPs
|
||||
|
||||
* - MFMA FLOPs (F16)
|
||||
|
||||
- The total number of 16-bit floating point :ref:`MFMA <desc-mfma>`
|
||||
operations executed per second. Note: this does not include any 16-bit
|
||||
floating point operations from :ref:`VALU <desc-valu>` instructions. This
|
||||
is also presented as a percent of the peak theoretical F16 MFMA
|
||||
operations achievable on the specific accelerator.
|
||||
|
||||
- GFLOPs
|
||||
|
||||
* - MFMA FLOPs (F32)
|
||||
|
||||
- The total number of 32-bit floating point :ref:`MFMA <desc-mfma>`
|
||||
operations executed per second. Note: this does not include any 32-bit
|
||||
floating point operations from :ref:`VALU <desc-valu>` instructions. This
|
||||
is also presented as a percent of the peak theoretical F32 MFMA
|
||||
operations achievable on the specific accelerator.
|
||||
|
||||
- GFLOPs
|
||||
|
||||
* - MFMA FLOPs (F64)
|
||||
|
||||
- The total number of 64-bit floating point :ref:`MFMA <desc-mfma>`
|
||||
operations executed per second. Note: this does not include any 64-bit
|
||||
floating point operations from :ref:`VALU <desc-valu>` instructions. This
|
||||
is also presented as a percent of the peak theoretical F64 MFMA
|
||||
operations achievable on the specific accelerator.
|
||||
|
||||
- GFLOPs
|
||||
|
||||
* - MFMA IOPs (INT8)
|
||||
|
||||
- The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations
|
||||
executed per second. Note: this does not include any 8-bit integer
|
||||
operations from :ref:`VALU <desc-valu>` instructions. This is also
|
||||
presented as a percent of the peak theoretical INT8 MFMA operations
|
||||
achievable on the specific accelerator.
|
||||
|
||||
- GIOPs
|
||||
.. jinja:: compute-speed-of-light
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
.. _pipeline-stats:
|
||||
|
||||
@@ -702,120 +212,8 @@ various execution units on the :doc:`CU <compute-unit>`. Refer to
|
||||
:ref:`scheduler <desc-scheduler>` the for a high-level overview of execution
|
||||
units and instruction issue.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:widths: 20 65 15
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - IPC
|
||||
|
||||
- The ratio of the total number of instructions executed on the
|
||||
:doc:`CU <compute-unit>` over the
|
||||
:ref:`total active CU cycles <total-active-cu-cycles>`.
|
||||
|
||||
- Instructions per-cycle
|
||||
|
||||
* - IPC (Issued)
|
||||
|
||||
- The ratio of the total number of
|
||||
(non-:ref:`internal <ipc-internal-instructions>`) instructions issued over
|
||||
the number of cycles where the :ref:`scheduler <desc-scheduler>` was
|
||||
actively working on issuing instructions. Refer to the
|
||||
:ref:`Issued IPC <issued-ipc>` example for further detail.
|
||||
|
||||
- Instructions per-cycle
|
||||
|
||||
* - SALU utilization
|
||||
|
||||
- Indicates what percent of the kernel's duration the
|
||||
:ref:`SALU <desc-salu>` was busy executing instructions. Computed as the
|
||||
ratio of the total number of cycles spent by the
|
||||
:ref:`scheduler <desc-scheduler>` issuing SALU / :ref:`SMEM <desc-smem>`
|
||||
instructions over the :ref:`total CU cycles <total-cu-cycles>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - VALU utilization
|
||||
|
||||
- Indicates what percent of the kernel's duration the
|
||||
:ref:`VALU <desc-valu>` was busy executing instructions. Does not include
|
||||
:ref:`VMEM <desc-vmem>` operations. Computed as the ratio of the total
|
||||
number of cycles spent by the :ref:`scheduler <desc-scheduler>` issuing
|
||||
VALU instructions over the :ref:`total CU cycles <total-cu-cycles>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - VMEM utilization
|
||||
|
||||
- Indicates what percent of the kernel's duration the
|
||||
:ref:`VMEM <desc-vmem>` unit was busy executing instructions, including
|
||||
both global/generic and spill/scratch operations (see the
|
||||
:ref:`VMEM instruction count metrics <ta-instruction-counts>` for more
|
||||
detail). Does not include :ref:`VALU <desc-valu>` operations. Computed
|
||||
as the ratio of the total number of cycles spent by the
|
||||
:ref:`scheduler <desc-scheduler>` issuing VMEM instructions over the
|
||||
:ref:`total CU cycles <total-cu-cycles>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Branch utilization
|
||||
|
||||
- Indicates what percent of the kernel's duration the
|
||||
:ref:`branch <desc-branch>` unit was busy executing instructions.
|
||||
Computed as the ratio of the total number of cycles spent by the
|
||||
:ref:`scheduler <desc-scheduler>` issuing branch instructions over the
|
||||
:ref:`total CU cycles <total-cu-cycles>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - VALU active threads
|
||||
|
||||
- Indicates the average level of :ref:`divergence <desc-divergence>` within
|
||||
a wavefront over the lifetime of the kernel. The number of work-items
|
||||
that were active in a wavefront during execution of each
|
||||
:ref:`VALU <desc-valu>` instruction, time-averaged over all VALU
|
||||
instructions run on all wavefronts in the kernel.
|
||||
|
||||
- Work-items
|
||||
|
||||
* - MFMA utilization
|
||||
|
||||
- Indicates what percent of the kernel's duration the
|
||||
:ref:`MFMA <desc-mfma>` unit was busy executing instructions. Computed as
|
||||
the ratio of the total number of cycles spent by the
|
||||
:ref:`MFMA <desc-salu>` was busy over the
|
||||
:ref:`total CU cycles <total-cu-cycles>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - MFMA instruction cycles
|
||||
|
||||
- The average duration of :ref:`MFMA <desc-mfma>` instructions in this
|
||||
kernel in cycles. Computed as the ratio of the total number of cycles the
|
||||
MFMA unit was busy over the total number of MFMA instructions. Compare
|
||||
to, for example, the
|
||||
`AMD Matrix Instruction Calculator <https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator>`_.
|
||||
|
||||
- Cycles per instruction
|
||||
|
||||
* - VMEM latency
|
||||
|
||||
- The average number of round-trip cycles (that is, from issue to data
|
||||
return / acknowledgment) required for a VMEM instruction to complete.
|
||||
|
||||
- Cycles
|
||||
|
||||
* - SMEM latency
|
||||
|
||||
- The average number of round-trip cycles (that is, from issue to data
|
||||
return / acknowledgment) required for a SMEM instruction to complete.
|
||||
|
||||
- Cycles
|
||||
.. jinja:: pipeline-stats
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
.. note::
|
||||
|
||||
@@ -846,70 +244,5 @@ not. For more detail on how operations are counted see the
|
||||
take into account the execution mask of the operation, and will report the
|
||||
same value even if EXEC is identically zero.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:widths: 18 65 17
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - FLOPs (Total)
|
||||
|
||||
- The total number of floating-point operations executed on either the
|
||||
:ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- FLOP per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - IOPs (Total)
|
||||
|
||||
- The total number of integer operations executed on either the
|
||||
:ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- IOP per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - F16 OPs
|
||||
|
||||
- The total number of 16-bit floating-point operations executed on either the
|
||||
:ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- FLOP per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - BF16 OPs
|
||||
|
||||
- The total number of 16-bit brain floating-point operations executed on either the
|
||||
:ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
|
||||
:ref:`normalization unit <normalization-units>`. Note: on current CDNA
|
||||
accelerators, the VALU has no native BF16 instructions.
|
||||
|
||||
- FLOP per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - F32 OPs
|
||||
|
||||
- The total number of 32-bit floating-point operations executed on either
|
||||
the :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- FLOP per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - F64 OPs
|
||||
|
||||
- The total number of 64-bit floating-point operations executed on either
|
||||
the :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- FLOP per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - INT8 OPs
|
||||
|
||||
- The total number of 8-bit integer operations executed on either the
|
||||
:ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
|
||||
:ref:`normalization unit <normalization-units>`. Note: on current CDNA
|
||||
accelerators, the VALU has no native INT8 instructions.
|
||||
|
||||
- IOPs per :ref:`normalization unit <normalization-units>`
|
||||
.. jinja:: arithmetic-operations
|
||||
:file: _templates/metrics_table.j2
|
||||
@@ -71,40 +71,8 @@ Scalar L1D Speed-of-Light
|
||||
The Scalar L1D speed-of-light chart shows some key metrics of the sL1D
|
||||
cache as a comparison with the peak achievable values of those metrics:
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:widths: 20 65 15
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - Bandwidth
|
||||
|
||||
- The number of bytes looked up in the sL1D cache, as a percent of the peak
|
||||
theoretical bandwidth. Calculated as the ratio of sL1D requests over the
|
||||
:ref:`total sL1D cycles <total-sl1d-cycles>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Cache Hit Rate
|
||||
|
||||
- The percent of sL1D requests that hit [#sl1d-cache]_ on a previously
|
||||
loaded line in the cache. Calculated as the ratio of the number of sL1D
|
||||
requests that hit over the number of all sL1D requests.
|
||||
|
||||
- Percent
|
||||
|
||||
* - sL1D-L2 BW
|
||||
|
||||
- The number of bytes requested by the sL1D from the L2 cache, as a percent
|
||||
of the peak theoretical sL1D → L2 cache bandwidth. Calculated as the
|
||||
ratio of the total number of requests from the sL1D to the L2 cache over
|
||||
the :ref:`total sL1D-L2 interface cycles <total-sl1d-cycles>`.
|
||||
|
||||
- Percent
|
||||
.. jinja:: desc-sl1d-sol
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
.. _desc-sl1d-stats:
|
||||
|
||||
@@ -114,104 +82,8 @@ Scalar L1D cache accesses
|
||||
This panel gives more detail on the types of accesses made to the sL1D,
|
||||
and the hit/miss statistics.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - Requests
|
||||
|
||||
- The total number of requests, of any size or type, made to the sL1D per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Hits
|
||||
|
||||
- The total number of sL1D requests that hit on a previously loaded cache
|
||||
line, per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Misses - Non Duplicated
|
||||
|
||||
- The total number of sL1D requests that missed on a cache line that *was
|
||||
not* already pending due to another request, per
|
||||
:ref:`normalization unit <normalization-units>`. See :ref:`desc-sl1d-sol`
|
||||
for more detail.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Misses - Duplicated
|
||||
|
||||
- The total number of sL1D requests that missed on a cache line that *was*
|
||||
already pending due to another request, per
|
||||
:ref:`normalization unit <normalization-units>`. See
|
||||
:ref:`desc-sl1d-sol` for more detail.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Cache Hit Rate
|
||||
|
||||
- Indicates the percent of sL1D requests that hit on a previously loaded
|
||||
line the cache. The ratio of the number of sL1D requests that hit
|
||||
[#sl1d-cache]_ over the number of all sL1D requests.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Read Requests (Total)
|
||||
|
||||
- The total number of sL1D read requests of any size, per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Atomic Requests
|
||||
|
||||
- The total number of sL1D atomic requests of any size, per
|
||||
:ref:`normalization unit <normalization-units>`. Typically unused on CDNA
|
||||
accelerators.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Read Requests (1 DWord)
|
||||
|
||||
- The total number of sL1D read requests made for a single dword of data
|
||||
(4B), per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Read Requests (2 DWord)
|
||||
|
||||
- The total number of sL1D read requests made for a two dwords of data
|
||||
(8B), per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Read Requests (4 DWord)
|
||||
|
||||
- The total number of sL1D read requests made for a four dwords of data
|
||||
(16B), per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Read Requests (8 DWord)
|
||||
|
||||
- The total number of sL1D read requests made for a eight dwords of data
|
||||
(32B), per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Read Requests (16 DWord)
|
||||
|
||||
- The total number of sL1D read requests made for a sixteen dwords of data
|
||||
(64B), per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
.. jinja:: desc-sl1d-stats
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
.. _desc-sl1d-l2-interface:
|
||||
|
||||
@@ -222,56 +94,8 @@ This panel gives more detail on the data requested across the
|
||||
sL1D↔
|
||||
:doc:`L2 <l2-cache>` interface.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - sL1D-L2 BW
|
||||
|
||||
- The total number of bytes read from, written to, or atomically updated
|
||||
across the sL1D↔:doc:`L2 <l2-cache>` interface, per
|
||||
:ref:`normalization unit <normalization-units>`. Note that sL1D writes
|
||||
and atomics are typically unused on current CDNA accelerators, so in the
|
||||
majority of cases this can be interpreted as an sL1D→L2 read bandwidth.
|
||||
|
||||
- Bytes per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Read Requests
|
||||
|
||||
- The total number of read requests from sL1D to the :doc:`L2 <l2-cache>`,
|
||||
per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Write Requests
|
||||
|
||||
- The total number of write requests from sL1D to the :doc:`L2 <l2-cache>`,
|
||||
per :ref:`normalization unit <normalization-units>`. Typically unused on
|
||||
current CDNA accelerators.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Atomic Requests
|
||||
|
||||
- The total number of atomic requests from sL1D to the
|
||||
:doc:`L2 <l2-cache>`, per
|
||||
:ref:`normalization unit <normalization-units>`. Typically unused on
|
||||
current CDNA accelerators.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Stall Cycles
|
||||
|
||||
- The total number of cycles the sL1D↔
|
||||
:doc:`L2 <l2-cache>` interface was stalled, per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Cycles per :ref:`normalization unit <normalization-units>`
|
||||
.. jinja:: desc-sl1d-l2-interface
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
.. rubric:: Footnotes
|
||||
|
||||
@@ -318,46 +142,8 @@ The L1 Instruction Cache speed-of-light chart shows some key metrics of
|
||||
the L1I cache as a comparison with the peak achievable values of those
|
||||
metrics:
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - Bandwidth
|
||||
|
||||
- The number of bytes looked up in the L1I cache, as a percent of the peak
|
||||
theoretical bandwidth. Calculated as the ratio of L1I requests over the
|
||||
:ref:`total L1I cycles <total-l1i-cycles>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Cache Hit Rate
|
||||
|
||||
- The percent of L1I requests that hit on a previously loaded line the
|
||||
cache. Calculated as the ratio of the number of L1I requests that hit
|
||||
[#l1i-cache]_ over the number of all L1I requests.
|
||||
|
||||
- Percent
|
||||
|
||||
* - L1I-L2 BW
|
||||
|
||||
- The percent of the peak theoretical L1I → L2 cache request bandwidth
|
||||
achieved. Calculated as the ratio of the total number of requests from
|
||||
the L1I to the L2 cache over the
|
||||
:ref:`total L1I-L2 interface cycles <total-l1i-cycles>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Instruction Fetch Latency
|
||||
|
||||
- The average number of cycles spent to fetch instructions to a
|
||||
:doc:`CU <compute-unit>`.
|
||||
|
||||
- Cycles
|
||||
.. jinja:: desc-l1i-sol
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
.. _desc-l1i-stats:
|
||||
|
||||
@@ -366,54 +152,10 @@ L1I cache accesses
|
||||
|
||||
This panel gives more detail on the hit/miss statistics of the L1I:
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
.. jinja:: desc-l1i-stats
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - Requests
|
||||
|
||||
- The total number of requests made to the L1I per
|
||||
:ref:`normalization-unit <normalization-units>`.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
* - Hits
|
||||
|
||||
- The total number of L1I requests that hit on a previously loaded cache
|
||||
line, per :ref:`normalization-unit <normalization-units>`.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Misses - Non Duplicated
|
||||
|
||||
- The total number of L1I requests that missed on a cache line that
|
||||
*were not* already pending due to another request, per
|
||||
:ref:`normalization-unit <normalization-units>`. See note in
|
||||
:ref:`desc-l1i-sol` for more detail.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
* - Misses - Duplicated
|
||||
|
||||
- The total number of L1I requests that missed on a cache line that *were*
|
||||
already pending due to another request, per
|
||||
:ref:`normalization-unit <normalization-units>`. See note in
|
||||
:ref:`desc-l1i-sol` for more detail.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Cache Hit Rate
|
||||
|
||||
- The percent of L1I requests that hit [#l1i-cache]_ on a previously loaded
|
||||
line the cache. Calculated as the ratio of the number of L1I requests
|
||||
that hit over the number of all L1I requests.
|
||||
|
||||
- Percent
|
||||
.. _desc-l1i-l2-interface:
|
||||
|
||||
L1I - L2 interface
|
||||
------------------
|
||||
@@ -421,21 +163,8 @@ L1I - L2 interface
|
||||
This panel gives more detail on the data requested across the
|
||||
L1I-:doc:`L2 <l2-cache>` interface.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - L1I-L2 BW
|
||||
|
||||
- The total number of bytes read across the L1I-:doc:`L2 <l2-cache>`
|
||||
interface, per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Bytes per :ref:`normalization unit <normalization-units>`
|
||||
.. jinja:: desc-l1i-l2-interface
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
.. rubric:: Footnotes
|
||||
|
||||
@@ -493,90 +222,18 @@ issuing concurrently).
|
||||
kernels). This means that these scheduler-pipe utilization metrics are
|
||||
expected to reach (for example) a maximum of one pipe active -- only 25%.
|
||||
|
||||
.. _spi-util:
|
||||
|
||||
Workgroup manager utilizations
|
||||
------------------------------
|
||||
|
||||
This section describes the utilization of the workgroup manager, and the
|
||||
hardware components it interacts with.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:widths: 20 65 15
|
||||
.. jinja:: spi-util
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - Accelerator utilization
|
||||
|
||||
- The percent of cycles in the kernel where the accelerator was actively
|
||||
doing any work.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Scheduler-pipe utilization
|
||||
|
||||
- The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in
|
||||
the kernel where the scheduler-pipes were actively doing any work. Note:
|
||||
this value is expected to range between 0% and 25%. See :ref:`desc-spi`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Workgroup manager utilization
|
||||
|
||||
- The percent of cycles in the kernel where the workgroup manager was
|
||||
actively doing any work.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Shader engine utilization
|
||||
|
||||
- The percent of :ref:`total shader engine cycles <total-se-cycles>` in the
|
||||
kernel where any CU in a shader-engine was actively doing any work,
|
||||
normalized over all shader-engines. Low values (e.g., << 100%) indicate
|
||||
that the accelerator was not fully saturated by the kernel, or a
|
||||
potential load-imbalance issue.
|
||||
|
||||
- Percent
|
||||
|
||||
* - SIMD utilization
|
||||
|
||||
- The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel
|
||||
where any :ref:`SIMD <desc-valu>` on a CU was actively doing any work,
|
||||
summed over all CUs. Low values (less than 100%) indicate that the
|
||||
accelerator was not fully saturated by the kernel, or a potential
|
||||
load-imbalance issue.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Dispatched workgroups
|
||||
|
||||
- The total number of workgroups forming this kernel launch.
|
||||
|
||||
- Workgroups
|
||||
|
||||
* - Dispatched wavefronts
|
||||
|
||||
- The total number of wavefronts, summed over all workgroups, forming this
|
||||
kernel launch.
|
||||
|
||||
- Wavefronts
|
||||
|
||||
* - VGPR writes
|
||||
|
||||
- The average number of cycles spent initializing :ref:`VGPRs <desc-valu>`
|
||||
at wave creation.
|
||||
|
||||
- Cycles/wave
|
||||
|
||||
* - SGPR Writes
|
||||
|
||||
- The average number of cycles spent initializing :ref:`SGPRs <desc-salu>`
|
||||
at wave creation.
|
||||
|
||||
- Cycles/wave
|
||||
.. _spi-resc-util:
|
||||
|
||||
Resource allocation
|
||||
-------------------
|
||||
@@ -590,117 +247,5 @@ limited by LDS usage, for example, but may still achieve high occupancy levels
|
||||
such that improving occupancy further may not improve performance. See
|
||||
:ref:`occupancy-example` for details.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - Not-scheduled rate (Workgroup Manager)
|
||||
|
||||
- The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in
|
||||
the kernel where a workgroup could not be scheduled to a
|
||||
:doc:`CU <compute-unit>` due to a bottleneck within the workgroup manager
|
||||
rather than a lack of a CU or :ref:`SIMD <desc-valu>` with sufficient
|
||||
resources. Note: this value is expected to range between 0-25%. See note
|
||||
in :ref:`workgroup manager <desc-spi>` description.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Not-scheduled rate (Scheduler-Pipe)
|
||||
|
||||
- The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in
|
||||
the kernel where a workgroup could not be scheduled to a
|
||||
:doc:`CU <compute-unit>` due to a bottleneck within the scheduler-pipes
|
||||
rather than a lack of a CU or :ref:`SIMD <desc-valu>` with sufficient
|
||||
resources. Note: this value is expected to range between 0-25%, see note
|
||||
in :ref:`workgroup manager <desc-spi>` description.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Scheduler-Pipe Stall Rate
|
||||
|
||||
- The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in
|
||||
the kernel where a workgroup could not be scheduled to a
|
||||
:doc:`CU <compute-unit>` due to occupancy limitations (like a lack of a
|
||||
CU or :ref:`SIMD <desc-valu>` with sufficient resources). Note: this
|
||||
value is expected to range between 0-25%, see note in
|
||||
:ref:`workgroup manager <desc-spi>` description.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Scratch Stall Rate
|
||||
|
||||
- The percent of :ref:`total shader-engine cycles <total-se-cycles>` in the
|
||||
kernel where a workgroup could not be scheduled to a
|
||||
:doc:`CU <compute-unit>` due to lack of
|
||||
:ref:`private (a.k.a., scratch) memory <memory-type>` slots. While this
|
||||
can reach up to 100%, note that the actual occupancy limitations on a
|
||||
kernel using private memory are typically quite small (for example, less
|
||||
than 1% of the total number of waves that can be scheduled to an
|
||||
accelerator).
|
||||
|
||||
- Percent
|
||||
|
||||
* - Insufficient SIMD Waveslots
|
||||
|
||||
- The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel
|
||||
where a workgroup could not be scheduled to a :ref:`SIMD <desc-valu>`
|
||||
due to lack of available :ref:`waveslots <desc-valu>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Insufficient SIMD VGPRs
|
||||
|
||||
- The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel
|
||||
where a workgroup could not be scheduled to a :ref:`SIMD <desc-valu>`
|
||||
due to lack of available :ref:`VGPRs <desc-valu>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Insufficient SIMD SGPRs
|
||||
|
||||
- The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel
|
||||
where a workgroup could not be scheduled to a :ref:`SIMD <desc-valu>`
|
||||
due to lack of available :ref:`SGPRs <desc-salu>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Insufficient CU LDS
|
||||
|
||||
- The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel
|
||||
where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
|
||||
due to lack of available :doc:`LDS <local-data-share>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Insufficient CU Barriers
|
||||
|
||||
- The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel
|
||||
where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
|
||||
due to lack of available :ref:`barriers <desc-barrier>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Reached CU Workgroup Limit
|
||||
|
||||
- The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel
|
||||
where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
|
||||
due to limits within the workgroup manager. This is expected to be
|
||||
always be zero on CDNA2 or newer accelerators (and small for previous
|
||||
accelerators).
|
||||
|
||||
- Percent
|
||||
|
||||
* - Reached CU Wavefront Limit
|
||||
|
||||
- The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel
|
||||
where a wavefront could not be scheduled to a :doc:`CU <compute-unit>`
|
||||
due to limits within the workgroup manager. This is expected to be
|
||||
always be zero on CDNA2 or newer accelerators (and small for previous
|
||||
accelerators).
|
||||
|
||||
- Percent
|
||||
.. jinja:: spi-resc-util
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
@@ -2,6 +2,8 @@
|
||||
:description: ROCm Compute Profiler performance model: System Speed-of-Light
|
||||
:keywords: Omniperf, ROCm Compute Profiler, ROCm, profiler, tool, Instinct, accelerator, AMD, system, speed of light
|
||||
|
||||
.. _sys-sol:
|
||||
|
||||
*********************
|
||||
System Speed-of-Light
|
||||
*********************
|
||||
@@ -20,308 +22,5 @@ of ROCm Compute Profiler’s profiling report.
|
||||
Instinct™ MI-series accelerators. For more detail on how operations are
|
||||
counted, see the :ref:`metrics-flop-count` section.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - :ref:`VALU <desc-valu>` FLOPs
|
||||
|
||||
- The total floating-point operations executed per second on the
|
||||
:ref:`VALU <desc-valu>`. This is also presented as a percent of the peak
|
||||
theoretical FLOPs achievable on the specific accelerator. Note: this does
|
||||
not include any floating-point operations from :ref:`MFMA <desc-mfma>`
|
||||
instructions.
|
||||
|
||||
- GFLOPs
|
||||
|
||||
* - :ref:`VALU <desc-valu>` IOPs
|
||||
|
||||
- The total integer operations executed per second on the
|
||||
:ref:`VALU <desc-valu>`. This is also presented as a percent of the peak
|
||||
theoretical IOPs achievable on the specific accelerator. Note: this does
|
||||
not include any integer operations from :ref:`MFMA <desc-mfma>`
|
||||
instructions.
|
||||
|
||||
- GIOPs
|
||||
|
||||
* - :ref:`MFMA <desc-mfma>` FLOPs (F8)
|
||||
|
||||
- The total number of 8-bit floating point :ref:`MFMA <desc-mfma>`
|
||||
operations executed per second. This does not include any 16-bit
|
||||
brain floating point operations from :ref:`VALU <desc-valu>`
|
||||
instructions. This is also presented as a percent of the peak theoretical
|
||||
F8 MFMA operations achievable on the specific accelerator. It is supported on AMD Instinct MI300 series and later only.
|
||||
|
||||
- GFLOPs
|
||||
|
||||
* - :ref:`MFMA <desc-mfma>` FLOPs (BF16)
|
||||
|
||||
- The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
|
||||
operations executed per second. Note: this does not include any 16-bit
|
||||
brain floating point operations from :ref:`VALU <desc-valu>`
|
||||
instructions. This is also presented as a percent of the peak theoretical
|
||||
BF16 MFMA operations achievable on the specific accelerator.
|
||||
|
||||
- GFLOPs
|
||||
|
||||
* - :ref:`MFMA <desc-mfma>` FLOPs (F16)
|
||||
|
||||
- The total number of 16-bit floating point :ref:`MFMA <desc-mfma>`
|
||||
operations executed per second. Note: this does not include any 16-bit
|
||||
floating point operations from :ref:`VALU <desc-valu>` instructions. This
|
||||
is also presented as a percent of the peak theoretical F16 MFMA
|
||||
operations achievable on the specific accelerator.
|
||||
|
||||
- GFLOPs
|
||||
|
||||
* - :ref:`MFMA <desc-mfma>` FLOPs (F32)
|
||||
|
||||
- The total number of 32-bit floating point :ref:`MFMA <desc-mfma>`
|
||||
operations executed per second. Note: this does not include any 32-bit
|
||||
floating point operations from :ref:`VALU <desc-valu>` instructions. This
|
||||
is also presented as a percent of the peak theoretical F32 MFMA
|
||||
operations achievable on the specific accelerator.
|
||||
|
||||
- GFLOPs
|
||||
|
||||
* - :ref:`MFMA <desc-mfma>` FLOPs (F64)
|
||||
|
||||
- The total number of 64-bit floating point :ref:`MFMA <desc-mfma>`
|
||||
operations executed per second. Note: this does not include any 64-bit
|
||||
floating point operations from :ref:`VALU <desc-valu>` instructions. This
|
||||
is also presented as a percent of the peak theoretical F64 MFMA
|
||||
operations achievable on the specific accelerator.
|
||||
|
||||
- GFLOPs
|
||||
|
||||
* - :ref:`MFMA <desc-mfma>` IOPs (INT8)
|
||||
|
||||
- The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations
|
||||
executed per second. Note: this does not include any 8-bit integer
|
||||
operations from :ref:`VALU <desc-valu>` instructions. This is also
|
||||
presented as a percent of the peak theoretical INT8 MFMA operations
|
||||
achievable on the specific accelerator.
|
||||
|
||||
- GIOPs
|
||||
|
||||
* - :ref:`SALU <desc-salu>` utilization
|
||||
|
||||
- Indicates what percent of the kernel's duration the
|
||||
:ref:`SALU <desc-salu>` was busy executing instructions. Computed as the
|
||||
ratio of the total number of cycles spent by the
|
||||
:ref:`scheduler <desc-scheduler>` issuing :ref:`SALU <desc-salu>` or
|
||||
:ref:`SMEM <desc-salu>` instructions over the
|
||||
:ref:`total CU cycles <total-cu-cycles>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - :ref:`VALU <desc-valu>` utilization
|
||||
|
||||
- Indicates what percent of the kernel's duration the
|
||||
:ref:`VALU <desc-valu>` was busy executing instructions. Does not include
|
||||
:ref:`VMEM <desc-vmem>` operations. Computed as the ratio of the total
|
||||
number of cycles spent by the :ref:`scheduler <desc-scheduler>` issuing
|
||||
:ref:`VALU <desc-valu>` instructions over the
|
||||
:ref:`total CU cycles <total-cu-cycles>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - :ref:`MFMA <desc-mfma>` utilization
|
||||
|
||||
- Indicates what percent of the kernel's duration the
|
||||
:ref:`MFMA <desc-mfma>` unit was busy executing instructions. Computed as
|
||||
the ratio of the total number of cycles the MFMA was busy over the
|
||||
:ref:`total CU cycles <total-cu-cycles>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - :ref:`VMEM <desc-valu>` utilization
|
||||
|
||||
- Indicates what percent of the kernel's duration the
|
||||
:ref:`VMEM <desc-valu>` unit was busy executing instructions, including
|
||||
both global/generic and spill/scratch operations (see the
|
||||
:ref:`VMEM instruction count metrics <ta-instruction-counts>`) for more
|
||||
detail). Does not include :ref:`VALU <desc-valu>` operations. Computed as
|
||||
the ratio of the total number of cycles spent by the
|
||||
:ref:`scheduler <desc-scheduler>` issuing VMEM instructions over the
|
||||
:ref:`total CU cycles <total-cu-cycles>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - :ref:`Branch <desc-branch>` utilization
|
||||
|
||||
- Indicates what percent of the kernel's duration the
|
||||
:ref:`branch <desc-branch>` unit was busy executing instructions.
|
||||
Computed as the ratio of the total number of cycles spent by the
|
||||
:ref:`scheduler <desc-scheduler>` issuing :ref:`branch <desc-branch>`
|
||||
instructions over the :ref:`total CU cycles <total-cu-cycles>`
|
||||
|
||||
- Percent
|
||||
|
||||
* - :ref:`VALU <desc-valu>` active threads
|
||||
|
||||
- Indicates the average level of :ref:`divergence <desc-divergence>` within
|
||||
a wavefront over the lifetime of the kernel. The number of work-items
|
||||
that were active in a wavefront during execution of each
|
||||
:ref:`VALU <desc-valu>` instruction, time-averaged over all VALU
|
||||
instructions run on all wavefronts in the kernel.
|
||||
|
||||
- Work-items
|
||||
|
||||
* - IPC
|
||||
|
||||
- The ratio of the total number of instructions executed on the
|
||||
:doc:`CU <compute-unit>` over the
|
||||
:ref:`total active CU cycles <total-active-cu-cycles>`. This is also
|
||||
presented as a percent of the peak theoretical bandwidth achievable on
|
||||
the specific accelerator.
|
||||
|
||||
- Instructions per-cycle
|
||||
|
||||
* - Wavefront occupancy
|
||||
|
||||
- The time-averaged number of wavefronts resident on the accelerator over
|
||||
the lifetime of the kernel. Note: this metric may be inaccurate for
|
||||
short-running kernels (less than 1ms). This is also presented as a
|
||||
percent of the peak theoretical occupancy achievable on the specific
|
||||
accelerator.
|
||||
|
||||
- Wavefronts
|
||||
|
||||
* - :doc:`LDS <local-data-share>` theoretical bandwidth
|
||||
|
||||
- Indicates the maximum amount of bytes that could have been loaded from,
|
||||
stored to, or atomically updated in the LDS per unit time (see
|
||||
:ref:`LDS Bandwidth <lds-bandwidth>` example for more detail). This is
|
||||
also presented as a percent of the peak theoretical F64 MFMA operations
|
||||
achievable on the specific accelerator.
|
||||
|
||||
- GB/s
|
||||
|
||||
* - :doc:`LDS <local-data-share>` bank conflicts/access
|
||||
|
||||
- The ratio of the number of cycles spent in the
|
||||
:doc:`LDS scheduler <local-data-share>` due to bank conflicts (as
|
||||
determined by the conflict resolution hardware) to the base number of
|
||||
cycles that would be spent in the LDS scheduler in a completely
|
||||
uncontended case. This is also presented in normalized form (i.e., the
|
||||
Bank Conflict Rate).
|
||||
|
||||
- Conflicts/Access
|
||||
|
||||
* - :doc:`vL1D <vector-l1-cache>` cache hit rate
|
||||
|
||||
- The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the
|
||||
:ref:`vL1D cache RAM <desc-tc>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - :doc:`vL1D <vector-l1-cache>` cache bandwidth
|
||||
|
||||
- The number of bytes looked up in the vL1D cache as a result of
|
||||
:ref:`VMEM <desc-vmem>` instructions per unit time. The number of bytes
|
||||
is calculated as the number of cache lines requested multiplied by the
|
||||
cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement
|
||||
will still be counted as a full cache line. This is also presented as a
|
||||
percent of the peak theoretical bandwidth achievable on the specific
|
||||
accelerator.
|
||||
|
||||
- GB/s
|
||||
|
||||
* - :doc:`L2 <l2-cache>` cache hit rate
|
||||
|
||||
- The ratio of the number of L2 cache line requests that hit in the L2
|
||||
cache over the total number of incoming cache line requests to the L2
|
||||
cache.
|
||||
|
||||
- Percent
|
||||
|
||||
* - :doc:`L2 <l2-cache>` cache bandwidth
|
||||
|
||||
- The number of bytes looked up in the L2 cache per unit time. The number
|
||||
of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so
|
||||
e.g., if only a single value is requested in a cache line, the data
|
||||
movement will still be counted as a full cache line. This is also
|
||||
presented as a percent of the peak theoretical bandwidth achievable on
|
||||
the specific accelerator.
|
||||
|
||||
- GB/s
|
||||
|
||||
* - :doc:`L2 <l2-cache>`-fabric read BW
|
||||
|
||||
- The number of bytes read by the L2 over the
|
||||
:ref:`Infinity Fabric™ interface <l2-fabric>` per unit time. This is also
|
||||
presented as a percent of the peak theoretical bandwidth achievable on
|
||||
the specific accelerator.
|
||||
|
||||
- GB/s
|
||||
|
||||
* - :doc:`L2 <l2-cache>`-fabric write and atomic BW
|
||||
|
||||
- The number of bytes sent by the L2 over the
|
||||
:ref:`Infinity Fabric interface <l2-fabric>` by write and atomic
|
||||
operations per unit time. This is also presented as a percent of the peak
|
||||
theoretical bandwidth achievable on the specific accelerator.
|
||||
|
||||
- GB/s
|
||||
|
||||
* - :doc:`L2 <l2-cache>`-fabric read latency
|
||||
|
||||
- The time-averaged number of cycles read requests spent in Infinity Fabric
|
||||
before data was returned to the L2.
|
||||
|
||||
- Cycles
|
||||
|
||||
* - :doc:`L2 <l2-cache>`-fabric write latency
|
||||
|
||||
- The time-averaged number of cycles write requests spent in Infinity
|
||||
Fabric before a completion acknowledgement was returned to the L2.
|
||||
|
||||
- Cycles
|
||||
|
||||
* - :ref:`sL1D <desc-sl1d>` cache hit rate
|
||||
|
||||
- The percent of sL1D requests that hit on a previously loaded line the
|
||||
cache. Calculated as the ratio of the number of sL1D requests that hit
|
||||
over the number of all sL1D requests.
|
||||
|
||||
- Percent
|
||||
|
||||
* - :ref:`sL1D <desc-sl1d>` bandwidth
|
||||
|
||||
- The number of bytes looked up in the sL1D cache per unit time. This is
|
||||
also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
|
||||
- GB/s
|
||||
|
||||
* - :ref:`L1I <desc-l1i>` bandwidth
|
||||
|
||||
- The number of bytes looked up in the L1I cache per unit time. This is
|
||||
also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
|
||||
- GB/s
|
||||
|
||||
* - :ref:`L1I <desc-l1i>` cache hit rate
|
||||
|
||||
- The percent of L1I requests that hit on a previously loaded line the
|
||||
cache. Calculated as the ratio of the number of L1I requests that hit
|
||||
over the number of all L1I requests.
|
||||
|
||||
- Percent
|
||||
|
||||
* - :ref:`L1I <desc-l1i>` fetch latency
|
||||
|
||||
- The average number of cycles spent to fetch instructions to a
|
||||
:doc:`CU <compute-unit>`.
|
||||
|
||||
- Cycles
|
||||
.. jinja:: sys-sol
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
@@ -63,53 +63,8 @@ vL1D Speed-of-Light
|
||||
The vL1D’s speed-of-light chart shows several key metrics for the vL1D
|
||||
as a comparison with the peak achievable values of those metrics.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - Hit Rate
|
||||
|
||||
- The ratio of the number of vL1D cache line requests that hit [#vl1d-hit]_
|
||||
in vL1D cache over the total number of cache line requests to the
|
||||
:ref:`vL1D Cache RAM <desc-tc>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Bandwidth
|
||||
|
||||
- The number of bytes looked up in the vL1D cache as a result of
|
||||
:ref:`VMEM <desc-vmem>` instructions, as a percent of the peak
|
||||
theoretical bandwidth achievable on the specific accelerator. The number
|
||||
of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so
|
||||
for instance, if only a single value is requested in a cache line, the
|
||||
data movement will still be counted as a full cache line.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Utilization
|
||||
|
||||
- Indicates how busy the :ref:`vL1D Cache RAM <desc-tc>` was during the
|
||||
kernel execution. The number of cycles where the vL1D Cache RAM is
|
||||
actively processing any request divided by the number of cycles where the
|
||||
vL1D is active [#vl1d-activity]_.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Coalescing
|
||||
|
||||
- Indicates how well memory instructions were coalesced by the
|
||||
:ref:`address processing unit <desc-ta>`, ranging from uncoalesced (25%)
|
||||
to fully coalesced (100%). Calculated as the average number of
|
||||
:ref:`thread-requests <thread-requests>` generated per instruction
|
||||
divided by the ideal number of thread-requests per instruction.
|
||||
|
||||
- Percent
|
||||
.. jinja:: vl1d-sol
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
.. _desc-ta:
|
||||
|
||||
@@ -145,45 +100,8 @@ processing unit. When the front-end cannot accept any more addresses, it
|
||||
must backpressure the wave-issue logic for the VMEM pipe and prevent the
|
||||
issue of further vector memory instructions.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - Busy
|
||||
|
||||
- Percent of the :ref:`total CU cycles <total-cu-cycles>` the address
|
||||
processor was busy
|
||||
|
||||
- Percent
|
||||
|
||||
* - Address Stall
|
||||
|
||||
- Percent of the :ref:`total CU cycles <total-cu-cycles>` the address
|
||||
processor was stalled from sending address requests further into the vL1D
|
||||
pipeline
|
||||
|
||||
- Percent
|
||||
|
||||
* - Data Stall
|
||||
|
||||
- Percent of the :ref:`total CU cycles <total-cu-cycles>` the address
|
||||
processor was stalled from sending write/atomic data further into the
|
||||
vL1D pipeline
|
||||
|
||||
- Percent
|
||||
|
||||
* - Data-Processor → Address Stall
|
||||
|
||||
- Percent of :ref:`total CU cycles <total-cu-cycles>` the address processor
|
||||
was stalled waiting to send command data to the
|
||||
:ref:`data processor <desc-td>`
|
||||
|
||||
- Percent
|
||||
.. jinja:: ta-busy-stall
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
.. _ta-instruction-counts:
|
||||
|
||||
@@ -232,80 +150,8 @@ kernel. These are broken down into a few major categories:
|
||||
|
||||
The address processor counts these instruction types as follows:
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Type
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - Global/Generic
|
||||
|
||||
- The total number of global & generic memory instructions executed on all
|
||||
:doc:`compute units <compute-unit>` on the accelerator, per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Global/Generic Read
|
||||
|
||||
- The total number of global & generic memory read instructions executed on
|
||||
all :doc:`compute units <compute-unit>` on the accelerator, per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Global/Generic Write
|
||||
|
||||
- The total number of global & generic memory write instructions executed
|
||||
on all :doc:`compute units <compute-unit>` on the accelerator, per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Global/Generic Atomic
|
||||
|
||||
- The total number of global & generic memory atomic (with and without
|
||||
return) instructions executed on all :doc:`compute units <compute-unit>`
|
||||
on the accelerator, per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Spill/Stack
|
||||
|
||||
- The total number of spill/stack memory instructions executed on all
|
||||
:doc:`compute units <compute-unit>` on the accelerator, per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Spill/Stack Read
|
||||
|
||||
- The total number of spill/stack memory read instructions executed on all
|
||||
:doc:`compute units <compute-unit>` on the accelerator, per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Spill/Stack Write
|
||||
|
||||
- The total number of spill/stack memory write instructions executed on all
|
||||
:doc:`compute units <compute-unit>` on the accelerator, per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instruction per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Spill/Stack Atomic
|
||||
|
||||
- The total number of spill/stack memory atomic (with and without return)
|
||||
instructions executed on all :doc:`compute units <compute-unit>` on the
|
||||
accelerator, per :ref:`normalization unit <normalization-units>`.
|
||||
Typically unused as these memory operations are typically used to
|
||||
implement thread-local storage.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
.. jinja:: ta-instruction-counts
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
.. note::
|
||||
|
||||
@@ -333,38 +179,8 @@ Spill / stack metrics
|
||||
Finally, the address processing unit contains a separate coalescing
|
||||
stage for spill/stack memory, and thus reports:
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - Spill/Stack Total Cycles
|
||||
|
||||
- The number of cycles the address processing unit spent working on
|
||||
spill/stack instructions, per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Cycles per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Spill/Stack Coalesced Read Cycles
|
||||
|
||||
- The number of cycles the address processing unit spent working on
|
||||
coalesced spill/stack read instructions, per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Cycles per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Spill/Stack Coalesced Write Cycles
|
||||
|
||||
- The number of cycles the address processing unit spent working on
|
||||
coalesced spill/stack write instructions, per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Cycles per :ref:`normalization unit <normalization-units>`
|
||||
.. jinja:: ta-spill-stack
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
.. _desc-utcl1:
|
||||
|
||||
@@ -380,52 +196,8 @@ reduce the cost of subsequent re-translations.
|
||||
|
||||
ROCm Compute Profiler reports the following L1 TLB metrics:
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - Requests
|
||||
|
||||
- The number of translation requests made to the UTCL1 per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Hits
|
||||
|
||||
- The number of translation requests that hit in the UTCL1, and could be
|
||||
reused, per :ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Hit Ratio
|
||||
|
||||
- The ratio of the number of translation requests that hit in the UTCL1
|
||||
divided by the total number of translation requests made to the UTCL1.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Translation Misses
|
||||
|
||||
- The total number of translation requests that missed in the UTCL1 due to
|
||||
translation not being present in the cache, per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Permission Misses
|
||||
|
||||
- The total number of translation requests that missed in the UTCL1 due to
|
||||
a permission error, per :ref:`normalization unit <normalization-units>`.
|
||||
This is unused and expected to be zero in most configurations for modern
|
||||
CDNA™ accelerators.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
.. jinja:: desc-utcl1
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
.. note::
|
||||
|
||||
@@ -464,39 +236,8 @@ L2 requests may backpressure the wave-issue logic of the :ref:`VMEM <desc-vmem>`
|
||||
pipe and prevent it from issuing more vector memory instructions until
|
||||
the vL1D’s outstanding requests are completed.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - Stalled on L2 Data
|
||||
|
||||
- The ratio of the number of cycles where the vL1D is stalled waiting for
|
||||
requested data to return from the :doc:`L2 cache <l2-cache>` divided by
|
||||
the number of cycles where the vL1D is active [#vl1d-activity]_.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Stalled on L2 Requests
|
||||
|
||||
- The ratio of the number of cycles where the vL1D is stalled waiting to
|
||||
issue a request for data to the :doc:`L2 cache <l2-cache>` divided by the
|
||||
number of cycles where the vL1D is active [#vl1d-activity]_.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Tag RAM Stall (Read/Write/Atomic)
|
||||
|
||||
- The ratio of the number of cycles where the vL1D is stalled due to
|
||||
Read/Write/Atomic requests with conflicting tags being looked up
|
||||
concurrently, divided by the number of cycles where the
|
||||
vL1D is active [#vl1d-activity]_.
|
||||
|
||||
- Percent
|
||||
.. jinja:: vl1d-cache-stall-metrics
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
.. _vl1d-cache-access-metrics:
|
||||
|
||||
@@ -510,135 +251,8 @@ the :doc:`L2 cache <l2-cache>`. In addition, this section includes the
|
||||
approximate latencies of accesses to the cache itself, along with
|
||||
latencies of read/write memory operations to the :doc:`L2 cache <l2-cache>`.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - Total Requests
|
||||
|
||||
- The total number of incoming requests from the
|
||||
:ref:`address processing unit <desc-ta>` after coalescing.
|
||||
|
||||
- Requests
|
||||
|
||||
* - Total read/write/atomic requests
|
||||
|
||||
- The total number of incoming read/write/atomic requests from the
|
||||
:ref:`address processing unit <desc-ta>` after coalescing per
|
||||
:ref:`normalization unit <normalization-units>`
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Cache Bandwidth
|
||||
|
||||
- The number of bytes looked up in the vL1D cache as a result of
|
||||
:ref:`VMEM <desc-vmem>` instructions per
|
||||
:ref:`normalization unit <normalization-units>`. The number of bytes is
|
||||
calculated as the number of cache lines requested multiplied by the cache
|
||||
line size. This value does not consider partial requests, so for
|
||||
instance, if only a single value is requested in a cache line, the data
|
||||
movement will still be counted as a full cache line.
|
||||
|
||||
- Bytes per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Cache Hit Rate [#vl1d-hit]_
|
||||
|
||||
- The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the
|
||||
:ref:`vL1D Cache RAM <desc-tc>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Cache Accesses
|
||||
|
||||
- The total number of cache line lookups in the vL1D.
|
||||
|
||||
- Cache lines
|
||||
|
||||
* - Cache Hits [#vl1d-hit]_
|
||||
|
||||
- The number of cache accesses minus the number of outgoing requests to the
|
||||
:doc:`L2 cache <l2-cache>`, that is, the number of cache line requests
|
||||
serviced by the :ref:`vL1D Cache RAM <desc-tc>` per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Cache lines per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Invalidations
|
||||
|
||||
- The number of times the vL1D was issued a write-back invalidate command
|
||||
during the kernel's execution per
|
||||
:ref:`normalization unit <normalization-units>`. This may be triggered
|
||||
by, for instance, the ``buffer_wbinvl1`` instruction.
|
||||
|
||||
- Invalidations per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - L1-L2 Bandwidth
|
||||
|
||||
- The number of bytes transferred across the vL1D-L2 interface as a result
|
||||
of :ref:`VMEM <desc-vmem>` instructions, per
|
||||
:ref:`normalization unit <normalization-units>`. The number of bytes is
|
||||
calculated as the number of cache lines requested multiplied by the cache
|
||||
line size. This value does not consider partial requests, so for
|
||||
instance, if only a single value is requested in a cache line, the data
|
||||
movement will still be counted as a full cache line.
|
||||
|
||||
- Bytes per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - L1-L2 Reads
|
||||
|
||||
- The number of read requests for a vL1D cache line that were not satisfied
|
||||
by the vL1D and must be retrieved from the to the
|
||||
:doc:`L2 Cache <l2-cache>` per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - L1-L2 Writes
|
||||
|
||||
- The number of write requests to a vL1D cache line that were sent through
|
||||
the vL1D to the :doc:`L2 cache <l2-cache>`, per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - L1-L2 Atomics
|
||||
|
||||
- The number of atomic requests that are sent through the vL1D to the
|
||||
:doc:`L2 cache <l2-cache>`, per
|
||||
:ref:`normalization unit <normalization-units>`. This includes requests
|
||||
for atomics with, and without return.
|
||||
|
||||
- Requests per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - L1 Access Latency
|
||||
|
||||
- Calculated as the average number of cycles that a vL1D cache line request
|
||||
spent in the vL1D cache pipeline.
|
||||
|
||||
- Cycles
|
||||
|
||||
* - L1-L2 Read Access Latency
|
||||
|
||||
- Calculated as the average number of cycles that the vL1D cache took to
|
||||
issue and receive read requests from the :doc:`L2 Cache <l2-cache>`. This
|
||||
number also includes requests for atomics with return values.
|
||||
|
||||
- Cycles
|
||||
|
||||
* - L1-L2 Write Access Latency
|
||||
|
||||
- Calculated as the average number of cycles that the vL1D cache took to
|
||||
issue and receive acknowledgement of a write request to the
|
||||
:doc:`L2 Cache <l2-cache>`. This number also includes requests for
|
||||
atomics without return values.
|
||||
|
||||
- Cycles
|
||||
.. jinja:: vl1d-cache-access-metrics
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
.. note::
|
||||
|
||||
@@ -687,80 +301,5 @@ data, and returned to the appropriate SIMD.
|
||||
|
||||
ROCm Compute Profiler reports the following vL1D data-return path metrics:
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Metric
|
||||
|
||||
- Description
|
||||
|
||||
- Unit
|
||||
|
||||
* - Data-return Busy
|
||||
|
||||
- Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return
|
||||
unit was busy processing or waiting on data to return to the
|
||||
:doc:`CU <compute-unit>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Cache RAM → Data-return Stall
|
||||
|
||||
- Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return
|
||||
unit was stalled on data to be returned from the
|
||||
:ref:`vL1D Cache RAM <desc-tc>`.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Workgroup manager → Data-return Stall
|
||||
|
||||
- Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return
|
||||
unit was stalled by the :ref:`workgroup manager <desc-spi>` due to
|
||||
initialization of registers as a part of launching new workgroups.
|
||||
|
||||
- Percent
|
||||
|
||||
* - Coalescable Instructions
|
||||
|
||||
- The number of instructions submitted to the
|
||||
:ref:`data-return unit <desc-td>` by the
|
||||
:ref:`address processor <desc-ta>` that were found to be coalescable, per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Read Instructions
|
||||
|
||||
- The number of read instructions submitted to the
|
||||
:ref:`data-return unit <desc-td>` by the
|
||||
:ref:`address processor <desc-ta>` summed over all
|
||||
:doc:`compute units <compute-unit>` on the accelerator, per
|
||||
:ref:`normalization unit <normalization-units>`. This is expected to be
|
||||
the sum of global/generic and spill/stack reads in the
|
||||
:ref:`address processor <desc-ta>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Write Instructions
|
||||
|
||||
- The number of store instructions submitted to the
|
||||
:ref:`data-return unit <desc-td>` by the
|
||||
:ref:`address processor <desc-ta>` summed over all
|
||||
:doc:`compute units <compute-unit>` on the accelerator, per
|
||||
:ref:`normalization unit <normalization-units>`. This is expected to be
|
||||
the sum of global/generic and spill/stack stores counted by the
|
||||
:ref:`vL1D cache-front-end <ta-instruction-counts>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
|
||||
* - Atomic Instructions
|
||||
|
||||
- The number of atomic instructions submitted to the
|
||||
:ref:`data-return unit <desc-td>` by the
|
||||
:ref:`address processor <desc-ta>` summed over all
|
||||
:doc:`compute units <compute-unit>` on the accelerator, per
|
||||
:ref:`normalization unit <normalization-units>`. This is expected to be
|
||||
the sum of global/generic and spill/stack atomics in the
|
||||
:ref:`address processor <desc-ta>`.
|
||||
|
||||
- Instructions per :ref:`normalization unit <normalization-units>`
|
||||
.. jinja:: desc-td
|
||||
:file: _templates/metrics_table.j2
|
||||
|
||||
@@ -30,6 +30,8 @@
|
||||
|
||||
import re
|
||||
|
||||
import yaml
|
||||
|
||||
with open("../VERSION", encoding="utf-8") as f:
|
||||
match = re.search(r"([0-9.]+)[^0-9.]+", f.read())
|
||||
if not match:
|
||||
@@ -43,7 +45,12 @@ copyright = "Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved
|
||||
version = version_number
|
||||
release = version_number
|
||||
|
||||
extensions = ["rocm_docs", "sphinx.ext.extlinks", "sphinxcontrib.datatemplates"]
|
||||
extensions = [
|
||||
"rocm_docs",
|
||||
"sphinx.ext.extlinks",
|
||||
"sphinxcontrib.datatemplates",
|
||||
"sphinx_jinja",
|
||||
]
|
||||
html_theme = "rocm_docs_theme"
|
||||
html_theme_options = {"flavor": "rocm"}
|
||||
html_title = f"{project} {version_number} documentation"
|
||||
@@ -52,6 +59,113 @@ exclude_patterns = ["archive", "*/includes"]
|
||||
html_static_path = ["sphinx/static/css"]
|
||||
html_css_files = ["o_custom.css"]
|
||||
|
||||
with open("data/metrics_description.yaml", "r") as f:
|
||||
metrics_data = yaml.safe_load(f)
|
||||
jinja_contexts = {
|
||||
"wavefront-launch-stats": {
|
||||
"data": metrics_data["Wavefront launch stats"],
|
||||
},
|
||||
"wavefront-runtime-stats": {
|
||||
"data": metrics_data["Wavefront runtime stats"],
|
||||
},
|
||||
"instruction-mix": {
|
||||
"data": metrics_data["Overall instruction mix"],
|
||||
},
|
||||
"valu-arith-instruction-mix": {
|
||||
"data": metrics_data["VALU arithmetic instruction mix"],
|
||||
},
|
||||
"mfma-instruction-mix": {
|
||||
"data": metrics_data["MFMA instruction mix"],
|
||||
},
|
||||
"compute-speed-of-light": {
|
||||
"data": metrics_data["Compute Speed-of-Light"],
|
||||
},
|
||||
"pipeline-stats": {
|
||||
"data": metrics_data["Pipeline statistics"],
|
||||
},
|
||||
"arithmetic-operations": {
|
||||
"data": metrics_data["Arithmetic operations"],
|
||||
},
|
||||
"lds-sol": {
|
||||
"data": metrics_data["LDS Speed-of-Light"],
|
||||
},
|
||||
"lds-stats": {
|
||||
"data": metrics_data["LDS Statistics"],
|
||||
},
|
||||
"vl1d-sol": {
|
||||
"data": metrics_data["vL1D Speed-of-Light"],
|
||||
},
|
||||
"ta-busy-stall": {
|
||||
"data": metrics_data["Busy / stall metrics"],
|
||||
},
|
||||
"ta-instruction-counts": {
|
||||
"data": metrics_data["Instruction counts"],
|
||||
},
|
||||
"ta-spill-stack": {
|
||||
"data": metrics_data["Spill / stack metrics"],
|
||||
},
|
||||
"desc-utcl1": {
|
||||
"data": metrics_data["L1 Unified Translation Cache (UTCL1)"],
|
||||
},
|
||||
"vl1d-cache-stall-metrics": {
|
||||
"data": metrics_data["vL1D cache stall metrics"],
|
||||
},
|
||||
"vl1d-cache-access-metrics": {
|
||||
"data": metrics_data["vL1D cache access metrics"],
|
||||
},
|
||||
"desc-td": {
|
||||
"data": metrics_data["Vector L1 data-return path or Texture Data (TD)"],
|
||||
},
|
||||
"l2-sol": {
|
||||
"data": metrics_data["L2 Speed-of-Light"],
|
||||
},
|
||||
"l2-cache-accesses": {
|
||||
"data": metrics_data["L2 cache accesses"],
|
||||
},
|
||||
"l2-fabric-metrics": {
|
||||
"data": metrics_data["L2-Fabric interface metrics"],
|
||||
},
|
||||
"l2-detailed-metrics": {
|
||||
"data": metrics_data["L2 - Fabric interface detailed metrics"],
|
||||
},
|
||||
"l2-fabric-stalls": {
|
||||
"data": metrics_data["L2 - Fabric Interface stalls"],
|
||||
},
|
||||
"desc-sl1d-sol": {
|
||||
"data": metrics_data["Scalar L1D Speed-of-Light"],
|
||||
},
|
||||
"desc-sl1d-stats": {
|
||||
"data": metrics_data["Scalar L1D cache accesses"],
|
||||
},
|
||||
"desc-sl1d-l2-interface": {
|
||||
"data": metrics_data["Scalar L1D Cache - L2 Interface"],
|
||||
},
|
||||
"desc-l1i-sol": {
|
||||
"data": metrics_data["L1I Speed-of-Light"],
|
||||
},
|
||||
"desc-l1i-stats": {
|
||||
"data": metrics_data["L1I cache accesses"],
|
||||
},
|
||||
"desc-l1i-l2-interface": {
|
||||
"data": metrics_data["L1I <-> L2 interface"],
|
||||
},
|
||||
"spi-util": {
|
||||
"data": metrics_data["Workgroup manager utilizations"],
|
||||
},
|
||||
"spi-resc-util": {
|
||||
"data": metrics_data["Workgroup Manager - Resource Allocation"],
|
||||
},
|
||||
"cpf-metrics": {
|
||||
"data": metrics_data["Command processor fetcher (CPF)"],
|
||||
},
|
||||
"cpc-metrics": {
|
||||
"data": metrics_data["Command processor packet processor (CPC)"],
|
||||
},
|
||||
"sys-sol": {
|
||||
"data": metrics_data["System Speed-of-Light"],
|
||||
},
|
||||
}
|
||||
|
||||
external_toc_path = "./sphinx/_toc.yml"
|
||||
external_projects_current_project = "rocprofiler-compute"
|
||||
|
||||
@@ -96,3 +210,6 @@ extlinks = {
|
||||
"HSA Runtime Programmer's Reference Manual (page %s)",
|
||||
),
|
||||
}
|
||||
|
||||
# Uncomment if facing rate limit exceed issue with local build
|
||||
external_projects_remote_repository = ""
|
||||
Plik diff jest za duży
Load Diff
@@ -242,6 +242,11 @@ List metrics
|
||||
|
||||
$ rocprof-compute analyze -p workloads/vcopy/MI200/ --list-metrics gfx90a
|
||||
|
||||
Show Description column which is excluded by default in cli output
|
||||
.. code-block:: shell
|
||||
|
||||
$ rocprof-compute analyze -p workloads/vcopy/MI200/ --list-metrics gfx90a --include-cols Description
|
||||
|
||||
Show System Speed-of-Light and CS_Busy blocks only
|
||||
.. code-block:: shell
|
||||
|
||||
|
||||
@@ -1,2 +1,3 @@
|
||||
rocm-docs-core==1.21.1
|
||||
sphinxcontrib.datatemplates==0.11.0
|
||||
sphinx-jinja==2.0.2
|
||||
|
||||
@@ -53,7 +53,8 @@ docutils==0.21.2
|
||||
# myst-parser
|
||||
# pydata-sphinx-theme
|
||||
# sphinx
|
||||
exceptiongroup==1.2.2
|
||||
# sphinx-jinja
|
||||
exceptiongroup==1.3.0
|
||||
# via ipython
|
||||
executing==2.2.0
|
||||
# via stack-data
|
||||
@@ -87,6 +88,7 @@ jinja2==3.1.5
|
||||
# via
|
||||
# myst-parser
|
||||
# sphinx
|
||||
# sphinx-jinja
|
||||
jsonschema==4.23.0
|
||||
# via nbformat
|
||||
jsonschema-specifications==2024.10.1
|
||||
@@ -215,6 +217,7 @@ sphinx==8.1.3
|
||||
# sphinx-copybutton
|
||||
# sphinx-design
|
||||
# sphinx-external-toc
|
||||
# sphinx-jinja
|
||||
# sphinx-notfound-page
|
||||
# sphinxcontrib-datatemplates
|
||||
# sphinxcontrib-runcmd
|
||||
@@ -226,6 +229,8 @@ sphinx-design==0.6.1
|
||||
# via rocm-docs-core
|
||||
sphinx-external-toc==1.0.1
|
||||
# via rocm-docs-core
|
||||
sphinx-jinja==2.0.2
|
||||
# via -r requirements.in
|
||||
sphinx-notfound-page==1.0.4
|
||||
# via rocm-docs-core
|
||||
sphinxcontrib-applehelp==2.0.0
|
||||
@@ -268,6 +273,7 @@ traitlets==5.14.3
|
||||
# nbformat
|
||||
typing-extensions==4.12.2
|
||||
# via
|
||||
# exceptiongroup
|
||||
# ipython
|
||||
# myst-nb
|
||||
# pydata-sphinx-theme
|
||||
|
||||
@@ -202,7 +202,7 @@ Examples:
|
||||
nargs="?",
|
||||
const="",
|
||||
# Argument to --list-metrics is optional
|
||||
choices=[""] + list(supported_archs.keys()), # ["gfx906", "gfx908", "gfx90a"],
|
||||
choices=[""] + list(supported_archs.keys()), # ["gfx908", "gfx90a"],
|
||||
help=print_avail_arch(supported_archs.keys()),
|
||||
)
|
||||
profile_group.add_argument(
|
||||
@@ -623,7 +623,18 @@ Examples:
|
||||
dest="cols",
|
||||
metavar="",
|
||||
nargs="+",
|
||||
help="\t\tSpecify column indices to display.",
|
||||
help="\t\tSpecify column indices to display.\n\t\tDefaults to display all columns.",
|
||||
)
|
||||
analyze_advanced_group.add_argument(
|
||||
"--include-cols",
|
||||
dest="include_cols",
|
||||
metavar="",
|
||||
nargs="+",
|
||||
help=(
|
||||
"\t\tSpecify which hidden column names should be included in cli output.\n"
|
||||
"\t\tFor example, to show 'Description' column which is hidden by default in cli output,\n"
|
||||
"\t\tuse the option --include-cols Description."
|
||||
),
|
||||
)
|
||||
analyze_advanced_group.add_argument(
|
||||
"-g", dest="debug", action="store_true", help="\t\tDebug single metric."
|
||||
|
||||
@@ -28,7 +28,8 @@ from pathlib import Path
|
||||
rocprof_compute_home = Path(__file__).resolve().parent
|
||||
PROJECT_NAME = "rocprofiler-compute"
|
||||
|
||||
HIDDEN_COLUMNS = ["Tips", "coll_level"]
|
||||
HIDDEN_COLUMNS = ["coll_level"]
|
||||
HIDDEN_COLUMNS_CLI = ["Description", "coll_level"]
|
||||
HIDDEN_SECTIONS = [400, 1900, 2000]
|
||||
|
||||
TIME_UNITS = {"s": 10**9, "ms": 10**6, "us": 10**3, "ns": 1}
|
||||
|
||||
@@ -25,6 +25,7 @@
|
||||
import copy
|
||||
import os
|
||||
import sys
|
||||
import textwrap
|
||||
from abc import ABC, abstractmethod
|
||||
from collections import OrderedDict
|
||||
from pathlib import Path
|
||||
@@ -96,15 +97,28 @@ class OmniAnalyze_Base:
|
||||
sys_info.iloc[0],
|
||||
)
|
||||
|
||||
metric_descriptions = {
|
||||
k: v
|
||||
for dfs in self._arch_configs[args.list_metrics].dfs.values()
|
||||
for k, v in dfs.to_dict().get("Description", {}).items()
|
||||
}
|
||||
for key, value in self._arch_configs[args.list_metrics].metric_list.items():
|
||||
prefix = ""
|
||||
description = ""
|
||||
if "." not in str(key):
|
||||
prefix = ""
|
||||
elif str(key).count(".") == 1:
|
||||
prefix = "\t"
|
||||
else:
|
||||
prefix = "\t\t"
|
||||
print(prefix + key, "->", value)
|
||||
description = metric_descriptions.get(key, "")
|
||||
print(prefix + key, "->", value + "\n")
|
||||
if description:
|
||||
print(
|
||||
prefix
|
||||
+ f"\n{prefix}".join(textwrap.wrap(description, width=40))
|
||||
+ "\n"
|
||||
)
|
||||
sys.exit(0)
|
||||
else:
|
||||
console_error("Unsupported arch")
|
||||
|
||||
+11
-11
@@ -1,14 +1,14 @@
|
||||
---
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 000
|
||||
id: 0
|
||||
title: Top Stats
|
||||
metrics_description: {}
|
||||
data source:
|
||||
- raw_csv_table:
|
||||
id: 001
|
||||
title: Top Kernels
|
||||
source: pmc_kernel_top.csv
|
||||
|
||||
- raw_csv_table:
|
||||
id: 002
|
||||
title: Dispatch List
|
||||
source: pmc_dispatch_info.csv
|
||||
- raw_csv_table:
|
||||
id: 1
|
||||
title: Top Kernels
|
||||
source: pmc_kernel_top.csv
|
||||
- raw_csv_table:
|
||||
id: 2
|
||||
title: Dispatch List
|
||||
source: pmc_dispatch_info.csv
|
||||
|
||||
+6
-5
@@ -1,9 +1,10 @@
|
||||
---
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 100
|
||||
title: System Info
|
||||
metrics_description: {}
|
||||
data source:
|
||||
- raw_csv_table:
|
||||
id: 101
|
||||
source: sysinfo.csv
|
||||
columnwise: True
|
||||
- raw_csv_table:
|
||||
id: 101
|
||||
source: sysinfo.csv
|
||||
columnwise: true
|
||||
|
||||
-236
@@ -1,236 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
SALU: &SALU_anchor Scalar Arithmetic Logic Unit
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 200
|
||||
title: System Speed-of-Light
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 201
|
||||
title: Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
peak: Peak
|
||||
pop: Pct of Peak
|
||||
tips: Tips
|
||||
metric:
|
||||
VALU FLOPs:
|
||||
value: None # No perf counter
|
||||
unit: GFLOP/s
|
||||
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
|
||||
pop: None # No perf counter
|
||||
tips:
|
||||
VALU IOPs:
|
||||
value: None # No perf counter
|
||||
unit: GIOP/s
|
||||
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
|
||||
pop: None # No perf counter
|
||||
tips:
|
||||
MFMA FLOPs (BF16):
|
||||
value: None # No perf counter
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 512) / 1000)
|
||||
pop: None # No perf counter
|
||||
tips:
|
||||
MFMA FLOPs (F16):
|
||||
value: None # No perf counter
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
|
||||
pop: None # No perf counter
|
||||
tips:
|
||||
MFMA FLOPs (F32):
|
||||
value: None # No perf counter
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
|
||||
pop: None # No perf counter
|
||||
tips:
|
||||
MFMA FLOPs (F64):
|
||||
value: None # No perf counter
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
|
||||
pop: None # No perf counter
|
||||
tips:
|
||||
MFMA IOPs (Int8):
|
||||
value: None # No perf counter
|
||||
unit: GIOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
|
||||
pop: None # No perf counter
|
||||
tips:
|
||||
Active CUs:
|
||||
value: $numActiveCUs
|
||||
unit: CUs
|
||||
peak: $cu_per_gpu
|
||||
pop: ((100 * $numActiveCUs) / $cu_per_gpu)
|
||||
tips:
|
||||
SALU Utilization:
|
||||
value: AVG(((100 * SQ_ACTIVE_INST_SCA) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 * SQ_ACTIVE_INST_SCA) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
tips:
|
||||
VALU Utilization:
|
||||
value: AVG(((100 * SQ_ACTIVE_INST_VALU) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 * SQ_ACTIVE_INST_VALU) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
tips:
|
||||
MFMA Utilization:
|
||||
value: None # No HW module
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: None # No HW module
|
||||
tips:
|
||||
VMEM Utilization:
|
||||
value: None # No HW module
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: None # No HW module
|
||||
tips:
|
||||
Branch Utilization:
|
||||
value: None # No HW module
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: None # No HW module
|
||||
tips:
|
||||
VALU Active Threads:
|
||||
value: AVG(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
|
||||
!= 0) else None))
|
||||
unit: Threads
|
||||
peak: $wave_size
|
||||
pop: (100 * AVG((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU / $wave_size) if (SQ_ACTIVE_INST_VALU != 0) else None))
|
||||
tips:
|
||||
IPC:
|
||||
value: AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))
|
||||
unit: Instr/cycle
|
||||
peak: 5
|
||||
pop: ((100 * AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))) / 5)
|
||||
tips:
|
||||
Wavefront Occupancy:
|
||||
value: AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
unit: Wavefronts
|
||||
peak: ($max_waves_per_cu * $cu_per_gpu)
|
||||
pop: (100 * AVG(((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD) / ($max_waves_per_cu
|
||||
* $cu_per_gpu))))
|
||||
coll_level: SQ_LEVEL_WAVES
|
||||
tips:
|
||||
Theoretical LDS Bandwidth:
|
||||
value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: (($max_sclk * $cu_per_gpu) * 0.128)
|
||||
pop: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
|
||||
tips:
|
||||
LDS Bank Conflicts/Access:
|
||||
value: AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
|
||||
unit: Conflicts/access
|
||||
peak: 32
|
||||
pop: ((100 * AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))) / 32)
|
||||
tips:
|
||||
vL1D Cache Hit Rate:
|
||||
value: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
|
||||
None))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) +
|
||||
TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) /
|
||||
TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
|
||||
None))
|
||||
tips:
|
||||
vL1D Cache BW:
|
||||
value: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: ((($max_sclk / 1000) * 64) * $cu_per_gpu)
|
||||
pop: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk / 1000) * 64) * $cu_per_gpu))
|
||||
tips:
|
||||
L2 Cache Hit Rate:
|
||||
value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
tips:
|
||||
L2 Cache BW:
|
||||
value: AVG(((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: ((($max_sclk / 1000) * 64) * TO_INT($total_l2_chan))
|
||||
pop: ((100 * AVG(((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk / 1000) * 64) * TO_INT($total_l2_chan)))
|
||||
tips:
|
||||
L2-Fabric Read BW:
|
||||
value: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: $hbmBandwidth
|
||||
pop: ((100 * AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))) / $hbmBandwidth)
|
||||
tips:
|
||||
L2-Fabric Write BW:
|
||||
value: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: $hbmBandwidth
|
||||
pop: ((100 * AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))) / $hbmBandwidth)
|
||||
tips:
|
||||
L2-Fabric Read Latency:
|
||||
value: AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
peak: None
|
||||
pop: None
|
||||
tips:
|
||||
L2-Fabric Write Latency:
|
||||
value: AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
peak: None
|
||||
pop: None
|
||||
tips:
|
||||
sL1D Cache Hit Rate:
|
||||
value: AVG((((100 * SQC_DCACHE_HITS) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES))
|
||||
if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES) != 0) else None))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG((((100 * SQC_DCACHE_HITS) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES))
|
||||
if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES) != 0) else None))
|
||||
tips:
|
||||
sL1D Cache BW:
|
||||
value: AVG(((SQC_DCACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))
|
||||
unit: GB/s
|
||||
peak: ((($max_sclk / 1000) * 64) * $sqc_per_gpu)
|
||||
pop: ((100 * AVG(((SQC_DCACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))) / ((($max_sclk
|
||||
/ 1000) * 64) * $sqc_per_gpu))
|
||||
tips:
|
||||
L1I Hit Rate:
|
||||
value: AVG(((100 * SQC_ICACHE_HITS) / (SQC_ICACHE_HITS + SQC_ICACHE_MISSES)))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 * SQC_ICACHE_HITS) / (SQC_ICACHE_HITS + SQC_ICACHE_MISSES)))
|
||||
tips:
|
||||
L1I BW:
|
||||
value: AVG(((SQC_ICACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))
|
||||
unit: GB/s
|
||||
peak: ((($max_sclk / 1000) * 64) * $sqc_per_gpu)
|
||||
pop: ((100 * AVG(((SQC_ICACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))) / ((($max_sclk
|
||||
/ 1000) * 64) * $sqc_per_gpu))
|
||||
tips:
|
||||
L1I Fetch Latency:
|
||||
value: AVG((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
|
||||
unit: Cycles
|
||||
peak: None
|
||||
pop: None
|
||||
coll_level: SQ_IFETCH_LEVEL
|
||||
tips:
|
||||
+317
@@ -0,0 +1,317 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 200
|
||||
title: System Speed-of-Light
|
||||
metrics_description:
|
||||
VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
|
||||
This is also presented as a percent of the peak theoretical FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any floating-point
|
||||
operations from MFMA instructions.'
|
||||
VALU IOPs: 'The total integer operations executed per second on the VALU. This
|
||||
is also presented as a percent of the peak theoretical IOPs achievable on the
|
||||
specific accelerator. Note: this does not include any integer operations from
|
||||
MFMA instructions.'
|
||||
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
|
||||
executed per second. This does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F8 MFMA operations achievable on the specific accelerator. It is supported on
|
||||
AMD Instinct MI300 series and later only.
|
||||
MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
|
||||
executed per second. Note: this does not include any 16-bit brain floating point
|
||||
operations from VALU instructions. This is also presented as a percent of the
|
||||
peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 32-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F32 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 64-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F64 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
|
||||
per second. Note: this does not include any 8-bit integer operations from VALU
|
||||
instructions. This is also presented as a percent of the peak theoretical INT8
|
||||
MFMA operations achievable on the specific accelerator.'
|
||||
Active CUs: Total number of active compute units (CUs) on the accelerator during
|
||||
the kernel execution.
|
||||
SALU Utilization: Indicates what percent of the kernel's duration the SALU was
|
||||
busy executing instructions. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.
|
||||
VALU Utilization: Indicates what percent of the kernel's duration the VALU was
|
||||
busy executing instructions. Does not include VMEM operations. Computed as the
|
||||
ratio of the total number of cycles spent by the scheduler issuing VALU instructions
|
||||
over the total CU cycles.
|
||||
MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
|
||||
was busy executing instructions. Computed as the ratio of the total number of
|
||||
cycles the MFMA was busy over the total CU cycles.
|
||||
VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
|
||||
was busy executing instructions, including both global/generic and spill/scratch
|
||||
operations (see the VMEM instruction count metrics) for more detail). Does not
|
||||
include VALU operations. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing VMEM instructions over the total CU cycles.
|
||||
Branch Utilization: Indicates what percent of the kernel's duration the branch
|
||||
unit was busy executing instructions. Computed as the ratio of the total number
|
||||
of cycles spent by the scheduler issuing branch instructions over the total
|
||||
CU cycles
|
||||
VALU Active Threads: Indicates the average level of divergence within a wavefront
|
||||
over the lifetime of the kernel. The number of work-items that were active in
|
||||
a wavefront during execution of each VALU instruction, time-averaged over all
|
||||
VALU instructions run on all wavefronts in the kernel.
|
||||
IPC: The ratio of the total number of instructions executed on the CU over the
|
||||
total active CU cycles. This is also presented as a percent of the peak theoretical
|
||||
bandwidth achievable on the specific accelerator.
|
||||
Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
|
||||
over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
||||
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
|
||||
occupancy achievable on the specific accelerator.'
|
||||
Theoretical LDS Bandwidth: Indicates the maximum amount of bytes that could have
|
||||
been loaded from, stored to, or atomically updated in the LDS per unit time
|
||||
(see LDS Bandwidth example for more detail). This is also presented as a percent
|
||||
of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
|
||||
LDS Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS
|
||||
scheduler due to bank conflicts (as determined by the conflict resolution hardware)
|
||||
to the base number of cycles that would be spent in the LDS scheduler in a completely
|
||||
uncontended case. This is also presented in normalized form (i.e., the Bank
|
||||
Conflict Rate).
|
||||
vL1D Cache Hit Rate: The ratio of the number of vL1D cache line requests that
|
||||
hit in vL1D cache over the total number of cache line requests to the vL1D cache
|
||||
RAM.
|
||||
vL1D Cache BW: The number of bytes looked up in the vL1D cache as a result of
|
||||
VMEM instructions per unit time. The number of bytes is calculated as the number
|
||||
of cache lines requested multiplied by the cache line size. This value does
|
||||
not consider partial requests, so e.g., if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
This is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L2 Cache Hit Rate: The ratio of the number of L2 cache line requests that hit
|
||||
in the L2 cache over the total number of incoming cache line requests to the
|
||||
L2 cache.
|
||||
L2 Cache BW: The number of bytes looked up in the L2 cache per unit time. The
|
||||
number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line. This is also presented as a percent of
|
||||
the peak theoretical bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Read BW: "The number of bytes read by the L2 over the Infinity Fabric\u2122\
|
||||
\ interface per unit time. This is also presented as a percent of the peak theoretical\
|
||||
\ bandwidth achievable on the specific accelerator."
|
||||
L2-Fabric Write BW: The number of bytes sent by the L2 over the Infinity Fabric
|
||||
interface by write and atomic operations per unit time. This is also presented
|
||||
as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Read Latency: The time-averaged number of cycles read requests spent
|
||||
in Infinity Fabric before data was returned to the L2.
|
||||
L2-Fabric Write Latency: The time-averaged number of cycles write requests spent
|
||||
in Infinity Fabric before a completion acknowledgement was returned to the L2.
|
||||
sL1D Cache Hit Rate: The percent of sL1D requests that hit on a previously loaded
|
||||
line the cache. Calculated as the ratio of the number of sL1D requests that
|
||||
hit over the number of all sL1D requests.
|
||||
sL1D Cache BW: The number of bytes looked up in the sL1D cache per unit time.
|
||||
This is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L1I Hit Rate: The number of bytes looked up in the L1I cache per unit time. This
|
||||
is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L1I BW: The percent of L1I requests that hit on a previously loaded line the cache.
|
||||
Calculated as the ratio of the number of L1I requests that hit over the number
|
||||
of all L1I requests.
|
||||
L1I Fetch Latency: The average number of cycles spent to fetch instructions to
|
||||
a CU.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 201
|
||||
title: System Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
peak: Peak
|
||||
pop: Pct of Peak
|
||||
metric:
|
||||
VALU FLOPs:
|
||||
value: None
|
||||
unit: GFLOP/s
|
||||
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
|
||||
pop: None
|
||||
VALU IOPs:
|
||||
value: None
|
||||
unit: GIOP/s
|
||||
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
|
||||
pop: None
|
||||
MFMA FLOPs (BF16):
|
||||
value: None
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 512) / 1000)
|
||||
pop: None
|
||||
MFMA FLOPs (F16):
|
||||
value: None
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
|
||||
pop: None
|
||||
MFMA FLOPs (F32):
|
||||
value: None
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
|
||||
pop: None
|
||||
MFMA FLOPs (F64):
|
||||
value: None
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
|
||||
pop: None
|
||||
MFMA IOPs (Int8):
|
||||
value: None
|
||||
unit: GIOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
|
||||
pop: None
|
||||
Active CUs:
|
||||
value: $numActiveCUs
|
||||
unit: CUs
|
||||
peak: $cu_per_gpu
|
||||
pop: ((100 * $numActiveCUs) / $cu_per_gpu)
|
||||
SALU Utilization:
|
||||
value: AVG(((100 * SQ_ACTIVE_INST_SCA) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 * SQ_ACTIVE_INST_SCA) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
VALU Utilization:
|
||||
value: AVG(((100 * SQ_ACTIVE_INST_VALU) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 * SQ_ACTIVE_INST_VALU) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
MFMA Utilization:
|
||||
value: None
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: None
|
||||
VMEM Utilization:
|
||||
value: None
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: None
|
||||
Branch Utilization:
|
||||
value: None
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: None
|
||||
VALU Active Threads:
|
||||
value: AVG(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
|
||||
!= 0) else None))
|
||||
unit: Threads
|
||||
peak: $wave_size
|
||||
pop: (100 * AVG((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU / $wave_size)
|
||||
if (SQ_ACTIVE_INST_VALU != 0) else None))
|
||||
IPC:
|
||||
value: AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))
|
||||
unit: Instr/cycle
|
||||
peak: 5
|
||||
pop: ((100 * AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))) / 5)
|
||||
Wavefront Occupancy:
|
||||
value: AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
unit: Wavefronts
|
||||
peak: ($max_waves_per_cu * $cu_per_gpu)
|
||||
pop: (100 * AVG(((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD) / ($max_waves_per_cu
|
||||
* $cu_per_gpu))))
|
||||
coll_level: SQ_LEVEL_WAVES
|
||||
Theoretical LDS Bandwidth:
|
||||
value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: (($max_sclk * $cu_per_gpu) * 0.128)
|
||||
pop: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
|
||||
LDS Bank Conflicts/Access:
|
||||
value: AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
|
||||
unit: Conflicts/access
|
||||
peak: 32
|
||||
pop: ((100 * AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))) / 32)
|
||||
vL1D Cache Hit Rate:
|
||||
value: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
|
||||
else None))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
|
||||
else None))
|
||||
vL1D Cache BW:
|
||||
value: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: ((($max_sclk / 1000) * 64) * $cu_per_gpu)
|
||||
pop: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp -
|
||||
Start_Timestamp)))) / ((($max_sclk / 1000) * 64) * $cu_per_gpu))
|
||||
L2 Cache Hit Rate:
|
||||
value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
L2 Cache BW:
|
||||
value: AVG(((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: ((($max_sclk / 1000) * 64) * TO_INT($total_l2_chan))
|
||||
pop: ((100 * AVG(((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk / 1000) * 64) * TO_INT($total_l2_chan)))
|
||||
L2-Fabric Read BW:
|
||||
value: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: $hbmBandwidth
|
||||
pop: ((100 * AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))) / $hbmBandwidth)
|
||||
L2-Fabric Write BW:
|
||||
value: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: $hbmBandwidth
|
||||
pop: ((100 * AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))) / $hbmBandwidth)
|
||||
L2-Fabric Read Latency:
|
||||
value: AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
peak: None
|
||||
pop: None
|
||||
L2-Fabric Write Latency:
|
||||
value: AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
peak: None
|
||||
pop: None
|
||||
sL1D Cache Hit Rate:
|
||||
value: AVG((((100 * SQC_DCACHE_HITS) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES))
|
||||
if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES) != 0) else None))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG((((100 * SQC_DCACHE_HITS) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES))
|
||||
if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES) != 0) else None))
|
||||
sL1D Cache BW:
|
||||
value: AVG(((SQC_DCACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))
|
||||
unit: GB/s
|
||||
peak: ((($max_sclk / 1000) * 64) * $sqc_per_gpu)
|
||||
pop: ((100 * AVG(((SQC_DCACHE_REQ / (End_Timestamp - Start_Timestamp)) *
|
||||
64))) / ((($max_sclk / 1000) * 64) * $sqc_per_gpu))
|
||||
L1I Hit Rate:
|
||||
value: AVG(((100 * SQC_ICACHE_HITS) / (SQC_ICACHE_HITS + SQC_ICACHE_MISSES)))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 * SQC_ICACHE_HITS) / (SQC_ICACHE_HITS + SQC_ICACHE_MISSES)))
|
||||
L1I BW:
|
||||
value: AVG(((SQC_ICACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))
|
||||
unit: GB/s
|
||||
peak: ((($max_sclk / 1000) * 64) * $sqc_per_gpu)
|
||||
pop: ((100 * AVG(((SQC_ICACHE_REQ / (End_Timestamp - Start_Timestamp)) *
|
||||
64))) / ((($max_sclk / 1000) * 64) * $sqc_per_gpu))
|
||||
L1I Fetch Latency:
|
||||
value: AVG((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
|
||||
unit: Cycles
|
||||
peak: None
|
||||
pop: None
|
||||
coll_level: SQ_IFETCH_LEVEL
|
||||
-310
@@ -1,310 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 300
|
||||
title: Memory Chart
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 301
|
||||
title: Memory Chart
|
||||
header:
|
||||
metric: Metric
|
||||
#alias: #alias
|
||||
value: Value
|
||||
tips: Tips
|
||||
metric:
|
||||
# ----------------------------------------
|
||||
# Instr Buff Block
|
||||
|
||||
#TODO: double check wave_occupancy
|
||||
Wavefront Occupancy:
|
||||
#alias: wave_occ_
|
||||
value: ROUND(AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD) / $numActiveCUs), 0)
|
||||
coll_level: SQ_LEVEL_WAVES
|
||||
tips:
|
||||
Wave Life:
|
||||
#alias: wave_life_
|
||||
value: ROUND(AVG(((4 * (SQ_WAVE_CYCLES / SQ_WAVES)) if (SQ_WAVES != 0) else 0)), 0)
|
||||
tips:
|
||||
|
||||
# ----------------------------------------
|
||||
# Instr Dispatch Block
|
||||
SALU:
|
||||
#alias: salu_
|
||||
value: ROUND(AVG((SQ_INSTS_SALU / $denom)), 0)
|
||||
tips:
|
||||
SMEM:
|
||||
#alias: smem_
|
||||
value: ROUND(AVG((SQ_INSTS_SMEM / $denom)), 0)
|
||||
tips:
|
||||
VALU:
|
||||
#alias: valu_
|
||||
value: ROUND(AVG((SQ_INSTS_VALU / $denom)), 0)
|
||||
tips:
|
||||
VMEM:
|
||||
#alias: vmem_
|
||||
value: ROUND(AVG((SQ_INSTS_VMEM / $denom)), 0)
|
||||
tips:
|
||||
LDS:
|
||||
#alias: lds_
|
||||
value: ROUND(AVG((SQ_INSTS_LDS / $denom)), 0)
|
||||
tips:
|
||||
GWS:
|
||||
#alias: gws_
|
||||
value: ROUND(AVG((SQ_INSTS_GDS / $denom)), 0)
|
||||
tips:
|
||||
BR:
|
||||
#alias: br_
|
||||
value: ROUND(AVG((SQ_INSTS_BRANCH / $denom)), 0)
|
||||
tips:
|
||||
|
||||
# ----------------------------------------
|
||||
# Exec Block
|
||||
Active CUs:
|
||||
#alias: active_cu_
|
||||
value: $numActiveCUs
|
||||
tips:
|
||||
Num CUs:
|
||||
#alias: num_cu_
|
||||
value: $cu_per_gpu
|
||||
tips:
|
||||
VGPR:
|
||||
#alias: vgpr_
|
||||
value: ROUND(AVG(Arch_VGPR), 0)
|
||||
tips:
|
||||
SGPR:
|
||||
#alias: sgpr_
|
||||
value: ROUND(AVG(SGPR), 0)
|
||||
tips:
|
||||
LDS Allocation:
|
||||
#alias: lds_alloc_
|
||||
value: ROUND(AVG(LDS_Per_Workgroup), 0)
|
||||
tips:
|
||||
Scratch Allocation:
|
||||
#alias: scratch_alloc_
|
||||
value: ROUND(AVG(Scratch_Per_Workitem), 0)
|
||||
tips:
|
||||
Wavefronts:
|
||||
#alias: wavefronts_
|
||||
value: ROUND(AVG(SPI_CSN_WAVE), 0)
|
||||
tips:
|
||||
Workgroups:
|
||||
#alias: workgroups_
|
||||
value: ROUND(AVG(SPI_CSN_NUM_THREADGROUPS), 0)
|
||||
tips:
|
||||
|
||||
# ----------------------------------------
|
||||
# LDS Block
|
||||
LDS Req:
|
||||
#alias: lds_req_
|
||||
value: ROUND(AVG((SQ_INSTS_LDS / $denom)), 0)
|
||||
tips:
|
||||
LDS Util:
|
||||
#alias: lds_util_
|
||||
value:
|
||||
ROUND(AVG(((100 * SQ_LDS_IDX_ACTIVE) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))),
|
||||
0)
|
||||
tips:
|
||||
LDS Latency:
|
||||
#alias: lds_lat
|
||||
value: ROUND(AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None)),0)
|
||||
coll_level: SQ_INST_LEVEL_LDS
|
||||
tips:
|
||||
|
||||
# ----------------------------------------
|
||||
# Vector L1 Cache Block
|
||||
VL1 Rd:
|
||||
#alias: vl1_rd_
|
||||
value: ROUND(AVG((TCP_TOTAL_READ_sum / $denom)), 0)
|
||||
tips:
|
||||
VL1 Wr:
|
||||
#alias: vl1_wr_
|
||||
value: ROUND(AVG((TCP_TOTAL_WRITE_sum / $denom)), 0)
|
||||
tips:
|
||||
VL1 Atomic:
|
||||
#alias: vl1_atom_
|
||||
value:
|
||||
ROUND(AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
|
||||
/ $denom)), 0)
|
||||
tips:
|
||||
|
||||
VL1 Hit:
|
||||
#alias: vl1_hit_
|
||||
value:
|
||||
ROUND(AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
|
||||
None )), 0)
|
||||
tips:
|
||||
VL1 Lat:
|
||||
#alias: vl1_lat_
|
||||
value:
|
||||
ROUND(AVG(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
|
||||
!= 0) else None)), 0)
|
||||
tips:
|
||||
VL1 Coalesce:
|
||||
#alias: vl1_coales_
|
||||
value:
|
||||
ROUND(AVG(((((TA_TOTAL_WAVEFRONTS_sum * 64) * 100) / (TCP_TOTAL_ACCESSES_sum
|
||||
* 4)) if (TCP_TOTAL_ACCESSES_sum != None) else 0)), 0)
|
||||
tips:
|
||||
VL1 Stall:
|
||||
#alias: vl1_stall_
|
||||
value:
|
||||
ROUND(AVG((((100 * TCP_TCR_TCP_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
|
||||
if (TCP_GATE_EN1_sum != 0) else None)), 0)
|
||||
tips:
|
||||
|
||||
VL1_L2 Rd:
|
||||
#alias: vl1_l2_rd_
|
||||
value: ROUND(AVG((TCP_TCC_READ_REQ_sum / $denom)), 0)
|
||||
tips:
|
||||
VL1_L2 Wr:
|
||||
#alias: vl1_l2_wr_
|
||||
value: ROUND(AVG((TCP_TCC_WRITE_REQ_sum / $denom)), 0)
|
||||
tips:
|
||||
VL1_L2 Atomic:
|
||||
#alias: vl1_l2_atom_
|
||||
value:
|
||||
ROUND(AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
/ $denom)), 0)
|
||||
tips:
|
||||
|
||||
# ----------------------------------------
|
||||
# Scalar L1D Cache Block
|
||||
VL1D Rd:
|
||||
#alias: sl1_rd_
|
||||
value: ROUND(AVG((SQC_DCACHE_REQ / $denom)), 0)
|
||||
tips:
|
||||
VL1D Hit:
|
||||
#alias: sl1_hit_
|
||||
value:
|
||||
ROUND((AVG(((SQC_DCACHE_HITS / SQC_DCACHE_REQ) if (SQC_DCACHE_REQ !=
|
||||
0) else None)) * 100), 0)
|
||||
tips:
|
||||
VL1D Lat:
|
||||
#alias: sl1_lat_
|
||||
value:
|
||||
ROUND((AVG(((SQ_ACCUM_PREV_HIRES / SQC_DCACHE_REQ) if (SQC_DCACHE_REQ !=
|
||||
0) else None)) * 100), 0)
|
||||
coll_level: SQC_DCACHE_INFLIGHT_LEVEL
|
||||
tips:
|
||||
|
||||
VL1D_L2 Rd:
|
||||
#alias: sl1_l2_rd_
|
||||
value: ROUND(AVG((SQC_TC_DATA_READ_REQ / $denom)), 0)
|
||||
tips:
|
||||
VL1D_L2 Wr:
|
||||
#alias: sl1_l2_wr_
|
||||
value: ROUND(AVG((SQC_TC_DATA_WRITE_REQ / $denom)), 0)
|
||||
tips:
|
||||
VL1D_L2 Atomic:
|
||||
#alias: sl1_l2_atom_
|
||||
value: ROUND(AVG((SQC_TC_DATA_ATOMIC_REQ / $denom)), 0)
|
||||
tips:
|
||||
|
||||
# ----------------------------------------
|
||||
# Instr L1 Cache Block
|
||||
IL1 Fetch:
|
||||
#alias: il1_fetch_
|
||||
value: ROUND(AVG((SQC_ICACHE_REQ / $denom)), 0)
|
||||
tips:
|
||||
IL1 Hit:
|
||||
#alias: il1_hit_
|
||||
value: ROUND((AVG((SQC_ICACHE_HITS / SQC_ICACHE_REQ)) * 100), 0)
|
||||
tips:
|
||||
IL1 Lat:
|
||||
#alias: il1_lat_
|
||||
value:
|
||||
ROUND((AVG(((SQ_ACCUM_PREV_HIRES / SQC_ICACHE_REQ) if (SQC_ICACHE_REQ !=
|
||||
0) else None)) * 100), 0)
|
||||
tips: # ??? coll_level: SQ_IFETCH_LEVEL
|
||||
IL1_L2 Rd:
|
||||
#alias: il1_l2_req_
|
||||
value: ROUND(AVG((SQC_TC_INST_REQ / $denom)), 0)
|
||||
tips:
|
||||
|
||||
# ----------------------------------------
|
||||
# L2 Cache Block(inside)
|
||||
L2 Rd:
|
||||
#alias: l2_rd_
|
||||
value: ROUND(AVG((TCC_READ_sum / $denom)), 0)
|
||||
tips:
|
||||
L2 Wr:
|
||||
#alias: l2_wr_
|
||||
value: ROUND(AVG((TCC_WRITE_sum / $denom)), 0)
|
||||
tips:
|
||||
L2 Atomic:
|
||||
#alias: l2_atom_
|
||||
value: ROUND(AVG((TCC_ATOMIC_sum / $denom)), 0)
|
||||
tips:
|
||||
L2 Hit:
|
||||
#alias: l2_hit_
|
||||
value:
|
||||
ROUND(AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else 0)), 0)
|
||||
tips:
|
||||
L2 Rd Lat:
|
||||
#alias: l2_rd_lat_
|
||||
value:
|
||||
ROUND(AVG(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
|
||||
if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else None)),
|
||||
0)
|
||||
tips:
|
||||
L2 Wr Lat:
|
||||
#alias: l2_wr_lat_
|
||||
value:
|
||||
ROUND(AVG(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum +
|
||||
TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
!= 0) else None)), 0)
|
||||
tips:
|
||||
|
||||
# ----------------------------------------
|
||||
# Fabric Block
|
||||
Fabric_L2 Rd:
|
||||
#alias: l2_fabric_rd_
|
||||
value: ROUND(AVG((TCC_EA_RDREQ_sum / $denom)), 0)
|
||||
tips:
|
||||
Fabric_L2 Wr:
|
||||
#alias: l2_fabric_wr_
|
||||
value: ROUND(AVG((TCC_EA_WRREQ_sum / $denom)), 0)
|
||||
tips:
|
||||
Fabric_L2 Atomic:
|
||||
#alias: l2_fabric_atom_
|
||||
value: ROUND(AVG((TCC_EA_ATOMIC_sum / $denom)), 0)
|
||||
tips:
|
||||
|
||||
Fabric Rd Lat:
|
||||
#alias: fabric_rd_lat_
|
||||
value:
|
||||
ROUND(AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else 0)), 0)
|
||||
tips:
|
||||
Fabric Wr Lat:
|
||||
#alias: fabric_wr_lat_
|
||||
value:
|
||||
ROUND(AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else 0)), 0)
|
||||
tips:
|
||||
Fabric Atomic Lat:
|
||||
#alias: fabric_atom_lat_
|
||||
value:
|
||||
ROUND(AVG(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
|
||||
!= 0) else 0)), 0)
|
||||
tips:
|
||||
|
||||
HBM Rd:
|
||||
#alias: hbm_rd_
|
||||
value: ROUND(AVG((TCC_EA_RDREQ_DRAM_sum / $denom)), 0)
|
||||
tips:
|
||||
HBM Wr:
|
||||
#alias: hbm_wr_
|
||||
value: ROUND(AVG((TCC_EA_WRREQ_DRAM_sum / $denom)), 0)
|
||||
tips:
|
||||
|
||||
comparable: false # for now
|
||||
cli_style: mem_chart
|
||||
+267
@@ -0,0 +1,267 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 300
|
||||
title: Memory Chart
|
||||
metrics_description:
|
||||
Wavefront Occupancy: Wavefronts per active CU.
|
||||
Wave Life: Average number of cycles executing a wave.
|
||||
SALU: Total Number of SALU (Scalar ALU) instructions issued per normalization
|
||||
unit.
|
||||
SMEM: Total number of SMEM (Scalar Memory Read) instructions issued normalization
|
||||
unit.
|
||||
VALU: The number of VALU (Vector ALU) instructions issued per normalization unit.
|
||||
MFMA: Total number of MFMA (Matrix-Fused-Multiply-Add) instructions issued per
|
||||
normalization unit.
|
||||
VMEM: The number of VMEM (GPU Memory) read instructions issued (including FLAT/scratch
|
||||
memory) per normalization unit.
|
||||
LDS: The total number of LDS instructions (including, but not limited to, read/write/atomics
|
||||
and HIP's __shfl instructions) executed per normalization unit.
|
||||
GWS: Total number of GDS (global data sync) instructions issued per normalization
|
||||
unit.
|
||||
BR: Total number of BRANCH instructions issued per normalization unit.
|
||||
Active CUs: Total number of active compute units (CUs) on the accelerator during
|
||||
the kernel execution.
|
||||
Num CUs: Total number of compute units (CUs) on the accelerator.
|
||||
VGPR: 'The number of architected vector general-purpose registers allocated for
|
||||
the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
|
||||
by the compiler due to allocation granularity.'
|
||||
SGPR: 'The number of scalar general-purpose registers allocated for the kernel,
|
||||
see SALU. Note: this may not exactly match the number of SGPRs requested by
|
||||
the compiler due to allocation granularity.'
|
||||
LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
|
||||
for this kernel. Note: This may also be larger than what was requested at compile
|
||||
time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
|
||||
Scratch Allocation: The number of bytes of scratch memory requested per work-item
|
||||
for this kernel. Scratch memory is used for stack memory on the accelerator,
|
||||
as well as for register spills and restores.
|
||||
Wavefronts: The total number of wavefronts, summed over all workgroups, forming
|
||||
this kernel launch.
|
||||
Workgroups: The total number of workgroups forming this kernel launch.
|
||||
LDS Req: The total number of LDS instructions (including, but not limited to,
|
||||
read/write/atomics and HIP's __shfl instructions) executed per normalization
|
||||
unit.
|
||||
LDS Util: Indicates what percent of the kernel's duration the LDS was actively
|
||||
executing instructions (including, but not limited to, load, store, atomic and
|
||||
HIP's __shfl operations). Calculated as the ratio of the total number of cycles
|
||||
LDS was active over the total CU cycles.
|
||||
LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
|
||||
/ acknowledgment) required for an LDS instruction to complete.
|
||||
VL1 Rd: The total number of incoming read requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Wr: The total number of incoming write requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Atomic: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Hit: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
VL1 Lat: Calculated as the average number of cycles that a vL1D cache line request
|
||||
spent in the vL1D cache pipeline.
|
||||
VL1 Coalesce: Indicates how well memory instructions were coalesced by the address
|
||||
processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
|
||||
as the average number of thread-requests generated per instruction divided by
|
||||
the ideal number of thread-requests per instruction.
|
||||
VL1 Stall: The ratio of the number of cycles where the vL1D is stalled waiting
|
||||
to issue a request for data to the L2 cache divided by the number of cycles
|
||||
where the vL1D is active.
|
||||
VL1_L2 Rd: The number of read requests for a vL1D cache line that were not satisfied
|
||||
by the vL1D and must be retrieved from the to the L2 Cache per normalization
|
||||
unit.
|
||||
VL1_L2 Wr: The number of write requests to a vL1D cache line that were sent through
|
||||
the vL1D to the L2 cache, per normalization unit.
|
||||
VL1_L2 Atomic: The number of atomic requests that are sent through the vL1D to
|
||||
the L2 cache, per normalization unit. This includes requests for atomics with,
|
||||
and without return.
|
||||
sL1D Rd: The total number of requests, of any size or type, made to the sL1D per
|
||||
normalization unit.
|
||||
sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
|
||||
line, per normalization unit.
|
||||
sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
|
||||
unit.
|
||||
sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
sL1D_L2 Atomic: The total number of atomic requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
IL1 Fetch: The total number of requests made to the L1I per normalization-unit.
|
||||
IL1 Hit: The percent of L1I requests that hit on a previously loaded line the
|
||||
cache. Calculated as the ratio of the number of L1I requests that hit over the
|
||||
number of all L1I requests.
|
||||
IL1 Lat: The average number of cycles spent to fetch instructions to a CU.
|
||||
IL1_L2 Rd: The total number of requests across the L1I - L2 interface per normalization-unit.
|
||||
L2 Rd: The total number of read requests to the L2 from all clients.
|
||||
L2 Wr: The total number of write requests to the L2 from all clients.
|
||||
L2 Atomic: The total number of atomic requests (with and without return) to the
|
||||
L2 from all clients.
|
||||
L2 Hit: The ratio of the number of L2 cache line requests that hit in the L2 cache
|
||||
over the total number of incoming cache line requests to the L2 cache.
|
||||
L2 Rd Lat: Calculated as the average number of cycles that the vL1D cache took
|
||||
to issue and receive read requests from the L2 Cache. This number also includes
|
||||
requests for atomics with return values.
|
||||
L2 Wr Lat: Calculated as the average number of cycles that the vL1D cache took
|
||||
to issue and receive acknowledgement of a write request to the L2 Cache. This
|
||||
number also includes requests for atomics without return values.
|
||||
Fabric_L2 Rd: Number of L2 cache - Infinity Fabric read requests (either 32-byte
|
||||
or 64-byte) summed over TCC instances per normalization unit.
|
||||
Fabric_L2 Wr: Number of L2 cache - Infinity Fabric write requests (either 32-byte
|
||||
or 64-byte) summed over TCC instances per normalization unit.
|
||||
Fabric_L2 Atomic: Number of L2 cache - Infinity Fabric write requests (either
|
||||
32-byte or 64-byte) that are actually atomic requests summed over TCC instances
|
||||
per normalization unit.
|
||||
Fabric Rd Lat: The time-averaged number of cycles read requests spent in Infinity
|
||||
Fabric before data was returned to the L2.
|
||||
Fabric Wr Lat: The time-averaged number of cycles write requests spent in Infinity
|
||||
Fabric before a completion acknowledgement was returned to the L2.
|
||||
Fabric Atomic Lat: The time-averaged number of cycles atomic requests spent in
|
||||
Infinity Fabric before a completion acknowledgement (atomic without return value)
|
||||
or data (atomic with return value) was returned to the L2.
|
||||
HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
|
||||
of data from the accelerator's local HBM, per normalization unit.
|
||||
HBM Wr: 'The total number of L2 requests to Infinity Fabric to write or atomically
|
||||
update 32B or 64B of data in the accelerator''s local HBM, per normalization
|
||||
unit. '
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 301
|
||||
title: Memory Chart
|
||||
header:
|
||||
metric: Metric
|
||||
value: Value
|
||||
metric:
|
||||
Wavefront Occupancy:
|
||||
value: ROUND(AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD) / $numActiveCUs),
|
||||
0)
|
||||
coll_level: SQ_LEVEL_WAVES
|
||||
Wave Life:
|
||||
value: ROUND(AVG(((4 * (SQ_WAVE_CYCLES / SQ_WAVES)) if (SQ_WAVES != 0) else
|
||||
0)), 0)
|
||||
SALU:
|
||||
value: ROUND(AVG((SQ_INSTS_SALU / $denom)), 0)
|
||||
SMEM:
|
||||
value: ROUND(AVG((SQ_INSTS_SMEM / $denom)), 0)
|
||||
VALU:
|
||||
value: ROUND(AVG((SQ_INSTS_VALU / $denom)), 0)
|
||||
MFMA:
|
||||
value: None
|
||||
VMEM:
|
||||
value: ROUND(AVG((SQ_INSTS_VMEM / $denom)), 0)
|
||||
LDS:
|
||||
value: ROUND(AVG((SQ_INSTS_LDS / $denom)), 0)
|
||||
GWS:
|
||||
value: ROUND(AVG((SQ_INSTS_GDS / $denom)), 0)
|
||||
BR:
|
||||
value: ROUND(AVG((SQ_INSTS_BRANCH / $denom)), 0)
|
||||
Active CUs:
|
||||
value: $numActiveCUs
|
||||
Num CUs:
|
||||
value: $cu_per_gpu
|
||||
VGPR:
|
||||
value: ROUND(AVG(Arch_VGPR), 0)
|
||||
SGPR:
|
||||
value: ROUND(AVG(SGPR), 0)
|
||||
LDS Allocation:
|
||||
value: ROUND(AVG(LDS_Per_Workgroup), 0)
|
||||
Scratch Allocation:
|
||||
value: ROUND(AVG(Scratch_Per_Workitem), 0)
|
||||
Wavefronts:
|
||||
value: ROUND(AVG(SPI_CSN_WAVE), 0)
|
||||
Workgroups:
|
||||
value: ROUND(AVG(SPI_CSN_NUM_THREADGROUPS), 0)
|
||||
LDS Req:
|
||||
value: ROUND(AVG((SQ_INSTS_LDS / $denom)), 0)
|
||||
LDS Util:
|
||||
value: ROUND(AVG(((100 * SQ_LDS_IDX_ACTIVE) / ($GRBM_GUI_ACTIVE_PER_XCD
|
||||
* $cu_per_gpu))), 0)
|
||||
LDS Latency:
|
||||
value: ROUND(AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS
|
||||
!= 0) else None)),0)
|
||||
coll_level: SQ_INST_LEVEL_LDS
|
||||
VL1 Rd:
|
||||
value: ROUND(AVG((TCP_TOTAL_READ_sum / $denom)), 0)
|
||||
VL1 Wr:
|
||||
value: ROUND(AVG((TCP_TOTAL_WRITE_sum / $denom)), 0)
|
||||
VL1 Atomic:
|
||||
value: ROUND(AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
|
||||
/ $denom)), 0)
|
||||
VL1 Hit:
|
||||
value: ROUND(AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
|
||||
else None )), 0)
|
||||
VL1 Lat:
|
||||
value: ROUND(AVG(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
|
||||
!= 0) else None)), 0)
|
||||
VL1 Coalesce:
|
||||
value: ROUND(AVG(((((TA_TOTAL_WAVEFRONTS_sum * 64) * 100) / (TCP_TOTAL_ACCESSES_sum
|
||||
* 4)) if (TCP_TOTAL_ACCESSES_sum != None) else 0)), 0)
|
||||
VL1 Stall:
|
||||
value: ROUND(AVG((((100 * TCP_TCR_TCP_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
|
||||
if (TCP_GATE_EN1_sum != 0) else None)), 0)
|
||||
VL1_L2 Rd:
|
||||
value: ROUND(AVG((TCP_TCC_READ_REQ_sum / $denom)), 0)
|
||||
VL1_L2 Wr:
|
||||
value: ROUND(AVG((TCP_TCC_WRITE_REQ_sum / $denom)), 0)
|
||||
VL1_L2 Atomic:
|
||||
value: ROUND(AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
/ $denom)), 0)
|
||||
sL1D Rd:
|
||||
value: ROUND(AVG((SQC_DCACHE_REQ / $denom)), 0)
|
||||
sL1D Hit:
|
||||
value: ROUND((AVG(((SQC_DCACHE_HITS / SQC_DCACHE_REQ) if (SQC_DCACHE_REQ
|
||||
!= 0) else None)) * 100), 0)
|
||||
sL1D Lat:
|
||||
value: ROUND((AVG(((SQ_ACCUM_PREV_HIRES / SQC_DCACHE_REQ) if (SQC_DCACHE_REQ
|
||||
!= 0) else None)) * 100), 0)
|
||||
coll_level: SQC_DCACHE_INFLIGHT_LEVEL
|
||||
sL1D_L2 Rd:
|
||||
value: ROUND(AVG((SQC_TC_DATA_READ_REQ / $denom)), 0)
|
||||
sL1D_L2 Wr:
|
||||
value: ROUND(AVG((SQC_TC_DATA_WRITE_REQ / $denom)), 0)
|
||||
sL1D_L2 Atomic:
|
||||
value: ROUND(AVG((SQC_TC_DATA_ATOMIC_REQ / $denom)), 0)
|
||||
IL1 Fetch:
|
||||
value: ROUND(AVG((SQC_ICACHE_REQ / $denom)), 0)
|
||||
IL1 Hit:
|
||||
value: ROUND((AVG((SQC_ICACHE_HITS / SQC_ICACHE_REQ)) * 100), 0)
|
||||
IL1 Lat:
|
||||
value: ROUND((AVG(((SQ_ACCUM_PREV_HIRES / SQC_ICACHE_REQ) if (SQC_ICACHE_REQ
|
||||
!= 0) else None)) * 100), 0)
|
||||
IL1_L2 Rd:
|
||||
value: ROUND(AVG((SQC_TC_INST_REQ / $denom)), 0)
|
||||
L2 Rd:
|
||||
value: ROUND(AVG((TCC_READ_sum / $denom)), 0)
|
||||
L2 Wr:
|
||||
value: ROUND(AVG((TCC_WRITE_sum / $denom)), 0)
|
||||
L2 Atomic:
|
||||
value: ROUND(AVG((TCC_ATOMIC_sum / $denom)), 0)
|
||||
L2 Hit:
|
||||
value: ROUND(AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if
|
||||
((TCC_HIT_sum + TCC_MISS_sum) != 0) else 0)), 0)
|
||||
L2 Rd Lat:
|
||||
value: ROUND(AVG(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum)) if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
|
||||
!= 0) else None)), 0)
|
||||
L2 Wr Lat:
|
||||
value: ROUND(AVG(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
!= 0) else None)), 0)
|
||||
Fabric_L2 Rd:
|
||||
value: ROUND(AVG((TCC_EA_RDREQ_sum / $denom)), 0)
|
||||
Fabric_L2 Wr:
|
||||
value: ROUND(AVG((TCC_EA_WRREQ_sum / $denom)), 0)
|
||||
Fabric_L2 Atomic:
|
||||
value: ROUND(AVG((TCC_EA_ATOMIC_sum / $denom)), 0)
|
||||
Fabric Rd Lat:
|
||||
value: ROUND(AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else 0)), 0)
|
||||
Fabric Wr Lat:
|
||||
value: ROUND(AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else 0)), 0)
|
||||
Fabric Atomic Lat:
|
||||
value: ROUND(AVG(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
|
||||
!= 0) else 0)), 0)
|
||||
HBM Rd:
|
||||
value: ROUND(AVG((TCC_EA_RDREQ_DRAM_sum / $denom)), 0)
|
||||
HBM Wr:
|
||||
value: ROUND(AVG((TCC_EA_WRREQ_DRAM_sum / $denom)), 0)
|
||||
comparable: false
|
||||
cli_style: mem_chart
|
||||
tui_style: mem_chart
|
||||
@@ -0,0 +1,9 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 400
|
||||
title: Roofline
|
||||
metrics_description: {}
|
||||
data source:
|
||||
- None:
|
||||
id: 401
|
||||
title: Roofline
|
||||
-135
@@ -1,135 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 500
|
||||
title: Command Processor (CPC/CPF)
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 501
|
||||
title: Command Processor Fetcher
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
CPF Utilization:
|
||||
avg: AVG((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
|
||||
if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
|
||||
if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
|
||||
if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
CPF Stall:
|
||||
avg: AVG((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
|
||||
!= 0) else None))
|
||||
min: MIN((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
|
||||
!= 0) else None))
|
||||
max: MAX((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
CPF-L2 Utilization:
|
||||
avg: AVG((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
|
||||
if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
|
||||
if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
|
||||
if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
CPF-L2 Stall:
|
||||
avg: AVG((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
|
||||
!= 0) else None))
|
||||
min: MIN((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
|
||||
!= 0) else None))
|
||||
max: MAX((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
CPF-UTCL1 Stall:
|
||||
avg: AVG(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
|
||||
!= 0) else None)
|
||||
min: MIN(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
|
||||
!= 0) else None)
|
||||
max: MAX(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
|
||||
!= 0) else None)
|
||||
unit: pct
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 502
|
||||
title: Packet Processor
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
CPC Utilization:
|
||||
avg: AVG((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
|
||||
if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
|
||||
if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
|
||||
if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
CPC Stall Rate:
|
||||
avg: AVG((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None))
|
||||
min: MIN((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None))
|
||||
max: MAX((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
CPC Packet Decoding Utilization:
|
||||
avg: AVG((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
|
||||
min: MIN((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
|
||||
max: MAX((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
|
||||
unit: pct
|
||||
tips:
|
||||
CPC-Workgroup Manager Utilization:
|
||||
avg: AVG((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
|
||||
min: MIN((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
|
||||
max: MAX((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
|
||||
unit: Pct
|
||||
tips:
|
||||
CPC-L2 Utilization:
|
||||
avg: AVG((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
|
||||
if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
|
||||
if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
|
||||
if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
CPC-UTCL1 Stall:
|
||||
avg: AVG(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None)
|
||||
min: MIN(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None)
|
||||
max: MAX(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None)
|
||||
unit: pct
|
||||
tips:
|
||||
CPC-UTCL2 Utilization:
|
||||
avg: AVG((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
|
||||
if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
|
||||
if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
|
||||
if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
+145
@@ -0,0 +1,145 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 500
|
||||
title: Command Processor (CPC/CPF)
|
||||
metrics_description:
|
||||
CPF Utilization: Percent of total cycles where the CPF was busy actively doing
|
||||
any work. The ratio of CPF busy cycles over total cycles counted by the CPF.
|
||||
CPF Stall: Percent of CPF busy cycles where the CPF was stalled for any reason.
|
||||
CPF-L2 Utilization: Percent of total cycles counted by the CPF-L2 interface where
|
||||
the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
|
||||
over total cycles counted by the CPF-L2.
|
||||
CPF-L2 Stall: Percent of CPF-L2 L2 busy cycles where the CPF-L2 interface was
|
||||
stalled for any reason.
|
||||
CPF-UTCL1 Stall: Percent of CPF busy cycles where the CPF was stalled by address
|
||||
translation.
|
||||
CPC Utilization: Percent of total cycles where the CPC was busy actively doing
|
||||
any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
|
||||
CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
|
||||
CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
|
||||
for processing.
|
||||
CPC-Workgroup Manager Utilization: Percent of CPC busy cycles spent dispatching
|
||||
workgroups to the workgroup manager.
|
||||
CPC-L2 Utilization: Percent of total cycles counted by the CPC-L2 interface where
|
||||
the CPC-L2 interface was active doing any work.
|
||||
CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
|
||||
translation
|
||||
CPC-UTCL2 Utilization: 'Percent of total cycles counted by the CPC''s L2 address
|
||||
translation interface where the CPC was busy doing address translation work. '
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 501
|
||||
title: Command processor fetcher (CPF)
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
CPF Utilization:
|
||||
avg: AVG((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
|
||||
if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
|
||||
if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
|
||||
if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
CPF Stall:
|
||||
avg: AVG((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
|
||||
!= 0) else None))
|
||||
min: MIN((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
|
||||
!= 0) else None))
|
||||
max: MAX((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
CPF-L2 Utilization:
|
||||
avg: AVG((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
|
||||
if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
|
||||
if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
|
||||
if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
CPF-L2 Stall:
|
||||
avg: AVG((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
|
||||
!= 0) else None))
|
||||
min: MIN((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
|
||||
!= 0) else None))
|
||||
max: MAX((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
CPF-UTCL1 Stall:
|
||||
avg: AVG(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY)
|
||||
if (CPF_CPF_STAT_BUSY != 0) else None)
|
||||
min: MIN(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY)
|
||||
if (CPF_CPF_STAT_BUSY != 0) else None)
|
||||
max: MAX(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY)
|
||||
if (CPF_CPF_STAT_BUSY != 0) else None)
|
||||
unit: pct
|
||||
- metric_table:
|
||||
id: 502
|
||||
title: Command processor packet processor (CPC)
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
CPC Utilization:
|
||||
avg: AVG((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
|
||||
if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
|
||||
if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
|
||||
if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
CPC Stall Rate:
|
||||
avg: AVG((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None))
|
||||
min: MIN((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None))
|
||||
max: MAX((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
CPC Packet Decoding Utilization:
|
||||
avg: AVG((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None)
|
||||
min: MIN((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None)
|
||||
max: MAX((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None)
|
||||
unit: pct
|
||||
CPC-Workgroup Manager Utilization:
|
||||
avg: AVG((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None)
|
||||
min: MIN((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None)
|
||||
max: MAX((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None)
|
||||
unit: Pct
|
||||
CPC-L2 Utilization:
|
||||
avg: AVG((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
|
||||
if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
|
||||
if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
|
||||
if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
CPC-UTCL1 Stall:
|
||||
avg: AVG(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if
|
||||
(CPC_CPC_STAT_BUSY != 0) else None)
|
||||
min: MIN(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if
|
||||
(CPC_CPC_STAT_BUSY != 0) else None)
|
||||
max: MAX(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if
|
||||
(CPC_CPC_STAT_BUSY != 0) else None)
|
||||
unit: pct
|
||||
CPC-UTCL2 Utilization:
|
||||
avg: AVG((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
|
||||
if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
|
||||
if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
|
||||
if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
-167
@@ -1,167 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 600
|
||||
title: Workgroup Manager (SPI)
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 601
|
||||
title: Workgroup Manager Utilizations
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Accelerator Utilization:
|
||||
avg: AVG(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
|
||||
min: MIN(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
|
||||
max: MAX(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
|
||||
unit: Pct
|
||||
tips:
|
||||
Scheduler-Pipe Utilization:
|
||||
avg: AVG(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
|
||||
min: MIN(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
|
||||
max: MAX(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
Workgroup Manager Utilization:
|
||||
avg: AVG(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
|
||||
min: MIN(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
|
||||
max: MAX(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
|
||||
unit: Pct
|
||||
tips:
|
||||
Shader Engine Utilization:
|
||||
avg: AVG(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
|
||||
min: MIN(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
|
||||
max: MAX(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
SIMD Utilization:
|
||||
avg: AVG(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
Dispatched Workgroups:
|
||||
avg: AVG(SPI_CSN_NUM_THREADGROUPS)
|
||||
min: MIN(SPI_CSN_NUM_THREADGROUPS)
|
||||
max: MAX(SPI_CSN_NUM_THREADGROUPS)
|
||||
unit: Workgroups
|
||||
tips:
|
||||
Dispatched Wavefronts:
|
||||
avg: AVG(SPI_CSN_WAVE)
|
||||
min: MIN(SPI_CSN_WAVE)
|
||||
max: MAX(SPI_CSN_WAVE)
|
||||
unit: Wavefronts
|
||||
tips:
|
||||
VGPR Writes:
|
||||
avg: AVG((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
min: MIN((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
max: MAX((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
unit: Cycles/wave
|
||||
tips:
|
||||
SGPR Writes:
|
||||
avg: AVG((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
min: MIN((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
max: MAX((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
unit: Cycles/wave
|
||||
tips:
|
||||
- metric_table:
|
||||
id: 602
|
||||
title: Workgroup Manager - Resource Allocation
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Not-scheduled Rate (Workgroup Manager):
|
||||
avg: AVG((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
|
||||
0) else None)
|
||||
min: MIN((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
|
||||
0) else None)
|
||||
max: MAX((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
|
||||
0) else None)
|
||||
unit: Pct
|
||||
tips:
|
||||
Not-scheduled Rate (Scheduler-Pipe):
|
||||
avg: AVG((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
|
||||
0) else None)
|
||||
min: MIN((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
|
||||
0) else None)
|
||||
max: MAX((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
|
||||
0) else None)
|
||||
unit: Pct
|
||||
tips:
|
||||
Scheduler-Pipe Stall Rate:
|
||||
avg: AVG((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
|
||||
0) else None))
|
||||
min: MIN((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
|
||||
0) else None))
|
||||
max: MAX((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
|
||||
0) else None))
|
||||
unit: Pct
|
||||
tips:
|
||||
Scratch Stall Rate:
|
||||
avg: AVG((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
min: MIN((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
max: MAX((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
unit: Pct
|
||||
tips:
|
||||
Insufficient SIMD Waveslots:
|
||||
avg: AVG(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
Insufficient SIMD VGPRs:
|
||||
avg: AVG(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
Insufficient SIMD SGPRs:
|
||||
avg: AVG(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
Insufficient CU LDS:
|
||||
avg: AVG(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
Insufficient CU Barriers:
|
||||
avg: AVG(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
Reached CU Workgroup Limit:
|
||||
avg: AVG(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
Reached CU Wavefront Limit:
|
||||
avg: AVG(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
+201
@@ -0,0 +1,201 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 600
|
||||
title: Workgroup Manager (SPI)
|
||||
metrics_description:
|
||||
Accelerator Utilization: The percent of cycles in the kernel where the accelerator
|
||||
was actively doing any work.
|
||||
Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
|
||||
kernel where the scheduler-pipes were actively doing any work.
|
||||
Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
|
||||
manager was actively doing any work.
|
||||
Shader Engine Utilization: The percent of total shader engine cycles in the kernel
|
||||
where any CU in a shader-engine was actively doing any work, normalized over
|
||||
all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
|
||||
was not fully saturated by the kernel, or a potential load-imbalance issue.
|
||||
SIMD Utilization: The percent of total SIMD cycles in the kernel where any SIMD
|
||||
on a CU was actively doing any work, summed over all CUs. Low values (less than
|
||||
100%) indicate that the accelerator was not fully saturated by the kernel, or
|
||||
a potential load-imbalance issue.
|
||||
Dispatched Workgroups: The total number of workgroups forming this kernel launch.
|
||||
Dispatched Wavefronts: The total number of wavefronts, summed over all workgroups,
|
||||
forming this kernel launch.
|
||||
VGPR Writes: The average number of cycles spent initializing VGPRs at wave creation.
|
||||
SGPR Writes: The average number of cycles spent initializing SGPRs at wave creation.
|
||||
Not-scheduled Rate (Workgroup Manager): The percent of total scheduler-pipe cycles
|
||||
in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
|
||||
within the workgroup manager rather than a lack of a CU or SIMD with sufficient
|
||||
resources.
|
||||
Not-scheduled Rate (Scheduler-Pipe): 'The percent of total scheduler-pipe cycles
|
||||
in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
|
||||
within the scheduler-pipes rather than a lack of a CU or SIMD with sufficient
|
||||
resources. '
|
||||
Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
|
||||
where a workgroup could not be scheduled to a CU due to occupancy limitations
|
||||
(like a lack of a CU or SIMD with sufficient resources).
|
||||
Scratch Stall Rate: The percent of total shader-engine cycles in the kernel where
|
||||
a workgroup could not be scheduled to a CU due to lack of private (a.k.a., scratch)
|
||||
memory slots. While this can reach up to 100%, note that the actual occupancy
|
||||
limitations on a kernel using private memory are typically quite small (for
|
||||
example, less than 1% of the total number of waves that can be scheduled to
|
||||
an accelerator).
|
||||
Insufficient SIMD Waveslots: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available waveslots.
|
||||
Insufficient SIMD VGPRs: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available VGPRs.
|
||||
Insufficient SIMD SGPRs: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available SGPRs.
|
||||
Insufficient CU LDS: The percent of total CU cycles in the kernel where a workgroup
|
||||
could not be scheduled to a CU due to lack of available LDS.
|
||||
Insufficient CU Barriers: The percent of total CU cycles in the kernel where a
|
||||
workgroup could not be scheduled to a CU due to lack of available barriers.
|
||||
Reached CU Workgroup Limit: The percent of total CU cycles in the kernel where
|
||||
a workgroup could not be scheduled to a CU due to limits within the workgroup
|
||||
manager. This is expected to be always be zero on CDNA2 or newer accelerators
|
||||
(and small for previous accelerators).
|
||||
Reached CU Wavefront Limit: The percent of total CU cycles in the kernel where
|
||||
a wavefront could not be scheduled to a CU due to limits within the workgroup
|
||||
manager. This is expected to be always be zero on CDNA2 or newer accelerators
|
||||
(and small for previous accelerators).
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 601
|
||||
title: Workgroup manager utilizations
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Accelerator Utilization:
|
||||
avg: AVG(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
|
||||
min: MIN(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
|
||||
max: MAX(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
|
||||
unit: Pct
|
||||
Scheduler-Pipe Utilization:
|
||||
avg: AVG(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu
|
||||
* $se_per_gpu))
|
||||
min: MIN(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu
|
||||
* $se_per_gpu))
|
||||
max: MAX(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu
|
||||
* $se_per_gpu))
|
||||
unit: Pct
|
||||
Workgroup Manager Utilization:
|
||||
avg: AVG(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
|
||||
min: MIN(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
|
||||
max: MAX(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
|
||||
unit: Pct
|
||||
Shader Engine Utilization:
|
||||
avg: AVG(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
|
||||
min: MIN(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
|
||||
max: MAX(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
|
||||
unit: Pct
|
||||
SIMD Utilization:
|
||||
avg: AVG(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
Dispatched Workgroups:
|
||||
avg: AVG(SPI_CSN_NUM_THREADGROUPS)
|
||||
min: MIN(SPI_CSN_NUM_THREADGROUPS)
|
||||
max: MAX(SPI_CSN_NUM_THREADGROUPS)
|
||||
unit: Workgroups
|
||||
Dispatched Wavefronts:
|
||||
avg: AVG(SPI_CSN_WAVE)
|
||||
min: MIN(SPI_CSN_WAVE)
|
||||
max: MAX(SPI_CSN_WAVE)
|
||||
unit: Wavefronts
|
||||
VGPR Writes:
|
||||
avg: AVG((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
min: MIN((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
max: MAX((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
unit: Cycles/wave
|
||||
SGPR Writes:
|
||||
avg: AVG((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
min: MIN((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
max: MAX((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
unit: Cycles/wave
|
||||
- metric_table:
|
||||
id: 602
|
||||
title: Workgroup Manager - Resource Allocation
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Not-scheduled Rate (Workgroup Manager):
|
||||
avg: AVG((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
min: MIN((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
max: MAX((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
unit: Pct
|
||||
Not-scheduled Rate (Scheduler-Pipe):
|
||||
avg: AVG((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
min: MIN((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
max: MAX((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
unit: Pct
|
||||
Scheduler-Pipe Stall Rate:
|
||||
avg: AVG((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None))
|
||||
min: MIN((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None))
|
||||
max: MAX((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None))
|
||||
unit: Pct
|
||||
Scratch Stall Rate:
|
||||
avg: AVG((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
min: MIN((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
max: MAX((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
unit: Pct
|
||||
Insufficient SIMD Waveslots:
|
||||
avg: AVG(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
Insufficient SIMD VGPRs:
|
||||
avg: AVG(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
Insufficient SIMD SGPRs:
|
||||
avg: AVG(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
Insufficient CU LDS:
|
||||
avg: AVG(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
Insufficient CU Barriers:
|
||||
avg: AVG(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
Reached CU Workgroup Limit:
|
||||
avg: AVG(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
Reached CU Wavefront Limit:
|
||||
avg: AVG(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
-142
@@ -1,142 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 700
|
||||
title: Wavefront
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 701
|
||||
title: Wavefront Launch Stats
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Grid Size:
|
||||
avg: AVG(Grid_Size)
|
||||
min: MIN(Grid_Size)
|
||||
max: MAX(Grid_Size)
|
||||
unit: Work Items
|
||||
tips:
|
||||
Workgroup Size:
|
||||
avg: AVG(Workgroup_Size)
|
||||
min: MIN(Workgroup_Size)
|
||||
max: MAX(Workgroup_Size)
|
||||
unit: Work Items
|
||||
tips:
|
||||
Total Wavefronts:
|
||||
avg: AVG(SPI_CSN_WAVE)
|
||||
min: MIN(SPI_CSN_WAVE)
|
||||
max: MAX(SPI_CSN_WAVE)
|
||||
unit: Wavefronts
|
||||
tips:
|
||||
Saved Wavefronts:
|
||||
avg: AVG(SQ_WAVES_SAVED)
|
||||
min: MIN(SQ_WAVES_SAVED)
|
||||
max: MAX(SQ_WAVES_SAVED)
|
||||
unit: Wavefronts
|
||||
tips:
|
||||
Restored Wavefronts:
|
||||
avg: AVG(SQ_WAVES_RESTORED)
|
||||
min: MIN(SQ_WAVES_RESTORED)
|
||||
max: MAX(SQ_WAVES_RESTORED)
|
||||
unit: Wavefronts
|
||||
tips:
|
||||
VGPRs:
|
||||
avg: AVG(Arch_VGPR)
|
||||
min: MIN(Arch_VGPR)
|
||||
max: MAX(Arch_VGPR)
|
||||
unit: Registers
|
||||
tips:
|
||||
AGPRs:
|
||||
avg: AVG(Accum_VGPR)
|
||||
min: MIN(Accum_VGPR)
|
||||
max: MAX(Accum_VGPR)
|
||||
unit: Registers
|
||||
tips:
|
||||
SGPRs:
|
||||
avg: AVG(SGPR)
|
||||
min: MIN(SGPR)
|
||||
max: MAX(SGPR)
|
||||
unit: Registers
|
||||
tips:
|
||||
LDS Allocation:
|
||||
avg: AVG(LDS_Per_Workgroup)
|
||||
min: MIN(LDS_Per_Workgroup)
|
||||
max: MAX(LDS_Per_Workgroup)
|
||||
unit: Bytes
|
||||
tips:
|
||||
Scratch Allocation:
|
||||
avg: AVG(Scratch_Per_Workitem)
|
||||
min: MIN(Scratch_Per_Workitem)
|
||||
max: MAX(Scratch_Per_Workitem)
|
||||
unit: Bytes/Workitem
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 702
|
||||
title: Wavefront Runtime Stats
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Kernel Time:
|
||||
avg: AVG((End_Timestamp - Start_Timestamp))
|
||||
min: MIN((End_Timestamp - Start_Timestamp))
|
||||
max: MAX((End_Timestamp - Start_Timestamp))
|
||||
unit: ns
|
||||
tips:
|
||||
Kernel Time (Cycles):
|
||||
avg: AVG($GRBM_GUI_ACTIVE_PER_XCD)
|
||||
min: MIN($GRBM_GUI_ACTIVE_PER_XCD)
|
||||
max: MAX($GRBM_GUI_ACTIVE_PER_XCD)
|
||||
unit: Cycle
|
||||
tips:
|
||||
Instructions per wavefront:
|
||||
avg: AVG((SQ_INSTS / SQ_WAVES))
|
||||
min: MIN((SQ_INSTS / SQ_WAVES))
|
||||
max: MAX((SQ_INSTS / SQ_WAVES))
|
||||
unit: Instr/wavefront
|
||||
tips:
|
||||
Wave Cycles:
|
||||
avg: AVG(((4 * SQ_WAVE_CYCLES) / $denom))
|
||||
min: MIN(((4 * SQ_WAVE_CYCLES) / $denom))
|
||||
max: MAX(((4 * SQ_WAVE_CYCLES) / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
Dependency Wait Cycles:
|
||||
avg: AVG(((4 * SQ_WAIT_ANY) / $denom))
|
||||
min: MIN(((4 * SQ_WAIT_ANY) / $denom))
|
||||
max: MAX(((4 * SQ_WAIT_ANY) / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
Issue Wait Cycles:
|
||||
avg: AVG(((4 * SQ_WAIT_INST_ANY) / $denom))
|
||||
min: MIN(((4 * SQ_WAIT_INST_ANY) / $denom))
|
||||
max: MAX(((4 * SQ_WAIT_INST_ANY) / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
Active Cycles:
|
||||
avg: AVG(((4 * SQ_ACTIVE_INST_ANY) / $denom))
|
||||
min: MIN(((4 * SQ_ACTIVE_INST_ANY) / $denom))
|
||||
max: MAX(((4 * SQ_ACTIVE_INST_ANY) / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
Wavefront Occupancy:
|
||||
avg: AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
min: MIN((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
max: MAX((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
unit: Wavefronts
|
||||
coll_level: SQ_LEVEL_WAVES
|
||||
tips:
|
||||
+173
@@ -0,0 +1,173 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 700
|
||||
title: Wavefront
|
||||
metrics_description:
|
||||
Grid Size: The total number of work-items (or, threads) launched as a part of
|
||||
the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied
|
||||
by the total workgroup (or, block) size.
|
||||
Workgroup Size: The total number of work-items (or, threads) in each workgroup
|
||||
(or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
|
||||
to the total block size.
|
||||
Total Wavefronts: "The total number of wavefronts launched as part of the kernel\
|
||||
\ dispatch. On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs,\
|
||||
\ the wavefront size is always 64 work-items. Thus, the total number of wavefronts\
|
||||
\ should be equivalent to the ceiling of grid size divided by 64."
|
||||
Saved Wavefronts: The total number of wavefronts saved at a context-save.
|
||||
Restored Wavefronts: The total number of wavefronts restored from a context-save.
|
||||
VGPRs: 'The number of architected vector general-purpose registers allocated for
|
||||
the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
|
||||
by the compiler due to allocation granularity.'
|
||||
AGPRs: 'The number of accumulation vector general-purpose registers allocated
|
||||
for the kernel, see AGPRs. Note: this may not exactly match the number of AGPRs
|
||||
requested by the compiler due to allocation granularity.'
|
||||
SGPRs: 'The number of scalar general-purpose registers allocated for the kernel,
|
||||
see SALU. Note: this may not exactly match the number of SGPRs requested by
|
||||
the compiler due to allocation granularity.'
|
||||
LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
|
||||
for this kernel. Note: This may also be larger than what was requested at compile
|
||||
time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
|
||||
Scratch Allocation: The number of bytes of scratch memory requested per work-item
|
||||
for this kernel. Scratch memory is used for stack memory on the accelerator,
|
||||
as well as for register spills and restores.
|
||||
Kernel Time: The total duration of the executed kernel.
|
||||
Kernel Time (Cycles): The total duration of the executed kernel in cycles.
|
||||
Instructions per wavefront: The average number of instructions (of all types)
|
||||
executed per wavefront. This is averaged over all wavefronts in a kernel dispatch.
|
||||
Wave Cycles: The number of cycles a wavefront in the kernel dispatch spent resident
|
||||
on a compute unit per normalization unit. This is averaged over all wavefronts
|
||||
in a kernel dispatch.
|
||||
Dependency Wait Cycles: The number of cycles a wavefront in the kernel dispatch
|
||||
spent resident on a compute unit per normalization unit. This is averaged over
|
||||
all wavefronts in a kernel dispatch.
|
||||
Issue Wait Cycles: The number of cycles a wavefront in the kernel dispatch was
|
||||
unable to issue an instruction for any reason (e.g., execution pipe back-pressure,
|
||||
arbitration loss, etc.) per normalization unit. This counter is incremented
|
||||
at every cycle by all wavefronts on a CU unable to issue an instruction. As
|
||||
such, it is most useful to get a sense of how waves were spending their time,
|
||||
rather than identification of a precise limiter because another wave could be
|
||||
actively executing while a wave is issue stalled. The sum of this metric, Dependency
|
||||
Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
|
||||
Active Cycles: The average number of cycles a wavefront in the kernel dispatch
|
||||
was actively executing instructions per normalization unit. This measurement
|
||||
is made on a per-wavefront basis, and may include cycles that another wavefront
|
||||
spent actively executing (on another execution unit, for example) or was stalled.
|
||||
As such, it is most useful to get a sense of how waves were spending their time,
|
||||
rather than identification of a precise limiter. The sum of this metric, Issue
|
||||
Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
|
||||
metric.
|
||||
Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
|
||||
over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
||||
kernels (less than 1ms).'
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 701
|
||||
title: Wavefront Launch Stats
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Grid Size:
|
||||
avg: AVG(Grid_Size)
|
||||
min: MIN(Grid_Size)
|
||||
max: MAX(Grid_Size)
|
||||
unit: Work Items
|
||||
Workgroup Size:
|
||||
avg: AVG(Workgroup_Size)
|
||||
min: MIN(Workgroup_Size)
|
||||
max: MAX(Workgroup_Size)
|
||||
unit: Work Items
|
||||
Total Wavefronts:
|
||||
avg: AVG(SPI_CSN_WAVE)
|
||||
min: MIN(SPI_CSN_WAVE)
|
||||
max: MAX(SPI_CSN_WAVE)
|
||||
unit: Wavefronts
|
||||
Saved Wavefronts:
|
||||
avg: AVG(SQ_WAVES_SAVED)
|
||||
min: MIN(SQ_WAVES_SAVED)
|
||||
max: MAX(SQ_WAVES_SAVED)
|
||||
unit: Wavefronts
|
||||
Restored Wavefronts:
|
||||
avg: AVG(SQ_WAVES_RESTORED)
|
||||
min: MIN(SQ_WAVES_RESTORED)
|
||||
max: MAX(SQ_WAVES_RESTORED)
|
||||
unit: Wavefronts
|
||||
VGPRs:
|
||||
avg: AVG(Arch_VGPR)
|
||||
min: MIN(Arch_VGPR)
|
||||
max: MAX(Arch_VGPR)
|
||||
unit: Registers
|
||||
AGPRs:
|
||||
avg: AVG(Accum_VGPR)
|
||||
min: MIN(Accum_VGPR)
|
||||
max: MAX(Accum_VGPR)
|
||||
unit: Registers
|
||||
SGPRs:
|
||||
avg: AVG(SGPR)
|
||||
min: MIN(SGPR)
|
||||
max: MAX(SGPR)
|
||||
unit: Registers
|
||||
LDS Allocation:
|
||||
avg: AVG(LDS_Per_Workgroup)
|
||||
min: MIN(LDS_Per_Workgroup)
|
||||
max: MAX(LDS_Per_Workgroup)
|
||||
unit: Bytes
|
||||
Scratch Allocation:
|
||||
avg: AVG(Scratch_Per_Workitem)
|
||||
min: MIN(Scratch_Per_Workitem)
|
||||
max: MAX(Scratch_Per_Workitem)
|
||||
unit: Bytes/Workitem
|
||||
- metric_table:
|
||||
id: 702
|
||||
title: Wavefront Runtime Stats
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Kernel Time:
|
||||
avg: AVG((End_Timestamp - Start_Timestamp))
|
||||
min: MIN((End_Timestamp - Start_Timestamp))
|
||||
max: MAX((End_Timestamp - Start_Timestamp))
|
||||
unit: ns
|
||||
Kernel Time (Cycles):
|
||||
avg: AVG($GRBM_GUI_ACTIVE_PER_XCD)
|
||||
min: MIN($GRBM_GUI_ACTIVE_PER_XCD)
|
||||
max: MAX($GRBM_GUI_ACTIVE_PER_XCD)
|
||||
unit: Cycle
|
||||
Instructions per wavefront:
|
||||
avg: AVG((SQ_INSTS / SQ_WAVES))
|
||||
min: MIN((SQ_INSTS / SQ_WAVES))
|
||||
max: MAX((SQ_INSTS / SQ_WAVES))
|
||||
unit: Instr/wavefront
|
||||
Wave Cycles:
|
||||
avg: AVG(((4 * SQ_WAVE_CYCLES) / $denom))
|
||||
min: MIN(((4 * SQ_WAVE_CYCLES) / $denom))
|
||||
max: MAX(((4 * SQ_WAVE_CYCLES) / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
Dependency Wait Cycles:
|
||||
avg: AVG(((4 * SQ_WAIT_ANY) / $denom))
|
||||
min: MIN(((4 * SQ_WAIT_ANY) / $denom))
|
||||
max: MAX(((4 * SQ_WAIT_ANY) / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
Issue Wait Cycles:
|
||||
avg: AVG(((4 * SQ_WAIT_INST_ANY) / $denom))
|
||||
min: MIN(((4 * SQ_WAIT_INST_ANY) / $denom))
|
||||
max: MAX(((4 * SQ_WAIT_INST_ANY) / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
Active Cycles:
|
||||
avg: AVG(((4 * SQ_ACTIVE_INST_ANY) / $denom))
|
||||
min: MIN(((4 * SQ_ACTIVE_INST_ANY) / $denom))
|
||||
max: MAX(((4 * SQ_ACTIVE_INST_ANY) / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
Wavefront Occupancy:
|
||||
avg: AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
min: MIN((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
max: MAX((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
unit: Wavefronts
|
||||
coll_level: SQ_LEVEL_WAVES
|
||||
-129
@@ -1,129 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 1000
|
||||
title: Compute Units - Instruction Mix
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1001
|
||||
title: Overall Instruction Mix
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
LDS:
|
||||
avg: AVG((SQ_INSTS_LDS / $denom))
|
||||
min: MIN((SQ_INSTS_LDS / $denom))
|
||||
max: MAX((SQ_INSTS_LDS / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
SALU:
|
||||
avg: AVG((SQ_INSTS_SALU / $denom))
|
||||
min: MIN((SQ_INSTS_SALU / $denom))
|
||||
max: MAX((SQ_INSTS_SALU / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
SMEM:
|
||||
avg: AVG((SQ_INSTS_SMEM / $denom))
|
||||
min: MIN((SQ_INSTS_SMEM / $denom))
|
||||
max: MAX((SQ_INSTS_SMEM / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
Branch:
|
||||
avg: AVG((SQ_INSTS_BRANCH / $denom))
|
||||
min: MIN((SQ_INSTS_BRANCH / $denom))
|
||||
max: MAX((SQ_INSTS_BRANCH / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 1002
|
||||
title: VALU Arithmetic Instr Mix
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
|
||||
- metric_table:
|
||||
id: 1003
|
||||
title: VMEM Instr Mix
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Global/Generic Instr:
|
||||
avg: AVG((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
Global/Generic Read:
|
||||
avg: AVG((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
Global/Generic Write:
|
||||
avg: AVG((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
Global/Generic Atomic:
|
||||
avg: AVG((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
Spill/Stack Instr:
|
||||
avg: AVG((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
Spill/Stack Read:
|
||||
avg: AVG((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
Spill/Stack Write:
|
||||
avg: AVG((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
Spill/Stack Atomic:
|
||||
avg: AVG((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 1004
|
||||
title: MFMA Arithmetic Instr Mix
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
+189
@@ -0,0 +1,189 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 1000
|
||||
title: Compute Units - Instruction Mix
|
||||
metrics_description:
|
||||
VALU: The total number of vector arithmetic logic unit (VALU) operations issued.
|
||||
These are the workhorses of the compute unit, and are used to execute a wide
|
||||
range of instruction types including floating point operations, non-uniform
|
||||
address calculations, transcendental operations, integer operations, shifts,
|
||||
conditional evaluation, etc.
|
||||
VMEM: The total number of vector memory operations issued. These include most
|
||||
loads, stores and atomic operations and all accesses to generic, global, private
|
||||
and texture memory.
|
||||
LDS: The total number of LDS (also known as shared memory) operations issued.
|
||||
These include loads, stores, atomics, and HIP's __shfl operations.
|
||||
MFMA: The total number of matrix fused multiply-add instructions issued.
|
||||
SALU: The total number of scalar arithmetic logic unit (SALU) operations issued.
|
||||
Typically these are used for address calculations, literal constants, and other
|
||||
operations that are provably uniform across a wavefront. Although scalar memory
|
||||
(SMEM) operations are issued by the SALU, they are counted separately in this
|
||||
section.
|
||||
SMEM: The total number of scalar memory (SMEM) operations issued. These are typically
|
||||
used for loading kernel arguments, base-pointers and loads from HIP's __constant__
|
||||
memory.
|
||||
Branch: The total number of branch operations issued. These typically consist
|
||||
of jump or branch operations and are used to implement control flow.
|
||||
INT32: The total number of instructions operating on 32-bit integer operands issued
|
||||
to the VALU per normalization unit.
|
||||
INT64: The total number of instructions operating on 64-bit integer operands issued
|
||||
to the VALU per normalization unit.
|
||||
F16-ADD: The total number of addition instructions operating on 16-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F16-MUL: The total number of multiplication instructions operating on 16-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F16-FMA: The total number of fused multiply-add instructions operating on 16-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F16-Trans: The total number of transcendental instructions (e.g., sqrt) operating
|
||||
on 16-bit floating-point operands issued to the VALU per normalization unit.
|
||||
F32-ADD: The total number of addition instructions operating on 32-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F32-MUL: The total number of multiplication instructions operating on 32-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F32-FMA: The total number of fused multiply-add instructions operating on 32-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F32-Trans: The total number of transcendental instructions (such as sqrt) operating
|
||||
on 32-bit floating-point operands issued to the VALU per normalization unit.
|
||||
F64-ADD: The total number of addition instructions operating on 64-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F64-MUL: The total number of multiplication instructions operating on 64-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F64-FMA: The total number of fused multiply-add instructions operating on 64-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F64-Trans: The total number of transcendental instructions (such as sqrt) operating
|
||||
on 64-bit floating-point operands issued to the VALU per normalization unit.
|
||||
Conversion: "The total number of type conversion instructions (such as converting\
|
||||
\ data to or from F32\u2194F64) issued to the VALU per normalization unit."
|
||||
Global/Generic Instr: The total number of global & generic memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Read: The total number of global & generic memory read instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Write: The total number of global & generic memory write instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Atomic: The total number of global & generic memory atomic (with
|
||||
and without return) instructions executed on all compute units on the accelerator,
|
||||
per normalization unit.
|
||||
Spill/Stack Instr: The total number of spill/stack memory instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Read: The total number of spill/stack memory read instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Write: The total number of spill/stack memory write instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Atomic: The total number of spill/stack memory atomic (with and without
|
||||
return) instructions executed on all compute units on the accelerator, per normalization
|
||||
unit. Typically unused as these memory operations are typically used to implement
|
||||
thread-local storage.
|
||||
MFMA-I8: The total number of 8-bit integer MFMA instructions issued per normalization
|
||||
unit.
|
||||
MFMA-F8: The total number of 8-bit floating point MFMA instructions issued per
|
||||
normalization unit. This is supported in AMD Instinct MI300 series and later
|
||||
only.
|
||||
MFMA-F16: The total number of 16-bit floating point MFMA instructions issued per
|
||||
normalization unit.
|
||||
MFMA-BF16: The total number of 16-bit brain floating point MFMA instructions issued
|
||||
per normalization unit.
|
||||
MFMA-F32: The total number of 32-bit floating-point MFMA instructions issued per
|
||||
normalization unit.
|
||||
MFMA-F64: The total number of 64-bit floating-point MFMA instructions issued per
|
||||
normalization unit.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1001
|
||||
title: Overall Instruction Mix
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
LDS:
|
||||
avg: AVG((SQ_INSTS_LDS / $denom))
|
||||
min: MIN((SQ_INSTS_LDS / $denom))
|
||||
max: MAX((SQ_INSTS_LDS / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
SALU:
|
||||
avg: AVG((SQ_INSTS_SALU / $denom))
|
||||
min: MIN((SQ_INSTS_SALU / $denom))
|
||||
max: MAX((SQ_INSTS_SALU / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
SMEM:
|
||||
avg: AVG((SQ_INSTS_SMEM / $denom))
|
||||
min: MIN((SQ_INSTS_SMEM / $denom))
|
||||
max: MAX((SQ_INSTS_SMEM / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
Branch:
|
||||
avg: AVG((SQ_INSTS_BRANCH / $denom))
|
||||
min: MIN((SQ_INSTS_BRANCH / $denom))
|
||||
max: MAX((SQ_INSTS_BRANCH / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
- metric_table:
|
||||
id: 1002
|
||||
title: VALU Arithmetic Instruction Mix
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric: {}
|
||||
- metric_table:
|
||||
id: 1003
|
||||
title: VMEM Instruction Mix
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Global/Generic Instr:
|
||||
avg: AVG((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
Global/Generic Read:
|
||||
avg: AVG((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
Global/Generic Write:
|
||||
avg: AVG((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
Global/Generic Atomic:
|
||||
avg: AVG((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
Spill/Stack Instr:
|
||||
avg: AVG((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
Spill/Stack Read:
|
||||
avg: AVG((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
Spill/Stack Write:
|
||||
avg: AVG((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
Spill/Stack Atomic:
|
||||
avg: AVG((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
- metric_table:
|
||||
id: 1004
|
||||
title: MFMA Arithmetic Instruction Mix
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric: {}
|
||||
-84
@@ -1,84 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 1100
|
||||
title: Compute Units - Compute Pipeline
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1101
|
||||
title: Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
peak: Peak
|
||||
pop: Pct of Peak
|
||||
tips: Tips
|
||||
metric:
|
||||
|
||||
- metric_table:
|
||||
id: 1102
|
||||
title: Pipeline Stats
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
IPC:
|
||||
avg: AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))
|
||||
min: MIN((SQ_INSTS / SQ_BUSY_CU_CYCLES))
|
||||
max: MAX((SQ_INSTS / SQ_BUSY_CU_CYCLES))
|
||||
unit: Instr/cycle
|
||||
tips:
|
||||
IPC (Issued):
|
||||
avg: AVG(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
/ SQ_ACTIVE_INST_ANY))
|
||||
min: MIN(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
/ SQ_ACTIVE_INST_ANY))
|
||||
max: MAX(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
/ SQ_ACTIVE_INST_ANY))
|
||||
unit: Instr/cycle
|
||||
tips:
|
||||
SALU Utilization:
|
||||
avg: AVG((((100 * SQ_ACTIVE_INST_SCA) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
min: MIN((((100 * SQ_ACTIVE_INST_SCA) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
max: MAX((((100 * SQ_ACTIVE_INST_SCA) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
unit: pct
|
||||
tips:
|
||||
VALU Utilization:
|
||||
avg: AVG((((100 * SQ_ACTIVE_INST_VALU) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
min: MIN((((100 * SQ_ACTIVE_INST_VALU) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
max: MAX((((100 * SQ_ACTIVE_INST_VALU) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
unit: pct
|
||||
tips:
|
||||
VALU Active Threads:
|
||||
avg: AVG(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
|
||||
!= 0) else None))
|
||||
min: MIN(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
|
||||
!= 0) else None))
|
||||
max: MAX(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
|
||||
!= 0) else None))
|
||||
unit: Threads
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 1103
|
||||
title: Arithmetic Operations
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
+147
@@ -0,0 +1,147 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 1100
|
||||
title: Compute Units - Compute Pipeline
|
||||
metrics_description:
|
||||
VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
|
||||
This is also presented as a percent of the peak theoretical FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any floating-point
|
||||
operations from MFMA instructions.'
|
||||
VALU IOPs: 'The total integer operations executed per second on the VALU. This
|
||||
is also presented as a percent of the peak theoretical IOPs achievable on the
|
||||
specific accelerator. Note: this does not include any integer operations from
|
||||
MFMA instructions.'
|
||||
MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
|
||||
executed per second. Note: this does not include any 16-bit brain floating point
|
||||
operations from VALU instructions. This is also presented as a percent of the
|
||||
peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 32-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F32 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 64-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F64 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA IOPs (INT8): 'The total number of 8-bit integer MFMA operations executed
|
||||
per second. Note: this does not include any 8-bit integer operations from VALU
|
||||
instructions. This is also presented as a percent of the peak theoretical INT8
|
||||
MFMA operations achievable on the specific accelerator.'
|
||||
IPC: The ratio of the total number of instructions executed on the CU over the
|
||||
total active CU cycles.
|
||||
IPC (Issued): The ratio of the total number of (non-internal) instructions issued
|
||||
over the number of cycles where the scheduler was actively working on issuing
|
||||
instructions.
|
||||
SALU Utilization: Indicates what percent of the kernel's duration the SALU was
|
||||
busy executing instructions. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing SALU / SMEM instructions over the total CU cycles.
|
||||
VALU Utilization: Indicates what percent of the kernel's duration the VALU was
|
||||
busy executing instructions. Does not include VMEM operations. Computed as the
|
||||
ratio of the total number of cycles spent by the scheduler issuing VALU instructions
|
||||
over the total CU cycles.
|
||||
VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
|
||||
was busy executing instructions, including both global/generic and spill/scratch
|
||||
operations (see the VMEM instruction count metrics for more detail). Does not
|
||||
include VALU operations. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing VMEM instructions over the total CU cycles.
|
||||
Branch Utilization: Indicates what percent of the kernel's duration the branch
|
||||
unit was busy executing instructions. Computed as the ratio of the total number
|
||||
of cycles spent by the scheduler issuing branch instructions over the total
|
||||
CU cycles.
|
||||
VALU Active Threads: Indicates the average level of divergence within a wavefront
|
||||
over the lifetime of the kernel. The number of work-items that were active in
|
||||
a wavefront during execution of each VALU instruction, time-averaged over all
|
||||
VALU instructions run on all wavefronts in the kernel
|
||||
MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
|
||||
was busy executing instructions. Computed as the ratio of the total number of
|
||||
cycles spent by the MFMA was busy over the total CU cycles.
|
||||
MFMA Instruction Cycles: The average duration of MFMA instructions in this kernel
|
||||
in cycles. Computed as the ratio of the total number of cycles the MFMA unit
|
||||
was busy over the total number of MFMA instructions.
|
||||
VMEM Latency: The average number of round-trip cycles (that is, from issue to
|
||||
data return / acknowledgment) required for a VMEM instruction to complete.
|
||||
SMEM Latency: The average number of round-trip cycles (that is, from issue to
|
||||
data return / acknowledgment) required for a SMEM instruction to complete.
|
||||
FLOPs (Total): The total number of floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
IOPs (Total): The total number of integer operations executed on either the VALU
|
||||
or MFMA units, per normalization unit.
|
||||
F16 OPs: The total number of 16-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
BF16 OPs: The total number of 16-bit brain floating-point operations executed
|
||||
on either the VALU or MFMA units, per normalization unit.
|
||||
F32 OPs: The total number of 32-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
F64 OPs: The total number of 64-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
INT8 OPs: The total number of 8-bit integer operations executed on either the
|
||||
VALU or MFMA units, per normalization unit.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1101
|
||||
title: Compute Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
peak: Peak
|
||||
pop: Pct of Peak
|
||||
metric: {}
|
||||
- metric_table:
|
||||
id: 1102
|
||||
title: Pipeline Statistics
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
IPC:
|
||||
avg: AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))
|
||||
min: MIN((SQ_INSTS / SQ_BUSY_CU_CYCLES))
|
||||
max: MAX((SQ_INSTS / SQ_BUSY_CU_CYCLES))
|
||||
unit: Instr/cycle
|
||||
IPC (Issued):
|
||||
avg: AVG(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
/ SQ_ACTIVE_INST_ANY))
|
||||
min: MIN(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
/ SQ_ACTIVE_INST_ANY))
|
||||
max: MAX(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
/ SQ_ACTIVE_INST_ANY))
|
||||
unit: Instr/cycle
|
||||
SALU Utilization:
|
||||
avg: AVG((((100 * SQ_ACTIVE_INST_SCA) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
min: MIN((((100 * SQ_ACTIVE_INST_SCA) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
max: MAX((((100 * SQ_ACTIVE_INST_SCA) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
unit: pct
|
||||
VALU Utilization:
|
||||
avg: AVG((((100 * SQ_ACTIVE_INST_VALU) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
min: MIN((((100 * SQ_ACTIVE_INST_VALU) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
max: MAX((((100 * SQ_ACTIVE_INST_VALU) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
unit: pct
|
||||
VALU Active Threads:
|
||||
avg: AVG(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
|
||||
!= 0) else None))
|
||||
min: MIN(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
|
||||
!= 0) else None))
|
||||
max: MAX(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
|
||||
!= 0) else None))
|
||||
unit: Threads
|
||||
- metric_table:
|
||||
id: 1103
|
||||
title: Arithmetic Operations
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric: {}
|
||||
-118
@@ -1,118 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 1200
|
||||
title: Local Data Share (LDS)
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1201
|
||||
title: Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Utilization:
|
||||
value: AVG(((100 * SQ_LDS_IDX_ACTIVE) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
Access Rate:
|
||||
value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
Theoretical Bandwidth:
|
||||
value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
Bank Conflict Rate:
|
||||
value: AVG((((SQ_LDS_BANK_CONFLICT * 3.125) / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
comparable: false # for now
|
||||
cli_style: simple_bar
|
||||
|
||||
- metric_table:
|
||||
id: 1202
|
||||
title: LDS Stats
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
LDS Instrs:
|
||||
avg: AVG((SQ_INSTS_LDS / $denom))
|
||||
min: MIN((SQ_INSTS_LDS / $denom))
|
||||
max: MAX((SQ_INSTS_LDS / $denom))
|
||||
unit: (Instr + $normUnit)
|
||||
tips:
|
||||
Theoretical Bandwidth:
|
||||
avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
tips:
|
||||
LDS Latency:
|
||||
avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None))
|
||||
min: MIN(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None))
|
||||
max: MAX(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None))
|
||||
unit: Cycles
|
||||
coll_level: SQ_INST_LEVEL_LDS
|
||||
tips:
|
||||
Bank Conflicts/Access:
|
||||
avg: AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
|
||||
min: MIN(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
|
||||
max: MAX(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
|
||||
unit: Conflicts/Access
|
||||
tips:
|
||||
Index Accesses:
|
||||
avg: AVG((SQ_LDS_IDX_ACTIVE / $denom))
|
||||
min: MIN((SQ_LDS_IDX_ACTIVE / $denom))
|
||||
max: MAX((SQ_LDS_IDX_ACTIVE / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
Atomic Return Cycles:
|
||||
avg: AVG((SQ_LDS_ATOMIC_RETURN / $denom))
|
||||
min: MIN((SQ_LDS_ATOMIC_RETURN / $denom))
|
||||
max: MAX((SQ_LDS_ATOMIC_RETURN / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
Bank Conflict:
|
||||
avg: AVG((SQ_LDS_BANK_CONFLICT / $denom))
|
||||
min: MIN((SQ_LDS_BANK_CONFLICT / $denom))
|
||||
max: MAX((SQ_LDS_BANK_CONFLICT / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
Addr Conflict:
|
||||
avg: AVG((SQ_LDS_ADDR_CONFLICT / $denom))
|
||||
min: MIN((SQ_LDS_ADDR_CONFLICT / $denom))
|
||||
max: MAX((SQ_LDS_ADDR_CONFLICT / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
Unaligned Stall:
|
||||
avg: AVG((SQ_LDS_UNALIGNED_STALL / $denom))
|
||||
min: MIN((SQ_LDS_UNALIGNED_STALL / $denom))
|
||||
max: MAX((SQ_LDS_UNALIGNED_STALL / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
Mem Violations:
|
||||
avg: AVG((SQ_LDS_MEM_VIOLATIONS / $denom))
|
||||
min: MIN((SQ_LDS_MEM_VIOLATIONS / $denom))
|
||||
max: MAX((SQ_LDS_MEM_VIOLATIONS / $denom))
|
||||
unit: (Accesses + $normUnit)
|
||||
tips:
|
||||
+141
@@ -0,0 +1,141 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 1200
|
||||
title: Local Data Share (LDS)
|
||||
metrics_description:
|
||||
Utilization: Indicates what percent of the kernel's duration the LDS was actively
|
||||
executing instructions (including, but not limited to, load, store, atomic and
|
||||
HIP's __shfl operations). Calculated as the ratio of the total number of cycles
|
||||
LDS was active over the total CU cycles.
|
||||
Access Rate: Indicates the percentage of SIMDs in the VALU actively issuing LDS
|
||||
instructions, averaged over the lifetime of the kernel. Calculated as the ratio
|
||||
of the total number of cycles spent by the scheduler issuing LDS instructions
|
||||
over the total CU cycles.
|
||||
Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
|
||||
loaded from, stored to, or atomically updated in the LDS per normalization unit.
|
||||
Does not take into account the execution mask of the wavefront when the instruction
|
||||
was executed.
|
||||
Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
|
||||
servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing
|
||||
bank conflicts over the number of LDS cycles that would have been required to
|
||||
move the same amount of data in an uncontended access.
|
||||
LDS Instructions: The total number of LDS instructions (including, but not limited
|
||||
to, read/write/atomics and HIP's __shfl instructions) executed per normalization
|
||||
unit.
|
||||
LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
|
||||
/ acknowledgment) required for an LDS instruction to complete.
|
||||
Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS scheduler
|
||||
due to bank conflicts (as determined by the conflict resolution hardware) to
|
||||
the base number of cycles that would be spent in the LDS scheduler in a completely
|
||||
uncontended case. This is the unnormalized form of the Bank Conflict Rate.
|
||||
Index Accesses: The total number of cycles spent in the LDS scheduler over all
|
||||
operations per normalization unit.
|
||||
Atomic Return Cycles: The total number of cycles spent on LDS atomics with return
|
||||
per normalization unit.
|
||||
Bank Conflict: The total number of cycles spent in the LDS scheduler due to bank
|
||||
conflicts (as determined by the conflict resolution hardware) per normalization
|
||||
unit.
|
||||
Addr Conflict: The total number of cycles spent in the LDS scheduler due to address
|
||||
conflicts (as determined by the conflict resolution hardware) per normalization
|
||||
unit.
|
||||
Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
|
||||
stalls from non-dword aligned addresses per normalization unit.
|
||||
Mem Violations: "The total number of out-of-bounds accesses made to the LDS, per\
|
||||
\ normalization unit. This is unused and expected to be zero in most configurations\
|
||||
\ for modern CDNA\u2122 accelerators."
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1201
|
||||
title: LDS Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
metric:
|
||||
Utilization:
|
||||
value: AVG(((100 * SQ_LDS_IDX_ACTIVE) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: Pct of Peak
|
||||
Access Rate:
|
||||
value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: Pct of Peak
|
||||
Theoretical Bandwidth:
|
||||
value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
|
||||
unit: Pct of Peak
|
||||
Bank Conflict Rate:
|
||||
value: AVG((((SQ_LDS_BANK_CONFLICT * 3.125) / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
|
||||
unit: Pct of Peak
|
||||
comparable: false
|
||||
cli_style: simple_bar
|
||||
tui_style: simple_bar
|
||||
- metric_table:
|
||||
id: 1202
|
||||
title: LDS Statistics
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
LDS Instructions:
|
||||
avg: AVG((SQ_INSTS_LDS / $denom))
|
||||
min: MIN((SQ_INSTS_LDS / $denom))
|
||||
max: MAX((SQ_INSTS_LDS / $denom))
|
||||
unit: (Instr + $normUnit)
|
||||
Theoretical Bandwidth:
|
||||
avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
LDS Latency:
|
||||
avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
|
||||
None))
|
||||
min: MIN(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
|
||||
None))
|
||||
max: MAX(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
|
||||
None))
|
||||
unit: Cycles
|
||||
coll_level: SQ_INST_LEVEL_LDS
|
||||
Bank Conflicts/Access:
|
||||
avg: AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
|
||||
min: MIN(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
|
||||
max: MAX(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
|
||||
unit: Conflicts/Access
|
||||
Index Accesses:
|
||||
avg: AVG((SQ_LDS_IDX_ACTIVE / $denom))
|
||||
min: MIN((SQ_LDS_IDX_ACTIVE / $denom))
|
||||
max: MAX((SQ_LDS_IDX_ACTIVE / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
Atomic Return Cycles:
|
||||
avg: AVG((SQ_LDS_ATOMIC_RETURN / $denom))
|
||||
min: MIN((SQ_LDS_ATOMIC_RETURN / $denom))
|
||||
max: MAX((SQ_LDS_ATOMIC_RETURN / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
Bank Conflict:
|
||||
avg: AVG((SQ_LDS_BANK_CONFLICT / $denom))
|
||||
min: MIN((SQ_LDS_BANK_CONFLICT / $denom))
|
||||
max: MAX((SQ_LDS_BANK_CONFLICT / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
Addr Conflict:
|
||||
avg: AVG((SQ_LDS_ADDR_CONFLICT / $denom))
|
||||
min: MIN((SQ_LDS_ADDR_CONFLICT / $denom))
|
||||
max: MAX((SQ_LDS_ADDR_CONFLICT / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
Unaligned Stall:
|
||||
avg: AVG((SQ_LDS_UNALIGNED_STALL / $denom))
|
||||
min: MIN((SQ_LDS_UNALIGNED_STALL / $denom))
|
||||
max: MAX((SQ_LDS_UNALIGNED_STALL / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
Mem Violations:
|
||||
avg: AVG((SQ_LDS_MEM_VIOLATIONS / $denom))
|
||||
min: MIN((SQ_LDS_MEM_VIOLATIONS / $denom))
|
||||
max: MAX((SQ_LDS_MEM_VIOLATIONS / $denom))
|
||||
unit: (Accesses + $normUnit)
|
||||
-105
@@ -1,105 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 1300
|
||||
title: Instruction Cache
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1301
|
||||
title: Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Bandwidth:
|
||||
value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu)
|
||||
* (End_Timestamp - Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
Cache Hit Rate:
|
||||
value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
|
||||
+ SQC_ICACHE_MISSES_DUPLICATE)))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
L1I-L2 Bandwidth:
|
||||
value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu)
|
||||
* (End_Timestamp - Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
comparable: false # for now
|
||||
cli_style: simple_bar
|
||||
|
||||
- metric_table:
|
||||
id: 1302
|
||||
title: Instruction Cache Accesses
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Req:
|
||||
avg: AVG((SQC_ICACHE_REQ / $denom))
|
||||
min: MIN((SQC_ICACHE_REQ / $denom))
|
||||
max: MAX((SQC_ICACHE_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Hits:
|
||||
avg: AVG((SQC_ICACHE_HITS / $denom))
|
||||
min: MIN((SQC_ICACHE_HITS / $denom))
|
||||
max: MAX((SQC_ICACHE_HITS / $denom))
|
||||
unit: (Hits + $normUnit)
|
||||
tips:
|
||||
Misses - Non Duplicated:
|
||||
avg: AVG((SQC_ICACHE_MISSES / $denom))
|
||||
min: MIN((SQC_ICACHE_MISSES / $denom))
|
||||
max: MAX((SQC_ICACHE_MISSES / $denom))
|
||||
unit: (Misses + $normUnit)
|
||||
tips:
|
||||
Misses - Duplicated:
|
||||
avg: AVG((SQC_ICACHE_MISSES_DUPLICATE / $denom))
|
||||
min: MIN((SQC_ICACHE_MISSES_DUPLICATE / $denom))
|
||||
max: MAX((SQC_ICACHE_MISSES_DUPLICATE / $denom))
|
||||
unit: (Misses + $normUnit)
|
||||
tips:
|
||||
Cache Hit Rate:
|
||||
avg: AVG(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
|
||||
+ SQC_ICACHE_MISSES_DUPLICATE)))
|
||||
min: MIN(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
|
||||
+ SQC_ICACHE_MISSES_DUPLICATE)))
|
||||
max: MAX(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
|
||||
+ SQC_ICACHE_MISSES_DUPLICATE)))
|
||||
unit: pct
|
||||
tips:
|
||||
Instruction Fetch Latency:
|
||||
avg: AVG((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
|
||||
min: MIN((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
|
||||
max: MAX((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
|
||||
unit: Cycles
|
||||
coll_level: SQ_IFETCH_LEVEL
|
||||
tips:
|
||||
- metric_table:
|
||||
id: 1303
|
||||
title: Instruction Cache - L2 Interface
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
L1I-L2 Bandwidth:
|
||||
avg: AVG(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
min: MIN(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
max: MAX(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
tips:
|
||||
+106
@@ -0,0 +1,106 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 1300
|
||||
title: Instruction Cache
|
||||
metrics_description:
|
||||
Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the
|
||||
peak theoretical bandwidth. Calculated as the ratio of L1I requests over the
|
||||
total L1I cycles.
|
||||
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
|
||||
loaded line the cache. Calculated as the ratio of the number of L1I requests
|
||||
that hit over the number of all L1I requests.
|
||||
L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\
|
||||
\ bandwidth achieved. Calculated as the ratio of the total number of requests\
|
||||
\ from the L1I to the L2 cache over the total L1I-L2 interface cycles."
|
||||
Req: The total number of requests made to the L1I per normalization-unit
|
||||
Hits: The total number of L1I requests that hit on a previously loaded cache line,
|
||||
per normalization-unit.
|
||||
Misses - Non Duplicated: The total number of L1I requests that missed on a cache
|
||||
line that were not already pending due to another request, per normalization-unit.
|
||||
Misses - Duplicated: The total number of L1I requests that missed on a cache line
|
||||
that were already pending due to another request, per normalization-unit.
|
||||
Instruction Fetch Latency: The average number of cycles spent to fetch instructions
|
||||
to a CU.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1301
|
||||
title: L1I Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
|
||||
- Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
Cache Hit Rate:
|
||||
value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
|
||||
+ SQC_ICACHE_MISSES_DUPLICATE)))
|
||||
unit: Pct of Peak
|
||||
L1I-L2 Bandwidth:
|
||||
value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu)
|
||||
* (End_Timestamp - Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
comparable: false
|
||||
cli_style: simple_bar
|
||||
tui_style: simple_bar
|
||||
- metric_table:
|
||||
id: 1302
|
||||
title: L1I cache accesses
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Req:
|
||||
avg: AVG((SQC_ICACHE_REQ / $denom))
|
||||
min: MIN((SQC_ICACHE_REQ / $denom))
|
||||
max: MAX((SQC_ICACHE_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Hits:
|
||||
avg: AVG((SQC_ICACHE_HITS / $denom))
|
||||
min: MIN((SQC_ICACHE_HITS / $denom))
|
||||
max: MAX((SQC_ICACHE_HITS / $denom))
|
||||
unit: (Hits + $normUnit)
|
||||
Misses - Non Duplicated:
|
||||
avg: AVG((SQC_ICACHE_MISSES / $denom))
|
||||
min: MIN((SQC_ICACHE_MISSES / $denom))
|
||||
max: MAX((SQC_ICACHE_MISSES / $denom))
|
||||
unit: (Misses + $normUnit)
|
||||
Misses - Duplicated:
|
||||
avg: AVG((SQC_ICACHE_MISSES_DUPLICATE / $denom))
|
||||
min: MIN((SQC_ICACHE_MISSES_DUPLICATE / $denom))
|
||||
max: MAX((SQC_ICACHE_MISSES_DUPLICATE / $denom))
|
||||
unit: (Misses + $normUnit)
|
||||
Cache Hit Rate:
|
||||
avg: AVG(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
|
||||
+ SQC_ICACHE_MISSES_DUPLICATE)))
|
||||
min: MIN(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
|
||||
+ SQC_ICACHE_MISSES_DUPLICATE)))
|
||||
max: MAX(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
|
||||
+ SQC_ICACHE_MISSES_DUPLICATE)))
|
||||
unit: pct
|
||||
Instruction Fetch Latency:
|
||||
avg: AVG((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
|
||||
min: MIN((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
|
||||
max: MAX((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
|
||||
unit: Cycles
|
||||
coll_level: SQ_IFETCH_LEVEL
|
||||
- metric_table:
|
||||
id: 1303
|
||||
title: L1I <-> L2 interface
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
L1I-L2 Bandwidth:
|
||||
avg: AVG(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
min: MIN(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
max: MAX(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
-171
@@ -1,171 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 1400
|
||||
title: Scalar L1 Data Cache
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1401
|
||||
title: Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Bandwidth:
|
||||
value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu)
|
||||
* (End_Timestamp - Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
Cache Hit Rate:
|
||||
value: AVG((((SQC_DCACHE_HITS * 100) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES + SQC_DCACHE_MISSES_DUPLICATE))
|
||||
if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
sL1D-L2 BW:
|
||||
value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 100000)
|
||||
/ (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
comparable: false # for now
|
||||
cli_style: simple_bar
|
||||
|
||||
- metric_table:
|
||||
id: 1402
|
||||
title: Scalar L1D Cache Accesses
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Req:
|
||||
avg: AVG((SQC_DCACHE_REQ / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Hits:
|
||||
avg: AVG((SQC_DCACHE_HITS / $denom))
|
||||
min: MIN((SQC_DCACHE_HITS / $denom))
|
||||
max: MAX((SQC_DCACHE_HITS / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Misses - Non Duplicated:
|
||||
avg: AVG((SQC_DCACHE_MISSES / $denom))
|
||||
min: MIN((SQC_DCACHE_MISSES / $denom))
|
||||
max: MAX((SQC_DCACHE_MISSES / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Misses- Duplicated:
|
||||
avg: AVG((SQC_DCACHE_MISSES_DUPLICATE / $denom))
|
||||
min: MIN((SQC_DCACHE_MISSES_DUPLICATE / $denom))
|
||||
max: MAX((SQC_DCACHE_MISSES_DUPLICATE / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Cache Hit Rate:
|
||||
avg: AVG((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
|
||||
min: MIN((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
|
||||
max: MAX((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
Read Req (Total):
|
||||
avg: AVG((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
|
||||
+ SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
|
||||
min: MIN((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
|
||||
+ SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
|
||||
max: MAX((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
|
||||
+ SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Atomic Req:
|
||||
avg: AVG((SQC_DCACHE_ATOMIC / $denom))
|
||||
min: MIN((SQC_DCACHE_ATOMIC / $denom))
|
||||
max: MAX((SQC_DCACHE_ATOMIC / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Read Req (1 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_1 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_1 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_1 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Read Req (2 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_2 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_2 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_2 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Read Req (4 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_4 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_4 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_4 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Read Req (8 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_8 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_8 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_8 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Read Req (16 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_16 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_16 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_16 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 1403
|
||||
title: Scalar L1D Cache - L2 Interface
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
sL1D-L2 BW:
|
||||
avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 64)) / $denom))
|
||||
min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 64)) / $denom))
|
||||
max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 64)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
tips:
|
||||
Read Req:
|
||||
avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_READ_REQ / $denom))
|
||||
max: MAX((SQC_TC_DATA_READ_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Write Req:
|
||||
avg: AVG((SQC_TC_DATA_WRITE_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_WRITE_REQ / $denom))
|
||||
max: MAX((SQC_TC_DATA_WRITE_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Atomic Req:
|
||||
avg: AVG((SQC_TC_DATA_ATOMIC_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_ATOMIC_REQ / $denom))
|
||||
max: MAX((SQC_TC_DATA_ATOMIC_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Stall Cycles:
|
||||
avg: AVG((SQC_TC_STALL / $denom))
|
||||
min: MIN((SQC_TC_STALL / $denom))
|
||||
max: MAX((SQC_TC_STALL / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
+186
@@ -0,0 +1,186 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 1400
|
||||
title: Scalar L1 Data Cache
|
||||
metrics_description:
|
||||
Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the
|
||||
peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the
|
||||
total sL1D cycles.
|
||||
Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
|
||||
loaded line the cache. The ratio of the number of sL1D requests that hit over
|
||||
the number of all sL1D requests.
|
||||
sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
|
||||
\ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\
|
||||
\ writes and atomics are typically unused on current CDNA accelerators, so in\
|
||||
\ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
|
||||
Req: The total number of requests, of any size or type, made to the sL1D per normalization
|
||||
unit.
|
||||
Hits: The total number of sL1D requests that hit on a previously loaded cache
|
||||
line, per normalization unit.
|
||||
Misses - Non Duplicated: 'The total number of sL1D requests that missed on a cache
|
||||
line that was not already pending due to another request, per normalization
|
||||
unit. '
|
||||
Misses- Duplicated: The total number of sL1D requests that missed on a cache line
|
||||
that was already pending due to another request, per normalization unit.
|
||||
Read Req (Total): The total number of sL1D read requests of any size, per normalization
|
||||
unit.
|
||||
Atomic Req: The total number of atomic requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
Read Req (1 DWord): The total number of sL1D read requests made for a single dword
|
||||
of data (4B), per normalization unit.
|
||||
Read Req (2 DWord): The total number of sL1D read requests made for a two dwords
|
||||
of data (8B), per normalization unit.
|
||||
Read Req (4 DWord): The total number of sL1D read requests made for a four dwords
|
||||
of data (16B), per normalization unit.
|
||||
Read Req (8 DWord): The total number of sL1D read requests made for a eight dwords
|
||||
of data (32B), per normalization unit.
|
||||
Read Req (16 DWord): The total number of sL1D read requests made for a sixteen
|
||||
dwords of data (64B), per normalization unit.
|
||||
Read Req: The total number of read requests from sL1D to the L2 per normalization
|
||||
unit.
|
||||
Write Req: The total number of write requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
Stall Cycles: "The total number of cycles the sL1D\u2194L2 interface was stalled,\
|
||||
\ per normalization unit."
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1401
|
||||
title: Scalar L1D Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
|
||||
- Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
Cache Hit Rate:
|
||||
value: AVG((((SQC_DCACHE_HITS * 100) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
|
||||
unit: Pct of Peak
|
||||
sL1D-L2 BW:
|
||||
value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))
|
||||
unit: Pct of Peak
|
||||
comparable: false
|
||||
cli_style: simple_bar
|
||||
tui_style: simple_bar
|
||||
- metric_table:
|
||||
id: 1402
|
||||
title: Scalar L1D cache accesses
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Req:
|
||||
avg: AVG((SQC_DCACHE_REQ / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Hits:
|
||||
avg: AVG((SQC_DCACHE_HITS / $denom))
|
||||
min: MIN((SQC_DCACHE_HITS / $denom))
|
||||
max: MAX((SQC_DCACHE_HITS / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Misses - Non Duplicated:
|
||||
avg: AVG((SQC_DCACHE_MISSES / $denom))
|
||||
min: MIN((SQC_DCACHE_MISSES / $denom))
|
||||
max: MAX((SQC_DCACHE_MISSES / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Misses- Duplicated:
|
||||
avg: AVG((SQC_DCACHE_MISSES_DUPLICATE / $denom))
|
||||
min: MIN((SQC_DCACHE_MISSES_DUPLICATE / $denom))
|
||||
max: MAX((SQC_DCACHE_MISSES_DUPLICATE / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Cache Hit Rate:
|
||||
avg: AVG((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
|
||||
min: MIN((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
|
||||
max: MAX((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
|
||||
unit: pct
|
||||
Read Req (Total):
|
||||
avg: AVG((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
|
||||
+ SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
|
||||
min: MIN((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
|
||||
+ SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
|
||||
max: MAX((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
|
||||
+ SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Atomic Req:
|
||||
avg: AVG((SQC_DCACHE_ATOMIC / $denom))
|
||||
min: MIN((SQC_DCACHE_ATOMIC / $denom))
|
||||
max: MAX((SQC_DCACHE_ATOMIC / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (1 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_1 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_1 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_1 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (2 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_2 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_2 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_2 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (4 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_4 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_4 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_4 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (8 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_8 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_8 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_8 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (16 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_16 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_16 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_16 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1403
|
||||
title: Scalar L1D Cache - L2 Interface
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
sL1D-L2 BW:
|
||||
avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 64)) / $denom))
|
||||
min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 64)) / $denom))
|
||||
max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 64)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
Read Req:
|
||||
avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_READ_REQ / $denom))
|
||||
max: MAX((SQC_TC_DATA_READ_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Write Req:
|
||||
avg: AVG((SQC_TC_DATA_WRITE_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_WRITE_REQ / $denom))
|
||||
max: MAX((SQC_TC_DATA_WRITE_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Atomic Req:
|
||||
avg: AVG((SQC_TC_DATA_ATOMIC_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_ATOMIC_REQ / $denom))
|
||||
max: MAX((SQC_TC_DATA_ATOMIC_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Stall Cycles:
|
||||
avg: AVG((SQC_TC_STALL / $denom))
|
||||
min: MIN((SQC_TC_STALL / $denom))
|
||||
max: MAX((SQC_TC_STALL / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
-168
@@ -1,168 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 1500
|
||||
title: Address Processing Unit and Data Return Path (TA/TD)
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1501
|
||||
title: Address Processing Unit
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Address Processing Unit Busy:
|
||||
avg: AVG(((100 * TA_TA_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
min: MIN(((100 * TA_TA_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
max: MAX(((100 * TA_TA_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
tips:
|
||||
Address Stall:
|
||||
avg: AVG(((100 * TA_ADDR_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
min: MIN(((100 * TA_ADDR_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
max: MAX(((100 * TA_ADDR_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
tips:
|
||||
Data Stall:
|
||||
avg: AVG(((100 * TA_DATA_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
min: MIN(((100 * TA_DATA_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
max: MAX(((100 * TA_DATA_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
tips:
|
||||
Data-Processor → Address Stall:
|
||||
avg: AVG(((100 * TA_ADDR_STALLED_BY_TD_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
min: MIN(((100 * TA_ADDR_STALLED_BY_TD_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
max: MAX(((100 * TA_ADDR_STALLED_BY_TD_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
tips:
|
||||
Total Instructions:
|
||||
avg: AVG((TA_TOTAL_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_TOTAL_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_TOTAL_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
tips:
|
||||
Global/Generic Instructions:
|
||||
avg: AVG((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
tips:
|
||||
Global/Generic Read Instructions:
|
||||
avg: AVG((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
tips:
|
||||
Global/Generic Write Instructions:
|
||||
avg: AVG((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
tips:
|
||||
Global/Generic Atomic Instructions:
|
||||
avg: AVG((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
tips:
|
||||
Spill/Stack Instructions:
|
||||
avg: AVG((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
tips:
|
||||
Spill/Stack Read Instructions:
|
||||
avg: AVG((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
tips:
|
||||
Spill/Stack Write Instructions:
|
||||
avg: AVG((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
tips:
|
||||
Spill/Stack Atomic Instructions:
|
||||
avg: AVG((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
tips:
|
||||
Spill/Stack Total Cycles:
|
||||
avg: AVG((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
|
||||
min: MIN((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
|
||||
max: MAX((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
Spill/Stack Coalesced Read:
|
||||
avg: AVG((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
|
||||
min: MIN((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
|
||||
max: MAX((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
Spill/Stack Coalesced Write:
|
||||
avg: AVG((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
|
||||
min: MIN((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
|
||||
max: MAX((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 1502
|
||||
title: Data-Return Path
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Data-Return Busy:
|
||||
avg: AVG(((100 * TD_TD_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
min: MIN(((100 * TD_TD_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
max: MAX(((100 * TD_TD_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
tips:
|
||||
Cache RAM → Data-Return Stall:
|
||||
avg: AVG(((100 * TD_TC_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
min: MIN(((100 * TD_TC_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
max: MAX(((100 * TD_TC_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
tips:
|
||||
Coalescable Instructions:
|
||||
avg: AVG((TD_COALESCABLE_WAVEFRONT_sum / $denom))
|
||||
min: MIN((TD_COALESCABLE_WAVEFRONT_sum / $denom))
|
||||
max: MAX((TD_COALESCABLE_WAVEFRONT_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
tips:
|
||||
Read Instructions:
|
||||
avg: AVG((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
|
||||
/ $denom))
|
||||
min: MIN((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
|
||||
/ $denom))
|
||||
max: MAX((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
|
||||
/ $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
tips:
|
||||
Write Instructions:
|
||||
avg: AVG((TD_STORE_WAVEFRONT_sum / $denom))
|
||||
min: MIN((TD_STORE_WAVEFRONT_sum / $denom))
|
||||
max: MAX((TD_STORE_WAVEFRONT_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
tips:
|
||||
Atomic Instructions:
|
||||
avg: AVG((TD_ATOMIC_WAVEFRONT_sum / $denom))
|
||||
min: MIN((TD_ATOMIC_WAVEFRONT_sum / $denom))
|
||||
max: MAX((TD_ATOMIC_WAVEFRONT_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
tips:
|
||||
+233
@@ -0,0 +1,233 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 1500
|
||||
title: Address Processing Unit and Data Return Path (TA/TD)
|
||||
metrics_description:
|
||||
Address Processing Unit Busy: Percent of the total CU cycles the address processor
|
||||
was busy
|
||||
Address Stall: Percent of the total CU cycles the address processor was stalled
|
||||
from sending address requests further into the vL1D pipeline.
|
||||
Data Stall: Percent of the total CU cycles the address processor was stalled from
|
||||
sending write/atomic data further into the vL1D pipeline.
|
||||
"Data-Processor \u2192 Address Stall": Percent of total CU cycles the address
|
||||
processor was stalled waiting to send command data to the data processor.
|
||||
Total Instructions: The total number of memory instructions executed by the address
|
||||
processer over all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Instructions: The total number of global & generic memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Read Instructions: The total number of global & generic memory
|
||||
read instructions executed on all compute units on the accelerator, per normalization
|
||||
unit.
|
||||
Global/Generic Write Instructions: The total number of global & generic memory
|
||||
write instructions executed on all compute units on the accelerator, per normalization
|
||||
unit.
|
||||
Global/Generic Atomic Instructions: The total number of global & generic memory
|
||||
atomic (with and without return) instructions executed on all compute units
|
||||
on the accelerator, per normalization unit.
|
||||
Spill/Stack Instructions: The total number of spill/stack memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Read Instructions: The total number of spill/stack memory read instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Write Instructions: The total number of spill/stack memory write instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Atomic Instructions: The total number of spill/stack memory atomic
|
||||
(with and without return) instructions executed on all compute units on the
|
||||
accelerator, per normalization unit. Typically unused as these memory operations
|
||||
are typically used to implement thread-local storage.
|
||||
Spill/Stack Total Cycles: The number of cycles the address processing unit spent
|
||||
working on spill/stack instructions, per normalization unit.
|
||||
Spill/Stack Coalesced Read: The number of cycles the address processing unit spent
|
||||
working on coalesced spill/stack read instructions, per normalization unit.
|
||||
Spill/Stack Coalesced Write: The number of cycles the address processing unit
|
||||
spent working on coalesced spill/stack write instructions, per normalization
|
||||
unit.
|
||||
Data-Return Busy: Percent of the total CU cycles the data-return unit was busy
|
||||
processing or waiting on data to return to the CU.
|
||||
"Cache RAM \u2192 Data-Return Stall": Percent of the total CU cycles the data-return
|
||||
unit was stalled on data to be returned from the vL1D Cache RAM.
|
||||
"Workgroup manager \u2192 Data-Return Stall": Percent of the total CU cycles the
|
||||
data-return unit was stalled by the workgroup manager due to initialization
|
||||
of registers as a part of launching new workgroups.
|
||||
Coalescable Instructions: The number of instructions submitted to the data-return
|
||||
unit by the address processor that were found to be coalescable, per normalization
|
||||
unit.
|
||||
Read Instructions: The number of read instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack reads in the address processor.
|
||||
Write Instructions: The number of store instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack stores in the address processor.
|
||||
Atomic Instructions: The number of atomic instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack atomics in the address processor.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1501
|
||||
title: Busy and stall metrics
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Address Processing Unit Busy:
|
||||
avg: AVG(((100 * TA_TA_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
min: MIN(((100 * TA_TA_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
max: MAX(((100 * TA_TA_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
Address Stall:
|
||||
avg: AVG(((100 * TA_ADDR_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
|
||||
* $cu_per_gpu)))
|
||||
min: MIN(((100 * TA_ADDR_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
|
||||
* $cu_per_gpu)))
|
||||
max: MAX(((100 * TA_ADDR_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
|
||||
* $cu_per_gpu)))
|
||||
unit: pct
|
||||
Data Stall:
|
||||
avg: AVG(((100 * TA_DATA_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
|
||||
* $cu_per_gpu)))
|
||||
min: MIN(((100 * TA_DATA_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
|
||||
* $cu_per_gpu)))
|
||||
max: MAX(((100 * TA_DATA_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
|
||||
* $cu_per_gpu)))
|
||||
unit: pct
|
||||
"Data-Processor \u2192 Address Stall":
|
||||
avg: AVG(((100 * TA_ADDR_STALLED_BY_TD_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
|
||||
* $cu_per_gpu)))
|
||||
min: MIN(((100 * TA_ADDR_STALLED_BY_TD_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
|
||||
* $cu_per_gpu)))
|
||||
max: MAX(((100 * TA_ADDR_STALLED_BY_TD_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
|
||||
* $cu_per_gpu)))
|
||||
unit: pct
|
||||
- metric_table:
|
||||
id: 1502
|
||||
title: Instruction counts
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Total Instructions:
|
||||
avg: AVG((TA_TOTAL_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_TOTAL_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_TOTAL_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
Global/Generic Instructions:
|
||||
avg: AVG((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
Global/Generic Read Instructions:
|
||||
avg: AVG((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
Global/Generic Write Instructions:
|
||||
avg: AVG((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
Global/Generic Atomic Instructions:
|
||||
avg: AVG((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
Spill/Stack Instructions:
|
||||
avg: AVG((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
Spill/Stack Read Instructions:
|
||||
avg: AVG((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
Spill/Stack Write Instructions:
|
||||
avg: AVG((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
Spill/Stack Atomic Instructions:
|
||||
avg: AVG((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
- metric_table:
|
||||
id: 1503
|
||||
title: Spill and stack metrics
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Spill/Stack Total Cycles:
|
||||
avg: AVG((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
|
||||
min: MIN((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
|
||||
max: MAX((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
Spill/Stack Coalesced Read:
|
||||
avg: AVG((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
|
||||
min: MIN((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
|
||||
max: MAX((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
Spill/Stack Coalesced Write:
|
||||
avg: AVG((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
|
||||
min: MIN((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
|
||||
max: MAX((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
- metric_table:
|
||||
id: 1504
|
||||
title: Vector L1 data-return path or Texture Data (TD)
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Data-Return Busy:
|
||||
avg: AVG(((100 * TD_TD_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
min: MIN(((100 * TD_TD_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
max: MAX(((100 * TD_TD_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
"Cache RAM \u2192 Data-Return Stall":
|
||||
avg: AVG(((100 * TD_TC_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
min: MIN(((100 * TD_TC_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
max: MAX(((100 * TD_TC_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
"Workgroup manager \u2192 Data-Return Stall":
|
||||
avg: null
|
||||
min: null
|
||||
max: null
|
||||
unit: pct
|
||||
Coalescable Instructions:
|
||||
avg: AVG((TD_COALESCABLE_WAVEFRONT_sum / $denom))
|
||||
min: MIN((TD_COALESCABLE_WAVEFRONT_sum / $denom))
|
||||
max: MAX((TD_COALESCABLE_WAVEFRONT_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
Read Instructions:
|
||||
avg: AVG((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
|
||||
/ $denom))
|
||||
min: MIN((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
|
||||
/ $denom))
|
||||
max: MAX((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
|
||||
/ $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
Write Instructions:
|
||||
avg: AVG((TD_STORE_WAVEFRONT_sum / $denom))
|
||||
min: MIN((TD_STORE_WAVEFRONT_sum / $denom))
|
||||
max: MAX((TD_STORE_WAVEFRONT_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
Atomic Instructions:
|
||||
avg: AVG((TD_ATOMIC_WAVEFRONT_sum / $denom))
|
||||
min: MIN((TD_ATOMIC_WAVEFRONT_sum / $denom))
|
||||
max: MAX((TD_ATOMIC_WAVEFRONT_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
-414
@@ -1,414 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 1600
|
||||
title: Vector L1 Data Cache
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1601
|
||||
title: Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Hit rate:
|
||||
value: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
|
||||
None))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
Bandwidth:
|
||||
value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk / 1000) * 64) * $cu_per_gpu))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
Utilization:
|
||||
value: AVG((((TCP_GATE_EN2_sum * 100) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum
|
||||
!= 0) else None))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
Coalescing:
|
||||
value: AVG(((((TA_TOTAL_WAVEFRONTS_sum * 64) * 100) / (TCP_TOTAL_ACCESSES_sum
|
||||
* 4)) if (TCP_TOTAL_ACCESSES_sum != 0) else None))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
comparable: false # for now
|
||||
cli_style: simple_bar
|
||||
|
||||
- metric_table:
|
||||
id: 1602
|
||||
title: L1D Cache Stalls (%)
|
||||
header:
|
||||
metric: Metric
|
||||
expr: Expression
|
||||
tips: Tips
|
||||
metric:
|
||||
Stalled on L2 Data:
|
||||
expr:
|
||||
(((100 * TCP_PENDING_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum
|
||||
!= 0) else None)
|
||||
tips:
|
||||
Stalled on L2 Req:
|
||||
expr:
|
||||
(((100 * TCP_TCR_TCP_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum
|
||||
!= 0) else None)
|
||||
tips:
|
||||
Stalled on Address:
|
||||
expr:
|
||||
None
|
||||
tips:
|
||||
Stalled on Data:
|
||||
expr:
|
||||
None
|
||||
tips:
|
||||
Stalled on Latency FIFO:
|
||||
expr:
|
||||
None
|
||||
tips:
|
||||
Stalled on Request FIFO:
|
||||
expr:
|
||||
None
|
||||
tips:
|
||||
Stalled on Read Return:
|
||||
expr:
|
||||
None
|
||||
tips:
|
||||
Tag RAM Stall (Read):
|
||||
expr:
|
||||
(((100 * TCP_READ_TAGCONFLICT_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
|
||||
if (TCP_GATE_EN1_sum != 0) else None)
|
||||
tips:
|
||||
Tag RAM Stall (Write):
|
||||
expr:
|
||||
(((100 * TCP_WRITE_TAGCONFLICT_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
|
||||
if (TCP_GATE_EN1_sum != 0) else None)
|
||||
tips:
|
||||
Tag RAM Stall (Atomic):
|
||||
expr:
|
||||
(((100 * TCP_ATOMIC_TAGCONFLICT_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
|
||||
if (TCP_GATE_EN1_sum != 0) else None)
|
||||
tips:
|
||||
cli_style: simple_box
|
||||
|
||||
- metric_table:
|
||||
id: 1603
|
||||
title: L1D Cache Accesses
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Total Req:
|
||||
avg: AVG((TCP_TOTAL_ACCESSES_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_ACCESSES_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_ACCESSES_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Read Req:
|
||||
avg: AVG((TCP_TOTAL_READ_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_READ_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_READ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Write Req:
|
||||
avg: AVG((TCP_TOTAL_WRITE_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_WRITE_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_WRITE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Atomic Req:
|
||||
avg: AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
|
||||
/ $denom))
|
||||
min: MIN(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
|
||||
/ $denom))
|
||||
max: MAX(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Cache BW:
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
|
||||
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
|
||||
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
tips:
|
||||
Cache Hit Rate:
|
||||
avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) +
|
||||
TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) /
|
||||
TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
|
||||
None))
|
||||
min: MIN(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) +
|
||||
TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) /
|
||||
TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
|
||||
None))
|
||||
max: MAX(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) +
|
||||
TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) /
|
||||
TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
|
||||
None))
|
||||
unit: pct
|
||||
tips:
|
||||
Cache Accesses:
|
||||
avg: AVG((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Cache Hits:
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ $denom))
|
||||
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ $denom))
|
||||
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Invalidations:
|
||||
avg: AVG((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
L1-L2 BW:
|
||||
avg: AVG(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
min: MIN(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
max: MAX(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
tips:
|
||||
L1-L2 Read:
|
||||
avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
L1-L2 Write:
|
||||
avg: AVG((TCP_TCC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
L1-L2 Atomic:
|
||||
avg: AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
/ $denom))
|
||||
min: MIN(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
/ $denom))
|
||||
max: MAX(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
L1 Access Latency:
|
||||
avg: AVG(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
tips:
|
||||
L1-L2 Read Latency:
|
||||
avg: AVG(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
|
||||
if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else None))
|
||||
min: MIN(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
|
||||
if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else None))
|
||||
max: MAX(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
|
||||
if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else None))
|
||||
unit: Cycles
|
||||
tips:
|
||||
L1-L2 Write Latency:
|
||||
avg: AVG(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0) else
|
||||
None))
|
||||
min: MIN(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0) else
|
||||
None))
|
||||
max: MAX(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0) else
|
||||
None))
|
||||
unit: Cycles
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 1604
|
||||
title: L1D - L2 Transactions
|
||||
header:
|
||||
metric: Metric
|
||||
xfer: Xfer
|
||||
coherency: Coherency
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
NC - Read:
|
||||
xfer: Read
|
||||
coherency: NC
|
||||
avg: AVG((TCP_TCC_NC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_NC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_NC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
UC - Read:
|
||||
xfer: Read
|
||||
coherency: UC
|
||||
avg: AVG((TCP_TCC_UC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_UC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_UC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
CC - Read:
|
||||
xfer: Read
|
||||
coherency: CC
|
||||
avg: AVG((TCP_TCC_CC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_CC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_CC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
RW - Read:
|
||||
xfer: Read
|
||||
coherency: RW
|
||||
avg: AVG((TCP_TCC_RW_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_RW_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_RW_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
RW - Write:
|
||||
xfer: Write
|
||||
coherency: RW
|
||||
avg: AVG((TCP_TCC_RW_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_RW_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_RW_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
NC - Write:
|
||||
xfer: Write
|
||||
coherency: NC
|
||||
avg: AVG((TCP_TCC_NC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_NC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_NC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
UC - Write:
|
||||
xfer: Write
|
||||
coherency: UC
|
||||
avg: AVG((TCP_TCC_UC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_UC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_UC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
CC - Write:
|
||||
xfer: Write
|
||||
coherency: CC
|
||||
avg: AVG((TCP_TCC_CC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_CC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_CC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
NC - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: NC
|
||||
avg: AVG((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
UC - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: UC
|
||||
avg: AVG((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
CC - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: CC
|
||||
avg: AVG((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
RW - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: RW
|
||||
avg: AVG((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 1605
|
||||
title: L1D Addr Translation
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
units: Units
|
||||
tips: Tips
|
||||
metric:
|
||||
Req:
|
||||
avg: AVG((TCP_UTCL1_REQUEST_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_REQUEST_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_REQUEST_sum / $denom))
|
||||
units: (Req + $normUnit)
|
||||
tips:
|
||||
Inflight Req:
|
||||
avg: None # Missing perfmon
|
||||
min: None # Missing perfmon
|
||||
max: None # Missing perfmon
|
||||
units: (Req + $normUnit)
|
||||
tips:
|
||||
Hit Ratio:
|
||||
avg: AVG((((100 * TCP_UTCL1_TRANSLATION_HIT_sum) / TCP_UTCL1_REQUEST_sum) if
|
||||
(TCP_UTCL1_REQUEST_sum != 0) else None))
|
||||
min: MIN((((100 * TCP_UTCL1_TRANSLATION_HIT_sum) / TCP_UTCL1_REQUEST_sum) if
|
||||
(TCP_UTCL1_REQUEST_sum != 0) else None))
|
||||
max: MAX((((100 * TCP_UTCL1_TRANSLATION_HIT_sum) / TCP_UTCL1_REQUEST_sum) if
|
||||
(TCP_UTCL1_REQUEST_sum != 0) else None))
|
||||
units: pct
|
||||
tips:
|
||||
Hits:
|
||||
avg: AVG((TCP_UTCL1_TRANSLATION_HIT_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_TRANSLATION_HIT_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_TRANSLATION_HIT_sum / $denom))
|
||||
units: (Req + $normUnit)
|
||||
tips:
|
||||
Translation Misses:
|
||||
avg: AVG((TCP_UTCL1_TRANSLATION_MISS_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_TRANSLATION_MISS_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_TRANSLATION_MISS_sum / $denom))
|
||||
units: (Req + $normUnit)
|
||||
tips:
|
||||
Permission Misses:
|
||||
avg: AVG((TCP_UTCL1_PERMISSION_MISS_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_PERMISSION_MISS_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_PERMISSION_MISS_sum / $denom))
|
||||
units: (Req + $normUnit)
|
||||
tips:
|
||||
- metric_table:
|
||||
id: 1606
|
||||
title: L1D Addr Translation Stalls
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
units: Units
|
||||
tips: Tips
|
||||
metric:
|
||||
+442
@@ -0,0 +1,442 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 1600
|
||||
title: Vector L1 Data Cache
|
||||
metrics_description:
|
||||
Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM
|
||||
instructions, as a percent of the peak theoretical bandwidth achievable on the
|
||||
specific accelerator. The number of bytes is calculated as the number of cache
|
||||
lines requested multiplied by the cache line size. This value does not consider
|
||||
partial requests, so for instance, if only a single value is requested in a
|
||||
cache line, the data movement will still be counted as a full cache line.
|
||||
Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
|
||||
The number of cycles where the vL1D Cache RAM is actively processing any request
|
||||
divided by the number of cycles where the vL1D is active.
|
||||
Coalescing: Indicates how well memory instructions were coalesced by the address
|
||||
processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
|
||||
as the average number of thread-requests generated per instruction divided by
|
||||
the ideal number of thread-requests per instruction.
|
||||
Stalled on L2 Data: The ratio of the number of cycles where the vL1D is stalled
|
||||
waiting for requested data to return from the L2 cache divided by the number
|
||||
of cycles where the vL1D is active.
|
||||
Stalled on L2 Req: The ratio of the number of cycles where the vL1D is stalled
|
||||
waiting to issue a request for data to the L2 cache divided by the number of
|
||||
cycles where the vL1D is active.
|
||||
Tag RAM Stall (Read): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Read requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Tag RAM Stall (Write): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Write requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Tag RAM Stall (Atomic): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Atomic requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Total Req: The total number of incoming requests from the address processing unit
|
||||
after coalescing.
|
||||
Read Req: The total number of incoming read requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Write Req: The total number of incoming write requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Atomic Req: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
|
||||
instructions per normalization unit. The number of bytes is calculated as the
|
||||
number of cache lines requested multiplied by the cache line size. This value
|
||||
does not consider partial requests, so for instance, if only a single value
|
||||
is requested in a cache line, the data movement will still be counted as a full
|
||||
cache line.
|
||||
Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
|
||||
vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Cache Accesses: The total number of cache line lookups in the vL1D.
|
||||
Cache Hits: The number of cache accesses minus the number of outgoing requests
|
||||
to the L2 cache, that is, the number of cache line requests serviced by the
|
||||
vL1D Cache RAM per normalization unit.
|
||||
Invalidations: The number of times the vL1D was issued a write-back invalidate
|
||||
command during the kernel's execution per normalization unit. This may be triggered
|
||||
by, for instance, the buffer_wbinvl1 instruction.
|
||||
L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
|
||||
of VMEM instructions, per normalization unit. The number of bytes is calculated
|
||||
as the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
as a full cache line.
|
||||
L1-L2 Read: The number of read requests for a vL1D cache line that were not satisfied
|
||||
by the vL1D and must be retrieved from the to the L2 Cache per normalization
|
||||
unit.
|
||||
L1-L2 Write: The number of write requests to a vL1D cache line that were sent
|
||||
through the vL1D to the L2 cache, per normalization unit.
|
||||
L1-L2 Atomic: The number of atomic requests that are sent through the vL1D to
|
||||
the L2 cache, per normalization unit. This includes requests for atomics with,
|
||||
and without return.
|
||||
L1 Access Latency: Calculated as the average number of cycles that a vL1D cache
|
||||
line request spent in the vL1D cache pipeline.
|
||||
L1-L2 Read Latency: Calculated as the average number of cycles that the vL1D cache
|
||||
took to issue and receive read requests from the L2 Cache. This number also
|
||||
includes requests for atomics with return values.
|
||||
L1-L2 Write Latency: Calculated as the average number of cycles that the vL1D
|
||||
cache took to issue and receive acknowledgement of a write request to the L2
|
||||
Cache. This number also includes requests for atomics without return values.
|
||||
NC - Read: Total read requests with NC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
UC - Read: Total read requests with UC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
CC - Read: Total read requests with CC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
RW - Read: Total read requests with RW mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
RW - Write: Total write requests with RW mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
NC - Write: Total write requests with NC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
UC - Write: Total write requests with UC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
CC - Write: Total write requests with CC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
NC - Atomic: Total atomic requests with NC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
UC - Atomic: Total atomic requests with UC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
CC - Atomic: Total atomic requests with CC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
RW - Atomic: Total atomic requests with RW mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
Req: The number of translation requests made to the UTCL1 per normalization unit.
|
||||
Hit Ratio: The ratio of the number of translation requests that hit in the UTCL1
|
||||
divided by the total number of translation requests made to the UTCL1.
|
||||
Hits: The number of translation requests that hit in the UTCL1, and could be reused,
|
||||
per normalization unit.
|
||||
Translation Misses: The total number of translation requests that missed in the
|
||||
UTCL1 due to translation not being present in the cache, per normalization
|
||||
unit.
|
||||
Permission Misses: "The total number of translation requests that missed in the\
|
||||
\ UTCL1 due to a permission error, per normalization unit. This is unused and\
|
||||
\ expected to be zero in most configurations for modern CDNA\u2122 accelerators."
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1601
|
||||
title: vL1D Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
metric:
|
||||
Hit rate:
|
||||
value: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
|
||||
else None))
|
||||
unit: Pct of Peak
|
||||
Bandwidth:
|
||||
value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp
|
||||
- Start_Timestamp)))) / ((($max_sclk / 1000) * 64) * $cu_per_gpu))
|
||||
unit: Pct of Peak
|
||||
Utilization:
|
||||
value: AVG((((TCP_GATE_EN2_sum * 100) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum
|
||||
!= 0) else None))
|
||||
unit: Pct of Peak
|
||||
Coalescing:
|
||||
value: AVG(((((TA_TOTAL_WAVEFRONTS_sum * 64) * 100) / (TCP_TOTAL_ACCESSES_sum
|
||||
* 4)) if (TCP_TOTAL_ACCESSES_sum != 0) else None))
|
||||
unit: Pct of Peak
|
||||
comparable: false
|
||||
cli_style: simple_bar
|
||||
tui_style: simple_bar
|
||||
- metric_table:
|
||||
id: 1602
|
||||
title: vL1D cache stall metrics
|
||||
header:
|
||||
metric: Metric
|
||||
expr: Expression
|
||||
metric:
|
||||
Stalled on L2 Data:
|
||||
expr: (((100 * TCP_PENDING_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum
|
||||
!= 0) else None)
|
||||
Stalled on L2 Req:
|
||||
expr: (((100 * TCP_TCR_TCP_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum
|
||||
!= 0) else None)
|
||||
Tag RAM Stall (Read):
|
||||
expr: (((100 * TCP_READ_TAGCONFLICT_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
|
||||
if (TCP_GATE_EN1_sum != 0) else None)
|
||||
Tag RAM Stall (Write):
|
||||
expr: (((100 * TCP_WRITE_TAGCONFLICT_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
|
||||
if (TCP_GATE_EN1_sum != 0) else None)
|
||||
Tag RAM Stall (Atomic):
|
||||
expr: (((100 * TCP_ATOMIC_TAGCONFLICT_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
|
||||
if (TCP_GATE_EN1_sum != 0) else None)
|
||||
cli_style: simple_box
|
||||
tui_style: simple_box
|
||||
- metric_table:
|
||||
id: 1603
|
||||
title: vL1D cache access metrics
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Total Req:
|
||||
avg: AVG((TCP_TOTAL_ACCESSES_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_ACCESSES_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_ACCESSES_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read Req:
|
||||
avg: AVG((TCP_TOTAL_READ_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_READ_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_READ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Write Req:
|
||||
avg: AVG((TCP_TOTAL_WRITE_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_WRITE_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_WRITE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Atomic Req:
|
||||
avg: AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
|
||||
/ $denom))
|
||||
min: MIN(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
|
||||
/ $denom))
|
||||
max: MAX(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Cache BW:
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
|
||||
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
|
||||
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
Cache Hit Rate:
|
||||
avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
|
||||
else None))
|
||||
min: MIN(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
|
||||
else None))
|
||||
max: MAX(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
|
||||
else None))
|
||||
unit: pct
|
||||
Cache Accesses:
|
||||
avg: AVG((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Cache Hits:
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ $denom))
|
||||
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ $denom))
|
||||
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Invalidations:
|
||||
avg: AVG((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
L1-L2 BW:
|
||||
avg: AVG(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
min: MIN(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
max: MAX(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
L1-L2 Read:
|
||||
avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
L1-L2 Write:
|
||||
avg: AVG((TCP_TCC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
L1-L2 Atomic:
|
||||
avg: AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
/ $denom))
|
||||
min: MIN(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
/ $denom))
|
||||
max: MAX(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
L1 Access Latency:
|
||||
avg: AVG(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
L1-L2 Read Latency:
|
||||
avg: AVG(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
|
||||
if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else
|
||||
None))
|
||||
min: MIN(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
|
||||
if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else
|
||||
None))
|
||||
max: MAX(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
|
||||
if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else
|
||||
None))
|
||||
unit: Cycles
|
||||
L1-L2 Write Latency:
|
||||
avg: AVG(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0)
|
||||
else None))
|
||||
min: MIN(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0)
|
||||
else None))
|
||||
max: MAX(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0)
|
||||
else None))
|
||||
unit: Cycles
|
||||
- metric_table:
|
||||
id: 1604
|
||||
title: L1D - L2 Transactions
|
||||
header:
|
||||
metric: Metric
|
||||
xfer: Xfer
|
||||
coherency: Coherency
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
NC - Read:
|
||||
xfer: Read
|
||||
coherency: NC
|
||||
avg: AVG((TCP_TCC_NC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_NC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_NC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
UC - Read:
|
||||
xfer: Read
|
||||
coherency: UC
|
||||
avg: AVG((TCP_TCC_UC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_UC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_UC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
CC - Read:
|
||||
xfer: Read
|
||||
coherency: CC
|
||||
avg: AVG((TCP_TCC_CC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_CC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_CC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
RW - Read:
|
||||
xfer: Read
|
||||
coherency: RW
|
||||
avg: AVG((TCP_TCC_RW_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_RW_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_RW_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
RW - Write:
|
||||
xfer: Write
|
||||
coherency: RW
|
||||
avg: AVG((TCP_TCC_RW_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_RW_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_RW_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
NC - Write:
|
||||
xfer: Write
|
||||
coherency: NC
|
||||
avg: AVG((TCP_TCC_NC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_NC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_NC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
UC - Write:
|
||||
xfer: Write
|
||||
coherency: UC
|
||||
avg: AVG((TCP_TCC_UC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_UC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_UC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
CC - Write:
|
||||
xfer: Write
|
||||
coherency: CC
|
||||
avg: AVG((TCP_TCC_CC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_CC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_CC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
NC - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: NC
|
||||
avg: AVG((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
UC - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: UC
|
||||
avg: AVG((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
CC - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: CC
|
||||
avg: AVG((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
RW - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: RW
|
||||
avg: AVG((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1605
|
||||
title: L1 Unified Translation Cache (UTCL1)
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
units: Units
|
||||
metric:
|
||||
Req:
|
||||
avg: AVG((TCP_UTCL1_REQUEST_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_REQUEST_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_REQUEST_sum / $denom))
|
||||
units: (Req + $normUnit)
|
||||
Hit Ratio:
|
||||
avg: AVG((((100 * TCP_UTCL1_TRANSLATION_HIT_sum) / TCP_UTCL1_REQUEST_sum)
|
||||
if (TCP_UTCL1_REQUEST_sum != 0) else None))
|
||||
min: MIN((((100 * TCP_UTCL1_TRANSLATION_HIT_sum) / TCP_UTCL1_REQUEST_sum)
|
||||
if (TCP_UTCL1_REQUEST_sum != 0) else None))
|
||||
max: MAX((((100 * TCP_UTCL1_TRANSLATION_HIT_sum) / TCP_UTCL1_REQUEST_sum)
|
||||
if (TCP_UTCL1_REQUEST_sum != 0) else None))
|
||||
units: pct
|
||||
Hits:
|
||||
avg: AVG((TCP_UTCL1_TRANSLATION_HIT_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_TRANSLATION_HIT_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_TRANSLATION_HIT_sum / $denom))
|
||||
units: (Req + $normUnit)
|
||||
Translation Misses:
|
||||
avg: AVG((TCP_UTCL1_TRANSLATION_MISS_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_TRANSLATION_MISS_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_TRANSLATION_MISS_sum / $denom))
|
||||
units: (Req + $normUnit)
|
||||
Permission Misses:
|
||||
avg: AVG((TCP_UTCL1_PERMISSION_MISS_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_PERMISSION_MISS_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_PERMISSION_MISS_sum / $denom))
|
||||
units: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1606
|
||||
title: L1D Addr Translation Stalls
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
units: Units
|
||||
metric: {}
|
||||
-388
@@ -1,388 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 1700
|
||||
title: L2 Cache
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1701
|
||||
title: Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Utilization:
|
||||
value: AVG(((TCC_BUSY_sum * 100) / (TO_INT($total_l2_chan) * $GRBM_GUI_ACTIVE_PER_XCD)))
|
||||
unit: pct
|
||||
tips:
|
||||
Bandwidth:
|
||||
value: ((100 * AVG(((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk / 1000) * 64) * TO_INT($total_l2_chan)))
|
||||
unit: pct
|
||||
tips:
|
||||
Hit Rate:
|
||||
value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else 0))
|
||||
unit: pct
|
||||
tips:
|
||||
L2-Fabric Read BW:
|
||||
value: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
tips:
|
||||
L2-Fabric Write and Atomic BW:
|
||||
value: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
tips:
|
||||
HBM Bandwidth:
|
||||
value: $hbmBandwidth
|
||||
unit: GB/s
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 1702
|
||||
title: L2 - Fabric Transactions
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Read BW:
|
||||
avg: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
min: MIN((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
max: MAX((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
tips:
|
||||
HBM Read Traffic:
|
||||
avg: AVG((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
min: MIN((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
max: MAX((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
Remote Read Traffic:
|
||||
avg: AVG((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
min: MIN((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
max: MAX((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
Uncached Read Traffic:
|
||||
avg: AVG((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
min: MIN((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
max: MAX((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
Write and Atomic BW:
|
||||
avg: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
min: MIN((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
max: MAX((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
tips:
|
||||
HBM Write and Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
min: MIN((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
max: MAX((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
Remote Write and Atomic Traffic:
|
||||
avg: AVG((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
min: MIN((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
max: MAX((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
min: MIN((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
max: MAX((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
Uncached Write and Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
min: MIN((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
max: MAX((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
Read Latency:
|
||||
avg: AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum !=
|
||||
0) else None))
|
||||
min: MIN(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum !=
|
||||
0) else None))
|
||||
max: MAX(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum !=
|
||||
0) else None))
|
||||
unit: Cycles
|
||||
tips:
|
||||
Write and Atomic Latency:
|
||||
avg: AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum !=
|
||||
0) else None))
|
||||
min: MIN(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum !=
|
||||
0) else None))
|
||||
max: MAX(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum !=
|
||||
0) else None))
|
||||
unit: Cycles
|
||||
tips:
|
||||
Atomic Latency:
|
||||
avg: AVG(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 1703
|
||||
title: L2 Cache Accesses
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Bandwidth:
|
||||
avg: AVG((TCC_REQ_sum * 64) / $denom)
|
||||
min: MIN((TCC_REQ_sum * 64) / $denom)
|
||||
max: MAX((TCC_REQ_sum * 64) / $denom)
|
||||
unit: (Bytes + $normUnit)
|
||||
tips:
|
||||
Req:
|
||||
avg: AVG((TCC_REQ_sum / $denom))
|
||||
min: MIN((TCC_REQ_sum / $denom))
|
||||
max: MAX((TCC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Read Req:
|
||||
avg: AVG((TCC_READ_sum / $denom))
|
||||
min: MIN((TCC_READ_sum / $denom))
|
||||
max: MAX((TCC_READ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Write Req:
|
||||
avg: AVG((TCC_WRITE_sum / $denom))
|
||||
min: MIN((TCC_WRITE_sum / $denom))
|
||||
max: MAX((TCC_WRITE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Atomic Req:
|
||||
avg: AVG((TCC_ATOMIC_sum / $denom))
|
||||
min: MIN((TCC_ATOMIC_sum / $denom))
|
||||
max: MAX((TCC_ATOMIC_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Streaming Req:
|
||||
avg: AVG((TCC_STREAMING_REQ_sum / $denom))
|
||||
min: MIN((TCC_STREAMING_REQ_sum / $denom))
|
||||
max: MAX((TCC_STREAMING_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Probe Req:
|
||||
avg: AVG((TCC_PROBE_sum / $denom))
|
||||
min: MIN((TCC_PROBE_sum / $denom))
|
||||
max: MAX((TCC_PROBE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Cache Hit:
|
||||
avg: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
min: MIN((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
max: MAX((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
Hits:
|
||||
avg: AVG((TCC_HIT_sum / $denom))
|
||||
min: MIN((TCC_HIT_sum / $denom))
|
||||
max: MAX((TCC_HIT_sum / $denom))
|
||||
unit: (Hits + $normUnit)
|
||||
tips:
|
||||
Misses:
|
||||
avg: AVG((TCC_MISS_sum / $denom))
|
||||
min: MIN((TCC_MISS_sum / $denom))
|
||||
max: MAX((TCC_MISS_sum / $denom))
|
||||
unit: (Misses + $normUnit)
|
||||
tips:
|
||||
Writeback:
|
||||
avg: AVG((TCC_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_WRITEBACK_sum / $denom))
|
||||
max: MAX((TCC_WRITEBACK_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
tips:
|
||||
Writeback (Internal):
|
||||
avg: AVG((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
max: MAX((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
tips:
|
||||
Writeback (vL1D Req):
|
||||
avg: AVG((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
|
||||
max: MAX((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
tips:
|
||||
Evict (Internal):
|
||||
avg: AVG((TCC_NORMAL_EVICT_sum / $denom))
|
||||
min: MIN((TCC_NORMAL_EVICT_sum / $denom))
|
||||
max: MAX((TCC_NORMAL_EVICT_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
tips:
|
||||
Evict (vL1D Req):
|
||||
avg: AVG((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
|
||||
min: MIN((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
|
||||
max: MAX((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
tips:
|
||||
NC Req:
|
||||
avg: AVG((TCC_NC_REQ_sum / $denom))
|
||||
min: MIN((TCC_NC_REQ_sum / $denom))
|
||||
max: MAX((TCC_NC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
UC Req:
|
||||
avg: AVG((TCC_UC_REQ_sum / $denom))
|
||||
min: MIN((TCC_UC_REQ_sum / $denom))
|
||||
max: MAX((TCC_UC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
CC Req:
|
||||
avg: AVG((TCC_CC_REQ_sum / $denom))
|
||||
min: MIN((TCC_CC_REQ_sum / $denom))
|
||||
max: MAX((TCC_CC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
RW Req:
|
||||
avg: AVG((TCC_RW_REQ_sum / $denom))
|
||||
min: MIN((TCC_RW_REQ_sum / $denom))
|
||||
max: MAX((TCC_RW_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 1704
|
||||
title: L2 Cache Stalls
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
|
||||
- metric_table:
|
||||
id: 1705
|
||||
title: L2 - Fabric Interface Stalls
|
||||
header:
|
||||
metric: Metric
|
||||
type: Type
|
||||
transaction: Transaction
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
style:
|
||||
type: simple_multi_bar
|
||||
metric:
|
||||
Write - Credit Starvation:
|
||||
type: Credit Starvation
|
||||
transaction: Write
|
||||
avg: AVG(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
min: MIN(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
max: MAX(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 1706
|
||||
title: L2 - Fabric Detailed Transaction Breakdown
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Read (32B):
|
||||
avg: AVG((TCC_EA_RDREQ_32B_sum / $denom))
|
||||
min: MIN((TCC_EA_RDREQ_32B_sum / $denom))
|
||||
max: MAX((TCC_EA_RDREQ_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Read (64B):
|
||||
avg: AVG(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
|
||||
min: MIN(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
|
||||
max: MAX(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Read (Uncached):
|
||||
avg: AVG((TCC_EA_RD_UNCACHED_32B_sum / $denom))
|
||||
min: MIN((TCC_EA_RD_UNCACHED_32B_sum / $denom))
|
||||
max: MAX((TCC_EA_RD_UNCACHED_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
HBM Read:
|
||||
avg: AVG((TCC_EA_RDREQ_DRAM_sum / $denom))
|
||||
min: MIN((TCC_EA_RDREQ_DRAM_sum / $denom))
|
||||
max: MAX((TCC_EA_RDREQ_DRAM_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Remote Read:
|
||||
avg: AVG((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
|
||||
min: MIN((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
|
||||
max: MAX((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Write and Atomic (32B):
|
||||
avg: AVG(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom))
|
||||
min: MIN(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom))
|
||||
max: MAX(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Write and Atomic (Uncached):
|
||||
avg: AVG((TCC_EA_WR_UNCACHED_32B_sum / $denom))
|
||||
min: MIN((TCC_EA_WR_UNCACHED_32B_sum / $denom))
|
||||
max: MAX((TCC_EA_WR_UNCACHED_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Write and Atomic (64B):
|
||||
avg: AVG((TCC_EA_WRREQ_64B_sum / $denom))
|
||||
min: MIN((TCC_EA_WRREQ_64B_sum / $denom))
|
||||
max: MAX((TCC_EA_WRREQ_64B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
HBM Write and Atomic:
|
||||
avg: AVG((TCC_EA_WRREQ_DRAM_sum / $denom))
|
||||
min: MIN((TCC_EA_WRREQ_DRAM_sum / $denom))
|
||||
max: MAX((TCC_EA_WRREQ_DRAM_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Remote Write and Atomic:
|
||||
avg: AVG((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
|
||||
min: MIN((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
|
||||
max: MAX((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Atomic:
|
||||
avg: AVG((TCC_EA_ATOMIC_sum / $denom))
|
||||
min: MIN((TCC_EA_ATOMIC_sum / $denom))
|
||||
max: MAX((TCC_EA_ATOMIC_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
+536
@@ -0,0 +1,536 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 1700
|
||||
title: L2 Cache
|
||||
metrics_description:
|
||||
Utilization: The ratio of the number of cycles an L2 channel was active, summed
|
||||
over all L2 channels on the accelerator over the total L2 cycles.
|
||||
Peak Bandwidth: The number of bytes looked up in the L2 cache, as a percent of
|
||||
the peak theoretical bandwidth achievable on the specific accelerator. The number
|
||||
of bytes is calculated as the number of cache lines requested multiplied by
|
||||
the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line.
|
||||
Hit Rate: The ratio of the number of L2 cache line requests that hit in the L2
|
||||
cache over the total number of incoming cache line requests to the L2 cache.
|
||||
L2-Fabric Read BW: The number of bytes read by the L2 over the Infinity Fabric
|
||||
interface per unit time.
|
||||
L2-Fabric Write and Atomic BW: The number of bytes sent by the L2 over the Infinity
|
||||
Fabric interface by write and atomic operations per unit time.
|
||||
HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
|
||||
memory (HBM) per unit time. This value is calculated as the number of HBM channels
|
||||
multiplied by the HBM channel width multiplied by the HBM clock frequency.
|
||||
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per
|
||||
normalization unit.
|
||||
HBM Read Traffic: The percent of read requests generated by the L2 cache that
|
||||
are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
|
||||
does not consider the size of the request (meaning that 32B and 64B requests
|
||||
are both counted as a single request), so this metric only approximates the
|
||||
percent of the L2-Fabric Read bandwidth directed to the local HBM.
|
||||
Remote Read Traffic: The percent of read requests generated by the L2 cache that
|
||||
are routed to any memory location other than the accelerator's local high-bandwidth
|
||||
memory (HBM) - for example, the CPU's DRAM or a remote accelerator's HBM. This
|
||||
breakdown does not consider the size of the request (meaning that 32B and 64B
|
||||
requests are both counted as a single request), so this metric only approximates
|
||||
the percent of the L2-Fabric Read bandwidth directed to a remote location.
|
||||
Uncached Read Traffic: The percent of read requests generated by the L2 cache
|
||||
that are reading from an uncached memory allocation. Note, as described in the
|
||||
request flow section, a single 64B read request is typically counted as two
|
||||
uncached read requests. So, it is possible for the Uncached Read Traffic to
|
||||
reach up to 200% of the total number of read requests. This breakdown does not
|
||||
consider the size of the request (i.e., 32B and 64B requests are both counted
|
||||
as a single request), so this metric only approximates the percent of the L2-Fabric
|
||||
read bandwidth directed to an uncached memory location.
|
||||
Write and Atomic BW: The total number of bytes written by the L2 over Infinity
|
||||
Fabric by write and atomic operations per normalization unit. Note that on current
|
||||
CDNA accelerators, such as the MI2XX, requests are only considered atomic by
|
||||
Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
|
||||
fine-grained memory allocations or uncached memory allocations on the MI2XX.
|
||||
HBM Write and Atomic Traffic: The percent of write and atomic requests generated
|
||||
by the L2 cache that are routed to the accelerator's local high-bandwidth memory
|
||||
(HBM). This breakdown does not consider the size of the request (meaning that
|
||||
32B and 64B requests are both counted as a single request), so this metric only
|
||||
approximates the percent of the L2-Fabric Write and Atomic bandwidth directed
|
||||
to the local HBM. Note that on current CDNA accelerators, such as the MI2XX,
|
||||
requests are only considered atomic by Infinity Fabric if they are targeted
|
||||
at fine-grained memory allocations or uncached memory allocations.
|
||||
Remote Write and Atomic Traffic: The percent of read requests generated by the
|
||||
L2 cache that are routed to any memory location other than the accelerator's
|
||||
local high-bandwidth memory (HBM) - for example, the CPU's DRAM or a remote
|
||||
accelerator's HBM. This breakdown does not consider the size of the request
|
||||
(meaning that 32B and 64B requests are both counted as a single request), so
|
||||
this metric only approximates the percent of the L2-Fabric Read bandwidth directed
|
||||
to a remote location. Note that on current CDNA accelerators, such as the MI2XX,
|
||||
requests are only considered atomic by Infinity Fabric if they are targeted
|
||||
at fine-grained memory allocations or uncached memory allocations.
|
||||
Atomic Traffic: The percent of write requests generated by the L2 cache that are
|
||||
atomic requests to any memory location. This breakdown does not consider the
|
||||
size of the request (meaning that 32B and 64B requests are both counted as a
|
||||
single request), so this metric only approximates the percent of the L2-Fabric
|
||||
Read bandwidth directed to a remote location. Note that on current CDNA accelerators,
|
||||
such as the MI2XX, requests are only considered atomic by Infinity Fabric if
|
||||
they are targeted at fine-grained memory allocations or uncached memory allocations.
|
||||
Uncached Write and Atomic Traffic: The percent of write and atomic requests generated
|
||||
by the L2 cache that are targeting uncached memory allocations. This breakdown
|
||||
does not consider the size of the request (meaning that 32B and 64B requests
|
||||
are both counted as a single request), so this metric only approximates the
|
||||
percent of the L2-Fabric read bandwidth directed to uncached memory allocations.
|
||||
Read Latency: The time-averaged number of cycles read requests spent in Infinity
|
||||
Fabric before data was returned to the L2.
|
||||
Write and Atomic Latency: The time-averaged number of cycles write requests spent
|
||||
in Infinity Fabric before a completion acknowledgement was returned to the L2.
|
||||
Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
|
||||
Fabric before a completion acknowledgement (atomic without return value) or
|
||||
data (atomic with return value) was returned to the L2.
|
||||
Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit.
|
||||
The number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so for
|
||||
example, if only a single value is requested in a cache line, the data movement
|
||||
will still be counted as a full cache line.
|
||||
Req: The total number of incoming requests to the L2 from all clients for all
|
||||
request types, per normalization unit.
|
||||
Read Req: The total number of read requests to the L2 from all clients.
|
||||
Write Req: The total number of write requests to the L2 from all clients.
|
||||
Atomic Req: The total number of atomic requests (with and without return) to the
|
||||
L2 from all clients.
|
||||
Streaming Req: The total number of incoming requests to the L2 that are marked
|
||||
as streaming. The exact meaning of this may differ depending on the targeted
|
||||
accelerator, however on an MI2XX this corresponds to non-temporal load or stores.
|
||||
The L2 cache attempts to evict streaming requests before normal requests when
|
||||
the L2 is at capacity.
|
||||
Probe Req: The number of coherence probe requests made to the L2 cache from outside
|
||||
the accelerator. On an MI2XX, probe requests may be generated by, for example,
|
||||
writes to fine-grained device memory or by writes to coarse-grained device memory.
|
||||
Cache Hit: The ratio of the number of L2 cache line requests that hit in the L2
|
||||
cache over the total number of incoming cache line requests to the L2 cache.
|
||||
Hits: The total number of requests to the L2 from all clients that hit in the
|
||||
cache. As noted in the Speed-of-Light section, this includes hit-on-miss requests.
|
||||
Misses: The total number of requests to the L2 from all clients that miss in the
|
||||
cache. As noted in the Speed-of-Light section, these do not include hit-on-miss
|
||||
requests.
|
||||
Writeback: The total number of L2 cache lines written back to memory for any reason.
|
||||
Write-backs may occur due to user code (such as HIP kernel calls to _threadfence_system
|
||||
or atomic built-ins) by the command processor's memory acquire/release fences,
|
||||
or for other internal hardware reasons.
|
||||
Writeback (Internal): The total number of L2 cache lines written back to memory
|
||||
for internal hardware reasons, per normalization unit.
|
||||
Writeback (vL1D Req): The total number of L2 cache lines written back to memory
|
||||
due to requests initiated by the vL1D cache, per normalization unit.
|
||||
Evict (Internal): The total number of L2 cache lines evicted from the cache due
|
||||
to capacity limits, per normalization unit.
|
||||
Evict (vL1D Req): The total number of L2 cache lines evicted from the cache due
|
||||
to invalidation requests initiated by the vL1D cache, per normalization unit.
|
||||
NC Req: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory
|
||||
allocations, per normalization unit.
|
||||
UC Req: The total number of requests to the L2 that go to Uncached (UC) memory
|
||||
allocations.
|
||||
CC Req: The total number of requests to the L2 that go to Coherently Cacheable
|
||||
(CC) memory allocations.
|
||||
RW Req: The total number of requests to the L2 that go to Read-Write coherent
|
||||
memory (RW) allocations.
|
||||
Write - Credit Starvation: The number of cycles the L2-Fabric interface was stalled
|
||||
on write or atomic requests to any memory location because too many write/atomic
|
||||
requests were currently in flight, as a percent of the total active L2 cycles.
|
||||
Read (32B): The total number of L2 requests to Infinity Fabric to read 32B of
|
||||
data from any memory location, per normalization unit.
|
||||
Read (64B): The total number of L2 requests to Infinity Fabric to read 64B of
|
||||
data from any memory location, per normalization unit.
|
||||
Read (Uncached): The total number of L2 requests to Infinity Fabric to read uncached
|
||||
data from any memory location, per normalization unit. 64B requests for uncached
|
||||
data are counted as two 32B uncached data requests.
|
||||
HBM Read: The total number of L2 requests to Infinity Fabric to read 32B or 64B
|
||||
of data from the accelerator's local HBM, per normalization unit.
|
||||
Remote Read: The total number of L2 requests to Infinity Fabric to read 32B or
|
||||
64B of data from any source other than the accelerator's local HBM, per normalization
|
||||
unit.
|
||||
Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 32B of data to any memory location, per normalization
|
||||
unit.
|
||||
Write and Atomic (Uncached): The total number of L2 requests to Infinity Fabric
|
||||
to write or atomically update 32B or 64B of uncached data, per normalization
|
||||
unit.
|
||||
Write and Atomic (64B): The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 64B of data in any memory location, per normalization
|
||||
unit.
|
||||
HBM Write and Atomic: The total number of L2 requests to Infinity Fabric to write
|
||||
or atomically update 32B or 64B of data in the accelerator's local HBM, per
|
||||
normalization unit.
|
||||
Remote Write and Atomic: The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 32B or 64B of data in any memory location other than
|
||||
the accelerator's local HBM, per normalization unit.
|
||||
Atomic: The total number of L2 requests to Infinity Fabric to atomically update
|
||||
32B or 64B of data in any memory location, per normalization unit. See Request
|
||||
flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
|
||||
requests are only considered atomic by Infinity Fabric if they are targeted
|
||||
at non-write-cacheable memory, such as fine-grained memory allocations or uncached
|
||||
memory allocations on the MI2XX.
|
||||
Read Stall: "The ratio of the total number of cycles the L2-Fabric interface was\
|
||||
\ stalled on a read request to any destination (local HBM, remote PCIe\xAE connected\
|
||||
\ accelerator or CPU, or remote Infinity Fabric connected accelerator or CPU)\
|
||||
\ over the total active L2 cycles."
|
||||
Write Stall: The ratio of the total number of cycles the L2-Fabric interface was
|
||||
stalled on a write or atomic request to any destination (local HBM, remote accelerator
|
||||
or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected
|
||||
accelerator or CPU) over the total active L2 cycles.
|
||||
Read - PCIe Stall: The number of cycles the L2-Fabric interface was stalled on
|
||||
read requests to remote PCIe connected accelerators or CPUs as a percent of
|
||||
the total active L2 cycles.
|
||||
Read - Infinity Fabric Stall: The number of cycles the L2-Fabric interface was
|
||||
stalled on read requests to remote Infinity Fabric connected accelerators or
|
||||
CPUs as a percent of the total active L2 cycles.
|
||||
Read - HBM Stall: The number of cycles the L2-Fabric interface was stalled on
|
||||
read requests to the accelerator's local HBM as a percent of the total active
|
||||
L2 cycles.
|
||||
Write - PCIe Stall: The number of cycles the L2-Fabric interface was stalled on
|
||||
write or atomic requests to remote PCIe connected accelerators or CPUs as a
|
||||
percent of the total active L2 cycles.
|
||||
Write - Infinity Fabric Stall: The number of cycles the L2-Fabric interface was
|
||||
stalled on write or atomic requests to remote Infinity Fabric connected accelerators
|
||||
or CPUs as a percent of the total active L2 cycles.
|
||||
Write - HBM Stall: The number of cycles the L2-Fabric interface was stalled on
|
||||
write or atomic requests to accelerator's local HBM as a percent of the total
|
||||
active L2 cycles.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1701
|
||||
title: L2 Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
metric:
|
||||
Utilization:
|
||||
value: AVG(((TCC_BUSY_sum * 100) / (TO_INT($total_l2_chan) * $GRBM_GUI_ACTIVE_PER_XCD)))
|
||||
unit: pct
|
||||
Peak Bandwidth:
|
||||
value: ((100 * AVG(((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk / 1000) * 64) * TO_INT($total_l2_chan)))
|
||||
unit: pct
|
||||
Hit Rate:
|
||||
value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else 0))
|
||||
unit: pct
|
||||
L2-Fabric Read BW:
|
||||
value: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
L2-Fabric Write and Atomic BW:
|
||||
value: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
HBM Bandwidth:
|
||||
value: $hbmBandwidth
|
||||
unit: GB/s
|
||||
- metric_table:
|
||||
id: 1702
|
||||
title: L2-Fabric interface metrics
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Read BW:
|
||||
avg: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
min: MIN((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
max: MAX((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
HBM Read Traffic:
|
||||
avg: AVG((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Remote Read Traffic:
|
||||
avg: AVG((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum)
|
||||
if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
min: MIN((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum)
|
||||
if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
max: MAX((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum)
|
||||
if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
unit: pct
|
||||
Uncached Read Traffic:
|
||||
avg: AVG((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Write and Atomic BW:
|
||||
avg: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
min: MIN((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
max: MAX((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
HBM Write and Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Remote Write and Atomic Traffic:
|
||||
avg: AVG((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum)
|
||||
if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
min: MIN((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum)
|
||||
if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
max: MAX((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum)
|
||||
if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
unit: pct
|
||||
Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Uncached Write and Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Read Latency:
|
||||
avg: AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
Write and Atomic Latency:
|
||||
avg: AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
Atomic Latency:
|
||||
avg: AVG(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
- metric_table:
|
||||
id: 1703
|
||||
title: L2 Cache Accesses
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
avg: AVG((TCC_REQ_sum * 64) / $denom)
|
||||
min: MIN((TCC_REQ_sum * 64) / $denom)
|
||||
max: MAX((TCC_REQ_sum * 64) / $denom)
|
||||
unit: (Bytes + $normUnit)
|
||||
Req:
|
||||
avg: AVG((TCC_REQ_sum / $denom))
|
||||
min: MIN((TCC_REQ_sum / $denom))
|
||||
max: MAX((TCC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read Req:
|
||||
avg: AVG((TCC_READ_sum / $denom))
|
||||
min: MIN((TCC_READ_sum / $denom))
|
||||
max: MAX((TCC_READ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Write Req:
|
||||
avg: AVG((TCC_WRITE_sum / $denom))
|
||||
min: MIN((TCC_WRITE_sum / $denom))
|
||||
max: MAX((TCC_WRITE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Atomic Req:
|
||||
avg: AVG((TCC_ATOMIC_sum / $denom))
|
||||
min: MIN((TCC_ATOMIC_sum / $denom))
|
||||
max: MAX((TCC_ATOMIC_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Streaming Req:
|
||||
avg: AVG((TCC_STREAMING_REQ_sum / $denom))
|
||||
min: MIN((TCC_STREAMING_REQ_sum / $denom))
|
||||
max: MAX((TCC_STREAMING_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Probe Req:
|
||||
avg: AVG((TCC_PROBE_sum / $denom))
|
||||
min: MIN((TCC_PROBE_sum / $denom))
|
||||
max: MAX((TCC_PROBE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Cache Hit:
|
||||
avg: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
min: MIN((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
max: MAX((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
unit: pct
|
||||
Hits:
|
||||
avg: AVG((TCC_HIT_sum / $denom))
|
||||
min: MIN((TCC_HIT_sum / $denom))
|
||||
max: MAX((TCC_HIT_sum / $denom))
|
||||
unit: (Hits + $normUnit)
|
||||
Misses:
|
||||
avg: AVG((TCC_MISS_sum / $denom))
|
||||
min: MIN((TCC_MISS_sum / $denom))
|
||||
max: MAX((TCC_MISS_sum / $denom))
|
||||
unit: (Misses + $normUnit)
|
||||
Writeback:
|
||||
avg: AVG((TCC_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_WRITEBACK_sum / $denom))
|
||||
max: MAX((TCC_WRITEBACK_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
Writeback (Internal):
|
||||
avg: AVG((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
max: MAX((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
Writeback (vL1D Req):
|
||||
avg: AVG((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
|
||||
max: MAX((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
Evict (Internal):
|
||||
avg: AVG((TCC_NORMAL_EVICT_sum / $denom))
|
||||
min: MIN((TCC_NORMAL_EVICT_sum / $denom))
|
||||
max: MAX((TCC_NORMAL_EVICT_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
Evict (vL1D Req):
|
||||
avg: AVG((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
|
||||
min: MIN((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
|
||||
max: MAX((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
NC Req:
|
||||
avg: AVG((TCC_NC_REQ_sum / $denom))
|
||||
min: MIN((TCC_NC_REQ_sum / $denom))
|
||||
max: MAX((TCC_NC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
UC Req:
|
||||
avg: AVG((TCC_UC_REQ_sum / $denom))
|
||||
min: MIN((TCC_UC_REQ_sum / $denom))
|
||||
max: MAX((TCC_UC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
CC Req:
|
||||
avg: AVG((TCC_CC_REQ_sum / $denom))
|
||||
min: MIN((TCC_CC_REQ_sum / $denom))
|
||||
max: MAX((TCC_CC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
RW Req:
|
||||
avg: AVG((TCC_RW_REQ_sum / $denom))
|
||||
min: MIN((TCC_RW_REQ_sum / $denom))
|
||||
max: MAX((TCC_RW_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1704
|
||||
title: L2 Cache Stalls
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric: {}
|
||||
- metric_table:
|
||||
id: 1705
|
||||
title: L2 - Fabric Interface stalls
|
||||
header:
|
||||
metric: Metric
|
||||
type: Type
|
||||
transaction: Transaction
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
style:
|
||||
type: simple_multi_bar
|
||||
metric:
|
||||
Write - Credit Starvation:
|
||||
type: Credit Starvation
|
||||
transaction: Write
|
||||
avg: AVG(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
- metric_table:
|
||||
id: 1706
|
||||
title: L2 - Fabric interface detailed metrics
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Read (32B):
|
||||
avg: AVG((TCC_EA_RDREQ_32B_sum / $denom))
|
||||
min: MIN((TCC_EA_RDREQ_32B_sum / $denom))
|
||||
max: MAX((TCC_EA_RDREQ_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read (64B):
|
||||
avg: AVG(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
|
||||
min: MIN(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
|
||||
max: MAX(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read (Uncached):
|
||||
avg: AVG((TCC_EA_RD_UNCACHED_32B_sum / $denom))
|
||||
min: MIN((TCC_EA_RD_UNCACHED_32B_sum / $denom))
|
||||
max: MAX((TCC_EA_RD_UNCACHED_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
HBM Read:
|
||||
avg: AVG((TCC_EA_RDREQ_DRAM_sum / $denom))
|
||||
min: MIN((TCC_EA_RDREQ_DRAM_sum / $denom))
|
||||
max: MAX((TCC_EA_RDREQ_DRAM_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Remote Read:
|
||||
avg: AVG((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
|
||||
min: MIN((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
|
||||
max: MAX((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Write and Atomic (32B):
|
||||
avg: AVG(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom))
|
||||
min: MIN(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom))
|
||||
max: MAX(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Write and Atomic (Uncached):
|
||||
avg: AVG((TCC_EA_WR_UNCACHED_32B_sum / $denom))
|
||||
min: MIN((TCC_EA_WR_UNCACHED_32B_sum / $denom))
|
||||
max: MAX((TCC_EA_WR_UNCACHED_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Write and Atomic (64B):
|
||||
avg: AVG((TCC_EA_WRREQ_64B_sum / $denom))
|
||||
min: MIN((TCC_EA_WRREQ_64B_sum / $denom))
|
||||
max: MAX((TCC_EA_WRREQ_64B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
HBM Write and Atomic:
|
||||
avg: AVG((TCC_EA_WRREQ_DRAM_sum / $denom))
|
||||
min: MIN((TCC_EA_WRREQ_DRAM_sum / $denom))
|
||||
max: MAX((TCC_EA_WRREQ_DRAM_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Remote Write and Atomic:
|
||||
avg: AVG((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
|
||||
min: MIN((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
|
||||
max: MAX((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Atomic:
|
||||
avg: AVG((TCC_EA_ATOMIC_sum / $denom))
|
||||
min: MIN((TCC_EA_ATOMIC_sum / $denom))
|
||||
max: MAX((TCC_EA_ATOMIC_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
-350
@@ -1,350 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 1800
|
||||
title: L2 Cache (per Channel)
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1801
|
||||
title: Aggregate Stats (All channels)
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
std dev: Std Dev
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
L2 Cache Hit Rate:
|
||||
avg: AVG(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
|
||||
+ (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100 *
|
||||
TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
|
||||
+ (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) + (100
|
||||
* TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100 * TCC_HIT[15]))
|
||||
+ (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 * TCC_HIT[18])) + (100
|
||||
* TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21])) + (100 * TCC_HIT[22]))
|
||||
+ (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) + (100 * TCC_HIT[25])) + (100
|
||||
* TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100 * TCC_HIT[28])) + (100 * TCC_HIT[29]))
|
||||
+ (100 * TCC_HIT[30])) + (100 * TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0]
|
||||
+ TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
|
||||
+ (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
|
||||
+ TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
|
||||
+ (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
|
||||
+ TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
|
||||
+ (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
|
||||
+ TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
|
||||
+ (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
|
||||
+ TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
|
||||
+ (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
|
||||
+ TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
|
||||
+ (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
|
||||
+ TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
|
||||
+ TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
|
||||
+ (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
|
||||
+ TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
|
||||
+ (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
|
||||
+ TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
|
||||
+ (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
|
||||
+ TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
|
||||
+ (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
|
||||
+ TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
|
||||
+ (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
|
||||
+ TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
|
||||
+ (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[29] + TCC_HIT[29])) + (TCC_MISS[30]
|
||||
+ TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
|
||||
std dev: STD(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
|
||||
+ (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100 *
|
||||
TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
|
||||
+ (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) + (100
|
||||
* TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100 * TCC_HIT[15]))
|
||||
+ (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 * TCC_HIT[18])) + (100
|
||||
* TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21])) + (100 * TCC_HIT[22]))
|
||||
+ (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) + (100 * TCC_HIT[25])) + (100
|
||||
* TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100 * TCC_HIT[28])) + (100 * TCC_HIT[29]))
|
||||
+ (100 * TCC_HIT[30])) + (100 * TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0]
|
||||
+ TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
|
||||
+ (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
|
||||
+ TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
|
||||
+ (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
|
||||
+ TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
|
||||
+ (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
|
||||
+ TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
|
||||
+ (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
|
||||
+ TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
|
||||
+ (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
|
||||
+ TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
|
||||
+ (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
|
||||
+ TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
|
||||
+ TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
|
||||
+ (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
|
||||
+ TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
|
||||
+ (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
|
||||
+ TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
|
||||
+ (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
|
||||
+ TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
|
||||
+ (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
|
||||
+ TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
|
||||
+ (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
|
||||
+ TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
|
||||
+ (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
|
||||
+ TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
|
||||
min: MIN(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
|
||||
+ (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100 *
|
||||
TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
|
||||
+ (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) + (100
|
||||
* TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100 * TCC_HIT[15]))
|
||||
+ (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 * TCC_HIT[18])) + (100
|
||||
* TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21])) + (100 * TCC_HIT[22]))
|
||||
+ (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) + (100 * TCC_HIT[25])) + (100
|
||||
* TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100 * TCC_HIT[28])) + (100 * TCC_HIT[29]))
|
||||
+ (100 * TCC_HIT[30])) + (100 * TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0]
|
||||
+ TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
|
||||
+ (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
|
||||
+ TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
|
||||
+ (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
|
||||
+ TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
|
||||
+ (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
|
||||
+ TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
|
||||
+ (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
|
||||
+ TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
|
||||
+ (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
|
||||
+ TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
|
||||
+ (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
|
||||
+ TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
|
||||
+ TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
|
||||
+ (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
|
||||
+ TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
|
||||
+ (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
|
||||
+ TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
|
||||
+ (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
|
||||
+ TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
|
||||
+ (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
|
||||
+ TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
|
||||
+ (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
|
||||
+ TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
|
||||
+ (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
|
||||
+ TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
|
||||
max: MAX(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
|
||||
+ (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100 *
|
||||
TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
|
||||
+ (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) + (100
|
||||
* TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100 * TCC_HIT[15]))
|
||||
+ (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 * TCC_HIT[18])) + (100
|
||||
* TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21])) + (100 * TCC_HIT[22]))
|
||||
+ (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) + (100 * TCC_HIT[25])) + (100
|
||||
* TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100 * TCC_HIT[28])) + (100 * TCC_HIT[29]))
|
||||
+ (100 * TCC_HIT[30])) + (100 * TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0]
|
||||
+ TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
|
||||
+ (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
|
||||
+ TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
|
||||
+ (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
|
||||
+ TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
|
||||
+ (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
|
||||
+ TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
|
||||
+ (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
|
||||
+ TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
|
||||
+ (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
|
||||
+ TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
|
||||
+ (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
|
||||
+ TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
|
||||
+ TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
|
||||
+ (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
|
||||
+ TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
|
||||
+ (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
|
||||
+ TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
|
||||
+ (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
|
||||
+ TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
|
||||
+ (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
|
||||
+ TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
|
||||
+ (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
|
||||
+ TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
|
||||
+ (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
|
||||
+ TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
# FIXME: other arggr metrics!!
|
||||
|
||||
- metric_table:
|
||||
id: 1802
|
||||
title: L2 Cache Hit Rate (pct)
|
||||
header:
|
||||
metric: Channel
|
||||
expr: Expression
|
||||
metric:
|
||||
"::_1":
|
||||
expr:
|
||||
(((100 * TCC_HIT[::_1]) / (TCC_HIT[::_1] + TCC_MISS[::_1])) if ((TCC_HIT[::_1]
|
||||
+ TCC_MISS[::_1]) != 0) else None)
|
||||
placeholder_range:
|
||||
"::_1": $total_l2_chan
|
||||
cli_style: simple_box
|
||||
|
||||
- metric_table:
|
||||
id: 1803
|
||||
title: L2 Requests (per normUnit)
|
||||
header:
|
||||
metric: Channel
|
||||
expr: Expression
|
||||
metric:
|
||||
"::_1":
|
||||
expr: (TO_INT(TCC_REQ[::_1]) / $denom)
|
||||
placeholder_range:
|
||||
"::_1": $total_l2_chan
|
||||
cli_style: simple_box
|
||||
|
||||
- metric_table:
|
||||
id: 1804
|
||||
title: L2 Requests (per normUnit)
|
||||
header:
|
||||
metric: Channel
|
||||
read req: L2 Read
|
||||
write req: L2 Write
|
||||
atomic req: L2 Atomic
|
||||
metric:
|
||||
"::_1":
|
||||
read req: AVG((TO_INT(TCC_READ[::_1]) / $denom))
|
||||
write req: AVG((TO_INT(TCC_WRITE[::_1]) / $denom))
|
||||
atomic req: AVG((TO_INT(TCC_ATOMIC[::_1]) / $denom))
|
||||
placeholder_range:
|
||||
"::_1": $total_l2_chan
|
||||
cli_style: simple_multiple_bar
|
||||
|
||||
- metric_table:
|
||||
id: 1805
|
||||
title: L2-Fabric Requests (per normUnit)
|
||||
header:
|
||||
metric: Channel
|
||||
read req: L2-Fabric Read
|
||||
write req: L2-Fabric Write and Atomic
|
||||
atomic req: L2-Fabric Atomic
|
||||
metric:
|
||||
"::_1":
|
||||
read req: AVG((TO_INT(TCC_EA_RDREQ[::_1]) / $denom))
|
||||
write req: AVG((TO_INT(TCC_EA_WRREQ[::_1]) / $denom))
|
||||
atomic req: AVG((TO_INT(TCC_EA_ATOMIC[::_1]) / $denom))
|
||||
placeholder_range:
|
||||
"::_1": $total_l2_chan
|
||||
cli_style: simple_multiple_bar
|
||||
|
||||
# - metric_table:
|
||||
# id: 1806
|
||||
# title: L2-EA Latency (Cycles)
|
||||
# header:
|
||||
# metric: Metric
|
||||
# read lat: L2-EA Read
|
||||
# write lat: L2-EA Write
|
||||
# atomic lat: L2-EA Atomic
|
||||
# metric:
|
||||
# "::_1":
|
||||
# read lat:
|
||||
# AVG(((TCC_EA_RDREQ_LEVEL[::_1] / TCC_EA_RDREQ[::_1]) if (TCC_EA_RDREQ[::_1]
|
||||
# != 0) else None))
|
||||
# write lat:
|
||||
# AVG(((TCC_EA_WRREQ_LEVEL[::_1] / TCC_EA_WRREQ[::_1]) if (TCC_EA_WRREQ[::_1]
|
||||
# != 0) else None))
|
||||
# atomic lat:
|
||||
# AVG(((TCC_EA_ATOMIC_LEVEL[::_1] / TCC_EA_ATOMIC[::_1]) if
|
||||
# (TCC_EA_ATOMIC[::_1] != 0) else 0))
|
||||
# placeholder_range:
|
||||
# "::_1": 32
|
||||
# cli_style: simple_multiple_bar
|
||||
|
||||
- metric_table:
|
||||
id: 1806
|
||||
title: L2-Fabric Read Latency (Cycles)
|
||||
header:
|
||||
metric: Channel
|
||||
expr: Expression
|
||||
metric:
|
||||
"::_1":
|
||||
expr:
|
||||
((TCC_EA_RDREQ_LEVEL[::_1] / TCC_EA_RDREQ[::_1]) if (TCC_EA_RDREQ[::_1]
|
||||
!= 0) else None)
|
||||
placeholder_range:
|
||||
"::_1": $total_l2_chan
|
||||
cli_style: simple_box
|
||||
|
||||
- metric_table:
|
||||
id: 1807
|
||||
title: L2-Fabric Write and Atomic Latency (Cycles)
|
||||
header:
|
||||
metric: Channel
|
||||
expr: Expression
|
||||
metric:
|
||||
"::_1":
|
||||
expr:
|
||||
((TCC_EA_WRREQ_LEVEL[::_1] / TCC_EA_WRREQ[::_1]) if (TCC_EA_WRREQ[::_1]
|
||||
!= 0) else None)
|
||||
placeholder_range:
|
||||
"::_1": $total_l2_chan
|
||||
cli_style: simple_box
|
||||
|
||||
- metric_table:
|
||||
id: 1808
|
||||
title: L2-Fabric Atomic Latency (Cycles)
|
||||
header:
|
||||
metric: Channel
|
||||
expr: Expression
|
||||
metric:
|
||||
"::_1":
|
||||
expr: ((TCC_EA_ATOMIC_LEVEL[::_1] / TCC_EA_ATOMIC[::_1]) if
|
||||
(TCC_EA_ATOMIC[::_1] != 0) else 0)
|
||||
placeholder_range:
|
||||
"::_1": $total_l2_chan
|
||||
cli_style: simple_box
|
||||
|
||||
- metric_table:
|
||||
id: 1809
|
||||
title: L2-Fabric Read Stall (Cycles per normUnit)
|
||||
header:
|
||||
metric: Channel
|
||||
ea read stall - pcie: L2-Fabric Read Stall (PCIe)
|
||||
ea read stall - if: L2-Fabric Read Stall (Infinity Fabric™)
|
||||
ea read stall - hbm: L2-Fabric Read Stall (HBM)
|
||||
metric:
|
||||
"::_1":
|
||||
ea read stall - pcie: None # Missing perfmon
|
||||
ea read stall - if: None # Missing perfmon
|
||||
ea read stall - hbm: None # Missing perfmon
|
||||
placeholder_range:
|
||||
"::_1": $total_l2_chan
|
||||
cli_style: simple_multiple_bar
|
||||
|
||||
- metric_table:
|
||||
id: 1810
|
||||
title: L2-Fabric Write and Atomic Stall (Cycles per normUnit)
|
||||
header:
|
||||
metric: Channel
|
||||
ea write stall - pcie: L2-Fabric Write Stall (PCIe)
|
||||
ea write stall - if: L2-Fabric Write Stall (Infinity Fabric™)
|
||||
ea write stall - hbm: L2-Fabric Write Stall (HBM)
|
||||
ea write stall - starve: L2-Fabric Write Starve
|
||||
metric:
|
||||
"::_1":
|
||||
ea write stall - pcie: None # Missing perfmon
|
||||
ea write stall - if: None # Missing perfmon
|
||||
ea write stall - hbm: None # Missing perfmon
|
||||
ea write stall - starve: AVG((TO_INT(TCC_TOO_MANY_EA_WRREQS_STALL[::_1]) / $denom))
|
||||
placeholder_range:
|
||||
"::_1": $total_l2_chan
|
||||
cli_style: simple_multiple_bar
|
||||
|
||||
- metric_table:
|
||||
id: 1812
|
||||
title: L2-Fabric (128B read requests per normUnit)
|
||||
header:
|
||||
metric: Channel
|
||||
expr: Expression
|
||||
metric:
|
||||
"::_1":
|
||||
expr: (TO_INT(TCC_BUBBLE[::_1]) / $denom)
|
||||
placeholder_range:
|
||||
"::_1": $total_l2_chan
|
||||
# tips: Number of 128-byte read requests sent to EA
|
||||
cli_style: simple_box
|
||||
tui_style: simple_box
|
||||
+323
@@ -0,0 +1,323 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 1800
|
||||
title: L2 Cache (per Channel)
|
||||
metrics_description:
|
||||
L2 Cache Hit Rate: The percent of total number of requests to the L2 from all
|
||||
clients that hit in the cache. As noted in the Speed-of-Light section, this
|
||||
includes hit-on-miss requests.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1801
|
||||
title: Aggregate Stats (All channels)
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
std dev: Std Dev
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
L2 Cache Hit Rate:
|
||||
avg: AVG(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
|
||||
+ (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100
|
||||
* TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
|
||||
+ (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) +
|
||||
(100 * TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100
|
||||
* TCC_HIT[15])) + (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 *
|
||||
TCC_HIT[18])) + (100 * TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21]))
|
||||
+ (100 * TCC_HIT[22])) + (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) +
|
||||
(100 * TCC_HIT[25])) + (100 * TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100
|
||||
* TCC_HIT[28])) + (100 * TCC_HIT[29])) + (100 * TCC_HIT[30])) + (100 *
|
||||
TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0] + TCC_HIT[0])
|
||||
+ (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2])) + (TCC_MISS[3]
|
||||
+ TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5] + TCC_HIT[5]))
|
||||
+ (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7])) + (TCC_MISS[8]
|
||||
+ TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10] + TCC_HIT[10]))
|
||||
+ (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12])) + (TCC_MISS[13]
|
||||
+ TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15] + TCC_HIT[15]))
|
||||
+ (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17])) + (TCC_MISS[18]
|
||||
+ TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20] + TCC_HIT[20]))
|
||||
+ (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22])) + (TCC_MISS[23]
|
||||
+ TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25] + TCC_HIT[25]))
|
||||
+ (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27])) + (TCC_MISS[28]
|
||||
+ TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30] + TCC_HIT[30]))
|
||||
+ (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
|
||||
+ TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
|
||||
+ (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
|
||||
+ TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
|
||||
+ (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
|
||||
+ TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
|
||||
+ (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
|
||||
+ TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
|
||||
+ (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
|
||||
+ TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
|
||||
+ (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
|
||||
+ TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
|
||||
+ (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[29] + TCC_HIT[29])) + (TCC_MISS[30]
|
||||
+ TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
|
||||
std dev: STD(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100
|
||||
* TCC_HIT[1])) + (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4]))
|
||||
+ (100 * TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100
|
||||
* TCC_HIT[8])) + (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11]))
|
||||
+ (100 * TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) +
|
||||
(100 * TCC_HIT[15])) + (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100
|
||||
* TCC_HIT[18])) + (100 * TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 *
|
||||
TCC_HIT[21])) + (100 * TCC_HIT[22])) + (100 * TCC_HIT[23])) + (100 * TCC_HIT[24]))
|
||||
+ (100 * TCC_HIT[25])) + (100 * TCC_HIT[26])) + (100 * TCC_HIT[27])) +
|
||||
(100 * TCC_HIT[28])) + (100 * TCC_HIT[29])) + (100 * TCC_HIT[30])) + (100
|
||||
* TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0] + TCC_HIT[0])
|
||||
+ (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2])) + (TCC_MISS[3]
|
||||
+ TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5] + TCC_HIT[5]))
|
||||
+ (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7])) + (TCC_MISS[8]
|
||||
+ TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10] + TCC_HIT[10]))
|
||||
+ (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12])) + (TCC_MISS[13]
|
||||
+ TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15] + TCC_HIT[15]))
|
||||
+ (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17])) + (TCC_MISS[18]
|
||||
+ TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20] + TCC_HIT[20]))
|
||||
+ (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22])) + (TCC_MISS[23]
|
||||
+ TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25] + TCC_HIT[25]))
|
||||
+ (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27])) + (TCC_MISS[28]
|
||||
+ TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30] + TCC_HIT[30]))
|
||||
+ (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
|
||||
+ TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
|
||||
+ (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
|
||||
+ TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
|
||||
+ (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
|
||||
+ TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
|
||||
+ (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
|
||||
+ TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
|
||||
+ (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
|
||||
+ TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
|
||||
+ (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
|
||||
+ TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
|
||||
+ (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
|
||||
+ TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
|
||||
min: MIN(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
|
||||
+ (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100
|
||||
* TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
|
||||
+ (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) +
|
||||
(100 * TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100
|
||||
* TCC_HIT[15])) + (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 *
|
||||
TCC_HIT[18])) + (100 * TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21]))
|
||||
+ (100 * TCC_HIT[22])) + (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) +
|
||||
(100 * TCC_HIT[25])) + (100 * TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100
|
||||
* TCC_HIT[28])) + (100 * TCC_HIT[29])) + (100 * TCC_HIT[30])) + (100 *
|
||||
TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0] + TCC_HIT[0])
|
||||
+ (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2])) + (TCC_MISS[3]
|
||||
+ TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5] + TCC_HIT[5]))
|
||||
+ (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7])) + (TCC_MISS[8]
|
||||
+ TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10] + TCC_HIT[10]))
|
||||
+ (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12])) + (TCC_MISS[13]
|
||||
+ TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15] + TCC_HIT[15]))
|
||||
+ (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17])) + (TCC_MISS[18]
|
||||
+ TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20] + TCC_HIT[20]))
|
||||
+ (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22])) + (TCC_MISS[23]
|
||||
+ TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25] + TCC_HIT[25]))
|
||||
+ (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27])) + (TCC_MISS[28]
|
||||
+ TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30] + TCC_HIT[30]))
|
||||
+ (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
|
||||
+ TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
|
||||
+ (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
|
||||
+ TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
|
||||
+ (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
|
||||
+ TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
|
||||
+ (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
|
||||
+ TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
|
||||
+ (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
|
||||
+ TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
|
||||
+ (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
|
||||
+ TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
|
||||
+ (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
|
||||
+ TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
|
||||
max: MAX(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
|
||||
+ (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100
|
||||
* TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
|
||||
+ (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) +
|
||||
(100 * TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100
|
||||
* TCC_HIT[15])) + (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 *
|
||||
TCC_HIT[18])) + (100 * TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21]))
|
||||
+ (100 * TCC_HIT[22])) + (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) +
|
||||
(100 * TCC_HIT[25])) + (100 * TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100
|
||||
* TCC_HIT[28])) + (100 * TCC_HIT[29])) + (100 * TCC_HIT[30])) + (100 *
|
||||
TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0] + TCC_HIT[0])
|
||||
+ (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2])) + (TCC_MISS[3]
|
||||
+ TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5] + TCC_HIT[5]))
|
||||
+ (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7])) + (TCC_MISS[8]
|
||||
+ TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10] + TCC_HIT[10]))
|
||||
+ (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12])) + (TCC_MISS[13]
|
||||
+ TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15] + TCC_HIT[15]))
|
||||
+ (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17])) + (TCC_MISS[18]
|
||||
+ TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20] + TCC_HIT[20]))
|
||||
+ (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22])) + (TCC_MISS[23]
|
||||
+ TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25] + TCC_HIT[25]))
|
||||
+ (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27])) + (TCC_MISS[28]
|
||||
+ TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30] + TCC_HIT[30]))
|
||||
+ (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
|
||||
+ TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
|
||||
+ (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
|
||||
+ TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
|
||||
+ (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
|
||||
+ TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
|
||||
+ (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
|
||||
+ TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
|
||||
+ (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
|
||||
+ TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
|
||||
+ (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
|
||||
+ TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
|
||||
+ (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
|
||||
+ TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
|
||||
unit: pct
|
||||
- metric_table:
|
||||
id: 1802
|
||||
title: L2 Cache Hit Rate (pct)
|
||||
header:
|
||||
metric: Channel
|
||||
expr: Expression
|
||||
metric:
|
||||
::_1:
|
||||
expr: (((100 * TCC_HIT[::_1]) / (TCC_HIT[::_1] + TCC_MISS[::_1])) if ((TCC_HIT[::_1]
|
||||
+ TCC_MISS[::_1]) != 0) else None)
|
||||
placeholder_range:
|
||||
::_1: $total_l2_chan
|
||||
cli_style: simple_box
|
||||
tui_style: simple_box
|
||||
- metric_table:
|
||||
id: 1803
|
||||
title: L2 Requests (per normUnit)
|
||||
header:
|
||||
metric: Channel
|
||||
expr: Expression
|
||||
metric:
|
||||
::_1:
|
||||
expr: (TO_INT(TCC_REQ[::_1]) / $denom)
|
||||
placeholder_range:
|
||||
::_1: $total_l2_chan
|
||||
cli_style: simple_box
|
||||
tui_style: simple_box
|
||||
- metric_table:
|
||||
id: 1804
|
||||
title: L2 Requests (per normUnit)
|
||||
header:
|
||||
metric: Channel
|
||||
read req: L2 Read
|
||||
write req: L2 Write
|
||||
atomic req: L2 Atomic
|
||||
metric:
|
||||
::_1:
|
||||
read req: AVG((TO_INT(TCC_READ[::_1]) / $denom))
|
||||
write req: AVG((TO_INT(TCC_WRITE[::_1]) / $denom))
|
||||
atomic req: AVG((TO_INT(TCC_ATOMIC[::_1]) / $denom))
|
||||
placeholder_range:
|
||||
::_1: $total_l2_chan
|
||||
cli_style: simple_multiple_bar
|
||||
tui_style: simple_multiple_bar
|
||||
- metric_table:
|
||||
id: 1805
|
||||
title: L2-Fabric Requests (per normUnit)
|
||||
header:
|
||||
metric: Channel
|
||||
read req: L2-Fabric Read
|
||||
write req: L2-Fabric Write and Atomic
|
||||
atomic req: L2-Fabric Atomic
|
||||
metric:
|
||||
::_1:
|
||||
read req: AVG((TO_INT(TCC_EA_RDREQ[::_1]) / $denom))
|
||||
write req: AVG((TO_INT(TCC_EA_WRREQ[::_1]) / $denom))
|
||||
atomic req: AVG((TO_INT(TCC_EA_ATOMIC[::_1]) / $denom))
|
||||
placeholder_range:
|
||||
::_1: $total_l2_chan
|
||||
cli_style: simple_multiple_bar
|
||||
tui_style: simple_multiple_bar
|
||||
- metric_table:
|
||||
id: 1806
|
||||
title: L2-Fabric Read Latency (Cycles)
|
||||
header:
|
||||
metric: Channel
|
||||
expr: Expression
|
||||
metric:
|
||||
::_1:
|
||||
expr: ((TCC_EA_RDREQ_LEVEL[::_1] / TCC_EA_RDREQ[::_1]) if (TCC_EA_RDREQ[::_1]
|
||||
!= 0) else None)
|
||||
placeholder_range:
|
||||
::_1: $total_l2_chan
|
||||
cli_style: simple_box
|
||||
tui_style: simple_box
|
||||
- metric_table:
|
||||
id: 1807
|
||||
title: L2-Fabric Write and Atomic Latency (Cycles)
|
||||
header:
|
||||
metric: Channel
|
||||
expr: Expression
|
||||
metric:
|
||||
::_1:
|
||||
expr: ((TCC_EA_WRREQ_LEVEL[::_1] / TCC_EA_WRREQ[::_1]) if (TCC_EA_WRREQ[::_1]
|
||||
!= 0) else None)
|
||||
placeholder_range:
|
||||
::_1: $total_l2_chan
|
||||
cli_style: simple_box
|
||||
tui_style: simple_box
|
||||
- metric_table:
|
||||
id: 1808
|
||||
title: L2-Fabric Atomic Latency (Cycles)
|
||||
header:
|
||||
metric: Channel
|
||||
expr: Expression
|
||||
metric:
|
||||
::_1:
|
||||
expr: ((TCC_EA_ATOMIC_LEVEL[::_1] / TCC_EA_ATOMIC[::_1]) if (TCC_EA_ATOMIC[::_1]
|
||||
!= 0) else 0)
|
||||
placeholder_range:
|
||||
::_1: $total_l2_chan
|
||||
cli_style: simple_box
|
||||
tui_style: simple_box
|
||||
- metric_table:
|
||||
id: 1809
|
||||
title: L2-Fabric Read Stall (Cycles per normUnit)
|
||||
header:
|
||||
metric: Channel
|
||||
ea read stall - pcie: L2-Fabric Read Stall (PCIe)
|
||||
ea read stall - if: "L2-Fabric Read Stall (Infinity Fabric\u2122)"
|
||||
ea read stall - hbm: L2-Fabric Read Stall (HBM)
|
||||
metric:
|
||||
::_1:
|
||||
ea read stall - pcie: None
|
||||
ea read stall - if: None
|
||||
ea read stall - hbm: None
|
||||
placeholder_range:
|
||||
::_1: $total_l2_chan
|
||||
cli_style: simple_multiple_bar
|
||||
tui_style: simple_multiple_bar
|
||||
- metric_table:
|
||||
id: 1810
|
||||
title: L2-Fabric Write and Atomic Stall (Cycles per normUnit)
|
||||
header:
|
||||
metric: Channel
|
||||
ea write stall - pcie: L2-Fabric Write Stall (PCIe)
|
||||
ea write stall - if: "L2-Fabric Write Stall (Infinity Fabric\u2122)"
|
||||
ea write stall - hbm: L2-Fabric Write Stall (HBM)
|
||||
ea write stall - starve: L2-Fabric Write Starve
|
||||
metric:
|
||||
::_1:
|
||||
ea write stall - pcie: None
|
||||
ea write stall - if: None
|
||||
ea write stall - hbm: None
|
||||
ea write stall - starve: AVG((TO_INT(TCC_TOO_MANY_EA_WRREQS_STALL[::_1])
|
||||
/ $denom))
|
||||
placeholder_range:
|
||||
::_1: $total_l2_chan
|
||||
cli_style: simple_multiple_bar
|
||||
tui_style: simple_multiple_bar
|
||||
- metric_table:
|
||||
id: 1812
|
||||
title: L2-Fabric (128B read requests per normUnit)
|
||||
header:
|
||||
metric: Channel
|
||||
expr: Expression
|
||||
metric:
|
||||
::_1:
|
||||
expr: (TO_INT(TCC_BUBBLE[::_1]) / $denom)
|
||||
placeholder_range:
|
||||
::_1: $total_l2_chan
|
||||
cli_style: simple_box
|
||||
tui_style: simple_box
|
||||
+7
-6
@@ -1,10 +1,11 @@
|
||||
---
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 2100
|
||||
title: PC Sampling
|
||||
metrics_description: {}
|
||||
data source:
|
||||
- pc_sampling_table:
|
||||
id: 2101
|
||||
title: PC Sampling
|
||||
source: None # not support
|
||||
comparable: false # enable it later
|
||||
- pc_sampling_table:
|
||||
id: 2101
|
||||
title: PC Sampling
|
||||
source: ps_file
|
||||
comparable: false
|
||||
|
||||
+11
-11
@@ -1,14 +1,14 @@
|
||||
---
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 000
|
||||
id: 0
|
||||
title: Top Stats
|
||||
metrics_description: {}
|
||||
data source:
|
||||
- raw_csv_table:
|
||||
id: 001
|
||||
title: Top Kernels
|
||||
source: pmc_kernel_top.csv
|
||||
|
||||
- raw_csv_table:
|
||||
id: 002
|
||||
title: Dispatch List
|
||||
source: pmc_dispatch_info.csv
|
||||
- raw_csv_table:
|
||||
id: 1
|
||||
title: Top Kernels
|
||||
source: pmc_kernel_top.csv
|
||||
- raw_csv_table:
|
||||
id: 2
|
||||
title: Dispatch List
|
||||
source: pmc_dispatch_info.csv
|
||||
|
||||
+6
-5
@@ -1,9 +1,10 @@
|
||||
---
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 100
|
||||
title: System Info
|
||||
metrics_description: {}
|
||||
data source:
|
||||
- raw_csv_table:
|
||||
id: 101
|
||||
source: sysinfo.csv
|
||||
columnwise: True
|
||||
- raw_csv_table:
|
||||
id: 101
|
||||
source: sysinfo.csv
|
||||
columnwise: true
|
||||
|
||||
-254
@@ -1,254 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
SALU: &SALU_anchor Scalar Arithmetic Logic Unit
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 200
|
||||
title: System Speed-of-Light
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 201
|
||||
title: Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
peak: Peak
|
||||
pop: Pct of Peak
|
||||
tips: Tips
|
||||
metric:
|
||||
VALU FLOPs:
|
||||
value: AVG(((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) + SQ_INSTS_VALU_TRANS_F16)
|
||||
+ (2 * SQ_INSTS_VALU_FMA_F16))) + (64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32)
|
||||
+ SQ_INSTS_VALU_TRANS_F32) + (2 * SQ_INSTS_VALU_FMA_F32)))) + (64 * (((SQ_INSTS_VALU_ADD_F64
|
||||
+ SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64) + (2 * SQ_INSTS_VALU_FMA_F64))))
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP/s
|
||||
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
|
||||
pop: ((100 * AVG(((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16)
|
||||
+ SQ_INSTS_VALU_TRANS_F16) + (2 * SQ_INSTS_VALU_FMA_F16))) + (64 * (((SQ_INSTS_VALU_ADD_F32
|
||||
+ SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (2 * SQ_INSTS_VALU_FMA_F32))))
|
||||
+ (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (2 * SQ_INSTS_VALU_FMA_F64)))) / (End_Timestamp - Start_Timestamp)))) / (((($max_sclk
|
||||
* $cu_per_gpu) * 64) * 2) / 1000))
|
||||
tips:
|
||||
VALU IOPs:
|
||||
value: AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GIOP/s
|
||||
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
|
||||
pop: ((100 * AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) / (End_Timestamp
|
||||
- Start_Timestamp)))) / (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000))
|
||||
tips:
|
||||
MFMA FLOPs (BF16):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk * $cu_per_gpu) * 1024) / 1000))
|
||||
tips:
|
||||
MFMA FLOPs (F16):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk * $cu_per_gpu) * 1024) / 1000))
|
||||
tips:
|
||||
MFMA FLOPs (F32):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk * $cu_per_gpu) * 256) / 1000))
|
||||
tips:
|
||||
MFMA FLOPs (F64):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk * $cu_per_gpu) * 256) / 1000))
|
||||
tips:
|
||||
MFMA IOPs (Int8):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GIOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk * $cu_per_gpu) * 1024) / 1000))
|
||||
tips:
|
||||
Active CUs:
|
||||
value: $numActiveCUs
|
||||
unit: CUs
|
||||
peak: $cu_per_gpu
|
||||
pop: ((100 * $numActiveCUs) / $cu_per_gpu)
|
||||
tips:
|
||||
SALU Utilization:
|
||||
value: AVG(((100 * SQ_ACTIVE_INST_SCA) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 * SQ_ACTIVE_INST_SCA) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
tips:
|
||||
VALU Utilization:
|
||||
value: AVG(((100 * SQ_ACTIVE_INST_VALU) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 * SQ_ACTIVE_INST_VALU) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
tips:
|
||||
MFMA Utilization:
|
||||
value: AVG(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / (($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)
|
||||
* 4)))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / (($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)
|
||||
* 4)))
|
||||
tips:
|
||||
VMEM Utilization:
|
||||
value: AVG((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
tips:
|
||||
Branch Utilization:
|
||||
value: AVG((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
tips:
|
||||
VALU Active Threads:
|
||||
value: AVG(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
|
||||
!= 0) else None))
|
||||
unit: Threads
|
||||
peak: 64
|
||||
pop: (AVG(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
|
||||
!= 0) else None)) * 1.5625)
|
||||
tips:
|
||||
IPC:
|
||||
value: AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))
|
||||
unit: Instr/cycle
|
||||
peak: 5
|
||||
pop: ((100 * AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))) / 5)
|
||||
tips:
|
||||
Wavefront Occupancy:
|
||||
value: AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
unit: Wavefronts
|
||||
peak: ($max_waves_per_cu * $cu_per_gpu)
|
||||
pop: (100 * AVG(((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD) / ($max_waves_per_cu
|
||||
* $cu_per_gpu))))
|
||||
coll_level: SQ_LEVEL_WAVES
|
||||
tips:
|
||||
Theoretical LDS Bandwidth:
|
||||
value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: (($max_sclk * $cu_per_gpu) * 0.128)
|
||||
pop: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
|
||||
tips:
|
||||
LDS Bank Conflicts/Access:
|
||||
value: AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
|
||||
unit: Conflicts/access
|
||||
peak: 32
|
||||
pop: ((100 * AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))) / 32)
|
||||
tips:
|
||||
vL1D Cache Hit Rate:
|
||||
value: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
|
||||
None))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) +
|
||||
TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) /
|
||||
TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
|
||||
None))
|
||||
tips:
|
||||
vL1D Cache BW:
|
||||
value: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: ((($max_sclk / 1000) * 64) * $cu_per_gpu)
|
||||
pop: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk / 1000) * 64) * $cu_per_gpu))
|
||||
tips:
|
||||
L2 Cache Hit Rate:
|
||||
value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
tips:
|
||||
L2 Cache BW:
|
||||
value: AVG(((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: ((($max_sclk / 1000) * 64) * TO_INT($total_l2_chan))
|
||||
pop: ((100 * AVG(((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk / 1000) * 64) * TO_INT($total_l2_chan)))
|
||||
tips:
|
||||
L2-Fabric Read BW:
|
||||
value: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: $hbmBandwidth
|
||||
pop: ((100 * AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))) / $hbmBandwidth)
|
||||
tips:
|
||||
L2-Fabric Write BW:
|
||||
value: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: $hbmBandwidth
|
||||
pop: ((100 * AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))) / $hbmBandwidth)
|
||||
tips:
|
||||
L2-Fabric Read Latency:
|
||||
value: AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
peak: None
|
||||
pop: None
|
||||
tips:
|
||||
L2-Fabric Write Latency:
|
||||
value: AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
peak: None
|
||||
pop: None
|
||||
tips:
|
||||
sL1D Cache Hit Rate:
|
||||
value: AVG((((100 * SQC_DCACHE_HITS) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES))
|
||||
if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES) != 0) else None))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG((((100 * SQC_DCACHE_HITS) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES))
|
||||
if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES) != 0) else None))
|
||||
tips:
|
||||
sL1D Cache BW:
|
||||
value: AVG(((SQC_DCACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))
|
||||
unit: GB/s
|
||||
peak: ((($max_sclk / 1000) * 64) * $sqc_per_gpu)
|
||||
pop: ((100 * AVG(((SQC_DCACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))) / ((($max_sclk
|
||||
/ 1000) * 64) * $sqc_per_gpu))
|
||||
tips:
|
||||
L1I Hit Rate:
|
||||
value: AVG(((100 * SQC_ICACHE_HITS) / (SQC_ICACHE_HITS + SQC_ICACHE_MISSES)))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 * SQC_ICACHE_HITS) / (SQC_ICACHE_HITS + SQC_ICACHE_MISSES)))
|
||||
tips:
|
||||
L1I BW:
|
||||
value: AVG(((SQC_ICACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))
|
||||
unit: GB/s
|
||||
peak: ((($max_sclk / 1000) * 64) * $sqc_per_gpu)
|
||||
pop: ((100 * AVG(((SQC_ICACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))) / ((($max_sclk
|
||||
/ 1000) * 64) * $sqc_per_gpu))
|
||||
tips:
|
||||
L1I Fetch Latency:
|
||||
value: AVG((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
|
||||
unit: Cycles
|
||||
peak: None
|
||||
pop: None
|
||||
coll_level: SQ_IFETCH_LEVEL
|
||||
tips:
|
||||
+337
@@ -0,0 +1,337 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 200
|
||||
title: System Speed-of-Light
|
||||
metrics_description:
|
||||
VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
|
||||
This is also presented as a percent of the peak theoretical FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any floating-point
|
||||
operations from MFMA instructions.'
|
||||
VALU IOPs: 'The total integer operations executed per second on the VALU. This
|
||||
is also presented as a percent of the peak theoretical IOPs achievable on the
|
||||
specific accelerator. Note: this does not include any integer operations from
|
||||
MFMA instructions.'
|
||||
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
|
||||
executed per second. This does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F8 MFMA operations achievable on the specific accelerator. It is supported on
|
||||
AMD Instinct MI300 series and later only.
|
||||
MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
|
||||
executed per second. Note: this does not include any 16-bit brain floating point
|
||||
operations from VALU instructions. This is also presented as a percent of the
|
||||
peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 32-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F32 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 64-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F64 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
|
||||
per second. Note: this does not include any 8-bit integer operations from VALU
|
||||
instructions. This is also presented as a percent of the peak theoretical INT8
|
||||
MFMA operations achievable on the specific accelerator.'
|
||||
Active CUs: Total number of active compute units (CUs) on the accelerator during
|
||||
the kernel execution.
|
||||
SALU Utilization: Indicates what percent of the kernel's duration the SALU was
|
||||
busy executing instructions. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.
|
||||
VALU Utilization: Indicates what percent of the kernel's duration the VALU was
|
||||
busy executing instructions. Does not include VMEM operations. Computed as the
|
||||
ratio of the total number of cycles spent by the scheduler issuing VALU instructions
|
||||
over the total CU cycles.
|
||||
MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
|
||||
was busy executing instructions. Computed as the ratio of the total number of
|
||||
cycles the MFMA was busy over the total CU cycles.
|
||||
VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
|
||||
was busy executing instructions, including both global/generic and spill/scratch
|
||||
operations (see the VMEM instruction count metrics) for more detail). Does not
|
||||
include VALU operations. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing VMEM instructions over the total CU cycles.
|
||||
Branch Utilization: Indicates what percent of the kernel's duration the branch
|
||||
unit was busy executing instructions. Computed as the ratio of the total number
|
||||
of cycles spent by the scheduler issuing branch instructions over the total
|
||||
CU cycles
|
||||
VALU Active Threads: Indicates the average level of divergence within a wavefront
|
||||
over the lifetime of the kernel. The number of work-items that were active in
|
||||
a wavefront during execution of each VALU instruction, time-averaged over all
|
||||
VALU instructions run on all wavefronts in the kernel.
|
||||
IPC: The ratio of the total number of instructions executed on the CU over the
|
||||
total active CU cycles. This is also presented as a percent of the peak theoretical
|
||||
bandwidth achievable on the specific accelerator.
|
||||
Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
|
||||
over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
||||
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
|
||||
occupancy achievable on the specific accelerator.'
|
||||
Theoretical LDS Bandwidth: Indicates the maximum amount of bytes that could have
|
||||
been loaded from, stored to, or atomically updated in the LDS per unit time
|
||||
(see LDS Bandwidth example for more detail). This is also presented as a percent
|
||||
of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
|
||||
LDS Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS
|
||||
scheduler due to bank conflicts (as determined by the conflict resolution hardware)
|
||||
to the base number of cycles that would be spent in the LDS scheduler in a completely
|
||||
uncontended case. This is also presented in normalized form (i.e., the Bank
|
||||
Conflict Rate).
|
||||
vL1D Cache Hit Rate: The ratio of the number of vL1D cache line requests that
|
||||
hit in vL1D cache over the total number of cache line requests to the vL1D cache
|
||||
RAM.
|
||||
vL1D Cache BW: The number of bytes looked up in the vL1D cache as a result of
|
||||
VMEM instructions per unit time. The number of bytes is calculated as the number
|
||||
of cache lines requested multiplied by the cache line size. This value does
|
||||
not consider partial requests, so e.g., if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
This is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L2 Cache Hit Rate: The ratio of the number of L2 cache line requests that hit
|
||||
in the L2 cache over the total number of incoming cache line requests to the
|
||||
L2 cache.
|
||||
L2 Cache BW: The number of bytes looked up in the L2 cache per unit time. The
|
||||
number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line. This is also presented as a percent of
|
||||
the peak theoretical bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Read BW: "The number of bytes read by the L2 over the Infinity Fabric\u2122\
|
||||
\ interface per unit time. This is also presented as a percent of the peak theoretical\
|
||||
\ bandwidth achievable on the specific accelerator."
|
||||
L2-Fabric Write BW: The number of bytes sent by the L2 over the Infinity Fabric
|
||||
interface by write and atomic operations per unit time. This is also presented
|
||||
as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Read Latency: The time-averaged number of cycles read requests spent
|
||||
in Infinity Fabric before data was returned to the L2.
|
||||
L2-Fabric Write Latency: The time-averaged number of cycles write requests spent
|
||||
in Infinity Fabric before a completion acknowledgement was returned to the L2.
|
||||
sL1D Cache Hit Rate: The percent of sL1D requests that hit on a previously loaded
|
||||
line the cache. Calculated as the ratio of the number of sL1D requests that
|
||||
hit over the number of all sL1D requests.
|
||||
sL1D Cache BW: The number of bytes looked up in the sL1D cache per unit time.
|
||||
This is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L1I Hit Rate: The number of bytes looked up in the L1I cache per unit time. This
|
||||
is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L1I BW: The percent of L1I requests that hit on a previously loaded line the cache.
|
||||
Calculated as the ratio of the number of L1I requests that hit over the number
|
||||
of all L1I requests.
|
||||
L1I Fetch Latency: The average number of cycles spent to fetch instructions to
|
||||
a CU.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 201
|
||||
title: System Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
peak: Peak
|
||||
pop: Pct of Peak
|
||||
metric:
|
||||
VALU FLOPs:
|
||||
value: AVG(((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) +
|
||||
SQ_INSTS_VALU_TRANS_F16) + (2 * SQ_INSTS_VALU_FMA_F16))) + (64 * (((SQ_INSTS_VALU_ADD_F32
|
||||
+ SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (2 * SQ_INSTS_VALU_FMA_F32))))
|
||||
+ (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (2 * SQ_INSTS_VALU_FMA_F64)))) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP/s
|
||||
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
|
||||
pop: ((100 * AVG(((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16)
|
||||
+ SQ_INSTS_VALU_TRANS_F16) + (2 * SQ_INSTS_VALU_FMA_F16))) + (64 * (((SQ_INSTS_VALU_ADD_F32
|
||||
+ SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (2 * SQ_INSTS_VALU_FMA_F32))))
|
||||
+ (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (2 * SQ_INSTS_VALU_FMA_F64)))) / (End_Timestamp - Start_Timestamp))))
|
||||
/ (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000))
|
||||
VALU IOPs:
|
||||
value: AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) / (End_Timestamp
|
||||
- Start_Timestamp)))
|
||||
unit: GIOP/s
|
||||
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
|
||||
pop: ((100 * AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) / (End_Timestamp
|
||||
- Start_Timestamp)))) / (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000))
|
||||
MFMA FLOPs (BF16):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp
|
||||
- Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 1024) / 1000))
|
||||
MFMA FLOPs (F16):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp -
|
||||
Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 1024) / 1000))
|
||||
MFMA FLOPs (F32):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) / (End_Timestamp -
|
||||
Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 256) / 1000))
|
||||
MFMA FLOPs (F64):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp -
|
||||
Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 256) / 1000))
|
||||
MFMA IOPs (Int8):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GIOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp -
|
||||
Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 1024) / 1000))
|
||||
Active CUs:
|
||||
value: $numActiveCUs
|
||||
unit: CUs
|
||||
peak: $cu_per_gpu
|
||||
pop: ((100 * $numActiveCUs) / $cu_per_gpu)
|
||||
SALU Utilization:
|
||||
value: AVG(((100 * SQ_ACTIVE_INST_SCA) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 * SQ_ACTIVE_INST_SCA) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
VALU Utilization:
|
||||
value: AVG(((100 * SQ_ACTIVE_INST_VALU) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 * SQ_ACTIVE_INST_VALU) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
MFMA Utilization:
|
||||
value: AVG(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / (($GRBM_GUI_ACTIVE_PER_XCD
|
||||
* $cu_per_gpu) * 4)))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / (($GRBM_GUI_ACTIVE_PER_XCD
|
||||
* $cu_per_gpu) * 4)))
|
||||
VMEM Utilization:
|
||||
value: AVG((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD)
|
||||
/ $cu_per_gpu))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD)
|
||||
/ $cu_per_gpu))
|
||||
Branch Utilization:
|
||||
value: AVG((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
VALU Active Threads:
|
||||
value: AVG(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
|
||||
!= 0) else None))
|
||||
unit: Threads
|
||||
peak: 64
|
||||
pop: (AVG(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
|
||||
!= 0) else None)) * 1.5625)
|
||||
IPC:
|
||||
value: AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))
|
||||
unit: Instr/cycle
|
||||
peak: 5
|
||||
pop: ((100 * AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))) / 5)
|
||||
Wavefront Occupancy:
|
||||
value: AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
unit: Wavefronts
|
||||
peak: ($max_waves_per_cu * $cu_per_gpu)
|
||||
pop: (100 * AVG(((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD) / ($max_waves_per_cu
|
||||
* $cu_per_gpu))))
|
||||
coll_level: SQ_LEVEL_WAVES
|
||||
Theoretical LDS Bandwidth:
|
||||
value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: (($max_sclk * $cu_per_gpu) * 0.128)
|
||||
pop: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
|
||||
LDS Bank Conflicts/Access:
|
||||
value: AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
|
||||
unit: Conflicts/access
|
||||
peak: 32
|
||||
pop: ((100 * AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))) / 32)
|
||||
vL1D Cache Hit Rate:
|
||||
value: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
|
||||
else None))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
|
||||
else None))
|
||||
vL1D Cache BW:
|
||||
value: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: ((($max_sclk / 1000) * 64) * $cu_per_gpu)
|
||||
pop: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp -
|
||||
Start_Timestamp)))) / ((($max_sclk / 1000) * 64) * $cu_per_gpu))
|
||||
L2 Cache Hit Rate:
|
||||
value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
L2 Cache BW:
|
||||
value: AVG(((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: ((($max_sclk / 1000) * 64) * TO_INT($total_l2_chan))
|
||||
pop: ((100 * AVG(((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk / 1000) * 64) * TO_INT($total_l2_chan)))
|
||||
L2-Fabric Read BW:
|
||||
value: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: $hbmBandwidth
|
||||
pop: ((100 * AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))) / $hbmBandwidth)
|
||||
L2-Fabric Write BW:
|
||||
value: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: $hbmBandwidth
|
||||
pop: ((100 * AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))) / $hbmBandwidth)
|
||||
L2-Fabric Read Latency:
|
||||
value: AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
peak: None
|
||||
pop: None
|
||||
L2-Fabric Write Latency:
|
||||
value: AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
peak: None
|
||||
pop: None
|
||||
sL1D Cache Hit Rate:
|
||||
value: AVG((((100 * SQC_DCACHE_HITS) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES))
|
||||
if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES) != 0) else None))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG((((100 * SQC_DCACHE_HITS) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES))
|
||||
if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES) != 0) else None))
|
||||
sL1D Cache BW:
|
||||
value: AVG(((SQC_DCACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))
|
||||
unit: GB/s
|
||||
peak: ((($max_sclk / 1000) * 64) * $sqc_per_gpu)
|
||||
pop: ((100 * AVG(((SQC_DCACHE_REQ / (End_Timestamp - Start_Timestamp)) *
|
||||
64))) / ((($max_sclk / 1000) * 64) * $sqc_per_gpu))
|
||||
L1I Hit Rate:
|
||||
value: AVG(((100 * SQC_ICACHE_HITS) / (SQC_ICACHE_HITS + SQC_ICACHE_MISSES)))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 * SQC_ICACHE_HITS) / (SQC_ICACHE_HITS + SQC_ICACHE_MISSES)))
|
||||
L1I BW:
|
||||
value: AVG(((SQC_ICACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))
|
||||
unit: GB/s
|
||||
peak: ((($max_sclk / 1000) * 64) * $sqc_per_gpu)
|
||||
pop: ((100 * AVG(((SQC_ICACHE_REQ / (End_Timestamp - Start_Timestamp)) *
|
||||
64))) / ((($max_sclk / 1000) * 64) * $sqc_per_gpu))
|
||||
L1I Fetch Latency:
|
||||
value: AVG((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
|
||||
unit: Cycles
|
||||
peak: None
|
||||
pop: None
|
||||
coll_level: SQ_IFETCH_LEVEL
|
||||
-315
@@ -1,315 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 300
|
||||
title: Memory Chart
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 301
|
||||
title: Memory Chart
|
||||
header:
|
||||
metric: Metric
|
||||
#alias: #alias
|
||||
value: Value
|
||||
tips: Tips
|
||||
metric:
|
||||
# ----------------------------------------
|
||||
# Instr Buff Block
|
||||
|
||||
#TODO: double check wave_occupancy
|
||||
Wavefront Occupancy:
|
||||
#alias: wave_occ_
|
||||
value: ROUND(AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD) / $numActiveCUs), 0)
|
||||
coll_level: SQ_LEVEL_WAVES
|
||||
tips:
|
||||
Wave Life:
|
||||
#alias: wave_life_
|
||||
value: ROUND(AVG(((4 * (SQ_WAVE_CYCLES / SQ_WAVES)) if (SQ_WAVES != 0) else 0)), 0)
|
||||
tips:
|
||||
|
||||
# ----------------------------------------
|
||||
# Instr Dispatch Block
|
||||
SALU:
|
||||
#alias: salu_
|
||||
value: ROUND(AVG((SQ_INSTS_SALU / $denom)), 0)
|
||||
tips:
|
||||
SMEM:
|
||||
#alias: smem_
|
||||
value: ROUND(AVG((SQ_INSTS_SMEM / $denom)), 0)
|
||||
tips:
|
||||
VALU:
|
||||
#alias: valu_
|
||||
value: ROUND(AVG((SQ_INSTS_VALU / $denom)), 0)
|
||||
tips:
|
||||
MFMA:
|
||||
#alias: mfma_
|
||||
value: ROUND(AVG((SQ_INSTS_MFMA / $denom)), 0)
|
||||
tips:
|
||||
VMEM:
|
||||
#alias: vmem_
|
||||
value: ROUND(AVG((SQ_INSTS_VMEM / $denom)), 0)
|
||||
tips:
|
||||
LDS:
|
||||
#alias: lds_
|
||||
value: ROUND(AVG((SQ_INSTS_LDS / $denom)), 0)
|
||||
tips:
|
||||
GWS:
|
||||
#alias: gws_
|
||||
value: ROUND(AVG((SQ_INSTS_GDS / $denom)), 0)
|
||||
tips:
|
||||
BR:
|
||||
#alias: br_
|
||||
value: ROUND(AVG((SQ_INSTS_BRANCH / $denom)), 0)
|
||||
tips:
|
||||
|
||||
# ----------------------------------------
|
||||
# Exec Block
|
||||
Active CUs:
|
||||
#alias: active_cu_
|
||||
value: $numActiveCUs
|
||||
tips:
|
||||
Num CUs:
|
||||
#alias: num_cu_
|
||||
value: $cu_per_gpu
|
||||
tips:
|
||||
VGPR:
|
||||
#alias: vgpr_
|
||||
value: ROUND(AVG(Arch_VGPR), 0)
|
||||
tips:
|
||||
# Todo: add AGPRs
|
||||
SGPR:
|
||||
#alias: sgpr_
|
||||
value: ROUND(AVG(SGPR), 0)
|
||||
tips:
|
||||
LDS Allocation:
|
||||
#alias: lds_alloc_
|
||||
value: ROUND(AVG(LDS_Per_Workgroup), 0)
|
||||
tips:
|
||||
Scratch Allocation:
|
||||
#alias: scratch_alloc_
|
||||
value: ROUND(AVG(Scratch_Per_Workitem), 0)
|
||||
tips:
|
||||
Wavefronts:
|
||||
#alias: wavefronts_
|
||||
value: ROUND(AVG(SPI_CSN_WAVE), 0)
|
||||
tips:
|
||||
Workgroups:
|
||||
#alias: workgroups_
|
||||
value: ROUND(AVG(SPI_CSN_NUM_THREADGROUPS), 0)
|
||||
tips:
|
||||
|
||||
# ----------------------------------------
|
||||
# LDS Block
|
||||
LDS Req:
|
||||
#alias: lds_req_
|
||||
value: ROUND(AVG((SQ_INSTS_LDS / $denom)), 0)
|
||||
tips:
|
||||
LDS Util:
|
||||
#alias: lds_util_
|
||||
value:
|
||||
ROUND(AVG(((100 * SQ_LDS_IDX_ACTIVE) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))),
|
||||
0)
|
||||
tips:
|
||||
LDS Latency:
|
||||
#alias: lds_lat
|
||||
value: ROUND(AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None)),0)
|
||||
coll_level: SQ_INST_LEVEL_LDS
|
||||
tips:
|
||||
|
||||
# ----------------------------------------
|
||||
# Vector L1 Cache Block
|
||||
VL1 Rd:
|
||||
#alias: vl1_rd_
|
||||
value: ROUND(AVG((TCP_TOTAL_READ_sum / $denom)), 0)
|
||||
tips:
|
||||
VL1 Wr:
|
||||
#alias: vl1_wr_
|
||||
value: ROUND(AVG((TCP_TOTAL_WRITE_sum / $denom)), 0)
|
||||
tips:
|
||||
VL1 Atomic:
|
||||
#alias: vl1_atom_
|
||||
value:
|
||||
ROUND(AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
|
||||
/ $denom)), 0)
|
||||
tips:
|
||||
|
||||
VL1 Hit:
|
||||
#alias: vl1_hit_
|
||||
value:
|
||||
ROUND(AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
|
||||
None )), 0)
|
||||
tips:
|
||||
VL1 Lat:
|
||||
#alias: vl1_lat_
|
||||
value:
|
||||
ROUND(AVG(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
|
||||
!= 0) else None)), 0)
|
||||
tips:
|
||||
VL1 Coalesce:
|
||||
#alias: vl1_coales_
|
||||
value:
|
||||
ROUND(AVG(((((TA_TOTAL_WAVEFRONTS_sum * 64) * 100) / (TCP_TOTAL_ACCESSES_sum
|
||||
* 4)) if (TCP_TOTAL_ACCESSES_sum != None) else 0)), 0)
|
||||
tips:
|
||||
VL1 Stall:
|
||||
#alias: vl1_stall_
|
||||
value:
|
||||
ROUND(AVG((((100 * TCP_TCR_TCP_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
|
||||
if (TCP_GATE_EN1_sum != 0) else None)), 0)
|
||||
tips:
|
||||
|
||||
VL1_L2 Rd:
|
||||
#alias: vl1_l2_rd_
|
||||
value: ROUND(AVG((TCP_TCC_READ_REQ_sum / $denom)), 0)
|
||||
tips:
|
||||
VL1_L2 Wr:
|
||||
#alias: vl1_l2_wr_
|
||||
value: ROUND(AVG((TCP_TCC_WRITE_REQ_sum / $denom)), 0)
|
||||
tips:
|
||||
VL1_L2 Atomic:
|
||||
#alias: vl1_l2_atom_
|
||||
value:
|
||||
ROUND(AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
/ $denom)), 0)
|
||||
tips:
|
||||
|
||||
# ----------------------------------------
|
||||
# Scalar L1D Cache Block
|
||||
VL1D Rd:
|
||||
#alias: sl1_rd_
|
||||
value: ROUND(AVG((SQC_DCACHE_REQ / $denom)), 0)
|
||||
tips:
|
||||
VL1D Hit:
|
||||
#alias: sl1_hit_
|
||||
value:
|
||||
ROUND((AVG(((SQC_DCACHE_HITS / SQC_DCACHE_REQ) if (SQC_DCACHE_REQ !=
|
||||
0) else None)) * 100), 0)
|
||||
tips:
|
||||
VL1D Lat:
|
||||
#alias: sl1_lat_
|
||||
value:
|
||||
ROUND((AVG(((SQ_ACCUM_PREV_HIRES / SQC_DCACHE_REQ) if (SQC_DCACHE_REQ !=
|
||||
0) else None)) * 100), 0)
|
||||
coll_level: SQC_DCACHE_INFLIGHT_LEVEL
|
||||
tips:
|
||||
|
||||
VL1D_L2 Rd:
|
||||
#alias: sl1_l2_rd_
|
||||
value: ROUND(AVG((SQC_TC_DATA_READ_REQ / $denom)), 0)
|
||||
tips:
|
||||
VL1D_L2 Wr:
|
||||
#alias: sl1_l2_wr_
|
||||
value: ROUND(AVG((SQC_TC_DATA_WRITE_REQ / $denom)), 0)
|
||||
tips:
|
||||
VL1D_L2 Atomic:
|
||||
#alias: sl1_l2_atom_
|
||||
value: ROUND(AVG((SQC_TC_DATA_ATOMIC_REQ / $denom)), 0)
|
||||
tips:
|
||||
|
||||
# ----------------------------------------
|
||||
# Instr L1 Cache Block
|
||||
IL1 Fetch:
|
||||
#alias: il1_fetch_
|
||||
value: ROUND(AVG((SQC_ICACHE_REQ / $denom)), 0)
|
||||
tips:
|
||||
IL1 Hit:
|
||||
#alias: il1_hit_
|
||||
value: ROUND((AVG((SQC_ICACHE_HITS / SQC_ICACHE_REQ)) * 100), 0)
|
||||
tips:
|
||||
IL1 Lat:
|
||||
#alias: il1_lat_
|
||||
value:
|
||||
ROUND((AVG(((SQ_ACCUM_PREV_HIRES / SQC_ICACHE_REQ) if (SQC_ICACHE_REQ !=
|
||||
0) else None)) * 100), 0)
|
||||
tips: # ??? coll_level: SQ_IFETCH_LEVEL
|
||||
IL1_L2 Rd:
|
||||
#alias: il1_l2_req_
|
||||
value: ROUND(AVG((SQC_TC_INST_REQ / $denom)), 0)
|
||||
tips:
|
||||
|
||||
# ----------------------------------------
|
||||
# L2 Cache Block(inside)
|
||||
L2 Rd:
|
||||
#alias: l2_rd_
|
||||
value: ROUND(AVG((TCC_READ_sum / $denom)), 0)
|
||||
tips:
|
||||
L2 Wr:
|
||||
#alias: l2_wr_
|
||||
value: ROUND(AVG((TCC_WRITE_sum / $denom)), 0)
|
||||
tips:
|
||||
L2 Atomic:
|
||||
#alias: l2_atom_
|
||||
value: ROUND(AVG((TCC_ATOMIC_sum / $denom)), 0)
|
||||
tips:
|
||||
L2 Hit:
|
||||
#alias: l2_hit_
|
||||
value:
|
||||
ROUND(AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else 0)), 0)
|
||||
tips:
|
||||
L2 Rd Lat:
|
||||
#alias: l2_rd_lat_
|
||||
value:
|
||||
ROUND(AVG(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
|
||||
if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else None)),
|
||||
0)
|
||||
tips:
|
||||
L2 Wr Lat:
|
||||
#alias: l2_wr_lat_
|
||||
value:
|
||||
ROUND(AVG(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum +
|
||||
TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
!= 0) else None)), 0)
|
||||
tips:
|
||||
|
||||
# ----------------------------------------
|
||||
# Fabric Block
|
||||
Fabric_L2 Rd:
|
||||
#alias: l2_fabric_rd_
|
||||
value: ROUND(AVG((TCC_EA_RDREQ_sum / $denom)), 0)
|
||||
tips:
|
||||
Fabric_L2 Wr:
|
||||
#alias: l2_fabric_wr_
|
||||
value: ROUND(AVG((TCC_EA_WRREQ_sum / $denom)), 0)
|
||||
tips:
|
||||
Fabric_L2 Atomic:
|
||||
#alias: l2_fabric_atom_
|
||||
value: ROUND(AVG((TCC_EA_ATOMIC_sum / $denom)), 0)
|
||||
tips:
|
||||
|
||||
Fabric Rd Lat:
|
||||
#alias: fabric_rd_lat_
|
||||
value:
|
||||
ROUND(AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else 0)), 0)
|
||||
tips:
|
||||
Fabric Wr Lat:
|
||||
#alias: fabric_wr_lat_
|
||||
value:
|
||||
ROUND(AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else 0)), 0)
|
||||
tips:
|
||||
Fabric Atomic Lat:
|
||||
#alias: fabric_atom_lat_
|
||||
value:
|
||||
ROUND(AVG(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
|
||||
!= 0) else 0)), 0)
|
||||
tips:
|
||||
|
||||
HBM Rd:
|
||||
#alias: hbm_rd_
|
||||
value: ROUND(AVG((TCC_EA_RDREQ_DRAM_sum / $denom)), 0)
|
||||
tips:
|
||||
HBM Wr:
|
||||
#alias: hbm_wr_
|
||||
value: ROUND(AVG((TCC_EA_WRREQ_DRAM_sum / $denom)), 0)
|
||||
tips:
|
||||
|
||||
comparable: false # for now
|
||||
cli_style: mem_chart
|
||||
+267
@@ -0,0 +1,267 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 300
|
||||
title: Memory Chart
|
||||
metrics_description:
|
||||
Wavefront Occupancy: Wavefronts per active CU.
|
||||
Wave Life: Average number of cycles executing a wave.
|
||||
SALU: Total Number of SALU (Scalar ALU) instructions issued per normalization
|
||||
unit.
|
||||
SMEM: Total number of SMEM (Scalar Memory Read) instructions issued normalization
|
||||
unit.
|
||||
VALU: The number of VALU (Vector ALU) instructions issued per normalization unit.
|
||||
MFMA: Total number of MFMA (Matrix-Fused-Multiply-Add) instructions issued per
|
||||
normalization unit.
|
||||
VMEM: The number of VMEM (GPU Memory) read instructions issued (including FLAT/scratch
|
||||
memory) per normalization unit.
|
||||
LDS: The total number of LDS instructions (including, but not limited to, read/write/atomics
|
||||
and HIP's __shfl instructions) executed per normalization unit.
|
||||
GWS: Total number of GDS (global data sync) instructions issued per normalization
|
||||
unit.
|
||||
BR: Total number of BRANCH instructions issued per normalization unit.
|
||||
Active CUs: Total number of active compute units (CUs) on the accelerator during
|
||||
the kernel execution.
|
||||
Num CUs: Total number of compute units (CUs) on the accelerator.
|
||||
VGPR: 'The number of architected vector general-purpose registers allocated for
|
||||
the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
|
||||
by the compiler due to allocation granularity.'
|
||||
SGPR: 'The number of scalar general-purpose registers allocated for the kernel,
|
||||
see SALU. Note: this may not exactly match the number of SGPRs requested by
|
||||
the compiler due to allocation granularity.'
|
||||
LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
|
||||
for this kernel. Note: This may also be larger than what was requested at compile
|
||||
time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
|
||||
Scratch Allocation: The number of bytes of scratch memory requested per work-item
|
||||
for this kernel. Scratch memory is used for stack memory on the accelerator,
|
||||
as well as for register spills and restores.
|
||||
Wavefronts: The total number of wavefronts, summed over all workgroups, forming
|
||||
this kernel launch.
|
||||
Workgroups: The total number of workgroups forming this kernel launch.
|
||||
LDS Req: The total number of LDS instructions (including, but not limited to,
|
||||
read/write/atomics and HIP's __shfl instructions) executed per normalization
|
||||
unit.
|
||||
LDS Util: Indicates what percent of the kernel's duration the LDS was actively
|
||||
executing instructions (including, but not limited to, load, store, atomic and
|
||||
HIP's __shfl operations). Calculated as the ratio of the total number of cycles
|
||||
LDS was active over the total CU cycles.
|
||||
LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
|
||||
/ acknowledgment) required for an LDS instruction to complete.
|
||||
VL1 Rd: The total number of incoming read requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Wr: The total number of incoming write requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Atomic: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Hit: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
VL1 Lat: Calculated as the average number of cycles that a vL1D cache line request
|
||||
spent in the vL1D cache pipeline.
|
||||
VL1 Coalesce: Indicates how well memory instructions were coalesced by the address
|
||||
processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
|
||||
as the average number of thread-requests generated per instruction divided by
|
||||
the ideal number of thread-requests per instruction.
|
||||
VL1 Stall: The ratio of the number of cycles where the vL1D is stalled waiting
|
||||
to issue a request for data to the L2 cache divided by the number of cycles
|
||||
where the vL1D is active.
|
||||
VL1_L2 Rd: The number of read requests for a vL1D cache line that were not satisfied
|
||||
by the vL1D and must be retrieved from the to the L2 Cache per normalization
|
||||
unit.
|
||||
VL1_L2 Wr: The number of write requests to a vL1D cache line that were sent through
|
||||
the vL1D to the L2 cache, per normalization unit.
|
||||
VL1_L2 Atomic: The number of atomic requests that are sent through the vL1D to
|
||||
the L2 cache, per normalization unit. This includes requests for atomics with,
|
||||
and without return.
|
||||
sL1D Rd: The total number of requests, of any size or type, made to the sL1D per
|
||||
normalization unit.
|
||||
sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
|
||||
line, per normalization unit.
|
||||
sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
|
||||
unit.
|
||||
sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
sL1D_L2 Atomic: The total number of atomic requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
IL1 Fetch: The total number of requests made to the L1I per normalization-unit.
|
||||
IL1 Hit: The percent of L1I requests that hit on a previously loaded line the
|
||||
cache. Calculated as the ratio of the number of L1I requests that hit over the
|
||||
number of all L1I requests.
|
||||
IL1 Lat: The average number of cycles spent to fetch instructions to a CU.
|
||||
IL1_L2 Rd: The total number of requests across the L1I - L2 interface per normalization-unit.
|
||||
L2 Rd: The total number of read requests to the L2 from all clients.
|
||||
L2 Wr: The total number of write requests to the L2 from all clients.
|
||||
L2 Atomic: The total number of atomic requests (with and without return) to the
|
||||
L2 from all clients.
|
||||
L2 Hit: The ratio of the number of L2 cache line requests that hit in the L2 cache
|
||||
over the total number of incoming cache line requests to the L2 cache.
|
||||
L2 Rd Lat: Calculated as the average number of cycles that the vL1D cache took
|
||||
to issue and receive read requests from the L2 Cache. This number also includes
|
||||
requests for atomics with return values.
|
||||
L2 Wr Lat: Calculated as the average number of cycles that the vL1D cache took
|
||||
to issue and receive acknowledgement of a write request to the L2 Cache. This
|
||||
number also includes requests for atomics without return values.
|
||||
Fabric_L2 Rd: Number of L2 cache - Infinity Fabric read requests (either 32-byte
|
||||
or 64-byte) summed over TCC instances per normalization unit.
|
||||
Fabric_L2 Wr: Number of L2 cache - Infinity Fabric write requests (either 32-byte
|
||||
or 64-byte) summed over TCC instances per normalization unit.
|
||||
Fabric_L2 Atomic: Number of L2 cache - Infinity Fabric write requests (either
|
||||
32-byte or 64-byte) that are actually atomic requests summed over TCC instances
|
||||
per normalization unit.
|
||||
Fabric Rd Lat: The time-averaged number of cycles read requests spent in Infinity
|
||||
Fabric before data was returned to the L2.
|
||||
Fabric Wr Lat: The time-averaged number of cycles write requests spent in Infinity
|
||||
Fabric before a completion acknowledgement was returned to the L2.
|
||||
Fabric Atomic Lat: The time-averaged number of cycles atomic requests spent in
|
||||
Infinity Fabric before a completion acknowledgement (atomic without return value)
|
||||
or data (atomic with return value) was returned to the L2.
|
||||
HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
|
||||
of data from the accelerator's local HBM, per normalization unit.
|
||||
HBM Wr: 'The total number of L2 requests to Infinity Fabric to write or atomically
|
||||
update 32B or 64B of data in the accelerator''s local HBM, per normalization
|
||||
unit. '
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 301
|
||||
title: Memory Chart
|
||||
header:
|
||||
metric: Metric
|
||||
value: Value
|
||||
metric:
|
||||
Wavefront Occupancy:
|
||||
value: ROUND(AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD) / $numActiveCUs),
|
||||
0)
|
||||
coll_level: SQ_LEVEL_WAVES
|
||||
Wave Life:
|
||||
value: ROUND(AVG(((4 * (SQ_WAVE_CYCLES / SQ_WAVES)) if (SQ_WAVES != 0) else
|
||||
0)), 0)
|
||||
SALU:
|
||||
value: ROUND(AVG((SQ_INSTS_SALU / $denom)), 0)
|
||||
SMEM:
|
||||
value: ROUND(AVG((SQ_INSTS_SMEM / $denom)), 0)
|
||||
VALU:
|
||||
value: ROUND(AVG((SQ_INSTS_VALU / $denom)), 0)
|
||||
MFMA:
|
||||
value: ROUND(AVG((SQ_INSTS_MFMA / $denom)), 0)
|
||||
VMEM:
|
||||
value: ROUND(AVG((SQ_INSTS_VMEM / $denom)), 0)
|
||||
LDS:
|
||||
value: ROUND(AVG((SQ_INSTS_LDS / $denom)), 0)
|
||||
GWS:
|
||||
value: ROUND(AVG((SQ_INSTS_GDS / $denom)), 0)
|
||||
BR:
|
||||
value: ROUND(AVG((SQ_INSTS_BRANCH / $denom)), 0)
|
||||
Active CUs:
|
||||
value: $numActiveCUs
|
||||
Num CUs:
|
||||
value: $cu_per_gpu
|
||||
VGPR:
|
||||
value: ROUND(AVG(Arch_VGPR), 0)
|
||||
SGPR:
|
||||
value: ROUND(AVG(SGPR), 0)
|
||||
LDS Allocation:
|
||||
value: ROUND(AVG(LDS_Per_Workgroup), 0)
|
||||
Scratch Allocation:
|
||||
value: ROUND(AVG(Scratch_Per_Workitem), 0)
|
||||
Wavefronts:
|
||||
value: ROUND(AVG(SPI_CSN_WAVE), 0)
|
||||
Workgroups:
|
||||
value: ROUND(AVG(SPI_CSN_NUM_THREADGROUPS), 0)
|
||||
LDS Req:
|
||||
value: ROUND(AVG((SQ_INSTS_LDS / $denom)), 0)
|
||||
LDS Util:
|
||||
value: ROUND(AVG(((100 * SQ_LDS_IDX_ACTIVE) / ($GRBM_GUI_ACTIVE_PER_XCD
|
||||
* $cu_per_gpu))), 0)
|
||||
LDS Latency:
|
||||
value: ROUND(AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS
|
||||
!= 0) else None)),0)
|
||||
coll_level: SQ_INST_LEVEL_LDS
|
||||
VL1 Rd:
|
||||
value: ROUND(AVG((TCP_TOTAL_READ_sum / $denom)), 0)
|
||||
VL1 Wr:
|
||||
value: ROUND(AVG((TCP_TOTAL_WRITE_sum / $denom)), 0)
|
||||
VL1 Atomic:
|
||||
value: ROUND(AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
|
||||
/ $denom)), 0)
|
||||
VL1 Hit:
|
||||
value: ROUND(AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
|
||||
else None )), 0)
|
||||
VL1 Lat:
|
||||
value: ROUND(AVG(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
|
||||
!= 0) else None)), 0)
|
||||
VL1 Coalesce:
|
||||
value: ROUND(AVG(((((TA_TOTAL_WAVEFRONTS_sum * 64) * 100) / (TCP_TOTAL_ACCESSES_sum
|
||||
* 4)) if (TCP_TOTAL_ACCESSES_sum != None) else 0)), 0)
|
||||
VL1 Stall:
|
||||
value: ROUND(AVG((((100 * TCP_TCR_TCP_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
|
||||
if (TCP_GATE_EN1_sum != 0) else None)), 0)
|
||||
VL1_L2 Rd:
|
||||
value: ROUND(AVG((TCP_TCC_READ_REQ_sum / $denom)), 0)
|
||||
VL1_L2 Wr:
|
||||
value: ROUND(AVG((TCP_TCC_WRITE_REQ_sum / $denom)), 0)
|
||||
VL1_L2 Atomic:
|
||||
value: ROUND(AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
/ $denom)), 0)
|
||||
sL1D Rd:
|
||||
value: ROUND(AVG((SQC_DCACHE_REQ / $denom)), 0)
|
||||
sL1D Hit:
|
||||
value: ROUND((AVG(((SQC_DCACHE_HITS / SQC_DCACHE_REQ) if (SQC_DCACHE_REQ
|
||||
!= 0) else None)) * 100), 0)
|
||||
sL1D Lat:
|
||||
value: ROUND((AVG(((SQ_ACCUM_PREV_HIRES / SQC_DCACHE_REQ) if (SQC_DCACHE_REQ
|
||||
!= 0) else None)) * 100), 0)
|
||||
coll_level: SQC_DCACHE_INFLIGHT_LEVEL
|
||||
sL1D_L2 Rd:
|
||||
value: ROUND(AVG((SQC_TC_DATA_READ_REQ / $denom)), 0)
|
||||
sL1D_L2 Wr:
|
||||
value: ROUND(AVG((SQC_TC_DATA_WRITE_REQ / $denom)), 0)
|
||||
sL1D_L2 Atomic:
|
||||
value: ROUND(AVG((SQC_TC_DATA_ATOMIC_REQ / $denom)), 0)
|
||||
IL1 Fetch:
|
||||
value: ROUND(AVG((SQC_ICACHE_REQ / $denom)), 0)
|
||||
IL1 Hit:
|
||||
value: ROUND((AVG((SQC_ICACHE_HITS / SQC_ICACHE_REQ)) * 100), 0)
|
||||
IL1 Lat:
|
||||
value: ROUND((AVG(((SQ_ACCUM_PREV_HIRES / SQC_ICACHE_REQ) if (SQC_ICACHE_REQ
|
||||
!= 0) else None)) * 100), 0)
|
||||
IL1_L2 Rd:
|
||||
value: ROUND(AVG((SQC_TC_INST_REQ / $denom)), 0)
|
||||
L2 Rd:
|
||||
value: ROUND(AVG((TCC_READ_sum / $denom)), 0)
|
||||
L2 Wr:
|
||||
value: ROUND(AVG((TCC_WRITE_sum / $denom)), 0)
|
||||
L2 Atomic:
|
||||
value: ROUND(AVG((TCC_ATOMIC_sum / $denom)), 0)
|
||||
L2 Hit:
|
||||
value: ROUND(AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if
|
||||
((TCC_HIT_sum + TCC_MISS_sum) != 0) else 0)), 0)
|
||||
L2 Rd Lat:
|
||||
value: ROUND(AVG(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum)) if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
|
||||
!= 0) else None)), 0)
|
||||
L2 Wr Lat:
|
||||
value: ROUND(AVG(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
!= 0) else None)), 0)
|
||||
Fabric_L2 Rd:
|
||||
value: ROUND(AVG((TCC_EA_RDREQ_sum / $denom)), 0)
|
||||
Fabric_L2 Wr:
|
||||
value: ROUND(AVG((TCC_EA_WRREQ_sum / $denom)), 0)
|
||||
Fabric_L2 Atomic:
|
||||
value: ROUND(AVG((TCC_EA_ATOMIC_sum / $denom)), 0)
|
||||
Fabric Rd Lat:
|
||||
value: ROUND(AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else 0)), 0)
|
||||
Fabric Wr Lat:
|
||||
value: ROUND(AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else 0)), 0)
|
||||
Fabric Atomic Lat:
|
||||
value: ROUND(AVG(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
|
||||
!= 0) else 0)), 0)
|
||||
HBM Rd:
|
||||
value: ROUND(AVG((TCC_EA_RDREQ_DRAM_sum / $denom)), 0)
|
||||
HBM Wr:
|
||||
value: ROUND(AVG((TCC_EA_WRREQ_DRAM_sum / $denom)), 0)
|
||||
comparable: false
|
||||
cli_style: mem_chart
|
||||
tui_style: mem_chart
|
||||
@@ -0,0 +1,9 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 400
|
||||
title: Roofline
|
||||
metrics_description: {}
|
||||
data source:
|
||||
- None:
|
||||
id: 401
|
||||
title: Roofline
|
||||
@@ -1,8 +0,0 @@
|
||||
---
|
||||
Panel Config:
|
||||
id: 400
|
||||
title: Roofline
|
||||
data source:
|
||||
- None:
|
||||
id: 401
|
||||
title: Roofline
|
||||
-135
@@ -1,135 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 500
|
||||
title: Command Processor (CPC/CPF)
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 501
|
||||
title: Command Processor Fetcher
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
CPF Utilization:
|
||||
avg: AVG((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
|
||||
if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
|
||||
if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
|
||||
if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
CPF Stall:
|
||||
avg: AVG((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
|
||||
!= 0) else None))
|
||||
min: MIN((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
|
||||
!= 0) else None))
|
||||
max: MAX((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
CPF-L2 Utilization:
|
||||
avg: AVG((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
|
||||
if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
|
||||
if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
|
||||
if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
CPF-L2 Stall:
|
||||
avg: AVG((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
|
||||
!= 0) else None))
|
||||
min: MIN((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
|
||||
!= 0) else None))
|
||||
max: MAX((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
CPF-UTCL1 Stall:
|
||||
avg: AVG(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
|
||||
!= 0) else None)
|
||||
min: MIN(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
|
||||
!= 0) else None)
|
||||
max: MAX(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
|
||||
!= 0) else None)
|
||||
unit: pct
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 502
|
||||
title: Packet Processor
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
CPC Utilization:
|
||||
avg: AVG((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
|
||||
if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
|
||||
if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
|
||||
if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
CPC Stall Rate:
|
||||
avg: AVG((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None))
|
||||
min: MIN((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None))
|
||||
max: MAX((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
CPC Packet Decoding Utilization:
|
||||
avg: AVG((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
|
||||
min: MIN((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
|
||||
max: MAX((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
|
||||
unit: pct
|
||||
tips:
|
||||
CPC-Workgroup Manager Utilization:
|
||||
avg: AVG((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
|
||||
min: MIN((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
|
||||
max: MAX((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
|
||||
unit: Pct
|
||||
tips:
|
||||
CPC-L2 Utilization:
|
||||
avg: AVG((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
|
||||
if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
|
||||
if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
|
||||
if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
CPC-UTCL1 Stall:
|
||||
avg: AVG(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None)
|
||||
min: MIN(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None)
|
||||
max: MAX(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None)
|
||||
unit: pct
|
||||
tips:
|
||||
CPC-UTCL2 Utilization:
|
||||
avg: AVG((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
|
||||
if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
|
||||
if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
|
||||
if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
+145
@@ -0,0 +1,145 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 500
|
||||
title: Command Processor (CPC/CPF)
|
||||
metrics_description:
|
||||
CPF Utilization: Percent of total cycles where the CPF was busy actively doing
|
||||
any work. The ratio of CPF busy cycles over total cycles counted by the CPF.
|
||||
CPF Stall: Percent of CPF busy cycles where the CPF was stalled for any reason.
|
||||
CPF-L2 Utilization: Percent of total cycles counted by the CPF-L2 interface where
|
||||
the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
|
||||
over total cycles counted by the CPF-L2.
|
||||
CPF-L2 Stall: Percent of CPF-L2 L2 busy cycles where the CPF-L2 interface was
|
||||
stalled for any reason.
|
||||
CPF-UTCL1 Stall: Percent of CPF busy cycles where the CPF was stalled by address
|
||||
translation.
|
||||
CPC Utilization: Percent of total cycles where the CPC was busy actively doing
|
||||
any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
|
||||
CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
|
||||
CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
|
||||
for processing.
|
||||
CPC-Workgroup Manager Utilization: Percent of CPC busy cycles spent dispatching
|
||||
workgroups to the workgroup manager.
|
||||
CPC-L2 Utilization: Percent of total cycles counted by the CPC-L2 interface where
|
||||
the CPC-L2 interface was active doing any work.
|
||||
CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
|
||||
translation
|
||||
CPC-UTCL2 Utilization: 'Percent of total cycles counted by the CPC''s L2 address
|
||||
translation interface where the CPC was busy doing address translation work. '
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 501
|
||||
title: Command processor fetcher (CPF)
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
CPF Utilization:
|
||||
avg: AVG((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
|
||||
if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
|
||||
if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
|
||||
if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
CPF Stall:
|
||||
avg: AVG((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
|
||||
!= 0) else None))
|
||||
min: MIN((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
|
||||
!= 0) else None))
|
||||
max: MAX((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
CPF-L2 Utilization:
|
||||
avg: AVG((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
|
||||
if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
|
||||
if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
|
||||
if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
CPF-L2 Stall:
|
||||
avg: AVG((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
|
||||
!= 0) else None))
|
||||
min: MIN((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
|
||||
!= 0) else None))
|
||||
max: MAX((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
CPF-UTCL1 Stall:
|
||||
avg: AVG(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY)
|
||||
if (CPF_CPF_STAT_BUSY != 0) else None)
|
||||
min: MIN(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY)
|
||||
if (CPF_CPF_STAT_BUSY != 0) else None)
|
||||
max: MAX(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY)
|
||||
if (CPF_CPF_STAT_BUSY != 0) else None)
|
||||
unit: pct
|
||||
- metric_table:
|
||||
id: 502
|
||||
title: Command processor packet processor (CPC)
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
CPC Utilization:
|
||||
avg: AVG((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
|
||||
if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
|
||||
if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
|
||||
if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
CPC Stall Rate:
|
||||
avg: AVG((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None))
|
||||
min: MIN((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None))
|
||||
max: MAX((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
CPC Packet Decoding Utilization:
|
||||
avg: AVG((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None)
|
||||
min: MIN((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None)
|
||||
max: MAX((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None)
|
||||
unit: pct
|
||||
CPC-Workgroup Manager Utilization:
|
||||
avg: AVG((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None)
|
||||
min: MIN((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None)
|
||||
max: MAX((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None)
|
||||
unit: Pct
|
||||
CPC-L2 Utilization:
|
||||
avg: AVG((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
|
||||
if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
|
||||
if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
|
||||
if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
CPC-UTCL1 Stall:
|
||||
avg: AVG(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if
|
||||
(CPC_CPC_STAT_BUSY != 0) else None)
|
||||
min: MIN(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if
|
||||
(CPC_CPC_STAT_BUSY != 0) else None)
|
||||
max: MAX(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if
|
||||
(CPC_CPC_STAT_BUSY != 0) else None)
|
||||
unit: pct
|
||||
CPC-UTCL2 Utilization:
|
||||
avg: AVG((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
|
||||
if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
|
||||
if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
|
||||
if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
-167
@@ -1,167 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 600
|
||||
title: Workgroup Manager (SPI)
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 601
|
||||
title: Workgroup Manager Utilizations
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Accelerator Utilization:
|
||||
avg: AVG(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
|
||||
min: MIN(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
|
||||
max: MAX(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
|
||||
unit: Pct
|
||||
tips:
|
||||
Scheduler-Pipe Utilization:
|
||||
avg: AVG(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
|
||||
min: MIN(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
|
||||
max: MAX(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
Workgroup Manager Utilization:
|
||||
avg: AVG(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
|
||||
min: MIN(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
|
||||
max: MAX(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
|
||||
unit: Pct
|
||||
tips:
|
||||
Shader Engine Utilization:
|
||||
avg: AVG(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
|
||||
min: MIN(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
|
||||
max: MAX(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
SIMD Utilization:
|
||||
avg: AVG(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
Dispatched Workgroups:
|
||||
avg: AVG(SPI_CSN_NUM_THREADGROUPS)
|
||||
min: MIN(SPI_CSN_NUM_THREADGROUPS)
|
||||
max: MAX(SPI_CSN_NUM_THREADGROUPS)
|
||||
unit: Workgroups
|
||||
tips:
|
||||
Dispatched Wavefronts:
|
||||
avg: AVG(SPI_CSN_WAVE)
|
||||
min: MIN(SPI_CSN_WAVE)
|
||||
max: MAX(SPI_CSN_WAVE)
|
||||
unit: Wavefronts
|
||||
tips:
|
||||
VGPR Writes:
|
||||
avg: AVG((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
min: MIN((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
max: MAX((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
unit: Cycles/wave
|
||||
tips:
|
||||
SGPR Writes:
|
||||
avg: AVG((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
min: MIN((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
max: MAX((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
unit: Cycles/wave
|
||||
tips:
|
||||
- metric_table:
|
||||
id: 602
|
||||
title: Workgroup Manager - Resource Allocation
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Not-scheduled Rate (Workgroup Manager):
|
||||
avg: AVG((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
|
||||
0) else None)
|
||||
min: MIN((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
|
||||
0) else None)
|
||||
max: MAX((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
|
||||
0) else None)
|
||||
unit: Pct
|
||||
tips:
|
||||
Not-scheduled Rate (Scheduler-Pipe):
|
||||
avg: AVG((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
|
||||
0) else None)
|
||||
min: MIN((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
|
||||
0) else None)
|
||||
max: MAX((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
|
||||
0) else None)
|
||||
unit: Pct
|
||||
tips:
|
||||
Scheduler-Pipe Stall Rate:
|
||||
avg: AVG((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
|
||||
0) else None))
|
||||
min: MIN((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
|
||||
0) else None))
|
||||
max: MAX((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
|
||||
0) else None))
|
||||
unit: Pct
|
||||
tips:
|
||||
Scratch Stall Rate:
|
||||
avg: AVG((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
min: MIN((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
max: MAX((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
unit: Pct
|
||||
tips:
|
||||
Insufficient SIMD Waveslots:
|
||||
avg: AVG(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
Insufficient SIMD VGPRs:
|
||||
avg: AVG(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
Insufficient SIMD SGPRs:
|
||||
avg: AVG(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
Insufficient CU LDS:
|
||||
avg: AVG(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
Insufficient CU Barriers:
|
||||
avg: AVG(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
Reached CU Workgroup Limit:
|
||||
avg: AVG(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
Reached CU Wavefront Limit:
|
||||
avg: AVG(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
+201
@@ -0,0 +1,201 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 600
|
||||
title: Workgroup Manager (SPI)
|
||||
metrics_description:
|
||||
Accelerator Utilization: The percent of cycles in the kernel where the accelerator
|
||||
was actively doing any work.
|
||||
Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
|
||||
kernel where the scheduler-pipes were actively doing any work.
|
||||
Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
|
||||
manager was actively doing any work.
|
||||
Shader Engine Utilization: The percent of total shader engine cycles in the kernel
|
||||
where any CU in a shader-engine was actively doing any work, normalized over
|
||||
all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
|
||||
was not fully saturated by the kernel, or a potential load-imbalance issue.
|
||||
SIMD Utilization: The percent of total SIMD cycles in the kernel where any SIMD
|
||||
on a CU was actively doing any work, summed over all CUs. Low values (less than
|
||||
100%) indicate that the accelerator was not fully saturated by the kernel, or
|
||||
a potential load-imbalance issue.
|
||||
Dispatched Workgroups: The total number of workgroups forming this kernel launch.
|
||||
Dispatched Wavefronts: The total number of wavefronts, summed over all workgroups,
|
||||
forming this kernel launch.
|
||||
VGPR Writes: The average number of cycles spent initializing VGPRs at wave creation.
|
||||
SGPR Writes: The average number of cycles spent initializing SGPRs at wave creation.
|
||||
Not-scheduled Rate (Workgroup Manager): The percent of total scheduler-pipe cycles
|
||||
in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
|
||||
within the workgroup manager rather than a lack of a CU or SIMD with sufficient
|
||||
resources.
|
||||
Not-scheduled Rate (Scheduler-Pipe): 'The percent of total scheduler-pipe cycles
|
||||
in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
|
||||
within the scheduler-pipes rather than a lack of a CU or SIMD with sufficient
|
||||
resources. '
|
||||
Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
|
||||
where a workgroup could not be scheduled to a CU due to occupancy limitations
|
||||
(like a lack of a CU or SIMD with sufficient resources).
|
||||
Scratch Stall Rate: The percent of total shader-engine cycles in the kernel where
|
||||
a workgroup could not be scheduled to a CU due to lack of private (a.k.a., scratch)
|
||||
memory slots. While this can reach up to 100%, note that the actual occupancy
|
||||
limitations on a kernel using private memory are typically quite small (for
|
||||
example, less than 1% of the total number of waves that can be scheduled to
|
||||
an accelerator).
|
||||
Insufficient SIMD Waveslots: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available waveslots.
|
||||
Insufficient SIMD VGPRs: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available VGPRs.
|
||||
Insufficient SIMD SGPRs: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available SGPRs.
|
||||
Insufficient CU LDS: The percent of total CU cycles in the kernel where a workgroup
|
||||
could not be scheduled to a CU due to lack of available LDS.
|
||||
Insufficient CU Barriers: The percent of total CU cycles in the kernel where a
|
||||
workgroup could not be scheduled to a CU due to lack of available barriers.
|
||||
Reached CU Workgroup Limit: The percent of total CU cycles in the kernel where
|
||||
a workgroup could not be scheduled to a CU due to limits within the workgroup
|
||||
manager. This is expected to be always be zero on CDNA2 or newer accelerators
|
||||
(and small for previous accelerators).
|
||||
Reached CU Wavefront Limit: The percent of total CU cycles in the kernel where
|
||||
a wavefront could not be scheduled to a CU due to limits within the workgroup
|
||||
manager. This is expected to be always be zero on CDNA2 or newer accelerators
|
||||
(and small for previous accelerators).
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 601
|
||||
title: Workgroup manager utilizations
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Accelerator Utilization:
|
||||
avg: AVG(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
|
||||
min: MIN(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
|
||||
max: MAX(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
|
||||
unit: Pct
|
||||
Scheduler-Pipe Utilization:
|
||||
avg: AVG(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu
|
||||
* $se_per_gpu))
|
||||
min: MIN(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu
|
||||
* $se_per_gpu))
|
||||
max: MAX(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu
|
||||
* $se_per_gpu))
|
||||
unit: Pct
|
||||
Workgroup Manager Utilization:
|
||||
avg: AVG(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
|
||||
min: MIN(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
|
||||
max: MAX(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
|
||||
unit: Pct
|
||||
Shader Engine Utilization:
|
||||
avg: AVG(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
|
||||
min: MIN(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
|
||||
max: MAX(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
|
||||
unit: Pct
|
||||
SIMD Utilization:
|
||||
avg: AVG(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
Dispatched Workgroups:
|
||||
avg: AVG(SPI_CSN_NUM_THREADGROUPS)
|
||||
min: MIN(SPI_CSN_NUM_THREADGROUPS)
|
||||
max: MAX(SPI_CSN_NUM_THREADGROUPS)
|
||||
unit: Workgroups
|
||||
Dispatched Wavefronts:
|
||||
avg: AVG(SPI_CSN_WAVE)
|
||||
min: MIN(SPI_CSN_WAVE)
|
||||
max: MAX(SPI_CSN_WAVE)
|
||||
unit: Wavefronts
|
||||
VGPR Writes:
|
||||
avg: AVG((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
min: MIN((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
max: MAX((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
unit: Cycles/wave
|
||||
SGPR Writes:
|
||||
avg: AVG((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
min: MIN((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
max: MAX((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
unit: Cycles/wave
|
||||
- metric_table:
|
||||
id: 602
|
||||
title: Workgroup Manager - Resource Allocation
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Not-scheduled Rate (Workgroup Manager):
|
||||
avg: AVG((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
min: MIN((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
max: MAX((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
unit: Pct
|
||||
Not-scheduled Rate (Scheduler-Pipe):
|
||||
avg: AVG((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
min: MIN((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
max: MAX((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
unit: Pct
|
||||
Scheduler-Pipe Stall Rate:
|
||||
avg: AVG((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None))
|
||||
min: MIN((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None))
|
||||
max: MAX((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None))
|
||||
unit: Pct
|
||||
Scratch Stall Rate:
|
||||
avg: AVG((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
min: MIN((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
max: MAX((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
unit: Pct
|
||||
Insufficient SIMD Waveslots:
|
||||
avg: AVG(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
Insufficient SIMD VGPRs:
|
||||
avg: AVG(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
Insufficient SIMD SGPRs:
|
||||
avg: AVG(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
Insufficient CU LDS:
|
||||
avg: AVG(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
Insufficient CU Barriers:
|
||||
avg: AVG(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
Reached CU Workgroup Limit:
|
||||
avg: AVG(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
Reached CU Wavefront Limit:
|
||||
avg: AVG(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
-142
@@ -1,142 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 700
|
||||
title: Wavefront
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 701
|
||||
title: Wavefront Launch Stats
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Grid Size:
|
||||
avg: AVG(Grid_Size)
|
||||
min: MIN(Grid_Size)
|
||||
max: MAX(Grid_Size)
|
||||
unit: Work Items
|
||||
tips:
|
||||
Workgroup Size:
|
||||
avg: AVG(Workgroup_Size)
|
||||
min: MIN(Workgroup_Size)
|
||||
max: MAX(Workgroup_Size)
|
||||
unit: Work Items
|
||||
tips:
|
||||
Total Wavefronts:
|
||||
avg: AVG(SPI_CSN_WAVE)
|
||||
min: MIN(SPI_CSN_WAVE)
|
||||
max: MAX(SPI_CSN_WAVE)
|
||||
unit: Wavefronts
|
||||
tips:
|
||||
Saved Wavefronts:
|
||||
avg: AVG(SQ_WAVES_SAVED)
|
||||
min: MIN(SQ_WAVES_SAVED)
|
||||
max: MAX(SQ_WAVES_SAVED)
|
||||
unit: Wavefronts
|
||||
tips:
|
||||
Restored Wavefronts:
|
||||
avg: AVG(SQ_WAVES_RESTORED)
|
||||
min: MIN(SQ_WAVES_RESTORED)
|
||||
max: MAX(SQ_WAVES_RESTORED)
|
||||
unit: Wavefronts
|
||||
tips:
|
||||
VGPRs:
|
||||
avg: AVG(Arch_VGPR)
|
||||
min: MIN(Arch_VGPR)
|
||||
max: MAX(Arch_VGPR)
|
||||
unit: Registers
|
||||
tips:
|
||||
AGPRs:
|
||||
avg: AVG(Accum_VGPR)
|
||||
min: MIN(Accum_VGPR)
|
||||
max: MAX(Accum_VGPR)
|
||||
unit: Registers
|
||||
tips:
|
||||
SGPRs:
|
||||
avg: AVG(SGPR)
|
||||
min: MIN(SGPR)
|
||||
max: MAX(SGPR)
|
||||
unit: Registers
|
||||
tips:
|
||||
LDS Allocation:
|
||||
avg: AVG(LDS_Per_Workgroup)
|
||||
min: MIN(LDS_Per_Workgroup)
|
||||
max: MAX(LDS_Per_Workgroup)
|
||||
unit: Bytes
|
||||
tips:
|
||||
Scratch Allocation:
|
||||
avg: AVG(Scratch_Per_Workitem)
|
||||
min: MIN(Scratch_Per_Workitem)
|
||||
max: MAX(Scratch_Per_Workitem)
|
||||
unit: Bytes/Workitem
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 702
|
||||
title: Wavefront Runtime Stats
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Kernel Time:
|
||||
avg: AVG((End_Timestamp - Start_Timestamp))
|
||||
min: MIN((End_Timestamp - Start_Timestamp))
|
||||
max: MAX((End_Timestamp - Start_Timestamp))
|
||||
unit: ns
|
||||
tips:
|
||||
Kernel Time (Cycles):
|
||||
avg: AVG($GRBM_GUI_ACTIVE_PER_XCD)
|
||||
min: MIN($GRBM_GUI_ACTIVE_PER_XCD)
|
||||
max: MAX($GRBM_GUI_ACTIVE_PER_XCD)
|
||||
unit: Cycle
|
||||
tips:
|
||||
Instructions per wavefront:
|
||||
avg: AVG((SQ_INSTS / SQ_WAVES))
|
||||
min: MIN((SQ_INSTS / SQ_WAVES))
|
||||
max: MAX((SQ_INSTS / SQ_WAVES))
|
||||
unit: Instr/wavefront
|
||||
tips:
|
||||
Wave Cycles:
|
||||
avg: AVG(((4 * SQ_WAVE_CYCLES) / $denom))
|
||||
min: MIN(((4 * SQ_WAVE_CYCLES) / $denom))
|
||||
max: MAX(((4 * SQ_WAVE_CYCLES) / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
Dependency Wait Cycles:
|
||||
avg: AVG(((4 * SQ_WAIT_ANY) / $denom))
|
||||
min: MIN(((4 * SQ_WAIT_ANY) / $denom))
|
||||
max: MAX(((4 * SQ_WAIT_ANY) / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
Issue Wait Cycles:
|
||||
avg: AVG(((4 * SQ_WAIT_INST_ANY) / $denom))
|
||||
min: MIN(((4 * SQ_WAIT_INST_ANY) / $denom))
|
||||
max: MAX(((4 * SQ_WAIT_INST_ANY) / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
Active Cycles:
|
||||
avg: AVG(((4 * SQ_ACTIVE_INST_ANY) / $denom))
|
||||
min: MIN(((4 * SQ_ACTIVE_INST_ANY) / $denom))
|
||||
max: MAX(((4 * SQ_ACTIVE_INST_ANY) / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
Wavefront Occupancy:
|
||||
avg: AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
min: MIN((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
max: MAX((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
unit: Wavefronts
|
||||
coll_level: SQ_LEVEL_WAVES
|
||||
tips:
|
||||
+173
@@ -0,0 +1,173 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 700
|
||||
title: Wavefront
|
||||
metrics_description:
|
||||
Grid Size: The total number of work-items (or, threads) launched as a part of
|
||||
the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied
|
||||
by the total workgroup (or, block) size.
|
||||
Workgroup Size: The total number of work-items (or, threads) in each workgroup
|
||||
(or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
|
||||
to the total block size.
|
||||
Total Wavefronts: "The total number of wavefronts launched as part of the kernel\
|
||||
\ dispatch. On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs,\
|
||||
\ the wavefront size is always 64 work-items. Thus, the total number of wavefronts\
|
||||
\ should be equivalent to the ceiling of grid size divided by 64."
|
||||
Saved Wavefronts: The total number of wavefronts saved at a context-save.
|
||||
Restored Wavefronts: The total number of wavefronts restored from a context-save.
|
||||
VGPRs: 'The number of architected vector general-purpose registers allocated for
|
||||
the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
|
||||
by the compiler due to allocation granularity.'
|
||||
AGPRs: 'The number of accumulation vector general-purpose registers allocated
|
||||
for the kernel, see AGPRs. Note: this may not exactly match the number of AGPRs
|
||||
requested by the compiler due to allocation granularity.'
|
||||
SGPRs: 'The number of scalar general-purpose registers allocated for the kernel,
|
||||
see SALU. Note: this may not exactly match the number of SGPRs requested by
|
||||
the compiler due to allocation granularity.'
|
||||
LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
|
||||
for this kernel. Note: This may also be larger than what was requested at compile
|
||||
time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
|
||||
Scratch Allocation: The number of bytes of scratch memory requested per work-item
|
||||
for this kernel. Scratch memory is used for stack memory on the accelerator,
|
||||
as well as for register spills and restores.
|
||||
Kernel Time: The total duration of the executed kernel.
|
||||
Kernel Time (Cycles): The total duration of the executed kernel in cycles.
|
||||
Instructions per wavefront: The average number of instructions (of all types)
|
||||
executed per wavefront. This is averaged over all wavefronts in a kernel dispatch.
|
||||
Wave Cycles: The number of cycles a wavefront in the kernel dispatch spent resident
|
||||
on a compute unit per normalization unit. This is averaged over all wavefronts
|
||||
in a kernel dispatch.
|
||||
Dependency Wait Cycles: The number of cycles a wavefront in the kernel dispatch
|
||||
spent resident on a compute unit per normalization unit. This is averaged over
|
||||
all wavefronts in a kernel dispatch.
|
||||
Issue Wait Cycles: The number of cycles a wavefront in the kernel dispatch was
|
||||
unable to issue an instruction for any reason (e.g., execution pipe back-pressure,
|
||||
arbitration loss, etc.) per normalization unit. This counter is incremented
|
||||
at every cycle by all wavefronts on a CU unable to issue an instruction. As
|
||||
such, it is most useful to get a sense of how waves were spending their time,
|
||||
rather than identification of a precise limiter because another wave could be
|
||||
actively executing while a wave is issue stalled. The sum of this metric, Dependency
|
||||
Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
|
||||
Active Cycles: The average number of cycles a wavefront in the kernel dispatch
|
||||
was actively executing instructions per normalization unit. This measurement
|
||||
is made on a per-wavefront basis, and may include cycles that another wavefront
|
||||
spent actively executing (on another execution unit, for example) or was stalled.
|
||||
As such, it is most useful to get a sense of how waves were spending their time,
|
||||
rather than identification of a precise limiter. The sum of this metric, Issue
|
||||
Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
|
||||
metric.
|
||||
Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
|
||||
over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
||||
kernels (less than 1ms).'
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 701
|
||||
title: Wavefront Launch Stats
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Grid Size:
|
||||
avg: AVG(Grid_Size)
|
||||
min: MIN(Grid_Size)
|
||||
max: MAX(Grid_Size)
|
||||
unit: Work Items
|
||||
Workgroup Size:
|
||||
avg: AVG(Workgroup_Size)
|
||||
min: MIN(Workgroup_Size)
|
||||
max: MAX(Workgroup_Size)
|
||||
unit: Work Items
|
||||
Total Wavefronts:
|
||||
avg: AVG(SPI_CSN_WAVE)
|
||||
min: MIN(SPI_CSN_WAVE)
|
||||
max: MAX(SPI_CSN_WAVE)
|
||||
unit: Wavefronts
|
||||
Saved Wavefronts:
|
||||
avg: AVG(SQ_WAVES_SAVED)
|
||||
min: MIN(SQ_WAVES_SAVED)
|
||||
max: MAX(SQ_WAVES_SAVED)
|
||||
unit: Wavefronts
|
||||
Restored Wavefronts:
|
||||
avg: AVG(SQ_WAVES_RESTORED)
|
||||
min: MIN(SQ_WAVES_RESTORED)
|
||||
max: MAX(SQ_WAVES_RESTORED)
|
||||
unit: Wavefronts
|
||||
VGPRs:
|
||||
avg: AVG(Arch_VGPR)
|
||||
min: MIN(Arch_VGPR)
|
||||
max: MAX(Arch_VGPR)
|
||||
unit: Registers
|
||||
AGPRs:
|
||||
avg: AVG(Accum_VGPR)
|
||||
min: MIN(Accum_VGPR)
|
||||
max: MAX(Accum_VGPR)
|
||||
unit: Registers
|
||||
SGPRs:
|
||||
avg: AVG(SGPR)
|
||||
min: MIN(SGPR)
|
||||
max: MAX(SGPR)
|
||||
unit: Registers
|
||||
LDS Allocation:
|
||||
avg: AVG(LDS_Per_Workgroup)
|
||||
min: MIN(LDS_Per_Workgroup)
|
||||
max: MAX(LDS_Per_Workgroup)
|
||||
unit: Bytes
|
||||
Scratch Allocation:
|
||||
avg: AVG(Scratch_Per_Workitem)
|
||||
min: MIN(Scratch_Per_Workitem)
|
||||
max: MAX(Scratch_Per_Workitem)
|
||||
unit: Bytes/Workitem
|
||||
- metric_table:
|
||||
id: 702
|
||||
title: Wavefront Runtime Stats
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Kernel Time:
|
||||
avg: AVG((End_Timestamp - Start_Timestamp))
|
||||
min: MIN((End_Timestamp - Start_Timestamp))
|
||||
max: MAX((End_Timestamp - Start_Timestamp))
|
||||
unit: ns
|
||||
Kernel Time (Cycles):
|
||||
avg: AVG($GRBM_GUI_ACTIVE_PER_XCD)
|
||||
min: MIN($GRBM_GUI_ACTIVE_PER_XCD)
|
||||
max: MAX($GRBM_GUI_ACTIVE_PER_XCD)
|
||||
unit: Cycle
|
||||
Instructions per wavefront:
|
||||
avg: AVG((SQ_INSTS / SQ_WAVES))
|
||||
min: MIN((SQ_INSTS / SQ_WAVES))
|
||||
max: MAX((SQ_INSTS / SQ_WAVES))
|
||||
unit: Instr/wavefront
|
||||
Wave Cycles:
|
||||
avg: AVG(((4 * SQ_WAVE_CYCLES) / $denom))
|
||||
min: MIN(((4 * SQ_WAVE_CYCLES) / $denom))
|
||||
max: MAX(((4 * SQ_WAVE_CYCLES) / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
Dependency Wait Cycles:
|
||||
avg: AVG(((4 * SQ_WAIT_ANY) / $denom))
|
||||
min: MIN(((4 * SQ_WAIT_ANY) / $denom))
|
||||
max: MAX(((4 * SQ_WAIT_ANY) / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
Issue Wait Cycles:
|
||||
avg: AVG(((4 * SQ_WAIT_INST_ANY) / $denom))
|
||||
min: MIN(((4 * SQ_WAIT_INST_ANY) / $denom))
|
||||
max: MAX(((4 * SQ_WAIT_INST_ANY) / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
Active Cycles:
|
||||
avg: AVG(((4 * SQ_ACTIVE_INST_ANY) / $denom))
|
||||
min: MIN(((4 * SQ_ACTIVE_INST_ANY) / $denom))
|
||||
max: MAX(((4 * SQ_ACTIVE_INST_ANY) / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
Wavefront Occupancy:
|
||||
avg: AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
min: MIN((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
max: MAX((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
unit: Wavefronts
|
||||
coll_level: SQ_LEVEL_WAVES
|
||||
-267
@@ -1,267 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 1000
|
||||
title: Compute Units - Instruction Mix
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1001
|
||||
title: Overall Instruction Mix
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
VALU:
|
||||
avg: AVG(((SQ_INSTS_VALU - SQ_INSTS_MFMA) / $denom))
|
||||
min: MIN(((SQ_INSTS_VALU - SQ_INSTS_MFMA) / $denom))
|
||||
max: MAX(((SQ_INSTS_VALU - SQ_INSTS_MFMA) / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
VMEM:
|
||||
avg: AVG(((SQ_INSTS_VMEM - SQ_INSTS_FLAT_LDS_ONLY) / $denom))
|
||||
min: MIN(((SQ_INSTS_VMEM - SQ_INSTS_FLAT_LDS_ONLY) / $denom))
|
||||
max: MAX(((SQ_INSTS_VMEM - SQ_INSTS_FLAT_LDS_ONLY) / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
LDS:
|
||||
avg: AVG((SQ_INSTS_LDS / $denom))
|
||||
min: MIN((SQ_INSTS_LDS / $denom))
|
||||
max: MAX((SQ_INSTS_LDS / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
MFMA:
|
||||
avg: AVG((SQ_INSTS_MFMA / $denom))
|
||||
min: MIN((SQ_INSTS_MFMA / $denom))
|
||||
max: MAX((SQ_INSTS_MFMA / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
SALU:
|
||||
avg: AVG((SQ_INSTS_SALU / $denom))
|
||||
min: MIN((SQ_INSTS_SALU / $denom))
|
||||
max: MAX((SQ_INSTS_SALU / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
SMEM:
|
||||
avg: AVG((SQ_INSTS_SMEM / $denom))
|
||||
min: MIN((SQ_INSTS_SMEM / $denom))
|
||||
max: MAX((SQ_INSTS_SMEM / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
Branch:
|
||||
avg: AVG((SQ_INSTS_BRANCH / $denom))
|
||||
min: MIN((SQ_INSTS_BRANCH / $denom))
|
||||
max: MAX((SQ_INSTS_BRANCH / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 1002
|
||||
title: VALU Arithmetic Instr Mix
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
INT32:
|
||||
avg: AVG((SQ_INSTS_VALU_INT32 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_INT32 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_INT32 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
INT64:
|
||||
avg: AVG((SQ_INSTS_VALU_INT64 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_INT64 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_INT64 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
F16-ADD:
|
||||
avg: AVG((SQ_INSTS_VALU_ADD_F16 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_ADD_F16 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_ADD_F16 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
F16-MUL:
|
||||
avg: AVG((SQ_INSTS_VALU_MUL_F16 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_MUL_F16 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MUL_F16 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
F16-FMA:
|
||||
avg: AVG((SQ_INSTS_VALU_FMA_F16 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_FMA_F16 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_FMA_F16 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
F16-Trans:
|
||||
avg: AVG((SQ_INSTS_VALU_TRANS_F16 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_TRANS_F16 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_TRANS_F16 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
F32-ADD:
|
||||
avg: AVG((SQ_INSTS_VALU_ADD_F32 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_ADD_F32 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_ADD_F32 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
F32-MUL:
|
||||
avg: AVG((SQ_INSTS_VALU_MUL_F32 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_MUL_F32 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MUL_F32 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
F32-FMA:
|
||||
avg: AVG((SQ_INSTS_VALU_FMA_F32 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_FMA_F32 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_FMA_F32 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
F32-Trans:
|
||||
avg: AVG((SQ_INSTS_VALU_TRANS_F32 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_TRANS_F32 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_TRANS_F32 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
F64-ADD:
|
||||
avg: AVG((SQ_INSTS_VALU_ADD_F64 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_ADD_F64 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_ADD_F64 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
F64-MUL:
|
||||
avg: AVG((SQ_INSTS_VALU_MUL_F64 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_MUL_F64 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MUL_F64 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
F64-FMA:
|
||||
avg: AVG((SQ_INSTS_VALU_FMA_F64 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_FMA_F64 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_FMA_F64 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
F64-Trans:
|
||||
avg: AVG((SQ_INSTS_VALU_TRANS_F64 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_TRANS_F64 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_TRANS_F64 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
Conversion:
|
||||
avg: AVG((SQ_INSTS_VALU_CVT / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_CVT / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_CVT / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 1003
|
||||
title: VMEM Instr Mix
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Global/Generic Instr:
|
||||
avg: AVG((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
Global/Generic Read:
|
||||
avg: AVG((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
Global/Generic Write:
|
||||
avg: AVG((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
Global/Generic Atomic:
|
||||
avg: AVG((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
Spill/Stack Instr:
|
||||
avg: AVG((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
Spill/Stack Read:
|
||||
avg: AVG((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
Spill/Stack Write:
|
||||
avg: AVG((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
Spill/Stack Atomic:
|
||||
avg: AVG((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 1004
|
||||
title: MFMA Arithmetic Instr Mix
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
MFMA-I8:
|
||||
avg: AVG((SQ_INSTS_VALU_MFMA_I8 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_MFMA_I8 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MFMA_I8 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
MFMA-F16:
|
||||
avg: AVG((SQ_INSTS_VALU_MFMA_F16 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_MFMA_F16 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MFMA_F16 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
MFMA-BF16:
|
||||
avg: AVG((SQ_INSTS_VALU_MFMA_BF16 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_MFMA_BF16 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MFMA_BF16 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
MFMA-F32:
|
||||
avg: AVG((SQ_INSTS_VALU_MFMA_F32 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_MFMA_F32 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MFMA_F32 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
MFMA-F64:
|
||||
avg: AVG((SQ_INSTS_VALU_MFMA_F64 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_MFMA_F64 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MFMA_F64 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
+304
@@ -0,0 +1,304 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 1000
|
||||
title: Compute Units - Instruction Mix
|
||||
metrics_description:
|
||||
VALU: The total number of vector arithmetic logic unit (VALU) operations issued.
|
||||
These are the workhorses of the compute unit, and are used to execute a wide
|
||||
range of instruction types including floating point operations, non-uniform
|
||||
address calculations, transcendental operations, integer operations, shifts,
|
||||
conditional evaluation, etc.
|
||||
VMEM: The total number of vector memory operations issued. These include most
|
||||
loads, stores and atomic operations and all accesses to generic, global, private
|
||||
and texture memory.
|
||||
LDS: The total number of LDS (also known as shared memory) operations issued.
|
||||
These include loads, stores, atomics, and HIP's __shfl operations.
|
||||
MFMA: The total number of matrix fused multiply-add instructions issued.
|
||||
SALU: The total number of scalar arithmetic logic unit (SALU) operations issued.
|
||||
Typically these are used for address calculations, literal constants, and other
|
||||
operations that are provably uniform across a wavefront. Although scalar memory
|
||||
(SMEM) operations are issued by the SALU, they are counted separately in this
|
||||
section.
|
||||
SMEM: The total number of scalar memory (SMEM) operations issued. These are typically
|
||||
used for loading kernel arguments, base-pointers and loads from HIP's __constant__
|
||||
memory.
|
||||
Branch: The total number of branch operations issued. These typically consist
|
||||
of jump or branch operations and are used to implement control flow.
|
||||
INT32: The total number of instructions operating on 32-bit integer operands issued
|
||||
to the VALU per normalization unit.
|
||||
INT64: The total number of instructions operating on 64-bit integer operands issued
|
||||
to the VALU per normalization unit.
|
||||
F16-ADD: The total number of addition instructions operating on 16-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F16-MUL: The total number of multiplication instructions operating on 16-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F16-FMA: The total number of fused multiply-add instructions operating on 16-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F16-Trans: The total number of transcendental instructions (e.g., sqrt) operating
|
||||
on 16-bit floating-point operands issued to the VALU per normalization unit.
|
||||
F32-ADD: The total number of addition instructions operating on 32-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F32-MUL: The total number of multiplication instructions operating on 32-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F32-FMA: The total number of fused multiply-add instructions operating on 32-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F32-Trans: The total number of transcendental instructions (such as sqrt) operating
|
||||
on 32-bit floating-point operands issued to the VALU per normalization unit.
|
||||
F64-ADD: The total number of addition instructions operating on 64-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F64-MUL: The total number of multiplication instructions operating on 64-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F64-FMA: The total number of fused multiply-add instructions operating on 64-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F64-Trans: The total number of transcendental instructions (such as sqrt) operating
|
||||
on 64-bit floating-point operands issued to the VALU per normalization unit.
|
||||
Conversion: "The total number of type conversion instructions (such as converting\
|
||||
\ data to or from F32\u2194F64) issued to the VALU per normalization unit."
|
||||
Global/Generic Instr: The total number of global & generic memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Read: The total number of global & generic memory read instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Write: The total number of global & generic memory write instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Atomic: The total number of global & generic memory atomic (with
|
||||
and without return) instructions executed on all compute units on the accelerator,
|
||||
per normalization unit.
|
||||
Spill/Stack Instr: The total number of spill/stack memory instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Read: The total number of spill/stack memory read instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Write: The total number of spill/stack memory write instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Atomic: The total number of spill/stack memory atomic (with and without
|
||||
return) instructions executed on all compute units on the accelerator, per normalization
|
||||
unit. Typically unused as these memory operations are typically used to implement
|
||||
thread-local storage.
|
||||
MFMA-I8: The total number of 8-bit integer MFMA instructions issued per normalization
|
||||
unit.
|
||||
MFMA-F8: The total number of 8-bit floating point MFMA instructions issued per
|
||||
normalization unit. This is supported in AMD Instinct MI300 series and later
|
||||
only.
|
||||
MFMA-F16: The total number of 16-bit floating point MFMA instructions issued per
|
||||
normalization unit.
|
||||
MFMA-BF16: The total number of 16-bit brain floating point MFMA instructions issued
|
||||
per normalization unit.
|
||||
MFMA-F32: The total number of 32-bit floating-point MFMA instructions issued per
|
||||
normalization unit.
|
||||
MFMA-F64: The total number of 64-bit floating-point MFMA instructions issued per
|
||||
normalization unit.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1001
|
||||
title: Overall Instruction Mix
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
VALU:
|
||||
avg: AVG(((SQ_INSTS_VALU - SQ_INSTS_MFMA) / $denom))
|
||||
min: MIN(((SQ_INSTS_VALU - SQ_INSTS_MFMA) / $denom))
|
||||
max: MAX(((SQ_INSTS_VALU - SQ_INSTS_MFMA) / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
VMEM:
|
||||
avg: AVG(((SQ_INSTS_VMEM - SQ_INSTS_FLAT_LDS_ONLY) / $denom))
|
||||
min: MIN(((SQ_INSTS_VMEM - SQ_INSTS_FLAT_LDS_ONLY) / $denom))
|
||||
max: MAX(((SQ_INSTS_VMEM - SQ_INSTS_FLAT_LDS_ONLY) / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
LDS:
|
||||
avg: AVG((SQ_INSTS_LDS / $denom))
|
||||
min: MIN((SQ_INSTS_LDS / $denom))
|
||||
max: MAX((SQ_INSTS_LDS / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
MFMA:
|
||||
avg: AVG((SQ_INSTS_MFMA / $denom))
|
||||
min: MIN((SQ_INSTS_MFMA / $denom))
|
||||
max: MAX((SQ_INSTS_MFMA / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
SALU:
|
||||
avg: AVG((SQ_INSTS_SALU / $denom))
|
||||
min: MIN((SQ_INSTS_SALU / $denom))
|
||||
max: MAX((SQ_INSTS_SALU / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
SMEM:
|
||||
avg: AVG((SQ_INSTS_SMEM / $denom))
|
||||
min: MIN((SQ_INSTS_SMEM / $denom))
|
||||
max: MAX((SQ_INSTS_SMEM / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
Branch:
|
||||
avg: AVG((SQ_INSTS_BRANCH / $denom))
|
||||
min: MIN((SQ_INSTS_BRANCH / $denom))
|
||||
max: MAX((SQ_INSTS_BRANCH / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
- metric_table:
|
||||
id: 1002
|
||||
title: VALU Arithmetic Instruction Mix
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
INT32:
|
||||
avg: AVG((SQ_INSTS_VALU_INT32 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_INT32 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_INT32 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
INT64:
|
||||
avg: AVG((SQ_INSTS_VALU_INT64 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_INT64 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_INT64 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
F16-ADD:
|
||||
avg: AVG((SQ_INSTS_VALU_ADD_F16 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_ADD_F16 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_ADD_F16 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
F16-MUL:
|
||||
avg: AVG((SQ_INSTS_VALU_MUL_F16 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_MUL_F16 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MUL_F16 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
F16-FMA:
|
||||
avg: AVG((SQ_INSTS_VALU_FMA_F16 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_FMA_F16 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_FMA_F16 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
F16-Trans:
|
||||
avg: AVG((SQ_INSTS_VALU_TRANS_F16 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_TRANS_F16 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_TRANS_F16 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
F32-ADD:
|
||||
avg: AVG((SQ_INSTS_VALU_ADD_F32 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_ADD_F32 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_ADD_F32 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
F32-MUL:
|
||||
avg: AVG((SQ_INSTS_VALU_MUL_F32 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_MUL_F32 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MUL_F32 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
F32-FMA:
|
||||
avg: AVG((SQ_INSTS_VALU_FMA_F32 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_FMA_F32 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_FMA_F32 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
F32-Trans:
|
||||
avg: AVG((SQ_INSTS_VALU_TRANS_F32 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_TRANS_F32 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_TRANS_F32 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
F64-ADD:
|
||||
avg: AVG((SQ_INSTS_VALU_ADD_F64 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_ADD_F64 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_ADD_F64 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
F64-MUL:
|
||||
avg: AVG((SQ_INSTS_VALU_MUL_F64 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_MUL_F64 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MUL_F64 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
F64-FMA:
|
||||
avg: AVG((SQ_INSTS_VALU_FMA_F64 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_FMA_F64 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_FMA_F64 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
F64-Trans:
|
||||
avg: AVG((SQ_INSTS_VALU_TRANS_F64 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_TRANS_F64 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_TRANS_F64 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
Conversion:
|
||||
avg: AVG((SQ_INSTS_VALU_CVT / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_CVT / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_CVT / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
- metric_table:
|
||||
id: 1003
|
||||
title: VMEM Instruction Mix
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Global/Generic Instr:
|
||||
avg: AVG((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
Global/Generic Read:
|
||||
avg: AVG((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
Global/Generic Write:
|
||||
avg: AVG((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
Global/Generic Atomic:
|
||||
avg: AVG((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
Spill/Stack Instr:
|
||||
avg: AVG((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
Spill/Stack Read:
|
||||
avg: AVG((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
Spill/Stack Write:
|
||||
avg: AVG((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
Spill/Stack Atomic:
|
||||
avg: AVG((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
- metric_table:
|
||||
id: 1004
|
||||
title: MFMA Arithmetic Instruction Mix
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
MFMA-I8:
|
||||
avg: AVG((SQ_INSTS_VALU_MFMA_I8 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_MFMA_I8 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MFMA_I8 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
MFMA-F16:
|
||||
avg: AVG((SQ_INSTS_VALU_MFMA_F16 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_MFMA_F16 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MFMA_F16 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
MFMA-BF16:
|
||||
avg: AVG((SQ_INSTS_VALU_MFMA_BF16 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_MFMA_BF16 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MFMA_BF16 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
MFMA-F32:
|
||||
avg: AVG((SQ_INSTS_VALU_MFMA_F32 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_MFMA_F32 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MFMA_F32 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
MFMA-F64:
|
||||
avg: AVG((SQ_INSTS_VALU_MFMA_F64 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_MFMA_F64 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MFMA_F64 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
-260
@@ -1,260 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 1100
|
||||
title: Compute Units - Compute Pipeline
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1101
|
||||
title: Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
peak: Peak
|
||||
pop: Pct of Peak
|
||||
tips: Tips
|
||||
metric:
|
||||
VALU FLOPs:
|
||||
value: AVG(((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) + SQ_INSTS_VALU_TRANS_F16)
|
||||
+ (2 * SQ_INSTS_VALU_FMA_F16))) + (64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32)
|
||||
+ SQ_INSTS_VALU_TRANS_F32) + (2 * SQ_INSTS_VALU_FMA_F32)))) + (64 * (((SQ_INSTS_VALU_ADD_F64
|
||||
+ SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64) + (2 * SQ_INSTS_VALU_FMA_F64))))
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP
|
||||
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
|
||||
pop: ((100 * AVG(((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16)
|
||||
+ SQ_INSTS_VALU_TRANS_F16) + (2 * SQ_INSTS_VALU_FMA_F16))) + (64 * (((SQ_INSTS_VALU_ADD_F32
|
||||
+ SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (2 * SQ_INSTS_VALU_FMA_F32))))
|
||||
+ (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (2 * SQ_INSTS_VALU_FMA_F64)))) / (End_Timestamp - Start_Timestamp)))) / (((($max_sclk
|
||||
* $cu_per_gpu) * 64) * 2) / 1000))
|
||||
tips:
|
||||
VALU IOPs:
|
||||
value: AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GIOP
|
||||
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
|
||||
pop: ((100 * AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) / (End_Timestamp
|
||||
- Start_Timestamp)))) / (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000))
|
||||
tips:
|
||||
MFMA FLOPs (BF16):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk * $cu_per_gpu) * 1024) / 1000))
|
||||
tips:
|
||||
MFMA FLOPs (F16):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk * $cu_per_gpu) * 1024) / 1000))
|
||||
tips:
|
||||
MFMA FLOPs (F32):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk * $cu_per_gpu) * 256) / 1000))
|
||||
tips:
|
||||
MFMA FLOPs (F64):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk * $cu_per_gpu) * 256) / 1000))
|
||||
tips:
|
||||
MFMA IOPs (INT8):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GIOP
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk * $cu_per_gpu) * 1024) / 1000))
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 1102
|
||||
title: Pipeline Stats
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
IPC:
|
||||
avg: AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))
|
||||
min: MIN((SQ_INSTS / SQ_BUSY_CU_CYCLES))
|
||||
max: MAX((SQ_INSTS / SQ_BUSY_CU_CYCLES))
|
||||
unit: Instr/cycle
|
||||
tips:
|
||||
IPC (Issued):
|
||||
avg: AVG(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
/ SQ_ACTIVE_INST_ANY))
|
||||
min: MIN(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
/ SQ_ACTIVE_INST_ANY))
|
||||
max: MAX(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
/ SQ_ACTIVE_INST_ANY))
|
||||
unit: Instr/cycle
|
||||
tips:
|
||||
SALU Utilization:
|
||||
avg: AVG((((100 * SQ_ACTIVE_INST_SCA) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
min: MIN((((100 * SQ_ACTIVE_INST_SCA) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
max: MAX((((100 * SQ_ACTIVE_INST_SCA) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
unit: pct
|
||||
tips:
|
||||
VALU Utilization:
|
||||
avg: AVG((((100 * SQ_ACTIVE_INST_VALU) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
min: MIN((((100 * SQ_ACTIVE_INST_VALU) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
max: MAX((((100 * SQ_ACTIVE_INST_VALU) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
unit: pct
|
||||
tips:
|
||||
VMEM Utilization:
|
||||
avg: AVG((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
min: MIN((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
max: MAX((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
unit: pct
|
||||
tips:
|
||||
Branch Utilization:
|
||||
avg: AVG((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
min: MIN((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
max: MAX((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
unit: pct
|
||||
tips:
|
||||
VALU Active Threads:
|
||||
avg: AVG(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
|
||||
!= 0) else None))
|
||||
min: MIN(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
|
||||
!= 0) else None))
|
||||
max: MAX(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
|
||||
!= 0) else None))
|
||||
unit: Threads
|
||||
tips:
|
||||
MFMA Utilization:
|
||||
avg: AVG(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / ((4 * $cu_per_gpu) * $GRBM_GUI_ACTIVE_PER_XCD)))
|
||||
min: MIN(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / ((4 * $cu_per_gpu) * $GRBM_GUI_ACTIVE_PER_XCD)))
|
||||
max: MAX(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / ((4 * $cu_per_gpu) * $GRBM_GUI_ACTIVE_PER_XCD)))
|
||||
unit: pct
|
||||
tips:
|
||||
MFMA Instr Cycles:
|
||||
avg: AVG(((SQ_VALU_MFMA_BUSY_CYCLES / SQ_INSTS_MFMA) if (SQ_INSTS_MFMA != 0)
|
||||
else None))
|
||||
min: MIN(((SQ_VALU_MFMA_BUSY_CYCLES / SQ_INSTS_MFMA) if (SQ_INSTS_MFMA != 0)
|
||||
else None))
|
||||
max: MAX(((SQ_VALU_MFMA_BUSY_CYCLES / SQ_INSTS_MFMA) if (SQ_INSTS_MFMA != 0)
|
||||
else None))
|
||||
unit: cycles/instr
|
||||
tips:
|
||||
VMEM Latency:
|
||||
avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_VMEM) if (SQ_INSTS_VMEM != 0)
|
||||
else None))
|
||||
min: MIN(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_VMEM) if (SQ_INSTS_VMEM != 0)
|
||||
else None))
|
||||
max: MAX(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_VMEM) if (SQ_INSTS_VMEM != 0)
|
||||
else None))
|
||||
unit: Cycles
|
||||
coll_level: SQ_INST_LEVEL_VMEM
|
||||
tips:
|
||||
SMEM Latency:
|
||||
avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_SMEM) if (SQ_INSTS_SMEM != 0)
|
||||
else None))
|
||||
min: MIN(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_SMEM) if (SQ_INSTS_SMEM != 0)
|
||||
else None))
|
||||
max: MAX(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_SMEM) if (SQ_INSTS_SMEM != 0)
|
||||
else None))
|
||||
unit: Cycles
|
||||
coll_level: SQ_INST_LEVEL_SMEM
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 1103
|
||||
title: Arithmetic Operations
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
FLOPs (Total):
|
||||
avg: AVG((((((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) + SQ_INSTS_VALU_TRANS_F16)
|
||||
+ (SQ_INSTS_VALU_FMA_F16 * 2))) + ((512 * SQ_INSTS_VALU_MFMA_MOPS_F16) + (512
|
||||
* SQ_INSTS_VALU_MFMA_MOPS_BF16))) + (64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32)
|
||||
+ SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_FMA_F32 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
|
||||
+ (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (SQ_INSTS_VALU_FMA_F64 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64)) /
|
||||
$denom))
|
||||
min: MIN((((((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) + SQ_INSTS_VALU_TRANS_F16)
|
||||
+ (SQ_INSTS_VALU_FMA_F16 * 2))) + ((512 * SQ_INSTS_VALU_MFMA_MOPS_F16) + (512
|
||||
* SQ_INSTS_VALU_MFMA_MOPS_BF16))) + (64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32)
|
||||
+ SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_FMA_F32 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
|
||||
+ (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (SQ_INSTS_VALU_FMA_F64 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64)) /
|
||||
$denom))
|
||||
max: MAX((((((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) + SQ_INSTS_VALU_TRANS_F16)
|
||||
+ (SQ_INSTS_VALU_FMA_F16 * 2))) + ((512 * SQ_INSTS_VALU_MFMA_MOPS_F16) + (512
|
||||
* SQ_INSTS_VALU_MFMA_MOPS_BF16))) + (64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32)
|
||||
+ SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_FMA_F32 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
|
||||
+ (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (SQ_INSTS_VALU_FMA_F64 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64)) /
|
||||
$denom))
|
||||
unit: (OPs + $normUnit)
|
||||
tips:
|
||||
IOPs (Total):
|
||||
avg: AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / $denom)
|
||||
min: MIN(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / $denom)
|
||||
max: MAX(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / $denom)
|
||||
unit: (OPs + $normUnit)
|
||||
tips:
|
||||
F16 OPs:
|
||||
avg: AVG(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16)) +
|
||||
(64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512 *
|
||||
SQ_INSTS_VALU_MFMA_MOPS_F16)) / $denom))
|
||||
min: MIN(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16)) +
|
||||
(64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512 *
|
||||
SQ_INSTS_VALU_MFMA_MOPS_F16)) / $denom))
|
||||
max: MAX(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16)) +
|
||||
(64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512 *
|
||||
SQ_INSTS_VALU_MFMA_MOPS_F16)) / $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
tips:
|
||||
BF16 OPs:
|
||||
avg: AVG(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
|
||||
min: MIN(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
|
||||
max: MAX(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
tips:
|
||||
F32 OPs:
|
||||
avg: AVG((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
|
||||
+ (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32)) / $denom))
|
||||
min: MIN((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
|
||||
+ (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32)) / $denom))
|
||||
max: MAX((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
|
||||
+ (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32)) / $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
tips:
|
||||
F64 OPs:
|
||||
avg: AVG((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64)) / $denom))
|
||||
min: MIN((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64)) / $denom))
|
||||
max: MAX((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64)) / $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
tips:
|
||||
INT8 OPs:
|
||||
avg: AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
|
||||
min: MIN(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
|
||||
max: MAX(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
tips:
|
||||
+316
@@ -0,0 +1,316 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 1100
|
||||
title: Compute Units - Compute Pipeline
|
||||
metrics_description:
|
||||
VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
|
||||
This is also presented as a percent of the peak theoretical FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any floating-point
|
||||
operations from MFMA instructions.'
|
||||
VALU IOPs: 'The total integer operations executed per second on the VALU. This
|
||||
is also presented as a percent of the peak theoretical IOPs achievable on the
|
||||
specific accelerator. Note: this does not include any integer operations from
|
||||
MFMA instructions.'
|
||||
MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
|
||||
executed per second. Note: this does not include any 16-bit brain floating point
|
||||
operations from VALU instructions. This is also presented as a percent of the
|
||||
peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 32-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F32 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 64-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F64 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA IOPs (INT8): 'The total number of 8-bit integer MFMA operations executed
|
||||
per second. Note: this does not include any 8-bit integer operations from VALU
|
||||
instructions. This is also presented as a percent of the peak theoretical INT8
|
||||
MFMA operations achievable on the specific accelerator.'
|
||||
IPC: The ratio of the total number of instructions executed on the CU over the
|
||||
total active CU cycles.
|
||||
IPC (Issued): The ratio of the total number of (non-internal) instructions issued
|
||||
over the number of cycles where the scheduler was actively working on issuing
|
||||
instructions.
|
||||
SALU Utilization: Indicates what percent of the kernel's duration the SALU was
|
||||
busy executing instructions. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing SALU / SMEM instructions over the total CU cycles.
|
||||
VALU Utilization: Indicates what percent of the kernel's duration the VALU was
|
||||
busy executing instructions. Does not include VMEM operations. Computed as the
|
||||
ratio of the total number of cycles spent by the scheduler issuing VALU instructions
|
||||
over the total CU cycles.
|
||||
VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
|
||||
was busy executing instructions, including both global/generic and spill/scratch
|
||||
operations (see the VMEM instruction count metrics for more detail). Does not
|
||||
include VALU operations. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing VMEM instructions over the total CU cycles.
|
||||
Branch Utilization: Indicates what percent of the kernel's duration the branch
|
||||
unit was busy executing instructions. Computed as the ratio of the total number
|
||||
of cycles spent by the scheduler issuing branch instructions over the total
|
||||
CU cycles.
|
||||
VALU Active Threads: Indicates the average level of divergence within a wavefront
|
||||
over the lifetime of the kernel. The number of work-items that were active in
|
||||
a wavefront during execution of each VALU instruction, time-averaged over all
|
||||
VALU instructions run on all wavefronts in the kernel
|
||||
MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
|
||||
was busy executing instructions. Computed as the ratio of the total number of
|
||||
cycles spent by the MFMA was busy over the total CU cycles.
|
||||
MFMA Instruction Cycles: The average duration of MFMA instructions in this kernel
|
||||
in cycles. Computed as the ratio of the total number of cycles the MFMA unit
|
||||
was busy over the total number of MFMA instructions.
|
||||
VMEM Latency: The average number of round-trip cycles (that is, from issue to
|
||||
data return / acknowledgment) required for a VMEM instruction to complete.
|
||||
SMEM Latency: The average number of round-trip cycles (that is, from issue to
|
||||
data return / acknowledgment) required for a SMEM instruction to complete.
|
||||
FLOPs (Total): The total number of floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
IOPs (Total): The total number of integer operations executed on either the VALU
|
||||
or MFMA units, per normalization unit.
|
||||
F16 OPs: The total number of 16-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
BF16 OPs: The total number of 16-bit brain floating-point operations executed
|
||||
on either the VALU or MFMA units, per normalization unit.
|
||||
F32 OPs: The total number of 32-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
F64 OPs: The total number of 64-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
INT8 OPs: The total number of 8-bit integer operations executed on either the
|
||||
VALU or MFMA units, per normalization unit.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1101
|
||||
title: Compute Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
peak: Peak
|
||||
pop: Pct of Peak
|
||||
metric:
|
||||
VALU FLOPs:
|
||||
value: AVG(((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) +
|
||||
SQ_INSTS_VALU_TRANS_F16) + (2 * SQ_INSTS_VALU_FMA_F16))) + (64 * (((SQ_INSTS_VALU_ADD_F32
|
||||
+ SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (2 * SQ_INSTS_VALU_FMA_F32))))
|
||||
+ (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (2 * SQ_INSTS_VALU_FMA_F64)))) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP
|
||||
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
|
||||
pop: ((100 * AVG(((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16)
|
||||
+ SQ_INSTS_VALU_TRANS_F16) + (2 * SQ_INSTS_VALU_FMA_F16))) + (64 * (((SQ_INSTS_VALU_ADD_F32
|
||||
+ SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (2 * SQ_INSTS_VALU_FMA_F32))))
|
||||
+ (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (2 * SQ_INSTS_VALU_FMA_F64)))) / (End_Timestamp - Start_Timestamp))))
|
||||
/ (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000))
|
||||
VALU IOPs:
|
||||
value: AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) / (End_Timestamp
|
||||
- Start_Timestamp)))
|
||||
unit: GIOP
|
||||
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
|
||||
pop: ((100 * AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) / (End_Timestamp
|
||||
- Start_Timestamp)))) / (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000))
|
||||
MFMA FLOPs (BF16):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp
|
||||
- Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 1024) / 1000))
|
||||
MFMA FLOPs (F16):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp -
|
||||
Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 1024) / 1000))
|
||||
MFMA FLOPs (F32):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) / (End_Timestamp -
|
||||
Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 256) / 1000))
|
||||
MFMA FLOPs (F64):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp -
|
||||
Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 256) / 1000))
|
||||
MFMA IOPs (INT8):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GIOP
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp -
|
||||
Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 1024) / 1000))
|
||||
- metric_table:
|
||||
id: 1102
|
||||
title: Pipeline Statistics
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
IPC:
|
||||
avg: AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))
|
||||
min: MIN((SQ_INSTS / SQ_BUSY_CU_CYCLES))
|
||||
max: MAX((SQ_INSTS / SQ_BUSY_CU_CYCLES))
|
||||
unit: Instr/cycle
|
||||
IPC (Issued):
|
||||
avg: AVG(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
/ SQ_ACTIVE_INST_ANY))
|
||||
min: MIN(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
/ SQ_ACTIVE_INST_ANY))
|
||||
max: MAX(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
/ SQ_ACTIVE_INST_ANY))
|
||||
unit: Instr/cycle
|
||||
SALU Utilization:
|
||||
avg: AVG((((100 * SQ_ACTIVE_INST_SCA) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
min: MIN((((100 * SQ_ACTIVE_INST_SCA) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
max: MAX((((100 * SQ_ACTIVE_INST_SCA) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
unit: pct
|
||||
VALU Utilization:
|
||||
avg: AVG((((100 * SQ_ACTIVE_INST_VALU) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
min: MIN((((100 * SQ_ACTIVE_INST_VALU) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
max: MAX((((100 * SQ_ACTIVE_INST_VALU) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
unit: pct
|
||||
VMEM Utilization:
|
||||
avg: AVG((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD)
|
||||
/ $cu_per_gpu))
|
||||
min: MIN((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD)
|
||||
/ $cu_per_gpu))
|
||||
max: MAX((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD)
|
||||
/ $cu_per_gpu))
|
||||
unit: pct
|
||||
Branch Utilization:
|
||||
avg: AVG((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
min: MIN((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
max: MAX((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
unit: pct
|
||||
VALU Active Threads:
|
||||
avg: AVG(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
|
||||
!= 0) else None))
|
||||
min: MIN(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
|
||||
!= 0) else None))
|
||||
max: MAX(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
|
||||
!= 0) else None))
|
||||
unit: Threads
|
||||
MFMA Utilization:
|
||||
avg: AVG(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / ((4 * $cu_per_gpu) * $GRBM_GUI_ACTIVE_PER_XCD)))
|
||||
min: MIN(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / ((4 * $cu_per_gpu) * $GRBM_GUI_ACTIVE_PER_XCD)))
|
||||
max: MAX(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / ((4 * $cu_per_gpu) * $GRBM_GUI_ACTIVE_PER_XCD)))
|
||||
unit: pct
|
||||
MFMA Instruction Cycles:
|
||||
avg: AVG(((SQ_VALU_MFMA_BUSY_CYCLES / SQ_INSTS_MFMA) if (SQ_INSTS_MFMA !=
|
||||
0) else None))
|
||||
min: MIN(((SQ_VALU_MFMA_BUSY_CYCLES / SQ_INSTS_MFMA) if (SQ_INSTS_MFMA !=
|
||||
0) else None))
|
||||
max: MAX(((SQ_VALU_MFMA_BUSY_CYCLES / SQ_INSTS_MFMA) if (SQ_INSTS_MFMA !=
|
||||
0) else None))
|
||||
unit: cycles/instr
|
||||
VMEM Latency:
|
||||
avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_VMEM) if (SQ_INSTS_VMEM != 0)
|
||||
else None))
|
||||
min: MIN(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_VMEM) if (SQ_INSTS_VMEM != 0)
|
||||
else None))
|
||||
max: MAX(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_VMEM) if (SQ_INSTS_VMEM != 0)
|
||||
else None))
|
||||
unit: Cycles
|
||||
coll_level: SQ_INST_LEVEL_VMEM
|
||||
SMEM Latency:
|
||||
avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_SMEM) if (SQ_INSTS_SMEM != 0)
|
||||
else None))
|
||||
min: MIN(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_SMEM) if (SQ_INSTS_SMEM != 0)
|
||||
else None))
|
||||
max: MAX(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_SMEM) if (SQ_INSTS_SMEM != 0)
|
||||
else None))
|
||||
unit: Cycles
|
||||
coll_level: SQ_INST_LEVEL_SMEM
|
||||
- metric_table:
|
||||
id: 1103
|
||||
title: Arithmetic Operations
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
FLOPs (Total):
|
||||
avg: AVG((((((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) +
|
||||
SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_FMA_F16 * 2))) + ((512 * SQ_INSTS_VALU_MFMA_MOPS_F16)
|
||||
+ (512 * SQ_INSTS_VALU_MFMA_MOPS_BF16))) + (64 * (((SQ_INSTS_VALU_ADD_F32
|
||||
+ SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_FMA_F32
|
||||
* 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32)) + (64 * (((SQ_INSTS_VALU_ADD_F64
|
||||
+ SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64) + (SQ_INSTS_VALU_FMA_F64
|
||||
* 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64)) / $denom))
|
||||
min: MIN((((((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) +
|
||||
SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_FMA_F16 * 2))) + ((512 * SQ_INSTS_VALU_MFMA_MOPS_F16)
|
||||
+ (512 * SQ_INSTS_VALU_MFMA_MOPS_BF16))) + (64 * (((SQ_INSTS_VALU_ADD_F32
|
||||
+ SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_FMA_F32
|
||||
* 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32)) + (64 * (((SQ_INSTS_VALU_ADD_F64
|
||||
+ SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64) + (SQ_INSTS_VALU_FMA_F64
|
||||
* 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64)) / $denom))
|
||||
max: MAX((((((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) +
|
||||
SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_FMA_F16 * 2))) + ((512 * SQ_INSTS_VALU_MFMA_MOPS_F16)
|
||||
+ (512 * SQ_INSTS_VALU_MFMA_MOPS_BF16))) + (64 * (((SQ_INSTS_VALU_ADD_F32
|
||||
+ SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_FMA_F32
|
||||
* 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32)) + (64 * (((SQ_INSTS_VALU_ADD_F64
|
||||
+ SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64) + (SQ_INSTS_VALU_FMA_F64
|
||||
* 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64)) / $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
IOPs (Total):
|
||||
avg: AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8
|
||||
* 512)) / $denom)
|
||||
min: MIN(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8
|
||||
* 512)) / $denom)
|
||||
max: MAX(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8
|
||||
* 512)) / $denom)
|
||||
unit: (OPs + $normUnit)
|
||||
F16 OPs:
|
||||
avg: AVG(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16))
|
||||
+ (64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512
|
||||
* SQ_INSTS_VALU_MFMA_MOPS_F16)) / $denom))
|
||||
min: MIN(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16))
|
||||
+ (64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512
|
||||
* SQ_INSTS_VALU_MFMA_MOPS_F16)) / $denom))
|
||||
max: MAX(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16))
|
||||
+ (64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512
|
||||
* SQ_INSTS_VALU_MFMA_MOPS_F16)) / $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
BF16 OPs:
|
||||
avg: AVG(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
|
||||
min: MIN(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
|
||||
max: MAX(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
F32 OPs:
|
||||
avg: AVG((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
|
||||
+ (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
|
||||
/ $denom))
|
||||
min: MIN((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
|
||||
+ (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
|
||||
/ $denom))
|
||||
max: MAX((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
|
||||
+ (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
|
||||
/ $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
F64 OPs:
|
||||
avg: AVG((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
|
||||
/ $denom))
|
||||
min: MIN((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
|
||||
/ $denom))
|
||||
max: MAX((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
|
||||
/ $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
INT8 OPs:
|
||||
avg: AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
|
||||
min: MIN(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
|
||||
max: MAX(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
-118
@@ -1,118 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 1200
|
||||
title: Local Data Share (LDS)
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1201
|
||||
title: Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Utilization:
|
||||
value: AVG(((100 * SQ_LDS_IDX_ACTIVE) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
Access Rate:
|
||||
value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
Theoretical Bandwidth:
|
||||
value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
Bank Conflict Rate:
|
||||
value: AVG((((SQ_LDS_BANK_CONFLICT * 3.125) / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
comparable: false # for now
|
||||
cli_style: simple_bar
|
||||
|
||||
- metric_table:
|
||||
id: 1202
|
||||
title: LDS Stats
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
LDS Instrs:
|
||||
avg: AVG((SQ_INSTS_LDS / $denom))
|
||||
min: MIN((SQ_INSTS_LDS / $denom))
|
||||
max: MAX((SQ_INSTS_LDS / $denom))
|
||||
unit: (Instr + $normUnit)
|
||||
tips:
|
||||
Theoretical Bandwidth:
|
||||
avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
tips:
|
||||
LDS Latency:
|
||||
avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None))
|
||||
min: MIN(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None))
|
||||
max: MAX(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None))
|
||||
unit: Cycles
|
||||
coll_level: SQ_INST_LEVEL_LDS
|
||||
tips:
|
||||
Bank Conflicts/Access:
|
||||
avg: AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
|
||||
min: MIN(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
|
||||
max: MAX(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
|
||||
unit: Conflicts/Access
|
||||
tips:
|
||||
Index Accesses:
|
||||
avg: AVG((SQ_LDS_IDX_ACTIVE / $denom))
|
||||
min: MIN((SQ_LDS_IDX_ACTIVE / $denom))
|
||||
max: MAX((SQ_LDS_IDX_ACTIVE / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
Atomic Return Cycles:
|
||||
avg: AVG((SQ_LDS_ATOMIC_RETURN / $denom))
|
||||
min: MIN((SQ_LDS_ATOMIC_RETURN / $denom))
|
||||
max: MAX((SQ_LDS_ATOMIC_RETURN / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
Bank Conflict:
|
||||
avg: AVG((SQ_LDS_BANK_CONFLICT / $denom))
|
||||
min: MIN((SQ_LDS_BANK_CONFLICT / $denom))
|
||||
max: MAX((SQ_LDS_BANK_CONFLICT / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
Addr Conflict:
|
||||
avg: AVG((SQ_LDS_ADDR_CONFLICT / $denom))
|
||||
min: MIN((SQ_LDS_ADDR_CONFLICT / $denom))
|
||||
max: MAX((SQ_LDS_ADDR_CONFLICT / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
Unaligned Stall:
|
||||
avg: AVG((SQ_LDS_UNALIGNED_STALL / $denom))
|
||||
min: MIN((SQ_LDS_UNALIGNED_STALL / $denom))
|
||||
max: MAX((SQ_LDS_UNALIGNED_STALL / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
Mem Violations:
|
||||
avg: AVG((SQ_LDS_MEM_VIOLATIONS / $denom))
|
||||
min: MIN((SQ_LDS_MEM_VIOLATIONS / $denom))
|
||||
max: MAX((SQ_LDS_MEM_VIOLATIONS / $denom))
|
||||
unit: (Accesses + $normUnit)
|
||||
tips:
|
||||
+141
@@ -0,0 +1,141 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 1200
|
||||
title: Local Data Share (LDS)
|
||||
metrics_description:
|
||||
Utilization: Indicates what percent of the kernel's duration the LDS was actively
|
||||
executing instructions (including, but not limited to, load, store, atomic and
|
||||
HIP's __shfl operations). Calculated as the ratio of the total number of cycles
|
||||
LDS was active over the total CU cycles.
|
||||
Access Rate: Indicates the percentage of SIMDs in the VALU actively issuing LDS
|
||||
instructions, averaged over the lifetime of the kernel. Calculated as the ratio
|
||||
of the total number of cycles spent by the scheduler issuing LDS instructions
|
||||
over the total CU cycles.
|
||||
Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
|
||||
loaded from, stored to, or atomically updated in the LDS per normalization unit.
|
||||
Does not take into account the execution mask of the wavefront when the instruction
|
||||
was executed.
|
||||
Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
|
||||
servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing
|
||||
bank conflicts over the number of LDS cycles that would have been required to
|
||||
move the same amount of data in an uncontended access.
|
||||
LDS Instructions: The total number of LDS instructions (including, but not limited
|
||||
to, read/write/atomics and HIP's __shfl instructions) executed per normalization
|
||||
unit.
|
||||
LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
|
||||
/ acknowledgment) required for an LDS instruction to complete.
|
||||
Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS scheduler
|
||||
due to bank conflicts (as determined by the conflict resolution hardware) to
|
||||
the base number of cycles that would be spent in the LDS scheduler in a completely
|
||||
uncontended case. This is the unnormalized form of the Bank Conflict Rate.
|
||||
Index Accesses: The total number of cycles spent in the LDS scheduler over all
|
||||
operations per normalization unit.
|
||||
Atomic Return Cycles: The total number of cycles spent on LDS atomics with return
|
||||
per normalization unit.
|
||||
Bank Conflict: The total number of cycles spent in the LDS scheduler due to bank
|
||||
conflicts (as determined by the conflict resolution hardware) per normalization
|
||||
unit.
|
||||
Addr Conflict: The total number of cycles spent in the LDS scheduler due to address
|
||||
conflicts (as determined by the conflict resolution hardware) per normalization
|
||||
unit.
|
||||
Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
|
||||
stalls from non-dword aligned addresses per normalization unit.
|
||||
Mem Violations: "The total number of out-of-bounds accesses made to the LDS, per\
|
||||
\ normalization unit. This is unused and expected to be zero in most configurations\
|
||||
\ for modern CDNA\u2122 accelerators."
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1201
|
||||
title: LDS Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
metric:
|
||||
Utilization:
|
||||
value: AVG(((100 * SQ_LDS_IDX_ACTIVE) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: Pct of Peak
|
||||
Access Rate:
|
||||
value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: Pct of Peak
|
||||
Theoretical Bandwidth:
|
||||
value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
|
||||
unit: Pct of Peak
|
||||
Bank Conflict Rate:
|
||||
value: AVG((((SQ_LDS_BANK_CONFLICT * 3.125) / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
|
||||
unit: Pct of Peak
|
||||
comparable: false
|
||||
cli_style: simple_bar
|
||||
tui_style: simple_bar
|
||||
- metric_table:
|
||||
id: 1202
|
||||
title: LDS Statistics
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
LDS Instructions:
|
||||
avg: AVG((SQ_INSTS_LDS / $denom))
|
||||
min: MIN((SQ_INSTS_LDS / $denom))
|
||||
max: MAX((SQ_INSTS_LDS / $denom))
|
||||
unit: (Instr + $normUnit)
|
||||
Theoretical Bandwidth:
|
||||
avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
LDS Latency:
|
||||
avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
|
||||
None))
|
||||
min: MIN(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
|
||||
None))
|
||||
max: MAX(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
|
||||
None))
|
||||
unit: Cycles
|
||||
coll_level: SQ_INST_LEVEL_LDS
|
||||
Bank Conflicts/Access:
|
||||
avg: AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
|
||||
min: MIN(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
|
||||
max: MAX(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
|
||||
unit: Conflicts/Access
|
||||
Index Accesses:
|
||||
avg: AVG((SQ_LDS_IDX_ACTIVE / $denom))
|
||||
min: MIN((SQ_LDS_IDX_ACTIVE / $denom))
|
||||
max: MAX((SQ_LDS_IDX_ACTIVE / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
Atomic Return Cycles:
|
||||
avg: AVG((SQ_LDS_ATOMIC_RETURN / $denom))
|
||||
min: MIN((SQ_LDS_ATOMIC_RETURN / $denom))
|
||||
max: MAX((SQ_LDS_ATOMIC_RETURN / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
Bank Conflict:
|
||||
avg: AVG((SQ_LDS_BANK_CONFLICT / $denom))
|
||||
min: MIN((SQ_LDS_BANK_CONFLICT / $denom))
|
||||
max: MAX((SQ_LDS_BANK_CONFLICT / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
Addr Conflict:
|
||||
avg: AVG((SQ_LDS_ADDR_CONFLICT / $denom))
|
||||
min: MIN((SQ_LDS_ADDR_CONFLICT / $denom))
|
||||
max: MAX((SQ_LDS_ADDR_CONFLICT / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
Unaligned Stall:
|
||||
avg: AVG((SQ_LDS_UNALIGNED_STALL / $denom))
|
||||
min: MIN((SQ_LDS_UNALIGNED_STALL / $denom))
|
||||
max: MAX((SQ_LDS_UNALIGNED_STALL / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
Mem Violations:
|
||||
avg: AVG((SQ_LDS_MEM_VIOLATIONS / $denom))
|
||||
min: MIN((SQ_LDS_MEM_VIOLATIONS / $denom))
|
||||
max: MAX((SQ_LDS_MEM_VIOLATIONS / $denom))
|
||||
unit: (Accesses + $normUnit)
|
||||
-105
@@ -1,105 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 1300
|
||||
title: Instruction Cache
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1301
|
||||
title: Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Bandwidth:
|
||||
value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu)
|
||||
* (End_Timestamp - Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
Cache Hit Rate:
|
||||
value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
|
||||
+ SQC_ICACHE_MISSES_DUPLICATE)))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
L1I-L2 Bandwidth:
|
||||
value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu)
|
||||
* (End_Timestamp - Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
comparable: false # for now
|
||||
cli_style: simple_bar
|
||||
|
||||
- metric_table:
|
||||
id: 1302
|
||||
title: Instruction Cache Accesses
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Req:
|
||||
avg: AVG((SQC_ICACHE_REQ / $denom))
|
||||
min: MIN((SQC_ICACHE_REQ / $denom))
|
||||
max: MAX((SQC_ICACHE_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Hits:
|
||||
avg: AVG((SQC_ICACHE_HITS / $denom))
|
||||
min: MIN((SQC_ICACHE_HITS / $denom))
|
||||
max: MAX((SQC_ICACHE_HITS / $denom))
|
||||
unit: (Hits + $normUnit)
|
||||
tips:
|
||||
Misses - Non Duplicated:
|
||||
avg: AVG((SQC_ICACHE_MISSES / $denom))
|
||||
min: MIN((SQC_ICACHE_MISSES / $denom))
|
||||
max: MAX((SQC_ICACHE_MISSES / $denom))
|
||||
unit: (Misses + $normUnit)
|
||||
tips:
|
||||
Misses - Duplicated:
|
||||
avg: AVG((SQC_ICACHE_MISSES_DUPLICATE / $denom))
|
||||
min: MIN((SQC_ICACHE_MISSES_DUPLICATE / $denom))
|
||||
max: MAX((SQC_ICACHE_MISSES_DUPLICATE / $denom))
|
||||
unit: (Misses + $normUnit)
|
||||
tips:
|
||||
Cache Hit Rate:
|
||||
avg: AVG(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
|
||||
+ SQC_ICACHE_MISSES_DUPLICATE)))
|
||||
min: MIN(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
|
||||
+ SQC_ICACHE_MISSES_DUPLICATE)))
|
||||
max: MAX(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
|
||||
+ SQC_ICACHE_MISSES_DUPLICATE)))
|
||||
unit: pct
|
||||
tips:
|
||||
Instruction Fetch Latency:
|
||||
avg: AVG((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
|
||||
min: MIN((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
|
||||
max: MAX((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
|
||||
unit: Cycles
|
||||
coll_level: SQ_IFETCH_LEVEL
|
||||
tips:
|
||||
- metric_table:
|
||||
id: 1303
|
||||
title: Instruction Cache - L2 Interface
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
L1I-L2 Bandwidth:
|
||||
avg: AVG(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
min: MIN(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
max: MAX(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
tips:
|
||||
+106
@@ -0,0 +1,106 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 1300
|
||||
title: Instruction Cache
|
||||
metrics_description:
|
||||
Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the
|
||||
peak theoretical bandwidth. Calculated as the ratio of L1I requests over the
|
||||
total L1I cycles.
|
||||
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
|
||||
loaded line the cache. Calculated as the ratio of the number of L1I requests
|
||||
that hit over the number of all L1I requests.
|
||||
L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\
|
||||
\ bandwidth achieved. Calculated as the ratio of the total number of requests\
|
||||
\ from the L1I to the L2 cache over the total L1I-L2 interface cycles."
|
||||
Req: The total number of requests made to the L1I per normalization-unit
|
||||
Hits: The total number of L1I requests that hit on a previously loaded cache line,
|
||||
per normalization-unit.
|
||||
Misses - Non Duplicated: The total number of L1I requests that missed on a cache
|
||||
line that were not already pending due to another request, per normalization-unit.
|
||||
Misses - Duplicated: The total number of L1I requests that missed on a cache line
|
||||
that were already pending due to another request, per normalization-unit.
|
||||
Instruction Fetch Latency: The average number of cycles spent to fetch instructions
|
||||
to a CU.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1301
|
||||
title: L1I Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
|
||||
- Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
Cache Hit Rate:
|
||||
value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
|
||||
+ SQC_ICACHE_MISSES_DUPLICATE)))
|
||||
unit: Pct of Peak
|
||||
L1I-L2 Bandwidth:
|
||||
value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu)
|
||||
* (End_Timestamp - Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
comparable: false
|
||||
cli_style: simple_bar
|
||||
tui_style: simple_bar
|
||||
- metric_table:
|
||||
id: 1302
|
||||
title: L1I cache accesses
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Req:
|
||||
avg: AVG((SQC_ICACHE_REQ / $denom))
|
||||
min: MIN((SQC_ICACHE_REQ / $denom))
|
||||
max: MAX((SQC_ICACHE_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Hits:
|
||||
avg: AVG((SQC_ICACHE_HITS / $denom))
|
||||
min: MIN((SQC_ICACHE_HITS / $denom))
|
||||
max: MAX((SQC_ICACHE_HITS / $denom))
|
||||
unit: (Hits + $normUnit)
|
||||
Misses - Non Duplicated:
|
||||
avg: AVG((SQC_ICACHE_MISSES / $denom))
|
||||
min: MIN((SQC_ICACHE_MISSES / $denom))
|
||||
max: MAX((SQC_ICACHE_MISSES / $denom))
|
||||
unit: (Misses + $normUnit)
|
||||
Misses - Duplicated:
|
||||
avg: AVG((SQC_ICACHE_MISSES_DUPLICATE / $denom))
|
||||
min: MIN((SQC_ICACHE_MISSES_DUPLICATE / $denom))
|
||||
max: MAX((SQC_ICACHE_MISSES_DUPLICATE / $denom))
|
||||
unit: (Misses + $normUnit)
|
||||
Cache Hit Rate:
|
||||
avg: AVG(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
|
||||
+ SQC_ICACHE_MISSES_DUPLICATE)))
|
||||
min: MIN(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
|
||||
+ SQC_ICACHE_MISSES_DUPLICATE)))
|
||||
max: MAX(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
|
||||
+ SQC_ICACHE_MISSES_DUPLICATE)))
|
||||
unit: pct
|
||||
Instruction Fetch Latency:
|
||||
avg: AVG((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
|
||||
min: MIN((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
|
||||
max: MAX((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
|
||||
unit: Cycles
|
||||
coll_level: SQ_IFETCH_LEVEL
|
||||
- metric_table:
|
||||
id: 1303
|
||||
title: L1I <-> L2 interface
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
L1I-L2 Bandwidth:
|
||||
avg: AVG(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
min: MIN(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
max: MAX(((SQC_TC_INST_REQ * 64) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
-171
@@ -1,171 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 1400
|
||||
title: Scalar L1 Data Cache
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1401
|
||||
title: Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Bandwidth:
|
||||
value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu)
|
||||
* (End_Timestamp - Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
Cache Hit Rate:
|
||||
value: AVG((((SQC_DCACHE_HITS * 100) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES + SQC_DCACHE_MISSES_DUPLICATE))
|
||||
if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
sL1D-L2 BW:
|
||||
value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 100000)
|
||||
/ (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
comparable: false # for now
|
||||
cli_style: simple_bar
|
||||
|
||||
- metric_table:
|
||||
id: 1402
|
||||
title: Scalar L1D Cache Accesses
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Req:
|
||||
avg: AVG((SQC_DCACHE_REQ / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Hits:
|
||||
avg: AVG((SQC_DCACHE_HITS / $denom))
|
||||
min: MIN((SQC_DCACHE_HITS / $denom))
|
||||
max: MAX((SQC_DCACHE_HITS / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Misses - Non Duplicated:
|
||||
avg: AVG((SQC_DCACHE_MISSES / $denom))
|
||||
min: MIN((SQC_DCACHE_MISSES / $denom))
|
||||
max: MAX((SQC_DCACHE_MISSES / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Misses- Duplicated:
|
||||
avg: AVG((SQC_DCACHE_MISSES_DUPLICATE / $denom))
|
||||
min: MIN((SQC_DCACHE_MISSES_DUPLICATE / $denom))
|
||||
max: MAX((SQC_DCACHE_MISSES_DUPLICATE / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Cache Hit Rate:
|
||||
avg: AVG((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
|
||||
min: MIN((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
|
||||
max: MAX((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
Read Req (Total):
|
||||
avg: AVG((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
|
||||
+ SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
|
||||
min: MIN((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
|
||||
+ SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
|
||||
max: MAX((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
|
||||
+ SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Atomic Req:
|
||||
avg: AVG((SQC_DCACHE_ATOMIC / $denom))
|
||||
min: MIN((SQC_DCACHE_ATOMIC / $denom))
|
||||
max: MAX((SQC_DCACHE_ATOMIC / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Read Req (1 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_1 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_1 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_1 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Read Req (2 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_2 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_2 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_2 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Read Req (4 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_4 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_4 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_4 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Read Req (8 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_8 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_8 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_8 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Read Req (16 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_16 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_16 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_16 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 1403
|
||||
title: Scalar L1D Cache - L2 Interface
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
sL1D-L2 BW:
|
||||
avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 64)) / $denom))
|
||||
min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 64)) / $denom))
|
||||
max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 64)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
tips:
|
||||
Read Req:
|
||||
avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_READ_REQ / $denom))
|
||||
max: MAX((SQC_TC_DATA_READ_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Write Req:
|
||||
avg: AVG((SQC_TC_DATA_WRITE_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_WRITE_REQ / $denom))
|
||||
max: MAX((SQC_TC_DATA_WRITE_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Atomic Req:
|
||||
avg: AVG((SQC_TC_DATA_ATOMIC_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_ATOMIC_REQ / $denom))
|
||||
max: MAX((SQC_TC_DATA_ATOMIC_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Stall Cycles:
|
||||
avg: AVG((SQC_TC_STALL / $denom))
|
||||
min: MIN((SQC_TC_STALL / $denom))
|
||||
max: MAX((SQC_TC_STALL / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
+186
@@ -0,0 +1,186 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 1400
|
||||
title: Scalar L1 Data Cache
|
||||
metrics_description:
|
||||
Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the
|
||||
peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the
|
||||
total sL1D cycles.
|
||||
Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
|
||||
loaded line the cache. The ratio of the number of sL1D requests that hit over
|
||||
the number of all sL1D requests.
|
||||
sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
|
||||
\ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\
|
||||
\ writes and atomics are typically unused on current CDNA accelerators, so in\
|
||||
\ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
|
||||
Req: The total number of requests, of any size or type, made to the sL1D per normalization
|
||||
unit.
|
||||
Hits: The total number of sL1D requests that hit on a previously loaded cache
|
||||
line, per normalization unit.
|
||||
Misses - Non Duplicated: 'The total number of sL1D requests that missed on a cache
|
||||
line that was not already pending due to another request, per normalization
|
||||
unit. '
|
||||
Misses- Duplicated: The total number of sL1D requests that missed on a cache line
|
||||
that was already pending due to another request, per normalization unit.
|
||||
Read Req (Total): The total number of sL1D read requests of any size, per normalization
|
||||
unit.
|
||||
Atomic Req: The total number of atomic requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
Read Req (1 DWord): The total number of sL1D read requests made for a single dword
|
||||
of data (4B), per normalization unit.
|
||||
Read Req (2 DWord): The total number of sL1D read requests made for a two dwords
|
||||
of data (8B), per normalization unit.
|
||||
Read Req (4 DWord): The total number of sL1D read requests made for a four dwords
|
||||
of data (16B), per normalization unit.
|
||||
Read Req (8 DWord): The total number of sL1D read requests made for a eight dwords
|
||||
of data (32B), per normalization unit.
|
||||
Read Req (16 DWord): The total number of sL1D read requests made for a sixteen
|
||||
dwords of data (64B), per normalization unit.
|
||||
Read Req: The total number of read requests from sL1D to the L2 per normalization
|
||||
unit.
|
||||
Write Req: The total number of write requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
Stall Cycles: "The total number of cycles the sL1D\u2194L2 interface was stalled,\
|
||||
\ per normalization unit."
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1401
|
||||
title: Scalar L1D Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
|
||||
- Start_Timestamp))))
|
||||
unit: Pct of Peak
|
||||
Cache Hit Rate:
|
||||
value: AVG((((SQC_DCACHE_HITS * 100) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
|
||||
unit: Pct of Peak
|
||||
sL1D-L2 BW:
|
||||
value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))
|
||||
unit: Pct of Peak
|
||||
comparable: false
|
||||
cli_style: simple_bar
|
||||
tui_style: simple_bar
|
||||
- metric_table:
|
||||
id: 1402
|
||||
title: Scalar L1D cache accesses
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Req:
|
||||
avg: AVG((SQC_DCACHE_REQ / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Hits:
|
||||
avg: AVG((SQC_DCACHE_HITS / $denom))
|
||||
min: MIN((SQC_DCACHE_HITS / $denom))
|
||||
max: MAX((SQC_DCACHE_HITS / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Misses - Non Duplicated:
|
||||
avg: AVG((SQC_DCACHE_MISSES / $denom))
|
||||
min: MIN((SQC_DCACHE_MISSES / $denom))
|
||||
max: MAX((SQC_DCACHE_MISSES / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Misses- Duplicated:
|
||||
avg: AVG((SQC_DCACHE_MISSES_DUPLICATE / $denom))
|
||||
min: MIN((SQC_DCACHE_MISSES_DUPLICATE / $denom))
|
||||
max: MAX((SQC_DCACHE_MISSES_DUPLICATE / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Cache Hit Rate:
|
||||
avg: AVG((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
|
||||
min: MIN((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
|
||||
max: MAX((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
|
||||
unit: pct
|
||||
Read Req (Total):
|
||||
avg: AVG((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
|
||||
+ SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
|
||||
min: MIN((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
|
||||
+ SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
|
||||
max: MAX((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
|
||||
+ SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Atomic Req:
|
||||
avg: AVG((SQC_DCACHE_ATOMIC / $denom))
|
||||
min: MIN((SQC_DCACHE_ATOMIC / $denom))
|
||||
max: MAX((SQC_DCACHE_ATOMIC / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (1 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_1 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_1 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_1 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (2 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_2 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_2 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_2 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (4 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_4 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_4 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_4 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (8 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_8 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_8 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_8 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (16 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_16 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_16 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_16 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1403
|
||||
title: Scalar L1D Cache - L2 Interface
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
sL1D-L2 BW:
|
||||
avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 64)) / $denom))
|
||||
min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 64)) / $denom))
|
||||
max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
|
||||
* 64)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
Read Req:
|
||||
avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_READ_REQ / $denom))
|
||||
max: MAX((SQC_TC_DATA_READ_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Write Req:
|
||||
avg: AVG((SQC_TC_DATA_WRITE_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_WRITE_REQ / $denom))
|
||||
max: MAX((SQC_TC_DATA_WRITE_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Atomic Req:
|
||||
avg: AVG((SQC_TC_DATA_ATOMIC_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_ATOMIC_REQ / $denom))
|
||||
max: MAX((SQC_TC_DATA_ATOMIC_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Stall Cycles:
|
||||
avg: AVG((SQC_TC_STALL / $denom))
|
||||
min: MIN((SQC_TC_STALL / $denom))
|
||||
max: MAX((SQC_TC_STALL / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
-174
@@ -1,174 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 1500
|
||||
title: Address Processing Unit and Data Return Path (TA/TD)
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1501
|
||||
title: Address Processing Unit
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Address Processing Unit Busy:
|
||||
avg: AVG(((100 * TA_TA_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
min: MIN(((100 * TA_TA_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
max: MAX(((100 * TA_TA_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
tips:
|
||||
Address Stall:
|
||||
avg: AVG(((100 * TA_ADDR_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
min: MIN(((100 * TA_ADDR_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
max: MAX(((100 * TA_ADDR_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
tips:
|
||||
Data Stall:
|
||||
avg: AVG(((100 * TA_DATA_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
min: MIN(((100 * TA_DATA_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
max: MAX(((100 * TA_DATA_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
tips:
|
||||
Data-Processor → Address Stall:
|
||||
avg: AVG(((100 * TA_ADDR_STALLED_BY_TD_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
min: MIN(((100 * TA_ADDR_STALLED_BY_TD_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
max: MAX(((100 * TA_ADDR_STALLED_BY_TD_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
tips:
|
||||
Total Instructions:
|
||||
avg: AVG((TA_TOTAL_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_TOTAL_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_TOTAL_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
tips:
|
||||
Global/Generic Instructions:
|
||||
avg: AVG((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
tips:
|
||||
Global/Generic Read Instructions:
|
||||
avg: AVG((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
tips:
|
||||
Global/Generic Write Instructions:
|
||||
avg: AVG((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
tips:
|
||||
Global/Generic Atomic Instructions:
|
||||
avg: AVG((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
tips:
|
||||
Spill/Stack Instructions:
|
||||
avg: AVG((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
tips:
|
||||
Spill/Stack Read Instructions:
|
||||
avg: AVG((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
tips:
|
||||
Spill/Stack Write Instructions:
|
||||
avg: AVG((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
tips:
|
||||
Spill/Stack Atomic Instructions:
|
||||
avg: AVG((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
tips:
|
||||
Spill/Stack Total Cycles:
|
||||
avg: AVG((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
|
||||
min: MIN((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
|
||||
max: MAX((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
Spill/Stack Coalesced Read:
|
||||
avg: AVG((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
|
||||
min: MIN((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
|
||||
max: MAX((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
Spill/Stack Coalesced Write:
|
||||
avg: AVG((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
|
||||
min: MIN((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
|
||||
max: MAX((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 1502
|
||||
title: Data-Return Path
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Data-Return Busy:
|
||||
avg: AVG(((100 * TD_TD_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
min: MIN(((100 * TD_TD_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
max: MAX(((100 * TD_TD_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
tips:
|
||||
Cache RAM → Data-Return Stall:
|
||||
avg: AVG(((100 * TD_TC_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
min: MIN(((100 * TD_TC_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
max: MAX(((100 * TD_TC_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
tips:
|
||||
Workgroup manager → Data-Return Stall:
|
||||
avg: AVG(((100 * TD_SPI_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
min: MIN(((100 * TD_SPI_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
max: MAX(((100 * TD_SPI_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
tips:
|
||||
Coalescable Instructions:
|
||||
avg: AVG((TD_COALESCABLE_WAVEFRONT_sum / $denom))
|
||||
min: MIN((TD_COALESCABLE_WAVEFRONT_sum / $denom))
|
||||
max: MAX((TD_COALESCABLE_WAVEFRONT_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
tips:
|
||||
Read Instructions:
|
||||
avg: AVG((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
|
||||
/ $denom))
|
||||
min: MIN((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
|
||||
/ $denom))
|
||||
max: MAX((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
|
||||
/ $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
tips:
|
||||
Write Instructions:
|
||||
avg: AVG((TD_STORE_WAVEFRONT_sum / $denom))
|
||||
min: MIN((TD_STORE_WAVEFRONT_sum / $denom))
|
||||
max: MAX((TD_STORE_WAVEFRONT_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
tips:
|
||||
Atomic Instructions:
|
||||
avg: AVG((TD_ATOMIC_WAVEFRONT_sum / $denom))
|
||||
min: MIN((TD_ATOMIC_WAVEFRONT_sum / $denom))
|
||||
max: MAX((TD_ATOMIC_WAVEFRONT_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
tips:
|
||||
+248
@@ -0,0 +1,248 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 1500
|
||||
title: Address Processing Unit and Data Return Path (TA/TD)
|
||||
metrics_description:
|
||||
Address Processing Unit Busy: Percent of the total CU cycles the address processor
|
||||
was busy
|
||||
Address Stall: Percent of the total CU cycles the address processor was stalled
|
||||
from sending address requests further into the vL1D pipeline.
|
||||
Data Stall: Percent of the total CU cycles the address processor was stalled from
|
||||
sending write/atomic data further into the vL1D pipeline.
|
||||
"Data-Processor \u2192 Address Stall": Percent of total CU cycles the address
|
||||
processor was stalled waiting to send command data to the data processor.
|
||||
Total Instructions: The total number of memory instructions executed by the address
|
||||
processer over all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Instructions: The total number of global & generic memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Read Instructions: The total number of global & generic memory
|
||||
read instructions executed on all compute units on the accelerator, per normalization
|
||||
unit.
|
||||
Global/Generic Write Instructions: The total number of global & generic memory
|
||||
write instructions executed on all compute units on the accelerator, per normalization
|
||||
unit.
|
||||
Global/Generic Atomic Instructions: The total number of global & generic memory
|
||||
atomic (with and without return) instructions executed on all compute units
|
||||
on the accelerator, per normalization unit.
|
||||
Spill/Stack Instructions: The total number of spill/stack memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Read Instructions: The total number of spill/stack memory read instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Write Instructions: The total number of spill/stack memory write instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Atomic Instructions: The total number of spill/stack memory atomic
|
||||
(with and without return) instructions executed on all compute units on the
|
||||
accelerator, per normalization unit. Typically unused as these memory operations
|
||||
are typically used to implement thread-local storage.
|
||||
Spill/Stack Total Cycles: The number of cycles the address processing unit spent
|
||||
working on spill/stack instructions, per normalization unit.
|
||||
Spill/Stack Coalesced Read: The number of cycles the address processing unit spent
|
||||
working on coalesced spill/stack read instructions, per normalization unit.
|
||||
Spill/Stack Coalesced Write: The number of cycles the address processing unit
|
||||
spent working on coalesced spill/stack write instructions, per normalization
|
||||
unit.
|
||||
Data-Return Busy: Percent of the total CU cycles the data-return unit was busy
|
||||
processing or waiting on data to return to the CU.
|
||||
"Cache RAM \u2192 Data-Return Stall": Percent of the total CU cycles the data-return
|
||||
unit was stalled on data to be returned from the vL1D Cache RAM.
|
||||
"Workgroup manager \u2192 Data-Return Stall": Percent of the total CU cycles the
|
||||
data-return unit was stalled by the workgroup manager due to initialization
|
||||
of registers as a part of launching new workgroups.
|
||||
Coalescable Instructions: The number of instructions submitted to the data-return
|
||||
unit by the address processor that were found to be coalescable, per normalization
|
||||
unit.
|
||||
Read Instructions: The number of read instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack reads in the address processor.
|
||||
Write Instructions: The number of store instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack stores in the address processor.
|
||||
Atomic Instructions: The number of atomic instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack atomics in the address processor.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1501
|
||||
title: Busy and stall metrics
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Address Processing Unit Busy:
|
||||
avg: AVG(((100 * TA_TA_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
min: MIN(((100 * TA_TA_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
max: MAX(((100 * TA_TA_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
Address Stall:
|
||||
avg: AVG(((100 * TA_ADDR_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
|
||||
* $cu_per_gpu)))
|
||||
min: MIN(((100 * TA_ADDR_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
|
||||
* $cu_per_gpu)))
|
||||
max: MAX(((100 * TA_ADDR_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
|
||||
* $cu_per_gpu)))
|
||||
unit: pct
|
||||
Data Stall:
|
||||
avg: AVG(((100 * TA_DATA_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
|
||||
* $cu_per_gpu)))
|
||||
min: MIN(((100 * TA_DATA_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
|
||||
* $cu_per_gpu)))
|
||||
max: MAX(((100 * TA_DATA_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
|
||||
* $cu_per_gpu)))
|
||||
unit: pct
|
||||
"Data-Processor \u2192 Address Stall":
|
||||
avg: AVG(((100 * TA_ADDR_STALLED_BY_TD_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
|
||||
* $cu_per_gpu)))
|
||||
min: MIN(((100 * TA_ADDR_STALLED_BY_TD_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
|
||||
* $cu_per_gpu)))
|
||||
max: MAX(((100 * TA_ADDR_STALLED_BY_TD_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
|
||||
* $cu_per_gpu)))
|
||||
unit: pct
|
||||
"Sequencer \u2192 TA Address Stall":
|
||||
avg: AVG((SQ_VMEM_TA_ADDR_FIFO_FULL / $denom))
|
||||
min: MIN((SQ_VMEM_TA_ADDR_FIFO_FULL / $denom))
|
||||
max: MAX((SQ_VMEM_TA_ADDR_FIFO_FULL / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
"Sequencer \u2192 TA Command Stall":
|
||||
avg: AVG((SQ_VMEM_TA_CMD_FIFO_FULL / $denom))
|
||||
min: MIN((SQ_VMEM_TA_CMD_FIFO_FULL / $denom))
|
||||
max: MAX((SQ_VMEM_TA_CMD_FIFO_FULL / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
"Sequencer \u2192 TA Data Stall":
|
||||
avg: AVG((SQ_VMEM_WR_TA_DATA_FIFO_FULL / $denom))
|
||||
min: MIN((SQ_VMEM_WR_TA_DATA_FIFO_FULL / $denom))
|
||||
max: MAX((SQ_VMEM_WR_TA_DATA_FIFO_FULL / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
- metric_table:
|
||||
id: 1502
|
||||
title: Instruction counts
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Total Instructions:
|
||||
avg: AVG((TA_TOTAL_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_TOTAL_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_TOTAL_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
Global/Generic Instructions:
|
||||
avg: AVG((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
Global/Generic Read Instructions:
|
||||
avg: AVG((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
Global/Generic Write Instructions:
|
||||
avg: AVG((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
Global/Generic Atomic Instructions:
|
||||
avg: AVG((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
Spill/Stack Instructions:
|
||||
avg: AVG((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
Spill/Stack Read Instructions:
|
||||
avg: AVG((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
Spill/Stack Write Instructions:
|
||||
avg: AVG((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
Spill/Stack Atomic Instructions:
|
||||
avg: AVG((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
max: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
- metric_table:
|
||||
id: 1503
|
||||
title: Spill and stack metrics
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Spill/Stack Total Cycles:
|
||||
avg: AVG((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
|
||||
min: MIN((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
|
||||
max: MAX((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
Spill/Stack Coalesced Read:
|
||||
avg: AVG((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
|
||||
min: MIN((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
|
||||
max: MAX((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
Spill/Stack Coalesced Write:
|
||||
avg: AVG((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
|
||||
min: MIN((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
|
||||
max: MAX((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
- metric_table:
|
||||
id: 1504
|
||||
title: Vector L1 data-return path or Texture Data (TD)
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Data-Return Busy:
|
||||
avg: AVG(((100 * TD_TD_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
min: MIN(((100 * TD_TD_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
max: MAX(((100 * TD_TD_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
"Cache RAM \u2192 Data-Return Stall":
|
||||
avg: AVG(((100 * TD_TC_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
min: MIN(((100 * TD_TC_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
max: MAX(((100 * TD_TC_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
"Workgroup manager \u2192 Data-Return Stall":
|
||||
avg: AVG(((100 * TD_SPI_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
min: MIN(((100 * TD_SPI_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
max: MAX(((100 * TD_SPI_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
Coalescable Instructions:
|
||||
avg: AVG((TD_COALESCABLE_WAVEFRONT_sum / $denom))
|
||||
min: MIN((TD_COALESCABLE_WAVEFRONT_sum / $denom))
|
||||
max: MAX((TD_COALESCABLE_WAVEFRONT_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
Read Instructions:
|
||||
avg: AVG((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
|
||||
/ $denom))
|
||||
min: MIN((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
|
||||
/ $denom))
|
||||
max: MAX((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
|
||||
/ $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
Write Instructions:
|
||||
avg: AVG((TD_STORE_WAVEFRONT_sum / $denom))
|
||||
min: MIN((TD_STORE_WAVEFRONT_sum / $denom))
|
||||
max: MAX((TD_STORE_WAVEFRONT_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
Atomic Instructions:
|
||||
avg: AVG((TD_ATOMIC_WAVEFRONT_sum / $denom))
|
||||
min: MIN((TD_ATOMIC_WAVEFRONT_sum / $denom))
|
||||
max: MAX((TD_ATOMIC_WAVEFRONT_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
-414
@@ -1,414 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 1600
|
||||
title: Vector L1 Data Cache
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1601
|
||||
title: Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Hit rate:
|
||||
value: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
|
||||
None))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
Bandwidth:
|
||||
value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk / 1000) * 64) * $cu_per_gpu))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
Utilization:
|
||||
value: AVG((((TCP_GATE_EN2_sum * 100) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum
|
||||
!= 0) else None))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
Coalescing:
|
||||
value: AVG(((((TA_TOTAL_WAVEFRONTS_sum * 64) * 100) / (TCP_TOTAL_ACCESSES_sum
|
||||
* 4)) if (TCP_TOTAL_ACCESSES_sum != 0) else None))
|
||||
unit: Pct of Peak
|
||||
tips:
|
||||
comparable: false # for now
|
||||
cli_style: simple_bar
|
||||
|
||||
- metric_table:
|
||||
id: 1602
|
||||
title: L1D Cache Stalls (%)
|
||||
header:
|
||||
metric: Metric
|
||||
expr: Expression
|
||||
tips: Tips
|
||||
metric:
|
||||
Stalled on L2 Data:
|
||||
expr:
|
||||
(((100 * TCP_PENDING_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum
|
||||
!= 0) else None)
|
||||
tips:
|
||||
Stalled on L2 Req:
|
||||
expr:
|
||||
(((100 * TCP_TCR_TCP_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum
|
||||
!= 0) else None)
|
||||
tips:
|
||||
Stalled on Address:
|
||||
expr:
|
||||
None
|
||||
tips:
|
||||
Stalled on Data:
|
||||
expr:
|
||||
None
|
||||
tips:
|
||||
Stalled on Latency FIFO:
|
||||
expr:
|
||||
None
|
||||
tips:
|
||||
Stalled on Request FIFO:
|
||||
expr:
|
||||
None
|
||||
tips:
|
||||
Stalled on Read Return:
|
||||
expr:
|
||||
None
|
||||
tips:
|
||||
Tag RAM Stall (Read):
|
||||
expr:
|
||||
(((100 * TCP_READ_TAGCONFLICT_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
|
||||
if (TCP_GATE_EN1_sum != 0) else None)
|
||||
tips:
|
||||
Tag RAM Stall (Write):
|
||||
expr:
|
||||
(((100 * TCP_WRITE_TAGCONFLICT_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
|
||||
if (TCP_GATE_EN1_sum != 0) else None)
|
||||
tips:
|
||||
Tag RAM Stall (Atomic):
|
||||
expr:
|
||||
(((100 * TCP_ATOMIC_TAGCONFLICT_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
|
||||
if (TCP_GATE_EN1_sum != 0) else None)
|
||||
tips:
|
||||
cli_style: simple_box
|
||||
|
||||
- metric_table:
|
||||
id: 1603
|
||||
title: L1D Cache Accesses
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Total Req:
|
||||
avg: AVG((TCP_TOTAL_ACCESSES_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_ACCESSES_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_ACCESSES_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Read Req:
|
||||
avg: AVG((TCP_TOTAL_READ_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_READ_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_READ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Write Req:
|
||||
avg: AVG((TCP_TOTAL_WRITE_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_WRITE_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_WRITE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Atomic Req:
|
||||
avg: AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
|
||||
/ $denom))
|
||||
min: MIN(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
|
||||
/ $denom))
|
||||
max: MAX(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Cache BW:
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
|
||||
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
|
||||
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
tips:
|
||||
Cache Hit Rate:
|
||||
avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) +
|
||||
TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) /
|
||||
TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
|
||||
None))
|
||||
min: MIN(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) +
|
||||
TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) /
|
||||
TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
|
||||
None))
|
||||
max: MAX(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) +
|
||||
TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) /
|
||||
TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
|
||||
None))
|
||||
unit: pct
|
||||
tips:
|
||||
Cache Accesses:
|
||||
avg: AVG((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Cache Hits:
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ $denom))
|
||||
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ $denom))
|
||||
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Invalidations:
|
||||
avg: AVG((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
L1-L2 BW:
|
||||
avg: AVG(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
min: MIN(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
max: MAX(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
tips:
|
||||
L1-L2 Read:
|
||||
avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
L1-L2 Write:
|
||||
avg: AVG((TCP_TCC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
L1-L2 Atomic:
|
||||
avg: AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
/ $denom))
|
||||
min: MIN(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
/ $denom))
|
||||
max: MAX(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
L1 Access Latency:
|
||||
avg: AVG(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
tips:
|
||||
L1-L2 Read Latency:
|
||||
avg: AVG(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
|
||||
if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else None))
|
||||
min: MIN(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
|
||||
if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else None))
|
||||
max: MAX(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
|
||||
if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else None))
|
||||
unit: Cycles
|
||||
tips:
|
||||
L1-L2 Write Latency:
|
||||
avg: AVG(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0) else
|
||||
None))
|
||||
min: MIN(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0) else
|
||||
None))
|
||||
max: MAX(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0) else
|
||||
None))
|
||||
unit: Cycles
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 1604
|
||||
title: L1D - L2 Transactions
|
||||
header:
|
||||
metric: Metric
|
||||
xfer: Xfer
|
||||
coherency: Coherency
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
NC - Read:
|
||||
xfer: Read
|
||||
coherency: NC
|
||||
avg: AVG((TCP_TCC_NC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_NC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_NC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
UC - Read:
|
||||
xfer: Read
|
||||
coherency: UC
|
||||
avg: AVG((TCP_TCC_UC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_UC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_UC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
CC - Read:
|
||||
xfer: Read
|
||||
coherency: CC
|
||||
avg: AVG((TCP_TCC_CC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_CC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_CC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
RW - Read:
|
||||
xfer: Read
|
||||
coherency: RW
|
||||
avg: AVG((TCP_TCC_RW_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_RW_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_RW_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
RW - Write:
|
||||
xfer: Write
|
||||
coherency: RW
|
||||
avg: AVG((TCP_TCC_RW_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_RW_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_RW_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
NC - Write:
|
||||
xfer: Write
|
||||
coherency: NC
|
||||
avg: AVG((TCP_TCC_NC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_NC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_NC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
UC - Write:
|
||||
xfer: Write
|
||||
coherency: UC
|
||||
avg: AVG((TCP_TCC_UC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_UC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_UC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
CC - Write:
|
||||
xfer: Write
|
||||
coherency: CC
|
||||
avg: AVG((TCP_TCC_CC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_CC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_CC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
NC - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: NC
|
||||
avg: AVG((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
UC - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: UC
|
||||
avg: AVG((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
CC - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: CC
|
||||
avg: AVG((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
RW - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: RW
|
||||
avg: AVG((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 1605
|
||||
title: L1D Addr Translation
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
units: Units
|
||||
tips: Tips
|
||||
metric:
|
||||
Req:
|
||||
avg: AVG((TCP_UTCL1_REQUEST_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_REQUEST_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_REQUEST_sum / $denom))
|
||||
units: (Req + $normUnit)
|
||||
tips:
|
||||
Inflight Req:
|
||||
avg: None # Missing perfmon
|
||||
min: None # Missing perfmon
|
||||
max: None # Missing perfmon
|
||||
units: (Req + $normUnit)
|
||||
tips:
|
||||
Hit Ratio:
|
||||
avg: AVG((((100 * TCP_UTCL1_TRANSLATION_HIT_sum) / TCP_UTCL1_REQUEST_sum) if
|
||||
(TCP_UTCL1_REQUEST_sum != 0) else None))
|
||||
min: MIN((((100 * TCP_UTCL1_TRANSLATION_HIT_sum) / TCP_UTCL1_REQUEST_sum) if
|
||||
(TCP_UTCL1_REQUEST_sum != 0) else None))
|
||||
max: MAX((((100 * TCP_UTCL1_TRANSLATION_HIT_sum) / TCP_UTCL1_REQUEST_sum) if
|
||||
(TCP_UTCL1_REQUEST_sum != 0) else None))
|
||||
units: pct
|
||||
tips:
|
||||
Hits:
|
||||
avg: AVG((TCP_UTCL1_TRANSLATION_HIT_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_TRANSLATION_HIT_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_TRANSLATION_HIT_sum / $denom))
|
||||
units: (Req + $normUnit)
|
||||
tips:
|
||||
Translation Misses:
|
||||
avg: AVG((TCP_UTCL1_TRANSLATION_MISS_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_TRANSLATION_MISS_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_TRANSLATION_MISS_sum / $denom))
|
||||
units: (Req + $normUnit)
|
||||
tips:
|
||||
Permission Misses:
|
||||
avg: AVG((TCP_UTCL1_PERMISSION_MISS_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_PERMISSION_MISS_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_PERMISSION_MISS_sum / $denom))
|
||||
units: (Req + $normUnit)
|
||||
tips:
|
||||
- metric_table:
|
||||
id: 1606
|
||||
title: L1D Addr Translation Stalls
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
units: Units
|
||||
tips: Tips
|
||||
metric:
|
||||
+442
@@ -0,0 +1,442 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 1600
|
||||
title: Vector L1 Data Cache
|
||||
metrics_description:
|
||||
Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM
|
||||
instructions, as a percent of the peak theoretical bandwidth achievable on the
|
||||
specific accelerator. The number of bytes is calculated as the number of cache
|
||||
lines requested multiplied by the cache line size. This value does not consider
|
||||
partial requests, so for instance, if only a single value is requested in a
|
||||
cache line, the data movement will still be counted as a full cache line.
|
||||
Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
|
||||
The number of cycles where the vL1D Cache RAM is actively processing any request
|
||||
divided by the number of cycles where the vL1D is active.
|
||||
Coalescing: Indicates how well memory instructions were coalesced by the address
|
||||
processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
|
||||
as the average number of thread-requests generated per instruction divided by
|
||||
the ideal number of thread-requests per instruction.
|
||||
Stalled on L2 Data: The ratio of the number of cycles where the vL1D is stalled
|
||||
waiting for requested data to return from the L2 cache divided by the number
|
||||
of cycles where the vL1D is active.
|
||||
Stalled on L2 Req: The ratio of the number of cycles where the vL1D is stalled
|
||||
waiting to issue a request for data to the L2 cache divided by the number of
|
||||
cycles where the vL1D is active.
|
||||
Tag RAM Stall (Read): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Read requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Tag RAM Stall (Write): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Write requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Tag RAM Stall (Atomic): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Atomic requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Total Req: The total number of incoming requests from the address processing unit
|
||||
after coalescing.
|
||||
Read Req: The total number of incoming read requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Write Req: The total number of incoming write requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Atomic Req: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
|
||||
instructions per normalization unit. The number of bytes is calculated as the
|
||||
number of cache lines requested multiplied by the cache line size. This value
|
||||
does not consider partial requests, so for instance, if only a single value
|
||||
is requested in a cache line, the data movement will still be counted as a full
|
||||
cache line.
|
||||
Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
|
||||
vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Cache Accesses: The total number of cache line lookups in the vL1D.
|
||||
Cache Hits: The number of cache accesses minus the number of outgoing requests
|
||||
to the L2 cache, that is, the number of cache line requests serviced by the
|
||||
vL1D Cache RAM per normalization unit.
|
||||
Invalidations: The number of times the vL1D was issued a write-back invalidate
|
||||
command during the kernel's execution per normalization unit. This may be triggered
|
||||
by, for instance, the buffer_wbinvl1 instruction.
|
||||
L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
|
||||
of VMEM instructions, per normalization unit. The number of bytes is calculated
|
||||
as the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
as a full cache line.
|
||||
L1-L2 Read: The number of read requests for a vL1D cache line that were not satisfied
|
||||
by the vL1D and must be retrieved from the to the L2 Cache per normalization
|
||||
unit.
|
||||
L1-L2 Write: The number of write requests to a vL1D cache line that were sent
|
||||
through the vL1D to the L2 cache, per normalization unit.
|
||||
L1-L2 Atomic: The number of atomic requests that are sent through the vL1D to
|
||||
the L2 cache, per normalization unit. This includes requests for atomics with,
|
||||
and without return.
|
||||
L1 Access Latency: Calculated as the average number of cycles that a vL1D cache
|
||||
line request spent in the vL1D cache pipeline.
|
||||
L1-L2 Read Latency: Calculated as the average number of cycles that the vL1D cache
|
||||
took to issue and receive read requests from the L2 Cache. This number also
|
||||
includes requests for atomics with return values.
|
||||
L1-L2 Write Latency: Calculated as the average number of cycles that the vL1D
|
||||
cache took to issue and receive acknowledgement of a write request to the L2
|
||||
Cache. This number also includes requests for atomics without return values.
|
||||
NC - Read: Total read requests with NC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
UC - Read: Total read requests with UC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
CC - Read: Total read requests with CC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
RW - Read: Total read requests with RW mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
RW - Write: Total write requests with RW mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
NC - Write: Total write requests with NC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
UC - Write: Total write requests with UC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
CC - Write: Total write requests with CC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
NC - Atomic: Total atomic requests with NC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
UC - Atomic: Total atomic requests with UC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
CC - Atomic: Total atomic requests with CC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
RW - Atomic: Total atomic requests with RW mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
Req: The number of translation requests made to the UTCL1 per normalization unit.
|
||||
Hit Ratio: The ratio of the number of translation requests that hit in the UTCL1
|
||||
divided by the total number of translation requests made to the UTCL1.
|
||||
Hits: The number of translation requests that hit in the UTCL1, and could be reused,
|
||||
per normalization unit.
|
||||
Translation Misses: The total number of translation requests that missed in the
|
||||
UTCL1 due to translation not being present in the cache, per normalization
|
||||
unit.
|
||||
Permission Misses: "The total number of translation requests that missed in the\
|
||||
\ UTCL1 due to a permission error, per normalization unit. This is unused and\
|
||||
\ expected to be zero in most configurations for modern CDNA\u2122 accelerators."
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1601
|
||||
title: vL1D Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
metric:
|
||||
Hit rate:
|
||||
value: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
|
||||
else None))
|
||||
unit: Pct of Peak
|
||||
Bandwidth:
|
||||
value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp
|
||||
- Start_Timestamp)))) / ((($max_sclk / 1000) * 64) * $cu_per_gpu))
|
||||
unit: Pct of Peak
|
||||
Utilization:
|
||||
value: AVG((((TCP_GATE_EN2_sum * 100) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum
|
||||
!= 0) else None))
|
||||
unit: Pct of Peak
|
||||
Coalescing:
|
||||
value: AVG(((((TA_TOTAL_WAVEFRONTS_sum * 64) * 100) / (TCP_TOTAL_ACCESSES_sum
|
||||
* 4)) if (TCP_TOTAL_ACCESSES_sum != 0) else None))
|
||||
unit: Pct of Peak
|
||||
comparable: false
|
||||
cli_style: simple_bar
|
||||
tui_style: simple_bar
|
||||
- metric_table:
|
||||
id: 1602
|
||||
title: vL1D cache stall metrics
|
||||
header:
|
||||
metric: Metric
|
||||
expr: Expression
|
||||
metric:
|
||||
Stalled on L2 Data:
|
||||
expr: (((100 * TCP_PENDING_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum
|
||||
!= 0) else None)
|
||||
Stalled on L2 Req:
|
||||
expr: (((100 * TCP_TCR_TCP_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum
|
||||
!= 0) else None)
|
||||
Tag RAM Stall (Read):
|
||||
expr: (((100 * TCP_READ_TAGCONFLICT_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
|
||||
if (TCP_GATE_EN1_sum != 0) else None)
|
||||
Tag RAM Stall (Write):
|
||||
expr: (((100 * TCP_WRITE_TAGCONFLICT_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
|
||||
if (TCP_GATE_EN1_sum != 0) else None)
|
||||
Tag RAM Stall (Atomic):
|
||||
expr: (((100 * TCP_ATOMIC_TAGCONFLICT_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
|
||||
if (TCP_GATE_EN1_sum != 0) else None)
|
||||
cli_style: simple_box
|
||||
tui_style: simple_box
|
||||
- metric_table:
|
||||
id: 1603
|
||||
title: vL1D cache access metrics
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Total Req:
|
||||
avg: AVG((TCP_TOTAL_ACCESSES_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_ACCESSES_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_ACCESSES_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read Req:
|
||||
avg: AVG((TCP_TOTAL_READ_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_READ_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_READ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Write Req:
|
||||
avg: AVG((TCP_TOTAL_WRITE_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_WRITE_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_WRITE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Atomic Req:
|
||||
avg: AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
|
||||
/ $denom))
|
||||
min: MIN(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
|
||||
/ $denom))
|
||||
max: MAX(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Cache BW:
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
|
||||
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
|
||||
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
Cache Hit Rate:
|
||||
avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
|
||||
else None))
|
||||
min: MIN(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
|
||||
else None))
|
||||
max: MAX(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
|
||||
else None))
|
||||
unit: pct
|
||||
Cache Accesses:
|
||||
avg: AVG((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Cache Hits:
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ $denom))
|
||||
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ $denom))
|
||||
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Invalidations:
|
||||
avg: AVG((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
L1-L2 BW:
|
||||
avg: AVG(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
min: MIN(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
max: MAX(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
L1-L2 Read:
|
||||
avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
L1-L2 Write:
|
||||
avg: AVG((TCP_TCC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
L1-L2 Atomic:
|
||||
avg: AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
/ $denom))
|
||||
min: MIN(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
/ $denom))
|
||||
max: MAX(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
L1 Access Latency:
|
||||
avg: AVG(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
L1-L2 Read Latency:
|
||||
avg: AVG(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
|
||||
if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else
|
||||
None))
|
||||
min: MIN(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
|
||||
if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else
|
||||
None))
|
||||
max: MAX(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
|
||||
if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else
|
||||
None))
|
||||
unit: Cycles
|
||||
L1-L2 Write Latency:
|
||||
avg: AVG(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0)
|
||||
else None))
|
||||
min: MIN(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0)
|
||||
else None))
|
||||
max: MAX(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0)
|
||||
else None))
|
||||
unit: Cycles
|
||||
- metric_table:
|
||||
id: 1604
|
||||
title: L1D - L2 Transactions
|
||||
header:
|
||||
metric: Metric
|
||||
xfer: Xfer
|
||||
coherency: Coherency
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
NC - Read:
|
||||
xfer: Read
|
||||
coherency: NC
|
||||
avg: AVG((TCP_TCC_NC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_NC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_NC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
UC - Read:
|
||||
xfer: Read
|
||||
coherency: UC
|
||||
avg: AVG((TCP_TCC_UC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_UC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_UC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
CC - Read:
|
||||
xfer: Read
|
||||
coherency: CC
|
||||
avg: AVG((TCP_TCC_CC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_CC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_CC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
RW - Read:
|
||||
xfer: Read
|
||||
coherency: RW
|
||||
avg: AVG((TCP_TCC_RW_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_RW_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_RW_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
RW - Write:
|
||||
xfer: Write
|
||||
coherency: RW
|
||||
avg: AVG((TCP_TCC_RW_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_RW_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_RW_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
NC - Write:
|
||||
xfer: Write
|
||||
coherency: NC
|
||||
avg: AVG((TCP_TCC_NC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_NC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_NC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
UC - Write:
|
||||
xfer: Write
|
||||
coherency: UC
|
||||
avg: AVG((TCP_TCC_UC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_UC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_UC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
CC - Write:
|
||||
xfer: Write
|
||||
coherency: CC
|
||||
avg: AVG((TCP_TCC_CC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_CC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_CC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
NC - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: NC
|
||||
avg: AVG((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
UC - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: UC
|
||||
avg: AVG((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
CC - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: CC
|
||||
avg: AVG((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
RW - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: RW
|
||||
avg: AVG((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1605
|
||||
title: L1 Unified Translation Cache (UTCL1)
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
units: Units
|
||||
metric:
|
||||
Req:
|
||||
avg: AVG((TCP_UTCL1_REQUEST_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_REQUEST_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_REQUEST_sum / $denom))
|
||||
units: (Req + $normUnit)
|
||||
Hit Ratio:
|
||||
avg: AVG((((100 * TCP_UTCL1_TRANSLATION_HIT_sum) / TCP_UTCL1_REQUEST_sum)
|
||||
if (TCP_UTCL1_REQUEST_sum != 0) else None))
|
||||
min: MIN((((100 * TCP_UTCL1_TRANSLATION_HIT_sum) / TCP_UTCL1_REQUEST_sum)
|
||||
if (TCP_UTCL1_REQUEST_sum != 0) else None))
|
||||
max: MAX((((100 * TCP_UTCL1_TRANSLATION_HIT_sum) / TCP_UTCL1_REQUEST_sum)
|
||||
if (TCP_UTCL1_REQUEST_sum != 0) else None))
|
||||
units: pct
|
||||
Hits:
|
||||
avg: AVG((TCP_UTCL1_TRANSLATION_HIT_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_TRANSLATION_HIT_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_TRANSLATION_HIT_sum / $denom))
|
||||
units: (Req + $normUnit)
|
||||
Translation Misses:
|
||||
avg: AVG((TCP_UTCL1_TRANSLATION_MISS_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_TRANSLATION_MISS_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_TRANSLATION_MISS_sum / $denom))
|
||||
units: (Req + $normUnit)
|
||||
Permission Misses:
|
||||
avg: AVG((TCP_UTCL1_PERMISSION_MISS_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_PERMISSION_MISS_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_PERMISSION_MISS_sum / $denom))
|
||||
units: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1606
|
||||
title: L1D Addr Translation Stalls
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
units: Units
|
||||
metric: {}
|
||||
-388
@@ -1,388 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 1700
|
||||
title: L2 Cache
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1701
|
||||
title: Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Utilization:
|
||||
value: AVG(((TCC_BUSY_sum * 100) / (TO_INT($total_l2_chan) * $GRBM_GUI_ACTIVE_PER_XCD)))
|
||||
unit: pct
|
||||
tips:
|
||||
Bandwidth:
|
||||
value: ((100 * AVG(((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk / 1000) * 128) * TO_INT($total_l2_chan)))
|
||||
unit: pct
|
||||
tips:
|
||||
Hit Rate:
|
||||
value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else 0))
|
||||
unit: pct
|
||||
tips:
|
||||
L2-Fabric Read BW:
|
||||
value: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
tips:
|
||||
L2-Fabric Write and Atomic BW:
|
||||
value: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
tips:
|
||||
HBM Bandwidth:
|
||||
value: $hbmBandwidth
|
||||
unit: GB/s
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 1702
|
||||
title: L2 - Fabric Transactions
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Read BW:
|
||||
avg: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
min: MIN((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
max: MAX((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
tips:
|
||||
HBM Read Traffic:
|
||||
avg: AVG((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
min: MIN((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
max: MAX((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
Remote Read Traffic:
|
||||
avg: AVG((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
min: MIN((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
max: MAX((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
Uncached Read Traffic:
|
||||
avg: AVG((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
min: MIN((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
max: MAX((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
Write and Atomic BW:
|
||||
avg: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
min: MIN((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
max: MAX((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
tips:
|
||||
HBM Write and Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
min: MIN((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
max: MAX((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
Remote Write and Atomic Traffic:
|
||||
avg: AVG((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
min: MIN((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
max: MAX((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
min: MIN((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
max: MAX((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
Uncached Write and Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
min: MIN((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
max: MAX((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
Read Latency:
|
||||
avg: AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum !=
|
||||
0) else None))
|
||||
min: MIN(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum !=
|
||||
0) else None))
|
||||
max: MAX(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum !=
|
||||
0) else None))
|
||||
unit: Cycles
|
||||
tips:
|
||||
Write and Atomic Latency:
|
||||
avg: AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum !=
|
||||
0) else None))
|
||||
min: MIN(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum !=
|
||||
0) else None))
|
||||
max: MAX(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum !=
|
||||
0) else None))
|
||||
unit: Cycles
|
||||
tips:
|
||||
Atomic Latency:
|
||||
avg: AVG(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 1703
|
||||
title: L2 Cache Accesses
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Bandwidth:
|
||||
avg: AVG((TCC_REQ_sum * 128) / $denom)
|
||||
min: MIN((TCC_REQ_sum * 128) / $denom)
|
||||
max: MAX((TCC_REQ_sum * 128) / $denom)
|
||||
unit: (Bytes + $normUnit)
|
||||
tips:
|
||||
Req:
|
||||
avg: AVG((TCC_REQ_sum / $denom))
|
||||
min: MIN((TCC_REQ_sum / $denom))
|
||||
max: MAX((TCC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Read Req:
|
||||
avg: AVG((TCC_READ_sum / $denom))
|
||||
min: MIN((TCC_READ_sum / $denom))
|
||||
max: MAX((TCC_READ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Write Req:
|
||||
avg: AVG((TCC_WRITE_sum / $denom))
|
||||
min: MIN((TCC_WRITE_sum / $denom))
|
||||
max: MAX((TCC_WRITE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Atomic Req:
|
||||
avg: AVG((TCC_ATOMIC_sum / $denom))
|
||||
min: MIN((TCC_ATOMIC_sum / $denom))
|
||||
max: MAX((TCC_ATOMIC_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Streaming Req:
|
||||
avg: AVG((TCC_STREAMING_REQ_sum / $denom))
|
||||
min: MIN((TCC_STREAMING_REQ_sum / $denom))
|
||||
max: MAX((TCC_STREAMING_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Probe Req:
|
||||
avg: AVG((TCC_PROBE_sum / $denom))
|
||||
min: MIN((TCC_PROBE_sum / $denom))
|
||||
max: MAX((TCC_PROBE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Cache Hit:
|
||||
avg: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
min: MIN((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
max: MAX((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
Hits:
|
||||
avg: AVG((TCC_HIT_sum / $denom))
|
||||
min: MIN((TCC_HIT_sum / $denom))
|
||||
max: MAX((TCC_HIT_sum / $denom))
|
||||
unit: (Hits + $normUnit)
|
||||
tips:
|
||||
Misses:
|
||||
avg: AVG((TCC_MISS_sum / $denom))
|
||||
min: MIN((TCC_MISS_sum / $denom))
|
||||
max: MAX((TCC_MISS_sum / $denom))
|
||||
unit: (Misses + $normUnit)
|
||||
tips:
|
||||
Writeback:
|
||||
avg: AVG((TCC_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_WRITEBACK_sum / $denom))
|
||||
max: MAX((TCC_WRITEBACK_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
tips:
|
||||
Writeback (Internal):
|
||||
avg: AVG((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
max: MAX((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
tips:
|
||||
Writeback (vL1D Req):
|
||||
avg: AVG((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
|
||||
max: MAX((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
tips:
|
||||
Evict (Internal):
|
||||
avg: AVG((TCC_NORMAL_EVICT_sum / $denom))
|
||||
min: MIN((TCC_NORMAL_EVICT_sum / $denom))
|
||||
max: MAX((TCC_NORMAL_EVICT_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
tips:
|
||||
Evict (vL1D Req):
|
||||
avg: AVG((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
|
||||
min: MIN((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
|
||||
max: MAX((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
tips:
|
||||
NC Req:
|
||||
avg: AVG((TCC_NC_REQ_sum / $denom))
|
||||
min: MIN((TCC_NC_REQ_sum / $denom))
|
||||
max: MAX((TCC_NC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
UC Req:
|
||||
avg: AVG((TCC_UC_REQ_sum / $denom))
|
||||
min: MIN((TCC_UC_REQ_sum / $denom))
|
||||
max: MAX((TCC_UC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
CC Req:
|
||||
avg: AVG((TCC_CC_REQ_sum / $denom))
|
||||
min: MIN((TCC_CC_REQ_sum / $denom))
|
||||
max: MAX((TCC_CC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
RW Req:
|
||||
avg: AVG((TCC_RW_REQ_sum / $denom))
|
||||
min: MIN((TCC_RW_REQ_sum / $denom))
|
||||
max: MAX((TCC_RW_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 1704
|
||||
title: L2 Cache Stalls
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
|
||||
- metric_table:
|
||||
id: 1705
|
||||
title: L2 - Fabric Interface Stalls
|
||||
header:
|
||||
metric: Metric
|
||||
type: Type
|
||||
transaction: Transaction
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
style:
|
||||
type: simple_multi_bar
|
||||
metric:
|
||||
Write - Credit Starvation:
|
||||
type: Credit Starvation
|
||||
transaction: Write
|
||||
avg: AVG(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
min: MIN(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
max: MAX(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 1706
|
||||
title: L2 - Fabric Detailed Transaction Breakdown
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Read (32B):
|
||||
avg: AVG((TCC_EA_RDREQ_32B_sum / $denom))
|
||||
min: MIN((TCC_EA_RDREQ_32B_sum / $denom))
|
||||
max: MAX((TCC_EA_RDREQ_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Read (64B):
|
||||
avg: AVG(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
|
||||
min: MIN(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
|
||||
max: MAX(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Read (Uncached):
|
||||
avg: AVG((TCC_EA_RD_UNCACHED_32B_sum / $denom))
|
||||
min: MIN((TCC_EA_RD_UNCACHED_32B_sum / $denom))
|
||||
max: MAX((TCC_EA_RD_UNCACHED_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
HBM Read:
|
||||
avg: AVG((TCC_EA_RDREQ_DRAM_sum / $denom))
|
||||
min: MIN((TCC_EA_RDREQ_DRAM_sum / $denom))
|
||||
max: MAX((TCC_EA_RDREQ_DRAM_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Remote Read:
|
||||
avg: AVG((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
|
||||
min: MIN((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
|
||||
max: MAX((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Write and Atomic (32B):
|
||||
avg: AVG(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom))
|
||||
min: MIN(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom))
|
||||
max: MAX(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Write and Atomic (Uncached):
|
||||
avg: AVG((TCC_EA_WR_UNCACHED_32B_sum / $denom))
|
||||
min: MIN((TCC_EA_WR_UNCACHED_32B_sum / $denom))
|
||||
max: MAX((TCC_EA_WR_UNCACHED_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Write and Atomic (64B):
|
||||
avg: AVG((TCC_EA_WRREQ_64B_sum / $denom))
|
||||
min: MIN((TCC_EA_WRREQ_64B_sum / $denom))
|
||||
max: MAX((TCC_EA_WRREQ_64B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
HBM Write and Atomic:
|
||||
avg: AVG((TCC_EA_WRREQ_DRAM_sum / $denom))
|
||||
min: MIN((TCC_EA_WRREQ_DRAM_sum / $denom))
|
||||
max: MAX((TCC_EA_WRREQ_DRAM_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Remote Write and Atomic:
|
||||
avg: AVG((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
|
||||
min: MIN((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
|
||||
max: MAX((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
Atomic:
|
||||
avg: AVG((TCC_EA_ATOMIC_sum / $denom))
|
||||
min: MIN((TCC_EA_ATOMIC_sum / $denom))
|
||||
max: MAX((TCC_EA_ATOMIC_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
tips:
|
||||
+536
@@ -0,0 +1,536 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 1700
|
||||
title: L2 Cache
|
||||
metrics_description:
|
||||
Utilization: The ratio of the number of cycles an L2 channel was active, summed
|
||||
over all L2 channels on the accelerator over the total L2 cycles.
|
||||
Peak Bandwidth: The number of bytes looked up in the L2 cache, as a percent of
|
||||
the peak theoretical bandwidth achievable on the specific accelerator. The number
|
||||
of bytes is calculated as the number of cache lines requested multiplied by
|
||||
the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line.
|
||||
Hit Rate: The ratio of the number of L2 cache line requests that hit in the L2
|
||||
cache over the total number of incoming cache line requests to the L2 cache.
|
||||
L2-Fabric Read BW: The number of bytes read by the L2 over the Infinity Fabric
|
||||
interface per unit time.
|
||||
L2-Fabric Write and Atomic BW: The number of bytes sent by the L2 over the Infinity
|
||||
Fabric interface by write and atomic operations per unit time.
|
||||
HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
|
||||
memory (HBM) per unit time. This value is calculated as the number of HBM channels
|
||||
multiplied by the HBM channel width multiplied by the HBM clock frequency.
|
||||
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per
|
||||
normalization unit.
|
||||
HBM Read Traffic: The percent of read requests generated by the L2 cache that
|
||||
are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
|
||||
does not consider the size of the request (meaning that 32B and 64B requests
|
||||
are both counted as a single request), so this metric only approximates the
|
||||
percent of the L2-Fabric Read bandwidth directed to the local HBM.
|
||||
Remote Read Traffic: The percent of read requests generated by the L2 cache that
|
||||
are routed to any memory location other than the accelerator's local high-bandwidth
|
||||
memory (HBM) - for example, the CPU's DRAM or a remote accelerator's HBM. This
|
||||
breakdown does not consider the size of the request (meaning that 32B and 64B
|
||||
requests are both counted as a single request), so this metric only approximates
|
||||
the percent of the L2-Fabric Read bandwidth directed to a remote location.
|
||||
Uncached Read Traffic: The percent of read requests generated by the L2 cache
|
||||
that are reading from an uncached memory allocation. Note, as described in the
|
||||
request flow section, a single 64B read request is typically counted as two
|
||||
uncached read requests. So, it is possible for the Uncached Read Traffic to
|
||||
reach up to 200% of the total number of read requests. This breakdown does not
|
||||
consider the size of the request (i.e., 32B and 64B requests are both counted
|
||||
as a single request), so this metric only approximates the percent of the L2-Fabric
|
||||
read bandwidth directed to an uncached memory location.
|
||||
Write and Atomic BW: The total number of bytes written by the L2 over Infinity
|
||||
Fabric by write and atomic operations per normalization unit. Note that on current
|
||||
CDNA accelerators, such as the MI2XX, requests are only considered atomic by
|
||||
Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
|
||||
fine-grained memory allocations or uncached memory allocations on the MI2XX.
|
||||
HBM Write and Atomic Traffic: The percent of write and atomic requests generated
|
||||
by the L2 cache that are routed to the accelerator's local high-bandwidth memory
|
||||
(HBM). This breakdown does not consider the size of the request (meaning that
|
||||
32B and 64B requests are both counted as a single request), so this metric only
|
||||
approximates the percent of the L2-Fabric Write and Atomic bandwidth directed
|
||||
to the local HBM. Note that on current CDNA accelerators, such as the MI2XX,
|
||||
requests are only considered atomic by Infinity Fabric if they are targeted
|
||||
at fine-grained memory allocations or uncached memory allocations.
|
||||
Remote Write and Atomic Traffic: The percent of read requests generated by the
|
||||
L2 cache that are routed to any memory location other than the accelerator's
|
||||
local high-bandwidth memory (HBM) - for example, the CPU's DRAM or a remote
|
||||
accelerator's HBM. This breakdown does not consider the size of the request
|
||||
(meaning that 32B and 64B requests are both counted as a single request), so
|
||||
this metric only approximates the percent of the L2-Fabric Read bandwidth directed
|
||||
to a remote location. Note that on current CDNA accelerators, such as the MI2XX,
|
||||
requests are only considered atomic by Infinity Fabric if they are targeted
|
||||
at fine-grained memory allocations or uncached memory allocations.
|
||||
Atomic Traffic: The percent of write requests generated by the L2 cache that are
|
||||
atomic requests to any memory location. This breakdown does not consider the
|
||||
size of the request (meaning that 32B and 64B requests are both counted as a
|
||||
single request), so this metric only approximates the percent of the L2-Fabric
|
||||
Read bandwidth directed to a remote location. Note that on current CDNA accelerators,
|
||||
such as the MI2XX, requests are only considered atomic by Infinity Fabric if
|
||||
they are targeted at fine-grained memory allocations or uncached memory allocations.
|
||||
Uncached Write and Atomic Traffic: The percent of write and atomic requests generated
|
||||
by the L2 cache that are targeting uncached memory allocations. This breakdown
|
||||
does not consider the size of the request (meaning that 32B and 64B requests
|
||||
are both counted as a single request), so this metric only approximates the
|
||||
percent of the L2-Fabric read bandwidth directed to uncached memory allocations.
|
||||
Read Latency: The time-averaged number of cycles read requests spent in Infinity
|
||||
Fabric before data was returned to the L2.
|
||||
Write and Atomic Latency: The time-averaged number of cycles write requests spent
|
||||
in Infinity Fabric before a completion acknowledgement was returned to the L2.
|
||||
Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
|
||||
Fabric before a completion acknowledgement (atomic without return value) or
|
||||
data (atomic with return value) was returned to the L2.
|
||||
Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit.
|
||||
The number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so for
|
||||
example, if only a single value is requested in a cache line, the data movement
|
||||
will still be counted as a full cache line.
|
||||
Req: The total number of incoming requests to the L2 from all clients for all
|
||||
request types, per normalization unit.
|
||||
Read Req: The total number of read requests to the L2 from all clients.
|
||||
Write Req: The total number of write requests to the L2 from all clients.
|
||||
Atomic Req: The total number of atomic requests (with and without return) to the
|
||||
L2 from all clients.
|
||||
Streaming Req: The total number of incoming requests to the L2 that are marked
|
||||
as streaming. The exact meaning of this may differ depending on the targeted
|
||||
accelerator, however on an MI2XX this corresponds to non-temporal load or stores.
|
||||
The L2 cache attempts to evict streaming requests before normal requests when
|
||||
the L2 is at capacity.
|
||||
Probe Req: The number of coherence probe requests made to the L2 cache from outside
|
||||
the accelerator. On an MI2XX, probe requests may be generated by, for example,
|
||||
writes to fine-grained device memory or by writes to coarse-grained device memory.
|
||||
Cache Hit: The ratio of the number of L2 cache line requests that hit in the L2
|
||||
cache over the total number of incoming cache line requests to the L2 cache.
|
||||
Hits: The total number of requests to the L2 from all clients that hit in the
|
||||
cache. As noted in the Speed-of-Light section, this includes hit-on-miss requests.
|
||||
Misses: The total number of requests to the L2 from all clients that miss in the
|
||||
cache. As noted in the Speed-of-Light section, these do not include hit-on-miss
|
||||
requests.
|
||||
Writeback: The total number of L2 cache lines written back to memory for any reason.
|
||||
Write-backs may occur due to user code (such as HIP kernel calls to _threadfence_system
|
||||
or atomic built-ins) by the command processor's memory acquire/release fences,
|
||||
or for other internal hardware reasons.
|
||||
Writeback (Internal): The total number of L2 cache lines written back to memory
|
||||
for internal hardware reasons, per normalization unit.
|
||||
Writeback (vL1D Req): The total number of L2 cache lines written back to memory
|
||||
due to requests initiated by the vL1D cache, per normalization unit.
|
||||
Evict (Internal): The total number of L2 cache lines evicted from the cache due
|
||||
to capacity limits, per normalization unit.
|
||||
Evict (vL1D Req): The total number of L2 cache lines evicted from the cache due
|
||||
to invalidation requests initiated by the vL1D cache, per normalization unit.
|
||||
NC Req: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory
|
||||
allocations, per normalization unit.
|
||||
UC Req: The total number of requests to the L2 that go to Uncached (UC) memory
|
||||
allocations.
|
||||
CC Req: The total number of requests to the L2 that go to Coherently Cacheable
|
||||
(CC) memory allocations.
|
||||
RW Req: The total number of requests to the L2 that go to Read-Write coherent
|
||||
memory (RW) allocations.
|
||||
Write - Credit Starvation: The number of cycles the L2-Fabric interface was stalled
|
||||
on write or atomic requests to any memory location because too many write/atomic
|
||||
requests were currently in flight, as a percent of the total active L2 cycles.
|
||||
Read (32B): The total number of L2 requests to Infinity Fabric to read 32B of
|
||||
data from any memory location, per normalization unit.
|
||||
Read (64B): The total number of L2 requests to Infinity Fabric to read 64B of
|
||||
data from any memory location, per normalization unit.
|
||||
Read (Uncached): The total number of L2 requests to Infinity Fabric to read uncached
|
||||
data from any memory location, per normalization unit. 64B requests for uncached
|
||||
data are counted as two 32B uncached data requests.
|
||||
HBM Read: The total number of L2 requests to Infinity Fabric to read 32B or 64B
|
||||
of data from the accelerator's local HBM, per normalization unit.
|
||||
Remote Read: The total number of L2 requests to Infinity Fabric to read 32B or
|
||||
64B of data from any source other than the accelerator's local HBM, per normalization
|
||||
unit.
|
||||
Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 32B of data to any memory location, per normalization
|
||||
unit.
|
||||
Write and Atomic (Uncached): The total number of L2 requests to Infinity Fabric
|
||||
to write or atomically update 32B or 64B of uncached data, per normalization
|
||||
unit.
|
||||
Write and Atomic (64B): The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 64B of data in any memory location, per normalization
|
||||
unit.
|
||||
HBM Write and Atomic: The total number of L2 requests to Infinity Fabric to write
|
||||
or atomically update 32B or 64B of data in the accelerator's local HBM, per
|
||||
normalization unit.
|
||||
Remote Write and Atomic: The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 32B or 64B of data in any memory location other than
|
||||
the accelerator's local HBM, per normalization unit.
|
||||
Atomic: The total number of L2 requests to Infinity Fabric to atomically update
|
||||
32B or 64B of data in any memory location, per normalization unit. See Request
|
||||
flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
|
||||
requests are only considered atomic by Infinity Fabric if they are targeted
|
||||
at non-write-cacheable memory, such as fine-grained memory allocations or uncached
|
||||
memory allocations on the MI2XX.
|
||||
Read Stall: "The ratio of the total number of cycles the L2-Fabric interface was\
|
||||
\ stalled on a read request to any destination (local HBM, remote PCIe\xAE connected\
|
||||
\ accelerator or CPU, or remote Infinity Fabric connected accelerator or CPU)\
|
||||
\ over the total active L2 cycles."
|
||||
Write Stall: The ratio of the total number of cycles the L2-Fabric interface was
|
||||
stalled on a write or atomic request to any destination (local HBM, remote accelerator
|
||||
or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected
|
||||
accelerator or CPU) over the total active L2 cycles.
|
||||
Read - PCIe Stall: The number of cycles the L2-Fabric interface was stalled on
|
||||
read requests to remote PCIe connected accelerators or CPUs as a percent of
|
||||
the total active L2 cycles.
|
||||
Read - Infinity Fabric Stall: The number of cycles the L2-Fabric interface was
|
||||
stalled on read requests to remote Infinity Fabric connected accelerators or
|
||||
CPUs as a percent of the total active L2 cycles.
|
||||
Read - HBM Stall: The number of cycles the L2-Fabric interface was stalled on
|
||||
read requests to the accelerator's local HBM as a percent of the total active
|
||||
L2 cycles.
|
||||
Write - PCIe Stall: The number of cycles the L2-Fabric interface was stalled on
|
||||
write or atomic requests to remote PCIe connected accelerators or CPUs as a
|
||||
percent of the total active L2 cycles.
|
||||
Write - Infinity Fabric Stall: The number of cycles the L2-Fabric interface was
|
||||
stalled on write or atomic requests to remote Infinity Fabric connected accelerators
|
||||
or CPUs as a percent of the total active L2 cycles.
|
||||
Write - HBM Stall: The number of cycles the L2-Fabric interface was stalled on
|
||||
write or atomic requests to accelerator's local HBM as a percent of the total
|
||||
active L2 cycles.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1701
|
||||
title: L2 Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
metric:
|
||||
Utilization:
|
||||
value: AVG(((TCC_BUSY_sum * 100) / (TO_INT($total_l2_chan) * $GRBM_GUI_ACTIVE_PER_XCD)))
|
||||
unit: pct
|
||||
Peak Bandwidth:
|
||||
value: ((100 * AVG(((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk / 1000) * 128) * TO_INT($total_l2_chan)))
|
||||
unit: pct
|
||||
Hit Rate:
|
||||
value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else 0))
|
||||
unit: pct
|
||||
L2-Fabric Read BW:
|
||||
value: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
L2-Fabric Write and Atomic BW:
|
||||
value: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
HBM Bandwidth:
|
||||
value: $hbmBandwidth
|
||||
unit: GB/s
|
||||
- metric_table:
|
||||
id: 1702
|
||||
title: L2-Fabric interface metrics
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Read BW:
|
||||
avg: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
min: MIN((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
max: MAX((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
HBM Read Traffic:
|
||||
avg: AVG((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Remote Read Traffic:
|
||||
avg: AVG((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum)
|
||||
if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
min: MIN((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum)
|
||||
if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
max: MAX((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum)
|
||||
if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
unit: pct
|
||||
Uncached Read Traffic:
|
||||
avg: AVG((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Write and Atomic BW:
|
||||
avg: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
min: MIN((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
max: MAX((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
HBM Write and Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Remote Write and Atomic Traffic:
|
||||
avg: AVG((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum)
|
||||
if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
min: MIN((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum)
|
||||
if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
max: MAX((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum)
|
||||
if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
unit: pct
|
||||
Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Uncached Write and Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Read Latency:
|
||||
avg: AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
Write and Atomic Latency:
|
||||
avg: AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
Atomic Latency:
|
||||
avg: AVG(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
- metric_table:
|
||||
id: 1703
|
||||
title: L2 Cache Accesses
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
avg: AVG((TCC_REQ_sum * 128) / $denom)
|
||||
min: MIN((TCC_REQ_sum * 128) / $denom)
|
||||
max: MAX((TCC_REQ_sum * 128) / $denom)
|
||||
unit: (Bytes + $normUnit)
|
||||
Req:
|
||||
avg: AVG((TCC_REQ_sum / $denom))
|
||||
min: MIN((TCC_REQ_sum / $denom))
|
||||
max: MAX((TCC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read Req:
|
||||
avg: AVG((TCC_READ_sum / $denom))
|
||||
min: MIN((TCC_READ_sum / $denom))
|
||||
max: MAX((TCC_READ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Write Req:
|
||||
avg: AVG((TCC_WRITE_sum / $denom))
|
||||
min: MIN((TCC_WRITE_sum / $denom))
|
||||
max: MAX((TCC_WRITE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Atomic Req:
|
||||
avg: AVG((TCC_ATOMIC_sum / $denom))
|
||||
min: MIN((TCC_ATOMIC_sum / $denom))
|
||||
max: MAX((TCC_ATOMIC_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Streaming Req:
|
||||
avg: AVG((TCC_STREAMING_REQ_sum / $denom))
|
||||
min: MIN((TCC_STREAMING_REQ_sum / $denom))
|
||||
max: MAX((TCC_STREAMING_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Probe Req:
|
||||
avg: AVG((TCC_PROBE_sum / $denom))
|
||||
min: MIN((TCC_PROBE_sum / $denom))
|
||||
max: MAX((TCC_PROBE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Cache Hit:
|
||||
avg: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
min: MIN((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
max: MAX((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
unit: pct
|
||||
Hits:
|
||||
avg: AVG((TCC_HIT_sum / $denom))
|
||||
min: MIN((TCC_HIT_sum / $denom))
|
||||
max: MAX((TCC_HIT_sum / $denom))
|
||||
unit: (Hits + $normUnit)
|
||||
Misses:
|
||||
avg: AVG((TCC_MISS_sum / $denom))
|
||||
min: MIN((TCC_MISS_sum / $denom))
|
||||
max: MAX((TCC_MISS_sum / $denom))
|
||||
unit: (Misses + $normUnit)
|
||||
Writeback:
|
||||
avg: AVG((TCC_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_WRITEBACK_sum / $denom))
|
||||
max: MAX((TCC_WRITEBACK_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
Writeback (Internal):
|
||||
avg: AVG((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
max: MAX((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
Writeback (vL1D Req):
|
||||
avg: AVG((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
|
||||
max: MAX((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
Evict (Internal):
|
||||
avg: AVG((TCC_NORMAL_EVICT_sum / $denom))
|
||||
min: MIN((TCC_NORMAL_EVICT_sum / $denom))
|
||||
max: MAX((TCC_NORMAL_EVICT_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
Evict (vL1D Req):
|
||||
avg: AVG((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
|
||||
min: MIN((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
|
||||
max: MAX((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
NC Req:
|
||||
avg: AVG((TCC_NC_REQ_sum / $denom))
|
||||
min: MIN((TCC_NC_REQ_sum / $denom))
|
||||
max: MAX((TCC_NC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
UC Req:
|
||||
avg: AVG((TCC_UC_REQ_sum / $denom))
|
||||
min: MIN((TCC_UC_REQ_sum / $denom))
|
||||
max: MAX((TCC_UC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
CC Req:
|
||||
avg: AVG((TCC_CC_REQ_sum / $denom))
|
||||
min: MIN((TCC_CC_REQ_sum / $denom))
|
||||
max: MAX((TCC_CC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
RW Req:
|
||||
avg: AVG((TCC_RW_REQ_sum / $denom))
|
||||
min: MIN((TCC_RW_REQ_sum / $denom))
|
||||
max: MAX((TCC_RW_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1704
|
||||
title: L2 Cache Stalls
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric: {}
|
||||
- metric_table:
|
||||
id: 1705
|
||||
title: L2 - Fabric Interface stalls
|
||||
header:
|
||||
metric: Metric
|
||||
type: Type
|
||||
transaction: Transaction
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
style:
|
||||
type: simple_multi_bar
|
||||
metric:
|
||||
Write - Credit Starvation:
|
||||
type: Credit Starvation
|
||||
transaction: Write
|
||||
avg: AVG(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
- metric_table:
|
||||
id: 1706
|
||||
title: L2 - Fabric interface detailed metrics
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Read (32B):
|
||||
avg: AVG((TCC_EA_RDREQ_32B_sum / $denom))
|
||||
min: MIN((TCC_EA_RDREQ_32B_sum / $denom))
|
||||
max: MAX((TCC_EA_RDREQ_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read (64B):
|
||||
avg: AVG(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
|
||||
min: MIN(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
|
||||
max: MAX(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read (Uncached):
|
||||
avg: AVG((TCC_EA_RD_UNCACHED_32B_sum / $denom))
|
||||
min: MIN((TCC_EA_RD_UNCACHED_32B_sum / $denom))
|
||||
max: MAX((TCC_EA_RD_UNCACHED_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
HBM Read:
|
||||
avg: AVG((TCC_EA_RDREQ_DRAM_sum / $denom))
|
||||
min: MIN((TCC_EA_RDREQ_DRAM_sum / $denom))
|
||||
max: MAX((TCC_EA_RDREQ_DRAM_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Remote Read:
|
||||
avg: AVG((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
|
||||
min: MIN((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
|
||||
max: MAX((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Write and Atomic (32B):
|
||||
avg: AVG(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom))
|
||||
min: MIN(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom))
|
||||
max: MAX(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Write and Atomic (Uncached):
|
||||
avg: AVG((TCC_EA_WR_UNCACHED_32B_sum / $denom))
|
||||
min: MIN((TCC_EA_WR_UNCACHED_32B_sum / $denom))
|
||||
max: MAX((TCC_EA_WR_UNCACHED_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Write and Atomic (64B):
|
||||
avg: AVG((TCC_EA_WRREQ_64B_sum / $denom))
|
||||
min: MIN((TCC_EA_WRREQ_64B_sum / $denom))
|
||||
max: MAX((TCC_EA_WRREQ_64B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
HBM Write and Atomic:
|
||||
avg: AVG((TCC_EA_WRREQ_DRAM_sum / $denom))
|
||||
min: MIN((TCC_EA_WRREQ_DRAM_sum / $denom))
|
||||
max: MAX((TCC_EA_WRREQ_DRAM_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Remote Write and Atomic:
|
||||
avg: AVG((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
|
||||
min: MIN((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
|
||||
max: MAX((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Atomic:
|
||||
avg: AVG((TCC_EA_ATOMIC_sum / $denom))
|
||||
min: MIN((TCC_EA_ATOMIC_sum / $denom))
|
||||
max: MAX((TCC_EA_ATOMIC_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
-350
@@ -1,350 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 1800
|
||||
title: L2 Cache (per Channel)
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1801
|
||||
title: Aggregate Stats (All channels)
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
std dev: Std Dev
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
L2 Cache Hit Rate:
|
||||
avg: AVG(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
|
||||
+ (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100 *
|
||||
TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
|
||||
+ (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) + (100
|
||||
* TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100 * TCC_HIT[15]))
|
||||
+ (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 * TCC_HIT[18])) + (100
|
||||
* TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21])) + (100 * TCC_HIT[22]))
|
||||
+ (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) + (100 * TCC_HIT[25])) + (100
|
||||
* TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100 * TCC_HIT[28])) + (100 * TCC_HIT[29]))
|
||||
+ (100 * TCC_HIT[30])) + (100 * TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0]
|
||||
+ TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
|
||||
+ (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
|
||||
+ TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
|
||||
+ (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
|
||||
+ TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
|
||||
+ (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
|
||||
+ TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
|
||||
+ (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
|
||||
+ TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
|
||||
+ (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
|
||||
+ TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
|
||||
+ (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
|
||||
+ TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
|
||||
+ TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
|
||||
+ (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
|
||||
+ TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
|
||||
+ (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
|
||||
+ TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
|
||||
+ (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
|
||||
+ TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
|
||||
+ (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
|
||||
+ TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
|
||||
+ (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
|
||||
+ TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
|
||||
+ (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[29] + TCC_HIT[29])) + (TCC_MISS[30]
|
||||
+ TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
|
||||
std dev: STD(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
|
||||
+ (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100 *
|
||||
TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
|
||||
+ (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) + (100
|
||||
* TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100 * TCC_HIT[15]))
|
||||
+ (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 * TCC_HIT[18])) + (100
|
||||
* TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21])) + (100 * TCC_HIT[22]))
|
||||
+ (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) + (100 * TCC_HIT[25])) + (100
|
||||
* TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100 * TCC_HIT[28])) + (100 * TCC_HIT[29]))
|
||||
+ (100 * TCC_HIT[30])) + (100 * TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0]
|
||||
+ TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
|
||||
+ (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
|
||||
+ TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
|
||||
+ (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
|
||||
+ TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
|
||||
+ (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
|
||||
+ TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
|
||||
+ (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
|
||||
+ TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
|
||||
+ (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
|
||||
+ TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
|
||||
+ (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
|
||||
+ TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
|
||||
+ TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
|
||||
+ (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
|
||||
+ TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
|
||||
+ (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
|
||||
+ TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
|
||||
+ (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
|
||||
+ TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
|
||||
+ (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
|
||||
+ TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
|
||||
+ (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
|
||||
+ TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
|
||||
+ (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
|
||||
+ TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
|
||||
min: MIN(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
|
||||
+ (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100 *
|
||||
TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
|
||||
+ (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) + (100
|
||||
* TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100 * TCC_HIT[15]))
|
||||
+ (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 * TCC_HIT[18])) + (100
|
||||
* TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21])) + (100 * TCC_HIT[22]))
|
||||
+ (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) + (100 * TCC_HIT[25])) + (100
|
||||
* TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100 * TCC_HIT[28])) + (100 * TCC_HIT[29]))
|
||||
+ (100 * TCC_HIT[30])) + (100 * TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0]
|
||||
+ TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
|
||||
+ (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
|
||||
+ TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
|
||||
+ (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
|
||||
+ TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
|
||||
+ (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
|
||||
+ TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
|
||||
+ (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
|
||||
+ TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
|
||||
+ (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
|
||||
+ TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
|
||||
+ (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
|
||||
+ TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
|
||||
+ TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
|
||||
+ (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
|
||||
+ TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
|
||||
+ (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
|
||||
+ TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
|
||||
+ (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
|
||||
+ TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
|
||||
+ (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
|
||||
+ TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
|
||||
+ (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
|
||||
+ TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
|
||||
+ (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
|
||||
+ TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
|
||||
max: MAX(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
|
||||
+ (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100 *
|
||||
TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
|
||||
+ (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) + (100
|
||||
* TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100 * TCC_HIT[15]))
|
||||
+ (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 * TCC_HIT[18])) + (100
|
||||
* TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21])) + (100 * TCC_HIT[22]))
|
||||
+ (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) + (100 * TCC_HIT[25])) + (100
|
||||
* TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100 * TCC_HIT[28])) + (100 * TCC_HIT[29]))
|
||||
+ (100 * TCC_HIT[30])) + (100 * TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0]
|
||||
+ TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
|
||||
+ (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
|
||||
+ TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
|
||||
+ (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
|
||||
+ TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
|
||||
+ (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
|
||||
+ TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
|
||||
+ (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
|
||||
+ TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
|
||||
+ (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
|
||||
+ TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
|
||||
+ (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
|
||||
+ TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
|
||||
+ TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
|
||||
+ (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
|
||||
+ TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
|
||||
+ (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
|
||||
+ TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
|
||||
+ (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
|
||||
+ TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
|
||||
+ (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
|
||||
+ TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
|
||||
+ (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
|
||||
+ TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
|
||||
+ (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
|
||||
+ TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
# FIXME: other arggr metrics!!
|
||||
|
||||
- metric_table:
|
||||
id: 1802
|
||||
title: L2 Cache Hit Rate (pct)
|
||||
header:
|
||||
metric: Channel
|
||||
expr: Expression
|
||||
metric:
|
||||
"::_1":
|
||||
expr:
|
||||
(((100 * TCC_HIT[::_1]) / (TCC_HIT[::_1] + TCC_MISS[::_1])) if ((TCC_HIT[::_1]
|
||||
+ TCC_MISS[::_1]) != 0) else None)
|
||||
placeholder_range:
|
||||
"::_1": $total_l2_chan
|
||||
cli_style: simple_box
|
||||
|
||||
- metric_table:
|
||||
id: 1803
|
||||
title: L2 Requests (per normUnit)
|
||||
header:
|
||||
metric: Channel
|
||||
expr: Expression
|
||||
metric:
|
||||
"::_1":
|
||||
expr: (TO_INT(TCC_REQ[::_1]) / $denom)
|
||||
placeholder_range:
|
||||
"::_1": $total_l2_chan
|
||||
cli_style: simple_box
|
||||
|
||||
- metric_table:
|
||||
id: 1804
|
||||
title: L2 Requests (per normUnit)
|
||||
header:
|
||||
metric: Channel
|
||||
read req: L2 Read
|
||||
write req: L2 Write
|
||||
atomic req: L2 Atomic
|
||||
metric:
|
||||
"::_1":
|
||||
read req: AVG((TO_INT(TCC_READ[::_1]) / $denom))
|
||||
write req: AVG((TO_INT(TCC_WRITE[::_1]) / $denom))
|
||||
atomic req: AVG((TO_INT(TCC_ATOMIC[::_1]) / $denom))
|
||||
placeholder_range:
|
||||
"::_1": $total_l2_chan
|
||||
cli_style: simple_multiple_bar
|
||||
|
||||
- metric_table:
|
||||
id: 1805
|
||||
title: L2-Fabric Requests (per normUnit)
|
||||
header:
|
||||
metric: Channel
|
||||
read req: L2-Fabric Read
|
||||
write req: L2-Fabric Write and Atomic
|
||||
atomic req: L2-Fabric Atomic
|
||||
metric:
|
||||
"::_1":
|
||||
read req: AVG((TO_INT(TCC_EA_RDREQ[::_1]) / $denom))
|
||||
write req: AVG((TO_INT(TCC_EA_WRREQ[::_1]) / $denom))
|
||||
atomic req: AVG((TO_INT(TCC_EA_ATOMIC[::_1]) / $denom))
|
||||
placeholder_range:
|
||||
"::_1": $total_l2_chan
|
||||
cli_style: simple_multiple_bar
|
||||
|
||||
# - metric_table:
|
||||
# id: 1806
|
||||
# title: L2-EA Latency (Cycles)
|
||||
# header:
|
||||
# metric: Metric
|
||||
# read lat: L2-EA Read
|
||||
# write lat: L2-EA Write
|
||||
# atomic lat: L2-EA Atomic
|
||||
# metric:
|
||||
# "::_1":
|
||||
# read lat:
|
||||
# AVG(((TCC_EA_RDREQ_LEVEL[::_1] / TCC_EA_RDREQ[::_1]) if (TCC_EA_RDREQ[::_1]
|
||||
# != 0) else None))
|
||||
# write lat:
|
||||
# AVG(((TCC_EA_WRREQ_LEVEL[::_1] / TCC_EA_WRREQ[::_1]) if (TCC_EA_WRREQ[::_1]
|
||||
# != 0) else None))
|
||||
# atomic lat:
|
||||
# AVG(((TCC_EA_ATOMIC_LEVEL[::_1] / TCC_EA_ATOMIC[::_1]) if
|
||||
# (TCC_EA_ATOMIC[::_1] != 0) else 0))
|
||||
# placeholder_range:
|
||||
# "::_1": 32
|
||||
# cli_style: simple_multiple_bar
|
||||
|
||||
- metric_table:
|
||||
id: 1806
|
||||
title: L2-Fabric Read Latency (Cycles)
|
||||
header:
|
||||
metric: Channel
|
||||
expr: Expression
|
||||
metric:
|
||||
"::_1":
|
||||
expr:
|
||||
((TCC_EA_RDREQ_LEVEL[::_1] / TCC_EA_RDREQ[::_1]) if (TCC_EA_RDREQ[::_1]
|
||||
!= 0) else None)
|
||||
placeholder_range:
|
||||
"::_1": $total_l2_chan
|
||||
cli_style: simple_box
|
||||
|
||||
- metric_table:
|
||||
id: 1807
|
||||
title: L2-Fabric Write and Atomic Latency (Cycles)
|
||||
header:
|
||||
metric: Channel
|
||||
expr: Expression
|
||||
metric:
|
||||
"::_1":
|
||||
expr:
|
||||
((TCC_EA_WRREQ_LEVEL[::_1] / TCC_EA_WRREQ[::_1]) if (TCC_EA_WRREQ[::_1]
|
||||
!= 0) else None)
|
||||
placeholder_range:
|
||||
"::_1": $total_l2_chan
|
||||
cli_style: simple_box
|
||||
|
||||
- metric_table:
|
||||
id: 1808
|
||||
title: L2-Fabric Atomic Latency (Cycles)
|
||||
header:
|
||||
metric: Channel
|
||||
expr: Expression
|
||||
metric:
|
||||
"::_1":
|
||||
expr: ((TCC_EA_ATOMIC_LEVEL[::_1] / TCC_EA_ATOMIC[::_1]) if
|
||||
(TCC_EA_ATOMIC[::_1] != 0) else 0)
|
||||
placeholder_range:
|
||||
"::_1": $total_l2_chan
|
||||
cli_style: simple_box
|
||||
|
||||
- metric_table:
|
||||
id: 1809
|
||||
title: L2-Fabric Read Stall (Cycles per normUnit)
|
||||
header:
|
||||
metric: Channel
|
||||
ea read stall - pcie: L2-Fabric Read Stall (PCIe)
|
||||
ea read stall - if: L2-Fabric Read Stall (Infinity Fabric™)
|
||||
ea read stall - hbm: L2-Fabric Read Stall (HBM)
|
||||
metric:
|
||||
"::_1":
|
||||
ea read stall - pcie: None # Missing perfmon
|
||||
ea read stall - if: None # Missing perfmon
|
||||
ea read stall - hbm: None # Missing perfmon
|
||||
placeholder_range:
|
||||
"::_1": $total_l2_chan
|
||||
cli_style: simple_multiple_bar
|
||||
|
||||
- metric_table:
|
||||
id: 1810
|
||||
title: L2-Fabric Write and Atomic Stall (Cycles per normUnit)
|
||||
header:
|
||||
metric: Channel
|
||||
ea write stall - pcie: L2-Fabric Write Stall (PCIe)
|
||||
ea write stall - if: L2-Fabric Write Stall (Infinity Fabric™)
|
||||
ea write stall - hbm: L2-Fabric Write Stall (HBM)
|
||||
ea write stall - starve: L2-Fabric Write Starve
|
||||
metric:
|
||||
"::_1":
|
||||
ea write stall - pcie: None # Missing perfmon
|
||||
ea write stall - if: None # Missing perfmon
|
||||
ea write stall - hbm: None # Missing perfmon
|
||||
ea write stall - starve: AVG((TO_INT(TCC_TOO_MANY_EA_WRREQS_STALL[::_1]) / $denom))
|
||||
placeholder_range:
|
||||
"::_1": $total_l2_chan
|
||||
cli_style: simple_multiple_bar
|
||||
|
||||
- metric_table:
|
||||
id: 1812
|
||||
title: L2-Fabric (128B read requests per normUnit)
|
||||
header:
|
||||
metric: Channel
|
||||
expr: Expression
|
||||
metric:
|
||||
"::_1":
|
||||
expr: (TO_INT(TCC_BUBBLE[::_1]) / $denom)
|
||||
placeholder_range:
|
||||
"::_1": $total_l2_chan
|
||||
# tips: Number of 128-byte read requests sent to EA
|
||||
cli_style: simple_box
|
||||
tui_style: simple_box
|
||||
+323
@@ -0,0 +1,323 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 1800
|
||||
title: L2 Cache (per Channel)
|
||||
metrics_description:
|
||||
L2 Cache Hit Rate: The percent of total number of requests to the L2 from all
|
||||
clients that hit in the cache. As noted in the Speed-of-Light section, this
|
||||
includes hit-on-miss requests.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1801
|
||||
title: Aggregate Stats (All channels)
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
std dev: Std Dev
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
L2 Cache Hit Rate:
|
||||
avg: AVG(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
|
||||
+ (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100
|
||||
* TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
|
||||
+ (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) +
|
||||
(100 * TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100
|
||||
* TCC_HIT[15])) + (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 *
|
||||
TCC_HIT[18])) + (100 * TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21]))
|
||||
+ (100 * TCC_HIT[22])) + (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) +
|
||||
(100 * TCC_HIT[25])) + (100 * TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100
|
||||
* TCC_HIT[28])) + (100 * TCC_HIT[29])) + (100 * TCC_HIT[30])) + (100 *
|
||||
TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0] + TCC_HIT[0])
|
||||
+ (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2])) + (TCC_MISS[3]
|
||||
+ TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5] + TCC_HIT[5]))
|
||||
+ (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7])) + (TCC_MISS[8]
|
||||
+ TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10] + TCC_HIT[10]))
|
||||
+ (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12])) + (TCC_MISS[13]
|
||||
+ TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15] + TCC_HIT[15]))
|
||||
+ (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17])) + (TCC_MISS[18]
|
||||
+ TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20] + TCC_HIT[20]))
|
||||
+ (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22])) + (TCC_MISS[23]
|
||||
+ TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25] + TCC_HIT[25]))
|
||||
+ (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27])) + (TCC_MISS[28]
|
||||
+ TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30] + TCC_HIT[30]))
|
||||
+ (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
|
||||
+ TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
|
||||
+ (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
|
||||
+ TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
|
||||
+ (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
|
||||
+ TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
|
||||
+ (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
|
||||
+ TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
|
||||
+ (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
|
||||
+ TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
|
||||
+ (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
|
||||
+ TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
|
||||
+ (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[29] + TCC_HIT[29])) + (TCC_MISS[30]
|
||||
+ TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
|
||||
std dev: STD(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100
|
||||
* TCC_HIT[1])) + (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4]))
|
||||
+ (100 * TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100
|
||||
* TCC_HIT[8])) + (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11]))
|
||||
+ (100 * TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) +
|
||||
(100 * TCC_HIT[15])) + (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100
|
||||
* TCC_HIT[18])) + (100 * TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 *
|
||||
TCC_HIT[21])) + (100 * TCC_HIT[22])) + (100 * TCC_HIT[23])) + (100 * TCC_HIT[24]))
|
||||
+ (100 * TCC_HIT[25])) + (100 * TCC_HIT[26])) + (100 * TCC_HIT[27])) +
|
||||
(100 * TCC_HIT[28])) + (100 * TCC_HIT[29])) + (100 * TCC_HIT[30])) + (100
|
||||
* TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0] + TCC_HIT[0])
|
||||
+ (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2])) + (TCC_MISS[3]
|
||||
+ TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5] + TCC_HIT[5]))
|
||||
+ (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7])) + (TCC_MISS[8]
|
||||
+ TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10] + TCC_HIT[10]))
|
||||
+ (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12])) + (TCC_MISS[13]
|
||||
+ TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15] + TCC_HIT[15]))
|
||||
+ (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17])) + (TCC_MISS[18]
|
||||
+ TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20] + TCC_HIT[20]))
|
||||
+ (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22])) + (TCC_MISS[23]
|
||||
+ TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25] + TCC_HIT[25]))
|
||||
+ (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27])) + (TCC_MISS[28]
|
||||
+ TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30] + TCC_HIT[30]))
|
||||
+ (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
|
||||
+ TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
|
||||
+ (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
|
||||
+ TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
|
||||
+ (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
|
||||
+ TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
|
||||
+ (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
|
||||
+ TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
|
||||
+ (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
|
||||
+ TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
|
||||
+ (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
|
||||
+ TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
|
||||
+ (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
|
||||
+ TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
|
||||
min: MIN(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
|
||||
+ (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100
|
||||
* TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
|
||||
+ (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) +
|
||||
(100 * TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100
|
||||
* TCC_HIT[15])) + (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 *
|
||||
TCC_HIT[18])) + (100 * TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21]))
|
||||
+ (100 * TCC_HIT[22])) + (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) +
|
||||
(100 * TCC_HIT[25])) + (100 * TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100
|
||||
* TCC_HIT[28])) + (100 * TCC_HIT[29])) + (100 * TCC_HIT[30])) + (100 *
|
||||
TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0] + TCC_HIT[0])
|
||||
+ (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2])) + (TCC_MISS[3]
|
||||
+ TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5] + TCC_HIT[5]))
|
||||
+ (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7])) + (TCC_MISS[8]
|
||||
+ TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10] + TCC_HIT[10]))
|
||||
+ (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12])) + (TCC_MISS[13]
|
||||
+ TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15] + TCC_HIT[15]))
|
||||
+ (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17])) + (TCC_MISS[18]
|
||||
+ TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20] + TCC_HIT[20]))
|
||||
+ (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22])) + (TCC_MISS[23]
|
||||
+ TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25] + TCC_HIT[25]))
|
||||
+ (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27])) + (TCC_MISS[28]
|
||||
+ TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30] + TCC_HIT[30]))
|
||||
+ (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
|
||||
+ TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
|
||||
+ (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
|
||||
+ TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
|
||||
+ (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
|
||||
+ TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
|
||||
+ (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
|
||||
+ TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
|
||||
+ (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
|
||||
+ TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
|
||||
+ (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
|
||||
+ TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
|
||||
+ (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
|
||||
+ TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
|
||||
max: MAX(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
|
||||
+ (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100
|
||||
* TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
|
||||
+ (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) +
|
||||
(100 * TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100
|
||||
* TCC_HIT[15])) + (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 *
|
||||
TCC_HIT[18])) + (100 * TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21]))
|
||||
+ (100 * TCC_HIT[22])) + (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) +
|
||||
(100 * TCC_HIT[25])) + (100 * TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100
|
||||
* TCC_HIT[28])) + (100 * TCC_HIT[29])) + (100 * TCC_HIT[30])) + (100 *
|
||||
TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0] + TCC_HIT[0])
|
||||
+ (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2])) + (TCC_MISS[3]
|
||||
+ TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5] + TCC_HIT[5]))
|
||||
+ (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7])) + (TCC_MISS[8]
|
||||
+ TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10] + TCC_HIT[10]))
|
||||
+ (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12])) + (TCC_MISS[13]
|
||||
+ TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15] + TCC_HIT[15]))
|
||||
+ (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17])) + (TCC_MISS[18]
|
||||
+ TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20] + TCC_HIT[20]))
|
||||
+ (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22])) + (TCC_MISS[23]
|
||||
+ TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25] + TCC_HIT[25]))
|
||||
+ (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27])) + (TCC_MISS[28]
|
||||
+ TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30] + TCC_HIT[30]))
|
||||
+ (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
|
||||
+ TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
|
||||
+ (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
|
||||
+ TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
|
||||
+ (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
|
||||
+ TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
|
||||
+ (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
|
||||
+ TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
|
||||
+ (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
|
||||
+ TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
|
||||
+ (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
|
||||
+ TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
|
||||
+ (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
|
||||
+ TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
|
||||
unit: pct
|
||||
- metric_table:
|
||||
id: 1802
|
||||
title: L2 Cache Hit Rate (pct)
|
||||
header:
|
||||
metric: Channel
|
||||
expr: Expression
|
||||
metric:
|
||||
::_1:
|
||||
expr: (((100 * TCC_HIT[::_1]) / (TCC_HIT[::_1] + TCC_MISS[::_1])) if ((TCC_HIT[::_1]
|
||||
+ TCC_MISS[::_1]) != 0) else None)
|
||||
placeholder_range:
|
||||
::_1: $total_l2_chan
|
||||
cli_style: simple_box
|
||||
tui_style: simple_box
|
||||
- metric_table:
|
||||
id: 1803
|
||||
title: L2 Requests (per normUnit)
|
||||
header:
|
||||
metric: Channel
|
||||
expr: Expression
|
||||
metric:
|
||||
::_1:
|
||||
expr: (TO_INT(TCC_REQ[::_1]) / $denom)
|
||||
placeholder_range:
|
||||
::_1: $total_l2_chan
|
||||
cli_style: simple_box
|
||||
tui_style: simple_box
|
||||
- metric_table:
|
||||
id: 1804
|
||||
title: L2 Requests (per normUnit)
|
||||
header:
|
||||
metric: Channel
|
||||
read req: L2 Read
|
||||
write req: L2 Write
|
||||
atomic req: L2 Atomic
|
||||
metric:
|
||||
::_1:
|
||||
read req: AVG((TO_INT(TCC_READ[::_1]) / $denom))
|
||||
write req: AVG((TO_INT(TCC_WRITE[::_1]) / $denom))
|
||||
atomic req: AVG((TO_INT(TCC_ATOMIC[::_1]) / $denom))
|
||||
placeholder_range:
|
||||
::_1: $total_l2_chan
|
||||
cli_style: simple_multiple_bar
|
||||
tui_style: simple_multiple_bar
|
||||
- metric_table:
|
||||
id: 1805
|
||||
title: L2-Fabric Requests (per normUnit)
|
||||
header:
|
||||
metric: Channel
|
||||
read req: L2-Fabric Read
|
||||
write req: L2-Fabric Write and Atomic
|
||||
atomic req: L2-Fabric Atomic
|
||||
metric:
|
||||
::_1:
|
||||
read req: AVG((TO_INT(TCC_EA_RDREQ[::_1]) / $denom))
|
||||
write req: AVG((TO_INT(TCC_EA_WRREQ[::_1]) / $denom))
|
||||
atomic req: AVG((TO_INT(TCC_EA_ATOMIC[::_1]) / $denom))
|
||||
placeholder_range:
|
||||
::_1: $total_l2_chan
|
||||
cli_style: simple_multiple_bar
|
||||
tui_style: simple_multiple_bar
|
||||
- metric_table:
|
||||
id: 1806
|
||||
title: L2-Fabric Read Latency (Cycles)
|
||||
header:
|
||||
metric: Channel
|
||||
expr: Expression
|
||||
metric:
|
||||
::_1:
|
||||
expr: ((TCC_EA_RDREQ_LEVEL[::_1] / TCC_EA_RDREQ[::_1]) if (TCC_EA_RDREQ[::_1]
|
||||
!= 0) else None)
|
||||
placeholder_range:
|
||||
::_1: $total_l2_chan
|
||||
cli_style: simple_box
|
||||
tui_style: simple_box
|
||||
- metric_table:
|
||||
id: 1807
|
||||
title: L2-Fabric Write and Atomic Latency (Cycles)
|
||||
header:
|
||||
metric: Channel
|
||||
expr: Expression
|
||||
metric:
|
||||
::_1:
|
||||
expr: ((TCC_EA_WRREQ_LEVEL[::_1] / TCC_EA_WRREQ[::_1]) if (TCC_EA_WRREQ[::_1]
|
||||
!= 0) else None)
|
||||
placeholder_range:
|
||||
::_1: $total_l2_chan
|
||||
cli_style: simple_box
|
||||
tui_style: simple_box
|
||||
- metric_table:
|
||||
id: 1808
|
||||
title: L2-Fabric Atomic Latency (Cycles)
|
||||
header:
|
||||
metric: Channel
|
||||
expr: Expression
|
||||
metric:
|
||||
::_1:
|
||||
expr: ((TCC_EA_ATOMIC_LEVEL[::_1] / TCC_EA_ATOMIC[::_1]) if (TCC_EA_ATOMIC[::_1]
|
||||
!= 0) else 0)
|
||||
placeholder_range:
|
||||
::_1: $total_l2_chan
|
||||
cli_style: simple_box
|
||||
tui_style: simple_box
|
||||
- metric_table:
|
||||
id: 1809
|
||||
title: L2-Fabric Read Stall (Cycles per normUnit)
|
||||
header:
|
||||
metric: Channel
|
||||
ea read stall - pcie: L2-Fabric Read Stall (PCIe)
|
||||
ea read stall - if: "L2-Fabric Read Stall (Infinity Fabric\u2122)"
|
||||
ea read stall - hbm: L2-Fabric Read Stall (HBM)
|
||||
metric:
|
||||
::_1:
|
||||
ea read stall - pcie: None
|
||||
ea read stall - if: None
|
||||
ea read stall - hbm: None
|
||||
placeholder_range:
|
||||
::_1: $total_l2_chan
|
||||
cli_style: simple_multiple_bar
|
||||
tui_style: simple_multiple_bar
|
||||
- metric_table:
|
||||
id: 1810
|
||||
title: L2-Fabric Write and Atomic Stall (Cycles per normUnit)
|
||||
header:
|
||||
metric: Channel
|
||||
ea write stall - pcie: L2-Fabric Write Stall (PCIe)
|
||||
ea write stall - if: "L2-Fabric Write Stall (Infinity Fabric\u2122)"
|
||||
ea write stall - hbm: L2-Fabric Write Stall (HBM)
|
||||
ea write stall - starve: L2-Fabric Write Starve
|
||||
metric:
|
||||
::_1:
|
||||
ea write stall - pcie: None
|
||||
ea write stall - if: None
|
||||
ea write stall - hbm: None
|
||||
ea write stall - starve: AVG((TO_INT(TCC_TOO_MANY_EA_WRREQS_STALL[::_1])
|
||||
/ $denom))
|
||||
placeholder_range:
|
||||
::_1: $total_l2_chan
|
||||
cli_style: simple_multiple_bar
|
||||
tui_style: simple_multiple_bar
|
||||
- metric_table:
|
||||
id: 1812
|
||||
title: L2-Fabric (128B read requests per normUnit)
|
||||
header:
|
||||
metric: Channel
|
||||
expr: Expression
|
||||
metric:
|
||||
::_1:
|
||||
expr: (TO_INT(TCC_BUBBLE[::_1]) / $denom)
|
||||
placeholder_range:
|
||||
::_1: $total_l2_chan
|
||||
cli_style: simple_box
|
||||
tui_style: simple_box
|
||||
+7
-6
@@ -1,10 +1,11 @@
|
||||
---
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 2100
|
||||
title: PC Sampling
|
||||
metrics_description: {}
|
||||
data source:
|
||||
- pc_sampling_table:
|
||||
id: 2101
|
||||
title: PC Sampling
|
||||
source: ps_file
|
||||
comparable: false # enable it later
|
||||
- pc_sampling_table:
|
||||
id: 2101
|
||||
title: PC Sampling
|
||||
source: ps_file
|
||||
comparable: false
|
||||
|
||||
+11
-11
@@ -1,14 +1,14 @@
|
||||
---
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 000
|
||||
id: 0
|
||||
title: Top Stats
|
||||
metrics_description: {}
|
||||
data source:
|
||||
- raw_csv_table:
|
||||
id: 001
|
||||
title: Top Kernels
|
||||
source: pmc_kernel_top.csv
|
||||
|
||||
- raw_csv_table:
|
||||
id: 002
|
||||
title: Dispatch List
|
||||
source: pmc_dispatch_info.csv
|
||||
- raw_csv_table:
|
||||
id: 1
|
||||
title: Top Kernels
|
||||
source: pmc_kernel_top.csv
|
||||
- raw_csv_table:
|
||||
id: 2
|
||||
title: Dispatch List
|
||||
source: pmc_dispatch_info.csv
|
||||
|
||||
+6
-5
@@ -1,9 +1,10 @@
|
||||
---
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 100
|
||||
title: System Info
|
||||
metrics_description: {}
|
||||
data source:
|
||||
- raw_csv_table:
|
||||
id: 101
|
||||
source: sysinfo.csv
|
||||
columnwise: True
|
||||
- raw_csv_table:
|
||||
id: 101
|
||||
source: sysinfo.csv
|
||||
columnwise: true
|
||||
|
||||
-262
@@ -1,262 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
SALU: &SALU_anchor Scalar Arithmetic Logic Unit
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 200
|
||||
title: System Speed-of-Light
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 201
|
||||
title: Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
peak: Peak
|
||||
pop: Pct of Peak
|
||||
tips: Tips
|
||||
metric:
|
||||
VALU FLOPs:
|
||||
value: AVG(((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) + SQ_INSTS_VALU_TRANS_F16)
|
||||
+ (2 * SQ_INSTS_VALU_FMA_F16))) + (64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32)
|
||||
+ SQ_INSTS_VALU_TRANS_F32) + (2 * SQ_INSTS_VALU_FMA_F32)))) + (64 * (((SQ_INSTS_VALU_ADD_F64
|
||||
+ SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64) + (2 * SQ_INSTS_VALU_FMA_F64))))
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP/s
|
||||
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
|
||||
pop: ((100 * AVG(((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16)
|
||||
+ SQ_INSTS_VALU_TRANS_F16) + (2 * SQ_INSTS_VALU_FMA_F16))) + (64 * (((SQ_INSTS_VALU_ADD_F32
|
||||
+ SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (2 * SQ_INSTS_VALU_FMA_F32))))
|
||||
+ (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (2 * SQ_INSTS_VALU_FMA_F64)))) / (End_Timestamp - Start_Timestamp)))) / (((($max_sclk
|
||||
* $cu_per_gpu) * 64) * 2) / 1000))
|
||||
tips:
|
||||
VALU IOPs:
|
||||
value: AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GIOP/s
|
||||
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
|
||||
pop: ((100 * AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) / (End_Timestamp
|
||||
- Start_Timestamp)))) / (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000))
|
||||
tips:
|
||||
MFMA FLOPs (F8):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 4096) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk * $cu_per_gpu) * 4096) / 1000))
|
||||
tips:
|
||||
MFMA FLOPs (BF16):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 2048) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk * $cu_per_gpu) * 2048) / 1000))
|
||||
tips:
|
||||
MFMA FLOPs (F16):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 2048) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk * $cu_per_gpu) * 2048) / 1000))
|
||||
tips:
|
||||
MFMA FLOPs (F32):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk * $cu_per_gpu) * 256) / 1000))
|
||||
tips:
|
||||
MFMA FLOPs (F64):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk * $cu_per_gpu) * 256) / 1000))
|
||||
tips:
|
||||
MFMA IOPs (Int8):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GIOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 4096) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk * $cu_per_gpu) * 4096) / 1000))
|
||||
tips:
|
||||
Active CUs:
|
||||
value: $numActiveCUs
|
||||
unit: CUs
|
||||
peak: $cu_per_gpu
|
||||
pop: ((100 * $numActiveCUs) / $cu_per_gpu)
|
||||
tips:
|
||||
SALU Utilization:
|
||||
value: AVG(((100 * SQ_ACTIVE_INST_SCA) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 * SQ_ACTIVE_INST_SCA) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
tips:
|
||||
VALU Utilization:
|
||||
value: AVG(((100 * SQ_ACTIVE_INST_VALU) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 * SQ_ACTIVE_INST_VALU) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
tips:
|
||||
MFMA Utilization:
|
||||
value: AVG(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / (($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)
|
||||
* 4)))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / (($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)
|
||||
* 4)))
|
||||
tips:
|
||||
VMEM Utilization:
|
||||
value: AVG((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
tips:
|
||||
Branch Utilization:
|
||||
value: AVG((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
tips:
|
||||
VALU Active Threads:
|
||||
value: AVG(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
|
||||
!= 0) else None))
|
||||
unit: Threads
|
||||
peak: $wave_size
|
||||
pop: (100 * AVG((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU / $wave_size) if (SQ_ACTIVE_INST_VALU != 0) else None))
|
||||
tips:
|
||||
IPC:
|
||||
value: AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))
|
||||
unit: Instr/cycle
|
||||
peak: 5
|
||||
pop: ((100 * AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))) / 5)
|
||||
tips:
|
||||
Wavefront Occupancy:
|
||||
value: AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
unit: Wavefronts
|
||||
peak: ($max_waves_per_cu * $cu_per_gpu)
|
||||
pop: (100 * AVG(((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD) / ($max_waves_per_cu
|
||||
* $cu_per_gpu))))
|
||||
coll_level: SQ_LEVEL_WAVES
|
||||
tips:
|
||||
Theoretical LDS Bandwidth:
|
||||
value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: (($max_sclk * $cu_per_gpu) * 0.128)
|
||||
pop: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
|
||||
tips:
|
||||
LDS Bank Conflicts/Access:
|
||||
value: AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
|
||||
unit: Conflicts/access
|
||||
peak: 32
|
||||
pop: ((100 * AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))) / 32)
|
||||
tips:
|
||||
vL1D Cache Hit Rate:
|
||||
value: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
|
||||
None))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) +
|
||||
TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) /
|
||||
TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
|
||||
None))
|
||||
tips:
|
||||
vL1D Cache BW:
|
||||
value: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: ((($max_sclk / 1000) * 128) * $cu_per_gpu)
|
||||
pop: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk / 1000) * 128) * $cu_per_gpu))
|
||||
tips:
|
||||
L2 Cache Hit Rate:
|
||||
value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
tips:
|
||||
L2 Cache BW:
|
||||
value: AVG(((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: ((($max_sclk / 1000) * 128) * TO_INT($total_l2_chan))
|
||||
pop: ((100 * AVG(((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk / 1000) * 128) * TO_INT($total_l2_chan)))
|
||||
tips:
|
||||
L2-Fabric Read BW:
|
||||
value: AVG((128 * TCC_BUBBLE_sum +
|
||||
64 * (TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) +
|
||||
32 * TCC_EA0_RDREQ_32B_sum) / (End_Timestamp - Start_Timestamp))
|
||||
unit: GB/s
|
||||
peak: $hbmBandwidth
|
||||
pop: ((100 * (AVG((128 * TCC_BUBBLE_sum +
|
||||
64 * (TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) +
|
||||
32 * TCC_EA0_RDREQ_32B_sum) / (End_Timestamp - Start_Timestamp)))) / $hbmBandwidth)
|
||||
tips:
|
||||
L2-Fabric Write BW:
|
||||
value: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: $hbmBandwidth
|
||||
pop: ((100 * AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))) / $hbmBandwidth)
|
||||
tips:
|
||||
L2-Fabric Read Latency:
|
||||
value: AVG(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
peak: None
|
||||
pop: None
|
||||
tips:
|
||||
L2-Fabric Write Latency:
|
||||
value: AVG(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
peak: None
|
||||
pop: None
|
||||
tips:
|
||||
sL1D Cache Hit Rate:
|
||||
value: AVG((((100 * SQC_DCACHE_HITS) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES))
|
||||
if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES) != 0) else None))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG((((100 * SQC_DCACHE_HITS) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES))
|
||||
if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES) != 0) else None))
|
||||
tips:
|
||||
sL1D Cache BW:
|
||||
value: AVG(((SQC_DCACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))
|
||||
unit: GB/s
|
||||
peak: ((($max_sclk / 1000) * 64) * $sqc_per_gpu)
|
||||
pop: ((100 * AVG(((SQC_DCACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))) / ((($max_sclk
|
||||
/ 1000) * 64) * $sqc_per_gpu))
|
||||
tips:
|
||||
L1I Hit Rate:
|
||||
value: AVG(((100 * SQC_ICACHE_HITS) / (SQC_ICACHE_HITS + SQC_ICACHE_MISSES)))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 * SQC_ICACHE_HITS) / (SQC_ICACHE_HITS + SQC_ICACHE_MISSES)))
|
||||
tips:
|
||||
L1I BW:
|
||||
value: AVG(((SQC_ICACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))
|
||||
unit: GB/s
|
||||
peak: ((($max_sclk / 1000) * 64) * $sqc_per_gpu)
|
||||
pop: ((100 * AVG(((SQC_ICACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))) / ((($max_sclk
|
||||
/ 1000) * 64) * $sqc_per_gpu))
|
||||
tips:
|
||||
L1I Fetch Latency:
|
||||
value: AVG((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
|
||||
unit: Cycles
|
||||
peak: None
|
||||
pop: None
|
||||
coll_level: SQ_IFETCH_LEVEL
|
||||
tips:
|
||||
+346
@@ -0,0 +1,346 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 200
|
||||
title: System Speed-of-Light
|
||||
metrics_description:
|
||||
VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
|
||||
This is also presented as a percent of the peak theoretical FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any floating-point
|
||||
operations from MFMA instructions.'
|
||||
VALU IOPs: 'The total integer operations executed per second on the VALU. This
|
||||
is also presented as a percent of the peak theoretical IOPs achievable on the
|
||||
specific accelerator. Note: this does not include any integer operations from
|
||||
MFMA instructions.'
|
||||
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
|
||||
executed per second. This does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F8 MFMA operations achievable on the specific accelerator. It is supported on
|
||||
AMD Instinct MI300 series and later only.
|
||||
MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
|
||||
executed per second. Note: this does not include any 16-bit brain floating point
|
||||
operations from VALU instructions. This is also presented as a percent of the
|
||||
peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 32-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F32 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 64-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F64 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
|
||||
per second. Note: this does not include any 8-bit integer operations from VALU
|
||||
instructions. This is also presented as a percent of the peak theoretical INT8
|
||||
MFMA operations achievable on the specific accelerator.'
|
||||
Active CUs: Total number of active compute units (CUs) on the accelerator during
|
||||
the kernel execution.
|
||||
SALU Utilization: Indicates what percent of the kernel's duration the SALU was
|
||||
busy executing instructions. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.
|
||||
VALU Utilization: Indicates what percent of the kernel's duration the VALU was
|
||||
busy executing instructions. Does not include VMEM operations. Computed as the
|
||||
ratio of the total number of cycles spent by the scheduler issuing VALU instructions
|
||||
over the total CU cycles.
|
||||
MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
|
||||
was busy executing instructions. Computed as the ratio of the total number of
|
||||
cycles the MFMA was busy over the total CU cycles.
|
||||
VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
|
||||
was busy executing instructions, including both global/generic and spill/scratch
|
||||
operations (see the VMEM instruction count metrics) for more detail). Does not
|
||||
include VALU operations. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing VMEM instructions over the total CU cycles.
|
||||
Branch Utilization: Indicates what percent of the kernel's duration the branch
|
||||
unit was busy executing instructions. Computed as the ratio of the total number
|
||||
of cycles spent by the scheduler issuing branch instructions over the total
|
||||
CU cycles
|
||||
VALU Active Threads: Indicates the average level of divergence within a wavefront
|
||||
over the lifetime of the kernel. The number of work-items that were active in
|
||||
a wavefront during execution of each VALU instruction, time-averaged over all
|
||||
VALU instructions run on all wavefronts in the kernel.
|
||||
IPC: The ratio of the total number of instructions executed on the CU over the
|
||||
total active CU cycles. This is also presented as a percent of the peak theoretical
|
||||
bandwidth achievable on the specific accelerator.
|
||||
Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
|
||||
over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
||||
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
|
||||
occupancy achievable on the specific accelerator.'
|
||||
Theoretical LDS Bandwidth: Indicates the maximum amount of bytes that could have
|
||||
been loaded from, stored to, or atomically updated in the LDS per unit time
|
||||
(see LDS Bandwidth example for more detail). This is also presented as a percent
|
||||
of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
|
||||
LDS Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS
|
||||
scheduler due to bank conflicts (as determined by the conflict resolution hardware)
|
||||
to the base number of cycles that would be spent in the LDS scheduler in a completely
|
||||
uncontended case. This is also presented in normalized form (i.e., the Bank
|
||||
Conflict Rate).
|
||||
vL1D Cache Hit Rate: The ratio of the number of vL1D cache line requests that
|
||||
hit in vL1D cache over the total number of cache line requests to the vL1D cache
|
||||
RAM.
|
||||
vL1D Cache BW: The number of bytes looked up in the vL1D cache as a result of
|
||||
VMEM instructions per unit time. The number of bytes is calculated as the number
|
||||
of cache lines requested multiplied by the cache line size. This value does
|
||||
not consider partial requests, so e.g., if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
This is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L2 Cache Hit Rate: The ratio of the number of L2 cache line requests that hit
|
||||
in the L2 cache over the total number of incoming cache line requests to the
|
||||
L2 cache.
|
||||
L2 Cache BW: The number of bytes looked up in the L2 cache per unit time. The
|
||||
number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line. This is also presented as a percent of
|
||||
the peak theoretical bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Read BW: "The number of bytes read by the L2 over the Infinity Fabric\u2122\
|
||||
\ interface per unit time. This is also presented as a percent of the peak theoretical\
|
||||
\ bandwidth achievable on the specific accelerator."
|
||||
L2-Fabric Write BW: The number of bytes sent by the L2 over the Infinity Fabric
|
||||
interface by write and atomic operations per unit time. This is also presented
|
||||
as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Read Latency: The time-averaged number of cycles read requests spent
|
||||
in Infinity Fabric before data was returned to the L2.
|
||||
L2-Fabric Write Latency: The time-averaged number of cycles write requests spent
|
||||
in Infinity Fabric before a completion acknowledgement was returned to the L2.
|
||||
sL1D Cache Hit Rate: The percent of sL1D requests that hit on a previously loaded
|
||||
line the cache. Calculated as the ratio of the number of sL1D requests that
|
||||
hit over the number of all sL1D requests.
|
||||
sL1D Cache BW: The number of bytes looked up in the sL1D cache per unit time.
|
||||
This is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L1I Hit Rate: The number of bytes looked up in the L1I cache per unit time. This
|
||||
is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L1I BW: The percent of L1I requests that hit on a previously loaded line the cache.
|
||||
Calculated as the ratio of the number of L1I requests that hit over the number
|
||||
of all L1I requests.
|
||||
L1I Fetch Latency: The average number of cycles spent to fetch instructions to
|
||||
a CU.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 201
|
||||
title: System Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
peak: Peak
|
||||
pop: Pct of Peak
|
||||
metric:
|
||||
VALU FLOPs:
|
||||
value: AVG(((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) +
|
||||
SQ_INSTS_VALU_TRANS_F16) + (2 * SQ_INSTS_VALU_FMA_F16))) + (64 * (((SQ_INSTS_VALU_ADD_F32
|
||||
+ SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (2 * SQ_INSTS_VALU_FMA_F32))))
|
||||
+ (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (2 * SQ_INSTS_VALU_FMA_F64)))) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP/s
|
||||
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
|
||||
pop: ((100 * AVG(((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16)
|
||||
+ SQ_INSTS_VALU_TRANS_F16) + (2 * SQ_INSTS_VALU_FMA_F16))) + (64 * (((SQ_INSTS_VALU_ADD_F32
|
||||
+ SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (2 * SQ_INSTS_VALU_FMA_F32))))
|
||||
+ (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (2 * SQ_INSTS_VALU_FMA_F64)))) / (End_Timestamp - Start_Timestamp))))
|
||||
/ (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000))
|
||||
VALU IOPs:
|
||||
value: AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) / (End_Timestamp
|
||||
- Start_Timestamp)))
|
||||
unit: GIOP/s
|
||||
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
|
||||
pop: ((100 * AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) / (End_Timestamp
|
||||
- Start_Timestamp)))) / (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000))
|
||||
MFMA FLOPs (F8):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 4096) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) / (End_Timestamp -
|
||||
Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 4096) / 1000))
|
||||
MFMA FLOPs (BF16):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 2048) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp
|
||||
- Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 2048) / 1000))
|
||||
MFMA FLOPs (F16):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 2048) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp -
|
||||
Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 2048) / 1000))
|
||||
MFMA FLOPs (F32):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) / (End_Timestamp -
|
||||
Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 256) / 1000))
|
||||
MFMA FLOPs (F64):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp -
|
||||
Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 256) / 1000))
|
||||
MFMA IOPs (Int8):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GIOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 4096) / 1000)
|
||||
pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp -
|
||||
Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 4096) / 1000))
|
||||
Active CUs:
|
||||
value: $numActiveCUs
|
||||
unit: CUs
|
||||
peak: $cu_per_gpu
|
||||
pop: ((100 * $numActiveCUs) / $cu_per_gpu)
|
||||
SALU Utilization:
|
||||
value: AVG(((100 * SQ_ACTIVE_INST_SCA) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 * SQ_ACTIVE_INST_SCA) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
VALU Utilization:
|
||||
value: AVG(((100 * SQ_ACTIVE_INST_VALU) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 * SQ_ACTIVE_INST_VALU) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
|
||||
MFMA Utilization:
|
||||
value: AVG(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / (($GRBM_GUI_ACTIVE_PER_XCD
|
||||
* $cu_per_gpu) * 4)))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / (($GRBM_GUI_ACTIVE_PER_XCD
|
||||
* $cu_per_gpu) * 4)))
|
||||
VMEM Utilization:
|
||||
value: AVG((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD)
|
||||
/ $cu_per_gpu))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD)
|
||||
/ $cu_per_gpu))
|
||||
Branch Utilization:
|
||||
value: AVG((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
|
||||
VALU Active Threads:
|
||||
value: AVG(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
|
||||
!= 0) else None))
|
||||
unit: Threads
|
||||
peak: $wave_size
|
||||
pop: (100 * AVG((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU / $wave_size)
|
||||
if (SQ_ACTIVE_INST_VALU != 0) else None))
|
||||
IPC:
|
||||
value: AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))
|
||||
unit: Instr/cycle
|
||||
peak: 5
|
||||
pop: ((100 * AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))) / 5)
|
||||
Wavefront Occupancy:
|
||||
value: AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
unit: Wavefronts
|
||||
peak: ($max_waves_per_cu * $cu_per_gpu)
|
||||
pop: (100 * AVG(((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD) / ($max_waves_per_cu
|
||||
* $cu_per_gpu))))
|
||||
coll_level: SQ_LEVEL_WAVES
|
||||
Theoretical LDS Bandwidth:
|
||||
value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: (($max_sclk * $cu_per_gpu) * 0.128)
|
||||
pop: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
|
||||
LDS Bank Conflicts/Access:
|
||||
value: AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
|
||||
unit: Conflicts/access
|
||||
peak: 32
|
||||
pop: ((100 * AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
|
||||
if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))) / 32)
|
||||
vL1D Cache Hit Rate:
|
||||
value: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
|
||||
else None))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
|
||||
else None))
|
||||
vL1D Cache BW:
|
||||
value: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: ((($max_sclk / 1000) * 128) * $cu_per_gpu)
|
||||
pop: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp
|
||||
- Start_Timestamp)))) / ((($max_sclk / 1000) * 128) * $cu_per_gpu))
|
||||
L2 Cache Hit Rate:
|
||||
value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
L2 Cache BW:
|
||||
value: AVG(((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: ((($max_sclk / 1000) * 128) * TO_INT($total_l2_chan))
|
||||
pop: ((100 * AVG(((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk / 1000) * 128) * TO_INT($total_l2_chan)))
|
||||
L2-Fabric Read BW:
|
||||
value: AVG((128 * TCC_BUBBLE_sum + 64 * (TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum
|
||||
- TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / (End_Timestamp
|
||||
- Start_Timestamp))
|
||||
unit: GB/s
|
||||
peak: $hbmBandwidth
|
||||
pop: ((100 * (AVG((128 * TCC_BUBBLE_sum + 64 * (TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum
|
||||
- TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / (End_Timestamp
|
||||
- Start_Timestamp)))) / $hbmBandwidth)
|
||||
L2-Fabric Write BW:
|
||||
value: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
peak: $hbmBandwidth
|
||||
pop: ((100 * AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum -
|
||||
TCC_EA0_WRREQ_64B_sum) * 32)) / (End_Timestamp - Start_Timestamp)))) /
|
||||
$hbmBandwidth)
|
||||
L2-Fabric Read Latency:
|
||||
value: AVG(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
peak: None
|
||||
pop: None
|
||||
L2-Fabric Write Latency:
|
||||
value: AVG(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
peak: None
|
||||
pop: None
|
||||
sL1D Cache Hit Rate:
|
||||
value: AVG((((100 * SQC_DCACHE_HITS) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES))
|
||||
if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES) != 0) else None))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG((((100 * SQC_DCACHE_HITS) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES))
|
||||
if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES) != 0) else None))
|
||||
sL1D Cache BW:
|
||||
value: AVG(((SQC_DCACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))
|
||||
unit: GB/s
|
||||
peak: ((($max_sclk / 1000) * 64) * $sqc_per_gpu)
|
||||
pop: ((100 * AVG(((SQC_DCACHE_REQ / (End_Timestamp - Start_Timestamp)) *
|
||||
64))) / ((($max_sclk / 1000) * 64) * $sqc_per_gpu))
|
||||
L1I Hit Rate:
|
||||
value: AVG(((100 * SQC_ICACHE_HITS) / (SQC_ICACHE_HITS + SQC_ICACHE_MISSES)))
|
||||
unit: pct
|
||||
peak: 100
|
||||
pop: AVG(((100 * SQC_ICACHE_HITS) / (SQC_ICACHE_HITS + SQC_ICACHE_MISSES)))
|
||||
L1I BW:
|
||||
value: AVG(((SQC_ICACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))
|
||||
unit: GB/s
|
||||
peak: ((($max_sclk / 1000) * 64) * $sqc_per_gpu)
|
||||
pop: ((100 * AVG(((SQC_ICACHE_REQ / (End_Timestamp - Start_Timestamp)) *
|
||||
64))) / ((($max_sclk / 1000) * 64) * $sqc_per_gpu))
|
||||
L1I Fetch Latency:
|
||||
value: AVG((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
|
||||
unit: Cycles
|
||||
peak: None
|
||||
pop: None
|
||||
coll_level: SQ_IFETCH_LEVEL
|
||||
-315
@@ -1,315 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 300
|
||||
title: Memory Chart
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 301
|
||||
title: Memory Chart
|
||||
header:
|
||||
metric: Metric
|
||||
#alias: #alias
|
||||
value: Value
|
||||
tips: Tips
|
||||
metric:
|
||||
# ----------------------------------------
|
||||
# Instr Buff Block
|
||||
|
||||
#TODO: double check wave_occupancy
|
||||
Wavefront Occupancy:
|
||||
#alias: wave_occ_
|
||||
value: ROUND(AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD) / $numActiveCUs), 0)
|
||||
coll_level: SQ_LEVEL_WAVES
|
||||
tips:
|
||||
Wave Life:
|
||||
#alias: wave_life_
|
||||
value: ROUND(AVG(((4 * (SQ_WAVE_CYCLES / SQ_WAVES)) if (SQ_WAVES != 0) else 0)), 0)
|
||||
tips:
|
||||
|
||||
# ----------------------------------------
|
||||
# Instr Dispatch Block
|
||||
SALU:
|
||||
#alias: salu_
|
||||
value: ROUND(AVG((SQ_INSTS_SALU / $denom)), 0)
|
||||
tips:
|
||||
SMEM:
|
||||
#alias: smem_
|
||||
value: ROUND(AVG((SQ_INSTS_SMEM / $denom)), 0)
|
||||
tips:
|
||||
VALU:
|
||||
#alias: valu_
|
||||
value: ROUND(AVG((SQ_INSTS_VALU / $denom)), 0)
|
||||
tips:
|
||||
MFMA:
|
||||
#alias: mfma_
|
||||
value: ROUND(AVG((SQ_INSTS_MFMA / $denom)), 0)
|
||||
tips:
|
||||
VMEM:
|
||||
#alias: vmem_
|
||||
value: ROUND(AVG((SQ_INSTS_VMEM / $denom)), 0)
|
||||
tips:
|
||||
LDS:
|
||||
#alias: lds_
|
||||
value: ROUND(AVG((SQ_INSTS_LDS / $denom)), 0)
|
||||
tips:
|
||||
GWS:
|
||||
#alias: gws_
|
||||
value: ROUND(AVG((SQ_INSTS_GDS / $denom)), 0)
|
||||
tips:
|
||||
BR:
|
||||
#alias: br_
|
||||
value: ROUND(AVG((SQ_INSTS_BRANCH / $denom)), 0)
|
||||
tips:
|
||||
|
||||
# ----------------------------------------
|
||||
# Exec Block
|
||||
Active CUs:
|
||||
#alias: active_cu_
|
||||
value: $numActiveCUs
|
||||
tips:
|
||||
Num CUs:
|
||||
#alias: num_cu_
|
||||
value: $cu_per_gpu
|
||||
tips:
|
||||
VGPR:
|
||||
#alias: vgpr_
|
||||
value: ROUND(AVG(Arch_VGPR), 0)
|
||||
tips:
|
||||
# Todo: add AGPRs
|
||||
SGPR:
|
||||
#alias: sgpr_
|
||||
value: ROUND(AVG(SGPR), 0)
|
||||
tips:
|
||||
LDS Allocation:
|
||||
#alias: lds_alloc_
|
||||
value: ROUND(AVG(LDS_Per_Workgroup), 0)
|
||||
tips:
|
||||
Scratch Allocation:
|
||||
#alias: scratch_alloc_
|
||||
value: ROUND(AVG(Scratch_Per_Workitem), 0)
|
||||
tips:
|
||||
Wavefronts:
|
||||
#alias: wavefronts_
|
||||
value: ROUND(AVG(SPI_CSN_WAVE), 0)
|
||||
tips:
|
||||
Workgroups:
|
||||
#alias: workgroups_
|
||||
value: ROUND(AVG(SPI_CSN_NUM_THREADGROUPS), 0)
|
||||
tips:
|
||||
|
||||
# ----------------------------------------
|
||||
# LDS Block
|
||||
LDS Req:
|
||||
#alias: lds_req_
|
||||
value: ROUND(AVG((SQ_INSTS_LDS / $denom)), 0)
|
||||
tips:
|
||||
LDS Util:
|
||||
#alias: lds_util_
|
||||
value:
|
||||
ROUND(AVG(((100 * SQ_LDS_IDX_ACTIVE) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))),
|
||||
0)
|
||||
tips:
|
||||
LDS Latency:
|
||||
#alias: lds_lat
|
||||
value: ROUND(AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None)),0)
|
||||
coll_level: SQ_INST_LEVEL_LDS
|
||||
tips:
|
||||
|
||||
# ----------------------------------------
|
||||
# Vector L1 Cache Block
|
||||
VL1 Rd:
|
||||
#alias: vl1_rd_
|
||||
value: ROUND(AVG((TCP_TOTAL_READ_sum / $denom)), 0)
|
||||
tips:
|
||||
VL1 Wr:
|
||||
#alias: vl1_wr_
|
||||
value: ROUND(AVG((TCP_TOTAL_WRITE_sum / $denom)), 0)
|
||||
tips:
|
||||
VL1 Atomic:
|
||||
#alias: vl1_atom_
|
||||
value:
|
||||
ROUND(AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
|
||||
/ $denom)), 0)
|
||||
tips:
|
||||
|
||||
VL1 Hit:
|
||||
#alias: vl1_hit_
|
||||
value:
|
||||
ROUND(AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
|
||||
None )), 0)
|
||||
tips:
|
||||
VL1 Lat:
|
||||
#alias: vl1_lat_
|
||||
value:
|
||||
ROUND(AVG(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
|
||||
!= 0) else None)), 0)
|
||||
tips:
|
||||
VL1 Coalesce:
|
||||
#alias: vl1_coales_
|
||||
value:
|
||||
ROUND(AVG(((((TA_TOTAL_WAVEFRONTS_sum * 64) * 100) / (TCP_TOTAL_ACCESSES_sum
|
||||
* 4)) if (TCP_TOTAL_ACCESSES_sum != None) else 0)), 0)
|
||||
tips:
|
||||
VL1 Stall:
|
||||
#alias: vl1_stall_
|
||||
value:
|
||||
ROUND(AVG((((100 * TCP_TCR_TCP_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
|
||||
if (TCP_GATE_EN1_sum != 0) else None)), 0)
|
||||
tips:
|
||||
|
||||
VL1_L2 Rd:
|
||||
#alias: vl1_l2_rd_
|
||||
value: ROUND(AVG((TCP_TCC_READ_REQ_sum / $denom)), 0)
|
||||
tips:
|
||||
VL1_L2 Wr:
|
||||
#alias: vl1_l2_wr_
|
||||
value: ROUND(AVG((TCP_TCC_WRITE_REQ_sum / $denom)), 0)
|
||||
tips:
|
||||
VL1_L2 Atomic:
|
||||
#alias: vl1_l2_atom_
|
||||
value:
|
||||
ROUND(AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
/ $denom)), 0)
|
||||
tips:
|
||||
|
||||
# ----------------------------------------
|
||||
# Scalar L1D Cache Block
|
||||
VL1D Rd:
|
||||
#alias: sl1_rd_
|
||||
value: ROUND(AVG((SQC_DCACHE_REQ / $denom)), 0)
|
||||
tips:
|
||||
VL1D Hit:
|
||||
#alias: sl1_hit_
|
||||
value:
|
||||
ROUND((AVG(((SQC_DCACHE_HITS / SQC_DCACHE_REQ) if (SQC_DCACHE_REQ !=
|
||||
0) else None)) * 100), 0)
|
||||
tips:
|
||||
VL1D Lat:
|
||||
#alias: sl1_lat_
|
||||
value:
|
||||
ROUND((AVG(((SQ_ACCUM_PREV_HIRES / SQC_DCACHE_REQ) if (SQC_DCACHE_REQ !=
|
||||
0) else None)) * 100), 0)
|
||||
coll_level: SQC_DCACHE_INFLIGHT_LEVEL
|
||||
tips:
|
||||
|
||||
VL1D_L2 Rd:
|
||||
#alias: sl1_l2_rd_
|
||||
value: ROUND(AVG((SQC_TC_DATA_READ_REQ / $denom)), 0)
|
||||
tips:
|
||||
VL1D_L2 Wr:
|
||||
#alias: sl1_l2_wr_
|
||||
value: ROUND(AVG((SQC_TC_DATA_WRITE_REQ / $denom)), 0)
|
||||
tips:
|
||||
VL1D_L2 Atomic:
|
||||
#alias: sl1_l2_atom_
|
||||
value: ROUND(AVG((SQC_TC_DATA_ATOMIC_REQ / $denom)), 0)
|
||||
tips:
|
||||
|
||||
# ----------------------------------------
|
||||
# Instr L1 Cache Block
|
||||
IL1 Fetch:
|
||||
#alias: il1_fetch_
|
||||
value: ROUND(AVG((SQC_ICACHE_REQ / $denom)), 0)
|
||||
tips:
|
||||
IL1 Hit:
|
||||
#alias: il1_hit_
|
||||
value: ROUND((AVG((SQC_ICACHE_HITS / SQC_ICACHE_REQ)) * 100), 0)
|
||||
tips:
|
||||
IL1 Lat:
|
||||
#alias: il1_lat_
|
||||
value:
|
||||
ROUND((AVG(((SQ_ACCUM_PREV_HIRES / SQC_ICACHE_REQ) if (SQC_ICACHE_REQ !=
|
||||
0) else None)) * 100), 0)
|
||||
tips: # ??? coll_level: SQ_IFETCH_LEVEL
|
||||
IL1_L2 Rd:
|
||||
#alias: il1_l2_req_
|
||||
value: ROUND(AVG((SQC_TC_INST_REQ / $denom)), 0)
|
||||
tips:
|
||||
|
||||
# ----------------------------------------
|
||||
# L2 Cache Block(inside)
|
||||
L2 Rd:
|
||||
#alias: l2_rd_
|
||||
value: ROUND(AVG((TCC_READ_sum / $denom)), 0)
|
||||
tips:
|
||||
L2 Wr:
|
||||
#alias: l2_wr_
|
||||
value: ROUND(AVG((TCC_WRITE_sum / $denom)), 0)
|
||||
tips:
|
||||
L2 Atomic:
|
||||
#alias: l2_atom_
|
||||
value: ROUND(AVG((TCC_ATOMIC_sum / $denom)), 0)
|
||||
tips:
|
||||
L2 Hit:
|
||||
#alias: l2_hit_
|
||||
value:
|
||||
ROUND(AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else 0)), 0)
|
||||
tips:
|
||||
L2 Rd Lat:
|
||||
#alias: l2_rd_lat_
|
||||
value:
|
||||
# ROUND(AVG(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
|
||||
# if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else None)),
|
||||
# 0)
|
||||
tips:
|
||||
L2 Wr Lat:
|
||||
#alias: l2_wr_lat_
|
||||
value:
|
||||
# ROUND(AVG(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum +
|
||||
# TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
# != 0) else None)), 0)
|
||||
tips:
|
||||
|
||||
# ----------------------------------------
|
||||
# Fabric Block
|
||||
Fabric_L2 Rd:
|
||||
#alias: l2_fabric_rd_
|
||||
value: ROUND(AVG((TCC_EA0_RDREQ_sum / $denom)), 0)
|
||||
tips:
|
||||
Fabric_L2 Wr:
|
||||
#alias: l2_fabric_wr_
|
||||
value: ROUND(AVG((TCC_EA0_WRREQ_sum / $denom)), 0)
|
||||
tips:
|
||||
Fabric_L2 Atomic:
|
||||
#alias: l2_fabric_atom_
|
||||
value: ROUND(AVG((TCC_EA0_ATOMIC_sum / $denom)), 0)
|
||||
tips:
|
||||
|
||||
Fabric Rd Lat:
|
||||
#alias: fabric_rd_lat_
|
||||
value:
|
||||
ROUND(AVG(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else 0)), 0)
|
||||
tips:
|
||||
Fabric Wr Lat:
|
||||
#alias: fabric_wr_lat_
|
||||
value:
|
||||
ROUND(AVG(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else 0)), 0)
|
||||
tips:
|
||||
Fabric Atomic Lat:
|
||||
#alias: fabric_atom_lat_
|
||||
value:
|
||||
ROUND(AVG(((TCC_EA0_ATOMIC_LEVEL_sum / TCC_EA0_ATOMIC_sum) if (TCC_EA0_ATOMIC_sum
|
||||
!= 0) else 0)), 0)
|
||||
tips:
|
||||
|
||||
HBM Rd:
|
||||
#alias: hbm_rd_
|
||||
value: ROUND(AVG((TCC_EA0_RDREQ_DRAM_sum / $denom)), 0)
|
||||
tips:
|
||||
HBM Wr:
|
||||
#alias: hbm_wr_
|
||||
value: ROUND(AVG((TCC_EA0_WRREQ_DRAM_sum / $denom)), 0)
|
||||
tips:
|
||||
|
||||
comparable: false # for now
|
||||
cli_style: mem_chart
|
||||
+263
@@ -0,0 +1,263 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 300
|
||||
title: Memory Chart
|
||||
metrics_description:
|
||||
Wavefront Occupancy: Wavefronts per active CU.
|
||||
Wave Life: Average number of cycles executing a wave.
|
||||
SALU: Total Number of SALU (Scalar ALU) instructions issued per normalization
|
||||
unit.
|
||||
SMEM: Total number of SMEM (Scalar Memory Read) instructions issued normalization
|
||||
unit.
|
||||
VALU: The number of VALU (Vector ALU) instructions issued per normalization unit.
|
||||
MFMA: Total number of MFMA (Matrix-Fused-Multiply-Add) instructions issued per
|
||||
normalization unit.
|
||||
VMEM: The number of VMEM (GPU Memory) read instructions issued (including FLAT/scratch
|
||||
memory) per normalization unit.
|
||||
LDS: The total number of LDS instructions (including, but not limited to, read/write/atomics
|
||||
and HIP's __shfl instructions) executed per normalization unit.
|
||||
GWS: Total number of GDS (global data sync) instructions issued per normalization
|
||||
unit.
|
||||
BR: Total number of BRANCH instructions issued per normalization unit.
|
||||
Active CUs: Total number of active compute units (CUs) on the accelerator during
|
||||
the kernel execution.
|
||||
Num CUs: Total number of compute units (CUs) on the accelerator.
|
||||
VGPR: 'The number of architected vector general-purpose registers allocated for
|
||||
the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
|
||||
by the compiler due to allocation granularity.'
|
||||
SGPR: 'The number of scalar general-purpose registers allocated for the kernel,
|
||||
see SALU. Note: this may not exactly match the number of SGPRs requested by
|
||||
the compiler due to allocation granularity.'
|
||||
LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
|
||||
for this kernel. Note: This may also be larger than what was requested at compile
|
||||
time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
|
||||
Scratch Allocation: The number of bytes of scratch memory requested per work-item
|
||||
for this kernel. Scratch memory is used for stack memory on the accelerator,
|
||||
as well as for register spills and restores.
|
||||
Wavefronts: The total number of wavefronts, summed over all workgroups, forming
|
||||
this kernel launch.
|
||||
Workgroups: The total number of workgroups forming this kernel launch.
|
||||
LDS Req: The total number of LDS instructions (including, but not limited to,
|
||||
read/write/atomics and HIP's __shfl instructions) executed per normalization
|
||||
unit.
|
||||
LDS Util: Indicates what percent of the kernel's duration the LDS was actively
|
||||
executing instructions (including, but not limited to, load, store, atomic and
|
||||
HIP's __shfl operations). Calculated as the ratio of the total number of cycles
|
||||
LDS was active over the total CU cycles.
|
||||
LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
|
||||
/ acknowledgment) required for an LDS instruction to complete.
|
||||
VL1 Rd: The total number of incoming read requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Wr: The total number of incoming write requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Atomic: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Hit: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
VL1 Lat: Calculated as the average number of cycles that a vL1D cache line request
|
||||
spent in the vL1D cache pipeline.
|
||||
VL1 Coalesce: Indicates how well memory instructions were coalesced by the address
|
||||
processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
|
||||
as the average number of thread-requests generated per instruction divided by
|
||||
the ideal number of thread-requests per instruction.
|
||||
VL1 Stall: The ratio of the number of cycles where the vL1D is stalled waiting
|
||||
to issue a request for data to the L2 cache divided by the number of cycles
|
||||
where the vL1D is active.
|
||||
VL1_L2 Rd: The number of read requests for a vL1D cache line that were not satisfied
|
||||
by the vL1D and must be retrieved from the to the L2 Cache per normalization
|
||||
unit.
|
||||
VL1_L2 Wr: The number of write requests to a vL1D cache line that were sent through
|
||||
the vL1D to the L2 cache, per normalization unit.
|
||||
VL1_L2 Atomic: The number of atomic requests that are sent through the vL1D to
|
||||
the L2 cache, per normalization unit. This includes requests for atomics with,
|
||||
and without return.
|
||||
sL1D Rd: The total number of requests, of any size or type, made to the sL1D per
|
||||
normalization unit.
|
||||
sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
|
||||
line, per normalization unit.
|
||||
sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
|
||||
unit.
|
||||
sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
sL1D_L2 Atomic: The total number of atomic requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
IL1 Fetch: The total number of requests made to the L1I per normalization-unit.
|
||||
IL1 Hit: The percent of L1I requests that hit on a previously loaded line the
|
||||
cache. Calculated as the ratio of the number of L1I requests that hit over the
|
||||
number of all L1I requests.
|
||||
IL1 Lat: The average number of cycles spent to fetch instructions to a CU.
|
||||
IL1_L2 Rd: The total number of requests across the L1I - L2 interface per normalization-unit.
|
||||
L2 Rd: The total number of read requests to the L2 from all clients.
|
||||
L2 Wr: The total number of write requests to the L2 from all clients.
|
||||
L2 Atomic: The total number of atomic requests (with and without return) to the
|
||||
L2 from all clients.
|
||||
L2 Hit: The ratio of the number of L2 cache line requests that hit in the L2 cache
|
||||
over the total number of incoming cache line requests to the L2 cache.
|
||||
L2 Rd Lat: Calculated as the average number of cycles that the vL1D cache took
|
||||
to issue and receive read requests from the L2 Cache. This number also includes
|
||||
requests for atomics with return values.
|
||||
L2 Wr Lat: Calculated as the average number of cycles that the vL1D cache took
|
||||
to issue and receive acknowledgement of a write request to the L2 Cache. This
|
||||
number also includes requests for atomics without return values.
|
||||
Fabric_L2 Rd: Number of L2 cache - Infinity Fabric read requests (either 32-byte
|
||||
or 64-byte) summed over TCC instances per normalization unit.
|
||||
Fabric_L2 Wr: Number of L2 cache - Infinity Fabric write requests (either 32-byte
|
||||
or 64-byte) summed over TCC instances per normalization unit.
|
||||
Fabric_L2 Atomic: Number of L2 cache - Infinity Fabric write requests (either
|
||||
32-byte or 64-byte) that are actually atomic requests summed over TCC instances
|
||||
per normalization unit.
|
||||
Fabric Rd Lat: The time-averaged number of cycles read requests spent in Infinity
|
||||
Fabric before data was returned to the L2.
|
||||
Fabric Wr Lat: The time-averaged number of cycles write requests spent in Infinity
|
||||
Fabric before a completion acknowledgement was returned to the L2.
|
||||
Fabric Atomic Lat: The time-averaged number of cycles atomic requests spent in
|
||||
Infinity Fabric before a completion acknowledgement (atomic without return value)
|
||||
or data (atomic with return value) was returned to the L2.
|
||||
HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
|
||||
of data from the accelerator's local HBM, per normalization unit.
|
||||
HBM Wr: 'The total number of L2 requests to Infinity Fabric to write or atomically
|
||||
update 32B or 64B of data in the accelerator''s local HBM, per normalization
|
||||
unit. '
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 301
|
||||
title: Memory Chart
|
||||
header:
|
||||
metric: Metric
|
||||
value: Value
|
||||
metric:
|
||||
Wavefront Occupancy:
|
||||
value: ROUND(AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD) / $numActiveCUs),
|
||||
0)
|
||||
coll_level: SQ_LEVEL_WAVES
|
||||
Wave Life:
|
||||
value: ROUND(AVG(((4 * (SQ_WAVE_CYCLES / SQ_WAVES)) if (SQ_WAVES != 0) else
|
||||
0)), 0)
|
||||
SALU:
|
||||
value: ROUND(AVG((SQ_INSTS_SALU / $denom)), 0)
|
||||
SMEM:
|
||||
value: ROUND(AVG((SQ_INSTS_SMEM / $denom)), 0)
|
||||
VALU:
|
||||
value: ROUND(AVG((SQ_INSTS_VALU / $denom)), 0)
|
||||
MFMA:
|
||||
value: ROUND(AVG((SQ_INSTS_MFMA / $denom)), 0)
|
||||
VMEM:
|
||||
value: ROUND(AVG((SQ_INSTS_VMEM / $denom)), 0)
|
||||
LDS:
|
||||
value: ROUND(AVG((SQ_INSTS_LDS / $denom)), 0)
|
||||
GWS:
|
||||
value: ROUND(AVG((SQ_INSTS_GDS / $denom)), 0)
|
||||
BR:
|
||||
value: ROUND(AVG((SQ_INSTS_BRANCH / $denom)), 0)
|
||||
Active CUs:
|
||||
value: $numActiveCUs
|
||||
Num CUs:
|
||||
value: $cu_per_gpu
|
||||
VGPR:
|
||||
value: ROUND(AVG(Arch_VGPR), 0)
|
||||
SGPR:
|
||||
value: ROUND(AVG(SGPR), 0)
|
||||
LDS Allocation:
|
||||
value: ROUND(AVG(LDS_Per_Workgroup), 0)
|
||||
Scratch Allocation:
|
||||
value: ROUND(AVG(Scratch_Per_Workitem), 0)
|
||||
Wavefronts:
|
||||
value: ROUND(AVG(SPI_CSN_WAVE), 0)
|
||||
Workgroups:
|
||||
value: ROUND(AVG(SPI_CSN_NUM_THREADGROUPS), 0)
|
||||
LDS Req:
|
||||
value: ROUND(AVG((SQ_INSTS_LDS / $denom)), 0)
|
||||
LDS Util:
|
||||
value: ROUND(AVG(((100 * SQ_LDS_IDX_ACTIVE) / ($GRBM_GUI_ACTIVE_PER_XCD
|
||||
* $cu_per_gpu))), 0)
|
||||
LDS Latency:
|
||||
value: ROUND(AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS
|
||||
!= 0) else None)),0)
|
||||
coll_level: SQ_INST_LEVEL_LDS
|
||||
VL1 Rd:
|
||||
value: ROUND(AVG((TCP_TOTAL_READ_sum / $denom)), 0)
|
||||
VL1 Wr:
|
||||
value: ROUND(AVG((TCP_TOTAL_WRITE_sum / $denom)), 0)
|
||||
VL1 Atomic:
|
||||
value: ROUND(AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
|
||||
/ $denom)), 0)
|
||||
VL1 Hit:
|
||||
value: ROUND(AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
|
||||
else None )), 0)
|
||||
VL1 Lat:
|
||||
value: ROUND(AVG(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
|
||||
!= 0) else None)), 0)
|
||||
VL1 Coalesce:
|
||||
value: ROUND(AVG(((((TA_TOTAL_WAVEFRONTS_sum * 64) * 100) / (TCP_TOTAL_ACCESSES_sum
|
||||
* 4)) if (TCP_TOTAL_ACCESSES_sum != None) else 0)), 0)
|
||||
VL1 Stall:
|
||||
value: ROUND(AVG((((100 * TCP_TCR_TCP_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
|
||||
if (TCP_GATE_EN1_sum != 0) else None)), 0)
|
||||
VL1_L2 Rd:
|
||||
value: ROUND(AVG((TCP_TCC_READ_REQ_sum / $denom)), 0)
|
||||
VL1_L2 Wr:
|
||||
value: ROUND(AVG((TCP_TCC_WRITE_REQ_sum / $denom)), 0)
|
||||
VL1_L2 Atomic:
|
||||
value: ROUND(AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
/ $denom)), 0)
|
||||
sL1D Rd:
|
||||
value: ROUND(AVG((SQC_DCACHE_REQ / $denom)), 0)
|
||||
sL1D Hit:
|
||||
value: ROUND((AVG(((SQC_DCACHE_HITS / SQC_DCACHE_REQ) if (SQC_DCACHE_REQ
|
||||
!= 0) else None)) * 100), 0)
|
||||
sL1D Lat:
|
||||
value: ROUND((AVG(((SQ_ACCUM_PREV_HIRES / SQC_DCACHE_REQ) if (SQC_DCACHE_REQ
|
||||
!= 0) else None)) * 100), 0)
|
||||
coll_level: SQC_DCACHE_INFLIGHT_LEVEL
|
||||
sL1D_L2 Rd:
|
||||
value: ROUND(AVG((SQC_TC_DATA_READ_REQ / $denom)), 0)
|
||||
sL1D_L2 Wr:
|
||||
value: ROUND(AVG((SQC_TC_DATA_WRITE_REQ / $denom)), 0)
|
||||
sL1D_L2 Atomic:
|
||||
value: ROUND(AVG((SQC_TC_DATA_ATOMIC_REQ / $denom)), 0)
|
||||
IL1 Fetch:
|
||||
value: ROUND(AVG((SQC_ICACHE_REQ / $denom)), 0)
|
||||
IL1 Hit:
|
||||
value: ROUND((AVG((SQC_ICACHE_HITS / SQC_ICACHE_REQ)) * 100), 0)
|
||||
IL1 Lat:
|
||||
value: ROUND((AVG(((SQ_ACCUM_PREV_HIRES / SQC_ICACHE_REQ) if (SQC_ICACHE_REQ
|
||||
!= 0) else None)) * 100), 0)
|
||||
IL1_L2 Rd:
|
||||
value: ROUND(AVG((SQC_TC_INST_REQ / $denom)), 0)
|
||||
L2 Rd:
|
||||
value: ROUND(AVG((TCC_READ_sum / $denom)), 0)
|
||||
L2 Wr:
|
||||
value: ROUND(AVG((TCC_WRITE_sum / $denom)), 0)
|
||||
L2 Atomic:
|
||||
value: ROUND(AVG((TCC_ATOMIC_sum / $denom)), 0)
|
||||
L2 Hit:
|
||||
value: ROUND(AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if
|
||||
((TCC_HIT_sum + TCC_MISS_sum) != 0) else 0)), 0)
|
||||
L2 Rd Lat:
|
||||
value: null
|
||||
L2 Wr Lat:
|
||||
value: null
|
||||
Fabric_L2 Rd:
|
||||
value: ROUND(AVG((TCC_EA0_RDREQ_sum / $denom)), 0)
|
||||
Fabric_L2 Wr:
|
||||
value: ROUND(AVG((TCC_EA0_WRREQ_sum / $denom)), 0)
|
||||
Fabric_L2 Atomic:
|
||||
value: ROUND(AVG((TCC_EA0_ATOMIC_sum / $denom)), 0)
|
||||
Fabric Rd Lat:
|
||||
value: ROUND(AVG(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else 0)), 0)
|
||||
Fabric Wr Lat:
|
||||
value: ROUND(AVG(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else 0)), 0)
|
||||
Fabric Atomic Lat:
|
||||
value: ROUND(AVG(((TCC_EA0_ATOMIC_LEVEL_sum / TCC_EA0_ATOMIC_sum) if (TCC_EA0_ATOMIC_sum
|
||||
!= 0) else 0)), 0)
|
||||
HBM Rd:
|
||||
value: ROUND(AVG((TCC_EA0_RDREQ_DRAM_sum / $denom)), 0)
|
||||
HBM Wr:
|
||||
value: ROUND(AVG((TCC_EA0_WRREQ_DRAM_sum / $denom)), 0)
|
||||
comparable: false
|
||||
cli_style: mem_chart
|
||||
tui_style: mem_chart
|
||||
@@ -0,0 +1,9 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 400
|
||||
title: Roofline
|
||||
metrics_description: {}
|
||||
data source:
|
||||
- None:
|
||||
id: 401
|
||||
title: Roofline
|
||||
@@ -1,8 +0,0 @@
|
||||
---
|
||||
Panel Config:
|
||||
id: 400
|
||||
title: Roofline
|
||||
data source:
|
||||
- None:
|
||||
id: 401
|
||||
title: Roofline
|
||||
-135
@@ -1,135 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 500
|
||||
title: Command Processor (CPC/CPF)
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 501
|
||||
title: Command Processor Fetcher
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
CPF Utilization:
|
||||
avg: AVG((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
|
||||
if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
|
||||
if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
|
||||
if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
CPF Stall:
|
||||
avg: AVG((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
|
||||
!= 0) else None))
|
||||
min: MIN((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
|
||||
!= 0) else None))
|
||||
max: MAX((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
CPF-L2 Utilization:
|
||||
avg: AVG((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
|
||||
if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
|
||||
if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
|
||||
if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
CPF-L2 Stall:
|
||||
avg: AVG((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
|
||||
!= 0) else None))
|
||||
min: MIN((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
|
||||
!= 0) else None))
|
||||
max: MAX((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
CPF-UTCL1 Stall:
|
||||
avg: AVG(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
|
||||
!= 0) else None)
|
||||
min: MIN(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
|
||||
!= 0) else None)
|
||||
max: MAX(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
|
||||
!= 0) else None)
|
||||
unit: pct
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 502
|
||||
title: Packet Processor
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
CPC Utilization:
|
||||
avg: AVG((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
|
||||
if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
|
||||
if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
|
||||
if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
CPC Stall Rate:
|
||||
avg: AVG((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None))
|
||||
min: MIN((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None))
|
||||
max: MAX((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
CPC Packet Decoding Utilization:
|
||||
avg: AVG((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
|
||||
min: MIN((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
|
||||
max: MAX((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
|
||||
unit: pct
|
||||
tips:
|
||||
CPC-Workgroup Manager Utilization:
|
||||
avg: AVG((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
|
||||
min: MIN((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
|
||||
max: MAX((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
|
||||
unit: Pct
|
||||
tips:
|
||||
CPC-L2 Utilization:
|
||||
avg: AVG((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
|
||||
if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
|
||||
if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
|
||||
if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
CPC-UTCL1 Stall:
|
||||
avg: AVG(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None)
|
||||
min: MIN(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None)
|
||||
max: MAX(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None)
|
||||
unit: pct
|
||||
tips:
|
||||
CPC-UTCL2 Utilization:
|
||||
avg: AVG((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
|
||||
if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
|
||||
if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
|
||||
if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
tips:
|
||||
+145
@@ -0,0 +1,145 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 500
|
||||
title: Command Processor (CPC/CPF)
|
||||
metrics_description:
|
||||
CPF Utilization: Percent of total cycles where the CPF was busy actively doing
|
||||
any work. The ratio of CPF busy cycles over total cycles counted by the CPF.
|
||||
CPF Stall: Percent of CPF busy cycles where the CPF was stalled for any reason.
|
||||
CPF-L2 Utilization: Percent of total cycles counted by the CPF-L2 interface where
|
||||
the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
|
||||
over total cycles counted by the CPF-L2.
|
||||
CPF-L2 Stall: Percent of CPF-L2 L2 busy cycles where the CPF-L2 interface was
|
||||
stalled for any reason.
|
||||
CPF-UTCL1 Stall: Percent of CPF busy cycles where the CPF was stalled by address
|
||||
translation.
|
||||
CPC Utilization: Percent of total cycles where the CPC was busy actively doing
|
||||
any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
|
||||
CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
|
||||
CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
|
||||
for processing.
|
||||
CPC-Workgroup Manager Utilization: Percent of CPC busy cycles spent dispatching
|
||||
workgroups to the workgroup manager.
|
||||
CPC-L2 Utilization: Percent of total cycles counted by the CPC-L2 interface where
|
||||
the CPC-L2 interface was active doing any work.
|
||||
CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
|
||||
translation
|
||||
CPC-UTCL2 Utilization: 'Percent of total cycles counted by the CPC''s L2 address
|
||||
translation interface where the CPC was busy doing address translation work. '
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 501
|
||||
title: Command processor fetcher (CPF)
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
CPF Utilization:
|
||||
avg: AVG((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
|
||||
if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
|
||||
if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
|
||||
if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
CPF Stall:
|
||||
avg: AVG((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
|
||||
!= 0) else None))
|
||||
min: MIN((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
|
||||
!= 0) else None))
|
||||
max: MAX((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
CPF-L2 Utilization:
|
||||
avg: AVG((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
|
||||
if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
|
||||
if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
|
||||
if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
CPF-L2 Stall:
|
||||
avg: AVG((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
|
||||
!= 0) else None))
|
||||
min: MIN((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
|
||||
!= 0) else None))
|
||||
max: MAX((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
CPF-UTCL1 Stall:
|
||||
avg: AVG(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY)
|
||||
if (CPF_CPF_STAT_BUSY != 0) else None)
|
||||
min: MIN(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY)
|
||||
if (CPF_CPF_STAT_BUSY != 0) else None)
|
||||
max: MAX(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY)
|
||||
if (CPF_CPF_STAT_BUSY != 0) else None)
|
||||
unit: pct
|
||||
- metric_table:
|
||||
id: 502
|
||||
title: Command processor packet processor (CPC)
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
CPC Utilization:
|
||||
avg: AVG((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
|
||||
if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
|
||||
if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
|
||||
if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
CPC Stall Rate:
|
||||
avg: AVG((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None))
|
||||
min: MIN((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None))
|
||||
max: MAX((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
CPC Packet Decoding Utilization:
|
||||
avg: AVG((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None)
|
||||
min: MIN((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None)
|
||||
max: MAX((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None)
|
||||
unit: pct
|
||||
CPC-Workgroup Manager Utilization:
|
||||
avg: AVG((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None)
|
||||
min: MIN((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None)
|
||||
max: MAX((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
|
||||
!= 0) else None)
|
||||
unit: Pct
|
||||
CPC-L2 Utilization:
|
||||
avg: AVG((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
|
||||
if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
|
||||
if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
|
||||
if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
CPC-UTCL1 Stall:
|
||||
avg: AVG(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if
|
||||
(CPC_CPC_STAT_BUSY != 0) else None)
|
||||
min: MIN(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if
|
||||
(CPC_CPC_STAT_BUSY != 0) else None)
|
||||
max: MAX(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if
|
||||
(CPC_CPC_STAT_BUSY != 0) else None)
|
||||
unit: pct
|
||||
CPC-UTCL2 Utilization:
|
||||
avg: AVG((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
|
||||
if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
|
||||
min: MIN((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
|
||||
if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
|
||||
max: MAX((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
|
||||
if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
-167
@@ -1,167 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 600
|
||||
title: Workgroup Manager (SPI)
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 601
|
||||
title: Workgroup Manager Utilizations
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Accelerator Utilization:
|
||||
avg: AVG(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
|
||||
min: MIN(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
|
||||
max: MAX(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
|
||||
unit: Pct
|
||||
tips:
|
||||
Scheduler-Pipe Utilization:
|
||||
avg: AVG(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
|
||||
min: MIN(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
|
||||
max: MAX(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
Workgroup Manager Utilization:
|
||||
avg: AVG(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
|
||||
min: MIN(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
|
||||
max: MAX(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
|
||||
unit: Pct
|
||||
tips:
|
||||
Shader Engine Utilization:
|
||||
avg: AVG(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
|
||||
min: MIN(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
|
||||
max: MAX(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
SIMD Utilization:
|
||||
avg: AVG(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
Dispatched Workgroups:
|
||||
avg: AVG(SPI_CSN_NUM_THREADGROUPS)
|
||||
min: MIN(SPI_CSN_NUM_THREADGROUPS)
|
||||
max: MAX(SPI_CSN_NUM_THREADGROUPS)
|
||||
unit: Workgroups
|
||||
tips:
|
||||
Dispatched Wavefronts:
|
||||
avg: AVG(SPI_CSN_WAVE)
|
||||
min: MIN(SPI_CSN_WAVE)
|
||||
max: MAX(SPI_CSN_WAVE)
|
||||
unit: Wavefronts
|
||||
tips:
|
||||
VGPR Writes:
|
||||
avg: AVG((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
min: MIN((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
max: MAX((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
unit: Cycles/wave
|
||||
tips:
|
||||
SGPR Writes:
|
||||
avg: AVG((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
min: MIN((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
max: MAX((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
unit: Cycles/wave
|
||||
tips:
|
||||
- metric_table:
|
||||
id: 602
|
||||
title: Workgroup Manager - Resource Allocation
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Not-scheduled Rate (Workgroup Manager):
|
||||
avg: AVG((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
|
||||
0) else None)
|
||||
min: MIN((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
|
||||
0) else None)
|
||||
max: MAX((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
|
||||
0) else None)
|
||||
unit: Pct
|
||||
tips:
|
||||
Not-scheduled Rate (Scheduler-Pipe):
|
||||
avg: AVG((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
|
||||
0) else None)
|
||||
min: MIN((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
|
||||
0) else None)
|
||||
max: MAX((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
|
||||
0) else None)
|
||||
unit: Pct
|
||||
tips:
|
||||
Scheduler-Pipe Stall Rate:
|
||||
avg: AVG((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
|
||||
0) else None))
|
||||
min: MIN((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
|
||||
0) else None))
|
||||
max: MAX((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
|
||||
0) else None))
|
||||
unit: Pct
|
||||
tips:
|
||||
Scratch Stall Rate:
|
||||
avg: AVG((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
min: MIN((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
max: MAX((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
unit: Pct
|
||||
tips:
|
||||
Insufficient SIMD Waveslots:
|
||||
avg: AVG(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
Insufficient SIMD VGPRs:
|
||||
avg: AVG(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
Insufficient SIMD SGPRs:
|
||||
avg: AVG(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
Insufficient CU LDS:
|
||||
avg: AVG(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
Insufficient CU Barriers:
|
||||
avg: AVG(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
Reached CU Workgroup Limit:
|
||||
avg: AVG(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
Reached CU Wavefront Limit:
|
||||
avg: AVG(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
tips:
|
||||
+201
@@ -0,0 +1,201 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 600
|
||||
title: Workgroup Manager (SPI)
|
||||
metrics_description:
|
||||
Accelerator Utilization: The percent of cycles in the kernel where the accelerator
|
||||
was actively doing any work.
|
||||
Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
|
||||
kernel where the scheduler-pipes were actively doing any work.
|
||||
Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
|
||||
manager was actively doing any work.
|
||||
Shader Engine Utilization: The percent of total shader engine cycles in the kernel
|
||||
where any CU in a shader-engine was actively doing any work, normalized over
|
||||
all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
|
||||
was not fully saturated by the kernel, or a potential load-imbalance issue.
|
||||
SIMD Utilization: The percent of total SIMD cycles in the kernel where any SIMD
|
||||
on a CU was actively doing any work, summed over all CUs. Low values (less than
|
||||
100%) indicate that the accelerator was not fully saturated by the kernel, or
|
||||
a potential load-imbalance issue.
|
||||
Dispatched Workgroups: The total number of workgroups forming this kernel launch.
|
||||
Dispatched Wavefronts: The total number of wavefronts, summed over all workgroups,
|
||||
forming this kernel launch.
|
||||
VGPR Writes: The average number of cycles spent initializing VGPRs at wave creation.
|
||||
SGPR Writes: The average number of cycles spent initializing SGPRs at wave creation.
|
||||
Not-scheduled Rate (Workgroup Manager): The percent of total scheduler-pipe cycles
|
||||
in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
|
||||
within the workgroup manager rather than a lack of a CU or SIMD with sufficient
|
||||
resources.
|
||||
Not-scheduled Rate (Scheduler-Pipe): 'The percent of total scheduler-pipe cycles
|
||||
in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
|
||||
within the scheduler-pipes rather than a lack of a CU or SIMD with sufficient
|
||||
resources. '
|
||||
Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
|
||||
where a workgroup could not be scheduled to a CU due to occupancy limitations
|
||||
(like a lack of a CU or SIMD with sufficient resources).
|
||||
Scratch Stall Rate: The percent of total shader-engine cycles in the kernel where
|
||||
a workgroup could not be scheduled to a CU due to lack of private (a.k.a., scratch)
|
||||
memory slots. While this can reach up to 100%, note that the actual occupancy
|
||||
limitations on a kernel using private memory are typically quite small (for
|
||||
example, less than 1% of the total number of waves that can be scheduled to
|
||||
an accelerator).
|
||||
Insufficient SIMD Waveslots: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available waveslots.
|
||||
Insufficient SIMD VGPRs: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available VGPRs.
|
||||
Insufficient SIMD SGPRs: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available SGPRs.
|
||||
Insufficient CU LDS: The percent of total CU cycles in the kernel where a workgroup
|
||||
could not be scheduled to a CU due to lack of available LDS.
|
||||
Insufficient CU Barriers: The percent of total CU cycles in the kernel where a
|
||||
workgroup could not be scheduled to a CU due to lack of available barriers.
|
||||
Reached CU Workgroup Limit: The percent of total CU cycles in the kernel where
|
||||
a workgroup could not be scheduled to a CU due to limits within the workgroup
|
||||
manager. This is expected to be always be zero on CDNA2 or newer accelerators
|
||||
(and small for previous accelerators).
|
||||
Reached CU Wavefront Limit: The percent of total CU cycles in the kernel where
|
||||
a wavefront could not be scheduled to a CU due to limits within the workgroup
|
||||
manager. This is expected to be always be zero on CDNA2 or newer accelerators
|
||||
(and small for previous accelerators).
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 601
|
||||
title: Workgroup manager utilizations
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Accelerator Utilization:
|
||||
avg: AVG(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
|
||||
min: MIN(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
|
||||
max: MAX(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
|
||||
unit: Pct
|
||||
Scheduler-Pipe Utilization:
|
||||
avg: AVG(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu
|
||||
* $se_per_gpu))
|
||||
min: MIN(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu
|
||||
* $se_per_gpu))
|
||||
max: MAX(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu
|
||||
* $se_per_gpu))
|
||||
unit: Pct
|
||||
Workgroup Manager Utilization:
|
||||
avg: AVG(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
|
||||
min: MIN(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
|
||||
max: MAX(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
|
||||
unit: Pct
|
||||
Shader Engine Utilization:
|
||||
avg: AVG(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
|
||||
min: MIN(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
|
||||
max: MAX(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
|
||||
unit: Pct
|
||||
SIMD Utilization:
|
||||
avg: AVG(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
Dispatched Workgroups:
|
||||
avg: AVG(SPI_CSN_NUM_THREADGROUPS)
|
||||
min: MIN(SPI_CSN_NUM_THREADGROUPS)
|
||||
max: MAX(SPI_CSN_NUM_THREADGROUPS)
|
||||
unit: Workgroups
|
||||
Dispatched Wavefronts:
|
||||
avg: AVG(SPI_CSN_WAVE)
|
||||
min: MIN(SPI_CSN_WAVE)
|
||||
max: MAX(SPI_CSN_WAVE)
|
||||
unit: Wavefronts
|
||||
VGPR Writes:
|
||||
avg: AVG((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
min: MIN((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
max: MAX((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
unit: Cycles/wave
|
||||
SGPR Writes:
|
||||
avg: AVG((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
min: MIN((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
max: MAX((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
|
||||
None))
|
||||
unit: Cycles/wave
|
||||
- metric_table:
|
||||
id: 602
|
||||
title: Workgroup Manager - Resource Allocation
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Not-scheduled Rate (Workgroup Manager):
|
||||
avg: AVG((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
min: MIN((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
max: MAX((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
unit: Pct
|
||||
Not-scheduled Rate (Scheduler-Pipe):
|
||||
avg: AVG((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
min: MIN((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
max: MAX((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
unit: Pct
|
||||
Scheduler-Pipe Stall Rate:
|
||||
avg: AVG((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None))
|
||||
min: MIN((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None))
|
||||
max: MAX((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None))
|
||||
unit: Pct
|
||||
Scratch Stall Rate:
|
||||
avg: AVG((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
min: MIN((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
max: MAX((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
|
||||
if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
unit: Pct
|
||||
Insufficient SIMD Waveslots:
|
||||
avg: AVG(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
Insufficient SIMD VGPRs:
|
||||
avg: AVG(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
Insufficient SIMD SGPRs:
|
||||
avg: AVG(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
Insufficient CU LDS:
|
||||
avg: AVG(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
Insufficient CU Barriers:
|
||||
avg: AVG(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
Reached CU Workgroup Limit:
|
||||
avg: AVG(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
Reached CU Wavefront Limit:
|
||||
avg: AVG(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
min: MIN(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
-142
@@ -1,142 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 700
|
||||
title: Wavefront
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 701
|
||||
title: Wavefront Launch Stats
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Grid Size:
|
||||
avg: AVG(Grid_Size)
|
||||
min: MIN(Grid_Size)
|
||||
max: MAX(Grid_Size)
|
||||
unit: Work Items
|
||||
tips:
|
||||
Workgroup Size:
|
||||
avg: AVG(Workgroup_Size)
|
||||
min: MIN(Workgroup_Size)
|
||||
max: MAX(Workgroup_Size)
|
||||
unit: Work Items
|
||||
tips:
|
||||
Total Wavefronts:
|
||||
avg: AVG(SPI_CSN_WAVE)
|
||||
min: MIN(SPI_CSN_WAVE)
|
||||
max: MAX(SPI_CSN_WAVE)
|
||||
unit: Wavefronts
|
||||
tips:
|
||||
Saved Wavefronts:
|
||||
avg: AVG(SQ_WAVES_SAVED)
|
||||
min: MIN(SQ_WAVES_SAVED)
|
||||
max: MAX(SQ_WAVES_SAVED)
|
||||
unit: Wavefronts
|
||||
tips:
|
||||
Restored Wavefronts:
|
||||
avg: AVG(SQ_WAVES_RESTORED)
|
||||
min: MIN(SQ_WAVES_RESTORED)
|
||||
max: MAX(SQ_WAVES_RESTORED)
|
||||
unit: Wavefronts
|
||||
tips:
|
||||
VGPRs:
|
||||
avg: AVG(Arch_VGPR)
|
||||
min: MIN(Arch_VGPR)
|
||||
max: MAX(Arch_VGPR)
|
||||
unit: Registers
|
||||
tips:
|
||||
AGPRs:
|
||||
avg: AVG(Accum_VGPR)
|
||||
min: MIN(Accum_VGPR)
|
||||
max: MAX(Accum_VGPR)
|
||||
unit: Registers
|
||||
tips:
|
||||
SGPRs:
|
||||
avg: AVG(SGPR)
|
||||
min: MIN(SGPR)
|
||||
max: MAX(SGPR)
|
||||
unit: Registers
|
||||
tips:
|
||||
LDS Allocation:
|
||||
avg: AVG(LDS_Per_Workgroup)
|
||||
min: MIN(LDS_Per_Workgroup)
|
||||
max: MAX(LDS_Per_Workgroup)
|
||||
unit: Bytes
|
||||
tips:
|
||||
Scratch Allocation:
|
||||
avg: AVG(Scratch_Per_Workitem)
|
||||
min: MIN(Scratch_Per_Workitem)
|
||||
max: MAX(Scratch_Per_Workitem)
|
||||
unit: Bytes/Workitem
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 702
|
||||
title: Wavefront Runtime Stats
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Kernel Time:
|
||||
avg: AVG((End_Timestamp - Start_Timestamp))
|
||||
min: MIN((End_Timestamp - Start_Timestamp))
|
||||
max: MAX((End_Timestamp - Start_Timestamp))
|
||||
unit: ns
|
||||
tips:
|
||||
Kernel Time (Cycles):
|
||||
avg: AVG($GRBM_GUI_ACTIVE_PER_XCD)
|
||||
min: MIN($GRBM_GUI_ACTIVE_PER_XCD)
|
||||
max: MAX($GRBM_GUI_ACTIVE_PER_XCD)
|
||||
unit: Cycle
|
||||
tips:
|
||||
Instructions per wavefront:
|
||||
avg: AVG((SQ_INSTS / SQ_WAVES))
|
||||
min: MIN((SQ_INSTS / SQ_WAVES))
|
||||
max: MAX((SQ_INSTS / SQ_WAVES))
|
||||
unit: Instr/wavefront
|
||||
tips:
|
||||
Wave Cycles:
|
||||
avg: AVG(((4 * SQ_WAVE_CYCLES) / $denom))
|
||||
min: MIN(((4 * SQ_WAVE_CYCLES) / $denom))
|
||||
max: MAX(((4 * SQ_WAVE_CYCLES) / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
Dependency Wait Cycles:
|
||||
avg: AVG(((4 * SQ_WAIT_ANY) / $denom))
|
||||
min: MIN(((4 * SQ_WAIT_ANY) / $denom))
|
||||
max: MAX(((4 * SQ_WAIT_ANY) / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
Issue Wait Cycles:
|
||||
avg: AVG(((4 * SQ_WAIT_INST_ANY) / $denom))
|
||||
min: MIN(((4 * SQ_WAIT_INST_ANY) / $denom))
|
||||
max: MAX(((4 * SQ_WAIT_INST_ANY) / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
Active Cycles:
|
||||
avg: AVG(((4 * SQ_ACTIVE_INST_ANY) / $denom))
|
||||
min: MIN(((4 * SQ_ACTIVE_INST_ANY) / $denom))
|
||||
max: MAX(((4 * SQ_ACTIVE_INST_ANY) / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
tips:
|
||||
Wavefront Occupancy:
|
||||
avg: AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
min: MIN((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
max: MAX((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
unit: Wavefronts
|
||||
coll_level: SQ_LEVEL_WAVES
|
||||
tips:
|
||||
+173
@@ -0,0 +1,173 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
|
||||
Panel Config:
|
||||
id: 700
|
||||
title: Wavefront
|
||||
metrics_description:
|
||||
Grid Size: The total number of work-items (or, threads) launched as a part of
|
||||
the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied
|
||||
by the total workgroup (or, block) size.
|
||||
Workgroup Size: The total number of work-items (or, threads) in each workgroup
|
||||
(or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
|
||||
to the total block size.
|
||||
Total Wavefronts: "The total number of wavefronts launched as part of the kernel\
|
||||
\ dispatch. On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs,\
|
||||
\ the wavefront size is always 64 work-items. Thus, the total number of wavefronts\
|
||||
\ should be equivalent to the ceiling of grid size divided by 64."
|
||||
Saved Wavefronts: The total number of wavefronts saved at a context-save.
|
||||
Restored Wavefronts: The total number of wavefronts restored from a context-save.
|
||||
VGPRs: 'The number of architected vector general-purpose registers allocated for
|
||||
the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
|
||||
by the compiler due to allocation granularity.'
|
||||
AGPRs: 'The number of accumulation vector general-purpose registers allocated
|
||||
for the kernel, see AGPRs. Note: this may not exactly match the number of AGPRs
|
||||
requested by the compiler due to allocation granularity.'
|
||||
SGPRs: 'The number of scalar general-purpose registers allocated for the kernel,
|
||||
see SALU. Note: this may not exactly match the number of SGPRs requested by
|
||||
the compiler due to allocation granularity.'
|
||||
LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
|
||||
for this kernel. Note: This may also be larger than what was requested at compile
|
||||
time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
|
||||
Scratch Allocation: The number of bytes of scratch memory requested per work-item
|
||||
for this kernel. Scratch memory is used for stack memory on the accelerator,
|
||||
as well as for register spills and restores.
|
||||
Kernel Time: The total duration of the executed kernel.
|
||||
Kernel Time (Cycles): The total duration of the executed kernel in cycles.
|
||||
Instructions per wavefront: The average number of instructions (of all types)
|
||||
executed per wavefront. This is averaged over all wavefronts in a kernel dispatch.
|
||||
Wave Cycles: The number of cycles a wavefront in the kernel dispatch spent resident
|
||||
on a compute unit per normalization unit. This is averaged over all wavefronts
|
||||
in a kernel dispatch.
|
||||
Dependency Wait Cycles: The number of cycles a wavefront in the kernel dispatch
|
||||
spent resident on a compute unit per normalization unit. This is averaged over
|
||||
all wavefronts in a kernel dispatch.
|
||||
Issue Wait Cycles: The number of cycles a wavefront in the kernel dispatch was
|
||||
unable to issue an instruction for any reason (e.g., execution pipe back-pressure,
|
||||
arbitration loss, etc.) per normalization unit. This counter is incremented
|
||||
at every cycle by all wavefronts on a CU unable to issue an instruction. As
|
||||
such, it is most useful to get a sense of how waves were spending their time,
|
||||
rather than identification of a precise limiter because another wave could be
|
||||
actively executing while a wave is issue stalled. The sum of this metric, Dependency
|
||||
Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
|
||||
Active Cycles: The average number of cycles a wavefront in the kernel dispatch
|
||||
was actively executing instructions per normalization unit. This measurement
|
||||
is made on a per-wavefront basis, and may include cycles that another wavefront
|
||||
spent actively executing (on another execution unit, for example) or was stalled.
|
||||
As such, it is most useful to get a sense of how waves were spending their time,
|
||||
rather than identification of a precise limiter. The sum of this metric, Issue
|
||||
Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
|
||||
metric.
|
||||
Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
|
||||
over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
||||
kernels (less than 1ms).'
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 701
|
||||
title: Wavefront Launch Stats
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Grid Size:
|
||||
avg: AVG(Grid_Size)
|
||||
min: MIN(Grid_Size)
|
||||
max: MAX(Grid_Size)
|
||||
unit: Work Items
|
||||
Workgroup Size:
|
||||
avg: AVG(Workgroup_Size)
|
||||
min: MIN(Workgroup_Size)
|
||||
max: MAX(Workgroup_Size)
|
||||
unit: Work Items
|
||||
Total Wavefronts:
|
||||
avg: AVG(SPI_CSN_WAVE)
|
||||
min: MIN(SPI_CSN_WAVE)
|
||||
max: MAX(SPI_CSN_WAVE)
|
||||
unit: Wavefronts
|
||||
Saved Wavefronts:
|
||||
avg: AVG(SQ_WAVES_SAVED)
|
||||
min: MIN(SQ_WAVES_SAVED)
|
||||
max: MAX(SQ_WAVES_SAVED)
|
||||
unit: Wavefronts
|
||||
Restored Wavefronts:
|
||||
avg: AVG(SQ_WAVES_RESTORED)
|
||||
min: MIN(SQ_WAVES_RESTORED)
|
||||
max: MAX(SQ_WAVES_RESTORED)
|
||||
unit: Wavefronts
|
||||
VGPRs:
|
||||
avg: AVG(Arch_VGPR)
|
||||
min: MIN(Arch_VGPR)
|
||||
max: MAX(Arch_VGPR)
|
||||
unit: Registers
|
||||
AGPRs:
|
||||
avg: AVG(Accum_VGPR)
|
||||
min: MIN(Accum_VGPR)
|
||||
max: MAX(Accum_VGPR)
|
||||
unit: Registers
|
||||
SGPRs:
|
||||
avg: AVG(SGPR)
|
||||
min: MIN(SGPR)
|
||||
max: MAX(SGPR)
|
||||
unit: Registers
|
||||
LDS Allocation:
|
||||
avg: AVG(LDS_Per_Workgroup)
|
||||
min: MIN(LDS_Per_Workgroup)
|
||||
max: MAX(LDS_Per_Workgroup)
|
||||
unit: Bytes
|
||||
Scratch Allocation:
|
||||
avg: AVG(Scratch_Per_Workitem)
|
||||
min: MIN(Scratch_Per_Workitem)
|
||||
max: MAX(Scratch_Per_Workitem)
|
||||
unit: Bytes/Workitem
|
||||
- metric_table:
|
||||
id: 702
|
||||
title: Wavefront Runtime Stats
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Kernel Time:
|
||||
avg: AVG((End_Timestamp - Start_Timestamp))
|
||||
min: MIN((End_Timestamp - Start_Timestamp))
|
||||
max: MAX((End_Timestamp - Start_Timestamp))
|
||||
unit: ns
|
||||
Kernel Time (Cycles):
|
||||
avg: AVG($GRBM_GUI_ACTIVE_PER_XCD)
|
||||
min: MIN($GRBM_GUI_ACTIVE_PER_XCD)
|
||||
max: MAX($GRBM_GUI_ACTIVE_PER_XCD)
|
||||
unit: Cycle
|
||||
Instructions per wavefront:
|
||||
avg: AVG((SQ_INSTS / SQ_WAVES))
|
||||
min: MIN((SQ_INSTS / SQ_WAVES))
|
||||
max: MAX((SQ_INSTS / SQ_WAVES))
|
||||
unit: Instr/wavefront
|
||||
Wave Cycles:
|
||||
avg: AVG(((4 * SQ_WAVE_CYCLES) / $denom))
|
||||
min: MIN(((4 * SQ_WAVE_CYCLES) / $denom))
|
||||
max: MAX(((4 * SQ_WAVE_CYCLES) / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
Dependency Wait Cycles:
|
||||
avg: AVG(((4 * SQ_WAIT_ANY) / $denom))
|
||||
min: MIN(((4 * SQ_WAIT_ANY) / $denom))
|
||||
max: MAX(((4 * SQ_WAIT_ANY) / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
Issue Wait Cycles:
|
||||
avg: AVG(((4 * SQ_WAIT_INST_ANY) / $denom))
|
||||
min: MIN(((4 * SQ_WAIT_INST_ANY) / $denom))
|
||||
max: MAX(((4 * SQ_WAIT_INST_ANY) / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
Active Cycles:
|
||||
avg: AVG(((4 * SQ_ACTIVE_INST_ANY) / $denom))
|
||||
min: MIN(((4 * SQ_ACTIVE_INST_ANY) / $denom))
|
||||
max: MAX(((4 * SQ_ACTIVE_INST_ANY) / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
Wavefront Occupancy:
|
||||
avg: AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
min: MIN((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
max: MAX((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
unit: Wavefronts
|
||||
coll_level: SQ_LEVEL_WAVES
|
||||
-277
@@ -1,277 +0,0 @@
|
||||
---
|
||||
# Add description/tips for each metric in this section.
|
||||
# So it could be shown in hover.
|
||||
Metric Description:
|
||||
|
||||
# Define the panel properties and properties of each metric in the panel.
|
||||
Panel Config:
|
||||
id: 1000
|
||||
title: Compute Units - Instruction Mix
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1001
|
||||
title: Overall Instruction Mix
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
VALU:
|
||||
avg: AVG(((SQ_INSTS_VALU - SQ_INSTS_MFMA) / $denom))
|
||||
min: MIN(((SQ_INSTS_VALU - SQ_INSTS_MFMA) / $denom))
|
||||
max: MAX(((SQ_INSTS_VALU - SQ_INSTS_MFMA) / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
VMEM:
|
||||
# TODO: need to fix this when the new FLAT/LDS counts
|
||||
# are present in ROCm
|
||||
avg: AVG(((SQ_INSTS_VMEM) / $denom))
|
||||
min: MIN(((SQ_INSTS_VMEM) / $denom))
|
||||
max: MAX(((SQ_INSTS_VMEM) / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
LDS:
|
||||
# TODO: need to fix this when the new FLAT/LDS counts
|
||||
# are present in ROCm
|
||||
avg: AVG((SQ_INSTS_LDS / $denom))
|
||||
min: MIN((SQ_INSTS_LDS / $denom))
|
||||
max: MAX((SQ_INSTS_LDS / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
MFMA:
|
||||
avg: AVG((SQ_INSTS_MFMA / $denom))
|
||||
min: MIN((SQ_INSTS_MFMA / $denom))
|
||||
max: MAX((SQ_INSTS_MFMA / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
SALU:
|
||||
avg: AVG((SQ_INSTS_SALU / $denom))
|
||||
min: MIN((SQ_INSTS_SALU / $denom))
|
||||
max: MAX((SQ_INSTS_SALU / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
SMEM:
|
||||
avg: AVG((SQ_INSTS_SMEM / $denom))
|
||||
min: MIN((SQ_INSTS_SMEM / $denom))
|
||||
max: MAX((SQ_INSTS_SMEM / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
Branch:
|
||||
avg: AVG((SQ_INSTS_BRANCH / $denom))
|
||||
min: MIN((SQ_INSTS_BRANCH / $denom))
|
||||
max: MAX((SQ_INSTS_BRANCH / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 1002
|
||||
title: VALU Arithmetic Instr Mix
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
INT32:
|
||||
avg: AVG((SQ_INSTS_VALU_INT32 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_INT32 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_INT32 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
INT64:
|
||||
avg: AVG((SQ_INSTS_VALU_INT64 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_INT64 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_INT64 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
F16-ADD:
|
||||
avg: AVG((SQ_INSTS_VALU_ADD_F16 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_ADD_F16 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_ADD_F16 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
F16-MUL:
|
||||
avg: AVG((SQ_INSTS_VALU_MUL_F16 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_MUL_F16 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MUL_F16 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
F16-FMA:
|
||||
avg: AVG((SQ_INSTS_VALU_FMA_F16 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_FMA_F16 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_FMA_F16 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
F16-Trans:
|
||||
avg: AVG((SQ_INSTS_VALU_TRANS_F16 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_TRANS_F16 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_TRANS_F16 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
F32-ADD:
|
||||
avg: AVG((SQ_INSTS_VALU_ADD_F32 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_ADD_F32 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_ADD_F32 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
F32-MUL:
|
||||
avg: AVG((SQ_INSTS_VALU_MUL_F32 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_MUL_F32 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MUL_F32 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
F32-FMA:
|
||||
avg: AVG((SQ_INSTS_VALU_FMA_F32 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_FMA_F32 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_FMA_F32 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
F32-Trans:
|
||||
avg: AVG((SQ_INSTS_VALU_TRANS_F32 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_TRANS_F32 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_TRANS_F32 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
F64-ADD:
|
||||
avg: AVG((SQ_INSTS_VALU_ADD_F64 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_ADD_F64 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_ADD_F64 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
F64-MUL:
|
||||
avg: AVG((SQ_INSTS_VALU_MUL_F64 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_MUL_F64 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MUL_F64 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
F64-FMA:
|
||||
avg: AVG((SQ_INSTS_VALU_FMA_F64 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_FMA_F64 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_FMA_F64 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
F64-Trans:
|
||||
avg: AVG((SQ_INSTS_VALU_TRANS_F64 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_TRANS_F64 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_TRANS_F64 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
Conversion:
|
||||
avg: AVG((SQ_INSTS_VALU_CVT / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_CVT / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_CVT / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 1003
|
||||
title: VMEM Instr Mix
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
Global/Generic Instr:
|
||||
avg: AVG((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
Global/Generic Read:
|
||||
avg: AVG((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
Global/Generic Write:
|
||||
avg: AVG((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
Global/Generic Atomic:
|
||||
avg: AVG((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
Spill/Stack Instr:
|
||||
avg: AVG((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
Spill/Stack Read:
|
||||
avg: AVG((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
Spill/Stack Write:
|
||||
avg: AVG((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
Spill/Stack Atomic:
|
||||
avg: AVG((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
|
||||
- metric_table:
|
||||
id: 1004
|
||||
title: MFMA Arithmetic Instr Mix
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
tips: Tips
|
||||
metric:
|
||||
MFMA-I8:
|
||||
avg: AVG((SQ_INSTS_VALU_MFMA_I8 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_MFMA_I8 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MFMA_I8 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
MFMA-F8:
|
||||
avg: AVG((SQ_INSTS_VALU_MFMA_F8 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_MFMA_F8 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MFMA_F8 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
MFMA-F16:
|
||||
avg: AVG((SQ_INSTS_VALU_MFMA_F16 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_MFMA_F16 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MFMA_F16 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
MFMA-BF16:
|
||||
avg: AVG((SQ_INSTS_VALU_MFMA_BF16 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_MFMA_BF16 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MFMA_BF16 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
MFMA-F32:
|
||||
avg: AVG((SQ_INSTS_VALU_MFMA_F32 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_MFMA_F32 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MFMA_F32 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
MFMA-F64:
|
||||
avg: AVG((SQ_INSTS_VALU_MFMA_F64 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_MFMA_F64 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MFMA_F64 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
tips:
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user