[rocprofiler-compute] metrics generator (#1199)
Этот коммит содержится в:
@@ -7,12 +7,23 @@ repos:
|
||||
- id: check-yaml
|
||||
- id: end-of-file-fixer
|
||||
- id: trailing-whitespace
|
||||
# Python import sorting and formatting
|
||||
|
||||
# Python import sorting and formatting
|
||||
- repo: https://github.com/astral-sh/ruff-pre-commit
|
||||
# Ruff version. Check https://github.com/astral-sh/ruff-pre-commit#version-compatibility,
|
||||
# Ruff version. Check https://github.com/astral-sh/ruff-pre-commit#version-compatibility
|
||||
# for the latest ruff version supported by the hook.
|
||||
rev: v0.12.12
|
||||
hooks:
|
||||
- id: ruff-check
|
||||
args: [--fix, --exit-non-zero-on-fix]
|
||||
- id: ruff-format
|
||||
args: [--fix]
|
||||
- id: ruff-format
|
||||
|
||||
# Local hook: hash consistency check
|
||||
- repo: local
|
||||
hooks:
|
||||
- id: hash-check
|
||||
name: Hash consistency check
|
||||
entry: bash -lc 'cd projects/rocprofiler-compute && python3 tools/config_management/hash_checker.py'
|
||||
language: system
|
||||
pass_filenames: false
|
||||
stages: [pre-commit]
|
||||
|
||||
@@ -5,8 +5,12 @@ Full documentation for ROCm Compute Profiler is available at [https://rocm.docs.
|
||||
## Unreleased
|
||||
|
||||
### Added
|
||||
* Add `--list-blocks <arch>` option to general options to list available IP blocks on specified arch (similar to `--list-metrics`), cannot be used with `--block`.
|
||||
* Added `config_delta/gfx950_diff.yaml` to analysis config yamls to track the revision between a gfx9 architecture against the latest supported architecture gfx950
|
||||
|
||||
### Changed
|
||||
* `-b/--block` accepts block alias(es) (See block aliases using command-line option `--list-blocks <arch>`).
|
||||
* analysis configs yamls are now managed with the new config management workflow in `tools/config_management/`
|
||||
|
||||
### Removed
|
||||
|
||||
|
||||
@@ -400,18 +400,6 @@ add_test(
|
||||
WORKING_DIRECTORY ${PROJECT_SOURCE_DIR}
|
||||
)
|
||||
|
||||
# ---------------------------
|
||||
# DB Connector tests
|
||||
# ---------------------------
|
||||
|
||||
add_test(
|
||||
NAME test_db_connector
|
||||
COMMAND
|
||||
${Python3_EXECUTABLE} -m pytest --junitxml=tests/test_db_connector.xml
|
||||
${COV_OPTION} ${PROJECT_SOURCE_DIR}/tests/test_db_connector.py
|
||||
WORKING_DIRECTORY ${PROJECT_SOURCE_DIR}
|
||||
)
|
||||
|
||||
# ---------------------------
|
||||
# Utils tests
|
||||
# ---------------------------
|
||||
@@ -547,6 +535,13 @@ install(
|
||||
COMPONENT main
|
||||
PATTERN "__pycache__" EXCLUDE
|
||||
)
|
||||
# tools/config_management
|
||||
install(
|
||||
DIRECTORY tools/config_management
|
||||
DESTINATION ${CMAKE_INSTALL_LIBEXECDIR}/${PROJECT_NAME}
|
||||
COMPONENT main
|
||||
PATTERN "__pycache__" EXCLUDE
|
||||
)
|
||||
# grafana assets
|
||||
install(
|
||||
DIRECTORY grafana
|
||||
@@ -586,10 +581,10 @@ install(
|
||||
add_custom_target(
|
||||
license
|
||||
COMMAND
|
||||
${PROJECT_SOURCE_DIR}/utils/update_license.py --source ${PROJECT_SOURCE_DIR}/src
|
||||
${PROJECT_SOURCE_DIR}/tools/update_license.py --source ${PROJECT_SOURCE_DIR}/src
|
||||
--license ${PROJECT_SOURCE_DIR}/LICENSE.md --extension '.py'
|
||||
COMMAND
|
||||
${PROJECT_SOURCE_DIR}/utils/update_license.py --source ${PROJECT_SOURCE_DIR}
|
||||
${PROJECT_SOURCE_DIR}/tools/update_license.py --source ${PROJECT_SOURCE_DIR}
|
||||
--license ${PROJECT_SOURCE_DIR}/LICENSE.md --file
|
||||
"src/${PACKAGE_NAME},cmake/Dockerfile,cmake/rocm_install.sh,docker/docker-entrypoint.sh,src/rocprof_compute_analyze/convertor/mongodb/convert"
|
||||
)
|
||||
|
||||
@@ -190,4 +190,13 @@ Any future contributions should adhere to these guidelines:
|
||||
|
||||
### Build and test documentation changes
|
||||
|
||||
For instructions on how to build and test documentation changes (files under docs folder), please see https://rocm.docs.amd.com/en/latest/contribute/contributing.html
|
||||
For instructions on how to build and test documentation changes (files under docs folder), please see https://rocm.docs.amd.com/en/latest/contribute/contributing.html
|
||||
|
||||
|
||||
## Metrics Management
|
||||
|
||||
If your PR touches **metric configs** (panel YAMLs under `src/rocprof_compute_soc/analysis_configs/gfx<arch>/*.yaml`, config deltas, or metric descriptions in `docs/data/metrics_description.yaml`), please follow the metric management workflow summarized here:
|
||||
- Edit the panel YAMLs and, when appropriate, generate/apply a delta and (optionally) promote a new architecture using the [workflow script](`tools/config_management/master_config_workflow_script.py`).
|
||||
- Verify hashes are updated and CI tests pass.
|
||||
|
||||
For full details, see the [metric config management README](./tools/config_management/README.md)
|
||||
|
||||
@@ -13,7 +13,7 @@ monorepo/
|
||||
│ ├── CMakeLists.txt
|
||||
│ ├── coverage/
|
||||
│ │ └── coverage-latest.xml # committed coverage file
|
||||
│ ├── utils/
|
||||
│ ├── tools/
|
||||
│ │ ├── update_coverage.sh # coverage generation/update script
|
||||
│ │ └── run-ci.py # CDash upload script
|
||||
│ └── ...
|
||||
@@ -31,7 +31,7 @@ Run this periodically to update the coverage baseline:
|
||||
```bash
|
||||
# From monorepo root
|
||||
cd projects/rocprofiler-compute
|
||||
./utils/update_coverage.sh
|
||||
./tools/update_coverage.sh
|
||||
|
||||
# This will:
|
||||
# - Build with coverage enabled
|
||||
@@ -74,4 +74,4 @@ pip install coverage pytest pytest-cov
|
||||
#verify tests can run
|
||||
cd projects/rocprofiler-compute/build
|
||||
ctest --verbose
|
||||
```
|
||||
```
|
||||
|
||||
Разница между файлами не показана из-за своего большого размера
Загрузить разницу
@@ -19,7 +19,7 @@ This section provides an overview of ROCm Compute Profiler's CLI analysis featur
|
||||
* :ref:`Filtering <cli-analysis-options>`: Hone in on a particular kernel,
|
||||
GPU ID, or dispatch ID via post-process filtering.
|
||||
|
||||
* :ref:`Per-kernel roofline analysis <per-kernel-roofline>`: Detailed arithmetic
|
||||
* :ref:`Per-kernel roofline analysis <per-kernel-roofline>`: Detailed arithmetic
|
||||
intensity and performance analysis for individual kernels.
|
||||
|
||||
Run ``rocprof-compute analyze -h`` for more details.
|
||||
@@ -214,6 +214,90 @@ There are three high-level GPU analysis views:
|
||||
│ 2.1.28 │ Instr Fetch Latency │ 21.729248046875 │ Cycles │ │ │
|
||||
╘═════════╧═══════════════════════════╧═══════════════════════╧══════════════════╧════════════════════╧════════════════════════╛
|
||||
|
||||
Alternatively, use the option ``-b`` (or ``--block``) with block alias(es).
|
||||
The following snippet shows how to generate a report containing only metric 2 with the alias equivalent of ``sol``
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ rocprof-compute analyze -p workloads/vcopy/MI200/ -b sol
|
||||
|
||||
--------
|
||||
Analyze
|
||||
--------
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
1. Top Stat
|
||||
╒════╤══════════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╕
|
||||
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
|
||||
╞════╪══════════════════════════════════════════╪═════════╪═══════════╪════════════╪══════════════╪════════╡
|
||||
│ 0 │ vecCopy(double*, double*, double*, int, │ 1 │ 20000.00 │ 20000.00 │ 20000.00 │ 100.00 │
|
||||
│ │ int) [clone .kd] │ │ │ │ │ │
|
||||
╘════╧══════════════════════════════════════════╧═════════╧═══════════╧════════════╧══════════════╧════════╛
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
2. System Speed-of-Light
|
||||
╒═════════╤═══════════════════════════╤═══════════════════════╤══════════════════╤════════════════════╤════════════════════════╕
|
||||
│ Index │ Metric │ Value │ Unit │ Peak │ PoP │
|
||||
╞═════════╪═══════════════════════════╪═══════════════════════╪══════════════════╪════════════════════╪════════════════════════╡
|
||||
│ 2.1.0 │ VALU FLOPs │ 0.0 │ Gflop │ 22630.4 │ 0.0 │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.1 │ VALU IOPs │ 367.0016 │ Giop │ 22630.4 │ 1.6217194570135745 │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.2 │ MFMA FLOPs (BF16) │ 0.0 │ Gflop │ 90521.6 │ 0.0 │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.3 │ MFMA FLOPs (F16) │ 0.0 │ Gflop │ 181043.2 │ 0.0 │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.4 │ MFMA FLOPs (F32) │ 0.0 │ Gflop │ 45260.8 │ 0.0 │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.5 │ MFMA FLOPs (F64) │ 0.0 │ Gflop │ 45260.8 │ 0.0 │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.6 │ MFMA IOPs (Int8) │ 0.0 │ Giop │ 181043.2 │ 0.0 │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.7 │ Active CUs │ 74 │ Cus │ 104 │ 71.15384615384616 │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.8 │ SALU Util │ 4.016057506716307 │ Pct │ 100 │ 4.016057506716307 │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.9 │ VALU Util │ 5.737225009594725 │ Pct │ 100 │ 5.737225009594725 │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.10 │ MFMA Util │ 0.0 │ Pct │ 100 │ 0.0 │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.11 │ VALU Active Threads/Wave │ 64.0 │ Threads │ 64 │ 100.0 │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.12 │ IPC - Issue │ 1.0 │ Instr/cycle │ 5 │ 20.0 │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.13 │ LDS BW │ 0.0 │ Gb/sec │ 22630.4 │ 0.0 │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.14 │ LDS Bank Conflict │ │ Conflicts/access │ 32 │ │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.15 │ Instr Cache Hit Rate │ 99.91306912556854 │ Pct │ 100 │ 99.91306912556854 │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.16 │ Instr Cache BW │ 209.7152 │ Gb/s │ 6092.8 │ 3.442016806722689 │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.17 │ Scalar L1D Cache Hit Rate │ 99.81986908342313 │ Pct │ 100 │ 99.81986908342313 │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.18 │ Scalar L1D Cache BW │ 209.7152 │ Gb/s │ 6092.8 │ 3.442016806722689 │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.19 │ Vector L1D Cache Hit Rate │ 50.0 │ Pct │ 100 │ 50.0 │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.20 │ Vector L1D Cache BW │ 1677.7216 │ Gb/s │ 11315.199999999999 │ 14.82714932126697 │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.21 │ L2 Cache Hit Rate │ 35.55067615693325 │ Pct │ 100 │ 35.55067615693325 │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.22 │ L2-Fabric Read BW │ 419.8496 │ Gb/s │ 1638.4 │ 25.6255859375 │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.23 │ L2-Fabric Write BW │ 293.9456 │ Gb/s │ 1638.4 │ 17.941015625 │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.24 │ L2-Fabric Read Latency │ 256.6482321288385 │ Cycles │ │ │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.25 │ L2-Fabric Write Latency │ 317.2264255699014 │ Cycles │ │ │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.26 │ Wave Occupancy │ 1821.723057333852 │ Wavefronts │ 3328 │ 54.73927455931046 │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.27 │ Instr Fetch BW │ 4.174722306564298e-08 │ Gb/s │ 3046.4 │ 1.3703789084047721e-09 │
|
||||
├─────────┼───────────────────────────┼───────────────────────┼──────────────────┼────────────────────┼────────────────────────┤
|
||||
│ 2.1.28 │ Instr Fetch Latency │ 21.729248046875 │ Cycles │ │ │
|
||||
╘═════════╧═══════════════════════════╧═══════════════════════╧══════════════════╧════════════════════╧════════════════════════╛
|
||||
.. note::
|
||||
|
||||
Some cells may be blank indicating a missing or unavailable hardware
|
||||
@@ -245,6 +329,11 @@ List metrics
|
||||
|
||||
$ rocprof-compute analyze -p workloads/vcopy/MI200/ --list-metrics gfx90a
|
||||
|
||||
List IP blocks
|
||||
.. code-block:: shell
|
||||
|
||||
$ rocprof-compute analyze -p workloads/vcopy/MI200/ --list-blocks gfx90a
|
||||
|
||||
Show Description column which is excluded by default in cli output
|
||||
.. code-block:: shell
|
||||
|
||||
|
||||
@@ -261,7 +261,7 @@ detailed description of profiling filters available when using ROCm Compute Prof
|
||||
Filtering options
|
||||
-----------------
|
||||
|
||||
``-b``, ``--block <block-name>``
|
||||
``-b``, ``--block <block-id|block-alias|metric-id>``
|
||||
Allows system profiling on one or more selected analysis report blocks to speed
|
||||
up the profiling process. See :ref:`profiling-hw-component-filtering`.
|
||||
Note that this option cannot be used with ``--roof-only`` or ``--set``.
|
||||
|
||||
@@ -70,6 +70,13 @@ to view the metrics for current system architecture:
|
||||
$ rocprof-compute --list-metrics <sys_arch>
|
||||
$ rocprof-compute profile --list-available-metrics
|
||||
|
||||
To view available aliases by hardware block, use the ``--list-blocks``
|
||||
option with a system architecture argument
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ rocprof-compute --list-blocks <sys_arch>
|
||||
|
||||
.. _basic-analyze-cli:
|
||||
|
||||
Analyze in the command line
|
||||
|
||||
@@ -25,13 +25,30 @@
|
||||
|
||||
import argparse
|
||||
import os
|
||||
import re
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
from utils.utils import METRIC_ID_RE
|
||||
|
||||
def print_avail_arch(avail_arch: list[str]) -> str:
|
||||
ret_str = "List all available metrics for analysis on specified arch:"
|
||||
|
||||
def validate_block(value: str) -> str:
|
||||
if METRIC_ID_RE.match(value):
|
||||
return value
|
||||
raise argparse.ArgumentTypeError(f"Invalid metric id: {value}")
|
||||
|
||||
|
||||
def block_token_or_alias(s: str) -> str:
|
||||
try:
|
||||
return validate_block(s)
|
||||
except argparse.ArgumentTypeError:
|
||||
s = (s or "").strip()
|
||||
if not s:
|
||||
raise argparse.ArgumentTypeError("empty token for --block")
|
||||
return s
|
||||
|
||||
|
||||
def print_avail_arch(avail_arch: list[str], args: str) -> str:
|
||||
ret_str = f"List all available {args} for analysis on specified arch:"
|
||||
for arch in avail_arch:
|
||||
ret_str += f"\n {arch}"
|
||||
return ret_str
|
||||
@@ -66,7 +83,14 @@ def add_general_group(
|
||||
dest="list_metrics",
|
||||
metavar="",
|
||||
choices=supported_archs.keys(), # ["gfx908", "gfx90a"],
|
||||
help=print_avail_arch(list(supported_archs.keys())),
|
||||
help=print_avail_arch(list(supported_archs.keys()), "metrics"),
|
||||
)
|
||||
general_group.add_argument(
|
||||
"--list-blocks",
|
||||
dest="list_blocks",
|
||||
metavar="",
|
||||
choices=supported_archs.keys(), # ["gfx908", "gfx90a"],
|
||||
help=print_avail_arch(list(supported_archs.keys()), "blocks"),
|
||||
)
|
||||
general_group.add_argument(
|
||||
"--config-dir",
|
||||
@@ -234,12 +258,6 @@ Examples:
|
||||
help="\t\t\tDispatch ID filtering.",
|
||||
)
|
||||
|
||||
def validate_block(value: str) -> str:
|
||||
# Metric id is of the form I or I.I or I.I.I where I is two digit number.
|
||||
if re.compile(r"^\d{1,2}(?:\.\d{1,2}){0,2}$").match(value):
|
||||
return value
|
||||
raise argparse.ArgumentTypeError(f"Invalid metric id: {value}")
|
||||
|
||||
profile_group.add_argument(
|
||||
"--list-available-metrics",
|
||||
dest="list_available_metrics",
|
||||
@@ -249,15 +267,19 @@ Examples:
|
||||
profile_group.add_argument(
|
||||
"-b",
|
||||
"--block",
|
||||
type=validate_block,
|
||||
dest="filter_blocks",
|
||||
metavar="",
|
||||
nargs="+",
|
||||
type=block_token_or_alias,
|
||||
required=False,
|
||||
default=[],
|
||||
help=(
|
||||
"\t\t\tSpecify metric id(s) from --list-metrics for filtering "
|
||||
"(e.g. 12, 12.1, 12.1.1).\n"
|
||||
"\t\t\tAlternatively, specify block id(s) for filtering "
|
||||
"(e.g. 12, 13, 14).\n"
|
||||
"\t\t\tAlternatively, specify block alias(es) for filtering "
|
||||
"(e.g. lds, l1i, sl1d).\n"
|
||||
"\t\t\tCan provide multiple space separated arguments.\n"
|
||||
"\t\t\tCannot be used with --set or --roof-only"
|
||||
),
|
||||
@@ -656,6 +678,7 @@ Examples:
|
||||
dest="filter_metrics",
|
||||
metavar="",
|
||||
nargs="+",
|
||||
type=block_token_or_alias,
|
||||
help="\t\tSpecify metric id(s) from --list-metrics for filtering.",
|
||||
)
|
||||
analyze_group.add_argument(
|
||||
|
||||
@@ -45,7 +45,12 @@ from utils.logger import (
|
||||
console_warning,
|
||||
demarcate,
|
||||
)
|
||||
from utils.utils import get_uuid, is_workload_empty, merge_counters_spatial_multiplex
|
||||
from utils.utils import (
|
||||
get_panel_alias,
|
||||
get_uuid,
|
||||
is_workload_empty,
|
||||
merge_counters_spatial_multiplex,
|
||||
)
|
||||
|
||||
# the build-in config to list kernel names purpose only
|
||||
TOP_STATS_BUILD_IN_CONFIG: OrderedDict[int, dict[str, Any]] = OrderedDict([
|
||||
@@ -160,21 +165,41 @@ class OmniAnalyze_Base:
|
||||
}
|
||||
for key, value in self._arch_configs[arch].metric_list.items():
|
||||
dot_count = str(key).count(".")
|
||||
if dot_count == 0:
|
||||
prefix = ""
|
||||
elif dot_count == 1:
|
||||
prefix = "\t"
|
||||
else:
|
||||
prefix = "\t\t"
|
||||
indent = "\t" * min(dot_count, 2)
|
||||
|
||||
description = metric_descriptions.get(key, "") if dot_count > 1 else ""
|
||||
print(f"{indent}{key} -> {value}\n")
|
||||
|
||||
print(f"{prefix}{key} -> {value}\n")
|
||||
if description:
|
||||
formatted_desc = f"\n{prefix}".join(
|
||||
textwrap.wrap(description, width=40)
|
||||
)
|
||||
print(f"{prefix}{formatted_desc}\n")
|
||||
if dot_count > 1:
|
||||
description = metric_descriptions.get(key, "")
|
||||
if description:
|
||||
wrapped = textwrap.wrap(description, width=40)
|
||||
print(f"{indent}" + f"\n{indent}".join(wrapped) + "\n")
|
||||
|
||||
sys.exit(0)
|
||||
|
||||
@demarcate
|
||||
def list_blocks(self) -> None:
|
||||
args = self.get_args()
|
||||
arch = args.list_blocks
|
||||
|
||||
if arch not in self.__supported_archs:
|
||||
console_error("analysis", "Unsupported arch")
|
||||
if arch not in self._arch_configs:
|
||||
sys_info = file_io.load_sys_info(f"{args.path[0][0]}/sysinfo.csv")
|
||||
self.generate_configs(
|
||||
arch,
|
||||
args.config_dir,
|
||||
args.list_stats,
|
||||
args.filter_metrics,
|
||||
sys_info.iloc[0],
|
||||
)
|
||||
|
||||
print(f"{'INDEX':<8} {'BLOCK ALIAS':<16} {'BLOCK NAME'}")
|
||||
for key, value in self._arch_configs[arch].metric_list.items():
|
||||
panel_alias_dict = get_panel_alias()
|
||||
if key.count(".") > 0:
|
||||
continue
|
||||
print(f"{key:<8} {panel_alias_dict[value]:<16} {value}")
|
||||
|
||||
sys.exit(0)
|
||||
|
||||
@@ -208,6 +233,9 @@ class OmniAnalyze_Base:
|
||||
if args.list_metrics:
|
||||
self.list_metrics()
|
||||
|
||||
if args.list_blocks:
|
||||
self.list_blocks()
|
||||
|
||||
def get_sysinfo_path(data_path: str) -> Optional[str]:
|
||||
return (
|
||||
data_path
|
||||
|
||||
@@ -49,6 +49,7 @@ from utils.mi_gpu_spec import mi_gpu_specs
|
||||
from utils.specs import MachineSpecs, generate_machine_specs
|
||||
from utils.utils import (
|
||||
detect_rocprof,
|
||||
get_panel_alias,
|
||||
get_submodules,
|
||||
get_version,
|
||||
get_version_display,
|
||||
@@ -142,6 +143,8 @@ class RocProfCompute:
|
||||
|
||||
if self.__args.list_metrics is not None and block:
|
||||
console_error("Cannot use --list-metrics with --blocks")
|
||||
if self.__args.list_blocks is not None and block:
|
||||
console_error("Cannot use --list-blocks with --blocks")
|
||||
if (
|
||||
hasattr(self.__args, "list_available_metrics")
|
||||
and self.__args.list_available_metrics
|
||||
@@ -194,6 +197,9 @@ class RocProfCompute:
|
||||
elif self.__args.list_metrics is not None:
|
||||
self.list_metrics()
|
||||
sys.exit(0)
|
||||
elif self.__args.list_blocks is not None:
|
||||
self.list_blocks()
|
||||
sys.exit(0)
|
||||
elif self.__args.config_dir:
|
||||
parser.print_help(sys.stderr)
|
||||
console_error(
|
||||
@@ -250,6 +256,34 @@ class RocProfCompute:
|
||||
else:
|
||||
console_error("Unsupported arch")
|
||||
|
||||
@demarcate
|
||||
def list_blocks(self) -> None:
|
||||
for_current_arch = getattr(self.__args, "list_available_metrics", False)
|
||||
|
||||
arch = (
|
||||
self.__mspec.gpu_arch
|
||||
if (for_current_arch or self.__args.list_blocks is None)
|
||||
else self.__args.list_blocks
|
||||
)
|
||||
if arch in self.__supported_archs.keys():
|
||||
ac = schema.ArchConfig()
|
||||
ac.panel_configs = file_io.load_panel_configs([
|
||||
str(Path(self.__args.config_dir) / arch)
|
||||
])
|
||||
sys_info = (
|
||||
self.__mspec.get_class_members().iloc[0] if for_current_arch else None
|
||||
)
|
||||
parser.build_dfs(arch_configs=ac, filter_metrics=[], sys_info=sys_info)
|
||||
|
||||
print(f"{'INDEX':<8} {'BLOCK ALIAS':<16} {'BLOCK NAME'}")
|
||||
for key, value in ac.metric_list.items():
|
||||
if key.count(".") > 0:
|
||||
continue
|
||||
print(f"{key:<8} {get_panel_alias()[value]:<16} {value}")
|
||||
sys.exit(0)
|
||||
else:
|
||||
console_error("Unsupported arch")
|
||||
|
||||
@demarcate
|
||||
def list_sets(self) -> None:
|
||||
sets_info = parse_sets_yaml(self.__mspec.gpu_arch)
|
||||
|
||||
@@ -505,6 +505,7 @@ class RocProfCompute_Base:
|
||||
# PC sampling data is only collected when block "21" is specified
|
||||
if not (
|
||||
"21" in args.filter_blocks
|
||||
and "pc_sampling" in args.filter_blocks
|
||||
and self.__profiler in ("rocprofv3", "rocprofiler-sdk")
|
||||
):
|
||||
return
|
||||
|
||||
+1
-1
@@ -2,7 +2,6 @@
|
||||
Panel Config:
|
||||
id: 0
|
||||
title: Top Stats
|
||||
metrics_description: {}
|
||||
data source:
|
||||
- raw_csv_table:
|
||||
id: 1
|
||||
@@ -12,3 +11,4 @@ Panel Config:
|
||||
id: 2
|
||||
title: Dispatch List
|
||||
source: pmc_dispatch_info.csv
|
||||
metrics_description: {}
|
||||
|
||||
+1
-1
@@ -2,10 +2,10 @@
|
||||
Panel Config:
|
||||
id: 100
|
||||
title: System Info
|
||||
metrics_description: {}
|
||||
data source:
|
||||
- raw_csv_table:
|
||||
id: 101
|
||||
title: System Info
|
||||
source: sysinfo.csv
|
||||
columnwise: true
|
||||
metrics_description: {}
|
||||
|
||||
+122
-118
@@ -2,124 +2,6 @@
|
||||
Panel Config:
|
||||
id: 200
|
||||
title: System Speed-of-Light
|
||||
metrics_description:
|
||||
VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
|
||||
This is also presented as a percent of the peak theoretical FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any floating-point
|
||||
operations from MFMA instructions.'
|
||||
VALU IOPs: 'The total integer operations executed per second on the VALU. This
|
||||
is also presented as a percent of the peak theoretical IOPs achievable on the
|
||||
specific accelerator. Note: this does not include any integer operations from
|
||||
MFMA instructions.'
|
||||
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
|
||||
executed per second. This does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F8 MFMA operations achievable on the specific accelerator. It is supported on
|
||||
AMD Instinct MI300 series and later only.
|
||||
MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
|
||||
executed per second. Note: this does not include any 16-bit brain floating point
|
||||
operations from VALU instructions. This is also presented as a percent of the
|
||||
peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 32-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F32 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 64-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F64 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
|
||||
per second. Note: this does not include any 8-bit integer operations from VALU
|
||||
instructions. This is also presented as a percent of the peak theoretical INT8
|
||||
MFMA operations achievable on the specific accelerator.'
|
||||
Active CUs: Total number of active compute units (CUs) on the accelerator during
|
||||
the kernel execution.
|
||||
SALU Utilization: Indicates what percent of the kernel's duration the SALU was
|
||||
busy executing instructions. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.
|
||||
VALU Utilization: Indicates what percent of the kernel's duration the VALU was
|
||||
busy executing instructions. Does not include VMEM operations. Computed as the
|
||||
ratio of the total number of cycles spent by the scheduler issuing VALU instructions
|
||||
over the total CU cycles.
|
||||
MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
|
||||
was busy executing instructions. Computed as the ratio of the total number of
|
||||
cycles the MFMA was busy over the total CU cycles.
|
||||
VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
|
||||
was busy executing instructions, including both global/generic and spill/scratch
|
||||
operations (see the VMEM instruction count metrics) for more detail). Does not
|
||||
include VALU operations. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing VMEM instructions over the total CU cycles.
|
||||
Branch Utilization: Indicates what percent of the kernel's duration the branch
|
||||
unit was busy executing instructions. Computed as the ratio of the total number
|
||||
of cycles spent by the scheduler issuing branch instructions over the total
|
||||
CU cycles
|
||||
VALU Active Threads: Indicates the average level of divergence within a wavefront
|
||||
over the lifetime of the kernel. The number of work-items that were active in
|
||||
a wavefront during execution of each VALU instruction, time-averaged over all
|
||||
VALU instructions run on all wavefronts in the kernel.
|
||||
IPC: The ratio of the total number of instructions executed on the CU over the
|
||||
total active CU cycles. This is also presented as a percent of the peak theoretical
|
||||
bandwidth achievable on the specific accelerator.
|
||||
Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
|
||||
over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
||||
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
|
||||
occupancy achievable on the specific accelerator.'
|
||||
Theoretical LDS Bandwidth: Indicates the maximum amount of bytes that could have
|
||||
been loaded from, stored to, or atomically updated in the LDS per unit time
|
||||
(see LDS Bandwidth example for more detail). This is also presented as a percent
|
||||
of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
|
||||
LDS Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS
|
||||
scheduler due to bank conflicts (as determined by the conflict resolution hardware)
|
||||
to the base number of cycles that would be spent in the LDS scheduler in a completely
|
||||
uncontended case. This is also presented in normalized form (i.e., the Bank
|
||||
Conflict Rate).
|
||||
vL1D Cache Hit Rate: The ratio of the number of vL1D cache line requests that
|
||||
hit in vL1D cache over the total number of cache line requests to the vL1D cache
|
||||
RAM.
|
||||
vL1D Cache BW: The number of bytes looked up in the vL1D cache as a result of
|
||||
VMEM instructions per unit time. The number of bytes is calculated as the number
|
||||
of cache lines requested multiplied by the cache line size. This value does
|
||||
not consider partial requests, so e.g., if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
This is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L2 Cache Hit Rate: The ratio of the number of L2 cache line requests that hit
|
||||
in the L2 cache over the total number of incoming cache line requests to the
|
||||
L2 cache.
|
||||
L2 Cache BW: The number of bytes looked up in the L2 cache per unit time. The
|
||||
number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line. This is also presented as a percent of
|
||||
the peak theoretical bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Read BW: "The number of bytes read by the L2 over the Infinity Fabric\u2122\
|
||||
\ interface per unit time. This is also presented as a percent of the peak theoretical\
|
||||
\ bandwidth achievable on the specific accelerator."
|
||||
L2-Fabric Write BW: The number of bytes sent by the L2 over the Infinity Fabric
|
||||
interface by write and atomic operations per unit time. This is also presented
|
||||
as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Read Latency: The time-averaged number of cycles read requests spent
|
||||
in Infinity Fabric before data was returned to the L2.
|
||||
L2-Fabric Write Latency: The time-averaged number of cycles write requests spent
|
||||
in Infinity Fabric before a completion acknowledgement was returned to the L2.
|
||||
sL1D Cache Hit Rate: The percent of sL1D requests that hit on a previously loaded
|
||||
line the cache. Calculated as the ratio of the number of sL1D requests that
|
||||
hit over the number of all sL1D requests.
|
||||
sL1D Cache BW: The number of bytes looked up in the sL1D cache per unit time.
|
||||
This is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L1I Hit Rate: The number of bytes looked up in the L1I cache per unit time. This
|
||||
is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L1I BW: The percent of L1I requests that hit on a previously loaded line the cache.
|
||||
Calculated as the ratio of the number of L1I requests that hit over the number
|
||||
of all L1I requests.
|
||||
L1I Fetch Latency: The average number of cycles spent to fetch instructions to
|
||||
a CU.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 201
|
||||
@@ -317,3 +199,125 @@ Panel Config:
|
||||
peak: None
|
||||
pop: None
|
||||
coll_level: SQ_IFETCH_LEVEL
|
||||
metrics_description:
|
||||
VALU FLOPs: |-
|
||||
The total floating-point operations executed per second on the VALU.
|
||||
This is also presented as a percent of the peak theoretical FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any floating-point
|
||||
operations from MFMA instructions.
|
||||
VALU IOPs: |-
|
||||
The total integer operations executed per second on the VALU. This is
|
||||
also presented as a percent of the peak theoretical IOPs achievable on the
|
||||
specific accelerator. Note: this does not include any integer operations from
|
||||
MFMA instructions.
|
||||
MFMA FLOPs (BF16): |-
|
||||
The total number of 16-bit brain floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
BF16 MFMA operations achievable on the specific accelerator.
|
||||
MFMA FLOPs (F16): |-
|
||||
The total number of 16-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 16-bit floating point operations from
|
||||
VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F16 MFMA operations achievable on the specific accelerator.
|
||||
MFMA FLOPs (F32): |-
|
||||
The total number of 32-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 32-bit floating point operations from
|
||||
VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F32 MFMA operations achievable on the specific accelerator.
|
||||
MFMA FLOPs (F64): |-
|
||||
The total number of 64-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 64-bit floating point operations from
|
||||
VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F64 MFMA operations achievable on the specific accelerator.
|
||||
MFMA IOPs (Int8): |-
|
||||
The total number of 8-bit integer MFMA operations executed per second.
|
||||
Note: this does not include any 8-bit integer operations from VALU instructions.
|
||||
This is also presented as a percent of the peak theoretical INT8 MFMA operations
|
||||
achievable on the specific accelerator.
|
||||
Active CUs: Total number of active compute units (CUs) on the accelerator during
|
||||
the kernel execution.
|
||||
SALU Utilization: Indicates what percent of the kernel's duration the SALU was
|
||||
busy executing instructions. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.
|
||||
VALU Utilization: Indicates what percent of the kernel's duration the VALU was
|
||||
busy executing instructions. Does not include VMEM operations. Computed as the
|
||||
ratio of the total number of cycles spent by the scheduler issuing VALU instructions
|
||||
over the total CU cycles.
|
||||
MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
|
||||
was busy executing instructions. Computed as the ratio of the total number of
|
||||
cycles the MFMA was busy over the total CU cycles.
|
||||
VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
|
||||
was busy executing instructions, including both global/generic and spill/scratch
|
||||
operations (see the VMEM instruction count metrics) for more detail). Does not
|
||||
include VALU operations. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing VMEM instructions over the total CU cycles.
|
||||
Branch Utilization: Indicates what percent of the kernel's duration the branch
|
||||
unit was busy executing instructions. Computed as the ratio of the total number
|
||||
of cycles spent by the scheduler issuing branch instructions over the total
|
||||
CU cycles
|
||||
VALU Active Threads: Indicates the average level of divergence within a wavefront
|
||||
over the lifetime of the kernel. The number of work-items that were active in
|
||||
a wavefront during execution of each VALU instruction, time-averaged over all
|
||||
VALU instructions run on all wavefronts in the kernel.
|
||||
IPC: The ratio of the total number of instructions executed on the CU over the
|
||||
total active CU cycles. This is also presented as a percent of the peak theoretical
|
||||
bandwidth achievable on the specific accelerator.
|
||||
Wavefront Occupancy: |-
|
||||
The time-averaged number of wavefronts resident on the accelerator over
|
||||
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
||||
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
|
||||
occupancy achievable on the specific accelerator.
|
||||
Theoretical LDS Bandwidth: Indicates the maximum amount of bytes that could have
|
||||
been loaded from, stored to, or atomically updated in the LDS per unit time
|
||||
(see LDS Bandwidth example for more detail). This is also presented as a percent
|
||||
of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
|
||||
LDS Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS
|
||||
scheduler due to bank conflicts (as determined by the conflict resolution hardware)
|
||||
to the base number of cycles that would be spent in the LDS scheduler in a completely
|
||||
uncontended case. This is also presented in normalized form (i.e., the Bank
|
||||
Conflict Rate).
|
||||
vL1D Cache Hit Rate: The ratio of the number of vL1D cache line requests that
|
||||
hit in vL1D cache over the total number of cache line requests to the vL1D cache
|
||||
RAM.
|
||||
vL1D Cache BW: The number of bytes looked up in the vL1D cache as a result of
|
||||
VMEM instructions per unit time. The number of bytes is calculated as the number
|
||||
of cache lines requested multiplied by the cache line size. This value does
|
||||
not consider partial requests, so e.g., if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
This is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L2 Cache Hit Rate: The ratio of the number of L2 cache line requests that hit
|
||||
in the L2 cache over the total number of incoming cache line requests to the
|
||||
L2 cache.
|
||||
L2 Cache BW: The number of bytes looked up in the L2 cache per unit time. The
|
||||
number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line. This is also presented as a percent of
|
||||
the peak theoretical bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Read BW: |-
|
||||
The number of bytes read by the L2 over the Infinity Fabric\u2122 interface
|
||||
per unit time. This is also presented as a percent of the peak theoretical
|
||||
bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Write BW: The number of bytes sent by the L2 over the Infinity Fabric
|
||||
interface by write and atomic operations per unit time. This is also presented
|
||||
as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Read Latency: The time-averaged number of cycles read requests spent
|
||||
in Infinity Fabric before data was returned to the L2.
|
||||
L2-Fabric Write Latency: The time-averaged number of cycles write requests spent
|
||||
in Infinity Fabric before a completion acknowledgement was returned to the L2.
|
||||
sL1D Cache Hit Rate: The percent of sL1D requests that hit on a previously loaded
|
||||
line the cache. Calculated as the ratio of the number of sL1D requests that
|
||||
hit over the number of all sL1D requests.
|
||||
sL1D Cache BW: The number of bytes looked up in the sL1D cache per unit time.
|
||||
This is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L1I Hit Rate: The number of bytes looked up in the L1I cache per unit time. This
|
||||
is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L1I BW: The percent of L1I requests that hit on a previously loaded line the cache.
|
||||
Calculated as the ratio of the number of L1I requests that hit over the number
|
||||
of all L1I requests.
|
||||
L1I Fetch Latency: The average number of cycles spent to fetch instructions to
|
||||
a CU.
|
||||
|
||||
+123
-119
@@ -2,122 +2,6 @@
|
||||
Panel Config:
|
||||
id: 300
|
||||
title: Memory Chart
|
||||
metrics_description:
|
||||
Wavefront Occupancy: Wavefronts per active CU.
|
||||
Wave Life: Average number of cycles executing a wave.
|
||||
SALU: Total Number of SALU (Scalar ALU) instructions issued per normalization
|
||||
unit.
|
||||
SMEM: Total number of SMEM (Scalar Memory Read) instructions issued normalization
|
||||
unit.
|
||||
VALU: The number of VALU (Vector ALU) instructions issued per normalization unit.
|
||||
MFMA: Total number of MFMA (Matrix-Fused-Multiply-Add) instructions issued per
|
||||
normalization unit.
|
||||
VMEM: The number of VMEM (GPU Memory) read instructions issued (including FLAT/scratch
|
||||
memory) per normalization unit.
|
||||
LDS: The total number of LDS instructions (including, but not limited to, read/write/atomics
|
||||
and HIP's __shfl instructions) executed per normalization unit.
|
||||
GWS: Total number of GDS (global data sync) instructions issued per normalization
|
||||
unit.
|
||||
BR: Total number of BRANCH instructions issued per normalization unit.
|
||||
Active CUs: Total number of active compute units (CUs) on the accelerator during
|
||||
the kernel execution.
|
||||
Num CUs: Total number of compute units (CUs) on the accelerator.
|
||||
VGPR: 'The number of architected vector general-purpose registers allocated for
|
||||
the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
|
||||
by the compiler due to allocation granularity.'
|
||||
SGPR: 'The number of scalar general-purpose registers allocated for the kernel,
|
||||
see SALU. Note: this may not exactly match the number of SGPRs requested by
|
||||
the compiler due to allocation granularity.'
|
||||
LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
|
||||
for this kernel. Note: This may also be larger than what was requested at compile
|
||||
time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
|
||||
Scratch Allocation: The number of bytes of scratch memory requested per work-item
|
||||
for this kernel. Scratch memory is used for stack memory on the accelerator,
|
||||
as well as for register spills and restores.
|
||||
Wavefronts: The total number of wavefronts, summed over all workgroups, forming
|
||||
this kernel launch.
|
||||
Workgroups: The total number of workgroups forming this kernel launch.
|
||||
LDS Req: The total number of LDS instructions (including, but not limited to,
|
||||
read/write/atomics and HIP's __shfl instructions) executed per normalization
|
||||
unit.
|
||||
LDS Util: Indicates what percent of the kernel's duration the LDS was actively
|
||||
executing instructions (including, but not limited to, load, store, atomic and
|
||||
HIP's __shfl operations). Calculated as the ratio of the total number of cycles
|
||||
LDS was active over the total CU cycles.
|
||||
LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
|
||||
/ acknowledgment) required for an LDS instruction to complete.
|
||||
VL1 Rd: The total number of incoming read requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Wr: The total number of incoming write requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Atomic: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Hit: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
VL1 Lat: Calculated as the average number of cycles that a vL1D cache line request
|
||||
spent in the vL1D cache pipeline.
|
||||
VL1 Coalesce: Indicates how well memory instructions were coalesced by the address
|
||||
processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
|
||||
as the average number of thread-requests generated per instruction divided by
|
||||
the ideal number of thread-requests per instruction.
|
||||
VL1 Stall: The ratio of the number of cycles where the vL1D is stalled waiting
|
||||
to issue a request for data to the L2 cache divided by the number of cycles
|
||||
where the vL1D is active.
|
||||
VL1_L2 Rd: The number of read requests for a vL1D cache line that were not satisfied
|
||||
by the vL1D and must be retrieved from the to the L2 Cache per normalization
|
||||
unit.
|
||||
VL1_L2 Wr: The number of write requests to a vL1D cache line that were sent through
|
||||
the vL1D to the L2 cache, per normalization unit.
|
||||
VL1_L2 Atomic: The number of atomic requests that are sent through the vL1D to
|
||||
the L2 cache, per normalization unit. This includes requests for atomics with,
|
||||
and without return.
|
||||
sL1D Rd: The total number of requests, of any size or type, made to the sL1D per
|
||||
normalization unit.
|
||||
sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
|
||||
line, per normalization unit.
|
||||
sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
|
||||
unit.
|
||||
sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
sL1D_L2 Atomic: The total number of atomic requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
IL1 Fetch: The total number of requests made to the L1I per normalization-unit.
|
||||
IL1 Hit: The percent of L1I requests that hit on a previously loaded line the
|
||||
cache. Calculated as the ratio of the number of L1I requests that hit over the
|
||||
number of all L1I requests.
|
||||
IL1 Lat: The average number of cycles spent to fetch instructions to a CU.
|
||||
IL1_L2 Rd: The total number of requests across the L1I - L2 interface per normalization-unit.
|
||||
L2 Rd: The total number of read requests to the L2 from all clients.
|
||||
L2 Wr: The total number of write requests to the L2 from all clients.
|
||||
L2 Atomic: The total number of atomic requests (with and without return) to the
|
||||
L2 from all clients.
|
||||
L2 Hit: The ratio of the number of L2 cache line requests that hit in the L2 cache
|
||||
over the total number of incoming cache line requests to the L2 cache.
|
||||
L2 Rd Lat: Calculated as the average number of cycles that the vL1D cache took
|
||||
to issue and receive read requests from the L2 Cache. This number also includes
|
||||
requests for atomics with return values.
|
||||
L2 Wr Lat: Calculated as the average number of cycles that the vL1D cache took
|
||||
to issue and receive acknowledgement of a write request to the L2 Cache. This
|
||||
number also includes requests for atomics without return values.
|
||||
Fabric_L2 Rd: Number of L2 cache - Infinity Fabric read requests (either 32-byte
|
||||
or 64-byte) summed over TCC instances per normalization unit.
|
||||
Fabric_L2 Wr: Number of L2 cache - Infinity Fabric write requests (either 32-byte
|
||||
or 64-byte) summed over TCC instances per normalization unit.
|
||||
Fabric_L2 Atomic: Number of L2 cache - Infinity Fabric write requests (either
|
||||
32-byte or 64-byte) that are actually atomic requests summed over TCC instances
|
||||
per normalization unit.
|
||||
Fabric Rd Lat: The time-averaged number of cycles read requests spent in Infinity
|
||||
Fabric before data was returned to the L2.
|
||||
Fabric Wr Lat: The time-averaged number of cycles write requests spent in Infinity
|
||||
Fabric before a completion acknowledgement was returned to the L2.
|
||||
Fabric Atomic Lat: The time-averaged number of cycles atomic requests spent in
|
||||
Infinity Fabric before a completion acknowledgement (atomic without return value)
|
||||
or data (atomic with return value) was returned to the L2.
|
||||
HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
|
||||
of data from the accelerator's local HBM, per normalization unit.
|
||||
HBM Wr: 'The total number of L2 requests to Infinity Fabric to write or atomically
|
||||
update 32B or 64B of data in the accelerator''s local HBM, per normalization
|
||||
unit. '
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 301
|
||||
@@ -252,13 +136,13 @@ Panel Config:
|
||||
value: ROUND(AVG((TCC_EA0_ATOMIC_sum / $denom)), 0)
|
||||
Fabric Rd Lat:
|
||||
value: ROUND(AVG(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else 0)), 0)
|
||||
!= 0) else 0)), 0)
|
||||
Fabric Wr Lat:
|
||||
value: ROUND(AVG(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else 0)), 0)
|
||||
!= 0) else 0)), 0)
|
||||
Fabric Atomic Lat:
|
||||
value: ROUND(AVG(((TCC_EA0_ATOMIC_LEVEL_sum / TCC_EA0_ATOMIC_sum) if (TCC_EA0_ATOMIC_sum
|
||||
!= 0) else 0)), 0)
|
||||
!= 0) else 0)), 0)
|
||||
HBM Rd:
|
||||
value: ROUND(AVG((TCC_EA0_RDREQ_DRAM_sum / $denom)), 0)
|
||||
HBM Wr:
|
||||
@@ -266,3 +150,123 @@ Panel Config:
|
||||
comparable: false
|
||||
cli_style: mem_chart
|
||||
tui_style: mem_chart
|
||||
metrics_description:
|
||||
Wavefront Occupancy: Wavefronts per active CU.
|
||||
Wave Life: Average number of cycles executing a wave.
|
||||
SALU: Total Number of SALU (Scalar ALU) instructions issued per normalization
|
||||
unit.
|
||||
SMEM: Total number of SMEM (Scalar Memory Read) instructions issued normalization
|
||||
unit.
|
||||
VALU: The number of VALU (Vector ALU) instructions issued per normalization unit.
|
||||
MFMA: Total number of MFMA (Matrix-Fused-Multiply-Add) instructions issued per
|
||||
normalization unit.
|
||||
VMEM: The number of VMEM (GPU Memory) read instructions issued (including FLAT/scratch
|
||||
memory) per normalization unit.
|
||||
LDS: The total number of LDS instructions (including, but not limited to, read/write/atomics
|
||||
and HIP's __shfl instructions) executed per normalization unit.
|
||||
GWS: Total number of GDS (global data sync) instructions issued per normalization
|
||||
unit.
|
||||
BR: Total number of BRANCH instructions issued per normalization unit.
|
||||
Active CUs: Total number of active compute units (CUs) on the accelerator during
|
||||
the kernel execution.
|
||||
Num CUs: Total number of compute units (CUs) on the accelerator.
|
||||
VGPR: |-
|
||||
The number of architected vector general-purpose registers allocated
|
||||
for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
|
||||
requested by the compiler due to allocation granularity.
|
||||
SGPR: |-
|
||||
The number of scalar general-purpose registers allocated for the kernel,
|
||||
see SALU. Note: this may not exactly match the number of SGPRs requested by
|
||||
the compiler due to allocation granularity.
|
||||
LDS Allocation: |-
|
||||
The number of bytes of LDS memory (or, shared memory) allocated for
|
||||
this kernel. Note: This may also be larger than what was requested at compile
|
||||
time due to both allocation granularity and dynamic per-dispatch LDS allocations.
|
||||
Scratch Allocation: The number of bytes of scratch memory requested per work-item
|
||||
for this kernel. Scratch memory is used for stack memory on the accelerator,
|
||||
as well as for register spills and restores.
|
||||
Wavefronts: The total number of wavefronts, summed over all workgroups, forming
|
||||
this kernel launch.
|
||||
Workgroups: The total number of workgroups forming this kernel launch.
|
||||
LDS Req: The total number of LDS instructions (including, but not limited to,
|
||||
read/write/atomics and HIP's __shfl instructions) executed per normalization
|
||||
unit.
|
||||
LDS Util: Indicates what percent of the kernel's duration the LDS was actively
|
||||
executing instructions (including, but not limited to, load, store, atomic and
|
||||
HIP's __shfl operations). Calculated as the ratio of the total number of cycles
|
||||
LDS was active over the total CU cycles.
|
||||
LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
|
||||
/ acknowledgment) required for an LDS instruction to complete.
|
||||
VL1 Rd: The total number of incoming read requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Wr: The total number of incoming write requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Atomic: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Hit: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
VL1 Lat: Calculated as the average number of cycles that a vL1D cache line request
|
||||
spent in the vL1D cache pipeline.
|
||||
VL1 Coalesce: Indicates how well memory instructions were coalesced by the address
|
||||
processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
|
||||
as the average number of thread-requests generated per instruction divided by
|
||||
the ideal number of thread-requests per instruction.
|
||||
VL1 Stall: The ratio of the number of cycles where the vL1D is stalled waiting
|
||||
to issue a request for data to the L2 cache divided by the number of cycles
|
||||
where the vL1D is active.
|
||||
VL1_L2 Rd: The number of read requests for a vL1D cache line that were not satisfied
|
||||
by the vL1D and must be retrieved from the to the L2 Cache per normalization
|
||||
unit.
|
||||
VL1_L2 Wr: The number of write requests to a vL1D cache line that were sent through
|
||||
the vL1D to the L2 cache, per normalization unit.
|
||||
VL1_L2 Atomic: The number of atomic requests that are sent through the vL1D to
|
||||
the L2 cache, per normalization unit. This includes requests for atomics with,
|
||||
and without return.
|
||||
sL1D Rd: The total number of requests, of any size or type, made to the sL1D per
|
||||
normalization unit.
|
||||
sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
|
||||
line, per normalization unit.
|
||||
sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
|
||||
unit.
|
||||
sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
sL1D_L2 Atomic: The total number of atomic requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
IL1 Fetch: The total number of requests made to the L1I per normalization-unit.
|
||||
IL1 Hit: The percent of L1I requests that hit on a previously loaded line the
|
||||
cache. Calculated as the ratio of the number of L1I requests that hit over the
|
||||
number of all L1I requests.
|
||||
IL1 Lat: The average number of cycles spent to fetch instructions to a CU.
|
||||
IL1_L2 Rd: The total number of requests across the L1I - L2 interface per normalization-unit.
|
||||
L2 Rd: The total number of read requests to the L2 from all clients.
|
||||
L2 Wr: The total number of write requests to the L2 from all clients.
|
||||
L2 Atomic: The total number of atomic requests (with and without return) to the
|
||||
L2 from all clients.
|
||||
L2 Hit: The ratio of the number of L2 cache line requests that hit in the L2 cache
|
||||
over the total number of incoming cache line requests to the L2 cache.
|
||||
L2 Rd Lat: Calculated as the average number of cycles that the vL1D cache took
|
||||
to issue and receive read requests from the L2 Cache. This number also includes
|
||||
requests for atomics with return values.
|
||||
L2 Wr Lat: Calculated as the average number of cycles that the vL1D cache took
|
||||
to issue and receive acknowledgement of a write request to the L2 Cache. This
|
||||
number also includes requests for atomics without return values.
|
||||
Fabric_L2 Rd: Number of L2 cache - Infinity Fabric read requests (either 32-byte
|
||||
or 64-byte) summed over TCC instances per normalization unit.
|
||||
Fabric_L2 Wr: Number of L2 cache - Infinity Fabric write requests (either 32-byte
|
||||
or 64-byte) summed over TCC instances per normalization unit.
|
||||
Fabric_L2 Atomic: Number of L2 cache - Infinity Fabric write requests (either
|
||||
32-byte or 64-byte) that are actually atomic requests summed over TCC instances
|
||||
per normalization unit.
|
||||
Fabric Rd Lat: The time-averaged number of cycles read requests spent in Infinity
|
||||
Fabric before data was returned to the L2.
|
||||
Fabric Wr Lat: The time-averaged number of cycles write requests spent in Infinity
|
||||
Fabric before a completion acknowledgement was returned to the L2.
|
||||
Fabric Atomic Lat: The time-averaged number of cycles atomic requests spent in
|
||||
Infinity Fabric before a completion acknowledgement (atomic without return value)
|
||||
or data (atomic with return value) was returned to the L2.
|
||||
HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
|
||||
of data from the accelerator's local HBM, per normalization unit.
|
||||
HBM Wr: |-
|
||||
The total number of L2 requests to Infinity Fabric to write or atomically
|
||||
update 32B or 64B of data in the accelerator's local HBM, per normalization
|
||||
unit.
|
||||
|
||||
+83
-79
@@ -2,85 +2,6 @@
|
||||
Panel Config:
|
||||
id: 400
|
||||
title: Roofline
|
||||
metrics_description:
|
||||
VALU FLOPs (F16): 'The total 16-bit floating-point operations executed per second
|
||||
on the VALU. This is presented with the value of the peak empirical F16 FLOPs
|
||||
achievable on the specific accelerator. Note: this does not include any F16
|
||||
operations from MFMA instructions.'
|
||||
VALU FLOPs (F32): 'The total 32-bit floating-point operations executed per second
|
||||
on the VALU. This is presented with the value of the peak empirical F32 FLOPs
|
||||
achievable on the specific accelerator. Note: this does not include any F32
|
||||
operations from MFMA instructions.'
|
||||
VALU FLOPs (F64): 'The total 64-bit floating-point operations executed per second
|
||||
on the VALU. This is presented with the value of the peak empirical F64 FLOPs
|
||||
achievable on the specific accelerator. Note: this does not include any F64
|
||||
operations from MFMA instructions.'
|
||||
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
|
||||
executed per second. This does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. The peak empirically measured F8 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison. It is supported
|
||||
on AMD Instinct MI300 series and later only.
|
||||
MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
|
||||
executed per second. Note: this does not include any 16-bit brain floating point
|
||||
operations from VALU instructions. The peak empirically measured BF16 MFMA operations
|
||||
achievable on the specific accelerator is displayed alongside for comparison.'
|
||||
MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit floating point operations
|
||||
from VALU instructions. The peak empirically measured F16 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison.'
|
||||
MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 32-bit floating point operations
|
||||
from VALU instructions. The peak empirically measured F32 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison.'
|
||||
MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 64-bit floating point operations
|
||||
from VALU instructions. The peak empirically measured F64 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison.'
|
||||
MFMA FLOPs (F6F4): 'The total number of 4-bit and 6-bit floating point MFMA operations
|
||||
executed per second. Note: this does not include any floating point operations
|
||||
from VALU instructions. The peak empirically measured F6F4 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison. It is supported
|
||||
on AMD Instinct MI350 series (gfx950) and later only.'
|
||||
MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
|
||||
per second. Note: this does not include any 8-bit integer operations from VALU
|
||||
instructions. The peak empirically measured INT8 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison.'
|
||||
HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth
|
||||
Memory (HBM) per second. The peak empirically measured bandwidth achievable
|
||||
on the specific accelerator is displayed alongside for comparison.
|
||||
L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
|
||||
The number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line. The peak empirically measured bandwidth
|
||||
achievable on the specific accelerator is displayed alongside for comparison.
|
||||
L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
|
||||
of VMEM instructions per unit time. The number of bytes is calculated as the
|
||||
number of cache lines requested multiplied by the cache line size. This value
|
||||
does not consider partial requests, so e.g., if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
The peak empirically measured bandwidth achievable on the specific accelerator
|
||||
is displayed alongside for comparison.
|
||||
LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
|
||||
from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
|
||||
example for more detail). The peak empirically measured LDS bandwidth achievable
|
||||
on the specific accelerator is displayed alongside for comparison.
|
||||
AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
|
||||
of total floating-point operations (FLOPs) to total bytes transferred between
|
||||
the L1 cache and the processing units. This value is used as the x-coordinate
|
||||
for the L1 roofline.
|
||||
AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
|
||||
of total floating-point operations (FLOPs) to total bytes transferred between
|
||||
the L2 cache and the L1 cache. This value is used as the x-coordinate for the
|
||||
L2 roofline.
|
||||
AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
|
||||
It is the ratio of total floating-point operations (FLOPs) to total bytes transferred
|
||||
between HBM and the L2 cache. This value is used as the x-coordinate for the
|
||||
HBM roofline.
|
||||
Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs
|
||||
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
|
||||
operations divided by the total execution time. This value is used as the y-coordinate
|
||||
for the kernel's point on the Roofline plot.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 401
|
||||
@@ -212,3 +133,86 @@ Panel Config:
|
||||
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / (SUM(End_Timestamp - Start_Timestamp)
|
||||
/ 1e9) ) / 1e9
|
||||
unit: GFLOP/s
|
||||
metrics_description:
|
||||
VALU FLOPs (F16): |-
|
||||
The total 16-bit floating-point operations executed per second on the VALU.
|
||||
This is presented with the value of the peak empirical F16 FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any F16 operations
|
||||
from MFMA instructions.
|
||||
VALU FLOPs (F32): |-
|
||||
The total 32-bit floating-point operations executed per second on the VALU.
|
||||
This is presented with the value of the peak empirical F32 FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any F32 operations
|
||||
from MFMA instructions.
|
||||
VALU FLOPs (F64): |-
|
||||
The total 64-bit floating-point operations executed per second on the VALU.
|
||||
This is presented with the value of the peak empirical F64 FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any F64 operations
|
||||
from MFMA instructions.
|
||||
MFMA FLOPs (BF16): |-
|
||||
The total number of 16-bit brain floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit brain floating point
|
||||
operations from VALU instructions. The peak empirically measured BF16 MFMA
|
||||
operations achievable on the specific accelerator is displayed alongside
|
||||
for comparison.
|
||||
MFMA FLOPs (F16): |-
|
||||
The total number of 16-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 16-bit floating point operations from
|
||||
VALU instructions. The peak empirically measured F16 MFMA operations
|
||||
achievable on the specific accelerator is displayed alongside for comparison.
|
||||
MFMA FLOPs (F32): |-
|
||||
The total number of 32-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 32-bit floating point operations from
|
||||
VALU instructions. The peak empirically measured F32 MFMA operations
|
||||
achievable on the specific accelerator is displayed alongside for comparison.
|
||||
MFMA FLOPs (F64): |-
|
||||
The total number of 64-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 64-bit floating point operations from
|
||||
VALU instructions. The peak empirically measured F64 MFMA operations
|
||||
achievable on the specific accelerator is displayed alongside for comparison.
|
||||
MFMA IOPs (Int8): |-
|
||||
The total number of 8-bit integer MFMA operations executed per second.
|
||||
Note: this does not include any 8-bit integer operations from VALU instructions.
|
||||
The peak empirically measured INT8 MFMA operations achievable on the specific
|
||||
accelerator is displayed alongside for comparison.
|
||||
HBM Bandwidth: |-
|
||||
The total number of bytes read from and written to High-Bandwidth
|
||||
Memory (HBM) per second. The peak empirically measured bandwidth achievable
|
||||
on the specific accelerator is displayed alongside for comparison.
|
||||
L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
|
||||
The number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line. The peak empirically measured bandwidth
|
||||
achievable on the specific accelerator is displayed alongside for comparison.
|
||||
L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
|
||||
of VMEM instructions per unit time. The number of bytes is calculated as the
|
||||
number of cache lines requested multiplied by the cache line size. This value
|
||||
does not consider partial requests, so e.g., if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
The peak empirically measured bandwidth achievable on the specific accelerator
|
||||
is displayed alongside for comparison.
|
||||
LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
|
||||
from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
|
||||
example for more detail). The peak empirically measured LDS bandwidth achievable
|
||||
on the specific accelerator is displayed alongside for comparison.
|
||||
AI L1: |-
|
||||
The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
|
||||
of total floating-point operations (FLOPs) to total bytes transferred between
|
||||
the L1 cache and the processing units. This value is used as the x-coordinate
|
||||
for the L1 roofline.
|
||||
AI L2: |-
|
||||
The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
|
||||
of total floating-point operations (FLOPs) to total bytes transferred between
|
||||
the L2 cache and the L1 cache. This value is used as the x-coordinate for
|
||||
the L2 roofline.
|
||||
AI HBM: |-
|
||||
The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
|
||||
It is the ratio of total floating-point operations (FLOPs) to total bytes
|
||||
transferred between HBM and the L2 cache. This value is used as the x-coordinate
|
||||
for the HBM roofline.
|
||||
Performance (GFLOPs): |-
|
||||
The overall achieved performance, measured in GigaFLOPs
|
||||
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
|
||||
operations divided by the total execution time. This value is used as the y-coordinate
|
||||
for the kernel's point on the Roofline plot.
|
||||
|
||||
+25
-24
@@ -2,30 +2,6 @@
|
||||
Panel Config:
|
||||
id: 500
|
||||
title: Command Processor (CPC/CPF)
|
||||
metrics_description:
|
||||
CPF Utilization: Percent of total cycles where the CPF was busy actively doing
|
||||
any work. The ratio of CPF busy cycles over total cycles counted by the CPF.
|
||||
CPF Stall: Percent of CPF busy cycles where the CPF was stalled for any reason.
|
||||
CPF-L2 Utilization: Percent of total cycles counted by the CPF-L2 interface where
|
||||
the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
|
||||
over total cycles counted by the CPF-L2.
|
||||
CPF-L2 Stall: Percent of CPF-L2 L2 busy cycles where the CPF-L2 interface was
|
||||
stalled for any reason.
|
||||
CPF-UTCL1 Stall: Percent of CPF busy cycles where the CPF was stalled by address
|
||||
translation.
|
||||
CPC Utilization: Percent of total cycles where the CPC was busy actively doing
|
||||
any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
|
||||
CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
|
||||
CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
|
||||
for processing.
|
||||
CPC-Workgroup Manager Utilization: Percent of CPC busy cycles spent dispatching
|
||||
workgroups to the workgroup manager.
|
||||
CPC-L2 Utilization: Percent of total cycles counted by the CPC-L2 interface where
|
||||
the CPC-L2 interface was active doing any work.
|
||||
CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
|
||||
translation
|
||||
CPC-UTCL2 Utilization: 'Percent of total cycles counted by the CPC''s L2 address
|
||||
translation interface where the CPC was busy doing address translation work. '
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 501
|
||||
@@ -143,3 +119,28 @@ Panel Config:
|
||||
max: MAX((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
|
||||
if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
metrics_description:
|
||||
CPF Utilization: Percent of total cycles where the CPF was busy actively doing
|
||||
any work. The ratio of CPF busy cycles over total cycles counted by the CPF.
|
||||
CPF Stall: Percent of CPF busy cycles where the CPF was stalled for any reason.
|
||||
CPF-L2 Utilization: Percent of total cycles counted by the CPF-L2 interface where
|
||||
the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
|
||||
over total cycles counted by the CPF-L2.
|
||||
CPF-L2 Stall: Percent of CPF-L2 L2 busy cycles where the CPF-L2 interface was
|
||||
stalled for any reason.
|
||||
CPF-UTCL1 Stall: Percent of CPF busy cycles where the CPF was stalled by address
|
||||
translation.
|
||||
CPC Utilization: Percent of total cycles where the CPC was busy actively doing
|
||||
any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
|
||||
CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
|
||||
CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
|
||||
for processing.
|
||||
CPC-Workgroup Manager Utilization: Percent of CPC busy cycles spent dispatching
|
||||
workgroups to the workgroup manager.
|
||||
CPC-L2 Utilization: Percent of total cycles counted by the CPC-L2 interface where
|
||||
the CPC-L2 interface was active doing any work.
|
||||
CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
|
||||
translation
|
||||
CPC-UTCL2 Utilization: |-
|
||||
Percent of total cycles counted by the CPC's L2 address translation
|
||||
interface where the CPC was busy doing address translation work.
|
||||
|
||||
+55
-55
@@ -2,61 +2,6 @@
|
||||
Panel Config:
|
||||
id: 600
|
||||
title: Workgroup Manager (SPI)
|
||||
metrics_description:
|
||||
Accelerator Utilization: The percent of cycles in the kernel where the accelerator
|
||||
was actively doing any work.
|
||||
Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
|
||||
kernel where the scheduler-pipes were actively doing any work.
|
||||
Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
|
||||
manager was actively doing any work.
|
||||
Shader Engine Utilization: The percent of total shader engine cycles in the kernel
|
||||
where any CU in a shader-engine was actively doing any work, normalized over
|
||||
all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
|
||||
was not fully saturated by the kernel, or a potential load-imbalance issue.
|
||||
SIMD Utilization: The percent of total SIMD cycles in the kernel where any SIMD
|
||||
on a CU was actively doing any work, summed over all CUs. Low values (less than
|
||||
100%) indicate that the accelerator was not fully saturated by the kernel, or
|
||||
a potential load-imbalance issue.
|
||||
Dispatched Workgroups: The total number of workgroups forming this kernel launch.
|
||||
Dispatched Wavefronts: The total number of wavefronts, summed over all workgroups,
|
||||
forming this kernel launch.
|
||||
VGPR Writes: The average number of cycles spent initializing VGPRs at wave creation.
|
||||
SGPR Writes: The average number of cycles spent initializing SGPRs at wave creation.
|
||||
Not-scheduled Rate (Workgroup Manager): The percent of total scheduler-pipe cycles
|
||||
in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
|
||||
within the workgroup manager rather than a lack of a CU or SIMD with sufficient
|
||||
resources.
|
||||
Not-scheduled Rate (Scheduler-Pipe): 'The percent of total scheduler-pipe cycles
|
||||
in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
|
||||
within the scheduler-pipes rather than a lack of a CU or SIMD with sufficient
|
||||
resources. '
|
||||
Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
|
||||
where a workgroup could not be scheduled to a CU due to occupancy limitations
|
||||
(like a lack of a CU or SIMD with sufficient resources).
|
||||
Scratch Stall Rate: The percent of total shader-engine cycles in the kernel where
|
||||
a workgroup could not be scheduled to a CU due to lack of private (a.k.a., scratch)
|
||||
memory slots. While this can reach up to 100%, note that the actual occupancy
|
||||
limitations on a kernel using private memory are typically quite small (for
|
||||
example, less than 1% of the total number of waves that can be scheduled to
|
||||
an accelerator).
|
||||
Insufficient SIMD Waveslots: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available waveslots.
|
||||
Insufficient SIMD VGPRs: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available VGPRs.
|
||||
Insufficient SIMD SGPRs: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available SGPRs.
|
||||
Insufficient CU LDS: The percent of total CU cycles in the kernel where a workgroup
|
||||
could not be scheduled to a CU due to lack of available LDS.
|
||||
Insufficient CU Barriers: The percent of total CU cycles in the kernel where a
|
||||
workgroup could not be scheduled to a CU due to lack of available barriers.
|
||||
Reached CU Workgroup Limit: The percent of total CU cycles in the kernel where
|
||||
a workgroup could not be scheduled to a CU due to limits within the workgroup
|
||||
manager. This is expected to be always be zero on CDNA2 or newer accelerators
|
||||
(and small for previous accelerators).
|
||||
Reached CU Wavefront Limit: The percent of total CU cycles in the kernel where
|
||||
a wavefront could not be scheduled to a CU due to limits within the workgroup
|
||||
manager. This is expected to be always be zero on CDNA2 or newer accelerators
|
||||
(and small for previous accelerators).
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 601
|
||||
@@ -199,3 +144,58 @@ Panel Config:
|
||||
min: MIN(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
metrics_description:
|
||||
Accelerator Utilization: The percent of cycles in the kernel where the accelerator
|
||||
was actively doing any work.
|
||||
Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
|
||||
kernel where the scheduler-pipes were actively doing any work.
|
||||
Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
|
||||
manager was actively doing any work.
|
||||
Shader Engine Utilization: The percent of total shader engine cycles in the kernel
|
||||
where any CU in a shader-engine was actively doing any work, normalized over
|
||||
all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
|
||||
was not fully saturated by the kernel, or a potential load-imbalance issue.
|
||||
SIMD Utilization: The percent of total SIMD cycles in the kernel where any SIMD
|
||||
on a CU was actively doing any work, summed over all CUs. Low values (less than
|
||||
100%) indicate that the accelerator was not fully saturated by the kernel, or
|
||||
a potential load-imbalance issue.
|
||||
Dispatched Workgroups: The total number of workgroups forming this kernel launch.
|
||||
Dispatched Wavefronts: The total number of wavefronts, summed over all workgroups,
|
||||
forming this kernel launch.
|
||||
VGPR Writes: The average number of cycles spent initializing VGPRs at wave creation.
|
||||
SGPR Writes: The average number of cycles spent initializing SGPRs at wave creation.
|
||||
Not-scheduled Rate (Workgroup Manager): The percent of total scheduler-pipe cycles
|
||||
in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
|
||||
within the workgroup manager rather than a lack of a CU or SIMD with sufficient
|
||||
resources.
|
||||
Not-scheduled Rate (Scheduler-Pipe): |-
|
||||
The percent of total scheduler-pipe cycles in the kernel where a workgroup
|
||||
could not be scheduled to a CU due to a bottleneck within the scheduler-pipes
|
||||
rather than a lack of a CU or SIMD with sufficient resources.
|
||||
Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
|
||||
where a workgroup could not be scheduled to a CU due to occupancy limitations
|
||||
(like a lack of a CU or SIMD with sufficient resources).
|
||||
Scratch Stall Rate: The percent of total shader-engine cycles in the kernel where
|
||||
a workgroup could not be scheduled to a CU due to lack of private (a.k.a., scratch)
|
||||
memory slots. While this can reach up to 100%, note that the actual occupancy
|
||||
limitations on a kernel using private memory are typically quite small (for
|
||||
example, less than 1% of the total number of waves that can be scheduled to
|
||||
an accelerator).
|
||||
Insufficient SIMD Waveslots: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available waveslots.
|
||||
Insufficient SIMD VGPRs: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available VGPRs.
|
||||
Insufficient SIMD SGPRs: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available SGPRs.
|
||||
Insufficient CU LDS: The percent of total CU cycles in the kernel where a workgroup
|
||||
could not be scheduled to a CU due to lack of available LDS.
|
||||
Insufficient CU Barriers: The percent of total CU cycles in the kernel where a
|
||||
workgroup could not be scheduled to a CU due to lack of available barriers.
|
||||
Reached CU Workgroup Limit: The percent of total CU cycles in the kernel where
|
||||
a workgroup could not be scheduled to a CU due to limits within the workgroup
|
||||
manager. This is expected to be always be zero on CDNA2 or newer accelerators
|
||||
(and small for previous accelerators).
|
||||
Reached CU Wavefront Limit: The percent of total CU cycles in the kernel where
|
||||
a wavefront could not be scheduled to a CU due to limits within the workgroup
|
||||
manager. This is expected to be always be zero on CDNA2 or newer accelerators
|
||||
(and small for previous accelerators).
|
||||
|
||||
+63
-57
@@ -2,63 +2,6 @@
|
||||
Panel Config:
|
||||
id: 700
|
||||
title: Wavefront
|
||||
metrics_description:
|
||||
Grid Size: The total number of work-items (or, threads) launched as a part of
|
||||
the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied
|
||||
by the total workgroup (or, block) size.
|
||||
Workgroup Size: The total number of work-items (or, threads) in each workgroup
|
||||
(or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
|
||||
to the total block size.
|
||||
Total Wavefronts: "The total number of wavefronts launched as part of the kernel\
|
||||
\ dispatch. On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs,\
|
||||
\ the wavefront size is always 64 work-items. Thus, the total number of wavefronts\
|
||||
\ should be equivalent to the ceiling of grid size divided by 64."
|
||||
Saved Wavefronts: The total number of wavefronts saved at a context-save.
|
||||
Restored Wavefronts: The total number of wavefronts restored from a context-save.
|
||||
VGPRs: 'The number of architected vector general-purpose registers allocated for
|
||||
the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
|
||||
by the compiler due to allocation granularity.'
|
||||
AGPRs: 'The number of accumulation vector general-purpose registers allocated
|
||||
for the kernel, see AGPRs. Note: this may not exactly match the number of AGPRs
|
||||
requested by the compiler due to allocation granularity.'
|
||||
SGPRs: 'The number of scalar general-purpose registers allocated for the kernel,
|
||||
see SALU. Note: this may not exactly match the number of SGPRs requested by
|
||||
the compiler due to allocation granularity.'
|
||||
LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
|
||||
for this kernel. Note: This may also be larger than what was requested at compile
|
||||
time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
|
||||
Scratch Allocation: The number of bytes of scratch memory requested per work-item
|
||||
for this kernel. Scratch memory is used for stack memory on the accelerator,
|
||||
as well as for register spills and restores.
|
||||
Kernel Time: The total duration of the executed kernel.
|
||||
Kernel Time (Cycles): The total duration of the executed kernel in cycles.
|
||||
Instructions per wavefront: The average number of instructions (of all types)
|
||||
executed per wavefront. This is averaged over all wavefronts in a kernel dispatch.
|
||||
Wave Cycles: The number of cycles a wavefront in the kernel dispatch spent resident
|
||||
on a compute unit per normalization unit. This is averaged over all wavefronts
|
||||
in a kernel dispatch.
|
||||
Dependency Wait Cycles: The number of cycles a wavefront in the kernel dispatch
|
||||
spent resident on a compute unit per normalization unit. This is averaged over
|
||||
all wavefronts in a kernel dispatch.
|
||||
Issue Wait Cycles: The number of cycles a wavefront in the kernel dispatch was
|
||||
unable to issue an instruction for any reason (e.g., execution pipe back-pressure,
|
||||
arbitration loss, etc.) per normalization unit. This counter is incremented
|
||||
at every cycle by all wavefronts on a CU unable to issue an instruction. As
|
||||
such, it is most useful to get a sense of how waves were spending their time,
|
||||
rather than identification of a precise limiter because another wave could be
|
||||
actively executing while a wave is issue stalled. The sum of this metric, Dependency
|
||||
Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
|
||||
Active Cycles: The average number of cycles a wavefront in the kernel dispatch
|
||||
was actively executing instructions per normalization unit. This measurement
|
||||
is made on a per-wavefront basis, and may include cycles that another wavefront
|
||||
spent actively executing (on another execution unit, for example) or was stalled.
|
||||
As such, it is most useful to get a sense of how waves were spending their time,
|
||||
rather than identification of a precise limiter. The sum of this metric, Issue
|
||||
Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
|
||||
metric.
|
||||
Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
|
||||
over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
||||
kernels (less than 1ms).'
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 701
|
||||
@@ -171,3 +114,66 @@ Panel Config:
|
||||
max: MAX((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
unit: Wavefronts
|
||||
coll_level: SQ_LEVEL_WAVES
|
||||
metrics_description:
|
||||
Grid Size: The total number of work-items (or, threads) launched as a part of
|
||||
the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied
|
||||
by the total workgroup (or, block) size.
|
||||
Workgroup Size: The total number of work-items (or, threads) in each workgroup
|
||||
(or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
|
||||
to the total block size.
|
||||
Total Wavefronts: |-
|
||||
The total number of wavefronts launched as part of the kernel dispatch.
|
||||
On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
|
||||
size is always 64 work-items. Thus, the total number of wavefronts should
|
||||
be equivalent to the ceiling of grid size divided by 64.
|
||||
Saved Wavefronts: The total number of wavefronts saved at a context-save.
|
||||
Restored Wavefronts: The total number of wavefronts restored from a context-save.
|
||||
VGPRs: |-
|
||||
The number of architected vector general-purpose registers allocated
|
||||
for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
|
||||
requested by the compiler due to allocation granularity.
|
||||
AGPRs: |-
|
||||
The number of accumulation vector general-purpose registers allocated
|
||||
for the kernel, see AGPRs. Note: this may not exactly match the number of
|
||||
AGPRs requested by the compiler due to allocation granularity.
|
||||
SGPRs: |-
|
||||
The number of scalar general-purpose registers allocated for the kernel,
|
||||
see SALU. Note: this may not exactly match the number of SGPRs requested by
|
||||
the compiler due to allocation granularity.
|
||||
LDS Allocation: |-
|
||||
The number of bytes of LDS memory (or, shared memory) allocated for
|
||||
this kernel. Note: This may also be larger than what was requested at compile
|
||||
time due to both allocation granularity and dynamic per-dispatch LDS allocations.
|
||||
Scratch Allocation: The number of bytes of scratch memory requested per work-item
|
||||
for this kernel. Scratch memory is used for stack memory on the accelerator,
|
||||
as well as for register spills and restores.
|
||||
Kernel Time: The total duration of the executed kernel.
|
||||
Kernel Time (Cycles): The total duration of the executed kernel in cycles.
|
||||
Instructions per wavefront: The average number of instructions (of all types)
|
||||
executed per wavefront. This is averaged over all wavefronts in a kernel dispatch.
|
||||
Wave Cycles: The number of cycles a wavefront in the kernel dispatch spent resident
|
||||
on a compute unit per normalization unit. This is averaged over all wavefronts
|
||||
in a kernel dispatch.
|
||||
Dependency Wait Cycles: The number of cycles a wavefront in the kernel dispatch
|
||||
spent resident on a compute unit per normalization unit. This is averaged over
|
||||
all wavefronts in a kernel dispatch.
|
||||
Issue Wait Cycles: The number of cycles a wavefront in the kernel dispatch was
|
||||
unable to issue an instruction for any reason (e.g., execution pipe back-pressure,
|
||||
arbitration loss, etc.) per normalization unit. This counter is incremented
|
||||
at every cycle by all wavefronts on a CU unable to issue an instruction. As
|
||||
such, it is most useful to get a sense of how waves were spending their time,
|
||||
rather than identification of a precise limiter because another wave could be
|
||||
actively executing while a wave is issue stalled. The sum of this metric, Dependency
|
||||
Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
|
||||
Active Cycles: The average number of cycles a wavefront in the kernel dispatch
|
||||
was actively executing instructions per normalization unit. This measurement
|
||||
is made on a per-wavefront basis, and may include cycles that another wavefront
|
||||
spent actively executing (on another execution unit, for example) or was stalled.
|
||||
As such, it is most useful to get a sense of how waves were spending their time,
|
||||
rather than identification of a precise limiter. The sum of this metric, Issue
|
||||
Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
|
||||
metric.
|
||||
Wavefront Occupancy: |-
|
||||
The time-averaged number of wavefronts resident on the accelerator over
|
||||
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
||||
kernels (less than 1ms).
|
||||
|
||||
+32
-84
@@ -2,90 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1000
|
||||
title: Compute Units - Instruction Mix
|
||||
metrics_description:
|
||||
VALU: The total number of vector arithmetic logic unit (VALU) operations issued.
|
||||
These are the workhorses of the compute unit, and are used to execute a wide
|
||||
range of instruction types including floating point operations, non-uniform
|
||||
address calculations, transcendental operations, integer operations, shifts,
|
||||
conditional evaluation, etc.
|
||||
VMEM: The total number of vector memory operations issued. These include most
|
||||
loads, stores and atomic operations and all accesses to generic, global, private
|
||||
and texture memory.
|
||||
LDS: The total number of LDS (also known as shared memory) operations issued.
|
||||
These include loads, stores, atomics, and HIP's __shfl operations.
|
||||
MFMA: The total number of matrix fused multiply-add instructions issued.
|
||||
SALU: The total number of scalar arithmetic logic unit (SALU) operations issued.
|
||||
Typically these are used for address calculations, literal constants, and other
|
||||
operations that are provably uniform across a wavefront. Although scalar memory
|
||||
(SMEM) operations are issued by the SALU, they are counted separately in this
|
||||
section.
|
||||
SMEM: The total number of scalar memory (SMEM) operations issued. These are typically
|
||||
used for loading kernel arguments, base-pointers and loads from HIP's __constant__
|
||||
memory.
|
||||
Branch: The total number of branch operations issued. These typically consist
|
||||
of jump or branch operations and are used to implement control flow.
|
||||
INT32: The total number of instructions operating on 32-bit integer operands issued
|
||||
to the VALU per normalization unit.
|
||||
INT64: The total number of instructions operating on 64-bit integer operands issued
|
||||
to the VALU per normalization unit.
|
||||
F16-ADD: The total number of addition instructions operating on 16-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F16-MUL: The total number of multiplication instructions operating on 16-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F16-FMA: The total number of fused multiply-add instructions operating on 16-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F16-Trans: The total number of transcendental instructions (e.g., sqrt) operating
|
||||
on 16-bit floating-point operands issued to the VALU per normalization unit.
|
||||
F32-ADD: The total number of addition instructions operating on 32-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F32-MUL: The total number of multiplication instructions operating on 32-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F32-FMA: The total number of fused multiply-add instructions operating on 32-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F32-Trans: The total number of transcendental instructions (such as sqrt) operating
|
||||
on 32-bit floating-point operands issued to the VALU per normalization unit.
|
||||
F64-ADD: The total number of addition instructions operating on 64-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F64-MUL: The total number of multiplication instructions operating on 64-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F64-FMA: The total number of fused multiply-add instructions operating on 64-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F64-Trans: The total number of transcendental instructions (such as sqrt) operating
|
||||
on 64-bit floating-point operands issued to the VALU per normalization unit.
|
||||
Conversion: "The total number of type conversion instructions (such as converting\
|
||||
\ data to or from F32\u2194F64) issued to the VALU per normalization unit."
|
||||
Global/Generic Instr: The total number of global & generic memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Read: The total number of global & generic memory read instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Write: The total number of global & generic memory write instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Atomic: The total number of global & generic memory atomic (with
|
||||
and without return) instructions executed on all compute units on the accelerator,
|
||||
per normalization unit.
|
||||
Spill/Stack Instr: The total number of spill/stack memory instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Read: The total number of spill/stack memory read instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Write: The total number of spill/stack memory write instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Atomic: The total number of spill/stack memory atomic (with and without
|
||||
return) instructions executed on all compute units on the accelerator, per normalization
|
||||
unit. Typically unused as these memory operations are typically used to implement
|
||||
thread-local storage.
|
||||
MFMA-I8: The total number of 8-bit integer MFMA instructions issued per normalization
|
||||
unit.
|
||||
MFMA-F8: The total number of 8-bit floating point MFMA instructions issued per
|
||||
normalization unit. This is supported in AMD Instinct MI300 series and later
|
||||
only.
|
||||
MFMA-F16: The total number of 16-bit floating point MFMA instructions issued per
|
||||
normalization unit.
|
||||
MFMA-BF16: The total number of 16-bit brain floating point MFMA instructions issued
|
||||
per normalization unit.
|
||||
MFMA-F32: The total number of 32-bit floating-point MFMA instructions issued per
|
||||
normalization unit.
|
||||
MFMA-F64: The total number of 64-bit floating-point MFMA instructions issued per
|
||||
normalization unit.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1001
|
||||
@@ -187,3 +103,35 @@ Panel Config:
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric: {}
|
||||
metrics_description:
|
||||
LDS: The total number of LDS (also known as shared memory) operations issued.
|
||||
These include loads, stores, atomics, and HIP's __shfl operations.
|
||||
SALU: The total number of scalar arithmetic logic unit (SALU) operations issued.
|
||||
Typically these are used for address calculations, literal constants, and other
|
||||
operations that are provably uniform across a wavefront. Although scalar memory
|
||||
(SMEM) operations are issued by the SALU, they are counted separately in this
|
||||
section.
|
||||
SMEM: The total number of scalar memory (SMEM) operations issued. These are typically
|
||||
used for loading kernel arguments, base-pointers and loads from HIP's __constant__
|
||||
memory.
|
||||
Branch: The total number of branch operations issued. These typically consist
|
||||
of jump or branch operations and are used to implement control flow.
|
||||
Global/Generic Instr: The total number of global & generic memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Read: The total number of global & generic memory read instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Write: The total number of global & generic memory write instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Atomic: The total number of global & generic memory atomic (with
|
||||
and without return) instructions executed on all compute units on the accelerator,
|
||||
per normalization unit.
|
||||
Spill/Stack Instr: The total number of spill/stack memory instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Read: The total number of spill/stack memory read instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Write: The total number of spill/stack memory write instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Atomic: The total number of spill/stack memory atomic (with and without
|
||||
return) instructions executed on all compute units on the accelerator, per normalization
|
||||
unit. Typically unused as these memory operations are typically used to implement
|
||||
thread-local storage.
|
||||
|
||||
+19
-80
@@ -2,84 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1100
|
||||
title: Compute Units - Compute Pipeline
|
||||
metrics_description:
|
||||
VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
|
||||
This is also presented as a percent of the peak theoretical FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any floating-point
|
||||
operations from MFMA instructions.'
|
||||
VALU IOPs: 'The total integer operations executed per second on the VALU. This
|
||||
is also presented as a percent of the peak theoretical IOPs achievable on the
|
||||
specific accelerator. Note: this does not include any integer operations from
|
||||
MFMA instructions.'
|
||||
MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
|
||||
executed per second. Note: this does not include any 16-bit brain floating point
|
||||
operations from VALU instructions. This is also presented as a percent of the
|
||||
peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 32-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F32 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 64-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F64 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA IOPs (INT8): 'The total number of 8-bit integer MFMA operations executed
|
||||
per second. Note: this does not include any 8-bit integer operations from VALU
|
||||
instructions. This is also presented as a percent of the peak theoretical INT8
|
||||
MFMA operations achievable on the specific accelerator.'
|
||||
IPC: The ratio of the total number of instructions executed on the CU over the
|
||||
total active CU cycles.
|
||||
IPC (Issued): The ratio of the total number of (non-internal) instructions issued
|
||||
over the number of cycles where the scheduler was actively working on issuing
|
||||
instructions.
|
||||
SALU Utilization: Indicates what percent of the kernel's duration the SALU was
|
||||
busy executing instructions. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing SALU / SMEM instructions over the total CU cycles.
|
||||
VALU Utilization: Indicates what percent of the kernel's duration the VALU was
|
||||
busy executing instructions. Does not include VMEM operations. Computed as the
|
||||
ratio of the total number of cycles spent by the scheduler issuing VALU instructions
|
||||
over the total CU cycles.
|
||||
VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
|
||||
was busy executing instructions, including both global/generic and spill/scratch
|
||||
operations (see the VMEM instruction count metrics for more detail). Does not
|
||||
include VALU operations. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing VMEM instructions over the total CU cycles.
|
||||
Branch Utilization: Indicates what percent of the kernel's duration the branch
|
||||
unit was busy executing instructions. Computed as the ratio of the total number
|
||||
of cycles spent by the scheduler issuing branch instructions over the total
|
||||
CU cycles.
|
||||
VALU Active Threads: Indicates the average level of divergence within a wavefront
|
||||
over the lifetime of the kernel. The number of work-items that were active in
|
||||
a wavefront during execution of each VALU instruction, time-averaged over all
|
||||
VALU instructions run on all wavefronts in the kernel
|
||||
MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
|
||||
was busy executing instructions. Computed as the ratio of the total number of
|
||||
cycles spent by the MFMA was busy over the total CU cycles.
|
||||
MFMA Instruction Cycles: The average duration of MFMA instructions in this kernel
|
||||
in cycles. Computed as the ratio of the total number of cycles the MFMA unit
|
||||
was busy over the total number of MFMA instructions.
|
||||
VMEM Latency: The average number of round-trip cycles (that is, from issue to
|
||||
data return / acknowledgment) required for a VMEM instruction to complete.
|
||||
SMEM Latency: The average number of round-trip cycles (that is, from issue to
|
||||
data return / acknowledgment) required for a SMEM instruction to complete.
|
||||
FLOPs (Total): The total number of floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
IOPs (Total): The total number of integer operations executed on either the VALU
|
||||
or MFMA units, per normalization unit.
|
||||
F16 OPs: The total number of 16-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
BF16 OPs: The total number of 16-bit brain floating-point operations executed
|
||||
on either the VALU or MFMA units, per normalization unit.
|
||||
F32 OPs: The total number of 32-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
F64 OPs: The total number of 64-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
INT8 OPs: The total number of 8-bit integer operations executed on either the
|
||||
VALU or MFMA units, per normalization unit.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1101
|
||||
@@ -108,13 +30,13 @@ Panel Config:
|
||||
unit: Instr/cycle
|
||||
IPC (Issued):
|
||||
avg: AVG(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
/ SQ_ACTIVE_INST_ANY))
|
||||
min: MIN(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
/ SQ_ACTIVE_INST_ANY))
|
||||
max: MAX(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
/ SQ_ACTIVE_INST_ANY))
|
||||
unit: Instr/cycle
|
||||
SALU Utilization:
|
||||
@@ -145,3 +67,20 @@ Panel Config:
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric: {}
|
||||
metrics_description:
|
||||
IPC: The ratio of the total number of instructions executed on the CU over the
|
||||
total active CU cycles.
|
||||
IPC (Issued): The ratio of the total number of (non-internal) instructions issued
|
||||
over the number of cycles where the scheduler was actively working on issuing
|
||||
instructions.
|
||||
SALU Utilization: Indicates what percent of the kernel's duration the SALU was
|
||||
busy executing instructions. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing SALU / SMEM instructions over the total CU cycles.
|
||||
VALU Utilization: Indicates what percent of the kernel's duration the VALU was
|
||||
busy executing instructions. Does not include VMEM operations. Computed as the
|
||||
ratio of the total number of cycles spent by the scheduler issuing VALU instructions
|
||||
over the total CU cycles.
|
||||
VALU Active Threads: Indicates the average level of divergence within a wavefront
|
||||
over the lifetime of the kernel. The number of work-items that were active in
|
||||
a wavefront during execution of each VALU instruction, time-averaged over all
|
||||
VALU instructions run on all wavefronts in the kernel
|
||||
|
||||
+52
-51
@@ -2,51 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1200
|
||||
title: Local Data Share (LDS)
|
||||
metrics_description:
|
||||
Utilization: Indicates what percent of the kernel's duration the LDS was actively
|
||||
executing instructions (including, but not limited to, load, store, atomic and
|
||||
HIP's __shfl operations). Calculated as the ratio of the total number of cycles
|
||||
LDS was active over the total CU cycles.
|
||||
Access Rate: Indicates the percentage of SIMDs in the VALU actively issuing LDS
|
||||
instructions, averaged over the lifetime of the kernel. Calculated as the ratio
|
||||
of the total number of cycles spent by the scheduler issuing LDS instructions
|
||||
over the total CU cycles.
|
||||
Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
|
||||
could have been loaded from, stored to, or atomically updated in the LDS divided
|
||||
as percentage of theoretical peak. Does not take into account the execution
|
||||
mask of the wavefront when the instruction was executed.
|
||||
Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
|
||||
loaded from, stored to, or atomically updated in the LDS divided by total duration.
|
||||
Does not take into account the execution mask of the wavefront when the instruction
|
||||
was executed.
|
||||
Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
|
||||
servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing
|
||||
bank conflicts over the number of LDS cycles that would have been required to
|
||||
move the same amount of data in an uncontended access.
|
||||
LDS Instructions: The total number of LDS instructions (including, but not limited
|
||||
to, read/write/atomics and HIP's __shfl instructions) executed per normalization
|
||||
unit.
|
||||
LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
|
||||
/ acknowledgment) required for an LDS instruction to complete.
|
||||
Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS scheduler
|
||||
due to bank conflicts (as determined by the conflict resolution hardware) to
|
||||
the base number of cycles that would be spent in the LDS scheduler in a completely
|
||||
uncontended case. This is the unnormalized form of the Bank Conflict Rate.
|
||||
Index Accesses: The total number of cycles spent in the LDS scheduler over all
|
||||
operations per normalization unit.
|
||||
Atomic Return Cycles: The total number of cycles spent on LDS atomics with return
|
||||
per normalization unit.
|
||||
Bank Conflict: The total number of cycles spent in the LDS scheduler due to bank
|
||||
conflicts (as determined by the conflict resolution hardware) per normalization
|
||||
unit.
|
||||
Addr Conflict: The total number of cycles spent in the LDS scheduler due to address
|
||||
conflicts (as determined by the conflict resolution hardware) per normalization
|
||||
unit.
|
||||
Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
|
||||
stalls from non-dword aligned addresses per normalization unit.
|
||||
Mem Violations: "The total number of out-of-bounds accesses made to the LDS, per\
|
||||
\ normalization unit. This is unused and expected to be zero in most configurations\
|
||||
\ for modern CDNA\u2122 accelerators."
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1201
|
||||
@@ -87,7 +42,7 @@ Panel Config:
|
||||
avg: AVG((SQ_INSTS_LDS / $denom))
|
||||
min: MIN((SQ_INSTS_LDS / $denom))
|
||||
max: MAX((SQ_INSTS_LDS / $denom))
|
||||
unit: (Instr + $normUnit)
|
||||
unit: (Instr + $normUnit)
|
||||
Theoretical Bandwidth:
|
||||
avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
@@ -117,29 +72,75 @@ Panel Config:
|
||||
avg: AVG((SQ_LDS_IDX_ACTIVE / $denom))
|
||||
min: MIN((SQ_LDS_IDX_ACTIVE / $denom))
|
||||
max: MAX((SQ_LDS_IDX_ACTIVE / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Atomic Return Cycles:
|
||||
avg: AVG((SQ_LDS_ATOMIC_RETURN / $denom))
|
||||
min: MIN((SQ_LDS_ATOMIC_RETURN / $denom))
|
||||
max: MAX((SQ_LDS_ATOMIC_RETURN / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Bank Conflict:
|
||||
avg: AVG((SQ_LDS_BANK_CONFLICT / $denom))
|
||||
min: MIN((SQ_LDS_BANK_CONFLICT / $denom))
|
||||
max: MAX((SQ_LDS_BANK_CONFLICT / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Addr Conflict:
|
||||
avg: AVG((SQ_LDS_ADDR_CONFLICT / $denom))
|
||||
min: MIN((SQ_LDS_ADDR_CONFLICT / $denom))
|
||||
max: MAX((SQ_LDS_ADDR_CONFLICT / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Unaligned Stall:
|
||||
avg: AVG((SQ_LDS_UNALIGNED_STALL / $denom))
|
||||
min: MIN((SQ_LDS_UNALIGNED_STALL / $denom))
|
||||
max: MAX((SQ_LDS_UNALIGNED_STALL / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Mem Violations:
|
||||
avg: AVG((SQ_LDS_MEM_VIOLATIONS / $denom))
|
||||
min: MIN((SQ_LDS_MEM_VIOLATIONS / $denom))
|
||||
max: MAX((SQ_LDS_MEM_VIOLATIONS / $denom))
|
||||
unit: (Accesses + $normUnit)
|
||||
metrics_description:
|
||||
Utilization: Indicates what percent of the kernel's duration the LDS was actively
|
||||
executing instructions (including, but not limited to, load, store, atomic and
|
||||
HIP's __shfl operations). Calculated as the ratio of the total number of cycles
|
||||
LDS was active over the total CU cycles.
|
||||
Access Rate: Indicates the percentage of SIMDs in the VALU actively issuing LDS
|
||||
instructions, averaged over the lifetime of the kernel. Calculated as the ratio
|
||||
of the total number of cycles spent by the scheduler issuing LDS instructions
|
||||
over the total CU cycles.
|
||||
Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
|
||||
could have been loaded from, stored to, or atomically updated in the LDS divided
|
||||
as percentage of theoretical peak. Does not take into account the execution
|
||||
mask of the wavefront when the instruction was executed.
|
||||
Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
|
||||
loaded from, stored to, or atomically updated in the LDS divided by total duration.
|
||||
Does not take into account the execution mask of the wavefront when the instruction
|
||||
was executed.
|
||||
Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
|
||||
servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing
|
||||
bank conflicts over the number of LDS cycles that would have been required to
|
||||
move the same amount of data in an uncontended access.
|
||||
LDS Instructions: The total number of LDS instructions (including, but not limited
|
||||
to, read/write/atomics and HIP's __shfl instructions) executed per normalization
|
||||
unit.
|
||||
LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
|
||||
acknowledgment) required for an LDS instruction to complete.
|
||||
Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS scheduler
|
||||
due to bank conflicts (as determined by the conflict resolution hardware) to
|
||||
the base number of cycles that would be spent in the LDS scheduler in a completely
|
||||
uncontended case. This is the unnormalized form of the Bank Conflict Rate.
|
||||
Index Accesses: The total number of cycles spent in the LDS scheduler over all
|
||||
operations per normalization unit.
|
||||
Atomic Return Cycles: The total number of cycles spent on LDS atomics with return
|
||||
per normalization unit.
|
||||
Bank Conflict: The total number of cycles spent in the LDS scheduler due to bank
|
||||
conflicts (as determined by the conflict resolution hardware) per normalization
|
||||
unit.
|
||||
Addr Conflict: The total number of cycles spent in the LDS scheduler due to address
|
||||
conflicts (as determined by the conflict resolution hardware) per normalization
|
||||
unit.
|
||||
Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
|
||||
stalls from non-dword aligned addresses per normalization unit.
|
||||
Mem Violations: |-
|
||||
The total number of out-of-bounds accesses made to the LDS, per normalization
|
||||
unit. This is unused and expected to be zero in most configurations for
|
||||
modern CDNA\u2122 accelerators.
|
||||
|
||||
+26
-26
@@ -2,28 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1300
|
||||
title: Instruction Cache
|
||||
metrics_description:
|
||||
Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
|
||||
the total L1I cycles.
|
||||
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
|
||||
loaded line the cache. Calculated as the ratio of the number of L1I requests
|
||||
that hit over the number of all L1I requests.
|
||||
L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\
|
||||
\ L2 cache request bandwidth achieved. Calculated as the ratio of the total\
|
||||
\ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\
|
||||
\ cycles."
|
||||
L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
|
||||
divided by total duration.
|
||||
Req: The total number of requests made to the L1I per normalization-unit
|
||||
Hits: The total number of L1I requests that hit on a previously loaded cache line,
|
||||
per normalization-unit.
|
||||
Misses - Non Duplicated: The total number of L1I requests that missed on a cache
|
||||
line that were not already pending due to another request, per normalization-unit.
|
||||
Misses - Duplicated: The total number of L1I requests that missed on a cache line
|
||||
that were already pending due to another request, per normalization-unit.
|
||||
Instruction Fetch Latency: The average number of cycles spent to fetch instructions
|
||||
to a CU.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1301
|
||||
@@ -62,22 +40,22 @@ Panel Config:
|
||||
avg: AVG((SQC_ICACHE_REQ / $denom))
|
||||
min: MIN((SQC_ICACHE_REQ / $denom))
|
||||
max: MAX((SQC_ICACHE_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Hits:
|
||||
avg: AVG((SQC_ICACHE_HITS / $denom))
|
||||
min: MIN((SQC_ICACHE_HITS / $denom))
|
||||
max: MAX((SQC_ICACHE_HITS / $denom))
|
||||
unit: (Hits + $normUnit)
|
||||
unit: (Hits + $normUnit)
|
||||
Misses - Non Duplicated:
|
||||
avg: AVG((SQC_ICACHE_MISSES / $denom))
|
||||
min: MIN((SQC_ICACHE_MISSES / $denom))
|
||||
max: MAX((SQC_ICACHE_MISSES / $denom))
|
||||
unit: (Misses + $normUnit)
|
||||
unit: (Misses + $normUnit)
|
||||
Misses - Duplicated:
|
||||
avg: AVG((SQC_ICACHE_MISSES_DUPLICATE / $denom))
|
||||
min: MIN((SQC_ICACHE_MISSES_DUPLICATE / $denom))
|
||||
max: MAX((SQC_ICACHE_MISSES_DUPLICATE / $denom))
|
||||
unit: (Misses + $normUnit)
|
||||
unit: (Misses + $normUnit)
|
||||
Cache Hit Rate:
|
||||
avg: AVG(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
|
||||
+ SQC_ICACHE_MISSES_DUPLICATE)))
|
||||
@@ -107,3 +85,25 @@ Panel Config:
|
||||
min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
metrics_description:
|
||||
Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
|
||||
the total L1I cycles.
|
||||
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
|
||||
loaded line the cache. Calculated as the ratio of the number of L1I requests
|
||||
that hit over the number of all L1I requests.
|
||||
L1I-L2 Bandwidth Utilization: |-
|
||||
The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
|
||||
achieved. Calculated as the ratio of the total number of requests from the
|
||||
L1I to the L2 cache over the total L1I-L2 interface cycles.
|
||||
L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
|
||||
divided by total duration.
|
||||
Req: The total number of requests made to the L1I per normalization-unit
|
||||
Hits: The total number of L1I requests that hit on a previously loaded cache line,
|
||||
per normalization-unit.
|
||||
Misses - Non Duplicated: The total number of L1I requests that missed on a cache
|
||||
line that were not already pending due to another request, per normalization-unit.
|
||||
Misses - Duplicated: The total number of L1I requests that missed on a cache line
|
||||
that were already pending due to another request, per normalization-unit.
|
||||
Instruction Fetch Latency: The average number of cycles spent to fetch instructions
|
||||
to a CU.
|
||||
|
||||
+61
-58
@@ -2,49 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1400
|
||||
title: Scalar L1 Data Cache
|
||||
metrics_description:
|
||||
Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
|
||||
over the total sL1D cycles.
|
||||
Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
|
||||
loaded line the cache. The ratio of the number of sL1D requests that hit over
|
||||
the number of all sL1D requests.
|
||||
sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
|
||||
bandwidth acheived.\ \ Caclulated as total number of bytes read from, written
|
||||
to, or atomically updated\ \ across the sL1D - L2 interface.
|
||||
sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
|
||||
\ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\
|
||||
\ writes and atomics are typically unused on current CDNA accelerators, so in\
|
||||
\ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
|
||||
Req: The total number of requests, of any size or type, made to the sL1D per normalization
|
||||
unit.
|
||||
Hits: The total number of sL1D requests that hit on a previously loaded cache
|
||||
line, per normalization unit.
|
||||
Misses - Non Duplicated: 'The total number of sL1D requests that missed on a cache
|
||||
line that was not already pending due to another request, per normalization
|
||||
unit. '
|
||||
Misses- Duplicated: The total number of sL1D requests that missed on a cache line
|
||||
that was already pending due to another request, per normalization unit.
|
||||
Read Req (Total): The total number of sL1D read requests of any size, per normalization
|
||||
unit.
|
||||
Atomic Req: The total number of atomic requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
Read Req (1 DWord): The total number of sL1D read requests made for a single dword
|
||||
of data (4B), per normalization unit.
|
||||
Read Req (2 DWord): The total number of sL1D read requests made for a two dwords
|
||||
of data (8B), per normalization unit.
|
||||
Read Req (4 DWord): The total number of sL1D read requests made for a four dwords
|
||||
of data (16B), per normalization unit.
|
||||
Read Req (8 DWord): The total number of sL1D read requests made for a eight dwords
|
||||
of data (32B), per normalization unit.
|
||||
Read Req (16 DWord): The total number of sL1D read requests made for a sixteen
|
||||
dwords of data (64B), per normalization unit.
|
||||
Read Req: The total number of read requests from sL1D to the L2 per normalization
|
||||
unit.
|
||||
Write Req: The total number of write requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
Stall Cycles: "The total number of cycles the sL1D\u2194L2 interface was stalled,\
|
||||
\ per normalization unit."
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1401
|
||||
@@ -84,22 +41,22 @@ Panel Config:
|
||||
avg: AVG((SQC_DCACHE_REQ / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Hits:
|
||||
avg: AVG((SQC_DCACHE_HITS / $denom))
|
||||
min: MIN((SQC_DCACHE_HITS / $denom))
|
||||
max: MAX((SQC_DCACHE_HITS / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Misses - Non Duplicated:
|
||||
avg: AVG((SQC_DCACHE_MISSES / $denom))
|
||||
min: MIN((SQC_DCACHE_MISSES / $denom))
|
||||
max: MAX((SQC_DCACHE_MISSES / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Misses- Duplicated:
|
||||
avg: AVG((SQC_DCACHE_MISSES_DUPLICATE / $denom))
|
||||
min: MIN((SQC_DCACHE_MISSES_DUPLICATE / $denom))
|
||||
max: MAX((SQC_DCACHE_MISSES_DUPLICATE / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Cache Hit Rate:
|
||||
avg: AVG((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
@@ -118,37 +75,37 @@ Panel Config:
|
||||
+ SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
|
||||
max: MAX((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
|
||||
+ SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Atomic Req:
|
||||
avg: AVG((SQC_DCACHE_ATOMIC / $denom))
|
||||
min: MIN((SQC_DCACHE_ATOMIC / $denom))
|
||||
max: MAX((SQC_DCACHE_ATOMIC / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (1 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_1 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_1 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_1 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (2 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_2 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_2 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_2 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (4 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_4 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_4 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_4 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (8 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_8 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_8 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_8 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (16 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_16 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_16 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_16 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1403
|
||||
title: Scalar L1D Cache - L2 Interface
|
||||
@@ -171,19 +128,65 @@ Panel Config:
|
||||
avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_READ_REQ / $denom))
|
||||
max: MAX((SQC_TC_DATA_READ_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Write Req:
|
||||
avg: AVG((SQC_TC_DATA_WRITE_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_WRITE_REQ / $denom))
|
||||
max: MAX((SQC_TC_DATA_WRITE_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Atomic Req:
|
||||
avg: AVG((SQC_TC_DATA_ATOMIC_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_ATOMIC_REQ / $denom))
|
||||
max: MAX((SQC_TC_DATA_ATOMIC_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Stall Cycles:
|
||||
avg: AVG((SQC_TC_STALL / $denom))
|
||||
min: MIN((SQC_TC_STALL / $denom))
|
||||
max: MAX((SQC_TC_STALL / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
metrics_description:
|
||||
Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
|
||||
over the total sL1D cycles.
|
||||
Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
|
||||
loaded line the cache. The ratio of the number of sL1D requests that hit over
|
||||
the number of all sL1D requests.
|
||||
sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
|
||||
bandwidth acheived. Calculated as total number of bytes read from, written to,
|
||||
or atomically updated across the sL1D - L2 interface.
|
||||
sL1D-L2 BW: |-
|
||||
The total number of bytes read from, written to, or atomically updated
|
||||
across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D
|
||||
writes and atomics are typically unused on current CDNA accelerators, so
|
||||
in the majority of cases this can be interpreted as an sL1D\u2192L2 read
|
||||
bandwidth.
|
||||
Req: The total number of requests, of any size or type, made to the sL1D per normalization
|
||||
unit.
|
||||
Hits: The total number of sL1D requests that hit on a previously loaded cache
|
||||
line, per normalization unit.
|
||||
Misses - Non Duplicated: |-
|
||||
The total number of sL1D requests that missed on a cache line that was
|
||||
not already pending due to another request, per normalization unit.
|
||||
Misses- Duplicated: The total number of sL1D requests that missed on a cache line
|
||||
that was already pending due to another request, per normalization unit.
|
||||
Read Req (Total): The total number of sL1D read requests of any size, per normalization
|
||||
unit.
|
||||
Atomic Req: The total number of atomic requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
Read Req (1 DWord): The total number of sL1D read requests made for a single dword
|
||||
of data (4B), per normalization unit.
|
||||
Read Req (2 DWord): The total number of sL1D read requests made for a two dwords
|
||||
of data (8B), per normalization unit.
|
||||
Read Req (4 DWord): The total number of sL1D read requests made for a four dwords
|
||||
of data (16B), per normalization unit.
|
||||
Read Req (8 DWord): The total number of sL1D read requests made for a eight dwords
|
||||
of data (32B), per normalization unit.
|
||||
Read Req (16 DWord): The total number of sL1D read requests made for a sixteen
|
||||
dwords of data (64B), per normalization unit.
|
||||
Read Req: The total number of read requests from sL1D to the L2 per normalization
|
||||
unit.
|
||||
Write Req: The total number of write requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
Stall Cycles: |-
|
||||
The total number of cycles the sL1D\u2194L2 interface was stalled, per
|
||||
normalization unit.
|
||||
|
||||
+74
-80
@@ -2,70 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1500
|
||||
title: Address Processing Unit and Data Return Path (TA/TD)
|
||||
metrics_description:
|
||||
Address Processing Unit Busy: Percent of the total CU cycles the address processor
|
||||
was busy
|
||||
Address Stall: Percent of the total CU cycles the address processor was stalled
|
||||
from sending address requests further into the vL1D pipeline.
|
||||
Data Stall: Percent of the total CU cycles the address processor was stalled from
|
||||
sending write/atomic data further into the vL1D pipeline.
|
||||
"Data-Processor \u2192 Address Stall": Percent of total CU cycles the address
|
||||
processor was stalled waiting to send command data to the data processor.
|
||||
Total Instructions: The total number of memory instructions executed by the address
|
||||
processer over all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Instructions: The total number of global & generic memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Read Instructions: The total number of global & generic memory
|
||||
read instructions executed on all compute units on the accelerator, per normalization
|
||||
unit.
|
||||
Global/Generic Write Instructions: The total number of global & generic memory
|
||||
write instructions executed on all compute units on the accelerator, per normalization
|
||||
unit.
|
||||
Global/Generic Atomic Instructions: The total number of global & generic memory
|
||||
atomic (with and without return) instructions executed on all compute units
|
||||
on the accelerator, per normalization unit.
|
||||
Spill/Stack Instructions: The total number of spill/stack memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Read Instructions: The total number of spill/stack memory read instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Write Instructions: The total number of spill/stack memory write instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Atomic Instructions: The total number of spill/stack memory atomic
|
||||
(with and without return) instructions executed on all compute units on the
|
||||
accelerator, per normalization unit. Typically unused as these memory operations
|
||||
are typically used to implement thread-local storage.
|
||||
Spill/Stack Total Cycles: The number of cycles the address processing unit spent
|
||||
working on spill/stack instructions, per normalization unit.
|
||||
Spill/Stack Coalesced Read: The number of cycles the address processing unit spent
|
||||
working on coalesced spill/stack read instructions, per normalization unit.
|
||||
Spill/Stack Coalesced Write: The number of cycles the address processing unit
|
||||
spent working on coalesced spill/stack write instructions, per normalization
|
||||
unit.
|
||||
Data-Return Busy: Percent of the total CU cycles the data-return unit was busy
|
||||
processing or waiting on data to return to the CU.
|
||||
"Cache RAM \u2192 Data-Return Stall": Percent of the total CU cycles the data-return
|
||||
unit was stalled on data to be returned from the vL1D Cache RAM.
|
||||
"Workgroup manager \u2192 Data-Return Stall": Percent of the total CU cycles the
|
||||
data-return unit was stalled by the workgroup manager due to initialization
|
||||
of registers as a part of launching new workgroups.
|
||||
Coalescable Instructions: The number of instructions submitted to the data-return
|
||||
unit by the address processor that were found to be coalescable, per normalization
|
||||
unit.
|
||||
Read Instructions: The number of read instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack reads in the address processor.
|
||||
Write Instructions: The number of store instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack stores in the address processor.
|
||||
Atomic Instructions: The number of atomic instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack atomics in the address processor.
|
||||
Write Ack Instructions: The total number of write acknowledgements submitted by
|
||||
data-return unit to SQ, summed over all compute units on the accelerator, per
|
||||
normalization unit.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1501
|
||||
@@ -120,47 +56,47 @@ Panel Config:
|
||||
avg: AVG((TA_TOTAL_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_TOTAL_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_TOTAL_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Global/Generic Instructions:
|
||||
avg: AVG((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Global/Generic Read Instructions:
|
||||
avg: AVG((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Global/Generic Write Instructions:
|
||||
avg: AVG((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Global/Generic Atomic Instructions:
|
||||
avg: AVG((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Spill/Stack Instructions:
|
||||
avg: AVG((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Spill/Stack Read Instructions:
|
||||
avg: AVG((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Spill/Stack Write Instructions:
|
||||
avg: AVG((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Spill/Stack Atomic Instructions:
|
||||
avg: AVG((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
- metric_table:
|
||||
id: 1503
|
||||
title: Spill and stack metrics
|
||||
@@ -175,17 +111,17 @@ Panel Config:
|
||||
avg: AVG((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
|
||||
min: MIN((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
|
||||
max: MAX((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Spill/Stack Coalesced Read:
|
||||
avg: AVG((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
|
||||
min: MIN((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
|
||||
max: MAX((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Spill/Stack Coalesced Write:
|
||||
avg: AVG((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
|
||||
min: MIN((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
|
||||
max: MAX((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
- metric_table:
|
||||
id: 1504
|
||||
title: Vector L1 data-return path or Texture Data (TD)
|
||||
@@ -210,7 +146,7 @@ Panel Config:
|
||||
avg: AVG((TD_COALESCABLE_WAVEFRONT_sum / $denom))
|
||||
min: MIN((TD_COALESCABLE_WAVEFRONT_sum / $denom))
|
||||
max: MAX((TD_COALESCABLE_WAVEFRONT_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Read Instructions:
|
||||
avg: AVG((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
|
||||
/ $denom))
|
||||
@@ -218,14 +154,72 @@ Panel Config:
|
||||
/ $denom))
|
||||
max: MAX((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
|
||||
/ $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Write Instructions:
|
||||
avg: AVG((TD_STORE_WAVEFRONT_sum / $denom))
|
||||
min: MIN((TD_STORE_WAVEFRONT_sum / $denom))
|
||||
max: MAX((TD_STORE_WAVEFRONT_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Atomic Instructions:
|
||||
avg: AVG((TD_ATOMIC_WAVEFRONT_sum / $denom))
|
||||
min: MIN((TD_ATOMIC_WAVEFRONT_sum / $denom))
|
||||
max: MAX((TD_ATOMIC_WAVEFRONT_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
metrics_description:
|
||||
Address Processing Unit Busy: Percent of the total CU cycles the address processor
|
||||
was busy
|
||||
Address Stall: Percent of the total CU cycles the address processor was stalled
|
||||
from sending address requests further into the vL1D pipeline.
|
||||
Data Stall: Percent of the total CU cycles the address processor was stalled from
|
||||
sending write/atomic data further into the vL1D pipeline.
|
||||
"Data-Processor \u2192 Address Stall": Percent of total CU cycles the address
|
||||
processor was stalled waiting to send command data to the data processor.
|
||||
Total Instructions: The total number of memory instructions executed by the address
|
||||
processer over all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Instructions: The total number of global & generic memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Read Instructions: The total number of global & generic memory
|
||||
read instructions executed on all compute units on the accelerator, per normalization
|
||||
unit.
|
||||
Global/Generic Write Instructions: The total number of global & generic memory
|
||||
write instructions executed on all compute units on the accelerator, per normalization
|
||||
unit.
|
||||
Global/Generic Atomic Instructions: The total number of global & generic memory
|
||||
atomic (with and without return) instructions executed on all compute units
|
||||
on the accelerator, per normalization unit.
|
||||
Spill/Stack Instructions: The total number of spill/stack memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Read Instructions: The total number of spill/stack memory read instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Write Instructions: The total number of spill/stack memory write instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Atomic Instructions: The total number of spill/stack memory atomic
|
||||
(with and without return) instructions executed on all compute units on the
|
||||
accelerator, per normalization unit. Typically unused as these memory operations
|
||||
are typically used to implement thread-local storage.
|
||||
Spill/Stack Total Cycles: The number of cycles the address processing unit spent
|
||||
working on spill/stack instructions, per normalization unit.
|
||||
Spill/Stack Coalesced Read: The number of cycles the address processing unit spent
|
||||
working on coalesced spill/stack read instructions, per normalization unit.
|
||||
Spill/Stack Coalesced Write: The number of cycles the address processing unit
|
||||
spent working on coalesced spill/stack write instructions, per normalization
|
||||
unit.
|
||||
Data-Return Busy: Percent of the total CU cycles the data-return unit was busy
|
||||
processing or waiting on data to return to the CU.
|
||||
"Cache RAM \u2192 Data-Return Stall": Percent of the total CU cycles the data-return
|
||||
unit was stalled on data to be returned from the vL1D Cache RAM.
|
||||
Coalescable Instructions: The number of instructions submitted to the data-return
|
||||
unit by the address processor that were found to be coalescable, per normalization
|
||||
unit.
|
||||
Read Instructions: The number of read instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack reads in the address processor.
|
||||
Write Instructions: The number of store instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack stores in the address processor.
|
||||
Atomic Instructions: The number of atomic instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack atomics in the address processor.
|
||||
|
||||
+132
-132
@@ -2,117 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1600
|
||||
title: Vector L1 Data Cache
|
||||
metrics_description:
|
||||
Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
|
||||
of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator. The number of bytes is calculated as the number
|
||||
of cache lines requested multiplied by the cache line size. This value does
|
||||
not consider partial requests, so for instance, if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
|
||||
The number of cycles where the vL1D Cache RAM is actively processing any request
|
||||
divided by the number of cycles where the vL1D is active.
|
||||
Coalescing: Indicates how well memory instructions were coalesced by the address
|
||||
processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
|
||||
as the average number of thread-requests generated per instruction divided by
|
||||
the ideal number of thread-requests per instruction.
|
||||
Stalled on L2 Data: The ratio of the number of cycles where the vL1D is stalled
|
||||
waiting for requested data to return from the L2 cache divided by the number
|
||||
of cycles where the vL1D is active.
|
||||
Stalled on L2 Req: The ratio of the number of cycles where the vL1D is stalled
|
||||
waiting to issue a request for data to the L2 cache divided by the number of
|
||||
cycles where the vL1D is active.
|
||||
Tag RAM Stall (Read): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Read requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Tag RAM Stall (Write): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Write requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Tag RAM Stall (Atomic): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Atomic requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Total Req: The total number of incoming requests from the address processing unit
|
||||
after coalescing.
|
||||
Read Req: The total number of incoming read requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Write Req: The total number of incoming write requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Atomic Req: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
|
||||
instructions divided by total duration. The number of bytes is calculated as
|
||||
the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
as a full cache line.
|
||||
Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
|
||||
vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Cache Accesses: The total number of cache line lookups in the vL1D.
|
||||
Cache Hits: The number of cache accesses minus the number of outgoing requests
|
||||
to the L2 cache, that is, the number of cache line requests serviced by the
|
||||
vL1D Cache RAM per normalization unit.
|
||||
Invalidations: The number of times the vL1D was issued a write-back invalidate
|
||||
command during the kernel's execution per normalization unit. This may be triggered
|
||||
by, for instance, the buffer_wbinvl1 instruction.
|
||||
L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
|
||||
of VMEM instructions, divided by total duration. The number of bytes is calculated
|
||||
as the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
as a full cache line.
|
||||
L1-L2 Read: The number of read requests for a vL1D cache line that were not satisfied
|
||||
by the vL1D and must be retrieved from the to the L2 Cache per normalization
|
||||
unit.
|
||||
L1-L2 Write: The number of write requests to a vL1D cache line that were sent
|
||||
through the vL1D to the L2 cache, per normalization unit.
|
||||
L1-L2 Atomic: The number of atomic requests that are sent through the vL1D to
|
||||
the L2 cache, per normalization unit. This includes requests for atomics with,
|
||||
and without return.
|
||||
L1 Access Latency: Calculated as the average number of cycles that a vL1D cache
|
||||
line request spent in the vL1D cache pipeline.
|
||||
L1-L2 Read Latency: Calculated as the average number of cycles that the vL1D cache
|
||||
took to issue and receive read requests from the L2 Cache. This number also
|
||||
includes requests for atomics with return values.
|
||||
L1-L2 Write Latency: Calculated as the average number of cycles that the vL1D
|
||||
cache took to issue and receive acknowledgement of a write request to the L2
|
||||
Cache. This number also includes requests for atomics without return values.
|
||||
NC - Read: Total read requests with NC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
UC - Read: Total read requests with UC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
CC - Read: Total read requests with CC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
RW - Read: Total read requests with RW mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
RW - Write: Total write requests with RW mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
NC - Write: Total write requests with NC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
UC - Write: Total write requests with UC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
CC - Write: Total write requests with CC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
NC - Atomic: Total atomic requests with NC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
UC - Atomic: Total atomic requests with UC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
CC - Atomic: Total atomic requests with CC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
RW - Atomic: Total atomic requests with RW mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
Req: The number of translation requests made to the UTCL1 per normalization unit.
|
||||
Hit Ratio: The ratio of the number of translation requests that hit in the UTCL1
|
||||
divided by the total number of translation requests made to the UTCL1.
|
||||
Hits: The number of translation requests that hit in the UTCL1, and could be reused,
|
||||
per normalization unit.
|
||||
Translation Misses: The total number of translation requests that missed in the
|
||||
UTCL1 due to translation not being present in the cache, per normalization
|
||||
unit.
|
||||
Permission Misses: "The total number of translation requests that missed in the\
|
||||
\ UTCL1 due to a permission error, per normalization unit. This is unused and\
|
||||
\ expected to be zero in most configurations for modern CDNA\u2122 accelerators."
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1601
|
||||
@@ -181,17 +70,17 @@ Panel Config:
|
||||
avg: AVG((TCP_TOTAL_ACCESSES_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_ACCESSES_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_ACCESSES_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read Req:
|
||||
avg: AVG((TCP_TOTAL_READ_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_READ_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_READ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Write Req:
|
||||
avg: AVG((TCP_TOTAL_WRITE_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_WRITE_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_WRITE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Atomic Req:
|
||||
avg: AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
|
||||
/ $denom))
|
||||
@@ -199,7 +88,7 @@ Panel Config:
|
||||
/ $denom))
|
||||
max: MAX(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Cache BW:
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
@@ -223,7 +112,7 @@ Panel Config:
|
||||
avg: AVG((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Cache Hits:
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
@@ -234,7 +123,7 @@ Panel Config:
|
||||
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Invalidations:
|
||||
avg: AVG((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
|
||||
@@ -252,12 +141,12 @@ Panel Config:
|
||||
avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
L1-L2 Write:
|
||||
avg: AVG((TCP_TCC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
L1-L2 Atomic:
|
||||
avg: AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
/ $denom))
|
||||
@@ -265,7 +154,7 @@ Panel Config:
|
||||
/ $denom))
|
||||
max: MAX(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
L1 Access Latency:
|
||||
avg: AVG(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
|
||||
!= 0) else None))
|
||||
@@ -314,84 +203,84 @@ Panel Config:
|
||||
avg: AVG((TCP_TCC_NC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_NC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_NC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
UC - Read:
|
||||
xfer: Read
|
||||
coherency: UC
|
||||
avg: AVG((TCP_TCC_UC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_UC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_UC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
CC - Read:
|
||||
xfer: Read
|
||||
coherency: CC
|
||||
avg: AVG((TCP_TCC_CC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_CC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_CC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
RW - Read:
|
||||
xfer: Read
|
||||
coherency: RW
|
||||
avg: AVG((TCP_TCC_RW_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_RW_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_RW_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
RW - Write:
|
||||
xfer: Write
|
||||
coherency: RW
|
||||
avg: AVG((TCP_TCC_RW_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_RW_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_RW_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
NC - Write:
|
||||
xfer: Write
|
||||
coherency: NC
|
||||
avg: AVG((TCP_TCC_NC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_NC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_NC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
UC - Write:
|
||||
xfer: Write
|
||||
coherency: UC
|
||||
avg: AVG((TCP_TCC_UC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_UC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_UC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
CC - Write:
|
||||
xfer: Write
|
||||
coherency: CC
|
||||
avg: AVG((TCP_TCC_CC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_CC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_CC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
NC - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: NC
|
||||
avg: AVG((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
UC - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: UC
|
||||
avg: AVG((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
CC - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: CC
|
||||
avg: AVG((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
RW - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: RW
|
||||
avg: AVG((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1605
|
||||
title: L1 Unified Translation Cache (UTCL1)
|
||||
@@ -440,3 +329,114 @@ Panel Config:
|
||||
max: Max
|
||||
units: Unit
|
||||
metric: {}
|
||||
metrics_description:
|
||||
Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
|
||||
of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator. The number of bytes is calculated as the number
|
||||
of cache lines requested multiplied by the cache line size. This value does
|
||||
not consider partial requests, so for instance, if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
|
||||
The number of cycles where the vL1D Cache RAM is actively processing any request
|
||||
divided by the number of cycles where the vL1D is active.
|
||||
Coalescing: Indicates how well memory instructions were coalesced by the address
|
||||
processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
|
||||
as the average number of thread-requests generated per instruction divided by
|
||||
the ideal number of thread-requests per instruction.
|
||||
Stalled on L2 Data: The ratio of the number of cycles where the vL1D is stalled
|
||||
waiting for requested data to return from the L2 cache divided by the number
|
||||
of cycles where the vL1D is active.
|
||||
Stalled on L2 Req: The ratio of the number of cycles where the vL1D is stalled
|
||||
waiting to issue a request for data to the L2 cache divided by the number of
|
||||
cycles where the vL1D is active.
|
||||
Tag RAM Stall (Read): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Read requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Tag RAM Stall (Write): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Write requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Tag RAM Stall (Atomic): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Atomic requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Total Req: The total number of incoming requests from the address processing unit
|
||||
after coalescing.
|
||||
Read Req: The total number of incoming read requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Write Req: The total number of incoming write requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Atomic Req: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
|
||||
instructions divided by total duration. The number of bytes is calculated as
|
||||
the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
as a full cache line.
|
||||
Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
|
||||
vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Cache Accesses: The total number of cache line lookups in the vL1D.
|
||||
Cache Hits: The number of cache accesses minus the number of outgoing requests
|
||||
to the L2 cache, that is, the number of cache line requests serviced by the
|
||||
vL1D Cache RAM per normalization unit.
|
||||
Invalidations: The number of times the vL1D was issued a write-back invalidate
|
||||
command during the kernel's execution per normalization unit. This may be triggered
|
||||
by, for instance, the buffer_wbinvl1 instruction.
|
||||
L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
|
||||
of VMEM instructions, divided by total duration. The number of bytes is calculated
|
||||
as the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
as a full cache line.
|
||||
L1-L2 Read: The number of read requests for a vL1D cache line that were not satisfied
|
||||
by the vL1D and must be retrieved from the to the L2 Cache per normalization
|
||||
unit.
|
||||
L1-L2 Write: The number of write requests to a vL1D cache line that were sent
|
||||
through the vL1D to the L2 cache, per normalization unit.
|
||||
L1-L2 Atomic: The number of atomic requests that are sent through the vL1D to
|
||||
the L2 cache, per normalization unit. This includes requests for atomics with,
|
||||
and without return.
|
||||
L1 Access Latency: Calculated as the average number of cycles that a vL1D cache
|
||||
line request spent in the vL1D cache pipeline.
|
||||
L1-L2 Read Latency: Calculated as the average number of cycles that the vL1D cache
|
||||
took to issue and receive read requests from the L2 Cache. This number also
|
||||
includes requests for atomics with return values.
|
||||
L1-L2 Write Latency: Calculated as the average number of cycles that the vL1D
|
||||
cache took to issue and receive acknowledgement of a write request to the L2
|
||||
Cache. This number also includes requests for atomics without return values.
|
||||
NC - Read: Total read requests with NC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
UC - Read: Total read requests with UC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
CC - Read: Total read requests with CC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
RW - Read: Total read requests with RW mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
RW - Write: Total write requests with RW mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
NC - Write: Total write requests with NC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
UC - Write: Total write requests with UC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
CC - Write: Total write requests with CC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
NC - Atomic: Total atomic requests with NC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
UC - Atomic: Total atomic requests with UC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
CC - Atomic: Total atomic requests with CC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
RW - Atomic: Total atomic requests with RW mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
Req: The number of translation requests made to the UTCL1 per normalization unit.
|
||||
Hit Ratio: The ratio of the number of translation requests that hit in the UTCL1
|
||||
divided by the total number of translation requests made to the UTCL1.
|
||||
Hits: The number of translation requests that hit in the UTCL1, and could be reused,
|
||||
per normalization unit.
|
||||
Translation Misses: The total number of translation requests that missed in the
|
||||
UTCL1 due to translation not being present in the cache, per normalization unit.
|
||||
Permission Misses: |-
|
||||
The total number of translation requests that missed in the UTCL1 due
|
||||
to a permission error, per normalization unit. This is unused and expected
|
||||
to be zero in most configurations for modern CDNA\u2122 accelerators.
|
||||
|
||||
+344
-394
@@ -2,6 +2,350 @@
|
||||
Panel Config:
|
||||
id: 1700
|
||||
title: L2 Cache
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1701
|
||||
title: L2 Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
metric:
|
||||
Utilization:
|
||||
value: AVG(((TCC_BUSY_sum * 100) / (TO_INT($total_l2_chan) * $GRBM_GUI_ACTIVE_PER_XCD)))
|
||||
unit: pct
|
||||
Peak Bandwidth:
|
||||
value: ((100 * AVG(((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk / 1000) * 64) * TO_INT($total_l2_chan)))
|
||||
unit: pct
|
||||
Hit Rate:
|
||||
value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else 0))
|
||||
unit: pct
|
||||
L2-Fabric Read BW:
|
||||
value: AVG((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
L2-Fabric Write and Atomic BW:
|
||||
value: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
HBM Bandwidth:
|
||||
value: $hbmBandwidth
|
||||
unit: GB/s
|
||||
- metric_table:
|
||||
id: 1702
|
||||
title: L2-Fabric interface metrics
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Read BW:
|
||||
avg: AVG((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
min: MIN((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
max: MAX((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
HBM Read Traffic:
|
||||
avg: AVG((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Remote Read Traffic:
|
||||
avg: AVG((100 * ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum) / TCC_EA0_RDREQ_sum)
|
||||
if (TCC_EA0_RDREQ_sum != 0) else None))
|
||||
min: MIN((100 * ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum) / TCC_EA0_RDREQ_sum)
|
||||
if (TCC_EA0_RDREQ_sum != 0) else None))
|
||||
max: MAX((100 * ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum) / TCC_EA0_RDREQ_sum)
|
||||
if (TCC_EA0_RDREQ_sum != 0) else None))
|
||||
unit: pct
|
||||
Uncached Read Traffic:
|
||||
avg: AVG((100 * (TCC_EA0_RD_UNCACHED_32B_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA0_RD_UNCACHED_32B_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA0_RD_UNCACHED_32B_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Write and Atomic BW:
|
||||
avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
HBM Write and Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Remote Write and Atomic Traffic:
|
||||
avg: AVG((100 * ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum) / TCC_EA0_WRREQ_sum)
|
||||
if (TCC_EA0_WRREQ_sum != 0) else None))
|
||||
min: MIN((100 * ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum) / TCC_EA0_WRREQ_sum)
|
||||
if (TCC_EA0_WRREQ_sum != 0) else None))
|
||||
max: MAX((100 * ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum) / TCC_EA0_WRREQ_sum)
|
||||
if (TCC_EA0_WRREQ_sum != 0) else None))
|
||||
unit: pct
|
||||
Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA0_ATOMIC_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA0_ATOMIC_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA0_ATOMIC_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Uncached Write and Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA0_WR_UNCACHED_32B_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA0_WR_UNCACHED_32B_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA0_WR_UNCACHED_32B_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Read Latency:
|
||||
avg: AVG(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
Write and Atomic Latency:
|
||||
avg: AVG(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
Atomic Latency:
|
||||
avg: AVG(((TCC_EA0_ATOMIC_LEVEL_sum / TCC_EA0_ATOMIC_sum) if (TCC_EA0_ATOMIC_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((TCC_EA0_ATOMIC_LEVEL_sum / TCC_EA0_ATOMIC_sum) if (TCC_EA0_ATOMIC_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((TCC_EA0_ATOMIC_LEVEL_sum / TCC_EA0_ATOMIC_sum) if (TCC_EA0_ATOMIC_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
- metric_table:
|
||||
id: 1703
|
||||
title: L2 Cache Accesses
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
avg: AVG((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))
|
||||
min: MIN((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))
|
||||
max: MAX((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
Req:
|
||||
avg: AVG((TCC_REQ_sum / $denom))
|
||||
min: MIN((TCC_REQ_sum / $denom))
|
||||
max: MAX((TCC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read Req:
|
||||
avg: AVG((TCC_READ_sum / $denom))
|
||||
min: MIN((TCC_READ_sum / $denom))
|
||||
max: MAX((TCC_READ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Write Req:
|
||||
avg: AVG((TCC_WRITE_sum / $denom))
|
||||
min: MIN((TCC_WRITE_sum / $denom))
|
||||
max: MAX((TCC_WRITE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Atomic Req:
|
||||
avg: AVG((TCC_ATOMIC_sum / $denom))
|
||||
min: MIN((TCC_ATOMIC_sum / $denom))
|
||||
max: MAX((TCC_ATOMIC_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Streaming Req:
|
||||
avg: AVG((TCC_STREAMING_REQ_sum / $denom))
|
||||
min: MIN((TCC_STREAMING_REQ_sum / $denom))
|
||||
max: MAX((TCC_STREAMING_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Probe Req:
|
||||
avg: AVG((TCC_PROBE_sum / $denom))
|
||||
min: MIN((TCC_PROBE_sum / $denom))
|
||||
max: MAX((TCC_PROBE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Cache Hit:
|
||||
avg: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
min: MIN((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
max: MAX((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
unit: pct
|
||||
Hits:
|
||||
avg: AVG((TCC_HIT_sum / $denom))
|
||||
min: MIN((TCC_HIT_sum / $denom))
|
||||
max: MAX((TCC_HIT_sum / $denom))
|
||||
unit: (Hits + $normUnit)
|
||||
Misses:
|
||||
avg: AVG((TCC_MISS_sum / $denom))
|
||||
min: MIN((TCC_MISS_sum / $denom))
|
||||
max: MAX((TCC_MISS_sum / $denom))
|
||||
unit: (Misses + $normUnit)
|
||||
Writeback:
|
||||
avg: AVG((TCC_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_WRITEBACK_sum / $denom))
|
||||
max: MAX((TCC_WRITEBACK_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
Writeback (Internal):
|
||||
avg: AVG((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
max: MAX((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
Writeback (vL1D Req):
|
||||
avg: AVG((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
|
||||
max: MAX((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
Evict (Internal):
|
||||
avg: AVG((TCC_NORMAL_EVICT_sum / $denom))
|
||||
min: MIN((TCC_NORMAL_EVICT_sum / $denom))
|
||||
max: MAX((TCC_NORMAL_EVICT_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
Evict (vL1D Req):
|
||||
avg: AVG((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
|
||||
min: MIN((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
|
||||
max: MAX((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
NC Req:
|
||||
avg: AVG((TCC_NC_REQ_sum / $denom))
|
||||
min: MIN((TCC_NC_REQ_sum / $denom))
|
||||
max: MAX((TCC_NC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
UC Req:
|
||||
avg: AVG((TCC_UC_REQ_sum / $denom))
|
||||
min: MIN((TCC_UC_REQ_sum / $denom))
|
||||
max: MAX((TCC_UC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
CC Req:
|
||||
avg: AVG((TCC_CC_REQ_sum / $denom))
|
||||
min: MIN((TCC_CC_REQ_sum / $denom))
|
||||
max: MAX((TCC_CC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
RW Req:
|
||||
avg: AVG((TCC_RW_REQ_sum / $denom))
|
||||
min: MIN((TCC_RW_REQ_sum / $denom))
|
||||
max: MAX((TCC_RW_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1704
|
||||
title: L2 Cache Stalls
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric: {}
|
||||
- metric_table:
|
||||
id: 1705
|
||||
title: L2 - Fabric Interface stalls
|
||||
header:
|
||||
metric: Metric
|
||||
type: Type
|
||||
transaction: Transaction
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
style:
|
||||
type: simple_multi_bar
|
||||
metric:
|
||||
Write - Credit Starvation:
|
||||
type: Credit Starvation
|
||||
transaction: Write
|
||||
avg: AVG(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
- metric_table:
|
||||
id: 1706
|
||||
title: L2 - Fabric interface detailed metrics
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Read (32B):
|
||||
avg: AVG((TCC_EA0_RDREQ_32B_sum / $denom))
|
||||
min: MIN((TCC_EA0_RDREQ_32B_sum / $denom))
|
||||
max: MAX((TCC_EA0_RDREQ_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read (64B):
|
||||
avg: AVG(((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) / $denom))
|
||||
min: MIN(((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) / $denom))
|
||||
max: MAX(((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read (Uncached):
|
||||
avg: AVG((TCC_EA0_RD_UNCACHED_32B_sum / $denom))
|
||||
min: MIN((TCC_EA0_RD_UNCACHED_32B_sum / $denom))
|
||||
max: MAX((TCC_EA0_RD_UNCACHED_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
HBM Read:
|
||||
avg: AVG((TCC_EA0_RDREQ_DRAM_sum / $denom))
|
||||
min: MIN((TCC_EA0_RDREQ_DRAM_sum / $denom))
|
||||
max: MAX((TCC_EA0_RDREQ_DRAM_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Remote Read:
|
||||
avg: AVG((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
|
||||
min: MIN((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
|
||||
max: MAX((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Write and Atomic (32B):
|
||||
avg: AVG(MAX(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom), 0))
|
||||
min: MIN(MAX(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom), 0))
|
||||
max: MAX(MAX(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom), 0))
|
||||
unit: (Req + $normUnit)
|
||||
Write and Atomic (Uncached):
|
||||
avg: AVG((TCC_EA0_WR_UNCACHED_32B_sum / $denom))
|
||||
min: MIN((TCC_EA0_WR_UNCACHED_32B_sum / $denom))
|
||||
max: MAX((TCC_EA0_WR_UNCACHED_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Write and Atomic (64B):
|
||||
avg: AVG((TCC_EA0_WRREQ_64B_sum / $denom))
|
||||
min: MIN((TCC_EA0_WRREQ_64B_sum / $denom))
|
||||
max: MAX((TCC_EA0_WRREQ_64B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
HBM Write and Atomic:
|
||||
avg: AVG((TCC_EA0_WRREQ_DRAM_sum / $denom))
|
||||
min: MIN((TCC_EA0_WRREQ_DRAM_sum / $denom))
|
||||
max: MAX((TCC_EA0_WRREQ_DRAM_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Remote Write and Atomic:
|
||||
avg: AVG((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom))
|
||||
min: MIN((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom))
|
||||
max: MAX((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Atomic:
|
||||
avg: AVG((TCC_EA0_ATOMIC_sum / $denom))
|
||||
min: MIN((TCC_EA0_ATOMIC_sum / $denom))
|
||||
max: MAX((TCC_EA0_ATOMIC_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
metrics_description:
|
||||
Utilization: The ratio of the number of cycles an L2 channel was active, summed
|
||||
over all L2 channels on the accelerator over the total L2 cycles.
|
||||
@@ -87,12 +431,6 @@ Panel Config:
|
||||
by the cache line size. This value does not consider partial requests, so for
|
||||
example, if only a single value is requested in a cache line, the data movement
|
||||
will still be counted as a full cache line.
|
||||
Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests,
|
||||
divided by total duration.
|
||||
Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests,
|
||||
divided by total duration.
|
||||
Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests,
|
||||
divided by total duration.
|
||||
Req: The total number of incoming requests to the L2 from all clients for all
|
||||
request types, per normalization unit.
|
||||
Read Req: The total number of read requests to the L2 from all clients.
|
||||
@@ -149,12 +487,6 @@ Panel Config:
|
||||
Remote Read: The total number of L2 requests to Infinity Fabric to read 32B or
|
||||
64B of data from any source other than the accelerator's local HBM, per normalization
|
||||
unit.
|
||||
Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe
|
||||
traffic, divided by total duration.
|
||||
"Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM
|
||||
traffic, divided by total duration.
|
||||
Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 32B of data to any memory location, per normalization
|
||||
unit.
|
||||
@@ -170,391 +502,9 @@ Panel Config:
|
||||
Remote Write and Atomic: The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 32B or 64B of data in any memory location other than
|
||||
the accelerator's local HBM, per normalization unit.
|
||||
Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to
|
||||
PCIe traffic, divided by total duration.
|
||||
"Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM
|
||||
traffic, divided by total duration.
|
||||
Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to
|
||||
PCIe traffic, divided by total duration.
|
||||
"Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
|
||||
HBM traffic, divided by total duration.
|
||||
Atomic: The total number of L2 requests to Infinity Fabric to atomically update
|
||||
32B or 64B of data in any memory location, per normalization unit. See Request
|
||||
flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
|
||||
requests are only considered atomic by Infinity Fabric if they are targeted
|
||||
at non-write-cacheable memory, such as fine-grained memory allocations or uncached
|
||||
memory allocations on the MI2XX.
|
||||
Read Stall: "The ratio of the total number of cycles the L2-Fabric interface was\
|
||||
\ stalled on a read request to any destination (local HBM, remote PCIe\xAE connected\
|
||||
\ accelerator or CPU, or remote Infinity Fabric connected accelerator or CPU)\
|
||||
\ over the total active L2 cycles."
|
||||
Write Stall: The ratio of the total number of cycles the L2-Fabric interface was
|
||||
stalled on a write or atomic request to any destination (local HBM, remote accelerator
|
||||
or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected
|
||||
accelerator or CPU) over the total active L2 cycles.
|
||||
Read - PCIe Stall: The number of cycles the L2-Fabric interface was stalled on
|
||||
read requests to remote PCIe connected accelerators or CPUs as a percent of
|
||||
the total active L2 cycles.
|
||||
Read - Infinity Fabric Stall: The number of cycles the L2-Fabric interface was
|
||||
stalled on read requests to remote Infinity Fabric connected accelerators or
|
||||
CPUs as a percent of the total active L2 cycles.
|
||||
Read - HBM Stall: The number of cycles the L2-Fabric interface was stalled on
|
||||
read requests to the accelerator's local HBM as a percent of the total active
|
||||
L2 cycles.
|
||||
Write - PCIe Stall: The number of cycles the L2-Fabric interface was stalled on
|
||||
write or atomic requests to remote PCIe connected accelerators or CPUs as a
|
||||
percent of the total active L2 cycles.
|
||||
Write - Infinity Fabric Stall: The number of cycles the L2-Fabric interface was
|
||||
stalled on write or atomic requests to remote Infinity Fabric connected accelerators
|
||||
or CPUs as a percent of the total active L2 cycles.
|
||||
Write - HBM Stall: The number of cycles the L2-Fabric interface was stalled on
|
||||
write or atomic requests to accelerator's local HBM as a percent of the total
|
||||
active L2 cycles.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1701
|
||||
title: L2 Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
metric:
|
||||
Utilization:
|
||||
value: AVG(((TCC_BUSY_sum * 100) / (TO_INT($total_l2_chan) * $GRBM_GUI_ACTIVE_PER_XCD)))
|
||||
unit: pct
|
||||
Peak Bandwidth:
|
||||
value: ((100 * AVG(((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk / 1000) * 64) * TO_INT($total_l2_chan)))
|
||||
unit: pct
|
||||
Hit Rate:
|
||||
value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else 0))
|
||||
unit: pct
|
||||
L2-Fabric Read BW:
|
||||
value: AVG((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
L2-Fabric Write and Atomic BW:
|
||||
value: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
HBM Bandwidth:
|
||||
value: $hbmBandwidth
|
||||
unit: GB/s
|
||||
- metric_table:
|
||||
id: 1702
|
||||
title: L2-Fabric interface metrics
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Read BW:
|
||||
avg: AVG((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
min: MIN((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
max: MAX((((TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum)
|
||||
* 64)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
HBM Read Traffic:
|
||||
avg: AVG((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA0_RDREQ_DRAM_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Remote Read Traffic:
|
||||
avg: AVG((100 * ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum) / TCC_EA0_RDREQ_sum)
|
||||
if (TCC_EA0_RDREQ_sum != 0) else None))
|
||||
min: MIN((100 * ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum) / TCC_EA0_RDREQ_sum)
|
||||
if (TCC_EA0_RDREQ_sum != 0) else None))
|
||||
max: MAX((100 * ((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum) / TCC_EA0_RDREQ_sum)
|
||||
if (TCC_EA0_RDREQ_sum != 0) else None))
|
||||
unit: pct
|
||||
Uncached Read Traffic:
|
||||
avg: AVG((100 * (TCC_EA0_RD_UNCACHED_32B_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA0_RD_UNCACHED_32B_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA0_RD_UNCACHED_32B_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Write and Atomic BW:
|
||||
avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
HBM Write and Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA0_WRREQ_DRAM_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Remote Write and Atomic Traffic:
|
||||
avg: AVG((100 * ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum) / TCC_EA0_WRREQ_sum)
|
||||
if (TCC_EA0_WRREQ_sum != 0) else None))
|
||||
min: MIN((100 * ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum) / TCC_EA0_WRREQ_sum)
|
||||
if (TCC_EA0_WRREQ_sum != 0) else None))
|
||||
max: MAX((100 * ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum) / TCC_EA0_WRREQ_sum)
|
||||
if (TCC_EA0_WRREQ_sum != 0) else None))
|
||||
unit: pct
|
||||
Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA0_ATOMIC_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA0_ATOMIC_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA0_ATOMIC_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Uncached Write and Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA0_WR_UNCACHED_32B_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA0_WR_UNCACHED_32B_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA0_WR_UNCACHED_32B_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Read Latency:
|
||||
avg: AVG(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
Write and Atomic Latency:
|
||||
avg: AVG(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
Atomic Latency:
|
||||
avg: AVG(((TCC_EA0_ATOMIC_LEVEL_sum / TCC_EA0_ATOMIC_sum) if (TCC_EA0_ATOMIC_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((TCC_EA0_ATOMIC_LEVEL_sum / TCC_EA0_ATOMIC_sum) if (TCC_EA0_ATOMIC_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((TCC_EA0_ATOMIC_LEVEL_sum / TCC_EA0_ATOMIC_sum) if (TCC_EA0_ATOMIC_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
- metric_table:
|
||||
id: 1703
|
||||
title: L2 Cache Accesses
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
avg: AVG((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))
|
||||
min: MIN((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))
|
||||
max: MAX((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
Req:
|
||||
avg: AVG((TCC_REQ_sum / $denom))
|
||||
min: MIN((TCC_REQ_sum / $denom))
|
||||
max: MAX((TCC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read Req:
|
||||
avg: AVG((TCC_READ_sum / $denom))
|
||||
min: MIN((TCC_READ_sum / $denom))
|
||||
max: MAX((TCC_READ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Write Req:
|
||||
avg: AVG((TCC_WRITE_sum / $denom))
|
||||
min: MIN((TCC_WRITE_sum / $denom))
|
||||
max: MAX((TCC_WRITE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Atomic Req:
|
||||
avg: AVG((TCC_ATOMIC_sum / $denom))
|
||||
min: MIN((TCC_ATOMIC_sum / $denom))
|
||||
max: MAX((TCC_ATOMIC_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Streaming Req:
|
||||
avg: AVG((TCC_STREAMING_REQ_sum / $denom))
|
||||
min: MIN((TCC_STREAMING_REQ_sum / $denom))
|
||||
max: MAX((TCC_STREAMING_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Probe Req:
|
||||
avg: AVG((TCC_PROBE_sum / $denom))
|
||||
min: MIN((TCC_PROBE_sum / $denom))
|
||||
max: MAX((TCC_PROBE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Cache Hit:
|
||||
avg: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
min: MIN((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
max: MAX((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
unit: pct
|
||||
Hits:
|
||||
avg: AVG((TCC_HIT_sum / $denom))
|
||||
min: MIN((TCC_HIT_sum / $denom))
|
||||
max: MAX((TCC_HIT_sum / $denom))
|
||||
unit: (Hits + $normUnit)
|
||||
Misses:
|
||||
avg: AVG((TCC_MISS_sum / $denom))
|
||||
min: MIN((TCC_MISS_sum / $denom))
|
||||
max: MAX((TCC_MISS_sum / $denom))
|
||||
unit: (Misses + $normUnit)
|
||||
Writeback:
|
||||
avg: AVG((TCC_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_WRITEBACK_sum / $denom))
|
||||
max: MAX((TCC_WRITEBACK_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
Writeback (Internal):
|
||||
avg: AVG((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
max: MAX((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
Writeback (vL1D Req):
|
||||
avg: AVG((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
|
||||
max: MAX((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
Evict (Internal):
|
||||
avg: AVG((TCC_NORMAL_EVICT_sum / $denom))
|
||||
min: MIN((TCC_NORMAL_EVICT_sum / $denom))
|
||||
max: MAX((TCC_NORMAL_EVICT_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
Evict (vL1D Req):
|
||||
avg: AVG((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
|
||||
min: MIN((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
|
||||
max: MAX((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
NC Req:
|
||||
avg: AVG((TCC_NC_REQ_sum / $denom))
|
||||
min: MIN((TCC_NC_REQ_sum / $denom))
|
||||
max: MAX((TCC_NC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
UC Req:
|
||||
avg: AVG((TCC_UC_REQ_sum / $denom))
|
||||
min: MIN((TCC_UC_REQ_sum / $denom))
|
||||
max: MAX((TCC_UC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
CC Req:
|
||||
avg: AVG((TCC_CC_REQ_sum / $denom))
|
||||
min: MIN((TCC_CC_REQ_sum / $denom))
|
||||
max: MAX((TCC_CC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
RW Req:
|
||||
avg: AVG((TCC_RW_REQ_sum / $denom))
|
||||
min: MIN((TCC_RW_REQ_sum / $denom))
|
||||
max: MAX((TCC_RW_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1704
|
||||
title: L2 Cache Stalls
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric: {}
|
||||
- metric_table:
|
||||
id: 1705
|
||||
title: L2 - Fabric Interface stalls
|
||||
header:
|
||||
metric: Metric
|
||||
type: Type
|
||||
transaction: Transaction
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
style:
|
||||
type: simple_multi_bar
|
||||
metric:
|
||||
Write - Credit Starvation:
|
||||
type: Credit Starvation
|
||||
transaction: Write
|
||||
avg: AVG(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
- metric_table:
|
||||
id: 1706
|
||||
title: L2 - Fabric interface detailed metrics
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Read (32B):
|
||||
avg: AVG((TCC_EA0_RDREQ_32B_sum / $denom))
|
||||
min: MIN((TCC_EA0_RDREQ_32B_sum / $denom))
|
||||
max: MAX((TCC_EA0_RDREQ_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read (64B):
|
||||
avg: AVG(((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) / $denom))
|
||||
min: MIN(((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) / $denom))
|
||||
max: MAX(((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read (Uncached):
|
||||
avg: AVG((TCC_EA0_RD_UNCACHED_32B_sum / $denom))
|
||||
min: MIN((TCC_EA0_RD_UNCACHED_32B_sum / $denom))
|
||||
max: MAX((TCC_EA0_RD_UNCACHED_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
HBM Read:
|
||||
avg: AVG((TCC_EA0_RDREQ_DRAM_sum / $denom))
|
||||
min: MIN((TCC_EA0_RDREQ_DRAM_sum / $denom))
|
||||
max: MAX((TCC_EA0_RDREQ_DRAM_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Remote Read:
|
||||
avg: AVG((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
|
||||
min: MIN((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
|
||||
max: MAX((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Write and Atomic (32B):
|
||||
avg: AVG(MAX(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom), 0))
|
||||
min: MIN(MAX(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom), 0))
|
||||
max: MAX(MAX(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom), 0))
|
||||
unit: (Req + $normUnit)
|
||||
Write and Atomic (Uncached):
|
||||
avg: AVG((TCC_EA0_WR_UNCACHED_32B_sum / $denom))
|
||||
min: MIN((TCC_EA0_WR_UNCACHED_32B_sum / $denom))
|
||||
max: MAX((TCC_EA0_WR_UNCACHED_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Write and Atomic (64B):
|
||||
avg: AVG((TCC_EA0_WRREQ_64B_sum / $denom))
|
||||
min: MIN((TCC_EA0_WRREQ_64B_sum / $denom))
|
||||
max: MAX((TCC_EA0_WRREQ_64B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
HBM Write and Atomic:
|
||||
avg: AVG((TCC_EA0_WRREQ_DRAM_sum / $denom))
|
||||
min: MIN((TCC_EA0_WRREQ_DRAM_sum / $denom))
|
||||
max: MAX((TCC_EA0_WRREQ_DRAM_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Remote Write and Atomic:
|
||||
avg: AVG((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom))
|
||||
min: MIN((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom))
|
||||
max: MAX((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Atomic:
|
||||
avg: AVG((TCC_EA0_ATOMIC_sum / $denom))
|
||||
min: MIN((TCC_EA0_ATOMIC_sum / $denom))
|
||||
max: MAX((TCC_EA0_ATOMIC_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
|
||||
+4
-4
@@ -2,10 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1800
|
||||
title: L2 Cache (per Channel)
|
||||
metrics_description:
|
||||
L2 Cache Hit Rate: The percent of total number of requests to the L2 from all
|
||||
clients that hit in the cache. As noted in the Speed-of-Light section, this
|
||||
includes hit-on-miss requests.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1801
|
||||
@@ -321,3 +317,7 @@ Panel Config:
|
||||
::_1: $total_l2_chan
|
||||
cli_style: simple_box
|
||||
tui_style: simple_box
|
||||
metrics_description:
|
||||
L2 Cache Hit Rate: The percent of total number of requests to the L2 from all
|
||||
clients that hit in the cache. As noted in the Speed-of-Light section, this
|
||||
includes hit-on-miss requests.
|
||||
|
||||
+1
-1
@@ -2,10 +2,10 @@
|
||||
Panel Config:
|
||||
id: 2100
|
||||
title: PC Sampling
|
||||
metrics_description: {}
|
||||
data source:
|
||||
- pc_sampling_table:
|
||||
id: 2101
|
||||
title: PC Sampling
|
||||
source: ps_file
|
||||
comparable: false
|
||||
metrics_description: {}
|
||||
|
||||
+1128
Разница между файлами не показана из-за своего большого размера
Загрузить разницу
+1
-1
@@ -2,7 +2,6 @@
|
||||
Panel Config:
|
||||
id: 0
|
||||
title: Top Stats
|
||||
metrics_description: {}
|
||||
data source:
|
||||
- raw_csv_table:
|
||||
id: 1
|
||||
@@ -12,3 +11,4 @@ Panel Config:
|
||||
id: 2
|
||||
title: Dispatch List
|
||||
source: pmc_dispatch_info.csv
|
||||
metrics_description: {}
|
||||
|
||||
+1
-1
@@ -2,10 +2,10 @@
|
||||
Panel Config:
|
||||
id: 100
|
||||
title: System Info
|
||||
metrics_description: {}
|
||||
data source:
|
||||
- raw_csv_table:
|
||||
id: 101
|
||||
title: System Info
|
||||
source: sysinfo.csv
|
||||
columnwise: true
|
||||
metrics_description: {}
|
||||
|
||||
+122
-118
@@ -2,124 +2,6 @@
|
||||
Panel Config:
|
||||
id: 200
|
||||
title: System Speed-of-Light
|
||||
metrics_description:
|
||||
VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
|
||||
This is also presented as a percent of the peak theoretical FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any floating-point
|
||||
operations from MFMA instructions.'
|
||||
VALU IOPs: 'The total integer operations executed per second on the VALU. This
|
||||
is also presented as a percent of the peak theoretical IOPs achievable on the
|
||||
specific accelerator. Note: this does not include any integer operations from
|
||||
MFMA instructions.'
|
||||
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
|
||||
executed per second. This does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F8 MFMA operations achievable on the specific accelerator. It is supported on
|
||||
AMD Instinct MI300 series and later only.
|
||||
MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
|
||||
executed per second. Note: this does not include any 16-bit brain floating point
|
||||
operations from VALU instructions. This is also presented as a percent of the
|
||||
peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 32-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F32 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 64-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F64 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
|
||||
per second. Note: this does not include any 8-bit integer operations from VALU
|
||||
instructions. This is also presented as a percent of the peak theoretical INT8
|
||||
MFMA operations achievable on the specific accelerator.'
|
||||
Active CUs: Total number of active compute units (CUs) on the accelerator during
|
||||
the kernel execution.
|
||||
SALU Utilization: Indicates what percent of the kernel's duration the SALU was
|
||||
busy executing instructions. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.
|
||||
VALU Utilization: Indicates what percent of the kernel's duration the VALU was
|
||||
busy executing instructions. Does not include VMEM operations. Computed as the
|
||||
ratio of the total number of cycles spent by the scheduler issuing VALU instructions
|
||||
over the total CU cycles.
|
||||
MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
|
||||
was busy executing instructions. Computed as the ratio of the total number of
|
||||
cycles the MFMA was busy over the total CU cycles.
|
||||
VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
|
||||
was busy executing instructions, including both global/generic and spill/scratch
|
||||
operations (see the VMEM instruction count metrics) for more detail). Does not
|
||||
include VALU operations. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing VMEM instructions over the total CU cycles.
|
||||
Branch Utilization: Indicates what percent of the kernel's duration the branch
|
||||
unit was busy executing instructions. Computed as the ratio of the total number
|
||||
of cycles spent by the scheduler issuing branch instructions over the total
|
||||
CU cycles
|
||||
VALU Active Threads: Indicates the average level of divergence within a wavefront
|
||||
over the lifetime of the kernel. The number of work-items that were active in
|
||||
a wavefront during execution of each VALU instruction, time-averaged over all
|
||||
VALU instructions run on all wavefronts in the kernel.
|
||||
IPC: The ratio of the total number of instructions executed on the CU over the
|
||||
total active CU cycles. This is also presented as a percent of the peak theoretical
|
||||
bandwidth achievable on the specific accelerator.
|
||||
Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
|
||||
over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
||||
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
|
||||
occupancy achievable on the specific accelerator.'
|
||||
Theoretical LDS Bandwidth: Indicates the maximum amount of bytes that could have
|
||||
been loaded from, stored to, or atomically updated in the LDS per unit time
|
||||
(see LDS Bandwidth example for more detail). This is also presented as a percent
|
||||
of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
|
||||
LDS Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS
|
||||
scheduler due to bank conflicts (as determined by the conflict resolution hardware)
|
||||
to the base number of cycles that would be spent in the LDS scheduler in a completely
|
||||
uncontended case. This is also presented in normalized form (i.e., the Bank
|
||||
Conflict Rate).
|
||||
vL1D Cache Hit Rate: The ratio of the number of vL1D cache line requests that
|
||||
hit in vL1D cache over the total number of cache line requests to the vL1D cache
|
||||
RAM.
|
||||
vL1D Cache BW: The number of bytes looked up in the vL1D cache as a result of
|
||||
VMEM instructions per unit time. The number of bytes is calculated as the number
|
||||
of cache lines requested multiplied by the cache line size. This value does
|
||||
not consider partial requests, so e.g., if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
This is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L2 Cache Hit Rate: The ratio of the number of L2 cache line requests that hit
|
||||
in the L2 cache over the total number of incoming cache line requests to the
|
||||
L2 cache.
|
||||
L2 Cache BW: The number of bytes looked up in the L2 cache per unit time. The
|
||||
number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line. This is also presented as a percent of
|
||||
the peak theoretical bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Read BW: "The number of bytes read by the L2 over the Infinity Fabric\u2122\
|
||||
\ interface per unit time. This is also presented as a percent of the peak theoretical\
|
||||
\ bandwidth achievable on the specific accelerator."
|
||||
L2-Fabric Write BW: The number of bytes sent by the L2 over the Infinity Fabric
|
||||
interface by write and atomic operations per unit time. This is also presented
|
||||
as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Read Latency: The time-averaged number of cycles read requests spent
|
||||
in Infinity Fabric before data was returned to the L2.
|
||||
L2-Fabric Write Latency: The time-averaged number of cycles write requests spent
|
||||
in Infinity Fabric before a completion acknowledgement was returned to the L2.
|
||||
sL1D Cache Hit Rate: The percent of sL1D requests that hit on a previously loaded
|
||||
line the cache. Calculated as the ratio of the number of sL1D requests that
|
||||
hit over the number of all sL1D requests.
|
||||
sL1D Cache BW: The number of bytes looked up in the sL1D cache per unit time.
|
||||
This is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L1I Hit Rate: The number of bytes looked up in the L1I cache per unit time. This
|
||||
is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L1I BW: The percent of L1I requests that hit on a previously loaded line the cache.
|
||||
Calculated as the ratio of the number of L1I requests that hit over the number
|
||||
of all L1I requests.
|
||||
L1I Fetch Latency: The average number of cycles spent to fetch instructions to
|
||||
a CU.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 201
|
||||
@@ -335,3 +217,125 @@ Panel Config:
|
||||
peak: None
|
||||
pop: None
|
||||
coll_level: SQ_IFETCH_LEVEL
|
||||
metrics_description:
|
||||
VALU FLOPs: |-
|
||||
The total floating-point operations executed per second on the VALU.
|
||||
This is also presented as a percent of the peak theoretical FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any floating-point
|
||||
operations from MFMA instructions.
|
||||
VALU IOPs: |-
|
||||
The total integer operations executed per second on the VALU. This is
|
||||
also presented as a percent of the peak theoretical IOPs achievable on the
|
||||
specific accelerator. Note: this does not include any integer operations from
|
||||
MFMA instructions.
|
||||
MFMA FLOPs (BF16): |-
|
||||
The total number of 16-bit brain floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
BF16 MFMA operations achievable on the specific accelerator.
|
||||
MFMA FLOPs (F16): |-
|
||||
The total number of 16-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 16-bit floating point operations from
|
||||
VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F16 MFMA operations achievable on the specific accelerator.
|
||||
MFMA FLOPs (F32): |-
|
||||
The total number of 32-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 32-bit floating point operations from
|
||||
VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F32 MFMA operations achievable on the specific accelerator.
|
||||
MFMA FLOPs (F64): |-
|
||||
The total number of 64-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 64-bit floating point operations from
|
||||
VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F64 MFMA operations achievable on the specific accelerator.
|
||||
MFMA IOPs (Int8): |-
|
||||
The total number of 8-bit integer MFMA operations executed per second.
|
||||
Note: this does not include any 8-bit integer operations from VALU instructions.
|
||||
This is also presented as a percent of the peak theoretical INT8 MFMA operations
|
||||
achievable on the specific accelerator.
|
||||
Active CUs: Total number of active compute units (CUs) on the accelerator during
|
||||
the kernel execution.
|
||||
SALU Utilization: Indicates what percent of the kernel's duration the SALU was
|
||||
busy executing instructions. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.
|
||||
VALU Utilization: Indicates what percent of the kernel's duration the VALU was
|
||||
busy executing instructions. Does not include VMEM operations. Computed as the
|
||||
ratio of the total number of cycles spent by the scheduler issuing VALU instructions
|
||||
over the total CU cycles.
|
||||
MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
|
||||
was busy executing instructions. Computed as the ratio of the total number of
|
||||
cycles the MFMA was busy over the total CU cycles.
|
||||
VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
|
||||
was busy executing instructions, including both global/generic and spill/scratch
|
||||
operations (see the VMEM instruction count metrics) for more detail). Does not
|
||||
include VALU operations. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing VMEM instructions over the total CU cycles.
|
||||
Branch Utilization: Indicates what percent of the kernel's duration the branch
|
||||
unit was busy executing instructions. Computed as the ratio of the total number
|
||||
of cycles spent by the scheduler issuing branch instructions over the total
|
||||
CU cycles
|
||||
VALU Active Threads: Indicates the average level of divergence within a wavefront
|
||||
over the lifetime of the kernel. The number of work-items that were active in
|
||||
a wavefront during execution of each VALU instruction, time-averaged over all
|
||||
VALU instructions run on all wavefronts in the kernel.
|
||||
IPC: The ratio of the total number of instructions executed on the CU over the
|
||||
total active CU cycles. This is also presented as a percent of the peak theoretical
|
||||
bandwidth achievable on the specific accelerator.
|
||||
Wavefront Occupancy: |-
|
||||
The time-averaged number of wavefronts resident on the accelerator over
|
||||
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
||||
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
|
||||
occupancy achievable on the specific accelerator.
|
||||
Theoretical LDS Bandwidth: Indicates the maximum amount of bytes that could have
|
||||
been loaded from, stored to, or atomically updated in the LDS per unit time
|
||||
(see LDS Bandwidth example for more detail). This is also presented as a percent
|
||||
of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
|
||||
LDS Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS
|
||||
scheduler due to bank conflicts (as determined by the conflict resolution hardware)
|
||||
to the base number of cycles that would be spent in the LDS scheduler in a completely
|
||||
uncontended case. This is also presented in normalized form (i.e., the Bank
|
||||
Conflict Rate).
|
||||
vL1D Cache Hit Rate: The ratio of the number of vL1D cache line requests that
|
||||
hit in vL1D cache over the total number of cache line requests to the vL1D cache
|
||||
RAM.
|
||||
vL1D Cache BW: The number of bytes looked up in the vL1D cache as a result of
|
||||
VMEM instructions per unit time. The number of bytes is calculated as the number
|
||||
of cache lines requested multiplied by the cache line size. This value does
|
||||
not consider partial requests, so e.g., if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
This is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L2 Cache Hit Rate: The ratio of the number of L2 cache line requests that hit
|
||||
in the L2 cache over the total number of incoming cache line requests to the
|
||||
L2 cache.
|
||||
L2 Cache BW: The number of bytes looked up in the L2 cache per unit time. The
|
||||
number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line. This is also presented as a percent of
|
||||
the peak theoretical bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Read BW: |-
|
||||
The number of bytes read by the L2 over the Infinity Fabric\u2122 interface
|
||||
per unit time. This is also presented as a percent of the peak theoretical
|
||||
bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Write BW: The number of bytes sent by the L2 over the Infinity Fabric
|
||||
interface by write and atomic operations per unit time. This is also presented
|
||||
as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Read Latency: The time-averaged number of cycles read requests spent
|
||||
in Infinity Fabric before data was returned to the L2.
|
||||
L2-Fabric Write Latency: The time-averaged number of cycles write requests spent
|
||||
in Infinity Fabric before a completion acknowledgement was returned to the L2.
|
||||
sL1D Cache Hit Rate: The percent of sL1D requests that hit on a previously loaded
|
||||
line the cache. Calculated as the ratio of the number of sL1D requests that
|
||||
hit over the number of all sL1D requests.
|
||||
sL1D Cache BW: The number of bytes looked up in the sL1D cache per unit time.
|
||||
This is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L1I Hit Rate: The number of bytes looked up in the L1I cache per unit time. This
|
||||
is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L1I BW: The percent of L1I requests that hit on a previously loaded line the cache.
|
||||
Calculated as the ratio of the number of L1I requests that hit over the number
|
||||
of all L1I requests.
|
||||
L1I Fetch Latency: The average number of cycles spent to fetch instructions to
|
||||
a CU.
|
||||
|
||||
+123
-119
@@ -2,122 +2,6 @@
|
||||
Panel Config:
|
||||
id: 300
|
||||
title: Memory Chart
|
||||
metrics_description:
|
||||
Wavefront Occupancy: Wavefronts per active CU.
|
||||
Wave Life: Average number of cycles executing a wave.
|
||||
SALU: Total Number of SALU (Scalar ALU) instructions issued per normalization
|
||||
unit.
|
||||
SMEM: Total number of SMEM (Scalar Memory Read) instructions issued normalization
|
||||
unit.
|
||||
VALU: The number of VALU (Vector ALU) instructions issued per normalization unit.
|
||||
MFMA: Total number of MFMA (Matrix-Fused-Multiply-Add) instructions issued per
|
||||
normalization unit.
|
||||
VMEM: The number of VMEM (GPU Memory) read instructions issued (including FLAT/scratch
|
||||
memory) per normalization unit.
|
||||
LDS: The total number of LDS instructions (including, but not limited to, read/write/atomics
|
||||
and HIP's __shfl instructions) executed per normalization unit.
|
||||
GWS: Total number of GDS (global data sync) instructions issued per normalization
|
||||
unit.
|
||||
BR: Total number of BRANCH instructions issued per normalization unit.
|
||||
Active CUs: Total number of active compute units (CUs) on the accelerator during
|
||||
the kernel execution.
|
||||
Num CUs: Total number of compute units (CUs) on the accelerator.
|
||||
VGPR: 'The number of architected vector general-purpose registers allocated for
|
||||
the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
|
||||
by the compiler due to allocation granularity.'
|
||||
SGPR: 'The number of scalar general-purpose registers allocated for the kernel,
|
||||
see SALU. Note: this may not exactly match the number of SGPRs requested by
|
||||
the compiler due to allocation granularity.'
|
||||
LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
|
||||
for this kernel. Note: This may also be larger than what was requested at compile
|
||||
time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
|
||||
Scratch Allocation: The number of bytes of scratch memory requested per work-item
|
||||
for this kernel. Scratch memory is used for stack memory on the accelerator,
|
||||
as well as for register spills and restores.
|
||||
Wavefronts: The total number of wavefronts, summed over all workgroups, forming
|
||||
this kernel launch.
|
||||
Workgroups: The total number of workgroups forming this kernel launch.
|
||||
LDS Req: The total number of LDS instructions (including, but not limited to,
|
||||
read/write/atomics and HIP's __shfl instructions) executed per normalization
|
||||
unit.
|
||||
LDS Util: Indicates what percent of the kernel's duration the LDS was actively
|
||||
executing instructions (including, but not limited to, load, store, atomic and
|
||||
HIP's __shfl operations). Calculated as the ratio of the total number of cycles
|
||||
LDS was active over the total CU cycles.
|
||||
LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
|
||||
/ acknowledgment) required for an LDS instruction to complete.
|
||||
VL1 Rd: The total number of incoming read requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Wr: The total number of incoming write requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Atomic: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Hit: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
VL1 Lat: Calculated as the average number of cycles that a vL1D cache line request
|
||||
spent in the vL1D cache pipeline.
|
||||
VL1 Coalesce: Indicates how well memory instructions were coalesced by the address
|
||||
processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
|
||||
as the average number of thread-requests generated per instruction divided by
|
||||
the ideal number of thread-requests per instruction.
|
||||
VL1 Stall: The ratio of the number of cycles where the vL1D is stalled waiting
|
||||
to issue a request for data to the L2 cache divided by the number of cycles
|
||||
where the vL1D is active.
|
||||
VL1_L2 Rd: The number of read requests for a vL1D cache line that were not satisfied
|
||||
by the vL1D and must be retrieved from the to the L2 Cache per normalization
|
||||
unit.
|
||||
VL1_L2 Wr: The number of write requests to a vL1D cache line that were sent through
|
||||
the vL1D to the L2 cache, per normalization unit.
|
||||
VL1_L2 Atomic: The number of atomic requests that are sent through the vL1D to
|
||||
the L2 cache, per normalization unit. This includes requests for atomics with,
|
||||
and without return.
|
||||
sL1D Rd: The total number of requests, of any size or type, made to the sL1D per
|
||||
normalization unit.
|
||||
sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
|
||||
line, per normalization unit.
|
||||
sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
|
||||
unit.
|
||||
sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
sL1D_L2 Atomic: The total number of atomic requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
IL1 Fetch: The total number of requests made to the L1I per normalization-unit.
|
||||
IL1 Hit: The percent of L1I requests that hit on a previously loaded line the
|
||||
cache. Calculated as the ratio of the number of L1I requests that hit over the
|
||||
number of all L1I requests.
|
||||
IL1 Lat: The average number of cycles spent to fetch instructions to a CU.
|
||||
IL1_L2 Rd: The total number of requests across the L1I - L2 interface per normalization-unit.
|
||||
L2 Rd: The total number of read requests to the L2 from all clients.
|
||||
L2 Wr: The total number of write requests to the L2 from all clients.
|
||||
L2 Atomic: The total number of atomic requests (with and without return) to the
|
||||
L2 from all clients.
|
||||
L2 Hit: The ratio of the number of L2 cache line requests that hit in the L2 cache
|
||||
over the total number of incoming cache line requests to the L2 cache.
|
||||
L2 Rd Lat: Calculated as the average number of cycles that the vL1D cache took
|
||||
to issue and receive read requests from the L2 Cache. This number also includes
|
||||
requests for atomics with return values.
|
||||
L2 Wr Lat: Calculated as the average number of cycles that the vL1D cache took
|
||||
to issue and receive acknowledgement of a write request to the L2 Cache. This
|
||||
number also includes requests for atomics without return values.
|
||||
Fabric_L2 Rd: Number of L2 cache - Infinity Fabric read requests (either 32-byte
|
||||
or 64-byte) summed over TCC instances per normalization unit.
|
||||
Fabric_L2 Wr: Number of L2 cache - Infinity Fabric write requests (either 32-byte
|
||||
or 64-byte) summed over TCC instances per normalization unit.
|
||||
Fabric_L2 Atomic: Number of L2 cache - Infinity Fabric write requests (either
|
||||
32-byte or 64-byte) that are actually atomic requests summed over TCC instances
|
||||
per normalization unit.
|
||||
Fabric Rd Lat: The time-averaged number of cycles read requests spent in Infinity
|
||||
Fabric before data was returned to the L2.
|
||||
Fabric Wr Lat: The time-averaged number of cycles write requests spent in Infinity
|
||||
Fabric before a completion acknowledgement was returned to the L2.
|
||||
Fabric Atomic Lat: The time-averaged number of cycles atomic requests spent in
|
||||
Infinity Fabric before a completion acknowledgement (atomic without return value)
|
||||
or data (atomic with return value) was returned to the L2.
|
||||
HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
|
||||
of data from the accelerator's local HBM, per normalization unit.
|
||||
HBM Wr: 'The total number of L2 requests to Infinity Fabric to write or atomically
|
||||
update 32B or 64B of data in the accelerator''s local HBM, per normalization
|
||||
unit. '
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 301
|
||||
@@ -252,13 +136,13 @@ Panel Config:
|
||||
value: ROUND(AVG((TCC_EA_ATOMIC_sum / $denom)), 0)
|
||||
Fabric Rd Lat:
|
||||
value: ROUND(AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else 0)), 0)
|
||||
!= 0) else 0)), 0)
|
||||
Fabric Wr Lat:
|
||||
value: ROUND(AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else 0)), 0)
|
||||
!= 0) else 0)), 0)
|
||||
Fabric Atomic Lat:
|
||||
value: ROUND(AVG(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
|
||||
!= 0) else 0)), 0)
|
||||
!= 0) else 0)), 0)
|
||||
HBM Rd:
|
||||
value: ROUND(AVG((TCC_EA_RDREQ_DRAM_sum / $denom)), 0)
|
||||
HBM Wr:
|
||||
@@ -266,3 +150,123 @@ Panel Config:
|
||||
comparable: false
|
||||
cli_style: mem_chart
|
||||
tui_style: mem_chart
|
||||
metrics_description:
|
||||
Wavefront Occupancy: Wavefronts per active CU.
|
||||
Wave Life: Average number of cycles executing a wave.
|
||||
SALU: Total Number of SALU (Scalar ALU) instructions issued per normalization
|
||||
unit.
|
||||
SMEM: Total number of SMEM (Scalar Memory Read) instructions issued normalization
|
||||
unit.
|
||||
VALU: The number of VALU (Vector ALU) instructions issued per normalization unit.
|
||||
MFMA: Total number of MFMA (Matrix-Fused-Multiply-Add) instructions issued per
|
||||
normalization unit.
|
||||
VMEM: The number of VMEM (GPU Memory) read instructions issued (including FLAT/scratch
|
||||
memory) per normalization unit.
|
||||
LDS: The total number of LDS instructions (including, but not limited to, read/write/atomics
|
||||
and HIP's __shfl instructions) executed per normalization unit.
|
||||
GWS: Total number of GDS (global data sync) instructions issued per normalization
|
||||
unit.
|
||||
BR: Total number of BRANCH instructions issued per normalization unit.
|
||||
Active CUs: Total number of active compute units (CUs) on the accelerator during
|
||||
the kernel execution.
|
||||
Num CUs: Total number of compute units (CUs) on the accelerator.
|
||||
VGPR: |-
|
||||
The number of architected vector general-purpose registers allocated
|
||||
for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
|
||||
requested by the compiler due to allocation granularity.
|
||||
SGPR: |-
|
||||
The number of scalar general-purpose registers allocated for the kernel,
|
||||
see SALU. Note: this may not exactly match the number of SGPRs requested by
|
||||
the compiler due to allocation granularity.
|
||||
LDS Allocation: |-
|
||||
The number of bytes of LDS memory (or, shared memory) allocated for
|
||||
this kernel. Note: This may also be larger than what was requested at compile
|
||||
time due to both allocation granularity and dynamic per-dispatch LDS allocations.
|
||||
Scratch Allocation: The number of bytes of scratch memory requested per work-item
|
||||
for this kernel. Scratch memory is used for stack memory on the accelerator,
|
||||
as well as for register spills and restores.
|
||||
Wavefronts: The total number of wavefronts, summed over all workgroups, forming
|
||||
this kernel launch.
|
||||
Workgroups: The total number of workgroups forming this kernel launch.
|
||||
LDS Req: The total number of LDS instructions (including, but not limited to,
|
||||
read/write/atomics and HIP's __shfl instructions) executed per normalization
|
||||
unit.
|
||||
LDS Util: Indicates what percent of the kernel's duration the LDS was actively
|
||||
executing instructions (including, but not limited to, load, store, atomic and
|
||||
HIP's __shfl operations). Calculated as the ratio of the total number of cycles
|
||||
LDS was active over the total CU cycles.
|
||||
LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
|
||||
/ acknowledgment) required for an LDS instruction to complete.
|
||||
VL1 Rd: The total number of incoming read requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Wr: The total number of incoming write requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Atomic: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Hit: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
VL1 Lat: Calculated as the average number of cycles that a vL1D cache line request
|
||||
spent in the vL1D cache pipeline.
|
||||
VL1 Coalesce: Indicates how well memory instructions were coalesced by the address
|
||||
processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
|
||||
as the average number of thread-requests generated per instruction divided by
|
||||
the ideal number of thread-requests per instruction.
|
||||
VL1 Stall: The ratio of the number of cycles where the vL1D is stalled waiting
|
||||
to issue a request for data to the L2 cache divided by the number of cycles
|
||||
where the vL1D is active.
|
||||
VL1_L2 Rd: The number of read requests for a vL1D cache line that were not satisfied
|
||||
by the vL1D and must be retrieved from the to the L2 Cache per normalization
|
||||
unit.
|
||||
VL1_L2 Wr: The number of write requests to a vL1D cache line that were sent through
|
||||
the vL1D to the L2 cache, per normalization unit.
|
||||
VL1_L2 Atomic: The number of atomic requests that are sent through the vL1D to
|
||||
the L2 cache, per normalization unit. This includes requests for atomics with,
|
||||
and without return.
|
||||
sL1D Rd: The total number of requests, of any size or type, made to the sL1D per
|
||||
normalization unit.
|
||||
sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
|
||||
line, per normalization unit.
|
||||
sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
|
||||
unit.
|
||||
sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
sL1D_L2 Atomic: The total number of atomic requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
IL1 Fetch: The total number of requests made to the L1I per normalization-unit.
|
||||
IL1 Hit: The percent of L1I requests that hit on a previously loaded line the
|
||||
cache. Calculated as the ratio of the number of L1I requests that hit over the
|
||||
number of all L1I requests.
|
||||
IL1 Lat: The average number of cycles spent to fetch instructions to a CU.
|
||||
IL1_L2 Rd: The total number of requests across the L1I - L2 interface per normalization-unit.
|
||||
L2 Rd: The total number of read requests to the L2 from all clients.
|
||||
L2 Wr: The total number of write requests to the L2 from all clients.
|
||||
L2 Atomic: The total number of atomic requests (with and without return) to the
|
||||
L2 from all clients.
|
||||
L2 Hit: The ratio of the number of L2 cache line requests that hit in the L2 cache
|
||||
over the total number of incoming cache line requests to the L2 cache.
|
||||
L2 Rd Lat: Calculated as the average number of cycles that the vL1D cache took
|
||||
to issue and receive read requests from the L2 Cache. This number also includes
|
||||
requests for atomics with return values.
|
||||
L2 Wr Lat: Calculated as the average number of cycles that the vL1D cache took
|
||||
to issue and receive acknowledgement of a write request to the L2 Cache. This
|
||||
number also includes requests for atomics without return values.
|
||||
Fabric_L2 Rd: Number of L2 cache - Infinity Fabric read requests (either 32-byte
|
||||
or 64-byte) summed over TCC instances per normalization unit.
|
||||
Fabric_L2 Wr: Number of L2 cache - Infinity Fabric write requests (either 32-byte
|
||||
or 64-byte) summed over TCC instances per normalization unit.
|
||||
Fabric_L2 Atomic: Number of L2 cache - Infinity Fabric write requests (either
|
||||
32-byte or 64-byte) that are actually atomic requests summed over TCC instances
|
||||
per normalization unit.
|
||||
Fabric Rd Lat: The time-averaged number of cycles read requests spent in Infinity
|
||||
Fabric before data was returned to the L2.
|
||||
Fabric Wr Lat: The time-averaged number of cycles write requests spent in Infinity
|
||||
Fabric before a completion acknowledgement was returned to the L2.
|
||||
Fabric Atomic Lat: The time-averaged number of cycles atomic requests spent in
|
||||
Infinity Fabric before a completion acknowledgement (atomic without return value)
|
||||
or data (atomic with return value) was returned to the L2.
|
||||
HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
|
||||
of data from the accelerator's local HBM, per normalization unit.
|
||||
HBM Wr: |-
|
||||
The total number of L2 requests to Infinity Fabric to write or atomically
|
||||
update 32B or 64B of data in the accelerator's local HBM, per normalization
|
||||
unit.
|
||||
|
||||
+83
-79
@@ -2,85 +2,6 @@
|
||||
Panel Config:
|
||||
id: 400
|
||||
title: Roofline
|
||||
metrics_description:
|
||||
VALU FLOPs (F16): 'The total 16-bit floating-point operations executed per second
|
||||
on the VALU. This is presented with the value of the peak empirical F16 FLOPs
|
||||
achievable on the specific accelerator. Note: this does not include any F16
|
||||
operations from MFMA instructions.'
|
||||
VALU FLOPs (F32): 'The total 32-bit floating-point operations executed per second
|
||||
on the VALU. This is presented with the value of the peak empirical F32 FLOPs
|
||||
achievable on the specific accelerator. Note: this does not include any F32
|
||||
operations from MFMA instructions.'
|
||||
VALU FLOPs (F64): 'The total 64-bit floating-point operations executed per second
|
||||
on the VALU. This is presented with the value of the peak empirical F64 FLOPs
|
||||
achievable on the specific accelerator. Note: this does not include any F64
|
||||
operations from MFMA instructions.'
|
||||
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
|
||||
executed per second. This does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. The peak empirically measured F8 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison. It is supported
|
||||
on AMD Instinct MI300 series and later only.
|
||||
MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
|
||||
executed per second. Note: this does not include any 16-bit brain floating point
|
||||
operations from VALU instructions. The peak empirically measured BF16 MFMA operations
|
||||
achievable on the specific accelerator is displayed alongside for comparison.'
|
||||
MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit floating point operations
|
||||
from VALU instructions. The peak empirically measured F16 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison.'
|
||||
MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 32-bit floating point operations
|
||||
from VALU instructions. The peak empirically measured F32 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison.'
|
||||
MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 64-bit floating point operations
|
||||
from VALU instructions. The peak empirically measured F64 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison.'
|
||||
MFMA FLOPs (F6F4): 'The total number of 4-bit and 6-bit floating point MFMA operations
|
||||
executed per second. Note: this does not include any floating point operations
|
||||
from VALU instructions. The peak empirically measured F6F4 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison. It is supported
|
||||
on AMD Instinct MI350 series (gfx950) and later only.'
|
||||
MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
|
||||
per second. Note: this does not include any 8-bit integer operations from VALU
|
||||
instructions. The peak empirically measured INT8 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison.'
|
||||
HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth
|
||||
Memory (HBM) per second. The peak empirically measured bandwidth achievable
|
||||
on the specific accelerator is displayed alongside for comparison.
|
||||
L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
|
||||
The number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line. The peak empirically measured bandwidth
|
||||
achievable on the specific accelerator is displayed alongside for comparison.
|
||||
L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
|
||||
of VMEM instructions per unit time. The number of bytes is calculated as the
|
||||
number of cache lines requested multiplied by the cache line size. This value
|
||||
does not consider partial requests, so e.g., if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
The peak empirically measured bandwidth achievable on the specific accelerator
|
||||
is displayed alongside for comparison.
|
||||
LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
|
||||
from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
|
||||
example for more detail). The peak empirically measured LDS bandwidth achievable
|
||||
on the specific accelerator is displayed alongside for comparison.
|
||||
AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
|
||||
of total floating-point operations (FLOPs) to total bytes transferred between
|
||||
the L1 cache and the processing units. This value is used as the x-coordinate
|
||||
for the L1 roofline.
|
||||
AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
|
||||
of total floating-point operations (FLOPs) to total bytes transferred between
|
||||
the L2 cache and the L1 cache. This value is used as the x-coordinate for the
|
||||
L2 roofline.
|
||||
AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
|
||||
It is the ratio of total floating-point operations (FLOPs) to total bytes transferred
|
||||
between HBM and the L2 cache. This value is used as the x-coordinate for the
|
||||
HBM roofline.
|
||||
Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs
|
||||
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
|
||||
operations divided by the total execution time. This value is used as the y-coordinate
|
||||
for the kernel's point on the Roofline plot.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 401
|
||||
@@ -210,3 +131,86 @@ Panel Config:
|
||||
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / (SUM(End_Timestamp - Start_Timestamp)
|
||||
/ 1e9) ) / 1e9
|
||||
unit: GFLOP/s
|
||||
metrics_description:
|
||||
VALU FLOPs (F16): |-
|
||||
The total 16-bit floating-point operations executed per second on the VALU.
|
||||
This is presented with the value of the peak empirical F16 FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any F16 operations
|
||||
from MFMA instructions.
|
||||
VALU FLOPs (F32): |-
|
||||
The total 32-bit floating-point operations executed per second on the VALU.
|
||||
This is presented with the value of the peak empirical F32 FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any F32 operations
|
||||
from MFMA instructions.
|
||||
VALU FLOPs (F64): |-
|
||||
The total 64-bit floating-point operations executed per second on the VALU.
|
||||
This is presented with the value of the peak empirical F64 FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any F64 operations
|
||||
from MFMA instructions.
|
||||
MFMA FLOPs (BF16): |-
|
||||
The total number of 16-bit brain floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit brain floating point
|
||||
operations from VALU instructions. The peak empirically measured BF16 MFMA
|
||||
operations achievable on the specific accelerator is displayed alongside
|
||||
for comparison.
|
||||
MFMA FLOPs (F16): |-
|
||||
The total number of 16-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 16-bit floating point operations from
|
||||
VALU instructions. The peak empirically measured F16 MFMA operations
|
||||
achievable on the specific accelerator is displayed alongside for comparison.
|
||||
MFMA FLOPs (F32): |-
|
||||
The total number of 32-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 32-bit floating point operations from
|
||||
VALU instructions. The peak empirically measured F32 MFMA operations
|
||||
achievable on the specific accelerator is displayed alongside for comparison.
|
||||
MFMA FLOPs (F64): |-
|
||||
The total number of 64-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 64-bit floating point operations from
|
||||
VALU instructions. The peak empirically measured F64 MFMA operations
|
||||
achievable on the specific accelerator is displayed alongside for comparison.
|
||||
MFMA IOPs (Int8): |-
|
||||
The total number of 8-bit integer MFMA operations executed per second.
|
||||
Note: this does not include any 8-bit integer operations from VALU instructions.
|
||||
The peak empirically measured INT8 MFMA operations achievable on the specific
|
||||
accelerator is displayed alongside for comparison.
|
||||
HBM Bandwidth: |-
|
||||
The total number of bytes read from and written to High-Bandwidth
|
||||
Memory (HBM) per second. The peak empirically measured bandwidth achievable
|
||||
on the specific accelerator is displayed alongside for comparison.
|
||||
L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
|
||||
The number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line. The peak empirically measured bandwidth
|
||||
achievable on the specific accelerator is displayed alongside for comparison.
|
||||
L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
|
||||
of VMEM instructions per unit time. The number of bytes is calculated as the
|
||||
number of cache lines requested multiplied by the cache line size. This value
|
||||
does not consider partial requests, so e.g., if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
The peak empirically measured bandwidth achievable on the specific accelerator
|
||||
is displayed alongside for comparison.
|
||||
LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
|
||||
from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
|
||||
example for more detail). The peak empirically measured LDS bandwidth achievable
|
||||
on the specific accelerator is displayed alongside for comparison.
|
||||
AI L1: |-
|
||||
The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
|
||||
of total floating-point operations (FLOPs) to total bytes transferred between
|
||||
the L1 cache and the processing units. This value is used as the x-coordinate
|
||||
for the L1 roofline.
|
||||
AI L2: |-
|
||||
The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
|
||||
of total floating-point operations (FLOPs) to total bytes transferred between
|
||||
the L2 cache and the L1 cache. This value is used as the x-coordinate for
|
||||
the L2 roofline.
|
||||
AI HBM: |-
|
||||
The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
|
||||
It is the ratio of total floating-point operations (FLOPs) to total bytes
|
||||
transferred between HBM and the L2 cache. This value is used as the x-coordinate
|
||||
for the HBM roofline.
|
||||
Performance (GFLOPs): |-
|
||||
The overall achieved performance, measured in GigaFLOPs
|
||||
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
|
||||
operations divided by the total execution time. This value is used as the y-coordinate
|
||||
for the kernel's point on the Roofline plot.
|
||||
|
||||
+25
-24
@@ -2,30 +2,6 @@
|
||||
Panel Config:
|
||||
id: 500
|
||||
title: Command Processor (CPC/CPF)
|
||||
metrics_description:
|
||||
CPF Utilization: Percent of total cycles where the CPF was busy actively doing
|
||||
any work. The ratio of CPF busy cycles over total cycles counted by the CPF.
|
||||
CPF Stall: Percent of CPF busy cycles where the CPF was stalled for any reason.
|
||||
CPF-L2 Utilization: Percent of total cycles counted by the CPF-L2 interface where
|
||||
the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
|
||||
over total cycles counted by the CPF-L2.
|
||||
CPF-L2 Stall: Percent of CPF-L2 L2 busy cycles where the CPF-L2 interface was
|
||||
stalled for any reason.
|
||||
CPF-UTCL1 Stall: Percent of CPF busy cycles where the CPF was stalled by address
|
||||
translation.
|
||||
CPC Utilization: Percent of total cycles where the CPC was busy actively doing
|
||||
any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
|
||||
CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
|
||||
CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
|
||||
for processing.
|
||||
CPC-Workgroup Manager Utilization: Percent of CPC busy cycles spent dispatching
|
||||
workgroups to the workgroup manager.
|
||||
CPC-L2 Utilization: Percent of total cycles counted by the CPC-L2 interface where
|
||||
the CPC-L2 interface was active doing any work.
|
||||
CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
|
||||
translation
|
||||
CPC-UTCL2 Utilization: 'Percent of total cycles counted by the CPC''s L2 address
|
||||
translation interface where the CPC was busy doing address translation work. '
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 501
|
||||
@@ -143,3 +119,28 @@ Panel Config:
|
||||
max: MAX((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
|
||||
if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
metrics_description:
|
||||
CPF Utilization: Percent of total cycles where the CPF was busy actively doing
|
||||
any work. The ratio of CPF busy cycles over total cycles counted by the CPF.
|
||||
CPF Stall: Percent of CPF busy cycles where the CPF was stalled for any reason.
|
||||
CPF-L2 Utilization: Percent of total cycles counted by the CPF-L2 interface where
|
||||
the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
|
||||
over total cycles counted by the CPF-L2.
|
||||
CPF-L2 Stall: Percent of CPF-L2 L2 busy cycles where the CPF-L2 interface was
|
||||
stalled for any reason.
|
||||
CPF-UTCL1 Stall: Percent of CPF busy cycles where the CPF was stalled by address
|
||||
translation.
|
||||
CPC Utilization: Percent of total cycles where the CPC was busy actively doing
|
||||
any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
|
||||
CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
|
||||
CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
|
||||
for processing.
|
||||
CPC-Workgroup Manager Utilization: Percent of CPC busy cycles spent dispatching
|
||||
workgroups to the workgroup manager.
|
||||
CPC-L2 Utilization: Percent of total cycles counted by the CPC-L2 interface where
|
||||
the CPC-L2 interface was active doing any work.
|
||||
CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
|
||||
translation
|
||||
CPC-UTCL2 Utilization: |-
|
||||
Percent of total cycles counted by the CPC's L2 address translation
|
||||
interface where the CPC was busy doing address translation work.
|
||||
|
||||
+55
-55
@@ -2,61 +2,6 @@
|
||||
Panel Config:
|
||||
id: 600
|
||||
title: Workgroup Manager (SPI)
|
||||
metrics_description:
|
||||
Accelerator Utilization: The percent of cycles in the kernel where the accelerator
|
||||
was actively doing any work.
|
||||
Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
|
||||
kernel where the scheduler-pipes were actively doing any work.
|
||||
Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
|
||||
manager was actively doing any work.
|
||||
Shader Engine Utilization: The percent of total shader engine cycles in the kernel
|
||||
where any CU in a shader-engine was actively doing any work, normalized over
|
||||
all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
|
||||
was not fully saturated by the kernel, or a potential load-imbalance issue.
|
||||
SIMD Utilization: The percent of total SIMD cycles in the kernel where any SIMD
|
||||
on a CU was actively doing any work, summed over all CUs. Low values (less than
|
||||
100%) indicate that the accelerator was not fully saturated by the kernel, or
|
||||
a potential load-imbalance issue.
|
||||
Dispatched Workgroups: The total number of workgroups forming this kernel launch.
|
||||
Dispatched Wavefronts: The total number of wavefronts, summed over all workgroups,
|
||||
forming this kernel launch.
|
||||
VGPR Writes: The average number of cycles spent initializing VGPRs at wave creation.
|
||||
SGPR Writes: The average number of cycles spent initializing SGPRs at wave creation.
|
||||
Not-scheduled Rate (Workgroup Manager): The percent of total scheduler-pipe cycles
|
||||
in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
|
||||
within the workgroup manager rather than a lack of a CU or SIMD with sufficient
|
||||
resources.
|
||||
Not-scheduled Rate (Scheduler-Pipe): 'The percent of total scheduler-pipe cycles
|
||||
in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
|
||||
within the scheduler-pipes rather than a lack of a CU or SIMD with sufficient
|
||||
resources. '
|
||||
Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
|
||||
where a workgroup could not be scheduled to a CU due to occupancy limitations
|
||||
(like a lack of a CU or SIMD with sufficient resources).
|
||||
Scratch Stall Rate: The percent of total shader-engine cycles in the kernel where
|
||||
a workgroup could not be scheduled to a CU due to lack of private (a.k.a., scratch)
|
||||
memory slots. While this can reach up to 100%, note that the actual occupancy
|
||||
limitations on a kernel using private memory are typically quite small (for
|
||||
example, less than 1% of the total number of waves that can be scheduled to
|
||||
an accelerator).
|
||||
Insufficient SIMD Waveslots: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available waveslots.
|
||||
Insufficient SIMD VGPRs: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available VGPRs.
|
||||
Insufficient SIMD SGPRs: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available SGPRs.
|
||||
Insufficient CU LDS: The percent of total CU cycles in the kernel where a workgroup
|
||||
could not be scheduled to a CU due to lack of available LDS.
|
||||
Insufficient CU Barriers: The percent of total CU cycles in the kernel where a
|
||||
workgroup could not be scheduled to a CU due to lack of available barriers.
|
||||
Reached CU Workgroup Limit: The percent of total CU cycles in the kernel where
|
||||
a workgroup could not be scheduled to a CU due to limits within the workgroup
|
||||
manager. This is expected to be always be zero on CDNA2 or newer accelerators
|
||||
(and small for previous accelerators).
|
||||
Reached CU Wavefront Limit: The percent of total CU cycles in the kernel where
|
||||
a wavefront could not be scheduled to a CU due to limits within the workgroup
|
||||
manager. This is expected to be always be zero on CDNA2 or newer accelerators
|
||||
(and small for previous accelerators).
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 601
|
||||
@@ -199,3 +144,58 @@ Panel Config:
|
||||
min: MIN(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
metrics_description:
|
||||
Accelerator Utilization: The percent of cycles in the kernel where the accelerator
|
||||
was actively doing any work.
|
||||
Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
|
||||
kernel where the scheduler-pipes were actively doing any work.
|
||||
Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
|
||||
manager was actively doing any work.
|
||||
Shader Engine Utilization: The percent of total shader engine cycles in the kernel
|
||||
where any CU in a shader-engine was actively doing any work, normalized over
|
||||
all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
|
||||
was not fully saturated by the kernel, or a potential load-imbalance issue.
|
||||
SIMD Utilization: The percent of total SIMD cycles in the kernel where any SIMD
|
||||
on a CU was actively doing any work, summed over all CUs. Low values (less than
|
||||
100%) indicate that the accelerator was not fully saturated by the kernel, or
|
||||
a potential load-imbalance issue.
|
||||
Dispatched Workgroups: The total number of workgroups forming this kernel launch.
|
||||
Dispatched Wavefronts: The total number of wavefronts, summed over all workgroups,
|
||||
forming this kernel launch.
|
||||
VGPR Writes: The average number of cycles spent initializing VGPRs at wave creation.
|
||||
SGPR Writes: The average number of cycles spent initializing SGPRs at wave creation.
|
||||
Not-scheduled Rate (Workgroup Manager): The percent of total scheduler-pipe cycles
|
||||
in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
|
||||
within the workgroup manager rather than a lack of a CU or SIMD with sufficient
|
||||
resources.
|
||||
Not-scheduled Rate (Scheduler-Pipe): |-
|
||||
The percent of total scheduler-pipe cycles in the kernel where a workgroup
|
||||
could not be scheduled to a CU due to a bottleneck within the scheduler-pipes
|
||||
rather than a lack of a CU or SIMD with sufficient resources.
|
||||
Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
|
||||
where a workgroup could not be scheduled to a CU due to occupancy limitations
|
||||
(like a lack of a CU or SIMD with sufficient resources).
|
||||
Scratch Stall Rate: The percent of total shader-engine cycles in the kernel where
|
||||
a workgroup could not be scheduled to a CU due to lack of private (a.k.a., scratch)
|
||||
memory slots. While this can reach up to 100%, note that the actual occupancy
|
||||
limitations on a kernel using private memory are typically quite small (for
|
||||
example, less than 1% of the total number of waves that can be scheduled to
|
||||
an accelerator).
|
||||
Insufficient SIMD Waveslots: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available waveslots.
|
||||
Insufficient SIMD VGPRs: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available VGPRs.
|
||||
Insufficient SIMD SGPRs: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available SGPRs.
|
||||
Insufficient CU LDS: The percent of total CU cycles in the kernel where a workgroup
|
||||
could not be scheduled to a CU due to lack of available LDS.
|
||||
Insufficient CU Barriers: The percent of total CU cycles in the kernel where a
|
||||
workgroup could not be scheduled to a CU due to lack of available barriers.
|
||||
Reached CU Workgroup Limit: The percent of total CU cycles in the kernel where
|
||||
a workgroup could not be scheduled to a CU due to limits within the workgroup
|
||||
manager. This is expected to be always be zero on CDNA2 or newer accelerators
|
||||
(and small for previous accelerators).
|
||||
Reached CU Wavefront Limit: The percent of total CU cycles in the kernel where
|
||||
a wavefront could not be scheduled to a CU due to limits within the workgroup
|
||||
manager. This is expected to be always be zero on CDNA2 or newer accelerators
|
||||
(and small for previous accelerators).
|
||||
|
||||
+63
-57
@@ -2,63 +2,6 @@
|
||||
Panel Config:
|
||||
id: 700
|
||||
title: Wavefront
|
||||
metrics_description:
|
||||
Grid Size: The total number of work-items (or, threads) launched as a part of
|
||||
the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied
|
||||
by the total workgroup (or, block) size.
|
||||
Workgroup Size: The total number of work-items (or, threads) in each workgroup
|
||||
(or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
|
||||
to the total block size.
|
||||
Total Wavefronts: "The total number of wavefronts launched as part of the kernel\
|
||||
\ dispatch. On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs,\
|
||||
\ the wavefront size is always 64 work-items. Thus, the total number of wavefronts\
|
||||
\ should be equivalent to the ceiling of grid size divided by 64."
|
||||
Saved Wavefronts: The total number of wavefronts saved at a context-save.
|
||||
Restored Wavefronts: The total number of wavefronts restored from a context-save.
|
||||
VGPRs: 'The number of architected vector general-purpose registers allocated for
|
||||
the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
|
||||
by the compiler due to allocation granularity.'
|
||||
AGPRs: 'The number of accumulation vector general-purpose registers allocated
|
||||
for the kernel, see AGPRs. Note: this may not exactly match the number of AGPRs
|
||||
requested by the compiler due to allocation granularity.'
|
||||
SGPRs: 'The number of scalar general-purpose registers allocated for the kernel,
|
||||
see SALU. Note: this may not exactly match the number of SGPRs requested by
|
||||
the compiler due to allocation granularity.'
|
||||
LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
|
||||
for this kernel. Note: This may also be larger than what was requested at compile
|
||||
time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
|
||||
Scratch Allocation: The number of bytes of scratch memory requested per work-item
|
||||
for this kernel. Scratch memory is used for stack memory on the accelerator,
|
||||
as well as for register spills and restores.
|
||||
Kernel Time: The total duration of the executed kernel.
|
||||
Kernel Time (Cycles): The total duration of the executed kernel in cycles.
|
||||
Instructions per wavefront: The average number of instructions (of all types)
|
||||
executed per wavefront. This is averaged over all wavefronts in a kernel dispatch.
|
||||
Wave Cycles: The number of cycles a wavefront in the kernel dispatch spent resident
|
||||
on a compute unit per normalization unit. This is averaged over all wavefronts
|
||||
in a kernel dispatch.
|
||||
Dependency Wait Cycles: The number of cycles a wavefront in the kernel dispatch
|
||||
spent resident on a compute unit per normalization unit. This is averaged over
|
||||
all wavefronts in a kernel dispatch.
|
||||
Issue Wait Cycles: The number of cycles a wavefront in the kernel dispatch was
|
||||
unable to issue an instruction for any reason (e.g., execution pipe back-pressure,
|
||||
arbitration loss, etc.) per normalization unit. This counter is incremented
|
||||
at every cycle by all wavefronts on a CU unable to issue an instruction. As
|
||||
such, it is most useful to get a sense of how waves were spending their time,
|
||||
rather than identification of a precise limiter because another wave could be
|
||||
actively executing while a wave is issue stalled. The sum of this metric, Dependency
|
||||
Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
|
||||
Active Cycles: The average number of cycles a wavefront in the kernel dispatch
|
||||
was actively executing instructions per normalization unit. This measurement
|
||||
is made on a per-wavefront basis, and may include cycles that another wavefront
|
||||
spent actively executing (on another execution unit, for example) or was stalled.
|
||||
As such, it is most useful to get a sense of how waves were spending their time,
|
||||
rather than identification of a precise limiter. The sum of this metric, Issue
|
||||
Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
|
||||
metric.
|
||||
Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
|
||||
over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
||||
kernels (less than 1ms).'
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 701
|
||||
@@ -171,3 +114,66 @@ Panel Config:
|
||||
max: MAX((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
unit: Wavefronts
|
||||
coll_level: SQ_LEVEL_WAVES
|
||||
metrics_description:
|
||||
Grid Size: The total number of work-items (or, threads) launched as a part of
|
||||
the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied
|
||||
by the total workgroup (or, block) size.
|
||||
Workgroup Size: The total number of work-items (or, threads) in each workgroup
|
||||
(or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
|
||||
to the total block size.
|
||||
Total Wavefronts: |-
|
||||
The total number of wavefronts launched as part of the kernel dispatch.
|
||||
On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
|
||||
size is always 64 work-items. Thus, the total number of wavefronts should
|
||||
be equivalent to the ceiling of grid size divided by 64.
|
||||
Saved Wavefronts: The total number of wavefronts saved at a context-save.
|
||||
Restored Wavefronts: The total number of wavefronts restored from a context-save.
|
||||
VGPRs: |-
|
||||
The number of architected vector general-purpose registers allocated
|
||||
for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
|
||||
requested by the compiler due to allocation granularity.
|
||||
AGPRs: |-
|
||||
The number of accumulation vector general-purpose registers allocated
|
||||
for the kernel, see AGPRs. Note: this may not exactly match the number of
|
||||
AGPRs requested by the compiler due to allocation granularity.
|
||||
SGPRs: |-
|
||||
The number of scalar general-purpose registers allocated for the kernel,
|
||||
see SALU. Note: this may not exactly match the number of SGPRs requested by
|
||||
the compiler due to allocation granularity.
|
||||
LDS Allocation: |-
|
||||
The number of bytes of LDS memory (or, shared memory) allocated for
|
||||
this kernel. Note: This may also be larger than what was requested at compile
|
||||
time due to both allocation granularity and dynamic per-dispatch LDS allocations.
|
||||
Scratch Allocation: The number of bytes of scratch memory requested per work-item
|
||||
for this kernel. Scratch memory is used for stack memory on the accelerator,
|
||||
as well as for register spills and restores.
|
||||
Kernel Time: The total duration of the executed kernel.
|
||||
Kernel Time (Cycles): The total duration of the executed kernel in cycles.
|
||||
Instructions per wavefront: The average number of instructions (of all types)
|
||||
executed per wavefront. This is averaged over all wavefronts in a kernel dispatch.
|
||||
Wave Cycles: The number of cycles a wavefront in the kernel dispatch spent resident
|
||||
on a compute unit per normalization unit. This is averaged over all wavefronts
|
||||
in a kernel dispatch.
|
||||
Dependency Wait Cycles: The number of cycles a wavefront in the kernel dispatch
|
||||
spent resident on a compute unit per normalization unit. This is averaged over
|
||||
all wavefronts in a kernel dispatch.
|
||||
Issue Wait Cycles: The number of cycles a wavefront in the kernel dispatch was
|
||||
unable to issue an instruction for any reason (e.g., execution pipe back-pressure,
|
||||
arbitration loss, etc.) per normalization unit. This counter is incremented
|
||||
at every cycle by all wavefronts on a CU unable to issue an instruction. As
|
||||
such, it is most useful to get a sense of how waves were spending their time,
|
||||
rather than identification of a precise limiter because another wave could be
|
||||
actively executing while a wave is issue stalled. The sum of this metric, Dependency
|
||||
Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
|
||||
Active Cycles: The average number of cycles a wavefront in the kernel dispatch
|
||||
was actively executing instructions per normalization unit. This measurement
|
||||
is made on a per-wavefront basis, and may include cycles that another wavefront
|
||||
spent actively executing (on another execution unit, for example) or was stalled.
|
||||
As such, it is most useful to get a sense of how waves were spending their time,
|
||||
rather than identification of a precise limiter. The sum of this metric, Issue
|
||||
Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
|
||||
metric.
|
||||
Wavefront Occupancy: |-
|
||||
The time-averaged number of wavefronts resident on the accelerator over
|
||||
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
||||
kernels (less than 1ms).
|
||||
|
||||
+82
-84
@@ -2,90 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1000
|
||||
title: Compute Units - Instruction Mix
|
||||
metrics_description:
|
||||
VALU: The total number of vector arithmetic logic unit (VALU) operations issued.
|
||||
These are the workhorses of the compute unit, and are used to execute a wide
|
||||
range of instruction types including floating point operations, non-uniform
|
||||
address calculations, transcendental operations, integer operations, shifts,
|
||||
conditional evaluation, etc.
|
||||
VMEM: The total number of vector memory operations issued. These include most
|
||||
loads, stores and atomic operations and all accesses to generic, global, private
|
||||
and texture memory.
|
||||
LDS: The total number of LDS (also known as shared memory) operations issued.
|
||||
These include loads, stores, atomics, and HIP's __shfl operations.
|
||||
MFMA: The total number of matrix fused multiply-add instructions issued.
|
||||
SALU: The total number of scalar arithmetic logic unit (SALU) operations issued.
|
||||
Typically these are used for address calculations, literal constants, and other
|
||||
operations that are provably uniform across a wavefront. Although scalar memory
|
||||
(SMEM) operations are issued by the SALU, they are counted separately in this
|
||||
section.
|
||||
SMEM: The total number of scalar memory (SMEM) operations issued. These are typically
|
||||
used for loading kernel arguments, base-pointers and loads from HIP's __constant__
|
||||
memory.
|
||||
Branch: The total number of branch operations issued. These typically consist
|
||||
of jump or branch operations and are used to implement control flow.
|
||||
INT32: The total number of instructions operating on 32-bit integer operands issued
|
||||
to the VALU per normalization unit.
|
||||
INT64: The total number of instructions operating on 64-bit integer operands issued
|
||||
to the VALU per normalization unit.
|
||||
F16-ADD: The total number of addition instructions operating on 16-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F16-MUL: The total number of multiplication instructions operating on 16-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F16-FMA: The total number of fused multiply-add instructions operating on 16-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F16-Trans: The total number of transcendental instructions (e.g., sqrt) operating
|
||||
on 16-bit floating-point operands issued to the VALU per normalization unit.
|
||||
F32-ADD: The total number of addition instructions operating on 32-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F32-MUL: The total number of multiplication instructions operating on 32-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F32-FMA: The total number of fused multiply-add instructions operating on 32-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F32-Trans: The total number of transcendental instructions (such as sqrt) operating
|
||||
on 32-bit floating-point operands issued to the VALU per normalization unit.
|
||||
F64-ADD: The total number of addition instructions operating on 64-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F64-MUL: The total number of multiplication instructions operating on 64-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F64-FMA: The total number of fused multiply-add instructions operating on 64-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F64-Trans: The total number of transcendental instructions (such as sqrt) operating
|
||||
on 64-bit floating-point operands issued to the VALU per normalization unit.
|
||||
Conversion: "The total number of type conversion instructions (such as converting\
|
||||
\ data to or from F32\u2194F64) issued to the VALU per normalization unit."
|
||||
Global/Generic Instr: The total number of global & generic memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Read: The total number of global & generic memory read instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Write: The total number of global & generic memory write instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Atomic: The total number of global & generic memory atomic (with
|
||||
and without return) instructions executed on all compute units on the accelerator,
|
||||
per normalization unit.
|
||||
Spill/Stack Instr: The total number of spill/stack memory instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Read: The total number of spill/stack memory read instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Write: The total number of spill/stack memory write instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Atomic: The total number of spill/stack memory atomic (with and without
|
||||
return) instructions executed on all compute units on the accelerator, per normalization
|
||||
unit. Typically unused as these memory operations are typically used to implement
|
||||
thread-local storage.
|
||||
MFMA-I8: The total number of 8-bit integer MFMA instructions issued per normalization
|
||||
unit.
|
||||
MFMA-F8: The total number of 8-bit floating point MFMA instructions issued per
|
||||
normalization unit. This is supported in AMD Instinct MI300 series and later
|
||||
only.
|
||||
MFMA-F16: The total number of 16-bit floating point MFMA instructions issued per
|
||||
normalization unit.
|
||||
MFMA-BF16: The total number of 16-bit brain floating point MFMA instructions issued
|
||||
per normalization unit.
|
||||
MFMA-F32: The total number of 32-bit floating-point MFMA instructions issued per
|
||||
normalization unit.
|
||||
MFMA-F64: The total number of 64-bit floating-point MFMA instructions issued per
|
||||
normalization unit.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1001
|
||||
@@ -302,3 +218,85 @@ Panel Config:
|
||||
min: MIN((SQ_INSTS_VALU_MFMA_F64 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MFMA_F64 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
metrics_description:
|
||||
VALU: The total number of vector arithmetic logic unit (VALU) operations issued.
|
||||
These are the workhorses of the compute unit, and are used to execute a wide
|
||||
range of instruction types including floating point operations, non-uniform
|
||||
address calculations, transcendental operations, integer operations, shifts,
|
||||
conditional evaluation, etc.
|
||||
VMEM: The total number of vector memory operations issued. These include most
|
||||
loads, stores and atomic operations and all accesses to generic, global, private
|
||||
and texture memory.
|
||||
LDS: The total number of LDS (also known as shared memory) operations issued.
|
||||
These include loads, stores, atomics, and HIP's __shfl operations.
|
||||
MFMA: The total number of matrix fused multiply-add instructions issued.
|
||||
SALU: The total number of scalar arithmetic logic unit (SALU) operations issued.
|
||||
Typically these are used for address calculations, literal constants, and other
|
||||
operations that are provably uniform across a wavefront. Although scalar memory
|
||||
(SMEM) operations are issued by the SALU, they are counted separately in this
|
||||
section.
|
||||
SMEM: The total number of scalar memory (SMEM) operations issued. These are typically
|
||||
used for loading kernel arguments, base-pointers and loads from HIP's __constant__
|
||||
memory.
|
||||
Branch: The total number of branch operations issued. These typically consist
|
||||
of jump or branch operations and are used to implement control flow.
|
||||
INT32: The total number of instructions operating on 32-bit integer operands issued
|
||||
to the VALU per normalization unit.
|
||||
INT64: The total number of instructions operating on 64-bit integer operands issued
|
||||
to the VALU per normalization unit.
|
||||
F16-ADD: The total number of addition instructions operating on 16-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F16-MUL: The total number of multiplication instructions operating on 16-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F16-FMA: The total number of fused multiply-add instructions operating on 16-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F16-Trans: The total number of transcendental instructions (e.g., sqrt) operating
|
||||
on 16-bit floating-point operands issued to the VALU per normalization unit.
|
||||
F32-ADD: The total number of addition instructions operating on 32-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F32-MUL: The total number of multiplication instructions operating on 32-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F32-FMA: The total number of fused multiply-add instructions operating on 32-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F32-Trans: The total number of transcendental instructions (such as sqrt) operating
|
||||
on 32-bit floating-point operands issued to the VALU per normalization unit.
|
||||
F64-ADD: The total number of addition instructions operating on 64-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F64-MUL: The total number of multiplication instructions operating on 64-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F64-FMA: The total number of fused multiply-add instructions operating on 64-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F64-Trans: The total number of transcendental instructions (such as sqrt) operating
|
||||
on 64-bit floating-point operands issued to the VALU per normalization unit.
|
||||
Conversion: |-
|
||||
The total number of type conversion instructions (such as converting
|
||||
data to or from F32\u2194F64) issued to the VALU per normalization unit.
|
||||
Global/Generic Instr: The total number of global & generic memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Read: The total number of global & generic memory read instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Write: The total number of global & generic memory write instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Atomic: The total number of global & generic memory atomic (with
|
||||
and without return) instructions executed on all compute units on the accelerator,
|
||||
per normalization unit.
|
||||
Spill/Stack Instr: The total number of spill/stack memory instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Read: The total number of spill/stack memory read instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Write: The total number of spill/stack memory write instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Atomic: The total number of spill/stack memory atomic (with and without
|
||||
return) instructions executed on all compute units on the accelerator, per normalization
|
||||
unit. Typically unused as these memory operations are typically used to implement
|
||||
thread-local storage.
|
||||
MFMA-I8: The total number of 8-bit integer MFMA instructions issued per normalization
|
||||
unit.
|
||||
MFMA-F16: The total number of 16-bit floating point MFMA instructions issued per
|
||||
normalization unit.
|
||||
MFMA-BF16: The total number of 16-bit brain floating point MFMA instructions issued
|
||||
per normalization unit.
|
||||
MFMA-F32: The total number of 32-bit floating-point MFMA instructions issued per
|
||||
normalization unit.
|
||||
MFMA-F64: The total number of 64-bit floating-point MFMA instructions issued per
|
||||
normalization unit.
|
||||
|
||||
+94
-87
@@ -2,84 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1100
|
||||
title: Compute Units - Compute Pipeline
|
||||
metrics_description:
|
||||
VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
|
||||
This is also presented as a percent of the peak theoretical FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any floating-point
|
||||
operations from MFMA instructions.'
|
||||
VALU IOPs: 'The total integer operations executed per second on the VALU. This
|
||||
is also presented as a percent of the peak theoretical IOPs achievable on the
|
||||
specific accelerator. Note: this does not include any integer operations from
|
||||
MFMA instructions.'
|
||||
MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
|
||||
executed per second. Note: this does not include any 16-bit brain floating point
|
||||
operations from VALU instructions. This is also presented as a percent of the
|
||||
peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 32-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F32 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 64-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F64 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA IOPs (INT8): 'The total number of 8-bit integer MFMA operations executed
|
||||
per second. Note: this does not include any 8-bit integer operations from VALU
|
||||
instructions. This is also presented as a percent of the peak theoretical INT8
|
||||
MFMA operations achievable on the specific accelerator.'
|
||||
IPC: The ratio of the total number of instructions executed on the CU over the
|
||||
total active CU cycles.
|
||||
IPC (Issued): The ratio of the total number of (non-internal) instructions issued
|
||||
over the number of cycles where the scheduler was actively working on issuing
|
||||
instructions.
|
||||
SALU Utilization: Indicates what percent of the kernel's duration the SALU was
|
||||
busy executing instructions. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing SALU / SMEM instructions over the total CU cycles.
|
||||
VALU Utilization: Indicates what percent of the kernel's duration the VALU was
|
||||
busy executing instructions. Does not include VMEM operations. Computed as the
|
||||
ratio of the total number of cycles spent by the scheduler issuing VALU instructions
|
||||
over the total CU cycles.
|
||||
VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
|
||||
was busy executing instructions, including both global/generic and spill/scratch
|
||||
operations (see the VMEM instruction count metrics for more detail). Does not
|
||||
include VALU operations. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing VMEM instructions over the total CU cycles.
|
||||
Branch Utilization: Indicates what percent of the kernel's duration the branch
|
||||
unit was busy executing instructions. Computed as the ratio of the total number
|
||||
of cycles spent by the scheduler issuing branch instructions over the total
|
||||
CU cycles.
|
||||
VALU Active Threads: Indicates the average level of divergence within a wavefront
|
||||
over the lifetime of the kernel. The number of work-items that were active in
|
||||
a wavefront during execution of each VALU instruction, time-averaged over all
|
||||
VALU instructions run on all wavefronts in the kernel
|
||||
MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
|
||||
was busy executing instructions. Computed as the ratio of the total number of
|
||||
cycles spent by the MFMA was busy over the total CU cycles.
|
||||
MFMA Instruction Cycles: The average duration of MFMA instructions in this kernel
|
||||
in cycles. Computed as the ratio of the total number of cycles the MFMA unit
|
||||
was busy over the total number of MFMA instructions.
|
||||
VMEM Latency: The average number of round-trip cycles (that is, from issue to
|
||||
data return / acknowledgment) required for a VMEM instruction to complete.
|
||||
SMEM Latency: The average number of round-trip cycles (that is, from issue to
|
||||
data return / acknowledgment) required for a SMEM instruction to complete.
|
||||
FLOPs (Total): The total number of floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
IOPs (Total): The total number of integer operations executed on either the VALU
|
||||
or MFMA units, per normalization unit.
|
||||
F16 OPs: The total number of 16-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
BF16 OPs: The total number of 16-bit brain floating-point operations executed
|
||||
on either the VALU or MFMA units, per normalization unit.
|
||||
F32 OPs: The total number of 32-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
F64 OPs: The total number of 64-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
INT8 OPs: The total number of 8-bit integer operations executed on either the
|
||||
VALU or MFMA units, per normalization unit.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1101
|
||||
@@ -159,13 +81,13 @@ Panel Config:
|
||||
unit: Instr/cycle
|
||||
IPC (Issued):
|
||||
avg: AVG(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
/ SQ_ACTIVE_INST_ANY))
|
||||
min: MIN(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
/ SQ_ACTIVE_INST_ANY))
|
||||
max: MAX(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
/ SQ_ACTIVE_INST_ANY))
|
||||
unit: Instr/cycle
|
||||
SALU Utilization:
|
||||
@@ -262,7 +184,7 @@ Panel Config:
|
||||
* 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32)) + (64 * (((SQ_INSTS_VALU_ADD_F64
|
||||
+ SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64) + (SQ_INSTS_VALU_FMA_F64
|
||||
* 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64)) / $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
IOPs (Total):
|
||||
avg: AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8
|
||||
* 512)) / $denom)
|
||||
@@ -270,7 +192,7 @@ Panel Config:
|
||||
* 512)) / $denom)
|
||||
max: MAX(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8
|
||||
* 512)) / $denom)
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
F16 OPs:
|
||||
avg: AVG(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16))
|
||||
+ (64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512
|
||||
@@ -281,12 +203,12 @@ Panel Config:
|
||||
max: MAX(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16))
|
||||
+ (64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512
|
||||
* SQ_INSTS_VALU_MFMA_MOPS_F16)) / $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
BF16 OPs:
|
||||
avg: AVG(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
|
||||
min: MIN(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
|
||||
max: MAX(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
F32 OPs:
|
||||
avg: AVG((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
|
||||
+ (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
|
||||
@@ -297,7 +219,7 @@ Panel Config:
|
||||
max: MAX((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
|
||||
+ (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
|
||||
/ $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
F64 OPs:
|
||||
avg: AVG((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
|
||||
@@ -308,9 +230,94 @@ Panel Config:
|
||||
max: MAX((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
|
||||
/ $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
INT8 OPs:
|
||||
avg: AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
|
||||
min: MIN(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
|
||||
max: MAX(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
metrics_description:
|
||||
VALU FLOPs: |-
|
||||
The total floating-point operations executed per second on the VALU.
|
||||
This is also presented as a percent of the peak theoretical FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any floating-point
|
||||
operations from MFMA instructions.
|
||||
VALU IOPs: |-
|
||||
The total integer operations executed per second on the VALU. This is
|
||||
also presented as a percent of the peak theoretical IOPs achievable on the
|
||||
specific accelerator. Note: this does not include any integer operations from
|
||||
MFMA instructions.
|
||||
MFMA FLOPs (BF16): |-
|
||||
The total number of 16-bit brain floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
BF16 MFMA operations achievable on the specific accelerator.
|
||||
MFMA FLOPs (F16): |-
|
||||
The total number of 16-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 16-bit floating point operations from
|
||||
VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F16 MFMA operations achievable on the specific accelerator.
|
||||
MFMA FLOPs (F32): |-
|
||||
The total number of 32-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 32-bit floating point operations from
|
||||
VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F32 MFMA operations achievable on the specific accelerator.
|
||||
MFMA FLOPs (F64): |-
|
||||
The total number of 64-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 64-bit floating point operations from
|
||||
VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F64 MFMA operations achievable on the specific accelerator.
|
||||
MFMA IOPs (INT8): |-
|
||||
The total number of 8-bit integer MFMA operations executed per second.
|
||||
Note: this does not include any 8-bit integer operations from VALU instructions.
|
||||
This is also presented as a percent of the peak theoretical INT8 MFMA operations
|
||||
achievable on the specific accelerator.
|
||||
IPC: The ratio of the total number of instructions executed on the CU over the
|
||||
total active CU cycles.
|
||||
IPC (Issued): The ratio of the total number of (non-internal) instructions issued
|
||||
over the number of cycles where the scheduler was actively working on issuing
|
||||
instructions.
|
||||
SALU Utilization: Indicates what percent of the kernel's duration the SALU was
|
||||
busy executing instructions. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing SALU / SMEM instructions over the total CU cycles.
|
||||
VALU Utilization: Indicates what percent of the kernel's duration the VALU was
|
||||
busy executing instructions. Does not include VMEM operations. Computed as the
|
||||
ratio of the total number of cycles spent by the scheduler issuing VALU instructions
|
||||
over the total CU cycles.
|
||||
VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
|
||||
was busy executing instructions, including both global/generic and spill/scratch
|
||||
operations (see the VMEM instruction count metrics for more detail). Does not
|
||||
include VALU operations. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing VMEM instructions over the total CU cycles.
|
||||
Branch Utilization: Indicates what percent of the kernel's duration the branch
|
||||
unit was busy executing instructions. Computed as the ratio of the total number
|
||||
of cycles spent by the scheduler issuing branch instructions over the total
|
||||
CU cycles.
|
||||
VALU Active Threads: Indicates the average level of divergence within a wavefront
|
||||
over the lifetime of the kernel. The number of work-items that were active in
|
||||
a wavefront during execution of each VALU instruction, time-averaged over all
|
||||
VALU instructions run on all wavefronts in the kernel
|
||||
MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
|
||||
was busy executing instructions. Computed as the ratio of the total number of
|
||||
cycles spent by the MFMA was busy over the total CU cycles.
|
||||
MFMA Instruction Cycles: The average duration of MFMA instructions in this kernel
|
||||
in cycles. Computed as the ratio of the total number of cycles the MFMA unit
|
||||
was busy over the total number of MFMA instructions.
|
||||
VMEM Latency: The average number of round-trip cycles (that is, from issue to
|
||||
data return / acknowledgment) required for a VMEM instruction to complete.
|
||||
SMEM Latency: The average number of round-trip cycles (that is, from issue to
|
||||
data return / acknowledgment) required for a SMEM instruction to complete.
|
||||
FLOPs (Total): The total number of floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
IOPs (Total): The total number of integer operations executed on either the VALU
|
||||
or MFMA units, per normalization unit.
|
||||
F16 OPs: The total number of 16-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
BF16 OPs: The total number of 16-bit brain floating-point operations executed
|
||||
on either the VALU or MFMA units, per normalization unit.
|
||||
F32 OPs: The total number of 32-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
F64 OPs: The total number of 64-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
INT8 OPs: The total number of 8-bit integer operations executed on either the
|
||||
VALU or MFMA units, per normalization unit.
|
||||
|
||||
+52
-51
@@ -2,51 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1200
|
||||
title: Local Data Share (LDS)
|
||||
metrics_description:
|
||||
Utilization: Indicates what percent of the kernel's duration the LDS was actively
|
||||
executing instructions (including, but not limited to, load, store, atomic and
|
||||
HIP's __shfl operations). Calculated as the ratio of the total number of cycles
|
||||
LDS was active over the total CU cycles.
|
||||
Access Rate: Indicates the percentage of SIMDs in the VALU actively issuing LDS
|
||||
instructions, averaged over the lifetime of the kernel. Calculated as the ratio
|
||||
of the total number of cycles spent by the scheduler issuing LDS instructions
|
||||
over the total CU cycles.
|
||||
Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
|
||||
could have been loaded from, stored to, or atomically updated in the LDS divided
|
||||
as percentage of theoretical peak. Does not take into account the execution
|
||||
mask of the wavefront when the instruction was executed.
|
||||
Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
|
||||
loaded from, stored to, or atomically updated in the LDS divided by total duration.
|
||||
Does not take into account the execution mask of the wavefront when the instruction
|
||||
was executed.
|
||||
Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
|
||||
servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing
|
||||
bank conflicts over the number of LDS cycles that would have been required to
|
||||
move the same amount of data in an uncontended access.
|
||||
LDS Instructions: The total number of LDS instructions (including, but not limited
|
||||
to, read/write/atomics and HIP's __shfl instructions) executed per normalization
|
||||
unit.
|
||||
LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
|
||||
/ acknowledgment) required for an LDS instruction to complete.
|
||||
Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS scheduler
|
||||
due to bank conflicts (as determined by the conflict resolution hardware) to
|
||||
the base number of cycles that would be spent in the LDS scheduler in a completely
|
||||
uncontended case. This is the unnormalized form of the Bank Conflict Rate.
|
||||
Index Accesses: The total number of cycles spent in the LDS scheduler over all
|
||||
operations per normalization unit.
|
||||
Atomic Return Cycles: The total number of cycles spent on LDS atomics with return
|
||||
per normalization unit.
|
||||
Bank Conflict: The total number of cycles spent in the LDS scheduler due to bank
|
||||
conflicts (as determined by the conflict resolution hardware) per normalization
|
||||
unit.
|
||||
Addr Conflict: The total number of cycles spent in the LDS scheduler due to address
|
||||
conflicts (as determined by the conflict resolution hardware) per normalization
|
||||
unit.
|
||||
Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
|
||||
stalls from non-dword aligned addresses per normalization unit.
|
||||
Mem Violations: "The total number of out-of-bounds accesses made to the LDS, per\
|
||||
\ normalization unit. This is unused and expected to be zero in most configurations\
|
||||
\ for modern CDNA\u2122 accelerators."
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1201
|
||||
@@ -87,7 +42,7 @@ Panel Config:
|
||||
avg: AVG((SQ_INSTS_LDS / $denom))
|
||||
min: MIN((SQ_INSTS_LDS / $denom))
|
||||
max: MAX((SQ_INSTS_LDS / $denom))
|
||||
unit: (Instr + $normUnit)
|
||||
unit: (Instr + $normUnit)
|
||||
Theoretical Bandwidth:
|
||||
avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
@@ -117,29 +72,75 @@ Panel Config:
|
||||
avg: AVG((SQ_LDS_IDX_ACTIVE / $denom))
|
||||
min: MIN((SQ_LDS_IDX_ACTIVE / $denom))
|
||||
max: MAX((SQ_LDS_IDX_ACTIVE / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Atomic Return Cycles:
|
||||
avg: AVG((SQ_LDS_ATOMIC_RETURN / $denom))
|
||||
min: MIN((SQ_LDS_ATOMIC_RETURN / $denom))
|
||||
max: MAX((SQ_LDS_ATOMIC_RETURN / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Bank Conflict:
|
||||
avg: AVG((SQ_LDS_BANK_CONFLICT / $denom))
|
||||
min: MIN((SQ_LDS_BANK_CONFLICT / $denom))
|
||||
max: MAX((SQ_LDS_BANK_CONFLICT / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Addr Conflict:
|
||||
avg: AVG((SQ_LDS_ADDR_CONFLICT / $denom))
|
||||
min: MIN((SQ_LDS_ADDR_CONFLICT / $denom))
|
||||
max: MAX((SQ_LDS_ADDR_CONFLICT / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Unaligned Stall:
|
||||
avg: AVG((SQ_LDS_UNALIGNED_STALL / $denom))
|
||||
min: MIN((SQ_LDS_UNALIGNED_STALL / $denom))
|
||||
max: MAX((SQ_LDS_UNALIGNED_STALL / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Mem Violations:
|
||||
avg: AVG((SQ_LDS_MEM_VIOLATIONS / $denom))
|
||||
min: MIN((SQ_LDS_MEM_VIOLATIONS / $denom))
|
||||
max: MAX((SQ_LDS_MEM_VIOLATIONS / $denom))
|
||||
unit: (Accesses + $normUnit)
|
||||
metrics_description:
|
||||
Utilization: Indicates what percent of the kernel's duration the LDS was actively
|
||||
executing instructions (including, but not limited to, load, store, atomic and
|
||||
HIP's __shfl operations). Calculated as the ratio of the total number of cycles
|
||||
LDS was active over the total CU cycles.
|
||||
Access Rate: Indicates the percentage of SIMDs in the VALU actively issuing LDS
|
||||
instructions, averaged over the lifetime of the kernel. Calculated as the ratio
|
||||
of the total number of cycles spent by the scheduler issuing LDS instructions
|
||||
over the total CU cycles.
|
||||
Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
|
||||
could have been loaded from, stored to, or atomically updated in the LDS divided
|
||||
as percentage of theoretical peak. Does not take into account the execution
|
||||
mask of the wavefront when the instruction was executed.
|
||||
Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
|
||||
loaded from, stored to, or atomically updated in the LDS divided by total duration.
|
||||
Does not take into account the execution mask of the wavefront when the instruction
|
||||
was executed.
|
||||
Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
|
||||
servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing
|
||||
bank conflicts over the number of LDS cycles that would have been required to
|
||||
move the same amount of data in an uncontended access.
|
||||
LDS Instructions: The total number of LDS instructions (including, but not limited
|
||||
to, read/write/atomics and HIP's __shfl instructions) executed per normalization
|
||||
unit.
|
||||
LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
|
||||
acknowledgment) required for an LDS instruction to complete.
|
||||
Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS scheduler
|
||||
due to bank conflicts (as determined by the conflict resolution hardware) to
|
||||
the base number of cycles that would be spent in the LDS scheduler in a completely
|
||||
uncontended case. This is the unnormalized form of the Bank Conflict Rate.
|
||||
Index Accesses: The total number of cycles spent in the LDS scheduler over all
|
||||
operations per normalization unit.
|
||||
Atomic Return Cycles: The total number of cycles spent on LDS atomics with return
|
||||
per normalization unit.
|
||||
Bank Conflict: The total number of cycles spent in the LDS scheduler due to bank
|
||||
conflicts (as determined by the conflict resolution hardware) per normalization
|
||||
unit.
|
||||
Addr Conflict: The total number of cycles spent in the LDS scheduler due to address
|
||||
conflicts (as determined by the conflict resolution hardware) per normalization
|
||||
unit.
|
||||
Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
|
||||
stalls from non-dword aligned addresses per normalization unit.
|
||||
Mem Violations: |-
|
||||
The total number of out-of-bounds accesses made to the LDS, per normalization
|
||||
unit. This is unused and expected to be zero in most configurations for
|
||||
modern CDNA\u2122 accelerators.
|
||||
|
||||
+26
-26
@@ -2,28 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1300
|
||||
title: Instruction Cache
|
||||
metrics_description:
|
||||
Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
|
||||
the total L1I cycles.
|
||||
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
|
||||
loaded line the cache. Calculated as the ratio of the number of L1I requests
|
||||
that hit over the number of all L1I requests.
|
||||
L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\
|
||||
\ L2 cache request bandwidth achieved. Calculated as the ratio of the total\
|
||||
\ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\
|
||||
\ cycles."
|
||||
L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
|
||||
divided by total duration.
|
||||
Req: The total number of requests made to the L1I per normalization-unit
|
||||
Hits: The total number of L1I requests that hit on a previously loaded cache line,
|
||||
per normalization-unit.
|
||||
Misses - Non Duplicated: The total number of L1I requests that missed on a cache
|
||||
line that were not already pending due to another request, per normalization-unit.
|
||||
Misses - Duplicated: The total number of L1I requests that missed on a cache line
|
||||
that were already pending due to another request, per normalization-unit.
|
||||
Instruction Fetch Latency: The average number of cycles spent to fetch instructions
|
||||
to a CU.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1301
|
||||
@@ -62,22 +40,22 @@ Panel Config:
|
||||
avg: AVG((SQC_ICACHE_REQ / $denom))
|
||||
min: MIN((SQC_ICACHE_REQ / $denom))
|
||||
max: MAX((SQC_ICACHE_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Hits:
|
||||
avg: AVG((SQC_ICACHE_HITS / $denom))
|
||||
min: MIN((SQC_ICACHE_HITS / $denom))
|
||||
max: MAX((SQC_ICACHE_HITS / $denom))
|
||||
unit: (Hits + $normUnit)
|
||||
unit: (Hits + $normUnit)
|
||||
Misses - Non Duplicated:
|
||||
avg: AVG((SQC_ICACHE_MISSES / $denom))
|
||||
min: MIN((SQC_ICACHE_MISSES / $denom))
|
||||
max: MAX((SQC_ICACHE_MISSES / $denom))
|
||||
unit: (Misses + $normUnit)
|
||||
unit: (Misses + $normUnit)
|
||||
Misses - Duplicated:
|
||||
avg: AVG((SQC_ICACHE_MISSES_DUPLICATE / $denom))
|
||||
min: MIN((SQC_ICACHE_MISSES_DUPLICATE / $denom))
|
||||
max: MAX((SQC_ICACHE_MISSES_DUPLICATE / $denom))
|
||||
unit: (Misses + $normUnit)
|
||||
unit: (Misses + $normUnit)
|
||||
Cache Hit Rate:
|
||||
avg: AVG(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
|
||||
+ SQC_ICACHE_MISSES_DUPLICATE)))
|
||||
@@ -107,3 +85,25 @@ Panel Config:
|
||||
min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
metrics_description:
|
||||
Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
|
||||
the total L1I cycles.
|
||||
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
|
||||
loaded line the cache. Calculated as the ratio of the number of L1I requests
|
||||
that hit over the number of all L1I requests.
|
||||
L1I-L2 Bandwidth Utilization: |-
|
||||
The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
|
||||
achieved. Calculated as the ratio of the total number of requests from the
|
||||
L1I to the L2 cache over the total L1I-L2 interface cycles.
|
||||
L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
|
||||
divided by total duration.
|
||||
Req: The total number of requests made to the L1I per normalization-unit
|
||||
Hits: The total number of L1I requests that hit on a previously loaded cache line,
|
||||
per normalization-unit.
|
||||
Misses - Non Duplicated: The total number of L1I requests that missed on a cache
|
||||
line that were not already pending due to another request, per normalization-unit.
|
||||
Misses - Duplicated: The total number of L1I requests that missed on a cache line
|
||||
that were already pending due to another request, per normalization-unit.
|
||||
Instruction Fetch Latency: The average number of cycles spent to fetch instructions
|
||||
to a CU.
|
||||
|
||||
+61
-58
@@ -2,49 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1400
|
||||
title: Scalar L1 Data Cache
|
||||
metrics_description:
|
||||
Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
|
||||
over the total sL1D cycles.
|
||||
Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
|
||||
loaded line the cache. The ratio of the number of sL1D requests that hit over
|
||||
the number of all sL1D requests.
|
||||
sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
|
||||
bandwidth acheived.\ \ Caclulated as total number of bytes read from, written
|
||||
to, or atomically updated\ \ across the sL1D - L2 interface.
|
||||
sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
|
||||
\ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\
|
||||
\ writes and atomics are typically unused on current CDNA accelerators, so in\
|
||||
\ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
|
||||
Req: The total number of requests, of any size or type, made to the sL1D per normalization
|
||||
unit.
|
||||
Hits: The total number of sL1D requests that hit on a previously loaded cache
|
||||
line, per normalization unit.
|
||||
Misses - Non Duplicated: 'The total number of sL1D requests that missed on a cache
|
||||
line that was not already pending due to another request, per normalization
|
||||
unit. '
|
||||
Misses- Duplicated: The total number of sL1D requests that missed on a cache line
|
||||
that was already pending due to another request, per normalization unit.
|
||||
Read Req (Total): The total number of sL1D read requests of any size, per normalization
|
||||
unit.
|
||||
Atomic Req: The total number of atomic requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
Read Req (1 DWord): The total number of sL1D read requests made for a single dword
|
||||
of data (4B), per normalization unit.
|
||||
Read Req (2 DWord): The total number of sL1D read requests made for a two dwords
|
||||
of data (8B), per normalization unit.
|
||||
Read Req (4 DWord): The total number of sL1D read requests made for a four dwords
|
||||
of data (16B), per normalization unit.
|
||||
Read Req (8 DWord): The total number of sL1D read requests made for a eight dwords
|
||||
of data (32B), per normalization unit.
|
||||
Read Req (16 DWord): The total number of sL1D read requests made for a sixteen
|
||||
dwords of data (64B), per normalization unit.
|
||||
Read Req: The total number of read requests from sL1D to the L2 per normalization
|
||||
unit.
|
||||
Write Req: The total number of write requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
Stall Cycles: "The total number of cycles the sL1D\u2194L2 interface was stalled,\
|
||||
\ per normalization unit."
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1401
|
||||
@@ -84,22 +41,22 @@ Panel Config:
|
||||
avg: AVG((SQC_DCACHE_REQ / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Hits:
|
||||
avg: AVG((SQC_DCACHE_HITS / $denom))
|
||||
min: MIN((SQC_DCACHE_HITS / $denom))
|
||||
max: MAX((SQC_DCACHE_HITS / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Misses - Non Duplicated:
|
||||
avg: AVG((SQC_DCACHE_MISSES / $denom))
|
||||
min: MIN((SQC_DCACHE_MISSES / $denom))
|
||||
max: MAX((SQC_DCACHE_MISSES / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Misses- Duplicated:
|
||||
avg: AVG((SQC_DCACHE_MISSES_DUPLICATE / $denom))
|
||||
min: MIN((SQC_DCACHE_MISSES_DUPLICATE / $denom))
|
||||
max: MAX((SQC_DCACHE_MISSES_DUPLICATE / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Cache Hit Rate:
|
||||
avg: AVG((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
@@ -118,37 +75,37 @@ Panel Config:
|
||||
+ SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
|
||||
max: MAX((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
|
||||
+ SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Atomic Req:
|
||||
avg: AVG((SQC_DCACHE_ATOMIC / $denom))
|
||||
min: MIN((SQC_DCACHE_ATOMIC / $denom))
|
||||
max: MAX((SQC_DCACHE_ATOMIC / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (1 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_1 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_1 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_1 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (2 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_2 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_2 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_2 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (4 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_4 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_4 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_4 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (8 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_8 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_8 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_8 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (16 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_16 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_16 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_16 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1403
|
||||
title: Scalar L1D Cache - L2 Interface
|
||||
@@ -171,19 +128,65 @@ Panel Config:
|
||||
avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_READ_REQ / $denom))
|
||||
max: MAX((SQC_TC_DATA_READ_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Write Req:
|
||||
avg: AVG((SQC_TC_DATA_WRITE_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_WRITE_REQ / $denom))
|
||||
max: MAX((SQC_TC_DATA_WRITE_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Atomic Req:
|
||||
avg: AVG((SQC_TC_DATA_ATOMIC_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_ATOMIC_REQ / $denom))
|
||||
max: MAX((SQC_TC_DATA_ATOMIC_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Stall Cycles:
|
||||
avg: AVG((SQC_TC_STALL / $denom))
|
||||
min: MIN((SQC_TC_STALL / $denom))
|
||||
max: MAX((SQC_TC_STALL / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
metrics_description:
|
||||
Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
|
||||
over the total sL1D cycles.
|
||||
Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
|
||||
loaded line the cache. The ratio of the number of sL1D requests that hit over
|
||||
the number of all sL1D requests.
|
||||
sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
|
||||
bandwidth acheived. Calculated as total number of bytes read from, written to,
|
||||
or atomically updated across the sL1D - L2 interface.
|
||||
sL1D-L2 BW: |-
|
||||
The total number of bytes read from, written to, or atomically updated
|
||||
across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D
|
||||
writes and atomics are typically unused on current CDNA accelerators, so
|
||||
in the majority of cases this can be interpreted as an sL1D\u2192L2 read
|
||||
bandwidth.
|
||||
Req: The total number of requests, of any size or type, made to the sL1D per normalization
|
||||
unit.
|
||||
Hits: The total number of sL1D requests that hit on a previously loaded cache
|
||||
line, per normalization unit.
|
||||
Misses - Non Duplicated: |-
|
||||
The total number of sL1D requests that missed on a cache line that was
|
||||
not already pending due to another request, per normalization unit.
|
||||
Misses- Duplicated: The total number of sL1D requests that missed on a cache line
|
||||
that was already pending due to another request, per normalization unit.
|
||||
Read Req (Total): The total number of sL1D read requests of any size, per normalization
|
||||
unit.
|
||||
Atomic Req: The total number of atomic requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
Read Req (1 DWord): The total number of sL1D read requests made for a single dword
|
||||
of data (4B), per normalization unit.
|
||||
Read Req (2 DWord): The total number of sL1D read requests made for a two dwords
|
||||
of data (8B), per normalization unit.
|
||||
Read Req (4 DWord): The total number of sL1D read requests made for a four dwords
|
||||
of data (16B), per normalization unit.
|
||||
Read Req (8 DWord): The total number of sL1D read requests made for a eight dwords
|
||||
of data (32B), per normalization unit.
|
||||
Read Req (16 DWord): The total number of sL1D read requests made for a sixteen
|
||||
dwords of data (64B), per normalization unit.
|
||||
Read Req: The total number of read requests from sL1D to the L2 per normalization
|
||||
unit.
|
||||
Write Req: The total number of write requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
Stall Cycles: |-
|
||||
The total number of cycles the sL1D\u2194L2 interface was stalled, per
|
||||
normalization unit.
|
||||
|
||||
+77
-80
@@ -2,70 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1500
|
||||
title: Address Processing Unit and Data Return Path (TA/TD)
|
||||
metrics_description:
|
||||
Address Processing Unit Busy: Percent of the total CU cycles the address processor
|
||||
was busy
|
||||
Address Stall: Percent of the total CU cycles the address processor was stalled
|
||||
from sending address requests further into the vL1D pipeline.
|
||||
Data Stall: Percent of the total CU cycles the address processor was stalled from
|
||||
sending write/atomic data further into the vL1D pipeline.
|
||||
"Data-Processor \u2192 Address Stall": Percent of total CU cycles the address
|
||||
processor was stalled waiting to send command data to the data processor.
|
||||
Total Instructions: The total number of memory instructions executed by the address
|
||||
processer over all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Instructions: The total number of global & generic memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Read Instructions: The total number of global & generic memory
|
||||
read instructions executed on all compute units on the accelerator, per normalization
|
||||
unit.
|
||||
Global/Generic Write Instructions: The total number of global & generic memory
|
||||
write instructions executed on all compute units on the accelerator, per normalization
|
||||
unit.
|
||||
Global/Generic Atomic Instructions: The total number of global & generic memory
|
||||
atomic (with and without return) instructions executed on all compute units
|
||||
on the accelerator, per normalization unit.
|
||||
Spill/Stack Instructions: The total number of spill/stack memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Read Instructions: The total number of spill/stack memory read instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Write Instructions: The total number of spill/stack memory write instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Atomic Instructions: The total number of spill/stack memory atomic
|
||||
(with and without return) instructions executed on all compute units on the
|
||||
accelerator, per normalization unit. Typically unused as these memory operations
|
||||
are typically used to implement thread-local storage.
|
||||
Spill/Stack Total Cycles: The number of cycles the address processing unit spent
|
||||
working on spill/stack instructions, per normalization unit.
|
||||
Spill/Stack Coalesced Read: The number of cycles the address processing unit spent
|
||||
working on coalesced spill/stack read instructions, per normalization unit.
|
||||
Spill/Stack Coalesced Write: The number of cycles the address processing unit
|
||||
spent working on coalesced spill/stack write instructions, per normalization
|
||||
unit.
|
||||
Data-Return Busy: Percent of the total CU cycles the data-return unit was busy
|
||||
processing or waiting on data to return to the CU.
|
||||
"Cache RAM \u2192 Data-Return Stall": Percent of the total CU cycles the data-return
|
||||
unit was stalled on data to be returned from the vL1D Cache RAM.
|
||||
"Workgroup manager \u2192 Data-Return Stall": Percent of the total CU cycles the
|
||||
data-return unit was stalled by the workgroup manager due to initialization
|
||||
of registers as a part of launching new workgroups.
|
||||
Coalescable Instructions: The number of instructions submitted to the data-return
|
||||
unit by the address processor that were found to be coalescable, per normalization
|
||||
unit.
|
||||
Read Instructions: The number of read instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack reads in the address processor.
|
||||
Write Instructions: The number of store instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack stores in the address processor.
|
||||
Atomic Instructions: The number of atomic instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack atomics in the address processor.
|
||||
Write Ack Instructions: The total number of write acknowledgements submitted by
|
||||
data-return unit to SQ, summed over all compute units on the accelerator, per
|
||||
normalization unit.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1501
|
||||
@@ -135,47 +71,47 @@ Panel Config:
|
||||
avg: AVG((TA_TOTAL_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_TOTAL_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_TOTAL_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Global/Generic Instructions:
|
||||
avg: AVG((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Global/Generic Read Instructions:
|
||||
avg: AVG((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Global/Generic Write Instructions:
|
||||
avg: AVG((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Global/Generic Atomic Instructions:
|
||||
avg: AVG((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Spill/Stack Instructions:
|
||||
avg: AVG((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Spill/Stack Read Instructions:
|
||||
avg: AVG((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Spill/Stack Write Instructions:
|
||||
avg: AVG((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Spill/Stack Atomic Instructions:
|
||||
avg: AVG((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
max: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
- metric_table:
|
||||
id: 1503
|
||||
title: Spill and stack metrics
|
||||
@@ -190,17 +126,17 @@ Panel Config:
|
||||
avg: AVG((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
|
||||
min: MIN((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
|
||||
max: MAX((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Spill/Stack Coalesced Read:
|
||||
avg: AVG((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
|
||||
min: MIN((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
|
||||
max: MAX((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Spill/Stack Coalesced Write:
|
||||
avg: AVG((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
|
||||
min: MIN((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
|
||||
max: MAX((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
- metric_table:
|
||||
id: 1504
|
||||
title: Vector L1 data-return path or Texture Data (TD)
|
||||
@@ -230,7 +166,7 @@ Panel Config:
|
||||
avg: AVG((TD_COALESCABLE_WAVEFRONT_sum / $denom))
|
||||
min: MIN((TD_COALESCABLE_WAVEFRONT_sum / $denom))
|
||||
max: MAX((TD_COALESCABLE_WAVEFRONT_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Read Instructions:
|
||||
avg: AVG((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
|
||||
/ $denom))
|
||||
@@ -238,14 +174,75 @@ Panel Config:
|
||||
/ $denom))
|
||||
max: MAX((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
|
||||
/ $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Write Instructions:
|
||||
avg: AVG((TD_STORE_WAVEFRONT_sum / $denom))
|
||||
min: MIN((TD_STORE_WAVEFRONT_sum / $denom))
|
||||
max: MAX((TD_STORE_WAVEFRONT_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Atomic Instructions:
|
||||
avg: AVG((TD_ATOMIC_WAVEFRONT_sum / $denom))
|
||||
min: MIN((TD_ATOMIC_WAVEFRONT_sum / $denom))
|
||||
max: MAX((TD_ATOMIC_WAVEFRONT_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
metrics_description:
|
||||
Address Processing Unit Busy: Percent of the total CU cycles the address processor
|
||||
was busy
|
||||
Address Stall: Percent of the total CU cycles the address processor was stalled
|
||||
from sending address requests further into the vL1D pipeline.
|
||||
Data Stall: Percent of the total CU cycles the address processor was stalled from
|
||||
sending write/atomic data further into the vL1D pipeline.
|
||||
"Data-Processor \u2192 Address Stall": Percent of total CU cycles the address
|
||||
processor was stalled waiting to send command data to the data processor.
|
||||
Total Instructions: The total number of memory instructions executed by the address
|
||||
processer over all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Instructions: The total number of global & generic memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Read Instructions: The total number of global & generic memory
|
||||
read instructions executed on all compute units on the accelerator, per normalization
|
||||
unit.
|
||||
Global/Generic Write Instructions: The total number of global & generic memory
|
||||
write instructions executed on all compute units on the accelerator, per normalization
|
||||
unit.
|
||||
Global/Generic Atomic Instructions: The total number of global & generic memory
|
||||
atomic (with and without return) instructions executed on all compute units
|
||||
on the accelerator, per normalization unit.
|
||||
Spill/Stack Instructions: The total number of spill/stack memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Read Instructions: The total number of spill/stack memory read instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Write Instructions: The total number of spill/stack memory write instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Atomic Instructions: The total number of spill/stack memory atomic
|
||||
(with and without return) instructions executed on all compute units on the
|
||||
accelerator, per normalization unit. Typically unused as these memory operations
|
||||
are typically used to implement thread-local storage.
|
||||
Spill/Stack Total Cycles: The number of cycles the address processing unit spent
|
||||
working on spill/stack instructions, per normalization unit.
|
||||
Spill/Stack Coalesced Read: The number of cycles the address processing unit spent
|
||||
working on coalesced spill/stack read instructions, per normalization unit.
|
||||
Spill/Stack Coalesced Write: The number of cycles the address processing unit
|
||||
spent working on coalesced spill/stack write instructions, per normalization
|
||||
unit.
|
||||
Data-Return Busy: Percent of the total CU cycles the data-return unit was busy
|
||||
processing or waiting on data to return to the CU.
|
||||
"Cache RAM \u2192 Data-Return Stall": Percent of the total CU cycles the data-return
|
||||
unit was stalled on data to be returned from the vL1D Cache RAM.
|
||||
"Workgroup manager \u2192 Data-Return Stall": Percent of the total CU cycles the
|
||||
data-return unit was stalled by the workgroup manager due to initialization
|
||||
of registers as a part of launching new workgroups.
|
||||
Coalescable Instructions: The number of instructions submitted to the data-return
|
||||
unit by the address processor that were found to be coalescable, per normalization
|
||||
unit.
|
||||
Read Instructions: The number of read instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack reads in the address processor.
|
||||
Write Instructions: The number of store instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack stores in the address processor.
|
||||
Atomic Instructions: The number of atomic instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack atomics in the address processor.
|
||||
|
||||
+132
-132
@@ -2,117 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1600
|
||||
title: Vector L1 Data Cache
|
||||
metrics_description:
|
||||
Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
|
||||
of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator. The number of bytes is calculated as the number
|
||||
of cache lines requested multiplied by the cache line size. This value does
|
||||
not consider partial requests, so for instance, if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
|
||||
The number of cycles where the vL1D Cache RAM is actively processing any request
|
||||
divided by the number of cycles where the vL1D is active.
|
||||
Coalescing: Indicates how well memory instructions were coalesced by the address
|
||||
processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
|
||||
as the average number of thread-requests generated per instruction divided by
|
||||
the ideal number of thread-requests per instruction.
|
||||
Stalled on L2 Data: The ratio of the number of cycles where the vL1D is stalled
|
||||
waiting for requested data to return from the L2 cache divided by the number
|
||||
of cycles where the vL1D is active.
|
||||
Stalled on L2 Req: The ratio of the number of cycles where the vL1D is stalled
|
||||
waiting to issue a request for data to the L2 cache divided by the number of
|
||||
cycles where the vL1D is active.
|
||||
Tag RAM Stall (Read): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Read requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Tag RAM Stall (Write): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Write requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Tag RAM Stall (Atomic): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Atomic requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Total Req: The total number of incoming requests from the address processing unit
|
||||
after coalescing.
|
||||
Read Req: The total number of incoming read requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Write Req: The total number of incoming write requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Atomic Req: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
|
||||
instructions divided by total duration. The number of bytes is calculated as
|
||||
the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
as a full cache line.
|
||||
Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
|
||||
vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Cache Accesses: The total number of cache line lookups in the vL1D.
|
||||
Cache Hits: The number of cache accesses minus the number of outgoing requests
|
||||
to the L2 cache, that is, the number of cache line requests serviced by the
|
||||
vL1D Cache RAM per normalization unit.
|
||||
Invalidations: The number of times the vL1D was issued a write-back invalidate
|
||||
command during the kernel's execution per normalization unit. This may be triggered
|
||||
by, for instance, the buffer_wbinvl1 instruction.
|
||||
L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
|
||||
of VMEM instructions, divided by total duration. The number of bytes is calculated
|
||||
as the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
as a full cache line.
|
||||
L1-L2 Read: The number of read requests for a vL1D cache line that were not satisfied
|
||||
by the vL1D and must be retrieved from the to the L2 Cache per normalization
|
||||
unit.
|
||||
L1-L2 Write: The number of write requests to a vL1D cache line that were sent
|
||||
through the vL1D to the L2 cache, per normalization unit.
|
||||
L1-L2 Atomic: The number of atomic requests that are sent through the vL1D to
|
||||
the L2 cache, per normalization unit. This includes requests for atomics with,
|
||||
and without return.
|
||||
L1 Access Latency: Calculated as the average number of cycles that a vL1D cache
|
||||
line request spent in the vL1D cache pipeline.
|
||||
L1-L2 Read Latency: Calculated as the average number of cycles that the vL1D cache
|
||||
took to issue and receive read requests from the L2 Cache. This number also
|
||||
includes requests for atomics with return values.
|
||||
L1-L2 Write Latency: Calculated as the average number of cycles that the vL1D
|
||||
cache took to issue and receive acknowledgement of a write request to the L2
|
||||
Cache. This number also includes requests for atomics without return values.
|
||||
NC - Read: Total read requests with NC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
UC - Read: Total read requests with UC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
CC - Read: Total read requests with CC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
RW - Read: Total read requests with RW mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
RW - Write: Total write requests with RW mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
NC - Write: Total write requests with NC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
UC - Write: Total write requests with UC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
CC - Write: Total write requests with CC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
NC - Atomic: Total atomic requests with NC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
UC - Atomic: Total atomic requests with UC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
CC - Atomic: Total atomic requests with CC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
RW - Atomic: Total atomic requests with RW mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
Req: The number of translation requests made to the UTCL1 per normalization unit.
|
||||
Hit Ratio: The ratio of the number of translation requests that hit in the UTCL1
|
||||
divided by the total number of translation requests made to the UTCL1.
|
||||
Hits: The number of translation requests that hit in the UTCL1, and could be reused,
|
||||
per normalization unit.
|
||||
Translation Misses: The total number of translation requests that missed in the
|
||||
UTCL1 due to translation not being present in the cache, per normalization
|
||||
unit.
|
||||
Permission Misses: "The total number of translation requests that missed in the\
|
||||
\ UTCL1 due to a permission error, per normalization unit. This is unused and\
|
||||
\ expected to be zero in most configurations for modern CDNA\u2122 accelerators."
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1601
|
||||
@@ -181,17 +70,17 @@ Panel Config:
|
||||
avg: AVG((TCP_TOTAL_ACCESSES_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_ACCESSES_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_ACCESSES_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read Req:
|
||||
avg: AVG((TCP_TOTAL_READ_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_READ_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_READ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Write Req:
|
||||
avg: AVG((TCP_TOTAL_WRITE_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_WRITE_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_WRITE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Atomic Req:
|
||||
avg: AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
|
||||
/ $denom))
|
||||
@@ -199,7 +88,7 @@ Panel Config:
|
||||
/ $denom))
|
||||
max: MAX(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Cache BW:
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
@@ -223,7 +112,7 @@ Panel Config:
|
||||
avg: AVG((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Cache Hits:
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
@@ -234,7 +123,7 @@ Panel Config:
|
||||
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Invalidations:
|
||||
avg: AVG((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
|
||||
@@ -252,12 +141,12 @@ Panel Config:
|
||||
avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
L1-L2 Write:
|
||||
avg: AVG((TCP_TCC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
L1-L2 Atomic:
|
||||
avg: AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
/ $denom))
|
||||
@@ -265,7 +154,7 @@ Panel Config:
|
||||
/ $denom))
|
||||
max: MAX(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
L1 Access Latency:
|
||||
avg: AVG(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
|
||||
!= 0) else None))
|
||||
@@ -314,84 +203,84 @@ Panel Config:
|
||||
avg: AVG((TCP_TCC_NC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_NC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_NC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
UC - Read:
|
||||
xfer: Read
|
||||
coherency: UC
|
||||
avg: AVG((TCP_TCC_UC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_UC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_UC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
CC - Read:
|
||||
xfer: Read
|
||||
coherency: CC
|
||||
avg: AVG((TCP_TCC_CC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_CC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_CC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
RW - Read:
|
||||
xfer: Read
|
||||
coherency: RW
|
||||
avg: AVG((TCP_TCC_RW_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_RW_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_RW_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
RW - Write:
|
||||
xfer: Write
|
||||
coherency: RW
|
||||
avg: AVG((TCP_TCC_RW_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_RW_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_RW_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
NC - Write:
|
||||
xfer: Write
|
||||
coherency: NC
|
||||
avg: AVG((TCP_TCC_NC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_NC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_NC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
UC - Write:
|
||||
xfer: Write
|
||||
coherency: UC
|
||||
avg: AVG((TCP_TCC_UC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_UC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_UC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
CC - Write:
|
||||
xfer: Write
|
||||
coherency: CC
|
||||
avg: AVG((TCP_TCC_CC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_CC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_CC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
NC - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: NC
|
||||
avg: AVG((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
UC - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: UC
|
||||
avg: AVG((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
CC - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: CC
|
||||
avg: AVG((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
RW - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: RW
|
||||
avg: AVG((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1605
|
||||
title: L1 Unified Translation Cache (UTCL1)
|
||||
@@ -440,3 +329,114 @@ Panel Config:
|
||||
max: Max
|
||||
units: Unit
|
||||
metric: {}
|
||||
metrics_description:
|
||||
Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
|
||||
of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator. The number of bytes is calculated as the number
|
||||
of cache lines requested multiplied by the cache line size. This value does
|
||||
not consider partial requests, so for instance, if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
|
||||
The number of cycles where the vL1D Cache RAM is actively processing any request
|
||||
divided by the number of cycles where the vL1D is active.
|
||||
Coalescing: Indicates how well memory instructions were coalesced by the address
|
||||
processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
|
||||
as the average number of thread-requests generated per instruction divided by
|
||||
the ideal number of thread-requests per instruction.
|
||||
Stalled on L2 Data: The ratio of the number of cycles where the vL1D is stalled
|
||||
waiting for requested data to return from the L2 cache divided by the number
|
||||
of cycles where the vL1D is active.
|
||||
Stalled on L2 Req: The ratio of the number of cycles where the vL1D is stalled
|
||||
waiting to issue a request for data to the L2 cache divided by the number of
|
||||
cycles where the vL1D is active.
|
||||
Tag RAM Stall (Read): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Read requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Tag RAM Stall (Write): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Write requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Tag RAM Stall (Atomic): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Atomic requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Total Req: The total number of incoming requests from the address processing unit
|
||||
after coalescing.
|
||||
Read Req: The total number of incoming read requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Write Req: The total number of incoming write requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Atomic Req: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
|
||||
instructions divided by total duration. The number of bytes is calculated as
|
||||
the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
as a full cache line.
|
||||
Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
|
||||
vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Cache Accesses: The total number of cache line lookups in the vL1D.
|
||||
Cache Hits: The number of cache accesses minus the number of outgoing requests
|
||||
to the L2 cache, that is, the number of cache line requests serviced by the
|
||||
vL1D Cache RAM per normalization unit.
|
||||
Invalidations: The number of times the vL1D was issued a write-back invalidate
|
||||
command during the kernel's execution per normalization unit. This may be triggered
|
||||
by, for instance, the buffer_wbinvl1 instruction.
|
||||
L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
|
||||
of VMEM instructions, divided by total duration. The number of bytes is calculated
|
||||
as the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
as a full cache line.
|
||||
L1-L2 Read: The number of read requests for a vL1D cache line that were not satisfied
|
||||
by the vL1D and must be retrieved from the to the L2 Cache per normalization
|
||||
unit.
|
||||
L1-L2 Write: The number of write requests to a vL1D cache line that were sent
|
||||
through the vL1D to the L2 cache, per normalization unit.
|
||||
L1-L2 Atomic: The number of atomic requests that are sent through the vL1D to
|
||||
the L2 cache, per normalization unit. This includes requests for atomics with,
|
||||
and without return.
|
||||
L1 Access Latency: Calculated as the average number of cycles that a vL1D cache
|
||||
line request spent in the vL1D cache pipeline.
|
||||
L1-L2 Read Latency: Calculated as the average number of cycles that the vL1D cache
|
||||
took to issue and receive read requests from the L2 Cache. This number also
|
||||
includes requests for atomics with return values.
|
||||
L1-L2 Write Latency: Calculated as the average number of cycles that the vL1D
|
||||
cache took to issue and receive acknowledgement of a write request to the L2
|
||||
Cache. This number also includes requests for atomics without return values.
|
||||
NC - Read: Total read requests with NC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
UC - Read: Total read requests with UC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
CC - Read: Total read requests with CC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
RW - Read: Total read requests with RW mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
RW - Write: Total write requests with RW mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
NC - Write: Total write requests with NC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
UC - Write: Total write requests with UC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
CC - Write: Total write requests with CC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
NC - Atomic: Total atomic requests with NC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
UC - Atomic: Total atomic requests with UC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
CC - Atomic: Total atomic requests with CC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
RW - Atomic: Total atomic requests with RW mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
Req: The number of translation requests made to the UTCL1 per normalization unit.
|
||||
Hit Ratio: The ratio of the number of translation requests that hit in the UTCL1
|
||||
divided by the total number of translation requests made to the UTCL1.
|
||||
Hits: The number of translation requests that hit in the UTCL1, and could be reused,
|
||||
per normalization unit.
|
||||
Translation Misses: The total number of translation requests that missed in the
|
||||
UTCL1 due to translation not being present in the cache, per normalization unit.
|
||||
Permission Misses: |-
|
||||
The total number of translation requests that missed in the UTCL1 due
|
||||
to a permission error, per normalization unit. This is unused and expected
|
||||
to be zero in most configurations for modern CDNA\u2122 accelerators.
|
||||
|
||||
+344
-394
@@ -2,6 +2,350 @@
|
||||
Panel Config:
|
||||
id: 1700
|
||||
title: L2 Cache
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1701
|
||||
title: L2 Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
metric:
|
||||
Utilization:
|
||||
value: AVG(((TCC_BUSY_sum * 100) / (TO_INT($total_l2_chan) * $GRBM_GUI_ACTIVE_PER_XCD)))
|
||||
unit: pct
|
||||
Peak Bandwidth:
|
||||
value: ((100 * AVG(((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk / 1000) * 128) * TO_INT($total_l2_chan)))
|
||||
unit: pct
|
||||
Hit Rate:
|
||||
value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else 0))
|
||||
unit: pct
|
||||
L2-Fabric Read BW:
|
||||
value: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
L2-Fabric Write and Atomic BW:
|
||||
value: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
HBM Bandwidth:
|
||||
value: $hbmBandwidth
|
||||
unit: GB/s
|
||||
- metric_table:
|
||||
id: 1702
|
||||
title: L2-Fabric interface metrics
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Read BW:
|
||||
avg: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
HBM Read Traffic:
|
||||
avg: AVG((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Remote Read Traffic:
|
||||
avg: AVG((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum)
|
||||
if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
min: MIN((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum)
|
||||
if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
max: MAX((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum)
|
||||
if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
unit: pct
|
||||
Uncached Read Traffic:
|
||||
avg: AVG((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Write and Atomic BW:
|
||||
avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
HBM Write and Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Remote Write and Atomic Traffic:
|
||||
avg: AVG((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum)
|
||||
if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
min: MIN((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum)
|
||||
if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
max: MAX((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum)
|
||||
if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
unit: pct
|
||||
Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Uncached Write and Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Read Latency:
|
||||
avg: AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
Write and Atomic Latency:
|
||||
avg: AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
Atomic Latency:
|
||||
avg: AVG(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
- metric_table:
|
||||
id: 1703
|
||||
title: L2 Cache Accesses
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
|
||||
min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
|
||||
max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
Req:
|
||||
avg: AVG((TCC_REQ_sum / $denom))
|
||||
min: MIN((TCC_REQ_sum / $denom))
|
||||
max: MAX((TCC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read Req:
|
||||
avg: AVG((TCC_READ_sum / $denom))
|
||||
min: MIN((TCC_READ_sum / $denom))
|
||||
max: MAX((TCC_READ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Write Req:
|
||||
avg: AVG((TCC_WRITE_sum / $denom))
|
||||
min: MIN((TCC_WRITE_sum / $denom))
|
||||
max: MAX((TCC_WRITE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Atomic Req:
|
||||
avg: AVG((TCC_ATOMIC_sum / $denom))
|
||||
min: MIN((TCC_ATOMIC_sum / $denom))
|
||||
max: MAX((TCC_ATOMIC_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Streaming Req:
|
||||
avg: AVG((TCC_STREAMING_REQ_sum / $denom))
|
||||
min: MIN((TCC_STREAMING_REQ_sum / $denom))
|
||||
max: MAX((TCC_STREAMING_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Probe Req:
|
||||
avg: AVG((TCC_PROBE_sum / $denom))
|
||||
min: MIN((TCC_PROBE_sum / $denom))
|
||||
max: MAX((TCC_PROBE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Cache Hit:
|
||||
avg: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
min: MIN((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
max: MAX((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
unit: pct
|
||||
Hits:
|
||||
avg: AVG((TCC_HIT_sum / $denom))
|
||||
min: MIN((TCC_HIT_sum / $denom))
|
||||
max: MAX((TCC_HIT_sum / $denom))
|
||||
unit: (Hits + $normUnit)
|
||||
Misses:
|
||||
avg: AVG((TCC_MISS_sum / $denom))
|
||||
min: MIN((TCC_MISS_sum / $denom))
|
||||
max: MAX((TCC_MISS_sum / $denom))
|
||||
unit: (Misses + $normUnit)
|
||||
Writeback:
|
||||
avg: AVG((TCC_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_WRITEBACK_sum / $denom))
|
||||
max: MAX((TCC_WRITEBACK_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
Writeback (Internal):
|
||||
avg: AVG((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
max: MAX((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
Writeback (vL1D Req):
|
||||
avg: AVG((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
|
||||
max: MAX((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
Evict (Internal):
|
||||
avg: AVG((TCC_NORMAL_EVICT_sum / $denom))
|
||||
min: MIN((TCC_NORMAL_EVICT_sum / $denom))
|
||||
max: MAX((TCC_NORMAL_EVICT_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
Evict (vL1D Req):
|
||||
avg: AVG((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
|
||||
min: MIN((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
|
||||
max: MAX((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
NC Req:
|
||||
avg: AVG((TCC_NC_REQ_sum / $denom))
|
||||
min: MIN((TCC_NC_REQ_sum / $denom))
|
||||
max: MAX((TCC_NC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
UC Req:
|
||||
avg: AVG((TCC_UC_REQ_sum / $denom))
|
||||
min: MIN((TCC_UC_REQ_sum / $denom))
|
||||
max: MAX((TCC_UC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
CC Req:
|
||||
avg: AVG((TCC_CC_REQ_sum / $denom))
|
||||
min: MIN((TCC_CC_REQ_sum / $denom))
|
||||
max: MAX((TCC_CC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
RW Req:
|
||||
avg: AVG((TCC_RW_REQ_sum / $denom))
|
||||
min: MIN((TCC_RW_REQ_sum / $denom))
|
||||
max: MAX((TCC_RW_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1704
|
||||
title: L2 Cache Stalls
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric: {}
|
||||
- metric_table:
|
||||
id: 1705
|
||||
title: L2 - Fabric Interface stalls
|
||||
header:
|
||||
metric: Metric
|
||||
type: Type
|
||||
transaction: Transaction
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
style:
|
||||
type: simple_multi_bar
|
||||
metric:
|
||||
Write - Credit Starvation:
|
||||
type: Credit Starvation
|
||||
transaction: Write
|
||||
avg: AVG(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
- metric_table:
|
||||
id: 1706
|
||||
title: L2 - Fabric interface detailed metrics
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Read (32B):
|
||||
avg: AVG((TCC_EA_RDREQ_32B_sum / $denom))
|
||||
min: MIN((TCC_EA_RDREQ_32B_sum / $denom))
|
||||
max: MAX((TCC_EA_RDREQ_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read (64B):
|
||||
avg: AVG(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
|
||||
min: MIN(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
|
||||
max: MAX(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read (Uncached):
|
||||
avg: AVG((TCC_EA_RD_UNCACHED_32B_sum / $denom))
|
||||
min: MIN((TCC_EA_RD_UNCACHED_32B_sum / $denom))
|
||||
max: MAX((TCC_EA_RD_UNCACHED_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
HBM Read:
|
||||
avg: AVG((TCC_EA_RDREQ_DRAM_sum / $denom))
|
||||
min: MIN((TCC_EA_RDREQ_DRAM_sum / $denom))
|
||||
max: MAX((TCC_EA_RDREQ_DRAM_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Remote Read:
|
||||
avg: AVG((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
|
||||
min: MIN((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
|
||||
max: MAX((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Write and Atomic (32B):
|
||||
avg: AVG(MAX(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom), 0))
|
||||
min: MIN(MAX(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom), 0))
|
||||
max: MAX(MAX(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom), 0))
|
||||
unit: (Req + $normUnit)
|
||||
Write and Atomic (Uncached):
|
||||
avg: AVG((TCC_EA_WR_UNCACHED_32B_sum / $denom))
|
||||
min: MIN((TCC_EA_WR_UNCACHED_32B_sum / $denom))
|
||||
max: MAX((TCC_EA_WR_UNCACHED_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Write and Atomic (64B):
|
||||
avg: AVG((TCC_EA_WRREQ_64B_sum / $denom))
|
||||
min: MIN((TCC_EA_WRREQ_64B_sum / $denom))
|
||||
max: MAX((TCC_EA_WRREQ_64B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
HBM Write and Atomic:
|
||||
avg: AVG((TCC_EA_WRREQ_DRAM_sum / $denom))
|
||||
min: MIN((TCC_EA_WRREQ_DRAM_sum / $denom))
|
||||
max: MAX((TCC_EA_WRREQ_DRAM_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Remote Write and Atomic:
|
||||
avg: AVG((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
|
||||
min: MIN((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
|
||||
max: MAX((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Atomic:
|
||||
avg: AVG((TCC_EA_ATOMIC_sum / $denom))
|
||||
min: MIN((TCC_EA_ATOMIC_sum / $denom))
|
||||
max: MAX((TCC_EA_ATOMIC_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
metrics_description:
|
||||
Utilization: The ratio of the number of cycles an L2 channel was active, summed
|
||||
over all L2 channels on the accelerator over the total L2 cycles.
|
||||
@@ -87,12 +431,6 @@ Panel Config:
|
||||
by the cache line size. This value does not consider partial requests, so for
|
||||
example, if only a single value is requested in a cache line, the data movement
|
||||
will still be counted as a full cache line.
|
||||
Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests,
|
||||
divided by total duration.
|
||||
Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests,
|
||||
divided by total duration.
|
||||
Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests,
|
||||
divided by total duration.
|
||||
Req: The total number of incoming requests to the L2 from all clients for all
|
||||
request types, per normalization unit.
|
||||
Read Req: The total number of read requests to the L2 from all clients.
|
||||
@@ -149,12 +487,6 @@ Panel Config:
|
||||
Remote Read: The total number of L2 requests to Infinity Fabric to read 32B or
|
||||
64B of data from any source other than the accelerator's local HBM, per normalization
|
||||
unit.
|
||||
Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe
|
||||
traffic, divided by total duration.
|
||||
"Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM
|
||||
traffic, divided by total duration.
|
||||
Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 32B of data to any memory location, per normalization
|
||||
unit.
|
||||
@@ -170,391 +502,9 @@ Panel Config:
|
||||
Remote Write and Atomic: The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 32B or 64B of data in any memory location other than
|
||||
the accelerator's local HBM, per normalization unit.
|
||||
Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to
|
||||
PCIe traffic, divided by total duration.
|
||||
"Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM
|
||||
traffic, divided by total duration.
|
||||
Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to
|
||||
PCIe traffic, divided by total duration.
|
||||
"Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
|
||||
HBM traffic, divided by total duration.
|
||||
Atomic: The total number of L2 requests to Infinity Fabric to atomically update
|
||||
32B or 64B of data in any memory location, per normalization unit. See Request
|
||||
flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
|
||||
requests are only considered atomic by Infinity Fabric if they are targeted
|
||||
at non-write-cacheable memory, such as fine-grained memory allocations or uncached
|
||||
memory allocations on the MI2XX.
|
||||
Read Stall: "The ratio of the total number of cycles the L2-Fabric interface was\
|
||||
\ stalled on a read request to any destination (local HBM, remote PCIe\xAE connected\
|
||||
\ accelerator or CPU, or remote Infinity Fabric connected accelerator or CPU)\
|
||||
\ over the total active L2 cycles."
|
||||
Write Stall: The ratio of the total number of cycles the L2-Fabric interface was
|
||||
stalled on a write or atomic request to any destination (local HBM, remote accelerator
|
||||
or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected
|
||||
accelerator or CPU) over the total active L2 cycles.
|
||||
Read - PCIe Stall: The number of cycles the L2-Fabric interface was stalled on
|
||||
read requests to remote PCIe connected accelerators or CPUs as a percent of
|
||||
the total active L2 cycles.
|
||||
Read - Infinity Fabric Stall: The number of cycles the L2-Fabric interface was
|
||||
stalled on read requests to remote Infinity Fabric connected accelerators or
|
||||
CPUs as a percent of the total active L2 cycles.
|
||||
Read - HBM Stall: The number of cycles the L2-Fabric interface was stalled on
|
||||
read requests to the accelerator's local HBM as a percent of the total active
|
||||
L2 cycles.
|
||||
Write - PCIe Stall: The number of cycles the L2-Fabric interface was stalled on
|
||||
write or atomic requests to remote PCIe connected accelerators or CPUs as a
|
||||
percent of the total active L2 cycles.
|
||||
Write - Infinity Fabric Stall: The number of cycles the L2-Fabric interface was
|
||||
stalled on write or atomic requests to remote Infinity Fabric connected accelerators
|
||||
or CPUs as a percent of the total active L2 cycles.
|
||||
Write - HBM Stall: The number of cycles the L2-Fabric interface was stalled on
|
||||
write or atomic requests to accelerator's local HBM as a percent of the total
|
||||
active L2 cycles.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1701
|
||||
title: L2 Speed-of-Light
|
||||
header:
|
||||
metric: Metric
|
||||
value: Avg
|
||||
unit: Unit
|
||||
metric:
|
||||
Utilization:
|
||||
value: AVG(((TCC_BUSY_sum * 100) / (TO_INT($total_l2_chan) * $GRBM_GUI_ACTIVE_PER_XCD)))
|
||||
unit: pct
|
||||
Peak Bandwidth:
|
||||
value: ((100 * AVG(((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))))
|
||||
/ ((($max_sclk / 1000) * 128) * TO_INT($total_l2_chan)))
|
||||
unit: pct
|
||||
Hit Rate:
|
||||
value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else 0))
|
||||
unit: pct
|
||||
L2-Fabric Read BW:
|
||||
value: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
L2-Fabric Write and Atomic BW:
|
||||
value: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
|
||||
* 32)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GB/s
|
||||
HBM Bandwidth:
|
||||
value: $hbmBandwidth
|
||||
unit: GB/s
|
||||
- metric_table:
|
||||
id: 1702
|
||||
title: L2-Fabric interface metrics
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Read BW:
|
||||
avg: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
|
||||
* 64)) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
HBM Read Traffic:
|
||||
avg: AVG((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Remote Read Traffic:
|
||||
avg: AVG((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum)
|
||||
if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
min: MIN((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum)
|
||||
if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
max: MAX((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum)
|
||||
if (TCC_EA_RDREQ_sum != 0) else None))
|
||||
unit: pct
|
||||
Uncached Read Traffic:
|
||||
avg: AVG((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Write and Atomic BW:
|
||||
avg: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
min: MIN((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
max: MAX((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
|
||||
* 32)) / $denom))
|
||||
unit: (Bytes + $normUnit)
|
||||
HBM Write and Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Remote Write and Atomic Traffic:
|
||||
avg: AVG((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum)
|
||||
if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
min: MIN((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum)
|
||||
if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
max: MAX((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum)
|
||||
if (TCC_EA_WRREQ_sum != 0) else None))
|
||||
unit: pct
|
||||
Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Uncached Write and Atomic Traffic:
|
||||
avg: AVG((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
Read Latency:
|
||||
avg: AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
Write and Atomic Latency:
|
||||
avg: AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
Atomic Latency:
|
||||
avg: AVG(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
|
||||
!= 0) else None))
|
||||
unit: Cycles
|
||||
- metric_table:
|
||||
id: 1703
|
||||
title: L2 Cache Accesses
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Bandwidth:
|
||||
avg: AVG((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
|
||||
min: MIN((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
|
||||
max: MAX((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
Req:
|
||||
avg: AVG((TCC_REQ_sum / $denom))
|
||||
min: MIN((TCC_REQ_sum / $denom))
|
||||
max: MAX((TCC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read Req:
|
||||
avg: AVG((TCC_READ_sum / $denom))
|
||||
min: MIN((TCC_READ_sum / $denom))
|
||||
max: MAX((TCC_READ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Write Req:
|
||||
avg: AVG((TCC_WRITE_sum / $denom))
|
||||
min: MIN((TCC_WRITE_sum / $denom))
|
||||
max: MAX((TCC_WRITE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Atomic Req:
|
||||
avg: AVG((TCC_ATOMIC_sum / $denom))
|
||||
min: MIN((TCC_ATOMIC_sum / $denom))
|
||||
max: MAX((TCC_ATOMIC_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Streaming Req:
|
||||
avg: AVG((TCC_STREAMING_REQ_sum / $denom))
|
||||
min: MIN((TCC_STREAMING_REQ_sum / $denom))
|
||||
max: MAX((TCC_STREAMING_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Probe Req:
|
||||
avg: AVG((TCC_PROBE_sum / $denom))
|
||||
min: MIN((TCC_PROBE_sum / $denom))
|
||||
max: MAX((TCC_PROBE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Cache Hit:
|
||||
avg: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
min: MIN((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
max: MAX((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
unit: pct
|
||||
Hits:
|
||||
avg: AVG((TCC_HIT_sum / $denom))
|
||||
min: MIN((TCC_HIT_sum / $denom))
|
||||
max: MAX((TCC_HIT_sum / $denom))
|
||||
unit: (Hits + $normUnit)
|
||||
Misses:
|
||||
avg: AVG((TCC_MISS_sum / $denom))
|
||||
min: MIN((TCC_MISS_sum / $denom))
|
||||
max: MAX((TCC_MISS_sum / $denom))
|
||||
unit: (Misses + $normUnit)
|
||||
Writeback:
|
||||
avg: AVG((TCC_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_WRITEBACK_sum / $denom))
|
||||
max: MAX((TCC_WRITEBACK_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
Writeback (Internal):
|
||||
avg: AVG((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
max: MAX((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
Writeback (vL1D Req):
|
||||
avg: AVG((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
|
||||
max: MAX((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
Evict (Internal):
|
||||
avg: AVG((TCC_NORMAL_EVICT_sum / $denom))
|
||||
min: MIN((TCC_NORMAL_EVICT_sum / $denom))
|
||||
max: MAX((TCC_NORMAL_EVICT_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
Evict (vL1D Req):
|
||||
avg: AVG((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
|
||||
min: MIN((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
|
||||
max: MAX((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
NC Req:
|
||||
avg: AVG((TCC_NC_REQ_sum / $denom))
|
||||
min: MIN((TCC_NC_REQ_sum / $denom))
|
||||
max: MAX((TCC_NC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
UC Req:
|
||||
avg: AVG((TCC_UC_REQ_sum / $denom))
|
||||
min: MIN((TCC_UC_REQ_sum / $denom))
|
||||
max: MAX((TCC_UC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
CC Req:
|
||||
avg: AVG((TCC_CC_REQ_sum / $denom))
|
||||
min: MIN((TCC_CC_REQ_sum / $denom))
|
||||
max: MAX((TCC_CC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
RW Req:
|
||||
avg: AVG((TCC_RW_REQ_sum / $denom))
|
||||
min: MIN((TCC_RW_REQ_sum / $denom))
|
||||
max: MAX((TCC_RW_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1704
|
||||
title: L2 Cache Stalls
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric: {}
|
||||
- metric_table:
|
||||
id: 1705
|
||||
title: L2 - Fabric Interface stalls
|
||||
header:
|
||||
metric: Metric
|
||||
type: Type
|
||||
transaction: Transaction
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
style:
|
||||
type: simple_multi_bar
|
||||
metric:
|
||||
Write - Credit Starvation:
|
||||
type: Credit Starvation
|
||||
transaction: Write
|
||||
avg: AVG(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
|
||||
!= 0) else None))
|
||||
min: MIN(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
|
||||
!= 0) else None))
|
||||
max: MAX(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
|
||||
!= 0) else None))
|
||||
unit: pct
|
||||
- metric_table:
|
||||
id: 1706
|
||||
title: L2 - Fabric interface detailed metrics
|
||||
header:
|
||||
metric: Metric
|
||||
avg: Avg
|
||||
min: Min
|
||||
max: Max
|
||||
unit: Unit
|
||||
metric:
|
||||
Read (32B):
|
||||
avg: AVG((TCC_EA_RDREQ_32B_sum / $denom))
|
||||
min: MIN((TCC_EA_RDREQ_32B_sum / $denom))
|
||||
max: MAX((TCC_EA_RDREQ_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read (64B):
|
||||
avg: AVG(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
|
||||
min: MIN(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
|
||||
max: MAX(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Read (Uncached):
|
||||
avg: AVG((TCC_EA_RD_UNCACHED_32B_sum / $denom))
|
||||
min: MIN((TCC_EA_RD_UNCACHED_32B_sum / $denom))
|
||||
max: MAX((TCC_EA_RD_UNCACHED_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
HBM Read:
|
||||
avg: AVG((TCC_EA_RDREQ_DRAM_sum / $denom))
|
||||
min: MIN((TCC_EA_RDREQ_DRAM_sum / $denom))
|
||||
max: MAX((TCC_EA_RDREQ_DRAM_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Remote Read:
|
||||
avg: AVG((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
|
||||
min: MIN((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
|
||||
max: MAX((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Write and Atomic (32B):
|
||||
avg: AVG(MAX(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom), 0))
|
||||
min: MIN(MAX(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom), 0))
|
||||
max: MAX(MAX(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom), 0))
|
||||
unit: (Req + $normUnit)
|
||||
Write and Atomic (Uncached):
|
||||
avg: AVG((TCC_EA_WR_UNCACHED_32B_sum / $denom))
|
||||
min: MIN((TCC_EA_WR_UNCACHED_32B_sum / $denom))
|
||||
max: MAX((TCC_EA_WR_UNCACHED_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Write and Atomic (64B):
|
||||
avg: AVG((TCC_EA_WRREQ_64B_sum / $denom))
|
||||
min: MIN((TCC_EA_WRREQ_64B_sum / $denom))
|
||||
max: MAX((TCC_EA_WRREQ_64B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
HBM Write and Atomic:
|
||||
avg: AVG((TCC_EA_WRREQ_DRAM_sum / $denom))
|
||||
min: MIN((TCC_EA_WRREQ_DRAM_sum / $denom))
|
||||
max: MAX((TCC_EA_WRREQ_DRAM_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Remote Write and Atomic:
|
||||
avg: AVG((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
|
||||
min: MIN((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
|
||||
max: MAX((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
Atomic:
|
||||
avg: AVG((TCC_EA_ATOMIC_sum / $denom))
|
||||
min: MIN((TCC_EA_ATOMIC_sum / $denom))
|
||||
max: MAX((TCC_EA_ATOMIC_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
|
||||
+4
-4
@@ -2,10 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1800
|
||||
title: L2 Cache (per Channel)
|
||||
metrics_description:
|
||||
L2 Cache Hit Rate: The percent of total number of requests to the L2 from all
|
||||
clients that hit in the cache. As noted in the Speed-of-Light section, this
|
||||
includes hit-on-miss requests.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1801
|
||||
@@ -321,3 +317,7 @@ Panel Config:
|
||||
::_1: $total_l2_chan
|
||||
cli_style: simple_box
|
||||
tui_style: simple_box
|
||||
metrics_description:
|
||||
L2 Cache Hit Rate: The percent of total number of requests to the L2 from all
|
||||
clients that hit in the cache. As noted in the Speed-of-Light section, this
|
||||
includes hit-on-miss requests.
|
||||
|
||||
+1
-1
@@ -2,10 +2,10 @@
|
||||
Panel Config:
|
||||
id: 2100
|
||||
title: PC Sampling
|
||||
metrics_description: {}
|
||||
data source:
|
||||
- pc_sampling_table:
|
||||
id: 2101
|
||||
title: PC Sampling
|
||||
source: ps_file
|
||||
comparable: false
|
||||
metrics_description: {}
|
||||
|
||||
+1022
Разница между файлами не показана из-за своего большого размера
Загрузить разницу
+1
-1
@@ -2,7 +2,6 @@
|
||||
Panel Config:
|
||||
id: 0
|
||||
title: Top Stats
|
||||
metrics_description: {}
|
||||
data source:
|
||||
- raw_csv_table:
|
||||
id: 1
|
||||
@@ -12,3 +11,4 @@ Panel Config:
|
||||
id: 2
|
||||
title: Dispatch List
|
||||
source: pmc_dispatch_info.csv
|
||||
metrics_description: {}
|
||||
|
||||
+1
-1
@@ -2,10 +2,10 @@
|
||||
Panel Config:
|
||||
id: 100
|
||||
title: System Info
|
||||
metrics_description: {}
|
||||
data source:
|
||||
- raw_csv_table:
|
||||
id: 101
|
||||
title: System Info
|
||||
source: sysinfo.csv
|
||||
columnwise: true
|
||||
metrics_description: {}
|
||||
|
||||
+127
-118
@@ -2,124 +2,6 @@
|
||||
Panel Config:
|
||||
id: 200
|
||||
title: System Speed-of-Light
|
||||
metrics_description:
|
||||
VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
|
||||
This is also presented as a percent of the peak theoretical FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any floating-point
|
||||
operations from MFMA instructions.'
|
||||
VALU IOPs: 'The total integer operations executed per second on the VALU. This
|
||||
is also presented as a percent of the peak theoretical IOPs achievable on the
|
||||
specific accelerator. Note: this does not include any integer operations from
|
||||
MFMA instructions.'
|
||||
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
|
||||
executed per second. This does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F8 MFMA operations achievable on the specific accelerator. It is supported on
|
||||
AMD Instinct MI300 series and later only.
|
||||
MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
|
||||
executed per second. Note: this does not include any 16-bit brain floating point
|
||||
operations from VALU instructions. This is also presented as a percent of the
|
||||
peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 32-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F32 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 64-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F64 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
|
||||
per second. Note: this does not include any 8-bit integer operations from VALU
|
||||
instructions. This is also presented as a percent of the peak theoretical INT8
|
||||
MFMA operations achievable on the specific accelerator.'
|
||||
Active CUs: Total number of active compute units (CUs) on the accelerator during
|
||||
the kernel execution.
|
||||
SALU Utilization: Indicates what percent of the kernel's duration the SALU was
|
||||
busy executing instructions. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.
|
||||
VALU Utilization: Indicates what percent of the kernel's duration the VALU was
|
||||
busy executing instructions. Does not include VMEM operations. Computed as the
|
||||
ratio of the total number of cycles spent by the scheduler issuing VALU instructions
|
||||
over the total CU cycles.
|
||||
MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
|
||||
was busy executing instructions. Computed as the ratio of the total number of
|
||||
cycles the MFMA was busy over the total CU cycles.
|
||||
VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
|
||||
was busy executing instructions, including both global/generic and spill/scratch
|
||||
operations (see the VMEM instruction count metrics) for more detail). Does not
|
||||
include VALU operations. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing VMEM instructions over the total CU cycles.
|
||||
Branch Utilization: Indicates what percent of the kernel's duration the branch
|
||||
unit was busy executing instructions. Computed as the ratio of the total number
|
||||
of cycles spent by the scheduler issuing branch instructions over the total
|
||||
CU cycles
|
||||
VALU Active Threads: Indicates the average level of divergence within a wavefront
|
||||
over the lifetime of the kernel. The number of work-items that were active in
|
||||
a wavefront during execution of each VALU instruction, time-averaged over all
|
||||
VALU instructions run on all wavefronts in the kernel.
|
||||
IPC: The ratio of the total number of instructions executed on the CU over the
|
||||
total active CU cycles. This is also presented as a percent of the peak theoretical
|
||||
bandwidth achievable on the specific accelerator.
|
||||
Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
|
||||
over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
||||
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
|
||||
occupancy achievable on the specific accelerator.'
|
||||
Theoretical LDS Bandwidth: Indicates the maximum amount of bytes that could have
|
||||
been loaded from, stored to, or atomically updated in the LDS per unit time
|
||||
(see LDS Bandwidth example for more detail). This is also presented as a percent
|
||||
of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
|
||||
LDS Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS
|
||||
scheduler due to bank conflicts (as determined by the conflict resolution hardware)
|
||||
to the base number of cycles that would be spent in the LDS scheduler in a completely
|
||||
uncontended case. This is also presented in normalized form (i.e., the Bank
|
||||
Conflict Rate).
|
||||
vL1D Cache Hit Rate: The ratio of the number of vL1D cache line requests that
|
||||
hit in vL1D cache over the total number of cache line requests to the vL1D cache
|
||||
RAM.
|
||||
vL1D Cache BW: The number of bytes looked up in the vL1D cache as a result of
|
||||
VMEM instructions per unit time. The number of bytes is calculated as the number
|
||||
of cache lines requested multiplied by the cache line size. This value does
|
||||
not consider partial requests, so e.g., if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
This is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L2 Cache Hit Rate: The ratio of the number of L2 cache line requests that hit
|
||||
in the L2 cache over the total number of incoming cache line requests to the
|
||||
L2 cache.
|
||||
L2 Cache BW: The number of bytes looked up in the L2 cache per unit time. The
|
||||
number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line. This is also presented as a percent of
|
||||
the peak theoretical bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Read BW: "The number of bytes read by the L2 over the Infinity Fabric\u2122\
|
||||
\ interface per unit time. This is also presented as a percent of the peak theoretical\
|
||||
\ bandwidth achievable on the specific accelerator."
|
||||
L2-Fabric Write BW: The number of bytes sent by the L2 over the Infinity Fabric
|
||||
interface by write and atomic operations per unit time. This is also presented
|
||||
as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Read Latency: The time-averaged number of cycles read requests spent
|
||||
in Infinity Fabric before data was returned to the L2.
|
||||
L2-Fabric Write Latency: The time-averaged number of cycles write requests spent
|
||||
in Infinity Fabric before a completion acknowledgement was returned to the L2.
|
||||
sL1D Cache Hit Rate: The percent of sL1D requests that hit on a previously loaded
|
||||
line the cache. Calculated as the ratio of the number of sL1D requests that
|
||||
hit over the number of all sL1D requests.
|
||||
sL1D Cache BW: The number of bytes looked up in the sL1D cache per unit time.
|
||||
This is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L1I Hit Rate: The number of bytes looked up in the L1I cache per unit time. This
|
||||
is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L1I BW: The percent of L1I requests that hit on a previously loaded line the cache.
|
||||
Calculated as the ratio of the number of L1I requests that hit over the number
|
||||
of all L1I requests.
|
||||
L1I Fetch Latency: The average number of cycles spent to fetch instructions to
|
||||
a CU.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 201
|
||||
@@ -344,3 +226,130 @@ Panel Config:
|
||||
peak: None
|
||||
pop: None
|
||||
coll_level: SQ_IFETCH_LEVEL
|
||||
metrics_description:
|
||||
VALU FLOPs: |-
|
||||
The total floating-point operations executed per second on the VALU.
|
||||
This is also presented as a percent of the peak theoretical FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any floating-point
|
||||
operations from MFMA instructions.
|
||||
VALU IOPs: |-
|
||||
The total integer operations executed per second on the VALU. This is
|
||||
also presented as a percent of the peak theoretical IOPs achievable on the
|
||||
specific accelerator. Note: this does not include any integer operations from
|
||||
MFMA instructions.
|
||||
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
|
||||
executed per second. This does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F8 MFMA operations achievable on the specific accelerator. It is supported on
|
||||
AMD Instinct MI300 series and later only.
|
||||
MFMA FLOPs (BF16): |-
|
||||
The total number of 16-bit brain floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
BF16 MFMA operations achievable on the specific accelerator.
|
||||
MFMA FLOPs (F16): |-
|
||||
The total number of 16-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 16-bit floating point operations from
|
||||
VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F16 MFMA operations achievable on the specific accelerator.
|
||||
MFMA FLOPs (F32): |-
|
||||
The total number of 32-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 32-bit floating point operations from
|
||||
VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F32 MFMA operations achievable on the specific accelerator.
|
||||
MFMA FLOPs (F64): |-
|
||||
The total number of 64-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 64-bit floating point operations from
|
||||
VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F64 MFMA operations achievable on the specific accelerator.
|
||||
MFMA IOPs (Int8): |-
|
||||
The total number of 8-bit integer MFMA operations executed per second.
|
||||
Note: this does not include any 8-bit integer operations from VALU instructions.
|
||||
This is also presented as a percent of the peak theoretical INT8 MFMA operations
|
||||
achievable on the specific accelerator.
|
||||
Active CUs: Total number of active compute units (CUs) on the accelerator during
|
||||
the kernel execution.
|
||||
SALU Utilization: Indicates what percent of the kernel's duration the SALU was
|
||||
busy executing instructions. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.
|
||||
VALU Utilization: Indicates what percent of the kernel's duration the VALU was
|
||||
busy executing instructions. Does not include VMEM operations. Computed as the
|
||||
ratio of the total number of cycles spent by the scheduler issuing VALU instructions
|
||||
over the total CU cycles.
|
||||
MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
|
||||
was busy executing instructions. Computed as the ratio of the total number of
|
||||
cycles the MFMA was busy over the total CU cycles.
|
||||
VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
|
||||
was busy executing instructions, including both global/generic and spill/scratch
|
||||
operations (see the VMEM instruction count metrics) for more detail). Does not
|
||||
include VALU operations. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing VMEM instructions over the total CU cycles.
|
||||
Branch Utilization: Indicates what percent of the kernel's duration the branch
|
||||
unit was busy executing instructions. Computed as the ratio of the total number
|
||||
of cycles spent by the scheduler issuing branch instructions over the total
|
||||
CU cycles
|
||||
VALU Active Threads: Indicates the average level of divergence within a wavefront
|
||||
over the lifetime of the kernel. The number of work-items that were active in
|
||||
a wavefront during execution of each VALU instruction, time-averaged over all
|
||||
VALU instructions run on all wavefronts in the kernel.
|
||||
IPC: The ratio of the total number of instructions executed on the CU over the
|
||||
total active CU cycles. This is also presented as a percent of the peak theoretical
|
||||
bandwidth achievable on the specific accelerator.
|
||||
Wavefront Occupancy: |-
|
||||
The time-averaged number of wavefronts resident on the accelerator over
|
||||
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
||||
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
|
||||
occupancy achievable on the specific accelerator.
|
||||
Theoretical LDS Bandwidth: Indicates the maximum amount of bytes that could have
|
||||
been loaded from, stored to, or atomically updated in the LDS per unit time
|
||||
(see LDS Bandwidth example for more detail). This is also presented as a percent
|
||||
of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
|
||||
LDS Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS
|
||||
scheduler due to bank conflicts (as determined by the conflict resolution hardware)
|
||||
to the base number of cycles that would be spent in the LDS scheduler in a completely
|
||||
uncontended case. This is also presented in normalized form (i.e., the Bank
|
||||
Conflict Rate).
|
||||
vL1D Cache Hit Rate: The ratio of the number of vL1D cache line requests that
|
||||
hit in vL1D cache over the total number of cache line requests to the vL1D cache
|
||||
RAM.
|
||||
vL1D Cache BW: The number of bytes looked up in the vL1D cache as a result of
|
||||
VMEM instructions per unit time. The number of bytes is calculated as the number
|
||||
of cache lines requested multiplied by the cache line size. This value does
|
||||
not consider partial requests, so e.g., if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
This is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L2 Cache Hit Rate: The ratio of the number of L2 cache line requests that hit
|
||||
in the L2 cache over the total number of incoming cache line requests to the
|
||||
L2 cache.
|
||||
L2 Cache BW: The number of bytes looked up in the L2 cache per unit time. The
|
||||
number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line. This is also presented as a percent of
|
||||
the peak theoretical bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Read BW: |-
|
||||
The number of bytes read by the L2 over the Infinity Fabric\u2122 interface
|
||||
per unit time. This is also presented as a percent of the peak theoretical
|
||||
bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Write BW: The number of bytes sent by the L2 over the Infinity Fabric
|
||||
interface by write and atomic operations per unit time. This is also presented
|
||||
as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Read Latency: The time-averaged number of cycles read requests spent
|
||||
in Infinity Fabric before data was returned to the L2.
|
||||
L2-Fabric Write Latency: The time-averaged number of cycles write requests spent
|
||||
in Infinity Fabric before a completion acknowledgement was returned to the L2.
|
||||
sL1D Cache Hit Rate: The percent of sL1D requests that hit on a previously loaded
|
||||
line the cache. Calculated as the ratio of the number of sL1D requests that
|
||||
hit over the number of all sL1D requests.
|
||||
sL1D Cache BW: The number of bytes looked up in the sL1D cache per unit time.
|
||||
This is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L1I Hit Rate: The number of bytes looked up in the L1I cache per unit time. This
|
||||
is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L1I BW: The percent of L1I requests that hit on a previously loaded line the cache.
|
||||
Calculated as the ratio of the number of L1I requests that hit over the number
|
||||
of all L1I requests.
|
||||
L1I Fetch Latency: The average number of cycles spent to fetch instructions to
|
||||
a CU.
|
||||
|
||||
+117
-119
@@ -2,122 +2,6 @@
|
||||
Panel Config:
|
||||
id: 300
|
||||
title: Memory Chart
|
||||
metrics_description:
|
||||
Wavefront Occupancy: Wavefronts per active CU.
|
||||
Wave Life: Average number of cycles executing a wave.
|
||||
SALU: Total Number of SALU (Scalar ALU) instructions issued per normalization
|
||||
unit.
|
||||
SMEM: Total number of SMEM (Scalar Memory Read) instructions issued normalization
|
||||
unit.
|
||||
VALU: The number of VALU (Vector ALU) instructions issued per normalization unit.
|
||||
MFMA: Total number of MFMA (Matrix-Fused-Multiply-Add) instructions issued per
|
||||
normalization unit.
|
||||
VMEM: The number of VMEM (GPU Memory) read instructions issued (including FLAT/scratch
|
||||
memory) per normalization unit.
|
||||
LDS: The total number of LDS instructions (including, but not limited to, read/write/atomics
|
||||
and HIP's __shfl instructions) executed per normalization unit.
|
||||
GWS: Total number of GDS (global data sync) instructions issued per normalization
|
||||
unit.
|
||||
BR: Total number of BRANCH instructions issued per normalization unit.
|
||||
Active CUs: Total number of active compute units (CUs) on the accelerator during
|
||||
the kernel execution.
|
||||
Num CUs: Total number of compute units (CUs) on the accelerator.
|
||||
VGPR: 'The number of architected vector general-purpose registers allocated for
|
||||
the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
|
||||
by the compiler due to allocation granularity.'
|
||||
SGPR: 'The number of scalar general-purpose registers allocated for the kernel,
|
||||
see SALU. Note: this may not exactly match the number of SGPRs requested by
|
||||
the compiler due to allocation granularity.'
|
||||
LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
|
||||
for this kernel. Note: This may also be larger than what was requested at compile
|
||||
time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
|
||||
Scratch Allocation: The number of bytes of scratch memory requested per work-item
|
||||
for this kernel. Scratch memory is used for stack memory on the accelerator,
|
||||
as well as for register spills and restores.
|
||||
Wavefronts: The total number of wavefronts, summed over all workgroups, forming
|
||||
this kernel launch.
|
||||
Workgroups: The total number of workgroups forming this kernel launch.
|
||||
LDS Req: The total number of LDS instructions (including, but not limited to,
|
||||
read/write/atomics and HIP's __shfl instructions) executed per normalization
|
||||
unit.
|
||||
LDS Util: Indicates what percent of the kernel's duration the LDS was actively
|
||||
executing instructions (including, but not limited to, load, store, atomic and
|
||||
HIP's __shfl operations). Calculated as the ratio of the total number of cycles
|
||||
LDS was active over the total CU cycles.
|
||||
LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
|
||||
/ acknowledgment) required for an LDS instruction to complete.
|
||||
VL1 Rd: The total number of incoming read requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Wr: The total number of incoming write requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Atomic: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Hit: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
VL1 Lat: Calculated as the average number of cycles that a vL1D cache line request
|
||||
spent in the vL1D cache pipeline.
|
||||
VL1 Coalesce: Indicates how well memory instructions were coalesced by the address
|
||||
processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
|
||||
as the average number of thread-requests generated per instruction divided by
|
||||
the ideal number of thread-requests per instruction.
|
||||
VL1 Stall: The ratio of the number of cycles where the vL1D is stalled waiting
|
||||
to issue a request for data to the L2 cache divided by the number of cycles
|
||||
where the vL1D is active.
|
||||
VL1_L2 Rd: The number of read requests for a vL1D cache line that were not satisfied
|
||||
by the vL1D and must be retrieved from the to the L2 Cache per normalization
|
||||
unit.
|
||||
VL1_L2 Wr: The number of write requests to a vL1D cache line that were sent through
|
||||
the vL1D to the L2 cache, per normalization unit.
|
||||
VL1_L2 Atomic: The number of atomic requests that are sent through the vL1D to
|
||||
the L2 cache, per normalization unit. This includes requests for atomics with,
|
||||
and without return.
|
||||
sL1D Rd: The total number of requests, of any size or type, made to the sL1D per
|
||||
normalization unit.
|
||||
sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
|
||||
line, per normalization unit.
|
||||
sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
|
||||
unit.
|
||||
sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
sL1D_L2 Atomic: The total number of atomic requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
IL1 Fetch: The total number of requests made to the L1I per normalization-unit.
|
||||
IL1 Hit: The percent of L1I requests that hit on a previously loaded line the
|
||||
cache. Calculated as the ratio of the number of L1I requests that hit over the
|
||||
number of all L1I requests.
|
||||
IL1 Lat: The average number of cycles spent to fetch instructions to a CU.
|
||||
IL1_L2 Rd: The total number of requests across the L1I - L2 interface per normalization-unit.
|
||||
L2 Rd: The total number of read requests to the L2 from all clients.
|
||||
L2 Wr: The total number of write requests to the L2 from all clients.
|
||||
L2 Atomic: The total number of atomic requests (with and without return) to the
|
||||
L2 from all clients.
|
||||
L2 Hit: The ratio of the number of L2 cache line requests that hit in the L2 cache
|
||||
over the total number of incoming cache line requests to the L2 cache.
|
||||
L2 Rd Lat: Calculated as the average number of cycles that the vL1D cache took
|
||||
to issue and receive read requests from the L2 Cache. This number also includes
|
||||
requests for atomics with return values.
|
||||
L2 Wr Lat: Calculated as the average number of cycles that the vL1D cache took
|
||||
to issue and receive acknowledgement of a write request to the L2 Cache. This
|
||||
number also includes requests for atomics without return values.
|
||||
Fabric_L2 Rd: Number of L2 cache - Infinity Fabric read requests (either 32-byte
|
||||
or 64-byte) summed over TCC instances per normalization unit.
|
||||
Fabric_L2 Wr: Number of L2 cache - Infinity Fabric write requests (either 32-byte
|
||||
or 64-byte) summed over TCC instances per normalization unit.
|
||||
Fabric_L2 Atomic: Number of L2 cache - Infinity Fabric write requests (either
|
||||
32-byte or 64-byte) that are actually atomic requests summed over TCC instances
|
||||
per normalization unit.
|
||||
Fabric Rd Lat: The time-averaged number of cycles read requests spent in Infinity
|
||||
Fabric before data was returned to the L2.
|
||||
Fabric Wr Lat: The time-averaged number of cycles write requests spent in Infinity
|
||||
Fabric before a completion acknowledgement was returned to the L2.
|
||||
Fabric Atomic Lat: The time-averaged number of cycles atomic requests spent in
|
||||
Infinity Fabric before a completion acknowledgement (atomic without return value)
|
||||
or data (atomic with return value) was returned to the L2.
|
||||
HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
|
||||
of data from the accelerator's local HBM, per normalization unit.
|
||||
HBM Wr: 'The total number of L2 requests to Infinity Fabric to write or atomically
|
||||
update 32B or 64B of data in the accelerator''s local HBM, per normalization
|
||||
unit. '
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 301
|
||||
@@ -244,13 +128,13 @@ Panel Config:
|
||||
value: ROUND(AVG((TCC_EA0_ATOMIC_sum / $denom)), 0)
|
||||
Fabric Rd Lat:
|
||||
value: ROUND(AVG(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else 0)), 0)
|
||||
!= 0) else 0)), 0)
|
||||
Fabric Wr Lat:
|
||||
value: ROUND(AVG(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else 0)), 0)
|
||||
!= 0) else 0)), 0)
|
||||
Fabric Atomic Lat:
|
||||
value: ROUND(AVG(((TCC_EA0_ATOMIC_LEVEL_sum / TCC_EA0_ATOMIC_sum) if (TCC_EA0_ATOMIC_sum
|
||||
!= 0) else 0)), 0)
|
||||
!= 0) else 0)), 0)
|
||||
HBM Rd:
|
||||
value: ROUND(AVG((TCC_EA0_RDREQ_DRAM_sum / $denom)), 0)
|
||||
HBM Wr:
|
||||
@@ -258,3 +142,117 @@ Panel Config:
|
||||
comparable: false
|
||||
cli_style: mem_chart
|
||||
tui_style: mem_chart
|
||||
metrics_description:
|
||||
Wavefront Occupancy: Wavefronts per active CU.
|
||||
Wave Life: Average number of cycles executing a wave.
|
||||
SALU: Total Number of SALU (Scalar ALU) instructions issued per normalization
|
||||
unit.
|
||||
SMEM: Total number of SMEM (Scalar Memory Read) instructions issued normalization
|
||||
unit.
|
||||
VALU: The number of VALU (Vector ALU) instructions issued per normalization unit.
|
||||
MFMA: Total number of MFMA (Matrix-Fused-Multiply-Add) instructions issued per
|
||||
normalization unit.
|
||||
VMEM: The number of VMEM (GPU Memory) read instructions issued (including FLAT/scratch
|
||||
memory) per normalization unit.
|
||||
LDS: The total number of LDS instructions (including, but not limited to, read/write/atomics
|
||||
and HIP's __shfl instructions) executed per normalization unit.
|
||||
GWS: Total number of GDS (global data sync) instructions issued per normalization
|
||||
unit.
|
||||
BR: Total number of BRANCH instructions issued per normalization unit.
|
||||
Active CUs: Total number of active compute units (CUs) on the accelerator during
|
||||
the kernel execution.
|
||||
Num CUs: Total number of compute units (CUs) on the accelerator.
|
||||
VGPR: |-
|
||||
The number of architected vector general-purpose registers allocated
|
||||
for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
|
||||
requested by the compiler due to allocation granularity.
|
||||
SGPR: |-
|
||||
The number of scalar general-purpose registers allocated for the kernel,
|
||||
see SALU. Note: this may not exactly match the number of SGPRs requested by
|
||||
the compiler due to allocation granularity.
|
||||
LDS Allocation: |-
|
||||
The number of bytes of LDS memory (or, shared memory) allocated for
|
||||
this kernel. Note: This may also be larger than what was requested at compile
|
||||
time due to both allocation granularity and dynamic per-dispatch LDS allocations.
|
||||
Scratch Allocation: The number of bytes of scratch memory requested per work-item
|
||||
for this kernel. Scratch memory is used for stack memory on the accelerator,
|
||||
as well as for register spills and restores.
|
||||
Wavefronts: The total number of wavefronts, summed over all workgroups, forming
|
||||
this kernel launch.
|
||||
Workgroups: The total number of workgroups forming this kernel launch.
|
||||
LDS Req: The total number of LDS instructions (including, but not limited to,
|
||||
read/write/atomics and HIP's __shfl instructions) executed per normalization
|
||||
unit.
|
||||
LDS Util: Indicates what percent of the kernel's duration the LDS was actively
|
||||
executing instructions (including, but not limited to, load, store, atomic and
|
||||
HIP's __shfl operations). Calculated as the ratio of the total number of cycles
|
||||
LDS was active over the total CU cycles.
|
||||
LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
|
||||
/ acknowledgment) required for an LDS instruction to complete.
|
||||
VL1 Rd: The total number of incoming read requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Wr: The total number of incoming write requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Atomic: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Hit: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
VL1 Lat: Calculated as the average number of cycles that a vL1D cache line request
|
||||
spent in the vL1D cache pipeline.
|
||||
VL1 Coalesce: Indicates how well memory instructions were coalesced by the address
|
||||
processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
|
||||
as the average number of thread-requests generated per instruction divided by
|
||||
the ideal number of thread-requests per instruction.
|
||||
VL1 Stall: The ratio of the number of cycles where the vL1D is stalled waiting
|
||||
to issue a request for data to the L2 cache divided by the number of cycles
|
||||
where the vL1D is active.
|
||||
VL1_L2 Rd: The number of read requests for a vL1D cache line that were not satisfied
|
||||
by the vL1D and must be retrieved from the to the L2 Cache per normalization
|
||||
unit.
|
||||
VL1_L2 Wr: The number of write requests to a vL1D cache line that were sent through
|
||||
the vL1D to the L2 cache, per normalization unit.
|
||||
VL1_L2 Atomic: The number of atomic requests that are sent through the vL1D to
|
||||
the L2 cache, per normalization unit. This includes requests for atomics with,
|
||||
and without return.
|
||||
sL1D Rd: The total number of requests, of any size or type, made to the sL1D per
|
||||
normalization unit.
|
||||
sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
|
||||
line, per normalization unit.
|
||||
sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
|
||||
unit.
|
||||
sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
sL1D_L2 Atomic: The total number of atomic requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
IL1 Fetch: The total number of requests made to the L1I per normalization-unit.
|
||||
IL1 Hit: The percent of L1I requests that hit on a previously loaded line the
|
||||
cache. Calculated as the ratio of the number of L1I requests that hit over the
|
||||
number of all L1I requests.
|
||||
IL1 Lat: The average number of cycles spent to fetch instructions to a CU.
|
||||
IL1_L2 Rd: The total number of requests across the L1I - L2 interface per normalization-unit.
|
||||
L2 Rd: The total number of read requests to the L2 from all clients.
|
||||
L2 Wr: The total number of write requests to the L2 from all clients.
|
||||
L2 Atomic: The total number of atomic requests (with and without return) to the
|
||||
L2 from all clients.
|
||||
L2 Hit: The ratio of the number of L2 cache line requests that hit in the L2 cache
|
||||
over the total number of incoming cache line requests to the L2 cache.
|
||||
Fabric_L2 Rd: Number of L2 cache - Infinity Fabric read requests (either 32-byte
|
||||
or 64-byte) summed over TCC instances per normalization unit.
|
||||
Fabric_L2 Wr: Number of L2 cache - Infinity Fabric write requests (either 32-byte
|
||||
or 64-byte) summed over TCC instances per normalization unit.
|
||||
Fabric_L2 Atomic: Number of L2 cache - Infinity Fabric write requests (either
|
||||
32-byte or 64-byte) that are actually atomic requests summed over TCC instances
|
||||
per normalization unit.
|
||||
Fabric Rd Lat: The time-averaged number of cycles read requests spent in Infinity
|
||||
Fabric before data was returned to the L2.
|
||||
Fabric Wr Lat: The time-averaged number of cycles write requests spent in Infinity
|
||||
Fabric before a completion acknowledgement was returned to the L2.
|
||||
Fabric Atomic Lat: The time-averaged number of cycles atomic requests spent in
|
||||
Infinity Fabric before a completion acknowledgement (atomic without return value)
|
||||
or data (atomic with return value) was returned to the L2.
|
||||
HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
|
||||
of data from the accelerator's local HBM, per normalization unit.
|
||||
HBM Wr: |-
|
||||
The total number of L2 requests to Infinity Fabric to write or atomically
|
||||
update 32B or 64B of data in the accelerator's local HBM, per normalization
|
||||
unit.
|
||||
|
||||
+88
-79
@@ -2,85 +2,6 @@
|
||||
Panel Config:
|
||||
id: 400
|
||||
title: Roofline
|
||||
metrics_description:
|
||||
VALU FLOPs (F16): 'The total 16-bit floating-point operations executed per second
|
||||
on the VALU. This is presented with the value of the peak empirical F16 FLOPs
|
||||
achievable on the specific accelerator. Note: this does not include any F16
|
||||
operations from MFMA instructions.'
|
||||
VALU FLOPs (F32): 'The total 32-bit floating-point operations executed per second
|
||||
on the VALU. This is presented with the value of the peak empirical F32 FLOPs
|
||||
achievable on the specific accelerator. Note: this does not include any F32
|
||||
operations from MFMA instructions.'
|
||||
VALU FLOPs (F64): 'The total 64-bit floating-point operations executed per second
|
||||
on the VALU. This is presented with the value of the peak empirical F64 FLOPs
|
||||
achievable on the specific accelerator. Note: this does not include any F64
|
||||
operations from MFMA instructions.'
|
||||
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
|
||||
executed per second. This does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. The peak empirically measured F8 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison. It is supported
|
||||
on AMD Instinct MI300 series and later only.
|
||||
MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
|
||||
executed per second. Note: this does not include any 16-bit brain floating point
|
||||
operations from VALU instructions. The peak empirically measured BF16 MFMA operations
|
||||
achievable on the specific accelerator is displayed alongside for comparison.'
|
||||
MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit floating point operations
|
||||
from VALU instructions. The peak empirically measured F16 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison.'
|
||||
MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 32-bit floating point operations
|
||||
from VALU instructions. The peak empirically measured F32 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison.'
|
||||
MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 64-bit floating point operations
|
||||
from VALU instructions. The peak empirically measured F64 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison.'
|
||||
MFMA FLOPs (F6F4): 'The total number of 4-bit and 6-bit floating point MFMA operations
|
||||
executed per second. Note: this does not include any floating point operations
|
||||
from VALU instructions. The peak empirically measured F6F4 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison. It is supported
|
||||
on AMD Instinct MI350 series (gfx950) and later only.'
|
||||
MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
|
||||
per second. Note: this does not include any 8-bit integer operations from VALU
|
||||
instructions. The peak empirically measured INT8 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison.'
|
||||
HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth
|
||||
Memory (HBM) per second. The peak empirically measured bandwidth achievable
|
||||
on the specific accelerator is displayed alongside for comparison.
|
||||
L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
|
||||
The number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line. The peak empirically measured bandwidth
|
||||
achievable on the specific accelerator is displayed alongside for comparison.
|
||||
L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
|
||||
of VMEM instructions per unit time. The number of bytes is calculated as the
|
||||
number of cache lines requested multiplied by the cache line size. This value
|
||||
does not consider partial requests, so e.g., if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
The peak empirically measured bandwidth achievable on the specific accelerator
|
||||
is displayed alongside for comparison.
|
||||
LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
|
||||
from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
|
||||
example for more detail). The peak empirically measured LDS bandwidth achievable
|
||||
on the specific accelerator is displayed alongside for comparison.
|
||||
AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
|
||||
of total floating-point operations (FLOPs) to total bytes transferred between
|
||||
the L1 cache and the processing units. This value is used as the x-coordinate
|
||||
for the L1 roofline.
|
||||
AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
|
||||
of total floating-point operations (FLOPs) to total bytes transferred between
|
||||
the L2 cache and the L1 cache. This value is used as the x-coordinate for the
|
||||
L2 roofline.
|
||||
AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
|
||||
It is the ratio of total floating-point operations (FLOPs) to total bytes transferred
|
||||
between HBM and the L2 cache. This value is used as the x-coordinate for the
|
||||
HBM roofline.
|
||||
Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs
|
||||
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
|
||||
operations divided by the total execution time. This value is used as the y-coordinate
|
||||
for the kernel's point on the Roofline plot.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 401
|
||||
@@ -218,3 +139,91 @@ Panel Config:
|
||||
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
|
||||
* 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9
|
||||
unit: GFLOP/s
|
||||
metrics_description:
|
||||
VALU FLOPs (F16): |-
|
||||
The total 16-bit floating-point operations executed per second on the VALU.
|
||||
This is presented with the value of the peak empirical F16 FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any F16 operations
|
||||
from MFMA instructions.
|
||||
VALU FLOPs (F32): |-
|
||||
The total 32-bit floating-point operations executed per second on the VALU.
|
||||
This is presented with the value of the peak empirical F32 FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any F32 operations
|
||||
from MFMA instructions.
|
||||
VALU FLOPs (F64): |-
|
||||
The total 64-bit floating-point operations executed per second on the VALU.
|
||||
This is presented with the value of the peak empirical F64 FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any F64 operations
|
||||
from MFMA instructions.
|
||||
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
|
||||
executed per second. This does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. The peak empirically measured F8 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison. It is supported
|
||||
on AMD Instinct MI300 series and later only.
|
||||
MFMA FLOPs (BF16): |-
|
||||
The total number of 16-bit brain floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit brain floating point
|
||||
operations from VALU instructions. The peak empirically measured BF16 MFMA
|
||||
operations achievable on the specific accelerator is displayed alongside
|
||||
for comparison.
|
||||
MFMA FLOPs (F16): |-
|
||||
The total number of 16-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 16-bit floating point operations from
|
||||
VALU instructions. The peak empirically measured F16 MFMA operations
|
||||
achievable on the specific accelerator is displayed alongside for comparison.
|
||||
MFMA FLOPs (F32): |-
|
||||
The total number of 32-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 32-bit floating point operations from
|
||||
VALU instructions. The peak empirically measured F32 MFMA operations
|
||||
achievable on the specific accelerator is displayed alongside for comparison.
|
||||
MFMA FLOPs (F64): |-
|
||||
The total number of 64-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 64-bit floating point operations from
|
||||
VALU instructions. The peak empirically measured F64 MFMA operations
|
||||
achievable on the specific accelerator is displayed alongside for comparison.
|
||||
MFMA IOPs (Int8): |-
|
||||
The total number of 8-bit integer MFMA operations executed per second.
|
||||
Note: this does not include any 8-bit integer operations from VALU instructions.
|
||||
The peak empirically measured INT8 MFMA operations achievable on the specific
|
||||
accelerator is displayed alongside for comparison.
|
||||
HBM Bandwidth: |-
|
||||
The total number of bytes read from and written to High-Bandwidth
|
||||
Memory (HBM) per second. The peak empirically measured bandwidth achievable
|
||||
on the specific accelerator is displayed alongside for comparison.
|
||||
L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
|
||||
The number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line. The peak empirically measured bandwidth
|
||||
achievable on the specific accelerator is displayed alongside for comparison.
|
||||
L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
|
||||
of VMEM instructions per unit time. The number of bytes is calculated as the
|
||||
number of cache lines requested multiplied by the cache line size. This value
|
||||
does not consider partial requests, so e.g., if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
The peak empirically measured bandwidth achievable on the specific accelerator
|
||||
is displayed alongside for comparison.
|
||||
LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
|
||||
from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
|
||||
example for more detail). The peak empirically measured LDS bandwidth achievable
|
||||
on the specific accelerator is displayed alongside for comparison.
|
||||
AI L1: |-
|
||||
The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
|
||||
of total floating-point operations (FLOPs) to total bytes transferred between
|
||||
the L1 cache and the processing units. This value is used as the x-coordinate
|
||||
for the L1 roofline.
|
||||
AI L2: |-
|
||||
The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
|
||||
of total floating-point operations (FLOPs) to total bytes transferred between
|
||||
the L2 cache and the L1 cache. This value is used as the x-coordinate for
|
||||
the L2 roofline.
|
||||
AI HBM: |-
|
||||
The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
|
||||
It is the ratio of total floating-point operations (FLOPs) to total bytes
|
||||
transferred between HBM and the L2 cache. This value is used as the x-coordinate
|
||||
for the HBM roofline.
|
||||
Performance (GFLOPs): |-
|
||||
The overall achieved performance, measured in GigaFLOPs
|
||||
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
|
||||
operations divided by the total execution time. This value is used as the y-coordinate
|
||||
for the kernel's point on the Roofline plot.
|
||||
|
||||
+25
-24
@@ -2,30 +2,6 @@
|
||||
Panel Config:
|
||||
id: 500
|
||||
title: Command Processor (CPC/CPF)
|
||||
metrics_description:
|
||||
CPF Utilization: Percent of total cycles where the CPF was busy actively doing
|
||||
any work. The ratio of CPF busy cycles over total cycles counted by the CPF.
|
||||
CPF Stall: Percent of CPF busy cycles where the CPF was stalled for any reason.
|
||||
CPF-L2 Utilization: Percent of total cycles counted by the CPF-L2 interface where
|
||||
the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
|
||||
over total cycles counted by the CPF-L2.
|
||||
CPF-L2 Stall: Percent of CPF-L2 L2 busy cycles where the CPF-L2 interface was
|
||||
stalled for any reason.
|
||||
CPF-UTCL1 Stall: Percent of CPF busy cycles where the CPF was stalled by address
|
||||
translation.
|
||||
CPC Utilization: Percent of total cycles where the CPC was busy actively doing
|
||||
any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
|
||||
CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
|
||||
CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
|
||||
for processing.
|
||||
CPC-Workgroup Manager Utilization: Percent of CPC busy cycles spent dispatching
|
||||
workgroups to the workgroup manager.
|
||||
CPC-L2 Utilization: Percent of total cycles counted by the CPC-L2 interface where
|
||||
the CPC-L2 interface was active doing any work.
|
||||
CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
|
||||
translation
|
||||
CPC-UTCL2 Utilization: 'Percent of total cycles counted by the CPC''s L2 address
|
||||
translation interface where the CPC was busy doing address translation work. '
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 501
|
||||
@@ -143,3 +119,28 @@ Panel Config:
|
||||
max: MAX((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
|
||||
if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
metrics_description:
|
||||
CPF Utilization: Percent of total cycles where the CPF was busy actively doing
|
||||
any work. The ratio of CPF busy cycles over total cycles counted by the CPF.
|
||||
CPF Stall: Percent of CPF busy cycles where the CPF was stalled for any reason.
|
||||
CPF-L2 Utilization: Percent of total cycles counted by the CPF-L2 interface where
|
||||
the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
|
||||
over total cycles counted by the CPF-L2.
|
||||
CPF-L2 Stall: Percent of CPF-L2 L2 busy cycles where the CPF-L2 interface was
|
||||
stalled for any reason.
|
||||
CPF-UTCL1 Stall: Percent of CPF busy cycles where the CPF was stalled by address
|
||||
translation.
|
||||
CPC Utilization: Percent of total cycles where the CPC was busy actively doing
|
||||
any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
|
||||
CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
|
||||
CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
|
||||
for processing.
|
||||
CPC-Workgroup Manager Utilization: Percent of CPC busy cycles spent dispatching
|
||||
workgroups to the workgroup manager.
|
||||
CPC-L2 Utilization: Percent of total cycles counted by the CPC-L2 interface where
|
||||
the CPC-L2 interface was active doing any work.
|
||||
CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
|
||||
translation
|
||||
CPC-UTCL2 Utilization: |-
|
||||
Percent of total cycles counted by the CPC's L2 address translation
|
||||
interface where the CPC was busy doing address translation work.
|
||||
|
||||
+55
-55
@@ -2,61 +2,6 @@
|
||||
Panel Config:
|
||||
id: 600
|
||||
title: Workgroup Manager (SPI)
|
||||
metrics_description:
|
||||
Accelerator Utilization: The percent of cycles in the kernel where the accelerator
|
||||
was actively doing any work.
|
||||
Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
|
||||
kernel where the scheduler-pipes were actively doing any work.
|
||||
Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
|
||||
manager was actively doing any work.
|
||||
Shader Engine Utilization: The percent of total shader engine cycles in the kernel
|
||||
where any CU in a shader-engine was actively doing any work, normalized over
|
||||
all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
|
||||
was not fully saturated by the kernel, or a potential load-imbalance issue.
|
||||
SIMD Utilization: The percent of total SIMD cycles in the kernel where any SIMD
|
||||
on a CU was actively doing any work, summed over all CUs. Low values (less than
|
||||
100%) indicate that the accelerator was not fully saturated by the kernel, or
|
||||
a potential load-imbalance issue.
|
||||
Dispatched Workgroups: The total number of workgroups forming this kernel launch.
|
||||
Dispatched Wavefronts: The total number of wavefronts, summed over all workgroups,
|
||||
forming this kernel launch.
|
||||
VGPR Writes: The average number of cycles spent initializing VGPRs at wave creation.
|
||||
SGPR Writes: The average number of cycles spent initializing SGPRs at wave creation.
|
||||
Not-scheduled Rate (Workgroup Manager): The percent of total scheduler-pipe cycles
|
||||
in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
|
||||
within the workgroup manager rather than a lack of a CU or SIMD with sufficient
|
||||
resources.
|
||||
Not-scheduled Rate (Scheduler-Pipe): 'The percent of total scheduler-pipe cycles
|
||||
in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
|
||||
within the scheduler-pipes rather than a lack of a CU or SIMD with sufficient
|
||||
resources. '
|
||||
Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
|
||||
where a workgroup could not be scheduled to a CU due to occupancy limitations
|
||||
(like a lack of a CU or SIMD with sufficient resources).
|
||||
Scratch Stall Rate: The percent of total shader-engine cycles in the kernel where
|
||||
a workgroup could not be scheduled to a CU due to lack of private (a.k.a., scratch)
|
||||
memory slots. While this can reach up to 100%, note that the actual occupancy
|
||||
limitations on a kernel using private memory are typically quite small (for
|
||||
example, less than 1% of the total number of waves that can be scheduled to
|
||||
an accelerator).
|
||||
Insufficient SIMD Waveslots: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available waveslots.
|
||||
Insufficient SIMD VGPRs: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available VGPRs.
|
||||
Insufficient SIMD SGPRs: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available SGPRs.
|
||||
Insufficient CU LDS: The percent of total CU cycles in the kernel where a workgroup
|
||||
could not be scheduled to a CU due to lack of available LDS.
|
||||
Insufficient CU Barriers: The percent of total CU cycles in the kernel where a
|
||||
workgroup could not be scheduled to a CU due to lack of available barriers.
|
||||
Reached CU Workgroup Limit: The percent of total CU cycles in the kernel where
|
||||
a workgroup could not be scheduled to a CU due to limits within the workgroup
|
||||
manager. This is expected to be always be zero on CDNA2 or newer accelerators
|
||||
(and small for previous accelerators).
|
||||
Reached CU Wavefront Limit: The percent of total CU cycles in the kernel where
|
||||
a wavefront could not be scheduled to a CU due to limits within the workgroup
|
||||
manager. This is expected to be always be zero on CDNA2 or newer accelerators
|
||||
(and small for previous accelerators).
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 601
|
||||
@@ -199,3 +144,58 @@ Panel Config:
|
||||
min: MIN(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
metrics_description:
|
||||
Accelerator Utilization: The percent of cycles in the kernel where the accelerator
|
||||
was actively doing any work.
|
||||
Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
|
||||
kernel where the scheduler-pipes were actively doing any work.
|
||||
Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
|
||||
manager was actively doing any work.
|
||||
Shader Engine Utilization: The percent of total shader engine cycles in the kernel
|
||||
where any CU in a shader-engine was actively doing any work, normalized over
|
||||
all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
|
||||
was not fully saturated by the kernel, or a potential load-imbalance issue.
|
||||
SIMD Utilization: The percent of total SIMD cycles in the kernel where any SIMD
|
||||
on a CU was actively doing any work, summed over all CUs. Low values (less than
|
||||
100%) indicate that the accelerator was not fully saturated by the kernel, or
|
||||
a potential load-imbalance issue.
|
||||
Dispatched Workgroups: The total number of workgroups forming this kernel launch.
|
||||
Dispatched Wavefronts: The total number of wavefronts, summed over all workgroups,
|
||||
forming this kernel launch.
|
||||
VGPR Writes: The average number of cycles spent initializing VGPRs at wave creation.
|
||||
SGPR Writes: The average number of cycles spent initializing SGPRs at wave creation.
|
||||
Not-scheduled Rate (Workgroup Manager): The percent of total scheduler-pipe cycles
|
||||
in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
|
||||
within the workgroup manager rather than a lack of a CU or SIMD with sufficient
|
||||
resources.
|
||||
Not-scheduled Rate (Scheduler-Pipe): |-
|
||||
The percent of total scheduler-pipe cycles in the kernel where a workgroup
|
||||
could not be scheduled to a CU due to a bottleneck within the scheduler-pipes
|
||||
rather than a lack of a CU or SIMD with sufficient resources.
|
||||
Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
|
||||
where a workgroup could not be scheduled to a CU due to occupancy limitations
|
||||
(like a lack of a CU or SIMD with sufficient resources).
|
||||
Scratch Stall Rate: The percent of total shader-engine cycles in the kernel where
|
||||
a workgroup could not be scheduled to a CU due to lack of private (a.k.a., scratch)
|
||||
memory slots. While this can reach up to 100%, note that the actual occupancy
|
||||
limitations on a kernel using private memory are typically quite small (for
|
||||
example, less than 1% of the total number of waves that can be scheduled to
|
||||
an accelerator).
|
||||
Insufficient SIMD Waveslots: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available waveslots.
|
||||
Insufficient SIMD VGPRs: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available VGPRs.
|
||||
Insufficient SIMD SGPRs: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available SGPRs.
|
||||
Insufficient CU LDS: The percent of total CU cycles in the kernel where a workgroup
|
||||
could not be scheduled to a CU due to lack of available LDS.
|
||||
Insufficient CU Barriers: The percent of total CU cycles in the kernel where a
|
||||
workgroup could not be scheduled to a CU due to lack of available barriers.
|
||||
Reached CU Workgroup Limit: The percent of total CU cycles in the kernel where
|
||||
a workgroup could not be scheduled to a CU due to limits within the workgroup
|
||||
manager. This is expected to be always be zero on CDNA2 or newer accelerators
|
||||
(and small for previous accelerators).
|
||||
Reached CU Wavefront Limit: The percent of total CU cycles in the kernel where
|
||||
a wavefront could not be scheduled to a CU due to limits within the workgroup
|
||||
manager. This is expected to be always be zero on CDNA2 or newer accelerators
|
||||
(and small for previous accelerators).
|
||||
|
||||
+63
-57
@@ -2,63 +2,6 @@
|
||||
Panel Config:
|
||||
id: 700
|
||||
title: Wavefront
|
||||
metrics_description:
|
||||
Grid Size: The total number of work-items (or, threads) launched as a part of
|
||||
the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied
|
||||
by the total workgroup (or, block) size.
|
||||
Workgroup Size: The total number of work-items (or, threads) in each workgroup
|
||||
(or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
|
||||
to the total block size.
|
||||
Total Wavefronts: "The total number of wavefronts launched as part of the kernel\
|
||||
\ dispatch. On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs,\
|
||||
\ the wavefront size is always 64 work-items. Thus, the total number of wavefronts\
|
||||
\ should be equivalent to the ceiling of grid size divided by 64."
|
||||
Saved Wavefronts: The total number of wavefronts saved at a context-save.
|
||||
Restored Wavefronts: The total number of wavefronts restored from a context-save.
|
||||
VGPRs: 'The number of architected vector general-purpose registers allocated for
|
||||
the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
|
||||
by the compiler due to allocation granularity.'
|
||||
AGPRs: 'The number of accumulation vector general-purpose registers allocated
|
||||
for the kernel, see AGPRs. Note: this may not exactly match the number of AGPRs
|
||||
requested by the compiler due to allocation granularity.'
|
||||
SGPRs: 'The number of scalar general-purpose registers allocated for the kernel,
|
||||
see SALU. Note: this may not exactly match the number of SGPRs requested by
|
||||
the compiler due to allocation granularity.'
|
||||
LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
|
||||
for this kernel. Note: This may also be larger than what was requested at compile
|
||||
time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
|
||||
Scratch Allocation: The number of bytes of scratch memory requested per work-item
|
||||
for this kernel. Scratch memory is used for stack memory on the accelerator,
|
||||
as well as for register spills and restores.
|
||||
Kernel Time: The total duration of the executed kernel.
|
||||
Kernel Time (Cycles): The total duration of the executed kernel in cycles.
|
||||
Instructions per wavefront: The average number of instructions (of all types)
|
||||
executed per wavefront. This is averaged over all wavefronts in a kernel dispatch.
|
||||
Wave Cycles: The number of cycles a wavefront in the kernel dispatch spent resident
|
||||
on a compute unit per normalization unit. This is averaged over all wavefronts
|
||||
in a kernel dispatch.
|
||||
Dependency Wait Cycles: The number of cycles a wavefront in the kernel dispatch
|
||||
spent resident on a compute unit per normalization unit. This is averaged over
|
||||
all wavefronts in a kernel dispatch.
|
||||
Issue Wait Cycles: The number of cycles a wavefront in the kernel dispatch was
|
||||
unable to issue an instruction for any reason (e.g., execution pipe back-pressure,
|
||||
arbitration loss, etc.) per normalization unit. This counter is incremented
|
||||
at every cycle by all wavefronts on a CU unable to issue an instruction. As
|
||||
such, it is most useful to get a sense of how waves were spending their time,
|
||||
rather than identification of a precise limiter because another wave could be
|
||||
actively executing while a wave is issue stalled. The sum of this metric, Dependency
|
||||
Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
|
||||
Active Cycles: The average number of cycles a wavefront in the kernel dispatch
|
||||
was actively executing instructions per normalization unit. This measurement
|
||||
is made on a per-wavefront basis, and may include cycles that another wavefront
|
||||
spent actively executing (on another execution unit, for example) or was stalled.
|
||||
As such, it is most useful to get a sense of how waves were spending their time,
|
||||
rather than identification of a precise limiter. The sum of this metric, Issue
|
||||
Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
|
||||
metric.
|
||||
Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
|
||||
over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
||||
kernels (less than 1ms).'
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 701
|
||||
@@ -171,3 +114,66 @@ Panel Config:
|
||||
max: MAX((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
unit: Wavefronts
|
||||
coll_level: SQ_LEVEL_WAVES
|
||||
metrics_description:
|
||||
Grid Size: The total number of work-items (or, threads) launched as a part of
|
||||
the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied
|
||||
by the total workgroup (or, block) size.
|
||||
Workgroup Size: The total number of work-items (or, threads) in each workgroup
|
||||
(or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
|
||||
to the total block size.
|
||||
Total Wavefronts: |-
|
||||
The total number of wavefronts launched as part of the kernel dispatch.
|
||||
On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
|
||||
size is always 64 work-items. Thus, the total number of wavefronts should
|
||||
be equivalent to the ceiling of grid size divided by 64.
|
||||
Saved Wavefronts: The total number of wavefronts saved at a context-save.
|
||||
Restored Wavefronts: The total number of wavefronts restored from a context-save.
|
||||
VGPRs: |-
|
||||
The number of architected vector general-purpose registers allocated
|
||||
for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
|
||||
requested by the compiler due to allocation granularity.
|
||||
AGPRs: |-
|
||||
The number of accumulation vector general-purpose registers allocated
|
||||
for the kernel, see AGPRs. Note: this may not exactly match the number of
|
||||
AGPRs requested by the compiler due to allocation granularity.
|
||||
SGPRs: |-
|
||||
The number of scalar general-purpose registers allocated for the kernel,
|
||||
see SALU. Note: this may not exactly match the number of SGPRs requested by
|
||||
the compiler due to allocation granularity.
|
||||
LDS Allocation: |-
|
||||
The number of bytes of LDS memory (or, shared memory) allocated for
|
||||
this kernel. Note: This may also be larger than what was requested at compile
|
||||
time due to both allocation granularity and dynamic per-dispatch LDS allocations.
|
||||
Scratch Allocation: The number of bytes of scratch memory requested per work-item
|
||||
for this kernel. Scratch memory is used for stack memory on the accelerator,
|
||||
as well as for register spills and restores.
|
||||
Kernel Time: The total duration of the executed kernel.
|
||||
Kernel Time (Cycles): The total duration of the executed kernel in cycles.
|
||||
Instructions per wavefront: The average number of instructions (of all types)
|
||||
executed per wavefront. This is averaged over all wavefronts in a kernel dispatch.
|
||||
Wave Cycles: The number of cycles a wavefront in the kernel dispatch spent resident
|
||||
on a compute unit per normalization unit. This is averaged over all wavefronts
|
||||
in a kernel dispatch.
|
||||
Dependency Wait Cycles: The number of cycles a wavefront in the kernel dispatch
|
||||
spent resident on a compute unit per normalization unit. This is averaged over
|
||||
all wavefronts in a kernel dispatch.
|
||||
Issue Wait Cycles: The number of cycles a wavefront in the kernel dispatch was
|
||||
unable to issue an instruction for any reason (e.g., execution pipe back-pressure,
|
||||
arbitration loss, etc.) per normalization unit. This counter is incremented
|
||||
at every cycle by all wavefronts on a CU unable to issue an instruction. As
|
||||
such, it is most useful to get a sense of how waves were spending their time,
|
||||
rather than identification of a precise limiter because another wave could be
|
||||
actively executing while a wave is issue stalled. The sum of this metric, Dependency
|
||||
Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
|
||||
Active Cycles: The average number of cycles a wavefront in the kernel dispatch
|
||||
was actively executing instructions per normalization unit. This measurement
|
||||
is made on a per-wavefront basis, and may include cycles that another wavefront
|
||||
spent actively executing (on another execution unit, for example) or was stalled.
|
||||
As such, it is most useful to get a sense of how waves were spending their time,
|
||||
rather than identification of a precise limiter. The sum of this metric, Issue
|
||||
Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
|
||||
metric.
|
||||
Wavefront Occupancy: |-
|
||||
The time-averaged number of wavefronts resident on the accelerator over
|
||||
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
||||
kernels (less than 1ms).
|
||||
|
||||
+85
-84
@@ -2,90 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1000
|
||||
title: Compute Units - Instruction Mix
|
||||
metrics_description:
|
||||
VALU: The total number of vector arithmetic logic unit (VALU) operations issued.
|
||||
These are the workhorses of the compute unit, and are used to execute a wide
|
||||
range of instruction types including floating point operations, non-uniform
|
||||
address calculations, transcendental operations, integer operations, shifts,
|
||||
conditional evaluation, etc.
|
||||
VMEM: The total number of vector memory operations issued. These include most
|
||||
loads, stores and atomic operations and all accesses to generic, global, private
|
||||
and texture memory.
|
||||
LDS: The total number of LDS (also known as shared memory) operations issued.
|
||||
These include loads, stores, atomics, and HIP's __shfl operations.
|
||||
MFMA: The total number of matrix fused multiply-add instructions issued.
|
||||
SALU: The total number of scalar arithmetic logic unit (SALU) operations issued.
|
||||
Typically these are used for address calculations, literal constants, and other
|
||||
operations that are provably uniform across a wavefront. Although scalar memory
|
||||
(SMEM) operations are issued by the SALU, they are counted separately in this
|
||||
section.
|
||||
SMEM: The total number of scalar memory (SMEM) operations issued. These are typically
|
||||
used for loading kernel arguments, base-pointers and loads from HIP's __constant__
|
||||
memory.
|
||||
Branch: The total number of branch operations issued. These typically consist
|
||||
of jump or branch operations and are used to implement control flow.
|
||||
INT32: The total number of instructions operating on 32-bit integer operands issued
|
||||
to the VALU per normalization unit.
|
||||
INT64: The total number of instructions operating on 64-bit integer operands issued
|
||||
to the VALU per normalization unit.
|
||||
F16-ADD: The total number of addition instructions operating on 16-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F16-MUL: The total number of multiplication instructions operating on 16-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F16-FMA: The total number of fused multiply-add instructions operating on 16-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F16-Trans: The total number of transcendental instructions (e.g., sqrt) operating
|
||||
on 16-bit floating-point operands issued to the VALU per normalization unit.
|
||||
F32-ADD: The total number of addition instructions operating on 32-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F32-MUL: The total number of multiplication instructions operating on 32-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F32-FMA: The total number of fused multiply-add instructions operating on 32-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F32-Trans: The total number of transcendental instructions (such as sqrt) operating
|
||||
on 32-bit floating-point operands issued to the VALU per normalization unit.
|
||||
F64-ADD: The total number of addition instructions operating on 64-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F64-MUL: The total number of multiplication instructions operating on 64-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F64-FMA: The total number of fused multiply-add instructions operating on 64-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F64-Trans: The total number of transcendental instructions (such as sqrt) operating
|
||||
on 64-bit floating-point operands issued to the VALU per normalization unit.
|
||||
Conversion: "The total number of type conversion instructions (such as converting\
|
||||
\ data to or from F32\u2194F64) issued to the VALU per normalization unit."
|
||||
Global/Generic Instr: The total number of global & generic memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Read: The total number of global & generic memory read instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Write: The total number of global & generic memory write instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Atomic: The total number of global & generic memory atomic (with
|
||||
and without return) instructions executed on all compute units on the accelerator,
|
||||
per normalization unit.
|
||||
Spill/Stack Instr: The total number of spill/stack memory instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Read: The total number of spill/stack memory read instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Write: The total number of spill/stack memory write instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Atomic: The total number of spill/stack memory atomic (with and without
|
||||
return) instructions executed on all compute units on the accelerator, per normalization
|
||||
unit. Typically unused as these memory operations are typically used to implement
|
||||
thread-local storage.
|
||||
MFMA-I8: The total number of 8-bit integer MFMA instructions issued per normalization
|
||||
unit.
|
||||
MFMA-F8: The total number of 8-bit floating point MFMA instructions issued per
|
||||
normalization unit. This is supported in AMD Instinct MI300 series and later
|
||||
only.
|
||||
MFMA-F16: The total number of 16-bit floating point MFMA instructions issued per
|
||||
normalization unit.
|
||||
MFMA-BF16: The total number of 16-bit brain floating point MFMA instructions issued
|
||||
per normalization unit.
|
||||
MFMA-F32: The total number of 32-bit floating-point MFMA instructions issued per
|
||||
normalization unit.
|
||||
MFMA-F64: The total number of 64-bit floating-point MFMA instructions issued per
|
||||
normalization unit.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1001
|
||||
@@ -307,3 +223,88 @@ Panel Config:
|
||||
min: MIN((SQ_INSTS_VALU_MFMA_F64 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MFMA_F64 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
metrics_description:
|
||||
VALU: The total number of vector arithmetic logic unit (VALU) operations issued.
|
||||
These are the workhorses of the compute unit, and are used to execute a wide
|
||||
range of instruction types including floating point operations, non-uniform
|
||||
address calculations, transcendental operations, integer operations, shifts,
|
||||
conditional evaluation, etc.
|
||||
VMEM: The total number of vector memory operations issued. These include most
|
||||
loads, stores and atomic operations and all accesses to generic, global, private
|
||||
and texture memory.
|
||||
LDS: The total number of LDS (also known as shared memory) operations issued.
|
||||
These include loads, stores, atomics, and HIP's __shfl operations.
|
||||
MFMA: The total number of matrix fused multiply-add instructions issued.
|
||||
SALU: The total number of scalar arithmetic logic unit (SALU) operations issued.
|
||||
Typically these are used for address calculations, literal constants, and other
|
||||
operations that are provably uniform across a wavefront. Although scalar memory
|
||||
(SMEM) operations are issued by the SALU, they are counted separately in this
|
||||
section.
|
||||
SMEM: The total number of scalar memory (SMEM) operations issued. These are typically
|
||||
used for loading kernel arguments, base-pointers and loads from HIP's __constant__
|
||||
memory.
|
||||
Branch: The total number of branch operations issued. These typically consist
|
||||
of jump or branch operations and are used to implement control flow.
|
||||
INT32: The total number of instructions operating on 32-bit integer operands issued
|
||||
to the VALU per normalization unit.
|
||||
INT64: The total number of instructions operating on 64-bit integer operands issued
|
||||
to the VALU per normalization unit.
|
||||
F16-ADD: The total number of addition instructions operating on 16-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F16-MUL: The total number of multiplication instructions operating on 16-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F16-FMA: The total number of fused multiply-add instructions operating on 16-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F16-Trans: The total number of transcendental instructions (e.g., sqrt) operating
|
||||
on 16-bit floating-point operands issued to the VALU per normalization unit.
|
||||
F32-ADD: The total number of addition instructions operating on 32-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F32-MUL: The total number of multiplication instructions operating on 32-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F32-FMA: The total number of fused multiply-add instructions operating on 32-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F32-Trans: The total number of transcendental instructions (such as sqrt) operating
|
||||
on 32-bit floating-point operands issued to the VALU per normalization unit.
|
||||
F64-ADD: The total number of addition instructions operating on 64-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F64-MUL: The total number of multiplication instructions operating on 64-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F64-FMA: The total number of fused multiply-add instructions operating on 64-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F64-Trans: The total number of transcendental instructions (such as sqrt) operating
|
||||
on 64-bit floating-point operands issued to the VALU per normalization unit.
|
||||
Conversion: |-
|
||||
The total number of type conversion instructions (such as converting
|
||||
data to or from F32\u2194F64) issued to the VALU per normalization unit.
|
||||
Global/Generic Instr: The total number of global & generic memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Read: The total number of global & generic memory read instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Write: The total number of global & generic memory write instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Atomic: The total number of global & generic memory atomic (with
|
||||
and without return) instructions executed on all compute units on the accelerator,
|
||||
per normalization unit.
|
||||
Spill/Stack Instr: The total number of spill/stack memory instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Read: The total number of spill/stack memory read instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Write: The total number of spill/stack memory write instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Atomic: The total number of spill/stack memory atomic (with and without
|
||||
return) instructions executed on all compute units on the accelerator, per normalization
|
||||
unit. Typically unused as these memory operations are typically used to implement
|
||||
thread-local storage.
|
||||
MFMA-I8: The total number of 8-bit integer MFMA instructions issued per normalization
|
||||
unit.
|
||||
MFMA-F8: The total number of 8-bit floating point MFMA instructions issued per
|
||||
normalization unit. This is supported in AMD Instinct MI300 series and later
|
||||
only.
|
||||
MFMA-F16: The total number of 16-bit floating point MFMA instructions issued per
|
||||
normalization unit.
|
||||
MFMA-BF16: The total number of 16-bit brain floating point MFMA instructions issued
|
||||
per normalization unit.
|
||||
MFMA-F32: The total number of 32-bit floating-point MFMA instructions issued per
|
||||
normalization unit.
|
||||
MFMA-F64: The total number of 64-bit floating-point MFMA instructions issued per
|
||||
normalization unit.
|
||||
|
||||
+95
-88
@@ -2,84 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1100
|
||||
title: Compute Units - Compute Pipeline
|
||||
metrics_description:
|
||||
VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
|
||||
This is also presented as a percent of the peak theoretical FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any floating-point
|
||||
operations from MFMA instructions.'
|
||||
VALU IOPs: 'The total integer operations executed per second on the VALU. This
|
||||
is also presented as a percent of the peak theoretical IOPs achievable on the
|
||||
specific accelerator. Note: this does not include any integer operations from
|
||||
MFMA instructions.'
|
||||
MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
|
||||
executed per second. Note: this does not include any 16-bit brain floating point
|
||||
operations from VALU instructions. This is also presented as a percent of the
|
||||
peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 32-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F32 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 64-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F64 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA IOPs (INT8): 'The total number of 8-bit integer MFMA operations executed
|
||||
per second. Note: this does not include any 8-bit integer operations from VALU
|
||||
instructions. This is also presented as a percent of the peak theoretical INT8
|
||||
MFMA operations achievable on the specific accelerator.'
|
||||
IPC: The ratio of the total number of instructions executed on the CU over the
|
||||
total active CU cycles.
|
||||
IPC (Issued): The ratio of the total number of (non-internal) instructions issued
|
||||
over the number of cycles where the scheduler was actively working on issuing
|
||||
instructions.
|
||||
SALU Utilization: Indicates what percent of the kernel's duration the SALU was
|
||||
busy executing instructions. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing SALU / SMEM instructions over the total CU cycles.
|
||||
VALU Utilization: Indicates what percent of the kernel's duration the VALU was
|
||||
busy executing instructions. Does not include VMEM operations. Computed as the
|
||||
ratio of the total number of cycles spent by the scheduler issuing VALU instructions
|
||||
over the total CU cycles.
|
||||
VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
|
||||
was busy executing instructions, including both global/generic and spill/scratch
|
||||
operations (see the VMEM instruction count metrics for more detail). Does not
|
||||
include VALU operations. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing VMEM instructions over the total CU cycles.
|
||||
Branch Utilization: Indicates what percent of the kernel's duration the branch
|
||||
unit was busy executing instructions. Computed as the ratio of the total number
|
||||
of cycles spent by the scheduler issuing branch instructions over the total
|
||||
CU cycles.
|
||||
VALU Active Threads: Indicates the average level of divergence within a wavefront
|
||||
over the lifetime of the kernel. The number of work-items that were active in
|
||||
a wavefront during execution of each VALU instruction, time-averaged over all
|
||||
VALU instructions run on all wavefronts in the kernel
|
||||
MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
|
||||
was busy executing instructions. Computed as the ratio of the total number of
|
||||
cycles spent by the MFMA was busy over the total CU cycles.
|
||||
MFMA Instruction Cycles: The average duration of MFMA instructions in this kernel
|
||||
in cycles. Computed as the ratio of the total number of cycles the MFMA unit
|
||||
was busy over the total number of MFMA instructions.
|
||||
VMEM Latency: The average number of round-trip cycles (that is, from issue to
|
||||
data return / acknowledgment) required for a VMEM instruction to complete.
|
||||
SMEM Latency: The average number of round-trip cycles (that is, from issue to
|
||||
data return / acknowledgment) required for a SMEM instruction to complete.
|
||||
FLOPs (Total): The total number of floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
IOPs (Total): The total number of integer operations executed on either the VALU
|
||||
or MFMA units, per normalization unit.
|
||||
F16 OPs: The total number of 16-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
BF16 OPs: The total number of 16-bit brain floating-point operations executed
|
||||
on either the VALU or MFMA units, per normalization unit.
|
||||
F32 OPs: The total number of 32-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
F64 OPs: The total number of 64-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
INT8 OPs: The total number of 8-bit integer operations executed on either the
|
||||
VALU or MFMA units, per normalization unit.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1101
|
||||
@@ -165,13 +87,13 @@ Panel Config:
|
||||
unit: Instr/cycle
|
||||
IPC (Issued):
|
||||
avg: AVG(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
/ SQ_ACTIVE_INST_ANY))
|
||||
min: MIN(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
/ SQ_ACTIVE_INST_ANY))
|
||||
max: MAX(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
/ SQ_ACTIVE_INST_ANY))
|
||||
unit: Instr/cycle
|
||||
SALU Utilization:
|
||||
@@ -271,7 +193,7 @@ Panel Config:
|
||||
+ (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (SQ_INSTS_VALU_FMA_F64 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
|
||||
/ $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
IOPs (Total):
|
||||
avg: AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8
|
||||
* 512)) / $denom)
|
||||
@@ -279,12 +201,12 @@ Panel Config:
|
||||
* 512)) / $denom)
|
||||
max: MAX(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8
|
||||
* 512)) / $denom)
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
F8 OPs:
|
||||
avg: AVG(((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) / $denom))
|
||||
min: MIN(((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) / $denom))
|
||||
max: MAX(((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) / $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
F16 OPs:
|
||||
avg: AVG(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16))
|
||||
+ (64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512
|
||||
@@ -295,12 +217,12 @@ Panel Config:
|
||||
max: MAX(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16))
|
||||
+ (64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512
|
||||
* SQ_INSTS_VALU_MFMA_MOPS_F16)) / $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
BF16 OPs:
|
||||
avg: AVG(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
|
||||
min: MIN(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
|
||||
max: MAX(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
F32 OPs:
|
||||
avg: AVG((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
|
||||
+ (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
|
||||
@@ -311,7 +233,7 @@ Panel Config:
|
||||
max: MAX((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
|
||||
+ (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
|
||||
/ $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
F64 OPs:
|
||||
avg: AVG((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
|
||||
@@ -322,9 +244,94 @@ Panel Config:
|
||||
max: MAX((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
|
||||
/ $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
INT8 OPs:
|
||||
avg: AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
|
||||
min: MIN(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
|
||||
max: MAX(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
metrics_description:
|
||||
VALU FLOPs: |-
|
||||
The total floating-point operations executed per second on the VALU.
|
||||
This is also presented as a percent of the peak theoretical FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any floating-point
|
||||
operations from MFMA instructions.
|
||||
VALU IOPs: |-
|
||||
The total integer operations executed per second on the VALU. This is
|
||||
also presented as a percent of the peak theoretical IOPs achievable on the
|
||||
specific accelerator. Note: this does not include any integer operations from
|
||||
MFMA instructions.
|
||||
MFMA FLOPs (BF16): |-
|
||||
The total number of 16-bit brain floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
BF16 MFMA operations achievable on the specific accelerator.
|
||||
MFMA FLOPs (F16): |-
|
||||
The total number of 16-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 16-bit floating point operations from
|
||||
VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F16 MFMA operations achievable on the specific accelerator.
|
||||
MFMA FLOPs (F32): |-
|
||||
The total number of 32-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 32-bit floating point operations from
|
||||
VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F32 MFMA operations achievable on the specific accelerator.
|
||||
MFMA FLOPs (F64): |-
|
||||
The total number of 64-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 64-bit floating point operations from
|
||||
VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F64 MFMA operations achievable on the specific accelerator.
|
||||
MFMA IOPs (INT8): |-
|
||||
The total number of 8-bit integer MFMA operations executed per second.
|
||||
Note: this does not include any 8-bit integer operations from VALU instructions.
|
||||
This is also presented as a percent of the peak theoretical INT8 MFMA operations
|
||||
achievable on the specific accelerator.
|
||||
IPC: The ratio of the total number of instructions executed on the CU over the
|
||||
total active CU cycles.
|
||||
IPC (Issued): The ratio of the total number of (non-internal) instructions issued
|
||||
over the number of cycles where the scheduler was actively working on issuing
|
||||
instructions.
|
||||
SALU Utilization: Indicates what percent of the kernel's duration the SALU was
|
||||
busy executing instructions. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing SALU / SMEM instructions over the total CU cycles.
|
||||
VALU Utilization: Indicates what percent of the kernel's duration the VALU was
|
||||
busy executing instructions. Does not include VMEM operations. Computed as the
|
||||
ratio of the total number of cycles spent by the scheduler issuing VALU instructions
|
||||
over the total CU cycles.
|
||||
VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
|
||||
was busy executing instructions, including both global/generic and spill/scratch
|
||||
operations (see the VMEM instruction count metrics for more detail). Does not
|
||||
include VALU operations. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing VMEM instructions over the total CU cycles.
|
||||
Branch Utilization: Indicates what percent of the kernel's duration the branch
|
||||
unit was busy executing instructions. Computed as the ratio of the total number
|
||||
of cycles spent by the scheduler issuing branch instructions over the total
|
||||
CU cycles.
|
||||
VALU Active Threads: Indicates the average level of divergence within a wavefront
|
||||
over the lifetime of the kernel. The number of work-items that were active in
|
||||
a wavefront during execution of each VALU instruction, time-averaged over all
|
||||
VALU instructions run on all wavefronts in the kernel
|
||||
MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
|
||||
was busy executing instructions. Computed as the ratio of the total number of
|
||||
cycles spent by the MFMA was busy over the total CU cycles.
|
||||
MFMA Instruction Cycles: The average duration of MFMA instructions in this kernel
|
||||
in cycles. Computed as the ratio of the total number of cycles the MFMA unit
|
||||
was busy over the total number of MFMA instructions.
|
||||
VMEM Latency: The average number of round-trip cycles (that is, from issue to
|
||||
data return / acknowledgment) required for a VMEM instruction to complete.
|
||||
SMEM Latency: The average number of round-trip cycles (that is, from issue to
|
||||
data return / acknowledgment) required for a SMEM instruction to complete.
|
||||
FLOPs (Total): The total number of floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
IOPs (Total): The total number of integer operations executed on either the VALU
|
||||
or MFMA units, per normalization unit.
|
||||
F16 OPs: The total number of 16-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
BF16 OPs: The total number of 16-bit brain floating-point operations executed
|
||||
on either the VALU or MFMA units, per normalization unit.
|
||||
F32 OPs: The total number of 32-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
F64 OPs: The total number of 64-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
INT8 OPs: The total number of 8-bit integer operations executed on either the
|
||||
VALU or MFMA units, per normalization unit.
|
||||
|
||||
+52
-51
@@ -2,51 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1200
|
||||
title: Local Data Share (LDS)
|
||||
metrics_description:
|
||||
Utilization: Indicates what percent of the kernel's duration the LDS was actively
|
||||
executing instructions (including, but not limited to, load, store, atomic and
|
||||
HIP's __shfl operations). Calculated as the ratio of the total number of cycles
|
||||
LDS was active over the total CU cycles.
|
||||
Access Rate: Indicates the percentage of SIMDs in the VALU actively issuing LDS
|
||||
instructions, averaged over the lifetime of the kernel. Calculated as the ratio
|
||||
of the total number of cycles spent by the scheduler issuing LDS instructions
|
||||
over the total CU cycles.
|
||||
Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
|
||||
could have been loaded from, stored to, or atomically updated in the LDS divided
|
||||
as percentage of theoretical peak. Does not take into account the execution
|
||||
mask of the wavefront when the instruction was executed.
|
||||
Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
|
||||
loaded from, stored to, or atomically updated in the LDS divided by total duration.
|
||||
Does not take into account the execution mask of the wavefront when the instruction
|
||||
was executed.
|
||||
Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
|
||||
servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing
|
||||
bank conflicts over the number of LDS cycles that would have been required to
|
||||
move the same amount of data in an uncontended access.
|
||||
LDS Instructions: The total number of LDS instructions (including, but not limited
|
||||
to, read/write/atomics and HIP's __shfl instructions) executed per normalization
|
||||
unit.
|
||||
LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
|
||||
/ acknowledgment) required for an LDS instruction to complete.
|
||||
Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS scheduler
|
||||
due to bank conflicts (as determined by the conflict resolution hardware) to
|
||||
the base number of cycles that would be spent in the LDS scheduler in a completely
|
||||
uncontended case. This is the unnormalized form of the Bank Conflict Rate.
|
||||
Index Accesses: The total number of cycles spent in the LDS scheduler over all
|
||||
operations per normalization unit.
|
||||
Atomic Return Cycles: The total number of cycles spent on LDS atomics with return
|
||||
per normalization unit.
|
||||
Bank Conflict: The total number of cycles spent in the LDS scheduler due to bank
|
||||
conflicts (as determined by the conflict resolution hardware) per normalization
|
||||
unit.
|
||||
Addr Conflict: The total number of cycles spent in the LDS scheduler due to address
|
||||
conflicts (as determined by the conflict resolution hardware) per normalization
|
||||
unit.
|
||||
Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
|
||||
stalls from non-dword aligned addresses per normalization unit.
|
||||
Mem Violations: "The total number of out-of-bounds accesses made to the LDS, per\
|
||||
\ normalization unit. This is unused and expected to be zero in most configurations\
|
||||
\ for modern CDNA\u2122 accelerators."
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1201
|
||||
@@ -87,7 +42,7 @@ Panel Config:
|
||||
avg: AVG((SQ_INSTS_LDS / $denom))
|
||||
min: MIN((SQ_INSTS_LDS / $denom))
|
||||
max: MAX((SQ_INSTS_LDS / $denom))
|
||||
unit: (Instr + $normUnit)
|
||||
unit: (Instr + $normUnit)
|
||||
Theoretical Bandwidth:
|
||||
avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
@@ -117,29 +72,75 @@ Panel Config:
|
||||
avg: AVG((SQ_LDS_IDX_ACTIVE / $denom))
|
||||
min: MIN((SQ_LDS_IDX_ACTIVE / $denom))
|
||||
max: MAX((SQ_LDS_IDX_ACTIVE / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Atomic Return Cycles:
|
||||
avg: AVG((SQ_LDS_ATOMIC_RETURN / $denom))
|
||||
min: MIN((SQ_LDS_ATOMIC_RETURN / $denom))
|
||||
max: MAX((SQ_LDS_ATOMIC_RETURN / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Bank Conflict:
|
||||
avg: AVG((SQ_LDS_BANK_CONFLICT / $denom))
|
||||
min: MIN((SQ_LDS_BANK_CONFLICT / $denom))
|
||||
max: MAX((SQ_LDS_BANK_CONFLICT / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Addr Conflict:
|
||||
avg: AVG((SQ_LDS_ADDR_CONFLICT / $denom))
|
||||
min: MIN((SQ_LDS_ADDR_CONFLICT / $denom))
|
||||
max: MAX((SQ_LDS_ADDR_CONFLICT / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Unaligned Stall:
|
||||
avg: AVG((SQ_LDS_UNALIGNED_STALL / $denom))
|
||||
min: MIN((SQ_LDS_UNALIGNED_STALL / $denom))
|
||||
max: MAX((SQ_LDS_UNALIGNED_STALL / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Mem Violations:
|
||||
avg: AVG((SQ_LDS_MEM_VIOLATIONS / $denom))
|
||||
min: MIN((SQ_LDS_MEM_VIOLATIONS / $denom))
|
||||
max: MAX((SQ_LDS_MEM_VIOLATIONS / $denom))
|
||||
unit: (Accesses + $normUnit)
|
||||
metrics_description:
|
||||
Utilization: Indicates what percent of the kernel's duration the LDS was actively
|
||||
executing instructions (including, but not limited to, load, store, atomic and
|
||||
HIP's __shfl operations). Calculated as the ratio of the total number of cycles
|
||||
LDS was active over the total CU cycles.
|
||||
Access Rate: Indicates the percentage of SIMDs in the VALU actively issuing LDS
|
||||
instructions, averaged over the lifetime of the kernel. Calculated as the ratio
|
||||
of the total number of cycles spent by the scheduler issuing LDS instructions
|
||||
over the total CU cycles.
|
||||
Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
|
||||
could have been loaded from, stored to, or atomically updated in the LDS divided
|
||||
as percentage of theoretical peak. Does not take into account the execution
|
||||
mask of the wavefront when the instruction was executed.
|
||||
Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
|
||||
loaded from, stored to, or atomically updated in the LDS divided by total duration.
|
||||
Does not take into account the execution mask of the wavefront when the instruction
|
||||
was executed.
|
||||
Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
|
||||
servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing
|
||||
bank conflicts over the number of LDS cycles that would have been required to
|
||||
move the same amount of data in an uncontended access.
|
||||
LDS Instructions: The total number of LDS instructions (including, but not limited
|
||||
to, read/write/atomics and HIP's __shfl instructions) executed per normalization
|
||||
unit.
|
||||
LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
|
||||
acknowledgment) required for an LDS instruction to complete.
|
||||
Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS scheduler
|
||||
due to bank conflicts (as determined by the conflict resolution hardware) to
|
||||
the base number of cycles that would be spent in the LDS scheduler in a completely
|
||||
uncontended case. This is the unnormalized form of the Bank Conflict Rate.
|
||||
Index Accesses: The total number of cycles spent in the LDS scheduler over all
|
||||
operations per normalization unit.
|
||||
Atomic Return Cycles: The total number of cycles spent on LDS atomics with return
|
||||
per normalization unit.
|
||||
Bank Conflict: The total number of cycles spent in the LDS scheduler due to bank
|
||||
conflicts (as determined by the conflict resolution hardware) per normalization
|
||||
unit.
|
||||
Addr Conflict: The total number of cycles spent in the LDS scheduler due to address
|
||||
conflicts (as determined by the conflict resolution hardware) per normalization
|
||||
unit.
|
||||
Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
|
||||
stalls from non-dword aligned addresses per normalization unit.
|
||||
Mem Violations: |-
|
||||
The total number of out-of-bounds accesses made to the LDS, per normalization
|
||||
unit. This is unused and expected to be zero in most configurations for
|
||||
modern CDNA\u2122 accelerators.
|
||||
|
||||
+26
-26
@@ -2,28 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1300
|
||||
title: Instruction Cache
|
||||
metrics_description:
|
||||
Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
|
||||
the total L1I cycles.
|
||||
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
|
||||
loaded line the cache. Calculated as the ratio of the number of L1I requests
|
||||
that hit over the number of all L1I requests.
|
||||
L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\
|
||||
\ L2 cache request bandwidth achieved. Calculated as the ratio of the total\
|
||||
\ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\
|
||||
\ cycles."
|
||||
L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
|
||||
divided by total duration.
|
||||
Req: The total number of requests made to the L1I per normalization-unit
|
||||
Hits: The total number of L1I requests that hit on a previously loaded cache line,
|
||||
per normalization-unit.
|
||||
Misses - Non Duplicated: The total number of L1I requests that missed on a cache
|
||||
line that were not already pending due to another request, per normalization-unit.
|
||||
Misses - Duplicated: The total number of L1I requests that missed on a cache line
|
||||
that were already pending due to another request, per normalization-unit.
|
||||
Instruction Fetch Latency: The average number of cycles spent to fetch instructions
|
||||
to a CU.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1301
|
||||
@@ -62,22 +40,22 @@ Panel Config:
|
||||
avg: AVG((SQC_ICACHE_REQ / $denom))
|
||||
min: MIN((SQC_ICACHE_REQ / $denom))
|
||||
max: MAX((SQC_ICACHE_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Hits:
|
||||
avg: AVG((SQC_ICACHE_HITS / $denom))
|
||||
min: MIN((SQC_ICACHE_HITS / $denom))
|
||||
max: MAX((SQC_ICACHE_HITS / $denom))
|
||||
unit: (Hits + $normUnit)
|
||||
unit: (Hits + $normUnit)
|
||||
Misses - Non Duplicated:
|
||||
avg: AVG((SQC_ICACHE_MISSES / $denom))
|
||||
min: MIN((SQC_ICACHE_MISSES / $denom))
|
||||
max: MAX((SQC_ICACHE_MISSES / $denom))
|
||||
unit: (Misses + $normUnit)
|
||||
unit: (Misses + $normUnit)
|
||||
Misses - Duplicated:
|
||||
avg: AVG((SQC_ICACHE_MISSES_DUPLICATE / $denom))
|
||||
min: MIN((SQC_ICACHE_MISSES_DUPLICATE / $denom))
|
||||
max: MAX((SQC_ICACHE_MISSES_DUPLICATE / $denom))
|
||||
unit: (Misses + $normUnit)
|
||||
unit: (Misses + $normUnit)
|
||||
Cache Hit Rate:
|
||||
avg: AVG(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
|
||||
+ SQC_ICACHE_MISSES_DUPLICATE)))
|
||||
@@ -107,3 +85,25 @@ Panel Config:
|
||||
min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
metrics_description:
|
||||
Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
|
||||
the total L1I cycles.
|
||||
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
|
||||
loaded line the cache. Calculated as the ratio of the number of L1I requests
|
||||
that hit over the number of all L1I requests.
|
||||
L1I-L2 Bandwidth Utilization: |-
|
||||
The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
|
||||
achieved. Calculated as the ratio of the total number of requests from the
|
||||
L1I to the L2 cache over the total L1I-L2 interface cycles.
|
||||
L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
|
||||
divided by total duration.
|
||||
Req: The total number of requests made to the L1I per normalization-unit
|
||||
Hits: The total number of L1I requests that hit on a previously loaded cache line,
|
||||
per normalization-unit.
|
||||
Misses - Non Duplicated: The total number of L1I requests that missed on a cache
|
||||
line that were not already pending due to another request, per normalization-unit.
|
||||
Misses - Duplicated: The total number of L1I requests that missed on a cache line
|
||||
that were already pending due to another request, per normalization-unit.
|
||||
Instruction Fetch Latency: The average number of cycles spent to fetch instructions
|
||||
to a CU.
|
||||
|
||||
+61
-58
@@ -2,49 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1400
|
||||
title: Scalar L1 Data Cache
|
||||
metrics_description:
|
||||
Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
|
||||
over the total sL1D cycles.
|
||||
Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
|
||||
loaded line the cache. The ratio of the number of sL1D requests that hit over
|
||||
the number of all sL1D requests.
|
||||
sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
|
||||
bandwidth acheived.\ \ Caclulated as total number of bytes read from, written
|
||||
to, or atomically updated\ \ across the sL1D - L2 interface.
|
||||
sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
|
||||
\ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\
|
||||
\ writes and atomics are typically unused on current CDNA accelerators, so in\
|
||||
\ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
|
||||
Req: The total number of requests, of any size or type, made to the sL1D per normalization
|
||||
unit.
|
||||
Hits: The total number of sL1D requests that hit on a previously loaded cache
|
||||
line, per normalization unit.
|
||||
Misses - Non Duplicated: 'The total number of sL1D requests that missed on a cache
|
||||
line that was not already pending due to another request, per normalization
|
||||
unit. '
|
||||
Misses- Duplicated: The total number of sL1D requests that missed on a cache line
|
||||
that was already pending due to another request, per normalization unit.
|
||||
Read Req (Total): The total number of sL1D read requests of any size, per normalization
|
||||
unit.
|
||||
Atomic Req: The total number of atomic requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
Read Req (1 DWord): The total number of sL1D read requests made for a single dword
|
||||
of data (4B), per normalization unit.
|
||||
Read Req (2 DWord): The total number of sL1D read requests made for a two dwords
|
||||
of data (8B), per normalization unit.
|
||||
Read Req (4 DWord): The total number of sL1D read requests made for a four dwords
|
||||
of data (16B), per normalization unit.
|
||||
Read Req (8 DWord): The total number of sL1D read requests made for a eight dwords
|
||||
of data (32B), per normalization unit.
|
||||
Read Req (16 DWord): The total number of sL1D read requests made for a sixteen
|
||||
dwords of data (64B), per normalization unit.
|
||||
Read Req: The total number of read requests from sL1D to the L2 per normalization
|
||||
unit.
|
||||
Write Req: The total number of write requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
Stall Cycles: "The total number of cycles the sL1D\u2194L2 interface was stalled,\
|
||||
\ per normalization unit."
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1401
|
||||
@@ -84,22 +41,22 @@ Panel Config:
|
||||
avg: AVG((SQC_DCACHE_REQ / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Hits:
|
||||
avg: AVG((SQC_DCACHE_HITS / $denom))
|
||||
min: MIN((SQC_DCACHE_HITS / $denom))
|
||||
max: MAX((SQC_DCACHE_HITS / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Misses - Non Duplicated:
|
||||
avg: AVG((SQC_DCACHE_MISSES / $denom))
|
||||
min: MIN((SQC_DCACHE_MISSES / $denom))
|
||||
max: MAX((SQC_DCACHE_MISSES / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Misses- Duplicated:
|
||||
avg: AVG((SQC_DCACHE_MISSES_DUPLICATE / $denom))
|
||||
min: MIN((SQC_DCACHE_MISSES_DUPLICATE / $denom))
|
||||
max: MAX((SQC_DCACHE_MISSES_DUPLICATE / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Cache Hit Rate:
|
||||
avg: AVG((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
@@ -118,37 +75,37 @@ Panel Config:
|
||||
+ SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
|
||||
max: MAX((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
|
||||
+ SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Atomic Req:
|
||||
avg: AVG((SQC_DCACHE_ATOMIC / $denom))
|
||||
min: MIN((SQC_DCACHE_ATOMIC / $denom))
|
||||
max: MAX((SQC_DCACHE_ATOMIC / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (1 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_1 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_1 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_1 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (2 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_2 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_2 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_2 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (4 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_4 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_4 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_4 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (8 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_8 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_8 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_8 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (16 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_16 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_16 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_16 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1403
|
||||
title: Scalar L1D Cache - L2 Interface
|
||||
@@ -171,19 +128,65 @@ Panel Config:
|
||||
avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_READ_REQ / $denom))
|
||||
max: MAX((SQC_TC_DATA_READ_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Write Req:
|
||||
avg: AVG((SQC_TC_DATA_WRITE_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_WRITE_REQ / $denom))
|
||||
max: MAX((SQC_TC_DATA_WRITE_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Atomic Req:
|
||||
avg: AVG((SQC_TC_DATA_ATOMIC_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_ATOMIC_REQ / $denom))
|
||||
max: MAX((SQC_TC_DATA_ATOMIC_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Stall Cycles:
|
||||
avg: AVG((SQC_TC_STALL / $denom))
|
||||
min: MIN((SQC_TC_STALL / $denom))
|
||||
max: MAX((SQC_TC_STALL / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
metrics_description:
|
||||
Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
|
||||
over the total sL1D cycles.
|
||||
Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
|
||||
loaded line the cache. The ratio of the number of sL1D requests that hit over
|
||||
the number of all sL1D requests.
|
||||
sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
|
||||
bandwidth acheived. Calculated as total number of bytes read from, written to,
|
||||
or atomically updated across the sL1D - L2 interface.
|
||||
sL1D-L2 BW: |-
|
||||
The total number of bytes read from, written to, or atomically updated
|
||||
across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D
|
||||
writes and atomics are typically unused on current CDNA accelerators, so
|
||||
in the majority of cases this can be interpreted as an sL1D\u2192L2 read
|
||||
bandwidth.
|
||||
Req: The total number of requests, of any size or type, made to the sL1D per normalization
|
||||
unit.
|
||||
Hits: The total number of sL1D requests that hit on a previously loaded cache
|
||||
line, per normalization unit.
|
||||
Misses - Non Duplicated: |-
|
||||
The total number of sL1D requests that missed on a cache line that was
|
||||
not already pending due to another request, per normalization unit.
|
||||
Misses- Duplicated: The total number of sL1D requests that missed on a cache line
|
||||
that was already pending due to another request, per normalization unit.
|
||||
Read Req (Total): The total number of sL1D read requests of any size, per normalization
|
||||
unit.
|
||||
Atomic Req: The total number of atomic requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
Read Req (1 DWord): The total number of sL1D read requests made for a single dword
|
||||
of data (4B), per normalization unit.
|
||||
Read Req (2 DWord): The total number of sL1D read requests made for a two dwords
|
||||
of data (8B), per normalization unit.
|
||||
Read Req (4 DWord): The total number of sL1D read requests made for a four dwords
|
||||
of data (16B), per normalization unit.
|
||||
Read Req (8 DWord): The total number of sL1D read requests made for a eight dwords
|
||||
of data (32B), per normalization unit.
|
||||
Read Req (16 DWord): The total number of sL1D read requests made for a sixteen
|
||||
dwords of data (64B), per normalization unit.
|
||||
Read Req: The total number of read requests from sL1D to the L2 per normalization
|
||||
unit.
|
||||
Write Req: The total number of write requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
Stall Cycles: |-
|
||||
The total number of cycles the sL1D\u2194L2 interface was stalled, per
|
||||
normalization unit.
|
||||
|
||||
+77
-80
@@ -2,70 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1500
|
||||
title: Address Processing Unit and Data Return Path (TA/TD)
|
||||
metrics_description:
|
||||
Address Processing Unit Busy: Percent of the total CU cycles the address processor
|
||||
was busy
|
||||
Address Stall: Percent of the total CU cycles the address processor was stalled
|
||||
from sending address requests further into the vL1D pipeline.
|
||||
Data Stall: Percent of the total CU cycles the address processor was stalled from
|
||||
sending write/atomic data further into the vL1D pipeline.
|
||||
"Data-Processor \u2192 Address Stall": Percent of total CU cycles the address
|
||||
processor was stalled waiting to send command data to the data processor.
|
||||
Total Instructions: The total number of memory instructions executed by the address
|
||||
processer over all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Instructions: The total number of global & generic memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Read Instructions: The total number of global & generic memory
|
||||
read instructions executed on all compute units on the accelerator, per normalization
|
||||
unit.
|
||||
Global/Generic Write Instructions: The total number of global & generic memory
|
||||
write instructions executed on all compute units on the accelerator, per normalization
|
||||
unit.
|
||||
Global/Generic Atomic Instructions: The total number of global & generic memory
|
||||
atomic (with and without return) instructions executed on all compute units
|
||||
on the accelerator, per normalization unit.
|
||||
Spill/Stack Instructions: The total number of spill/stack memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Read Instructions: The total number of spill/stack memory read instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Write Instructions: The total number of spill/stack memory write instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Atomic Instructions: The total number of spill/stack memory atomic
|
||||
(with and without return) instructions executed on all compute units on the
|
||||
accelerator, per normalization unit. Typically unused as these memory operations
|
||||
are typically used to implement thread-local storage.
|
||||
Spill/Stack Total Cycles: The number of cycles the address processing unit spent
|
||||
working on spill/stack instructions, per normalization unit.
|
||||
Spill/Stack Coalesced Read: The number of cycles the address processing unit spent
|
||||
working on coalesced spill/stack read instructions, per normalization unit.
|
||||
Spill/Stack Coalesced Write: The number of cycles the address processing unit
|
||||
spent working on coalesced spill/stack write instructions, per normalization
|
||||
unit.
|
||||
Data-Return Busy: Percent of the total CU cycles the data-return unit was busy
|
||||
processing or waiting on data to return to the CU.
|
||||
"Cache RAM \u2192 Data-Return Stall": Percent of the total CU cycles the data-return
|
||||
unit was stalled on data to be returned from the vL1D Cache RAM.
|
||||
"Workgroup manager \u2192 Data-Return Stall": Percent of the total CU cycles the
|
||||
data-return unit was stalled by the workgroup manager due to initialization
|
||||
of registers as a part of launching new workgroups.
|
||||
Coalescable Instructions: The number of instructions submitted to the data-return
|
||||
unit by the address processor that were found to be coalescable, per normalization
|
||||
unit.
|
||||
Read Instructions: The number of read instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack reads in the address processor.
|
||||
Write Instructions: The number of store instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack stores in the address processor.
|
||||
Atomic Instructions: The number of atomic instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack atomics in the address processor.
|
||||
Write Ack Instructions: The total number of write acknowledgements submitted by
|
||||
data-return unit to SQ, summed over all compute units on the accelerator, per
|
||||
normalization unit.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1501
|
||||
@@ -135,47 +71,47 @@ Panel Config:
|
||||
avg: AVG((TA_TOTAL_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_TOTAL_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_TOTAL_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Global/Generic Instructions:
|
||||
avg: AVG((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Global/Generic Read Instructions:
|
||||
avg: AVG((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Global/Generic Write Instructions:
|
||||
avg: AVG((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Global/Generic Atomic Instructions:
|
||||
avg: AVG((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Spill/Stack Instructions:
|
||||
avg: AVG((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Spill/Stack Read Instructions:
|
||||
avg: AVG((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Spill/Stack Write Instructions:
|
||||
avg: AVG((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Spill/Stack Atomic Instructions:
|
||||
avg: AVG((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
- metric_table:
|
||||
id: 1503
|
||||
title: Spill and stack metrics
|
||||
@@ -190,17 +126,17 @@ Panel Config:
|
||||
avg: AVG((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
|
||||
min: MIN((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
|
||||
max: MAX((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Spill/Stack Coalesced Read:
|
||||
avg: AVG((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
|
||||
min: MIN((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
|
||||
max: MAX((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Spill/Stack Coalesced Write:
|
||||
avg: AVG((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
|
||||
min: MIN((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
|
||||
max: MAX((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
- metric_table:
|
||||
id: 1504
|
||||
title: Vector L1 data-return path or Texture Data (TD)
|
||||
@@ -230,7 +166,7 @@ Panel Config:
|
||||
avg: AVG((TD_COALESCABLE_WAVEFRONT_sum / $denom))
|
||||
min: MIN((TD_COALESCABLE_WAVEFRONT_sum / $denom))
|
||||
max: MAX((TD_COALESCABLE_WAVEFRONT_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Read Instructions:
|
||||
avg: AVG((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
|
||||
/ $denom))
|
||||
@@ -238,14 +174,75 @@ Panel Config:
|
||||
/ $denom))
|
||||
max: MAX((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
|
||||
/ $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Write Instructions:
|
||||
avg: AVG((TD_STORE_WAVEFRONT_sum / $denom))
|
||||
min: MIN((TD_STORE_WAVEFRONT_sum / $denom))
|
||||
max: MAX((TD_STORE_WAVEFRONT_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Atomic Instructions:
|
||||
avg: AVG((TD_ATOMIC_WAVEFRONT_sum / $denom))
|
||||
min: MIN((TD_ATOMIC_WAVEFRONT_sum / $denom))
|
||||
max: MAX((TD_ATOMIC_WAVEFRONT_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
metrics_description:
|
||||
Address Processing Unit Busy: Percent of the total CU cycles the address processor
|
||||
was busy
|
||||
Address Stall: Percent of the total CU cycles the address processor was stalled
|
||||
from sending address requests further into the vL1D pipeline.
|
||||
Data Stall: Percent of the total CU cycles the address processor was stalled from
|
||||
sending write/atomic data further into the vL1D pipeline.
|
||||
"Data-Processor \u2192 Address Stall": Percent of total CU cycles the address
|
||||
processor was stalled waiting to send command data to the data processor.
|
||||
Total Instructions: The total number of memory instructions executed by the address
|
||||
processer over all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Instructions: The total number of global & generic memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Read Instructions: The total number of global & generic memory
|
||||
read instructions executed on all compute units on the accelerator, per normalization
|
||||
unit.
|
||||
Global/Generic Write Instructions: The total number of global & generic memory
|
||||
write instructions executed on all compute units on the accelerator, per normalization
|
||||
unit.
|
||||
Global/Generic Atomic Instructions: The total number of global & generic memory
|
||||
atomic (with and without return) instructions executed on all compute units
|
||||
on the accelerator, per normalization unit.
|
||||
Spill/Stack Instructions: The total number of spill/stack memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Read Instructions: The total number of spill/stack memory read instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Write Instructions: The total number of spill/stack memory write instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Atomic Instructions: The total number of spill/stack memory atomic
|
||||
(with and without return) instructions executed on all compute units on the
|
||||
accelerator, per normalization unit. Typically unused as these memory operations
|
||||
are typically used to implement thread-local storage.
|
||||
Spill/Stack Total Cycles: The number of cycles the address processing unit spent
|
||||
working on spill/stack instructions, per normalization unit.
|
||||
Spill/Stack Coalesced Read: The number of cycles the address processing unit spent
|
||||
working on coalesced spill/stack read instructions, per normalization unit.
|
||||
Spill/Stack Coalesced Write: The number of cycles the address processing unit
|
||||
spent working on coalesced spill/stack write instructions, per normalization
|
||||
unit.
|
||||
Data-Return Busy: Percent of the total CU cycles the data-return unit was busy
|
||||
processing or waiting on data to return to the CU.
|
||||
"Cache RAM \u2192 Data-Return Stall": Percent of the total CU cycles the data-return
|
||||
unit was stalled on data to be returned from the vL1D Cache RAM.
|
||||
"Workgroup manager \u2192 Data-Return Stall": Percent of the total CU cycles the
|
||||
data-return unit was stalled by the workgroup manager due to initialization
|
||||
of registers as a part of launching new workgroups.
|
||||
Coalescable Instructions: The number of instructions submitted to the data-return
|
||||
unit by the address processor that were found to be coalescable, per normalization
|
||||
unit.
|
||||
Read Instructions: The number of read instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack reads in the address processor.
|
||||
Write Instructions: The number of store instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack stores in the address processor.
|
||||
Atomic Instructions: The number of atomic instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack atomics in the address processor.
|
||||
|
||||
+124
-132
@@ -2,117 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1600
|
||||
title: Vector L1 Data Cache
|
||||
metrics_description:
|
||||
Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
|
||||
of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator. The number of bytes is calculated as the number
|
||||
of cache lines requested multiplied by the cache line size. This value does
|
||||
not consider partial requests, so for instance, if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
|
||||
The number of cycles where the vL1D Cache RAM is actively processing any request
|
||||
divided by the number of cycles where the vL1D is active.
|
||||
Coalescing: Indicates how well memory instructions were coalesced by the address
|
||||
processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
|
||||
as the average number of thread-requests generated per instruction divided by
|
||||
the ideal number of thread-requests per instruction.
|
||||
Stalled on L2 Data: The ratio of the number of cycles where the vL1D is stalled
|
||||
waiting for requested data to return from the L2 cache divided by the number
|
||||
of cycles where the vL1D is active.
|
||||
Stalled on L2 Req: The ratio of the number of cycles where the vL1D is stalled
|
||||
waiting to issue a request for data to the L2 cache divided by the number of
|
||||
cycles where the vL1D is active.
|
||||
Tag RAM Stall (Read): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Read requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Tag RAM Stall (Write): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Write requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Tag RAM Stall (Atomic): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Atomic requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Total Req: The total number of incoming requests from the address processing unit
|
||||
after coalescing.
|
||||
Read Req: The total number of incoming read requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Write Req: The total number of incoming write requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Atomic Req: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
|
||||
instructions divided by total duration. The number of bytes is calculated as
|
||||
the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
as a full cache line.
|
||||
Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
|
||||
vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Cache Accesses: The total number of cache line lookups in the vL1D.
|
||||
Cache Hits: The number of cache accesses minus the number of outgoing requests
|
||||
to the L2 cache, that is, the number of cache line requests serviced by the
|
||||
vL1D Cache RAM per normalization unit.
|
||||
Invalidations: The number of times the vL1D was issued a write-back invalidate
|
||||
command during the kernel's execution per normalization unit. This may be triggered
|
||||
by, for instance, the buffer_wbinvl1 instruction.
|
||||
L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
|
||||
of VMEM instructions, divided by total duration. The number of bytes is calculated
|
||||
as the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
as a full cache line.
|
||||
L1-L2 Read: The number of read requests for a vL1D cache line that were not satisfied
|
||||
by the vL1D and must be retrieved from the to the L2 Cache per normalization
|
||||
unit.
|
||||
L1-L2 Write: The number of write requests to a vL1D cache line that were sent
|
||||
through the vL1D to the L2 cache, per normalization unit.
|
||||
L1-L2 Atomic: The number of atomic requests that are sent through the vL1D to
|
||||
the L2 cache, per normalization unit. This includes requests for atomics with,
|
||||
and without return.
|
||||
L1 Access Latency: Calculated as the average number of cycles that a vL1D cache
|
||||
line request spent in the vL1D cache pipeline.
|
||||
L1-L2 Read Latency: Calculated as the average number of cycles that the vL1D cache
|
||||
took to issue and receive read requests from the L2 Cache. This number also
|
||||
includes requests for atomics with return values.
|
||||
L1-L2 Write Latency: Calculated as the average number of cycles that the vL1D
|
||||
cache took to issue and receive acknowledgement of a write request to the L2
|
||||
Cache. This number also includes requests for atomics without return values.
|
||||
NC - Read: Total read requests with NC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
UC - Read: Total read requests with UC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
CC - Read: Total read requests with CC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
RW - Read: Total read requests with RW mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
RW - Write: Total write requests with RW mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
NC - Write: Total write requests with NC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
UC - Write: Total write requests with UC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
CC - Write: Total write requests with CC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
NC - Atomic: Total atomic requests with NC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
UC - Atomic: Total atomic requests with UC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
CC - Atomic: Total atomic requests with CC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
RW - Atomic: Total atomic requests with RW mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
Req: The number of translation requests made to the UTCL1 per normalization unit.
|
||||
Hit Ratio: The ratio of the number of translation requests that hit in the UTCL1
|
||||
divided by the total number of translation requests made to the UTCL1.
|
||||
Hits: The number of translation requests that hit in the UTCL1, and could be reused,
|
||||
per normalization unit.
|
||||
Translation Misses: The total number of translation requests that missed in the
|
||||
UTCL1 due to translation not being present in the cache, per normalization
|
||||
unit.
|
||||
Permission Misses: "The total number of translation requests that missed in the\
|
||||
\ UTCL1 due to a permission error, per normalization unit. This is unused and\
|
||||
\ expected to be zero in most configurations for modern CDNA\u2122 accelerators."
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1601
|
||||
@@ -181,17 +70,17 @@ Panel Config:
|
||||
avg: AVG((TCP_TOTAL_ACCESSES_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_ACCESSES_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_ACCESSES_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read Req:
|
||||
avg: AVG((TCP_TOTAL_READ_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_READ_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_READ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Write Req:
|
||||
avg: AVG((TCP_TOTAL_WRITE_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_WRITE_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_WRITE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Atomic Req:
|
||||
avg: AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
|
||||
/ $denom))
|
||||
@@ -199,7 +88,7 @@ Panel Config:
|
||||
/ $denom))
|
||||
max: MAX(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Cache BW:
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
|
||||
@@ -223,7 +112,7 @@ Panel Config:
|
||||
avg: AVG((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Cache Hits:
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
@@ -234,7 +123,7 @@ Panel Config:
|
||||
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Invalidations:
|
||||
avg: AVG((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
|
||||
@@ -252,12 +141,12 @@ Panel Config:
|
||||
avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
L1-L2 Write:
|
||||
avg: AVG((TCP_TCC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
L1-L2 Atomic:
|
||||
avg: AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
/ $denom))
|
||||
@@ -265,7 +154,7 @@ Panel Config:
|
||||
/ $denom))
|
||||
max: MAX(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1604
|
||||
title: L1D - L2 Transactions
|
||||
@@ -284,84 +173,84 @@ Panel Config:
|
||||
avg: AVG((TCP_TCC_NC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_NC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_NC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
UC - Read:
|
||||
xfer: Read
|
||||
coherency: UC
|
||||
avg: AVG((TCP_TCC_UC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_UC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_UC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
CC - Read:
|
||||
xfer: Read
|
||||
coherency: CC
|
||||
avg: AVG((TCP_TCC_CC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_CC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_CC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
RW - Read:
|
||||
xfer: Read
|
||||
coherency: RW
|
||||
avg: AVG((TCP_TCC_RW_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_RW_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_RW_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
RW - Write:
|
||||
xfer: Write
|
||||
coherency: RW
|
||||
avg: AVG((TCP_TCC_RW_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_RW_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_RW_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
NC - Write:
|
||||
xfer: Write
|
||||
coherency: NC
|
||||
avg: AVG((TCP_TCC_NC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_NC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_NC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
UC - Write:
|
||||
xfer: Write
|
||||
coherency: UC
|
||||
avg: AVG((TCP_TCC_UC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_UC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_UC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
CC - Write:
|
||||
xfer: Write
|
||||
coherency: CC
|
||||
avg: AVG((TCP_TCC_CC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_CC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_CC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
NC - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: NC
|
||||
avg: AVG((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
UC - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: UC
|
||||
avg: AVG((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
CC - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: CC
|
||||
avg: AVG((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
RW - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: RW
|
||||
avg: AVG((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1605
|
||||
title: L1 Unified Translation Cache (UTCL1)
|
||||
@@ -410,3 +299,106 @@ Panel Config:
|
||||
max: Max
|
||||
units: Unit
|
||||
metric: {}
|
||||
metrics_description:
|
||||
Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
|
||||
of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator. The number of bytes is calculated as the number
|
||||
of cache lines requested multiplied by the cache line size. This value does
|
||||
not consider partial requests, so for instance, if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
|
||||
The number of cycles where the vL1D Cache RAM is actively processing any request
|
||||
divided by the number of cycles where the vL1D is active.
|
||||
Coalescing: Indicates how well memory instructions were coalesced by the address
|
||||
processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
|
||||
as the average number of thread-requests generated per instruction divided by
|
||||
the ideal number of thread-requests per instruction.
|
||||
Stalled on L2 Data: The ratio of the number of cycles where the vL1D is stalled
|
||||
waiting for requested data to return from the L2 cache divided by the number
|
||||
of cycles where the vL1D is active.
|
||||
Stalled on L2 Req: The ratio of the number of cycles where the vL1D is stalled
|
||||
waiting to issue a request for data to the L2 cache divided by the number of
|
||||
cycles where the vL1D is active.
|
||||
Tag RAM Stall (Read): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Read requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Tag RAM Stall (Write): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Write requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Tag RAM Stall (Atomic): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Atomic requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Total Req: The total number of incoming requests from the address processing unit
|
||||
after coalescing.
|
||||
Read Req: The total number of incoming read requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Write Req: The total number of incoming write requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Atomic Req: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
|
||||
instructions divided by total duration. The number of bytes is calculated as
|
||||
the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
as a full cache line.
|
||||
Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
|
||||
vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Cache Accesses: The total number of cache line lookups in the vL1D.
|
||||
Cache Hits: The number of cache accesses minus the number of outgoing requests
|
||||
to the L2 cache, that is, the number of cache line requests serviced by the
|
||||
vL1D Cache RAM per normalization unit.
|
||||
Invalidations: The number of times the vL1D was issued a write-back invalidate
|
||||
command during the kernel's execution per normalization unit. This may be triggered
|
||||
by, for instance, the buffer_wbinvl1 instruction.
|
||||
L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
|
||||
of VMEM instructions, divided by total duration. The number of bytes is calculated
|
||||
as the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
as a full cache line.
|
||||
L1-L2 Read: The number of read requests for a vL1D cache line that were not satisfied
|
||||
by the vL1D and must be retrieved from the to the L2 Cache per normalization
|
||||
unit.
|
||||
L1-L2 Write: The number of write requests to a vL1D cache line that were sent
|
||||
through the vL1D to the L2 cache, per normalization unit.
|
||||
L1-L2 Atomic: The number of atomic requests that are sent through the vL1D to
|
||||
the L2 cache, per normalization unit. This includes requests for atomics with,
|
||||
and without return.
|
||||
NC - Read: Total read requests with NC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
UC - Read: Total read requests with UC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
CC - Read: Total read requests with CC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
RW - Read: Total read requests with RW mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
RW - Write: Total write requests with RW mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
NC - Write: Total write requests with NC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
UC - Write: Total write requests with UC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
CC - Write: Total write requests with CC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
NC - Atomic: Total atomic requests with NC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
UC - Atomic: Total atomic requests with UC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
CC - Atomic: Total atomic requests with CC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
RW - Atomic: Total atomic requests with RW mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
Req: The number of translation requests made to the UTCL1 per normalization unit.
|
||||
Hit Ratio: The ratio of the number of translation requests that hit in the UTCL1
|
||||
divided by the total number of translation requests made to the UTCL1.
|
||||
Hits: The number of translation requests that hit in the UTCL1, and could be reused,
|
||||
per normalization unit.
|
||||
Translation Misses: The total number of translation requests that missed in the
|
||||
UTCL1 due to translation not being present in the cache, per normalization unit.
|
||||
Permission Misses: |-
|
||||
The total number of translation requests that missed in the UTCL1 due
|
||||
to a permission error, per normalization unit. This is unused and expected
|
||||
to be zero in most configurations for modern CDNA\u2122 accelerators.
|
||||
|
||||
+186
-236
@@ -2,218 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1700
|
||||
title: L2 Cache
|
||||
metrics_description:
|
||||
Utilization: The ratio of the number of cycles an L2 channel was active, summed
|
||||
over all L2 channels on the accelerator over the total L2 cycles.
|
||||
Peak Bandwidth: The number of bytes looked up in the L2 cache, as a percent of
|
||||
the peak theoretical bandwidth achievable on the specific accelerator. The number
|
||||
of bytes is calculated as the number of cache lines requested multiplied by
|
||||
the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line.
|
||||
Hit Rate: The ratio of the number of L2 cache line requests that hit in the L2
|
||||
cache over the total number of incoming cache line requests to the L2 cache.
|
||||
L2-Fabric Read BW: The number of bytes read by the L2 over the Infinity Fabric
|
||||
interface per unit time.
|
||||
L2-Fabric Write and Atomic BW: The number of bytes sent by the L2 over the Infinity
|
||||
Fabric interface by write and atomic operations per unit time.
|
||||
HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
|
||||
memory (HBM) per unit time. This value is calculated as the number of HBM channels
|
||||
multiplied by the HBM channel width multiplied by the HBM clock frequency.
|
||||
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided
|
||||
by total duration.
|
||||
HBM Read Traffic: The percent of read requests generated by the L2 cache that
|
||||
are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
|
||||
does not consider the size of the request (meaning that 32B and 64B requests
|
||||
are both counted as a single request), so this metric only approximates the
|
||||
percent of the L2-Fabric Read bandwidth directed to the local HBM.
|
||||
Remote Read Traffic: The percent of read requests generated by the L2 cache that
|
||||
are routed to any memory location other than the accelerator's local high-bandwidth
|
||||
memory (HBM) - for example, the CPU's DRAM or a remote accelerator's HBM. This
|
||||
breakdown does not consider the size of the request (meaning that 32B and 64B
|
||||
requests are both counted as a single request), so this metric only approximates
|
||||
the percent of the L2-Fabric Read bandwidth directed to a remote location.
|
||||
Uncached Read Traffic: The percent of read requests generated by the L2 cache
|
||||
that are reading from an uncached memory allocation. Note, as described in the
|
||||
request flow section, a single 64B read request is typically counted as two
|
||||
uncached read requests. So, it is possible for the Uncached Read Traffic to
|
||||
reach up to 200% of the total number of read requests. This breakdown does not
|
||||
consider the size of the request (i.e., 32B and 64B requests are both counted
|
||||
as a single request), so this metric only approximates the percent of the L2-Fabric
|
||||
read bandwidth directed to an uncached memory location.
|
||||
Write and Atomic BW: The total number of bytes written by the L2 over Infinity
|
||||
Fabric by write and atomic operations divided by total duration. Note that on
|
||||
current CDNA accelerators, such as the MI2XX, requests are only considered atomic
|
||||
by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
|
||||
fine-grained memory allocations or uncached memory allocations on the MI2XX.
|
||||
HBM Write and Atomic Traffic: The percent of write and atomic requests generated
|
||||
by the L2 cache that are routed to the accelerator's local high-bandwidth memory
|
||||
(HBM). This breakdown does not consider the size of the request (meaning that
|
||||
32B and 64B requests are both counted as a single request), so this metric only
|
||||
approximates the percent of the L2-Fabric Write and Atomic bandwidth directed
|
||||
to the local HBM. Note that on current CDNA accelerators, such as the MI2XX,
|
||||
requests are only considered atomic by Infinity Fabric if they are targeted
|
||||
at fine-grained memory allocations or uncached memory allocations.
|
||||
Remote Write and Atomic Traffic: The percent of read requests generated by the
|
||||
L2 cache that are routed to any memory location other than the accelerator's
|
||||
local high-bandwidth memory (HBM) - for example, the CPU's DRAM or a remote
|
||||
accelerator's HBM. This breakdown does not consider the size of the request
|
||||
(meaning that 32B and 64B requests are both counted as a single request), so
|
||||
this metric only approximates the percent of the L2-Fabric Read bandwidth directed
|
||||
to a remote location. Note that on current CDNA accelerators, such as the MI2XX,
|
||||
requests are only considered atomic by Infinity Fabric if they are targeted
|
||||
at fine-grained memory allocations or uncached memory allocations.
|
||||
Atomic Traffic: The percent of write requests generated by the L2 cache that are
|
||||
atomic requests to any memory location. This breakdown does not consider the
|
||||
size of the request (meaning that 32B and 64B requests are both counted as a
|
||||
single request), so this metric only approximates the percent of the L2-Fabric
|
||||
Read bandwidth directed to a remote location. Note that on current CDNA accelerators,
|
||||
such as the MI2XX, requests are only considered atomic by Infinity Fabric if
|
||||
they are targeted at fine-grained memory allocations or uncached memory allocations.
|
||||
Uncached Write and Atomic Traffic: The percent of write and atomic requests generated
|
||||
by the L2 cache that are targeting uncached memory allocations. This breakdown
|
||||
does not consider the size of the request (meaning that 32B and 64B requests
|
||||
are both counted as a single request), so this metric only approximates the
|
||||
percent of the L2-Fabric read bandwidth directed to uncached memory allocations.
|
||||
Read Latency: The time-averaged number of cycles read requests spent in Infinity
|
||||
Fabric before data was returned to the L2.
|
||||
Write and Atomic Latency: The time-averaged number of cycles write requests spent
|
||||
in Infinity Fabric before a completion acknowledgement was returned to the L2.
|
||||
Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
|
||||
Fabric before a completion acknowledgement (atomic without return value) or
|
||||
data (atomic with return value) was returned to the L2.
|
||||
Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration.
|
||||
The number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so for
|
||||
example, if only a single value is requested in a cache line, the data movement
|
||||
will still be counted as a full cache line.
|
||||
Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests,
|
||||
divided by total duration.
|
||||
Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests,
|
||||
divided by total duration.
|
||||
Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests,
|
||||
divided by total duration.
|
||||
Req: The total number of incoming requests to the L2 from all clients for all
|
||||
request types, per normalization unit.
|
||||
Read Req: The total number of read requests to the L2 from all clients.
|
||||
Write Req: The total number of write requests to the L2 from all clients.
|
||||
Atomic Req: The total number of atomic requests (with and without return) to the
|
||||
L2 from all clients.
|
||||
Streaming Req: The total number of incoming requests to the L2 that are marked
|
||||
as streaming. The exact meaning of this may differ depending on the targeted
|
||||
accelerator, however on an MI2XX this corresponds to non-temporal load or stores.
|
||||
The L2 cache attempts to evict streaming requests before normal requests when
|
||||
the L2 is at capacity.
|
||||
Probe Req: The number of coherence probe requests made to the L2 cache from outside
|
||||
the accelerator. On an MI2XX, probe requests may be generated by, for example,
|
||||
writes to fine-grained device memory or by writes to coarse-grained device memory.
|
||||
Cache Hit: The ratio of the number of L2 cache line requests that hit in the L2
|
||||
cache over the total number of incoming cache line requests to the L2 cache.
|
||||
Hits: The total number of requests to the L2 from all clients that hit in the
|
||||
cache. As noted in the Speed-of-Light section, this includes hit-on-miss requests.
|
||||
Misses: The total number of requests to the L2 from all clients that miss in the
|
||||
cache. As noted in the Speed-of-Light section, these do not include hit-on-miss
|
||||
requests.
|
||||
Writeback: The total number of L2 cache lines written back to memory for any reason.
|
||||
Write-backs may occur due to user code (such as HIP kernel calls to _threadfence_system
|
||||
or atomic built-ins) by the command processor's memory acquire/release fences,
|
||||
or for other internal hardware reasons.
|
||||
Writeback (Internal): The total number of L2 cache lines written back to memory
|
||||
for internal hardware reasons, per normalization unit.
|
||||
Writeback (vL1D Req): The total number of L2 cache lines written back to memory
|
||||
due to requests initiated by the vL1D cache, per normalization unit.
|
||||
Evict (Internal): The total number of L2 cache lines evicted from the cache due
|
||||
to capacity limits, per normalization unit.
|
||||
Evict (vL1D Req): The total number of L2 cache lines evicted from the cache due
|
||||
to invalidation requests initiated by the vL1D cache, per normalization unit.
|
||||
NC Req: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory
|
||||
allocations, per normalization unit.
|
||||
UC Req: The total number of requests to the L2 that go to Uncached (UC) memory
|
||||
allocations.
|
||||
CC Req: The total number of requests to the L2 that go to Coherently Cacheable
|
||||
(CC) memory allocations.
|
||||
RW Req: The total number of requests to the L2 that go to Read-Write coherent
|
||||
memory (RW) allocations.
|
||||
Write - Credit Starvation: The number of cycles the L2-Fabric interface was stalled
|
||||
on write or atomic requests to any memory location because too many write/atomic
|
||||
requests were currently in flight, as a percent of the total active L2 cycles.
|
||||
Read (32B): The total number of L2 requests to Infinity Fabric to read 32B of
|
||||
data from any memory location, per normalization unit.
|
||||
Read (64B): The total number of L2 requests to Infinity Fabric to read 64B of
|
||||
data from any memory location, per normalization unit.
|
||||
Read (Uncached): The total number of L2 requests to Infinity Fabric to read uncached
|
||||
data from any memory location, per normalization unit. 64B requests for uncached
|
||||
data are counted as two 32B uncached data requests.
|
||||
HBM Read: The total number of L2 requests to Infinity Fabric to read 32B or 64B
|
||||
of data from the accelerator's local HBM, per normalization unit.
|
||||
Remote Read: The total number of L2 requests to Infinity Fabric to read 32B or
|
||||
64B of data from any source other than the accelerator's local HBM, per normalization
|
||||
unit.
|
||||
Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe
|
||||
traffic, divided by total duration.
|
||||
"Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM
|
||||
traffic, divided by total duration.
|
||||
Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 32B of data to any memory location, per normalization
|
||||
unit.
|
||||
Write and Atomic (Uncached): The total number of L2 requests to Infinity Fabric
|
||||
to write or atomically update 32B or 64B of uncached data, per normalization
|
||||
unit.
|
||||
Write and Atomic (64B): The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 64B of data in any memory location, per normalization
|
||||
unit.
|
||||
HBM Write and Atomic: The total number of L2 requests to Infinity Fabric to write
|
||||
or atomically update 32B or 64B of data in the accelerator's local HBM, per
|
||||
normalization unit.
|
||||
Remote Write and Atomic: The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 32B or 64B of data in any memory location other than
|
||||
the accelerator's local HBM, per normalization unit.
|
||||
Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to
|
||||
PCIe traffic, divided by total duration.
|
||||
"Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM
|
||||
traffic, divided by total duration.
|
||||
Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to
|
||||
PCIe traffic, divided by total duration.
|
||||
"Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
|
||||
HBM traffic, divided by total duration.
|
||||
Atomic: The total number of L2 requests to Infinity Fabric to atomically update
|
||||
32B or 64B of data in any memory location, per normalization unit. See Request
|
||||
flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
|
||||
requests are only considered atomic by Infinity Fabric if they are targeted
|
||||
at non-write-cacheable memory, such as fine-grained memory allocations or uncached
|
||||
memory allocations on the MI2XX.
|
||||
Read Stall: "The ratio of the total number of cycles the L2-Fabric interface was\
|
||||
\ stalled on a read request to any destination (local HBM, remote PCIe\xAE connected\
|
||||
\ accelerator or CPU, or remote Infinity Fabric connected accelerator or CPU)\
|
||||
\ over the total active L2 cycles."
|
||||
Write Stall: The ratio of the total number of cycles the L2-Fabric interface was
|
||||
stalled on a write or atomic request to any destination (local HBM, remote accelerator
|
||||
or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected
|
||||
accelerator or CPU) over the total active L2 cycles.
|
||||
Read - PCIe Stall: The number of cycles the L2-Fabric interface was stalled on
|
||||
read requests to remote PCIe connected accelerators or CPUs as a percent of
|
||||
the total active L2 cycles.
|
||||
Read - Infinity Fabric Stall: The number of cycles the L2-Fabric interface was
|
||||
stalled on read requests to remote Infinity Fabric connected accelerators or
|
||||
CPUs as a percent of the total active L2 cycles.
|
||||
Read - HBM Stall: The number of cycles the L2-Fabric interface was stalled on
|
||||
read requests to the accelerator's local HBM as a percent of the total active
|
||||
L2 cycles.
|
||||
Write - PCIe Stall: The number of cycles the L2-Fabric interface was stalled on
|
||||
write or atomic requests to remote PCIe connected accelerators or CPUs as a
|
||||
percent of the total active L2 cycles.
|
||||
Write - Infinity Fabric Stall: The number of cycles the L2-Fabric interface was
|
||||
stalled on write or atomic requests to remote Infinity Fabric connected accelerators
|
||||
or CPUs as a percent of the total active L2 cycles.
|
||||
Write - HBM Stall: The number of cycles the L2-Fabric interface was stalled on
|
||||
write or atomic requests to accelerator's local HBM as a percent of the total
|
||||
active L2 cycles.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1701
|
||||
@@ -370,32 +158,32 @@ Panel Config:
|
||||
avg: AVG((TCC_REQ_sum / $denom))
|
||||
min: MIN((TCC_REQ_sum / $denom))
|
||||
max: MAX((TCC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read Req:
|
||||
avg: AVG((TCC_READ_sum / $denom))
|
||||
min: MIN((TCC_READ_sum / $denom))
|
||||
max: MAX((TCC_READ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Write Req:
|
||||
avg: AVG((TCC_WRITE_sum / $denom))
|
||||
min: MIN((TCC_WRITE_sum / $denom))
|
||||
max: MAX((TCC_WRITE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Atomic Req:
|
||||
avg: AVG((TCC_ATOMIC_sum / $denom))
|
||||
min: MIN((TCC_ATOMIC_sum / $denom))
|
||||
max: MAX((TCC_ATOMIC_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Streaming Req:
|
||||
avg: AVG((TCC_STREAMING_REQ_sum / $denom))
|
||||
min: MIN((TCC_STREAMING_REQ_sum / $denom))
|
||||
max: MAX((TCC_STREAMING_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Probe Req:
|
||||
avg: AVG((TCC_PROBE_sum / $denom))
|
||||
min: MIN((TCC_PROBE_sum / $denom))
|
||||
max: MAX((TCC_PROBE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Cache Hit:
|
||||
avg: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
@@ -408,17 +196,17 @@ Panel Config:
|
||||
avg: AVG((TCC_HIT_sum / $denom))
|
||||
min: MIN((TCC_HIT_sum / $denom))
|
||||
max: MAX((TCC_HIT_sum / $denom))
|
||||
unit: (Hits + $normUnit)
|
||||
unit: (Hits + $normUnit)
|
||||
Misses:
|
||||
avg: AVG((TCC_MISS_sum / $denom))
|
||||
min: MIN((TCC_MISS_sum / $denom))
|
||||
max: MAX((TCC_MISS_sum / $denom))
|
||||
unit: (Misses + $normUnit)
|
||||
unit: (Misses + $normUnit)
|
||||
Writeback:
|
||||
avg: AVG((TCC_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_WRITEBACK_sum / $denom))
|
||||
max: MAX((TCC_WRITEBACK_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
unit: (Cachelines + $normUnit)
|
||||
Writeback (Internal):
|
||||
avg: AVG((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
@@ -443,22 +231,22 @@ Panel Config:
|
||||
avg: AVG((TCC_NC_REQ_sum / $denom))
|
||||
min: MIN((TCC_NC_REQ_sum / $denom))
|
||||
max: MAX((TCC_NC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
UC Req:
|
||||
avg: AVG((TCC_UC_REQ_sum / $denom))
|
||||
min: MIN((TCC_UC_REQ_sum / $denom))
|
||||
max: MAX((TCC_UC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
CC Req:
|
||||
avg: AVG((TCC_CC_REQ_sum / $denom))
|
||||
min: MIN((TCC_CC_REQ_sum / $denom))
|
||||
max: MAX((TCC_CC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
RW Req:
|
||||
avg: AVG((TCC_RW_REQ_sum / $denom))
|
||||
min: MIN((TCC_RW_REQ_sum / $denom))
|
||||
max: MAX((TCC_RW_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1704
|
||||
title: L2 Cache Stalls
|
||||
@@ -507,54 +295,216 @@ Panel Config:
|
||||
avg: AVG((TCC_EA0_RDREQ_32B_sum / $denom))
|
||||
min: MIN((TCC_EA0_RDREQ_32B_sum / $denom))
|
||||
max: MAX((TCC_EA0_RDREQ_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read (64B):
|
||||
avg: AVG(((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) / $denom))
|
||||
min: MIN(((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) / $denom))
|
||||
max: MAX(((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read (Uncached):
|
||||
avg: AVG((TCC_EA0_RD_UNCACHED_32B_sum / $denom))
|
||||
min: MIN((TCC_EA0_RD_UNCACHED_32B_sum / $denom))
|
||||
max: MAX((TCC_EA0_RD_UNCACHED_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
HBM Read:
|
||||
avg: AVG((TCC_EA0_RDREQ_DRAM_sum / $denom))
|
||||
min: MIN((TCC_EA0_RDREQ_DRAM_sum / $denom))
|
||||
max: MAX((TCC_EA0_RDREQ_DRAM_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Remote Read:
|
||||
avg: AVG((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
|
||||
min: MIN((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
|
||||
max: MAX((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Write and Atomic (32B):
|
||||
avg: AVG(MAX(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom), 0))
|
||||
min: MIN(MAX(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom), 0))
|
||||
max: MAX(MAX(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom), 0))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Write and Atomic (Uncached):
|
||||
avg: AVG((TCC_EA0_WR_UNCACHED_32B_sum / $denom))
|
||||
min: MIN((TCC_EA0_WR_UNCACHED_32B_sum / $denom))
|
||||
max: MAX((TCC_EA0_WR_UNCACHED_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Write and Atomic (64B):
|
||||
avg: AVG((TCC_EA0_WRREQ_64B_sum / $denom))
|
||||
min: MIN((TCC_EA0_WRREQ_64B_sum / $denom))
|
||||
max: MAX((TCC_EA0_WRREQ_64B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
HBM Write and Atomic:
|
||||
avg: AVG((TCC_EA0_WRREQ_DRAM_sum / $denom))
|
||||
min: MIN((TCC_EA0_WRREQ_DRAM_sum / $denom))
|
||||
max: MAX((TCC_EA0_WRREQ_DRAM_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Remote Write and Atomic:
|
||||
avg: AVG((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom))
|
||||
min: MIN((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom))
|
||||
max: MAX((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Atomic:
|
||||
avg: AVG((TCC_EA0_ATOMIC_sum / $denom))
|
||||
min: MIN((TCC_EA0_ATOMIC_sum / $denom))
|
||||
max: MAX((TCC_EA0_ATOMIC_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
metrics_description:
|
||||
Utilization: The ratio of the number of cycles an L2 channel was active, summed
|
||||
over all L2 channels on the accelerator over the total L2 cycles.
|
||||
Peak Bandwidth: The number of bytes looked up in the L2 cache, as a percent of
|
||||
the peak theoretical bandwidth achievable on the specific accelerator. The number
|
||||
of bytes is calculated as the number of cache lines requested multiplied by
|
||||
the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line.
|
||||
Hit Rate: The ratio of the number of L2 cache line requests that hit in the L2
|
||||
cache over the total number of incoming cache line requests to the L2 cache.
|
||||
L2-Fabric Read BW: The number of bytes read by the L2 over the Infinity Fabric
|
||||
interface per unit time.
|
||||
L2-Fabric Write and Atomic BW: The number of bytes sent by the L2 over the Infinity
|
||||
Fabric interface by write and atomic operations per unit time.
|
||||
HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
|
||||
memory (HBM) per unit time. This value is calculated as the number of HBM channels
|
||||
multiplied by the HBM channel width multiplied by the HBM clock frequency.
|
||||
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided
|
||||
by total duration.
|
||||
HBM Read Traffic: The percent of read requests generated by the L2 cache that
|
||||
are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
|
||||
does not consider the size of the request (meaning that 32B and 64B requests
|
||||
are both counted as a single request), so this metric only approximates the
|
||||
percent of the L2-Fabric Read bandwidth directed to the local HBM.
|
||||
Remote Read Traffic: The percent of read requests generated by the L2 cache that
|
||||
are routed to any memory location other than the accelerator's local high-bandwidth
|
||||
memory (HBM) - for example, the CPU's DRAM or a remote accelerator's HBM. This
|
||||
breakdown does not consider the size of the request (meaning that 32B and 64B
|
||||
requests are both counted as a single request), so this metric only approximates
|
||||
the percent of the L2-Fabric Read bandwidth directed to a remote location.
|
||||
Uncached Read Traffic: The percent of read requests generated by the L2 cache
|
||||
that are reading from an uncached memory allocation. Note, as described in the
|
||||
request flow section, a single 64B read request is typically counted as two
|
||||
uncached read requests. So, it is possible for the Uncached Read Traffic to
|
||||
reach up to 200% of the total number of read requests. This breakdown does not
|
||||
consider the size of the request (i.e., 32B and 64B requests are both counted
|
||||
as a single request), so this metric only approximates the percent of the L2-Fabric
|
||||
read bandwidth directed to an uncached memory location.
|
||||
Write and Atomic BW: The total number of bytes written by the L2 over Infinity
|
||||
Fabric by write and atomic operations divided by total duration. Note that on
|
||||
current CDNA accelerators, such as the MI2XX, requests are only considered atomic
|
||||
by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
|
||||
fine-grained memory allocations or uncached memory allocations on the MI2XX.
|
||||
HBM Write and Atomic Traffic: The percent of write and atomic requests generated
|
||||
by the L2 cache that are routed to the accelerator's local high-bandwidth memory
|
||||
(HBM). This breakdown does not consider the size of the request (meaning that
|
||||
32B and 64B requests are both counted as a single request), so this metric only
|
||||
approximates the percent of the L2-Fabric Write and Atomic bandwidth directed
|
||||
to the local HBM. Note that on current CDNA accelerators, such as the MI2XX,
|
||||
requests are only considered atomic by Infinity Fabric if they are targeted
|
||||
at fine-grained memory allocations or uncached memory allocations.
|
||||
Remote Write and Atomic Traffic: The percent of read requests generated by the
|
||||
L2 cache that are routed to any memory location other than the accelerator's
|
||||
local high-bandwidth memory (HBM) - for example, the CPU's DRAM or a remote
|
||||
accelerator's HBM. This breakdown does not consider the size of the request
|
||||
(meaning that 32B and 64B requests are both counted as a single request), so
|
||||
this metric only approximates the percent of the L2-Fabric Read bandwidth directed
|
||||
to a remote location. Note that on current CDNA accelerators, such as the MI2XX,
|
||||
requests are only considered atomic by Infinity Fabric if they are targeted
|
||||
at fine-grained memory allocations or uncached memory allocations.
|
||||
Atomic Traffic: The percent of write requests generated by the L2 cache that are
|
||||
atomic requests to any memory location. This breakdown does not consider the
|
||||
size of the request (meaning that 32B and 64B requests are both counted as a
|
||||
single request), so this metric only approximates the percent of the L2-Fabric
|
||||
Read bandwidth directed to a remote location. Note that on current CDNA accelerators,
|
||||
such as the MI2XX, requests are only considered atomic by Infinity Fabric if
|
||||
they are targeted at fine-grained memory allocations or uncached memory allocations.
|
||||
Uncached Write and Atomic Traffic: The percent of write and atomic requests generated
|
||||
by the L2 cache that are targeting uncached memory allocations. This breakdown
|
||||
does not consider the size of the request (meaning that 32B and 64B requests
|
||||
are both counted as a single request), so this metric only approximates the
|
||||
percent of the L2-Fabric read bandwidth directed to uncached memory allocations.
|
||||
Read Latency: The time-averaged number of cycles read requests spent in Infinity
|
||||
Fabric before data was returned to the L2.
|
||||
Write and Atomic Latency: The time-averaged number of cycles write requests spent
|
||||
in Infinity Fabric before a completion acknowledgement was returned to the L2.
|
||||
Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
|
||||
Fabric before a completion acknowledgement (atomic without return value) or
|
||||
data (atomic with return value) was returned to the L2.
|
||||
Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration.
|
||||
The number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so for
|
||||
example, if only a single value is requested in a cache line, the data movement
|
||||
will still be counted as a full cache line.
|
||||
Req: The total number of incoming requests to the L2 from all clients for all
|
||||
request types, per normalization unit.
|
||||
Read Req: The total number of read requests to the L2 from all clients.
|
||||
Write Req: The total number of write requests to the L2 from all clients.
|
||||
Atomic Req: The total number of atomic requests (with and without return) to the
|
||||
L2 from all clients.
|
||||
Streaming Req: The total number of incoming requests to the L2 that are marked
|
||||
as streaming. The exact meaning of this may differ depending on the targeted
|
||||
accelerator, however on an MI2XX this corresponds to non-temporal load or stores.
|
||||
The L2 cache attempts to evict streaming requests before normal requests when
|
||||
the L2 is at capacity.
|
||||
Probe Req: The number of coherence probe requests made to the L2 cache from outside
|
||||
the accelerator. On an MI2XX, probe requests may be generated by, for example,
|
||||
writes to fine-grained device memory or by writes to coarse-grained device memory.
|
||||
Cache Hit: The ratio of the number of L2 cache line requests that hit in the L2
|
||||
cache over the total number of incoming cache line requests to the L2 cache.
|
||||
Hits: The total number of requests to the L2 from all clients that hit in the
|
||||
cache. As noted in the Speed-of-Light section, this includes hit-on-miss requests.
|
||||
Misses: The total number of requests to the L2 from all clients that miss in the
|
||||
cache. As noted in the Speed-of-Light section, these do not include hit-on-miss
|
||||
requests.
|
||||
Writeback: The total number of L2 cache lines written back to memory for any reason.
|
||||
Write-backs may occur due to user code (such as HIP kernel calls to _threadfence_system
|
||||
or atomic built-ins) by the command processor's memory acquire/release fences,
|
||||
or for other internal hardware reasons.
|
||||
Writeback (Internal): The total number of L2 cache lines written back to memory
|
||||
for internal hardware reasons, per normalization unit.
|
||||
Writeback (vL1D Req): The total number of L2 cache lines written back to memory
|
||||
due to requests initiated by the vL1D cache, per normalization unit.
|
||||
Evict (Internal): The total number of L2 cache lines evicted from the cache due
|
||||
to capacity limits, per normalization unit.
|
||||
Evict (vL1D Req): The total number of L2 cache lines evicted from the cache due
|
||||
to invalidation requests initiated by the vL1D cache, per normalization unit.
|
||||
NC Req: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory
|
||||
allocations, per normalization unit.
|
||||
UC Req: The total number of requests to the L2 that go to Uncached (UC) memory
|
||||
allocations.
|
||||
CC Req: The total number of requests to the L2 that go to Coherently Cacheable
|
||||
(CC) memory allocations.
|
||||
RW Req: The total number of requests to the L2 that go to Read-Write coherent
|
||||
memory (RW) allocations.
|
||||
Write - Credit Starvation: The number of cycles the L2-Fabric interface was stalled
|
||||
on write or atomic requests to any memory location because too many write/atomic
|
||||
requests were currently in flight, as a percent of the total active L2 cycles.
|
||||
Read (32B): The total number of L2 requests to Infinity Fabric to read 32B of
|
||||
data from any memory location, per normalization unit.
|
||||
Read (64B): The total number of L2 requests to Infinity Fabric to read 64B of
|
||||
data from any memory location, per normalization unit.
|
||||
Read (Uncached): The total number of L2 requests to Infinity Fabric to read uncached
|
||||
data from any memory location, per normalization unit. 64B requests for uncached
|
||||
data are counted as two 32B uncached data requests.
|
||||
HBM Read: The total number of L2 requests to Infinity Fabric to read 32B or 64B
|
||||
of data from the accelerator's local HBM, per normalization unit.
|
||||
Remote Read: The total number of L2 requests to Infinity Fabric to read 32B or
|
||||
64B of data from any source other than the accelerator's local HBM, per normalization
|
||||
unit.
|
||||
Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 32B of data to any memory location, per normalization
|
||||
unit.
|
||||
Write and Atomic (Uncached): The total number of L2 requests to Infinity Fabric
|
||||
to write or atomically update 32B or 64B of uncached data, per normalization
|
||||
unit.
|
||||
Write and Atomic (64B): The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 64B of data in any memory location, per normalization
|
||||
unit.
|
||||
HBM Write and Atomic: The total number of L2 requests to Infinity Fabric to write
|
||||
or atomically update 32B or 64B of data in the accelerator's local HBM, per
|
||||
normalization unit.
|
||||
Remote Write and Atomic: The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 32B or 64B of data in any memory location other than
|
||||
the accelerator's local HBM, per normalization unit.
|
||||
Atomic: The total number of L2 requests to Infinity Fabric to atomically update
|
||||
32B or 64B of data in any memory location, per normalization unit. See Request
|
||||
flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
|
||||
requests are only considered atomic by Infinity Fabric if they are targeted
|
||||
at non-write-cacheable memory, such as fine-grained memory allocations or uncached
|
||||
memory allocations on the MI2XX.
|
||||
|
||||
+4
-4
@@ -2,10 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1800
|
||||
title: L2 Cache (per Channel)
|
||||
metrics_description:
|
||||
L2 Cache Hit Rate: The percent of total number of requests to the L2 from all
|
||||
clients that hit in the cache. As noted in the Speed-of-Light section, this
|
||||
includes hit-on-miss requests.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1801
|
||||
@@ -249,3 +245,7 @@ Panel Config:
|
||||
::_1: $total_l2_chan
|
||||
cli_style: simple_box
|
||||
tui_style: simple_box
|
||||
metrics_description:
|
||||
L2 Cache Hit Rate: The percent of total number of requests to the L2 from all
|
||||
clients that hit in the cache. As noted in the Speed-of-Light section, this
|
||||
includes hit-on-miss requests.
|
||||
|
||||
+1
-1
@@ -2,10 +2,10 @@
|
||||
Panel Config:
|
||||
id: 2100
|
||||
title: PC Sampling
|
||||
metrics_description: {}
|
||||
data source:
|
||||
- pc_sampling_table:
|
||||
id: 2101
|
||||
title: PC Sampling
|
||||
source: ps_file
|
||||
comparable: false
|
||||
metrics_description: {}
|
||||
|
||||
+755
@@ -0,0 +1,755 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated by tools/config_management/generate_config_deltas.py
|
||||
Addition:
|
||||
- Panel Config:
|
||||
id: 200
|
||||
title: System Speed-of-Light
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 201
|
||||
title: System Speed-of-Light
|
||||
metrics:
|
||||
- MFMA FLOPs (F6F4):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 16834) / 1000)
|
||||
pop: |
|
||||
((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 16834) / 1000))
|
||||
- Panel Config:
|
||||
id: 300
|
||||
title: Memory Chart
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 301
|
||||
title: Memory Chart
|
||||
metrics:
|
||||
- L2 Wr Lat:
|
||||
value: |
|
||||
ROUND(AVG(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0) else None)), 0)
|
||||
- L2 Rd Lat:
|
||||
value: |
|
||||
ROUND(AVG(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)) if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else None)), 0)
|
||||
- Panel Config:
|
||||
id: 400
|
||||
title: Roofline
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 401
|
||||
title: Roofline Performance Rates
|
||||
metrics:
|
||||
- MFMA FLOPs (F6F4):
|
||||
value: |
|
||||
AVG((((SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
|
||||
unit: GFLOP/s
|
||||
peak: $MFMA_FLOPs_F6F4_empirical_peak
|
||||
- Panel Config:
|
||||
id: 500
|
||||
title: Command Processor (CPC/CPF)
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 502
|
||||
title: Command processor packet processor (CPC)
|
||||
metrics:
|
||||
- CPC SYNC FIFO Full Rate:
|
||||
avg: |
|
||||
AVG((100 * CPC_SYNC_FIFO_FULL) / CPC_SYNC_WRREQ_FIFO_BUSY if (CPC_SYNC_WRREQ_FIFO_BUSY != 0) else None)
|
||||
min: |
|
||||
MIN((100 * CPC_SYNC_FIFO_FULL) / CPC_SYNC_WRREQ_FIFO_BUSY if (CPC_SYNC_WRREQ_FIFO_BUSY != 0) else None)
|
||||
max: |
|
||||
MAX((100 * CPC_SYNC_FIFO_FULL) / CPC_SYNC_WRREQ_FIFO_BUSY if (CPC_SYNC_WRREQ_FIFO_BUSY != 0) else None)
|
||||
unit: pct
|
||||
- CPC CANE Stall Rate:
|
||||
avg: AVG((100 * CPC_CANE_STALL) / CPC_CANE_BUSY if (CPC_CANE_BUSY != 0) else None)
|
||||
min: MIN((100 * CPC_CANE_STALL) / CPC_CANE_BUSY if (CPC_CANE_BUSY != 0) else None)
|
||||
max: MAX((100 * CPC_CANE_STALL) / CPC_CANE_BUSY if (CPC_CANE_BUSY != 0) else None)
|
||||
unit: pct
|
||||
- CPC ADC Utilization:
|
||||
avg: AVG((100 * CPC_TG_SEND) / CPC_GD_BUSY if (CPC_GD_BUSY != 0) else None)
|
||||
min: MIN((100 * CPC_TG_SEND) / CPC_GD_BUSY if (CPC_GD_BUSY != 0) else None)
|
||||
max: MAX((100 * CPC_TG_SEND) / CPC_GD_BUSY if (CPC_GD_BUSY != 0) else None)
|
||||
unit: pct
|
||||
- Panel Config:
|
||||
id: 600
|
||||
title: Workgroup Manager (SPI)
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 601
|
||||
title: Workgroup manager utilizations
|
||||
metrics:
|
||||
- Scheduler-Pipe Wave Utilization:
|
||||
avg: |
|
||||
AVG(100 * (SPI_CSC_WAVE_CNT_BUSY) / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
|
||||
min: |
|
||||
MIN(100 * (SPI_CSC_WAVE_CNT_BUSY) / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
|
||||
max: |
|
||||
MAX(100 * (SPI_CSC_WAVE_CNT_BUSY) / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
|
||||
unit: Pct
|
||||
- Schedule-Pipe Wave Occupancy:
|
||||
avg: |
|
||||
AVG(SPI_CSQ_P0_OCCUPANCY + SPI_CSQ_P1_OCCUPANCY + SPI_CSQ_P2_OCCUPANCY + SPI_CSQ_P3_OCCUPANCY)
|
||||
min: |
|
||||
MIN(SPI_CSQ_P0_OCCUPANCY + SPI_CSQ_P1_OCCUPANCY + SPI_CSQ_P2_OCCUPANCY + SPI_CSQ_P3_OCCUPANCY)
|
||||
max: |
|
||||
MAX(SPI_CSQ_P0_OCCUPANCY + SPI_CSQ_P1_OCCUPANCY + SPI_CSQ_P2_OCCUPANCY + SPI_CSQ_P3_OCCUPANCY)
|
||||
unit: Wave
|
||||
- metric_table:
|
||||
id: 602
|
||||
title: Workgroup Manager - Resource Allocation
|
||||
metrics:
|
||||
- Scheduler-Pipe FIFO Full Rate:
|
||||
avg: |
|
||||
AVG((100 * (SPI_CS0_CRAWLER_STALL + SPI_CS1_CRAWLER_STALL + SPI_CS2_CRAWLER_STALL + SPI_CS3_CRAWLER_STALL) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
min: |
|
||||
MIN((100 * (SPI_CS0_CRAWLER_STALL + SPI_CS1_CRAWLER_STALL + SPI_CS2_CRAWLER_STALL + SPI_CS3_CRAWLER_STALL) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
max: |
|
||||
MAX((100 * (SPI_CS0_CRAWLER_STALL + SPI_CS1_CRAWLER_STALL + SPI_CS2_CRAWLER_STALL + SPI_CS3_CRAWLER_STALL) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
unit: Pct
|
||||
- Panel Config:
|
||||
id: 1000
|
||||
title: Compute Units - Instruction Mix
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 1003
|
||||
title: VMEM Instruction Mix
|
||||
metrics:
|
||||
- Spill/Stack Coalesceable Instr:
|
||||
avg: AVG((TA_BUFFER_COALESCEABLE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_COALESCEABLE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_COALESCEABLE_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
- metric_table:
|
||||
id: 1004
|
||||
title: MFMA Arithmetic Instruction Mix
|
||||
metrics:
|
||||
- MFMA-F6F4:
|
||||
avg: AVG((SQ_INSTS_VALU_MFMA_F6F4 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_MFMA_F6F4 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MFMA_F6F4 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
- Panel Config:
|
||||
id: 1100
|
||||
title: Compute Units - Compute Pipeline
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 1101
|
||||
title: Compute Speed-of-Light
|
||||
metrics:
|
||||
- MFMA FLOPs (F6F4):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 16834) / 1000)
|
||||
pop: |
|
||||
((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 16834) / 1000))
|
||||
- metric_table:
|
||||
id: 1102
|
||||
title: Pipeline Statistics
|
||||
metrics:
|
||||
- VALU Co-Issue Efficiency:
|
||||
avg: AVG((100 * SQ_ACTIVE_INST_VALU2) / (SQ_ACTIVE_INST_VALU - SQ_ACTIVE_INST_VALU2))
|
||||
min: MIN((100 * SQ_ACTIVE_INST_VALU2) / (SQ_ACTIVE_INST_VALU - SQ_ACTIVE_INST_VALU2))
|
||||
max: MAX((100 * SQ_ACTIVE_INST_VALU2) / (SQ_ACTIVE_INST_VALU - SQ_ACTIVE_INST_VALU2))
|
||||
unit: pct
|
||||
- metric_table:
|
||||
id: 1103
|
||||
title: Arithmetic Operations
|
||||
metrics:
|
||||
- F6F4 OPs:
|
||||
avg: AVG((512 * SQ_INSTS_VALU_MFMA_MOPS_F6F4) / $denom)
|
||||
min: MIN((512 * SQ_INSTS_VALU_MFMA_MOPS_F6F4) / $denom)
|
||||
max: MAX((512 * SQ_INSTS_VALU_MFMA_MOPS_F6F4) / $denom)
|
||||
unit: (OPs + $normUnit)
|
||||
- Panel Config:
|
||||
id: 1200
|
||||
title: Local Data Share (LDS)
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 1202
|
||||
title: LDS Statistics
|
||||
metrics:
|
||||
- LDS STORE:
|
||||
avg: AVG((SQ_INSTS_LDS_STORE / $denom))
|
||||
min: MIN((SQ_INSTS_LDS_STORE / $denom))
|
||||
max: MAX((SQ_INSTS_LDS_STORE / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
- LDS LOAD Bandwidth:
|
||||
avg: AVG(64 * SQ_INSTS_LDS_LOAD_BANDWIDTH / (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(64 * SQ_INSTS_LDS_LOAD_BANDWIDTH / (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(64 * SQ_INSTS_LDS_LOAD_BANDWIDTH / (End_Timestamp - Start_Timestamp))
|
||||
units: Gbps
|
||||
- LDS ATOMIC:
|
||||
avg: AVG((SQ_INSTS_LDS_ATOMIC / $denom))
|
||||
min: MIN((SQ_INSTS_LDS_ATOMIC / $denom))
|
||||
max: MAX((SQ_INSTS_LDS_ATOMIC / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
- LDS STORE Bandwidth:
|
||||
avg: AVG(64 * SQ_INSTS_LDS_STORE_BANDWIDTH / (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(64 * SQ_INSTS_LDS_STORE_BANDWIDTH / (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(64 * SQ_INSTS_LDS_STORE_BANDWIDTH / (End_Timestamp - Start_Timestamp))
|
||||
units: Gbps
|
||||
- LDS Command FIFO Full Rate:
|
||||
avg: AVG((SQ_LDS_CMD_FIFO_FULL / $denom))
|
||||
min: MIN((SQ_LDS_CMD_FIFO_FULL / $denom))
|
||||
max: MAX((SQ_LDS_CMD_FIFO_FULL / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
- LDS LOAD:
|
||||
avg: AVG((SQ_INSTS_LDS_LOAD / $denom))
|
||||
min: MIN((SQ_INSTS_LDS_LOAD / $denom))
|
||||
max: MAX((SQ_INSTS_LDS_LOAD / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
- LDS ATOMIC Bandwidth:
|
||||
avg: AVG(64 * SQ_INSTS_LDS_ATOMIC_BANDWIDTH / (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(64 * SQ_INSTS_LDS_ATOMIC_BANDWIDTH / (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(64 * SQ_INSTS_LDS_ATOMIC_BANDWIDTH / (End_Timestamp - Start_Timestamp))
|
||||
units: Gbps
|
||||
- LDS Data FIFO Full Rate:
|
||||
avg: AVG((SQ_LDS_DATA_FIFO_FULL / $denom))
|
||||
min: MIN((SQ_LDS_DATA_FIFO_FULL / $denom))
|
||||
max: MAX((SQ_LDS_DATA_FIFO_FULL / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
- Panel Config:
|
||||
id: 1500
|
||||
title: Address Processing Unit and Data Return Path (TA/TD)
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 1504
|
||||
title: Vector L1 data-return path or Texture Data (TD)
|
||||
metrics:
|
||||
- Write Ack Instructions:
|
||||
avg: AVG((TD_WRITE_ACKT_WAVEFRONT_sum / $denom))
|
||||
min: MIN((TD_WRITE_ACKT_WAVEFRONT_sum / $denom))
|
||||
max: MAX((TD_WRITE_ACKT_WAVEFRONT_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
- metric_table:
|
||||
id: 1502
|
||||
title: Instruction counts
|
||||
metrics:
|
||||
- Spill/Stack Read Instructions for LDS:
|
||||
avg: AVG((TA_BUFFER_READ_LDS_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_READ_LDS_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_READ_LDS_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
- Global/Generic Read Instructions for LDS:
|
||||
avg: AVG((TA_FLAT_READ_LDS_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_READ_LDS_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_READ_LDS_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
- Panel Config:
|
||||
id: 1600
|
||||
title: Vector L1 Data Cache
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 1602
|
||||
title: vL1D cache stall metrics
|
||||
metrics:
|
||||
- Stalled on Address:
|
||||
expr: |
|
||||
(((100 * TCP_TCP_TA_ADDR_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum != 0) else None)
|
||||
- Stalled on Read Return:
|
||||
expr: |
|
||||
(((100 * TCP_TCR_RDRET_STALL_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum != 0) else None)
|
||||
- Stalled on Request FIFO:
|
||||
expr: |
|
||||
(((100 * TCP_RFIFO_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum != 0) else None)
|
||||
- Stalled on Data:
|
||||
expr: |
|
||||
(((100 * TCP_TCP_TA_DATA_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum != 0) else None)
|
||||
- Stalled on Latency FIFO:
|
||||
expr: |
|
||||
(((100 * TCP_LFIFO_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum != 0) else None)
|
||||
- metric_table:
|
||||
id: 1603
|
||||
title: vL1D cache access metrics
|
||||
metrics:
|
||||
- Tag RAM 3 Req:
|
||||
avg: AVG((TCP_TAGRAM3_REQ_sum / $denom))
|
||||
min: MIN((TCP_TAGRAM3_REQ_sum / $denom))
|
||||
max: MAX((TCP_TAGRAM3_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
- L1-L2 Read Latency:
|
||||
avg: AVG((TCP_TCC_READ_REQ_LATENCY_sum / $denom))
|
||||
min: MIN((TCP_TCC_READ_REQ_LATENCY_sum / $denom))
|
||||
max: MAX((TCP_TCC_READ_REQ_LATENCY_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
- Tag RAM 2 Req:
|
||||
avg: AVG((TCP_TAGRAM2_REQ_sum / $denom))
|
||||
min: MIN((TCP_TAGRAM2_REQ_sum / $denom))
|
||||
max: MAX((TCP_TAGRAM2_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
- Tag RAM 0 Req:
|
||||
avg: AVG((TCP_TAGRAM0_REQ_sum / $denom))
|
||||
min: MIN((TCP_TAGRAM0_REQ_sum / $denom))
|
||||
max: MAX((TCP_TAGRAM0_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
- L1-L2 Write Latency:
|
||||
avg: AVG((TCP_TCC_WRITE_REQ_LATENCY_sum / $denom))
|
||||
min: MIN((TCP_TCC_WRITE_REQ_LATENCY_sum / $denom))
|
||||
max: MAX((TCP_TCC_WRITE_REQ_LATENCY_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
- L1 Access Latency:
|
||||
avg: AVG((TCP_TCP_LATENCY_sum / $denom))
|
||||
min: MIN((TCP_TCP_LATENCY_sum / $denom))
|
||||
max: MAX((TCP_TCP_LATENCY_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
- Tag RAM 1 Req:
|
||||
avg: AVG((TCP_TAGRAM1_REQ_sum / $denom))
|
||||
min: MIN((TCP_TAGRAM1_REQ_sum / $denom))
|
||||
max: MAX((TCP_TAGRAM1_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1605
|
||||
title: L1 Unified Translation Cache (UTCL1)
|
||||
metrics:
|
||||
- Misses under Translation Miss:
|
||||
avg: AVG((TCP_UTCL1_TRANSLATION_MISS_UNDER_MISS_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_TRANSLATION_MISS_UNDER_MISS_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_TRANSLATION_MISS_UNDER_MISS_sum / $denom))
|
||||
units: (Req + $normUnit)
|
||||
- Inflight Req:
|
||||
avg: AVG((TCP_CLIENT_UTCL1_INFLIGHT_sum / $denom))
|
||||
min: MIN((TCP_CLIENT_UTCL1_INFLIGHT_sum / $denom))
|
||||
max: MAX((TCP_CLIENT_UTCL1_INFLIGHT_sum / $denom))
|
||||
units: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1606
|
||||
title: L1D Addr Translation Stalls
|
||||
metrics:
|
||||
- Latency FIFO Stall:
|
||||
avg: AVG((TCP_UTCL1_LFIFO_FULL_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_LFIFO_FULL_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_LFIFO_FULL_sum / $denom))
|
||||
units: (Cycles + $normUnit)
|
||||
- Serialization Stall:
|
||||
avg: AVG((TCP_UTCL1_SERIALIZATION_STALL_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_SERIALIZATION_STALL_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_SERIALIZATION_STALL_sum / $denom))
|
||||
units: (Cycles + $normUnit)
|
||||
- Cache Full Stall:
|
||||
avg: AVG((TCP_UTCL1_STALL_INFLIGHT_MAX_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_STALL_INFLIGHT_MAX_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_STALL_INFLIGHT_MAX_sum / $denom))
|
||||
units: (Cycles + $normUnit)
|
||||
- UTCL2 Stall:
|
||||
avg: AVG((TCP_UTCL1_STALL_UTCL2_REQ_OUT_OF_CREDITS_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_STALL_UTCL2_REQ_OUT_OF_CREDITS_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_STALL_UTCL2_REQ_OUT_OF_CREDITS_sum / $denom))
|
||||
units: (Cycles + $normUnit)
|
||||
- Cache Miss Stall:
|
||||
avg: AVG((TCP_UTCL1_STALL_MULTI_MISS_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_STALL_MULTI_MISS_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_STALL_MULTI_MISS_sum / $denom))
|
||||
units: (Cycles + $normUnit)
|
||||
- Resident Page Full Stall:
|
||||
avg: AVG((TCP_UTCL1_STALL_LFIFO_NO_RES_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_STALL_LFIFO_NO_RES_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_STALL_LFIFO_NO_RES_sum / $denom))
|
||||
units: (Cycles + $normUnit)
|
||||
- Thrashing Stall:
|
||||
avg: AVG((TCP_UTCL1_THRASHING_STALL_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_THRASHING_STALL_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_THRASHING_STALL_sum / $denom))
|
||||
units: (Cycles + $normUnit)
|
||||
- Panel Config:
|
||||
id: 1700
|
||||
title: L2 Cache
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 1702
|
||||
title: L2-Fabric interface metrics
|
||||
metrics:
|
||||
- Write Stall:
|
||||
avg: |
|
||||
AVG(((100 * (TCC_EA0_WRREQ_STALL_sum) / TCC_BUSY_sum) if (TCC_BUSY_sum != 0) else None))
|
||||
min: |
|
||||
MIN(((100 * (TCC_EA0_WRREQ_STALL_sum) / TCC_BUSY_sum) if (TCC_BUSY_sum != 0) else None))
|
||||
max: |
|
||||
MAX(((100 * (TCC_EA0_WRREQ_STALL_sum) / TCC_BUSY_sum) if (TCC_BUSY_sum != 0) else None))
|
||||
unit: pct
|
||||
- Read Stall:
|
||||
avg: |
|
||||
AVG((((100 * ((TCC_EA0_RDREQ_IO_CREDIT_STALL_sum + TCC_EA0_RDREQ_GMI_CREDIT_STALL_sum) + TCC_EA0_RDREQ_DRAM_CREDIT_STALL_sum)) / TCC_BUSY_sum) if (TCC_BUSY_sum != 0) else None))
|
||||
min: |
|
||||
MIN((((100 * ((TCC_EA0_RDREQ_IO_CREDIT_STALL_sum + TCC_EA0_RDREQ_GMI_CREDIT_STALL_sum) + TCC_EA0_RDREQ_DRAM_CREDIT_STALL_sum)) / TCC_BUSY_sum) if (TCC_BUSY_sum != 0) else None))
|
||||
max: |
|
||||
MAX((((100 * ((TCC_EA0_RDREQ_IO_CREDIT_STALL_sum + TCC_EA0_RDREQ_GMI_CREDIT_STALL_sum) + TCC_EA0_RDREQ_DRAM_CREDIT_STALL_sum)) / TCC_BUSY_sum) if (TCC_BUSY_sum != 0) else None))
|
||||
unit: pct
|
||||
- metric_table:
|
||||
id: 1703
|
||||
title: L2 Cache Accesses
|
||||
metrics:
|
||||
- Input Buffer Req:
|
||||
avg: AVG((TCC_IB_REQ_sum / $denom))
|
||||
min: MIN((TCC_IB_REQ_sum / $denom))
|
||||
max: MAX((TCC_IB_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
- Bypasss Req:
|
||||
avg: AVG((TCC_BYPASS_REQ_sum / $denom))
|
||||
min: MIN((TCC_BYPASS_REQ_sum / $denom))
|
||||
max: MAX((TCC_BYPASS_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
- Atomic Bandwidth:
|
||||
avg: AVG(TCC_ATOMIC_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_ATOMIC_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_ATOMIC_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
- Write Bandwidth:
|
||||
avg: AVG(TCC_WRITE_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_WRITE_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_WRITE_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
- Read Bandwidth:
|
||||
avg: AVG(TCC_READ_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_READ_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_READ_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
- metric_table:
|
||||
id: 1704
|
||||
title: L2 Cache Stalls
|
||||
metrics:
|
||||
- Input Buffer Stalled on L2:
|
||||
avg: AVG(TCC_IB_STALL_sum / $denom)
|
||||
min: MIN(TCC_IB_STALL_sum / $denom)
|
||||
max: MAX(TCC_IB_STALL_sum / $denom)
|
||||
unit: (Cycles + $normUnit)
|
||||
- Stalled on Latency FIFO:
|
||||
avg: AVG(TCC_LATENCY_FIFO_FULL_sum / $denom)
|
||||
min: MIN(TCC_LATENCY_FIFO_FULL_sum / $denom)
|
||||
max: MAX(TCC_LATENCY_FIFO_FULL_sum / $denom)
|
||||
unit: (Cycles + $normUnit)
|
||||
- Stalled on Write Data FIFO:
|
||||
avg: AVG(TCC_SRC_FIFO_FULL_sum / $denom)
|
||||
min: MIN(TCC_SRC_FIFO_FULL_sum / $denom)
|
||||
max: MAX(TCC_SRC_FIFO_FULL_sum / $denom)
|
||||
unit: (Cycles + $normUnit)
|
||||
- metric_table:
|
||||
id: 1705
|
||||
title: L2 - Fabric Interface stalls
|
||||
metrics:
|
||||
- Write - HBM Stall:
|
||||
type: HBM Stall
|
||||
transaction: Write
|
||||
avg: |
|
||||
AVG(((100 * (TCC_EA0_WRREQ_DRAM_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
min: |
|
||||
MIN(((100 * (TCC_EA0_WRREQ_DRAM_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
max: |
|
||||
MAX(((100 * (TCC_EA0_WRREQ_DRAM_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
unit: pct
|
||||
- Read - HBM Stall:
|
||||
type: HBM Stall
|
||||
transaction: Read
|
||||
avg: |
|
||||
AVG(((100 * (TCC_EA0_RDREQ_DRAM_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
min: |
|
||||
MIN(((100 * (TCC_EA0_RDREQ_DRAM_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
max: |
|
||||
MAX(((100 * (TCC_EA0_RDREQ_DRAM_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
unit: pct
|
||||
- Write - PCIe Stall:
|
||||
type: PCIe Stall
|
||||
transaction: Write
|
||||
avg: |
|
||||
AVG(((100 * (TCC_EA0_WRREQ_IO_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
min: |
|
||||
MIN(((100 * (TCC_EA0_WRREQ_IO_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
max: |
|
||||
MAX(((100 * (TCC_EA0_WRREQ_IO_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
unit: pct
|
||||
- Write - Infinity Fabric Stall:
|
||||
type: Infinity Fabric™ Stall
|
||||
transaction: Write
|
||||
avg: |
|
||||
AVG(((100 * (TCC_EA0_WRREQ_GMI_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
min: |
|
||||
MIN(((100 * (TCC_EA0_WRREQ_GMI_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
max: |
|
||||
MAX(((100 * (TCC_EA0_WRREQ_GMI_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
unit: pct
|
||||
- Read - Infinity Fabric Stall:
|
||||
type: Infinity Fabric™ Stall
|
||||
transaction: Read
|
||||
avg: |
|
||||
AVG(((100 * (TCC_EA0_RDREQ_GMI_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
min: |
|
||||
MIN(((100 * (TCC_EA0_RDREQ_GMI_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
max: |
|
||||
MAX(((100 * (TCC_EA0_RDREQ_GMI_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
unit: pct
|
||||
- Read - PCIe Stall:
|
||||
type: PCIe Stall
|
||||
transaction: Read
|
||||
avg: |
|
||||
AVG(((100 * (TCC_EA0_RDREQ_IO_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
min: |
|
||||
MIN(((100 * (TCC_EA0_RDREQ_IO_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
max: |
|
||||
MAX(((100 * (TCC_EA0_RDREQ_IO_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
unit: pct
|
||||
- metric_table:
|
||||
id: 1706
|
||||
title: L2 - Fabric interface detailed metrics
|
||||
metrics:
|
||||
- Read Bandwidth - PCIe:
|
||||
avg: AVG(TCC_EA0_RDREQ_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_EA0_RDREQ_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_EA0_RDREQ_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
- Write Bandwidth - Infinity Fabric™:
|
||||
avg: AVG(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
- Atomic Bandwidth - HBM:
|
||||
avg: AVG(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
- Atomic - HBM:
|
||||
avg: AVG((TCC_EA0_WRREQ_ATOMIC_DRAM_sum / $denom))
|
||||
min: MIN((TCC_EA0_WRREQ_ATOMIC_DRAM_sum / $denom))
|
||||
max: MAX((TCC_EA0_WRREQ_ATOMIC_DRAM_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
- Read Bandwidth - HBM:
|
||||
avg: AVG(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
- Atomic Bandwidth - Infinity Fabric™:
|
||||
avg: AVG(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
- Write Bandwidth - HBM:
|
||||
avg: AVG(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
- Atomic Bandwidth - PCIe:
|
||||
avg: AVG(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
- Read (128B):
|
||||
avg: AVG((TCC_EA0_RDREQ_128B_sum / $denom))
|
||||
min: MIN((TCC_EA0_RDREQ_128B_sum / $denom))
|
||||
max: MAX((TCC_EA0_RDREQ_128B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
- Read Bandwidth - Infinity Fabric™:
|
||||
avg: AVG(TCC_EA0_RDREQ_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_EA0_RDREQ_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_EA0_RDREQ_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
- Write Bandwidth - PCIe:
|
||||
avg: AVG(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
|
||||
Deletion:
|
||||
[]
|
||||
|
||||
Modification:
|
||||
- Panel Config:
|
||||
id: 200
|
||||
title: System Speed-of-Light
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 201
|
||||
title: System Speed-of-Light
|
||||
metrics:
|
||||
- MFMA FLOPs (F8):
|
||||
pop: |
|
||||
((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 8192) / 1000))
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 8192) / 1000)
|
||||
- MFMA FLOPs (F64):
|
||||
pop: |
|
||||
((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 128) / 1000))
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 128) / 1000)
|
||||
- MFMA IOPs (Int8):
|
||||
pop: |
|
||||
((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 8192) / 1000))
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 8192) / 1000)
|
||||
- MFMA FLOPs (F16):
|
||||
pop: |
|
||||
((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 4096) / 1000))
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 4096) / 1000)
|
||||
- MFMA FLOPs (BF16):
|
||||
pop: |
|
||||
((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 4096) / 1000))
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 4096) / 1000)
|
||||
- Panel Config:
|
||||
id: 300
|
||||
title: Memory Chart
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 301
|
||||
title: Memory Chart
|
||||
metrics:
|
||||
- Wavefronts:
|
||||
value: ROUND(AVG(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE), 0)
|
||||
- Workgroups:
|
||||
value: |
|
||||
ROUND(AVG(SPI_CS0_NUM_THREADGROUPS + SPI_CS1_NUM_THREADGROUPS + SPI_CS2_NUM_THREADGROUPS + SPI_CS3_NUM_THREADGROUPS), 0)
|
||||
- Panel Config:
|
||||
id: 400
|
||||
title: Roofline
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 402
|
||||
title: Roofline Plot Points
|
||||
metrics:
|
||||
- Performance (GFLOPs):
|
||||
value: |
|
||||
( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9
|
||||
- AI L2:
|
||||
value: |
|
||||
( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / SUM( (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64 ) )
|
||||
- AI L1:
|
||||
value: |
|
||||
( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / SUM(TCP_TOTAL_CACHE_ACCESSES_sum * 64) )
|
||||
- AI HBM:
|
||||
value: |
|
||||
( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / SUM( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum * 64) ) )
|
||||
- Panel Config:
|
||||
id: 600
|
||||
title: Workgroup Manager (SPI)
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 601
|
||||
title: Workgroup manager utilizations
|
||||
metrics:
|
||||
- SGPR Writes:
|
||||
max: |
|
||||
MAX((((1 * SPI_SWC_CSC_WR) / (SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)) if ((SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE) != 0) else None))
|
||||
min: |
|
||||
MIN((((1 * SPI_SWC_CSC_WR) / (SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)) if ((SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE) != 0) else None))
|
||||
avg: |
|
||||
AVG((((1 * SPI_SWC_CSC_WR) / (SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)) if ((SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE) != 0) else None))
|
||||
- Dispatched Wavefronts:
|
||||
max: MAX(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)
|
||||
min: MIN(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)
|
||||
avg: AVG(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)
|
||||
- Dispatched Workgroups:
|
||||
max: |
|
||||
MAX(SPI_CS0_NUM_THREADGROUPS + SPI_CS1_NUM_THREADGROUPS + SPI_CS2_NUM_THREADGROUPS + SPI_CS3_NUM_THREADGROUPS)
|
||||
min: |
|
||||
MIN(SPI_CS0_NUM_THREADGROUPS + SPI_CS1_NUM_THREADGROUPS + SPI_CS2_NUM_THREADGROUPS + SPI_CS3_NUM_THREADGROUPS)
|
||||
avg: |
|
||||
AVG(SPI_CS0_NUM_THREADGROUPS + SPI_CS1_NUM_THREADGROUPS + SPI_CS2_NUM_THREADGROUPS + SPI_CS3_NUM_THREADGROUPS)
|
||||
- Scheduler-Pipe Utilization:
|
||||
max: |
|
||||
MAX(100 * (SPI_CS0_BUSY + SPI_CS1_BUSY + SPI_CS2_BUSY + SPI_CS3_BUSY) / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
|
||||
min: |
|
||||
MIN(100 * (SPI_CS0_BUSY + SPI_CS1_BUSY + SPI_CS2_BUSY + SPI_CS3_BUSY) / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
|
||||
avg: |
|
||||
AVG(100 * (SPI_CS0_BUSY + SPI_CS1_BUSY + SPI_CS2_BUSY + SPI_CS3_BUSY) / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
|
||||
- VGPR Writes:
|
||||
max: |
|
||||
MAX((((SPI_VWC0_VDATA_VALID_WR + SPI_VWC1_VDATA_VALID_WR) / (SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)) if ((SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE) != 0) else None))
|
||||
min: |
|
||||
MIN((((SPI_VWC0_VDATA_VALID_WR + SPI_VWC1_VDATA_VALID_WR) / (SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)) if ((SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE) != 0) else None))
|
||||
avg: |
|
||||
AVG((((SPI_VWC0_VDATA_VALID_WR + SPI_VWC1_VDATA_VALID_WR) / (SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)) if ((SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE) != 0) else None))
|
||||
- Panel Config:
|
||||
id: 700
|
||||
title: Wavefront
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 701
|
||||
title: Wavefront Launch Stats
|
||||
metrics:
|
||||
- Total Wavefronts:
|
||||
max: MAX(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)
|
||||
min: MIN(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)
|
||||
avg: AVG(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)
|
||||
- Panel Config:
|
||||
id: 1100
|
||||
title: Compute Units - Compute Pipeline
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 1101
|
||||
title: Compute Speed-of-Light
|
||||
metrics:
|
||||
- MFMA FLOPs (F16):
|
||||
pop: |
|
||||
((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 4096) / 1000))
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 4096) / 1000)
|
||||
- MFMA FLOPs (F64):
|
||||
pop: |
|
||||
((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 128) / 1000))
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 128) / 1000)
|
||||
- MFMA IOPs (INT8):
|
||||
pop: |
|
||||
((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 8192) / 1000))
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 8192) / 1000)
|
||||
- MFMA FLOPs (BF16):
|
||||
pop: |
|
||||
((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 4096) / 1000))
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 4096) / 1000)
|
||||
- MFMA FLOPs (F8):
|
||||
pop: |
|
||||
((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 8192) / 1000))
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 8192) / 1000)
|
||||
- metric_table:
|
||||
id: 1103
|
||||
title: Arithmetic Operations
|
||||
metrics:
|
||||
- FLOPs (Total):
|
||||
max: |
|
||||
MAX((((((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_FMA_F16 * 2))) + ((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F16) + (512 * SQ_INSTS_VALU_MFMA_MOPS_BF16))) + (64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_FMA_F32 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32)) + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64) + (SQ_INSTS_VALU_FMA_F64 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F6F4)) / $denom))
|
||||
min: |
|
||||
MIN((((((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_FMA_F16 * 2))) + ((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F16) + (512 * SQ_INSTS_VALU_MFMA_MOPS_BF16))) + (64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_FMA_F32 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32)) + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64) + (SQ_INSTS_VALU_FMA_F64 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F6F4)) / $denom))
|
||||
avg: |
|
||||
AVG((((((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_FMA_F16 * 2))) + ((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F16) + (512 * SQ_INSTS_VALU_MFMA_MOPS_BF16))) + (64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_FMA_F32 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32)) + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64) + (SQ_INSTS_VALU_FMA_F64 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F6F4)) / $denom))
|
||||
- Panel Config:
|
||||
id: 1700
|
||||
title: L2 Cache
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 1701
|
||||
title: L2 Speed-of-Light
|
||||
metrics:
|
||||
- L2-Fabric Read BW:
|
||||
value: |
|
||||
AVG((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) + (TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp)))
|
||||
- metric_table:
|
||||
id: 1702
|
||||
title: L2-Fabric interface metrics
|
||||
metrics:
|
||||
- Read BW:
|
||||
max: |
|
||||
MAX((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) + (TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp)))
|
||||
min: |
|
||||
MIN((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) + (TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp)))
|
||||
avg: |
|
||||
AVG((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) + (TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp)))
|
||||
- metric_table:
|
||||
id: 1706
|
||||
title: L2 - Fabric interface detailed metrics
|
||||
metrics:
|
||||
- Read (64B):
|
||||
max: MAX((TCC_EA0_RDREQ_64B_sum / $denom))
|
||||
min: MIN((TCC_EA0_RDREQ_64B_sum / $denom))
|
||||
avg: AVG((TCC_EA0_RDREQ_64B_sum / $denom))
|
||||
- HBM Write and Atomic:
|
||||
max: MAX((TCC_EA0_WRREQ_WRITE_DRAM_sum / $denom))
|
||||
min: MIN((TCC_EA0_WRREQ_WRITE_DRAM_sum / $denom))
|
||||
avg: AVG((TCC_EA0_WRREQ_WRITE_DRAM_sum / $denom))
|
||||
- Panel Config:
|
||||
id: 1800
|
||||
title: L2 Cache (per Channel)
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 1809
|
||||
title: L2-Fabric Read Stall (Cycles per normUnit)
|
||||
metrics:
|
||||
- ::_1:
|
||||
ea read stall - pcie: AVG((TO_INT(TCC_EA0_RDREQ_IO_CREDIT_STALL[::_1]) / $denom))
|
||||
ea read stall - hbm: AVG((TO_INT(TCC_EA0_RDREQ_DRAM_CREDIT_STALL[::_1]) / $denom))
|
||||
ea read stall - if: AVG((TO_INT(TCC_EA0_RDREQ_GMI_CREDIT_STALL[::_1]) / $denom))
|
||||
- metric_table:
|
||||
id: 1810
|
||||
title: L2-Fabric Write and Atomic Stall (Cycles per normUnit)
|
||||
metrics:
|
||||
- ::_1:
|
||||
ea write stall - hbm: AVG((TO_INT(TCC_EA0_WRREQ_DRAM_CREDIT_STALL[::_1]) / $denom))
|
||||
ea write stall - pcie: AVG((TO_INT(TCC_EA0_WRREQ_IO_CREDIT_STALL[::_1]) / $denom))
|
||||
ea write stall - if: AVG((TO_INT(TCC_EA0_WRREQ_GMI_CREDIT_STALL[::_1]) / $denom))
|
||||
+1
-1
@@ -2,7 +2,6 @@
|
||||
Panel Config:
|
||||
id: 0
|
||||
title: Top Stats
|
||||
metrics_description: {}
|
||||
data source:
|
||||
- raw_csv_table:
|
||||
id: 1
|
||||
@@ -12,3 +11,4 @@ Panel Config:
|
||||
id: 2
|
||||
title: Dispatch List
|
||||
source: pmc_dispatch_info.csv
|
||||
metrics_description: {}
|
||||
|
||||
+1
-1
@@ -2,10 +2,10 @@
|
||||
Panel Config:
|
||||
id: 100
|
||||
title: System Info
|
||||
metrics_description: {}
|
||||
data source:
|
||||
- raw_csv_table:
|
||||
id: 101
|
||||
title: System Info
|
||||
source: sysinfo.csv
|
||||
columnwise: true
|
||||
metrics_description: {}
|
||||
|
||||
+127
-118
@@ -2,124 +2,6 @@
|
||||
Panel Config:
|
||||
id: 200
|
||||
title: System Speed-of-Light
|
||||
metrics_description:
|
||||
VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
|
||||
This is also presented as a percent of the peak theoretical FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any floating-point
|
||||
operations from MFMA instructions.'
|
||||
VALU IOPs: 'The total integer operations executed per second on the VALU. This
|
||||
is also presented as a percent of the peak theoretical IOPs achievable on the
|
||||
specific accelerator. Note: this does not include any integer operations from
|
||||
MFMA instructions.'
|
||||
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
|
||||
executed per second. This does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F8 MFMA operations achievable on the specific accelerator. It is supported on
|
||||
AMD Instinct MI300 series and later only.
|
||||
MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
|
||||
executed per second. Note: this does not include any 16-bit brain floating point
|
||||
operations from VALU instructions. This is also presented as a percent of the
|
||||
peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 32-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F32 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 64-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F64 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
|
||||
per second. Note: this does not include any 8-bit integer operations from VALU
|
||||
instructions. This is also presented as a percent of the peak theoretical INT8
|
||||
MFMA operations achievable on the specific accelerator.'
|
||||
Active CUs: Total number of active compute units (CUs) on the accelerator during
|
||||
the kernel execution.
|
||||
SALU Utilization: Indicates what percent of the kernel's duration the SALU was
|
||||
busy executing instructions. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.
|
||||
VALU Utilization: Indicates what percent of the kernel's duration the VALU was
|
||||
busy executing instructions. Does not include VMEM operations. Computed as the
|
||||
ratio of the total number of cycles spent by the scheduler issuing VALU instructions
|
||||
over the total CU cycles.
|
||||
MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
|
||||
was busy executing instructions. Computed as the ratio of the total number of
|
||||
cycles the MFMA was busy over the total CU cycles.
|
||||
VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
|
||||
was busy executing instructions, including both global/generic and spill/scratch
|
||||
operations (see the VMEM instruction count metrics) for more detail). Does not
|
||||
include VALU operations. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing VMEM instructions over the total CU cycles.
|
||||
Branch Utilization: Indicates what percent of the kernel's duration the branch
|
||||
unit was busy executing instructions. Computed as the ratio of the total number
|
||||
of cycles spent by the scheduler issuing branch instructions over the total
|
||||
CU cycles
|
||||
VALU Active Threads: Indicates the average level of divergence within a wavefront
|
||||
over the lifetime of the kernel. The number of work-items that were active in
|
||||
a wavefront during execution of each VALU instruction, time-averaged over all
|
||||
VALU instructions run on all wavefronts in the kernel.
|
||||
IPC: The ratio of the total number of instructions executed on the CU over the
|
||||
total active CU cycles. This is also presented as a percent of the peak theoretical
|
||||
bandwidth achievable on the specific accelerator.
|
||||
Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
|
||||
over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
||||
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
|
||||
occupancy achievable on the specific accelerator.'
|
||||
Theoretical LDS Bandwidth: Indicates the maximum amount of bytes that could have
|
||||
been loaded from, stored to, or atomically updated in the LDS per unit time
|
||||
(see LDS Bandwidth example for more detail). This is also presented as a percent
|
||||
of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
|
||||
LDS Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS
|
||||
scheduler due to bank conflicts (as determined by the conflict resolution hardware)
|
||||
to the base number of cycles that would be spent in the LDS scheduler in a completely
|
||||
uncontended case. This is also presented in normalized form (i.e., the Bank
|
||||
Conflict Rate).
|
||||
vL1D Cache Hit Rate: The ratio of the number of vL1D cache line requests that
|
||||
hit in vL1D cache over the total number of cache line requests to the vL1D cache
|
||||
RAM.
|
||||
vL1D Cache BW: The number of bytes looked up in the vL1D cache as a result of
|
||||
VMEM instructions per unit time. The number of bytes is calculated as the number
|
||||
of cache lines requested multiplied by the cache line size. This value does
|
||||
not consider partial requests, so e.g., if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
This is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L2 Cache Hit Rate: The ratio of the number of L2 cache line requests that hit
|
||||
in the L2 cache over the total number of incoming cache line requests to the
|
||||
L2 cache.
|
||||
L2 Cache BW: The number of bytes looked up in the L2 cache per unit time. The
|
||||
number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line. This is also presented as a percent of
|
||||
the peak theoretical bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Read BW: "The number of bytes read by the L2 over the Infinity Fabric\u2122\
|
||||
\ interface per unit time. This is also presented as a percent of the peak theoretical\
|
||||
\ bandwidth achievable on the specific accelerator."
|
||||
L2-Fabric Write BW: The number of bytes sent by the L2 over the Infinity Fabric
|
||||
interface by write and atomic operations per unit time. This is also presented
|
||||
as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Read Latency: The time-averaged number of cycles read requests spent
|
||||
in Infinity Fabric before data was returned to the L2.
|
||||
L2-Fabric Write Latency: The time-averaged number of cycles write requests spent
|
||||
in Infinity Fabric before a completion acknowledgement was returned to the L2.
|
||||
sL1D Cache Hit Rate: The percent of sL1D requests that hit on a previously loaded
|
||||
line the cache. Calculated as the ratio of the number of sL1D requests that
|
||||
hit over the number of all sL1D requests.
|
||||
sL1D Cache BW: The number of bytes looked up in the sL1D cache per unit time.
|
||||
This is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L1I Hit Rate: The number of bytes looked up in the L1I cache per unit time. This
|
||||
is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L1I BW: The percent of L1I requests that hit on a previously loaded line the cache.
|
||||
Calculated as the ratio of the number of L1I requests that hit over the number
|
||||
of all L1I requests.
|
||||
L1I Fetch Latency: The average number of cycles spent to fetch instructions to
|
||||
a CU.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 201
|
||||
@@ -344,3 +226,130 @@ Panel Config:
|
||||
peak: None
|
||||
pop: None
|
||||
coll_level: SQ_IFETCH_LEVEL
|
||||
metrics_description:
|
||||
VALU FLOPs: |-
|
||||
The total floating-point operations executed per second on the VALU.
|
||||
This is also presented as a percent of the peak theoretical FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any floating-point
|
||||
operations from MFMA instructions.
|
||||
VALU IOPs: |-
|
||||
The total integer operations executed per second on the VALU. This is
|
||||
also presented as a percent of the peak theoretical IOPs achievable on the
|
||||
specific accelerator. Note: this does not include any integer operations from
|
||||
MFMA instructions.
|
||||
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
|
||||
executed per second. This does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F8 MFMA operations achievable on the specific accelerator. It is supported on
|
||||
AMD Instinct MI300 series and later only.
|
||||
MFMA FLOPs (BF16): |-
|
||||
The total number of 16-bit brain floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
BF16 MFMA operations achievable on the specific accelerator.
|
||||
MFMA FLOPs (F16): |-
|
||||
The total number of 16-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 16-bit floating point operations from
|
||||
VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F16 MFMA operations achievable on the specific accelerator.
|
||||
MFMA FLOPs (F32): |-
|
||||
The total number of 32-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 32-bit floating point operations from
|
||||
VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F32 MFMA operations achievable on the specific accelerator.
|
||||
MFMA FLOPs (F64): |-
|
||||
The total number of 64-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 64-bit floating point operations from
|
||||
VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F64 MFMA operations achievable on the specific accelerator.
|
||||
MFMA IOPs (Int8): |-
|
||||
The total number of 8-bit integer MFMA operations executed per second.
|
||||
Note: this does not include any 8-bit integer operations from VALU instructions.
|
||||
This is also presented as a percent of the peak theoretical INT8 MFMA operations
|
||||
achievable on the specific accelerator.
|
||||
Active CUs: Total number of active compute units (CUs) on the accelerator during
|
||||
the kernel execution.
|
||||
SALU Utilization: Indicates what percent of the kernel's duration the SALU was
|
||||
busy executing instructions. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.
|
||||
VALU Utilization: Indicates what percent of the kernel's duration the VALU was
|
||||
busy executing instructions. Does not include VMEM operations. Computed as the
|
||||
ratio of the total number of cycles spent by the scheduler issuing VALU instructions
|
||||
over the total CU cycles.
|
||||
MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
|
||||
was busy executing instructions. Computed as the ratio of the total number of
|
||||
cycles the MFMA was busy over the total CU cycles.
|
||||
VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
|
||||
was busy executing instructions, including both global/generic and spill/scratch
|
||||
operations (see the VMEM instruction count metrics) for more detail). Does not
|
||||
include VALU operations. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing VMEM instructions over the total CU cycles.
|
||||
Branch Utilization: Indicates what percent of the kernel's duration the branch
|
||||
unit was busy executing instructions. Computed as the ratio of the total number
|
||||
of cycles spent by the scheduler issuing branch instructions over the total
|
||||
CU cycles
|
||||
VALU Active Threads: Indicates the average level of divergence within a wavefront
|
||||
over the lifetime of the kernel. The number of work-items that were active in
|
||||
a wavefront during execution of each VALU instruction, time-averaged over all
|
||||
VALU instructions run on all wavefronts in the kernel.
|
||||
IPC: The ratio of the total number of instructions executed on the CU over the
|
||||
total active CU cycles. This is also presented as a percent of the peak theoretical
|
||||
bandwidth achievable on the specific accelerator.
|
||||
Wavefront Occupancy: |-
|
||||
The time-averaged number of wavefronts resident on the accelerator over
|
||||
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
||||
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
|
||||
occupancy achievable on the specific accelerator.
|
||||
Theoretical LDS Bandwidth: Indicates the maximum amount of bytes that could have
|
||||
been loaded from, stored to, or atomically updated in the LDS per unit time
|
||||
(see LDS Bandwidth example for more detail). This is also presented as a percent
|
||||
of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
|
||||
LDS Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS
|
||||
scheduler due to bank conflicts (as determined by the conflict resolution hardware)
|
||||
to the base number of cycles that would be spent in the LDS scheduler in a completely
|
||||
uncontended case. This is also presented in normalized form (i.e., the Bank
|
||||
Conflict Rate).
|
||||
vL1D Cache Hit Rate: The ratio of the number of vL1D cache line requests that
|
||||
hit in vL1D cache over the total number of cache line requests to the vL1D cache
|
||||
RAM.
|
||||
vL1D Cache BW: The number of bytes looked up in the vL1D cache as a result of
|
||||
VMEM instructions per unit time. The number of bytes is calculated as the number
|
||||
of cache lines requested multiplied by the cache line size. This value does
|
||||
not consider partial requests, so e.g., if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
This is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L2 Cache Hit Rate: The ratio of the number of L2 cache line requests that hit
|
||||
in the L2 cache over the total number of incoming cache line requests to the
|
||||
L2 cache.
|
||||
L2 Cache BW: The number of bytes looked up in the L2 cache per unit time. The
|
||||
number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line. This is also presented as a percent of
|
||||
the peak theoretical bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Read BW: |-
|
||||
The number of bytes read by the L2 over the Infinity Fabric\u2122 interface
|
||||
per unit time. This is also presented as a percent of the peak theoretical
|
||||
bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Write BW: The number of bytes sent by the L2 over the Infinity Fabric
|
||||
interface by write and atomic operations per unit time. This is also presented
|
||||
as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Read Latency: The time-averaged number of cycles read requests spent
|
||||
in Infinity Fabric before data was returned to the L2.
|
||||
L2-Fabric Write Latency: The time-averaged number of cycles write requests spent
|
||||
in Infinity Fabric before a completion acknowledgement was returned to the L2.
|
||||
sL1D Cache Hit Rate: The percent of sL1D requests that hit on a previously loaded
|
||||
line the cache. Calculated as the ratio of the number of sL1D requests that
|
||||
hit over the number of all sL1D requests.
|
||||
sL1D Cache BW: The number of bytes looked up in the sL1D cache per unit time.
|
||||
This is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L1I Hit Rate: The number of bytes looked up in the L1I cache per unit time. This
|
||||
is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L1I BW: The percent of L1I requests that hit on a previously loaded line the cache.
|
||||
Calculated as the ratio of the number of L1I requests that hit over the number
|
||||
of all L1I requests.
|
||||
L1I Fetch Latency: The average number of cycles spent to fetch instructions to
|
||||
a CU.
|
||||
|
||||
+117
-119
@@ -2,122 +2,6 @@
|
||||
Panel Config:
|
||||
id: 300
|
||||
title: Memory Chart
|
||||
metrics_description:
|
||||
Wavefront Occupancy: Wavefronts per active CU.
|
||||
Wave Life: Average number of cycles executing a wave.
|
||||
SALU: Total Number of SALU (Scalar ALU) instructions issued per normalization
|
||||
unit.
|
||||
SMEM: Total number of SMEM (Scalar Memory Read) instructions issued normalization
|
||||
unit.
|
||||
VALU: The number of VALU (Vector ALU) instructions issued per normalization unit.
|
||||
MFMA: Total number of MFMA (Matrix-Fused-Multiply-Add) instructions issued per
|
||||
normalization unit.
|
||||
VMEM: The number of VMEM (GPU Memory) read instructions issued (including FLAT/scratch
|
||||
memory) per normalization unit.
|
||||
LDS: The total number of LDS instructions (including, but not limited to, read/write/atomics
|
||||
and HIP's __shfl instructions) executed per normalization unit.
|
||||
GWS: Total number of GDS (global data sync) instructions issued per normalization
|
||||
unit.
|
||||
BR: Total number of BRANCH instructions issued per normalization unit.
|
||||
Active CUs: Total number of active compute units (CUs) on the accelerator during
|
||||
the kernel execution.
|
||||
Num CUs: Total number of compute units (CUs) on the accelerator.
|
||||
VGPR: 'The number of architected vector general-purpose registers allocated for
|
||||
the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
|
||||
by the compiler due to allocation granularity.'
|
||||
SGPR: 'The number of scalar general-purpose registers allocated for the kernel,
|
||||
see SALU. Note: this may not exactly match the number of SGPRs requested by
|
||||
the compiler due to allocation granularity.'
|
||||
LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
|
||||
for this kernel. Note: This may also be larger than what was requested at compile
|
||||
time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
|
||||
Scratch Allocation: The number of bytes of scratch memory requested per work-item
|
||||
for this kernel. Scratch memory is used for stack memory on the accelerator,
|
||||
as well as for register spills and restores.
|
||||
Wavefronts: The total number of wavefronts, summed over all workgroups, forming
|
||||
this kernel launch.
|
||||
Workgroups: The total number of workgroups forming this kernel launch.
|
||||
LDS Req: The total number of LDS instructions (including, but not limited to,
|
||||
read/write/atomics and HIP's __shfl instructions) executed per normalization
|
||||
unit.
|
||||
LDS Util: Indicates what percent of the kernel's duration the LDS was actively
|
||||
executing instructions (including, but not limited to, load, store, atomic and
|
||||
HIP's __shfl operations). Calculated as the ratio of the total number of cycles
|
||||
LDS was active over the total CU cycles.
|
||||
LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
|
||||
/ acknowledgment) required for an LDS instruction to complete.
|
||||
VL1 Rd: The total number of incoming read requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Wr: The total number of incoming write requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Atomic: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Hit: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
VL1 Lat: Calculated as the average number of cycles that a vL1D cache line request
|
||||
spent in the vL1D cache pipeline.
|
||||
VL1 Coalesce: Indicates how well memory instructions were coalesced by the address
|
||||
processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
|
||||
as the average number of thread-requests generated per instruction divided by
|
||||
the ideal number of thread-requests per instruction.
|
||||
VL1 Stall: The ratio of the number of cycles where the vL1D is stalled waiting
|
||||
to issue a request for data to the L2 cache divided by the number of cycles
|
||||
where the vL1D is active.
|
||||
VL1_L2 Rd: The number of read requests for a vL1D cache line that were not satisfied
|
||||
by the vL1D and must be retrieved from the to the L2 Cache per normalization
|
||||
unit.
|
||||
VL1_L2 Wr: The number of write requests to a vL1D cache line that were sent through
|
||||
the vL1D to the L2 cache, per normalization unit.
|
||||
VL1_L2 Atomic: The number of atomic requests that are sent through the vL1D to
|
||||
the L2 cache, per normalization unit. This includes requests for atomics with,
|
||||
and without return.
|
||||
sL1D Rd: The total number of requests, of any size or type, made to the sL1D per
|
||||
normalization unit.
|
||||
sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
|
||||
line, per normalization unit.
|
||||
sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
|
||||
unit.
|
||||
sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
sL1D_L2 Atomic: The total number of atomic requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
IL1 Fetch: The total number of requests made to the L1I per normalization-unit.
|
||||
IL1 Hit: The percent of L1I requests that hit on a previously loaded line the
|
||||
cache. Calculated as the ratio of the number of L1I requests that hit over the
|
||||
number of all L1I requests.
|
||||
IL1 Lat: The average number of cycles spent to fetch instructions to a CU.
|
||||
IL1_L2 Rd: The total number of requests across the L1I - L2 interface per normalization-unit.
|
||||
L2 Rd: The total number of read requests to the L2 from all clients.
|
||||
L2 Wr: The total number of write requests to the L2 from all clients.
|
||||
L2 Atomic: The total number of atomic requests (with and without return) to the
|
||||
L2 from all clients.
|
||||
L2 Hit: The ratio of the number of L2 cache line requests that hit in the L2 cache
|
||||
over the total number of incoming cache line requests to the L2 cache.
|
||||
L2 Rd Lat: Calculated as the average number of cycles that the vL1D cache took
|
||||
to issue and receive read requests from the L2 Cache. This number also includes
|
||||
requests for atomics with return values.
|
||||
L2 Wr Lat: Calculated as the average number of cycles that the vL1D cache took
|
||||
to issue and receive acknowledgement of a write request to the L2 Cache. This
|
||||
number also includes requests for atomics without return values.
|
||||
Fabric_L2 Rd: Number of L2 cache - Infinity Fabric read requests (either 32-byte
|
||||
or 64-byte) summed over TCC instances per normalization unit.
|
||||
Fabric_L2 Wr: Number of L2 cache - Infinity Fabric write requests (either 32-byte
|
||||
or 64-byte) summed over TCC instances per normalization unit.
|
||||
Fabric_L2 Atomic: Number of L2 cache - Infinity Fabric write requests (either
|
||||
32-byte or 64-byte) that are actually atomic requests summed over TCC instances
|
||||
per normalization unit.
|
||||
Fabric Rd Lat: The time-averaged number of cycles read requests spent in Infinity
|
||||
Fabric before data was returned to the L2.
|
||||
Fabric Wr Lat: The time-averaged number of cycles write requests spent in Infinity
|
||||
Fabric before a completion acknowledgement was returned to the L2.
|
||||
Fabric Atomic Lat: The time-averaged number of cycles atomic requests spent in
|
||||
Infinity Fabric before a completion acknowledgement (atomic without return value)
|
||||
or data (atomic with return value) was returned to the L2.
|
||||
HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
|
||||
of data from the accelerator's local HBM, per normalization unit.
|
||||
HBM Wr: 'The total number of L2 requests to Infinity Fabric to write or atomically
|
||||
update 32B or 64B of data in the accelerator''s local HBM, per normalization
|
||||
unit. '
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 301
|
||||
@@ -244,13 +128,13 @@ Panel Config:
|
||||
value: ROUND(AVG((TCC_EA0_ATOMIC_sum / $denom)), 0)
|
||||
Fabric Rd Lat:
|
||||
value: ROUND(AVG(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else 0)), 0)
|
||||
!= 0) else 0)), 0)
|
||||
Fabric Wr Lat:
|
||||
value: ROUND(AVG(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else 0)), 0)
|
||||
!= 0) else 0)), 0)
|
||||
Fabric Atomic Lat:
|
||||
value: ROUND(AVG(((TCC_EA0_ATOMIC_LEVEL_sum / TCC_EA0_ATOMIC_sum) if (TCC_EA0_ATOMIC_sum
|
||||
!= 0) else 0)), 0)
|
||||
!= 0) else 0)), 0)
|
||||
HBM Rd:
|
||||
value: ROUND(AVG((TCC_EA0_RDREQ_DRAM_sum / $denom)), 0)
|
||||
HBM Wr:
|
||||
@@ -258,3 +142,117 @@ Panel Config:
|
||||
comparable: false
|
||||
cli_style: mem_chart
|
||||
tui_style: mem_chart
|
||||
metrics_description:
|
||||
Wavefront Occupancy: Wavefronts per active CU.
|
||||
Wave Life: Average number of cycles executing a wave.
|
||||
SALU: Total Number of SALU (Scalar ALU) instructions issued per normalization
|
||||
unit.
|
||||
SMEM: Total number of SMEM (Scalar Memory Read) instructions issued normalization
|
||||
unit.
|
||||
VALU: The number of VALU (Vector ALU) instructions issued per normalization unit.
|
||||
MFMA: Total number of MFMA (Matrix-Fused-Multiply-Add) instructions issued per
|
||||
normalization unit.
|
||||
VMEM: The number of VMEM (GPU Memory) read instructions issued (including FLAT/scratch
|
||||
memory) per normalization unit.
|
||||
LDS: The total number of LDS instructions (including, but not limited to, read/write/atomics
|
||||
and HIP's __shfl instructions) executed per normalization unit.
|
||||
GWS: Total number of GDS (global data sync) instructions issued per normalization
|
||||
unit.
|
||||
BR: Total number of BRANCH instructions issued per normalization unit.
|
||||
Active CUs: Total number of active compute units (CUs) on the accelerator during
|
||||
the kernel execution.
|
||||
Num CUs: Total number of compute units (CUs) on the accelerator.
|
||||
VGPR: |-
|
||||
The number of architected vector general-purpose registers allocated
|
||||
for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
|
||||
requested by the compiler due to allocation granularity.
|
||||
SGPR: |-
|
||||
The number of scalar general-purpose registers allocated for the kernel,
|
||||
see SALU. Note: this may not exactly match the number of SGPRs requested by
|
||||
the compiler due to allocation granularity.
|
||||
LDS Allocation: |-
|
||||
The number of bytes of LDS memory (or, shared memory) allocated for
|
||||
this kernel. Note: This may also be larger than what was requested at compile
|
||||
time due to both allocation granularity and dynamic per-dispatch LDS allocations.
|
||||
Scratch Allocation: The number of bytes of scratch memory requested per work-item
|
||||
for this kernel. Scratch memory is used for stack memory on the accelerator,
|
||||
as well as for register spills and restores.
|
||||
Wavefronts: The total number of wavefronts, summed over all workgroups, forming
|
||||
this kernel launch.
|
||||
Workgroups: The total number of workgroups forming this kernel launch.
|
||||
LDS Req: The total number of LDS instructions (including, but not limited to,
|
||||
read/write/atomics and HIP's __shfl instructions) executed per normalization
|
||||
unit.
|
||||
LDS Util: Indicates what percent of the kernel's duration the LDS was actively
|
||||
executing instructions (including, but not limited to, load, store, atomic and
|
||||
HIP's __shfl operations). Calculated as the ratio of the total number of cycles
|
||||
LDS was active over the total CU cycles.
|
||||
LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
|
||||
/ acknowledgment) required for an LDS instruction to complete.
|
||||
VL1 Rd: The total number of incoming read requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Wr: The total number of incoming write requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Atomic: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Hit: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
VL1 Lat: Calculated as the average number of cycles that a vL1D cache line request
|
||||
spent in the vL1D cache pipeline.
|
||||
VL1 Coalesce: Indicates how well memory instructions were coalesced by the address
|
||||
processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
|
||||
as the average number of thread-requests generated per instruction divided by
|
||||
the ideal number of thread-requests per instruction.
|
||||
VL1 Stall: The ratio of the number of cycles where the vL1D is stalled waiting
|
||||
to issue a request for data to the L2 cache divided by the number of cycles
|
||||
where the vL1D is active.
|
||||
VL1_L2 Rd: The number of read requests for a vL1D cache line that were not satisfied
|
||||
by the vL1D and must be retrieved from the to the L2 Cache per normalization
|
||||
unit.
|
||||
VL1_L2 Wr: The number of write requests to a vL1D cache line that were sent through
|
||||
the vL1D to the L2 cache, per normalization unit.
|
||||
VL1_L2 Atomic: The number of atomic requests that are sent through the vL1D to
|
||||
the L2 cache, per normalization unit. This includes requests for atomics with,
|
||||
and without return.
|
||||
sL1D Rd: The total number of requests, of any size or type, made to the sL1D per
|
||||
normalization unit.
|
||||
sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
|
||||
line, per normalization unit.
|
||||
sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
|
||||
unit.
|
||||
sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
sL1D_L2 Atomic: The total number of atomic requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
IL1 Fetch: The total number of requests made to the L1I per normalization-unit.
|
||||
IL1 Hit: The percent of L1I requests that hit on a previously loaded line the
|
||||
cache. Calculated as the ratio of the number of L1I requests that hit over the
|
||||
number of all L1I requests.
|
||||
IL1 Lat: The average number of cycles spent to fetch instructions to a CU.
|
||||
IL1_L2 Rd: The total number of requests across the L1I - L2 interface per normalization-unit.
|
||||
L2 Rd: The total number of read requests to the L2 from all clients.
|
||||
L2 Wr: The total number of write requests to the L2 from all clients.
|
||||
L2 Atomic: The total number of atomic requests (with and without return) to the
|
||||
L2 from all clients.
|
||||
L2 Hit: The ratio of the number of L2 cache line requests that hit in the L2 cache
|
||||
over the total number of incoming cache line requests to the L2 cache.
|
||||
Fabric_L2 Rd: Number of L2 cache - Infinity Fabric read requests (either 32-byte
|
||||
or 64-byte) summed over TCC instances per normalization unit.
|
||||
Fabric_L2 Wr: Number of L2 cache - Infinity Fabric write requests (either 32-byte
|
||||
or 64-byte) summed over TCC instances per normalization unit.
|
||||
Fabric_L2 Atomic: Number of L2 cache - Infinity Fabric write requests (either
|
||||
32-byte or 64-byte) that are actually atomic requests summed over TCC instances
|
||||
per normalization unit.
|
||||
Fabric Rd Lat: The time-averaged number of cycles read requests spent in Infinity
|
||||
Fabric before data was returned to the L2.
|
||||
Fabric Wr Lat: The time-averaged number of cycles write requests spent in Infinity
|
||||
Fabric before a completion acknowledgement was returned to the L2.
|
||||
Fabric Atomic Lat: The time-averaged number of cycles atomic requests spent in
|
||||
Infinity Fabric before a completion acknowledgement (atomic without return value)
|
||||
or data (atomic with return value) was returned to the L2.
|
||||
HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
|
||||
of data from the accelerator's local HBM, per normalization unit.
|
||||
HBM Wr: |-
|
||||
The total number of L2 requests to Infinity Fabric to write or atomically
|
||||
update 32B or 64B of data in the accelerator's local HBM, per normalization
|
||||
unit.
|
||||
|
||||
+88
-79
@@ -2,85 +2,6 @@
|
||||
Panel Config:
|
||||
id: 400
|
||||
title: Roofline
|
||||
metrics_description:
|
||||
VALU FLOPs (F16): 'The total 16-bit floating-point operations executed per second
|
||||
on the VALU. This is presented with the value of the peak empirical F16 FLOPs
|
||||
achievable on the specific accelerator. Note: this does not include any F16
|
||||
operations from MFMA instructions.'
|
||||
VALU FLOPs (F32): 'The total 32-bit floating-point operations executed per second
|
||||
on the VALU. This is presented with the value of the peak empirical F32 FLOPs
|
||||
achievable on the specific accelerator. Note: this does not include any F32
|
||||
operations from MFMA instructions.'
|
||||
VALU FLOPs (F64): 'The total 64-bit floating-point operations executed per second
|
||||
on the VALU. This is presented with the value of the peak empirical F64 FLOPs
|
||||
achievable on the specific accelerator. Note: this does not include any F64
|
||||
operations from MFMA instructions.'
|
||||
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
|
||||
executed per second. This does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. The peak empirically measured F8 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison. It is supported
|
||||
on AMD Instinct MI300 series and later only.
|
||||
MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
|
||||
executed per second. Note: this does not include any 16-bit brain floating point
|
||||
operations from VALU instructions. The peak empirically measured BF16 MFMA operations
|
||||
achievable on the specific accelerator is displayed alongside for comparison.'
|
||||
MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit floating point operations
|
||||
from VALU instructions. The peak empirically measured F16 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison.'
|
||||
MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 32-bit floating point operations
|
||||
from VALU instructions. The peak empirically measured F32 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison.'
|
||||
MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 64-bit floating point operations
|
||||
from VALU instructions. The peak empirically measured F64 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison.'
|
||||
MFMA FLOPs (F6F4): 'The total number of 4-bit and 6-bit floating point MFMA operations
|
||||
executed per second. Note: this does not include any floating point operations
|
||||
from VALU instructions. The peak empirically measured F6F4 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison. It is supported
|
||||
on AMD Instinct MI350 series (gfx950) and later only.'
|
||||
MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
|
||||
per second. Note: this does not include any 8-bit integer operations from VALU
|
||||
instructions. The peak empirically measured INT8 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison.'
|
||||
HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth
|
||||
Memory (HBM) per second. The peak empirically measured bandwidth achievable
|
||||
on the specific accelerator is displayed alongside for comparison.
|
||||
L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
|
||||
The number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line. The peak empirically measured bandwidth
|
||||
achievable on the specific accelerator is displayed alongside for comparison.
|
||||
L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
|
||||
of VMEM instructions per unit time. The number of bytes is calculated as the
|
||||
number of cache lines requested multiplied by the cache line size. This value
|
||||
does not consider partial requests, so e.g., if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
The peak empirically measured bandwidth achievable on the specific accelerator
|
||||
is displayed alongside for comparison.
|
||||
LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
|
||||
from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
|
||||
example for more detail). The peak empirically measured LDS bandwidth achievable
|
||||
on the specific accelerator is displayed alongside for comparison.
|
||||
AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
|
||||
of total floating-point operations (FLOPs) to total bytes transferred between
|
||||
the L1 cache and the processing units. This value is used as the x-coordinate
|
||||
for the L1 roofline.
|
||||
AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
|
||||
of total floating-point operations (FLOPs) to total bytes transferred between
|
||||
the L2 cache and the L1 cache. This value is used as the x-coordinate for the
|
||||
L2 roofline.
|
||||
AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
|
||||
It is the ratio of total floating-point operations (FLOPs) to total bytes transferred
|
||||
between HBM and the L2 cache. This value is used as the x-coordinate for the
|
||||
HBM roofline.
|
||||
Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs
|
||||
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
|
||||
operations divided by the total execution time. This value is used as the y-coordinate
|
||||
for the kernel's point on the Roofline plot.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 401
|
||||
@@ -218,3 +139,91 @@ Panel Config:
|
||||
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
|
||||
* 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9
|
||||
unit: GFLOP/s
|
||||
metrics_description:
|
||||
VALU FLOPs (F16): |-
|
||||
The total 16-bit floating-point operations executed per second on the VALU.
|
||||
This is presented with the value of the peak empirical F16 FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any F16 operations
|
||||
from MFMA instructions.
|
||||
VALU FLOPs (F32): |-
|
||||
The total 32-bit floating-point operations executed per second on the VALU.
|
||||
This is presented with the value of the peak empirical F32 FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any F32 operations
|
||||
from MFMA instructions.
|
||||
VALU FLOPs (F64): |-
|
||||
The total 64-bit floating-point operations executed per second on the VALU.
|
||||
This is presented with the value of the peak empirical F64 FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any F64 operations
|
||||
from MFMA instructions.
|
||||
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
|
||||
executed per second. This does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. The peak empirically measured F8 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison. It is supported
|
||||
on AMD Instinct MI300 series and later only.
|
||||
MFMA FLOPs (BF16): |-
|
||||
The total number of 16-bit brain floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit brain floating point
|
||||
operations from VALU instructions. The peak empirically measured BF16 MFMA
|
||||
operations achievable on the specific accelerator is displayed alongside
|
||||
for comparison.
|
||||
MFMA FLOPs (F16): |-
|
||||
The total number of 16-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 16-bit floating point operations from
|
||||
VALU instructions. The peak empirically measured F16 MFMA operations
|
||||
achievable on the specific accelerator is displayed alongside for comparison.
|
||||
MFMA FLOPs (F32): |-
|
||||
The total number of 32-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 32-bit floating point operations from
|
||||
VALU instructions. The peak empirically measured F32 MFMA operations
|
||||
achievable on the specific accelerator is displayed alongside for comparison.
|
||||
MFMA FLOPs (F64): |-
|
||||
The total number of 64-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 64-bit floating point operations from
|
||||
VALU instructions. The peak empirically measured F64 MFMA operations
|
||||
achievable on the specific accelerator is displayed alongside for comparison.
|
||||
MFMA IOPs (Int8): |-
|
||||
The total number of 8-bit integer MFMA operations executed per second.
|
||||
Note: this does not include any 8-bit integer operations from VALU instructions.
|
||||
The peak empirically measured INT8 MFMA operations achievable on the specific
|
||||
accelerator is displayed alongside for comparison.
|
||||
HBM Bandwidth: |-
|
||||
The total number of bytes read from and written to High-Bandwidth
|
||||
Memory (HBM) per second. The peak empirically measured bandwidth achievable
|
||||
on the specific accelerator is displayed alongside for comparison.
|
||||
L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
|
||||
The number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line. The peak empirically measured bandwidth
|
||||
achievable on the specific accelerator is displayed alongside for comparison.
|
||||
L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
|
||||
of VMEM instructions per unit time. The number of bytes is calculated as the
|
||||
number of cache lines requested multiplied by the cache line size. This value
|
||||
does not consider partial requests, so e.g., if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
The peak empirically measured bandwidth achievable on the specific accelerator
|
||||
is displayed alongside for comparison.
|
||||
LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
|
||||
from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
|
||||
example for more detail). The peak empirically measured LDS bandwidth achievable
|
||||
on the specific accelerator is displayed alongside for comparison.
|
||||
AI L1: |-
|
||||
The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
|
||||
of total floating-point operations (FLOPs) to total bytes transferred between
|
||||
the L1 cache and the processing units. This value is used as the x-coordinate
|
||||
for the L1 roofline.
|
||||
AI L2: |-
|
||||
The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
|
||||
of total floating-point operations (FLOPs) to total bytes transferred between
|
||||
the L2 cache and the L1 cache. This value is used as the x-coordinate for
|
||||
the L2 roofline.
|
||||
AI HBM: |-
|
||||
The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
|
||||
It is the ratio of total floating-point operations (FLOPs) to total bytes
|
||||
transferred between HBM and the L2 cache. This value is used as the x-coordinate
|
||||
for the HBM roofline.
|
||||
Performance (GFLOPs): |-
|
||||
The overall achieved performance, measured in GigaFLOPs
|
||||
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
|
||||
operations divided by the total execution time. This value is used as the y-coordinate
|
||||
for the kernel's point on the Roofline plot.
|
||||
|
||||
+25
-24
@@ -2,30 +2,6 @@
|
||||
Panel Config:
|
||||
id: 500
|
||||
title: Command Processor (CPC/CPF)
|
||||
metrics_description:
|
||||
CPF Utilization: Percent of total cycles where the CPF was busy actively doing
|
||||
any work. The ratio of CPF busy cycles over total cycles counted by the CPF.
|
||||
CPF Stall: Percent of CPF busy cycles where the CPF was stalled for any reason.
|
||||
CPF-L2 Utilization: Percent of total cycles counted by the CPF-L2 interface where
|
||||
the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
|
||||
over total cycles counted by the CPF-L2.
|
||||
CPF-L2 Stall: Percent of CPF-L2 L2 busy cycles where the CPF-L2 interface was
|
||||
stalled for any reason.
|
||||
CPF-UTCL1 Stall: Percent of CPF busy cycles where the CPF was stalled by address
|
||||
translation.
|
||||
CPC Utilization: Percent of total cycles where the CPC was busy actively doing
|
||||
any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
|
||||
CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
|
||||
CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
|
||||
for processing.
|
||||
CPC-Workgroup Manager Utilization: Percent of CPC busy cycles spent dispatching
|
||||
workgroups to the workgroup manager.
|
||||
CPC-L2 Utilization: Percent of total cycles counted by the CPC-L2 interface where
|
||||
the CPC-L2 interface was active doing any work.
|
||||
CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
|
||||
translation
|
||||
CPC-UTCL2 Utilization: 'Percent of total cycles counted by the CPC''s L2 address
|
||||
translation interface where the CPC was busy doing address translation work. '
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 501
|
||||
@@ -143,3 +119,28 @@ Panel Config:
|
||||
max: MAX((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
|
||||
if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
metrics_description:
|
||||
CPF Utilization: Percent of total cycles where the CPF was busy actively doing
|
||||
any work. The ratio of CPF busy cycles over total cycles counted by the CPF.
|
||||
CPF Stall: Percent of CPF busy cycles where the CPF was stalled for any reason.
|
||||
CPF-L2 Utilization: Percent of total cycles counted by the CPF-L2 interface where
|
||||
the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
|
||||
over total cycles counted by the CPF-L2.
|
||||
CPF-L2 Stall: Percent of CPF-L2 L2 busy cycles where the CPF-L2 interface was
|
||||
stalled for any reason.
|
||||
CPF-UTCL1 Stall: Percent of CPF busy cycles where the CPF was stalled by address
|
||||
translation.
|
||||
CPC Utilization: Percent of total cycles where the CPC was busy actively doing
|
||||
any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
|
||||
CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
|
||||
CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
|
||||
for processing.
|
||||
CPC-Workgroup Manager Utilization: Percent of CPC busy cycles spent dispatching
|
||||
workgroups to the workgroup manager.
|
||||
CPC-L2 Utilization: Percent of total cycles counted by the CPC-L2 interface where
|
||||
the CPC-L2 interface was active doing any work.
|
||||
CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
|
||||
translation
|
||||
CPC-UTCL2 Utilization: |-
|
||||
Percent of total cycles counted by the CPC's L2 address translation
|
||||
interface where the CPC was busy doing address translation work.
|
||||
|
||||
+55
-55
@@ -2,61 +2,6 @@
|
||||
Panel Config:
|
||||
id: 600
|
||||
title: Workgroup Manager (SPI)
|
||||
metrics_description:
|
||||
Accelerator Utilization: The percent of cycles in the kernel where the accelerator
|
||||
was actively doing any work.
|
||||
Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
|
||||
kernel where the scheduler-pipes were actively doing any work.
|
||||
Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
|
||||
manager was actively doing any work.
|
||||
Shader Engine Utilization: The percent of total shader engine cycles in the kernel
|
||||
where any CU in a shader-engine was actively doing any work, normalized over
|
||||
all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
|
||||
was not fully saturated by the kernel, or a potential load-imbalance issue.
|
||||
SIMD Utilization: The percent of total SIMD cycles in the kernel where any SIMD
|
||||
on a CU was actively doing any work, summed over all CUs. Low values (less than
|
||||
100%) indicate that the accelerator was not fully saturated by the kernel, or
|
||||
a potential load-imbalance issue.
|
||||
Dispatched Workgroups: The total number of workgroups forming this kernel launch.
|
||||
Dispatched Wavefronts: The total number of wavefronts, summed over all workgroups,
|
||||
forming this kernel launch.
|
||||
VGPR Writes: The average number of cycles spent initializing VGPRs at wave creation.
|
||||
SGPR Writes: The average number of cycles spent initializing SGPRs at wave creation.
|
||||
Not-scheduled Rate (Workgroup Manager): The percent of total scheduler-pipe cycles
|
||||
in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
|
||||
within the workgroup manager rather than a lack of a CU or SIMD with sufficient
|
||||
resources.
|
||||
Not-scheduled Rate (Scheduler-Pipe): 'The percent of total scheduler-pipe cycles
|
||||
in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
|
||||
within the scheduler-pipes rather than a lack of a CU or SIMD with sufficient
|
||||
resources. '
|
||||
Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
|
||||
where a workgroup could not be scheduled to a CU due to occupancy limitations
|
||||
(like a lack of a CU or SIMD with sufficient resources).
|
||||
Scratch Stall Rate: The percent of total shader-engine cycles in the kernel where
|
||||
a workgroup could not be scheduled to a CU due to lack of private (a.k.a., scratch)
|
||||
memory slots. While this can reach up to 100%, note that the actual occupancy
|
||||
limitations on a kernel using private memory are typically quite small (for
|
||||
example, less than 1% of the total number of waves that can be scheduled to
|
||||
an accelerator).
|
||||
Insufficient SIMD Waveslots: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available waveslots.
|
||||
Insufficient SIMD VGPRs: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available VGPRs.
|
||||
Insufficient SIMD SGPRs: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available SGPRs.
|
||||
Insufficient CU LDS: The percent of total CU cycles in the kernel where a workgroup
|
||||
could not be scheduled to a CU due to lack of available LDS.
|
||||
Insufficient CU Barriers: The percent of total CU cycles in the kernel where a
|
||||
workgroup could not be scheduled to a CU due to lack of available barriers.
|
||||
Reached CU Workgroup Limit: The percent of total CU cycles in the kernel where
|
||||
a workgroup could not be scheduled to a CU due to limits within the workgroup
|
||||
manager. This is expected to be always be zero on CDNA2 or newer accelerators
|
||||
(and small for previous accelerators).
|
||||
Reached CU Wavefront Limit: The percent of total CU cycles in the kernel where
|
||||
a wavefront could not be scheduled to a CU due to limits within the workgroup
|
||||
manager. This is expected to be always be zero on CDNA2 or newer accelerators
|
||||
(and small for previous accelerators).
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 601
|
||||
@@ -199,3 +144,58 @@ Panel Config:
|
||||
min: MIN(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
metrics_description:
|
||||
Accelerator Utilization: The percent of cycles in the kernel where the accelerator
|
||||
was actively doing any work.
|
||||
Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
|
||||
kernel where the scheduler-pipes were actively doing any work.
|
||||
Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
|
||||
manager was actively doing any work.
|
||||
Shader Engine Utilization: The percent of total shader engine cycles in the kernel
|
||||
where any CU in a shader-engine was actively doing any work, normalized over
|
||||
all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
|
||||
was not fully saturated by the kernel, or a potential load-imbalance issue.
|
||||
SIMD Utilization: The percent of total SIMD cycles in the kernel where any SIMD
|
||||
on a CU was actively doing any work, summed over all CUs. Low values (less than
|
||||
100%) indicate that the accelerator was not fully saturated by the kernel, or
|
||||
a potential load-imbalance issue.
|
||||
Dispatched Workgroups: The total number of workgroups forming this kernel launch.
|
||||
Dispatched Wavefronts: The total number of wavefronts, summed over all workgroups,
|
||||
forming this kernel launch.
|
||||
VGPR Writes: The average number of cycles spent initializing VGPRs at wave creation.
|
||||
SGPR Writes: The average number of cycles spent initializing SGPRs at wave creation.
|
||||
Not-scheduled Rate (Workgroup Manager): The percent of total scheduler-pipe cycles
|
||||
in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
|
||||
within the workgroup manager rather than a lack of a CU or SIMD with sufficient
|
||||
resources.
|
||||
Not-scheduled Rate (Scheduler-Pipe): |-
|
||||
The percent of total scheduler-pipe cycles in the kernel where a workgroup
|
||||
could not be scheduled to a CU due to a bottleneck within the scheduler-pipes
|
||||
rather than a lack of a CU or SIMD with sufficient resources.
|
||||
Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
|
||||
where a workgroup could not be scheduled to a CU due to occupancy limitations
|
||||
(like a lack of a CU or SIMD with sufficient resources).
|
||||
Scratch Stall Rate: The percent of total shader-engine cycles in the kernel where
|
||||
a workgroup could not be scheduled to a CU due to lack of private (a.k.a., scratch)
|
||||
memory slots. While this can reach up to 100%, note that the actual occupancy
|
||||
limitations on a kernel using private memory are typically quite small (for
|
||||
example, less than 1% of the total number of waves that can be scheduled to
|
||||
an accelerator).
|
||||
Insufficient SIMD Waveslots: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available waveslots.
|
||||
Insufficient SIMD VGPRs: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available VGPRs.
|
||||
Insufficient SIMD SGPRs: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available SGPRs.
|
||||
Insufficient CU LDS: The percent of total CU cycles in the kernel where a workgroup
|
||||
could not be scheduled to a CU due to lack of available LDS.
|
||||
Insufficient CU Barriers: The percent of total CU cycles in the kernel where a
|
||||
workgroup could not be scheduled to a CU due to lack of available barriers.
|
||||
Reached CU Workgroup Limit: The percent of total CU cycles in the kernel where
|
||||
a workgroup could not be scheduled to a CU due to limits within the workgroup
|
||||
manager. This is expected to be always be zero on CDNA2 or newer accelerators
|
||||
(and small for previous accelerators).
|
||||
Reached CU Wavefront Limit: The percent of total CU cycles in the kernel where
|
||||
a wavefront could not be scheduled to a CU due to limits within the workgroup
|
||||
manager. This is expected to be always be zero on CDNA2 or newer accelerators
|
||||
(and small for previous accelerators).
|
||||
|
||||
+63
-57
@@ -2,63 +2,6 @@
|
||||
Panel Config:
|
||||
id: 700
|
||||
title: Wavefront
|
||||
metrics_description:
|
||||
Grid Size: The total number of work-items (or, threads) launched as a part of
|
||||
the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied
|
||||
by the total workgroup (or, block) size.
|
||||
Workgroup Size: The total number of work-items (or, threads) in each workgroup
|
||||
(or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
|
||||
to the total block size.
|
||||
Total Wavefronts: "The total number of wavefronts launched as part of the kernel\
|
||||
\ dispatch. On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs,\
|
||||
\ the wavefront size is always 64 work-items. Thus, the total number of wavefronts\
|
||||
\ should be equivalent to the ceiling of grid size divided by 64."
|
||||
Saved Wavefronts: The total number of wavefronts saved at a context-save.
|
||||
Restored Wavefronts: The total number of wavefronts restored from a context-save.
|
||||
VGPRs: 'The number of architected vector general-purpose registers allocated for
|
||||
the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
|
||||
by the compiler due to allocation granularity.'
|
||||
AGPRs: 'The number of accumulation vector general-purpose registers allocated
|
||||
for the kernel, see AGPRs. Note: this may not exactly match the number of AGPRs
|
||||
requested by the compiler due to allocation granularity.'
|
||||
SGPRs: 'The number of scalar general-purpose registers allocated for the kernel,
|
||||
see SALU. Note: this may not exactly match the number of SGPRs requested by
|
||||
the compiler due to allocation granularity.'
|
||||
LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
|
||||
for this kernel. Note: This may also be larger than what was requested at compile
|
||||
time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
|
||||
Scratch Allocation: The number of bytes of scratch memory requested per work-item
|
||||
for this kernel. Scratch memory is used for stack memory on the accelerator,
|
||||
as well as for register spills and restores.
|
||||
Kernel Time: The total duration of the executed kernel.
|
||||
Kernel Time (Cycles): The total duration of the executed kernel in cycles.
|
||||
Instructions per wavefront: The average number of instructions (of all types)
|
||||
executed per wavefront. This is averaged over all wavefronts in a kernel dispatch.
|
||||
Wave Cycles: The number of cycles a wavefront in the kernel dispatch spent resident
|
||||
on a compute unit per normalization unit. This is averaged over all wavefronts
|
||||
in a kernel dispatch.
|
||||
Dependency Wait Cycles: The number of cycles a wavefront in the kernel dispatch
|
||||
spent resident on a compute unit per normalization unit. This is averaged over
|
||||
all wavefronts in a kernel dispatch.
|
||||
Issue Wait Cycles: The number of cycles a wavefront in the kernel dispatch was
|
||||
unable to issue an instruction for any reason (e.g., execution pipe back-pressure,
|
||||
arbitration loss, etc.) per normalization unit. This counter is incremented
|
||||
at every cycle by all wavefronts on a CU unable to issue an instruction. As
|
||||
such, it is most useful to get a sense of how waves were spending their time,
|
||||
rather than identification of a precise limiter because another wave could be
|
||||
actively executing while a wave is issue stalled. The sum of this metric, Dependency
|
||||
Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
|
||||
Active Cycles: The average number of cycles a wavefront in the kernel dispatch
|
||||
was actively executing instructions per normalization unit. This measurement
|
||||
is made on a per-wavefront basis, and may include cycles that another wavefront
|
||||
spent actively executing (on another execution unit, for example) or was stalled.
|
||||
As such, it is most useful to get a sense of how waves were spending their time,
|
||||
rather than identification of a precise limiter. The sum of this metric, Issue
|
||||
Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
|
||||
metric.
|
||||
Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
|
||||
over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
||||
kernels (less than 1ms).'
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 701
|
||||
@@ -171,3 +114,66 @@ Panel Config:
|
||||
max: MAX((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
unit: Wavefronts
|
||||
coll_level: SQ_LEVEL_WAVES
|
||||
metrics_description:
|
||||
Grid Size: The total number of work-items (or, threads) launched as a part of
|
||||
the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied
|
||||
by the total workgroup (or, block) size.
|
||||
Workgroup Size: The total number of work-items (or, threads) in each workgroup
|
||||
(or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
|
||||
to the total block size.
|
||||
Total Wavefronts: |-
|
||||
The total number of wavefronts launched as part of the kernel dispatch.
|
||||
On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
|
||||
size is always 64 work-items. Thus, the total number of wavefronts should
|
||||
be equivalent to the ceiling of grid size divided by 64.
|
||||
Saved Wavefronts: The total number of wavefronts saved at a context-save.
|
||||
Restored Wavefronts: The total number of wavefronts restored from a context-save.
|
||||
VGPRs: |-
|
||||
The number of architected vector general-purpose registers allocated
|
||||
for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
|
||||
requested by the compiler due to allocation granularity.
|
||||
AGPRs: |-
|
||||
The number of accumulation vector general-purpose registers allocated
|
||||
for the kernel, see AGPRs. Note: this may not exactly match the number of
|
||||
AGPRs requested by the compiler due to allocation granularity.
|
||||
SGPRs: |-
|
||||
The number of scalar general-purpose registers allocated for the kernel,
|
||||
see SALU. Note: this may not exactly match the number of SGPRs requested by
|
||||
the compiler due to allocation granularity.
|
||||
LDS Allocation: |-
|
||||
The number of bytes of LDS memory (or, shared memory) allocated for
|
||||
this kernel. Note: This may also be larger than what was requested at compile
|
||||
time due to both allocation granularity and dynamic per-dispatch LDS allocations.
|
||||
Scratch Allocation: The number of bytes of scratch memory requested per work-item
|
||||
for this kernel. Scratch memory is used for stack memory on the accelerator,
|
||||
as well as for register spills and restores.
|
||||
Kernel Time: The total duration of the executed kernel.
|
||||
Kernel Time (Cycles): The total duration of the executed kernel in cycles.
|
||||
Instructions per wavefront: The average number of instructions (of all types)
|
||||
executed per wavefront. This is averaged over all wavefronts in a kernel dispatch.
|
||||
Wave Cycles: The number of cycles a wavefront in the kernel dispatch spent resident
|
||||
on a compute unit per normalization unit. This is averaged over all wavefronts
|
||||
in a kernel dispatch.
|
||||
Dependency Wait Cycles: The number of cycles a wavefront in the kernel dispatch
|
||||
spent resident on a compute unit per normalization unit. This is averaged over
|
||||
all wavefronts in a kernel dispatch.
|
||||
Issue Wait Cycles: The number of cycles a wavefront in the kernel dispatch was
|
||||
unable to issue an instruction for any reason (e.g., execution pipe back-pressure,
|
||||
arbitration loss, etc.) per normalization unit. This counter is incremented
|
||||
at every cycle by all wavefronts on a CU unable to issue an instruction. As
|
||||
such, it is most useful to get a sense of how waves were spending their time,
|
||||
rather than identification of a precise limiter because another wave could be
|
||||
actively executing while a wave is issue stalled. The sum of this metric, Dependency
|
||||
Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
|
||||
Active Cycles: The average number of cycles a wavefront in the kernel dispatch
|
||||
was actively executing instructions per normalization unit. This measurement
|
||||
is made on a per-wavefront basis, and may include cycles that another wavefront
|
||||
spent actively executing (on another execution unit, for example) or was stalled.
|
||||
As such, it is most useful to get a sense of how waves were spending their time,
|
||||
rather than identification of a precise limiter. The sum of this metric, Issue
|
||||
Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
|
||||
metric.
|
||||
Wavefront Occupancy: |-
|
||||
The time-averaged number of wavefronts resident on the accelerator over
|
||||
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
||||
kernels (less than 1ms).
|
||||
|
||||
+85
-84
@@ -2,90 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1000
|
||||
title: Compute Units - Instruction Mix
|
||||
metrics_description:
|
||||
VALU: The total number of vector arithmetic logic unit (VALU) operations issued.
|
||||
These are the workhorses of the compute unit, and are used to execute a wide
|
||||
range of instruction types including floating point operations, non-uniform
|
||||
address calculations, transcendental operations, integer operations, shifts,
|
||||
conditional evaluation, etc.
|
||||
VMEM: The total number of vector memory operations issued. These include most
|
||||
loads, stores and atomic operations and all accesses to generic, global, private
|
||||
and texture memory.
|
||||
LDS: The total number of LDS (also known as shared memory) operations issued.
|
||||
These include loads, stores, atomics, and HIP's __shfl operations.
|
||||
MFMA: The total number of matrix fused multiply-add instructions issued.
|
||||
SALU: The total number of scalar arithmetic logic unit (SALU) operations issued.
|
||||
Typically these are used for address calculations, literal constants, and other
|
||||
operations that are provably uniform across a wavefront. Although scalar memory
|
||||
(SMEM) operations are issued by the SALU, they are counted separately in this
|
||||
section.
|
||||
SMEM: The total number of scalar memory (SMEM) operations issued. These are typically
|
||||
used for loading kernel arguments, base-pointers and loads from HIP's __constant__
|
||||
memory.
|
||||
Branch: The total number of branch operations issued. These typically consist
|
||||
of jump or branch operations and are used to implement control flow.
|
||||
INT32: The total number of instructions operating on 32-bit integer operands issued
|
||||
to the VALU per normalization unit.
|
||||
INT64: The total number of instructions operating on 64-bit integer operands issued
|
||||
to the VALU per normalization unit.
|
||||
F16-ADD: The total number of addition instructions operating on 16-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F16-MUL: The total number of multiplication instructions operating on 16-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F16-FMA: The total number of fused multiply-add instructions operating on 16-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F16-Trans: The total number of transcendental instructions (e.g., sqrt) operating
|
||||
on 16-bit floating-point operands issued to the VALU per normalization unit.
|
||||
F32-ADD: The total number of addition instructions operating on 32-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F32-MUL: The total number of multiplication instructions operating on 32-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F32-FMA: The total number of fused multiply-add instructions operating on 32-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F32-Trans: The total number of transcendental instructions (such as sqrt) operating
|
||||
on 32-bit floating-point operands issued to the VALU per normalization unit.
|
||||
F64-ADD: The total number of addition instructions operating on 64-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F64-MUL: The total number of multiplication instructions operating on 64-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F64-FMA: The total number of fused multiply-add instructions operating on 64-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F64-Trans: The total number of transcendental instructions (such as sqrt) operating
|
||||
on 64-bit floating-point operands issued to the VALU per normalization unit.
|
||||
Conversion: "The total number of type conversion instructions (such as converting\
|
||||
\ data to or from F32\u2194F64) issued to the VALU per normalization unit."
|
||||
Global/Generic Instr: The total number of global & generic memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Read: The total number of global & generic memory read instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Write: The total number of global & generic memory write instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Atomic: The total number of global & generic memory atomic (with
|
||||
and without return) instructions executed on all compute units on the accelerator,
|
||||
per normalization unit.
|
||||
Spill/Stack Instr: The total number of spill/stack memory instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Read: The total number of spill/stack memory read instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Write: The total number of spill/stack memory write instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Atomic: The total number of spill/stack memory atomic (with and without
|
||||
return) instructions executed on all compute units on the accelerator, per normalization
|
||||
unit. Typically unused as these memory operations are typically used to implement
|
||||
thread-local storage.
|
||||
MFMA-I8: The total number of 8-bit integer MFMA instructions issued per normalization
|
||||
unit.
|
||||
MFMA-F8: The total number of 8-bit floating point MFMA instructions issued per
|
||||
normalization unit. This is supported in AMD Instinct MI300 series and later
|
||||
only.
|
||||
MFMA-F16: The total number of 16-bit floating point MFMA instructions issued per
|
||||
normalization unit.
|
||||
MFMA-BF16: The total number of 16-bit brain floating point MFMA instructions issued
|
||||
per normalization unit.
|
||||
MFMA-F32: The total number of 32-bit floating-point MFMA instructions issued per
|
||||
normalization unit.
|
||||
MFMA-F64: The total number of 64-bit floating-point MFMA instructions issued per
|
||||
normalization unit.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1001
|
||||
@@ -307,3 +223,88 @@ Panel Config:
|
||||
min: MIN((SQ_INSTS_VALU_MFMA_F64 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MFMA_F64 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
metrics_description:
|
||||
VALU: The total number of vector arithmetic logic unit (VALU) operations issued.
|
||||
These are the workhorses of the compute unit, and are used to execute a wide
|
||||
range of instruction types including floating point operations, non-uniform
|
||||
address calculations, transcendental operations, integer operations, shifts,
|
||||
conditional evaluation, etc.
|
||||
VMEM: The total number of vector memory operations issued. These include most
|
||||
loads, stores and atomic operations and all accesses to generic, global, private
|
||||
and texture memory.
|
||||
LDS: The total number of LDS (also known as shared memory) operations issued.
|
||||
These include loads, stores, atomics, and HIP's __shfl operations.
|
||||
MFMA: The total number of matrix fused multiply-add instructions issued.
|
||||
SALU: The total number of scalar arithmetic logic unit (SALU) operations issued.
|
||||
Typically these are used for address calculations, literal constants, and other
|
||||
operations that are provably uniform across a wavefront. Although scalar memory
|
||||
(SMEM) operations are issued by the SALU, they are counted separately in this
|
||||
section.
|
||||
SMEM: The total number of scalar memory (SMEM) operations issued. These are typically
|
||||
used for loading kernel arguments, base-pointers and loads from HIP's __constant__
|
||||
memory.
|
||||
Branch: The total number of branch operations issued. These typically consist
|
||||
of jump or branch operations and are used to implement control flow.
|
||||
INT32: The total number of instructions operating on 32-bit integer operands issued
|
||||
to the VALU per normalization unit.
|
||||
INT64: The total number of instructions operating on 64-bit integer operands issued
|
||||
to the VALU per normalization unit.
|
||||
F16-ADD: The total number of addition instructions operating on 16-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F16-MUL: The total number of multiplication instructions operating on 16-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F16-FMA: The total number of fused multiply-add instructions operating on 16-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F16-Trans: The total number of transcendental instructions (e.g., sqrt) operating
|
||||
on 16-bit floating-point operands issued to the VALU per normalization unit.
|
||||
F32-ADD: The total number of addition instructions operating on 32-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F32-MUL: The total number of multiplication instructions operating on 32-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F32-FMA: The total number of fused multiply-add instructions operating on 32-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F32-Trans: The total number of transcendental instructions (such as sqrt) operating
|
||||
on 32-bit floating-point operands issued to the VALU per normalization unit.
|
||||
F64-ADD: The total number of addition instructions operating on 64-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F64-MUL: The total number of multiplication instructions operating on 64-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F64-FMA: The total number of fused multiply-add instructions operating on 64-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F64-Trans: The total number of transcendental instructions (such as sqrt) operating
|
||||
on 64-bit floating-point operands issued to the VALU per normalization unit.
|
||||
Conversion: |-
|
||||
The total number of type conversion instructions (such as converting
|
||||
data to or from F32\u2194F64) issued to the VALU per normalization unit.
|
||||
Global/Generic Instr: The total number of global & generic memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Read: The total number of global & generic memory read instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Write: The total number of global & generic memory write instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Atomic: The total number of global & generic memory atomic (with
|
||||
and without return) instructions executed on all compute units on the accelerator,
|
||||
per normalization unit.
|
||||
Spill/Stack Instr: The total number of spill/stack memory instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Read: The total number of spill/stack memory read instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Write: The total number of spill/stack memory write instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Atomic: The total number of spill/stack memory atomic (with and without
|
||||
return) instructions executed on all compute units on the accelerator, per normalization
|
||||
unit. Typically unused as these memory operations are typically used to implement
|
||||
thread-local storage.
|
||||
MFMA-I8: The total number of 8-bit integer MFMA instructions issued per normalization
|
||||
unit.
|
||||
MFMA-F8: The total number of 8-bit floating point MFMA instructions issued per
|
||||
normalization unit. This is supported in AMD Instinct MI300 series and later
|
||||
only.
|
||||
MFMA-F16: The total number of 16-bit floating point MFMA instructions issued per
|
||||
normalization unit.
|
||||
MFMA-BF16: The total number of 16-bit brain floating point MFMA instructions issued
|
||||
per normalization unit.
|
||||
MFMA-F32: The total number of 32-bit floating-point MFMA instructions issued per
|
||||
normalization unit.
|
||||
MFMA-F64: The total number of 64-bit floating-point MFMA instructions issued per
|
||||
normalization unit.
|
||||
|
||||
+95
-88
@@ -2,84 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1100
|
||||
title: Compute Units - Compute Pipeline
|
||||
metrics_description:
|
||||
VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
|
||||
This is also presented as a percent of the peak theoretical FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any floating-point
|
||||
operations from MFMA instructions.'
|
||||
VALU IOPs: 'The total integer operations executed per second on the VALU. This
|
||||
is also presented as a percent of the peak theoretical IOPs achievable on the
|
||||
specific accelerator. Note: this does not include any integer operations from
|
||||
MFMA instructions.'
|
||||
MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
|
||||
executed per second. Note: this does not include any 16-bit brain floating point
|
||||
operations from VALU instructions. This is also presented as a percent of the
|
||||
peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 32-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F32 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 64-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F64 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA IOPs (INT8): 'The total number of 8-bit integer MFMA operations executed
|
||||
per second. Note: this does not include any 8-bit integer operations from VALU
|
||||
instructions. This is also presented as a percent of the peak theoretical INT8
|
||||
MFMA operations achievable on the specific accelerator.'
|
||||
IPC: The ratio of the total number of instructions executed on the CU over the
|
||||
total active CU cycles.
|
||||
IPC (Issued): The ratio of the total number of (non-internal) instructions issued
|
||||
over the number of cycles where the scheduler was actively working on issuing
|
||||
instructions.
|
||||
SALU Utilization: Indicates what percent of the kernel's duration the SALU was
|
||||
busy executing instructions. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing SALU / SMEM instructions over the total CU cycles.
|
||||
VALU Utilization: Indicates what percent of the kernel's duration the VALU was
|
||||
busy executing instructions. Does not include VMEM operations. Computed as the
|
||||
ratio of the total number of cycles spent by the scheduler issuing VALU instructions
|
||||
over the total CU cycles.
|
||||
VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
|
||||
was busy executing instructions, including both global/generic and spill/scratch
|
||||
operations (see the VMEM instruction count metrics for more detail). Does not
|
||||
include VALU operations. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing VMEM instructions over the total CU cycles.
|
||||
Branch Utilization: Indicates what percent of the kernel's duration the branch
|
||||
unit was busy executing instructions. Computed as the ratio of the total number
|
||||
of cycles spent by the scheduler issuing branch instructions over the total
|
||||
CU cycles.
|
||||
VALU Active Threads: Indicates the average level of divergence within a wavefront
|
||||
over the lifetime of the kernel. The number of work-items that were active in
|
||||
a wavefront during execution of each VALU instruction, time-averaged over all
|
||||
VALU instructions run on all wavefronts in the kernel
|
||||
MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
|
||||
was busy executing instructions. Computed as the ratio of the total number of
|
||||
cycles spent by the MFMA was busy over the total CU cycles.
|
||||
MFMA Instruction Cycles: The average duration of MFMA instructions in this kernel
|
||||
in cycles. Computed as the ratio of the total number of cycles the MFMA unit
|
||||
was busy over the total number of MFMA instructions.
|
||||
VMEM Latency: The average number of round-trip cycles (that is, from issue to
|
||||
data return / acknowledgment) required for a VMEM instruction to complete.
|
||||
SMEM Latency: The average number of round-trip cycles (that is, from issue to
|
||||
data return / acknowledgment) required for a SMEM instruction to complete.
|
||||
FLOPs (Total): The total number of floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
IOPs (Total): The total number of integer operations executed on either the VALU
|
||||
or MFMA units, per normalization unit.
|
||||
F16 OPs: The total number of 16-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
BF16 OPs: The total number of 16-bit brain floating-point operations executed
|
||||
on either the VALU or MFMA units, per normalization unit.
|
||||
F32 OPs: The total number of 32-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
F64 OPs: The total number of 64-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
INT8 OPs: The total number of 8-bit integer operations executed on either the
|
||||
VALU or MFMA units, per normalization unit.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1101
|
||||
@@ -165,13 +87,13 @@ Panel Config:
|
||||
unit: Instr/cycle
|
||||
IPC (Issued):
|
||||
avg: AVG(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
/ SQ_ACTIVE_INST_ANY))
|
||||
min: MIN(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
/ SQ_ACTIVE_INST_ANY))
|
||||
max: MAX(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
/ SQ_ACTIVE_INST_ANY))
|
||||
unit: Instr/cycle
|
||||
SALU Utilization:
|
||||
@@ -271,7 +193,7 @@ Panel Config:
|
||||
+ (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (SQ_INSTS_VALU_FMA_F64 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
|
||||
/ $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
IOPs (Total):
|
||||
avg: AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8
|
||||
* 512)) / $denom)
|
||||
@@ -279,12 +201,12 @@ Panel Config:
|
||||
* 512)) / $denom)
|
||||
max: MAX(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8
|
||||
* 512)) / $denom)
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
F8 OPs:
|
||||
avg: AVG(((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) / $denom))
|
||||
min: MIN(((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) / $denom))
|
||||
max: MAX(((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) / $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
F16 OPs:
|
||||
avg: AVG(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16))
|
||||
+ (64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512
|
||||
@@ -295,12 +217,12 @@ Panel Config:
|
||||
max: MAX(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16))
|
||||
+ (64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512
|
||||
* SQ_INSTS_VALU_MFMA_MOPS_F16)) / $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
BF16 OPs:
|
||||
avg: AVG(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
|
||||
min: MIN(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
|
||||
max: MAX(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
F32 OPs:
|
||||
avg: AVG((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
|
||||
+ (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
|
||||
@@ -311,7 +233,7 @@ Panel Config:
|
||||
max: MAX((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
|
||||
+ (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
|
||||
/ $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
F64 OPs:
|
||||
avg: AVG((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
|
||||
@@ -322,9 +244,94 @@ Panel Config:
|
||||
max: MAX((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
|
||||
/ $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
INT8 OPs:
|
||||
avg: AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
|
||||
min: MIN(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
|
||||
max: MAX(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
metrics_description:
|
||||
VALU FLOPs: |-
|
||||
The total floating-point operations executed per second on the VALU.
|
||||
This is also presented as a percent of the peak theoretical FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any floating-point
|
||||
operations from MFMA instructions.
|
||||
VALU IOPs: |-
|
||||
The total integer operations executed per second on the VALU. This is
|
||||
also presented as a percent of the peak theoretical IOPs achievable on the
|
||||
specific accelerator. Note: this does not include any integer operations from
|
||||
MFMA instructions.
|
||||
MFMA FLOPs (BF16): |-
|
||||
The total number of 16-bit brain floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
BF16 MFMA operations achievable on the specific accelerator.
|
||||
MFMA FLOPs (F16): |-
|
||||
The total number of 16-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 16-bit floating point operations from
|
||||
VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F16 MFMA operations achievable on the specific accelerator.
|
||||
MFMA FLOPs (F32): |-
|
||||
The total number of 32-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 32-bit floating point operations from
|
||||
VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F32 MFMA operations achievable on the specific accelerator.
|
||||
MFMA FLOPs (F64): |-
|
||||
The total number of 64-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 64-bit floating point operations from
|
||||
VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F64 MFMA operations achievable on the specific accelerator.
|
||||
MFMA IOPs (INT8): |-
|
||||
The total number of 8-bit integer MFMA operations executed per second.
|
||||
Note: this does not include any 8-bit integer operations from VALU instructions.
|
||||
This is also presented as a percent of the peak theoretical INT8 MFMA operations
|
||||
achievable on the specific accelerator.
|
||||
IPC: The ratio of the total number of instructions executed on the CU over the
|
||||
total active CU cycles.
|
||||
IPC (Issued): The ratio of the total number of (non-internal) instructions issued
|
||||
over the number of cycles where the scheduler was actively working on issuing
|
||||
instructions.
|
||||
SALU Utilization: Indicates what percent of the kernel's duration the SALU was
|
||||
busy executing instructions. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing SALU / SMEM instructions over the total CU cycles.
|
||||
VALU Utilization: Indicates what percent of the kernel's duration the VALU was
|
||||
busy executing instructions. Does not include VMEM operations. Computed as the
|
||||
ratio of the total number of cycles spent by the scheduler issuing VALU instructions
|
||||
over the total CU cycles.
|
||||
VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
|
||||
was busy executing instructions, including both global/generic and spill/scratch
|
||||
operations (see the VMEM instruction count metrics for more detail). Does not
|
||||
include VALU operations. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing VMEM instructions over the total CU cycles.
|
||||
Branch Utilization: Indicates what percent of the kernel's duration the branch
|
||||
unit was busy executing instructions. Computed as the ratio of the total number
|
||||
of cycles spent by the scheduler issuing branch instructions over the total
|
||||
CU cycles.
|
||||
VALU Active Threads: Indicates the average level of divergence within a wavefront
|
||||
over the lifetime of the kernel. The number of work-items that were active in
|
||||
a wavefront during execution of each VALU instruction, time-averaged over all
|
||||
VALU instructions run on all wavefronts in the kernel
|
||||
MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
|
||||
was busy executing instructions. Computed as the ratio of the total number of
|
||||
cycles spent by the MFMA was busy over the total CU cycles.
|
||||
MFMA Instruction Cycles: The average duration of MFMA instructions in this kernel
|
||||
in cycles. Computed as the ratio of the total number of cycles the MFMA unit
|
||||
was busy over the total number of MFMA instructions.
|
||||
VMEM Latency: The average number of round-trip cycles (that is, from issue to
|
||||
data return / acknowledgment) required for a VMEM instruction to complete.
|
||||
SMEM Latency: The average number of round-trip cycles (that is, from issue to
|
||||
data return / acknowledgment) required for a SMEM instruction to complete.
|
||||
FLOPs (Total): The total number of floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
IOPs (Total): The total number of integer operations executed on either the VALU
|
||||
or MFMA units, per normalization unit.
|
||||
F16 OPs: The total number of 16-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
BF16 OPs: The total number of 16-bit brain floating-point operations executed
|
||||
on either the VALU or MFMA units, per normalization unit.
|
||||
F32 OPs: The total number of 32-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
F64 OPs: The total number of 64-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
INT8 OPs: The total number of 8-bit integer operations executed on either the
|
||||
VALU or MFMA units, per normalization unit.
|
||||
|
||||
+52
-51
@@ -2,51 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1200
|
||||
title: Local Data Share (LDS)
|
||||
metrics_description:
|
||||
Utilization: Indicates what percent of the kernel's duration the LDS was actively
|
||||
executing instructions (including, but not limited to, load, store, atomic and
|
||||
HIP's __shfl operations). Calculated as the ratio of the total number of cycles
|
||||
LDS was active over the total CU cycles.
|
||||
Access Rate: Indicates the percentage of SIMDs in the VALU actively issuing LDS
|
||||
instructions, averaged over the lifetime of the kernel. Calculated as the ratio
|
||||
of the total number of cycles spent by the scheduler issuing LDS instructions
|
||||
over the total CU cycles.
|
||||
Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
|
||||
could have been loaded from, stored to, or atomically updated in the LDS divided
|
||||
as percentage of theoretical peak. Does not take into account the execution
|
||||
mask of the wavefront when the instruction was executed.
|
||||
Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
|
||||
loaded from, stored to, or atomically updated in the LDS divided by total duration.
|
||||
Does not take into account the execution mask of the wavefront when the instruction
|
||||
was executed.
|
||||
Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
|
||||
servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing
|
||||
bank conflicts over the number of LDS cycles that would have been required to
|
||||
move the same amount of data in an uncontended access.
|
||||
LDS Instructions: The total number of LDS instructions (including, but not limited
|
||||
to, read/write/atomics and HIP's __shfl instructions) executed per normalization
|
||||
unit.
|
||||
LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
|
||||
/ acknowledgment) required for an LDS instruction to complete.
|
||||
Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS scheduler
|
||||
due to bank conflicts (as determined by the conflict resolution hardware) to
|
||||
the base number of cycles that would be spent in the LDS scheduler in a completely
|
||||
uncontended case. This is the unnormalized form of the Bank Conflict Rate.
|
||||
Index Accesses: The total number of cycles spent in the LDS scheduler over all
|
||||
operations per normalization unit.
|
||||
Atomic Return Cycles: The total number of cycles spent on LDS atomics with return
|
||||
per normalization unit.
|
||||
Bank Conflict: The total number of cycles spent in the LDS scheduler due to bank
|
||||
conflicts (as determined by the conflict resolution hardware) per normalization
|
||||
unit.
|
||||
Addr Conflict: The total number of cycles spent in the LDS scheduler due to address
|
||||
conflicts (as determined by the conflict resolution hardware) per normalization
|
||||
unit.
|
||||
Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
|
||||
stalls from non-dword aligned addresses per normalization unit.
|
||||
Mem Violations: "The total number of out-of-bounds accesses made to the LDS, per\
|
||||
\ normalization unit. This is unused and expected to be zero in most configurations\
|
||||
\ for modern CDNA\u2122 accelerators."
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1201
|
||||
@@ -87,7 +42,7 @@ Panel Config:
|
||||
avg: AVG((SQ_INSTS_LDS / $denom))
|
||||
min: MIN((SQ_INSTS_LDS / $denom))
|
||||
max: MAX((SQ_INSTS_LDS / $denom))
|
||||
unit: (Instr + $normUnit)
|
||||
unit: (Instr + $normUnit)
|
||||
Theoretical Bandwidth:
|
||||
avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
@@ -117,29 +72,75 @@ Panel Config:
|
||||
avg: AVG((SQ_LDS_IDX_ACTIVE / $denom))
|
||||
min: MIN((SQ_LDS_IDX_ACTIVE / $denom))
|
||||
max: MAX((SQ_LDS_IDX_ACTIVE / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Atomic Return Cycles:
|
||||
avg: AVG((SQ_LDS_ATOMIC_RETURN / $denom))
|
||||
min: MIN((SQ_LDS_ATOMIC_RETURN / $denom))
|
||||
max: MAX((SQ_LDS_ATOMIC_RETURN / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Bank Conflict:
|
||||
avg: AVG((SQ_LDS_BANK_CONFLICT / $denom))
|
||||
min: MIN((SQ_LDS_BANK_CONFLICT / $denom))
|
||||
max: MAX((SQ_LDS_BANK_CONFLICT / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Addr Conflict:
|
||||
avg: AVG((SQ_LDS_ADDR_CONFLICT / $denom))
|
||||
min: MIN((SQ_LDS_ADDR_CONFLICT / $denom))
|
||||
max: MAX((SQ_LDS_ADDR_CONFLICT / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Unaligned Stall:
|
||||
avg: AVG((SQ_LDS_UNALIGNED_STALL / $denom))
|
||||
min: MIN((SQ_LDS_UNALIGNED_STALL / $denom))
|
||||
max: MAX((SQ_LDS_UNALIGNED_STALL / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Mem Violations:
|
||||
avg: AVG((SQ_LDS_MEM_VIOLATIONS / $denom))
|
||||
min: MIN((SQ_LDS_MEM_VIOLATIONS / $denom))
|
||||
max: MAX((SQ_LDS_MEM_VIOLATIONS / $denom))
|
||||
unit: (Accesses + $normUnit)
|
||||
metrics_description:
|
||||
Utilization: Indicates what percent of the kernel's duration the LDS was actively
|
||||
executing instructions (including, but not limited to, load, store, atomic and
|
||||
HIP's __shfl operations). Calculated as the ratio of the total number of cycles
|
||||
LDS was active over the total CU cycles.
|
||||
Access Rate: Indicates the percentage of SIMDs in the VALU actively issuing LDS
|
||||
instructions, averaged over the lifetime of the kernel. Calculated as the ratio
|
||||
of the total number of cycles spent by the scheduler issuing LDS instructions
|
||||
over the total CU cycles.
|
||||
Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
|
||||
could have been loaded from, stored to, or atomically updated in the LDS divided
|
||||
as percentage of theoretical peak. Does not take into account the execution
|
||||
mask of the wavefront when the instruction was executed.
|
||||
Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
|
||||
loaded from, stored to, or atomically updated in the LDS divided by total duration.
|
||||
Does not take into account the execution mask of the wavefront when the instruction
|
||||
was executed.
|
||||
Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
|
||||
servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing
|
||||
bank conflicts over the number of LDS cycles that would have been required to
|
||||
move the same amount of data in an uncontended access.
|
||||
LDS Instructions: The total number of LDS instructions (including, but not limited
|
||||
to, read/write/atomics and HIP's __shfl instructions) executed per normalization
|
||||
unit.
|
||||
LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
|
||||
acknowledgment) required for an LDS instruction to complete.
|
||||
Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS scheduler
|
||||
due to bank conflicts (as determined by the conflict resolution hardware) to
|
||||
the base number of cycles that would be spent in the LDS scheduler in a completely
|
||||
uncontended case. This is the unnormalized form of the Bank Conflict Rate.
|
||||
Index Accesses: The total number of cycles spent in the LDS scheduler over all
|
||||
operations per normalization unit.
|
||||
Atomic Return Cycles: The total number of cycles spent on LDS atomics with return
|
||||
per normalization unit.
|
||||
Bank Conflict: The total number of cycles spent in the LDS scheduler due to bank
|
||||
conflicts (as determined by the conflict resolution hardware) per normalization
|
||||
unit.
|
||||
Addr Conflict: The total number of cycles spent in the LDS scheduler due to address
|
||||
conflicts (as determined by the conflict resolution hardware) per normalization
|
||||
unit.
|
||||
Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
|
||||
stalls from non-dword aligned addresses per normalization unit.
|
||||
Mem Violations: |-
|
||||
The total number of out-of-bounds accesses made to the LDS, per normalization
|
||||
unit. This is unused and expected to be zero in most configurations for
|
||||
modern CDNA\u2122 accelerators.
|
||||
|
||||
+26
-26
@@ -2,28 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1300
|
||||
title: Instruction Cache
|
||||
metrics_description:
|
||||
Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
|
||||
the total L1I cycles.
|
||||
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
|
||||
loaded line the cache. Calculated as the ratio of the number of L1I requests
|
||||
that hit over the number of all L1I requests.
|
||||
L1I-L2 Bandwidth Utilization: "The percent of the peak theoretical L1I \u2192\
|
||||
\ L2 cache request bandwidth achieved. Calculated as the ratio of the total\
|
||||
\ number of requests from the L1I to the L2 cache over the total L1I-L2 interface\
|
||||
\ cycles."
|
||||
L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
|
||||
divided by total duration.
|
||||
Req: The total number of requests made to the L1I per normalization-unit
|
||||
Hits: The total number of L1I requests that hit on a previously loaded cache line,
|
||||
per normalization-unit.
|
||||
Misses - Non Duplicated: The total number of L1I requests that missed on a cache
|
||||
line that were not already pending due to another request, per normalization-unit.
|
||||
Misses - Duplicated: The total number of L1I requests that missed on a cache line
|
||||
that were already pending due to another request, per normalization-unit.
|
||||
Instruction Fetch Latency: The average number of cycles spent to fetch instructions
|
||||
to a CU.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1301
|
||||
@@ -62,22 +40,22 @@ Panel Config:
|
||||
avg: AVG((SQC_ICACHE_REQ / $denom))
|
||||
min: MIN((SQC_ICACHE_REQ / $denom))
|
||||
max: MAX((SQC_ICACHE_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Hits:
|
||||
avg: AVG((SQC_ICACHE_HITS / $denom))
|
||||
min: MIN((SQC_ICACHE_HITS / $denom))
|
||||
max: MAX((SQC_ICACHE_HITS / $denom))
|
||||
unit: (Hits + $normUnit)
|
||||
unit: (Hits + $normUnit)
|
||||
Misses - Non Duplicated:
|
||||
avg: AVG((SQC_ICACHE_MISSES / $denom))
|
||||
min: MIN((SQC_ICACHE_MISSES / $denom))
|
||||
max: MAX((SQC_ICACHE_MISSES / $denom))
|
||||
unit: (Misses + $normUnit)
|
||||
unit: (Misses + $normUnit)
|
||||
Misses - Duplicated:
|
||||
avg: AVG((SQC_ICACHE_MISSES_DUPLICATE / $denom))
|
||||
min: MIN((SQC_ICACHE_MISSES_DUPLICATE / $denom))
|
||||
max: MAX((SQC_ICACHE_MISSES_DUPLICATE / $denom))
|
||||
unit: (Misses + $normUnit)
|
||||
unit: (Misses + $normUnit)
|
||||
Cache Hit Rate:
|
||||
avg: AVG(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
|
||||
+ SQC_ICACHE_MISSES_DUPLICATE)))
|
||||
@@ -107,3 +85,25 @@ Panel Config:
|
||||
min: MIN(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
max: MAX(((SQC_TC_INST_REQ * 64) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: Gbps
|
||||
metrics_description:
|
||||
Bandwidth Utilization: The number of bytes looked up in the L1I cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over
|
||||
the total L1I cycles.
|
||||
Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
|
||||
loaded line the cache. Calculated as the ratio of the number of L1I requests
|
||||
that hit over the number of all L1I requests.
|
||||
L1I-L2 Bandwidth Utilization: |-
|
||||
The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth
|
||||
achieved. Calculated as the ratio of the total number of requests from the
|
||||
L1I to the L2 cache over the total L1I-L2 interface cycles.
|
||||
L1I-L2 Bandwidth: Total number of bytes transferred across L1I - L2 interface
|
||||
divided by total duration.
|
||||
Req: The total number of requests made to the L1I per normalization-unit
|
||||
Hits: The total number of L1I requests that hit on a previously loaded cache line,
|
||||
per normalization-unit.
|
||||
Misses - Non Duplicated: The total number of L1I requests that missed on a cache
|
||||
line that were not already pending due to another request, per normalization-unit.
|
||||
Misses - Duplicated: The total number of L1I requests that missed on a cache line
|
||||
that were already pending due to another request, per normalization-unit.
|
||||
Instruction Fetch Latency: The average number of cycles spent to fetch instructions
|
||||
to a CU.
|
||||
|
||||
+61
-58
@@ -2,49 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1400
|
||||
title: Scalar L1 Data Cache
|
||||
metrics_description:
|
||||
Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
|
||||
over the total sL1D cycles.
|
||||
Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
|
||||
loaded line the cache. The ratio of the number of sL1D requests that hit over
|
||||
the number of all sL1D requests.
|
||||
sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
|
||||
bandwidth acheived.\ \ Caclulated as total number of bytes read from, written
|
||||
to, or atomically updated\ \ across the sL1D - L2 interface.
|
||||
sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
|
||||
\ across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D\
|
||||
\ writes and atomics are typically unused on current CDNA accelerators, so in\
|
||||
\ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
|
||||
Req: The total number of requests, of any size or type, made to the sL1D per normalization
|
||||
unit.
|
||||
Hits: The total number of sL1D requests that hit on a previously loaded cache
|
||||
line, per normalization unit.
|
||||
Misses - Non Duplicated: 'The total number of sL1D requests that missed on a cache
|
||||
line that was not already pending due to another request, per normalization
|
||||
unit. '
|
||||
Misses- Duplicated: The total number of sL1D requests that missed on a cache line
|
||||
that was already pending due to another request, per normalization unit.
|
||||
Read Req (Total): The total number of sL1D read requests of any size, per normalization
|
||||
unit.
|
||||
Atomic Req: The total number of atomic requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
Read Req (1 DWord): The total number of sL1D read requests made for a single dword
|
||||
of data (4B), per normalization unit.
|
||||
Read Req (2 DWord): The total number of sL1D read requests made for a two dwords
|
||||
of data (8B), per normalization unit.
|
||||
Read Req (4 DWord): The total number of sL1D read requests made for a four dwords
|
||||
of data (16B), per normalization unit.
|
||||
Read Req (8 DWord): The total number of sL1D read requests made for a eight dwords
|
||||
of data (32B), per normalization unit.
|
||||
Read Req (16 DWord): The total number of sL1D read requests made for a sixteen
|
||||
dwords of data (64B), per normalization unit.
|
||||
Read Req: The total number of read requests from sL1D to the L2 per normalization
|
||||
unit.
|
||||
Write Req: The total number of write requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
Stall Cycles: "The total number of cycles the sL1D\u2194L2 interface was stalled,\
|
||||
\ per normalization unit."
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1401
|
||||
@@ -84,22 +41,22 @@ Panel Config:
|
||||
avg: AVG((SQC_DCACHE_REQ / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Hits:
|
||||
avg: AVG((SQC_DCACHE_HITS / $denom))
|
||||
min: MIN((SQC_DCACHE_HITS / $denom))
|
||||
max: MAX((SQC_DCACHE_HITS / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Misses - Non Duplicated:
|
||||
avg: AVG((SQC_DCACHE_MISSES / $denom))
|
||||
min: MIN((SQC_DCACHE_MISSES / $denom))
|
||||
max: MAX((SQC_DCACHE_MISSES / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Misses- Duplicated:
|
||||
avg: AVG((SQC_DCACHE_MISSES_DUPLICATE / $denom))
|
||||
min: MIN((SQC_DCACHE_MISSES_DUPLICATE / $denom))
|
||||
max: MAX((SQC_DCACHE_MISSES_DUPLICATE / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Cache Hit Rate:
|
||||
avg: AVG((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
+ SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
|
||||
@@ -118,37 +75,37 @@ Panel Config:
|
||||
+ SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
|
||||
max: MAX((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
|
||||
+ SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Atomic Req:
|
||||
avg: AVG((SQC_DCACHE_ATOMIC / $denom))
|
||||
min: MIN((SQC_DCACHE_ATOMIC / $denom))
|
||||
max: MAX((SQC_DCACHE_ATOMIC / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (1 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_1 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_1 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_1 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (2 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_2 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_2 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_2 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (4 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_4 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_4 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_4 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (8 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_8 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_8 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_8 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read Req (16 DWord):
|
||||
avg: AVG((SQC_DCACHE_REQ_READ_16 / $denom))
|
||||
min: MIN((SQC_DCACHE_REQ_READ_16 / $denom))
|
||||
max: MAX((SQC_DCACHE_REQ_READ_16 / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1403
|
||||
title: Scalar L1D Cache - L2 Interface
|
||||
@@ -171,19 +128,65 @@ Panel Config:
|
||||
avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_READ_REQ / $denom))
|
||||
max: MAX((SQC_TC_DATA_READ_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Write Req:
|
||||
avg: AVG((SQC_TC_DATA_WRITE_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_WRITE_REQ / $denom))
|
||||
max: MAX((SQC_TC_DATA_WRITE_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Atomic Req:
|
||||
avg: AVG((SQC_TC_DATA_ATOMIC_REQ / $denom))
|
||||
min: MIN((SQC_TC_DATA_ATOMIC_REQ / $denom))
|
||||
max: MAX((SQC_TC_DATA_ATOMIC_REQ / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Stall Cycles:
|
||||
avg: AVG((SQC_TC_STALL / $denom))
|
||||
min: MIN((SQC_TC_STALL / $denom))
|
||||
max: MAX((SQC_TC_STALL / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
metrics_description:
|
||||
Bandwidth Utilization: The number of bytes looked up in the sL1D cache, as a percent
|
||||
of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests
|
||||
over the total sL1D cycles.
|
||||
Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
|
||||
loaded line the cache. The ratio of the number of sL1D requests that hit over
|
||||
the number of all sL1D requests.
|
||||
sL1D-L2 BW Utilization: The percentage of the peak theoretical sL1D - L2 interface
|
||||
bandwidth acheived. Calculated as total number of bytes read from, written to,
|
||||
or atomically updated across the sL1D - L2 interface.
|
||||
sL1D-L2 BW: |-
|
||||
The total number of bytes read from, written to, or atomically updated
|
||||
across the sL1D\u2194L2 interface, divided by total duration. Note that sL1D
|
||||
writes and atomics are typically unused on current CDNA accelerators, so
|
||||
in the majority of cases this can be interpreted as an sL1D\u2192L2 read
|
||||
bandwidth.
|
||||
Req: The total number of requests, of any size or type, made to the sL1D per normalization
|
||||
unit.
|
||||
Hits: The total number of sL1D requests that hit on a previously loaded cache
|
||||
line, per normalization unit.
|
||||
Misses - Non Duplicated: |-
|
||||
The total number of sL1D requests that missed on a cache line that was
|
||||
not already pending due to another request, per normalization unit.
|
||||
Misses- Duplicated: The total number of sL1D requests that missed on a cache line
|
||||
that was already pending due to another request, per normalization unit.
|
||||
Read Req (Total): The total number of sL1D read requests of any size, per normalization
|
||||
unit.
|
||||
Atomic Req: The total number of atomic requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
Read Req (1 DWord): The total number of sL1D read requests made for a single dword
|
||||
of data (4B), per normalization unit.
|
||||
Read Req (2 DWord): The total number of sL1D read requests made for a two dwords
|
||||
of data (8B), per normalization unit.
|
||||
Read Req (4 DWord): The total number of sL1D read requests made for a four dwords
|
||||
of data (16B), per normalization unit.
|
||||
Read Req (8 DWord): The total number of sL1D read requests made for a eight dwords
|
||||
of data (32B), per normalization unit.
|
||||
Read Req (16 DWord): The total number of sL1D read requests made for a sixteen
|
||||
dwords of data (64B), per normalization unit.
|
||||
Read Req: The total number of read requests from sL1D to the L2 per normalization
|
||||
unit.
|
||||
Write Req: The total number of write requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
Stall Cycles: |-
|
||||
The total number of cycles the sL1D\u2194L2 interface was stalled, per
|
||||
normalization unit.
|
||||
|
||||
+77
-80
@@ -2,70 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1500
|
||||
title: Address Processing Unit and Data Return Path (TA/TD)
|
||||
metrics_description:
|
||||
Address Processing Unit Busy: Percent of the total CU cycles the address processor
|
||||
was busy
|
||||
Address Stall: Percent of the total CU cycles the address processor was stalled
|
||||
from sending address requests further into the vL1D pipeline.
|
||||
Data Stall: Percent of the total CU cycles the address processor was stalled from
|
||||
sending write/atomic data further into the vL1D pipeline.
|
||||
"Data-Processor \u2192 Address Stall": Percent of total CU cycles the address
|
||||
processor was stalled waiting to send command data to the data processor.
|
||||
Total Instructions: The total number of memory instructions executed by the address
|
||||
processer over all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Instructions: The total number of global & generic memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Read Instructions: The total number of global & generic memory
|
||||
read instructions executed on all compute units on the accelerator, per normalization
|
||||
unit.
|
||||
Global/Generic Write Instructions: The total number of global & generic memory
|
||||
write instructions executed on all compute units on the accelerator, per normalization
|
||||
unit.
|
||||
Global/Generic Atomic Instructions: The total number of global & generic memory
|
||||
atomic (with and without return) instructions executed on all compute units
|
||||
on the accelerator, per normalization unit.
|
||||
Spill/Stack Instructions: The total number of spill/stack memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Read Instructions: The total number of spill/stack memory read instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Write Instructions: The total number of spill/stack memory write instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Atomic Instructions: The total number of spill/stack memory atomic
|
||||
(with and without return) instructions executed on all compute units on the
|
||||
accelerator, per normalization unit. Typically unused as these memory operations
|
||||
are typically used to implement thread-local storage.
|
||||
Spill/Stack Total Cycles: The number of cycles the address processing unit spent
|
||||
working on spill/stack instructions, per normalization unit.
|
||||
Spill/Stack Coalesced Read: The number of cycles the address processing unit spent
|
||||
working on coalesced spill/stack read instructions, per normalization unit.
|
||||
Spill/Stack Coalesced Write: The number of cycles the address processing unit
|
||||
spent working on coalesced spill/stack write instructions, per normalization
|
||||
unit.
|
||||
Data-Return Busy: Percent of the total CU cycles the data-return unit was busy
|
||||
processing or waiting on data to return to the CU.
|
||||
"Cache RAM \u2192 Data-Return Stall": Percent of the total CU cycles the data-return
|
||||
unit was stalled on data to be returned from the vL1D Cache RAM.
|
||||
"Workgroup manager \u2192 Data-Return Stall": Percent of the total CU cycles the
|
||||
data-return unit was stalled by the workgroup manager due to initialization
|
||||
of registers as a part of launching new workgroups.
|
||||
Coalescable Instructions: The number of instructions submitted to the data-return
|
||||
unit by the address processor that were found to be coalescable, per normalization
|
||||
unit.
|
||||
Read Instructions: The number of read instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack reads in the address processor.
|
||||
Write Instructions: The number of store instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack stores in the address processor.
|
||||
Atomic Instructions: The number of atomic instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack atomics in the address processor.
|
||||
Write Ack Instructions: The total number of write acknowledgements submitted by
|
||||
data-return unit to SQ, summed over all compute units on the accelerator, per
|
||||
normalization unit.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1501
|
||||
@@ -135,47 +71,47 @@ Panel Config:
|
||||
avg: AVG((TA_TOTAL_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_TOTAL_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_TOTAL_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Global/Generic Instructions:
|
||||
avg: AVG((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Global/Generic Read Instructions:
|
||||
avg: AVG((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Global/Generic Write Instructions:
|
||||
avg: AVG((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Global/Generic Atomic Instructions:
|
||||
avg: AVG((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Spill/Stack Instructions:
|
||||
avg: AVG((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Spill/Stack Read Instructions:
|
||||
avg: AVG((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Spill/Stack Write Instructions:
|
||||
avg: AVG((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Spill/Stack Atomic Instructions:
|
||||
avg: AVG((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
- metric_table:
|
||||
id: 1503
|
||||
title: Spill and stack metrics
|
||||
@@ -190,17 +126,17 @@ Panel Config:
|
||||
avg: AVG((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
|
||||
min: MIN((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
|
||||
max: MAX((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Spill/Stack Coalesced Read:
|
||||
avg: AVG((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
|
||||
min: MIN((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
|
||||
max: MAX((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Spill/Stack Coalesced Write:
|
||||
avg: AVG((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
|
||||
min: MIN((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
|
||||
max: MAX((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
- metric_table:
|
||||
id: 1504
|
||||
title: Vector L1 data-return path or Texture Data (TD)
|
||||
@@ -230,7 +166,7 @@ Panel Config:
|
||||
avg: AVG((TD_COALESCABLE_WAVEFRONT_sum / $denom))
|
||||
min: MIN((TD_COALESCABLE_WAVEFRONT_sum / $denom))
|
||||
max: MAX((TD_COALESCABLE_WAVEFRONT_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Read Instructions:
|
||||
avg: AVG((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
|
||||
/ $denom))
|
||||
@@ -238,14 +174,75 @@ Panel Config:
|
||||
/ $denom))
|
||||
max: MAX((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
|
||||
/ $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Write Instructions:
|
||||
avg: AVG((TD_STORE_WAVEFRONT_sum / $denom))
|
||||
min: MIN((TD_STORE_WAVEFRONT_sum / $denom))
|
||||
max: MAX((TD_STORE_WAVEFRONT_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
Atomic Instructions:
|
||||
avg: AVG((TD_ATOMIC_WAVEFRONT_sum / $denom))
|
||||
min: MIN((TD_ATOMIC_WAVEFRONT_sum / $denom))
|
||||
max: MAX((TD_ATOMIC_WAVEFRONT_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
unit: (Instructions + $normUnit)
|
||||
metrics_description:
|
||||
Address Processing Unit Busy: Percent of the total CU cycles the address processor
|
||||
was busy
|
||||
Address Stall: Percent of the total CU cycles the address processor was stalled
|
||||
from sending address requests further into the vL1D pipeline.
|
||||
Data Stall: Percent of the total CU cycles the address processor was stalled from
|
||||
sending write/atomic data further into the vL1D pipeline.
|
||||
"Data-Processor \u2192 Address Stall": Percent of total CU cycles the address
|
||||
processor was stalled waiting to send command data to the data processor.
|
||||
Total Instructions: The total number of memory instructions executed by the address
|
||||
processer over all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Instructions: The total number of global & generic memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Read Instructions: The total number of global & generic memory
|
||||
read instructions executed on all compute units on the accelerator, per normalization
|
||||
unit.
|
||||
Global/Generic Write Instructions: The total number of global & generic memory
|
||||
write instructions executed on all compute units on the accelerator, per normalization
|
||||
unit.
|
||||
Global/Generic Atomic Instructions: The total number of global & generic memory
|
||||
atomic (with and without return) instructions executed on all compute units
|
||||
on the accelerator, per normalization unit.
|
||||
Spill/Stack Instructions: The total number of spill/stack memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Read Instructions: The total number of spill/stack memory read instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Write Instructions: The total number of spill/stack memory write instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Atomic Instructions: The total number of spill/stack memory atomic
|
||||
(with and without return) instructions executed on all compute units on the
|
||||
accelerator, per normalization unit. Typically unused as these memory operations
|
||||
are typically used to implement thread-local storage.
|
||||
Spill/Stack Total Cycles: The number of cycles the address processing unit spent
|
||||
working on spill/stack instructions, per normalization unit.
|
||||
Spill/Stack Coalesced Read: The number of cycles the address processing unit spent
|
||||
working on coalesced spill/stack read instructions, per normalization unit.
|
||||
Spill/Stack Coalesced Write: The number of cycles the address processing unit
|
||||
spent working on coalesced spill/stack write instructions, per normalization
|
||||
unit.
|
||||
Data-Return Busy: Percent of the total CU cycles the data-return unit was busy
|
||||
processing or waiting on data to return to the CU.
|
||||
"Cache RAM \u2192 Data-Return Stall": Percent of the total CU cycles the data-return
|
||||
unit was stalled on data to be returned from the vL1D Cache RAM.
|
||||
"Workgroup manager \u2192 Data-Return Stall": Percent of the total CU cycles the
|
||||
data-return unit was stalled by the workgroup manager due to initialization
|
||||
of registers as a part of launching new workgroups.
|
||||
Coalescable Instructions: The number of instructions submitted to the data-return
|
||||
unit by the address processor that were found to be coalescable, per normalization
|
||||
unit.
|
||||
Read Instructions: The number of read instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack reads in the address processor.
|
||||
Write Instructions: The number of store instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack stores in the address processor.
|
||||
Atomic Instructions: The number of atomic instructions submitted to the data-return
|
||||
unit by the address processor summed over all compute units on the accelerator,
|
||||
per normalization unit. This is expected to be the sum of global/generic and
|
||||
spill/stack atomics in the address processor.
|
||||
|
||||
+124
-132
@@ -2,117 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1600
|
||||
title: Vector L1 Data Cache
|
||||
metrics_description:
|
||||
Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
|
||||
of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator. The number of bytes is calculated as the number
|
||||
of cache lines requested multiplied by the cache line size. This value does
|
||||
not consider partial requests, so for instance, if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
|
||||
The number of cycles where the vL1D Cache RAM is actively processing any request
|
||||
divided by the number of cycles where the vL1D is active.
|
||||
Coalescing: Indicates how well memory instructions were coalesced by the address
|
||||
processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
|
||||
as the average number of thread-requests generated per instruction divided by
|
||||
the ideal number of thread-requests per instruction.
|
||||
Stalled on L2 Data: The ratio of the number of cycles where the vL1D is stalled
|
||||
waiting for requested data to return from the L2 cache divided by the number
|
||||
of cycles where the vL1D is active.
|
||||
Stalled on L2 Req: The ratio of the number of cycles where the vL1D is stalled
|
||||
waiting to issue a request for data to the L2 cache divided by the number of
|
||||
cycles where the vL1D is active.
|
||||
Tag RAM Stall (Read): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Read requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Tag RAM Stall (Write): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Write requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Tag RAM Stall (Atomic): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Atomic requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Total Req: The total number of incoming requests from the address processing unit
|
||||
after coalescing.
|
||||
Read Req: The total number of incoming read requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Write Req: The total number of incoming write requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Atomic Req: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
|
||||
instructions divided by total duration. The number of bytes is calculated as
|
||||
the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
as a full cache line.
|
||||
Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
|
||||
vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Cache Accesses: The total number of cache line lookups in the vL1D.
|
||||
Cache Hits: The number of cache accesses minus the number of outgoing requests
|
||||
to the L2 cache, that is, the number of cache line requests serviced by the
|
||||
vL1D Cache RAM per normalization unit.
|
||||
Invalidations: The number of times the vL1D was issued a write-back invalidate
|
||||
command during the kernel's execution per normalization unit. This may be triggered
|
||||
by, for instance, the buffer_wbinvl1 instruction.
|
||||
L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
|
||||
of VMEM instructions, divided by total duration. The number of bytes is calculated
|
||||
as the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
as a full cache line.
|
||||
L1-L2 Read: The number of read requests for a vL1D cache line that were not satisfied
|
||||
by the vL1D and must be retrieved from the to the L2 Cache per normalization
|
||||
unit.
|
||||
L1-L2 Write: The number of write requests to a vL1D cache line that were sent
|
||||
through the vL1D to the L2 cache, per normalization unit.
|
||||
L1-L2 Atomic: The number of atomic requests that are sent through the vL1D to
|
||||
the L2 cache, per normalization unit. This includes requests for atomics with,
|
||||
and without return.
|
||||
L1 Access Latency: Calculated as the average number of cycles that a vL1D cache
|
||||
line request spent in the vL1D cache pipeline.
|
||||
L1-L2 Read Latency: Calculated as the average number of cycles that the vL1D cache
|
||||
took to issue and receive read requests from the L2 Cache. This number also
|
||||
includes requests for atomics with return values.
|
||||
L1-L2 Write Latency: Calculated as the average number of cycles that the vL1D
|
||||
cache took to issue and receive acknowledgement of a write request to the L2
|
||||
Cache. This number also includes requests for atomics without return values.
|
||||
NC - Read: Total read requests with NC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
UC - Read: Total read requests with UC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
CC - Read: Total read requests with CC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
RW - Read: Total read requests with RW mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
RW - Write: Total write requests with RW mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
NC - Write: Total write requests with NC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
UC - Write: Total write requests with UC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
CC - Write: Total write requests with CC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
NC - Atomic: Total atomic requests with NC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
UC - Atomic: Total atomic requests with UC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
CC - Atomic: Total atomic requests with CC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
RW - Atomic: Total atomic requests with RW mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
Req: The number of translation requests made to the UTCL1 per normalization unit.
|
||||
Hit Ratio: The ratio of the number of translation requests that hit in the UTCL1
|
||||
divided by the total number of translation requests made to the UTCL1.
|
||||
Hits: The number of translation requests that hit in the UTCL1, and could be reused,
|
||||
per normalization unit.
|
||||
Translation Misses: The total number of translation requests that missed in the
|
||||
UTCL1 due to translation not being present in the cache, per normalization
|
||||
unit.
|
||||
Permission Misses: "The total number of translation requests that missed in the\
|
||||
\ UTCL1 due to a permission error, per normalization unit. This is unused and\
|
||||
\ expected to be zero in most configurations for modern CDNA\u2122 accelerators."
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1601
|
||||
@@ -181,17 +70,17 @@ Panel Config:
|
||||
avg: AVG((TCP_TOTAL_ACCESSES_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_ACCESSES_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_ACCESSES_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read Req:
|
||||
avg: AVG((TCP_TOTAL_READ_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_READ_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_READ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Write Req:
|
||||
avg: AVG((TCP_TOTAL_WRITE_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_WRITE_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_WRITE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Atomic Req:
|
||||
avg: AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
|
||||
/ $denom))
|
||||
@@ -199,7 +88,7 @@ Panel Config:
|
||||
/ $denom))
|
||||
max: MAX(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Cache BW:
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
|
||||
min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
|
||||
@@ -223,7 +112,7 @@ Panel Config:
|
||||
avg: AVG((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
|
||||
max: MAX((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Cache Hits:
|
||||
avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
@@ -234,7 +123,7 @@ Panel Config:
|
||||
max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
|
||||
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Invalidations:
|
||||
avg: AVG((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
|
||||
min: MIN((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
|
||||
@@ -252,12 +141,12 @@ Panel Config:
|
||||
avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
L1-L2 Write:
|
||||
avg: AVG((TCP_TCC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
L1-L2 Atomic:
|
||||
avg: AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
/ $denom))
|
||||
@@ -265,7 +154,7 @@ Panel Config:
|
||||
/ $denom))
|
||||
max: MAX(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
|
||||
/ $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1604
|
||||
title: L1D - L2 Transactions
|
||||
@@ -284,84 +173,84 @@ Panel Config:
|
||||
avg: AVG((TCP_TCC_NC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_NC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_NC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
UC - Read:
|
||||
xfer: Read
|
||||
coherency: UC
|
||||
avg: AVG((TCP_TCC_UC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_UC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_UC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
CC - Read:
|
||||
xfer: Read
|
||||
coherency: CC
|
||||
avg: AVG((TCP_TCC_CC_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_CC_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_CC_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
RW - Read:
|
||||
xfer: Read
|
||||
coherency: RW
|
||||
avg: AVG((TCP_TCC_RW_READ_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_RW_READ_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_RW_READ_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
RW - Write:
|
||||
xfer: Write
|
||||
coherency: RW
|
||||
avg: AVG((TCP_TCC_RW_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_RW_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_RW_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
NC - Write:
|
||||
xfer: Write
|
||||
coherency: NC
|
||||
avg: AVG((TCP_TCC_NC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_NC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_NC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
UC - Write:
|
||||
xfer: Write
|
||||
coherency: UC
|
||||
avg: AVG((TCP_TCC_UC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_UC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_UC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
CC - Write:
|
||||
xfer: Write
|
||||
coherency: CC
|
||||
avg: AVG((TCP_TCC_CC_WRITE_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_CC_WRITE_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_CC_WRITE_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
NC - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: NC
|
||||
avg: AVG((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
UC - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: UC
|
||||
avg: AVG((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
CC - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: CC
|
||||
avg: AVG((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
RW - Atomic:
|
||||
xfer: Atomic
|
||||
coherency: RW
|
||||
avg: AVG((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
|
||||
min: MIN((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
|
||||
max: MAX((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1605
|
||||
title: L1 Unified Translation Cache (UTCL1)
|
||||
@@ -410,3 +299,106 @@ Panel Config:
|
||||
max: Max
|
||||
units: Unit
|
||||
metric: {}
|
||||
metrics_description:
|
||||
Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Bandwidth Utilization: The number of bytes looked up in the vL1D cache as a result
|
||||
of VMEM instructions, as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator. The number of bytes is calculated as the number
|
||||
of cache lines requested multiplied by the cache line size. This value does
|
||||
not consider partial requests, so for instance, if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
|
||||
The number of cycles where the vL1D Cache RAM is actively processing any request
|
||||
divided by the number of cycles where the vL1D is active.
|
||||
Coalescing: Indicates how well memory instructions were coalesced by the address
|
||||
processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
|
||||
as the average number of thread-requests generated per instruction divided by
|
||||
the ideal number of thread-requests per instruction.
|
||||
Stalled on L2 Data: The ratio of the number of cycles where the vL1D is stalled
|
||||
waiting for requested data to return from the L2 cache divided by the number
|
||||
of cycles where the vL1D is active.
|
||||
Stalled on L2 Req: The ratio of the number of cycles where the vL1D is stalled
|
||||
waiting to issue a request for data to the L2 cache divided by the number of
|
||||
cycles where the vL1D is active.
|
||||
Tag RAM Stall (Read): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Read requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Tag RAM Stall (Write): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Write requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Tag RAM Stall (Atomic): The ratio of the number of cycles where the vL1D is stalled
|
||||
due to Atomic requests with conflicting tags being looked up concurrently, divided
|
||||
by the number of cycles where the vL1D is active.
|
||||
Total Req: The total number of incoming requests from the address processing unit
|
||||
after coalescing.
|
||||
Read Req: The total number of incoming read requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Write Req: The total number of incoming write requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Atomic Req: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit.
|
||||
Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
|
||||
instructions divided by total duration. The number of bytes is calculated as
|
||||
the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
as a full cache line.
|
||||
Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
|
||||
vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
Cache Accesses: The total number of cache line lookups in the vL1D.
|
||||
Cache Hits: The number of cache accesses minus the number of outgoing requests
|
||||
to the L2 cache, that is, the number of cache line requests serviced by the
|
||||
vL1D Cache RAM per normalization unit.
|
||||
Invalidations: The number of times the vL1D was issued a write-back invalidate
|
||||
command during the kernel's execution per normalization unit. This may be triggered
|
||||
by, for instance, the buffer_wbinvl1 instruction.
|
||||
L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
|
||||
of VMEM instructions, divided by total duration. The number of bytes is calculated
|
||||
as the number of cache lines requested multiplied by the cache line size. This
|
||||
value does not consider partial requests, so for instance, if only a single
|
||||
value is requested in a cache line, the data movement will still be counted
|
||||
as a full cache line.
|
||||
L1-L2 Read: The number of read requests for a vL1D cache line that were not satisfied
|
||||
by the vL1D and must be retrieved from the to the L2 Cache per normalization
|
||||
unit.
|
||||
L1-L2 Write: The number of write requests to a vL1D cache line that were sent
|
||||
through the vL1D to the L2 cache, per normalization unit.
|
||||
L1-L2 Atomic: The number of atomic requests that are sent through the vL1D to
|
||||
the L2 cache, per normalization unit. This includes requests for atomics with,
|
||||
and without return.
|
||||
NC - Read: Total read requests with NC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
UC - Read: Total read requests with UC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
CC - Read: Total read requests with CC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
RW - Read: Total read requests with RW mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
RW - Write: Total write requests with RW mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
NC - Write: Total write requests with NC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
UC - Write: Total write requests with UC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
CC - Write: Total write requests with CC mtype from this TCP to all TCCs Sum over
|
||||
TCP instances per normalization unit.
|
||||
NC - Atomic: Total atomic requests with NC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
UC - Atomic: Total atomic requests with UC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
CC - Atomic: Total atomic requests with CC mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
RW - Atomic: Total atomic requests with RW mtype from this TCP to all TCCs Sum
|
||||
over TCP instances per normalization unit.
|
||||
Req: The number of translation requests made to the UTCL1 per normalization unit.
|
||||
Hit Ratio: The ratio of the number of translation requests that hit in the UTCL1
|
||||
divided by the total number of translation requests made to the UTCL1.
|
||||
Hits: The number of translation requests that hit in the UTCL1, and could be reused,
|
||||
per normalization unit.
|
||||
Translation Misses: The total number of translation requests that missed in the
|
||||
UTCL1 due to translation not being present in the cache, per normalization unit.
|
||||
Permission Misses: |-
|
||||
The total number of translation requests that missed in the UTCL1 due
|
||||
to a permission error, per normalization unit. This is unused and expected
|
||||
to be zero in most configurations for modern CDNA\u2122 accelerators.
|
||||
|
||||
+186
-236
@@ -2,218 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1700
|
||||
title: L2 Cache
|
||||
metrics_description:
|
||||
Utilization: The ratio of the number of cycles an L2 channel was active, summed
|
||||
over all L2 channels on the accelerator over the total L2 cycles.
|
||||
Peak Bandwidth: The number of bytes looked up in the L2 cache, as a percent of
|
||||
the peak theoretical bandwidth achievable on the specific accelerator. The number
|
||||
of bytes is calculated as the number of cache lines requested multiplied by
|
||||
the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line.
|
||||
Hit Rate: The ratio of the number of L2 cache line requests that hit in the L2
|
||||
cache over the total number of incoming cache line requests to the L2 cache.
|
||||
L2-Fabric Read BW: The number of bytes read by the L2 over the Infinity Fabric
|
||||
interface per unit time.
|
||||
L2-Fabric Write and Atomic BW: The number of bytes sent by the L2 over the Infinity
|
||||
Fabric interface by write and atomic operations per unit time.
|
||||
HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
|
||||
memory (HBM) per unit time. This value is calculated as the number of HBM channels
|
||||
multiplied by the HBM channel width multiplied by the HBM clock frequency.
|
||||
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided
|
||||
by total duration.
|
||||
HBM Read Traffic: The percent of read requests generated by the L2 cache that
|
||||
are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
|
||||
does not consider the size of the request (meaning that 32B and 64B requests
|
||||
are both counted as a single request), so this metric only approximates the
|
||||
percent of the L2-Fabric Read bandwidth directed to the local HBM.
|
||||
Remote Read Traffic: The percent of read requests generated by the L2 cache that
|
||||
are routed to any memory location other than the accelerator's local high-bandwidth
|
||||
memory (HBM) - for example, the CPU's DRAM or a remote accelerator's HBM. This
|
||||
breakdown does not consider the size of the request (meaning that 32B and 64B
|
||||
requests are both counted as a single request), so this metric only approximates
|
||||
the percent of the L2-Fabric Read bandwidth directed to a remote location.
|
||||
Uncached Read Traffic: The percent of read requests generated by the L2 cache
|
||||
that are reading from an uncached memory allocation. Note, as described in the
|
||||
request flow section, a single 64B read request is typically counted as two
|
||||
uncached read requests. So, it is possible for the Uncached Read Traffic to
|
||||
reach up to 200% of the total number of read requests. This breakdown does not
|
||||
consider the size of the request (i.e., 32B and 64B requests are both counted
|
||||
as a single request), so this metric only approximates the percent of the L2-Fabric
|
||||
read bandwidth directed to an uncached memory location.
|
||||
Write and Atomic BW: The total number of bytes written by the L2 over Infinity
|
||||
Fabric by write and atomic operations divided by total duration. Note that on
|
||||
current CDNA accelerators, such as the MI2XX, requests are only considered atomic
|
||||
by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
|
||||
fine-grained memory allocations or uncached memory allocations on the MI2XX.
|
||||
HBM Write and Atomic Traffic: The percent of write and atomic requests generated
|
||||
by the L2 cache that are routed to the accelerator's local high-bandwidth memory
|
||||
(HBM). This breakdown does not consider the size of the request (meaning that
|
||||
32B and 64B requests are both counted as a single request), so this metric only
|
||||
approximates the percent of the L2-Fabric Write and Atomic bandwidth directed
|
||||
to the local HBM. Note that on current CDNA accelerators, such as the MI2XX,
|
||||
requests are only considered atomic by Infinity Fabric if they are targeted
|
||||
at fine-grained memory allocations or uncached memory allocations.
|
||||
Remote Write and Atomic Traffic: The percent of read requests generated by the
|
||||
L2 cache that are routed to any memory location other than the accelerator's
|
||||
local high-bandwidth memory (HBM) - for example, the CPU's DRAM or a remote
|
||||
accelerator's HBM. This breakdown does not consider the size of the request
|
||||
(meaning that 32B and 64B requests are both counted as a single request), so
|
||||
this metric only approximates the percent of the L2-Fabric Read bandwidth directed
|
||||
to a remote location. Note that on current CDNA accelerators, such as the MI2XX,
|
||||
requests are only considered atomic by Infinity Fabric if they are targeted
|
||||
at fine-grained memory allocations or uncached memory allocations.
|
||||
Atomic Traffic: The percent of write requests generated by the L2 cache that are
|
||||
atomic requests to any memory location. This breakdown does not consider the
|
||||
size of the request (meaning that 32B and 64B requests are both counted as a
|
||||
single request), so this metric only approximates the percent of the L2-Fabric
|
||||
Read bandwidth directed to a remote location. Note that on current CDNA accelerators,
|
||||
such as the MI2XX, requests are only considered atomic by Infinity Fabric if
|
||||
they are targeted at fine-grained memory allocations or uncached memory allocations.
|
||||
Uncached Write and Atomic Traffic: The percent of write and atomic requests generated
|
||||
by the L2 cache that are targeting uncached memory allocations. This breakdown
|
||||
does not consider the size of the request (meaning that 32B and 64B requests
|
||||
are both counted as a single request), so this metric only approximates the
|
||||
percent of the L2-Fabric read bandwidth directed to uncached memory allocations.
|
||||
Read Latency: The time-averaged number of cycles read requests spent in Infinity
|
||||
Fabric before data was returned to the L2.
|
||||
Write and Atomic Latency: The time-averaged number of cycles write requests spent
|
||||
in Infinity Fabric before a completion acknowledgement was returned to the L2.
|
||||
Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
|
||||
Fabric before a completion acknowledgement (atomic without return value) or
|
||||
data (atomic with return value) was returned to the L2.
|
||||
Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration.
|
||||
The number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so for
|
||||
example, if only a single value is requested in a cache line, the data movement
|
||||
will still be counted as a full cache line.
|
||||
Read Bandwidth: Total number of bytes looked up in the L2 cache for read requests,
|
||||
divided by total duration.
|
||||
Write Bandwidth: Total number of bytes looked up in the L2 cache for write requests,
|
||||
divided by total duration.
|
||||
Atomic Bandwidth: Total number of bytes looked up in the L2 cache for atomic requests,
|
||||
divided by total duration.
|
||||
Req: The total number of incoming requests to the L2 from all clients for all
|
||||
request types, per normalization unit.
|
||||
Read Req: The total number of read requests to the L2 from all clients.
|
||||
Write Req: The total number of write requests to the L2 from all clients.
|
||||
Atomic Req: The total number of atomic requests (with and without return) to the
|
||||
L2 from all clients.
|
||||
Streaming Req: The total number of incoming requests to the L2 that are marked
|
||||
as streaming. The exact meaning of this may differ depending on the targeted
|
||||
accelerator, however on an MI2XX this corresponds to non-temporal load or stores.
|
||||
The L2 cache attempts to evict streaming requests before normal requests when
|
||||
the L2 is at capacity.
|
||||
Probe Req: The number of coherence probe requests made to the L2 cache from outside
|
||||
the accelerator. On an MI2XX, probe requests may be generated by, for example,
|
||||
writes to fine-grained device memory or by writes to coarse-grained device memory.
|
||||
Cache Hit: The ratio of the number of L2 cache line requests that hit in the L2
|
||||
cache over the total number of incoming cache line requests to the L2 cache.
|
||||
Hits: The total number of requests to the L2 from all clients that hit in the
|
||||
cache. As noted in the Speed-of-Light section, this includes hit-on-miss requests.
|
||||
Misses: The total number of requests to the L2 from all clients that miss in the
|
||||
cache. As noted in the Speed-of-Light section, these do not include hit-on-miss
|
||||
requests.
|
||||
Writeback: The total number of L2 cache lines written back to memory for any reason.
|
||||
Write-backs may occur due to user code (such as HIP kernel calls to _threadfence_system
|
||||
or atomic built-ins) by the command processor's memory acquire/release fences,
|
||||
or for other internal hardware reasons.
|
||||
Writeback (Internal): The total number of L2 cache lines written back to memory
|
||||
for internal hardware reasons, per normalization unit.
|
||||
Writeback (vL1D Req): The total number of L2 cache lines written back to memory
|
||||
due to requests initiated by the vL1D cache, per normalization unit.
|
||||
Evict (Internal): The total number of L2 cache lines evicted from the cache due
|
||||
to capacity limits, per normalization unit.
|
||||
Evict (vL1D Req): The total number of L2 cache lines evicted from the cache due
|
||||
to invalidation requests initiated by the vL1D cache, per normalization unit.
|
||||
NC Req: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory
|
||||
allocations, per normalization unit.
|
||||
UC Req: The total number of requests to the L2 that go to Uncached (UC) memory
|
||||
allocations.
|
||||
CC Req: The total number of requests to the L2 that go to Coherently Cacheable
|
||||
(CC) memory allocations.
|
||||
RW Req: The total number of requests to the L2 that go to Read-Write coherent
|
||||
memory (RW) allocations.
|
||||
Write - Credit Starvation: The number of cycles the L2-Fabric interface was stalled
|
||||
on write or atomic requests to any memory location because too many write/atomic
|
||||
requests were currently in flight, as a percent of the total active L2 cycles.
|
||||
Read (32B): The total number of L2 requests to Infinity Fabric to read 32B of
|
||||
data from any memory location, per normalization unit.
|
||||
Read (64B): The total number of L2 requests to Infinity Fabric to read 64B of
|
||||
data from any memory location, per normalization unit.
|
||||
Read (Uncached): The total number of L2 requests to Infinity Fabric to read uncached
|
||||
data from any memory location, per normalization unit. 64B requests for uncached
|
||||
data are counted as two 32B uncached data requests.
|
||||
HBM Read: The total number of L2 requests to Infinity Fabric to read 32B or 64B
|
||||
of data from the accelerator's local HBM, per normalization unit.
|
||||
Remote Read: The total number of L2 requests to Infinity Fabric to read 32B or
|
||||
64B of data from any source other than the accelerator's local HBM, per normalization
|
||||
unit.
|
||||
Read Bandwidth - PCIe: Total number of bytes due to L2 read requests due to PCIe
|
||||
traffic, divided by total duration.
|
||||
"Read Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 read
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Read Bandwidth - HBM: Total number of bytes due to L2 read requests due to HBM
|
||||
traffic, divided by total duration.
|
||||
Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 32B of data to any memory location, per normalization
|
||||
unit.
|
||||
Write and Atomic (Uncached): The total number of L2 requests to Infinity Fabric
|
||||
to write or atomically update 32B or 64B of uncached data, per normalization
|
||||
unit.
|
||||
Write and Atomic (64B): The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 64B of data in any memory location, per normalization
|
||||
unit.
|
||||
HBM Write and Atomic: The total number of L2 requests to Infinity Fabric to write
|
||||
or atomically update 32B or 64B of data in the accelerator's local HBM, per
|
||||
normalization unit.
|
||||
Remote Write and Atomic: The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 32B or 64B of data in any memory location other than
|
||||
the accelerator's local HBM, per normalization unit.
|
||||
Write Bandwidth - PCIe: Total number of bytes due to L2 write requests due to
|
||||
PCIe traffic, divided by total duration.
|
||||
"Write Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 write
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Write Bandwidth - HBM: Total number of bytes due to L2 write requests due to HBM
|
||||
traffic, divided by total duration.
|
||||
Atomic Bandwidth - PCIe: Total number of bytes due to L2 atomic requests due to
|
||||
PCIe traffic, divided by total duration.
|
||||
"Atomic Bandwidth - Infinity Fabric\u2122": Total number of bytes due to L2 atomic
|
||||
requests due to Infinity Fabric traffic, divided by total duration.
|
||||
Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
|
||||
HBM traffic, divided by total duration.
|
||||
Atomic: The total number of L2 requests to Infinity Fabric to atomically update
|
||||
32B or 64B of data in any memory location, per normalization unit. See Request
|
||||
flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
|
||||
requests are only considered atomic by Infinity Fabric if they are targeted
|
||||
at non-write-cacheable memory, such as fine-grained memory allocations or uncached
|
||||
memory allocations on the MI2XX.
|
||||
Read Stall: "The ratio of the total number of cycles the L2-Fabric interface was\
|
||||
\ stalled on a read request to any destination (local HBM, remote PCIe\xAE connected\
|
||||
\ accelerator or CPU, or remote Infinity Fabric connected accelerator or CPU)\
|
||||
\ over the total active L2 cycles."
|
||||
Write Stall: The ratio of the total number of cycles the L2-Fabric interface was
|
||||
stalled on a write or atomic request to any destination (local HBM, remote accelerator
|
||||
or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected
|
||||
accelerator or CPU) over the total active L2 cycles.
|
||||
Read - PCIe Stall: The number of cycles the L2-Fabric interface was stalled on
|
||||
read requests to remote PCIe connected accelerators or CPUs as a percent of
|
||||
the total active L2 cycles.
|
||||
Read - Infinity Fabric Stall: The number of cycles the L2-Fabric interface was
|
||||
stalled on read requests to remote Infinity Fabric connected accelerators or
|
||||
CPUs as a percent of the total active L2 cycles.
|
||||
Read - HBM Stall: The number of cycles the L2-Fabric interface was stalled on
|
||||
read requests to the accelerator's local HBM as a percent of the total active
|
||||
L2 cycles.
|
||||
Write - PCIe Stall: The number of cycles the L2-Fabric interface was stalled on
|
||||
write or atomic requests to remote PCIe connected accelerators or CPUs as a
|
||||
percent of the total active L2 cycles.
|
||||
Write - Infinity Fabric Stall: The number of cycles the L2-Fabric interface was
|
||||
stalled on write or atomic requests to remote Infinity Fabric connected accelerators
|
||||
or CPUs as a percent of the total active L2 cycles.
|
||||
Write - HBM Stall: The number of cycles the L2-Fabric interface was stalled on
|
||||
write or atomic requests to accelerator's local HBM as a percent of the total
|
||||
active L2 cycles.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1701
|
||||
@@ -370,32 +158,32 @@ Panel Config:
|
||||
avg: AVG((TCC_REQ_sum / $denom))
|
||||
min: MIN((TCC_REQ_sum / $denom))
|
||||
max: MAX((TCC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read Req:
|
||||
avg: AVG((TCC_READ_sum / $denom))
|
||||
min: MIN((TCC_READ_sum / $denom))
|
||||
max: MAX((TCC_READ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Write Req:
|
||||
avg: AVG((TCC_WRITE_sum / $denom))
|
||||
min: MIN((TCC_WRITE_sum / $denom))
|
||||
max: MAX((TCC_WRITE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Atomic Req:
|
||||
avg: AVG((TCC_ATOMIC_sum / $denom))
|
||||
min: MIN((TCC_ATOMIC_sum / $denom))
|
||||
max: MAX((TCC_ATOMIC_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Streaming Req:
|
||||
avg: AVG((TCC_STREAMING_REQ_sum / $denom))
|
||||
min: MIN((TCC_STREAMING_REQ_sum / $denom))
|
||||
max: MAX((TCC_STREAMING_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Probe Req:
|
||||
avg: AVG((TCC_PROBE_sum / $denom))
|
||||
min: MIN((TCC_PROBE_sum / $denom))
|
||||
max: MAX((TCC_PROBE_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Cache Hit:
|
||||
avg: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
|
||||
+ TCC_MISS_sum) != 0) else None))
|
||||
@@ -408,17 +196,17 @@ Panel Config:
|
||||
avg: AVG((TCC_HIT_sum / $denom))
|
||||
min: MIN((TCC_HIT_sum / $denom))
|
||||
max: MAX((TCC_HIT_sum / $denom))
|
||||
unit: (Hits + $normUnit)
|
||||
unit: (Hits + $normUnit)
|
||||
Misses:
|
||||
avg: AVG((TCC_MISS_sum / $denom))
|
||||
min: MIN((TCC_MISS_sum / $denom))
|
||||
max: MAX((TCC_MISS_sum / $denom))
|
||||
unit: (Misses + $normUnit)
|
||||
unit: (Misses + $normUnit)
|
||||
Writeback:
|
||||
avg: AVG((TCC_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_WRITEBACK_sum / $denom))
|
||||
max: MAX((TCC_WRITEBACK_sum / $denom))
|
||||
unit: (Cachelines + $normUnit)
|
||||
unit: (Cachelines + $normUnit)
|
||||
Writeback (Internal):
|
||||
avg: AVG((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
min: MIN((TCC_NORMAL_WRITEBACK_sum / $denom))
|
||||
@@ -443,22 +231,22 @@ Panel Config:
|
||||
avg: AVG((TCC_NC_REQ_sum / $denom))
|
||||
min: MIN((TCC_NC_REQ_sum / $denom))
|
||||
max: MAX((TCC_NC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
UC Req:
|
||||
avg: AVG((TCC_UC_REQ_sum / $denom))
|
||||
min: MIN((TCC_UC_REQ_sum / $denom))
|
||||
max: MAX((TCC_UC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
CC Req:
|
||||
avg: AVG((TCC_CC_REQ_sum / $denom))
|
||||
min: MIN((TCC_CC_REQ_sum / $denom))
|
||||
max: MAX((TCC_CC_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
RW Req:
|
||||
avg: AVG((TCC_RW_REQ_sum / $denom))
|
||||
min: MIN((TCC_RW_REQ_sum / $denom))
|
||||
max: MAX((TCC_RW_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1704
|
||||
title: L2 Cache Stalls
|
||||
@@ -507,54 +295,216 @@ Panel Config:
|
||||
avg: AVG((TCC_EA0_RDREQ_32B_sum / $denom))
|
||||
min: MIN((TCC_EA0_RDREQ_32B_sum / $denom))
|
||||
max: MAX((TCC_EA0_RDREQ_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read (64B):
|
||||
avg: AVG(((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) / $denom))
|
||||
min: MIN(((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) / $denom))
|
||||
max: MAX(((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_32B_sum) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Read (Uncached):
|
||||
avg: AVG((TCC_EA0_RD_UNCACHED_32B_sum / $denom))
|
||||
min: MIN((TCC_EA0_RD_UNCACHED_32B_sum / $denom))
|
||||
max: MAX((TCC_EA0_RD_UNCACHED_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
HBM Read:
|
||||
avg: AVG((TCC_EA0_RDREQ_DRAM_sum / $denom))
|
||||
min: MIN((TCC_EA0_RDREQ_DRAM_sum / $denom))
|
||||
max: MAX((TCC_EA0_RDREQ_DRAM_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Remote Read:
|
||||
avg: AVG((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
|
||||
min: MIN((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
|
||||
max: MAX((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Write and Atomic (32B):
|
||||
avg: AVG(MAX(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom), 0))
|
||||
min: MIN(MAX(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom), 0))
|
||||
max: MAX(MAX(((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) / $denom), 0))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Write and Atomic (Uncached):
|
||||
avg: AVG((TCC_EA0_WR_UNCACHED_32B_sum / $denom))
|
||||
min: MIN((TCC_EA0_WR_UNCACHED_32B_sum / $denom))
|
||||
max: MAX((TCC_EA0_WR_UNCACHED_32B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Write and Atomic (64B):
|
||||
avg: AVG((TCC_EA0_WRREQ_64B_sum / $denom))
|
||||
min: MIN((TCC_EA0_WRREQ_64B_sum / $denom))
|
||||
max: MAX((TCC_EA0_WRREQ_64B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
HBM Write and Atomic:
|
||||
avg: AVG((TCC_EA0_WRREQ_DRAM_sum / $denom))
|
||||
min: MIN((TCC_EA0_WRREQ_DRAM_sum / $denom))
|
||||
max: MAX((TCC_EA0_WRREQ_DRAM_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Remote Write and Atomic:
|
||||
avg: AVG((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom))
|
||||
min: MIN((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom))
|
||||
max: MAX((MAX((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_DRAM_sum), 0) / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
Atomic:
|
||||
avg: AVG((TCC_EA0_ATOMIC_sum / $denom))
|
||||
min: MIN((TCC_EA0_ATOMIC_sum / $denom))
|
||||
max: MAX((TCC_EA0_ATOMIC_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
unit: (Req + $normUnit)
|
||||
metrics_description:
|
||||
Utilization: The ratio of the number of cycles an L2 channel was active, summed
|
||||
over all L2 channels on the accelerator over the total L2 cycles.
|
||||
Peak Bandwidth: The number of bytes looked up in the L2 cache, as a percent of
|
||||
the peak theoretical bandwidth achievable on the specific accelerator. The number
|
||||
of bytes is calculated as the number of cache lines requested multiplied by
|
||||
the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line.
|
||||
Hit Rate: The ratio of the number of L2 cache line requests that hit in the L2
|
||||
cache over the total number of incoming cache line requests to the L2 cache.
|
||||
L2-Fabric Read BW: The number of bytes read by the L2 over the Infinity Fabric
|
||||
interface per unit time.
|
||||
L2-Fabric Write and Atomic BW: The number of bytes sent by the L2 over the Infinity
|
||||
Fabric interface by write and atomic operations per unit time.
|
||||
HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
|
||||
memory (HBM) per unit time. This value is calculated as the number of HBM channels
|
||||
multiplied by the HBM channel width multiplied by the HBM clock frequency.
|
||||
Read BW: The total number of bytes read by the L2 cache from Infinity Fabric divided
|
||||
by total duration.
|
||||
HBM Read Traffic: The percent of read requests generated by the L2 cache that
|
||||
are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
|
||||
does not consider the size of the request (meaning that 32B and 64B requests
|
||||
are both counted as a single request), so this metric only approximates the
|
||||
percent of the L2-Fabric Read bandwidth directed to the local HBM.
|
||||
Remote Read Traffic: The percent of read requests generated by the L2 cache that
|
||||
are routed to any memory location other than the accelerator's local high-bandwidth
|
||||
memory (HBM) - for example, the CPU's DRAM or a remote accelerator's HBM. This
|
||||
breakdown does not consider the size of the request (meaning that 32B and 64B
|
||||
requests are both counted as a single request), so this metric only approximates
|
||||
the percent of the L2-Fabric Read bandwidth directed to a remote location.
|
||||
Uncached Read Traffic: The percent of read requests generated by the L2 cache
|
||||
that are reading from an uncached memory allocation. Note, as described in the
|
||||
request flow section, a single 64B read request is typically counted as two
|
||||
uncached read requests. So, it is possible for the Uncached Read Traffic to
|
||||
reach up to 200% of the total number of read requests. This breakdown does not
|
||||
consider the size of the request (i.e., 32B and 64B requests are both counted
|
||||
as a single request), so this metric only approximates the percent of the L2-Fabric
|
||||
read bandwidth directed to an uncached memory location.
|
||||
Write and Atomic BW: The total number of bytes written by the L2 over Infinity
|
||||
Fabric by write and atomic operations divided by total duration. Note that on
|
||||
current CDNA accelerators, such as the MI2XX, requests are only considered atomic
|
||||
by Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
|
||||
fine-grained memory allocations or uncached memory allocations on the MI2XX.
|
||||
HBM Write and Atomic Traffic: The percent of write and atomic requests generated
|
||||
by the L2 cache that are routed to the accelerator's local high-bandwidth memory
|
||||
(HBM). This breakdown does not consider the size of the request (meaning that
|
||||
32B and 64B requests are both counted as a single request), so this metric only
|
||||
approximates the percent of the L2-Fabric Write and Atomic bandwidth directed
|
||||
to the local HBM. Note that on current CDNA accelerators, such as the MI2XX,
|
||||
requests are only considered atomic by Infinity Fabric if they are targeted
|
||||
at fine-grained memory allocations or uncached memory allocations.
|
||||
Remote Write and Atomic Traffic: The percent of read requests generated by the
|
||||
L2 cache that are routed to any memory location other than the accelerator's
|
||||
local high-bandwidth memory (HBM) - for example, the CPU's DRAM or a remote
|
||||
accelerator's HBM. This breakdown does not consider the size of the request
|
||||
(meaning that 32B and 64B requests are both counted as a single request), so
|
||||
this metric only approximates the percent of the L2-Fabric Read bandwidth directed
|
||||
to a remote location. Note that on current CDNA accelerators, such as the MI2XX,
|
||||
requests are only considered atomic by Infinity Fabric if they are targeted
|
||||
at fine-grained memory allocations or uncached memory allocations.
|
||||
Atomic Traffic: The percent of write requests generated by the L2 cache that are
|
||||
atomic requests to any memory location. This breakdown does not consider the
|
||||
size of the request (meaning that 32B and 64B requests are both counted as a
|
||||
single request), so this metric only approximates the percent of the L2-Fabric
|
||||
Read bandwidth directed to a remote location. Note that on current CDNA accelerators,
|
||||
such as the MI2XX, requests are only considered atomic by Infinity Fabric if
|
||||
they are targeted at fine-grained memory allocations or uncached memory allocations.
|
||||
Uncached Write and Atomic Traffic: The percent of write and atomic requests generated
|
||||
by the L2 cache that are targeting uncached memory allocations. This breakdown
|
||||
does not consider the size of the request (meaning that 32B and 64B requests
|
||||
are both counted as a single request), so this metric only approximates the
|
||||
percent of the L2-Fabric read bandwidth directed to uncached memory allocations.
|
||||
Read Latency: The time-averaged number of cycles read requests spent in Infinity
|
||||
Fabric before data was returned to the L2.
|
||||
Write and Atomic Latency: The time-averaged number of cycles write requests spent
|
||||
in Infinity Fabric before a completion acknowledgement was returned to the L2.
|
||||
Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
|
||||
Fabric before a completion acknowledgement (atomic without return value) or
|
||||
data (atomic with return value) was returned to the L2.
|
||||
Bandwidth: The number of bytes looked up in the L2 cache, divided by total duration.
|
||||
The number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so for
|
||||
example, if only a single value is requested in a cache line, the data movement
|
||||
will still be counted as a full cache line.
|
||||
Req: The total number of incoming requests to the L2 from all clients for all
|
||||
request types, per normalization unit.
|
||||
Read Req: The total number of read requests to the L2 from all clients.
|
||||
Write Req: The total number of write requests to the L2 from all clients.
|
||||
Atomic Req: The total number of atomic requests (with and without return) to the
|
||||
L2 from all clients.
|
||||
Streaming Req: The total number of incoming requests to the L2 that are marked
|
||||
as streaming. The exact meaning of this may differ depending on the targeted
|
||||
accelerator, however on an MI2XX this corresponds to non-temporal load or stores.
|
||||
The L2 cache attempts to evict streaming requests before normal requests when
|
||||
the L2 is at capacity.
|
||||
Probe Req: The number of coherence probe requests made to the L2 cache from outside
|
||||
the accelerator. On an MI2XX, probe requests may be generated by, for example,
|
||||
writes to fine-grained device memory or by writes to coarse-grained device memory.
|
||||
Cache Hit: The ratio of the number of L2 cache line requests that hit in the L2
|
||||
cache over the total number of incoming cache line requests to the L2 cache.
|
||||
Hits: The total number of requests to the L2 from all clients that hit in the
|
||||
cache. As noted in the Speed-of-Light section, this includes hit-on-miss requests.
|
||||
Misses: The total number of requests to the L2 from all clients that miss in the
|
||||
cache. As noted in the Speed-of-Light section, these do not include hit-on-miss
|
||||
requests.
|
||||
Writeback: The total number of L2 cache lines written back to memory for any reason.
|
||||
Write-backs may occur due to user code (such as HIP kernel calls to _threadfence_system
|
||||
or atomic built-ins) by the command processor's memory acquire/release fences,
|
||||
or for other internal hardware reasons.
|
||||
Writeback (Internal): The total number of L2 cache lines written back to memory
|
||||
for internal hardware reasons, per normalization unit.
|
||||
Writeback (vL1D Req): The total number of L2 cache lines written back to memory
|
||||
due to requests initiated by the vL1D cache, per normalization unit.
|
||||
Evict (Internal): The total number of L2 cache lines evicted from the cache due
|
||||
to capacity limits, per normalization unit.
|
||||
Evict (vL1D Req): The total number of L2 cache lines evicted from the cache due
|
||||
to invalidation requests initiated by the vL1D cache, per normalization unit.
|
||||
NC Req: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory
|
||||
allocations, per normalization unit.
|
||||
UC Req: The total number of requests to the L2 that go to Uncached (UC) memory
|
||||
allocations.
|
||||
CC Req: The total number of requests to the L2 that go to Coherently Cacheable
|
||||
(CC) memory allocations.
|
||||
RW Req: The total number of requests to the L2 that go to Read-Write coherent
|
||||
memory (RW) allocations.
|
||||
Write - Credit Starvation: The number of cycles the L2-Fabric interface was stalled
|
||||
on write or atomic requests to any memory location because too many write/atomic
|
||||
requests were currently in flight, as a percent of the total active L2 cycles.
|
||||
Read (32B): The total number of L2 requests to Infinity Fabric to read 32B of
|
||||
data from any memory location, per normalization unit.
|
||||
Read (64B): The total number of L2 requests to Infinity Fabric to read 64B of
|
||||
data from any memory location, per normalization unit.
|
||||
Read (Uncached): The total number of L2 requests to Infinity Fabric to read uncached
|
||||
data from any memory location, per normalization unit. 64B requests for uncached
|
||||
data are counted as two 32B uncached data requests.
|
||||
HBM Read: The total number of L2 requests to Infinity Fabric to read 32B or 64B
|
||||
of data from the accelerator's local HBM, per normalization unit.
|
||||
Remote Read: The total number of L2 requests to Infinity Fabric to read 32B or
|
||||
64B of data from any source other than the accelerator's local HBM, per normalization
|
||||
unit.
|
||||
Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 32B of data to any memory location, per normalization
|
||||
unit.
|
||||
Write and Atomic (Uncached): The total number of L2 requests to Infinity Fabric
|
||||
to write or atomically update 32B or 64B of uncached data, per normalization
|
||||
unit.
|
||||
Write and Atomic (64B): The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 64B of data in any memory location, per normalization
|
||||
unit.
|
||||
HBM Write and Atomic: The total number of L2 requests to Infinity Fabric to write
|
||||
or atomically update 32B or 64B of data in the accelerator's local HBM, per
|
||||
normalization unit.
|
||||
Remote Write and Atomic: The total number of L2 requests to Infinity Fabric to
|
||||
write or atomically update 32B or 64B of data in any memory location other than
|
||||
the accelerator's local HBM, per normalization unit.
|
||||
Atomic: The total number of L2 requests to Infinity Fabric to atomically update
|
||||
32B or 64B of data in any memory location, per normalization unit. See Request
|
||||
flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
|
||||
requests are only considered atomic by Infinity Fabric if they are targeted
|
||||
at non-write-cacheable memory, such as fine-grained memory allocations or uncached
|
||||
memory allocations on the MI2XX.
|
||||
|
||||
+4
-4
@@ -2,10 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1800
|
||||
title: L2 Cache (per Channel)
|
||||
metrics_description:
|
||||
L2 Cache Hit Rate: The percent of total number of requests to the L2 from all
|
||||
clients that hit in the cache. As noted in the Speed-of-Light section, this
|
||||
includes hit-on-miss requests.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1801
|
||||
@@ -249,3 +245,7 @@ Panel Config:
|
||||
::_1: $total_l2_chan
|
||||
cli_style: simple_box
|
||||
tui_style: simple_box
|
||||
metrics_description:
|
||||
L2 Cache Hit Rate: The percent of total number of requests to the L2 from all
|
||||
clients that hit in the cache. As noted in the Speed-of-Light section, this
|
||||
includes hit-on-miss requests.
|
||||
|
||||
+1
-1
@@ -2,10 +2,10 @@
|
||||
Panel Config:
|
||||
id: 2100
|
||||
title: PC Sampling
|
||||
metrics_description: {}
|
||||
data source:
|
||||
- pc_sampling_table:
|
||||
id: 2101
|
||||
title: PC Sampling
|
||||
source: ps_file
|
||||
comparable: false
|
||||
metrics_description: {}
|
||||
|
||||
+763
@@ -0,0 +1,763 @@
|
||||
# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated by tools/config_management/generate_config_deltas.py
|
||||
Addition:
|
||||
- Panel Config:
|
||||
id: 200
|
||||
title: System Speed-of-Light
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 201
|
||||
title: System Speed-of-Light
|
||||
metrics:
|
||||
- MFMA FLOPs (F6F4):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP/s
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 16834) / 1000)
|
||||
pop: |
|
||||
((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 16834) / 1000))
|
||||
- Panel Config:
|
||||
id: 300
|
||||
title: Memory Chart
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 301
|
||||
title: Memory Chart
|
||||
metrics:
|
||||
- L2 Rd Lat:
|
||||
value: |
|
||||
ROUND(AVG(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)) if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else None)), 0)
|
||||
- L2 Wr Lat:
|
||||
value: |
|
||||
ROUND(AVG(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0) else None)), 0)
|
||||
- Panel Config:
|
||||
id: 400
|
||||
title: Roofline
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 401
|
||||
title: Roofline Performance Rates
|
||||
metrics:
|
||||
- MFMA FLOPs (F6F4):
|
||||
value: |
|
||||
AVG((((SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
|
||||
unit: GFLOP/s
|
||||
peak: $MFMA_FLOPs_F6F4_empirical_peak
|
||||
- Panel Config:
|
||||
id: 500
|
||||
title: Command Processor (CPC/CPF)
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 502
|
||||
title: Command processor packet processor (CPC)
|
||||
metrics:
|
||||
- CPC SYNC FIFO Full Rate:
|
||||
avg: |
|
||||
AVG((100 * CPC_SYNC_FIFO_FULL) / CPC_SYNC_WRREQ_FIFO_BUSY if (CPC_SYNC_WRREQ_FIFO_BUSY != 0) else None)
|
||||
min: |
|
||||
MIN((100 * CPC_SYNC_FIFO_FULL) / CPC_SYNC_WRREQ_FIFO_BUSY if (CPC_SYNC_WRREQ_FIFO_BUSY != 0) else None)
|
||||
max: |
|
||||
MAX((100 * CPC_SYNC_FIFO_FULL) / CPC_SYNC_WRREQ_FIFO_BUSY if (CPC_SYNC_WRREQ_FIFO_BUSY != 0) else None)
|
||||
unit: pct
|
||||
- CPC ADC Utilization:
|
||||
avg: AVG((100 * CPC_TG_SEND) / CPC_GD_BUSY if (CPC_GD_BUSY != 0) else None)
|
||||
min: MIN((100 * CPC_TG_SEND) / CPC_GD_BUSY if (CPC_GD_BUSY != 0) else None)
|
||||
max: MAX((100 * CPC_TG_SEND) / CPC_GD_BUSY if (CPC_GD_BUSY != 0) else None)
|
||||
unit: pct
|
||||
- CPC CANE Stall Rate:
|
||||
avg: AVG((100 * CPC_CANE_STALL) / CPC_CANE_BUSY if (CPC_CANE_BUSY != 0) else None)
|
||||
min: MIN((100 * CPC_CANE_STALL) / CPC_CANE_BUSY if (CPC_CANE_BUSY != 0) else None)
|
||||
max: MAX((100 * CPC_CANE_STALL) / CPC_CANE_BUSY if (CPC_CANE_BUSY != 0) else None)
|
||||
unit: pct
|
||||
- Panel Config:
|
||||
id: 600
|
||||
title: Workgroup Manager (SPI)
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 601
|
||||
title: Workgroup manager utilizations
|
||||
metrics:
|
||||
- Schedule-Pipe Wave Occupancy:
|
||||
avg: |
|
||||
AVG(SPI_CSQ_P0_OCCUPANCY + SPI_CSQ_P1_OCCUPANCY + SPI_CSQ_P2_OCCUPANCY + SPI_CSQ_P3_OCCUPANCY)
|
||||
min: |
|
||||
MIN(SPI_CSQ_P0_OCCUPANCY + SPI_CSQ_P1_OCCUPANCY + SPI_CSQ_P2_OCCUPANCY + SPI_CSQ_P3_OCCUPANCY)
|
||||
max: |
|
||||
MAX(SPI_CSQ_P0_OCCUPANCY + SPI_CSQ_P1_OCCUPANCY + SPI_CSQ_P2_OCCUPANCY + SPI_CSQ_P3_OCCUPANCY)
|
||||
unit: Wave
|
||||
- Scheduler-Pipe Wave Utilization:
|
||||
avg: |
|
||||
AVG(100 * (SPI_CSC_WAVE_CNT_BUSY) / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
|
||||
min: |
|
||||
MIN(100 * (SPI_CSC_WAVE_CNT_BUSY) / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
|
||||
max: |
|
||||
MAX(100 * (SPI_CSC_WAVE_CNT_BUSY) / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
|
||||
unit: Pct
|
||||
- metric_table:
|
||||
id: 602
|
||||
title: Workgroup Manager - Resource Allocation
|
||||
metrics:
|
||||
- Scheduler-Pipe FIFO Full Rate:
|
||||
avg: |
|
||||
AVG((100 * (SPI_CS0_CRAWLER_STALL + SPI_CS1_CRAWLER_STALL + SPI_CS2_CRAWLER_STALL + SPI_CS3_CRAWLER_STALL) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
min: |
|
||||
MIN((100 * (SPI_CS0_CRAWLER_STALL + SPI_CS1_CRAWLER_STALL + SPI_CS2_CRAWLER_STALL + SPI_CS3_CRAWLER_STALL) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
max: |
|
||||
MAX((100 * (SPI_CS0_CRAWLER_STALL + SPI_CS1_CRAWLER_STALL + SPI_CS2_CRAWLER_STALL + SPI_CS3_CRAWLER_STALL) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
|
||||
unit: Pct
|
||||
- Panel Config:
|
||||
id: 1000
|
||||
title: Compute Units - Instruction Mix
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 1003
|
||||
title: VMEM Instruction Mix
|
||||
metrics:
|
||||
- Spill/Stack Coalesceable Instr:
|
||||
avg: AVG((TA_BUFFER_COALESCEABLE_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_COALESCEABLE_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_COALESCEABLE_WAVEFRONTS_sum / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
- metric_table:
|
||||
id: 1004
|
||||
title: MFMA Arithmetic Instruction Mix
|
||||
metrics:
|
||||
- MFMA-F6F4:
|
||||
avg: AVG((SQ_INSTS_VALU_MFMA_F6F4 / $denom))
|
||||
min: MIN((SQ_INSTS_VALU_MFMA_F6F4 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MFMA_F6F4 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
- Panel Config:
|
||||
id: 1100
|
||||
title: Compute Units - Compute Pipeline
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 1101
|
||||
title: Compute Speed-of-Light
|
||||
metrics:
|
||||
- MFMA FLOPs (F6F4):
|
||||
value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) / (End_Timestamp - Start_Timestamp)))
|
||||
unit: GFLOP
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 16834) / 1000)
|
||||
pop: |
|
||||
((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 16834) / 1000))
|
||||
- metric_table:
|
||||
id: 1102
|
||||
title: Pipeline Statistics
|
||||
metrics:
|
||||
- VALU Co-Issue Efficiency:
|
||||
avg: AVG((100 * SQ_ACTIVE_INST_VALU2) / (SQ_ACTIVE_INST_VALU - SQ_ACTIVE_INST_VALU2))
|
||||
min: MIN((100 * SQ_ACTIVE_INST_VALU2) / (SQ_ACTIVE_INST_VALU - SQ_ACTIVE_INST_VALU2))
|
||||
max: MAX((100 * SQ_ACTIVE_INST_VALU2) / (SQ_ACTIVE_INST_VALU - SQ_ACTIVE_INST_VALU2))
|
||||
unit: pct
|
||||
- metric_table:
|
||||
id: 1103
|
||||
title: Arithmetic Operations
|
||||
metrics:
|
||||
- F6F4 OPs:
|
||||
avg: AVG((512 * SQ_INSTS_VALU_MFMA_MOPS_F6F4) / $denom)
|
||||
min: MIN((512 * SQ_INSTS_VALU_MFMA_MOPS_F6F4) / $denom)
|
||||
max: MAX((512 * SQ_INSTS_VALU_MFMA_MOPS_F6F4) / $denom)
|
||||
unit: (OPs + $normUnit)
|
||||
- Panel Config:
|
||||
id: 1200
|
||||
title: Local Data Share (LDS)
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 1202
|
||||
title: LDS Statistics
|
||||
metrics:
|
||||
- LDS ATOMIC Bandwidth:
|
||||
avg: AVG(64 * SQ_INSTS_LDS_ATOMIC_BANDWIDTH / (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(64 * SQ_INSTS_LDS_ATOMIC_BANDWIDTH / (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(64 * SQ_INSTS_LDS_ATOMIC_BANDWIDTH / (End_Timestamp - Start_Timestamp))
|
||||
units: Gbps
|
||||
- LDS LOAD:
|
||||
avg: AVG((SQ_INSTS_LDS_LOAD / $denom))
|
||||
min: MIN((SQ_INSTS_LDS_LOAD / $denom))
|
||||
max: MAX((SQ_INSTS_LDS_LOAD / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
- LDS STORE:
|
||||
avg: AVG((SQ_INSTS_LDS_STORE / $denom))
|
||||
min: MIN((SQ_INSTS_LDS_STORE / $denom))
|
||||
max: MAX((SQ_INSTS_LDS_STORE / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
- LDS STORE Bandwidth:
|
||||
avg: AVG(64 * SQ_INSTS_LDS_STORE_BANDWIDTH / (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(64 * SQ_INSTS_LDS_STORE_BANDWIDTH / (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(64 * SQ_INSTS_LDS_STORE_BANDWIDTH / (End_Timestamp - Start_Timestamp))
|
||||
units: Gbps
|
||||
- LDS LOAD Bandwidth:
|
||||
avg: AVG(64 * SQ_INSTS_LDS_LOAD_BANDWIDTH / (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(64 * SQ_INSTS_LDS_LOAD_BANDWIDTH / (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(64 * SQ_INSTS_LDS_LOAD_BANDWIDTH / (End_Timestamp - Start_Timestamp))
|
||||
units: Gbps
|
||||
- LDS Command FIFO Full Rate:
|
||||
avg: AVG((SQ_LDS_CMD_FIFO_FULL / $denom))
|
||||
min: MIN((SQ_LDS_CMD_FIFO_FULL / $denom))
|
||||
max: MAX((SQ_LDS_CMD_FIFO_FULL / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
- LDS ATOMIC:
|
||||
avg: AVG((SQ_INSTS_LDS_ATOMIC / $denom))
|
||||
min: MIN((SQ_INSTS_LDS_ATOMIC / $denom))
|
||||
max: MAX((SQ_INSTS_LDS_ATOMIC / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
- LDS Data FIFO Full Rate:
|
||||
avg: AVG((SQ_LDS_DATA_FIFO_FULL / $denom))
|
||||
min: MIN((SQ_LDS_DATA_FIFO_FULL / $denom))
|
||||
max: MAX((SQ_LDS_DATA_FIFO_FULL / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
- Panel Config:
|
||||
id: 1500
|
||||
title: Address Processing Unit and Data Return Path (TA/TD)
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 1504
|
||||
title: Vector L1 data-return path or Texture Data (TD)
|
||||
metrics:
|
||||
- Write Ack Instructions:
|
||||
avg: AVG((TD_WRITE_ACKT_WAVEFRONT_sum / $denom))
|
||||
min: MIN((TD_WRITE_ACKT_WAVEFRONT_sum / $denom))
|
||||
max: MAX((TD_WRITE_ACKT_WAVEFRONT_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
- metric_table:
|
||||
id: 1502
|
||||
title: Instruction counts
|
||||
metrics:
|
||||
- Global/Generic Read Instructions for LDS:
|
||||
avg: AVG((TA_FLAT_READ_LDS_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_FLAT_READ_LDS_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_FLAT_READ_LDS_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
- Spill/Stack Read Instructions for LDS:
|
||||
avg: AVG((TA_BUFFER_READ_LDS_WAVEFRONTS_sum / $denom))
|
||||
min: MIN((TA_BUFFER_READ_LDS_WAVEFRONTS_sum / $denom))
|
||||
max: MAX((TA_BUFFER_READ_LDS_WAVEFRONTS_sum / $denom))
|
||||
unit: (Instructions + $normUnit)
|
||||
- Panel Config:
|
||||
id: 1600
|
||||
title: Vector L1 Data Cache
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 1602
|
||||
title: vL1D cache stall metrics
|
||||
metrics:
|
||||
- Stalled on Request FIFO:
|
||||
expr: |
|
||||
(((100 * TCP_RFIFO_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum != 0) else None)
|
||||
- Stalled on Latency FIFO:
|
||||
expr: |
|
||||
(((100 * TCP_LFIFO_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum != 0) else None)
|
||||
- Stalled on Address:
|
||||
expr: |
|
||||
(((100 * TCP_TCP_TA_ADDR_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum != 0) else None)
|
||||
- Stalled on Read Return:
|
||||
expr: |
|
||||
(((100 * TCP_TCR_RDRET_STALL_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum != 0) else None)
|
||||
- Stalled on Data:
|
||||
expr: |
|
||||
(((100 * TCP_TCP_TA_DATA_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum != 0) else None)
|
||||
- metric_table:
|
||||
id: 1603
|
||||
title: vL1D cache access metrics
|
||||
metrics:
|
||||
- Tag RAM 2 Req:
|
||||
avg: AVG((TCP_TAGRAM2_REQ_sum / $denom))
|
||||
min: MIN((TCP_TAGRAM2_REQ_sum / $denom))
|
||||
max: MAX((TCP_TAGRAM2_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
- Tag RAM 0 Req:
|
||||
avg: AVG((TCP_TAGRAM0_REQ_sum / $denom))
|
||||
min: MIN((TCP_TAGRAM0_REQ_sum / $denom))
|
||||
max: MAX((TCP_TAGRAM0_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
- Tag RAM 3 Req:
|
||||
avg: AVG((TCP_TAGRAM3_REQ_sum / $denom))
|
||||
min: MIN((TCP_TAGRAM3_REQ_sum / $denom))
|
||||
max: MAX((TCP_TAGRAM3_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
- Tag RAM 1 Req:
|
||||
avg: AVG((TCP_TAGRAM1_REQ_sum / $denom))
|
||||
min: MIN((TCP_TAGRAM1_REQ_sum / $denom))
|
||||
max: MAX((TCP_TAGRAM1_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
- L1 Access Latency:
|
||||
avg: AVG((TCP_TCP_LATENCY_sum / $denom))
|
||||
min: MIN((TCP_TCP_LATENCY_sum / $denom))
|
||||
max: MAX((TCP_TCP_LATENCY_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
- L1-L2 Read Latency:
|
||||
avg: AVG((TCP_TCC_READ_REQ_LATENCY_sum / $denom))
|
||||
min: MIN((TCP_TCC_READ_REQ_LATENCY_sum / $denom))
|
||||
max: MAX((TCP_TCC_READ_REQ_LATENCY_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
- L1-L2 Write Latency:
|
||||
avg: AVG((TCP_TCC_WRITE_REQ_LATENCY_sum / $denom))
|
||||
min: MIN((TCP_TCC_WRITE_REQ_LATENCY_sum / $denom))
|
||||
max: MAX((TCP_TCC_WRITE_REQ_LATENCY_sum / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
- metric_table:
|
||||
id: 1605
|
||||
title: L1 Unified Translation Cache (UTCL1)
|
||||
metrics:
|
||||
- Misses under Translation Miss:
|
||||
avg: AVG((TCP_UTCL1_TRANSLATION_MISS_UNDER_MISS_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_TRANSLATION_MISS_UNDER_MISS_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_TRANSLATION_MISS_UNDER_MISS_sum / $denom))
|
||||
units: (Req + $normUnit)
|
||||
- Inflight Req:
|
||||
avg: AVG((TCP_CLIENT_UTCL1_INFLIGHT_sum / $denom))
|
||||
min: MIN((TCP_CLIENT_UTCL1_INFLIGHT_sum / $denom))
|
||||
max: MAX((TCP_CLIENT_UTCL1_INFLIGHT_sum / $denom))
|
||||
units: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1606
|
||||
title: L1D Addr Translation Stalls
|
||||
metrics:
|
||||
- Serialization Stall:
|
||||
avg: AVG((TCP_UTCL1_SERIALIZATION_STALL_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_SERIALIZATION_STALL_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_SERIALIZATION_STALL_sum / $denom))
|
||||
units: (Cycles + $normUnit)
|
||||
- Cache Full Stall:
|
||||
avg: AVG((TCP_UTCL1_STALL_INFLIGHT_MAX_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_STALL_INFLIGHT_MAX_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_STALL_INFLIGHT_MAX_sum / $denom))
|
||||
units: (Cycles + $normUnit)
|
||||
- Resident Page Full Stall:
|
||||
avg: AVG((TCP_UTCL1_STALL_LFIFO_NO_RES_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_STALL_LFIFO_NO_RES_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_STALL_LFIFO_NO_RES_sum / $denom))
|
||||
units: (Cycles + $normUnit)
|
||||
- UTCL2 Stall:
|
||||
avg: AVG((TCP_UTCL1_STALL_UTCL2_REQ_OUT_OF_CREDITS_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_STALL_UTCL2_REQ_OUT_OF_CREDITS_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_STALL_UTCL2_REQ_OUT_OF_CREDITS_sum / $denom))
|
||||
units: (Cycles + $normUnit)
|
||||
- Latency FIFO Stall:
|
||||
avg: AVG((TCP_UTCL1_LFIFO_FULL_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_LFIFO_FULL_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_LFIFO_FULL_sum / $denom))
|
||||
units: (Cycles + $normUnit)
|
||||
- Thrashing Stall:
|
||||
avg: AVG((TCP_UTCL1_THRASHING_STALL_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_THRASHING_STALL_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_THRASHING_STALL_sum / $denom))
|
||||
units: (Cycles + $normUnit)
|
||||
- Cache Miss Stall:
|
||||
avg: AVG((TCP_UTCL1_STALL_MULTI_MISS_sum / $denom))
|
||||
min: MIN((TCP_UTCL1_STALL_MULTI_MISS_sum / $denom))
|
||||
max: MAX((TCP_UTCL1_STALL_MULTI_MISS_sum / $denom))
|
||||
units: (Cycles + $normUnit)
|
||||
- Panel Config:
|
||||
id: 1700
|
||||
title: L2 Cache
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 1702
|
||||
title: L2-Fabric interface metrics
|
||||
metrics:
|
||||
- Read Stall:
|
||||
avg: |
|
||||
AVG((((100 * ((TCC_EA0_RDREQ_IO_CREDIT_STALL_sum + TCC_EA0_RDREQ_GMI_CREDIT_STALL_sum) + TCC_EA0_RDREQ_DRAM_CREDIT_STALL_sum)) / TCC_BUSY_sum) if (TCC_BUSY_sum != 0) else None))
|
||||
min: |
|
||||
MIN((((100 * ((TCC_EA0_RDREQ_IO_CREDIT_STALL_sum + TCC_EA0_RDREQ_GMI_CREDIT_STALL_sum) + TCC_EA0_RDREQ_DRAM_CREDIT_STALL_sum)) / TCC_BUSY_sum) if (TCC_BUSY_sum != 0) else None))
|
||||
max: |
|
||||
MAX((((100 * ((TCC_EA0_RDREQ_IO_CREDIT_STALL_sum + TCC_EA0_RDREQ_GMI_CREDIT_STALL_sum) + TCC_EA0_RDREQ_DRAM_CREDIT_STALL_sum)) / TCC_BUSY_sum) if (TCC_BUSY_sum != 0) else None))
|
||||
unit: pct
|
||||
- Write Stall:
|
||||
avg: |
|
||||
AVG(((100 * (TCC_EA0_WRREQ_STALL_sum) / TCC_BUSY_sum) if (TCC_BUSY_sum != 0) else None))
|
||||
min: |
|
||||
MIN(((100 * (TCC_EA0_WRREQ_STALL_sum) / TCC_BUSY_sum) if (TCC_BUSY_sum != 0) else None))
|
||||
max: |
|
||||
MAX(((100 * (TCC_EA0_WRREQ_STALL_sum) / TCC_BUSY_sum) if (TCC_BUSY_sum != 0) else None))
|
||||
unit: pct
|
||||
- metric_table:
|
||||
id: 1703
|
||||
title: L2 Cache Accesses
|
||||
metrics:
|
||||
- Atomic Bandwidth:
|
||||
avg: AVG(TCC_ATOMIC_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_ATOMIC_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_ATOMIC_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
- Input Buffer Req:
|
||||
avg: AVG((TCC_IB_REQ_sum / $denom))
|
||||
min: MIN((TCC_IB_REQ_sum / $denom))
|
||||
max: MAX((TCC_IB_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
- Write Bandwidth:
|
||||
avg: AVG(TCC_WRITE_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_WRITE_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_WRITE_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
- Read Bandwidth:
|
||||
avg: AVG(TCC_READ_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_READ_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_READ_SECTORS_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
- Bypasss Req:
|
||||
avg: AVG((TCC_BYPASS_REQ_sum / $denom))
|
||||
min: MIN((TCC_BYPASS_REQ_sum / $denom))
|
||||
max: MAX((TCC_BYPASS_REQ_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
- metric_table:
|
||||
id: 1704
|
||||
title: L2 Cache Stalls
|
||||
metrics:
|
||||
- Input Buffer Stalled on L2:
|
||||
avg: AVG(TCC_IB_STALL_sum / $denom)
|
||||
min: MIN(TCC_IB_STALL_sum / $denom)
|
||||
max: MAX(TCC_IB_STALL_sum / $denom)
|
||||
unit: (Cycles + $normUnit)
|
||||
- Stalled on Latency FIFO:
|
||||
avg: AVG(TCC_LATENCY_FIFO_FULL_sum / $denom)
|
||||
min: MIN(TCC_LATENCY_FIFO_FULL_sum / $denom)
|
||||
max: MAX(TCC_LATENCY_FIFO_FULL_sum / $denom)
|
||||
unit: (Cycles + $normUnit)
|
||||
- Stalled on Write Data FIFO:
|
||||
avg: AVG(TCC_SRC_FIFO_FULL_sum / $denom)
|
||||
min: MIN(TCC_SRC_FIFO_FULL_sum / $denom)
|
||||
max: MAX(TCC_SRC_FIFO_FULL_sum / $denom)
|
||||
unit: (Cycles + $normUnit)
|
||||
- metric_table:
|
||||
id: 1705
|
||||
title: L2 - Fabric Interface stalls
|
||||
metrics:
|
||||
- Read - HBM Stall:
|
||||
type: HBM Stall
|
||||
transaction: Read
|
||||
avg: |
|
||||
AVG(((100 * (TCC_EA0_RDREQ_DRAM_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
min: |
|
||||
MIN(((100 * (TCC_EA0_RDREQ_DRAM_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
max: |
|
||||
MAX(((100 * (TCC_EA0_RDREQ_DRAM_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
unit: pct
|
||||
- Read - Infinity Fabric Stall:
|
||||
type: Infinity Fabric™ Stall
|
||||
transaction: Read
|
||||
avg: |
|
||||
AVG(((100 * (TCC_EA0_RDREQ_GMI_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
min: |
|
||||
MIN(((100 * (TCC_EA0_RDREQ_GMI_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
max: |
|
||||
MAX(((100 * (TCC_EA0_RDREQ_GMI_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
unit: pct
|
||||
- Write - PCIe Stall:
|
||||
type: PCIe Stall
|
||||
transaction: Write
|
||||
avg: |
|
||||
AVG(((100 * (TCC_EA0_WRREQ_IO_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
min: |
|
||||
MIN(((100 * (TCC_EA0_WRREQ_IO_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
max: |
|
||||
MAX(((100 * (TCC_EA0_WRREQ_IO_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
unit: pct
|
||||
- Read - PCIe Stall:
|
||||
type: PCIe Stall
|
||||
transaction: Read
|
||||
avg: |
|
||||
AVG(((100 * (TCC_EA0_RDREQ_IO_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
min: |
|
||||
MIN(((100 * (TCC_EA0_RDREQ_IO_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
max: |
|
||||
MAX(((100 * (TCC_EA0_RDREQ_IO_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
unit: pct
|
||||
- Write - Infinity Fabric Stall:
|
||||
type: Infinity Fabric™ Stall
|
||||
transaction: Write
|
||||
avg: |
|
||||
AVG(((100 * (TCC_EA0_WRREQ_GMI_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
min: |
|
||||
MIN(((100 * (TCC_EA0_WRREQ_GMI_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
max: |
|
||||
MAX(((100 * (TCC_EA0_WRREQ_GMI_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
unit: pct
|
||||
- Write - HBM Stall:
|
||||
type: HBM Stall
|
||||
transaction: Write
|
||||
avg: |
|
||||
AVG(((100 * (TCC_EA0_WRREQ_DRAM_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
min: |
|
||||
MIN(((100 * (TCC_EA0_WRREQ_DRAM_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
max: |
|
||||
MAX(((100 * (TCC_EA0_WRREQ_DRAM_CREDIT_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
|
||||
unit: pct
|
||||
- metric_table:
|
||||
id: 1706
|
||||
title: L2 - Fabric interface detailed metrics
|
||||
metrics:
|
||||
- Write Bandwidth - HBM:
|
||||
avg: AVG(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_EA0_WRREQ_WRITE_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
- Read (128B):
|
||||
avg: AVG((TCC_EA0_RDREQ_128B_sum / $denom))
|
||||
min: MIN((TCC_EA0_RDREQ_128B_sum / $denom))
|
||||
max: MAX((TCC_EA0_RDREQ_128B_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
- Atomic - HBM:
|
||||
avg: AVG((TCC_EA0_WRREQ_ATOMIC_DRAM_sum / $denom))
|
||||
min: MIN((TCC_EA0_WRREQ_ATOMIC_DRAM_sum / $denom))
|
||||
max: MAX((TCC_EA0_WRREQ_ATOMIC_DRAM_sum / $denom))
|
||||
unit: (Req + $normUnit)
|
||||
- Read Bandwidth - PCIe:
|
||||
avg: AVG(TCC_EA0_RDREQ_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_EA0_RDREQ_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_EA0_RDREQ_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
- Atomic Bandwidth - HBM:
|
||||
avg: AVG(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_EA0_WRREQ_ATOMIC_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
- Read Bandwidth - Infinity Fabric™:
|
||||
avg: AVG(TCC_EA0_RDREQ_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_EA0_RDREQ_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_EA0_RDREQ_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
- Write Bandwidth - PCIe:
|
||||
avg: AVG(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_EA0_WRREQ_WRITE_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
- Atomic Bandwidth - PCIe:
|
||||
avg: AVG(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_EA0_WRREQ_ATOMIC_IO_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
- Write Bandwidth - Infinity Fabric™:
|
||||
avg: AVG(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_EA0_WRREQ_WRITE_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
- Atomic Bandwidth - Infinity Fabric™:
|
||||
avg: AVG(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_EA0_WRREQ_ATOMIC_GMI_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
- Read Bandwidth - HBM:
|
||||
avg: AVG(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
min: MIN(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
max: MAX(TCC_EA0_RDREQ_DRAM_32B_sum * 32/ (End_Timestamp - Start_Timestamp))
|
||||
unit: Gbps
|
||||
|
||||
Deletion:
|
||||
[]
|
||||
|
||||
Modification:
|
||||
- Panel Config:
|
||||
id: 200
|
||||
title: System Speed-of-Light
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 201
|
||||
title: System Speed-of-Light
|
||||
metrics:
|
||||
- MFMA IOPs (Int8):
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 8192) / 1000)
|
||||
pop: |
|
||||
((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 8192) / 1000))
|
||||
- MFMA FLOPs (F16):
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 4096) / 1000)
|
||||
pop: |
|
||||
((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 4096) / 1000))
|
||||
- MFMA FLOPs (F8):
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 8192) / 1000)
|
||||
unit: GFLOP/s
|
||||
pop: |
|
||||
((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 8192) / 1000))
|
||||
- MFMA FLOPs (F64):
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 128) / 1000)
|
||||
pop: |
|
||||
((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 128) / 1000))
|
||||
- MFMA FLOPs (BF16):
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 4096) / 1000)
|
||||
pop: |
|
||||
((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 4096) / 1000))
|
||||
- Panel Config:
|
||||
id: 300
|
||||
title: Memory Chart
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 301
|
||||
title: Memory Chart
|
||||
metrics:
|
||||
- Workgroups:
|
||||
value: |
|
||||
ROUND(AVG(SPI_CS0_NUM_THREADGROUPS + SPI_CS1_NUM_THREADGROUPS + SPI_CS2_NUM_THREADGROUPS + SPI_CS3_NUM_THREADGROUPS), 0)
|
||||
- Wavefronts:
|
||||
value: ROUND(AVG(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE), 0)
|
||||
- Panel Config:
|
||||
id: 400
|
||||
title: Roofline
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 402
|
||||
title: Roofline Plot Points
|
||||
metrics:
|
||||
- AI L2:
|
||||
value: |
|
||||
( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / SUM( (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64 ) )
|
||||
- AI HBM:
|
||||
value: |
|
||||
( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / SUM( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum * 64) ) )
|
||||
- AI L1:
|
||||
value: |
|
||||
( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / SUM(TCP_TOTAL_CACHE_ACCESSES_sum * 64) )
|
||||
- Performance (GFLOPs):
|
||||
value: |
|
||||
( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9
|
||||
- Panel Config:
|
||||
id: 600
|
||||
title: Workgroup Manager (SPI)
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 601
|
||||
title: Workgroup manager utilizations
|
||||
metrics:
|
||||
- Dispatched Workgroups:
|
||||
max: |
|
||||
MAX(SPI_CS0_NUM_THREADGROUPS + SPI_CS1_NUM_THREADGROUPS + SPI_CS2_NUM_THREADGROUPS + SPI_CS3_NUM_THREADGROUPS)
|
||||
avg: |
|
||||
AVG(SPI_CS0_NUM_THREADGROUPS + SPI_CS1_NUM_THREADGROUPS + SPI_CS2_NUM_THREADGROUPS + SPI_CS3_NUM_THREADGROUPS)
|
||||
min: |
|
||||
MIN(SPI_CS0_NUM_THREADGROUPS + SPI_CS1_NUM_THREADGROUPS + SPI_CS2_NUM_THREADGROUPS + SPI_CS3_NUM_THREADGROUPS)
|
||||
- VGPR Writes:
|
||||
max: |
|
||||
MAX((((SPI_VWC0_VDATA_VALID_WR + SPI_VWC1_VDATA_VALID_WR) / (SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)) if ((SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE) != 0) else None))
|
||||
avg: |
|
||||
AVG((((SPI_VWC0_VDATA_VALID_WR + SPI_VWC1_VDATA_VALID_WR) / (SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)) if ((SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE) != 0) else None))
|
||||
min: |
|
||||
MIN((((SPI_VWC0_VDATA_VALID_WR + SPI_VWC1_VDATA_VALID_WR) / (SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)) if ((SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE) != 0) else None))
|
||||
- Scheduler-Pipe Utilization:
|
||||
max: |
|
||||
MAX(100 * (SPI_CS0_BUSY + SPI_CS1_BUSY + SPI_CS2_BUSY + SPI_CS3_BUSY) / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
|
||||
avg: |
|
||||
AVG(100 * (SPI_CS0_BUSY + SPI_CS1_BUSY + SPI_CS2_BUSY + SPI_CS3_BUSY) / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
|
||||
min: |
|
||||
MIN(100 * (SPI_CS0_BUSY + SPI_CS1_BUSY + SPI_CS2_BUSY + SPI_CS3_BUSY) / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
|
||||
- SGPR Writes:
|
||||
max: |
|
||||
MAX((((1 * SPI_SWC_CSC_WR) / (SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)) if ((SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE) != 0) else None))
|
||||
avg: |
|
||||
AVG((((1 * SPI_SWC_CSC_WR) / (SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)) if ((SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE) != 0) else None))
|
||||
min: |
|
||||
MIN((((1 * SPI_SWC_CSC_WR) / (SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)) if ((SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE) != 0) else None))
|
||||
- Dispatched Wavefronts:
|
||||
max: MAX(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)
|
||||
avg: AVG(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)
|
||||
min: MIN(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)
|
||||
- Panel Config:
|
||||
id: 700
|
||||
title: Wavefront
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 701
|
||||
title: Wavefront Launch Stats
|
||||
metrics:
|
||||
- Total Wavefronts:
|
||||
max: MAX(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)
|
||||
avg: AVG(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)
|
||||
min: MIN(SPI_CS0_WAVE + SPI_CS1_WAVE + SPI_CS2_WAVE + SPI_CS3_WAVE)
|
||||
- Panel Config:
|
||||
id: 1100
|
||||
title: Compute Units - Compute Pipeline
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 1101
|
||||
title: Compute Speed-of-Light
|
||||
metrics:
|
||||
- MFMA FLOPs (F8):
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 8192) / 1000)
|
||||
pop: |
|
||||
((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 8192) / 1000))
|
||||
- MFMA FLOPs (F64):
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 128) / 1000)
|
||||
pop: |
|
||||
((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 128) / 1000))
|
||||
- MFMA FLOPs (BF16):
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 4096) / 1000)
|
||||
pop: |
|
||||
((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 4096) / 1000))
|
||||
- MFMA IOPs (INT8):
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 8192) / 1000)
|
||||
pop: |
|
||||
((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 8192) / 1000))
|
||||
- MFMA FLOPs (F16):
|
||||
peak: ((($max_sclk * $cu_per_gpu) * 4096) / 1000)
|
||||
pop: |
|
||||
((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 4096) / 1000))
|
||||
- metric_table:
|
||||
id: 1103
|
||||
title: Arithmetic Operations
|
||||
metrics:
|
||||
- FLOPs (Total):
|
||||
max: |
|
||||
MAX((((((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_FMA_F16 * 2))) + ((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F16) + (512 * SQ_INSTS_VALU_MFMA_MOPS_BF16))) + (64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_FMA_F32 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32)) + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64) + (SQ_INSTS_VALU_FMA_F64 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F6F4)) / $denom))
|
||||
avg: |
|
||||
AVG((((((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_FMA_F16 * 2))) + ((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F16) + (512 * SQ_INSTS_VALU_MFMA_MOPS_BF16))) + (64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_FMA_F32 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32)) + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64) + (SQ_INSTS_VALU_FMA_F64 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F6F4)) / $denom))
|
||||
min: |
|
||||
MIN((((((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_FMA_F16 * 2))) + ((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F16) + (512 * SQ_INSTS_VALU_MFMA_MOPS_BF16))) + (64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_FMA_F32 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32)) + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64) + (SQ_INSTS_VALU_FMA_F64 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F6F4)) / $denom))
|
||||
- Panel Config:
|
||||
id: 1700
|
||||
title: L2 Cache
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 1701
|
||||
title: L2 Speed-of-Light
|
||||
metrics:
|
||||
- L2-Fabric Read BW:
|
||||
value: |
|
||||
AVG((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) + (TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp)))
|
||||
- metric_table:
|
||||
id: 1702
|
||||
title: L2-Fabric interface metrics
|
||||
metrics:
|
||||
- Read BW:
|
||||
max: |
|
||||
MAX((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) + (TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp)))
|
||||
avg: |
|
||||
AVG((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) + (TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp)))
|
||||
min: |
|
||||
MIN((((TCC_EA0_RDREQ_32B_sum * 32) + (TCC_EA0_RDREQ_64B_sum * 64) + (TCC_EA0_RDREQ_128B_sum * 128)) / (End_Timestamp - Start_Timestamp)))
|
||||
- Remote Read Traffic:
|
||||
max: |
|
||||
MAX((100 * (MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum != 0) else None))
|
||||
avg: |
|
||||
AVG((100 * (MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum != 0) else None))
|
||||
min: |
|
||||
MIN((100 * (MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum != 0) else None))
|
||||
- metric_table:
|
||||
id: 1706
|
||||
title: L2 - Fabric interface detailed metrics
|
||||
metrics:
|
||||
- HBM Write and Atomic:
|
||||
max: MAX((TCC_EA0_WRREQ_WRITE_DRAM_sum / $denom))
|
||||
avg: AVG((TCC_EA0_WRREQ_WRITE_DRAM_sum / $denom))
|
||||
min: MIN((TCC_EA0_WRREQ_WRITE_DRAM_sum / $denom))
|
||||
- Read (64B):
|
||||
max: MAX((TCC_EA0_RDREQ_64B_sum / $denom))
|
||||
avg: AVG((TCC_EA0_RDREQ_64B_sum / $denom))
|
||||
min: MIN((TCC_EA0_RDREQ_64B_sum / $denom))
|
||||
- Panel Config:
|
||||
id: 1800
|
||||
title: L2 Cache (per Channel)
|
||||
metric_tables:
|
||||
- metric_table:
|
||||
id: 1809
|
||||
title: L2-Fabric Read Stall (Cycles per normUnit)
|
||||
metrics:
|
||||
- ::_1:
|
||||
ea read stall - pcie: AVG((TO_INT(TCC_EA0_RDREQ_IO_CREDIT_STALL[::_1]) / $denom))
|
||||
ea read stall - hbm: AVG((TO_INT(TCC_EA0_RDREQ_DRAM_CREDIT_STALL[::_1]) / $denom))
|
||||
ea read stall - if: AVG((TO_INT(TCC_EA0_RDREQ_GMI_CREDIT_STALL[::_1]) / $denom))
|
||||
- metric_table:
|
||||
id: 1810
|
||||
title: L2-Fabric Write and Atomic Stall (Cycles per normUnit)
|
||||
metrics:
|
||||
- ::_1:
|
||||
ea write stall - pcie: AVG((TO_INT(TCC_EA0_WRREQ_IO_CREDIT_STALL[::_1]) / $denom))
|
||||
ea write stall - if: AVG((TO_INT(TCC_EA0_WRREQ_GMI_CREDIT_STALL[::_1]) / $denom))
|
||||
ea write stall - hbm: AVG((TO_INT(TCC_EA0_WRREQ_DRAM_CREDIT_STALL[::_1]) / $denom))
|
||||
+1
-1
@@ -2,7 +2,6 @@
|
||||
Panel Config:
|
||||
id: 0
|
||||
title: Top Stats
|
||||
metrics_description: {}
|
||||
data source:
|
||||
- raw_csv_table:
|
||||
id: 1
|
||||
@@ -12,3 +11,4 @@ Panel Config:
|
||||
id: 2
|
||||
title: Dispatch List
|
||||
source: pmc_dispatch_info.csv
|
||||
metrics_description: {}
|
||||
|
||||
+1
-1
@@ -2,10 +2,10 @@
|
||||
Panel Config:
|
||||
id: 100
|
||||
title: System Info
|
||||
metrics_description: {}
|
||||
data source:
|
||||
- raw_csv_table:
|
||||
id: 101
|
||||
title: System Info
|
||||
source: sysinfo.csv
|
||||
columnwise: true
|
||||
metrics_description: {}
|
||||
|
||||
+127
-118
@@ -2,124 +2,6 @@
|
||||
Panel Config:
|
||||
id: 200
|
||||
title: System Speed-of-Light
|
||||
metrics_description:
|
||||
VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
|
||||
This is also presented as a percent of the peak theoretical FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any floating-point
|
||||
operations from MFMA instructions.'
|
||||
VALU IOPs: 'The total integer operations executed per second on the VALU. This
|
||||
is also presented as a percent of the peak theoretical IOPs achievable on the
|
||||
specific accelerator. Note: this does not include any integer operations from
|
||||
MFMA instructions.'
|
||||
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
|
||||
executed per second. This does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F8 MFMA operations achievable on the specific accelerator. It is supported on
|
||||
AMD Instinct MI300 series and later only.
|
||||
MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
|
||||
executed per second. Note: this does not include any 16-bit brain floating point
|
||||
operations from VALU instructions. This is also presented as a percent of the
|
||||
peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 32-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F32 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 64-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F64 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
|
||||
per second. Note: this does not include any 8-bit integer operations from VALU
|
||||
instructions. This is also presented as a percent of the peak theoretical INT8
|
||||
MFMA operations achievable on the specific accelerator.'
|
||||
Active CUs: Total number of active compute units (CUs) on the accelerator during
|
||||
the kernel execution.
|
||||
SALU Utilization: Indicates what percent of the kernel's duration the SALU was
|
||||
busy executing instructions. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.
|
||||
VALU Utilization: Indicates what percent of the kernel's duration the VALU was
|
||||
busy executing instructions. Does not include VMEM operations. Computed as the
|
||||
ratio of the total number of cycles spent by the scheduler issuing VALU instructions
|
||||
over the total CU cycles.
|
||||
MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
|
||||
was busy executing instructions. Computed as the ratio of the total number of
|
||||
cycles the MFMA was busy over the total CU cycles.
|
||||
VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
|
||||
was busy executing instructions, including both global/generic and spill/scratch
|
||||
operations (see the VMEM instruction count metrics) for more detail). Does not
|
||||
include VALU operations. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing VMEM instructions over the total CU cycles.
|
||||
Branch Utilization: Indicates what percent of the kernel's duration the branch
|
||||
unit was busy executing instructions. Computed as the ratio of the total number
|
||||
of cycles spent by the scheduler issuing branch instructions over the total
|
||||
CU cycles
|
||||
VALU Active Threads: Indicates the average level of divergence within a wavefront
|
||||
over the lifetime of the kernel. The number of work-items that were active in
|
||||
a wavefront during execution of each VALU instruction, time-averaged over all
|
||||
VALU instructions run on all wavefronts in the kernel.
|
||||
IPC: The ratio of the total number of instructions executed on the CU over the
|
||||
total active CU cycles. This is also presented as a percent of the peak theoretical
|
||||
bandwidth achievable on the specific accelerator.
|
||||
Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
|
||||
over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
||||
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
|
||||
occupancy achievable on the specific accelerator.'
|
||||
Theoretical LDS Bandwidth: Indicates the maximum amount of bytes that could have
|
||||
been loaded from, stored to, or atomically updated in the LDS per unit time
|
||||
(see LDS Bandwidth example for more detail). This is also presented as a percent
|
||||
of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
|
||||
LDS Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS
|
||||
scheduler due to bank conflicts (as determined by the conflict resolution hardware)
|
||||
to the base number of cycles that would be spent in the LDS scheduler in a completely
|
||||
uncontended case. This is also presented in normalized form (i.e., the Bank
|
||||
Conflict Rate).
|
||||
vL1D Cache Hit Rate: The ratio of the number of vL1D cache line requests that
|
||||
hit in vL1D cache over the total number of cache line requests to the vL1D cache
|
||||
RAM.
|
||||
vL1D Cache BW: The number of bytes looked up in the vL1D cache as a result of
|
||||
VMEM instructions per unit time. The number of bytes is calculated as the number
|
||||
of cache lines requested multiplied by the cache line size. This value does
|
||||
not consider partial requests, so e.g., if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
This is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L2 Cache Hit Rate: The ratio of the number of L2 cache line requests that hit
|
||||
in the L2 cache over the total number of incoming cache line requests to the
|
||||
L2 cache.
|
||||
L2 Cache BW: The number of bytes looked up in the L2 cache per unit time. The
|
||||
number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line. This is also presented as a percent of
|
||||
the peak theoretical bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Read BW: "The number of bytes read by the L2 over the Infinity Fabric\u2122\
|
||||
\ interface per unit time. This is also presented as a percent of the peak theoretical\
|
||||
\ bandwidth achievable on the specific accelerator."
|
||||
L2-Fabric Write BW: The number of bytes sent by the L2 over the Infinity Fabric
|
||||
interface by write and atomic operations per unit time. This is also presented
|
||||
as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Read Latency: The time-averaged number of cycles read requests spent
|
||||
in Infinity Fabric before data was returned to the L2.
|
||||
L2-Fabric Write Latency: The time-averaged number of cycles write requests spent
|
||||
in Infinity Fabric before a completion acknowledgement was returned to the L2.
|
||||
sL1D Cache Hit Rate: The percent of sL1D requests that hit on a previously loaded
|
||||
line the cache. Calculated as the ratio of the number of sL1D requests that
|
||||
hit over the number of all sL1D requests.
|
||||
sL1D Cache BW: The number of bytes looked up in the sL1D cache per unit time.
|
||||
This is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L1I Hit Rate: The number of bytes looked up in the L1I cache per unit time. This
|
||||
is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L1I BW: The percent of L1I requests that hit on a previously loaded line the cache.
|
||||
Calculated as the ratio of the number of L1I requests that hit over the number
|
||||
of all L1I requests.
|
||||
L1I Fetch Latency: The average number of cycles spent to fetch instructions to
|
||||
a CU.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 201
|
||||
@@ -344,3 +226,130 @@ Panel Config:
|
||||
peak: None
|
||||
pop: None
|
||||
coll_level: SQ_IFETCH_LEVEL
|
||||
metrics_description:
|
||||
VALU FLOPs: |-
|
||||
The total floating-point operations executed per second on the VALU.
|
||||
This is also presented as a percent of the peak theoretical FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any floating-point
|
||||
operations from MFMA instructions.
|
||||
VALU IOPs: |-
|
||||
The total integer operations executed per second on the VALU. This is
|
||||
also presented as a percent of the peak theoretical IOPs achievable on the
|
||||
specific accelerator. Note: this does not include any integer operations from
|
||||
MFMA instructions.
|
||||
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
|
||||
executed per second. This does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F8 MFMA operations achievable on the specific accelerator. It is supported on
|
||||
AMD Instinct MI300 series and later only.
|
||||
MFMA FLOPs (BF16): |-
|
||||
The total number of 16-bit brain floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
BF16 MFMA operations achievable on the specific accelerator.
|
||||
MFMA FLOPs (F16): |-
|
||||
The total number of 16-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 16-bit floating point operations from
|
||||
VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F16 MFMA operations achievable on the specific accelerator.
|
||||
MFMA FLOPs (F32): |-
|
||||
The total number of 32-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 32-bit floating point operations from
|
||||
VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F32 MFMA operations achievable on the specific accelerator.
|
||||
MFMA FLOPs (F64): |-
|
||||
The total number of 64-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 64-bit floating point operations from
|
||||
VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F64 MFMA operations achievable on the specific accelerator.
|
||||
MFMA IOPs (Int8): |-
|
||||
The total number of 8-bit integer MFMA operations executed per second.
|
||||
Note: this does not include any 8-bit integer operations from VALU instructions.
|
||||
This is also presented as a percent of the peak theoretical INT8 MFMA operations
|
||||
achievable on the specific accelerator.
|
||||
Active CUs: Total number of active compute units (CUs) on the accelerator during
|
||||
the kernel execution.
|
||||
SALU Utilization: Indicates what percent of the kernel's duration the SALU was
|
||||
busy executing instructions. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.
|
||||
VALU Utilization: Indicates what percent of the kernel's duration the VALU was
|
||||
busy executing instructions. Does not include VMEM operations. Computed as the
|
||||
ratio of the total number of cycles spent by the scheduler issuing VALU instructions
|
||||
over the total CU cycles.
|
||||
MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
|
||||
was busy executing instructions. Computed as the ratio of the total number of
|
||||
cycles the MFMA was busy over the total CU cycles.
|
||||
VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
|
||||
was busy executing instructions, including both global/generic and spill/scratch
|
||||
operations (see the VMEM instruction count metrics) for more detail). Does not
|
||||
include VALU operations. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing VMEM instructions over the total CU cycles.
|
||||
Branch Utilization: Indicates what percent of the kernel's duration the branch
|
||||
unit was busy executing instructions. Computed as the ratio of the total number
|
||||
of cycles spent by the scheduler issuing branch instructions over the total
|
||||
CU cycles
|
||||
VALU Active Threads: Indicates the average level of divergence within a wavefront
|
||||
over the lifetime of the kernel. The number of work-items that were active in
|
||||
a wavefront during execution of each VALU instruction, time-averaged over all
|
||||
VALU instructions run on all wavefronts in the kernel.
|
||||
IPC: The ratio of the total number of instructions executed on the CU over the
|
||||
total active CU cycles. This is also presented as a percent of the peak theoretical
|
||||
bandwidth achievable on the specific accelerator.
|
||||
Wavefront Occupancy: |-
|
||||
The time-averaged number of wavefronts resident on the accelerator over
|
||||
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
||||
kernels (less than 1ms). This is also presented as a percent of the peak theoretical
|
||||
occupancy achievable on the specific accelerator.
|
||||
Theoretical LDS Bandwidth: Indicates the maximum amount of bytes that could have
|
||||
been loaded from, stored to, or atomically updated in the LDS per unit time
|
||||
(see LDS Bandwidth example for more detail). This is also presented as a percent
|
||||
of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
|
||||
LDS Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS
|
||||
scheduler due to bank conflicts (as determined by the conflict resolution hardware)
|
||||
to the base number of cycles that would be spent in the LDS scheduler in a completely
|
||||
uncontended case. This is also presented in normalized form (i.e., the Bank
|
||||
Conflict Rate).
|
||||
vL1D Cache Hit Rate: The ratio of the number of vL1D cache line requests that
|
||||
hit in vL1D cache over the total number of cache line requests to the vL1D cache
|
||||
RAM.
|
||||
vL1D Cache BW: The number of bytes looked up in the vL1D cache as a result of
|
||||
VMEM instructions per unit time. The number of bytes is calculated as the number
|
||||
of cache lines requested multiplied by the cache line size. This value does
|
||||
not consider partial requests, so e.g., if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
This is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L2 Cache Hit Rate: The ratio of the number of L2 cache line requests that hit
|
||||
in the L2 cache over the total number of incoming cache line requests to the
|
||||
L2 cache.
|
||||
L2 Cache BW: The number of bytes looked up in the L2 cache per unit time. The
|
||||
number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line. This is also presented as a percent of
|
||||
the peak theoretical bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Read BW: |-
|
||||
The number of bytes read by the L2 over the Infinity Fabric\u2122 interface
|
||||
per unit time. This is also presented as a percent of the peak theoretical
|
||||
bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Write BW: The number of bytes sent by the L2 over the Infinity Fabric
|
||||
interface by write and atomic operations per unit time. This is also presented
|
||||
as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
|
||||
L2-Fabric Read Latency: The time-averaged number of cycles read requests spent
|
||||
in Infinity Fabric before data was returned to the L2.
|
||||
L2-Fabric Write Latency: The time-averaged number of cycles write requests spent
|
||||
in Infinity Fabric before a completion acknowledgement was returned to the L2.
|
||||
sL1D Cache Hit Rate: The percent of sL1D requests that hit on a previously loaded
|
||||
line the cache. Calculated as the ratio of the number of sL1D requests that
|
||||
hit over the number of all sL1D requests.
|
||||
sL1D Cache BW: The number of bytes looked up in the sL1D cache per unit time.
|
||||
This is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L1I Hit Rate: The number of bytes looked up in the L1I cache per unit time. This
|
||||
is also presented as a percent of the peak theoretical bandwidth achievable
|
||||
on the specific accelerator.
|
||||
L1I BW: The percent of L1I requests that hit on a previously loaded line the cache.
|
||||
Calculated as the ratio of the number of L1I requests that hit over the number
|
||||
of all L1I requests.
|
||||
L1I Fetch Latency: The average number of cycles spent to fetch instructions to
|
||||
a CU.
|
||||
|
||||
+117
-119
@@ -2,122 +2,6 @@
|
||||
Panel Config:
|
||||
id: 300
|
||||
title: Memory Chart
|
||||
metrics_description:
|
||||
Wavefront Occupancy: Wavefronts per active CU.
|
||||
Wave Life: Average number of cycles executing a wave.
|
||||
SALU: Total Number of SALU (Scalar ALU) instructions issued per normalization
|
||||
unit.
|
||||
SMEM: Total number of SMEM (Scalar Memory Read) instructions issued normalization
|
||||
unit.
|
||||
VALU: The number of VALU (Vector ALU) instructions issued per normalization unit.
|
||||
MFMA: Total number of MFMA (Matrix-Fused-Multiply-Add) instructions issued per
|
||||
normalization unit.
|
||||
VMEM: The number of VMEM (GPU Memory) read instructions issued (including FLAT/scratch
|
||||
memory) per normalization unit.
|
||||
LDS: The total number of LDS instructions (including, but not limited to, read/write/atomics
|
||||
and HIP's __shfl instructions) executed per normalization unit.
|
||||
GWS: Total number of GDS (global data sync) instructions issued per normalization
|
||||
unit.
|
||||
BR: Total number of BRANCH instructions issued per normalization unit.
|
||||
Active CUs: Total number of active compute units (CUs) on the accelerator during
|
||||
the kernel execution.
|
||||
Num CUs: Total number of compute units (CUs) on the accelerator.
|
||||
VGPR: 'The number of architected vector general-purpose registers allocated for
|
||||
the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
|
||||
by the compiler due to allocation granularity.'
|
||||
SGPR: 'The number of scalar general-purpose registers allocated for the kernel,
|
||||
see SALU. Note: this may not exactly match the number of SGPRs requested by
|
||||
the compiler due to allocation granularity.'
|
||||
LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
|
||||
for this kernel. Note: This may also be larger than what was requested at compile
|
||||
time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
|
||||
Scratch Allocation: The number of bytes of scratch memory requested per work-item
|
||||
for this kernel. Scratch memory is used for stack memory on the accelerator,
|
||||
as well as for register spills and restores.
|
||||
Wavefronts: The total number of wavefronts, summed over all workgroups, forming
|
||||
this kernel launch.
|
||||
Workgroups: The total number of workgroups forming this kernel launch.
|
||||
LDS Req: The total number of LDS instructions (including, but not limited to,
|
||||
read/write/atomics and HIP's __shfl instructions) executed per normalization
|
||||
unit.
|
||||
LDS Util: Indicates what percent of the kernel's duration the LDS was actively
|
||||
executing instructions (including, but not limited to, load, store, atomic and
|
||||
HIP's __shfl operations). Calculated as the ratio of the total number of cycles
|
||||
LDS was active over the total CU cycles.
|
||||
LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
|
||||
/ acknowledgment) required for an LDS instruction to complete.
|
||||
VL1 Rd: The total number of incoming read requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Wr: The total number of incoming write requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Atomic: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Hit: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
VL1 Lat: Calculated as the average number of cycles that a vL1D cache line request
|
||||
spent in the vL1D cache pipeline.
|
||||
VL1 Coalesce: Indicates how well memory instructions were coalesced by the address
|
||||
processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
|
||||
as the average number of thread-requests generated per instruction divided by
|
||||
the ideal number of thread-requests per instruction.
|
||||
VL1 Stall: The ratio of the number of cycles where the vL1D is stalled waiting
|
||||
to issue a request for data to the L2 cache divided by the number of cycles
|
||||
where the vL1D is active.
|
||||
VL1_L2 Rd: The number of read requests for a vL1D cache line that were not satisfied
|
||||
by the vL1D and must be retrieved from the to the L2 Cache per normalization
|
||||
unit.
|
||||
VL1_L2 Wr: The number of write requests to a vL1D cache line that were sent through
|
||||
the vL1D to the L2 cache, per normalization unit.
|
||||
VL1_L2 Atomic: The number of atomic requests that are sent through the vL1D to
|
||||
the L2 cache, per normalization unit. This includes requests for atomics with,
|
||||
and without return.
|
||||
sL1D Rd: The total number of requests, of any size or type, made to the sL1D per
|
||||
normalization unit.
|
||||
sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
|
||||
line, per normalization unit.
|
||||
sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
|
||||
unit.
|
||||
sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
sL1D_L2 Atomic: The total number of atomic requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
IL1 Fetch: The total number of requests made to the L1I per normalization-unit.
|
||||
IL1 Hit: The percent of L1I requests that hit on a previously loaded line the
|
||||
cache. Calculated as the ratio of the number of L1I requests that hit over the
|
||||
number of all L1I requests.
|
||||
IL1 Lat: The average number of cycles spent to fetch instructions to a CU.
|
||||
IL1_L2 Rd: The total number of requests across the L1I - L2 interface per normalization-unit.
|
||||
L2 Rd: The total number of read requests to the L2 from all clients.
|
||||
L2 Wr: The total number of write requests to the L2 from all clients.
|
||||
L2 Atomic: The total number of atomic requests (with and without return) to the
|
||||
L2 from all clients.
|
||||
L2 Hit: The ratio of the number of L2 cache line requests that hit in the L2 cache
|
||||
over the total number of incoming cache line requests to the L2 cache.
|
||||
L2 Rd Lat: Calculated as the average number of cycles that the vL1D cache took
|
||||
to issue and receive read requests from the L2 Cache. This number also includes
|
||||
requests for atomics with return values.
|
||||
L2 Wr Lat: Calculated as the average number of cycles that the vL1D cache took
|
||||
to issue and receive acknowledgement of a write request to the L2 Cache. This
|
||||
number also includes requests for atomics without return values.
|
||||
Fabric_L2 Rd: Number of L2 cache - Infinity Fabric read requests (either 32-byte
|
||||
or 64-byte) summed over TCC instances per normalization unit.
|
||||
Fabric_L2 Wr: Number of L2 cache - Infinity Fabric write requests (either 32-byte
|
||||
or 64-byte) summed over TCC instances per normalization unit.
|
||||
Fabric_L2 Atomic: Number of L2 cache - Infinity Fabric write requests (either
|
||||
32-byte or 64-byte) that are actually atomic requests summed over TCC instances
|
||||
per normalization unit.
|
||||
Fabric Rd Lat: The time-averaged number of cycles read requests spent in Infinity
|
||||
Fabric before data was returned to the L2.
|
||||
Fabric Wr Lat: The time-averaged number of cycles write requests spent in Infinity
|
||||
Fabric before a completion acknowledgement was returned to the L2.
|
||||
Fabric Atomic Lat: The time-averaged number of cycles atomic requests spent in
|
||||
Infinity Fabric before a completion acknowledgement (atomic without return value)
|
||||
or data (atomic with return value) was returned to the L2.
|
||||
HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
|
||||
of data from the accelerator's local HBM, per normalization unit.
|
||||
HBM Wr: 'The total number of L2 requests to Infinity Fabric to write or atomically
|
||||
update 32B or 64B of data in the accelerator''s local HBM, per normalization
|
||||
unit. '
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 301
|
||||
@@ -244,13 +128,13 @@ Panel Config:
|
||||
value: ROUND(AVG((TCC_EA0_ATOMIC_sum / $denom)), 0)
|
||||
Fabric Rd Lat:
|
||||
value: ROUND(AVG(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
|
||||
!= 0) else 0)), 0)
|
||||
!= 0) else 0)), 0)
|
||||
Fabric Wr Lat:
|
||||
value: ROUND(AVG(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
|
||||
!= 0) else 0)), 0)
|
||||
!= 0) else 0)), 0)
|
||||
Fabric Atomic Lat:
|
||||
value: ROUND(AVG(((TCC_EA0_ATOMIC_LEVEL_sum / TCC_EA0_ATOMIC_sum) if (TCC_EA0_ATOMIC_sum
|
||||
!= 0) else 0)), 0)
|
||||
!= 0) else 0)), 0)
|
||||
HBM Rd:
|
||||
value: ROUND(AVG((TCC_EA0_RDREQ_DRAM_sum / $denom)), 0)
|
||||
HBM Wr:
|
||||
@@ -258,3 +142,117 @@ Panel Config:
|
||||
comparable: false
|
||||
cli_style: mem_chart
|
||||
tui_style: mem_chart
|
||||
metrics_description:
|
||||
Wavefront Occupancy: Wavefronts per active CU.
|
||||
Wave Life: Average number of cycles executing a wave.
|
||||
SALU: Total Number of SALU (Scalar ALU) instructions issued per normalization
|
||||
unit.
|
||||
SMEM: Total number of SMEM (Scalar Memory Read) instructions issued normalization
|
||||
unit.
|
||||
VALU: The number of VALU (Vector ALU) instructions issued per normalization unit.
|
||||
MFMA: Total number of MFMA (Matrix-Fused-Multiply-Add) instructions issued per
|
||||
normalization unit.
|
||||
VMEM: The number of VMEM (GPU Memory) read instructions issued (including FLAT/scratch
|
||||
memory) per normalization unit.
|
||||
LDS: The total number of LDS instructions (including, but not limited to, read/write/atomics
|
||||
and HIP's __shfl instructions) executed per normalization unit.
|
||||
GWS: Total number of GDS (global data sync) instructions issued per normalization
|
||||
unit.
|
||||
BR: Total number of BRANCH instructions issued per normalization unit.
|
||||
Active CUs: Total number of active compute units (CUs) on the accelerator during
|
||||
the kernel execution.
|
||||
Num CUs: Total number of compute units (CUs) on the accelerator.
|
||||
VGPR: |-
|
||||
The number of architected vector general-purpose registers allocated
|
||||
for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
|
||||
requested by the compiler due to allocation granularity.
|
||||
SGPR: |-
|
||||
The number of scalar general-purpose registers allocated for the kernel,
|
||||
see SALU. Note: this may not exactly match the number of SGPRs requested by
|
||||
the compiler due to allocation granularity.
|
||||
LDS Allocation: |-
|
||||
The number of bytes of LDS memory (or, shared memory) allocated for
|
||||
this kernel. Note: This may also be larger than what was requested at compile
|
||||
time due to both allocation granularity and dynamic per-dispatch LDS allocations.
|
||||
Scratch Allocation: The number of bytes of scratch memory requested per work-item
|
||||
for this kernel. Scratch memory is used for stack memory on the accelerator,
|
||||
as well as for register spills and restores.
|
||||
Wavefronts: The total number of wavefronts, summed over all workgroups, forming
|
||||
this kernel launch.
|
||||
Workgroups: The total number of workgroups forming this kernel launch.
|
||||
LDS Req: The total number of LDS instructions (including, but not limited to,
|
||||
read/write/atomics and HIP's __shfl instructions) executed per normalization
|
||||
unit.
|
||||
LDS Util: Indicates what percent of the kernel's duration the LDS was actively
|
||||
executing instructions (including, but not limited to, load, store, atomic and
|
||||
HIP's __shfl operations). Calculated as the ratio of the total number of cycles
|
||||
LDS was active over the total CU cycles.
|
||||
LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
|
||||
/ acknowledgment) required for an LDS instruction to complete.
|
||||
VL1 Rd: The total number of incoming read requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Wr: The total number of incoming write requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Atomic: The total number of incoming atomic requests from the address processing
|
||||
unit after coalescing per normalization unit
|
||||
VL1 Hit: The ratio of the number of vL1D cache line requests that hit in vL1D
|
||||
cache over the total number of cache line requests to the vL1D Cache RAM.
|
||||
VL1 Lat: Calculated as the average number of cycles that a vL1D cache line request
|
||||
spent in the vL1D cache pipeline.
|
||||
VL1 Coalesce: Indicates how well memory instructions were coalesced by the address
|
||||
processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
|
||||
as the average number of thread-requests generated per instruction divided by
|
||||
the ideal number of thread-requests per instruction.
|
||||
VL1 Stall: The ratio of the number of cycles where the vL1D is stalled waiting
|
||||
to issue a request for data to the L2 cache divided by the number of cycles
|
||||
where the vL1D is active.
|
||||
VL1_L2 Rd: The number of read requests for a vL1D cache line that were not satisfied
|
||||
by the vL1D and must be retrieved from the to the L2 Cache per normalization
|
||||
unit.
|
||||
VL1_L2 Wr: The number of write requests to a vL1D cache line that were sent through
|
||||
the vL1D to the L2 cache, per normalization unit.
|
||||
VL1_L2 Atomic: The number of atomic requests that are sent through the vL1D to
|
||||
the L2 cache, per normalization unit. This includes requests for atomics with,
|
||||
and without return.
|
||||
sL1D Rd: The total number of requests, of any size or type, made to the sL1D per
|
||||
normalization unit.
|
||||
sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
|
||||
line, per normalization unit.
|
||||
sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
|
||||
unit.
|
||||
sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
sL1D_L2 Atomic: The total number of atomic requests from sL1D to the L2, per normalization
|
||||
unit. Typically unused on current CDNA accelerators.
|
||||
IL1 Fetch: The total number of requests made to the L1I per normalization-unit.
|
||||
IL1 Hit: The percent of L1I requests that hit on a previously loaded line the
|
||||
cache. Calculated as the ratio of the number of L1I requests that hit over the
|
||||
number of all L1I requests.
|
||||
IL1 Lat: The average number of cycles spent to fetch instructions to a CU.
|
||||
IL1_L2 Rd: The total number of requests across the L1I - L2 interface per normalization-unit.
|
||||
L2 Rd: The total number of read requests to the L2 from all clients.
|
||||
L2 Wr: The total number of write requests to the L2 from all clients.
|
||||
L2 Atomic: The total number of atomic requests (with and without return) to the
|
||||
L2 from all clients.
|
||||
L2 Hit: The ratio of the number of L2 cache line requests that hit in the L2 cache
|
||||
over the total number of incoming cache line requests to the L2 cache.
|
||||
Fabric_L2 Rd: Number of L2 cache - Infinity Fabric read requests (either 32-byte
|
||||
or 64-byte) summed over TCC instances per normalization unit.
|
||||
Fabric_L2 Wr: Number of L2 cache - Infinity Fabric write requests (either 32-byte
|
||||
or 64-byte) summed over TCC instances per normalization unit.
|
||||
Fabric_L2 Atomic: Number of L2 cache - Infinity Fabric write requests (either
|
||||
32-byte or 64-byte) that are actually atomic requests summed over TCC instances
|
||||
per normalization unit.
|
||||
Fabric Rd Lat: The time-averaged number of cycles read requests spent in Infinity
|
||||
Fabric before data was returned to the L2.
|
||||
Fabric Wr Lat: The time-averaged number of cycles write requests spent in Infinity
|
||||
Fabric before a completion acknowledgement was returned to the L2.
|
||||
Fabric Atomic Lat: The time-averaged number of cycles atomic requests spent in
|
||||
Infinity Fabric before a completion acknowledgement (atomic without return value)
|
||||
or data (atomic with return value) was returned to the L2.
|
||||
HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
|
||||
of data from the accelerator's local HBM, per normalization unit.
|
||||
HBM Wr: |-
|
||||
The total number of L2 requests to Infinity Fabric to write or atomically
|
||||
update 32B or 64B of data in the accelerator's local HBM, per normalization
|
||||
unit.
|
||||
|
||||
+88
-79
@@ -2,85 +2,6 @@
|
||||
Panel Config:
|
||||
id: 400
|
||||
title: Roofline
|
||||
metrics_description:
|
||||
VALU FLOPs (F16): 'The total 16-bit floating-point operations executed per second
|
||||
on the VALU. This is presented with the value of the peak empirical F16 FLOPs
|
||||
achievable on the specific accelerator. Note: this does not include any F16
|
||||
operations from MFMA instructions.'
|
||||
VALU FLOPs (F32): 'The total 32-bit floating-point operations executed per second
|
||||
on the VALU. This is presented with the value of the peak empirical F32 FLOPs
|
||||
achievable on the specific accelerator. Note: this does not include any F32
|
||||
operations from MFMA instructions.'
|
||||
VALU FLOPs (F64): 'The total 64-bit floating-point operations executed per second
|
||||
on the VALU. This is presented with the value of the peak empirical F64 FLOPs
|
||||
achievable on the specific accelerator. Note: this does not include any F64
|
||||
operations from MFMA instructions.'
|
||||
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
|
||||
executed per second. This does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. The peak empirically measured F8 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison. It is supported
|
||||
on AMD Instinct MI300 series and later only.
|
||||
MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
|
||||
executed per second. Note: this does not include any 16-bit brain floating point
|
||||
operations from VALU instructions. The peak empirically measured BF16 MFMA operations
|
||||
achievable on the specific accelerator is displayed alongside for comparison.'
|
||||
MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit floating point operations
|
||||
from VALU instructions. The peak empirically measured F16 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison.'
|
||||
MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 32-bit floating point operations
|
||||
from VALU instructions. The peak empirically measured F32 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison.'
|
||||
MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 64-bit floating point operations
|
||||
from VALU instructions. The peak empirically measured F64 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison.'
|
||||
MFMA FLOPs (F6F4): 'The total number of 4-bit and 6-bit floating point MFMA operations
|
||||
executed per second. Note: this does not include any floating point operations
|
||||
from VALU instructions. The peak empirically measured F6F4 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison. It is supported
|
||||
on AMD Instinct MI350 series (gfx950) and later only.'
|
||||
MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
|
||||
per second. Note: this does not include any 8-bit integer operations from VALU
|
||||
instructions. The peak empirically measured INT8 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison.'
|
||||
HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth
|
||||
Memory (HBM) per second. The peak empirically measured bandwidth achievable
|
||||
on the specific accelerator is displayed alongside for comparison.
|
||||
L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
|
||||
The number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line. The peak empirically measured bandwidth
|
||||
achievable on the specific accelerator is displayed alongside for comparison.
|
||||
L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
|
||||
of VMEM instructions per unit time. The number of bytes is calculated as the
|
||||
number of cache lines requested multiplied by the cache line size. This value
|
||||
does not consider partial requests, so e.g., if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
The peak empirically measured bandwidth achievable on the specific accelerator
|
||||
is displayed alongside for comparison.
|
||||
LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
|
||||
from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
|
||||
example for more detail). The peak empirically measured LDS bandwidth achievable
|
||||
on the specific accelerator is displayed alongside for comparison.
|
||||
AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
|
||||
of total floating-point operations (FLOPs) to total bytes transferred between
|
||||
the L1 cache and the processing units. This value is used as the x-coordinate
|
||||
for the L1 roofline.
|
||||
AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
|
||||
of total floating-point operations (FLOPs) to total bytes transferred between
|
||||
the L2 cache and the L1 cache. This value is used as the x-coordinate for the
|
||||
L2 roofline.
|
||||
AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
|
||||
It is the ratio of total floating-point operations (FLOPs) to total bytes transferred
|
||||
between HBM and the L2 cache. This value is used as the x-coordinate for the
|
||||
HBM roofline.
|
||||
Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs
|
||||
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
|
||||
operations divided by the total execution time. This value is used as the y-coordinate
|
||||
for the kernel's point on the Roofline plot.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 401
|
||||
@@ -218,3 +139,91 @@ Panel Config:
|
||||
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
|
||||
* 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9
|
||||
unit: GFLOP/s
|
||||
metrics_description:
|
||||
VALU FLOPs (F16): |-
|
||||
The total 16-bit floating-point operations executed per second on the VALU.
|
||||
This is presented with the value of the peak empirical F16 FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any F16 operations
|
||||
from MFMA instructions.
|
||||
VALU FLOPs (F32): |-
|
||||
The total 32-bit floating-point operations executed per second on the VALU.
|
||||
This is presented with the value of the peak empirical F32 FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any F32 operations
|
||||
from MFMA instructions.
|
||||
VALU FLOPs (F64): |-
|
||||
The total 64-bit floating-point operations executed per second on the VALU.
|
||||
This is presented with the value of the peak empirical F64 FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any F64 operations
|
||||
from MFMA instructions.
|
||||
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
|
||||
executed per second. This does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. The peak empirically measured F8 MFMA operations achievable
|
||||
on the specific accelerator is displayed alongside for comparison. It is supported
|
||||
on AMD Instinct MI300 series and later only.
|
||||
MFMA FLOPs (BF16): |-
|
||||
The total number of 16-bit brain floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit brain floating point
|
||||
operations from VALU instructions. The peak empirically measured BF16 MFMA
|
||||
operations achievable on the specific accelerator is displayed alongside
|
||||
for comparison.
|
||||
MFMA FLOPs (F16): |-
|
||||
The total number of 16-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 16-bit floating point operations from
|
||||
VALU instructions. The peak empirically measured F16 MFMA operations
|
||||
achievable on the specific accelerator is displayed alongside for comparison.
|
||||
MFMA FLOPs (F32): |-
|
||||
The total number of 32-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 32-bit floating point operations from
|
||||
VALU instructions. The peak empirically measured F32 MFMA operations
|
||||
achievable on the specific accelerator is displayed alongside for comparison.
|
||||
MFMA FLOPs (F64): |-
|
||||
The total number of 64-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 64-bit floating point operations from
|
||||
VALU instructions. The peak empirically measured F64 MFMA operations
|
||||
achievable on the specific accelerator is displayed alongside for comparison.
|
||||
MFMA IOPs (Int8): |-
|
||||
The total number of 8-bit integer MFMA operations executed per second.
|
||||
Note: this does not include any 8-bit integer operations from VALU instructions.
|
||||
The peak empirically measured INT8 MFMA operations achievable on the specific
|
||||
accelerator is displayed alongside for comparison.
|
||||
HBM Bandwidth: |-
|
||||
The total number of bytes read from and written to High-Bandwidth
|
||||
Memory (HBM) per second. The peak empirically measured bandwidth achievable
|
||||
on the specific accelerator is displayed alongside for comparison.
|
||||
L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
|
||||
The number of bytes is calculated as the number of cache lines requested multiplied
|
||||
by the cache line size. This value does not consider partial requests, so e.g.,
|
||||
if only a single value is requested in a cache line, the data movement will
|
||||
still be counted as a full cache line. The peak empirically measured bandwidth
|
||||
achievable on the specific accelerator is displayed alongside for comparison.
|
||||
L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
|
||||
of VMEM instructions per unit time. The number of bytes is calculated as the
|
||||
number of cache lines requested multiplied by the cache line size. This value
|
||||
does not consider partial requests, so e.g., if only a single value is requested
|
||||
in a cache line, the data movement will still be counted as a full cache line.
|
||||
The peak empirically measured bandwidth achievable on the specific accelerator
|
||||
is displayed alongside for comparison.
|
||||
LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
|
||||
from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
|
||||
example for more detail). The peak empirically measured LDS bandwidth achievable
|
||||
on the specific accelerator is displayed alongside for comparison.
|
||||
AI L1: |-
|
||||
The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
|
||||
of total floating-point operations (FLOPs) to total bytes transferred between
|
||||
the L1 cache and the processing units. This value is used as the x-coordinate
|
||||
for the L1 roofline.
|
||||
AI L2: |-
|
||||
The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
|
||||
of total floating-point operations (FLOPs) to total bytes transferred between
|
||||
the L2 cache and the L1 cache. This value is used as the x-coordinate for
|
||||
the L2 roofline.
|
||||
AI HBM: |-
|
||||
The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
|
||||
It is the ratio of total floating-point operations (FLOPs) to total bytes
|
||||
transferred between HBM and the L2 cache. This value is used as the x-coordinate
|
||||
for the HBM roofline.
|
||||
Performance (GFLOPs): |-
|
||||
The overall achieved performance, measured in GigaFLOPs
|
||||
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
|
||||
operations divided by the total execution time. This value is used as the y-coordinate
|
||||
for the kernel's point on the Roofline plot.
|
||||
|
||||
+25
-24
@@ -2,30 +2,6 @@
|
||||
Panel Config:
|
||||
id: 500
|
||||
title: Command Processor (CPC/CPF)
|
||||
metrics_description:
|
||||
CPF Utilization: Percent of total cycles where the CPF was busy actively doing
|
||||
any work. The ratio of CPF busy cycles over total cycles counted by the CPF.
|
||||
CPF Stall: Percent of CPF busy cycles where the CPF was stalled for any reason.
|
||||
CPF-L2 Utilization: Percent of total cycles counted by the CPF-L2 interface where
|
||||
the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
|
||||
over total cycles counted by the CPF-L2.
|
||||
CPF-L2 Stall: Percent of CPF-L2 L2 busy cycles where the CPF-L2 interface was
|
||||
stalled for any reason.
|
||||
CPF-UTCL1 Stall: Percent of CPF busy cycles where the CPF was stalled by address
|
||||
translation.
|
||||
CPC Utilization: Percent of total cycles where the CPC was busy actively doing
|
||||
any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
|
||||
CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
|
||||
CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
|
||||
for processing.
|
||||
CPC-Workgroup Manager Utilization: Percent of CPC busy cycles spent dispatching
|
||||
workgroups to the workgroup manager.
|
||||
CPC-L2 Utilization: Percent of total cycles counted by the CPC-L2 interface where
|
||||
the CPC-L2 interface was active doing any work.
|
||||
CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
|
||||
translation
|
||||
CPC-UTCL2 Utilization: 'Percent of total cycles counted by the CPC''s L2 address
|
||||
translation interface where the CPC was busy doing address translation work. '
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 501
|
||||
@@ -143,3 +119,28 @@ Panel Config:
|
||||
max: MAX((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
|
||||
if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
|
||||
unit: pct
|
||||
metrics_description:
|
||||
CPF Utilization: Percent of total cycles where the CPF was busy actively doing
|
||||
any work. The ratio of CPF busy cycles over total cycles counted by the CPF.
|
||||
CPF Stall: Percent of CPF busy cycles where the CPF was stalled for any reason.
|
||||
CPF-L2 Utilization: Percent of total cycles counted by the CPF-L2 interface where
|
||||
the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
|
||||
over total cycles counted by the CPF-L2.
|
||||
CPF-L2 Stall: Percent of CPF-L2 L2 busy cycles where the CPF-L2 interface was
|
||||
stalled for any reason.
|
||||
CPF-UTCL1 Stall: Percent of CPF busy cycles where the CPF was stalled by address
|
||||
translation.
|
||||
CPC Utilization: Percent of total cycles where the CPC was busy actively doing
|
||||
any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
|
||||
CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
|
||||
CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
|
||||
for processing.
|
||||
CPC-Workgroup Manager Utilization: Percent of CPC busy cycles spent dispatching
|
||||
workgroups to the workgroup manager.
|
||||
CPC-L2 Utilization: Percent of total cycles counted by the CPC-L2 interface where
|
||||
the CPC-L2 interface was active doing any work.
|
||||
CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
|
||||
translation
|
||||
CPC-UTCL2 Utilization: |-
|
||||
Percent of total cycles counted by the CPC's L2 address translation
|
||||
interface where the CPC was busy doing address translation work.
|
||||
|
||||
+55
-55
@@ -2,61 +2,6 @@
|
||||
Panel Config:
|
||||
id: 600
|
||||
title: Workgroup Manager (SPI)
|
||||
metrics_description:
|
||||
Accelerator Utilization: The percent of cycles in the kernel where the accelerator
|
||||
was actively doing any work.
|
||||
Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
|
||||
kernel where the scheduler-pipes were actively doing any work.
|
||||
Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
|
||||
manager was actively doing any work.
|
||||
Shader Engine Utilization: The percent of total shader engine cycles in the kernel
|
||||
where any CU in a shader-engine was actively doing any work, normalized over
|
||||
all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
|
||||
was not fully saturated by the kernel, or a potential load-imbalance issue.
|
||||
SIMD Utilization: The percent of total SIMD cycles in the kernel where any SIMD
|
||||
on a CU was actively doing any work, summed over all CUs. Low values (less than
|
||||
100%) indicate that the accelerator was not fully saturated by the kernel, or
|
||||
a potential load-imbalance issue.
|
||||
Dispatched Workgroups: The total number of workgroups forming this kernel launch.
|
||||
Dispatched Wavefronts: The total number of wavefronts, summed over all workgroups,
|
||||
forming this kernel launch.
|
||||
VGPR Writes: The average number of cycles spent initializing VGPRs at wave creation.
|
||||
SGPR Writes: The average number of cycles spent initializing SGPRs at wave creation.
|
||||
Not-scheduled Rate (Workgroup Manager): The percent of total scheduler-pipe cycles
|
||||
in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
|
||||
within the workgroup manager rather than a lack of a CU or SIMD with sufficient
|
||||
resources.
|
||||
Not-scheduled Rate (Scheduler-Pipe): 'The percent of total scheduler-pipe cycles
|
||||
in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
|
||||
within the scheduler-pipes rather than a lack of a CU or SIMD with sufficient
|
||||
resources. '
|
||||
Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
|
||||
where a workgroup could not be scheduled to a CU due to occupancy limitations
|
||||
(like a lack of a CU or SIMD with sufficient resources).
|
||||
Scratch Stall Rate: The percent of total shader-engine cycles in the kernel where
|
||||
a workgroup could not be scheduled to a CU due to lack of private (a.k.a., scratch)
|
||||
memory slots. While this can reach up to 100%, note that the actual occupancy
|
||||
limitations on a kernel using private memory are typically quite small (for
|
||||
example, less than 1% of the total number of waves that can be scheduled to
|
||||
an accelerator).
|
||||
Insufficient SIMD Waveslots: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available waveslots.
|
||||
Insufficient SIMD VGPRs: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available VGPRs.
|
||||
Insufficient SIMD SGPRs: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available SGPRs.
|
||||
Insufficient CU LDS: The percent of total CU cycles in the kernel where a workgroup
|
||||
could not be scheduled to a CU due to lack of available LDS.
|
||||
Insufficient CU Barriers: The percent of total CU cycles in the kernel where a
|
||||
workgroup could not be scheduled to a CU due to lack of available barriers.
|
||||
Reached CU Workgroup Limit: The percent of total CU cycles in the kernel where
|
||||
a workgroup could not be scheduled to a CU due to limits within the workgroup
|
||||
manager. This is expected to be always be zero on CDNA2 or newer accelerators
|
||||
(and small for previous accelerators).
|
||||
Reached CU Wavefront Limit: The percent of total CU cycles in the kernel where
|
||||
a wavefront could not be scheduled to a CU due to limits within the workgroup
|
||||
manager. This is expected to be always be zero on CDNA2 or newer accelerators
|
||||
(and small for previous accelerators).
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 601
|
||||
@@ -199,3 +144,58 @@ Panel Config:
|
||||
min: MIN(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
max: MAX(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
|
||||
unit: Pct
|
||||
metrics_description:
|
||||
Accelerator Utilization: The percent of cycles in the kernel where the accelerator
|
||||
was actively doing any work.
|
||||
Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
|
||||
kernel where the scheduler-pipes were actively doing any work.
|
||||
Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
|
||||
manager was actively doing any work.
|
||||
Shader Engine Utilization: The percent of total shader engine cycles in the kernel
|
||||
where any CU in a shader-engine was actively doing any work, normalized over
|
||||
all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
|
||||
was not fully saturated by the kernel, or a potential load-imbalance issue.
|
||||
SIMD Utilization: The percent of total SIMD cycles in the kernel where any SIMD
|
||||
on a CU was actively doing any work, summed over all CUs. Low values (less than
|
||||
100%) indicate that the accelerator was not fully saturated by the kernel, or
|
||||
a potential load-imbalance issue.
|
||||
Dispatched Workgroups: The total number of workgroups forming this kernel launch.
|
||||
Dispatched Wavefronts: The total number of wavefronts, summed over all workgroups,
|
||||
forming this kernel launch.
|
||||
VGPR Writes: The average number of cycles spent initializing VGPRs at wave creation.
|
||||
SGPR Writes: The average number of cycles spent initializing SGPRs at wave creation.
|
||||
Not-scheduled Rate (Workgroup Manager): The percent of total scheduler-pipe cycles
|
||||
in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
|
||||
within the workgroup manager rather than a lack of a CU or SIMD with sufficient
|
||||
resources.
|
||||
Not-scheduled Rate (Scheduler-Pipe): |-
|
||||
The percent of total scheduler-pipe cycles in the kernel where a workgroup
|
||||
could not be scheduled to a CU due to a bottleneck within the scheduler-pipes
|
||||
rather than a lack of a CU or SIMD with sufficient resources.
|
||||
Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
|
||||
where a workgroup could not be scheduled to a CU due to occupancy limitations
|
||||
(like a lack of a CU or SIMD with sufficient resources).
|
||||
Scratch Stall Rate: The percent of total shader-engine cycles in the kernel where
|
||||
a workgroup could not be scheduled to a CU due to lack of private (a.k.a., scratch)
|
||||
memory slots. While this can reach up to 100%, note that the actual occupancy
|
||||
limitations on a kernel using private memory are typically quite small (for
|
||||
example, less than 1% of the total number of waves that can be scheduled to
|
||||
an accelerator).
|
||||
Insufficient SIMD Waveslots: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available waveslots.
|
||||
Insufficient SIMD VGPRs: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available VGPRs.
|
||||
Insufficient SIMD SGPRs: The percent of total SIMD cycles in the kernel where
|
||||
a workgroup could not be scheduled to a SIMD due to lack of available SGPRs.
|
||||
Insufficient CU LDS: The percent of total CU cycles in the kernel where a workgroup
|
||||
could not be scheduled to a CU due to lack of available LDS.
|
||||
Insufficient CU Barriers: The percent of total CU cycles in the kernel where a
|
||||
workgroup could not be scheduled to a CU due to lack of available barriers.
|
||||
Reached CU Workgroup Limit: The percent of total CU cycles in the kernel where
|
||||
a workgroup could not be scheduled to a CU due to limits within the workgroup
|
||||
manager. This is expected to be always be zero on CDNA2 or newer accelerators
|
||||
(and small for previous accelerators).
|
||||
Reached CU Wavefront Limit: The percent of total CU cycles in the kernel where
|
||||
a wavefront could not be scheduled to a CU due to limits within the workgroup
|
||||
manager. This is expected to be always be zero on CDNA2 or newer accelerators
|
||||
(and small for previous accelerators).
|
||||
|
||||
+63
-57
@@ -2,63 +2,6 @@
|
||||
Panel Config:
|
||||
id: 700
|
||||
title: Wavefront
|
||||
metrics_description:
|
||||
Grid Size: The total number of work-items (or, threads) launched as a part of
|
||||
the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied
|
||||
by the total workgroup (or, block) size.
|
||||
Workgroup Size: The total number of work-items (or, threads) in each workgroup
|
||||
(or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
|
||||
to the total block size.
|
||||
Total Wavefronts: "The total number of wavefronts launched as part of the kernel\
|
||||
\ dispatch. On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs,\
|
||||
\ the wavefront size is always 64 work-items. Thus, the total number of wavefronts\
|
||||
\ should be equivalent to the ceiling of grid size divided by 64."
|
||||
Saved Wavefronts: The total number of wavefronts saved at a context-save.
|
||||
Restored Wavefronts: The total number of wavefronts restored from a context-save.
|
||||
VGPRs: 'The number of architected vector general-purpose registers allocated for
|
||||
the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
|
||||
by the compiler due to allocation granularity.'
|
||||
AGPRs: 'The number of accumulation vector general-purpose registers allocated
|
||||
for the kernel, see AGPRs. Note: this may not exactly match the number of AGPRs
|
||||
requested by the compiler due to allocation granularity.'
|
||||
SGPRs: 'The number of scalar general-purpose registers allocated for the kernel,
|
||||
see SALU. Note: this may not exactly match the number of SGPRs requested by
|
||||
the compiler due to allocation granularity.'
|
||||
LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
|
||||
for this kernel. Note: This may also be larger than what was requested at compile
|
||||
time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
|
||||
Scratch Allocation: The number of bytes of scratch memory requested per work-item
|
||||
for this kernel. Scratch memory is used for stack memory on the accelerator,
|
||||
as well as for register spills and restores.
|
||||
Kernel Time: The total duration of the executed kernel.
|
||||
Kernel Time (Cycles): The total duration of the executed kernel in cycles.
|
||||
Instructions per wavefront: The average number of instructions (of all types)
|
||||
executed per wavefront. This is averaged over all wavefronts in a kernel dispatch.
|
||||
Wave Cycles: The number of cycles a wavefront in the kernel dispatch spent resident
|
||||
on a compute unit per normalization unit. This is averaged over all wavefronts
|
||||
in a kernel dispatch.
|
||||
Dependency Wait Cycles: The number of cycles a wavefront in the kernel dispatch
|
||||
spent resident on a compute unit per normalization unit. This is averaged over
|
||||
all wavefronts in a kernel dispatch.
|
||||
Issue Wait Cycles: The number of cycles a wavefront in the kernel dispatch was
|
||||
unable to issue an instruction for any reason (e.g., execution pipe back-pressure,
|
||||
arbitration loss, etc.) per normalization unit. This counter is incremented
|
||||
at every cycle by all wavefronts on a CU unable to issue an instruction. As
|
||||
such, it is most useful to get a sense of how waves were spending their time,
|
||||
rather than identification of a precise limiter because another wave could be
|
||||
actively executing while a wave is issue stalled. The sum of this metric, Dependency
|
||||
Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
|
||||
Active Cycles: The average number of cycles a wavefront in the kernel dispatch
|
||||
was actively executing instructions per normalization unit. This measurement
|
||||
is made on a per-wavefront basis, and may include cycles that another wavefront
|
||||
spent actively executing (on another execution unit, for example) or was stalled.
|
||||
As such, it is most useful to get a sense of how waves were spending their time,
|
||||
rather than identification of a precise limiter. The sum of this metric, Issue
|
||||
Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
|
||||
metric.
|
||||
Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
|
||||
over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
||||
kernels (less than 1ms).'
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 701
|
||||
@@ -171,3 +114,66 @@ Panel Config:
|
||||
max: MAX((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
|
||||
unit: Wavefronts
|
||||
coll_level: SQ_LEVEL_WAVES
|
||||
metrics_description:
|
||||
Grid Size: The total number of work-items (or, threads) launched as a part of
|
||||
the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied
|
||||
by the total workgroup (or, block) size.
|
||||
Workgroup Size: The total number of work-items (or, threads) in each workgroup
|
||||
(or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
|
||||
to the total block size.
|
||||
Total Wavefronts: |-
|
||||
The total number of wavefronts launched as part of the kernel dispatch.
|
||||
On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs, the wavefront
|
||||
size is always 64 work-items. Thus, the total number of wavefronts should
|
||||
be equivalent to the ceiling of grid size divided by 64.
|
||||
Saved Wavefronts: The total number of wavefronts saved at a context-save.
|
||||
Restored Wavefronts: The total number of wavefronts restored from a context-save.
|
||||
VGPRs: |-
|
||||
The number of architected vector general-purpose registers allocated
|
||||
for the kernel, see VALU. Note: this may not exactly match the number of VGPRs
|
||||
requested by the compiler due to allocation granularity.
|
||||
AGPRs: |-
|
||||
The number of accumulation vector general-purpose registers allocated
|
||||
for the kernel, see AGPRs. Note: this may not exactly match the number of
|
||||
AGPRs requested by the compiler due to allocation granularity.
|
||||
SGPRs: |-
|
||||
The number of scalar general-purpose registers allocated for the kernel,
|
||||
see SALU. Note: this may not exactly match the number of SGPRs requested by
|
||||
the compiler due to allocation granularity.
|
||||
LDS Allocation: |-
|
||||
The number of bytes of LDS memory (or, shared memory) allocated for
|
||||
this kernel. Note: This may also be larger than what was requested at compile
|
||||
time due to both allocation granularity and dynamic per-dispatch LDS allocations.
|
||||
Scratch Allocation: The number of bytes of scratch memory requested per work-item
|
||||
for this kernel. Scratch memory is used for stack memory on the accelerator,
|
||||
as well as for register spills and restores.
|
||||
Kernel Time: The total duration of the executed kernel.
|
||||
Kernel Time (Cycles): The total duration of the executed kernel in cycles.
|
||||
Instructions per wavefront: The average number of instructions (of all types)
|
||||
executed per wavefront. This is averaged over all wavefronts in a kernel dispatch.
|
||||
Wave Cycles: The number of cycles a wavefront in the kernel dispatch spent resident
|
||||
on a compute unit per normalization unit. This is averaged over all wavefronts
|
||||
in a kernel dispatch.
|
||||
Dependency Wait Cycles: The number of cycles a wavefront in the kernel dispatch
|
||||
spent resident on a compute unit per normalization unit. This is averaged over
|
||||
all wavefronts in a kernel dispatch.
|
||||
Issue Wait Cycles: The number of cycles a wavefront in the kernel dispatch was
|
||||
unable to issue an instruction for any reason (e.g., execution pipe back-pressure,
|
||||
arbitration loss, etc.) per normalization unit. This counter is incremented
|
||||
at every cycle by all wavefronts on a CU unable to issue an instruction. As
|
||||
such, it is most useful to get a sense of how waves were spending their time,
|
||||
rather than identification of a precise limiter because another wave could be
|
||||
actively executing while a wave is issue stalled. The sum of this metric, Dependency
|
||||
Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
|
||||
Active Cycles: The average number of cycles a wavefront in the kernel dispatch
|
||||
was actively executing instructions per normalization unit. This measurement
|
||||
is made on a per-wavefront basis, and may include cycles that another wavefront
|
||||
spent actively executing (on another execution unit, for example) or was stalled.
|
||||
As such, it is most useful to get a sense of how waves were spending their time,
|
||||
rather than identification of a precise limiter. The sum of this metric, Issue
|
||||
Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
|
||||
metric.
|
||||
Wavefront Occupancy: |-
|
||||
The time-averaged number of wavefronts resident on the accelerator over
|
||||
the lifetime of the kernel. Note: this metric may be inaccurate for short-running
|
||||
kernels (less than 1ms).
|
||||
|
||||
+85
-84
@@ -2,90 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1000
|
||||
title: Compute Units - Instruction Mix
|
||||
metrics_description:
|
||||
VALU: The total number of vector arithmetic logic unit (VALU) operations issued.
|
||||
These are the workhorses of the compute unit, and are used to execute a wide
|
||||
range of instruction types including floating point operations, non-uniform
|
||||
address calculations, transcendental operations, integer operations, shifts,
|
||||
conditional evaluation, etc.
|
||||
VMEM: The total number of vector memory operations issued. These include most
|
||||
loads, stores and atomic operations and all accesses to generic, global, private
|
||||
and texture memory.
|
||||
LDS: The total number of LDS (also known as shared memory) operations issued.
|
||||
These include loads, stores, atomics, and HIP's __shfl operations.
|
||||
MFMA: The total number of matrix fused multiply-add instructions issued.
|
||||
SALU: The total number of scalar arithmetic logic unit (SALU) operations issued.
|
||||
Typically these are used for address calculations, literal constants, and other
|
||||
operations that are provably uniform across a wavefront. Although scalar memory
|
||||
(SMEM) operations are issued by the SALU, they are counted separately in this
|
||||
section.
|
||||
SMEM: The total number of scalar memory (SMEM) operations issued. These are typically
|
||||
used for loading kernel arguments, base-pointers and loads from HIP's __constant__
|
||||
memory.
|
||||
Branch: The total number of branch operations issued. These typically consist
|
||||
of jump or branch operations and are used to implement control flow.
|
||||
INT32: The total number of instructions operating on 32-bit integer operands issued
|
||||
to the VALU per normalization unit.
|
||||
INT64: The total number of instructions operating on 64-bit integer operands issued
|
||||
to the VALU per normalization unit.
|
||||
F16-ADD: The total number of addition instructions operating on 16-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F16-MUL: The total number of multiplication instructions operating on 16-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F16-FMA: The total number of fused multiply-add instructions operating on 16-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F16-Trans: The total number of transcendental instructions (e.g., sqrt) operating
|
||||
on 16-bit floating-point operands issued to the VALU per normalization unit.
|
||||
F32-ADD: The total number of addition instructions operating on 32-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F32-MUL: The total number of multiplication instructions operating on 32-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F32-FMA: The total number of fused multiply-add instructions operating on 32-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F32-Trans: The total number of transcendental instructions (such as sqrt) operating
|
||||
on 32-bit floating-point operands issued to the VALU per normalization unit.
|
||||
F64-ADD: The total number of addition instructions operating on 64-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F64-MUL: The total number of multiplication instructions operating on 64-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F64-FMA: The total number of fused multiply-add instructions operating on 64-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F64-Trans: The total number of transcendental instructions (such as sqrt) operating
|
||||
on 64-bit floating-point operands issued to the VALU per normalization unit.
|
||||
Conversion: "The total number of type conversion instructions (such as converting\
|
||||
\ data to or from F32\u2194F64) issued to the VALU per normalization unit."
|
||||
Global/Generic Instr: The total number of global & generic memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Read: The total number of global & generic memory read instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Write: The total number of global & generic memory write instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Atomic: The total number of global & generic memory atomic (with
|
||||
and without return) instructions executed on all compute units on the accelerator,
|
||||
per normalization unit.
|
||||
Spill/Stack Instr: The total number of spill/stack memory instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Read: The total number of spill/stack memory read instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Write: The total number of spill/stack memory write instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Atomic: The total number of spill/stack memory atomic (with and without
|
||||
return) instructions executed on all compute units on the accelerator, per normalization
|
||||
unit. Typically unused as these memory operations are typically used to implement
|
||||
thread-local storage.
|
||||
MFMA-I8: The total number of 8-bit integer MFMA instructions issued per normalization
|
||||
unit.
|
||||
MFMA-F8: The total number of 8-bit floating point MFMA instructions issued per
|
||||
normalization unit. This is supported in AMD Instinct MI300 series and later
|
||||
only.
|
||||
MFMA-F16: The total number of 16-bit floating point MFMA instructions issued per
|
||||
normalization unit.
|
||||
MFMA-BF16: The total number of 16-bit brain floating point MFMA instructions issued
|
||||
per normalization unit.
|
||||
MFMA-F32: The total number of 32-bit floating-point MFMA instructions issued per
|
||||
normalization unit.
|
||||
MFMA-F64: The total number of 64-bit floating-point MFMA instructions issued per
|
||||
normalization unit.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1001
|
||||
@@ -307,3 +223,88 @@ Panel Config:
|
||||
min: MIN((SQ_INSTS_VALU_MFMA_F64 / $denom))
|
||||
max: MAX((SQ_INSTS_VALU_MFMA_F64 / $denom))
|
||||
unit: (instr + $normUnit)
|
||||
metrics_description:
|
||||
VALU: The total number of vector arithmetic logic unit (VALU) operations issued.
|
||||
These are the workhorses of the compute unit, and are used to execute a wide
|
||||
range of instruction types including floating point operations, non-uniform
|
||||
address calculations, transcendental operations, integer operations, shifts,
|
||||
conditional evaluation, etc.
|
||||
VMEM: The total number of vector memory operations issued. These include most
|
||||
loads, stores and atomic operations and all accesses to generic, global, private
|
||||
and texture memory.
|
||||
LDS: The total number of LDS (also known as shared memory) operations issued.
|
||||
These include loads, stores, atomics, and HIP's __shfl operations.
|
||||
MFMA: The total number of matrix fused multiply-add instructions issued.
|
||||
SALU: The total number of scalar arithmetic logic unit (SALU) operations issued.
|
||||
Typically these are used for address calculations, literal constants, and other
|
||||
operations that are provably uniform across a wavefront. Although scalar memory
|
||||
(SMEM) operations are issued by the SALU, they are counted separately in this
|
||||
section.
|
||||
SMEM: The total number of scalar memory (SMEM) operations issued. These are typically
|
||||
used for loading kernel arguments, base-pointers and loads from HIP's __constant__
|
||||
memory.
|
||||
Branch: The total number of branch operations issued. These typically consist
|
||||
of jump or branch operations and are used to implement control flow.
|
||||
INT32: The total number of instructions operating on 32-bit integer operands issued
|
||||
to the VALU per normalization unit.
|
||||
INT64: The total number of instructions operating on 64-bit integer operands issued
|
||||
to the VALU per normalization unit.
|
||||
F16-ADD: The total number of addition instructions operating on 16-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F16-MUL: The total number of multiplication instructions operating on 16-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F16-FMA: The total number of fused multiply-add instructions operating on 16-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F16-Trans: The total number of transcendental instructions (e.g., sqrt) operating
|
||||
on 16-bit floating-point operands issued to the VALU per normalization unit.
|
||||
F32-ADD: The total number of addition instructions operating on 32-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F32-MUL: The total number of multiplication instructions operating on 32-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F32-FMA: The total number of fused multiply-add instructions operating on 32-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F32-Trans: The total number of transcendental instructions (such as sqrt) operating
|
||||
on 32-bit floating-point operands issued to the VALU per normalization unit.
|
||||
F64-ADD: The total number of addition instructions operating on 64-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F64-MUL: The total number of multiplication instructions operating on 64-bit floating-point
|
||||
operands issued to the VALU per normalization unit.
|
||||
F64-FMA: The total number of fused multiply-add instructions operating on 64-bit
|
||||
floating-point operands issued to the VALU per normalization unit.
|
||||
F64-Trans: The total number of transcendental instructions (such as sqrt) operating
|
||||
on 64-bit floating-point operands issued to the VALU per normalization unit.
|
||||
Conversion: |-
|
||||
The total number of type conversion instructions (such as converting
|
||||
data to or from F32\u2194F64) issued to the VALU per normalization unit.
|
||||
Global/Generic Instr: The total number of global & generic memory instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Read: The total number of global & generic memory read instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Write: The total number of global & generic memory write instructions
|
||||
executed on all compute units on the accelerator, per normalization unit.
|
||||
Global/Generic Atomic: The total number of global & generic memory atomic (with
|
||||
and without return) instructions executed on all compute units on the accelerator,
|
||||
per normalization unit.
|
||||
Spill/Stack Instr: The total number of spill/stack memory instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Read: The total number of spill/stack memory read instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Write: The total number of spill/stack memory write instructions executed
|
||||
on all compute units on the accelerator, per normalization unit.
|
||||
Spill/Stack Atomic: The total number of spill/stack memory atomic (with and without
|
||||
return) instructions executed on all compute units on the accelerator, per normalization
|
||||
unit. Typically unused as these memory operations are typically used to implement
|
||||
thread-local storage.
|
||||
MFMA-I8: The total number of 8-bit integer MFMA instructions issued per normalization
|
||||
unit.
|
||||
MFMA-F8: The total number of 8-bit floating point MFMA instructions issued per
|
||||
normalization unit. This is supported in AMD Instinct MI300 series and later
|
||||
only.
|
||||
MFMA-F16: The total number of 16-bit floating point MFMA instructions issued per
|
||||
normalization unit.
|
||||
MFMA-BF16: The total number of 16-bit brain floating point MFMA instructions issued
|
||||
per normalization unit.
|
||||
MFMA-F32: The total number of 32-bit floating-point MFMA instructions issued per
|
||||
normalization unit.
|
||||
MFMA-F64: The total number of 64-bit floating-point MFMA instructions issued per
|
||||
normalization unit.
|
||||
|
||||
+95
-88
@@ -2,84 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1100
|
||||
title: Compute Units - Compute Pipeline
|
||||
metrics_description:
|
||||
VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
|
||||
This is also presented as a percent of the peak theoretical FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any floating-point
|
||||
operations from MFMA instructions.'
|
||||
VALU IOPs: 'The total integer operations executed per second on the VALU. This
|
||||
is also presented as a percent of the peak theoretical IOPs achievable on the
|
||||
specific accelerator. Note: this does not include any integer operations from
|
||||
MFMA instructions.'
|
||||
MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
|
||||
executed per second. Note: this does not include any 16-bit brain floating point
|
||||
operations from VALU instructions. This is also presented as a percent of the
|
||||
peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F16 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 32-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F32 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
|
||||
per second. Note: this does not include any 64-bit floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F64 MFMA operations achievable on the specific accelerator.'
|
||||
MFMA IOPs (INT8): 'The total number of 8-bit integer MFMA operations executed
|
||||
per second. Note: this does not include any 8-bit integer operations from VALU
|
||||
instructions. This is also presented as a percent of the peak theoretical INT8
|
||||
MFMA operations achievable on the specific accelerator.'
|
||||
IPC: The ratio of the total number of instructions executed on the CU over the
|
||||
total active CU cycles.
|
||||
IPC (Issued): The ratio of the total number of (non-internal) instructions issued
|
||||
over the number of cycles where the scheduler was actively working on issuing
|
||||
instructions.
|
||||
SALU Utilization: Indicates what percent of the kernel's duration the SALU was
|
||||
busy executing instructions. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing SALU / SMEM instructions over the total CU cycles.
|
||||
VALU Utilization: Indicates what percent of the kernel's duration the VALU was
|
||||
busy executing instructions. Does not include VMEM operations. Computed as the
|
||||
ratio of the total number of cycles spent by the scheduler issuing VALU instructions
|
||||
over the total CU cycles.
|
||||
VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
|
||||
was busy executing instructions, including both global/generic and spill/scratch
|
||||
operations (see the VMEM instruction count metrics for more detail). Does not
|
||||
include VALU operations. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing VMEM instructions over the total CU cycles.
|
||||
Branch Utilization: Indicates what percent of the kernel's duration the branch
|
||||
unit was busy executing instructions. Computed as the ratio of the total number
|
||||
of cycles spent by the scheduler issuing branch instructions over the total
|
||||
CU cycles.
|
||||
VALU Active Threads: Indicates the average level of divergence within a wavefront
|
||||
over the lifetime of the kernel. The number of work-items that were active in
|
||||
a wavefront during execution of each VALU instruction, time-averaged over all
|
||||
VALU instructions run on all wavefronts in the kernel
|
||||
MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
|
||||
was busy executing instructions. Computed as the ratio of the total number of
|
||||
cycles spent by the MFMA was busy over the total CU cycles.
|
||||
MFMA Instruction Cycles: The average duration of MFMA instructions in this kernel
|
||||
in cycles. Computed as the ratio of the total number of cycles the MFMA unit
|
||||
was busy over the total number of MFMA instructions.
|
||||
VMEM Latency: The average number of round-trip cycles (that is, from issue to
|
||||
data return / acknowledgment) required for a VMEM instruction to complete.
|
||||
SMEM Latency: The average number of round-trip cycles (that is, from issue to
|
||||
data return / acknowledgment) required for a SMEM instruction to complete.
|
||||
FLOPs (Total): The total number of floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
IOPs (Total): The total number of integer operations executed on either the VALU
|
||||
or MFMA units, per normalization unit.
|
||||
F16 OPs: The total number of 16-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
BF16 OPs: The total number of 16-bit brain floating-point operations executed
|
||||
on either the VALU or MFMA units, per normalization unit.
|
||||
F32 OPs: The total number of 32-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
F64 OPs: The total number of 64-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
INT8 OPs: The total number of 8-bit integer operations executed on either the
|
||||
VALU or MFMA units, per normalization unit.
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1101
|
||||
@@ -165,13 +87,13 @@ Panel Config:
|
||||
unit: Instr/cycle
|
||||
IPC (Issued):
|
||||
avg: AVG(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
/ SQ_ACTIVE_INST_ANY))
|
||||
min: MIN(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
/ SQ_ACTIVE_INST_ANY))
|
||||
max: MAX(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
+ SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
|
||||
/ SQ_ACTIVE_INST_ANY))
|
||||
unit: Instr/cycle
|
||||
SALU Utilization:
|
||||
@@ -271,7 +193,7 @@ Panel Config:
|
||||
+ (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (SQ_INSTS_VALU_FMA_F64 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
|
||||
/ $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
IOPs (Total):
|
||||
avg: AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8
|
||||
* 512)) / $denom)
|
||||
@@ -279,12 +201,12 @@ Panel Config:
|
||||
* 512)) / $denom)
|
||||
max: MAX(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8
|
||||
* 512)) / $denom)
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
F8 OPs:
|
||||
avg: AVG(((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) / $denom))
|
||||
min: MIN(((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) / $denom))
|
||||
max: MAX(((512 * SQ_INSTS_VALU_MFMA_MOPS_F8) / $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
F16 OPs:
|
||||
avg: AVG(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16))
|
||||
+ (64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512
|
||||
@@ -295,12 +217,12 @@ Panel Config:
|
||||
max: MAX(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16))
|
||||
+ (64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512
|
||||
* SQ_INSTS_VALU_MFMA_MOPS_F16)) / $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
BF16 OPs:
|
||||
avg: AVG(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
|
||||
min: MIN(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
|
||||
max: MAX(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
F32 OPs:
|
||||
avg: AVG((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
|
||||
+ (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
|
||||
@@ -311,7 +233,7 @@ Panel Config:
|
||||
max: MAX((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
|
||||
+ (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
|
||||
/ $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
F64 OPs:
|
||||
avg: AVG((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
|
||||
@@ -322,9 +244,94 @@ Panel Config:
|
||||
max: MAX((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
|
||||
+ (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
|
||||
/ $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
INT8 OPs:
|
||||
avg: AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
|
||||
min: MIN(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
|
||||
max: MAX(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
|
||||
unit: (OPs + $normUnit)
|
||||
unit: (OPs + $normUnit)
|
||||
metrics_description:
|
||||
VALU FLOPs: |-
|
||||
The total floating-point operations executed per second on the VALU.
|
||||
This is also presented as a percent of the peak theoretical FLOPs achievable
|
||||
on the specific accelerator. Note: this does not include any floating-point
|
||||
operations from MFMA instructions.
|
||||
VALU IOPs: |-
|
||||
The total integer operations executed per second on the VALU. This is
|
||||
also presented as a percent of the peak theoretical IOPs achievable on the
|
||||
specific accelerator. Note: this does not include any integer operations from
|
||||
MFMA instructions.
|
||||
MFMA FLOPs (BF16): |-
|
||||
The total number of 16-bit brain floating point MFMA operations executed
|
||||
per second. Note: this does not include any 16-bit brain floating point operations
|
||||
from VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
BF16 MFMA operations achievable on the specific accelerator.
|
||||
MFMA FLOPs (F16): |-
|
||||
The total number of 16-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 16-bit floating point operations from
|
||||
VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F16 MFMA operations achievable on the specific accelerator.
|
||||
MFMA FLOPs (F32): |-
|
||||
The total number of 32-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 32-bit floating point operations from
|
||||
VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F32 MFMA operations achievable on the specific accelerator.
|
||||
MFMA FLOPs (F64): |-
|
||||
The total number of 64-bit floating point MFMA operations executed per
|
||||
second. Note: this does not include any 64-bit floating point operations from
|
||||
VALU instructions. This is also presented as a percent of the peak theoretical
|
||||
F64 MFMA operations achievable on the specific accelerator.
|
||||
MFMA IOPs (INT8): |-
|
||||
The total number of 8-bit integer MFMA operations executed per second.
|
||||
Note: this does not include any 8-bit integer operations from VALU instructions.
|
||||
This is also presented as a percent of the peak theoretical INT8 MFMA operations
|
||||
achievable on the specific accelerator.
|
||||
IPC: The ratio of the total number of instructions executed on the CU over the
|
||||
total active CU cycles.
|
||||
IPC (Issued): The ratio of the total number of (non-internal) instructions issued
|
||||
over the number of cycles where the scheduler was actively working on issuing
|
||||
instructions.
|
||||
SALU Utilization: Indicates what percent of the kernel's duration the SALU was
|
||||
busy executing instructions. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing SALU / SMEM instructions over the total CU cycles.
|
||||
VALU Utilization: Indicates what percent of the kernel's duration the VALU was
|
||||
busy executing instructions. Does not include VMEM operations. Computed as the
|
||||
ratio of the total number of cycles spent by the scheduler issuing VALU instructions
|
||||
over the total CU cycles.
|
||||
VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
|
||||
was busy executing instructions, including both global/generic and spill/scratch
|
||||
operations (see the VMEM instruction count metrics for more detail). Does not
|
||||
include VALU operations. Computed as the ratio of the total number of cycles
|
||||
spent by the scheduler issuing VMEM instructions over the total CU cycles.
|
||||
Branch Utilization: Indicates what percent of the kernel's duration the branch
|
||||
unit was busy executing instructions. Computed as the ratio of the total number
|
||||
of cycles spent by the scheduler issuing branch instructions over the total
|
||||
CU cycles.
|
||||
VALU Active Threads: Indicates the average level of divergence within a wavefront
|
||||
over the lifetime of the kernel. The number of work-items that were active in
|
||||
a wavefront during execution of each VALU instruction, time-averaged over all
|
||||
VALU instructions run on all wavefronts in the kernel
|
||||
MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
|
||||
was busy executing instructions. Computed as the ratio of the total number of
|
||||
cycles spent by the MFMA was busy over the total CU cycles.
|
||||
MFMA Instruction Cycles: The average duration of MFMA instructions in this kernel
|
||||
in cycles. Computed as the ratio of the total number of cycles the MFMA unit
|
||||
was busy over the total number of MFMA instructions.
|
||||
VMEM Latency: The average number of round-trip cycles (that is, from issue to
|
||||
data return / acknowledgment) required for a VMEM instruction to complete.
|
||||
SMEM Latency: The average number of round-trip cycles (that is, from issue to
|
||||
data return / acknowledgment) required for a SMEM instruction to complete.
|
||||
FLOPs (Total): The total number of floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
IOPs (Total): The total number of integer operations executed on either the VALU
|
||||
or MFMA units, per normalization unit.
|
||||
F16 OPs: The total number of 16-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
BF16 OPs: The total number of 16-bit brain floating-point operations executed
|
||||
on either the VALU or MFMA units, per normalization unit.
|
||||
F32 OPs: The total number of 32-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
F64 OPs: The total number of 64-bit floating-point operations executed on either
|
||||
the VALU or MFMA units, per normalization unit.
|
||||
INT8 OPs: The total number of 8-bit integer operations executed on either the
|
||||
VALU or MFMA units, per normalization unit.
|
||||
|
||||
+52
-51
@@ -2,51 +2,6 @@
|
||||
Panel Config:
|
||||
id: 1200
|
||||
title: Local Data Share (LDS)
|
||||
metrics_description:
|
||||
Utilization: Indicates what percent of the kernel's duration the LDS was actively
|
||||
executing instructions (including, but not limited to, load, store, atomic and
|
||||
HIP's __shfl operations). Calculated as the ratio of the total number of cycles
|
||||
LDS was active over the total CU cycles.
|
||||
Access Rate: Indicates the percentage of SIMDs in the VALU actively issuing LDS
|
||||
instructions, averaged over the lifetime of the kernel. Calculated as the ratio
|
||||
of the total number of cycles spent by the scheduler issuing LDS instructions
|
||||
over the total CU cycles.
|
||||
Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
|
||||
could have been loaded from, stored to, or atomically updated in the LDS divided
|
||||
as percentage of theoretical peak. Does not take into account the execution
|
||||
mask of the wavefront when the instruction was executed.
|
||||
Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
|
||||
loaded from, stored to, or atomically updated in the LDS divided by total duration.
|
||||
Does not take into account the execution mask of the wavefront when the instruction
|
||||
was executed.
|
||||
Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
|
||||
servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing
|
||||
bank conflicts over the number of LDS cycles that would have been required to
|
||||
move the same amount of data in an uncontended access.
|
||||
LDS Instructions: The total number of LDS instructions (including, but not limited
|
||||
to, read/write/atomics and HIP's __shfl instructions) executed per normalization
|
||||
unit.
|
||||
LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
|
||||
/ acknowledgment) required for an LDS instruction to complete.
|
||||
Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS scheduler
|
||||
due to bank conflicts (as determined by the conflict resolution hardware) to
|
||||
the base number of cycles that would be spent in the LDS scheduler in a completely
|
||||
uncontended case. This is the unnormalized form of the Bank Conflict Rate.
|
||||
Index Accesses: The total number of cycles spent in the LDS scheduler over all
|
||||
operations per normalization unit.
|
||||
Atomic Return Cycles: The total number of cycles spent on LDS atomics with return
|
||||
per normalization unit.
|
||||
Bank Conflict: The total number of cycles spent in the LDS scheduler due to bank
|
||||
conflicts (as determined by the conflict resolution hardware) per normalization
|
||||
unit.
|
||||
Addr Conflict: The total number of cycles spent in the LDS scheduler due to address
|
||||
conflicts (as determined by the conflict resolution hardware) per normalization
|
||||
unit.
|
||||
Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
|
||||
stalls from non-dword aligned addresses per normalization unit.
|
||||
Mem Violations: "The total number of out-of-bounds accesses made to the LDS, per\
|
||||
\ normalization unit. This is unused and expected to be zero in most configurations\
|
||||
\ for modern CDNA\u2122 accelerators."
|
||||
data source:
|
||||
- metric_table:
|
||||
id: 1201
|
||||
@@ -87,7 +42,7 @@ Panel Config:
|
||||
avg: AVG((SQ_INSTS_LDS / $denom))
|
||||
min: MIN((SQ_INSTS_LDS / $denom))
|
||||
max: MAX((SQ_INSTS_LDS / $denom))
|
||||
unit: (Instr + $normUnit)
|
||||
unit: (Instr + $normUnit)
|
||||
Theoretical Bandwidth:
|
||||
avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
|
||||
/ (End_Timestamp - Start_Timestamp)))
|
||||
@@ -117,29 +72,75 @@ Panel Config:
|
||||
avg: AVG((SQ_LDS_IDX_ACTIVE / $denom))
|
||||
min: MIN((SQ_LDS_IDX_ACTIVE / $denom))
|
||||
max: MAX((SQ_LDS_IDX_ACTIVE / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Atomic Return Cycles:
|
||||
avg: AVG((SQ_LDS_ATOMIC_RETURN / $denom))
|
||||
min: MIN((SQ_LDS_ATOMIC_RETURN / $denom))
|
||||
max: MAX((SQ_LDS_ATOMIC_RETURN / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Bank Conflict:
|
||||
avg: AVG((SQ_LDS_BANK_CONFLICT / $denom))
|
||||
min: MIN((SQ_LDS_BANK_CONFLICT / $denom))
|
||||
max: MAX((SQ_LDS_BANK_CONFLICT / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Addr Conflict:
|
||||
avg: AVG((SQ_LDS_ADDR_CONFLICT / $denom))
|
||||
min: MIN((SQ_LDS_ADDR_CONFLICT / $denom))
|
||||
max: MAX((SQ_LDS_ADDR_CONFLICT / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Unaligned Stall:
|
||||
avg: AVG((SQ_LDS_UNALIGNED_STALL / $denom))
|
||||
min: MIN((SQ_LDS_UNALIGNED_STALL / $denom))
|
||||
max: MAX((SQ_LDS_UNALIGNED_STALL / $denom))
|
||||
unit: (Cycles + $normUnit)
|
||||
unit: (Cycles + $normUnit)
|
||||
Mem Violations:
|
||||
avg: AVG((SQ_LDS_MEM_VIOLATIONS / $denom))
|
||||
min: MIN((SQ_LDS_MEM_VIOLATIONS / $denom))
|
||||
max: MAX((SQ_LDS_MEM_VIOLATIONS / $denom))
|
||||
unit: (Accesses + $normUnit)
|
||||
metrics_description:
|
||||
Utilization: Indicates what percent of the kernel's duration the LDS was actively
|
||||
executing instructions (including, but not limited to, load, store, atomic and
|
||||
HIP's __shfl operations). Calculated as the ratio of the total number of cycles
|
||||
LDS was active over the total CU cycles.
|
||||
Access Rate: Indicates the percentage of SIMDs in the VALU actively issuing LDS
|
||||
instructions, averaged over the lifetime of the kernel. Calculated as the ratio
|
||||
of the total number of cycles spent by the scheduler issuing LDS instructions
|
||||
over the total CU cycles.
|
||||
Theoretical Bandwidth Utilization: Indicates the maximum amount of bytes that
|
||||
could have been loaded from, stored to, or atomically updated in the LDS divided
|
||||
as percentage of theoretical peak. Does not take into account the execution
|
||||
mask of the wavefront when the instruction was executed.
|
||||
Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
|
||||
loaded from, stored to, or atomically updated in the LDS divided by total duration.
|
||||
Does not take into account the execution mask of the wavefront when the instruction
|
||||
was executed.
|
||||
Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
|
||||
servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing
|
||||
bank conflicts over the number of LDS cycles that would have been required to
|
||||
move the same amount of data in an uncontended access.
|
||||
LDS Instructions: The total number of LDS instructions (including, but not limited
|
||||
to, read/write/atomics and HIP's __shfl instructions) executed per normalization
|
||||
unit.
|
||||
LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
|
||||
acknowledgment) required for an LDS instruction to complete.
|
||||
Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS scheduler
|
||||
due to bank conflicts (as determined by the conflict resolution hardware) to
|
||||
the base number of cycles that would be spent in the LDS scheduler in a completely
|
||||
uncontended case. This is the unnormalized form of the Bank Conflict Rate.
|
||||
Index Accesses: The total number of cycles spent in the LDS scheduler over all
|
||||
operations per normalization unit.
|
||||
Atomic Return Cycles: The total number of cycles spent on LDS atomics with return
|
||||
per normalization unit.
|
||||
Bank Conflict: The total number of cycles spent in the LDS scheduler due to bank
|
||||
conflicts (as determined by the conflict resolution hardware) per normalization
|
||||
unit.
|
||||
Addr Conflict: The total number of cycles spent in the LDS scheduler due to address
|
||||
conflicts (as determined by the conflict resolution hardware) per normalization
|
||||
unit.
|
||||
Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
|
||||
stalls from non-dword aligned addresses per normalization unit.
|
||||
Mem Violations: |-
|
||||
The total number of out-of-bounds accesses made to the LDS, per normalization
|
||||
unit. This is unused and expected to be zero in most configurations for
|
||||
modern CDNA\u2122 accelerators.
|
||||
|
||||
Некоторые файлы не были показаны из-за слишком большого количества измененных файлов Показать больше
Ссылка в новой задаче
Block a user