[rocprof-compute] Generalize Roofline (#325)

* per kernel analysis Roofline

* added per-kernel eval_metric calculation with display

* fixed typo

* updated tty.py show_all()

* formatting

* fixed ctest failures and updated equations

* formatting

* updated metric descriptoins

* review tweaks

* update docs

* added roofline gui analysis

* updated GUI docs

* updated print statement

* comment tweaks and ran ruff formatting
This commit is contained in:
jamessiddeley-amd
2025-08-20 09:58:08 -04:00
committed by GitHub
parent 71b725f307
commit 5840940caa
26 changed files with 2612 additions and 158 deletions
@@ -19,6 +19,9 @@ This section provides an overview of ROCm Compute Profiler's CLI analysis featur
* :ref:`Filtering <cli-analysis-options>`: Hone in on a particular kernel,
GPU ID, or dispatch ID via post-process filtering.
* :ref:`Per-kernel roofline analysis <per-kernel-roofline>`: Detailed arithmetic
intensity and performance analysis for individual kernels.
Run ``rocprof-compute analyze -h`` for more details.
.. _cli-walkthrough:
@@ -32,7 +35,7 @@ There are three high-level GPU analysis views:
* System Speed-of-Light: Key GPU performance metrics to show overall GPU performance and utilization.
* Memory chart: Shows memory transactions and throughput on each cache hierarchical level.
* Empirical hierarchical roofline: Roofline model that compares achieved throughput with attainable peak hardware limits, more specifically peak compute throughput and memory bandwidth (on L1/LDS/L2/HBM).
* Empirical hierarchical roofline: Roofline model that compares achieved throughput with attainable peak hardware limits, more specifically peak compute throughput and memory bandwidth (on L1/LDS/L2/HBM). When combined with kernel filtering, provides detailed per-kernel arithmetic intensity analysis and performance breakdowns.
**System Speed-of-Light:**
@@ -67,7 +70,7 @@ There are three high-level GPU analysis views:
.. note::
* Visualized memory chart and Roofline chart are only supported in single run analysis. In multiple runs comparison mode, both are switched back to basic table view.
* Visualized memory chart requires the width of the terminal output to be greater than or equal to 234 to display the whole chart properly.
* Visualized Roofline chart is adapted to the initial terminal size only. If it is not clear, you may need to adjust the terminal size and regenerate it to check the display effect.
* Visualized Roofline chart is adapted to the initial terminal size only. If it is not clear, you may need to adjust the terminal size and regenerate it to check the display effect. Roofline analysis provides detailed, structured table output with measured empirical peak values for comparison.
.. _cli-list-metrics:
@@ -309,6 +312,67 @@ Filter kernels
You should see your filtered kernels indicated by an asterisk in the **Top
Stats** table.
.. _per-kernel-roofline:
Per-kernel roofline analysis
When analyzing specific kernels, the roofline analysis provides detailed metrics for each filtered kernel:
.. code-block:: shell-session
$ rocprof-compute analyze -p workloads/vcopy/MI200/ -k 0 -b 4
This generates enhanced roofline output showing per-kernel performance rates and arithmetic intensity calculations:
.. code-block:: text
================================================================================
4. Roofline
================================================================================
(4.1) Per-Kernel Roofline Metrics and (4.2) AI Plot Points
--------------------------------------------------------------------------------
Kernel 0: vecCopy(double*, double*, double*, int, int) (100.0%)
|
├─ 4.1 Roofline Rate Metrics:
| ╒═════════════╤════════════════════╤═══════════════════╤═════════╤════════════════════╕
| │ Metric_ID │ Metric │ Value │ Unit │ Peak (Empirical) │
| ╞═════════════╪════════════════════╪═══════════════════╪═════════╪════════════════════╡
| │ 4.1.0 │ VALU FLOPs │ │ Gflop/s │ 61286.40 │
| ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
| │ 4.1.1 │ MFMA FLOPs (F64) │ │ Gflop/s │ 108544.33 │
| ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
| │ 4.1.2 │ MFMA FLOPs (F32) │ │ Gflop/s │ 104531.42 │
| ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
| │ 4.1.3 │ MFMA FLOPs (F16) │ │ Gflop/s │ 709169.38 │
| ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
| │ 4.1.4 │ MFMA FLOPs (BF16) │ 0.0 │ Gflop/s │ 388161.09 │
| ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
| │ 4.1.5 │ MFMA FLOPs (F8) │ 0.0 │ Gflop/s │ 1446089.60 │
| ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
| │ 4.1.6 │ MFMA IOPs (Int8) │ │ Giop/s │ 737317.94 │
| ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
| │ 4.1.7 │ HBM Bandwidth │ │ Gb/s │ 3231.95 │
| ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
| │ 4.1.8 │ L2 Cache Bandwidth │ │ Gb/s │ 19096.81 │
| ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
| │ 4.1.9 │ L1 Cache Bandwidth │ 3880.358726762844 │ Gb/s │ 25006.24 │
| ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤
| │ 4.1.10 │ LDS Bandwidth │ │ Gb/s │ 54920.88 │
| ╘═════════════╧════════════════════╧═══════════════════╧═════════╧════════════════════╛
├─ 4.2 Roofline AI Plot Points:
| ╒═════════════╤══════════════════════╤═════════╤════════════╕
| │ Metric_ID │ Metric │ Value │ Unit │
| ╞═════════════╪══════════════════════╪═════════╪════════════╡
| │ 4.2.0 │ AI HBM │ │ Flops/byte │
| ├─────────────┼──────────────────────┼─────────┼────────────┤
| │ 4.2.1 │ AI L2 │ │ Flops/byte │
| ├─────────────┼──────────────────────┼─────────┼────────────┤
| │ 4.2.2 │ AI L1 │ │ Flops/byte │
| ├─────────────┼──────────────────────┼─────────┼────────────┤
| │ 4.2.3 │ Performance (GFLOPs) │ │ Gflop/s │
| ╘═════════════╧══════════════════════╧═════════╧════════════╛
The per-kernel analysis uses YAML-based metric evaluation for accurate calculations.
Analyze multiple kernels for comparison:
.. code-block:: shell-session
$ rocprof-compute analyze -p workloads/vcopy/MI200/ -k 0 1 2 -b 4
Baseline comparison
.. code-block:: shell
@@ -83,6 +83,7 @@ application's profiling data:
#. Top Stats (Top Kernel Statistics)
#. System Info
#. System Speed-of-Light
#. Roofline AI Data Metrics
To dive deeper, use the dropdown menus at the top of the screen to isolate
particular kernels or dispatches. You should see the web page update with
@@ -307,7 +307,7 @@ Examples:
"\t\t\t For stochastic sampling, the interval is in cycles.\n"
"\t\t\t For host_trap sampling, the interval is in microsecond "
"(DEFAULT: 1048576)."
)
),
)
profile_group.add_argument(
+1 -1
View File
@@ -32,6 +32,6 @@ PROJECT_NAME = "rocprofiler-compute"
HIDDEN_COLUMNS = ["coll_level"]
HIDDEN_COLUMNS_CLI = ["Description", "coll_level"]
HIDDEN_COLUMNS_TUI = ["Description", "coll_level"]
HIDDEN_SECTIONS = [400, 1900, 2000]
HIDDEN_SECTIONS = [1900, 2000]
TIME_UNITS = {"s": 10**9, "ms": 10**6, "us": 10**3, "ns": 1}
@@ -30,8 +30,16 @@ from abc import abstractmethod
from collections import OrderedDict
from pathlib import Path
import pandas as pd
from utils import file_io, parser, schema
from utils.logger import console_debug, console_error, console_log, demarcate
from utils.logger import (
console_debug,
console_error,
console_log,
console_warning,
demarcate,
)
from utils.utils import is_workload_empty, merge_counters_spatial_multiplex
@@ -189,6 +197,21 @@ class OmniAnalyze_Base:
else file_io.find_1st_sub_dir(d[0])
)
w.sys_info = file_io.load_sys_info(sysinfo_path.joinpath("sysinfo.csv"))
if not getattr(self.get_args(), "no_roof", False):
try:
roofline_path = sysinfo_path.joinpath("roofline.csv")
roofline_df = pd.read_csv(roofline_path)
# use original column names from roofline.csv directly
w.roofline_peaks = roofline_df
except FileNotFoundError:
console_warning("roofline.csv not found.")
w.roofline_peaks = pd.DataFrame()
else:
w.roofline_peaks = pd.DataFrame()
arch = w.sys_info.iloc[0]["gpu_arch"]
mspec = self.get_socs()[arch]._mspec
if self.__args.specs_correction:
@@ -40,8 +40,9 @@ class cli_analysis(OmniAnalyze_Base):
if self.get_args().random_port:
console_error("--gui flag is required to enable --random-port")
for d in self.get_args().path:
workload = self._runs[d[0]]
# create 'mega dataframe'
self._runs[d[0]].raw_pmc = file_io.create_df_pmc(
workload.raw_pmc = file_io.create_df_pmc(
d[0],
self.get_args().nodes,
self.get_args().spatial_multiplexing,
@@ -51,29 +52,27 @@ class cli_analysis(OmniAnalyze_Base):
)
if self.get_args().spatial_multiplexing:
self._runs[d[0]].raw_pmc = self.spatial_multiplex_merge_counters(
self._runs[d[0]].raw_pmc
workload.raw_pmc = self.spatial_multiplex_merge_counters(
workload.raw_pmc
)
file_io.create_df_kernel_top_stats(
df_in=self._runs[d[0]].raw_pmc,
df_in=workload.raw_pmc,
raw_data_dir=d[0],
filter_gpu_ids=self._runs[d[0]].filter_gpu_ids,
filter_dispatch_ids=self._runs[d[0]].filter_dispatch_ids,
filter_nodes=self._runs[d[0]].filter_nodes,
filter_gpu_ids=workload.filter_gpu_ids,
filter_dispatch_ids=workload.filter_dispatch_ids,
filter_nodes=workload.filter_nodes,
time_unit=self.get_args().time_unit,
max_stat_num=self.get_args().max_stat_num,
kernel_verbose=self.get_args().kernel_verbose,
)
# demangle and overwrite original 'Kernel_Name'
kernel_name_shortener(
self._runs[d[0]].raw_pmc, self.get_args().kernel_verbose
)
kernel_name_shortener(workload.raw_pmc, self.get_args().kernel_verbose)
# create the loaded table
parser.load_table_data(
workload=self._runs[d[0]],
workload=workload,
dir=d[0],
is_gui=False,
args=self.get_args(),
@@ -85,42 +84,41 @@ class cli_analysis(OmniAnalyze_Base):
"""Run CLI analysis."""
super().run_analysis()
workload_path = self.get_args().path[0][0]
workload = self._runs[workload_path]
gpu_arch = workload.sys_info.iloc[0]["gpu_arch"]
arch_config = self._arch_configs[gpu_arch]
if self.get_args().list_stats:
tty.show_kernel_stats(
self.get_args(),
self._runs,
self._arch_configs[
self._runs[self.get_args().path[0][0]].sys_info.iloc[0]["gpu_arch"]
],
arch_config,
self._output,
)
else:
roof_plot = None
# 1. check if not baseline && compatible soc:
if (len(self.get_args().path)) == 1 and self._runs[
self.get_args().path[0][0]
].sys_info.iloc[0]["gpu_arch"] in [
"gfx90a",
"gfx940",
"gfx941",
"gfx942",
"gfx950",
]:
# add roofline plot to cli output
roof_obj = self.get_socs()[
self._runs[self.get_args().path[0][0]].sys_info.iloc[0]["gpu_arch"]
].roofline_obj
if (len(self.get_args().path)) == 1:
if gpu_arch in ["gfx90a", "gfx940", "gfx941", "gfx942", "gfx950"]:
roof_obj = self.get_socs()[gpu_arch].roofline_obj
if roof_obj:
# NOTE: using default data type
roof_plot = roof_obj.cli_generate_plot(roof_obj.get_dtype()[0])
if roof_obj:
# store path in workload for calc_ai_analyze
workload.path = workload_path
# NOTE: using default data type
roof_plot = roof_obj.cli_generate_plot(
dtype=roof_obj.get_dtype()[0],
workload=workload,
config=self._profiling_config,
arch_config=arch_config,
)
tty.show_all(
self.get_args(),
self._runs,
self._arch_configs[
self._runs[self.get_args().path[0][0]].sys_info.iloc[0]["gpu_arch"]
],
arch_config,
self._output,
self._profiling_config,
roof_plot=roof_plot,
@@ -48,7 +48,7 @@ class webui_analysis(OmniAnalyze_Base):
self.dest_dir = str(Path(args.path[0][0]).absolute().resolve())
self.arch = None
self.__hidden_sections = ["Memory Chart", "Roofline"]
self.__hidden_sections = ["Memory Chart"]
self.__hidden_columns = HIDDEN_COLUMNS
# define different types of bar charts
self.__barchart_elements = {
@@ -151,7 +151,7 @@ class webui_analysis(OmniAnalyze_Base):
# Only display basic metrics if no filters are applied
if not (disp_filt or kernel_filter or gcd_filter):
temp = {}
keep = [1, 2, 101, 201, 301, 401]
keep = [1, 2, 101, 201, 301, 401, 402]
for key in base_data[base_run].dfs:
if keep.count(key) != 0:
temp[key] = base_data[base_run].dfs[key]
@@ -219,7 +219,6 @@ class webui_analysis(OmniAnalyze_Base):
.lower()
)
html_section = []
if panel["title"] not in self.__hidden_sections:
# Iterate over each table per section
for data_source in panel["data source"]:
@@ -289,7 +289,7 @@ class RocProfCompute:
if sets_info:
first_set = next(iter(sets_info.keys()))
print(f" rocprof-compute profile --set {first_set} # Profile this set")
print(f" rocprof-compute profile --list-sets # Show this help")
print(" rocprof-compute profile --list-sets # Show this help")
print()
sys.exit(0)
@@ -2,8 +2,191 @@
Panel Config:
id: 400
title: Roofline
metrics_description: {}
metrics_description:
VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
This is also presented as a percent of the peak theoretical FLOPs achievable
on the specific accelerator. Note: this does not include any floating-point
operations from MFMA instructions.'
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
executed per second. This does not include any 16-bit brain floating point operations
from VALU instructions. The peak empirically measured F8 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison. It is supported
on AMD Instinct MI300 series and later only.
MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
executed per second. Note: this does not include any 16-bit brain floating point
operations from VALU instructions. The peak empirically measured BF16 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.'
MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
per second. Note: this does not include any 16-bit floating point operations
from VALU instructions. The peak empirically measured F16 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
per second. Note: this does not include any 32-bit floating point operations
from VALU instructions. The peak empirically measured F32 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
per second. Note: this does not include any 64-bit floating point operations
from VALU instructions. The peak empirically measured F64 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
per second. Note: this does not include any 8-bit integer operations from VALU
instructions. The peak empirically measured INT8 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth
Memory (HBM) per second. The peak empirically measured bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
The number of bytes is calculated as the number of cache lines requested multiplied
by the cache line size. This value does not consider partial requests, so e.g.,
if only a single value is requested in a cache line, the data movement will
still be counted as a full cache line. The peak empirically measured bandwidth
achievable on the specific accelerator is displayed alongside for comparison.
L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
of VMEM instructions per unit time. The number of bytes is calculated as the
number of cache lines requested multiplied by the cache line size. This value
does not consider partial requests, so e.g., if only a single value is requested
in a cache line, the data movement will still be counted as a full cache line.
The peak empirically measured bandwidth achievable on the specific accelerator
is displayed alongside for comparison.
LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
example for more detail). The peak empirically measured LDS bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L1 cache and the processing units. This value is used as the x-coordinate
for the L1 roofline.
AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L2 cache and the L1 cache. This value is used as the x-coordinate for the
L2 roofline.
AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
It is the ratio of total floating-point operations (FLOPs) to total bytes transferred
between HBM and the L2 cache. This value is used as the x-coordinate for the
HBM roofline.
Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
operations divided by the total execution time. This value is used as the y-coordinate
for the kernel's point on the Roofline plot.
data source:
- None:
- metric_table:
id: 401
title: Roofline
title: Roofline Performance Rates
cli_style: Roofline
tui_style: Roofline
header:
metric: Metric
value: Value
unit: Unit
peak: Peak (Empirical)
metric:
VALU FLOPs:
value: AVG((($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) / ((End_Timestamp - Start_Timestamp) / 1e9))
/ 1e9)
unit: GFLOP/s
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
MFMA FLOPs (F64):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF64Flops_empirical_peak
MFMA FLOPs (F32):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF32Flops_empirical_peak
MFMA FLOPs (F16):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF16Flops_empirical_peak
MFMA FLOPs (BF16):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMABF16Flops_empirical_peak
MFMA IOPs (Int8):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GIOP/s
peak: $MFMAI8Ops_empirical_peak
HBM Bandwidth:
value: AVG((( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum
- TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + ((TCC_EA0_WRREQ_sum
- TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $HBMBw_empirical_peak
L2 Cache Bandwidth:
value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64)) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $L2Bw_empirical_peak
L1 Cache Bandwidth:
value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GB/s
peak: $L1Bw_empirical_peak
LDS Bandwidth:
value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4 * $lds_banks_per_cu))
/ ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $LDSBw_empirical_peak
- metric_table:
id: 402
title: Roofline Plot Points
cli_style: Roofline
tui_style: Roofline
header:
metric: Metric
value: Value
unit: Unit
metric:
AI HBM:
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / SUM( (TCC_BUBBLE_sum *
128) + (TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum
- TCC_EA0_RDREQ_32B_sum) * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
* 32) + (TCC_EA0_WRREQ_64B_sum * 64) ) )
unit: FLOPs/Byte
AI L2:
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / SUM( (TCP_TCC_WRITE_REQ_sum
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum
+ TCP_TCC_READ_REQ_sum) * 64 ) )
unit: FLOPs/Byte
AI L1:
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / SUM(TCP_TOTAL_CACHE_ACCESSES_sum
* 64) )
unit: FLOPs/Byte
Performance GFLOPs:
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / (SUM(End_Timestamp - Start_Timestamp)
/ 1e9) ) / 1e9
unit: GFLOP/s
@@ -2,8 +2,189 @@
Panel Config:
id: 400
title: Roofline
metrics_description: {}
metrics_description:
VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
This is also presented as a percent of the peak theoretical FLOPs achievable
on the specific accelerator. Note: this does not include any floating-point
operations from MFMA instructions.'
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
executed per second. This does not include any 16-bit brain floating point operations
from VALU instructions. The peak empirically measured F8 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison. It is supported
on AMD Instinct MI300 series and later only.
MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
executed per second. Note: this does not include any 16-bit brain floating point
operations from VALU instructions. The peak empirically measured BF16 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.'
MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
per second. Note: this does not include any 16-bit floating point operations
from VALU instructions. The peak empirically measured F16 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
per second. Note: this does not include any 32-bit floating point operations
from VALU instructions. The peak empirically measured F32 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
per second. Note: this does not include any 64-bit floating point operations
from VALU instructions. The peak empirically measured F64 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
per second. Note: this does not include any 8-bit integer operations from VALU
instructions. The peak empirically measured INT8 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth
Memory (HBM) per second. The peak empirically measured bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
The number of bytes is calculated as the number of cache lines requested multiplied
by the cache line size. This value does not consider partial requests, so e.g.,
if only a single value is requested in a cache line, the data movement will
still be counted as a full cache line. The peak empirically measured bandwidth
achievable on the specific accelerator is displayed alongside for comparison.
L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
of VMEM instructions per unit time. The number of bytes is calculated as the
number of cache lines requested multiplied by the cache line size. This value
does not consider partial requests, so e.g., if only a single value is requested
in a cache line, the data movement will still be counted as a full cache line.
The peak empirically measured bandwidth achievable on the specific accelerator
is displayed alongside for comparison.
LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
example for more detail). The peak empirically measured LDS bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L1 cache and the processing units. This value is used as the x-coordinate
for the L1 roofline.
AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L2 cache and the L1 cache. This value is used as the x-coordinate for the
L2 roofline.
AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
It is the ratio of total floating-point operations (FLOPs) to total bytes transferred
between HBM and the L2 cache. This value is used as the x-coordinate for the
HBM roofline.
Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
operations divided by the total execution time. This value is used as the y-coordinate
for the kernel's point on the Roofline plot.
data source:
- None:
- metric_table:
id: 401
title: Roofline
title: Roofline Performance Rates
cli_style: Roofline
tui_style: Roofline
header:
metric: Metric
value: Value
unit: Unit
peak: Peak (Empirical)
metric:
VALU FLOPs:
value: AVG((($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) / ((End_Timestamp - Start_Timestamp) / 1e9))
/ 1e9)
unit: GFLOP/s
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
MFMA FLOPs (F64):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF64Flops_empirical_peak
MFMA FLOPs (F32):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF32Flops_empirical_peak
MFMA FLOPs (F16):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF16Flops_empirical_peak
MFMA FLOPs (BF16):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMABF16Flops_empirical_peak
MFMA IOPs (Int8):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GIOP/s
peak: $MFMAI8Ops_empirical_peak
HBM Bandwidth:
value: AVG((( (TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
* 64) + (TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
* 32) ) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $HBMBw_empirical_peak
L2 Cache Bandwidth:
value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64)) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $L2Bw_empirical_peak
L1 Cache Bandwidth:
value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GB/s
peak: $L1Bw_empirical_peak
LDS Bandwidth:
value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4 * $lds_banks_per_cu))
/ ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $LDSBw_empirical_peak
- metric_table:
id: 402
title: Roofline Plot Points
cli_style: Roofline
tui_style: Roofline
header:
metric: Metric
value: Value
unit: Unit
metric:
AI HBM:
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / SUM( (TCC_EA_RDREQ_32B_sum
* 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) * 64) + (TCC_EA_WRREQ_64B_sum
* 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) * 32) ) )
unit: FLOPs/Byte
AI L2:
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / SUM( (TCP_TCC_WRITE_REQ_sum
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum
+ TCP_TCC_READ_REQ_sum) * 64 ) )
unit: FLOPs/Byte
AI L1:
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / SUM(TCP_TOTAL_CACHE_ACCESSES_sum
* 64) )
unit: FLOPs/Byte
Performance GFLOPs:
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / (SUM(End_Timestamp - Start_Timestamp)
/ 1e9) ) / 1e9
unit: GFLOP/s
@@ -2,8 +2,197 @@
Panel Config:
id: 400
title: Roofline
metrics_description: {}
metrics_description:
VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
This is also presented as a percent of the peak theoretical FLOPs achievable
on the specific accelerator. Note: this does not include any floating-point
operations from MFMA instructions.'
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
executed per second. This does not include any 16-bit brain floating point operations
from VALU instructions. The peak empirically measured F8 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison. It is supported
on AMD Instinct MI300 series and later only.
MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
executed per second. Note: this does not include any 16-bit brain floating point
operations from VALU instructions. The peak empirically measured BF16 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.'
MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
per second. Note: this does not include any 16-bit floating point operations
from VALU instructions. The peak empirically measured F16 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
per second. Note: this does not include any 32-bit floating point operations
from VALU instructions. The peak empirically measured F32 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
per second. Note: this does not include any 64-bit floating point operations
from VALU instructions. The peak empirically measured F64 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
per second. Note: this does not include any 8-bit integer operations from VALU
instructions. The peak empirically measured INT8 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth
Memory (HBM) per second. The peak empirically measured bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
The number of bytes is calculated as the number of cache lines requested multiplied
by the cache line size. This value does not consider partial requests, so e.g.,
if only a single value is requested in a cache line, the data movement will
still be counted as a full cache line. The peak empirically measured bandwidth
achievable on the specific accelerator is displayed alongside for comparison.
L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
of VMEM instructions per unit time. The number of bytes is calculated as the
number of cache lines requested multiplied by the cache line size. This value
does not consider partial requests, so e.g., if only a single value is requested
in a cache line, the data movement will still be counted as a full cache line.
The peak empirically measured bandwidth achievable on the specific accelerator
is displayed alongside for comparison.
LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
example for more detail). The peak empirically measured LDS bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L1 cache and the processing units. This value is used as the x-coordinate
for the L1 roofline.
AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L2 cache and the L1 cache. This value is used as the x-coordinate for the
L2 roofline.
AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
It is the ratio of total floating-point operations (FLOPs) to total bytes transferred
between HBM and the L2 cache. This value is used as the x-coordinate for the
HBM roofline.
Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
operations divided by the total execution time. This value is used as the y-coordinate
for the kernel's point on the Roofline plot.
data source:
- None:
- metric_table:
id: 401
title: Roofline
title: Roofline Performance Rates
cli_style: Roofline
tui_style: Roofline
header:
metric: Metric
value: Value
unit: Unit
peak: Peak (Empirical)
metric:
VALU FLOPs:
value: AVG(($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) ) / ((End_Timestamp - Start_Timestamp) / 1e9))
/ 1e9)
unit: GFLOP/s
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
MFMA FLOPs (F64):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF64Flops_empirical_peak
MFMA FLOPs (F32):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF32Flops_empirical_peak
MFMA FLOPs (F16):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF16Flops_empirical_peak
MFMA FLOPs (BF16):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMABF16Flops_empirical_peak
MFMA FLOPs (F8):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF8Flops_empirical_peak
MFMA IOPs (Int8):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GIOP/s
peak: $MFMAI8Ops_empirical_peak
HBM Bandwidth:
value: AVG((( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum
- TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + ((TCC_EA0_WRREQ_sum
- TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $HBMBw_empirical_peak
L2 Cache Bandwidth:
value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64)) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $L2Bw_empirical_peak
L1 Cache Bandwidth:
value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GB/s
peak: $L1Bw_empirical_peak
LDS Bandwidth:
value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4 * $lds_banks_per_cu))
/ ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $LDSBw_empirical_peak
- metric_table:
id: 402
title: Roofline Plot Points
cli_style: Roofline
tui_style: Roofline
header:
metric: Metric
value: Value
unit: Unit
metric:
AI HBM:
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
* 512) ) / SUM( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32)
+ ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64)
+ ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum
* 64) ) )
unit: FLOPs/Byte
AI L2:
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + ) / SUM( (TCP_TCC_WRITE_REQ_sum
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum
+ TCP_TCC_READ_REQ_sum) * 64 ) )
unit: FLOPs/Byte
AI L1:
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + ) / SUM(TCP_TOTAL_CACHE_ACCESSES_sum
* 64) )
unit: FLOPs/Byte
Performance GFLOPs:
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
* 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9
unit: GFLOP/s
@@ -2,8 +2,197 @@
Panel Config:
id: 400
title: Roofline
metrics_description: {}
metrics_description:
VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
This is also presented as a percent of the peak theoretical FLOPs achievable
on the specific accelerator. Note: this does not include any floating-point
operations from MFMA instructions.'
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
executed per second. This does not include any 16-bit brain floating point operations
from VALU instructions. The peak empirically measured F8 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison. It is supported
on AMD Instinct MI300 series and later only.
MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
executed per second. Note: this does not include any 16-bit brain floating point
operations from VALU instructions. The peak empirically measured BF16 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.'
MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
per second. Note: this does not include any 16-bit floating point operations
from VALU instructions. The peak empirically measured F16 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
per second. Note: this does not include any 32-bit floating point operations
from VALU instructions. The peak empirically measured F32 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
per second. Note: this does not include any 64-bit floating point operations
from VALU instructions. The peak empirically measured F64 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
per second. Note: this does not include any 8-bit integer operations from VALU
instructions. The peak empirically measured INT8 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth
Memory (HBM) per second. The peak empirically measured bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
The number of bytes is calculated as the number of cache lines requested multiplied
by the cache line size. This value does not consider partial requests, so e.g.,
if only a single value is requested in a cache line, the data movement will
still be counted as a full cache line. The peak empirically measured bandwidth
achievable on the specific accelerator is displayed alongside for comparison.
L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
of VMEM instructions per unit time. The number of bytes is calculated as the
number of cache lines requested multiplied by the cache line size. This value
does not consider partial requests, so e.g., if only a single value is requested
in a cache line, the data movement will still be counted as a full cache line.
The peak empirically measured bandwidth achievable on the specific accelerator
is displayed alongside for comparison.
LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
example for more detail). The peak empirically measured LDS bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L1 cache and the processing units. This value is used as the x-coordinate
for the L1 roofline.
AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L2 cache and the L1 cache. This value is used as the x-coordinate for the
L2 roofline.
AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
It is the ratio of total floating-point operations (FLOPs) to total bytes transferred
between HBM and the L2 cache. This value is used as the x-coordinate for the
HBM roofline.
Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
operations divided by the total execution time. This value is used as the y-coordinate
for the kernel's point on the Roofline plot.
data source:
- None:
- metric_table:
id: 401
title: Roofline
title: Roofline Performance Rates
cli_style: Roofline
tui_style: Roofline
header:
metric: Metric
value: Value
unit: Unit
peak: Peak (Empirical)
metric:
VALU FLOPs:
value: AVG(($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) ) / ((End_Timestamp - Start_Timestamp) / 1e9))
/ 1e9)
unit: GFLOP/s
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
MFMA FLOPs (F64):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF64Flops_empirical_peak
MFMA FLOPs (F32):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF32Flops_empirical_peak
MFMA FLOPs (F16):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF16Flops_empirical_peak
MFMA FLOPs (BF16):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMABF16Flops_empirical_peak
MFMA FLOPs (F8):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF8Flops_empirical_peak
MFMA IOPs (Int8):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GIOP/s
peak: $MFMAI8Ops_empirical_peak
HBM Bandwidth:
value: AVG((( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum
- TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + ((TCC_EA0_WRREQ_sum
- TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $HBMBw_empirical_peak
L2 Cache Bandwidth:
value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64)) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $L2Bw_empirical_peak
L1 Cache Bandwidth:
value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GB/s
peak: $L1Bw_empirical_peak
LDS Bandwidth:
value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4 * $lds_banks_per_cu))
/ ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $LDSBw_empirical_peak
- metric_table:
id: 402
title: Roofline Plot Points
cli_style: Roofline
tui_style: Roofline
header:
metric: Metric
value: Value
unit: Unit
metric:
AI HBM:
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
* 512) ) / SUM( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32)
+ ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64)
+ ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum
* 64) ) )
unit: FLOPs/Byte
AI L2:
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
* 512) ) / SUM( (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64 ) )
unit: FLOPs/Byte
AI L1:
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
* 512) ) / SUM(TCP_TOTAL_CACHE_ACCESSES_sum * 64) )
unit: FLOPs/Byte
Performance GFLOPs:
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
* 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9
unit: GFLOP/s
@@ -2,8 +2,197 @@
Panel Config:
id: 400
title: Roofline
metrics_description: {}
metrics_description:
VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
This is also presented as a percent of the peak theoretical FLOPs achievable
on the specific accelerator. Note: this does not include any floating-point
operations from MFMA instructions.'
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
executed per second. This does not include any 16-bit brain floating point operations
from VALU instructions. The peak empirically measured F8 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison. It is supported
on AMD Instinct MI300 series and later only.
MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
executed per second. Note: this does not include any 16-bit brain floating point
operations from VALU instructions. The peak empirically measured BF16 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.'
MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
per second. Note: this does not include any 16-bit floating point operations
from VALU instructions. The peak empirically measured F16 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
per second. Note: this does not include any 32-bit floating point operations
from VALU instructions. The peak empirically measured F32 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
per second. Note: this does not include any 64-bit floating point operations
from VALU instructions. The peak empirically measured F64 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
per second. Note: this does not include any 8-bit integer operations from VALU
instructions. The peak empirically measured INT8 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth
Memory (HBM) per second. The peak empirically measured bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
The number of bytes is calculated as the number of cache lines requested multiplied
by the cache line size. This value does not consider partial requests, so e.g.,
if only a single value is requested in a cache line, the data movement will
still be counted as a full cache line. The peak empirically measured bandwidth
achievable on the specific accelerator is displayed alongside for comparison.
L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
of VMEM instructions per unit time. The number of bytes is calculated as the
number of cache lines requested multiplied by the cache line size. This value
does not consider partial requests, so e.g., if only a single value is requested
in a cache line, the data movement will still be counted as a full cache line.
The peak empirically measured bandwidth achievable on the specific accelerator
is displayed alongside for comparison.
LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
example for more detail). The peak empirically measured LDS bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L1 cache and the processing units. This value is used as the x-coordinate
for the L1 roofline.
AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L2 cache and the L1 cache. This value is used as the x-coordinate for the
L2 roofline.
AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
It is the ratio of total floating-point operations (FLOPs) to total bytes transferred
between HBM and the L2 cache. This value is used as the x-coordinate for the
HBM roofline.
Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
operations divided by the total execution time. This value is used as the y-coordinate
for the kernel's point on the Roofline plot.
data source:
- None:
- metric_table:
id: 401
title: Roofline
title: Roofline Performance Rates
cli_style: Roofline
tui_style: Roofline
header:
metric: Metric
value: Value
unit: Unit
peak: Peak (Empirical)
metric:
VALU FLOPs:
value: AVG((($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) / ((End_Timestamp - Start_Timestamp) / 1e9))
/ 1e9)
unit: GFLOP/s
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
MFMA FLOPs (F64):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF64Flops_empirical_peak
MFMA FLOPs (F32):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF32Flops_empirical_peak
MFMA FLOPs (F16):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF16Flops_empirical_peak
MFMA FLOPs (BF16):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMABF16Flops_empirical_peak
MFMA FLOPs (F8):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF8Flops_empirical_peak
MFMA IOPs (Int8):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GIOP/s
peak: $MFMAI8Ops_empirical_peak
HBM Bandwidth:
value: AVG((( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum
- TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + ((TCC_EA0_WRREQ_sum
- TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $HBMBw_empirical_peak
L2 Cache Bandwidth:
value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64)) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $L2Bw_empirical_peak
L1 Cache Bandwidth:
value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GB/s
peak: $L1Bw_empirical_peak
LDS Bandwidth:
value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4 * $lds_banks_per_cu))
/ ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $LDSBw_empirical_peak
- metric_table:
id: 402
title: Roofline Plot Points
cli_style: Roofline
tui_style: Roofline
header:
metric: Metric
value: Value
unit: Unit
metric:
AI HBM:
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
* 512) ) / SUM( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32)
+ ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64)
+ ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum
* 64) ) )
unit: FLOPs/Byte
AI L2:
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
* 512) ) / SUM( (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum
+ TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64 ) )
unit: FLOPs/Byte
AI L1:
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
* 512) ) / SUM( TCP_TOTAL_CACHE_ACCESSES_sum * 64 ) )
unit: FLOPs/Byte
Performance (GFLOPs):
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
* 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9
unit: GFLOP/s
@@ -2,8 +2,205 @@
Panel Config:
id: 400
title: Roofline
metrics_description: {}
metrics_description:
VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
This is also presented as a percent of the peak theoretical FLOPs achievable
on the specific accelerator. Note: this does not include any floating-point
operations from MFMA instructions.'
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
executed per second. This does not include any 16-bit brain floating point operations
from VALU instructions. The peak empirically measured F8 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison. It is supported
on AMD Instinct MI300 series and later only.
MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
executed per second. Note: this does not include any 16-bit brain floating point
operations from VALU instructions. The peak empirically measured BF16 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.'
MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
per second. Note: this does not include any 16-bit floating point operations
from VALU instructions. The peak empirically measured F16 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
per second. Note: this does not include any 32-bit floating point operations
from VALU instructions. The peak empirically measured F32 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
per second. Note: this does not include any 64-bit floating point operations
from VALU instructions. The peak empirically measured F64 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
per second. Note: this does not include any 8-bit integer operations from VALU
instructions. The peak empirically measured INT8 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth
Memory (HBM) per second. The peak empirically measured bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time.
The number of bytes is calculated as the number of cache lines requested multiplied
by the cache line size. This value does not consider partial requests, so e.g.,
if only a single value is requested in a cache line, the data movement will
still be counted as a full cache line. The peak empirically measured bandwidth
achievable on the specific accelerator is displayed alongside for comparison.
L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result
of VMEM instructions per unit time. The number of bytes is calculated as the
number of cache lines requested multiplied by the cache line size. This value
does not consider partial requests, so e.g., if only a single value is requested
in a cache line, the data movement will still be counted as a full cache line.
The peak empirically measured bandwidth achievable on the specific accelerator
is displayed alongside for comparison.
LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded
from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
example for more detail). The peak empirically measured LDS bandwidth achievable
on the specific accelerator is displayed alongside for comparison.
AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L1 cache and the processing units. This value is used as the x-coordinate
for the L1 roofline.
AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L2 cache and the L1 cache. This value is used as the x-coordinate for the
L2 roofline.
AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
It is the ratio of total floating-point operations (FLOPs) to total bytes transferred
between HBM and the L2 cache. This value is used as the x-coordinate for the
HBM roofline.
Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
operations divided by the total execution time. This value is used as the y-coordinate
for the kernel's point on the Roofline plot.
data source:
- None:
- metric_table:
id: 401
title: Roofline
title: Roofline Performance Rates
cli_style: Roofline
tui_style: Roofline
header:
metric: Metric
value: Value
unit: Unit
peak: Peak (Empirical)
metric:
VALU FLOPs:
value: AVG((($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) / ((End_Timestamp - Start_Timestamp) / 1e9))
/ 1e9)
unit: GFLOP/s
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
MFMA FLOPs (F64):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF64Flops_empirical_peak
MFMA FLOPs (F32):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF32Flops_empirical_peak
MFMA FLOPs (F16):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF16Flops_empirical_peak
MFMA FLOPs (BF16):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMABF16Flops_empirical_peak
MFMA FLOPs (F8):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF8Flops_empirical_peak
MFMA FLOPs (F6F4):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMA_FLOPs_F6F4_empirical_peak
MFMA IOPs (Int8):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GIOP/s
peak: $MFMAI8Ops_empirical_peak
HBM Bandwidth:
value: AVG((( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum
- TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + ((TCC_EA0_WRREQ_sum
- TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $HBMBw_empirical_peak
L2 Cache Bandwidth:
value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64)) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $L2Bw_empirical_peak
L1 Cache Bandwidth:
value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
unit: GB/s
peak: $L1Bw_empirical_peak
LDS Bandwidth:
value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4 * $lds_banks_per_cu))
/ ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $LDSBw_empirical_peak
- metric_table:
id: 402
title: Roofline Plot Points
cli_style: Roofline
tui_style: Roofline
header:
metric: Metric
value: Value
unit: Unit
metric:
AI HBM:
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
* 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / SUM( (TCC_BUBBLE_sum
* 128) + (TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum
- TCC_EA0_RDREQ_32B_sum) * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
* 32) + (TCC_EA0_WRREQ_64B_sum * 64) ) )
unit: FLOPs/Byte
AI L2:
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
* 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / SUM( (TCP_TCC_WRITE_REQ_sum
+ TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum
+ TCP_TCC_READ_REQ_sum) * 64 ) )
unit: FLOPs/Byte
AI L1:
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
* 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / SUM(TCP_TOTAL_CACHE_ACCESSES_sum
* 64) )
unit: FLOPs/Byte
Performance GFLOPs:
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
* 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / (SUM(End_Timestamp -
Start_Timestamp) / 1e9) ) / 1e9
unit: GFLOP/s
@@ -377,14 +377,10 @@ class OmniSoC_Base:
if counter_name.startswith("TCC") and counter_name.endswith("["):
counters.remove(counter_name)
counter_name = counter_name.split("[")[0]
counters = counters.union(
{
f"{counter_name}[{i}]"
for i in range(
num_xcd_for_pmc_file * int(self._mspec._l2_banks)
)
}
)
counters = counters.union({
f"{counter_name}[{i}]"
for i in range(num_xcd_for_pmc_file * int(self._mspec._l2_banks))
})
return counters
+38 -22
View File
@@ -48,7 +48,8 @@ from utils.roofline_calc import (
MFMA_DATATYPES,
PEAK_OPS_DATATYPES,
SUPPORTED_DATATYPES,
calc_ai,
calc_ai_analyze,
calc_ai_profile,
constuct_roof,
)
from utils.utils import mibench
@@ -182,10 +183,9 @@ class Roofline:
console_debug(
"roofline", "Path: %s" % self.__run_parameters.get("workload_dir")
)
self.__ai_data = calc_ai(
self.__ai_data = calc_ai_profile(
self.__mspec, self.__run_parameters.get("sort_type"), ret_df
)
msg = "AI at each mem level:"
for i in self.__ai_data:
msg += "\n\t%s -> %s" % (i, self.__ai_data[i])
@@ -620,7 +620,7 @@ class Roofline:
return fig
def cli_generate_plot(self, dtype):
def cli_generate_plot(self, dtype, workload=None, config=None, arch_config=None):
"""
Plot CLI mode roofline analysis in terminal using plotext
@@ -668,11 +668,43 @@ class Roofline:
else:
# workload_dir is a string
base_dir = workload_dir
self.roof_setup()
# Convert to Path object for easier manipulation
base_path = Path(base_dir)
roofline_csv = base_path / "roofline.csv"
if not roofline_csv.is_file():
console_log("roofline", "{} does not exist".format(roofline_csv))
return
# if workload is detected, utilize Roofline yamls. If not, fallback to legacy calc_ai
if workload is not None:
self.__ai_data = calc_ai_analyze(
workload=workload,
mspec=self.__mspec,
sort_type=self.__run_parameters.get("sort_type"),
config=config,
arch_config=arch_config,
)
else:
pmc_perf_csv = base_path / "pmc_perf.csv"
if not pmc_perf_csv.is_file():
console_error("roofline", "{} does not exist".format(pmc_perf_csv))
t_df = OrderedDict()
t_df["pmc_perf"] = pd.read_csv(pmc_perf_csv)
self.__ai_data = calc_ai_profile(
self.__mspec, self.__run_parameters["sort_type"], t_df
)
self.__ceiling_data = constuct_roof(
roofline_parameters=self.__run_parameters, dtype=dtype
)
console_debug(f"AI data: {self.__ai_data}")
console_debug(f"Kernel names: {self.__ai_data.get('kernelNames', [])}")
self.roof_setup()
# Check proper datatype input - takes single str
if not isinstance(dtype, str):
console_error("Unsupported datatype input - must be str")
@@ -682,16 +714,6 @@ class Roofline:
self.__run_parameters["mem_level"].remove("vL1D")
self.__run_parameters["mem_level"].append("L1")
roofline_csv = base_path / "roofline.csv"
if not roofline_csv.is_file():
console_log("roofline", "{} does not exist".format(roofline_csv))
return
pmc_perf_csv = base_path / "pmc_perf.csv"
if not pmc_perf_csv.is_file():
console_error("roofline", "{} does not exist".format(pmc_perf_csv))
t_df = OrderedDict()
t_df["pmc_perf"] = pd.read_csv(pmc_perf_csv)
profiling_config = file_io.load_profiling_config(self.__args.path[0][0])
if profiling_config.get("format_rocprof_output") == "rocpd":
t_df["pmc_perf"] = rocpd_data.process_rocpd_csv(t_df["pmc_perf"])
@@ -714,12 +736,6 @@ class Roofline:
5: "atom",
}
self.__ceiling_data = constuct_roof(
roofline_parameters=self.__run_parameters,
dtype=dtype,
)
self.__ai_data = calc_ai(self.__mspec, self.__run_parameters["sort_type"], t_df)
plt.clf()
plt.plotsize(plt.tw(), plt.th())
@@ -103,6 +103,7 @@ supported_call = {
"STD": "to_std",
# functions apply to whole column of df or a single value
"TO_INT": "to_int",
"SUM": "to_sum",
# Support the below with 2 inputs
"ROUND": "to_round",
"QUANTILE": "to_quantile",
@@ -196,6 +197,19 @@ def to_int(a):
raise Exception("to_int: unsupported type.")
def to_sum(a):
if str(type(a)) == "<class 'NoneType'>":
return np.nan
elif np.isnan(a).all():
return np.nan
elif a.empty:
return np.nan
elif isinstance(a, pd.core.series.Series):
return a.sum()
else:
raise Exception("to_sum: unsupported type.")
def to_round(a, b):
if isinstance(a, pd.core.series.Series):
return a.round(b)
@@ -755,7 +769,7 @@ def build_metric_value_string(dfs, dfs_type, normal_unit, profiling_config):
@demarcate
def eval_metric(dfs, dfs_type, sys_info, raw_pmc_df, debug, config):
def eval_metric(dfs, dfs_type, sys_info, empirical_peaks_df, raw_pmc_df, debug, config):
"""
Execute the expr string for each metric in the df.
"""
@@ -860,6 +874,30 @@ def eval_metric(dfs, dfs_type, sys_info, raw_pmc_df, debug, config):
"wave_size is not available in sysinfo.csv, please provide the correct "
"value using --specs-correction"
)
if not empirical_peaks_df.empty:
peak_data_row = empirical_peaks_df.iloc[0]
for metric_name in empirical_peaks_df.columns:
var_name = f"ammolite__{metric_name}_empirical_peak"
locals()[var_name] = peak_data_row[metric_name]
else:
default_peaks = [
"MFMAF64Flops",
"MFMAF32Flops",
"MFMAF16Flops",
"MFMABF16Flops",
"MFMAF8Flops",
"MFMAI8Ops",
"HBMBw",
"L2Bw",
"L1Bw",
"LDSBw",
"MFMA_FLOPs_F6F4",
]
# set values to 0 if no no empirical peaks from roofline.csv are provided
for peak_name in default_peaks:
var_name = f"ammolite__{peak_name}_empirical_peak"
exec(f"{var_name} = 0", globals(), locals())
# TODO: fix all $normUnit in Unit column or title
# build and eval all derived build-in global variables
@@ -958,8 +996,7 @@ def eval_metric(dfs, dfs_type, sys_info, raw_pmc_df, debug, config):
except TypeError:
console_warning(
"Skipping entry. Encountered a missing "
"counter\n{} has been assigned to None\n{}"
.format(
"counter\n{} has been assigned to None\n{}".format(
expr,
np.nan,
)
@@ -984,8 +1021,14 @@ def eval_metric(dfs, dfs_type, sys_info, raw_pmc_df, debug, config):
row[expr] = ""
else:
row[expr] = out
except TypeError:
row[expr] = ""
except (TypeError, NameError) as e:
if "empirical_peak" in str(e):
console_warning(
f"Missing empirical peak data: {e}. Using empty value."
)
row[expr] = ""
else:
row[expr] = ""
except AttributeError as ae:
if (
str(ae)
@@ -1043,8 +1086,7 @@ def apply_filters(workload, dir, is_gui, debug):
for kernel_id in workload.filter_kernel_ids:
if kernel_id >= len(kernels_df["Kernel_Name"]):
console_error(
"{} is an invalid kernel id. Please enter an id between 0-{}"
.format(
"{} is an invalid kernel id. Please enter an id between 0-{}".format(
kernel_id,
len(kernels_df["Kernel_Name"]) - 1,
)
@@ -1579,6 +1621,7 @@ def load_table_data(workload, dir, is_gui, args, config, skipKernelTop=False):
workload.dfs,
workload.dfs_type,
workload.sys_info.iloc[0],
workload.roofline_peaks,
apply_filters(workload, dir, is_gui, args.debug),
args.debug,
config,
@@ -23,11 +23,15 @@
##############################################################################
import csv
from dataclasses import dataclass
from pathlib import Path
import pandas as pd
from utils.logger import console_debug
from utils.parser import apply_filters, eval_metric
################################################
# Global vars
@@ -154,8 +158,7 @@ def get_color(catagory):
# Plot BW at each cache level
# -------------------------------------------------------------------------------------
def calc_ceilings(roofline_parameters, dtype, benchmark_data):
"""Given benchmarking data, calculate ceilings
(or peak performance) for empirical roofline"""
"""Given benchmarking data, calculate ceilings (or peak performance) for empirical roofline"""
# TODO: This is where filtering by memory level will need to occur for standalone
graphPoints = {"hbm": [], "l2": [], "l1": [], "lds": [], "valu": [], "mfma": []}
@@ -186,7 +189,7 @@ def calc_ceilings(roofline_parameters, dtype, benchmark_data):
if dtype in PEAK_OPS_DATATYPES:
x2 = peakOps / peakBw
y2 = peakOps # noqa: F841
y2 = peakOps
# Plot MFMA lines (NOTE: Assuming MI200 soc)
x1_mfma = peakOps / peakBw
@@ -220,9 +223,9 @@ def calc_ceilings(roofline_parameters, dtype, benchmark_data):
graphPoints[cacheHierarchy[i].lower()].append([y1, peakY])
graphPoints[cacheHierarchy[i].lower()].append(peakBw)
# ---------------------------------------------------------------------------------
# -------------------------------------------------------------------------------------
# Plot computing roof
# ---------------------------------------------------------------------------------
# -------------------------------------------------------------------------------------
if dtype in PEAK_OPS_DATATYPES:
# Plot FMA roof
x0 = XMAX
@@ -254,9 +257,151 @@ def calc_ceilings(roofline_parameters, dtype, benchmark_data):
# Overlay application performance
# -------------------------------------------------------------------------------------
# Calculate relevant metrics for ai calculation
def calc_ai(mspec, sort_type, ret_df):
"""Given counter data, calculate arithmetic intensity
for each kernel in the application."""
def calc_ai_analyze(workload, mspec, sort_type, config, arch_config):
"""
Calculate per-kernel metrics and AI points with Roofline yamls using eval_metric.
"""
console_debug("calc_ai_analyze: Starting calc_ai analysis using Roofline yamls")
plot_points = {
"ai_l1": [[], []],
"ai_l2": [[], []],
"ai_hbm": [[], []],
"kernelNames": [],
}
workload.roofline_metrics = {}
filtered_pmc = apply_filters(workload, workload.path, is_gui=False, debug=False)
kernel_ids_to_process = []
kernel_top_table_id = 1
if workload.filter_kernel_ids:
kernel_ids_to_process = workload.filter_kernel_ids
else:
if kernel_top_table_id in workload.dfs:
kernel_top_df = workload.dfs[kernel_top_table_id]
kernel_ids_to_process = kernel_top_df.index.tolist()
console_debug(
"roofline", f"Found {len(kernel_ids_to_process)} kernels to process"
)
if not kernel_ids_to_process:
console_warning("No kernels found to process for roofline")
return plot_points
for kernel_id in kernel_ids_to_process:
if kernel_top_table_id in workload.dfs:
kernel_top_df = workload.dfs[kernel_top_table_id]
if kernel_id in kernel_top_df.index:
kernel_name = kernel_top_df.loc[kernel_id, "Kernel_Name"]
else:
continue
else:
continue
console_debug("roofline", f"Processing kernel {kernel_id}: {kernel_name[:50]}")
# filter PMC data for specific kernel
kernel_pmc_df = filtered_pmc[
filtered_pmc["pmc_perf"]["Kernel_Name"] == kernel_name
]
if kernel_pmc_df.empty:
console_debug("roofline", f"No PMC data for kernel {kernel_id}")
continue
kernel_only_data = {"pmc_perf": kernel_pmc_df["pmc_perf"]}
kernel_dfs = {}
kernel_dfs_type = {}
for table_id in [401, 402]:
if table_id in arch_config.dfs:
kernel_dfs[table_id] = arch_config.dfs[table_id].copy()
kernel_dfs_type[table_id] = arch_config.dfs_type[table_id]
# eval metrics for single kernel only
eval_metric(
kernel_dfs,
kernel_dfs_type,
workload.sys_info.iloc[0],
workload.roofline_peaks,
kernel_only_data,
debug=False,
config=config,
)
# DEBUG
if 402 in kernel_dfs:
console_debug("roofline", f"Table 402 for kernel {kernel_id}:")
for idx, row in kernel_dfs[402].iterrows():
console_debug(
"roofline", f" {row.get('Metric', '')}: {row.get('Value', '')}"
)
ai_hbm = ai_l2 = ai_l1 = performance = 0
if 402 in kernel_dfs:
for idx, row in kernel_dfs[402].iterrows():
metric = row.get("Metric", "")
value = row.get("Value", 0)
if metric == "AI HBM":
ai_hbm = value if value and value != "" else 0
elif metric == "AI L2":
ai_l2 = value if value and value != "" else 0
elif metric == "AI L1":
ai_l1 = value if value and value != "" else 0
elif metric == "Performance (GFLOPs)":
performance = value if value and value != "" else 0
console_debug(
"roofline",
f"Kernel {kernel_id}: AI_HBM={ai_hbm:.2f}, AI_L2={ai_l2:.2f}, AI_L1={ai_l1:.2f}, Performance={performance:.2e} GFLOP/s",
)
# add to plot points if we have valid data
if performance > 0:
if ai_hbm > 0:
plot_points["ai_hbm"][0].append(ai_hbm)
plot_points["ai_hbm"][1].append(performance)
if ai_l2 > 0:
plot_points["ai_l2"][0].append(ai_l2)
plot_points["ai_l2"][1].append(performance)
if ai_l1 > 0:
plot_points["ai_l1"][0].append(ai_l1)
plot_points["ai_l1"][1].append(performance)
plot_points["kernelNames"].append(f"K{kernel_id}")
console_debug("roofline", f"Added kernel {kernel_id} to plot points")
else:
console_debug(
"roofline", f"Skipping kernel {kernel_id} - no performance data"
)
# store metrics for display
workload.roofline_metrics[kernel_id] = {
"name": kernel_name,
"ai_table": kernel_dfs.get(401, pd.DataFrame()),
"calc_table": kernel_dfs.get(402, pd.DataFrame()),
}
console_debug(
"roofline", f"Generated {len(plot_points['kernelNames'])} plot points"
)
console_debug("roofline", f"Plot points: {plot_points}")
return plot_points
def calc_ai_profile(mspec, sort_type, ret_df):
"""Given counter data, calculate arithmetic intensity for each kernel in the application.
Leverage hard-coded equations to calculate AI values.
Used during profiling stage to generate roofline PDF, since Roofline yamls are not available
in the profiling stage."""
console_debug(
"calc_ai_profile: Starting legacy roofline calculation (from roofline_calc)"
)
df = ret_df["pmc_perf"]
# Sort by top kernels or top dispatches?
df = df.sort_values(by=["Kernel_Name"])
@@ -463,7 +608,9 @@ def calc_ai(mspec, sort_type, ret_df):
calls += 1
if sort_type == "kernels" and (at_end or (kernelName != next_kernelName)):
if sort_type == "kernels" and (
at_end == True or (kernelName != next_kernelName)
):
myList.append(
AI_Data(
kernelName,
@@ -538,8 +685,9 @@ def calc_ai(mspec, sort_type, ret_df):
while i < TOP_N and i != len(myList):
if myList[i].total_flops == 0:
console_debug(
"No flops counted for {}, arithmetic intensities will not "
"display on plots.".format(myList[i].KernelName)
"No flops counted for {}, arithmetic intensities will not display on plots.".format(
myList[i].KernelName
)
)
kernelNames.append(myList[i].KernelName)
@@ -548,40 +696,28 @@ def calc_ai(mspec, sort_type, ret_df):
if myList[i].L1cache_data
else intensities["ai_l1"].append(0)
)
# print(
# "cur_ai_L1",
# myList[i].total_flops / myList[i].L1cache_data
# ) if myList[i].L1cache_data else print("null")
# print("cur_ai_L1", myList[i].total_flops/myList[i].L1cache_data) if myList[i].L1cache_data else print("null")
# print()
(
intensities["ai_l2"].append(myList[i].total_flops / myList[i].L2cache_data)
if myList[i].L2cache_data
else intensities["ai_l2"].append(0)
)
# print(
# "cur_ai_L2",
# myList[i].total_flops / myList[i].L2cache_data
# ) if myList[i].L2cache_data else print("null")
# print("cur_ai_L2", myList[i].total_flops/myList[i].L2cache_data) if myList[i].L2cache_data else print("null")
# print()
(
intensities["ai_hbm"].append(myList[i].total_flops / myList[i].hbm_data)
if myList[i].hbm_data
else intensities["ai_hbm"].append(0)
)
# print(
# "cur_ai_hbm",
# myList[i].total_flops / myList[i].hbm_data
# ) if myList[i].hbm_data else print("null")
# print("cur_ai_hbm", myList[i].total_flops/myList[i].hbm_data) if myList[i].hbm_data else print("null")
# print()
(
curr_perf.append(myList[i].total_flops / myList[i].avgDuration)
if myList[i].avgDuration
else curr_perf.append(0)
)
# print(
# "cur_perf",
# myList[i].total_flops / myList[i].avgDuration
# ) if myList[i].avgDuration else print("null")
# print("cur_perf", myList[i].total_flops/myList[i].avgDuration) if myList[i].avgDuration else print("null")
i += 1
@@ -590,7 +726,7 @@ def calc_ai(mspec, sort_type, ret_df):
for i in intensities:
values = intensities[i]
color = get_color(i) # noqa: F841
color = get_color(i)
x = []
y = []
for entryIndx in range(0, len(values)):
@@ -622,8 +758,7 @@ def constuct_roof(roofline_parameters, dtype):
# -----------------------------------------------------
# Initialize roofline data dictionary from roofline.csv
# -----------------------------------------------------
# TODO: consider changing this to an ordered dict for consistency over py versions
benchmark_data = {}
benchmark_data = {} # TODO: consider changing this to an ordered dict for consistency over py versions
headers = []
try:
with open(benchmark_results, "r") as csvfile:
@@ -641,7 +776,7 @@ def constuct_roof(roofline_parameters, dtype):
rowCount += 1
csvfile.close()
except Exception:
except:
graphPoints = {
"hbm": [None, None, None],
"l2": [None, None, None],
@@ -83,6 +83,7 @@ supported_field = [
"Avg",
"Pct of Peak",
"Peak",
"Peak (Empirical)",
"Count",
"Mean",
"Pct",
+104 -11
View File
@@ -32,6 +32,7 @@ from tabulate import tabulate
import config
from utils import mem_chart, parser
from utils.kernel_name_shortener import kernel_name_shortener
from utils.logger import console_error, console_log, console_warning
from utils.utils import convert_metric_id_to_panel_info
@@ -146,6 +147,108 @@ def show_all(args, runs, archConfigs, output, profiling_config, roof_plot=None):
continue
ss = "" # store content of all data_source from one panel
if panel_id == 400:
has_roofline_style = any(
data_source.get(type, {}).get("cli_style") == "Roofline"
for data_source in panel["data source"]
for type in data_source
)
if has_roofline_style and (
not args.filter_metrics or "4" in args.filter_metrics
):
print("\n" + "=" * 80, file=output)
print("4. Roofline", file=output)
print("=" * 80, file=output)
for run_path, workload in runs.items():
if (
hasattr(workload, "roofline_metrics")
and workload.roofline_metrics
):
print(
"\n(4.1) Per-Kernel Roofline Metrics and (4.2) AI Plot Points",
file=output,
)
print("-" * 80, file=output)
kernel_top_df = workload.dfs.get(1, pd.DataFrame())
if not kernel_top_df.empty:
kernel_name_shortener(kernel_top_df, args.kernel_verbose)
for i, (kernel_id, metrics) in enumerate(
workload.roofline_metrics.items()
):
if (
not kernel_top_df.empty
and kernel_id in kernel_top_df.index
):
kernel_name = kernel_top_df.loc[
kernel_id, "Kernel_Name"
]
kernel_pct = (
kernel_top_df.loc[kernel_id, "Pct"]
if "Pct" in kernel_top_df.columns
else 0
)
else:
kernel_name = metrics.get("name", f"Kernel {kernel_id}")
kernel_pct = 0
display_name = (
kernel_name[:80] + "..."
if len(kernel_name) > 80
else kernel_name
)
print(
f"\nKernel {kernel_id}: {display_name} ({kernel_pct:.1f}%)",
file=output,
)
base_indent = " "
table_indent_prefix = f"{base_indent}| "
tables = {
401: (
"4.1 Roofline Rate Metrics:",
metrics.get("ai_table", pd.DataFrame()),
),
402: (
"4.2 Roofline AI Plot Points:",
metrics.get("calc_table", pd.DataFrame()),
),
}
print(f"{base_indent}|")
for table_id, (table_name, df) in tables.items():
if df.empty:
continue
print(f"{base_indent}├─ {table_name}", file=output)
display_df = df.copy()
for col in hidden_cols:
if col in display_df.columns:
display_df = display_df.drop(columns=[col])
table_string = get_table_string(
display_df, transpose=False, decimal=args.decimal
)
indented_table_string = textwrap.indent(
table_string, table_indent_prefix
)
print(indented_table_string, file=output)
else:
print("\nNo per-kernel metrics available", file=output)
# Show the roofline plot
if roof_plot:
show_roof_plot(roof_plot)
continue
for data_source in panel["data source"]:
for type, table_config in data_source.items():
# If block filtering was used during analysis, then don't use profiling
@@ -172,16 +275,6 @@ def show_all(args, runs, archConfigs, output, profiling_config, roof_plot=None):
)
continue
# Show roofline
# Check if we have filter_metrics for analyze stage:
# no filter_metrics = show all,
# filter_metrics containing "4" = user requesting roofline chart
if panel_id == 400 and (
not args.filter_metrics or "4" in args.filter_metrics
):
show_roof_plot(roof_plot)
continue
# Metrics baseline comparison mode
# We cannot guarantee that all runs have the same metrics.
# Only show common metrics.
@@ -454,7 +547,7 @@ def show_roof_plot(roof_plot):
# TODO: short term solution to display roofline plot
print("\n" + "-" * 80)
print("4. Roofline")
print("4.1 Roofline")
print("4.3 Roofline Plot")
if roof_plot:
print(roof_plot)
else:
@@ -745,7 +745,7 @@ def run_prof(
config.rocprof_compute_home
/ "rocprof_compute_soc"
/ "profile_configs"
/ f"counter_defs.yaml",
/ "counter_defs.yaml",
"r",
) as file:
counter_defs = yaml.safe_load(file)
@@ -1676,9 +1676,9 @@ class TestSetsIntegration:
memory_metrics = ["16.1.2", "17.1.0"]
for metric_id in memory_metrics:
assert (
metric_id in open(Path(workload_dir) / "log.txt", "r").read()
), f"Expected memory metric {metric_id} not found"
assert metric_id in open(Path(workload_dir) / "log.txt", "r").read(), (
f"Expected memory metric {metric_id} not found"
)
test_utils.clean_output_dir(config["cleanup"], workload_dir)
@@ -1745,7 +1745,9 @@ class TestSetsIntegration:
assert returncode == 1
test_utils.clean_output_dir(config["cleanup"], workload_dir)
def test_set_and_block_mutual_exclusion(self, binary_handler_profile_rocprof_compute):
def test_set_and_block_mutual_exclusion(
self, binary_handler_profile_rocprof_compute
):
options = ["--set", "compute_thruput_util", "--block", "12"]
workload_dir = test_utils.get_output_dir()
@@ -30,18 +30,17 @@ import json
import locale
import logging
import os
import tempfile
import pathlib
import re
import shutil
import subprocess
import tempfile
from pathlib import Path
from types import SimpleNamespace
from unittest import mock
import pandas as pd
import pytest
import yaml
import utils.utils as utils
@@ -23,12 +23,12 @@ src/rocprof_compute_soc/analysis_configs/gfx940/0300_memory_chart.yaml: cff5509a
src/rocprof_compute_soc/analysis_configs/gfx941/0300_memory_chart.yaml: cff5509ac8502bad6dbd75e3058159fe429aece5d93279c66b2a6a8c887b43b6
src/rocprof_compute_soc/analysis_configs/gfx942/0300_memory_chart.yaml: cff5509ac8502bad6dbd75e3058159fe429aece5d93279c66b2a6a8c887b43b6
src/rocprof_compute_soc/analysis_configs/gfx950/0300_memory_chart.yaml: 643b31ffa43bc3613d6f90b0c23d95093d0d0aa5bc8e72d9a0fbc1b739a08b67
src/rocprof_compute_soc/analysis_configs/gfx908/0400_roofline.yaml: 57f95dcd487dfcdf24e1c2d8eb16d14dc3462df83d11a08e7de2b06343b48c3e
src/rocprof_compute_soc/analysis_configs/gfx90a/0400_roofline.yaml: 57f95dcd487dfcdf24e1c2d8eb16d14dc3462df83d11a08e7de2b06343b48c3e
src/rocprof_compute_soc/analysis_configs/gfx940/0400_roofline.yaml: 57f95dcd487dfcdf24e1c2d8eb16d14dc3462df83d11a08e7de2b06343b48c3e
src/rocprof_compute_soc/analysis_configs/gfx941/0400_roofline.yaml: 57f95dcd487dfcdf24e1c2d8eb16d14dc3462df83d11a08e7de2b06343b48c3e
src/rocprof_compute_soc/analysis_configs/gfx942/0400_roofline.yaml: 57f95dcd487dfcdf24e1c2d8eb16d14dc3462df83d11a08e7de2b06343b48c3e
src/rocprof_compute_soc/analysis_configs/gfx950/0400_roofline.yaml: 57f95dcd487dfcdf24e1c2d8eb16d14dc3462df83d11a08e7de2b06343b48c3e
src/rocprof_compute_soc/analysis_configs/gfx908/0400_roofline.yaml: 6406ce67cd55064f0d2db2a3511c6536cc1625314ddb31366900fbf3c60ed523
src/rocprof_compute_soc/analysis_configs/gfx90a/0400_roofline.yaml: 100d555cf9e70b892e22f92ddd9c0a5d1f914d07077c4a8d35941e8ad62b5b30
src/rocprof_compute_soc/analysis_configs/gfx940/0400_roofline.yaml: f8bf66f43c9afede4fd1f17c279050cc27cc6fbc1cdb53a71ae8ceb0eb84dc37
src/rocprof_compute_soc/analysis_configs/gfx941/0400_roofline.yaml: 6fae04dcf4bcabe4a71f5d9eefc379a38d30cdf05fbb14e2c276e1c272fdb3f6
src/rocprof_compute_soc/analysis_configs/gfx942/0400_roofline.yaml: c8dfe7df24f94dfa229ffa2035b802c6833ce98f7710e0889bc5710f2167d4c0
src/rocprof_compute_soc/analysis_configs/gfx950/0400_roofline.yaml: 734fdfa818bfd8a87e01a0dd795c502a567c72158ca9b7bfe01e99451e8aa537
src/rocprof_compute_soc/analysis_configs/gfx908/0500_command_processor_cpc_cpf.yaml: da1c2997d42d66da2aa069caa741cf9eade124357c56e4290333de2f3e0412bb
src/rocprof_compute_soc/analysis_configs/gfx90a/0500_command_processor_cpc_cpf.yaml: da1c2997d42d66da2aa069caa741cf9eade124357c56e4290333de2f3e0412bb
src/rocprof_compute_soc/analysis_configs/gfx940/0500_command_processor_cpc_cpf.yaml: da1c2997d42d66da2aa069caa741cf9eade124357c56e4290333de2f3e0412bb
@@ -87,7 +87,9 @@ def update_analysis_config():
data_source_config["metric_table"]["metric"],
gfx_version,
)
new_panel_config["Panel Config"]["data source"].append(data_source_config)
new_panel_config["Panel Config"]["data source"].append(
data_source_config
)
# Write panel config to file
filename = Path(
TARGET_DIR.joinpath(gfx_version, f"{panel_id}_{panel_title}.yaml")
@@ -134,9 +136,9 @@ def update_sets_config():
}
for metric_id in sets["metric"][gfx_version]:
current_set["metric"].append(
{metric_id: METRIC_ID_TO_NAME_MAP[gfx_version][str(metric_id)]}
)
current_set["metric"].append({
metric_id: METRIC_ID_TO_NAME_MAP[gfx_version][str(metric_id)]
})
new_sets["sets"].append(current_set)
@@ -2801,9 +2801,963 @@ panels:
- id: 400
title: Roofline
data source:
- None:
- metric_table:
id: 401
title: Roofline
title: Roofline Performance Rates
cli_style: Roofline
tui_style: Roofline
header:
metric: Metric
value: Value
unit: Unit
peak: Peak (Empirical)
metric:
gfx90a:
VALU FLOPs:
value: AVG((($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
MFMA FLOPs (F64):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF64Flops_empirical_peak
MFMA FLOPs (F32):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF32Flops_empirical_peak
MFMA FLOPs (F16):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF16Flops_empirical_peak
MFMA FLOPs (BF16):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMABF16Flops_empirical_peak
MFMA IOPs (Int8):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GIOP/s
peak: $MFMAI8Ops_empirical_peak
HBM Bandwidth:
value: AVG(((
(TCC_EA_RDREQ_32B_sum * 32) +
((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) * 64) +
(TCC_EA_WRREQ_64B_sum * 64) +
((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) * 32)
) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $HBMBw_empirical_peak
L2 Cache Bandwidth:
value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) *
64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $L2Bw_empirical_peak
L1 Cache Bandwidth:
value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $L1Bw_empirical_peak
LDS Bandwidth:
value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) *
4 * $lds_banks_per_cu)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $LDSBw_empirical_peak
gfx908:
VALU FLOPs:
value: AVG((($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
MFMA FLOPs (F64):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF64Flops_empirical_peak
MFMA FLOPs (F32):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF32Flops_empirical_peak
MFMA FLOPs (F16):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF16Flops_empirical_peak
MFMA FLOPs (BF16):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMABF16Flops_empirical_peak
MFMA IOPs (Int8):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GIOP/s
peak: $MFMAI8Ops_empirical_peak
HBM Bandwidth:
value: AVG(((
(TCC_BUBBLE_sum * 128) +
(TCC_EA0_RDREQ_32B_sum * 32) +
((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) +
((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) +
(TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $HBMBw_empirical_peak
L2 Cache Bandwidth:
value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) *
64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $L2Bw_empirical_peak
L1 Cache Bandwidth:
value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $L1Bw_empirical_peak
LDS Bandwidth:
value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) *
4 * $lds_banks_per_cu)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $LDSBw_empirical_peak
gfx940:
VALU FLOPs:
value: AVG(($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
MFMA FLOPs (F64):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF64Flops_empirical_peak
MFMA FLOPs (F32):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF32Flops_empirical_peak
MFMA FLOPs (F16):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF16Flops_empirical_peak
MFMA FLOPs (BF16):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMABF16Flops_empirical_peak
MFMA FLOPs (F8):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF8Flops_empirical_peak
MFMA IOPs (Int8):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GIOP/s
peak: $MFMAI8Ops_empirical_peak
HBM Bandwidth:
value: AVG(((
(TCC_BUBBLE_sum * 128) +
(TCC_EA0_RDREQ_32B_sum * 32) +
((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) +
((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) +
(TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $HBMBw_empirical_peak
L2 Cache Bandwidth:
value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) *
64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $L2Bw_empirical_peak
L1 Cache Bandwidth:
value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $L1Bw_empirical_peak
LDS Bandwidth:
value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) *
4 * $lds_banks_per_cu)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $LDSBw_empirical_peak
gfx941:
VALU FLOPs:
value: AVG(($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
MFMA FLOPs (F64):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF64Flops_empirical_peak
MFMA FLOPs (F32):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF32Flops_empirical_peak
MFMA FLOPs (F16):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF16Flops_empirical_peak
MFMA FLOPs (BF16):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMABF16Flops_empirical_peak
MFMA FLOPs (F8):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF8Flops_empirical_peak
MFMA IOPs (Int8):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GIOP/s
peak: $MFMAI8Ops_empirical_peak
HBM Bandwidth:
value: AVG(((
(TCC_BUBBLE_sum * 128) +
(TCC_EA0_RDREQ_32B_sum * 32) +
((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) +
((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) +
(TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $HBMBw_empirical_peak
L2 Cache Bandwidth:
value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) *
64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $L2Bw_empirical_peak
L1 Cache Bandwidth:
value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $L1Bw_empirical_peak
LDS Bandwidth:
value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) *
4 * $lds_banks_per_cu)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $LDSBw_empirical_peak
gfx942:
VALU FLOPs:
value: AVG((($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
MFMA FLOPs (F64):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF64Flops_empirical_peak
MFMA FLOPs (F32):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF32Flops_empirical_peak
MFMA FLOPs (F16):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF16Flops_empirical_peak
MFMA FLOPs (BF16):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMABF16Flops_empirical_peak
MFMA FLOPs (F8):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF8Flops_empirical_peak
MFMA IOPs (Int8):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GIOP/s
peak: $MFMAI8Ops_empirical_peak
HBM Bandwidth:
value: AVG(((
(TCC_BUBBLE_sum * 128) +
(TCC_EA0_RDREQ_32B_sum * 32) +
((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) +
((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) +
(TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $HBMBw_empirical_peak
L2 Cache Bandwidth:
value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) *
64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $L2Bw_empirical_peak
L1 Cache Bandwidth:
value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $L1Bw_empirical_peak
LDS Bandwidth:
value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) *
4 * $lds_banks_per_cu)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $LDSBw_empirical_peak
gfx950:
VALU FLOPs:
value: AVG((($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
MFMA FLOPs (F64):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF64Flops_empirical_peak
MFMA FLOPs (F32):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF32Flops_empirical_peak
MFMA FLOPs (F16):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF16Flops_empirical_peak
MFMA FLOPs (BF16):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMABF16Flops_empirical_peak
MFMA FLOPs (F8):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMAF8Flops_empirical_peak
MFMA FLOPs (F6F4):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $MFMA_FLOPs_F6F4_empirical_peak
MFMA IOPs (Int8):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GIOP/s
peak: $MFMAI8Ops_empirical_peak
HBM Bandwidth:
value: AVG(((
(TCC_BUBBLE_sum * 128) +
(TCC_EA0_RDREQ_32B_sum * 32) +
((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) +
((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) +
(TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $HBMBw_empirical_peak
L2 Cache Bandwidth:
value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) *
64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $L2Bw_empirical_peak
L1 Cache Bandwidth:
value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $L1Bw_empirical_peak
LDS Bandwidth:
value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) *
4 * $lds_banks_per_cu)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GB/s
peak: $LDSBw_empirical_peak
- metric_table:
id: 402
title: Roofline Plot Points
cli_style: Roofline
tui_style: Roofline
header:
metric: Metric
value: Value
unit: Unit
metric:
gfx90a:
AI HBM:
value: (
SUM(
($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
)) +
(SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)
) /
SUM(
(TCC_EA_RDREQ_32B_sum * 32) +
((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) * 64) +
(TCC_EA_WRREQ_64B_sum * 64) +
((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) * 32)
)
)
unit: FLOPs/Byte
AI L2:
value: (
SUM(
($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
)) +
(SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)
) /
SUM(
(TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64
)
)
unit: FLOPs/Byte
AI L1:
value: (
SUM(
($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
)) +
(SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)
) /
SUM(TCP_TOTAL_CACHE_ACCESSES_sum * 64)
)
unit: FLOPs/Byte
Performance GFLOPs:
value: (
SUM(
($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
)) +
(SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)
) /
(SUM(End_Timestamp - Start_Timestamp) / 1e9)
) / 1e9
unit: GFLOP/s
gfx908:
AI HBM:
value: (
SUM(
($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
)) +
(SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)
) /
SUM(
(TCC_BUBBLE_sum * 128) +
(TCC_EA0_RDREQ_32B_sum * 32) +
((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) +
((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) +
(TCC_EA0_WRREQ_64B_sum * 64)
)
)
unit: FLOPs/Byte
AI L2:
value: (
SUM(
($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
)) +
(SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)
) /
SUM(
(TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64
)
)
unit: FLOPs/Byte
AI L1:
value: (
SUM(
($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
)) +
(SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)
) /
SUM(TCP_TOTAL_CACHE_ACCESSES_sum * 64)
)
unit: FLOPs/Byte
Performance GFLOPs:
value: (
SUM(
($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
)) +
(SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)
) /
(SUM(End_Timestamp - Start_Timestamp) / 1e9)
) / 1e9
unit: GFLOP/s
gfx940:
AI HBM:
value: (
SUM(
($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
)) +
(SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)
) /
SUM(
(TCC_BUBBLE_sum * 128) +
(TCC_EA0_RDREQ_32B_sum * 32) +
((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) +
((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) +
(TCC_EA0_WRREQ_64B_sum * 64)
)
)
unit: FLOPs/Byte
AI L2:
value: (
SUM(
($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
)) +
(SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) +
) /
SUM(
(TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64
)
)
unit: FLOPs/Byte
AI L1:
value: (
SUM(
($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
)) +
(SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) +
) /
SUM(TCP_TOTAL_CACHE_ACCESSES_sum * 64)
)
unit: FLOPs/Byte
Performance GFLOPs:
value: (
SUM(
($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
)) +
(SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)
) /
(SUM(End_Timestamp - Start_Timestamp) / 1e9)
) / 1e9
unit: GFLOP/s
gfx941:
AI HBM:
value: (
SUM(
($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
)) +
(SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)
) /
SUM(
(TCC_BUBBLE_sum * 128) +
(TCC_EA0_RDREQ_32B_sum * 32) +
((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) +
((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) +
(TCC_EA0_WRREQ_64B_sum * 64)
)
)
unit: FLOPs/Byte
AI L2:
value: (
SUM(
($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
)) +
(SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)
) /
SUM(
(TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64
)
)
unit: FLOPs/Byte
AI L1:
value: (
SUM(
($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
)) +
(SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)
) /
SUM(TCP_TOTAL_CACHE_ACCESSES_sum * 64)
)
unit: FLOPs/Byte
Performance GFLOPs:
value: (
SUM(
($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
)) +
(SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)
) /
(SUM(End_Timestamp - Start_Timestamp) / 1e9)
) / 1e9
unit: GFLOP/s
gfx942:
AI HBM:
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
* 512) ) / SUM( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32)
+ ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64)
+ ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum
* 64) ) )
unit: FLOPs/Byte
AI L2:
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
* 512) ) / SUM( (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64 ) )
unit: FLOPs/Byte
AI L1:
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
* 512) ) / SUM( TCP_TOTAL_CACHE_ACCESSES_sum * 64 ) )
unit: FLOPs/Byte
Performance (GFLOPs):
value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 *
512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8
* 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9
unit: GFLOP/s
gfx950:
AI HBM:
value: (
SUM(
($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
)) +
(SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512)
) /
SUM(
(TCC_BUBBLE_sum * 128) +
(TCC_EA0_RDREQ_32B_sum * 32) +
((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) +
((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) +
(TCC_EA0_WRREQ_64B_sum * 64)
)
)
unit: FLOPs/Byte
AI L2:
value: (
SUM(
($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
)) +
(SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512)
) /
SUM(
(TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum +
TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64
)
)
unit: FLOPs/Byte
AI L1:
value: (
SUM(
($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
)) +
(SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512)
) /
SUM(TCP_TOTAL_CACHE_ACCESSES_sum * 64)
)
unit: FLOPs/Byte
Performance GFLOPs:
value: (
SUM(
($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
)) +
(SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) +
(SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512)
) /
(SUM(End_Timestamp - Start_Timestamp) / 1e9)
) / 1e9
unit: GFLOP/s
metrics_description:
VALU FLOPs:
plain: 'The total floating-point operations executed per second on the VALU.
This is also presented as a percent of the peak theoretical FLOPs achievable
on the specific accelerator. Note: this does not include any floating-point
operations from MFMA instructions.'
rst: 'The total floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is also presented as a percent of the peak theoretical
FLOPs achievable on the specific accelerator. Note: this does not include
any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.'
unit: GFLOPs
MFMA FLOPs (F8):
plain: The total number of 8-bit brain floating point MFMA operations executed
per second. This does not include any 16-bit brain floating point operations
from VALU instructions. The peak empirically measured F8 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.
It is supported on AMD Instinct MI300 series and later only.
rst: 'The total number of 8-bit brain floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any 16-bit brain
floating point operations from :ref:`VALU <desc-valu>` instructions. The
peak empirically measured F8 MFMA operations achievable on the specific
accelerator is displayed alongside for comparison. It is supported on AMD
Instinct MI300 series and later only.'
unit: GFLOPs
MFMA FLOPs (BF16):
plain: 'The total number of 16-bit brain floating point MFMA operations executed
per second. Note: this does not include any 16-bit brain floating point
operations from VALU instructions. The peak empirically measured BF16 MFMA
operations achievable on the specific accelerator is displayed alongside
for comparison.'
rst: 'The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any 16-bit brain
floating point operations from :ref:`VALU <desc-valu>` instructions. The
peak empirically measured BF16 MFMA operations achievable on the specific
accelerator is displayed alongside for comparison.'
unit: GFLOPs
MFMA FLOPs (F16):
plain: 'The total number of 16-bit floating point MFMA operations executed per
second. Note: this does not include any 16-bit floating point operations from
VALU instructions. The peak empirically measured F16 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.'
rst: 'The total number of 16-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 16-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
measured F16 MFMA operations achievable on the specific accelerator is
displayed alongside for comparison.'
unit: GFLOPs
MFMA FLOPs (F32):
plain: 'The total number of 32-bit floating point MFMA operations executed per
second. Note: this does not include any 32-bit floating point operations from
VALU instructions. The peak empirically measured F32 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.'
rst: 'The total number of 32-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 32-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
measured F32 MFMA operations achievable on the specific accelerator is
displayed alongside for comparison.'
unit: GFLOPs
MFMA FLOPs (F64):
plain: 'The total number of 64-bit floating point MFMA operations executed per
second. Note: this does not include any 64-bit floating point operations from
VALU instructions. The peak empirically measured F64 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.'
rst: 'The total number of 64-bit floating point :ref:`MFMA <desc-mfma>` operations
executed per second. Note: this does not include any 64-bit floating point
operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
measured F64 MFMA operations achievable on the specific accelerator is
displayed alongside for comparison.'
unit: GFLOPs
MFMA IOPs (Int8):
plain: 'The total number of 8-bit integer MFMA operations executed per second.
Note: this does not include any 8-bit integer operations from VALU instructions.
The peak empirically measured INT8 MFMA operations achievable on the specific
accelerator is displayed alongside for comparison.'
rst: 'The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations executed
per second. Note: this does not include any 8-bit integer operations from
:ref:`VALU <desc-valu>` instructions. The peak empirically measured INT8 MFMA
operations achievable on the specific accelerator is displayed alongside
for comparison.'
unit: GIOPs
HBM Bandwidth:
plain: 'The total number of bytes read from and written to High-Bandwidth
Memory (HBM) per second. The peak empirically measured bandwidth achievable
on the specific accelerator is displayed alongside for comparison.'
rst: 'The total number of bytes read from and written to High-Bandwidth
Memory (HBM) per second. The peak empirically measured bandwidth achievable
on the specific accelerator is displayed alongside for comparison.'
unit: GB/s
L2 Cache Bandwidth:
plain: The number of bytes looked up in the L2 cache per unit time. The number
of bytes is calculated as the number of cache lines requested multiplied by
the cache line size. This value does not consider partial requests, so e.g.,
if only a single value is requested in a cache line, the data movement will
still be counted as a full cache line. The peak empirically measured bandwidth
achievable on the specific accelerator is displayed alongside for comparison.
rst: The number of bytes looked up in the L2 cache per unit time. The number of
bytes is calculated as the number of cache lines requested multiplied by
the cache line size. This value does not consider partial requests, so e.g.,
if only a single value is requested in a cache line, the data movement will
still be counted as a full cache line. The peak empirically measured
bandwidth achievable on the specific accelerator is displayed alongside
for comparison.
unit: GB/s
L1 Cache Bandwidth:
plain: The number of bytes looked up in the vL1D cache as a result of VMEM
instructions per unit time. The number of bytes is calculated as the number
of cache lines requested multiplied by the cache line size. This value does
not consider partial requests, so e.g., if only a single value is requested
in a cache line, the data movement will still be counted as a full cache line.
The peak empirically measured bandwidth achievable on the specific accelerator
is displayed alongside for comparison.
rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM
<desc-vmem>` instructions per unit time. The number of bytes is calculated
as the number of cache lines requested multiplied by the cache line size.
This value does not consider partial requests, so e.g., if only a single
value is requested in a cache line, the data movement will still be counted
as a full cache line. The peak empirically measured bandwidth achievable on
the specific accelerator is displayed alongside for comparison.
unit: GB/s
LDS Bandwidth:
plain: Indicates the maximum amount of bytes that could have been loaded from,
stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth
example for more detail). The peak empirically measured LDS bandwidth
achievable on the specific accelerator is displayed alongside for comparison.
rst: Indicates the maximum amount of bytes that could have been loaded from,
stored to, or atomically updated in the LDS per unit time (see :ref:`LDS
Bandwidth <lds-bandwidth>` example for more detail). The peak empirically
measured LDS bandwidth achievable on the specific accelerator is displayed
alongside for comparison.
unit: GB/s
AI L1:
plain: 'The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L1 cache and the processing units. This value is used as the x-coordinate
for the L1 roofline.'
rst: 'The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L1 cache and the processing units. This value is used as the x-coordinate
for the L1 roofline.'
unit: FLOPs/Byte
AI L2:
plain: 'The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L2 cache and the L1 cache. This value is used as the x-coordinate for
the L2 roofline.'
rst: 'The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio
of total floating-point operations (FLOPs) to total bytes transferred between
the L2 cache and the L1 cache. This value is used as the x-coordinate for
the L2 roofline.'
unit: FLOPs/Byte
AI HBM:
plain: 'The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
It is the ratio of total floating-point operations (FLOPs) to total bytes
transferred between HBM and the L2 cache. This value is used as the x-coordinate
for the HBM roofline.'
rst: 'The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM).
It is the ratio of total floating-point operations (FLOPs) to total bytes
transferred between HBM and the L2 cache. This value is used as the x-coordinate
for the HBM roofline.'
unit: FLOPs/Byte
Performance (GFLOPs):
plain: 'The overall achieved performance, measured in GigaFLOPs
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
operations divided by the total execution time. This value is used as the y-coordinate
for the kernel''s point on the Roofline plot.'
rst: 'The overall achieved performance, measured in GigaFLOPs
per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point
operations divided by the total execution time. This value is used as the y-coordinate
for the kernel''s point on the Roofline plot.'
unit: GFLOP/s
- id: 500
title: Command Processor (CPC/CPF)
data source: