Datatype selection option for roofline (#624)

Added command line option to specify which datatype(s) to capture into the roofline PDF(s).
All datatypes are still collected by roofline call if applicable, but only specific datatypes are plotted into PDF outputs. Will dump out all datatypes into one graph, but separate FP from Int into two graphs if needed. Will skip datatype and give error message if the datatype is not valid on a particular gpu arch.
Default is FP32

Reworked roofline calls and plotting to be general enough such that any new datatypes added into rocm-amdgpu-bench can easily be reflected in rocprof-compute with simple modifications in roofline_calc.py.

Adjusted ctest to reflect expected default pdf outputs from roofline.

---------

Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
此提交包含在:
cfallows-amd
2025-03-25 15:02:09 -04:00
提交者 GitHub
父節點 58cf702d40
當前提交 a492e92034
共有 8 個檔案被更改,包括 179 行新增139 行删除
+4
查看文件
@@ -13,6 +13,10 @@ Full documentation for ROCm Compute Profiler is available at [https://rocm.docs.
* -b option in profile mode also accept hardware IP block for filtering, however, this support will be deprecated soon
* --list-metrics option added in profile mode to list possible metric id(s), similar to analyze mode
* Datatype selection option for roofline profiling
* --roofline-data-type / -R option added to specify which datatypes the user wants to capture in the roofline PDF plot outputs
* Default is FP32, but user can specify as many types as desired to overlay on the same plot output
### Changed
* Change normal_unit default to per_kernel
未顯示二進位檔案。

之前

寬度:  |  高度:  |  大小: 89 KiB

之後

寬度:  |  高度:  |  大小: 64 KiB

+6 -6
查看文件
@@ -474,6 +474,9 @@ Roofline options
Allows you to specify a device ID to collect performance data from when
running a roofline benchmark on your system.
``--roofline-data-type <datatype>``
Allows you to specify datatypes that you want plotted in the roofline PDF output(s). Selecting more than one datatype will overlay the results onto the same plot. Default: FP32
To distinguish different kernels in your ``.pdf`` roofline plot use
``--kernel-names``. This will give each kernel a unique marker identifiable from
the plot's key.
@@ -507,8 +510,7 @@ successfully.
$ ls workloads/vcopy/MI200/
total 48
-rw-r--r-- 1 auser agroup 13331 Mar 1 16:05 empirRoof_gpu-0_fp32_fp64.pdf
-rw-r--r-- 1 auser agroup 13136 Mar 1 16:05 empirRoof_gpu-0_int8_fp16.pdf
-rw-r--r-- 1 auser agroup 13331 Mar 1 16:05 empirRoof_gpu-0_FP32.pdf
drwxr-xr-x 1 auser agroup 0 Mar 1 16:03 perfmon
-rw-r--r-- 1 auser agroup 1101 Mar 1 16:03 pmc_perf.csv
-rw-r--r-- 1 auser agroup 1715 Mar 1 16:05 roofline.csv
@@ -517,11 +519,9 @@ successfully.
.. note::
ROCm Compute Profiler generates three roofline outputs to organize results and reduce
clutter. One chart plots FP32/FP64 performance, one plots I8/FP16
performance, and the other plots FP8 performance.
ROCm Compute Profiler currently captures roofline profiling for all data types, but has the ability to reduce clutter in the PDF outputs by selecting datatype(s). Selecting multiple datatypes will overlay the results into the same PDF. If the user would like separate PDFs for each datatype off of the same workload run, the user can run the profiling command again with the single datatype as long as the roofline.csv still exists in the workload folder.
The following image is a sample ``empirRoof_gpu-0_int8_fp16.pdf`` roofline
The following image is a sample ``empirRoof_gpu-0_FP32.pdf`` roofline
plot.
.. image:: ../../data/profile/sample-roof-plot.jpg
+1 -1
查看文件
@@ -231,7 +231,7 @@ The following table lists ROCm Compute Profiler's basic operations, their
* - :ref:`Standalone roofline analysis <standalone-roofline>`
- ``profile``
- ``--name``, ``--roof-only``, ``-- <profile_cmd>``
- ``--name``, ``--roof-only``, ``--roofline-data-type <data_type>``, ``-- <profile_cmd>``
* - :ref:`Import a workload to database <grafana-gui-import>`
- ``database``
+13
查看文件
@@ -367,6 +367,19 @@ Examples:
action="store_true",
help="\t\t\tInclude kernel names in roofline plot.",
)
roofline_group.add_argument(
"-R",
"--roofline-data-type",
required=False,
choices=["FP8", "FP16", "BF16", "FP32", "FP64", "I8"],
metavar="",
nargs="+",
type=str,
default=["FP32"],
help="\t\t\tChoose datatypes to generate plotted roofline PDFs for: (DEFAULT: FP32)\n\t\t\t FP8\n\t\t\t FP16\n\t\t\t BF16\n\t\t\t FP32\n\t\t\t FP64\n\t\t\t I8",
)
# roofline_group.add_argument('-w', '--workgroups', required=False, default=-1, type=int, help="\t\t\tNumber of kernel workgroups (DEFAULT: 1024)")
# roofline_group.add_argument('--wsize', required=False, default=-1, type=int, help="\t\t\tWorkgroup size (DEFAULT: 256)")
# roofline_group.add_argument('--dataset', required=False, default = -1, type=int, help="\t\t\tDataset size (DEFAULT: 536M)")
+117 -103
查看文件
@@ -33,7 +33,13 @@ import pandas as pd
import plotly.graph_objects as go
from dash import dcc, html
from utils.roofline_calc import calc_ai, constuct_roof
from utils.roofline_calc import (
MFMA_DATATYPES,
PEAK_OPS_DATATYPES,
SUPPORTED_DATATYPES,
calc_ai,
constuct_roof,
)
from utils.utils import (
console_debug,
console_error,
@@ -60,6 +66,7 @@ class Roofline:
"mem_level": "ALL",
"include_kernel_names": False,
"is_standalone": False,
"roofline_data_type": ["FP32"],
}
)
self.__ai_data = None
@@ -76,7 +83,10 @@ class Roofline:
self.__run_parameters["mem_level"] = self.__args.mem_level
if hasattr(self.__args, "sort") and self.__args.sort != "ALL":
self.__run_parameters["sort_type"] = self.__args.sort
if hasattr(
self.__args, "roofline_data_type"
) and self.__args.roofline_data_type != ["FP32"]:
self.__run_parameters["roofline_data_type"] = self.__args.roofline_data_type
self.validate_parameters()
def validate_parameters(self):
@@ -121,19 +131,42 @@ class Roofline:
msg += "\n\t%s -> %s" % (i, self.__ai_data[i])
console_debug(msg)
# Generate a roofline figure for each data type
fp32_fig = self.generate_plot(dtype="FP32")
ml_combo_fig_fp32_fp64 = self.generate_plot(
dtype="FP64",
fig=fp32_fig,
)
fp16_fig = self.generate_plot(dtype="FP16")
ml_combo_fig_int8_fp16 = self.generate_plot(
dtype="I8",
fig=fp16_fig,
)
if self.__mspec.gpu_series != "MI200":
fig_fp8 = self.generate_plot(dtype="FP8")
# Generate a roofline figure for the datatypes
ops_figure = flops_figure = None
ops_dt_list = flops_dt_list = ""
for dt in self.__run_parameters["roofline_data_type"]:
# Do not generate a roofline figure if the datatype is not supported on this gpu_arch
if not str(dt) in SUPPORTED_DATATYPES[self.__mspec.gpu_arch]:
console_error(
"{} is not a supported datatype for roofline profiling on {}".format(
str(dt), self.__mspec.gpu_model
),
exit=False,
)
continue
ops_flops = "Ops" if (str(dt[:1]) == "I") else "Flops"
if ops_flops == "Ops":
if ops_figure:
ops_combo_figure = self.generate_plot(
dtype=str(dt),
fig=ops_figure,
)
ops_figure = ops_combo_figure
else:
ops_figure = self.generate_plot(dtype=str(dt))
ops_dt_list += "_" + str(dt)
if ops_flops == "Flops":
if flops_figure:
flops_combo_figure = self.generate_plot(
dtype=str(dt),
fig=flops_figure,
)
flops_figure = flops_combo_figure
else:
flops_figure = self.generate_plot(dtype=str(dt))
flops_dt_list += "_" + str(dt)
# Create a legend and distinct kernel markers. This can be saved, optionally
self.__figure = go.Figure(
@@ -160,80 +193,55 @@ class Roofline:
if self.__run_parameters["is_standalone"]:
dev_id = str(self.__run_parameters["device_id"])
ml_combo_fig_fp32_fp64.write_image(
self.__run_parameters["workload_dir"]
+ "/empirRoof_gpu-{}_fp32_fp64.pdf".format(dev_id)
)
ml_combo_fig_int8_fp16.write_image(
self.__run_parameters["workload_dir"]
+ "/empirRoof_gpu-{}_int8_fp16.pdf".format(dev_id)
)
if self.__mspec.gpu_series != "MI200":
fig_fp8.write_image(
self.__run_parameters["workload_dir"]
+ "/empirRoof_gpu-{}_fp8.pdf".format(dev_id)
)
# only save a legend if kernel_names option is toggled
if self.__run_parameters["include_kernel_names"]:
self.__figure.write_image(
self.__run_parameters["workload_dir"] + "/kernelName_legend.pdf"
)
time.sleep(1)
# Re-save to remove loading MathJax pop up
ml_combo_fig_fp32_fp64.write_image(
self.__run_parameters["workload_dir"]
+ "/empirRoof_gpu-{}_fp32_fp64.pdf".format(dev_id)
)
ml_combo_fig_int8_fp16.write_image(
self.__run_parameters["workload_dir"]
+ "/empirRoof_gpu-{}_int8_fp16.pdf".format(dev_id)
)
if self.__mspec.gpu_series != "MI200":
fig_fp8.write_image(
self.__run_parameters["workload_dir"]
+ "/empirRoof_gpu-{}_fp8.pdf".format(dev_id)
)
if self.__run_parameters["include_kernel_names"]:
self.__figure.write_image(
self.__run_parameters["workload_dir"] + "/kernelName_legend.pdf"
)
for i in range(2):
if ops_figure:
ops_figure.write_image(
self.__run_parameters["workload_dir"]
+ "/empirRoof_gpu-{}{}.pdf".format(dev_id, ops_dt_list)
)
if flops_figure:
flops_figure.write_image(
self.__run_parameters["workload_dir"]
+ "/empirRoof_gpu-{}{}.pdf".format(dev_id, flops_dt_list)
)
# only save a legend if kernel_names option is toggled
if self.__run_parameters["include_kernel_names"]:
self.__figure.write_image(
self.__run_parameters["workload_dir"] + "/kernelName_legend.pdf"
)
time.sleep(1)
console_log("roofline", "Empirical Roofline PDFs saved!")
else:
if self.__mspec.gpu_series != "MI200":
fp8_graph = html.Div(
if ops_figure:
ops_graph = html.Div(
className="float-child",
children=[
html.H3(children="Empirical Roofline Analysis (FP8)"),
dcc.Graph(figure=fig_fp8),
html.H3(children="Empirical Roofline Analysis (Ops)"),
dcc.Graph(figure=ops_figure),
],
)
else:
fp8_graph = None
ops_graph = None
if flops_figure:
flops_graph = html.Div(
className="float-child",
children=[
html.H3(children="Empirical Roofline Analysis (Flops)"),
dcc.Graph(figure=flops_figure),
],
)
else:
flops_graph = None
return html.Section(
id="roofline",
children=[
html.Div(
className="float-container",
children=[
html.Div(
className="float-child",
children=[
html.H3(
children="Empirical Roofline Analysis (FP32/FP64)"
),
dcc.Graph(figure=ml_combo_fig_fp32_fp64),
],
),
html.Div(
className="float-child",
children=[
html.H3(
children="Empirical Roofline Analysis (FP16/INT8)"
),
dcc.Graph(figure=ml_combo_fig_int8_fp16),
],
),
fp8_graph,
ops_graph,
flops_graph,
],
)
],
@@ -284,9 +292,10 @@ class Roofline:
)
)
ops_flops = "OP" if (dtype[:1] == "I") else "FLOP"
# Plot peak VALU ceiling
# VALU info I8/FP16 not collected via microbench
if dtype != "FP16" and dtype != "I8":
if dtype in PEAK_OPS_DATATYPES:
fig.add_trace(
go.Scatter(
x=self.__ceiling_data["valu"][0],
@@ -298,20 +307,18 @@ class Roofline:
(
None
if self.__run_parameters["is_standalone"]
else "{} GFLOP/s".format(
to_int(self.__ceiling_data["valu"][2])
else "{} G{}/s".format(
to_int(self.__ceiling_data["valu"][2], ops_flops)
)
),
"{} GFLOP/s".format(to_int(self.__ceiling_data["valu"][2])),
"{} G{}/s".format(
to_int(self.__ceiling_data["valu"][2]), ops_flops
),
],
textposition="top left",
)
)
if dtype == "FP16":
pos = "bottom left"
else:
pos = "top left"
# Plot peak MFMA ceiling
fig.add_trace(
go.Scatter(
@@ -324,26 +331,26 @@ class Roofline:
(
None
if self.__run_parameters["is_standalone"]
else "{} GFLOP/s".format(to_int(self.__ceiling_data["mfma"][2]))
else "{} G{}/s".format(
to_int(self.__ceiling_data["mfma"][2]), ops_flops
)
),
"{} GFLOP/s".format(to_int(self.__ceiling_data["mfma"][2])),
"{} G{}/s".format(to_int(self.__ceiling_data["mfma"][2]), ops_flops),
],
textposition=pos,
textposition="top left",
)
)
#######################
# Plot Application AI
#######################
if dtype != "I8" and dtype != "FP64":
# Plot the arithmetic intensity points for each cache level
# Omitting I8 AIs to clean up graph. FP16 tends to be higher.
# Plot the arithmetic intensity points for each cache level
if ops_flops == "FLOP":
fig.add_trace(
go.Scatter(
x=self.__ai_data["ai_l1"][0],
y=self.__ai_data["ai_l1"][1],
name="ai_l1",
name=dtype + "_ai_l1",
mode="markers",
marker={"color": "#00CC96"},
marker_symbol=(
SYMBOLS if self.__run_parameters["include_kernel_names"] else None
),
@@ -353,9 +360,8 @@ class Roofline:
go.Scatter(
x=self.__ai_data["ai_l2"][0],
y=self.__ai_data["ai_l2"][1],
name="ai_l2",
name=dtype + "_ai_l2",
mode="markers",
marker={"color": "#EF553B"},
marker_symbol=(
SYMBOLS if self.__run_parameters["include_kernel_names"] else None
),
@@ -365,22 +371,30 @@ class Roofline:
go.Scatter(
x=self.__ai_data["ai_hbm"][0],
y=self.__ai_data["ai_hbm"][1],
name="ai_hbm",
name=dtype + "_ai_hbm",
mode="markers",
marker={"color": "#636EFA"},
marker_symbol=(
SYMBOLS if self.__run_parameters["include_kernel_names"] else None
),
)
)
# Set layout
fig.update_layout(
xaxis_title="Arithmetic Intensity (FLOPs/Byte)",
yaxis_title="Performance (GFLOP/sec)",
hovermode="x unified",
margin=dict(l=50, r=50, b=50, t=50, pad=4),
)
# Set layout
fig.update_layout(
xaxis_title="Arithmetic Intensity (FLOPs/Byte)",
yaxis_title="Performance (GFLOP/sec)",
hovermode="x unified",
margin=dict(l=50, r=50, b=50, t=50, pad=4),
)
else:
# Set layout
fig.update_layout(
xaxis_title="Bandwidth (GB/sec)",
yaxis_title="Performance (GOP/sec)",
hovermode="x unified",
margin=dict(l=50, r=50, b=50, t=50, pad=4),
)
fig.update_xaxes(type="log", autorange=True)
fig.update_yaxes(type="log", autorange=True)
+37 -26
查看文件
@@ -41,7 +41,17 @@ FONT_SIZE = 16
FONT_COLOR = "black"
FONT_WEIGHT = "bold"
SUPPORTED_SOC = ["mi200", "mi300"]
# SUPPORTED_DATATYPES table is based on datatype support in rocm-amdgpu-bench repository
# Indicates which datatypes per gpu arch can be generated by the roofline binary
SUPPORTED_DATATYPES = {
"gfx90a": ["FP16", "BF16", "FP32", "FP64", "I8"], # Unsupported: F8
"gfx940": ["FP8", "FP16", "FP32", "FP64"], # Unsupported: BF16, I8
"gfx941": ["FP8", "FP16", "FP32", "FP64"], # Unsupported: BF16, I8
"gfx942": ["FP8", "FP16", "FP32", "FP64"], # Unsupported: BF16, I8
}
PEAK_OPS_DATATYPES = ["FP8", "FP32", "FP64"]
MFMA_DATATYPES = ["FP8", "FP16", "BF16", "FP32", "FP64", "I8"]
TOP_N = 10
@@ -106,31 +116,25 @@ def calc_ceilings(roofline_parameters, dtype, benchmark_data):
x1 = y1 = x2 = y2 = -1
x1_mfma = y1_mfma = x2_mfma = y2_mfma = -1
target_precision = dtype[2:]
if dtype != "FP16" and dtype != "I8":
peakOps = float(benchmark_data[dtype + "Flops"][roofline_parameters["device_id"]])
ops_flops = "Ops" if (dtype[:1] == "I") else "Flops"
if dtype in PEAK_OPS_DATATYPES:
peakOps = float(
benchmark_data[dtype + "{}".format(ops_flops)][
roofline_parameters["device_id"]
]
)
for i in range(0, len(cacheHierarchy)):
# Plot BW line
console_debug("roofline", "Current cache level is %s" % cacheHierarchy[i])
curr_bw = cacheHierarchy[i] + "Bw"
peakBw = float(benchmark_data[curr_bw][roofline_parameters["device_id"]])
if dtype == "I8":
peakMFMA = float(
benchmark_data["MFMAI8Ops"][roofline_parameters["device_id"]]
)
else:
peakMFMA = float(
benchmark_data["MFMAF{}Flops".format(target_precision)][
roofline_parameters["device_id"]
]
)
x1 = float(XMIN)
y1 = float(XMIN) * peakBw
# Note: No reg peakOps for FP16 or INT8
if dtype != "FP16" and dtype != "I8":
if dtype in PEAK_OPS_DATATYPES:
x2 = peakOps / peakBw
y2 = peakOps
@@ -138,8 +142,16 @@ def calc_ceilings(roofline_parameters, dtype, benchmark_data):
x1_mfma = peakOps / peakBw
y1_mfma = peakOps
x2_mfma = peakMFMA / peakBw
y2_mfma = peakMFMA
if dtype in MFMA_DATATYPES:
target_precision = (dtype) if (dtype[:1] == "I") else ("F" + dtype[2:])
peakMFMA = float(
benchmark_data["MFMA{}{}".format(target_precision, ops_flops)][
roofline_parameters["device_id"]
]
)
x2_mfma = peakMFMA / peakBw
y2_mfma = peakMFMA
# These are the points to use:
console_debug("roofline", "coordinate points:")
@@ -153,8 +165,7 @@ def calc_ceilings(roofline_parameters, dtype, benchmark_data):
# -------------------------------------------------------------------------------------
# Plot computing roof
# -------------------------------------------------------------------------------------
# Note: No FMA roof for FP16 or INT8
if dtype != "FP16" and dtype != "I8":
if dtype in PEAK_OPS_DATATYPES:
# Plot FMA roof
x0 = XMAX
if x2 < x0:
@@ -166,9 +177,7 @@ def calc_ceilings(roofline_parameters, dtype, benchmark_data):
graphPoints["valu"].append(peakOps)
# Plot MFMA roof
if (
x1_mfma != -1 or dtype == "FP16" or dtype == "I8"
): # assert that mfma has been assigned
if x1_mfma != -1 or (dtype in MFMA_DATATYPES): # assert that mfma has been assigned
x0_mfma = XMAX
if x2_mfma < x0_mfma:
x0_mfma = x2_mfma
@@ -206,6 +215,8 @@ def calc_ai(mspec, sort_type, ret_df):
at_end = False
next_kernelName = ""
supported_dt = SUPPORTED_DATATYPES[mspec.gpu_arch]
for idx in df.index:
# CASE: Top kernels
# Calculate + append AI data if
@@ -251,7 +262,7 @@ def calc_ai(mspec, sort_type, ret_df):
+ (df["SQ_INSTS_VALU_MFMA_MOPS_F32"][idx] * 512)
+ (df["SQ_INSTS_VALU_MFMA_MOPS_F64"][idx] * 512)
)
if mspec.gpu_series != "MI200":
if "FP8" in supported_dt:
total_flops += df["SQ_INSTS_VALU_MFMA_MOPS_F8"][idx] * 512
except KeyError:
console_debug(
@@ -291,7 +302,7 @@ def calc_ai(mspec, sort_type, ret_df):
pass
try:
if mspec.gpu_series != "MI200":
if "FP8" in supported_dt:
mfma_flops_f8 += df["SQ_INSTS_VALU_MFMA_MOPS_F8"][idx] * 512
mfma_flops_f16 += df["SQ_INSTS_VALU_MFMA_MOPS_F16"][idx] * 512
mfma_flops_bf16 += df["SQ_INSTS_VALU_MFMA_MOPS_BF16"][idx] * 512
+1 -3
查看文件
@@ -107,9 +107,7 @@ ALL_CSVS_MI300 = sorted(
ROOF_ONLY_FILES = sorted(
[
"empirRoof_gpu-0_fp32_fp64.pdf",
"empirRoof_gpu-0_int8_fp16.pdf",
"empirRoof_gpu-0_fp8.pdf",
"empirRoof_gpu-0_FP32.pdf",
"pmc_perf.csv",
"pmc_perf_0.csv",
"pmc_perf_1.csv",