diff --git a/projects/rocprofiler-compute/docs/how-to/analyze/cli.rst b/projects/rocprofiler-compute/docs/how-to/analyze/cli.rst index 2c494840e3..767e5c0365 100644 --- a/projects/rocprofiler-compute/docs/how-to/analyze/cli.rst +++ b/projects/rocprofiler-compute/docs/how-to/analyze/cli.rst @@ -19,6 +19,9 @@ This section provides an overview of ROCm Compute Profiler's CLI analysis featur * :ref:`Filtering `: Hone in on a particular kernel, GPU ID, or dispatch ID via post-process filtering. +* :ref:`Per-kernel roofline analysis `: Detailed arithmetic + intensity and performance analysis for individual kernels. + Run ``rocprof-compute analyze -h`` for more details. .. _cli-walkthrough: @@ -32,7 +35,7 @@ There are three high-level GPU analysis views: * System Speed-of-Light: Key GPU performance metrics to show overall GPU performance and utilization. * Memory chart: Shows memory transactions and throughput on each cache hierarchical level. -* Empirical hierarchical roofline: Roofline model that compares achieved throughput with attainable peak hardware limits, more specifically peak compute throughput and memory bandwidth (on L1/LDS/L2/HBM). +* Empirical hierarchical roofline: Roofline model that compares achieved throughput with attainable peak hardware limits, more specifically peak compute throughput and memory bandwidth (on L1/LDS/L2/HBM). When combined with kernel filtering, provides detailed per-kernel arithmetic intensity analysis and performance breakdowns. **System Speed-of-Light:** @@ -67,7 +70,7 @@ There are three high-level GPU analysis views: .. note:: * Visualized memory chart and Roofline chart are only supported in single run analysis. In multiple runs comparison mode, both are switched back to basic table view. * Visualized memory chart requires the width of the terminal output to be greater than or equal to 234 to display the whole chart properly. - * Visualized Roofline chart is adapted to the initial terminal size only. If it is not clear, you may need to adjust the terminal size and regenerate it to check the display effect. + * Visualized Roofline chart is adapted to the initial terminal size only. If it is not clear, you may need to adjust the terminal size and regenerate it to check the display effect. Roofline analysis provides detailed, structured table output with measured empirical peak values for comparison. .. _cli-list-metrics: @@ -309,6 +312,67 @@ Filter kernels You should see your filtered kernels indicated by an asterisk in the **Top Stats** table. +.. _per-kernel-roofline: + +Per-kernel roofline analysis + When analyzing specific kernels, the roofline analysis provides detailed metrics for each filtered kernel: + + .. code-block:: shell-session + $ rocprof-compute analyze -p workloads/vcopy/MI200/ -k 0 -b 4 + This generates enhanced roofline output showing per-kernel performance rates and arithmetic intensity calculations: + + .. code-block:: text + ================================================================================ + 4. Roofline + ================================================================================ + (4.1) Per-Kernel Roofline Metrics and (4.2) AI Plot Points + -------------------------------------------------------------------------------- + Kernel 0: vecCopy(double*, double*, double*, int, int) (100.0%) + | + ├─ 4.1 Roofline Rate Metrics: + | ╒═════════════╤════════════════════╤═══════════════════╤═════════╤════════════════════╕ + | │ Metric_ID │ Metric │ Value │ Unit │ Peak (Empirical) │ + | ╞═════════════╪════════════════════╪═══════════════════╪═════════╪════════════════════╡ + | │ 4.1.0 │ VALU FLOPs │ │ Gflop/s │ 61286.40 │ + | ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤ + | │ 4.1.1 │ MFMA FLOPs (F64) │ │ Gflop/s │ 108544.33 │ + | ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤ + | │ 4.1.2 │ MFMA FLOPs (F32) │ │ Gflop/s │ 104531.42 │ + | ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤ + | │ 4.1.3 │ MFMA FLOPs (F16) │ │ Gflop/s │ 709169.38 │ + | ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤ + | │ 4.1.4 │ MFMA FLOPs (BF16) │ 0.0 │ Gflop/s │ 388161.09 │ + | ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤ + | │ 4.1.5 │ MFMA FLOPs (F8) │ 0.0 │ Gflop/s │ 1446089.60 │ + | ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤ + | │ 4.1.6 │ MFMA IOPs (Int8) │ │ Giop/s │ 737317.94 │ + | ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤ + | │ 4.1.7 │ HBM Bandwidth │ │ Gb/s │ 3231.95 │ + | ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤ + | │ 4.1.8 │ L2 Cache Bandwidth │ │ Gb/s │ 19096.81 │ + | ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤ + | │ 4.1.9 │ L1 Cache Bandwidth │ 3880.358726762844 │ Gb/s │ 25006.24 │ + | ├─────────────┼────────────────────┼───────────────────┼─────────┼────────────────────┤ + | │ 4.1.10 │ LDS Bandwidth │ │ Gb/s │ 54920.88 │ + | ╘═════════════╧════════════════════╧═══════════════════╧═════════╧════════════════════╛ + ├─ 4.2 Roofline AI Plot Points: + | ╒═════════════╤══════════════════════╤═════════╤════════════╕ + | │ Metric_ID │ Metric │ Value │ Unit │ + | ╞═════════════╪══════════════════════╪═════════╪════════════╡ + | │ 4.2.0 │ AI HBM │ │ Flops/byte │ + | ├─────────────┼──────────────────────┼─────────┼────────────┤ + | │ 4.2.1 │ AI L2 │ │ Flops/byte │ + | ├─────────────┼──────────────────────┼─────────┼────────────┤ + | │ 4.2.2 │ AI L1 │ │ Flops/byte │ + | ├─────────────┼──────────────────────┼─────────┼────────────┤ + | │ 4.2.3 │ Performance (GFLOPs) │ │ Gflop/s │ + | ╘═════════════╧══════════════════════╧═════════╧════════════╛ + The per-kernel analysis uses YAML-based metric evaluation for accurate calculations. + + Analyze multiple kernels for comparison: + + .. code-block:: shell-session + $ rocprof-compute analyze -p workloads/vcopy/MI200/ -k 0 1 2 -b 4 Baseline comparison .. code-block:: shell diff --git a/projects/rocprofiler-compute/docs/how-to/analyze/standalone-gui.rst b/projects/rocprofiler-compute/docs/how-to/analyze/standalone-gui.rst index 1c79c816d9..5454080951 100644 --- a/projects/rocprofiler-compute/docs/how-to/analyze/standalone-gui.rst +++ b/projects/rocprofiler-compute/docs/how-to/analyze/standalone-gui.rst @@ -83,6 +83,7 @@ application's profiling data: #. Top Stats (Top Kernel Statistics) #. System Info #. System Speed-of-Light +#. Roofline AI Data Metrics To dive deeper, use the dropdown menus at the top of the screen to isolate particular kernels or dispatches. You should see the web page update with diff --git a/projects/rocprofiler-compute/src/argparser.py b/projects/rocprofiler-compute/src/argparser.py index ec8569e3df..27cbc45c74 100644 --- a/projects/rocprofiler-compute/src/argparser.py +++ b/projects/rocprofiler-compute/src/argparser.py @@ -307,7 +307,7 @@ Examples: "\t\t\t For stochastic sampling, the interval is in cycles.\n" "\t\t\t For host_trap sampling, the interval is in microsecond " "(DEFAULT: 1048576)." - ) + ), ) profile_group.add_argument( diff --git a/projects/rocprofiler-compute/src/config.py b/projects/rocprofiler-compute/src/config.py index eda006cab6..42a599c718 100644 --- a/projects/rocprofiler-compute/src/config.py +++ b/projects/rocprofiler-compute/src/config.py @@ -32,6 +32,6 @@ PROJECT_NAME = "rocprofiler-compute" HIDDEN_COLUMNS = ["coll_level"] HIDDEN_COLUMNS_CLI = ["Description", "coll_level"] HIDDEN_COLUMNS_TUI = ["Description", "coll_level"] -HIDDEN_SECTIONS = [400, 1900, 2000] +HIDDEN_SECTIONS = [1900, 2000] TIME_UNITS = {"s": 10**9, "ms": 10**6, "us": 10**3, "ns": 1} diff --git a/projects/rocprofiler-compute/src/rocprof_compute_analyze/analysis_base.py b/projects/rocprofiler-compute/src/rocprof_compute_analyze/analysis_base.py index d1d09428c7..ea7539f42b 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_analyze/analysis_base.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_analyze/analysis_base.py @@ -30,8 +30,16 @@ from abc import abstractmethod from collections import OrderedDict from pathlib import Path +import pandas as pd + from utils import file_io, parser, schema -from utils.logger import console_debug, console_error, console_log, demarcate +from utils.logger import ( + console_debug, + console_error, + console_log, + console_warning, + demarcate, +) from utils.utils import is_workload_empty, merge_counters_spatial_multiplex @@ -189,6 +197,21 @@ class OmniAnalyze_Base: else file_io.find_1st_sub_dir(d[0]) ) w.sys_info = file_io.load_sys_info(sysinfo_path.joinpath("sysinfo.csv")) + + if not getattr(self.get_args(), "no_roof", False): + try: + roofline_path = sysinfo_path.joinpath("roofline.csv") + roofline_df = pd.read_csv(roofline_path) + + # use original column names from roofline.csv directly + w.roofline_peaks = roofline_df + + except FileNotFoundError: + console_warning("roofline.csv not found.") + w.roofline_peaks = pd.DataFrame() + else: + w.roofline_peaks = pd.DataFrame() + arch = w.sys_info.iloc[0]["gpu_arch"] mspec = self.get_socs()[arch]._mspec if self.__args.specs_correction: diff --git a/projects/rocprofiler-compute/src/rocprof_compute_analyze/analysis_cli.py b/projects/rocprofiler-compute/src/rocprof_compute_analyze/analysis_cli.py index 709a8aa745..573cc5b250 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_analyze/analysis_cli.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_analyze/analysis_cli.py @@ -40,8 +40,9 @@ class cli_analysis(OmniAnalyze_Base): if self.get_args().random_port: console_error("--gui flag is required to enable --random-port") for d in self.get_args().path: + workload = self._runs[d[0]] # create 'mega dataframe' - self._runs[d[0]].raw_pmc = file_io.create_df_pmc( + workload.raw_pmc = file_io.create_df_pmc( d[0], self.get_args().nodes, self.get_args().spatial_multiplexing, @@ -51,29 +52,27 @@ class cli_analysis(OmniAnalyze_Base): ) if self.get_args().spatial_multiplexing: - self._runs[d[0]].raw_pmc = self.spatial_multiplex_merge_counters( - self._runs[d[0]].raw_pmc + workload.raw_pmc = self.spatial_multiplex_merge_counters( + workload.raw_pmc ) file_io.create_df_kernel_top_stats( - df_in=self._runs[d[0]].raw_pmc, + df_in=workload.raw_pmc, raw_data_dir=d[0], - filter_gpu_ids=self._runs[d[0]].filter_gpu_ids, - filter_dispatch_ids=self._runs[d[0]].filter_dispatch_ids, - filter_nodes=self._runs[d[0]].filter_nodes, + filter_gpu_ids=workload.filter_gpu_ids, + filter_dispatch_ids=workload.filter_dispatch_ids, + filter_nodes=workload.filter_nodes, time_unit=self.get_args().time_unit, max_stat_num=self.get_args().max_stat_num, kernel_verbose=self.get_args().kernel_verbose, ) # demangle and overwrite original 'Kernel_Name' - kernel_name_shortener( - self._runs[d[0]].raw_pmc, self.get_args().kernel_verbose - ) + kernel_name_shortener(workload.raw_pmc, self.get_args().kernel_verbose) # create the loaded table parser.load_table_data( - workload=self._runs[d[0]], + workload=workload, dir=d[0], is_gui=False, args=self.get_args(), @@ -85,42 +84,41 @@ class cli_analysis(OmniAnalyze_Base): """Run CLI analysis.""" super().run_analysis() + workload_path = self.get_args().path[0][0] + workload = self._runs[workload_path] + gpu_arch = workload.sys_info.iloc[0]["gpu_arch"] + arch_config = self._arch_configs[gpu_arch] + if self.get_args().list_stats: tty.show_kernel_stats( self.get_args(), self._runs, - self._arch_configs[ - self._runs[self.get_args().path[0][0]].sys_info.iloc[0]["gpu_arch"] - ], + arch_config, self._output, ) else: roof_plot = None # 1. check if not baseline && compatible soc: - if (len(self.get_args().path)) == 1 and self._runs[ - self.get_args().path[0][0] - ].sys_info.iloc[0]["gpu_arch"] in [ - "gfx90a", - "gfx940", - "gfx941", - "gfx942", - "gfx950", - ]: - # add roofline plot to cli output - roof_obj = self.get_socs()[ - self._runs[self.get_args().path[0][0]].sys_info.iloc[0]["gpu_arch"] - ].roofline_obj + if (len(self.get_args().path)) == 1: + if gpu_arch in ["gfx90a", "gfx940", "gfx941", "gfx942", "gfx950"]: + roof_obj = self.get_socs()[gpu_arch].roofline_obj - if roof_obj: - # NOTE: using default data type - roof_plot = roof_obj.cli_generate_plot(roof_obj.get_dtype()[0]) + if roof_obj: + # store path in workload for calc_ai_analyze + workload.path = workload_path + + # NOTE: using default data type + roof_plot = roof_obj.cli_generate_plot( + dtype=roof_obj.get_dtype()[0], + workload=workload, + config=self._profiling_config, + arch_config=arch_config, + ) tty.show_all( self.get_args(), self._runs, - self._arch_configs[ - self._runs[self.get_args().path[0][0]].sys_info.iloc[0]["gpu_arch"] - ], + arch_config, self._output, self._profiling_config, roof_plot=roof_plot, diff --git a/projects/rocprofiler-compute/src/rocprof_compute_analyze/analysis_webui.py b/projects/rocprofiler-compute/src/rocprof_compute_analyze/analysis_webui.py index 283d19c9ad..904b09890a 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_analyze/analysis_webui.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_analyze/analysis_webui.py @@ -48,7 +48,7 @@ class webui_analysis(OmniAnalyze_Base): self.dest_dir = str(Path(args.path[0][0]).absolute().resolve()) self.arch = None - self.__hidden_sections = ["Memory Chart", "Roofline"] + self.__hidden_sections = ["Memory Chart"] self.__hidden_columns = HIDDEN_COLUMNS # define different types of bar charts self.__barchart_elements = { @@ -151,7 +151,7 @@ class webui_analysis(OmniAnalyze_Base): # Only display basic metrics if no filters are applied if not (disp_filt or kernel_filter or gcd_filter): temp = {} - keep = [1, 2, 101, 201, 301, 401] + keep = [1, 2, 101, 201, 301, 401, 402] for key in base_data[base_run].dfs: if keep.count(key) != 0: temp[key] = base_data[base_run].dfs[key] @@ -219,7 +219,6 @@ class webui_analysis(OmniAnalyze_Base): .lower() ) html_section = [] - if panel["title"] not in self.__hidden_sections: # Iterate over each table per section for data_source in panel["data source"]: diff --git a/projects/rocprofiler-compute/src/rocprof_compute_base.py b/projects/rocprofiler-compute/src/rocprof_compute_base.py index a2b06c8263..ab9021d5d2 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_base.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_base.py @@ -289,7 +289,7 @@ class RocProfCompute: if sets_info: first_set = next(iter(sets_info.keys())) print(f" rocprof-compute profile --set {first_set} # Profile this set") - print(f" rocprof-compute profile --list-sets # Show this help") + print(" rocprof-compute profile --list-sets # Show this help") print() sys.exit(0) diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/0400_roofline.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/0400_roofline.yaml index 41c8bac547..66c656fb4c 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/0400_roofline.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx908/0400_roofline.yaml @@ -2,8 +2,191 @@ Panel Config: id: 400 title: Roofline - metrics_description: {} + metrics_description: + VALU FLOPs: 'The total floating-point operations executed per second on the VALU. + This is also presented as a percent of the peak theoretical FLOPs achievable + on the specific accelerator. Note: this does not include any floating-point + operations from MFMA instructions.' + MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations + executed per second. This does not include any 16-bit brain floating point operations + from VALU instructions. The peak empirically measured F8 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison. It is supported + on AMD Instinct MI300 series and later only. + MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations + executed per second. Note: this does not include any 16-bit brain floating point + operations from VALU instructions. The peak empirically measured BF16 MFMA operations + achievable on the specific accelerator is displayed alongside for comparison.' + MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed + per second. Note: this does not include any 16-bit floating point operations + from VALU instructions. The peak empirically measured F16 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison.' + MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed + per second. Note: this does not include any 32-bit floating point operations + from VALU instructions. The peak empirically measured F32 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison.' + MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed + per second. Note: this does not include any 64-bit floating point operations + from VALU instructions. The peak empirically measured F64 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison.' + MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed + per second. Note: this does not include any 8-bit integer operations from VALU + instructions. The peak empirically measured INT8 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison.' + HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth + Memory (HBM) per second. The peak empirically measured bandwidth achievable + on the specific accelerator is displayed alongside for comparison. + L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time. + The number of bytes is calculated as the number of cache lines requested multiplied + by the cache line size. This value does not consider partial requests, so e.g., + if only a single value is requested in a cache line, the data movement will + still be counted as a full cache line. The peak empirically measured bandwidth + achievable on the specific accelerator is displayed alongside for comparison. + L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result + of VMEM instructions per unit time. The number of bytes is calculated as the + number of cache lines requested multiplied by the cache line size. This value + does not consider partial requests, so e.g., if only a single value is requested + in a cache line, the data movement will still be counted as a full cache line. + The peak empirically measured bandwidth achievable on the specific accelerator + is displayed alongside for comparison. + LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded + from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth + example for more detail). The peak empirically measured LDS bandwidth achievable + on the specific accelerator is displayed alongside for comparison. + AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio + of total floating-point operations (FLOPs) to total bytes transferred between + the L1 cache and the processing units. This value is used as the x-coordinate + for the L1 roofline. + AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio + of total floating-point operations (FLOPs) to total bytes transferred between + the L2 cache and the L1 cache. This value is used as the x-coordinate for the + L2 roofline. + AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM). + It is the ratio of total floating-point operations (FLOPs) to total bytes transferred + between HBM and the L2 cache. This value is used as the x-coordinate for the + HBM roofline. + Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs + per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point + operations divided by the total execution time. This value is used as the y-coordinate + for the kernel's point on the Roofline plot. data source: - - None: + - metric_table: id: 401 - title: Roofline + title: Roofline Performance Rates + cli_style: Roofline + tui_style: Roofline + header: + metric: Metric + value: Value + unit: Unit + peak: Peak (Empirical) + metric: + VALU FLOPs: + value: AVG((($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) / ((End_Timestamp - Start_Timestamp) / 1e9)) + / 1e9) + unit: GFLOP/s + peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000) + MFMA FLOPs (F64): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF64Flops_empirical_peak + MFMA FLOPs (F32): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF32Flops_empirical_peak + MFMA FLOPs (F16): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF16Flops_empirical_peak + MFMA FLOPs (BF16): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMABF16Flops_empirical_peak + MFMA IOPs (Int8): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GIOP/s + peak: $MFMAI8Ops_empirical_peak + HBM Bandwidth: + value: AVG((( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum + - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + ((TCC_EA0_WRREQ_sum + - TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp + - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $HBMBw_empirical_peak + L2 Cache Bandwidth: + value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64)) / ((End_Timestamp + - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $L2Bw_empirical_peak + L1 Cache Bandwidth: + value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GB/s + peak: $L1Bw_empirical_peak + LDS Bandwidth: + value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4 * $lds_banks_per_cu)) + / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $LDSBw_empirical_peak + - metric_table: + id: 402 + title: Roofline Plot Points + cli_style: Roofline + tui_style: Roofline + header: + metric: Metric + value: Value + unit: Unit + metric: + AI HBM: + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / SUM( (TCC_BUBBLE_sum * + 128) + (TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum + - TCC_EA0_RDREQ_32B_sum) * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) + * 32) + (TCC_EA0_WRREQ_64B_sum * 64) ) ) + unit: FLOPs/Byte + AI L2: + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / SUM( (TCP_TCC_WRITE_REQ_sum + + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + + TCP_TCC_READ_REQ_sum) * 64 ) ) + unit: FLOPs/Byte + AI L1: + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / SUM(TCP_TOTAL_CACHE_ACCESSES_sum + * 64) ) + unit: FLOPs/Byte + Performance GFLOPs: + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / (SUM(End_Timestamp - Start_Timestamp) + / 1e9) ) / 1e9 + unit: GFLOP/s diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/0400_roofline.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/0400_roofline.yaml index 41c8bac547..38af3367e9 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/0400_roofline.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx90a/0400_roofline.yaml @@ -2,8 +2,189 @@ Panel Config: id: 400 title: Roofline - metrics_description: {} + metrics_description: + VALU FLOPs: 'The total floating-point operations executed per second on the VALU. + This is also presented as a percent of the peak theoretical FLOPs achievable + on the specific accelerator. Note: this does not include any floating-point + operations from MFMA instructions.' + MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations + executed per second. This does not include any 16-bit brain floating point operations + from VALU instructions. The peak empirically measured F8 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison. It is supported + on AMD Instinct MI300 series and later only. + MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations + executed per second. Note: this does not include any 16-bit brain floating point + operations from VALU instructions. The peak empirically measured BF16 MFMA operations + achievable on the specific accelerator is displayed alongside for comparison.' + MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed + per second. Note: this does not include any 16-bit floating point operations + from VALU instructions. The peak empirically measured F16 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison.' + MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed + per second. Note: this does not include any 32-bit floating point operations + from VALU instructions. The peak empirically measured F32 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison.' + MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed + per second. Note: this does not include any 64-bit floating point operations + from VALU instructions. The peak empirically measured F64 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison.' + MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed + per second. Note: this does not include any 8-bit integer operations from VALU + instructions. The peak empirically measured INT8 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison.' + HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth + Memory (HBM) per second. The peak empirically measured bandwidth achievable + on the specific accelerator is displayed alongside for comparison. + L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time. + The number of bytes is calculated as the number of cache lines requested multiplied + by the cache line size. This value does not consider partial requests, so e.g., + if only a single value is requested in a cache line, the data movement will + still be counted as a full cache line. The peak empirically measured bandwidth + achievable on the specific accelerator is displayed alongside for comparison. + L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result + of VMEM instructions per unit time. The number of bytes is calculated as the + number of cache lines requested multiplied by the cache line size. This value + does not consider partial requests, so e.g., if only a single value is requested + in a cache line, the data movement will still be counted as a full cache line. + The peak empirically measured bandwidth achievable on the specific accelerator + is displayed alongside for comparison. + LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded + from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth + example for more detail). The peak empirically measured LDS bandwidth achievable + on the specific accelerator is displayed alongside for comparison. + AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio + of total floating-point operations (FLOPs) to total bytes transferred between + the L1 cache and the processing units. This value is used as the x-coordinate + for the L1 roofline. + AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio + of total floating-point operations (FLOPs) to total bytes transferred between + the L2 cache and the L1 cache. This value is used as the x-coordinate for the + L2 roofline. + AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM). + It is the ratio of total floating-point operations (FLOPs) to total bytes transferred + between HBM and the L2 cache. This value is used as the x-coordinate for the + HBM roofline. + Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs + per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point + operations divided by the total execution time. This value is used as the y-coordinate + for the kernel's point on the Roofline plot. data source: - - None: + - metric_table: id: 401 - title: Roofline + title: Roofline Performance Rates + cli_style: Roofline + tui_style: Roofline + header: + metric: Metric + value: Value + unit: Unit + peak: Peak (Empirical) + metric: + VALU FLOPs: + value: AVG((($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) / ((End_Timestamp - Start_Timestamp) / 1e9)) + / 1e9) + unit: GFLOP/s + peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000) + MFMA FLOPs (F64): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF64Flops_empirical_peak + MFMA FLOPs (F32): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF32Flops_empirical_peak + MFMA FLOPs (F16): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF16Flops_empirical_peak + MFMA FLOPs (BF16): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMABF16Flops_empirical_peak + MFMA IOPs (Int8): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GIOP/s + peak: $MFMAI8Ops_empirical_peak + HBM Bandwidth: + value: AVG((( (TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) + * 64) + (TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) + * 32) ) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $HBMBw_empirical_peak + L2 Cache Bandwidth: + value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64)) / ((End_Timestamp + - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $L2Bw_empirical_peak + L1 Cache Bandwidth: + value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GB/s + peak: $L1Bw_empirical_peak + LDS Bandwidth: + value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4 * $lds_banks_per_cu)) + / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $LDSBw_empirical_peak + - metric_table: + id: 402 + title: Roofline Plot Points + cli_style: Roofline + tui_style: Roofline + header: + metric: Metric + value: Value + unit: Unit + metric: + AI HBM: + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / SUM( (TCC_EA_RDREQ_32B_sum + * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) * 64) + (TCC_EA_WRREQ_64B_sum + * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) * 32) ) ) + unit: FLOPs/Byte + AI L2: + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / SUM( (TCP_TCC_WRITE_REQ_sum + + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + + TCP_TCC_READ_REQ_sum) * 64 ) ) + unit: FLOPs/Byte + AI L1: + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / SUM(TCP_TOTAL_CACHE_ACCESSES_sum + * 64) ) + unit: FLOPs/Byte + Performance GFLOPs: + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) ) / (SUM(End_Timestamp - Start_Timestamp) + / 1e9) ) / 1e9 + unit: GFLOP/s diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/0400_roofline.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/0400_roofline.yaml index 41c8bac547..839c04fd2e 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/0400_roofline.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx940/0400_roofline.yaml @@ -2,8 +2,197 @@ Panel Config: id: 400 title: Roofline - metrics_description: {} + metrics_description: + VALU FLOPs: 'The total floating-point operations executed per second on the VALU. + This is also presented as a percent of the peak theoretical FLOPs achievable + on the specific accelerator. Note: this does not include any floating-point + operations from MFMA instructions.' + MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations + executed per second. This does not include any 16-bit brain floating point operations + from VALU instructions. The peak empirically measured F8 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison. It is supported + on AMD Instinct MI300 series and later only. + MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations + executed per second. Note: this does not include any 16-bit brain floating point + operations from VALU instructions. The peak empirically measured BF16 MFMA operations + achievable on the specific accelerator is displayed alongside for comparison.' + MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed + per second. Note: this does not include any 16-bit floating point operations + from VALU instructions. The peak empirically measured F16 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison.' + MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed + per second. Note: this does not include any 32-bit floating point operations + from VALU instructions. The peak empirically measured F32 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison.' + MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed + per second. Note: this does not include any 64-bit floating point operations + from VALU instructions. The peak empirically measured F64 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison.' + MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed + per second. Note: this does not include any 8-bit integer operations from VALU + instructions. The peak empirically measured INT8 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison.' + HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth + Memory (HBM) per second. The peak empirically measured bandwidth achievable + on the specific accelerator is displayed alongside for comparison. + L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time. + The number of bytes is calculated as the number of cache lines requested multiplied + by the cache line size. This value does not consider partial requests, so e.g., + if only a single value is requested in a cache line, the data movement will + still be counted as a full cache line. The peak empirically measured bandwidth + achievable on the specific accelerator is displayed alongside for comparison. + L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result + of VMEM instructions per unit time. The number of bytes is calculated as the + number of cache lines requested multiplied by the cache line size. This value + does not consider partial requests, so e.g., if only a single value is requested + in a cache line, the data movement will still be counted as a full cache line. + The peak empirically measured bandwidth achievable on the specific accelerator + is displayed alongside for comparison. + LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded + from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth + example for more detail). The peak empirically measured LDS bandwidth achievable + on the specific accelerator is displayed alongside for comparison. + AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio + of total floating-point operations (FLOPs) to total bytes transferred between + the L1 cache and the processing units. This value is used as the x-coordinate + for the L1 roofline. + AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio + of total floating-point operations (FLOPs) to total bytes transferred between + the L2 cache and the L1 cache. This value is used as the x-coordinate for the + L2 roofline. + AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM). + It is the ratio of total floating-point operations (FLOPs) to total bytes transferred + between HBM and the L2 cache. This value is used as the x-coordinate for the + HBM roofline. + Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs + per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point + operations divided by the total execution time. This value is used as the y-coordinate + for the kernel's point on the Roofline plot. data source: - - None: + - metric_table: id: 401 - title: Roofline + title: Roofline Performance Rates + cli_style: Roofline + tui_style: Roofline + header: + metric: Metric + value: Value + unit: Unit + peak: Peak (Empirical) + metric: + VALU FLOPs: + value: AVG(($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) ) / ((End_Timestamp - Start_Timestamp) / 1e9)) + / 1e9) + unit: GFLOP/s + peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000) + MFMA FLOPs (F64): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF64Flops_empirical_peak + MFMA FLOPs (F32): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF32Flops_empirical_peak + MFMA FLOPs (F16): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF16Flops_empirical_peak + MFMA FLOPs (BF16): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMABF16Flops_empirical_peak + MFMA FLOPs (F8): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF8Flops_empirical_peak + MFMA IOPs (Int8): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GIOP/s + peak: $MFMAI8Ops_empirical_peak + HBM Bandwidth: + value: AVG((( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum + - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + ((TCC_EA0_WRREQ_sum + - TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp + - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $HBMBw_empirical_peak + L2 Cache Bandwidth: + value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64)) / ((End_Timestamp + - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $L2Bw_empirical_peak + L1 Cache Bandwidth: + value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GB/s + peak: $L1Bw_empirical_peak + LDS Bandwidth: + value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4 * $lds_banks_per_cu)) + / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $LDSBw_empirical_peak + - metric_table: + id: 402 + title: Roofline Plot Points + cli_style: Roofline + tui_style: Roofline + header: + metric: Metric + value: Value + unit: Unit + metric: + AI HBM: + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 + * 512) ) / SUM( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32) + + ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum + * 64) ) ) + unit: FLOPs/Byte + AI L2: + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + ) / SUM( (TCP_TCC_WRITE_REQ_sum + + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + + TCP_TCC_READ_REQ_sum) * 64 ) ) + unit: FLOPs/Byte + AI L1: + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + ) / SUM(TCP_TOTAL_CACHE_ACCESSES_sum + * 64) ) + unit: FLOPs/Byte + Performance GFLOPs: + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 + * 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9 + unit: GFLOP/s diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/0400_roofline.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/0400_roofline.yaml index 41c8bac547..f9f4d7cc19 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/0400_roofline.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx941/0400_roofline.yaml @@ -2,8 +2,197 @@ Panel Config: id: 400 title: Roofline - metrics_description: {} + metrics_description: + VALU FLOPs: 'The total floating-point operations executed per second on the VALU. + This is also presented as a percent of the peak theoretical FLOPs achievable + on the specific accelerator. Note: this does not include any floating-point + operations from MFMA instructions.' + MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations + executed per second. This does not include any 16-bit brain floating point operations + from VALU instructions. The peak empirically measured F8 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison. It is supported + on AMD Instinct MI300 series and later only. + MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations + executed per second. Note: this does not include any 16-bit brain floating point + operations from VALU instructions. The peak empirically measured BF16 MFMA operations + achievable on the specific accelerator is displayed alongside for comparison.' + MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed + per second. Note: this does not include any 16-bit floating point operations + from VALU instructions. The peak empirically measured F16 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison.' + MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed + per second. Note: this does not include any 32-bit floating point operations + from VALU instructions. The peak empirically measured F32 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison.' + MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed + per second. Note: this does not include any 64-bit floating point operations + from VALU instructions. The peak empirically measured F64 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison.' + MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed + per second. Note: this does not include any 8-bit integer operations from VALU + instructions. The peak empirically measured INT8 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison.' + HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth + Memory (HBM) per second. The peak empirically measured bandwidth achievable + on the specific accelerator is displayed alongside for comparison. + L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time. + The number of bytes is calculated as the number of cache lines requested multiplied + by the cache line size. This value does not consider partial requests, so e.g., + if only a single value is requested in a cache line, the data movement will + still be counted as a full cache line. The peak empirically measured bandwidth + achievable on the specific accelerator is displayed alongside for comparison. + L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result + of VMEM instructions per unit time. The number of bytes is calculated as the + number of cache lines requested multiplied by the cache line size. This value + does not consider partial requests, so e.g., if only a single value is requested + in a cache line, the data movement will still be counted as a full cache line. + The peak empirically measured bandwidth achievable on the specific accelerator + is displayed alongside for comparison. + LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded + from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth + example for more detail). The peak empirically measured LDS bandwidth achievable + on the specific accelerator is displayed alongside for comparison. + AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio + of total floating-point operations (FLOPs) to total bytes transferred between + the L1 cache and the processing units. This value is used as the x-coordinate + for the L1 roofline. + AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio + of total floating-point operations (FLOPs) to total bytes transferred between + the L2 cache and the L1 cache. This value is used as the x-coordinate for the + L2 roofline. + AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM). + It is the ratio of total floating-point operations (FLOPs) to total bytes transferred + between HBM and the L2 cache. This value is used as the x-coordinate for the + HBM roofline. + Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs + per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point + operations divided by the total execution time. This value is used as the y-coordinate + for the kernel's point on the Roofline plot. data source: - - None: + - metric_table: id: 401 - title: Roofline + title: Roofline Performance Rates + cli_style: Roofline + tui_style: Roofline + header: + metric: Metric + value: Value + unit: Unit + peak: Peak (Empirical) + metric: + VALU FLOPs: + value: AVG(($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) ) / ((End_Timestamp - Start_Timestamp) / 1e9)) + / 1e9) + unit: GFLOP/s + peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000) + MFMA FLOPs (F64): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF64Flops_empirical_peak + MFMA FLOPs (F32): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF32Flops_empirical_peak + MFMA FLOPs (F16): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF16Flops_empirical_peak + MFMA FLOPs (BF16): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMABF16Flops_empirical_peak + MFMA FLOPs (F8): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF8Flops_empirical_peak + MFMA IOPs (Int8): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GIOP/s + peak: $MFMAI8Ops_empirical_peak + HBM Bandwidth: + value: AVG((( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum + - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + ((TCC_EA0_WRREQ_sum + - TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp + - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $HBMBw_empirical_peak + L2 Cache Bandwidth: + value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64)) / ((End_Timestamp + - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $L2Bw_empirical_peak + L1 Cache Bandwidth: + value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GB/s + peak: $L1Bw_empirical_peak + LDS Bandwidth: + value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4 * $lds_banks_per_cu)) + / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $LDSBw_empirical_peak + - metric_table: + id: 402 + title: Roofline Plot Points + cli_style: Roofline + tui_style: Roofline + header: + metric: Metric + value: Value + unit: Unit + metric: + AI HBM: + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 + * 512) ) / SUM( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32) + + ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum + * 64) ) ) + unit: FLOPs/Byte + AI L2: + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 + * 512) ) / SUM( (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64 ) ) + unit: FLOPs/Byte + AI L1: + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 + * 512) ) / SUM(TCP_TOTAL_CACHE_ACCESSES_sum * 64) ) + unit: FLOPs/Byte + Performance GFLOPs: + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 + * 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9 + unit: GFLOP/s diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/0400_roofline.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/0400_roofline.yaml index 41c8bac547..9ba1e6f1fa 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/0400_roofline.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/0400_roofline.yaml @@ -2,8 +2,197 @@ Panel Config: id: 400 title: Roofline - metrics_description: {} + metrics_description: + VALU FLOPs: 'The total floating-point operations executed per second on the VALU. + This is also presented as a percent of the peak theoretical FLOPs achievable + on the specific accelerator. Note: this does not include any floating-point + operations from MFMA instructions.' + MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations + executed per second. This does not include any 16-bit brain floating point operations + from VALU instructions. The peak empirically measured F8 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison. It is supported + on AMD Instinct MI300 series and later only. + MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations + executed per second. Note: this does not include any 16-bit brain floating point + operations from VALU instructions. The peak empirically measured BF16 MFMA operations + achievable on the specific accelerator is displayed alongside for comparison.' + MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed + per second. Note: this does not include any 16-bit floating point operations + from VALU instructions. The peak empirically measured F16 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison.' + MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed + per second. Note: this does not include any 32-bit floating point operations + from VALU instructions. The peak empirically measured F32 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison.' + MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed + per second. Note: this does not include any 64-bit floating point operations + from VALU instructions. The peak empirically measured F64 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison.' + MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed + per second. Note: this does not include any 8-bit integer operations from VALU + instructions. The peak empirically measured INT8 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison.' + HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth + Memory (HBM) per second. The peak empirically measured bandwidth achievable + on the specific accelerator is displayed alongside for comparison. + L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time. + The number of bytes is calculated as the number of cache lines requested multiplied + by the cache line size. This value does not consider partial requests, so e.g., + if only a single value is requested in a cache line, the data movement will + still be counted as a full cache line. The peak empirically measured bandwidth + achievable on the specific accelerator is displayed alongside for comparison. + L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result + of VMEM instructions per unit time. The number of bytes is calculated as the + number of cache lines requested multiplied by the cache line size. This value + does not consider partial requests, so e.g., if only a single value is requested + in a cache line, the data movement will still be counted as a full cache line. + The peak empirically measured bandwidth achievable on the specific accelerator + is displayed alongside for comparison. + LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded + from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth + example for more detail). The peak empirically measured LDS bandwidth achievable + on the specific accelerator is displayed alongside for comparison. + AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio + of total floating-point operations (FLOPs) to total bytes transferred between + the L1 cache and the processing units. This value is used as the x-coordinate + for the L1 roofline. + AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio + of total floating-point operations (FLOPs) to total bytes transferred between + the L2 cache and the L1 cache. This value is used as the x-coordinate for the + L2 roofline. + AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM). + It is the ratio of total floating-point operations (FLOPs) to total bytes transferred + between HBM and the L2 cache. This value is used as the x-coordinate for the + HBM roofline. + Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs + per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point + operations divided by the total execution time. This value is used as the y-coordinate + for the kernel's point on the Roofline plot. data source: - - None: + - metric_table: id: 401 - title: Roofline + title: Roofline Performance Rates + cli_style: Roofline + tui_style: Roofline + header: + metric: Metric + value: Value + unit: Unit + peak: Peak (Empirical) + metric: + VALU FLOPs: + value: AVG((($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) / ((End_Timestamp - Start_Timestamp) / 1e9)) + / 1e9) + unit: GFLOP/s + peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000) + MFMA FLOPs (F64): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF64Flops_empirical_peak + MFMA FLOPs (F32): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF32Flops_empirical_peak + MFMA FLOPs (F16): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF16Flops_empirical_peak + MFMA FLOPs (BF16): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMABF16Flops_empirical_peak + MFMA FLOPs (F8): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF8Flops_empirical_peak + MFMA IOPs (Int8): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GIOP/s + peak: $MFMAI8Ops_empirical_peak + HBM Bandwidth: + value: AVG((( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum + - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + ((TCC_EA0_WRREQ_sum + - TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp + - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $HBMBw_empirical_peak + L2 Cache Bandwidth: + value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64)) / ((End_Timestamp + - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $L2Bw_empirical_peak + L1 Cache Bandwidth: + value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GB/s + peak: $L1Bw_empirical_peak + LDS Bandwidth: + value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4 * $lds_banks_per_cu)) + / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $LDSBw_empirical_peak + - metric_table: + id: 402 + title: Roofline Plot Points + cli_style: Roofline + tui_style: Roofline + header: + metric: Metric + value: Value + unit: Unit + metric: + AI HBM: + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 + * 512) ) / SUM( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32) + + ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum + * 64) ) ) + unit: FLOPs/Byte + AI L2: + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 + * 512) ) / SUM( (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64 ) ) + unit: FLOPs/Byte + AI L1: + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 + * 512) ) / SUM( TCP_TOTAL_CACHE_ACCESSES_sum * 64 ) ) + unit: FLOPs/Byte + Performance (GFLOPs): + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 + * 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9 + unit: GFLOP/s diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/0400_roofline.yaml b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/0400_roofline.yaml index 41c8bac547..500c7ff805 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/0400_roofline.yaml +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/0400_roofline.yaml @@ -2,8 +2,205 @@ Panel Config: id: 400 title: Roofline - metrics_description: {} + metrics_description: + VALU FLOPs: 'The total floating-point operations executed per second on the VALU. + This is also presented as a percent of the peak theoretical FLOPs achievable + on the specific accelerator. Note: this does not include any floating-point + operations from MFMA instructions.' + MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations + executed per second. This does not include any 16-bit brain floating point operations + from VALU instructions. The peak empirically measured F8 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison. It is supported + on AMD Instinct MI300 series and later only. + MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations + executed per second. Note: this does not include any 16-bit brain floating point + operations from VALU instructions. The peak empirically measured BF16 MFMA operations + achievable on the specific accelerator is displayed alongside for comparison.' + MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed + per second. Note: this does not include any 16-bit floating point operations + from VALU instructions. The peak empirically measured F16 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison.' + MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed + per second. Note: this does not include any 32-bit floating point operations + from VALU instructions. The peak empirically measured F32 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison.' + MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed + per second. Note: this does not include any 64-bit floating point operations + from VALU instructions. The peak empirically measured F64 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison.' + MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed + per second. Note: this does not include any 8-bit integer operations from VALU + instructions. The peak empirically measured INT8 MFMA operations achievable + on the specific accelerator is displayed alongside for comparison.' + HBM Bandwidth: The total number of bytes read from and written to High-Bandwidth + Memory (HBM) per second. The peak empirically measured bandwidth achievable + on the specific accelerator is displayed alongside for comparison. + L2 Cache Bandwidth: The number of bytes looked up in the L2 cache per unit time. + The number of bytes is calculated as the number of cache lines requested multiplied + by the cache line size. This value does not consider partial requests, so e.g., + if only a single value is requested in a cache line, the data movement will + still be counted as a full cache line. The peak empirically measured bandwidth + achievable on the specific accelerator is displayed alongside for comparison. + L1 Cache Bandwidth: The number of bytes looked up in the vL1D cache as a result + of VMEM instructions per unit time. The number of bytes is calculated as the + number of cache lines requested multiplied by the cache line size. This value + does not consider partial requests, so e.g., if only a single value is requested + in a cache line, the data movement will still be counted as a full cache line. + The peak empirically measured bandwidth achievable on the specific accelerator + is displayed alongside for comparison. + LDS Bandwidth: Indicates the maximum amount of bytes that could have been loaded + from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth + example for more detail). The peak empirically measured LDS bandwidth achievable + on the specific accelerator is displayed alongside for comparison. + AI L1: The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio + of total floating-point operations (FLOPs) to total bytes transferred between + the L1 cache and the processing units. This value is used as the x-coordinate + for the L1 roofline. + AI L2: The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio + of total floating-point operations (FLOPs) to total bytes transferred between + the L2 cache and the L1 cache. This value is used as the x-coordinate for the + L2 roofline. + AI HBM: The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM). + It is the ratio of total floating-point operations (FLOPs) to total bytes transferred + between HBM and the L2 cache. This value is used as the x-coordinate for the + HBM roofline. + Performance (GFLOPs): The overall achieved performance, measured in GigaFLOPs + per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point + operations divided by the total execution time. This value is used as the y-coordinate + for the kernel's point on the Roofline plot. data source: - - None: + - metric_table: id: 401 - title: Roofline + title: Roofline Performance Rates + cli_style: Roofline + tui_style: Roofline + header: + metric: Metric + value: Value + unit: Unit + peak: Peak (Empirical) + metric: + VALU FLOPs: + value: AVG((($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) / ((End_Timestamp - Start_Timestamp) / 1e9)) + / 1e9) + unit: GFLOP/s + peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000) + MFMA FLOPs (F64): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF64Flops_empirical_peak + MFMA FLOPs (F32): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF32Flops_empirical_peak + MFMA FLOPs (F16): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF16Flops_empirical_peak + MFMA FLOPs (BF16): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMABF16Flops_empirical_peak + MFMA FLOPs (F8): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF8Flops_empirical_peak + MFMA FLOPs (F6F4): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMA_FLOPs_F6F4_empirical_peak + MFMA IOPs (Int8): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GIOP/s + peak: $MFMAI8Ops_empirical_peak + HBM Bandwidth: + value: AVG((( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum + - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + ((TCC_EA0_WRREQ_sum + - TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp + - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $HBMBw_empirical_peak + L2 Cache Bandwidth: + value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64)) / ((End_Timestamp + - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $L2Bw_empirical_peak + L1 Cache Bandwidth: + value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp) + / 1e9)) / 1e9) + unit: GB/s + peak: $L1Bw_empirical_peak + LDS Bandwidth: + value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4 * $lds_banks_per_cu)) + / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $LDSBw_empirical_peak + - metric_table: + id: 402 + title: Roofline Plot Points + cli_style: Roofline + tui_style: Roofline + header: + metric: Metric + value: Value + unit: Unit + metric: + AI HBM: + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 + * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / SUM( (TCC_BUBBLE_sum + * 128) + (TCC_EA0_RDREQ_32B_sum * 32) + ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum + - TCC_EA0_RDREQ_32B_sum) * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) + * 32) + (TCC_EA0_WRREQ_64B_sum * 64) ) ) + unit: FLOPs/Byte + AI L2: + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 + * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / SUM( (TCP_TCC_WRITE_REQ_sum + + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + + TCP_TCC_READ_REQ_sum) * 64 ) ) + unit: FLOPs/Byte + AI L1: + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 + * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / SUM(TCP_TOTAL_CACHE_ACCESSES_sum + * 64) ) + unit: FLOPs/Byte + Performance GFLOPs: + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 + * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) ) / (SUM(End_Timestamp - + Start_Timestamp) / 1e9) ) / 1e9 + unit: GFLOP/s diff --git a/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_base.py b/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_base.py index f5790ffb48..05c8633c1a 100644 --- a/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_base.py +++ b/projects/rocprofiler-compute/src/rocprof_compute_soc/soc_base.py @@ -377,14 +377,10 @@ class OmniSoC_Base: if counter_name.startswith("TCC") and counter_name.endswith("["): counters.remove(counter_name) counter_name = counter_name.split("[")[0] - counters = counters.union( - { - f"{counter_name}[{i}]" - for i in range( - num_xcd_for_pmc_file * int(self._mspec._l2_banks) - ) - } - ) + counters = counters.union({ + f"{counter_name}[{i}]" + for i in range(num_xcd_for_pmc_file * int(self._mspec._l2_banks)) + }) return counters diff --git a/projects/rocprofiler-compute/src/roofline.py b/projects/rocprofiler-compute/src/roofline.py index 22faee0df2..1f6c43d35a 100644 --- a/projects/rocprofiler-compute/src/roofline.py +++ b/projects/rocprofiler-compute/src/roofline.py @@ -48,7 +48,8 @@ from utils.roofline_calc import ( MFMA_DATATYPES, PEAK_OPS_DATATYPES, SUPPORTED_DATATYPES, - calc_ai, + calc_ai_analyze, + calc_ai_profile, constuct_roof, ) from utils.utils import mibench @@ -182,10 +183,9 @@ class Roofline: console_debug( "roofline", "Path: %s" % self.__run_parameters.get("workload_dir") ) - self.__ai_data = calc_ai( + self.__ai_data = calc_ai_profile( self.__mspec, self.__run_parameters.get("sort_type"), ret_df ) - msg = "AI at each mem level:" for i in self.__ai_data: msg += "\n\t%s -> %s" % (i, self.__ai_data[i]) @@ -620,7 +620,7 @@ class Roofline: return fig - def cli_generate_plot(self, dtype): + def cli_generate_plot(self, dtype, workload=None, config=None, arch_config=None): """ Plot CLI mode roofline analysis in terminal using plotext @@ -668,11 +668,43 @@ class Roofline: else: # workload_dir is a string base_dir = workload_dir - self.roof_setup() - # Convert to Path object for easier manipulation base_path = Path(base_dir) + roofline_csv = base_path / "roofline.csv" + if not roofline_csv.is_file(): + console_log("roofline", "{} does not exist".format(roofline_csv)) + return + + # if workload is detected, utilize Roofline yamls. If not, fallback to legacy calc_ai + if workload is not None: + self.__ai_data = calc_ai_analyze( + workload=workload, + mspec=self.__mspec, + sort_type=self.__run_parameters.get("sort_type"), + config=config, + arch_config=arch_config, + ) + + else: + pmc_perf_csv = base_path / "pmc_perf.csv" + if not pmc_perf_csv.is_file(): + console_error("roofline", "{} does not exist".format(pmc_perf_csv)) + t_df = OrderedDict() + t_df["pmc_perf"] = pd.read_csv(pmc_perf_csv) + + self.__ai_data = calc_ai_profile( + self.__mspec, self.__run_parameters["sort_type"], t_df + ) + + self.__ceiling_data = constuct_roof( + roofline_parameters=self.__run_parameters, dtype=dtype + ) + console_debug(f"AI data: {self.__ai_data}") + console_debug(f"Kernel names: {self.__ai_data.get('kernelNames', [])}") + + self.roof_setup() + # Check proper datatype input - takes single str if not isinstance(dtype, str): console_error("Unsupported datatype input - must be str") @@ -682,16 +714,6 @@ class Roofline: self.__run_parameters["mem_level"].remove("vL1D") self.__run_parameters["mem_level"].append("L1") - roofline_csv = base_path / "roofline.csv" - if not roofline_csv.is_file(): - console_log("roofline", "{} does not exist".format(roofline_csv)) - return - - pmc_perf_csv = base_path / "pmc_perf.csv" - if not pmc_perf_csv.is_file(): - console_error("roofline", "{} does not exist".format(pmc_perf_csv)) - t_df = OrderedDict() - t_df["pmc_perf"] = pd.read_csv(pmc_perf_csv) profiling_config = file_io.load_profiling_config(self.__args.path[0][0]) if profiling_config.get("format_rocprof_output") == "rocpd": t_df["pmc_perf"] = rocpd_data.process_rocpd_csv(t_df["pmc_perf"]) @@ -714,12 +736,6 @@ class Roofline: 5: "atom", } - self.__ceiling_data = constuct_roof( - roofline_parameters=self.__run_parameters, - dtype=dtype, - ) - self.__ai_data = calc_ai(self.__mspec, self.__run_parameters["sort_type"], t_df) - plt.clf() plt.plotsize(plt.tw(), plt.th()) diff --git a/projects/rocprofiler-compute/src/utils/parser.py b/projects/rocprofiler-compute/src/utils/parser.py index d1d75abaa6..3a470b69a8 100644 --- a/projects/rocprofiler-compute/src/utils/parser.py +++ b/projects/rocprofiler-compute/src/utils/parser.py @@ -103,6 +103,7 @@ supported_call = { "STD": "to_std", # functions apply to whole column of df or a single value "TO_INT": "to_int", + "SUM": "to_sum", # Support the below with 2 inputs "ROUND": "to_round", "QUANTILE": "to_quantile", @@ -196,6 +197,19 @@ def to_int(a): raise Exception("to_int: unsupported type.") +def to_sum(a): + if str(type(a)) == "": + return np.nan + elif np.isnan(a).all(): + return np.nan + elif a.empty: + return np.nan + elif isinstance(a, pd.core.series.Series): + return a.sum() + else: + raise Exception("to_sum: unsupported type.") + + def to_round(a, b): if isinstance(a, pd.core.series.Series): return a.round(b) @@ -755,7 +769,7 @@ def build_metric_value_string(dfs, dfs_type, normal_unit, profiling_config): @demarcate -def eval_metric(dfs, dfs_type, sys_info, raw_pmc_df, debug, config): +def eval_metric(dfs, dfs_type, sys_info, empirical_peaks_df, raw_pmc_df, debug, config): """ Execute the expr string for each metric in the df. """ @@ -860,6 +874,30 @@ def eval_metric(dfs, dfs_type, sys_info, raw_pmc_df, debug, config): "wave_size is not available in sysinfo.csv, please provide the correct " "value using --specs-correction" ) + if not empirical_peaks_df.empty: + peak_data_row = empirical_peaks_df.iloc[0] + for metric_name in empirical_peaks_df.columns: + var_name = f"ammolite__{metric_name}_empirical_peak" + locals()[var_name] = peak_data_row[metric_name] + else: + default_peaks = [ + "MFMAF64Flops", + "MFMAF32Flops", + "MFMAF16Flops", + "MFMABF16Flops", + "MFMAF8Flops", + "MFMAI8Ops", + "HBMBw", + "L2Bw", + "L1Bw", + "LDSBw", + "MFMA_FLOPs_F6F4", + ] + # set values to 0 if no no empirical peaks from roofline.csv are provided + for peak_name in default_peaks: + var_name = f"ammolite__{peak_name}_empirical_peak" + exec(f"{var_name} = 0", globals(), locals()) + # TODO: fix all $normUnit in Unit column or title # build and eval all derived build-in global variables @@ -958,8 +996,7 @@ def eval_metric(dfs, dfs_type, sys_info, raw_pmc_df, debug, config): except TypeError: console_warning( "Skipping entry. Encountered a missing " - "counter\n{} has been assigned to None\n{}" - .format( + "counter\n{} has been assigned to None\n{}".format( expr, np.nan, ) @@ -984,8 +1021,14 @@ def eval_metric(dfs, dfs_type, sys_info, raw_pmc_df, debug, config): row[expr] = "" else: row[expr] = out - except TypeError: - row[expr] = "" + except (TypeError, NameError) as e: + if "empirical_peak" in str(e): + console_warning( + f"Missing empirical peak data: {e}. Using empty value." + ) + row[expr] = "" + else: + row[expr] = "" except AttributeError as ae: if ( str(ae) @@ -1043,8 +1086,7 @@ def apply_filters(workload, dir, is_gui, debug): for kernel_id in workload.filter_kernel_ids: if kernel_id >= len(kernels_df["Kernel_Name"]): console_error( - "{} is an invalid kernel id. Please enter an id between 0-{}" - .format( + "{} is an invalid kernel id. Please enter an id between 0-{}".format( kernel_id, len(kernels_df["Kernel_Name"]) - 1, ) @@ -1579,6 +1621,7 @@ def load_table_data(workload, dir, is_gui, args, config, skipKernelTop=False): workload.dfs, workload.dfs_type, workload.sys_info.iloc[0], + workload.roofline_peaks, apply_filters(workload, dir, is_gui, args.debug), args.debug, config, diff --git a/projects/rocprofiler-compute/src/utils/roofline_calc.py b/projects/rocprofiler-compute/src/utils/roofline_calc.py index 3a670d2578..0fa4b10428 100644 --- a/projects/rocprofiler-compute/src/utils/roofline_calc.py +++ b/projects/rocprofiler-compute/src/utils/roofline_calc.py @@ -23,11 +23,15 @@ ############################################################################## + import csv from dataclasses import dataclass from pathlib import Path +import pandas as pd + from utils.logger import console_debug +from utils.parser import apply_filters, eval_metric ################################################ # Global vars @@ -154,8 +158,7 @@ def get_color(catagory): # Plot BW at each cache level # ------------------------------------------------------------------------------------- def calc_ceilings(roofline_parameters, dtype, benchmark_data): - """Given benchmarking data, calculate ceilings - (or peak performance) for empirical roofline""" + """Given benchmarking data, calculate ceilings (or peak performance) for empirical roofline""" # TODO: This is where filtering by memory level will need to occur for standalone graphPoints = {"hbm": [], "l2": [], "l1": [], "lds": [], "valu": [], "mfma": []} @@ -186,7 +189,7 @@ def calc_ceilings(roofline_parameters, dtype, benchmark_data): if dtype in PEAK_OPS_DATATYPES: x2 = peakOps / peakBw - y2 = peakOps # noqa: F841 + y2 = peakOps # Plot MFMA lines (NOTE: Assuming MI200 soc) x1_mfma = peakOps / peakBw @@ -220,9 +223,9 @@ def calc_ceilings(roofline_parameters, dtype, benchmark_data): graphPoints[cacheHierarchy[i].lower()].append([y1, peakY]) graphPoints[cacheHierarchy[i].lower()].append(peakBw) - # --------------------------------------------------------------------------------- + # ------------------------------------------------------------------------------------- # Plot computing roof - # --------------------------------------------------------------------------------- + # ------------------------------------------------------------------------------------- if dtype in PEAK_OPS_DATATYPES: # Plot FMA roof x0 = XMAX @@ -254,9 +257,151 @@ def calc_ceilings(roofline_parameters, dtype, benchmark_data): # Overlay application performance # ------------------------------------------------------------------------------------- # Calculate relevant metrics for ai calculation -def calc_ai(mspec, sort_type, ret_df): - """Given counter data, calculate arithmetic intensity - for each kernel in the application.""" +def calc_ai_analyze(workload, mspec, sort_type, config, arch_config): + """ + Calculate per-kernel metrics and AI points with Roofline yamls using eval_metric. + """ + console_debug("calc_ai_analyze: Starting calc_ai analysis using Roofline yamls") + plot_points = { + "ai_l1": [[], []], + "ai_l2": [[], []], + "ai_hbm": [[], []], + "kernelNames": [], + } + + workload.roofline_metrics = {} + filtered_pmc = apply_filters(workload, workload.path, is_gui=False, debug=False) + + kernel_ids_to_process = [] + kernel_top_table_id = 1 + + if workload.filter_kernel_ids: + kernel_ids_to_process = workload.filter_kernel_ids + else: + if kernel_top_table_id in workload.dfs: + kernel_top_df = workload.dfs[kernel_top_table_id] + kernel_ids_to_process = kernel_top_df.index.tolist() + console_debug( + "roofline", f"Found {len(kernel_ids_to_process)} kernels to process" + ) + + if not kernel_ids_to_process: + console_warning("No kernels found to process for roofline") + return plot_points + + for kernel_id in kernel_ids_to_process: + if kernel_top_table_id in workload.dfs: + kernel_top_df = workload.dfs[kernel_top_table_id] + if kernel_id in kernel_top_df.index: + kernel_name = kernel_top_df.loc[kernel_id, "Kernel_Name"] + else: + continue + else: + continue + + console_debug("roofline", f"Processing kernel {kernel_id}: {kernel_name[:50]}") + + # filter PMC data for specific kernel + kernel_pmc_df = filtered_pmc[ + filtered_pmc["pmc_perf"]["Kernel_Name"] == kernel_name + ] + + if kernel_pmc_df.empty: + console_debug("roofline", f"No PMC data for kernel {kernel_id}") + continue + + kernel_only_data = {"pmc_perf": kernel_pmc_df["pmc_perf"]} + + kernel_dfs = {} + kernel_dfs_type = {} + + for table_id in [401, 402]: + if table_id in arch_config.dfs: + kernel_dfs[table_id] = arch_config.dfs[table_id].copy() + kernel_dfs_type[table_id] = arch_config.dfs_type[table_id] + + # eval metrics for single kernel only + eval_metric( + kernel_dfs, + kernel_dfs_type, + workload.sys_info.iloc[0], + workload.roofline_peaks, + kernel_only_data, + debug=False, + config=config, + ) + + # DEBUG + if 402 in kernel_dfs: + console_debug("roofline", f"Table 402 for kernel {kernel_id}:") + for idx, row in kernel_dfs[402].iterrows(): + console_debug( + "roofline", f" {row.get('Metric', '')}: {row.get('Value', '')}" + ) + + ai_hbm = ai_l2 = ai_l1 = performance = 0 + + if 402 in kernel_dfs: + for idx, row in kernel_dfs[402].iterrows(): + metric = row.get("Metric", "") + value = row.get("Value", 0) + if metric == "AI HBM": + ai_hbm = value if value and value != "" else 0 + elif metric == "AI L2": + ai_l2 = value if value and value != "" else 0 + elif metric == "AI L1": + ai_l1 = value if value and value != "" else 0 + elif metric == "Performance (GFLOPs)": + performance = value if value and value != "" else 0 + + console_debug( + "roofline", + f"Kernel {kernel_id}: AI_HBM={ai_hbm:.2f}, AI_L2={ai_l2:.2f}, AI_L1={ai_l1:.2f}, Performance={performance:.2e} GFLOP/s", + ) + + # add to plot points if we have valid data + if performance > 0: + if ai_hbm > 0: + plot_points["ai_hbm"][0].append(ai_hbm) + plot_points["ai_hbm"][1].append(performance) + if ai_l2 > 0: + plot_points["ai_l2"][0].append(ai_l2) + plot_points["ai_l2"][1].append(performance) + if ai_l1 > 0: + plot_points["ai_l1"][0].append(ai_l1) + plot_points["ai_l1"][1].append(performance) + + plot_points["kernelNames"].append(f"K{kernel_id}") + console_debug("roofline", f"Added kernel {kernel_id} to plot points") + else: + console_debug( + "roofline", f"Skipping kernel {kernel_id} - no performance data" + ) + + # store metrics for display + workload.roofline_metrics[kernel_id] = { + "name": kernel_name, + "ai_table": kernel_dfs.get(401, pd.DataFrame()), + "calc_table": kernel_dfs.get(402, pd.DataFrame()), + } + + console_debug( + "roofline", f"Generated {len(plot_points['kernelNames'])} plot points" + ) + console_debug("roofline", f"Plot points: {plot_points}") + return plot_points + + +def calc_ai_profile(mspec, sort_type, ret_df): + """Given counter data, calculate arithmetic intensity for each kernel in the application. + Leverage hard-coded equations to calculate AI values. + + Used during profiling stage to generate roofline PDF, since Roofline yamls are not available + in the profiling stage.""" + + console_debug( + "calc_ai_profile: Starting legacy roofline calculation (from roofline_calc)" + ) df = ret_df["pmc_perf"] # Sort by top kernels or top dispatches? df = df.sort_values(by=["Kernel_Name"]) @@ -463,7 +608,9 @@ def calc_ai(mspec, sort_type, ret_df): calls += 1 - if sort_type == "kernels" and (at_end or (kernelName != next_kernelName)): + if sort_type == "kernels" and ( + at_end == True or (kernelName != next_kernelName) + ): myList.append( AI_Data( kernelName, @@ -538,8 +685,9 @@ def calc_ai(mspec, sort_type, ret_df): while i < TOP_N and i != len(myList): if myList[i].total_flops == 0: console_debug( - "No flops counted for {}, arithmetic intensities will not " - "display on plots.".format(myList[i].KernelName) + "No flops counted for {}, arithmetic intensities will not display on plots.".format( + myList[i].KernelName + ) ) kernelNames.append(myList[i].KernelName) @@ -548,40 +696,28 @@ def calc_ai(mspec, sort_type, ret_df): if myList[i].L1cache_data else intensities["ai_l1"].append(0) ) - # print( - # "cur_ai_L1", - # myList[i].total_flops / myList[i].L1cache_data - # ) if myList[i].L1cache_data else print("null") + # print("cur_ai_L1", myList[i].total_flops/myList[i].L1cache_data) if myList[i].L1cache_data else print("null") # print() ( intensities["ai_l2"].append(myList[i].total_flops / myList[i].L2cache_data) if myList[i].L2cache_data else intensities["ai_l2"].append(0) ) - # print( - # "cur_ai_L2", - # myList[i].total_flops / myList[i].L2cache_data - # ) if myList[i].L2cache_data else print("null") + # print("cur_ai_L2", myList[i].total_flops/myList[i].L2cache_data) if myList[i].L2cache_data else print("null") # print() ( intensities["ai_hbm"].append(myList[i].total_flops / myList[i].hbm_data) if myList[i].hbm_data else intensities["ai_hbm"].append(0) ) - # print( - # "cur_ai_hbm", - # myList[i].total_flops / myList[i].hbm_data - # ) if myList[i].hbm_data else print("null") + # print("cur_ai_hbm", myList[i].total_flops/myList[i].hbm_data) if myList[i].hbm_data else print("null") # print() ( curr_perf.append(myList[i].total_flops / myList[i].avgDuration) if myList[i].avgDuration else curr_perf.append(0) ) - # print( - # "cur_perf", - # myList[i].total_flops / myList[i].avgDuration - # ) if myList[i].avgDuration else print("null") + # print("cur_perf", myList[i].total_flops/myList[i].avgDuration) if myList[i].avgDuration else print("null") i += 1 @@ -590,7 +726,7 @@ def calc_ai(mspec, sort_type, ret_df): for i in intensities: values = intensities[i] - color = get_color(i) # noqa: F841 + color = get_color(i) x = [] y = [] for entryIndx in range(0, len(values)): @@ -622,8 +758,7 @@ def constuct_roof(roofline_parameters, dtype): # ----------------------------------------------------- # Initialize roofline data dictionary from roofline.csv # ----------------------------------------------------- - # TODO: consider changing this to an ordered dict for consistency over py versions - benchmark_data = {} + benchmark_data = {} # TODO: consider changing this to an ordered dict for consistency over py versions headers = [] try: with open(benchmark_results, "r") as csvfile: @@ -641,7 +776,7 @@ def constuct_roof(roofline_parameters, dtype): rowCount += 1 csvfile.close() - except Exception: + except: graphPoints = { "hbm": [None, None, None], "l2": [None, None, None], diff --git a/projects/rocprofiler-compute/src/utils/schema.py b/projects/rocprofiler-compute/src/utils/schema.py index d03433cadb..b61a584189 100644 --- a/projects/rocprofiler-compute/src/utils/schema.py +++ b/projects/rocprofiler-compute/src/utils/schema.py @@ -83,6 +83,7 @@ supported_field = [ "Avg", "Pct of Peak", "Peak", + "Peak (Empirical)", "Count", "Mean", "Pct", diff --git a/projects/rocprofiler-compute/src/utils/tty.py b/projects/rocprofiler-compute/src/utils/tty.py index fbedf7c880..88bd6ea367 100644 --- a/projects/rocprofiler-compute/src/utils/tty.py +++ b/projects/rocprofiler-compute/src/utils/tty.py @@ -32,6 +32,7 @@ from tabulate import tabulate import config from utils import mem_chart, parser +from utils.kernel_name_shortener import kernel_name_shortener from utils.logger import console_error, console_log, console_warning from utils.utils import convert_metric_id_to_panel_info @@ -146,6 +147,108 @@ def show_all(args, runs, archConfigs, output, profiling_config, roof_plot=None): continue ss = "" # store content of all data_source from one panel + if panel_id == 400: + has_roofline_style = any( + data_source.get(type, {}).get("cli_style") == "Roofline" + for data_source in panel["data source"] + for type in data_source + ) + + if has_roofline_style and ( + not args.filter_metrics or "4" in args.filter_metrics + ): + print("\n" + "=" * 80, file=output) + print("4. Roofline", file=output) + print("=" * 80, file=output) + + for run_path, workload in runs.items(): + if ( + hasattr(workload, "roofline_metrics") + and workload.roofline_metrics + ): + print( + "\n(4.1) Per-Kernel Roofline Metrics and (4.2) AI Plot Points", + file=output, + ) + print("-" * 80, file=output) + + kernel_top_df = workload.dfs.get(1, pd.DataFrame()) + if not kernel_top_df.empty: + kernel_name_shortener(kernel_top_df, args.kernel_verbose) + + for i, (kernel_id, metrics) in enumerate( + workload.roofline_metrics.items() + ): + if ( + not kernel_top_df.empty + and kernel_id in kernel_top_df.index + ): + kernel_name = kernel_top_df.loc[ + kernel_id, "Kernel_Name" + ] + kernel_pct = ( + kernel_top_df.loc[kernel_id, "Pct"] + if "Pct" in kernel_top_df.columns + else 0 + ) + else: + kernel_name = metrics.get("name", f"Kernel {kernel_id}") + kernel_pct = 0 + + display_name = ( + kernel_name[:80] + "..." + if len(kernel_name) > 80 + else kernel_name + ) + print( + f"\nKernel {kernel_id}: {display_name} ({kernel_pct:.1f}%)", + file=output, + ) + + base_indent = " " + table_indent_prefix = f"{base_indent}| " + + tables = { + 401: ( + "4.1 Roofline Rate Metrics:", + metrics.get("ai_table", pd.DataFrame()), + ), + 402: ( + "4.2 Roofline AI Plot Points:", + metrics.get("calc_table", pd.DataFrame()), + ), + } + + print(f"{base_indent}|") + + for table_id, (table_name, df) in tables.items(): + if df.empty: + continue + + print(f"{base_indent}├─ {table_name}", file=output) + + display_df = df.copy() + + for col in hidden_cols: + if col in display_df.columns: + display_df = display_df.drop(columns=[col]) + + table_string = get_table_string( + display_df, transpose=False, decimal=args.decimal + ) + indented_table_string = textwrap.indent( + table_string, table_indent_prefix + ) + print(indented_table_string, file=output) + + else: + print("\nNo per-kernel metrics available", file=output) + + # Show the roofline plot + if roof_plot: + show_roof_plot(roof_plot) + continue + for data_source in panel["data source"]: for type, table_config in data_source.items(): # If block filtering was used during analysis, then don't use profiling @@ -172,16 +275,6 @@ def show_all(args, runs, archConfigs, output, profiling_config, roof_plot=None): ) continue - # Show roofline - # Check if we have filter_metrics for analyze stage: - # no filter_metrics = show all, - # filter_metrics containing "4" = user requesting roofline chart - if panel_id == 400 and ( - not args.filter_metrics or "4" in args.filter_metrics - ): - show_roof_plot(roof_plot) - continue - # Metrics baseline comparison mode # We cannot guarantee that all runs have the same metrics. # Only show common metrics. @@ -454,7 +547,7 @@ def show_roof_plot(roof_plot): # TODO: short term solution to display roofline plot print("\n" + "-" * 80) print("4. Roofline") - print("4.1 Roofline") + print("4.3 Roofline Plot") if roof_plot: print(roof_plot) else: diff --git a/projects/rocprofiler-compute/src/utils/utils.py b/projects/rocprofiler-compute/src/utils/utils.py index e0d2053943..d2d8c41df0 100644 --- a/projects/rocprofiler-compute/src/utils/utils.py +++ b/projects/rocprofiler-compute/src/utils/utils.py @@ -745,7 +745,7 @@ def run_prof( config.rocprof_compute_home / "rocprof_compute_soc" / "profile_configs" - / f"counter_defs.yaml", + / "counter_defs.yaml", "r", ) as file: counter_defs = yaml.safe_load(file) diff --git a/projects/rocprofiler-compute/tests/test_profile_general.py b/projects/rocprofiler-compute/tests/test_profile_general.py index 51dcdd50c7..33dcd2f3c3 100644 --- a/projects/rocprofiler-compute/tests/test_profile_general.py +++ b/projects/rocprofiler-compute/tests/test_profile_general.py @@ -1676,9 +1676,9 @@ class TestSetsIntegration: memory_metrics = ["16.1.2", "17.1.0"] for metric_id in memory_metrics: - assert ( - metric_id in open(Path(workload_dir) / "log.txt", "r").read() - ), f"Expected memory metric {metric_id} not found" + assert metric_id in open(Path(workload_dir) / "log.txt", "r").read(), ( + f"Expected memory metric {metric_id} not found" + ) test_utils.clean_output_dir(config["cleanup"], workload_dir) @@ -1745,7 +1745,9 @@ class TestSetsIntegration: assert returncode == 1 test_utils.clean_output_dir(config["cleanup"], workload_dir) - def test_set_and_block_mutual_exclusion(self, binary_handler_profile_rocprof_compute): + def test_set_and_block_mutual_exclusion( + self, binary_handler_profile_rocprof_compute + ): options = ["--set", "compute_thruput_util", "--block", "12"] workload_dir = test_utils.get_output_dir() diff --git a/projects/rocprofiler-compute/tests/test_utils.py b/projects/rocprofiler-compute/tests/test_utils.py index 4b530ce333..a57f659c3d 100644 --- a/projects/rocprofiler-compute/tests/test_utils.py +++ b/projects/rocprofiler-compute/tests/test_utils.py @@ -30,18 +30,17 @@ import json import locale import logging import os -import tempfile import pathlib import re import shutil import subprocess +import tempfile from pathlib import Path from types import SimpleNamespace from unittest import mock import pandas as pd import pytest -import yaml import utils.utils as utils diff --git a/projects/rocprofiler-compute/utils/autogen_hash.yaml b/projects/rocprofiler-compute/utils/autogen_hash.yaml index 3a4e66320c..6de2ea2fad 100644 --- a/projects/rocprofiler-compute/utils/autogen_hash.yaml +++ b/projects/rocprofiler-compute/utils/autogen_hash.yaml @@ -23,12 +23,12 @@ src/rocprof_compute_soc/analysis_configs/gfx940/0300_memory_chart.yaml: cff5509a src/rocprof_compute_soc/analysis_configs/gfx941/0300_memory_chart.yaml: cff5509ac8502bad6dbd75e3058159fe429aece5d93279c66b2a6a8c887b43b6 src/rocprof_compute_soc/analysis_configs/gfx942/0300_memory_chart.yaml: cff5509ac8502bad6dbd75e3058159fe429aece5d93279c66b2a6a8c887b43b6 src/rocprof_compute_soc/analysis_configs/gfx950/0300_memory_chart.yaml: 643b31ffa43bc3613d6f90b0c23d95093d0d0aa5bc8e72d9a0fbc1b739a08b67 -src/rocprof_compute_soc/analysis_configs/gfx908/0400_roofline.yaml: 57f95dcd487dfcdf24e1c2d8eb16d14dc3462df83d11a08e7de2b06343b48c3e -src/rocprof_compute_soc/analysis_configs/gfx90a/0400_roofline.yaml: 57f95dcd487dfcdf24e1c2d8eb16d14dc3462df83d11a08e7de2b06343b48c3e -src/rocprof_compute_soc/analysis_configs/gfx940/0400_roofline.yaml: 57f95dcd487dfcdf24e1c2d8eb16d14dc3462df83d11a08e7de2b06343b48c3e -src/rocprof_compute_soc/analysis_configs/gfx941/0400_roofline.yaml: 57f95dcd487dfcdf24e1c2d8eb16d14dc3462df83d11a08e7de2b06343b48c3e -src/rocprof_compute_soc/analysis_configs/gfx942/0400_roofline.yaml: 57f95dcd487dfcdf24e1c2d8eb16d14dc3462df83d11a08e7de2b06343b48c3e -src/rocprof_compute_soc/analysis_configs/gfx950/0400_roofline.yaml: 57f95dcd487dfcdf24e1c2d8eb16d14dc3462df83d11a08e7de2b06343b48c3e +src/rocprof_compute_soc/analysis_configs/gfx908/0400_roofline.yaml: 6406ce67cd55064f0d2db2a3511c6536cc1625314ddb31366900fbf3c60ed523 +src/rocprof_compute_soc/analysis_configs/gfx90a/0400_roofline.yaml: 100d555cf9e70b892e22f92ddd9c0a5d1f914d07077c4a8d35941e8ad62b5b30 +src/rocprof_compute_soc/analysis_configs/gfx940/0400_roofline.yaml: f8bf66f43c9afede4fd1f17c279050cc27cc6fbc1cdb53a71ae8ceb0eb84dc37 +src/rocprof_compute_soc/analysis_configs/gfx941/0400_roofline.yaml: 6fae04dcf4bcabe4a71f5d9eefc379a38d30cdf05fbb14e2c276e1c272fdb3f6 +src/rocprof_compute_soc/analysis_configs/gfx942/0400_roofline.yaml: c8dfe7df24f94dfa229ffa2035b802c6833ce98f7710e0889bc5710f2167d4c0 +src/rocprof_compute_soc/analysis_configs/gfx950/0400_roofline.yaml: 734fdfa818bfd8a87e01a0dd795c502a567c72158ca9b7bfe01e99451e8aa537 src/rocprof_compute_soc/analysis_configs/gfx908/0500_command_processor_cpc_cpf.yaml: da1c2997d42d66da2aa069caa741cf9eade124357c56e4290333de2f3e0412bb src/rocprof_compute_soc/analysis_configs/gfx90a/0500_command_processor_cpc_cpf.yaml: da1c2997d42d66da2aa069caa741cf9eade124357c56e4290333de2f3e0412bb src/rocprof_compute_soc/analysis_configs/gfx940/0500_command_processor_cpc_cpf.yaml: da1c2997d42d66da2aa069caa741cf9eade124357c56e4290333de2f3e0412bb diff --git a/projects/rocprofiler-compute/utils/split_config.py b/projects/rocprofiler-compute/utils/split_config.py index 0c9f864702..336e54aa74 100644 --- a/projects/rocprofiler-compute/utils/split_config.py +++ b/projects/rocprofiler-compute/utils/split_config.py @@ -87,7 +87,9 @@ def update_analysis_config(): data_source_config["metric_table"]["metric"], gfx_version, ) - new_panel_config["Panel Config"]["data source"].append(data_source_config) + new_panel_config["Panel Config"]["data source"].append( + data_source_config + ) # Write panel config to file filename = Path( TARGET_DIR.joinpath(gfx_version, f"{panel_id}_{panel_title}.yaml") @@ -134,9 +136,9 @@ def update_sets_config(): } for metric_id in sets["metric"][gfx_version]: - current_set["metric"].append( - {metric_id: METRIC_ID_TO_NAME_MAP[gfx_version][str(metric_id)]} - ) + current_set["metric"].append({ + metric_id: METRIC_ID_TO_NAME_MAP[gfx_version][str(metric_id)] + }) new_sets["sets"].append(current_set) diff --git a/projects/rocprofiler-compute/utils/unified_config.yaml b/projects/rocprofiler-compute/utils/unified_config.yaml index d214d3148d..4d1964dcb7 100644 --- a/projects/rocprofiler-compute/utils/unified_config.yaml +++ b/projects/rocprofiler-compute/utils/unified_config.yaml @@ -2801,9 +2801,963 @@ panels: - id: 400 title: Roofline data source: - - None: + - metric_table: id: 401 - title: Roofline + title: Roofline Performance Rates + cli_style: Roofline + tui_style: Roofline + header: + metric: Metric + value: Value + unit: Unit + peak: Peak (Empirical) + metric: + gfx90a: + VALU FLOPs: + value: AVG((($wave_size * ( + (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) + )) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000) + MFMA FLOPs (F64): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF64Flops_empirical_peak + MFMA FLOPs (F32): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF32Flops_empirical_peak + MFMA FLOPs (F16): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF16Flops_empirical_peak + MFMA FLOPs (BF16): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMABF16Flops_empirical_peak + MFMA IOPs (Int8): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GIOP/s + peak: $MFMAI8Ops_empirical_peak + HBM Bandwidth: + value: AVG((( + (TCC_EA_RDREQ_32B_sum * 32) + + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) * 64) + + (TCC_EA_WRREQ_64B_sum * 64) + + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) * 32) + ) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $HBMBw_empirical_peak + L2 Cache Bandwidth: + value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * + 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $L2Bw_empirical_peak + L1 Cache Bandwidth: + value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $L1Bw_empirical_peak + LDS Bandwidth: + value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * + 4 * $lds_banks_per_cu)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $LDSBw_empirical_peak + gfx908: + VALU FLOPs: + value: AVG((($wave_size * ( + (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) + )) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000) + MFMA FLOPs (F64): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF64Flops_empirical_peak + MFMA FLOPs (F32): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF32Flops_empirical_peak + MFMA FLOPs (F16): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF16Flops_empirical_peak + MFMA FLOPs (BF16): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMABF16Flops_empirical_peak + MFMA IOPs (Int8): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GIOP/s + peak: $MFMAI8Ops_empirical_peak + HBM Bandwidth: + value: AVG((( + (TCC_BUBBLE_sum * 128) + + (TCC_EA0_RDREQ_32B_sum * 32) + + ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) + + (TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $HBMBw_empirical_peak + L2 Cache Bandwidth: + value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * + 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $L2Bw_empirical_peak + L1 Cache Bandwidth: + value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $L1Bw_empirical_peak + LDS Bandwidth: + value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * + 4 * $lds_banks_per_cu)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $LDSBw_empirical_peak + gfx940: + VALU FLOPs: + value: AVG(($wave_size * ( + (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) + ) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000) + MFMA FLOPs (F64): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF64Flops_empirical_peak + MFMA FLOPs (F32): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF32Flops_empirical_peak + MFMA FLOPs (F16): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF16Flops_empirical_peak + MFMA FLOPs (BF16): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMABF16Flops_empirical_peak + MFMA FLOPs (F8): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF8Flops_empirical_peak + MFMA IOPs (Int8): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GIOP/s + peak: $MFMAI8Ops_empirical_peak + HBM Bandwidth: + value: AVG((( + (TCC_BUBBLE_sum * 128) + + (TCC_EA0_RDREQ_32B_sum * 32) + + ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) + + (TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $HBMBw_empirical_peak + L2 Cache Bandwidth: + value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * + 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $L2Bw_empirical_peak + L1 Cache Bandwidth: + value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $L1Bw_empirical_peak + LDS Bandwidth: + value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * + 4 * $lds_banks_per_cu)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $LDSBw_empirical_peak + gfx941: + VALU FLOPs: + value: AVG(($wave_size * ( + (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) + ) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000) + MFMA FLOPs (F64): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF64Flops_empirical_peak + MFMA FLOPs (F32): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF32Flops_empirical_peak + MFMA FLOPs (F16): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF16Flops_empirical_peak + MFMA FLOPs (BF16): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMABF16Flops_empirical_peak + MFMA FLOPs (F8): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF8Flops_empirical_peak + MFMA IOPs (Int8): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GIOP/s + peak: $MFMAI8Ops_empirical_peak + HBM Bandwidth: + value: AVG((( + (TCC_BUBBLE_sum * 128) + + (TCC_EA0_RDREQ_32B_sum * 32) + + ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) + + (TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $HBMBw_empirical_peak + L2 Cache Bandwidth: + value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * + 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $L2Bw_empirical_peak + L1 Cache Bandwidth: + value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $L1Bw_empirical_peak + LDS Bandwidth: + value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * + 4 * $lds_banks_per_cu)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $LDSBw_empirical_peak + gfx942: + VALU FLOPs: + value: AVG((($wave_size * ( + (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) + )) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000) + MFMA FLOPs (F64): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF64Flops_empirical_peak + MFMA FLOPs (F32): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF32Flops_empirical_peak + MFMA FLOPs (F16): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF16Flops_empirical_peak + MFMA FLOPs (BF16): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMABF16Flops_empirical_peak + MFMA FLOPs (F8): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF8Flops_empirical_peak + MFMA IOPs (Int8): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GIOP/s + peak: $MFMAI8Ops_empirical_peak + HBM Bandwidth: + value: AVG((( + (TCC_BUBBLE_sum * 128) + + (TCC_EA0_RDREQ_32B_sum * 32) + + ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) + + (TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $HBMBw_empirical_peak + L2 Cache Bandwidth: + value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * + 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $L2Bw_empirical_peak + L1 Cache Bandwidth: + value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $L1Bw_empirical_peak + LDS Bandwidth: + value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * + 4 * $lds_banks_per_cu)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $LDSBw_empirical_peak + gfx950: + VALU FLOPs: + value: AVG((($wave_size * ( + (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) + )) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000) + MFMA FLOPs (F64): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF64Flops_empirical_peak + MFMA FLOPs (F32): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF32Flops_empirical_peak + MFMA FLOPs (F16): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF16Flops_empirical_peak + MFMA FLOPs (BF16): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMABF16Flops_empirical_peak + MFMA FLOPs (F8): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMAF8Flops_empirical_peak + MFMA FLOPs (F6F4): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GFLOP/s + peak: $MFMA_FLOPs_F6F4_empirical_peak + MFMA IOPs (Int8): + value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GIOP/s + peak: $MFMAI8Ops_empirical_peak + HBM Bandwidth: + value: AVG((( + (TCC_BUBBLE_sum * 128) + + (TCC_EA0_RDREQ_32B_sum * 32) + + ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) + + (TCC_EA0_WRREQ_64B_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $HBMBw_empirical_peak + L2 Cache Bandwidth: + value: AVG(((((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * + 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $L2Bw_empirical_peak + L1 Cache Bandwidth: + value: AVG((((TCP_TOTAL_CACHE_ACCESSES_sum * 64)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $L1Bw_empirical_peak + LDS Bandwidth: + value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * + 4 * $lds_banks_per_cu)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9) + unit: GB/s + peak: $LDSBw_empirical_peak + - metric_table: + id: 402 + title: Roofline Plot Points + cli_style: Roofline + tui_style: Roofline + header: + metric: Metric + value: Value + unit: Unit + metric: + gfx90a: + AI HBM: + value: ( + SUM( + ($wave_size * ( + (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) + )) + + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + ) / + SUM( + (TCC_EA_RDREQ_32B_sum * 32) + + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) * 64) + + (TCC_EA_WRREQ_64B_sum * 64) + + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) * 32) + ) + ) + unit: FLOPs/Byte + AI L2: + value: ( + SUM( + ($wave_size * ( + (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) + )) + + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + ) / + SUM( + (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64 + ) + ) + unit: FLOPs/Byte + AI L1: + value: ( + SUM( + ($wave_size * ( + (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) + )) + + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + ) / + SUM(TCP_TOTAL_CACHE_ACCESSES_sum * 64) + ) + unit: FLOPs/Byte + Performance GFLOPs: + value: ( + SUM( + ($wave_size * ( + (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) + )) + + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + ) / + (SUM(End_Timestamp - Start_Timestamp) / 1e9) + ) / 1e9 + unit: GFLOP/s + gfx908: + AI HBM: + value: ( + SUM( + ($wave_size * ( + (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) + )) + + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + ) / + SUM( + (TCC_BUBBLE_sum * 128) + + (TCC_EA0_RDREQ_32B_sum * 32) + + ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) + + (TCC_EA0_WRREQ_64B_sum * 64) + ) + ) + unit: FLOPs/Byte + AI L2: + value: ( + SUM( + ($wave_size * ( + (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) + )) + + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + ) / + SUM( + (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64 + ) + ) + unit: FLOPs/Byte + AI L1: + value: ( + SUM( + ($wave_size * ( + (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) + )) + + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + ) / + SUM(TCP_TOTAL_CACHE_ACCESSES_sum * 64) + ) + unit: FLOPs/Byte + Performance GFLOPs: + value: ( + SUM( + ($wave_size * ( + (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) + )) + + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + ) / + (SUM(End_Timestamp - Start_Timestamp) / 1e9) + ) / 1e9 + unit: GFLOP/s + gfx940: + AI HBM: + value: ( + SUM( + ($wave_size * ( + (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) + )) + + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) + ) / + SUM( + (TCC_BUBBLE_sum * 128) + + (TCC_EA0_RDREQ_32B_sum * 32) + + ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) + + (TCC_EA0_WRREQ_64B_sum * 64) + ) + ) + unit: FLOPs/Byte + AI L2: + value: ( + SUM( + ($wave_size * ( + (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) + )) + + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + + ) / + SUM( + (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64 + ) + ) + unit: FLOPs/Byte + AI L1: + value: ( + SUM( + ($wave_size * ( + (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) + )) + + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + + ) / + SUM(TCP_TOTAL_CACHE_ACCESSES_sum * 64) + ) + unit: FLOPs/Byte + Performance GFLOPs: + value: ( + SUM( + ($wave_size * ( + (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) + )) + + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) + ) / + (SUM(End_Timestamp - Start_Timestamp) / 1e9) + ) / 1e9 + unit: GFLOP/s + gfx941: + AI HBM: + value: ( + SUM( + ($wave_size * ( + (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) + )) + + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) + ) / + SUM( + (TCC_BUBBLE_sum * 128) + + (TCC_EA0_RDREQ_32B_sum * 32) + + ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) + + (TCC_EA0_WRREQ_64B_sum * 64) + ) + ) + unit: FLOPs/Byte + AI L2: + value: ( + SUM( + ($wave_size * ( + (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) + )) + + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) + ) / + SUM( + (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64 + ) + ) + unit: FLOPs/Byte + AI L1: + value: ( + SUM( + ($wave_size * ( + (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) + )) + + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) + ) / + SUM(TCP_TOTAL_CACHE_ACCESSES_sum * 64) + ) + unit: FLOPs/Byte + Performance GFLOPs: + value: ( + SUM( + ($wave_size * ( + (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) + )) + + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) + ) / + (SUM(End_Timestamp - Start_Timestamp) / 1e9) + ) / 1e9 + unit: GFLOP/s + gfx942: + AI HBM: + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 + * 512) ) / SUM( (TCC_BUBBLE_sum * 128) + (TCC_EA0_RDREQ_32B_sum * 32) + + ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) + (TCC_EA0_WRREQ_64B_sum + * 64) ) ) + unit: FLOPs/Byte + AI L2: + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 + * 512) ) / SUM( (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64 ) ) + unit: FLOPs/Byte + AI L1: + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 + * 512) ) / SUM( TCP_TOTAL_CACHE_ACCESSES_sum * 64 ) ) + unit: FLOPs/Byte + Performance (GFLOPs): + value: ( SUM( ($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32 + + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + + SQ_INSTS_VALU_TRANS_F64) )) + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F32 * + 512) + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + (SQ_INSTS_VALU_MFMA_MOPS_F8 + * 512) ) / (SUM(End_Timestamp - Start_Timestamp) / 1e9) ) / 1e9 + unit: GFLOP/s + gfx950: + AI HBM: + value: ( + SUM( + ($wave_size * ( + (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) + )) + + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) + ) / + SUM( + (TCC_BUBBLE_sum * 128) + + (TCC_EA0_RDREQ_32B_sum * 32) + + ((TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) * 64) + + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum) * 32) + + (TCC_EA0_WRREQ_64B_sum * 64) + ) + ) + unit: FLOPs/Byte + AI L2: + value: ( + SUM( + ($wave_size * ( + (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) + )) + + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) + ) / + SUM( + (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum + TCP_TCC_READ_REQ_sum) * 64 + ) + ) + unit: FLOPs/Byte + AI L1: + value: ( + SUM( + ($wave_size * ( + (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) + )) + + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) + ) / + SUM(TCP_TOTAL_CACHE_ACCESSES_sum * 64) + ) + unit: FLOPs/Byte + Performance GFLOPs: + value: ( + SUM( + ($wave_size * ( + (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + + (SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) + + (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64) + )) + + (SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) + + (SQ_INSTS_VALU_MFMA_MOPS_F6F4 * 512) + ) / + (SUM(End_Timestamp - Start_Timestamp) / 1e9) + ) / 1e9 + unit: GFLOP/s + metrics_description: + VALU FLOPs: + plain: 'The total floating-point operations executed per second on the VALU. + This is also presented as a percent of the peak theoretical FLOPs achievable + on the specific accelerator. Note: this does not include any floating-point + operations from MFMA instructions.' + rst: 'The total floating-point operations executed per second on the :ref:`VALU + `. This is also presented as a percent of the peak theoretical + FLOPs achievable on the specific accelerator. Note: this does not include + any floating-point operations from :ref:`MFMA ` instructions.' + unit: GFLOPs + MFMA FLOPs (F8): + plain: The total number of 8-bit brain floating point MFMA operations executed + per second. This does not include any 16-bit brain floating point operations + from VALU instructions. The peak empirically measured F8 MFMA operations + achievable on the specific accelerator is displayed alongside for comparison. + It is supported on AMD Instinct MI300 series and later only. + rst: 'The total number of 8-bit brain floating point :ref:`MFMA ` + operations executed per second. Note: this does not include any 16-bit brain + floating point operations from :ref:`VALU ` instructions. The + peak empirically measured F8 MFMA operations achievable on the specific + accelerator is displayed alongside for comparison. It is supported on AMD + Instinct MI300 series and later only.' + unit: GFLOPs + MFMA FLOPs (BF16): + plain: 'The total number of 16-bit brain floating point MFMA operations executed + per second. Note: this does not include any 16-bit brain floating point + operations from VALU instructions. The peak empirically measured BF16 MFMA + operations achievable on the specific accelerator is displayed alongside + for comparison.' + rst: 'The total number of 16-bit brain floating point :ref:`MFMA ` + operations executed per second. Note: this does not include any 16-bit brain + floating point operations from :ref:`VALU ` instructions. The + peak empirically measured BF16 MFMA operations achievable on the specific + accelerator is displayed alongside for comparison.' + unit: GFLOPs + MFMA FLOPs (F16): + plain: 'The total number of 16-bit floating point MFMA operations executed per + second. Note: this does not include any 16-bit floating point operations from + VALU instructions. The peak empirically measured F16 MFMA operations + achievable on the specific accelerator is displayed alongside for comparison.' + rst: 'The total number of 16-bit floating point :ref:`MFMA ` operations + executed per second. Note: this does not include any 16-bit floating point + operations from :ref:`VALU ` instructions. The peak empirically + measured F16 MFMA operations achievable on the specific accelerator is + displayed alongside for comparison.' + unit: GFLOPs + MFMA FLOPs (F32): + plain: 'The total number of 32-bit floating point MFMA operations executed per + second. Note: this does not include any 32-bit floating point operations from + VALU instructions. The peak empirically measured F32 MFMA operations + achievable on the specific accelerator is displayed alongside for comparison.' + rst: 'The total number of 32-bit floating point :ref:`MFMA ` operations + executed per second. Note: this does not include any 32-bit floating point + operations from :ref:`VALU ` instructions. The peak empirically + measured F32 MFMA operations achievable on the specific accelerator is + displayed alongside for comparison.' + unit: GFLOPs + MFMA FLOPs (F64): + plain: 'The total number of 64-bit floating point MFMA operations executed per + second. Note: this does not include any 64-bit floating point operations from + VALU instructions. The peak empirically measured F64 MFMA operations + achievable on the specific accelerator is displayed alongside for comparison.' + rst: 'The total number of 64-bit floating point :ref:`MFMA ` operations + executed per second. Note: this does not include any 64-bit floating point + operations from :ref:`VALU ` instructions. The peak empirically + measured F64 MFMA operations achievable on the specific accelerator is + displayed alongside for comparison.' + unit: GFLOPs + MFMA IOPs (Int8): + plain: 'The total number of 8-bit integer MFMA operations executed per second. + Note: this does not include any 8-bit integer operations from VALU instructions. + The peak empirically measured INT8 MFMA operations achievable on the specific + accelerator is displayed alongside for comparison.' + rst: 'The total number of 8-bit integer :ref:`MFMA ` operations executed + per second. Note: this does not include any 8-bit integer operations from + :ref:`VALU ` instructions. The peak empirically measured INT8 MFMA + operations achievable on the specific accelerator is displayed alongside + for comparison.' + unit: GIOPs + HBM Bandwidth: + plain: 'The total number of bytes read from and written to High-Bandwidth + Memory (HBM) per second. The peak empirically measured bandwidth achievable + on the specific accelerator is displayed alongside for comparison.' + rst: 'The total number of bytes read from and written to High-Bandwidth + Memory (HBM) per second. The peak empirically measured bandwidth achievable + on the specific accelerator is displayed alongside for comparison.' + unit: GB/s + L2 Cache Bandwidth: + plain: The number of bytes looked up in the L2 cache per unit time. The number + of bytes is calculated as the number of cache lines requested multiplied by + the cache line size. This value does not consider partial requests, so e.g., + if only a single value is requested in a cache line, the data movement will + still be counted as a full cache line. The peak empirically measured bandwidth + achievable on the specific accelerator is displayed alongside for comparison. + rst: The number of bytes looked up in the L2 cache per unit time. The number of + bytes is calculated as the number of cache lines requested multiplied by + the cache line size. This value does not consider partial requests, so e.g., + if only a single value is requested in a cache line, the data movement will + still be counted as a full cache line. The peak empirically measured + bandwidth achievable on the specific accelerator is displayed alongside + for comparison. + unit: GB/s + L1 Cache Bandwidth: + plain: The number of bytes looked up in the vL1D cache as a result of VMEM + instructions per unit time. The number of bytes is calculated as the number + of cache lines requested multiplied by the cache line size. This value does + not consider partial requests, so e.g., if only a single value is requested + in a cache line, the data movement will still be counted as a full cache line. + The peak empirically measured bandwidth achievable on the specific accelerator + is displayed alongside for comparison. + rst: The number of bytes looked up in the vL1D cache as a result of :ref:`VMEM + ` instructions per unit time. The number of bytes is calculated + as the number of cache lines requested multiplied by the cache line size. + This value does not consider partial requests, so e.g., if only a single + value is requested in a cache line, the data movement will still be counted + as a full cache line. The peak empirically measured bandwidth achievable on + the specific accelerator is displayed alongside for comparison. + unit: GB/s + LDS Bandwidth: + plain: Indicates the maximum amount of bytes that could have been loaded from, + stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth + example for more detail). The peak empirically measured LDS bandwidth + achievable on the specific accelerator is displayed alongside for comparison. + rst: Indicates the maximum amount of bytes that could have been loaded from, + stored to, or atomically updated in the LDS per unit time (see :ref:`LDS + Bandwidth ` example for more detail). The peak empirically + measured LDS bandwidth achievable on the specific accelerator is displayed + alongside for comparison. + unit: GB/s + AI L1: + plain: 'The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio + of total floating-point operations (FLOPs) to total bytes transferred between + the L1 cache and the processing units. This value is used as the x-coordinate + for the L1 roofline.' + rst: 'The Arithmetic Intensity (AI) relative to the L1 Cache. It is the ratio + of total floating-point operations (FLOPs) to total bytes transferred between + the L1 cache and the processing units. This value is used as the x-coordinate + for the L1 roofline.' + unit: FLOPs/Byte + AI L2: + plain: 'The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio + of total floating-point operations (FLOPs) to total bytes transferred between + the L2 cache and the L1 cache. This value is used as the x-coordinate for + the L2 roofline.' + rst: 'The Arithmetic Intensity (AI) relative to the L2 Cache. It is the ratio + of total floating-point operations (FLOPs) to total bytes transferred between + the L2 cache and the L1 cache. This value is used as the x-coordinate for + the L2 roofline.' + unit: FLOPs/Byte + AI HBM: + plain: 'The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM). + It is the ratio of total floating-point operations (FLOPs) to total bytes + transferred between HBM and the L2 cache. This value is used as the x-coordinate + for the HBM roofline.' + rst: 'The Arithmetic Intensity (AI) relative to High-Bandwidth Memory (HBM). + It is the ratio of total floating-point operations (FLOPs) to total bytes + transferred between HBM and the L2 cache. This value is used as the x-coordinate + for the HBM roofline.' + unit: FLOPs/Byte + Performance (GFLOPs): + plain: 'The overall achieved performance, measured in GigaFLOPs + per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point + operations divided by the total execution time. This value is used as the y-coordinate + for the kernel''s point on the Roofline plot.' + rst: 'The overall achieved performance, measured in GigaFLOPs + per second (GFLOP/s). This is calculated as the sum of all VALU and MFMA floating-point + operations divided by the total execution time. This value is used as the y-coordinate + for the kernel''s point on the Roofline plot.' + unit: GFLOP/s - id: 500 title: Command Processor (CPC/CPF) data source: