[rocprofiler-compute] Update baseline comparison notes in documentation (#2878)

* Update baseline comparison with anchor, text, samples, image in CLI page. Fixes broken 404 links after grafana was removed.

Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>

* Update options in list to full name, correct gpu id option.

Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>

* Formatting and broken intersphinx fixed

* Indentation formatting fixed

---------

Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
Co-authored-by: prbasyal <prbasyal@amd.com>
Co-authored-by: Pratik Basyal <pratik.basyal@amd.com>
Этот коммит содержится в:
cfallows-amd
2026-01-27 16:04:21 -05:00
коммит произвёл GitHub
родитель fdb19e5a4c
Коммит 4d7f709510
5 изменённых файлов: 158 добавлений и 111 удалений
Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 236 KiB

+146 -107
Просмотреть файл
@@ -8,7 +8,7 @@ CLI analysis
This section provides an overview of ROCm Compute Profiler's CLI analysis features.
* :ref:`Derived metrics <cli-list-metrics>`: All of ROCm Compute Profiler's built-in metrics.
* :ref:`Derived metrics <cli-list-available-metrics>`: All of ROCm Compute Profiler's built-in metrics.
* :ref:`Baseline comparison <analysis-baseline-comparison>`: Compare multiple runs in a side-by-side manner.
@@ -310,35 +310,43 @@ There are three high-level GPU analysis views:
More analysis options
=====================
Single run
.. code-block:: shell
**Single run**
$ rocprof-compute analyze -p workloads/vcopy/MI200/
.. code-block:: shell
List top kernels and dispatches
.. code-block:: shell
$ rocprof-compute analyze -p workloads/vcopy/MI200/
$ rocprof-compute analyze -p workloads/vcopy/MI200/ --list-stats
**List top kernels and dispatches**
List metrics
.. code-block:: shell
.. code-block:: shell
$ rocprof-compute analyze -p workloads/vcopy/MI200/ --list-metrics gfx90a
$ rocprof-compute analyze -p workloads/vcopy/MI200/ --list-stats
List IP blocks
.. code-block:: shell
$ rocprof-compute analyze -p workloads/vcopy/MI200/ --list-blocks gfx90a
**List metrics**
Show Description column which is excluded by default in cli output
.. code-block:: shell
.. code-block:: shell
$ rocprof-compute analyze -p workloads/vcopy/MI200/ --list-metrics gfx90a --include-cols Description
$ rocprof-compute analyze -p workloads/vcopy/MI200/ --list-metrics gfx90a
Show System Speed-of-Light and CS_Busy blocks only
.. code-block:: shell
**List IP blocks**
$ rocprof-compute analyze -p workloads/vcopy/MI200/ -b 2 5.1.0
.. code-block:: shell
$ rocprof-compute analyze -p workloads/vcopy/MI200/ --list-blocks gfx90a
**Show Description column which is excluded by default in cli output**
.. code-block:: shell
$ rocprof-compute analyze -p workloads/vcopy/MI200/ --list-metrics gfx90a --include-cols Description
**Show System Speed-of-Light and CS_Busy blocks only**
.. code-block:: shell
$ rocprof-compute analyze -p workloads/vcopy/MI200/ -b 2 5.1.0
.. note::
@@ -347,68 +355,71 @@ Show System Speed-of-Light and CS_Busy blocks only
GPU Busy Cycles metric.
Filter kernels
First, list the top kernels in your application using `--list-stats`.
**Filter kernels**
.. code-block::
First, list the top kernels in your application using `--list-stats`.
$ rocprof-compute analyze -p workloads/vcopy/MI200/ --list-stats
.. code-block::
Analysis mode = cli
[analysis] deriving rocprofiler-compute metrics...
$ rocprof-compute analyze -p workloads/vcopy/MI200/ --list-stats
--------------------------------------------------------------------------------
Detected Kernels (sorted descending by duration)
╒════╤══════════════════════════════════════════════╕
│ │ Kernel_Name │
╞════╪══════════════════════════════════════════════╡
│ 0 │ vecCopy(double*, double*, double*, int, int) │
╘════╧══════════════════════════════════════════════╛
Analysis mode = cli
[analysis] deriving rocprofiler-compute metrics...
--------------------------------------------------------------------------------
Dispatch list
╒════╤═══════════════╤══════════════════════════════════════════════╤══════════
│ │ Dispatch_ID │ Kernel_Name │ GPU_ID │
╞════╪═══════════════╪══════════════════════════════════════════════╪══════════
│ 0 │ 0 │ vecCopy(double*, double*, double*, int, int) │ 0
╘════╧═══════════════╧══════════════════════════════════════════════╧══════════
--------------------------------------------------------------------------------
Detected Kernels (sorted descending by duration)
════╤══════════════════════════════════════════════╕
│ │ Kernel_Name │
════╪══════════════════════════════════════════════╡
0 │ vecCopy(double*, double*, double*, int, int) │
════╧══════════════════════════════════════════════╛
Second, select the index of the kernel you would like to filter; for example,
``vecCopy(double*, double*, double*, int, int) [clone .kd]`` at index ``0``.
Then, use this index to apply the filter via ``-k`` or ``--kernels``.
--------------------------------------------------------------------------------
Dispatch list
╒════╤═══════════════╤══════════════════════════════════════════════╤══════════╕
│ │ Dispatch_ID │ Kernel_Name │ GPU_ID │
╞════╪═══════════════╪══════════════════════════════════════════════╪══════════╡
│ 0 │ 0 │ vecCopy(double*, double*, double*, int, int) │ 0 │
╘════╧═══════════════╧══════════════════════════════════════════════╧══════════╛
.. code-block:: shell-session
Second, select the index of the kernel you would like to filter; for example,
``vecCopy(double*, double*, double*, int, int) [clone .kd]`` at index ``0``.
Then, use this index to apply the filter via ``-k`` or ``--kernels``.
$ rocprof-compute analyze -p workloads/vcopy/MI200/ -k 0
.. code-block:: shell-session
Analysis mode = cli
[analysis] deriving rocprofiler-compute metrics...
$ rocprof-compute analyze -p workloads/vcopy/MI200/ -k 0
--------------------------------------------------------------------------------
0. Top Stats
0.1 Top Kernels
╒════╤══════════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╤═════╕
│ │ Kernel_Name │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │ S │
╞════╪══════════════════════════════════════════╪═════════╪═══════════╪════════════╪══════════════╪════════╪═════╡
│ 0 │ vecCopy(double*, double*, double*, int, │ 1.00 │ 18560.00 │ 18560.00 │ 18560.00 │ 100.00 │ * │
│ │ int) │ │ │ │ │ │ │
╘════╧══════════════════════════════════════════╧═════════╧═══════════╧════════════╧══════════════╧════════╧═════╛
...
Analysis mode = cli
[analysis] deriving rocprofiler-compute metrics...
You should see your filtered kernels indicated by an asterisk in the **Top
Stats** table.
--------------------------------------------------------------------------------
0. Top Stats
0.1 Top Kernels
╒════╤══════════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╤═════╕
│ │ Kernel_Name │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │ S │
╞════╪══════════════════════════════════════════╪═════════╪═══════════╪════════════╪══════════════╪════════╪═════╡
│ 0 │ vecCopy(double*, double*, double*, int, │ 1.00 │ 18560.00 │ 18560.00 │ 18560.00 │ 100.00 │ * │
│ │ int) │ │ │ │ │ │ │
╘════╧══════════════════════════════════════════╧═════════╧═══════════╧════════════╧══════════════╧════════╧═════╛
...
You should see your filtered kernels indicated by an asterisk in the **Top
Stats** table.
.. _per-kernel-roofline:
Per-kernel roofline analysis
When analyzing specific kernels, the roofline analysis provides detailed metrics for each filtered kernel:
**Per-kernel roofline analysis**
.. code-block:: shell-session
When analyzing specific kernels, the roofline analysis provides detailed metrics for each filtered kernel:
$ rocprof-compute analyze -p workloads/vcopy/MI200/ -k 0 -b 4
This generates enhanced roofline output showing per-kernel performance rates and arithmetic intensity calculations:
.. code-block:: shell-session
.. code-block:: text
$ rocprof-compute analyze -p workloads/vcopy/MI200/ -k 0 -b 4
This generates enhanced roofline output showing per-kernel performance rates and arithmetic intensity calculations:
.. code-block:: text
================================================================================
4. Roofline
@@ -455,24 +466,52 @@ Per-kernel roofline analysis
| ├─────────────┼──────────────────────┼─────────┼────────────┤
| │ 4.2.3 │ Performance (GFLOPs) │ │ Gflop/s │
| ╘═════════════╧══════════════════════╧═════════╧════════════╛
The per-kernel analysis uses YAML-based metric evaluation for accurate calculations.
Analyze multiple kernels for comparison:
The per-kernel analysis uses YAML-based metric evaluation for accurate calculations.
.. code-block:: shell-session
Analyze multiple kernels for comparison:
$ rocprof-compute analyze -p workloads/vcopy/MI200/ -k 0 1 2 -b 4
.. code-block:: shell-session
Baseline comparison
.. code-block:: shell
$ rocprof-compute analyze -p workloads/vcopy/MI200/ -k 0 1 2 -b 4
rocprof-compute analyze -p workload1/path/ -p workload2/path/
.. _analysis-baseline-comparison:
OR
**Baseline comparison**
.. code-block:: shell
Baseline comparison allows for checking A/B effect. Currently baseline comparison is limited to the same :ref:`SoC <def-soc>`. Cross-comparison between SoCs is in development.
For both the Current Workload and the Baseline Workload, you can independently setup the following filters to allow fine grained comparisons:
* Workload Name with ``--path``
* GPU ID filtering (multi-selection) with ``--gpu-id``
* Kernel Name filtering (multi-selection) with ``--kernel``
* Dispatch ID filtering (regex filtering) with ``--dispatch``
* ROCm Compute Profiler panels/blocks (multi-selection) with ``--block``
.. code-block:: shell
rocprof-compute analyze -p [path1] [path2][pathN]
.. code-block:: shell
rocprof-compute analyze -p [path1] [options for path1] ... -p [pathN] [options for pathN]
Examples:
.. code-block:: shell
rocprof-compute analyze -p workloads/workload_1/gpu_arch/ -k 0 -b 2 -p workloads/workload_2/gpu_arch/ -k 1 -b 2
.. code-block:: shell
rocprof-compute analyze -p workloads/workload_1/gpu_arch/ workloads/workload_2/gpu_arch/ ... workloads/workload_7/gpu_arch/ -b 12
.. image:: ../../data/analyze/cli/baseline_comparison.png
:align: center
:alt: Baseline Comparison example of LDS block among 7 runs
:width: 800
rocprof-compute analyze -p workload1/path/ -k 0 -p workload2/path/ -k 1
Analysis output format
======================
@@ -538,37 +577,37 @@ Analysis database example
$ rocprof-compute analyze --verbose --output-name test --output-format db -p workloads/nbody/MI300X_A1 -p workloads/nbody1/MI300X_A1
DEBUG Execution mode = analyze
__ _
_ __ ___ ___ _ __ _ __ ___ / _| ___ ___ _ __ ___ _ __ _ _| |_ ___
| '__/ _ \ / __| '_ \| '__/ _ \| |_ _____ / __/ _ \| '_ ` _ \| '_ \| | | | __/ _ \
| | | (_) | (__| |_) | | | (_) | _|_____| (_| (_) | | | | | | |_) | |_| | || __/
|_| \___/ \___| .__/|_| \___/|_| \___\___/|_| |_| |_| .__/ \__,_|\__\___|
|_| |_|
__ _
_ __ ___ ___ _ __ _ __ ___ / _| ___ ___ _ __ ___ _ __ _ _| |_ ___
| '__/ _ \ / __| '_ \| '__/ _ \| |_ _____ / __/ _ \| '_ ` _ \| '_ \| | | | __/ _ \
| | | (_) | (__| |_) | | | (_) | _|_____| (_| (_) | | | | | | |_) | |_| | || __/
|_| \___/ \___| .__/|_| \___/|_| \___\___/|_| |_| |_| .__/ \__,_|\__\___|
|_| |_|
INFO Analysis mode = db
INFO ed45b0b189
DEBUG [omnisoc init]
INFO ed45b0b189
DEBUG [omnisoc init]
DEBUG [analysis] prepping to do some analysis
INFO [analysis] deriving rocprofiler-compute metrics...
DEBUG Collected roofline ceilings
WARNING PC sampling data not found for /app/projects/rocprofiler-compute/workloads/nbody/MI300X_A1.
WARNING PC sampling data not found for /app/projects/rocprofiler-compute/workloads/nbody1/MI300X_A1.
DEBUG Collected dispatch data
DEBUG Applied analysis mode filters
DEBUG Calculated dispatch data
DEBUG Collected metrics data
WARNING Failed to evaluate expression for 3.1.39 - Value: to_round((to_avg(
(pmc_df.get("pmc_perf_ACCUM") / pmc_df.get("SQC_ICACHE_REQ")).where((pmc_df.get("SQC_ICACHE_REQ") != 0), None)) * 100), 0) - unsupported operand type(s) for /: 'NoneType' and 'float'
WARNING Failed to evaluate expression for 3.1.39 - Value: to_round((to_avg(
(pmc_df.get("pmc_perf_ACCUM") / pmc_df.get("SQC_ICACHE_REQ")).where((pmc_df.get("SQC_ICACHE_REQ") != 0), None)) * 100), 0) - unsupported operand type(s) for /: 'NoneType' and 'float'
DEBUG Calculated metric values
DEBUG Calculated roofline data points
DEBUG [analysis] generating analysis
DEBUG SQLite database initialized with name: test.db
DEBUG Initialized database: test.db
INFO ed45b0b189
INFO ed45b0b189
DEBUG Completed writing database
WARNING Created file: test.db
INFO Analysis mode = db
INFO ed45b0b189
DEBUG [omnisoc init]
INFO ed45b0b189
DEBUG [omnisoc init]
DEBUG [analysis] prepping to do some analysis
INFO [analysis] deriving rocprofiler-compute metrics...
DEBUG Collected roofline ceilings
WARNING PC sampling data not found for /app/projects/rocprofiler-compute/workloads/nbody/MI300X_A1.
WARNING PC sampling data not found for /app/projects/rocprofiler-compute/workloads/nbody1/MI300X_A1.
DEBUG Collected dispatch data
DEBUG Applied analysis mode filters
DEBUG Calculated dispatch data
DEBUG Collected metrics data
WARNING Failed to evaluate expression for 3.1.39 - Value: to_round((to_avg(
(pmc_df.get("pmc_perf_ACCUM") / pmc_df.get("SQC_ICACHE_REQ")).where((pmc_df.get("SQC_ICACHE_REQ") != 0), None)) * 100), 0) - unsupported operand type(s) for /: 'NoneType' and 'float'
WARNING Failed to evaluate expression for 3.1.39 - Value: to_round((to_avg(
(pmc_df.get("pmc_perf_ACCUM") / pmc_df.get("SQC_ICACHE_REQ")).where((pmc_df.get("SQC_ICACHE_REQ") != 0), None)) * 100), 0) - unsupported operand type(s) for /: 'NoneType' and 'float'
DEBUG Calculated metric values
DEBUG Calculated roofline data points
DEBUG [analysis] generating analysis
DEBUG SQLite database initialized with name: test.db
DEBUG Initialized database: test.db
INFO ed45b0b189
INFO ed45b0b189
DEBUG Completed writing database
WARNING Created file: test.db
+1
Просмотреть файл
@@ -26,6 +26,7 @@ For using profiling options for PC sampling the configuration needed are:
**Sample command:**
.. code-block:: shell
$ rocprof-compute profile -n try_live_attach_detach -b 3.1.1 4.1.1 5.1.1 --no-roof -VVV --attach-pid <process id of workload>
$ rocprof-compute profile -n try_live_attach_detach --set launch_stats --no-roof -VVV --attach-pid <process id of workload>
+9 -4
Просмотреть файл
@@ -476,7 +476,7 @@ of the application (note zero-based indexing).
.. _profiling-metric-sets:
Metric sets filtering
^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^
A metrics set contains a subset of metrics that can be collected in a single pass. This filtering option minimizes profiling overhead by only collecting counters of interest.
The `--set` filter option provides a convenient way to group related metrics for common profiling scenarios, eliminating the need to manually specify individual metrics for typical analysis workflows.
@@ -556,8 +556,11 @@ Roofline analysis occurs on any profile mode run, provided ``--no-roof`` option
You don't need to include any additional roofline-specific options for roofline analysis.
If you want to focus only on roofline-specific performance data and reduce the time it takes to profile, you can use the ``--roof-only`` option.
This option checks if there is existing profiling data in the workload directory (``pmc_perf.csv`` and ``roofline.csv``):
a) If found, uses the data files with the provided arguments to create another roofline HTML output; otherwise,
b) Profile mode runs but is limited to collecting only roofline performance counters.
a) If found, uses the data files with the provided arguments to create another roofline HTML output; otherwise,
b) Profile mode runs but is limited to collecting only roofline performance counters.
Note that ``--roof-only`` cannot be used with ``--block`` or ``--set`` options.
Roofline options
@@ -633,6 +636,8 @@ The following example demonstrates profiling roofline data only:
GPU Device 0 (gfx942) with 304 CUs: Profiling...
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
...
An inspection of our workload output folder shows ``.html`` plots were generated
successfully.
@@ -941,7 +946,7 @@ The following example demonstrates how to use iteration multiplexing with the
Caveats
------
---------
Iteration multiplexing feature comes with some caveats to be considered when profiling any workload:
+2
Просмотреть файл
@@ -18,6 +18,8 @@ subtrees:
- file: how-to/use.rst
- file: how-to/pc_sampling.rst
title: Use PC sampling
- file: how-to/live_attach_detach.rst
title: Use Dynamic process attachment
- file: how-to/profile/mode.rst
- file: how-to/analyze/mode.rst
entries: