Responding to first round of Ben R.'s docs feedback.\n\nThis round includes fixes for comments up to the 'Performance Model' section. I will need to work with our documentation group to respond to those higher level comments.

Signed-off-by: colramos-amd <colramos@amd.com>
2024-03-25 12:13:13 -05:00
commit 20670da2b7
@@ -16,10 +16,10 @@ See sections below for more information on each.

 ### Features

- All of Omniperf's built-in metrics.
- Multiple runs base line comparison.
- Metrics customization: pick up subset of build-in metrics or build your own profiling configuration.
- Kernel, gpu-id, dispatch-id filters.
+- __Derived metrics__: All of Omniperf's built-in metrics.
+- __Baseline comparison__: Compare multiple runs in a side-by-side manner.
+- __Metric customization__: Isolate a subset of built-in metrics or build your own profiling configuration.
+- __Filtering__: Hone in on a particular kernel, gpu-id, and/or dispatch-id via post-process filtering.

 Run `omniperf analyze -h` for more details.

@@ -444,7 +444,7 @@ interface](https://rocm.github.io/omniperf/analysis.html#grafana-based-gui).
 #### Features
 The Omniperf Grafana GUI Analyzer supports the following features to facilitate MI GPU performance profiling and analysis:

- System and Hardware Component (IP Block) Speed-of-Light (SOL)
+- System and Hardware Component (Hardware Block) Speed-of-Light (SOL)
 - Multiple normalization options, including per-cycle, per-wave, per-kernel and per-second.
 - Baseline comparisons 
 - Regex based Dispatch ID filtering
@@ -470,7 +470,7 @@ Multiple performance number normalizations are provided to allow performance ins
 - per second

 ##### Baseline Comparison
-Omniperf enables baseline comparison to allow checking A/B effect. The current release limits the baseline comparison to the same SoC. Cross comparison between SoCs is in development.
+Omniperf enables baseline comparison to allow checking A/B effect. Currently baseline comparison is limited to the same SoC. Cross comparison between SoCs is in development.

 For both the Current Workload and the Baseline Workload, one can independently setup the following filters to allow fine grained comparions:
 - Workload Name 
@@ -480,7 +480,7 @@ For both the Current Workload and the Baseline Workload, one can independently s
 - Omniperf Panels (multi-selection)

 ##### Regex based Dispatch ID filtering
-This release enables Regular Expression (regex), a standard Linux string matching syntax, based dispatch ID filtering to flexibly choose the kernel invocations. One may refer to [Regex Numeric Range Generator](https://3widgets.com/), to generate typical number ranges.
+Omniperf enables Regular Expression (regex), a standard Linux string matching syntax, based dispatch ID filtering to flexibly choose the kernel invocations. One may refer to [Regex Numeric Range Generator](https://3widgets.com/), to generate typical number ranges.

 For example, if one wants to inspect Dispatch Range from 17 to 48, inclusive, the corresponding regex is : **(1[7-9]|[23]\d|4[0-8])**. The generated expression can be copied over for filtering.

@@ -16,7 +16,7 @@
    ```shell
    $ omniperf profile -n vcopy_data -- ./vcopy -n 1048576 -b 256
    ```
-    The app runs, each kernel is launched, and profiling results are generated. By default, results are written to e.g., ./workloads/vcopy_data (configurable via the `-n` argument). To collect all requested profile information, it may be required to replay kernels multiple times.
+    The app runs, each kernel is launched, and profiling results are generated. By default, results are written to a subdirectory with your accelerator's name e.g., ./workloads/vcopy_data/MI200/ (where name is configurable via the `-n` argument). To collect all requested profile information, it may be required to replay kernels multiple times.

 2. **Customize data collection**

@@ -29,14 +29,14 @@
    - `-d`/`--dispatch` enables filtering based on dispatch ID.
    - `-b`/`--block` enables collects metrics for only the specified (one or more) hardware component blocks.

-    To view available metrics by IP Block you can use the `--list-metrics` argument:
+    To view available metrics by hardware Block you can use the `--list-metrics` argument:
    ```shell
    $ omniperf analyze --list-metrics <sys_arch>
    ```

 3. **Analyze at the command line**

-   After generating a local output folder (./workloads/\<name>), the command line tool can also be used to quickly interface with profiling results. View different metrics derived from your profiled results and get immediate access all metrics organized by IP blocks.
+   After generating a local output folder (./workloads/\<name>), the command line tool can also be used to quickly interface with profiling results. View different metrics derived from your profiled results and get immediate access all metrics organized by hardware blocks.

   If no kernel, dispatch, or hardware block filters are applied at this stage, analysis will be reflective of the entirety of the profiling data.

@@ -10,7 +10,7 @@ Omniperf is broken into two installation components:

 1. **Omniperf Client-side (_Required_)**
   - Provides core application profiling capability
-   - Allows collection of performance counters, filtering by IP block, dispatch, kernel, etc
+   - Allows collection of performance counters, filtering by hardware block, dispatch, kernel, etc
   - CLI based analysis mode
   - Stand alone web interface for importing analysis metrics
 2. **Omniperf Server-side (_Optional_)**
@@ -14,7 +14,7 @@ This project is proudly open source, and we welcome all feedback! For more detai

 ## What is Omniperf

-Omniperf is a kernel level profiling tool for Machine Learning/HPC workloads running on AMD Instinct (tm) MI accelerators. AMD's Instinct (tm) MI accelerators are Data Center GPUs designed for compute and with some graphics functions disabled or removed. Omniperf is currently built on top of [rocProf](https://rocm.docs.amd.com/projects/rocprofiler/en/latest/rocprof.html) to monitor hardware performance counters. The Omniperf tool primarily targets accelerators in the MI100 and MI200 families. Development is in progress to support AMD Instinct (tm) MI300 and Radeon (tm) RDNA (tm) GPUs.
+Omniperf is a kernel level profiling tool for Machine Learning/HPC workloads running on AMD Instinct (tm) MI accelerators. AMD's Instinct (tm) MI accelerators are Data Center GPUs designed for compute and with some graphics functions disabled or removed. Omniperf is currently built on top of [rocProf](https://rocm.docs.amd.com/projects/rocprofiler/en/latest/rocprof.html) to monitor hardware performance counters. The Omniperf tool primarily targets accelerators in the MI100, MI200, and MI300 families. Development is in progress to support Radeon (tm) RDNA (tm) GPUs.

 ## Features

@@ -58,4 +58,4 @@ Detailed Feature List:
 | Vega 20 (MI50/60) | No support |
 | MI100             | Supported  |
 | MI200             | Supported  |
-| MI300             | Support    |
+| MI300             | Supported  |
@@ -38,7 +38,7 @@ Releasing CPU memory
 ```

 ## Omniperf Profiling
-The *omniperf* script, available through the Omniperf repository, is used to aquire all necessary performance monitoring data through analysis of compute workloads.
+The *omniperf* executable, available through the Omniperf repository, is used to aquire all necessary performance monitoring data through analysis of compute workloads.

 **omniperf help:**
 ```shell-session
@@ -101,13 +101,6 @@ Standalone Roofline Options:
  --kernel-names                                        Include kernel names in roofline plot.
 ```

- The `-k` \<kernel> flag allows for kernel filtering, which is compatible with the current rocProf utility.
-
- The `-d` \<dispatch> flag allows for dispatch ID filtering,  which is compatible with the current rocProf utility.
-
- The `-b` \<ipblocks> allows system profiling on one or more selected hardware components to speed up the profiling process. One can gradually include more hardware components, without overwriting performance data acquired on other hardware components.
-
-
 The following sample command profiles the *vcopy* workload.

 **vcopy profiling:**
@@ -128,7 +121,7 @@ Target: MI200
 Command: ./vcopy -n 1048576 -b 256
 Kernel Selection: None
 Dispatch Selection: None
-IP Blocks: All
+Hardware Blocks: All

 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Collecting Performance Counters
@@ -213,6 +206,7 @@ etc.  The SoC names are generated as a part of Omniperf, and do not necessarily
 $ ls workloads/vcopy/MI200/
 total 112
 total 60
+-rw-r--r-- 1 auser agroup 27937 Mar  1 15:15 log.txt
 drwxr-xr-x 1 auser agroup     0 Mar  1 15:15 perfmon
 -rw-r--r-- 1 auser agroup 26175 Mar  1 15:15 pmc_perf.csv
 -rw-r--r-- 1 auser agroup  1708 Mar  1 15:17 roofline.csv
@@ -232,11 +226,11 @@ To reduce profiling time and the counters collected one may use profiling filter

 Filtering Options:

- The `-k` \<kernel> flag allows for kernel filtering. Useage is equivalent with the current rocProf utility ([see details below](#kernel-filtering)).
+- The `-k` / `--kernel` flag allows for kernel filtering. Useage is equivalent with the current rocProf utility ([see details below](#kernel-filtering)).

- The `-d` \<dispatch> flag allows for dispatch ID filtering. Useage is equivalent with the current rocProf utility ([see details below](#dispatch-filtering)).
+- The `-d` / `--dispatch` flag allows for dispatch ID filtering. Useage is equivalent with the current rocProf utility ([see details below](#dispatch-filtering)).

- The `-b` \<ipblocks> allows system profiling on one or more selected hardware components to speed up the profiling process. One can gradually include more hardware components, without overwriting performance data acquired on other hardware components.
+- The `-b` / `--block` flag allows system profiling on one or more selected hardware components to speed up the profiling process ([see details below](#hardware-component-filtering)).

 ```{note}
 Be cautious while combining different profiling filters in the same call. Conflicting filters may result in error.
@@ -245,7 +239,7 @@ i.e. filtering dispatch X, but dispatch X does not match your kernel name filter
 ```

 #### Hardware Component Filtering
-One can profile specific hardware components to speed up the profiling process. In Omniperf, we use the term IP block to refer to a hardware component or a group of hardware components. All profiling results are accumulated in the same target directory, without overwriting those for other hardware components, hence enabling the incremental profiling and analysis.
+One can profile specific hardware components to speed up the profiling process. In Omniperf, we use the term hardware block to refer to a hardware component or a group of hardware components. All profiling results are accumulated in the same target directory, without overwriting those for other hardware components, hence enabling the incremental profiling and analysis.

 The following example only gathers hardware counters for the Shader Sequencer (SQ) and L2 Cache (TCC) components, skipping all other hardware components:
 ```shell-session
@@ -280,7 +274,7 @@ Target: MI200
 Command: ./vcopy -n 1048576 -b 256
 Kernel Selection: None
 Dispatch Selection: None
-IP Blocks: ['sq', 'tcc']
+Hardware Blocks: ['sq', 'tcc']

 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Collecting Performance Counters
@@ -309,7 +303,7 @@ Target: MI200
 Command: ./vcopy -n 1048576 -b 256
 Kernel Selection: ['vecCopy']
 Dispatch Selection: None
-IP Blocks: All
+Hardware Blocks: All

 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Collecting Performance Counters
@@ -320,7 +314,7 @@ Collecting Performance Counters
 #### Dispatch Filtering
 Dispatch filtering is based on the *global* dispatch index of kernels in a run. 

-The following example profiles only the 0th dispatched kernel in execution of the application:
+The following example profiles only the first kernel dispatch in execution of the application (please note zero-based indexing):
 ```shell-session
 $ omniperf profile --name vcopy -d 0 -- ./vcopy -n 1048576 -b 256

@@ -338,7 +332,7 @@ Target: MI200
 Command: ./vcopy -n 1048576 -b 256
 Kernel Selection: None
 Dispatch Selection: ['0']
-IP Blocks: All
+Hardware Blocks: All

 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Collecting Performance Counters
@@ -327,9 +327,9 @@ class OmniProfiler_Base:
        console_log("Kernel Selection: " + str(self.__args.kernel))
        console_log("Dispatch Selection: " + str(self.__args.dispatch))
        if self.__args.ipblocks == None:
-            console_log("IP Blocks: All")
+            console_log("Hardware Blocks: All")
        else:
-            console_log("IP Blocks: " + str(self.__args.ipblocks))
+            console_log("Hardware Blocks: " + str(self.__args.ipblocks))

        print_status("Collecting Performance Counters")