documentation updates for 2.x work - current set of updates are mostly tied to two changes:
(1) vcopy examples require updated command-line args (2) profiling output directory name format changed Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
This commit is contained in:
committed by
Karl W. Schulz
parent
65150e2384
commit
bfc0dea1de
+125
-132
@@ -27,153 +27,146 @@ Run `omniperf analyze -h` for more details.
|
||||
|
||||
1) To begin, generate a comprehensive analysis report with Omniperf CLI.
|
||||
```shell-session
|
||||
$ omniperf analyze -p workloads/vcopy/mi200/
|
||||
|
||||
--------
|
||||
Analyze
|
||||
--------
|
||||
$ omniperf analyze -p workloads/vcopy/MI200/
|
||||
|
||||
Analysis mode = cli
|
||||
SoC = {'MI200'}
|
||||
[analysis] deriving Omniperf metrics...
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
0. Top Stat
|
||||
0. Top Stats
|
||||
0.1 Top Kernels
|
||||
╒════╤══════════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╕
|
||||
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
|
||||
│ │ Kernel_Name │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
|
||||
╞════╪══════════════════════════════════════════╪═════════╪═══════════╪════════════╪══════════════╪════════╡
|
||||
│ 0 │ vecCopy(double*, double*, double*, int, │ 1 │ 20000.00 │ 20000.00 │ 20000.00 │ 100.00 │
|
||||
│ │ int) [clone .kd] │ │ │ │ │ │
|
||||
│ 0 │ vecCopy(double*, double*, double*, int, │ 1.00 │ 20480.00 │ 20480.00 │ 20480.00 │ 100.00 │
|
||||
│ │ int) │ │ │ │ │ │
|
||||
╘════╧══════════════════════════════════════════╧═════════╧═══════════╧════════════╧══════════════╧════════╛
|
||||
0.2 Dispatch List
|
||||
╒════╤═══════════════╤══════════════════════════════════════════════╤══════════╕
|
||||
│ │ Dispatch_ID │ Kernel_Name │ GPU_ID │
|
||||
╞════╪═══════════════╪══════════════════════════════════════════════╪══════════╡
|
||||
│ 0 │ 0 │ vecCopy(double*, double*, double*, int, int) │ 2 │
|
||||
╘════╧═══════════════╧══════════════════════════════════════════════╧══════════╛
|
||||
|
||||
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
1. System Info
|
||||
╒══════════════════╤═══════════════════════════════════════════════╕
|
||||
│ │ Info │
|
||||
╞══════════════════╪═══════════════════════════════════════════════╡
|
||||
│ workload_name │ vcopy │
|
||||
├──────────────────┼───────────────────────────────────────────────┤
|
||||
│ command │ /home/colramos/vcopy 1048576 256 │
|
||||
├──────────────────┼───────────────────────────────────────────────┤
|
||||
│ host_name │ sv-pdp-2 │
|
||||
├──────────────────┼───────────────────────────────────────────────┤
|
||||
│ host_cpu │ AMD EPYC 7282 16-Core Processor │
|
||||
├──────────────────┼───────────────────────────────────────────────┤
|
||||
│ host_distro │ Ubuntu 20.04.3 LTS │
|
||||
├──────────────────┼───────────────────────────────────────────────┤
|
||||
│ host_kernel │ 5.15.0-43-generic │
|
||||
├──────────────────┼───────────────────────────────────────────────┤
|
||||
│ host_rocmver │ 5.2.1-79 │
|
||||
├──────────────────┼───────────────────────────────────────────────┤
|
||||
│ date │ Fri Jan 20 11:22:20 2023 (CST) │
|
||||
├──────────────────┼───────────────────────────────────────────────┤
|
||||
│ gpu_soc │ gfx90a │
|
||||
├──────────────────┼───────────────────────────────────────────────┤
|
||||
│ numSE │ 8 │
|
||||
├──────────────────┼───────────────────────────────────────────────┤
|
||||
│ numCU │ 104 │
|
||||
├──────────────────┼───────────────────────────────────────────────┤
|
||||
│ numSIMD │ 4 │
|
||||
├──────────────────┼───────────────────────────────────────────────┤
|
||||
│ waveSize │ 64 │
|
||||
├──────────────────┼───────────────────────────────────────────────┤
|
||||
│ maxWavesPerCU │ 32 │
|
||||
├──────────────────┼───────────────────────────────────────────────┤
|
||||
│ maxWorkgroupSize │ 1024 │
|
||||
├──────────────────┼───────────────────────────────────────────────┤
|
||||
│ L1 │ 16 │
|
||||
├──────────────────┼───────────────────────────────────────────────┤
|
||||
│ L2 │ 8192 │
|
||||
├──────────────────┼───────────────────────────────────────────────┤
|
||||
│ sclk │ 1700 │
|
||||
├──────────────────┼───────────────────────────────────────────────┤
|
||||
│ mclk │ 1600 │
|
||||
├──────────────────┼───────────────────────────────────────────────┤
|
||||
│ cur_sclk │ 800 │
|
||||
├──────────────────┼───────────────────────────────────────────────┤
|
||||
│ cur_mclk │ 1600 │
|
||||
├──────────────────┼───────────────────────────────────────────────┤
|
||||
│ L2Banks │ 32 │
|
||||
├──────────────────┼───────────────────────────────────────────────┤
|
||||
│ name │ mi200 │
|
||||
├──────────────────┼───────────────────────────────────────────────┤
|
||||
│ numSQC │ 56 │
|
||||
├──────────────────┼───────────────────────────────────────────────┤
|
||||
│ hbmBW │ 1638.4 │
|
||||
├──────────────────┼───────────────────────────────────────────────┤
|
||||
│ ip_blocks │ roofline|SQ|LDS|SQC|TA|TD|TCP|TCC|SPI|CPC|CPF │
|
||||
╘══════════════════╧═══════════════════════════════════════════════╛
|
||||
|
||||
╒═══════════════════╤═══════════════════════════════════════════════╕
|
||||
│ │ Info │
|
||||
╞═══════════════════╪═══════════════════════════════════════════════╡
|
||||
│ workload_name │ vcopy │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ command │ ./vcopy -n 1048576 -b 256 │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ host_name │ t007-002 │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ host_cpu │ AMD EPYC 7V13 64-Core Processor │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ sbios │ American Megatrends Inc.0602 │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ host_distro │ Rocky Linux 9.1 (Blue Onyx) │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ host_kernel │ 5.14.0-162.18.1.el9_1.x86_64 │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ host_rocmver │ 5.7.1-98 │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ date │ Fri Mar 1 15:32:43 2024 (CST) │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ gpu_soc │ gfx90a │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ vbios │ 113-D67301-059 │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ numSE │ 8 │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ numCU │ 104 │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ numSIMD │ 4 │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ waveSize │ 64 │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ maxWavesPerCU │ 32 │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ maxWorkgroupSize │ 1024 │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ L1 │ 16 │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ L2 │ 8192 │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ sclk │ 1700 │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ mclk │ 1600 │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ cur_sclk │ 1700 │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ cur_mclk │ 1600 │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ L2Banks │ 32 │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ totalL2Banks │ 32 │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ LDSBanks │ 32 │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ name │ MI200 │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ numSQC │ 56 │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ numPipes │ 4 │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ hbmBW │ 1638.4 │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
│ ip_blocks │ roofline|SQ|LDS|SQC|TA|TD|TCP|TCC|SPI|CPC|CPF │
|
||||
├───────────────────┼───────────────────────────────────────────────┤
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
2. System Speed-of-Light
|
||||
....
|
||||
```
|
||||
2. Use `--list-metrics` to generate a list of available metrics for inspection
|
||||
```shell-session
|
||||
$ omniperf analyze -p workloads/vcopy/mi200/ --list-metrics gfx90a
|
||||
╒═════════╤═════════════════════════════╕
|
||||
│ │ Metric │
|
||||
╞═════════╪═════════════════════════════╡
|
||||
│ 0 │ Top Stat │
|
||||
├─────────┼─────────────────────────────┤
|
||||
│ 1 │ System Info │
|
||||
├─────────┼─────────────────────────────┤
|
||||
│ 2.1.0 │ VALU_FLOPs │
|
||||
├─────────┼─────────────────────────────┤
|
||||
│ 2.1.1 │ VALU_IOPs │
|
||||
├─────────┼─────────────────────────────┤
|
||||
│ 2.1.2 │ MFMA_FLOPs_(BF16) │
|
||||
├─────────┼─────────────────────────────┤
|
||||
│ 2.1.3 │ MFMA_FLOPs_(F16) │
|
||||
├─────────┼─────────────────────────────┤
|
||||
│ 2.1.4 │ MFMA_FLOPs_(F32) │
|
||||
├─────────┼─────────────────────────────┤
|
||||
│ 2.1.5 │ MFMA_FLOPs_(F64) │
|
||||
├─────────┼─────────────────────────────┤
|
||||
│ 2.1.6 │ MFMA_IOPs_(Int8) │
|
||||
├─────────┼─────────────────────────────┤
|
||||
│ 2.1.7 │ Active_CUs │
|
||||
├─────────┼─────────────────────────────┤
|
||||
│ 2.1.8 │ SALU_Util │
|
||||
├─────────┼─────────────────────────────┤
|
||||
│ 2.1.9 │ VALU_Util │
|
||||
├─────────┼─────────────────────────────┤
|
||||
│ 2.1.10 │ MFMA_Util │
|
||||
├─────────┼─────────────────────────────┤
|
||||
│ 2.1.11 │ VALU_Active_Threads/Wave │
|
||||
├─────────┼─────────────────────────────┤
|
||||
│ 2.1.12 │ IPC_-_Issue │
|
||||
├─────────┼─────────────────────────────┤
|
||||
│ 2.1.13 │ LDS_BW │
|
||||
├─────────┼─────────────────────────────┤
|
||||
│ 2.1.14 │ LDS_Bank_Conflict │
|
||||
├─────────┼─────────────────────────────┤
|
||||
│ 2.1.15 │ Instr_Cache_Hit_Rate │
|
||||
├─────────┼─────────────────────────────┤
|
||||
│ 2.1.16 │ Instr_Cache_BW │
|
||||
├─────────┼─────────────────────────────┤
|
||||
│ 2.1.17 │ Scalar_L1D_Cache_Hit_Rate │
|
||||
├─────────┼─────────────────────────────┤
|
||||
│ 2.1.18 │ Scalar_L1D_Cache_BW │
|
||||
├─────────┼─────────────────────────────┤
|
||||
│ 2.1.19 │ Vector_L1D_Cache_Hit_Rate │
|
||||
├─────────┼─────────────────────────────┤
|
||||
│ 2.1.20 │ Vector_L1D_Cache_BW │
|
||||
├─────────┼─────────────────────────────┤
|
||||
│ 2.1.21 │ L2_Cache_Hit_Rate │
|
||||
├─────────┼─────────────────────────────┤
|
||||
│ 2.1.22 │ L2-Fabric_Read_BW │
|
||||
├─────────┼─────────────────────────────┤
|
||||
│ 2.1.23 │ L2-Fabric_Write_BW │
|
||||
├─────────┼─────────────────────────────┤
|
||||
│ 2.1.24 │ L2-Fabric_Read_Latency │
|
||||
├─────────┼─────────────────────────────┤
|
||||
│ 2.1.25 │ L2-Fabric_Write_Latency │
|
||||
├─────────┼─────────────────────────────┤
|
||||
$ omniperf analyze -p workloads/vcopy/MI200/ --list-metrics gfx90a
|
||||
Execution mode = analyze
|
||||
|
||||
Analysis mode = cli
|
||||
SoC = {'MI200'}
|
||||
[analysis] deriving Omniperf metrics...
|
||||
0 -> Top Stats
|
||||
1 -> System Info
|
||||
2 -> System Speed-of-Light
|
||||
2.1 -> Speed-of-Light
|
||||
2.1.0 -> VALU FLOPs
|
||||
2.1.1 -> VALU IOPs
|
||||
2.1.2 -> MFMA FLOPs (BF16)
|
||||
2.1.3 -> MFMA FLOPs (F16)
|
||||
2.1.4 -> MFMA FLOPs (F32)
|
||||
2.1.5 -> MFMA FLOPs (F64)
|
||||
2.1.6 -> MFMA IOPs (Int8)
|
||||
2.1.7 -> Active CUs
|
||||
2.1.8 -> SALU Utilization
|
||||
2.1.9 -> VALU Utilization
|
||||
2.1.10 -> MFMA Utilization
|
||||
2.1.11 -> VMEM Utilization
|
||||
2.1.12 -> Branch Utilization
|
||||
2.1.13 -> VALU Active Threads
|
||||
2.1.14 -> IPC
|
||||
2.1.15 -> Wavefront Occupancy
|
||||
2.1.16 -> Theoretical LDS Bandwidth
|
||||
2.1.17 -> LDS Bank Conflicts/Access
|
||||
2.1.18 -> vL1D Cache Hit Rate
|
||||
2.1.19 -> vL1D Cache BW
|
||||
2.1.20 -> L2 Cache Hit Rate
|
||||
2.1.21 -> L2 Cache BW
|
||||
2.1.22 -> L2-Fabric Read BW
|
||||
2.1.23 -> L2-Fabric Write BW
|
||||
2.1.24 -> L2-Fabric Read Latency
|
||||
2.1.25 -> L2-Fabric Write Latency
|
||||
...
|
||||
```
|
||||
2. Choose your own customized subset of metrics with `-b` (a.k.a. `--metric`), or build your own config following [config_template](https://github.com/AMDResearch/omniperf/blob/main/src/omniperf_analyze/configs/panel_config_template.yaml). Below shows how to generate a report containing only metric 2 (a.k.a. System Speed-of-Light).
|
||||
```shell-session
|
||||
$ omniperf analyze -p workloads/vcopy/mi200/ -b 2
|
||||
$ omniperf analyze -p workloads/vcopy/MI200/ -b 2
|
||||
--------
|
||||
Analyze
|
||||
--------
|
||||
@@ -261,24 +254,24 @@ Analyze
|
||||
|
||||
- Single run
|
||||
```shell
|
||||
$ omniperf analyze -p workloads/vcopy/mi200/
|
||||
$ omniperf analyze -p workloads/vcopy/MI200/
|
||||
```
|
||||
|
||||
- List top kernels
|
||||
```shell
|
||||
$ omniperf analyze -p workloads/vcopy/mi200/ --list-kernels
|
||||
$ omniperf analyze -p workloads/vcopy/MI200/ --list-kernels
|
||||
```
|
||||
|
||||
- List metrics
|
||||
|
||||
```shell
|
||||
$ omniperf analyze -p workloads/vcopy/mi200/ --list-metrics gfx90a
|
||||
$ omniperf analyze -p workloads/vcopy/MI200/ --list-metrics gfx90a
|
||||
```
|
||||
|
||||
- Customized profiling "System Speed-of-Light" and "CS_Busy" only
|
||||
|
||||
```shell
|
||||
$ omniperf analyze -p workloads/vcopy/mi200/ -b 2 5.1.0
|
||||
$ omniperf analyze -p workloads/vcopy/MI200/ -b 2 5.1.0
|
||||
```
|
||||
|
||||
> Note: Users can filter single metric or the whole hardware component by its id. In this case, 1 is the id for "system speed of light" and 5.1.0 the id for metric "GPU Busy Cycles".
|
||||
@@ -287,7 +280,7 @@ Analyze
|
||||
|
||||
First, list the top kernels in your application using `--list-kernels`.
|
||||
```shell-session
|
||||
$ omniperf analyze -p workloads/vcopy/mi200/ --list-kernels
|
||||
$ omniperf analyze -p workloads/vcopy/MI200/ --list-kernels
|
||||
|
||||
--------
|
||||
Analyze
|
||||
@@ -373,7 +366,7 @@ See [FAQ](https://amdresearch.github.io/omniperf/faq.html) for more details on S
|
||||
To launch the standalone GUI, include the `--gui` flag with your desired analysis command. For example:
|
||||
|
||||
```shell-session
|
||||
$ omniperf analyze -p workloads/vcopy/mi200/ --gui
|
||||
$ omniperf analyze -p workloads/vcopy/MI200/ --gui
|
||||
|
||||
--------
|
||||
Analyze
|
||||
|
||||
@@ -1,7 +1,9 @@
|
||||
# AMD Instinct(tm) MI Series Accelerator Performance Model
|
||||
|
||||
```eval_rst
|
||||
.. sectionauthor:: Nicholas Curtis <nicholas.curtis@amd.com>
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 5
|
||||
```
|
||||
|
||||
Omniperf makes available an extensive list of metrics to better understand achieved application performance on AMD Instinct(tm) MI accelerators including Graphics Core Next (GCN) GPUs such as the AMD Instinct MI50, CDNA(tm) accelerators such as the MI100, and CDNA(tm) 2 accelerators such as MI250X/250/210.
|
||||
|
||||
+197
-223
@@ -23,11 +23,12 @@ the MI200 platform.
|
||||
$ hipcc vcopy.cpp -o vcopy
|
||||
$ ls
|
||||
vcopy vcopy.cpp
|
||||
$ ./vcopy 1048576 256
|
||||
$ ./vcopy -n 1048576 -b 256
|
||||
vcopy testing on GCD 0
|
||||
Finished allocating vectors on the CPU
|
||||
Finished allocating vectors on the GPU
|
||||
Finished copying vectors to the GPU
|
||||
sw thinks it moved 1.000000 KB per wave
|
||||
sw thinks it moved 1.000000 KB per wave
|
||||
Total threads: 1048576, Grid Size: 4096 block Size:256, Wavefronts:16384:
|
||||
Launching the kernel on the GPU
|
||||
Finished executing kernel
|
||||
@@ -42,70 +43,66 @@ The *omniperf* script, available through the Omniperf repository, is used to aqu
|
||||
**omniperf help:**
|
||||
```shell-session
|
||||
$ omniperf profile --help
|
||||
ROC Profiler: /usr/bin/rocprof
|
||||
|
||||
usage:
|
||||
|
||||
usage:
|
||||
|
||||
omniperf profile --name <workload_name> [profile options] [roofline options] -- <profile_cmd>
|
||||
|
||||
|
||||
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
|
||||
Examples:
|
||||
|
||||
omniperf profile -n vcopy_all -- ./vcopy 1048576 256
|
||||
|
||||
omniperf profile -n vcopy_SPI_TCC -b SQ TCC -- ./vcopy 1048576 256
|
||||
|
||||
omniperf profile -n vcopy_kernel -k vecCopy -- ./vcopy 1048576 256
|
||||
|
||||
omniperf profile -n vcopy_disp -d 0 -- ./vcopy 1048576 256
|
||||
|
||||
omniperf profile -n vcopy_roof --roof-only -- ./vcopy 1048576 256
|
||||
|
||||
|
||||
omniperf profile -n vcopy_all -- ./vcopy -n 1048576 -b 256
|
||||
omniperf profile -n vcopy_SPI_TCC -b SQ TCC -- ./vcopy -n 1048576 -b 256
|
||||
omniperf profile -n vcopy_kernel -k vecCopy -- ./vcopy -n 1048576 -b 256
|
||||
omniperf profile -n vcopy_disp -d 0 -- ./vcopy -n 1048576 -b 256
|
||||
omniperf profile -n vcopy_roof --roof-only -- ./vcopy -n 1048576 -b 256
|
||||
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
||||
Help:
|
||||
-h, --help show this help message and exit
|
||||
-h, --help show this help message and exit
|
||||
|
||||
General Options:
|
||||
-v, --version show program's version number and exit
|
||||
-V, --verbose Increase output verbosity
|
||||
-v, --version show program's version number and exit
|
||||
-V, --verbose Increase output verbosity
|
||||
|
||||
Profile Options:
|
||||
-n , --name Assign a name to workload.
|
||||
-p , --path Specify path to save workload.
|
||||
(DEFAULT: /home/colramos/GitHub/omniperf/workloads/<name>)
|
||||
-k [ ...], --kernel [ ...] Kernel filtering.
|
||||
-b [ ...], --ipblocks [ ...] Hardware block filtering:
|
||||
SQ
|
||||
SQC
|
||||
TA
|
||||
TD
|
||||
TCP
|
||||
TCC
|
||||
SPI
|
||||
CPC
|
||||
CPF
|
||||
-d [ ...], --dispatch [ ...] Dispatch ID filtering.
|
||||
--no-roof Profile without collecting roofline data.
|
||||
-- [ ...] Provide command for profiling after double dash.
|
||||
-n , --name Assign a name to workload.
|
||||
-p , --path Specify path to save workload.
|
||||
|
||||
-k [ ...], --kernel [ ...] Kernel filtering.
|
||||
-d [ ...], --dispatch [ ...] Dispatch ID filtering.
|
||||
-b [ ...], --ipblocks [ ...] IP block filtering:
|
||||
SQ
|
||||
SQC
|
||||
TA
|
||||
TD
|
||||
TCP
|
||||
TCC
|
||||
SPI
|
||||
CPC
|
||||
CPF
|
||||
--join-type Choose how to join rocprof runs: (DEFAULT: grid)
|
||||
kernel (i.e. By unique kernel name dispatches)
|
||||
grid (i.e. By unique kernel name + grid size dispatches)
|
||||
--no-roof Profile without collecting roofline data.
|
||||
-- [ ...] Provide command for profiling after double dash.
|
||||
--kernel-verbose Specify Kernel Name verbose level 1-5. Lower the level, shorter the kernel name. (DEFAULT: 2) (DISABLE: 5)
|
||||
|
||||
Standalone Roofline Options:
|
||||
--roof-only Profile roofline data only.
|
||||
--sort Overlay top kernels or top dispatches: (DEFAULT: kernels)
|
||||
kernels
|
||||
dispatches
|
||||
-m , --mem-level Filter by memory level: (DEFAULT: ALL)
|
||||
HBM
|
||||
L2
|
||||
vL1D
|
||||
LDS
|
||||
--device GPU device ID. (DEFAULT: ALL)
|
||||
--kernel-names Include kernel names in roofline plot.
|
||||
--roof-only Profile roofline data only.
|
||||
--sort Overlay top kernels or top dispatches: (DEFAULT: kernels)
|
||||
kernels
|
||||
dispatches
|
||||
-m [ ...], --mem-level [ ...] Filter by memory level: (DEFAULT: ALL)
|
||||
HBM
|
||||
L2
|
||||
vL1D
|
||||
LDS
|
||||
--device GPU device ID. (DEFAULT: ALL)
|
||||
--kernel-names Include kernel names in roofline plot.
|
||||
|
||||
```
|
||||
|
||||
- The `-k` \<kernel> flag allows for kernel filtering, which is compatible with the current rocProf utility.
|
||||
@@ -119,36 +116,42 @@ The following sample command profiles the *vcopy* workload.
|
||||
|
||||
**vcopy profiling:**
|
||||
```shell-session
|
||||
$ omniperf profile --name vcopy -- ./vcopy 1048576 256
|
||||
Resolving rocprof
|
||||
ROC Profiler: /usr/bin/rocprof
|
||||
$ omniperf profile --name vcopy -- ./vcopy -n 1048576 -b 256
|
||||
ROC Profiler: /opt/rocm-5.7.1/bin/rocprof
|
||||
Execution mode = profile
|
||||
|
||||
-------------
|
||||
Profile only
|
||||
-------------
|
||||
___ _ __
|
||||
/ _ \ _ __ ___ _ __ (_)_ __ ___ _ __ / _|
|
||||
| | | | '_ ` _ \| '_ \| | '_ \ / _ \ '__| |_
|
||||
| |_| | | | | | | | | | | |_) | __/ | | _|
|
||||
\___/|_| |_| |_|_| |_|_| .__/ \___|_| |_|
|
||||
|_|
|
||||
|
||||
omniperf ver: 1.0.8-PR1
|
||||
Path: /home/colramos/GitHub/omniperf-pub/workloads
|
||||
Target: mi200
|
||||
Command: /home/colramos/vcopy 1048576 256
|
||||
Kernel Selection: None
|
||||
Dispatch Selection: None
|
||||
SoC = {'MI200'}
|
||||
Profiler choice = rocprofv1
|
||||
omniperf ver: 1.0.10
|
||||
Path: /home/auser/repos/omniperf/sample/workloads/vcopy/MI200
|
||||
Target: MI200
|
||||
Command: ./vcopy -n 1048576 -b 256
|
||||
Kernel Selection: None
|
||||
Dispatch Selection: None
|
||||
IP Blocks: All
|
||||
Log: /home/colramos/GitHub/omniperf-pub/workloads/vcopy/mi200/log.txt
|
||||
KernelName verbose: 2
|
||||
|
||||
/home/colramos/GitHub/omniperf-pub/workloads/vcopy/mi200/perfmon/SQ_INST_LEVEL_SMEM.txt
|
||||
RPL: on '230411_165021' from '/opt/rocm-5.2.1' in '/home/colramos/GitHub/omniperf-pub'
|
||||
RPL: profiling '""/home/colramos/vcopy 1048576 256""'
|
||||
RPL: input file '/home/colramos/GitHub/omniperf-pub/workloads/vcopy/mi200/perfmon/SQ_INST_LEVEL_SMEM.txt'
|
||||
RPL: output dir '/tmp/rpl_data_230411_165021_26406'
|
||||
RPL: result dir '/tmp/rpl_data_230411_165021_26406/input0_results_230411_165021'
|
||||
Finished allocating vectors on the CPU
|
||||
ROCProfiler: input from "/tmp/rpl_data_230411_165021_26406/input0.xml"
|
||||
Current input file: /home/auser/repos/omniperf/sample/workloads/vcopy/MI200/perfmon/pmc_perf_11.txt
|
||||
RPL: on '240301_151506' from '/opt/rocm-5.7.1' in '/home/auser/repos/omniperf/sample'
|
||||
RPL: profiling '""./vcopy -n 1048576 -b 256""'
|
||||
RPL: input file '/home/auser/repos/omniperf/sample/workloads/vcopy/MI200/perfmon/pmc_perf_11.txt'
|
||||
RPL: output dir '/tmp/rpl_data_240301_151506_553019'
|
||||
RPL: result dir '/tmp/rpl_data_240301_151506_553019/input0_results_240301_151506'
|
||||
ROCProfiler: input from "/tmp/rpl_data_240301_151506_553019/input0.xml"
|
||||
gpu_index =
|
||||
kernel =
|
||||
range =
|
||||
3 metrics
|
||||
SQ_INSTS_SMEM, SQ_INST_LEVEL_SMEM, SQ_ACCUM_PREV_HIRES
|
||||
8 metrics
|
||||
SQ_INSTS_VALU_MFMA_F16, SQ_INSTS_VALU_MFMA_BF16, SQ_INSTS_VALU_MFMA_F32, SQ_INSTS_VALU_MFMA_F64, SQ_VALU_MFMA_BUSY_CYCLES, SQ_INSTS_FLAT_LDS_ONLY, SQ_INSTS_VALU_MFMA_MOPS_I8, SQ_INSTS_VALU_MFMA_MOPS_F16
|
||||
vcopy testing on GCD 0
|
||||
Finished allocating vectors on the CPU
|
||||
Finished allocating vectors on the GPU
|
||||
Finished copying vectors to the GPU
|
||||
sw thinks it moved 1.000000 KB per wave
|
||||
@@ -159,58 +162,48 @@ Finished copying the output vector from the GPU to the CPU
|
||||
Releasing GPU memory
|
||||
Releasing CPU memory
|
||||
|
||||
... ...
|
||||
ROCPRofiler: 1 contexts collected, output directory /tmp/rpl_data_220527_130317_1787038/input_results_220527_130317
|
||||
File 'workloads/vcopy/mi200/timestamps.csv' is generating
|
||||
Total detected GPU devices: 2
|
||||
ROCPRofiler: 1 contexts collected, output directory /tmp/rpl_data_240301_151506_553019/input0_results_240301_151506
|
||||
File '/home/auser/repos/omniperf/sample/workloads/vcopy/MI200/pmc_perf_11.csv' is generating
|
||||
...
|
||||
[profiling] Kernel_Name shortening complete.
|
||||
|
||||
[roofline] Checking for roofline.csv in /home/auser/repos/omniperf/sample/workloads/vcopy/MI200
|
||||
[roofline] No roofline data found. Generating...
|
||||
Empirical Roofline Calculation
|
||||
Copyright © 2022 Advanced Micro Devices, Inc. All rights reserved.
|
||||
Total detected GPU devices: 4
|
||||
GPU Device 0: Profiling...
|
||||
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
|
||||
HBM BW, GPU ID: 0, workgroupSize:256, workgroups:2097152, experiments:100, traffic:8589934592 bytes, duration:6.2 ms, mean:1382.7 GB/sec, stdev=2.4 GB/sec
|
||||
HBM BW, GPU ID: 0, workgroupSize:256, workgroups:2097152, experiments:100, traffic:8589934592 bytes, duration:6.2 ms, mean:1388.0 GB/sec, stdev=3.1 GB/sec
|
||||
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
|
||||
L2 BW, GPU ID: 0, workgroupSize:256, workgroups:8192, experiments:100, traffic:687194767360 bytes, duration:157.9 ms, mean:4358.7 GB/sec, stdev=4.7 GB/sec
|
||||
L2 BW, GPU ID: 0, workgroupSize:256, workgroups:8192, experiments:100, traffic:687194767360 bytes, duration:136.5 ms, mean:5020.8 GB/sec, stdev=16.5 GB/sec
|
||||
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
|
||||
L1 BW, GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, traffic:26843545600 bytes, duration:3.3 ms, mean:8247.1 GB/sec, stdev=5.1 GB/sec
|
||||
L1 BW, GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, traffic:26843545600 bytes, duration:2.9 ms, mean:9229.5 GB/sec, stdev=2.9 GB/sec
|
||||
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
|
||||
LDS BW, GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, traffic:33554432000 bytes, duration:2.4 ms, mean:14246.3 GB/sec, stdev=29.5 GB/sec
|
||||
LDS BW, GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, traffic:33554432000 bytes, duration:1.9 ms, mean:17645.6 GB/sec, stdev=20.1 GB/sec
|
||||
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
|
||||
Peak FLOPs (FP32), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, FLOP:274877906944, duration:14.507 ms, mean:18949.6 GFLOPS, stdev=4.5 GFLOPS
|
||||
Peak FLOPs (FP32), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, FLOP:274877906944, duration:13.078 ms, mean:20986.9 GFLOPS, stdev=310.8 GFLOPS
|
||||
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
|
||||
Peak FLOPs (FP64), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, FLOP:137438953472, duration:7.5 ms, mean:18308.197266.1 GFLOPS, stdev=3.6 GFLOPS
|
||||
Peak FLOPs (FP64), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, FLOP:137438953472, duration:6.7 ms, mean:20408.029297.1 GFLOPS, stdev=2.7 GFLOPS
|
||||
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
|
||||
Peak MFMA FLOPs (BF16), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, FLOP:2147483648000, duration:14.0 ms, mean:153574.8 GFLOPS, stdev=79.9 GFLOPS
|
||||
Peak MFMA FLOPs (BF16), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, FLOP:2147483648000, duration:12.6 ms, mean:170280.0 GFLOPS, stdev=22.3 GFLOPS
|
||||
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
|
||||
Peak MFMA FLOPs (F16), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, FLOP:2147483648000, duration:14.5 ms, mean:147680.1 GFLOPS, stdev=34.7 GFLOPS
|
||||
Peak MFMA FLOPs (F16), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, FLOP:2147483648000, duration:13.0 ms, mean:164733.6 GFLOPS, stdev=24.3 GFLOPS
|
||||
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
|
||||
Peak MFMA FLOPs (F32), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, FLOP:536870912000, duration:14.5 ms, mean:37142.1 GFLOPS, stdev=8.4 GFLOPS
|
||||
Peak MFMA FLOPs (F32), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, FLOP:536870912000, duration:13.0 ms, mean:41399.6 GFLOPS, stdev=4.1 GFLOPS
|
||||
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
|
||||
Peak MFMA FLOPs (F64), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, FLOP:268435456000, duration:7.3 ms, mean:36919.5 GFLOPS, stdev=14.1 GFLOPS
|
||||
Peak MFMA FLOPs (F64), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, FLOP:268435456000, duration:6.5 ms, mean:41379.2 GFLOPS, stdev=4.4 GFLOPS
|
||||
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
|
||||
Peak MFMA IOPs (I8), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, IOP:2147483648000, duration:14.4 ms, mean:149570.6 GOPS, stdev=41.7 GOPS
|
||||
Peak MFMA IOPs (I8), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, IOP:2147483648000, duration:12.9 ms, mean:166281.9 GOPS, stdev=2495.9 GOPS
|
||||
GPU Device 1: Profiling...
|
||||
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
|
||||
HBM BW, GPU ID: 1, workgroupSize:256, workgroups:2097152, experiments:100, traffic:8589934592 bytes, duration:6.2 ms, mean:1382.7 GB/sec, stdev=2.9 GB/sec
|
||||
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
|
||||
L2 BW, GPU ID: 1, workgroupSize:256, workgroups:8192, experiments:100, traffic:687194767360 bytes, duration:157.6 ms, mean:4371.0 GB/sec, stdev=4.1 GB/sec
|
||||
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
|
||||
L1 BW, GPU ID: 1, workgroupSize:256, workgroups:16384, experiments:100, traffic:26843545600 bytes, duration:3.2 ms, mean:8297.4 GB/sec, stdev=11.6 GB/sec
|
||||
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
|
||||
LDS BW, GPU ID: 1, workgroupSize:256, workgroups:16384, experiments:100, traffic:33554432000 bytes, duration:1.8 ms, mean:18839.2 GB/sec, stdev=44.5 GB/sec
|
||||
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
|
||||
Peak FLOPs (FP32), GPU ID: 1, workgroupSize:256, workgroups:16384, experiments:100, FLOP:274877906944, duration:14.441 ms, mean:19037.6 GFLOPS, stdev=2.7 GFLOPS
|
||||
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
|
||||
Peak FLOPs (FP64), GPU ID: 1, workgroupSize:256, workgroups:16384, experiments:100, FLOP:137438953472, duration:7.5 ms, mean:18402.255859.1 GFLOPS, stdev=20.1 GFLOPS
|
||||
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
|
||||
Peak MFMA FLOPs (BF16), GPU ID: 1, workgroupSize:256, workgroups:16384, experiments:100, FLOP:2147483648000, duration:13.9 ms, mean:154240.3 GFLOPS, stdev=119.3 GFLOPS
|
||||
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
|
||||
Peak MFMA FLOPs (F16), GPU ID: 1, workgroupSize:256, workgroups:16384, experiments:100, FLOP:2147483648000, duration:14.5 ms, mean:148450.1 GFLOPS, stdev=112.6 GFLOPS
|
||||
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
|
||||
Peak MFMA FLOPs (F32), GPU ID: 1, workgroupSize:256, workgroups:16384, experiments:100, FLOP:536870912000, duration:14.4 ms, mean:37335.2 GFLOPS, stdev=43.1 GFLOPS
|
||||
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
|
||||
Peak MFMA FLOPs (F64), GPU ID: 1, workgroupSize:256, workgroups:16384, experiments:100, FLOP:268435456000, duration:7.2 ms, mean:37105.3 GFLOPS, stdev=39.5 GFLOPS
|
||||
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
|
||||
Peak MFMA IOPs (I8), GPU ID: 1, workgroupSize:256, workgroups:16384, experiments:100, IOP:2147483648000, duration:14.3 ms, mean:150317.8 GOPS, stdev=203.5 GOPS
|
||||
...
|
||||
GPU Device 2: Profiling...
|
||||
...
|
||||
GPU Device 3: Profiling...
|
||||
...
|
||||
Peak MFMA IOPs (I8), GPU ID: 3, workgroupSize:256, workgroups:16384, experiments:100, IOP:2147483648000, duration:12.9 ms, mean:166686.0 GOPS, stdev=11.2 GOPS
|
||||
```
|
||||
You will notice two stages in *default* Omniperf profiling. The first stage collects all the counters needed for Omniperf analysis (omitting any filters you have provided). The second stage collects data for the roofline analysis (this stage can be disabled using `--no-roof`)
|
||||
You will notice two main stages in *default* Omniperf profiling. The first stage collects all the counters needed for Omniperf analysis (omitting any filters you have provided). The second stage collects data for the roofline analysis (this stage can be disabled using `--no-roof`)
|
||||
|
||||
In this document, we use the term System on Chip (SoC) to refer to a particular family of accelerators. At the end of profiling, all resulting csv files should be located in a SoC specific target directory, e.g.:
|
||||
- "mi200" for the AMD Instinct (tm) MI200 family of accelerators
|
||||
@@ -220,21 +213,19 @@ etc. The SoC names are generated as a part of Omniperf, and do not necessarily
|
||||
> Note: Additionally, you will notice a few extra files. An SoC parameters file, *sysinfo.csv*, is created to reflect the target device settings. All profiling output is stored in *log.txt*. Roofline specific benchmark results are stored in *roofline.csv*.
|
||||
|
||||
```shell-session
|
||||
$ ls workloads/vcopy/mi200/
|
||||
$ ls workloads/vcopy/MI200/
|
||||
total 112
|
||||
drwxrwxr-x 3 colramos colramos 4096 Apr 11 16:42 .
|
||||
drwxrwxr-x 3 colramos colramos 4096 Apr 11 16:42 ..
|
||||
-rw-rw-r-- 1 colramos colramos 40750 Apr 11 16:44 log.txt
|
||||
drwxrwxr-x 2 colramos colramos 4096 Apr 11 16:42 perfmon
|
||||
-rw-rw-r-- 1 colramos colramos 25877 Apr 11 16:42 pmc_perf.csv
|
||||
-rw-rw-r-- 1 colramos colramos 1716 Apr 11 16:44 roofline.csv
|
||||
-rw-rw-r-- 1 colramos colramos 429 Apr 11 16:42 SQ_IFETCH_LEVEL.csv
|
||||
-rw-rw-r-- 1 colramos colramos 366 Apr 11 16:42 SQ_INST_LEVEL_LDS.csv
|
||||
-rw-rw-r-- 1 colramos colramos 391 Apr 11 16:42 SQ_INST_LEVEL_SMEM.csv
|
||||
-rw-rw-r-- 1 colramos colramos 384 Apr 11 16:42 SQ_INST_LEVEL_VMEM.csv
|
||||
-rw-rw-r-- 1 colramos colramos 509 Apr 11 16:42 SQ_LEVEL_WAVES.csv
|
||||
-rw-rw-r-- 1 colramos colramos 498 Apr 11 16:42 sysinfo.csv
|
||||
-rw-rw-r-- 1 colramos colramos 309 Apr 11 16:42 timestamps.csv
|
||||
total 60
|
||||
drwxr-xr-x 1 auser agroup 0 Mar 1 15:15 perfmon
|
||||
-rw-r--r-- 1 auser agroup 26175 Mar 1 15:15 pmc_perf.csv
|
||||
-rw-r--r-- 1 auser agroup 1708 Mar 1 15:17 roofline.csv
|
||||
-rw-r--r-- 1 auser agroup 519 Mar 1 15:15 SQ_IFETCH_LEVEL.csv
|
||||
-rw-r--r-- 1 auser agroup 456 Mar 1 15:15 SQ_INST_LEVEL_LDS.csv
|
||||
-rw-r--r-- 1 auser agroup 474 Mar 1 15:15 SQ_INST_LEVEL_SMEM.csv
|
||||
-rw-r--r-- 1 auser agroup 474 Mar 1 15:15 SQ_INST_LEVEL_VMEM.csv
|
||||
-rw-r--r-- 1 auser agroup 599 Mar 1 15:15 SQ_LEVEL_WAVES.csv
|
||||
-rw-r--r-- 1 auser agroup 650 Mar 1 15:15 sysinfo.csv
|
||||
-rw-r--r-- 1 auser agroup 399 Mar 1 15:15 timestamps.csv
|
||||
```
|
||||
|
||||
### Filtering
|
||||
@@ -261,38 +252,36 @@ One can profile specific hardware components to speed up the profiling process.
|
||||
|
||||
The following example only gathers hardware counters for the Shader Sequencer (SQ) and L2 Cache (TCC) components, skipping all other hardware components:
|
||||
```shell-session
|
||||
$ omniperf profile --name vcopy -b SQ TCC -- ./sample/vcopy 1048576 256
|
||||
Resolving rocprof
|
||||
ROC Profiler: /usr/bin/rocprof
|
||||
$ omniperf profile --name vcopy -b SQ TCC -- ./vcopy -n 1048576 -b 256
|
||||
ROC Profiler: /opt/rocm-5.7.1/bin/rocprof
|
||||
Execution mode = profile
|
||||
|
||||
|
||||
|
||||
-------------
|
||||
Profile only
|
||||
-------------
|
||||
|
||||
omniperf ver: 1.0.8-PR1
|
||||
Path: /home/colramos/GitHub/omniperf-pub/workloads
|
||||
Target: mi200
|
||||
Command: /home/colramos/vcopy 1048576 256
|
||||
Kernel Selection: None
|
||||
Dispatch Selection: None
|
||||
IP Blocks: ['SQ', 'TCC']
|
||||
fname: pmc_sq_perf2: Added
|
||||
fname: pmc_td_perf: Skipped
|
||||
fname: pmc_tcc2_perf: Skipped
|
||||
fname: pmc_tcp_perf: Skipped
|
||||
SoC = {'MI200'}
|
||||
Profiler choice = rocprofv1
|
||||
fname: pmc_sq_perf8: Added
|
||||
fname: pmc_spi_perf: Skipped
|
||||
fname: pmc_sq_perf4: Added
|
||||
fname: pmc_sq_perf6: Added
|
||||
fname: pmc_cpf_perf: Skipped
|
||||
fname: pmc_sqc_perf1: Skipped
|
||||
fname: pmc_tcc_perf: Added
|
||||
fname: pmc_cpf_perf: Skipped
|
||||
fname: pmc_sq_perf8: Added
|
||||
fname: pmc_tcc2_perf: Skipped
|
||||
fname: pmc_sq_perf2: Added
|
||||
fname: pmc_cpc_perf: Skipped
|
||||
fname: pmc_td_perf: Skipped
|
||||
fname: pmc_tcp_perf: Skipped
|
||||
fname: pmc_sq_perf1: Added
|
||||
fname: pmc_ta_perf: Skipped
|
||||
fname: pmc_sq_perf3: Added
|
||||
fname: pmc_sq_perf6: Added
|
||||
Log: /home/colramos/GitHub/omniperf-pub/workloads/vcopy/mi200/log.txt
|
||||
fname: pmc_ta_perf: Skipped
|
||||
omniperf ver: 1.0.10
|
||||
Path: /home/auser/repos/omniperf/sample/workloads/vcopy/MI200
|
||||
Target: MI200
|
||||
Command: ./vcopy -n 1048576 -b 256
|
||||
Kernel Selection: None
|
||||
Dispatch Selection: None
|
||||
IP Blocks: ['sq', 'tcc']
|
||||
KernelName verbose: 2
|
||||
...
|
||||
```
|
||||
|
||||
@@ -301,35 +290,32 @@ Kernel filtering is based on the name of the kernel(s) you would like to isolate
|
||||
|
||||
The following example demonstrates profiling isolating the kernel matching substring "vecCopy":
|
||||
```shell-session
|
||||
$ omniperf profile --name vcopy -k vecCopy -- ./vcopy 1048576 256
|
||||
Resolving rocprof
|
||||
ROC Profiler: /usr/bin/rocprof
|
||||
$ omniperf profile --name vcopy -k vecCopy -- ./vcopy -n 1048576 -b 256
|
||||
ROC Profiler: /opt/rocm-5.7.1/bin/rocprof
|
||||
Execution mode = profile
|
||||
|
||||
|
||||
-------------
|
||||
Profile only
|
||||
-------------
|
||||
|
||||
omniperf ver: 1.0.8-PR1
|
||||
Path: /home/colramos/GitHub/omniperf-pub/workloads
|
||||
Target: mi200
|
||||
Command: /home/colramos/vcopy 1048576 256
|
||||
Kernel Selection: ['vecCopy']
|
||||
Dispatch Selection: None
|
||||
SoC = {'MI200'}
|
||||
Profiler choice = rocprofv1
|
||||
omniperf ver: 1.0.10
|
||||
Path: /home/auser/repos/omniperf/sample/workloads/vcopy/MI200
|
||||
Target: MI200
|
||||
Command: ./vcopy -n 1048576 -b 256
|
||||
Kernel Selection: ['vecCopy']
|
||||
Dispatch Selection: None
|
||||
IP Blocks: All
|
||||
Log: /home/colramos/GitHub/omniperf-pub/workloads/vcopy/mi200/log.txt
|
||||
KernelName verbose: 2
|
||||
|
||||
/home/colramos/GitHub/omniperf-pub/workloads/vcopy/mi200/perfmon/SQ_INST_LEVEL_SMEM.txt
|
||||
RPL: on '230411_170300' from '/opt/rocm-5.2.1' in '/home/colramos/GitHub/omniperf-pub'
|
||||
RPL: profiling '""/home/colramos/vcopy 1048576 256""'
|
||||
RPL: input file '/home/colramos/GitHub/omniperf-pub/workloads/vcopy/mi200/perfmon/SQ_INST_LEVEL_SMEM.txt'
|
||||
RPL: output dir '/tmp/rpl_data_230411_170300_29696'
|
||||
RPL: result dir '/tmp/rpl_data_230411_170300_29696/input0_results_230411_170300'
|
||||
Finished allocating vectors on the CPU
|
||||
ROCProfiler: input from "/tmp/rpl_data_230411_170300_29696/input0.xml"
|
||||
Current input file: /home/auser/repos/omniperf/sample/workloads/vcopy/MI200/perfmon/pmc_perf_12.txt
|
||||
RPL: on '240301_152305' from '/opt/rocm-5.7.1' in '/home/auser/repos/omniperf/sample'
|
||||
RPL: profiling '""./vcopy -n 1048576 -b 256""'
|
||||
RPL: input file '/home/auser/repos/omniperf/sample/workloads/vcopy/MI200/perfmon/pmc_perf_12.txt'
|
||||
RPL: output dir '/tmp/rpl_data_240301_152305_562565'
|
||||
RPL: result dir '/tmp/rpl_data_240301_152305_562565/input0_results_240301_152305'
|
||||
ROCProfiler: input from "/tmp/rpl_data_240301_152305_562565/input0.xml"
|
||||
gpu_index =
|
||||
kernel = vecCopy
|
||||
|
||||
... ...
|
||||
...
|
||||
```
|
||||
|
||||
#### Dispatch Filtering
|
||||
@@ -337,34 +323,33 @@ Dispatch filtering is based on the *global* dispatch index of kernels in a run.
|
||||
|
||||
The following example profiles only the 0th dispatched kernel in execution of the application:
|
||||
```shell-session
|
||||
$ omniperf profile --name vcopy -d 0 -- ./vcopy 1048576 256
|
||||
Resolving rocprof
|
||||
ROC Profiler: /usr/bin/rocprof
|
||||
$ omniperf profile --name vcopy -d 0 -- ./vcopy -n 1048576 -b 256
|
||||
ROC Profiler: /opt/rocm-5.7.1/bin/rocprof
|
||||
Execution mode = profile
|
||||
|
||||
|
||||
-------------
|
||||
Profile only
|
||||
-------------
|
||||
|
||||
omniperf ver: 1.0.8-PR1
|
||||
Path: /home/colramos/GitHub/omniperf-pub/workloads
|
||||
Target: mi200
|
||||
Command: /home/colramos/vcopy 1048576 256
|
||||
Kernel Selection: None
|
||||
Dispatch Selection: ['0']
|
||||
SoC = {'MI200'}
|
||||
Profiler choice = rocprofv1
|
||||
omniperf ver: 1.0.10
|
||||
Path: /home/auser/repos/omniperf/sample/workloads/vcopy/MI200
|
||||
Target: MI200
|
||||
Command: ./vcopy -n 1048576 -b 256
|
||||
Kernel Selection: None
|
||||
Dispatch Selection: ['0']
|
||||
IP Blocks: All
|
||||
Log: /home/colramos/GitHub/omniperf-pub/workloads/vcopy/mi200/log.txt
|
||||
KernelName verbose: 2
|
||||
|
||||
/home/colramos/GitHub/omniperf-pub/workloads/vcopy/mi200/perfmon/SQ_INST_LEVEL_SMEM.txt
|
||||
RPL: on '230411_170356' from '/opt/rocm-5.2.1' in '/home/colramos/GitHub/omniperf-pub'
|
||||
RPL: profiling '""/home/colramos/vcopy 1048576 256""'
|
||||
RPL: input file '/home/colramos/GitHub/omniperf-pub/workloads/vcopy/mi200/perfmon/SQ_INST_LEVEL_SMEM.txt'
|
||||
RPL: output dir '/tmp/rpl_data_230411_170356_30314'
|
||||
RPL: result dir '/tmp/rpl_data_230411_170356_30314/input0_results_230411_170356'
|
||||
Finished allocating vectors on the CPU
|
||||
ROCProfiler: input from "/tmp/rpl_data_230411_170356_30314/input0.xml"
|
||||
Current input file: /home/auser/repos/omniperf/sample/workloads/vcopy/MI200/perfmon/timestamps.txt
|
||||
RPL: on '240301_152445' from '/opt/rocm-5.7.1' in '/home/auser/repos/omniperf/sample'
|
||||
RPL: profiling '""./vcopy -n 1048576 -b 256""'
|
||||
RPL: input file '/home/auser/repos/omniperf/sample/workloads/vcopy/MI200/perfmon/timestamps.txt'
|
||||
RPL: output dir '/tmp/rpl_data_240301_152445_563349'
|
||||
RPL: result dir '/tmp/rpl_data_240301_152445_563349/input0_results_240301_152445'
|
||||
ROCProfiler: input from "/tmp/rpl_data_240301_152445_563349/input0.xml"
|
||||
gpu_index =
|
||||
kernel =
|
||||
range = 0
|
||||
|
||||
...
|
||||
```
|
||||
|
||||
@@ -386,42 +371,31 @@ Standalone Roofline Options:
|
||||
#### Roofline Only
|
||||
The following example demonstrates profiling roofline data only:
|
||||
```shell-session
|
||||
$ omniperf profile --name vcopy --roof-only -- ./vcopy 1048576 256
|
||||
Resolving rocprof
|
||||
ROC Profiler: /usr/bin/rocprof
|
||||
$ omniperf profile --name vcopy --roof-only -- ./vcopy -n 1048576 -b 256
|
||||
|
||||
|
||||
--------
|
||||
Roofline only
|
||||
--------
|
||||
|
||||
Checking for roofline.csv in /home/colramos/GitHub/omniperf-pub/workloads/vcopy/mi200
|
||||
No roofline data found. Generating...
|
||||
...
|
||||
[roofline] Checking for roofline.csv in /home/auser/repos/omniperf/sample/workloads/vcopy/MI200
|
||||
[roofline] No roofline data found. Generating...
|
||||
Checking for roofline.csv in /home/auser/repos/omniperf/sample/workloads/vcopy/MI200
|
||||
Empirical Roofline Calculation
|
||||
Copyright © 2022 Advanced Micro Devices, Inc. All rights reserved.
|
||||
Total detected GPU devices: 4
|
||||
GPU Device 0: Profiling...
|
||||
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
|
||||
... ...
|
||||
Checking for roofline.csv in /home/colramos/GitHub/omniperf-pub/workloads/mix/mi200
|
||||
Checking for sysinfo.csv in /home/colramos/GitHub/omniperf-pub/workloads/mix/mi200
|
||||
Checking for pmc_perf.csv in /home/colramos/GitHub/omniperf-pub/workloads/mix/mi200
|
||||
...
|
||||
Empirical Roofline PDFs saved!
|
||||
```
|
||||
An inspection of our workload output folder shows .pdf plots were generated successfully
|
||||
```shell-session
|
||||
$ ls workloads/vcopy/mi200/
|
||||
total 176
|
||||
drwxrwxr-x 3 colramos colramos 4096 Apr 11 17:18 .
|
||||
drwxrwxr-x 3 colramos colramos 4096 Apr 11 17:15 ..
|
||||
-rw-rw-r-- 1 colramos colramos 13271 Apr 11 17:18 empirRoof_gpu-ALL_fp32.pdf
|
||||
-rw-rw-r-- 1 colramos colramos 13175 Apr 11 17:18 empirRoof_gpu-ALL_int8_fp16.pdf
|
||||
-rw-rw-r-- 1 colramos colramos 26560 Apr 11 17:16 log.txt
|
||||
drwxrwxr-x 2 colramos colramos 4096 Apr 11 17:16 perfmon
|
||||
-rw-rw-r-- 1 colramos colramos 54031 Apr 11 17:16 pmc_perf.csv
|
||||
-rw-rw-r-- 1 colramos colramos 1714 Apr 11 17:16 roofline.csv
|
||||
-rw-rw-r-- 1 colramos colramos 457 Apr 11 17:16 sysinfo.csv
|
||||
-rw-rw-r-- 1 colramos colramos 37521 Apr 11 17:16 timestamps.csv
|
||||
$ ls workloads/vcopy/MI200/
|
||||
total 48
|
||||
-rw-r--r-- 1 auser agroup 13331 Mar 1 16:05 empirRoof_gpu-0_fp32.pdf
|
||||
-rw-r--r-- 1 auser agroup 13136 Mar 1 16:05 empirRoof_gpu-0_int8_fp16.pdf
|
||||
drwxr-xr-x 1 auser agroup 0 Mar 1 16:03 perfmon
|
||||
-rw-r--r-- 1 auser agroup 1101 Mar 1 16:03 pmc_perf.csv
|
||||
-rw-r--r-- 1 auser agroup 1715 Mar 1 16:05 roofline.csv
|
||||
-rw-r--r-- 1 auser agroup 650 Mar 1 16:03 sysinfo.csv
|
||||
-rw-r--r-- 1 auser agroup 399 Mar 1 16:03 timestamps.csv
|
||||
```
|
||||
A sample *empirRoof_gpu-ALL_fp32.pdf* looks something like this:
|
||||
|
||||
|
||||
Reference in New Issue
Block a user