documentation updates for 2.x work - current set of updates are mostly tied to two changes:

(1) vcopy examples require updated command-line args
(2) profiling output directory name format changed

Signed-off-by: Karl W. Schulz <karl.schulz@amd.com>
This commit is contained in:
Karl W. Schulz
2024-03-01 16:16:30 -06:00
committed by Karl W. Schulz
parent 65150e2384
commit bfc0dea1de
3 changed files with 325 additions and 356 deletions
+125 -132
View File
@@ -27,153 +27,146 @@ Run `omniperf analyze -h` for more details.
1) To begin, generate a comprehensive analysis report with Omniperf CLI.
```shell-session
$ omniperf analyze -p workloads/vcopy/mi200/
--------
Analyze
--------
$ omniperf analyze -p workloads/vcopy/MI200/
Analysis mode = cli
SoC = {'MI200'}
[analysis] deriving Omniperf metrics...
--------------------------------------------------------------------------------
0. Top Stat
0. Top Stats
0.1 Top Kernels
╒════╤══════════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╕
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
│ │ Kernel_Name │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
╞════╪══════════════════════════════════════════╪═════════╪═══════════╪════════════╪══════════════╪════════╡
│ 0 │ vecCopy(double*, double*, double*, int, │ 1 │ 20000.00 │ 20000.00 │ 20000.00 │ 100.00 │
│ │ int) [clone .kd] │ │ │ │ │ │
│ 0 │ vecCopy(double*, double*, double*, int, │ 1.00 │ 20480.00 │ 20480.00 │ 20480.00 │ 100.00 │
│ │ int) │ │ │ │ │ │
╘════╧══════════════════════════════════════════╧═════════╧═══════════╧════════════╧══════════════╧════════╛
0.2 Dispatch List
╒════╤═══════════════╤══════════════════════════════════════════════╤══════════╕
│ │ Dispatch_ID │ Kernel_Name │ GPU_ID │
╞════╪═══════════════╪══════════════════════════════════════════════╪══════════╡
│ 0 │ 0 │ vecCopy(double*, double*, double*, int, int) │ 2 │
╘════╧═══════════════╧══════════════════════════════════════════════╧══════════╛
--------------------------------------------------------------------------------
1. System Info
╒══════════════════╤═══════════════════════════════════════════════╕
│ │ Info │
╞══════════════════╪═══════════════════════════════════════════════╡
│ workload_name │ vcopy │
├──────────────────┼───────────────────────────────────────────────┤
│ command │ /home/colramos/vcopy 1048576 256 │
├──────────────────┼───────────────────────────────────────────────┤
│ host_name │ sv-pdp-2 │
├──────────────────┼───────────────────────────────────────────────┤
│ host_cpu │ AMD EPYC 7282 16-Core Processor │
├──────────────────┼───────────────────────────────────────────────┤
│ host_distro │ Ubuntu 20.04.3 LTS │
├──────────────────┼───────────────────────────────────────────────┤
│ host_kernel │ 5.15.0-43-generic │
├──────────────────┼───────────────────────────────────────────────┤
│ host_rocmver │ 5.2.1-79 │
├──────────────────┼───────────────────────────────────────────────┤
│ date │ Fri Jan 20 11:22:20 2023 (CST) │
├──────────────────┼───────────────────────────────────────────────┤
│ gpu_soc │ gfx90a │
├──────────────────┼───────────────────────────────────────────────┤
│ numSE │ 8 │
├──────────────────┼───────────────────────────────────────────────┤
│ numCU │ 104 │
├──────────────────┼───────────────────────────────────────────────┤
│ numSIMD │ 4 │
├──────────────────┼───────────────────────────────────────────────┤
│ waveSize │ 64 │
├──────────────────┼───────────────────────────────────────────────┤
│ maxWavesPerCU │ 32 │
├──────────────────┼───────────────────────────────────────────────┤
│ maxWorkgroupSize │ 1024 │
├──────────────────┼───────────────────────────────────────────────┤
│ L1 │ 16 │
├──────────────────┼───────────────────────────────────────────────┤
│ L2 │ 8192 │
├──────────────────┼───────────────────────────────────────────────┤
│ sclk │ 1700 │
├──────────────────┼───────────────────────────────────────────────┤
│ mclk │ 1600 │
├──────────────────┼───────────────────────────────────────────────┤
│ cur_sclk │ 800 │
├──────────────────┼───────────────────────────────────────────────┤
│ cur_mclk │ 1600 │
├──────────────────┼───────────────────────────────────────────────┤
│ L2Banks │ 32 │
├──────────────────┼───────────────────────────────────────────────┤
│ name │ mi200 │
├──────────────────┼───────────────────────────────────────────────┤
│ numSQC │ 56 │
├──────────────────┼───────────────────────────────────────────────┤
│ hbmBW │ 1638.4 │
├──────────────────┼───────────────────────────────────────────────┤
│ ip_blocks │ roofline|SQ|LDS|SQC|TA|TD|TCP|TCC|SPI|CPC|CPF │
╘══════════════════╧═══════════════════════════════════════════════╛
╒═══════════════════╤═══════════════════════════════════════════════╕
│ │ Info │
╞═══════════════════╪═══════════════════════════════════════════════╡
│ workload_name │ vcopy │
├───────────────────┼───────────────────────────────────────────────┤
│ command │ ./vcopy -n 1048576 -b 256 │
├───────────────────┼───────────────────────────────────────────────┤
│ host_name │ t007-002 │
├───────────────────┼───────────────────────────────────────────────┤
│ host_cpu │ AMD EPYC 7V13 64-Core Processor │
├───────────────────┼───────────────────────────────────────────────┤
│ sbios │ American Megatrends Inc.0602 │
├───────────────────┼───────────────────────────────────────────────┤
│ host_distro │ Rocky Linux 9.1 (Blue Onyx) │
├───────────────────┼───────────────────────────────────────────────┤
│ host_kernel │ 5.14.0-162.18.1.el9_1.x86_64 │
├───────────────────┼───────────────────────────────────────────────┤
│ host_rocmver │ 5.7.1-98 │
├───────────────────┼───────────────────────────────────────────────┤
│ date │ Fri Mar 1 15:32:43 2024 (CST) │
├───────────────────┼───────────────────────────────────────────────┤
│ gpu_soc │ gfx90a │
├───────────────────┼───────────────────────────────────────────────┤
│ vbios │ 113-D67301-059 │
├───────────────────┼───────────────────────────────────────────────┤
│ numSE │ 8 │
├───────────────────┼───────────────────────────────────────────────┤
│ numCU │ 104 │
├───────────────────┼───────────────────────────────────────────────┤
│ numSIMD │ 4 │
├───────────────────┼───────────────────────────────────────────────┤
│ waveSize │ 64 │
├───────────────────┼───────────────────────────────────────────────┤
│ maxWavesPerCU │ 32 │
├───────────────────┼───────────────────────────────────────────────┤
│ maxWorkgroupSize │ 1024 │
├───────────────────┼───────────────────────────────────────────────┤
│ L1 │ 16 │
├───────────────────┼───────────────────────────────────────────────┤
│ L2 │ 8192 │
├───────────────────┼───────────────────────────────────────────────┤
│ sclk │ 1700 │
├───────────────────┼───────────────────────────────────────────────┤
│ mclk │ 1600 │
├───────────────────┼───────────────────────────────────────────────┤
│ cur_sclk │ 1700 │
├───────────────────┼───────────────────────────────────────────────┤
│ cur_mclk │ 1600 │
├───────────────────┼───────────────────────────────────────────────┤
│ L2Banks │ 32 │
├───────────────────┼───────────────────────────────────────────────┤
│ totalL2Banks │ 32 │
├───────────────────┼───────────────────────────────────────────────┤
│ LDSBanks │ 32 │
├───────────────────┼───────────────────────────────────────────────┤
│ name │ MI200 │
├───────────────────┼───────────────────────────────────────────────┤
│ numSQC │ 56 │
├───────────────────┼───────────────────────────────────────────────┤
│ numPipes │ 4 │
├───────────────────┼───────────────────────────────────────────────┤
│ hbmBW │ 1638.4 │
├───────────────────┼───────────────────────────────────────────────┤
│ ip_blocks │ roofline|SQ|LDS|SQC|TA|TD|TCP|TCC|SPI|CPC|CPF │
├───────────────────┼───────────────────────────────────────────────┤
--------------------------------------------------------------------------------
2. System Speed-of-Light
....
```
2. Use `--list-metrics` to generate a list of available metrics for inspection
```shell-session
$ omniperf analyze -p workloads/vcopy/mi200/ --list-metrics gfx90a
╒═════════╤═════════════════════════════╕
│ │ Metric │
╞═════════╪═════════════════════════════╡
│ 0 │ Top Stat │
├─────────┼─────────────────────────────┤
│ 1 │ System Info │
├─────────┼─────────────────────────────┤
│ 2.1.0 │ VALU_FLOPs │
├─────────┼─────────────────────────────┤
2.1.1 │ VALU_IOPs │
├─────────┼─────────────────────────────┤
2.1.2 MFMA_FLOPs_(BF16)
├─────────┼─────────────────────────────┤
2.1.3 │ MFMA_FLOPs_(F16) │
├─────────┼─────────────────────────────┤
2.1.4 │ MFMA_FLOPs_(F32) │
├─────────┼─────────────────────────────┤
2.1.5 │ MFMA_FLOPs_(F64) │
├─────────┼─────────────────────────────┤
2.1.6 │ MFMA_IOPs_(Int8) │
├─────────┼─────────────────────────────┤
2.1.7 │ Active_CUs │
├─────────┼─────────────────────────────┤
2.1.8 │ SALU_Util │
├─────────┼─────────────────────────────┤
2.1.9 │ VALU_Util │
├─────────┼─────────────────────────────┤
2.1.10 │ MFMA_Util │
├─────────┼─────────────────────────────┤
2.1.11 │ VALU_Active_Threads/Wave │
├─────────┼─────────────────────────────┤
2.1.12 │ IPC_-_Issue │
├─────────┼─────────────────────────────┤
2.1.13 │ LDS_BW │
├─────────┼─────────────────────────────┤
│ 2.1.14 │ LDS_Bank_Conflict │
├─────────┼─────────────────────────────┤
│ 2.1.15 │ Instr_Cache_Hit_Rate │
├─────────┼─────────────────────────────┤
│ 2.1.16 │ Instr_Cache_BW │
├─────────┼─────────────────────────────┤
│ 2.1.17 │ Scalar_L1D_Cache_Hit_Rate │
├─────────┼─────────────────────────────┤
│ 2.1.18 │ Scalar_L1D_Cache_BW │
├─────────┼─────────────────────────────┤
│ 2.1.19 │ Vector_L1D_Cache_Hit_Rate │
├─────────┼─────────────────────────────┤
│ 2.1.20 │ Vector_L1D_Cache_BW │
├─────────┼─────────────────────────────┤
│ 2.1.21 │ L2_Cache_Hit_Rate │
├─────────┼─────────────────────────────┤
│ 2.1.22 │ L2-Fabric_Read_BW │
├─────────┼─────────────────────────────┤
│ 2.1.23 │ L2-Fabric_Write_BW │
├─────────┼─────────────────────────────┤
│ 2.1.24 │ L2-Fabric_Read_Latency │
├─────────┼─────────────────────────────┤
│ 2.1.25 │ L2-Fabric_Write_Latency │
├─────────┼─────────────────────────────┤
$ omniperf analyze -p workloads/vcopy/MI200/ --list-metrics gfx90a
Execution mode = analyze
Analysis mode = cli
SoC = {'MI200'}
[analysis] deriving Omniperf metrics...
0 -> Top Stats
1 -> System Info
2 -> System Speed-of-Light
2.1 -> Speed-of-Light
2.1.0 -> VALU FLOPs
2.1.1 -> VALU IOPs
2.1.2 -> MFMA FLOPs (BF16)
2.1.3 -> MFMA FLOPs (F16)
2.1.4 -> MFMA FLOPs (F32)
2.1.5 -> MFMA FLOPs (F64)
2.1.6 -> MFMA IOPs (Int8)
2.1.7 -> Active CUs
2.1.8 -> SALU Utilization
2.1.9 -> VALU Utilization
2.1.10 -> MFMA Utilization
2.1.11 -> VMEM Utilization
2.1.12 -> Branch Utilization
2.1.13 -> VALU Active Threads
2.1.14 -> IPC
2.1.15 -> Wavefront Occupancy
2.1.16 -> Theoretical LDS Bandwidth
2.1.17 -> LDS Bank Conflicts/Access
2.1.18 -> vL1D Cache Hit Rate
2.1.19 -> vL1D Cache BW
2.1.20 -> L2 Cache Hit Rate
2.1.21 -> L2 Cache BW
2.1.22 -> L2-Fabric Read BW
2.1.23 -> L2-Fabric Write BW
2.1.24 -> L2-Fabric Read Latency
2.1.25 -> L2-Fabric Write Latency
...
```
2. Choose your own customized subset of metrics with `-b` (a.k.a. `--metric`), or build your own config following [config_template](https://github.com/AMDResearch/omniperf/blob/main/src/omniperf_analyze/configs/panel_config_template.yaml). Below shows how to generate a report containing only metric 2 (a.k.a. System Speed-of-Light).
```shell-session
$ omniperf analyze -p workloads/vcopy/mi200/ -b 2
$ omniperf analyze -p workloads/vcopy/MI200/ -b 2
--------
Analyze
--------
@@ -261,24 +254,24 @@ Analyze
- Single run
```shell
$ omniperf analyze -p workloads/vcopy/mi200/
$ omniperf analyze -p workloads/vcopy/MI200/
```
- List top kernels
```shell
$ omniperf analyze -p workloads/vcopy/mi200/ --list-kernels
$ omniperf analyze -p workloads/vcopy/MI200/ --list-kernels
```
- List metrics
```shell
$ omniperf analyze -p workloads/vcopy/mi200/ --list-metrics gfx90a
$ omniperf analyze -p workloads/vcopy/MI200/ --list-metrics gfx90a
```
- Customized profiling "System Speed-of-Light" and "CS_Busy" only
```shell
$ omniperf analyze -p workloads/vcopy/mi200/ -b 2 5.1.0
$ omniperf analyze -p workloads/vcopy/MI200/ -b 2 5.1.0
```
> Note: Users can filter single metric or the whole hardware component by its id. In this case, 1 is the id for "system speed of light" and 5.1.0 the id for metric "GPU Busy Cycles".
@@ -287,7 +280,7 @@ Analyze
First, list the top kernels in your application using `--list-kernels`.
```shell-session
$ omniperf analyze -p workloads/vcopy/mi200/ --list-kernels
$ omniperf analyze -p workloads/vcopy/MI200/ --list-kernels
--------
Analyze
@@ -373,7 +366,7 @@ See [FAQ](https://amdresearch.github.io/omniperf/faq.html) for more details on S
To launch the standalone GUI, include the `--gui` flag with your desired analysis command. For example:
```shell-session
$ omniperf analyze -p workloads/vcopy/mi200/ --gui
$ omniperf analyze -p workloads/vcopy/MI200/ --gui
--------
Analyze
+3 -1
View File
@@ -1,7 +1,9 @@
# AMD Instinct(tm) MI Series Accelerator Performance Model
```eval_rst
.. sectionauthor:: Nicholas Curtis <nicholas.curtis@amd.com>
.. toctree::
:glob:
:maxdepth: 5
```
Omniperf makes available an extensive list of metrics to better understand achieved application performance on AMD Instinct(tm) MI accelerators including Graphics Core Next (GCN) GPUs such as the AMD Instinct MI50, CDNA(tm) accelerators such as the MI100, and CDNA(tm) 2 accelerators such as MI250X/250/210.
+197 -223
View File
@@ -23,11 +23,12 @@ the MI200 platform.
$ hipcc vcopy.cpp -o vcopy
$ ls
vcopy vcopy.cpp
$ ./vcopy 1048576 256
$ ./vcopy -n 1048576 -b 256
vcopy testing on GCD 0
Finished allocating vectors on the CPU
Finished allocating vectors on the GPU
Finished copying vectors to the GPU
sw thinks it moved 1.000000 KB per wave
sw thinks it moved 1.000000 KB per wave
Total threads: 1048576, Grid Size: 4096 block Size:256, Wavefronts:16384:
Launching the kernel on the GPU
Finished executing kernel
@@ -42,70 +43,66 @@ The *omniperf* script, available through the Omniperf repository, is used to aqu
**omniperf help:**
```shell-session
$ omniperf profile --help
ROC Profiler: /usr/bin/rocprof
usage:
usage:
omniperf profile --name <workload_name> [profile options] [roofline options] -- <profile_cmd>
-------------------------------------------------------------------------------
Examples:
omniperf profile -n vcopy_all -- ./vcopy 1048576 256
omniperf profile -n vcopy_SPI_TCC -b SQ TCC -- ./vcopy 1048576 256
omniperf profile -n vcopy_kernel -k vecCopy -- ./vcopy 1048576 256
omniperf profile -n vcopy_disp -d 0 -- ./vcopy 1048576 256
omniperf profile -n vcopy_roof --roof-only -- ./vcopy 1048576 256
omniperf profile -n vcopy_all -- ./vcopy -n 1048576 -b 256
omniperf profile -n vcopy_SPI_TCC -b SQ TCC -- ./vcopy -n 1048576 -b 256
omniperf profile -n vcopy_kernel -k vecCopy -- ./vcopy -n 1048576 -b 256
omniperf profile -n vcopy_disp -d 0 -- ./vcopy -n 1048576 -b 256
omniperf profile -n vcopy_roof --roof-only -- ./vcopy -n 1048576 -b 256
-------------------------------------------------------------------------------
Help:
-h, --help show this help message and exit
-h, --help show this help message and exit
General Options:
-v, --version show program's version number and exit
-V, --verbose Increase output verbosity
-v, --version show program's version number and exit
-V, --verbose Increase output verbosity
Profile Options:
-n , --name Assign a name to workload.
-p , --path Specify path to save workload.
(DEFAULT: /home/colramos/GitHub/omniperf/workloads/<name>)
-k [ ...], --kernel [ ...] Kernel filtering.
-b [ ...], --ipblocks [ ...] Hardware block filtering:
SQ
SQC
TA
TD
TCP
TCC
SPI
CPC
CPF
-d [ ...], --dispatch [ ...] Dispatch ID filtering.
--no-roof Profile without collecting roofline data.
-- [ ...] Provide command for profiling after double dash.
-n , --name Assign a name to workload.
-p , --path Specify path to save workload.
-k [ ...], --kernel [ ...] Kernel filtering.
-d [ ...], --dispatch [ ...] Dispatch ID filtering.
-b [ ...], --ipblocks [ ...] IP block filtering:
SQ
SQC
TA
TD
TCP
TCC
SPI
CPC
CPF
--join-type Choose how to join rocprof runs: (DEFAULT: grid)
kernel (i.e. By unique kernel name dispatches)
grid (i.e. By unique kernel name + grid size dispatches)
--no-roof Profile without collecting roofline data.
-- [ ...] Provide command for profiling after double dash.
--kernel-verbose Specify Kernel Name verbose level 1-5. Lower the level, shorter the kernel name. (DEFAULT: 2) (DISABLE: 5)
Standalone Roofline Options:
--roof-only Profile roofline data only.
--sort Overlay top kernels or top dispatches: (DEFAULT: kernels)
kernels
dispatches
-m , --mem-level Filter by memory level: (DEFAULT: ALL)
HBM
L2
vL1D
LDS
--device GPU device ID. (DEFAULT: ALL)
--kernel-names Include kernel names in roofline plot.
--roof-only Profile roofline data only.
--sort Overlay top kernels or top dispatches: (DEFAULT: kernels)
kernels
dispatches
-m [ ...], --mem-level [ ...] Filter by memory level: (DEFAULT: ALL)
HBM
L2
vL1D
LDS
--device GPU device ID. (DEFAULT: ALL)
--kernel-names Include kernel names in roofline plot.
```
- The `-k` \<kernel> flag allows for kernel filtering, which is compatible with the current rocProf utility.
@@ -119,36 +116,42 @@ The following sample command profiles the *vcopy* workload.
**vcopy profiling:**
```shell-session
$ omniperf profile --name vcopy -- ./vcopy 1048576 256
Resolving rocprof
ROC Profiler: /usr/bin/rocprof
$ omniperf profile --name vcopy -- ./vcopy -n 1048576 -b 256
ROC Profiler: /opt/rocm-5.7.1/bin/rocprof
Execution mode = profile
-------------
Profile only
-------------
___ _ __
/ _ \ _ __ ___ _ __ (_)_ __ ___ _ __ / _|
| | | | '_ ` _ \| '_ \| | '_ \ / _ \ '__| |_
| |_| | | | | | | | | | | |_) | __/ | | _|
\___/|_| |_| |_|_| |_|_| .__/ \___|_| |_|
|_|
omniperf ver: 1.0.8-PR1
Path: /home/colramos/GitHub/omniperf-pub/workloads
Target: mi200
Command: /home/colramos/vcopy 1048576 256
Kernel Selection: None
Dispatch Selection: None
SoC = {'MI200'}
Profiler choice = rocprofv1
omniperf ver: 1.0.10
Path: /home/auser/repos/omniperf/sample/workloads/vcopy/MI200
Target: MI200
Command: ./vcopy -n 1048576 -b 256
Kernel Selection: None
Dispatch Selection: None
IP Blocks: All
Log: /home/colramos/GitHub/omniperf-pub/workloads/vcopy/mi200/log.txt
KernelName verbose: 2
/home/colramos/GitHub/omniperf-pub/workloads/vcopy/mi200/perfmon/SQ_INST_LEVEL_SMEM.txt
RPL: on '230411_165021' from '/opt/rocm-5.2.1' in '/home/colramos/GitHub/omniperf-pub'
RPL: profiling '""/home/colramos/vcopy 1048576 256""'
RPL: input file '/home/colramos/GitHub/omniperf-pub/workloads/vcopy/mi200/perfmon/SQ_INST_LEVEL_SMEM.txt'
RPL: output dir '/tmp/rpl_data_230411_165021_26406'
RPL: result dir '/tmp/rpl_data_230411_165021_26406/input0_results_230411_165021'
Finished allocating vectors on the CPU
ROCProfiler: input from "/tmp/rpl_data_230411_165021_26406/input0.xml"
Current input file: /home/auser/repos/omniperf/sample/workloads/vcopy/MI200/perfmon/pmc_perf_11.txt
RPL: on '240301_151506' from '/opt/rocm-5.7.1' in '/home/auser/repos/omniperf/sample'
RPL: profiling '""./vcopy -n 1048576 -b 256""'
RPL: input file '/home/auser/repos/omniperf/sample/workloads/vcopy/MI200/perfmon/pmc_perf_11.txt'
RPL: output dir '/tmp/rpl_data_240301_151506_553019'
RPL: result dir '/tmp/rpl_data_240301_151506_553019/input0_results_240301_151506'
ROCProfiler: input from "/tmp/rpl_data_240301_151506_553019/input0.xml"
gpu_index =
kernel =
range =
3 metrics
SQ_INSTS_SMEM, SQ_INST_LEVEL_SMEM, SQ_ACCUM_PREV_HIRES
8 metrics
SQ_INSTS_VALU_MFMA_F16, SQ_INSTS_VALU_MFMA_BF16, SQ_INSTS_VALU_MFMA_F32, SQ_INSTS_VALU_MFMA_F64, SQ_VALU_MFMA_BUSY_CYCLES, SQ_INSTS_FLAT_LDS_ONLY, SQ_INSTS_VALU_MFMA_MOPS_I8, SQ_INSTS_VALU_MFMA_MOPS_F16
vcopy testing on GCD 0
Finished allocating vectors on the CPU
Finished allocating vectors on the GPU
Finished copying vectors to the GPU
sw thinks it moved 1.000000 KB per wave
@@ -159,58 +162,48 @@ Finished copying the output vector from the GPU to the CPU
Releasing GPU memory
Releasing CPU memory
... ...
ROCPRofiler: 1 contexts collected, output directory /tmp/rpl_data_220527_130317_1787038/input_results_220527_130317
File 'workloads/vcopy/mi200/timestamps.csv' is generating
Total detected GPU devices: 2
ROCPRofiler: 1 contexts collected, output directory /tmp/rpl_data_240301_151506_553019/input0_results_240301_151506
File '/home/auser/repos/omniperf/sample/workloads/vcopy/MI200/pmc_perf_11.csv' is generating
...
[profiling] Kernel_Name shortening complete.
[roofline] Checking for roofline.csv in /home/auser/repos/omniperf/sample/workloads/vcopy/MI200
[roofline] No roofline data found. Generating...
Empirical Roofline Calculation
Copyright © 2022 Advanced Micro Devices, Inc. All rights reserved.
Total detected GPU devices: 4
GPU Device 0: Profiling...
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
HBM BW, GPU ID: 0, workgroupSize:256, workgroups:2097152, experiments:100, traffic:8589934592 bytes, duration:6.2 ms, mean:1382.7 GB/sec, stdev=2.4 GB/sec
HBM BW, GPU ID: 0, workgroupSize:256, workgroups:2097152, experiments:100, traffic:8589934592 bytes, duration:6.2 ms, mean:1388.0 GB/sec, stdev=3.1 GB/sec
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
L2 BW, GPU ID: 0, workgroupSize:256, workgroups:8192, experiments:100, traffic:687194767360 bytes, duration:157.9 ms, mean:4358.7 GB/sec, stdev=4.7 GB/sec
L2 BW, GPU ID: 0, workgroupSize:256, workgroups:8192, experiments:100, traffic:687194767360 bytes, duration:136.5 ms, mean:5020.8 GB/sec, stdev=16.5 GB/sec
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
L1 BW, GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, traffic:26843545600 bytes, duration:3.3 ms, mean:8247.1 GB/sec, stdev=5.1 GB/sec
L1 BW, GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, traffic:26843545600 bytes, duration:2.9 ms, mean:9229.5 GB/sec, stdev=2.9 GB/sec
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
LDS BW, GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, traffic:33554432000 bytes, duration:2.4 ms, mean:14246.3 GB/sec, stdev=29.5 GB/sec
LDS BW, GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, traffic:33554432000 bytes, duration:1.9 ms, mean:17645.6 GB/sec, stdev=20.1 GB/sec
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
Peak FLOPs (FP32), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, FLOP:274877906944, duration:14.507 ms, mean:18949.6 GFLOPS, stdev=4.5 GFLOPS
Peak FLOPs (FP32), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, FLOP:274877906944, duration:13.078 ms, mean:20986.9 GFLOPS, stdev=310.8 GFLOPS
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
Peak FLOPs (FP64), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, FLOP:137438953472, duration:7.5 ms, mean:18308.197266.1 GFLOPS, stdev=3.6 GFLOPS
Peak FLOPs (FP64), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, FLOP:137438953472, duration:6.7 ms, mean:20408.029297.1 GFLOPS, stdev=2.7 GFLOPS
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
Peak MFMA FLOPs (BF16), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, FLOP:2147483648000, duration:14.0 ms, mean:153574.8 GFLOPS, stdev=79.9 GFLOPS
Peak MFMA FLOPs (BF16), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, FLOP:2147483648000, duration:12.6 ms, mean:170280.0 GFLOPS, stdev=22.3 GFLOPS
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
Peak MFMA FLOPs (F16), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, FLOP:2147483648000, duration:14.5 ms, mean:147680.1 GFLOPS, stdev=34.7 GFLOPS
Peak MFMA FLOPs (F16), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, FLOP:2147483648000, duration:13.0 ms, mean:164733.6 GFLOPS, stdev=24.3 GFLOPS
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
Peak MFMA FLOPs (F32), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, FLOP:536870912000, duration:14.5 ms, mean:37142.1 GFLOPS, stdev=8.4 GFLOPS
Peak MFMA FLOPs (F32), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, FLOP:536870912000, duration:13.0 ms, mean:41399.6 GFLOPS, stdev=4.1 GFLOPS
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
Peak MFMA FLOPs (F64), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, FLOP:268435456000, duration:7.3 ms, mean:36919.5 GFLOPS, stdev=14.1 GFLOPS
Peak MFMA FLOPs (F64), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, FLOP:268435456000, duration:6.5 ms, mean:41379.2 GFLOPS, stdev=4.4 GFLOPS
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
Peak MFMA IOPs (I8), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, IOP:2147483648000, duration:14.4 ms, mean:149570.6 GOPS, stdev=41.7 GOPS
Peak MFMA IOPs (I8), GPU ID: 0, workgroupSize:256, workgroups:16384, experiments:100, IOP:2147483648000, duration:12.9 ms, mean:166281.9 GOPS, stdev=2495.9 GOPS
GPU Device 1: Profiling...
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
HBM BW, GPU ID: 1, workgroupSize:256, workgroups:2097152, experiments:100, traffic:8589934592 bytes, duration:6.2 ms, mean:1382.7 GB/sec, stdev=2.9 GB/sec
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
L2 BW, GPU ID: 1, workgroupSize:256, workgroups:8192, experiments:100, traffic:687194767360 bytes, duration:157.6 ms, mean:4371.0 GB/sec, stdev=4.1 GB/sec
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
L1 BW, GPU ID: 1, workgroupSize:256, workgroups:16384, experiments:100, traffic:26843545600 bytes, duration:3.2 ms, mean:8297.4 GB/sec, stdev=11.6 GB/sec
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
LDS BW, GPU ID: 1, workgroupSize:256, workgroups:16384, experiments:100, traffic:33554432000 bytes, duration:1.8 ms, mean:18839.2 GB/sec, stdev=44.5 GB/sec
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
Peak FLOPs (FP32), GPU ID: 1, workgroupSize:256, workgroups:16384, experiments:100, FLOP:274877906944, duration:14.441 ms, mean:19037.6 GFLOPS, stdev=2.7 GFLOPS
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
Peak FLOPs (FP64), GPU ID: 1, workgroupSize:256, workgroups:16384, experiments:100, FLOP:137438953472, duration:7.5 ms, mean:18402.255859.1 GFLOPS, stdev=20.1 GFLOPS
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
Peak MFMA FLOPs (BF16), GPU ID: 1, workgroupSize:256, workgroups:16384, experiments:100, FLOP:2147483648000, duration:13.9 ms, mean:154240.3 GFLOPS, stdev=119.3 GFLOPS
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
Peak MFMA FLOPs (F16), GPU ID: 1, workgroupSize:256, workgroups:16384, experiments:100, FLOP:2147483648000, duration:14.5 ms, mean:148450.1 GFLOPS, stdev=112.6 GFLOPS
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
Peak MFMA FLOPs (F32), GPU ID: 1, workgroupSize:256, workgroups:16384, experiments:100, FLOP:536870912000, duration:14.4 ms, mean:37335.2 GFLOPS, stdev=43.1 GFLOPS
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
Peak MFMA FLOPs (F64), GPU ID: 1, workgroupSize:256, workgroups:16384, experiments:100, FLOP:268435456000, duration:7.2 ms, mean:37105.3 GFLOPS, stdev=39.5 GFLOPS
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
Peak MFMA IOPs (I8), GPU ID: 1, workgroupSize:256, workgroups:16384, experiments:100, IOP:2147483648000, duration:14.3 ms, mean:150317.8 GOPS, stdev=203.5 GOPS
...
GPU Device 2: Profiling...
...
GPU Device 3: Profiling...
...
Peak MFMA IOPs (I8), GPU ID: 3, workgroupSize:256, workgroups:16384, experiments:100, IOP:2147483648000, duration:12.9 ms, mean:166686.0 GOPS, stdev=11.2 GOPS
```
You will notice two stages in *default* Omniperf profiling. The first stage collects all the counters needed for Omniperf analysis (omitting any filters you have provided). The second stage collects data for the roofline analysis (this stage can be disabled using `--no-roof`)
You will notice two main stages in *default* Omniperf profiling. The first stage collects all the counters needed for Omniperf analysis (omitting any filters you have provided). The second stage collects data for the roofline analysis (this stage can be disabled using `--no-roof`)
In this document, we use the term System on Chip (SoC) to refer to a particular family of accelerators. At the end of profiling, all resulting csv files should be located in a SoC specific target directory, e.g.:
- "mi200" for the AMD Instinct (tm) MI200 family of accelerators
@@ -220,21 +213,19 @@ etc. The SoC names are generated as a part of Omniperf, and do not necessarily
> Note: Additionally, you will notice a few extra files. An SoC parameters file, *sysinfo.csv*, is created to reflect the target device settings. All profiling output is stored in *log.txt*. Roofline specific benchmark results are stored in *roofline.csv*.
```shell-session
$ ls workloads/vcopy/mi200/
$ ls workloads/vcopy/MI200/
total 112
drwxrwxr-x 3 colramos colramos 4096 Apr 11 16:42 .
drwxrwxr-x 3 colramos colramos 4096 Apr 11 16:42 ..
-rw-rw-r-- 1 colramos colramos 40750 Apr 11 16:44 log.txt
drwxrwxr-x 2 colramos colramos 4096 Apr 11 16:42 perfmon
-rw-rw-r-- 1 colramos colramos 25877 Apr 11 16:42 pmc_perf.csv
-rw-rw-r-- 1 colramos colramos 1716 Apr 11 16:44 roofline.csv
-rw-rw-r-- 1 colramos colramos 429 Apr 11 16:42 SQ_IFETCH_LEVEL.csv
-rw-rw-r-- 1 colramos colramos 366 Apr 11 16:42 SQ_INST_LEVEL_LDS.csv
-rw-rw-r-- 1 colramos colramos 391 Apr 11 16:42 SQ_INST_LEVEL_SMEM.csv
-rw-rw-r-- 1 colramos colramos 384 Apr 11 16:42 SQ_INST_LEVEL_VMEM.csv
-rw-rw-r-- 1 colramos colramos 509 Apr 11 16:42 SQ_LEVEL_WAVES.csv
-rw-rw-r-- 1 colramos colramos 498 Apr 11 16:42 sysinfo.csv
-rw-rw-r-- 1 colramos colramos 309 Apr 11 16:42 timestamps.csv
total 60
drwxr-xr-x 1 auser agroup 0 Mar 1 15:15 perfmon
-rw-r--r-- 1 auser agroup 26175 Mar 1 15:15 pmc_perf.csv
-rw-r--r-- 1 auser agroup 1708 Mar 1 15:17 roofline.csv
-rw-r--r-- 1 auser agroup 519 Mar 1 15:15 SQ_IFETCH_LEVEL.csv
-rw-r--r-- 1 auser agroup 456 Mar 1 15:15 SQ_INST_LEVEL_LDS.csv
-rw-r--r-- 1 auser agroup 474 Mar 1 15:15 SQ_INST_LEVEL_SMEM.csv
-rw-r--r-- 1 auser agroup 474 Mar 1 15:15 SQ_INST_LEVEL_VMEM.csv
-rw-r--r-- 1 auser agroup 599 Mar 1 15:15 SQ_LEVEL_WAVES.csv
-rw-r--r-- 1 auser agroup 650 Mar 1 15:15 sysinfo.csv
-rw-r--r-- 1 auser agroup 399 Mar 1 15:15 timestamps.csv
```
### Filtering
@@ -261,38 +252,36 @@ One can profile specific hardware components to speed up the profiling process.
The following example only gathers hardware counters for the Shader Sequencer (SQ) and L2 Cache (TCC) components, skipping all other hardware components:
```shell-session
$ omniperf profile --name vcopy -b SQ TCC -- ./sample/vcopy 1048576 256
Resolving rocprof
ROC Profiler: /usr/bin/rocprof
$ omniperf profile --name vcopy -b SQ TCC -- ./vcopy -n 1048576 -b 256
ROC Profiler: /opt/rocm-5.7.1/bin/rocprof
Execution mode = profile
-------------
Profile only
-------------
omniperf ver: 1.0.8-PR1
Path: /home/colramos/GitHub/omniperf-pub/workloads
Target: mi200
Command: /home/colramos/vcopy 1048576 256
Kernel Selection: None
Dispatch Selection: None
IP Blocks: ['SQ', 'TCC']
fname: pmc_sq_perf2: Added
fname: pmc_td_perf: Skipped
fname: pmc_tcc2_perf: Skipped
fname: pmc_tcp_perf: Skipped
SoC = {'MI200'}
Profiler choice = rocprofv1
fname: pmc_sq_perf8: Added
fname: pmc_spi_perf: Skipped
fname: pmc_sq_perf4: Added
fname: pmc_sq_perf6: Added
fname: pmc_cpf_perf: Skipped
fname: pmc_sqc_perf1: Skipped
fname: pmc_tcc_perf: Added
fname: pmc_cpf_perf: Skipped
fname: pmc_sq_perf8: Added
fname: pmc_tcc2_perf: Skipped
fname: pmc_sq_perf2: Added
fname: pmc_cpc_perf: Skipped
fname: pmc_td_perf: Skipped
fname: pmc_tcp_perf: Skipped
fname: pmc_sq_perf1: Added
fname: pmc_ta_perf: Skipped
fname: pmc_sq_perf3: Added
fname: pmc_sq_perf6: Added
Log: /home/colramos/GitHub/omniperf-pub/workloads/vcopy/mi200/log.txt
fname: pmc_ta_perf: Skipped
omniperf ver: 1.0.10
Path: /home/auser/repos/omniperf/sample/workloads/vcopy/MI200
Target: MI200
Command: ./vcopy -n 1048576 -b 256
Kernel Selection: None
Dispatch Selection: None
IP Blocks: ['sq', 'tcc']
KernelName verbose: 2
...
```
@@ -301,35 +290,32 @@ Kernel filtering is based on the name of the kernel(s) you would like to isolate
The following example demonstrates profiling isolating the kernel matching substring "vecCopy":
```shell-session
$ omniperf profile --name vcopy -k vecCopy -- ./vcopy 1048576 256
Resolving rocprof
ROC Profiler: /usr/bin/rocprof
$ omniperf profile --name vcopy -k vecCopy -- ./vcopy -n 1048576 -b 256
ROC Profiler: /opt/rocm-5.7.1/bin/rocprof
Execution mode = profile
-------------
Profile only
-------------
omniperf ver: 1.0.8-PR1
Path: /home/colramos/GitHub/omniperf-pub/workloads
Target: mi200
Command: /home/colramos/vcopy 1048576 256
Kernel Selection: ['vecCopy']
Dispatch Selection: None
SoC = {'MI200'}
Profiler choice = rocprofv1
omniperf ver: 1.0.10
Path: /home/auser/repos/omniperf/sample/workloads/vcopy/MI200
Target: MI200
Command: ./vcopy -n 1048576 -b 256
Kernel Selection: ['vecCopy']
Dispatch Selection: None
IP Blocks: All
Log: /home/colramos/GitHub/omniperf-pub/workloads/vcopy/mi200/log.txt
KernelName verbose: 2
/home/colramos/GitHub/omniperf-pub/workloads/vcopy/mi200/perfmon/SQ_INST_LEVEL_SMEM.txt
RPL: on '230411_170300' from '/opt/rocm-5.2.1' in '/home/colramos/GitHub/omniperf-pub'
RPL: profiling '""/home/colramos/vcopy 1048576 256""'
RPL: input file '/home/colramos/GitHub/omniperf-pub/workloads/vcopy/mi200/perfmon/SQ_INST_LEVEL_SMEM.txt'
RPL: output dir '/tmp/rpl_data_230411_170300_29696'
RPL: result dir '/tmp/rpl_data_230411_170300_29696/input0_results_230411_170300'
Finished allocating vectors on the CPU
ROCProfiler: input from "/tmp/rpl_data_230411_170300_29696/input0.xml"
Current input file: /home/auser/repos/omniperf/sample/workloads/vcopy/MI200/perfmon/pmc_perf_12.txt
RPL: on '240301_152305' from '/opt/rocm-5.7.1' in '/home/auser/repos/omniperf/sample'
RPL: profiling '""./vcopy -n 1048576 -b 256""'
RPL: input file '/home/auser/repos/omniperf/sample/workloads/vcopy/MI200/perfmon/pmc_perf_12.txt'
RPL: output dir '/tmp/rpl_data_240301_152305_562565'
RPL: result dir '/tmp/rpl_data_240301_152305_562565/input0_results_240301_152305'
ROCProfiler: input from "/tmp/rpl_data_240301_152305_562565/input0.xml"
gpu_index =
kernel = vecCopy
... ...
...
```
#### Dispatch Filtering
@@ -337,34 +323,33 @@ Dispatch filtering is based on the *global* dispatch index of kernels in a run.
The following example profiles only the 0th dispatched kernel in execution of the application:
```shell-session
$ omniperf profile --name vcopy -d 0 -- ./vcopy 1048576 256
Resolving rocprof
ROC Profiler: /usr/bin/rocprof
$ omniperf profile --name vcopy -d 0 -- ./vcopy -n 1048576 -b 256
ROC Profiler: /opt/rocm-5.7.1/bin/rocprof
Execution mode = profile
-------------
Profile only
-------------
omniperf ver: 1.0.8-PR1
Path: /home/colramos/GitHub/omniperf-pub/workloads
Target: mi200
Command: /home/colramos/vcopy 1048576 256
Kernel Selection: None
Dispatch Selection: ['0']
SoC = {'MI200'}
Profiler choice = rocprofv1
omniperf ver: 1.0.10
Path: /home/auser/repos/omniperf/sample/workloads/vcopy/MI200
Target: MI200
Command: ./vcopy -n 1048576 -b 256
Kernel Selection: None
Dispatch Selection: ['0']
IP Blocks: All
Log: /home/colramos/GitHub/omniperf-pub/workloads/vcopy/mi200/log.txt
KernelName verbose: 2
/home/colramos/GitHub/omniperf-pub/workloads/vcopy/mi200/perfmon/SQ_INST_LEVEL_SMEM.txt
RPL: on '230411_170356' from '/opt/rocm-5.2.1' in '/home/colramos/GitHub/omniperf-pub'
RPL: profiling '""/home/colramos/vcopy 1048576 256""'
RPL: input file '/home/colramos/GitHub/omniperf-pub/workloads/vcopy/mi200/perfmon/SQ_INST_LEVEL_SMEM.txt'
RPL: output dir '/tmp/rpl_data_230411_170356_30314'
RPL: result dir '/tmp/rpl_data_230411_170356_30314/input0_results_230411_170356'
Finished allocating vectors on the CPU
ROCProfiler: input from "/tmp/rpl_data_230411_170356_30314/input0.xml"
Current input file: /home/auser/repos/omniperf/sample/workloads/vcopy/MI200/perfmon/timestamps.txt
RPL: on '240301_152445' from '/opt/rocm-5.7.1' in '/home/auser/repos/omniperf/sample'
RPL: profiling '""./vcopy -n 1048576 -b 256""'
RPL: input file '/home/auser/repos/omniperf/sample/workloads/vcopy/MI200/perfmon/timestamps.txt'
RPL: output dir '/tmp/rpl_data_240301_152445_563349'
RPL: result dir '/tmp/rpl_data_240301_152445_563349/input0_results_240301_152445'
ROCProfiler: input from "/tmp/rpl_data_240301_152445_563349/input0.xml"
gpu_index =
kernel =
range = 0
...
```
@@ -386,42 +371,31 @@ Standalone Roofline Options:
#### Roofline Only
The following example demonstrates profiling roofline data only:
```shell-session
$ omniperf profile --name vcopy --roof-only -- ./vcopy 1048576 256
Resolving rocprof
ROC Profiler: /usr/bin/rocprof
$ omniperf profile --name vcopy --roof-only -- ./vcopy -n 1048576 -b 256
--------
Roofline only
--------
Checking for roofline.csv in /home/colramos/GitHub/omniperf-pub/workloads/vcopy/mi200
No roofline data found. Generating...
...
[roofline] Checking for roofline.csv in /home/auser/repos/omniperf/sample/workloads/vcopy/MI200
[roofline] No roofline data found. Generating...
Checking for roofline.csv in /home/auser/repos/omniperf/sample/workloads/vcopy/MI200
Empirical Roofline Calculation
Copyright © 2022 Advanced Micro Devices, Inc. All rights reserved.
Total detected GPU devices: 4
GPU Device 0: Profiling...
99% [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
... ...
Checking for roofline.csv in /home/colramos/GitHub/omniperf-pub/workloads/mix/mi200
Checking for sysinfo.csv in /home/colramos/GitHub/omniperf-pub/workloads/mix/mi200
Checking for pmc_perf.csv in /home/colramos/GitHub/omniperf-pub/workloads/mix/mi200
...
Empirical Roofline PDFs saved!
```
An inspection of our workload output folder shows .pdf plots were generated successfully
```shell-session
$ ls workloads/vcopy/mi200/
total 176
drwxrwxr-x 3 colramos colramos 4096 Apr 11 17:18 .
drwxrwxr-x 3 colramos colramos 4096 Apr 11 17:15 ..
-rw-rw-r-- 1 colramos colramos 13271 Apr 11 17:18 empirRoof_gpu-ALL_fp32.pdf
-rw-rw-r-- 1 colramos colramos 13175 Apr 11 17:18 empirRoof_gpu-ALL_int8_fp16.pdf
-rw-rw-r-- 1 colramos colramos 26560 Apr 11 17:16 log.txt
drwxrwxr-x 2 colramos colramos 4096 Apr 11 17:16 perfmon
-rw-rw-r-- 1 colramos colramos 54031 Apr 11 17:16 pmc_perf.csv
-rw-rw-r-- 1 colramos colramos 1714 Apr 11 17:16 roofline.csv
-rw-rw-r-- 1 colramos colramos 457 Apr 11 17:16 sysinfo.csv
-rw-rw-r-- 1 colramos colramos 37521 Apr 11 17:16 timestamps.csv
$ ls workloads/vcopy/MI200/
total 48
-rw-r--r-- 1 auser agroup 13331 Mar 1 16:05 empirRoof_gpu-0_fp32.pdf
-rw-r--r-- 1 auser agroup 13136 Mar 1 16:05 empirRoof_gpu-0_int8_fp16.pdf
drwxr-xr-x 1 auser agroup 0 Mar 1 16:03 perfmon
-rw-r--r-- 1 auser agroup 1101 Mar 1 16:03 pmc_perf.csv
-rw-r--r-- 1 auser agroup 1715 Mar 1 16:05 roofline.csv
-rw-r--r-- 1 auser agroup 650 Mar 1 16:03 sysinfo.csv
-rw-r--r-- 1 auser agroup 399 Mar 1 16:03 timestamps.csv
```
A sample *empirRoof_gpu-ALL_fp32.pdf* looks something like this: