# rocprof
## 1. Overview
The rocProf is a command line tool implemented on the top of rocProfiler and rocTracer APIs. Source code for rocProf may be found here:
GitHub: https://github.com/ROCm-Developer-Tools/rocprofiler/blob/amd-master/bin/rocprof
This command line tool is implemented as a script which is setting up the environment for attaching the profiler and then run the provided application command line. The tool uses two profiling plugins loaded by ROC runtime and based on rocProfiler and rocTracer for collecting metrics/counters, HW traces and runtime API/activity traces. The tool consumes an input XML or text file with counters list or trace parameters and provides output profiling data and statistics in various formats as text, CSV and JSON traces. Google Chrome tracing can be used to visualize the JSON traces with runtime API/activity timelines and per kernel counters data.
## 2. Profiling Modes
‘rocprof’ can be used for GPU profiling using HW counters and application tracing
### 2.1. GPU profiling
GPU profiling is controlled with input file which defines a list of metrics/counters and a profiling scope. An input file is provided using option ‘-i [input file]’. Output CSV file with a line per submitted kernel is generated. Each line has kernel name, kernel parameters and counter values. By option ‘—stats’ the kernel execution stats can be generated in CSV format. Currently profiling has limitation of serializing submitted kernels.
An example of input file:
```
# Perf counters group 1
pmc : Wavefronts VALUInsts SALUInsts SFetchInsts
# Perf counters group 2
pmc : TCC_HIT[0], TCC_MISS[0]
# Filter by dispatches range, GPU index and kernel names
# supported range formats: "3:9", "3:", "3"
range: 1 : 4
gpu: 0 1 2 3
kernel: simple Pass1 simpleConvolutionPass2
```
An example of profiling command line for ‘MatrixTranspose’ application
```
$ rocprof -i input.txt MatrixTranspose
RPL: on '191018_011134' from '/…./rocprofiler_pkg' in '/…./MatrixTranspose'
RPL: profiling '"./MatrixTranspose"'
RPL: input file 'input.txt'
RPL: output dir '/tmp/rpl_data_191018_011134_9695'
RPL: result dir '/tmp/rpl_data_191018_011134_9695/input0_results_191018_011134'
ROCProfiler: rc-file '/…./rpl_rc.xml'
ROCProfiler: input from "/tmp/rpl_data_191018_011134_9695/input0.xml"
gpu_index =
kernel =
range =
4 metrics
L2CacheHit, VFetchInsts, VWriteInsts, MemUnitStalled
0 traces
Device name Ellesmere [Radeon RX 470/480/570/570X/580/580X]
PASSED!
ROCprofiler: 1 contexts collected, output directory /tmp/rpl_data_191018_011134_9695/input0_results_191018_011134
RPL: '/…./MatrixTranspose/input.csv' is generated
```
#### 2.1.1. Counters and metrics
There are two profiling features, metrics and traces. Hardware performance counters are treated as the basic metrics and the formulas can be defined for derived metrics.
Counters and metrics can be dynamically configured using XML configuration files with counters and metrics tables:
- Counters table entry, basic metric: counter name, block name, event id
- Derived metrics table entry: metric name, an expression for calculation the metric from the counters
Metrics XML File Example:
```
. . .
. . .
```
##### 2.1.1.1. Metrics query
Available counters and metrics can be queried by options ‘—list-basic’ for counters and ‘—list-derived’ for derived metrics. The output for counters indicates number of block instances and number of block counter registers. The output for derived metrics prints the metrics expressions.
Examples:
```
$ rocprof --list-basic
RPL: on '191018_014450' from '/opt/rocm/rocprofiler' in '/…./MatrixTranspose'
ROCProfiler: rc-file '/…./rpl_rc.xml'
Basic HW counters:
gpu-agent0 : GRBM_COUNT : Tie High - Count Number of Clocks
block GRBM has 2 counters
gpu-agent0 : GRBM_GUI_ACTIVE : The GUI is Active
block GRBM has 2 counters
. . .
gpu-agent0 : TCC_HIT[0-15] : Number of cache hits.
block TCC has 4 counters
gpu-agent0 : TCC_MISS[0-15] : Number of cache misses. UC reads count as misses.
block TCC has 4 counters
. . .
$ rocprof --list-derived
RPL: on '191018_015911' from '/opt/rocm/rocprofiler' in '/home/evgeny/work/BUILD/0_MatrixTranspose'
ROCProfiler: rc-file '/home/evgeny/rpl_rc.xml'
Derived metrics:
gpu-agent0 : TCC_HIT_sum : Number of cache hits. Sum over TCC instances.
TCC_HIT_sum = sum(TCC_HIT,16)
gpu-agent0 : TCC_MISS_sum : Number of cache misses. Sum over TCC instances.
TCC_MISS_sum = sum(TCC_MISS,16)
gpu-agent0 : TCC_MC_RDREQ_sum : Number of 32-byte reads. Sum over TCC instances.
TCC_MC_RDREQ_sum = sum(TCC_MC_RDREQ,16)
. . .
```
##### 2.1.1.2. Metrics collecting
Counters and metrics accumulated per kernel can be collected using input file with a list of metrics, see an example in 2.1.
Currently profiling has limitation of serializing submitted kernels.
The number of counters which can be dumped by one run is limited by GPU HW by number of counter registers per block. The number of counters can be different for different blocks and can be queried, see 2.1.1.1.
###### 2.1.1.2.1. Blocks instancing
GPU blocks are implemented as several identical instances. To dump counters of specific instance square brackets can be used, see an example in 2.1.
The number of block instances can be queried, see 2.1.1.1.
###### 2.1.1.2.2. HW limitations
The number of counters which can be dumped by one run is limited by GPU HW by number of counter registers per block. The number of counters can be different for different blocks and can be queried, see 2.1.1.1.
- Metrics groups
To dump a list of metrics exceeding HW limitations the metrics list can be split on groups.
The tool supports automatic splitting on optimal metric groups:
```
$ rocprof -i input.txt ./MatrixTranspose
RPL: on '191018_032645' from '/opt/rocm/rocprofiler' in '/…./MatrixTranspose'
RPL: profiling './MatrixTranspose'
RPL: input file 'input.txt'
RPL: output dir '/tmp/rpl_data_191018_032645_12106'
RPL: result dir '/tmp/rpl_data_191018_032645_12106/input0_results_191018_032645'
ROCProfiler: rc-file '/…./rpl_rc.xml'
ROCProfiler: input from "/tmp/rpl_data_191018_032645_12106/input0.xml"
gpu_index =
kernel =
range =
20 metrics
Wavefronts, VALUInsts, SALUInsts, SFetchInsts, FlatVMemInsts, LDSInsts, FlatLDSInsts, GDSInsts, VALUUtilization, FetchSize, WriteSize, L2CacheHit, VWriteInsts, GPUBusy, VALUBusy, SALUBusy, MemUnitStalled, WriteUnitStalled, LDSBankConflict, MemUnitBusy
0 traces
Device name Ellesmere [Radeon RX 470/480/570/570X/580/580X]
Input metrics out of HW limit. Proposed metrics group set:
group1: L2CacheHit VWriteInsts MemUnitStalled WriteUnitStalled MemUnitBusy FetchSize FlatVMemInsts LDSInsts VALUInsts SALUInsts SFetchInsts FlatLDSInsts GPUBusy Wavefronts
group2: WriteSize GDSInsts VALUUtilization VALUBusy SALUBusy LDSBankConflict
ERROR: rocprofiler_open(), Construct(), Metrics list exceeds HW limits
Aborted (core dumped)
Error found, profiling aborted.
```
- Collecting with multiple runs
To collect several metric groups a full application replay is used by defining several ‘pmc:’ lines in the input file, see 2.1.
### 2.2. Application tracing
Supported application tracing includes runtime API and GPU activity tracing’
Supported runtimes are: ROCr (HSA API) and HIP
Supported GPU activity: kernel execution, async memory copy, barrier packets.
The trace is generated in JSON format compatible with Chrome tracing.
The trace consists of several sections with timelines for API trace per thread and GPU activity. The timelines events show event name and parameters.
Supported options: ‘—hsa-trace’, ‘—hip-trace’, ‘—sys-trace’, where ‘sys trace’ is for HIP and HSA combined trace.
#### 2.2.1. HIP runtime trace
The trace is generated by option ‘—hip-trace’ and includes HIP API timelines and GPU activity at the runtime level.
#### 2.2.2. ROCr runtime trace
The trace is generated by option ‘—hsa-trace’ and includes ROCr API timelines and GPU activity at AQL queue level. Also, can provide counters per kernel.
#### 2.2.3. KFD driver trace
The trace is generated by option ‘—kfd-trace’ and includes KFD Thunk API timeline.
It is planned to add memory allocations/migration tracing.
#### 2.2.4. Code annotation
Support for application code annotation.
Start/stop API is supported to programmatically control the profiling.
A ‘roctx’ library provides annotation API. Annotation is visualized in JSON trace as a separate "Markers and Ranges" timeline section.
##### 2.2.4.1. Start/stop API
```
// Tracing start API
void roctracer_start();
// Tracing stop API
void roctracer_stop();
```
##### 2.2.4.2. rocTX basic markers API
```
// A marker created by given ASCII massage
void roctxMark(const char* message);
// Returns the 0 based level of a nested range being started by given message associated to this range.
// A negative value is returned on the error.
int roctxRangePush(const char* message);
// Marks the end of a nested range.
// Returns the 0 based level the range.
// A negative value is returned on the error.
int roctxRangePop();
```
### 2.3. Multiple GPUs profiling
The profiler supports multiple GPU’s profiling and provide GPI id for counters and kernels data in CSV output file. Also, GPU id is indicating for respective GPU activity timeline in JSON trace.
## 3. Profiling control
Profiling can be controlled by specifying a profiling scope, by filtering trace events and specifying interesting time intervals.
### 3.1. Profiling scope
Counters profiling scope can be specified by GPU id list, kernel name substrings list and dispatch range.
Supported range formats examples: "3:9", "3:", "3". You can see an example of input file in 2.1.
#### 3.2. Tracing control
Tracing can be filtered by events names using profiler input file and by enabling interesting time intervals by command line option.
#### 3.2.1. Filtering traced APIs
A list of traced API names can be specified in profiler input file.
An example of input file line for ROCr runtime trace (HAS API):
```
hsa: hsa_queue_create hsa_amd_memory_pool_allocate
```
#### 3.2.2. Tracing time period
Trace can be dumped periodically with initial delay, dumping period length and rate:
```
--trace-period
```
### 3.3. Concurrent kernels
Currently concurrent kernels profiling is not supported which is a planned feature. Kernels are serialized.
### 3.4. Multi-processes profiling
Multi-processes profiling is not currently supported.
### 3.5. Errors logging
Profiler errors are logged to global logs:
```
/tmp/aql_profile_log.txt
/tmp/rocprofiler_log.txt
/tmp/roctracer_log.txt
```
## 4. 3rd party visualization tools
‘rocprof’ is producing JSON trace compatible with Chrome Tracing, which is an internal trace visualization tool in Google Chrome.
### 4.1. Chrome tracing
Good review can be found by the link: https://aras-p.info/blog/2017/01/23/Chrome-Tracing-as-Profiler-Frontend/
## 5. Command line options
The command line options can be printed with option ‘-h’:
```
$ rocprof -h
RPL: on '191018_023018' from '/opt/rocm/rocprofiler' in '/…./MatrixTranspose'
ROCm Profiling Library (RPL) run script, a part of ROCprofiler library package.
Full path: /opt/rocm/rocprofiler/bin/rocprof
Metrics definition: /opt/rocm/rocprofiler/lib/metrics.xml
Usage:
rocprof [-h] [--list-basic] [--list-derived] [-i ] [-o