This command line tool is implemented as a script which is setting up the environment for attaching the profiler and then run the provided application command line. The tool uses two profiling plugins loaded by ROC runtime and based on rocProfiler and rocTracer for collecting metrics/counters, HW traces and runtime API/activity traces. The tool consumes an input XML or text file with counters list or trace parameters and provides output profiling data and statistics in various formats as text, CSV and JSON traces. Google Chrome tracing can be used to visualize the JSON traces with runtime API/activity timelines and per kernel counters data.
## 2. Profiling Modes
‘rocprof’ can be used for GPU profiling using HW counters and application tracing
GPU profiling is controlled with input file which defines a list of metrics/counters and a profiling scope. An input file is provided using option ‘-i [input file]’. Output CSV file with a line per submitted kernel is generated. Each line has kernel name, kernel parameters and counter values. By option ‘—stats’ the kernel execution stats can be generated in CSV format. Currently profiling has limitation of serializing submitted kernels.
There are two profiling features, metrics and traces. Hardware performance counters are treated as the basic metrics and the formulas can be defined for derived metrics.
Counters and metrics can be dynamically configured using XML configuration files with counters and metrics tables:
Available counters and metrics can be queried by options ‘—list-basic’ for counters and ‘—list-derived’ for derived metrics. The output for counters indicates number of block instances and number of block counter registers. The output for derived metrics prints the metrics expressions.
Examples:
```
$ rocprof --list-basic
RPL: on '191018_014450' from '/opt/rocm/rocprofiler' in '/…./MatrixTranspose'
ROCProfiler: rc-file '/…./rpl_rc.xml'
Basic HW counters:
gpu-agent0 : GRBM_COUNT : Tie High - Count Number of Clocks
block GRBM has 2 counters
gpu-agent0 : GRBM_GUI_ACTIVE : The GUI is Active
block GRBM has 2 counters
. . .
gpu-agent0 : TCC_HIT[0-15] : Number of cache hits.
block TCC has 4 counters
gpu-agent0 : TCC_MISS[0-15] : Number of cache misses. UC reads count as misses.
block TCC has 4 counters
. . .
$ rocprof --list-derived
RPL: on '191018_015911' from '/opt/rocm/rocprofiler' in '/home/evgeny/work/BUILD/0_MatrixTranspose'
ROCProfiler: rc-file '/home/evgeny/rpl_rc.xml'
Derived metrics:
gpu-agent0 : TCC_HIT_sum : Number of cache hits. Sum over TCC instances.
TCC_HIT_sum = sum(TCC_HIT,16)
gpu-agent0 : TCC_MISS_sum : Number of cache misses. Sum over TCC instances.
Counters and metrics accumulated per kernel can be collected using input file with a list of metrics, see an example in 2.1.
Currently profiling has limitation of serializing submitted kernels.
The number of counters which can be dumped by one run is limited by GPU HW by number of counter registers per block. The number of counters can be different for different blocks and can be queried, see 2.1.1.1.
###### 2.1.1.2.1. Blocks instancing
GPU blocks are implemented as several identical instances. To dump counters of specific instance square brackets can be used, see an example in 2.1.
The number of block instances can be queried, see 2.1.1.1.
###### 2.1.1.2.2. HW limitations
The number of counters which can be dumped by one run is limited by GPU HW by number of counter registers per block. The number of counters can be different for different blocks and can be queried, see 2.1.1.1.
- Metrics groups
To dump a list of metrics exceeding HW limitations the metrics list can be split on groups.
The tool supports automatic splitting on optimal metric groups:
```
$ rocprof -i input.txt ./MatrixTranspose
RPL: on '191018_032645' from '/opt/rocm/rocprofiler' in '/…./MatrixTranspose'
RPL: profiling './MatrixTranspose'
RPL: input file 'input.txt'
RPL: output dir '/tmp/rpl_data_191018_032645_12106'
RPL: result dir '/tmp/rpl_data_191018_032645_12106/input0_results_191018_032645'
ROCProfiler: rc-file '/…./rpl_rc.xml'
ROCProfiler: input from "/tmp/rpl_data_191018_032645_12106/input0.xml"
The trace is generated in JSON format compatible with Chrome tracing.
The trace consists of several sections with timelines for API trace per thread and GPU activity. The timelines events show event name and parameters.
Supported options: ‘—hsa-trace’, ‘—hip-trace’, ‘—sys-trace’, where ‘sys trace’ is for HIP and HSA combined trace.
#### 2.2.1. HIP runtime trace
The trace is generated by option ‘—hip-trace’ and includes HIP API timelines and GPU activity at the runtime level.
#### 2.2.2. ROCr runtime trace
The trace is generated by option ‘—hsa-trace’ and includes ROCr API timelines and GPU activity at AQL queue level. Also, can provide counters per kernel.
#### 2.2.3. KFD driver trace
The trace is generated by option ‘—kfd-trace’ and includes KFD Thunk API timeline.
It is planned to add memory allocations/migration tracing.
#### 2.2.4. Code annotation
Support for application code annotation.
Start/stop API is supported to programmatically control the profiling.
A ‘roctx’ library provides annotation API. Annotation is visualized in JSON trace as a separate "Markers and Ranges" timeline section.
##### 2.2.4.1. Start/stop API
```
// Tracing start API
void roctracer_start();
// Tracing stop API
void roctracer_stop();
```
##### 2.2.4.2. rocTX basic markers API
```
// A marker created by given ASCII massage
void roctxMark(const char* message);
// Returns the 0 based level of a nested range being started by given message associated to this range.
// A negative value is returned on the error.
int roctxRangePush(const char* message);
// Marks the end of a nested range.
// Returns the 0 based level the range.
// A negative value is returned on the error.
int roctxRangePop();
```
### 2.3. Multiple GPUs profiling
The profiler supports multiple GPU’s profiling and provide GPI id for counters and kernels data in CSV output file. Also, GPU id is indicating for respective GPU activity timeline in JSON trace.
## 3. Profiling control
Profiling can be controlled by specifying a profiling scope, by filtering trace events and specifying interesting time intervals.
### 3.1. Profiling scope
Counters profiling scope can be specified by GPU id list, kernel name substrings list and dispatch range.
Supported range formats examples: "3:9", "3:", "3". You can see an example of input file in 2.1.
#### 3.2. Tracing control
Tracing can be filtered by events names using profiler input file and by enabling interesting time intervals by command line option.
#### 3.2.1. Filtering traced APIs
A list of traced API names can be specified in profiler input file.
An example of input file line for ROCr runtime trace (HAS API):
The following counters are publicly available for commercially available VEGA10/20 GPUs.
Counters:
```
• GRBM_COUNT : Tie High - Count Number of Clocks
• GRBM_GUI_ACTIVE : The GUI is Active
• SQ_WAVES : Count number of waves sent to SQs. (per-simd, emulated, global)
• SQ_INSTS_VALU : Number of VALU instructions issued. (per-simd, emulated)
• SQ_INSTS_VMEM_WR : Number of VMEM write instructions issued (including FLAT). (per-simd, emulated)
• SQ_INSTS_VMEM_RD : Number of VMEM read instructions issued (including FLAT). (per-simd, emulated)
• SQ_INSTS_SALU : Number of SALU instructions issued. (per-simd, emulated)
• SQ_INSTS_SMEM : Number of SMEM instructions issued. (per-simd, emulated)
• SQ_INSTS_FLAT : Number of FLAT instructions issued. (per-simd, emulated)
• SQ_INSTS_FLAT_LDS_ONLY : Number of FLAT instructions issued that read/wrote only from/to LDS (only works if EARLY_TA_DONE is enabled). (per-simd, emulated)
• SQ_INSTS_LDS : Number of LDS instructions issued (including FLAT). (per-simd, emulated)
• SQ_INSTS_GDS : Number of GDS instructions issued. (per-simd, emulated)
• SQ_WAIT_INST_LDS : Number of wave-cycles spent waiting for LDS instruction issue. In units of 4 cycles. (per-simd, nondeterministic)
• SQ_ACTIVE_INST_VALU : regspec 71? Number of cycles the SQ instruction arbiter is working on a VALU instruction. (per-simd, nondeterministic)
• SQ_INST_CYCLES_SALU : Number of cycles needed to execute non-memory read scalar operations. (per-simd, emulated)
• SQ_THREAD_CYCLES_VALU : Number of thread-cycles used to execute VALU operations (similar to INST_CYCLES_VALU but multiplied by # of active threads). (per-simd)
• SQ_LDS_BANK_CONFLICT : Number of cycles LDS is stalled by bank conflicts. (emulated)
• TA_TA_BUSY[0-15] : TA block is busy. Perf_Windowing not supported for this counter.
• TA_FLAT_READ_WAVEFRONTS[0-15] : Number of flat opcode reads processed by the TA.
• TA_FLAT_WRITE_WAVEFRONTS[0-15] : Number of flat opcode writes processed by the TA.
• TCC_HIT[0-15] : Number of cache hits.
• TCC_MISS[0-15] : Number of cache misses. UC reads count as misses.
• TCC_EA_WRREQ[0-15] : Number of transactions (either 32-byte or 64-byte) going over the TC_EA_wrreq interface. Atomics may travel over the same interface and are generally classified as write requests. This does not include probe commands.
• TCC_EA_WRREQ_64B[0-15] : Number of 64-byte transactions going (64-byte write or CMPSWAP) over the TC_EA_wrreq interface.
• TCC_EA_WRREQ_STALL[0-15] : Number of cycles a write request was stalled.
• TCC_EA_RDREQ[0-15] : Number of TCC/EA read requests (either 32-byte or 64-byte)
• TCC_EA_RDREQ_32B[0-15] : Number of 32-byte TCC/EA read requests
• TCP_TCP_TA_DATA_STALL_CYCLES[0-15] : TCP stalls TA data interface. Now Windowed.
```
The following derived metrics have been defined and the profiler metrics XML specification can be found at: https://github.com/ROCm-Developer-Tools/rocprofiler/blob/amd-master/test/tool/metrics.xml.
Metrics:
```
• TA_BUSY_avr : TA block is busy. Average over TA instances.
• TA_BUSY_max : TA block is busy. Max over TA instances.
• TA_BUSY_min : TA block is busy. Min over TA instances.
• TA_FLAT_READ_WAVEFRONTS_sum : Number of flat opcode reads processed by the TA. Sum over TA instances.
• TA_FLAT_WRITE_WAVEFRONTS_sum : Number of flat opcode writes processed by the TA. Sum over TA instances.
• TCC_HIT_sum : Number of cache hits. Sum over TCC instances.
• TCC_MISS_sum : Number of cache misses. Sum over TCC instances.
• TCC_EA_RDREQ_32B_sum : Number of 32-byte TCC/EA read requests. Sum over TCC instances.
• TCC_EA_RDREQ_sum : Number of TCC/EA read requests (either 32-byte or 64-byte). Sum over TCC instances.
• TCC_EA_WRREQ_sum : Number of transactions (either 32-byte or 64-byte) going over the TC_EA_wrreq interface. Sum over TCC instances.
• TCC_EA_WRREQ_64B_sum : Number of 64-byte transactions going (64-byte write or CMPSWAP) over the TC_EA_wrreq interface. Sum over TCC instances.
• TCC_WRREQ_STALL_max : Number of cycles a write request was stalled. Max over TCC instances.
• TCC_MC_WRREQ_sum : Number of 32-byte effective writes. Sum over TCC instaces.
• FETCH_SIZE : The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
• WRITE_SIZE : The total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
• GPUBusy : The percentage of time GPU was busy.
• Wavefronts : Total wavefronts.
• VALUInsts : The average number of vector ALU instructions executed per work-item (affected by flow control).
• SALUInsts : The average number of scalar ALU instructions executed per work-item (affected by flow control).
• VFetchInsts : The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that fetch from video memory.
• SFetchInsts : The average number of scalar fetch instructions from the video memory executed per work-item (affected by flow control).
• VWriteInsts : The average number of vector write instructions to the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that write to video memory.
• FlatVMemInsts : The average number of FLAT instructions that read from or write to the video memory executed per work item (affected by flow control). Includes FLAT instructions that read from or write to scratch.
• LDSInsts : The average number of LDS read or LDS write instructions executed per work item (affected by flow control). Excludes FLAT instructions that read from or write to LDS.
• FlatLDSInsts : The average number of FLAT instructions that read or write to LDS executed per work item (affected by flow control).
• GDSInsts : The average number of GDS read or GDS write instructions executed per work item (affected by flow control).
• VALUUtilization : The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64. Value range: 0% (bad), 100% (ideal - no thread divergence).
• VALUBusy : The percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).
• SALUBusy : The percentage of GPUTime scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).
• Mem32Bwrites :
• FetchSize : The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
• WriteSize : The total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
• L2CacheHit : The percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache. Value range: 0% (no hit) to 100% (optimal).
• MemUnitBusy : The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).
• MemUnitStalled : The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad).
• WriteUnitStalled : The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad).
• ALUStalledByLDS : The percentage of GPUTime ALU units are stalled by the LDS input queue being full or the output queue being not ready. If there are LDS bank conflicts, reduce them. Otherwise, try reducing the number of LDS accesses if possible. Value range: 0% (optimal) to 100% (bad).
• LDSBankConflict : The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad).