gobhardw/docs refactor (#975)

* refactor docs

* Fixing docs build

* merging 967

* Improving readability with table usage

* review changes

* compilation fix

* README links

---------

Co-authored-by: srawat <120587655+SwRaw@users.noreply.github.com>

[ROCm/rocprofiler-sdk commit: 785cc39c16]
Este commit está contenido en:
Gopesh Bhardwaj
2024-07-18 12:58:20 +05:30
cometido por GitHub
padre a136f3f129
commit c798a92f36
Se han modificado 20 ficheros con 1103 adiciones y 987 borrados
+3 -3
Ver fichero
@@ -5,7 +5,7 @@ Note: rocprofiler-sdk is currently considered a beta version and is subject to c
## Overview
ROCProfiler-SDK is AMD’s new and improved tooling infrastructure, providing a hardware-specific low-level performance analysis interface for profiling and tracing GPU compute applications. To see what's changed [Click Here](source/docs/index.md)
ROCProfiler-SDK is AMD’s new and improved tooling infrastructure, providing a hardware-specific low-level performance analysis interface for profiling and tracing GPU compute applications. To see what's changed [Click Here](source/docs/what-is-rocprof-sdk.rst)
## GPU Metrics
@@ -21,7 +21,7 @@ ROCProfiler-SDK is AMD’s new and improved tooling infrastructure, providing a
rocprofv3 is the command line tool built using the rocprofiler-sdk library and shipped with the ROCm stack. To see details on
the command line options of rocprofv3, please see rocprofv3 user guide
[Click Here](source/docs/rocprofv3.md)
[Click Here](source/docs/how-to/using-rocprofv3.rst)
## Documentation
@@ -55,7 +55,7 @@ To install ROCprofiler, run:
cmake --build rocprofiler-sdk-build --target install
```
Please see the detailed section on build and installation here: [Click Here](source/docs/installation.md)
Please see the detailed section on build and installation here: [Click Here](source/docs/install/installation.md)
## Support
+19 -13
Ver fichero
@@ -1,24 +1,30 @@
# Anywhere {branch} is used, the branch name will be substituted.
# These comments will also be removed.
defaults:
numbered: True
numbered: false
maxdepth: 4
root: index
subtrees:
- caption: Table of Contents
entries:
- file: features
- file: installation
- file: tool_library_overview
- file: callback_services
- file: buffered_services
- file: pc_sampling
- file: intercept_table
- file: counter_collection_services
- entries:
- file: what-is-rocprof-sdk
- file: buffered_services.md
- file: callback_services.md
- file: counter_collection_services.md
- file: intercept_table.md
- file: pc_sampling.md
- file: tool_library_overview.md
- caption: Install
entries:
- file: install/installation
- caption: How to
entries:
- file: how-to/using-rocprofv3
- file: how-to/samples
- caption: API reference
entries:
- file: _doxygen/html/index
- file: samples
- file: rocprofv3
title: API library
- caption: License
entries:
- file: license
@@ -0,0 +1,5 @@
"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
"HIP_COMPILER_API","__hipRegisterFatBinary",208,208,1,1508780270085955,1508780270096795
"HIP_COMPILER_API","__hipRegisterFunction",208,208,2,1508780270104242,1508780270115355
"HIP_COMPILER_API","__hipPushCallConfiguration",208,208,3,1508780613897816,1508780613898701
"HIP_COMPILER_API","__hipPopCallConfiguration",208,208,4,1508780613901714,1508780613902200
1 Domain Function Process_Id Thread_Id Correlation_Id Start_Timestamp End_Timestamp
2 HIP_COMPILER_API __hipRegisterFatBinary 208 208 1 1508780270085955 1508780270096795
3 HIP_COMPILER_API __hipRegisterFunction 208 208 2 1508780270104242 1508780270115355
4 HIP_COMPILER_API __hipPushCallConfiguration 208 208 3 1508780613897816 1508780613898701
5 HIP_COMPILER_API __hipPopCallConfiguration 208 208 4 1508780613901714 1508780613902200
@@ -0,0 +1,9 @@
"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
"HIP_RUNTIME_API","hipGetDevicePropertiesR0600",238,238,1,1191915574691984,1191915687784011
"HIP_RUNTIME_API","hipMalloc",238,238,2,1191915691312459,1191915691388696
"HIP_RUNTIME_API","hipMalloc",238,238,3,1191915691390637,1191915691423279
"HIP_RUNTIME_API","hipMemcpy",238,238,4,1191915691439107,1191916547828448
"HIP_RUNTIME_API","hipLaunchKernel",238,238,5,1191916547842972,1191916548408842
"HIP_RUNTIME_API","hipMemcpy",238,238,6,1191916548412677,1191916550217834
"HIP_RUNTIME_API","hipFree",238,238,7,1191916562618151,1191916562789093
"HIP_RUNTIME_API","hipFree",238,238,8,1191916562790923,1191916562836351
1 Domain Function Process_Id Thread_Id Correlation_Id Start_Timestamp End_Timestamp
2 HIP_RUNTIME_API hipGetDevicePropertiesR0600 238 238 1 1191915574691984 1191915687784011
3 HIP_RUNTIME_API hipMalloc 238 238 2 1191915691312459 1191915691388696
4 HIP_RUNTIME_API hipMalloc 238 238 3 1191915691390637 1191915691423279
5 HIP_RUNTIME_API hipMemcpy 238 238 4 1191915691439107 1191916547828448
6 HIP_RUNTIME_API hipLaunchKernel 238 238 5 1191916547842972 1191916548408842
7 HIP_RUNTIME_API hipMemcpy 238 238 6 1191916548412677 1191916550217834
8 HIP_RUNTIME_API hipFree 238 238 7 1191916562618151 1191916562789093
9 HIP_RUNTIME_API hipFree 238 238 8 1191916562790923 1191916562836351
@@ -0,0 +1,18 @@
"Name","Calls","TotalDurationNs","AverageNs","Percentage","MinNs","MaxNs","StdDev"
"hipStreamCreateWithFlags",4,262497406,65624351.500000,85.15,3991286,249121840,122332531.343496
"hipGetDeviceCount",1,32505687,32505687.000000,10.54,32505687,32505687,0.00000000e+00
"hipHostMalloc",12,6096409,508034.083333,1.98,443793,548024,39236.753678
"hipFree",12,1994421,166201.750000,0.6470,7790,1036046,299086.860470
"hipMemcpyAsync",12,1368378,114031.500000,0.4439,2490,764044,249308.051619
"hipMallocAsync",12,927255,77271.250000,0.3008,51540,107671,20487.475966
"hipStreamSynchronize",12,870486,72540.500000,0.2824,140,866606,250065.900069
"hipLaunchKernel",16,692734,43295.875000,0.2247,1000,670044,167133.656647
"hipStreamDestroy",4,619905,154976.250000,0.2011,92901,339252,122852.320356
"hipDeviceSynchronize",4,404252,101063.000000,0.1311,570,385212,189518.505401
"hipHostFree",12,271202,22600.166667,0.0880,11950,34950,7480.268600
"__hipRegisterFatBinary",1,9000,9000.000000,2.920e-03,9000,9000,0.00000000e+00
"__hipRegisterFunction",4,6150,1537.500000,1.995e-03,230,5370,2555.091323
"__hipPushCallConfiguration",16,2460,153.750000,7.980e-04,70,1140,267.503894
"__hipPopCallConfiguration",16,2000,125.000000,6.488e-04,70,680,151.613544
"hipGetLastError",16,1270,79.375000,4.120e-04,50,440,96.295985
"hipSetDevice",1,660,660.000000,2.141e-04,660,660,0.00000000e+00
1 Name Calls TotalDurationNs AverageNs Percentage MinNs MaxNs StdDev
2 hipStreamCreateWithFlags 4 262497406 65624351.500000 85.15 3991286 249121840 122332531.343496
3 hipGetDeviceCount 1 32505687 32505687.000000 10.54 32505687 32505687 0.00000000e+00
4 hipHostMalloc 12 6096409 508034.083333 1.98 443793 548024 39236.753678
5 hipFree 12 1994421 166201.750000 0.6470 7790 1036046 299086.860470
6 hipMemcpyAsync 12 1368378 114031.500000 0.4439 2490 764044 249308.051619
7 hipMallocAsync 12 927255 77271.250000 0.3008 51540 107671 20487.475966
8 hipStreamSynchronize 12 870486 72540.500000 0.2824 140 866606 250065.900069
9 hipLaunchKernel 16 692734 43295.875000 0.2247 1000 670044 167133.656647
10 hipStreamDestroy 4 619905 154976.250000 0.2011 92901 339252 122852.320356
11 hipDeviceSynchronize 4 404252 101063.000000 0.1311 570 385212 189518.505401
12 hipHostFree 12 271202 22600.166667 0.0880 11950 34950 7480.268600
13 __hipRegisterFatBinary 1 9000 9000.000000 2.920e-03 9000 9000 0.00000000e+00
14 __hipRegisterFunction 4 6150 1537.500000 1.995e-03 230 5370 2555.091323
15 __hipPushCallConfiguration 16 2460 153.750000 7.980e-04 70 1140 267.503894
16 __hipPopCallConfiguration 16 2000 125.000000 6.488e-04 70 680 151.613544
17 hipGetLastError 16 1270 79.375000 4.120e-04 50 440 96.295985
18 hipSetDevice 1 660 660.000000 2.141e-04 660 660 0.00000000e+00
@@ -0,0 +1,11 @@
"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
"HSA_CORE_API","hsa_system_get_major_extension_table",197,197,1,1507843974724237,1507843974724947
"HSA_CORE_API","hsa_agent_get_info",197,197,3,1507843974754471,1507843974755014
"HSA_AMD_EXT_API","hsa_amd_memory_pool_get_info",197,197,5,1507843974761705,1507843974762398
"HSA_AMD_EXT_API","hsa_amd_memory_pool_get_info",197,197,6,1507843974763901,1507843974764030
"HSA_AMD_EXT_API","hsa_amd_memory_pool_get_info",197,197,7,1507843974765121,1507843974765224
"HSA_AMD_EXT_API","hsa_amd_memory_pool_get_info",197,197,8,1507843974766196,1507843974766328
"HSA_AMD_EXT_API","hsa_amd_memory_pool_get_info",197,197,9,1507843974767534,1507843974767641
"HSA_AMD_EXT_API","hsa_amd_memory_pool_get_info",197,197,10,1507843974768639,1507843974768779
"HSA_AMD_EXT_API","hsa_amd_agent_iterate_memory_pools",197,197,4,1507843974758768,1507843974769238
"HSA_CORE_API","hsa_agent_get_info",197,197,11,1507843974771091,1507843974771537
1 Domain Function Process_Id Thread_Id Correlation_Id Start_Timestamp End_Timestamp
2 HSA_CORE_API hsa_system_get_major_extension_table 197 197 1 1507843974724237 1507843974724947
3 HSA_CORE_API hsa_agent_get_info 197 197 3 1507843974754471 1507843974755014
4 HSA_AMD_EXT_API hsa_amd_memory_pool_get_info 197 197 5 1507843974761705 1507843974762398
5 HSA_AMD_EXT_API hsa_amd_memory_pool_get_info 197 197 6 1507843974763901 1507843974764030
6 HSA_AMD_EXT_API hsa_amd_memory_pool_get_info 197 197 7 1507843974765121 1507843974765224
7 HSA_AMD_EXT_API hsa_amd_memory_pool_get_info 197 197 8 1507843974766196 1507843974766328
8 HSA_AMD_EXT_API hsa_amd_memory_pool_get_info 197 197 9 1507843974767534 1507843974767641
9 HSA_AMD_EXT_API hsa_amd_memory_pool_get_info 197 197 10 1507843974768639 1507843974768779
10 HSA_AMD_EXT_API hsa_amd_agent_iterate_memory_pools 197 197 4 1507843974758768 1507843974769238
11 HSA_CORE_API hsa_agent_get_info 197 197 11 1507843974771091 1507843974771537
@@ -0,0 +1,2 @@
"Kind","Agent_Id","Queue_Id","Kernel_Id","Kernel_Name","Correlation_Id","Start_Timestamp","End_Timestamp","Private_Segment_Size","Group_Segment_Size","Workgroup_Size_X","Workgroup_Size_Y","Workgroup_Size_Z","Grid_Size_X","Grid_Size_Y","Grid_Size_Z"
"KERNEL_DISPATCH",1,139690710949888,15,"matrixTranspose(float*, float*, int)",0,671599758568,671599825328,0,0,4,4,1,1024,1024,1
1 Kind Agent_Id Queue_Id Kernel_Id Kernel_Name Correlation_Id Start_Timestamp End_Timestamp Private_Segment_Size Group_Segment_Size Workgroup_Size_X Workgroup_Size_Y Workgroup_Size_Z Grid_Size_X Grid_Size_Y Grid_Size_Z
2 KERNEL_DISPATCH 1 139690710949888 15 matrixTranspose(float*, float*, int) 0 671599758568 671599825328 0 0 4 4 1 1024 1024 1
@@ -0,0 +1,6 @@
"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
"MARKER_CORE_API","before hipLaunchKernel",717,717,1,1520113899312225,1520113899312225
"MARKER_CORE_API","after hipLaunchKernel",717,717,4,1520113900128482,1520113900128482
"MARKER_CORE_API","hipMemcpy",717,717,5,1520113900141100,1520113901483408
"MARKER_CORE_API","hipLaunchKernel",717,717,3,1520113899684965,1520113901491622
"MARKER_CORE_API","hipLaunchKernel range",717,0,2,1520113899682208,1520113901495882
1 Domain Function Process_Id Thread_Id Correlation_Id Start_Timestamp End_Timestamp
2 MARKER_CORE_API before hipLaunchKernel 717 717 1 1520113899312225 1520113899312225
3 MARKER_CORE_API after hipLaunchKernel 717 717 4 1520113900128482 1520113900128482
4 MARKER_CORE_API hipMemcpy 717 717 5 1520113900141100 1520113901483408
5 MARKER_CORE_API hipLaunchKernel 717 717 3 1520113899684965 1520113901491622
6 MARKER_CORE_API hipLaunchKernel range 717 0 2 1520113899682208 1520113901495882
@@ -0,0 +1,3 @@
"Kind","Direction","Source_Agent_Id","Destination_Agent_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
"MEMORY_COPY","HOST_TO_DEVICE",0,1,0,14955949675563,14955950239443
"MEMORY_COPY","DEVICE_TO_HOST",1,0,0,14955952733485,14955953315285
1 Kind Direction Source_Agent_Id Destination_Agent_Id Correlation_Id Start_Timestamp End_Timestamp
2 MEMORY_COPY HOST_TO_DEVICE 0 1 0 14955949675563 14955950239443
3 MEMORY_COPY DEVICE_TO_HOST 1 0 0 14955952733485 14955953315285
@@ -1,10 +0,0 @@
# Features
## Overview
- Improved tool initialization
- Support for multiple tools using the same services simulatenously
- Simplified management of enabling/disabling one or more data collection services
- Improved error checking and logging
- Backwards ABI compatibility (goal)
- PC Sampling(Beta Implementation)
@@ -0,0 +1,43 @@
# Samples
The samples are provided to help you see the profiler in action.
## Finding samples
After the ROCm build is installed:
- Sample programs are installed here:
```bash
/opt/rocm/share/rocprofiler-sdk/samples
```
- `rocprofv3` tool is installed here:
```bash
/opt/rocm/bin
```
## Building Samples
To build samples from any directory, run:
```bash
cmake -B build-rocprofiler-sdk-samples /opt/rocm/share/rocprofiler-sdk/samples -DCMAKE_PREFIX_PATH=/opt/rocm
cmake --build build-rocprofiler-sdk-samples --target all --parallel 8
```
## Running samples
To run the built samples, `cd` into the `build-rocprofiler-sdk-samples` directory and run:
```bash
ctest -V
```
:::{note}
Running a few of these tests require you to install Pandas and pytest first.
:::
```bash
/usr/local/bin/python -m pip install -r requirements.txt
```
@@ -0,0 +1,808 @@
.. meta::
:description: Documentation of the installation, configuration, use of the ROCProfiler SDK, and rocprofv3 command-line tool
:keywords: ROCProfiler SDK tool, ROCProfiler SDK library, rocprofv3, ROCm, API, reference
.. _using-rocprofv3:
======================
Using rocprofv3
======================
``rocprofv3`` is a CLI tool that helps you quickly optimize applications and understand the low-level kernel details without requiring any modification in the source code.
It is being developed to be backward compatible with its predecessor, ``rocprof``, and to provide more features for application profiling with better accuracy.
The following sections demonstrate the use of ``rocprofv3`` for application tracing and kernel profiling using various command-line options.
``rocprofv3`` is installed with ROCm under ``/opt/rocm/bin``. To use the tool from anywhere in the system, export ``PATH`` variable:
.. code-block:: bash
export PATH=$PATH:/opt/rocm/bin
Before you start tracing or profiling your HIP application using ``rocprofv3``, build the application using:
.. code-block:: bash
cmake -B <build-directory> <source-directory> -DCMAKE_PREFIX_PATH=/opt/rocm
cmake --build <build-directory> --target all --parallel <N>
Options
---------
Here is the list of ``rocprofv3`` command-line options. Some options are used for application tracing and some for kernel profiling while the output control options control the presentation and redirection of the generated output.
.. list-table:: rocprofv3 options
:header-rows: 1
* - Option
- Description
- Use
* - ``--hip-trace``
- Collects HIP runtime traces.
- Application tracing
* - ``--hip-runtime-trace``
- Collects HIP runtime API traces.
- Application tracing
* - ``--hip-compiler-trace``
- Collects HIP compiler-generated code traces.
- Application tracing
* - ``--scratch-memory-trace``
- Collects scratch memory operations traces.
- Application tracing
* - ``--hsa-trace``
- Collects HSA API traces.
- Application tracing
* - ``--hsa-core-trace``
- Collects HSA API traces (core API).
- Application tracing
* - ``--hsa-amd-trace``
- Collects HSA API traces (AMD-extension API).
- Application tracing
* - ``--hsa-image-trace``
- Collects HSA API Ttaces (Image-extension API).
- Application tracing
* - ``--hsa-finalizer-trace``
- Collects HSA API traces (Finalizer-extension API).
- Application tracing
* - ``--stats``
- For Collecting statistics of enabled tracing types
- Application tracing
* - ``--kernel-trace``
- Collects kernel dispatch traces.
- Application tracing
* - ``--marker-trace``
- Collects marker (ROC-TX) traces.
- Application tracing
* - ``--memory-copy-trace``
- Collects memory copy traces.
- Application tracing
* - ``--sys-trace``
- Collects HIP, HSA, memory copy, marker, and kernel dispatch traces.
- Application Tracing
* - ``-i``
- Specifies the input file.
- Kernel profiling
* - ``--kernel-names``
- pecifies the kernel names to target during counter collection.
- Kernel profiling
* - ``-L`` \| ``--list-metrics``
- List metrics for counter collection.
- Kernel profiling
* - ``-d`` \| ``--output-directory``
- Specifies the path for the output files.
- Output control
* - ``-o`` \| ``--output-file``
- Specifies the name of the output file. Note that this name is appended to the default names (_api_trace or counter_collection.csv) of the generated files'.
- Output control
* - ``-M`` \| ``--mangled-kernels``
- Overrides the default demangling of kernel names.
- Output control
* - ``-T`` \| ``--truncate-kernels``
- Truncates the demangled kernel names for improved readability.
- Output control
* - ``--output-format``
- For adding output format (supported formats: csv, json, pftrace)
- Output control
* - ``--preload``
- Libraries to prepend to LD_PRELOAD (usually for sanitizers)
- Extension
You can also see all the ``rocprofv3`` options using:
.. code-block:: bash
rocprofv3 --help
Application tracing
---------------------
Application tracing provides the big picture of a program’s execution by collecting data on the execution times of API calls and GPU commands, such as kernel execution, async memory copy, and barrier packets. This information can be used as the first step in the profiling process to answer important questions, such as how much percentage of time was spent on memory copy and which kernel took the longest time to execute.
To use ``rocprofv3`` for application tracing, run:
.. code-block:: bash
rocprofv3 <tracing_option> -- <app_relative_path>
HIP trace
+++++++++++
HIP trace comprises execution traces for the entire application at the HIP level. This includes HIP API functions and their asynchronous activities at the runtime level. In general, HIP APIs directly interact with the user program. It is easier to analyze HIP traces as you can directly map them to the program.
To trace HIP runtime APIs, use:
.. code-block:: bash
rocprofv3 --hip-trace -- < app_relative_path >
.. note::
The tracing and counter collection options generate an additional `agent info` file.
The above command generates a `hip_api_trace.csv` file prefixed with the process ID.
.. code-block:: shell
$ cat 238_hip_api_trace.csv
Here are the contents of `hip_api_trace.csv` file:
.. csv-table:: HIP runtime api trace
:file: /data/hip_compile_trace.csv
:widths: 10,10,10,10,10,20,20
:header-rows: 1
To trace HIP compile time APIs, use:
.. code-block:: shell
rocprofv3 --hip-compiler-trace -- < app_relative_path >
The above command generates a `hip_api_trace.csv` file prefixed with the process ID.
.. code-block:: shell
$ cat 208_hip_api_trace.csv
Here are the contents of `hip_api_trace.csv` file:
.. csv-table:: HIP compile time api trace
:file: /data/hip_compile_trace.csv
:widths: 10,10,10,10,10,20,20
:header-rows: 1
For the description of the fields in the output file, see :ref:`output-file-fields`.
Agent Info
''''''''''''''
.. code-block:: shell
$ cat 238_agent_info.csv
"Node_Id","Logical_Node_Id","Agent_Type","Cpu_Cores_Count","Simd_Count","Cpu_Core_Id_Base","Simd_Id_Base","Max_Waves_Per_Simd","Lds_Size_In_Kb","Gds_Size_In_Kb","Num_Gws","Wave_Front_Size","Num_Xcc","Cu_Count","Array_Count","Num_Shader_Banks","Simd_Arrays_Per_Engine","Cu_Per_Simd_Array","Simd_Per_Cu","Max_Slots_Scratch_Cu","Gfx_Target_Version","Vendor_Id","Device_Id","Location_Id","Domain","Drm_Render_Minor","Num_Sdma_Engines","Num_Sdma_Xgmi_Engines","Num_Sdma_Queues_Per_Engine","Num_Cp_Queues","Max_Engine_Clk_Ccompute","Max_Engine_Clk_Fcompute","Sdma_Fw_Version","Fw_Version","Capability","Cu_Per_Engine","Max_Waves_Per_Cu","Family_Id","Workgroup_Max_Size","Grid_Max_Size","Local_Mem_Size","Hive_Id","Gpu_Id","Workgroup_Max_Dim_X","Workgroup_Max_Dim_Y","Workgroup_Max_Dim_Z","Grid_Max_Dim_X","Grid_Max_Dim_Y","Grid_Max_Dim_Z","Name","Vendor_Name","Product_Name","Model_Name"
0,0,"CPU",24,0,0,0,0,0,0,0,0,1,24,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3800,0,0,0,0,0,0,23,0,0,0,0,0,0,0,0,0,0,0,"AMD Ryzen 9 3900X 12-Core Processor","CPU","AMD Ryzen 9 3900X 12-Core Processor",""
1,1,"GPU",0,256,0,2147487744,10,64,0,64,64,1,64,4,4,1,16,4,32,90000,4098,26751,12032,0,128,2,0,2,24,3800,1630,432,440,138420864,16,40,141,1024,4294967295,0,0,64700,1024,1024,1024,4294967295,4294967295,4294967295,"gfx900","AMD","Radeon RX Vega","vega10"
HSA trace
+++++++++++++
The HIP runtime library is implemented with the low-level HSA runtime. HSA API tracing is more suited for advanced users who want to understand the application behavior at the lower level. In general, tracing at the HIP level is recommended for most users. You should use HSA trace only if you are familiar with HSA runtime.
HSA trace contains the start and end time of HSA runtime API calls and their asynchronous activities.
.. code-block:: bash
rocprofv3 --hsa-trace -- < app_relative_path >
The above command generates a `hsa_api_trace.csv` file prefixed with process ID. Note that the contents of this file have been truncated for demonstration purposes.
.. code-block:: shell
$ cat 197_hsa_api_trace.csv
Here are the contents of `hsa_api_trace.csv` file:
.. csv-table:: HSA api trace
:file: /data/hsa_trace.csv
:widths: 10,10,10,10,10,20,20
:header-rows: 1
For the description of the fields in the output file, see :ref:`output-file-fields`.
Marker trace
++++++++++++++
In certain situations, such as debugging performance issues in large-scale GPU programs, API-level tracing might be too fine-grained to provide a big picture of the program execution. In such cases, it is helpful to define specific tasks to be traced.
To specify the tasks for tracing, enclose the respective source code with the API calls provided by the `ROCTX` library. This process is also known as instrumentation. As the scope of code for instrumentation is defined using the enclosing API calls, it is called a range. A range is a programmer-defined task that has a well-defined start and end code scope. You can also refine the scope specified within a range using further nested ranges. ``rocprofv3`` also reports the timelines for these nested ranges.
Here is a list of useful APIs for code instrumentation.
- ``roctxMark``: Inserts a marker in the code with a message. Creating marks help you see when a line of code is executed.
- ``roctxRangeStart``: Starts a range. Different threads can start ranges.
- ``roctxRangePush``: Starts a new nested range.
- ``roctxRangePop``: Stops the current nested range.
- ``roctxRangeStop``: Stops the given range.
See how to use `rocTX` APIs in the MatrixTranspose application below:
.. code-block:: bash
roctxMark("before hipLaunchKernel");
int rangeId = roctxRangeStart("hipLaunchKernel range");
roctxRangePush("hipLaunchKernel");
// Launching kernel from host
hipLaunchKernelGGL(matrixTranspose, dim3(WIDTH/THREADS_PER_BLOCK_X, WIDTH/THREADS_PER_BLOCK_Y), dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y), 0,0,gpuTransposeMatrix,gpuMatrix, WIDTH);
roctxMark("after hipLaunchKernel");
// Memory transfer from device to host
roctxRangePush("hipMemcpy");
hipMemcpy(TransposeMatrix, gpuTransposeMatrix, NUM * sizeof(float), hipMemcpyDeviceToHost);
roctxRangePop(); // for "hipMemcpy"
roctxRangePop(); // for "hipLaunchKernel"
roctxRangeStop(rangeId);
To trace the API calls enclosed within the range, use:
.. code-block:: bash
rocprofv3 --marker-trace -- < app_relative_path >
Running the preceding command generates a `marker_api_trace.csv` file prefixed with the process ID.
.. code-block:: shell
$ cat 210_marker_api_trace.csv
Here are the contents of `marker_api_trace.csv` file:
.. csv-table:: Marker api trace
:file: /data/marker_api_trace.csv
:widths: 10,10,10,10,10,20,20
:header-rows: 1
For the description of the fields in the output file, see :ref:`output-file-fields`.
Kernel trace
++++++++++++++
To trace kernel dispatch traces, use:
.. code-block:: shell
rocprofv3 --kernel-trace -- < app_relative_path >
The above command generates a `kernel_trace.csv` file prefixed with the process ID.
.. code-block:: shell
$ cat 199_kernel_trace.csv
Here are the contents of `kernel_trace.csv` file:
.. csv-table:: Kernel trace
:file: /data/kernel_trace.csv
:widths: 10,10,10,10,10,10,20,20,10,10,10,10,10,10,10,10
:header-rows: 1
For the description of the fields in the output file, see :ref:`output-file-fields`.
Memory copy trace
+++++++++++++++++++
To trace memory moves across the application, use:
.. code-block:: shell
rocprofv3 -memory-copy-trace -- < app_relative_path >
The above command generates a `memory_copy_trace.csv` file prefixed with the process ID.
.. code-block:: shell
$ cat 197_memory_copy_trace.csv
Here are the contents of `memory_copy_trace.csv` file:
.. csv-table:: Memory copy trace
:file: /data/memory_copy_trace.csv
:widths: 10,10,10,10,10,20,20
:header-rows: 1
For the description of the fields in the output file, see :ref:`output-file-fields`.
Sys trace
+++++++++++
This is an all-inclusive option to collect all the above-mentioned traces.
.. code-block:: shell
rocprofv3 -sys-trace -- < app_relative_path >
Running the above command generates `hip_api_trace.csv`, `hsa_api_trace.csv`, `kernel_trace.csv`, `memory_copy_trace.csv`, and `marker_api_trace.csv` (if `rocTX` APIs are specified in the application) files prefixed with the process ID.
Scratch memory trace
++++++++++++++++++++++
This option collects scratch memory operation's traces. Scratch is an address space on AMDGPUs, which is roughly equivalent to the `local memory` in NVIDIA CUDA. The `local memory` in CUDA is a thread-local global memory with interleaved addressing, which is used for register spills or stack space. With this option, you can trace when the ``rocr`` runtime allocates, frees, and tries to reclaim scratch memory.
.. code-block:: shell
rocprofv3 --scratch-memory-trace -- < app_relative_path >
Stats
++++++++
This option collects statistics for the enabled tracing types. For example, to collect statistics of HIP APIs, when HIP trace is enabled.
A higher percentage in statistics can help user focus on the API/function that has taken the most time:
.. code-block:: shell
rocprofv3 --stats --hip-trace -- < app_relative_path >
The above command generates a `hip_stats.csv` and `hip_api_trace` file prefixed with the process ID.
.. code-block:: shell
$ cat hip_stats.csv
Here are the contents of `hip_stats.csv` file:
.. csv-table:: HIP stats
:file: /data/hip_stats.csv
:widths: 10,10,20,20,10,10,10,10
:header-rows: 1
Kernel profiling
-------------------
The application tracing functionality allows you to evaluate the duration of kernel execution but is of little help in providing insight into kernel execution details. The kernel profiling functionality allows you to select kernels for profiling and choose the basic counters or derived metrics to be collected for each kernel execution, thus providing a greater insight into kernel execution.
For a comprehensive list of counters available on MI200, see `MI200 performance counters and metrics <https://rocm.docs.amd.com/en/latest/conceptual/gpu-arch/mi300-mi200-performance-counters.html>`_.
Input file
++++++++++++
To collect the desired basic counters or derived metrics, mention them in an input file. In the input file, the line consisting of the counter or metric names must begin with ``pmc``. The input file could be in text (.txt), yaml (.yaml/.yml), or JSON (.json) format.
.. code-block:: shell
$ cat input.txt
pmc: GPUBusy SQ_WAVES
pmc: GRBM_GUI_ACTIVE
.. code-block:: shell
$ cat input.json
{
"metrics": [
{
"pmc": ["SQ_WAVES", "GRBM_COUNT", "GUI_ACTIVE"]
},
{
"pmc": ["FETCH_SIZE", "WRITE_SIZE"]
}
]
}
.. code-block:: shell
$ cat input.yaml
metrics:
- pmc:
- SQ_WAVES
- GRBM_COUNT
- GUI_ACTIVE
- 'TCC_HIT[1]'
- 'TCC_HIT[2]'
- pmc:
- FETCH_SIZE
- WRITE_SIZE
The number of basic counters or derived metrics that can be collected in one run of profiling are limited by the GPU hardware resources. If too many counters or metrics are selected, the kernels need to be executed multiple times to collect them. For multi-pass execution, include multiple ``pmc`` rows in the input file. Counters or metrics in each ``pmc`` row can be collected in each kernel run.
Kernel profiling output
+++++++++++++++++++++++++
To supply the input file for kernel profiling, use:
.. code-block:: shell
rocprofv3 -i input.txt -- <app_relative_path>
Running the above command generates a `./pmc_n/counter_collection.csv` file prefixed with the process ID. For each ``pmc`` row, a directory ``pmc_n`` containing a `counter_collection.csv` file is generated, where n = 1 for the first row and so on.
Each row of the CSV file is an instance of kernel execution. Here is a truncated version of the output file from ``pmc_1``.
.. code-block:: shell
$ cat pmc_1/218_counter_collection.csv
"Correlation_Id","Dispatch_Id","Agent_Id","Queue_Id","Process_Id","Thread_Id","Grid_Size","Kernel_Name","Workgroup_Size","LDS_Block_Size","Scratch_Size","VGPR_Count","SGPR_Count","Counter_Name","Counter_Value"
0,1,1,139892123975680,5619,5619,1048576,"matrixTranspose(float*, float*, int)",16,0,0,8,16,"SQ_WAVES",65536
.. _output-file-fields:
Output file fields
-----------------------
The following table lists the various fields or the columns in the output CSV files generated for application tracing and kernel profiling:
.. list-table:: output file fields
:header-rows: 1
* - Field
- Description
* - Agent_Id
- GPU identifier to which the kernel was submitted.
* - Correlation_Id
- Unique identifier for correlation between HIP and HSA async calls during activity tracing.
* - Start_Timestamp
- Begin time in nanoseconds (ns) when the kernel begins execution.
* - End_Timestamp
- End time in ns when the kernel finishes execution.
* - Queue_Id
- ROCm queue unique identifier to which the kernel was submitted.
* - Private_Segment_Size
- The amount of memory required in bytes for the combined private, spill, and arg segments for a work item.
* - Group_Segment_Size
- The group segment memory required by a workgroup in bytes. This does not include any dynamically allocated group segment memory that may be added when the kernel is dispatched.
* - Workgroup_Size
- Size of the workgroup as declared by the compute shader.
* - Workgroup_Size_n
- Size of the workgroup in the nth dimension as declared by the compute shader, where n = X, Y, or Z.
* - Grid_Size
- Number of thread blocks required to launch the kernel.
* - Grid_Size_n
- Number of thread blocks in the nth dimension required to launch the kernel, where n = X, Y, or Z.
* - LDS_Block_Size
- Thread block size for the kernel's Local Data Share (LDS) memory.
* - Scratch_Size
- Kernel’s scratch memory size.
* - SGPR_Count
- Kernel's Scalar General Purpose Register (SGPR) count.
* - VGPR_Count
- Kernel's Vector General Purpose Register (VGPR) count.
Kernel names
++++++++++++++
To target a kernel name during countr collection.
.. code-block:: shell
rocprofv3 -i input.txt --kernel-names divide_kernel -- <app_relative_path>
$ cat pmc_1/312_counter_collection.csv
"Correlation_Id","Dispatch_Id","Agent_Id","Queue_Id","Process_Id","Thread_Id","Grid_Size","Kernel_Name","Workgroup_Size","LDS_Block_Size","Scratch_Size","VGPR_Count","SGPR_Count","Counter_Name","Counter_Value"
4,4,1,1,36499,36499,1048576,"divide_kernel(float*, float const*, float const*, int, int)",64,0,0,12,16,"SQ_WAVES",16384
8,8,1,2,36499,36499,1048576,"divide_kernel(float*, float const*, float const*, int, int)",64,0,0,12,16,"SQ_WAVES",16384
12,12,1,3,36499,36499,1048576,"divide_kernel(float*, float const*, float const*, int, int)",64,0,0,12,16,"SQ_WAVES",16384
16,16,1,4,36499,36499,1048576,"divide_kernel(float*, float const*, float const*, int, int)",64,0,0,12,16,"SQ_WAVES",16384
Output formats
----------------
``rocprofv3`` supports the following output formats:
- CSV (default)
- JSON
- PFTrace
You can specify the output format using the ``--output-format`` command-line option. Format selection is case-insensitive
and multiple output formats are supported. For example: ``--output-format json`` enables JSON output exclusively whereas
``--output-format csv json pftrace`` enables all three output formats for the run.
For trace visualization, use the PFTrace format and open the trace in `ui.perfetto.dev <https://ui.perfetto.dev/>`_.
JSON output schema
++++++++++++++++++++
``rocprofv3`` supports a custom JSON output format designed for programmatic analysis. The schema is optimized for size
while factoring in usability. You can generate the JSON output using ``--output-format json`` command-line option.
Properties
++++++++++++
- **`rocprofiler-sdk-tool`** `(array)`: rocprofv3 data per process (each element represents a process).
- **Items** `(object)`: Data for rocprofv3.
- **`metadata`** `(object, required)`: Metadata related to the profiler session.
- **`pid`** `(integer, required)`: Process ID.
- **`init_time`** `(integer, required)`: Initialization time in nanoseconds.
- **`fini_time`** `(integer, required)`: Finalization time in nanoseconds.
- **`agents`** `(array, required)`: List of agents.
- **Items** `(object)`: Data for an agent.
- **`size`** `(integer, required)`: Size of the agent data.
- **`id`** `(object, required)`: Identifier for the agent.
- **`handle`** `(integer, required)`: Handle for the agent.
- **`type`** `(integer, required)`: Type of the agent.
- **`cpu_cores_count`** `(integer)`: Number of CPU cores.
- **`simd_count`** `(integer)`: Number of SIMD units.
- **`mem_banks_count`** `(integer)`: Number of memory banks.
- **`caches_count`** `(integer)`: Number of caches.
- **`io_links_count`** `(integer)`: Number of I/O links.
- **`cpu_core_id_base`** `(integer)`: Base ID for CPU cores.
- **`simd_id_base`** `(integer)`: Base ID for SIMD units.
- **`max_waves_per_simd`** `(integer)`: Maximum waves per SIMD.
- **`lds_size_in_kb`** `(integer)`: Size of LDS in KB.
- **`gds_size_in_kb`** `(integer)`: Size of GDS in KB.
- **`num_gws`** `(integer)`: Number of GWS (global work size).
- **`wave_front_size`** `(integer)`: Size of the wave front.
- **`num_xcc`** `(integer)`: Number of XCC (execution compute units).
- **`cu_count`** `(integer)`: Number of compute units (CUs).
- **`array_count`** `(integer)`: Number of arrays.
- **`num_shader_banks`** `(integer)`: Number of shader banks.
- **`simd_arrays_per_engine`** `(integer)`: SIMD arrays per engine.
- **`cu_per_simd_array`** `(integer)`: CUs per SIMD array.
- **`simd_per_cu`** `(integer)`: SIMDs per CU.
- **`max_slots_scratch_cu`** `(integer)`: Maximum slots for scratch CU.
- **`gfx_target_version`** `(integer)`: GFX target version.
- **`vendor_id`** `(integer)`: Vendor ID.
- **`device_id`** `(integer)`: Device ID.
- **`location_id`** `(integer)`: Location ID.
- **`domain`** `(integer)`: Domain identifier.
- **`drm_render_minor`** `(integer)`: DRM render minor version.
- **`num_sdma_engines`** `(integer)`: Number of SDMA engines.
- **`num_sdma_xgmi_engines`** `(integer)`: Number of SDMA XGMI engines.
- **`num_sdma_queues_per_engine`** `(integer)`: Number of SDMA queues per engine.
- **`num_cp_queues`** `(integer)`: Number of CP queues.
- **`max_engine_clk_ccompute`** `(integer)`: Maximum engine clock for compute.
- **`max_engine_clk_fcompute`** `(integer)`: Maximum engine clock for F compute.
- **`sdma_fw_version`** `(object)`: SDMA firmware version.
- **`uCodeSDMA`** `(integer, required)`: SDMA microcode version.
- **`uCodeRes`** `(integer, required)`: Reserved microcode version.
- **`fw_version`** `(object)`: Firmware version.
- **`uCode`** `(integer, required)`: Microcode version.
- **`Major`** `(integer, required)`: Major version.
- **`Minor`** `(integer, required)`: Minor version.
- **`Stepping`** `(integer, required)`: Stepping version.
- **`capability`** `(object, required)`: Agent capability flags.
- **`HotPluggable`** `(integer, required)`: Hot pluggable capability.
- **`HSAMMUPresent`** `(integer, required)`: HSAMMU present capability.
- **`SharedWithGraphics`** `(integer, required)`: Shared with graphics capability.
- **`QueueSizePowerOfTwo`** `(integer, required)`: Queue size is power of two.
- **`QueueSize32bit`** `(integer, required)`: Queue size is 32-bit.
- **`QueueIdleEvent`** `(integer, required)`: Queue idle event.
- **`VALimit`** `(integer, required)`: VA limit.
- **`WatchPointsSupported`** `(integer, required)`: Watch points supported.
- **`WatchPointsTotalBits`** `(integer, required)`: Total bits for watch points.
- **`DoorbellType`** `(integer, required)`: Doorbell type.
- **`AQLQueueDoubleMap`** `(integer, required)`: AQL queue double map.
- **`DebugTrapSupported`** `(integer, required)`: Debug trap supported.
- **`WaveLaunchTrapOverrideSupported`** `(integer, required)`: Wave launch trap override supported.
- **`WaveLaunchModeSupported`** `(integer, required)`: Wave launch mode supported.
- **`PreciseMemoryOperationsSupported`** `(integer, required)`: Precise memory operations supported.
- **`DEPRECATED_SRAM_EDCSupport`** `(integer, required)`: Deprecated SRAM EDC support.
- **`Mem_EDCSupport`** `(integer, required)`: Memory EDC support.
- **`RASEventNotify`** `(integer, required)`: RAS event notify.
- **`ASICRevision`** `(integer, required)`: ASIC revision.
- **`SRAM_EDCSupport`** `(integer, required)`: SRAM EDC support.
- **`SVMAPISupported`** `(integer, required)`: SVM API supported.
- **`CoherentHostAccess`** `(integer, required)`: Coherent host access.
- **`DebugSupportedFirmware`** `(integer, required)`: Debug supported firmware.
- **`Reserved`** `(integer, required)`: Reserved field.
- **`counters`** `(array, required)`: Array of counter objects.
- **Items** `(object)`
- **`agent_id`** *(object, required)*: Agent ID information.
- **`handle`** *(integer, required)*: Handle of the agent.
- **`id`** *(object, required)*: Counter ID information.
- **`handle`** *(integer, required)*: Handle of the counter.
- **`is_constant`** *(integer, required)*: Indicator if the counter value is constant.
- **`is_derived`** *(integer, required)*: Indicator if the counter value is derived.
- **`name`** *(string, required)*: Name of the counter.
- **`description`** *(string, required)*: Description of the counter.
- **`block`** *(string, required)*: Block information of the counter.
- **`expression`** *(string, required)*: Expression of the counter.
- **`dimension_ids`** *(array, required)*: Array of dimension IDs.
- **Items** *(integer)*: Dimension ID.
- **`strings`** *(object, required)*: String records.
- **`callback_records`** *(array)*: Callback records.
- **Items** *(object)*
- **`kind`** *(string, required)*: Kind of the record.
- **`operations`** *(array, required)*: Array of operations.
- **Items** *(string)*: Operation.
- **`buffer_records`** *(array)*: Buffer records.
- **Items** *(object)*
- **`kind`** *(string, required)*: Kind of the record.
- **`operations`** *(array, required)*: Array of operations.
- **Items** *(string)*: Operation.
- **`marker_api`** *(array)*: Marker API records.
- **Items** *(object)*
- **`key`** *(integer, required)*: Key of the record.
- **`value`** *(string, required)*: Value of the record.
- **`counters`** *(object)*: Counter records.
- **`dimension_ids`** *(array, required)*: Array of dimension IDs.
- **Items** *(object)*
- **`id`** *(integer, required)*: Dimension ID.
- **`instance_size`** *(integer, required)*: Size of the instance.
- **`name`** *(string, required)*: Name of the dimension.
- **`code_objects`** *(array, required)*: Code object records.
- **Items** *(object)*
- **`size`** *(integer, required)*: Size of the code object.
- **`code_object_id`** *(integer, required)*: ID of the code object.
- **`rocp_agent`** *(object, required)*: ROCP agent information.
- **`handle`** *(integer, required)*: Handle of the ROCP agent.
- **`hsa_agent`** *(object, required)*: HSA agent information.
- **`handle`** *(integer, required)*: Handle of the HSA agent.
- **`uri`** *(string, required)*: URI of the code object.
- **`load_base`** *(integer, required)*: Base address for loading.
- **`load_size`** *(integer, required)*: Size for loading.
- **`load_delta`** *(integer, required)*: Delta for loading.
- **`storage_type`** *(integer, required)*: Type of storage.
- **`memory_base`** *(integer, required)*: Base address for memory.
- **`memory_size`** *(integer, required)*: Size of memory.
- **`kernel_symbols`** *(array, required)*: Kernel symbol records.
- **Items** *(object)*
- **`size`** *(integer, required)*: Size of the kernel symbol.
- **`kernel_id`** *(integer, required)*: ID of the kernel.
- **`code_object_id`** *(integer, required)*: ID of the code object.
- **`kernel_name`** *(string, required)*: Name of the kernel.
- **`kernel_object`** *(integer, required)*: Object of the kernel.
- **`kernarg_segment_size`** *(integer, required)*: Size of the kernarg segment.
- **`kernarg_segment_alignment`** *(integer, required)*: Alignment of the kernarg segment.
- **`group_segment_size`** *(integer, required)*: Size of the group segment.
- **`private_segment_size`** *(integer, required)*: Size of the private segment.
- **`formatted_kernel_name`** *(string, required)*: Formatted name of the kernel.
- **`demangled_kernel_name`** *(string, required)*: Demangled name of the kernel.
- **`truncated_kernel_name`** *(string, required)*: Truncated name of the kernel.
- **`callback_records`** *(object, required)*: Callback record details.
- **`counter_collection`** *(array)*: Counter collection records.
- **Items** *(object)*
- **`dispatch_data`** *(object, required)*: Dispatch data details.
- **`size`** *(integer, required)*: Size of the dispatch data.
- **`correlation_id`** *(object, required)*: Correlation ID information.
- **`internal`** *(integer, required)*: Internal correlation ID.
- **`external`** *(integer, required)*: External correlation ID.
- **`dispatch_info`** *(object, required)*: Dispatch information details.
- **`size`** *(integer, required)*: Size of the dispatch information.
- **`agent_id`** *(object, required)*: Agent ID information.
- **`handle`** *(integer, required)*: Handle of the agent.
- **`queue_id`** *(object, required)*: Queue ID information.
- **`handle`** *(integer, required)*: Handle of the queue.
- **`kernel_id`** *(integer, required)*: ID of the kernel.
- **`dispatch_id`** *(integer, required)*: ID of the dispatch.
- **`private_segment_size`** *(integer, required)*: Size of the private segment.
- **`group_segment_size`** *(integer, required)*: Size of the group segment.
- **`workgroup_size`** *(object, required)*: Workgroup size information.
- **`x`** *(integer, required)*: X dimension.
- **`y`** *(integer, required)*: Y dimension.
- **`z`** *(integer, required)*: Z dimension.
- **`grid_size`** *(object, required)*: Grid size information.
- **`x`** *(integer, required)*: X dimension.
- **`y`** *(integer, required)*: Y dimension.
- **`z`** *(integer, required)*: Z dimension.
- **`records`** *(array, required)*: Records.
- **Items** *(object)*
- **`counter_id`** *(object, required)*: Counter ID information.
- **`handle`** *(integer, required)*: Handle of the counter.
- **`value`** *(number, required)*: Value of the counter.
- **`thread_id`** *(integer, required)*: Thread ID.
- **`arch_vgpr_count`** *(integer, required)*: Count of VGPRs.
- **`sgpr_count`** *(integer, required)*: Count of SGPRs.
- **`lds_block_size_v`** *(integer, required)*: Size of LDS block.
- **`buffer_records`** *(object, required)*: Buffer record details.
- **`kernel_dispatch`** *(array)*: Kernel dispatch records.
- **Items** *(object)*
- **`size`** *(integer, required)*: Size of the dispatch.
- **`kind`** *(integer, required)*: Kind of the dispatch.
- **`operation`** *(integer, required)*: Operation of the dispatch.
- **`thread_id`** *(integer, required)*: Thread ID.
- **`correlation_id`** *(object, required)*: Correlation ID information.
- **`internal`** *(integer, required)*: Internal correlation ID.
- **`external`** *(integer, required)*: External correlation ID.
- **`start_timestamp`** *(integer, required)*: Start timestamp.
- **`end_timestamp`** *(integer, required)*: End timestamp.
- **`dispatch_info`** *(object, required)*: Dispatch information details.
- **`size`** *(integer, required)*: Size of the dispatch information.
- **`agent_id`** *(object, required)*: Agent ID information.
- **`handle`** *(integer, required)*: Handle of the agent.
- **`queue_id`** *(object, required)*: Queue ID information.
- **`handle`** *(integer, required)*: Handle of the queue.
- **`kernel_id`** *(integer, required)*: ID of the kernel.
- **`dispatch_id`** *(integer, required)*: ID of the dispatch.
- **`private_segment_size`** *(integer, required)*: Size of the private segment.
- **`group_segment_size`** *(integer, required)*: Size of the group segment.
- **`workgroup_size`** *(object, required)*: Workgroup size information.
- **`x`** *(integer, required)*: X dimension.
- **`y`** *(integer, required)*: Y dimension.
- **`z`** *(integer, required)*: Z dimension.
- **`grid_size`** *(object, required)*: Grid size information.
- **`x`** *(integer, required)*: X dimension.
- **`y`** *(integer, required)*: Y dimension.
- **`z`** *(integer, required)*: Z dimension.
- **`hip_api`** *(array)*: HIP API records.
- **Items** *(object)*
- **`size`** *(integer, required)*: Size of the HIP API record.
- **`kind`** *(integer, required)*: Kind of the HIP API.
- **`operation`** *(integer, required)*: Operation of the HIP API.
- **`correlation_id`** *(object, required)*: Correlation ID information.
- **`internal`** *(integer, required)*: Internal correlation ID.
- **`external`** *(integer, required)*: External correlation ID.
- **`start_timestamp`** *(integer, required)*: Start timestamp.
- **`end_timestamp`** *(integer, required)*: End timestamp.
- **`thread_id`** *(integer, required)*: Thread ID.
- **`hsa_api`** *(array)*: HSA API records.
- **Items** *(object)*
- **`size`** *(integer, required)*: Size of the HSA API record.
- **`kind`** *(integer, required)*: Kind of the HSA API.
- **`operation`** *(integer, required)*: Operation of the HSA API.
- **`correlation_id`** *(object, required)*: Correlation ID information.
- **`internal`** *(integer, required)*: Internal correlation ID.
- **`external`** *(integer, required)*: External correlation ID.
- **`start_timestamp`** *(integer, required)*: Start timestamp.
- **`end_timestamp`** *(integer, required)*: End timestamp.
- **`thread_id`** *(integer, required)*: Thread ID.
- **`marker_api`** *(array)*: Marker (ROCTx) API records.
- **Items** *(object)*
- **`size`** *(integer, required)*: Size of the Marker API record.
- **`kind`** *(integer, required)*: Kind of the Marker API.
- **`operation`** *(integer, required)*: Operation of the Marker API.
- **`correlation_id`** *(object, required)*: Correlation ID information.
- **`internal`** *(integer, required)*: Internal correlation ID.
- **`external`** *(integer, required)*: External correlation ID.
- **`start_timestamp`** *(integer, required)*: Start timestamp.
- **`end_timestamp`** *(integer, required)*: End timestamp.
- **`thread_id`** *(integer, required)*: Thread ID.
- **`memory_copy`** *(array)*: Async memory copy records.
- **Items** *(object)*
- **`size`** *(integer, required)*: Size of the Marker API record.
- **`kind`** *(integer, required)*: Kind of the Marker API.
- **`operation`** *(integer, required)*: Operation of the Marker API.
- **`correlation_id`** *(object, required)*: Correlation ID information.
- **`internal`** *(integer, required)*: Internal correlation ID.
- **`external`** *(integer, required)*: External correlation ID.
- **`start_timestamp`** *(integer, required)*: Start timestamp.
- **`end_timestamp`** *(integer, required)*: End timestamp.
- **`thread_id`** *(integer, required)*: Thread ID.
- **`dst_agent_id`** *(object, required)*: Destination Agent ID.
- **`handle`** *(integer, required)*: Handle of the agent.
- **`src_agent_id`** *(object, required)*: Source Agent ID.
- **`handle`** *(integer, required)*: Handle of the agent.
- **`bytes`** *(integer, required)*: Bytes copied.
@@ -1,28 +0,0 @@
# Welcome to the [ROCprofiler](https://github.com/ROCm/rocprofiler-sdk) Documentation!
## Important Changes
[Roctracer](https://github.com/ROCm/roctracer) and [rocprofiler (v1)](https://github.com/ROCm/rocprofiler)
have been combined into a single rocprofiler SDK and re-designed from scratch. The new rocprofiler API has been designed with some
new restrictions to avoid problems that plagued the former implementations. These restrictions enable more efficient implementations
and much better thread-safety. The most important restriction is the window for tools to inform rocprofiler about which services
the tool wishes to use (where "services" refers to the capabilities for API tracing, kernel tracing, etc.).
In the former implementations, when one of the ROCm runtimes were initially loaded, a tool only had
to inform roctracer/rocprofiler that it wished to use its services at some point (e.g. calling `roctracer_init()`)
and were not required to specify which services it would eventually or potentially use. Thus, these libraries had to effectively prepare for
any service to be enable at any point in time -- which introduced unnecessary overhead when tools had no desire to use certain features and
made thread-safe data management difficult. For example, roctracer was required to _always_ install wrappers around _every_ runtime API function
and _always_ added extra overhead of indirection through the roctracer library and checks for the current service configuration (in a thread-safe manner).
In the re-designed implementation, rocprofiler introduces the concept of a "context". Contexts are effectively
bundles of service configurations. Rocprofiler gives each tool _one_ opportunity to create as many contexts as necessary --
for example, a tool can group all of the services into one context, create individual contexts for each service, or somewhere in between.
Due to this design choice change, rocprofiler now knows _exactly_ which services might be requested by the tool clients at any point in time.
This has several important implications:
- rocprofiler does not have to unnecessarily prepare for services that are never used -- if no registered contexts requested tracing the HSA API, no wrappers need to be generated
- rocprofiler can perform more extensive checks during service specification and inform tools about potential issues very early on
- rocprofiler can allow multiple tools to use certain services simulatenously
- rocprofiler was able to improve thread-safety without introducing parallel bottlenecks
- rocprofiler can manage internal data and allocations more efficiently
@@ -0,0 +1,37 @@
.. meta::
:description: Documentation of the installation, configuration, use of the ROCProfiler SDK, and rocprofv3 command-line tool
:keywords: ROCProfiler SDK tool, ROCProfiler SDK library, rocprofv3, ROCm, API, reference
.. _index:
******************************************
ROCProfiler SDK documentation
******************************************
ROCProfiler SDK is a comprehensive library that provides APIs for profiling and tracing HIP applications on AMD ROCm Software. To learn more, see :ref:`what-is-rocprof-sdk`
You can access ROCProfiler SDK on our `GitHub repository <https://github.com/ROCm/rocprofiler-sdk>`_.
The documentation is structured as follows:
.. grid:: 2
:gutter: 3
.. grid-item-card:: Install
* :doc:`Installation <install/installation>`
.. grid-item-card:: How to
* :doc:`Using rocprofv3 <how-to/using-rocprofv3>`
* :doc:`Samples <how-to/samples>`
.. grid-item-card:: API reference
* :doc:`API library <_doxygen/html/index>`
To contribute to the documentation, refer to
`Contributing to ROCm <https://rocm.docs.amd.com/en/latest/contribute/contributing.html>`_.
You can find licensing information on the
`Licensing <https://rocm.docs.amd.com/en/latest/about/license.html>`_ page.
@@ -0,0 +1,70 @@
# Installation
This document provides information required to install ROCprofiler-SDK from source.
## Supported systems
ROCprofiler-SDK is supported only on Linux. The following distributions are tested:
- Ubuntu 20.04
- Ubuntu 22.04
- OpenSUSE 15.4
- RedHat 8.8
Other [Linux distributions](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-operating-systems) might be supported but not tested yet.
### Identifying the operating system
To identify the Linux distribution and version, see the `/etc/os-release` and `/usr/lib/os-release` files:
```shell
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.4 LTS (Focal Fossa)"
ID=ubuntu
...
VERSION_ID="20.04"
...
```
The relevant fields are `ID` and the `VERSION_ID`.
## Build requirements
Install [CMake](https://cmake.org/) version 3.21 or higher.
**Note:** If the `CMake` installed on the system is too old, you can install a new version using various methods. One of the easiest options is to use PyPi (Python’s pip).
```bash
pip install --user 'cmake==3.22.0'
export PATH=${HOME}/.local/bin:${PATH}
```
## Building ROCprofiler-SDK
```bash
git clone https://git@github.com:ROCm/rocprofiler-sdk.git rocprofiler-sdk-source
cmake \
-B rocprofiler-sdk-build \
-D ROCPROFILER_BUILD_TESTS=ON \
-D ROCPROFILER_BUILD_SAMPLES=ON \
-D CMAKE_INSTALL_PREFIX=/opt/rocm \
rocprofiler-sdk-source
cmake --build rocprofiler-sdk-build --target all --parallel 8
```
## Installing ROCprofiler-SDK
To install ROCprofiler-SDK from the `rocprofiler-sdk-build` directory, run:
```bash
cmake --build rocprofiler-sdk-build --target install
```
## Testing ROCprofiler-SDK
To run the built tests, `cd` into the `rocprofiler-sdk-build` directory and run:
```bash
ctest --output-on-failure -O ctest.all.log
```
@@ -1,73 +0,0 @@
# Installation
## Operating System
ROCprofiler is only supported on Linux. The following distributions are tested:
- Ubuntu 20.04
- Ubuntu 22.04
- OpenSUSE 15.4
- RedHat 8.8
Other OS distributions may be supported but have yet to be tested.
### Identifying the Operating System
If you are unsure of the operating system and version, the `/etc/os-release` and `/usr/lib/os-release` files contain
operating system identification data for Linux systems.
```shell
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.4 LTS (Focal Fossa)"
ID=ubuntu
...
VERSION_ID="20.04"
...
```
The relevant fields are `ID` and the `VERSION_ID`.
## Installing ROCprofiler from source
### Build Requirements
ROCprofiler needs a CMake (https://cmake.org/) version 3.21 or higher.
***If the system installed 'CMake' is too old, installing a new version can be done through several methods. One of the easiest options is to use PyPi (i.e., python’s pip):***
```bash
pip install --user 'cmake==3.22.0'
export PATH=${HOME}/.local/bin:${PATH}
```
### Building ROCprofiler
```bash
git clone https://git@github.com:ROCm/rocprofiler-sdk.git rocprofiler-sdk-source
cmake \
-B rocprofiler-sdk-build \
-D ROCPROFILER_BUILD_TESTS=ON \
-D ROCPROFILER_BUILD_SAMPLES=ON \
-D ROCPROFILER_BUILD_DOCS=ON \
-D CMAKE_INSTALL_PREFIX=/opt/rocm \
rocprofiler-sdk-source
cmake --build rocprofiler-sdk-build --target all --parallel 8
```
### Testing ROCprofiler
To run the built tests, cd into the `rocprofiler-sdk-build` directory and run:
```bash
ctest --output-on-failure -O ctest.all.log
```
### Installing ROCprofiler
To install ROCprofiler from the `rocprofiler-sdk-build` directory, run:
```bash
cmake --build rocprofiler-sdk-build --target install
```
@@ -1,802 +0,0 @@
# rocprofv3 User Guide
ROCProfiler SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm platform. It supports application tracing to provide a big picture of the GPU application execution and kernel profiling to provide low-level hardware details from the performance counters.
The ROCProfiler SDK library provides runtime-independent APIs for tracing runtime calls and asynchronous activities such as GPU kernel dispatches and memory moves. The tracing includes callback APIs for runtime API tracing and activity APIs for asynchronous activity records logging. You can use these APIs to develop a tracing tool or implement tracing in the application.
This document discusses the command-line tool `rocprofv3` in detail. It is based on the APIs from the ROCProfiler SDK library.
## Installation
To install ROCProfiler SDK from the source, follow the instructions provided in the sections below:
### Prerequisites
- Linux operating system. Here is the list of operating systems tested for ROCProfiler SDK support:
- Ubuntu 20.04
- Ubuntu 22.04
- OpenSUSE 15.4
- RedHat 8.8
To check a system’s operating system and version, see the `/etc/os-release` and `/usr/lib/os-release` files:
```bash
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.4 LTS (Focal Fossa)"
ID=ubuntu
...
VERSION_ID="20.04"
...
```
- Cmake version 3.21 or higher.
- To install a new version of Cmake, we recommend using PyPi (Python’s pip):
```bash
pip install --user 'cmake==3.21.0'
export PATH=${HOME}/.local/bin:${PATH}
```
### Build
To build ROCProfiler SDK, use:
```bash
git clone https://git@github.com:ROCm/rocprofiler-sdk.git rocprofiler-sdk-source
```
```bash
cmake --build rocprofiler-sdk-build --target all --parallel 8
```
To see the various build options along with their default values, use:
```bash
$ cmake -LA
REPRODUCIBLE_RUNTIME_USE_MPI:BOOL=OFF
ROCPROFILER_BLACK_FORMAT_EXE:FILEPATH=ROCPROFILER_BLACK_FORMAT_EXE-NOTFOUND
ROCPROFILER_BUILD_CI:BOOL=OFF
ROCPROFILER_BUILD_CODECOV:BOOL=OFF
ROCPROFILER_BUILD_DEBUG:BOOL=OFF
ROCPROFILER_BUILD_DEVELOPER:BOOL=OFF
ROCPROFILER_BUILD_DOCS:BOOL=OFF
ROCPROFILER_BUILD_FMT:BOOL=ON
ROCPROFILER_BUILD_GHC_FS:BOOL=ON
ROCPROFILER_BUILD_GLOG:BOOL=ON
ROCPROFILER_BUILD_GTEST:BOOL=ON
ROCPROFILER_BUILD_RELEASE:BOOL=OFF
ROCPROFILER_BUILD_SAMPLES:BOOL=OFF
ROCPROFILER_BUILD_STACK_PROTECTOR:BOOL=ON
ROCPROFILER_BUILD_STATIC_LIBGCC:BOOL=OFF
ROCPROFILER_BUILD_STATIC_LIBSTDCXX:BOOL=OFF
ROCPROFILER_BUILD_TESTS:BOOL=ON
ROCPROFILER_BUILD_WERROR:BOOL=OFF
ROCPROFILER_CLANG_FORMAT_EXE:FILEPATH=ROCPROFILER_CLANG_FORMAT_EXE-NOTFOUND
ROCPROFILER_CLANG_TIDY_COMMAND:FILEPATH=ROCPROFILER_CLANG_TIDY_COMMAND-NOTFOUND
ROCPROFILER_CMAKE_FORMAT_EXE:FILEPATH=ROCPROFILER_CMAKE_FORMAT_EXE-NOTFOUND
ROCPROFILER_CPACK_SYSTEM_NAME:STRING=Linux
ROCPROFILER_DEBUG_TRACE:BOOL=OFF
ROCPROFILER_DEFAULT_ROCM_PATH:PATH=/opt/rocm-6.1.0-13278
ROCPROFILER_ENABLE_CLANG_TIDY:BOOL=OFF
ROCPROFILER_LD_AQLPROFILE:BOOL=OFF
ROCPROFILER_MEMCHECK:STRING=
ROCPROFILER_REGENERATE_COUNTERS_PARSER:BOOL=OFF
ROCPROFILER_UNSAFE_NO_VERSION_CHECK:BOOL=OFF
```
### Install
To install ROCProfiler SDK from the `rocprofiler-sdk-build` directory, run:
```bash
cmake --build rocprofiler-sdk-build --target install
```
### Test
To run the build tests, `cd` into the `rocprofiler-sdk-build` directory and run:
```bash
ctest -R
```
## Usage
`rocprofv3` is a CLI tool that helps you quickly optimize applications and understand the low-level kernel details without requiring any modification in the source code. It is being developed to be backward compatible with its predecessor, `rocprof`, and to provide more features to help users profile their applications with better accuracy.
The following sections demonstrate the use of `rocprofv3` for application tracing and kernel profiling using various command-line options.
`rocprofv3` is installed with ROCm under `/opt/rocm/bin`. To use the tool from anywhere in the system, export `PATH` variable:
```bash
export PATH=$PATH:/opt/rocm/bin
```
Before you can start tracing or profiling your HIP application using `rocprofv3`, build the application using:
```bash
cmake -B <build-directory> <source-directory> -DCMAKE_PREFIX_PATH=/opt/rocm
cmake --build <build-directory> --target all --parallel <N>
```
### Options
Below is the list of `rocprofv3` command-line options. Some options are used for application tracing and some for kernel profiling while the output control options control the presentation and redirection of the generated output.
| Option | Description | Use |
|--------|-------------|-----|
| --hip-trace | Collects HIP runtime traces. | Application tracing |
| --hip-runtime-trace | Collects HIP runtime API traces. | Application tracing |
| --hip-compiler-trace | Collects HIP compiler-generated code traces. | Application tracing |
| --scratch-memory-trace | Collects scratch memory operations traces. | Application tracing |
| --hsa-trace | Collects HSA API traces. | Application tracing |
| --hsa-core-trace | Collects HSA API traces (core API). | Application tracing |
| --hsa-amd-trace | Collects HSA API traces (AMD-extension API). | Application tracing |
| --hsa-image-trace | Collects HSA API Ttaces (Image-extension API). | Application tracing |
| --hsa-finalizer-trace | Collects HSA API traces (Finalizer-extension API). | Application tracing |
| --stats | For Collecting statistics of enabled tracing types | Application tracing |
| --kernel-trace | Collects kernel dispatch traces. | Application tracing |
| --marker-trace | Collects marker (ROC-TX) traces. | Application tracing |
| --memory-copy-trace | Collects memory copy traces. | Application tracing |
| --sys-trace | Collects HIP, HSA, memory copy, marker, and kernel dispatch traces. | Application Tracing |
| -i | Specifies the input file. | Kernel profiling |
| -L \| --list-metrics | List metrics for counter collection. | Kernel profiling |
| -d \| --output-directory | Specifies the path for the output files. | Output control |
| -o \| --output-file | Specifies the name of the output file. Note that this name is appended to the default names (_api_trace or counter_collection.csv) of the generated files'. | Output control |
| -M \| --mangled-kernels | Overrides the default demangling of kernel names. | Output control |
| -T \| --truncate-kernels | Truncates the demangled kernel names for improved readability. | Output control |
| --output-format | For adding output format (supported formats: csv, json, pftrace) | Output control |
You can also see all the `rocprofv3` options using:
```bash
rocprofv3 --help
```
### Application tracing
Application tracing provides the big picture of a program’s execution by collecting data on the execution times of API calls and GPU commands, such as kernel execution, async memory copy, and barrier packets. This information can be used as the first step in the profiling process to answer important questions, such as how much percentage of time was spent on memory copy and which kernel took the longest time to execute.
To use `rocprofv3` for application tracing, run:
```bash
rocprofv3 <tracing_option> -- <app_relative_path>
```
#### HIP trace
HIP trace comprises execution traces for the entire application at the HIP level. This includes HIP API functions and their asynchronous activities at the runtime level. In general, HIP APIs directly interact with the user program. It is easier to analyze HIP traces as you can directly map them to the program.
To trace HIP runtime APIs, use:
```bash
rocprofv3 --hip-trace -- < app_relative_path >
```
**Note: The tracing and counter colleciton options generates an additional agent info file.
The above command generates a `hip_api_trace.csv` file prefixed with the process ID.
```bash
$ cat 238_hip_api_trace.csv
"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
"HIP_RUNTIME_API","hipGetDevicePropertiesR0600",238,238,1,1191915574691984,1191915687784011
"HIP_RUNTIME_API","hipMalloc",238,238,2,1191915691312459,1191915691388696
"HIP_RUNTIME_API","hipMalloc",238,238,3,1191915691390637,1191915691423279
"HIP_RUNTIME_API","hipMemcpy",238,238,4,1191915691439107,1191916547828448
"HIP_RUNTIME_API","hipLaunchKernel",238,238,5,1191916547842972,1191916548408842
"HIP_RUNTIME_API","hipMemcpy",238,238,6,1191916548412677,1191916550217834
"HIP_RUNTIME_API","hipFree",238,238,7,1191916562618151,1191916562789093
"HIP_RUNTIME_API","hipFree",238,238,8,1191916562790923,1191916562836351
```
To trace HIP compile time APIs, use:
```bash
rocprofv3 --hip-compiler-trace -- < app_relative_path >
```
The above command generates a `hip_api_trace.csv` file prefixed with the process ID.
```bash
$ cat 208_hip_api_trace.csv
"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
"HIP_COMPILER_API","__hipRegisterFatBinary",208,208,1,1508780270085955,1508780270096795
"HIP_COMPILER_API","__hipRegisterFunction",208,208,2,1508780270104242,1508780270115355
"HIP_COMPILER_API","__hipPushCallConfiguration",208,208,3,1508780613897816,1508780613898701
"HIP_COMPILER_API","__hipPopCallConfiguration",208,208,4,1508780613901714,1508780613902200
```
To describe the fields in the output file, see [Output file fields](#output-file-fields).
##### Agent Info
```bash
$ cat 238_agent_info.csv
"Node_Id","Logical_Node_Id","Agent_Type","Cpu_Cores_Count","Simd_Count","Cpu_Core_Id_Base","Simd_Id_Base","Max_Waves_Per_Simd","Lds_Size_In_Kb","Gds_Size_In_Kb","Num_Gws","Wave_Front_Size","Num_Xcc","Cu_Count","Array_Count","Num_Shader_Banks","Simd_Arrays_Per_Engine","Cu_Per_Simd_Array","Simd_Per_Cu","Max_Slots_Scratch_Cu","Gfx_Target_Version","Vendor_Id","Device_Id","Location_Id","Domain","Drm_Render_Minor","Num_Sdma_Engines","Num_Sdma_Xgmi_Engines","Num_Sdma_Queues_Per_Engine","Num_Cp_Queues","Max_Engine_Clk_Ccompute","Max_Engine_Clk_Fcompute","Sdma_Fw_Version","Fw_Version","Capability","Cu_Per_Engine","Max_Waves_Per_Cu","Family_Id","Workgroup_Max_Size","Grid_Max_Size","Local_Mem_Size","Hive_Id","Gpu_Id","Workgroup_Max_Dim_X","Workgroup_Max_Dim_Y","Workgroup_Max_Dim_Z","Grid_Max_Dim_X","Grid_Max_Dim_Y","Grid_Max_Dim_Z","Name","Vendor_Name","Product_Name","Model_Name"
0,0,"CPU",24,0,0,0,0,0,0,0,0,1,24,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3800,0,0,0,0,0,0,23,0,0,0,0,0,0,0,0,0,0,0,"AMD Ryzen 9 3900X 12-Core Processor","CPU","AMD Ryzen 9 3900X 12-Core Processor",""
1,1,"GPU",0,256,0,2147487744,10,64,0,64,64,1,64,4,4,1,16,4,32,90000,4098,26751,12032,0,128,2,0,2,24,3800,1630,432,440,138420864,16,40,141,1024,4294967295,0,0,64700,1024,1024,1024,4294967295,4294967295,4294967295,"gfx900","AMD","Radeon RX Vega","vega10"
```
#### HSA trace
The HIP runtime library is implemented with the low-level HSA runtime. HSA API tracing is more suited for advanced users who want to understand the application behavior at the lower level. In general, tracing at the HIP level is recommended for most users. You should use HSA trace only if you are familiar with HSA runtime.
HSA trace contains the start and end time of HSA runtime API calls and their asynchronous activities.
```bash
rocprofv3 --hsa-trace -- < app_relative_path >
```
The above command generates a `hsa_api_trace.csv` file prefixed with process ID.
Note: the contents of this file have been truncated for demonstration purposes.
```bash
$ cat 197_hsa_api_trace.csv
"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
"HSA_CORE_API","hsa_system_get_major_extension_table",197,197,1,1507843974724237,1507843974724947
"HSA_CORE_API","hsa_agent_get_info",197,197,3,1507843974754471,1507843974755014
"HSA_AMD_EXT_API","hsa_amd_memory_pool_get_info",197,197,5,1507843974761705,1507843974762398
"HSA_AMD_EXT_API","hsa_amd_memory_pool_get_info",197,197,6,1507843974763901,1507843974764030
"HSA_AMD_EXT_API","hsa_amd_memory_pool_get_info",197,197,7,1507843974765121,1507843974765224
"HSA_AMD_EXT_API","hsa_amd_memory_pool_get_info",197,197,8,1507843974766196,1507843974766328
"HSA_AMD_EXT_API","hsa_amd_memory_pool_get_info",197,197,9,1507843974767534,1507843974767641
"HSA_AMD_EXT_API","hsa_amd_memory_pool_get_info",197,197,10,1507843974768639,1507843974768779
"HSA_AMD_EXT_API","hsa_amd_agent_iterate_memory_pools",197,197,4,1507843974758768,1507843974769238
"HSA_CORE_API","hsa_agent_get_info",197,197,11,1507843974771091,1507843974771537
```
To describe the fields in the output file, see [Output file fields](#output-file-fields).
#### Marker trace
In certain situations, such as debugging performance issues in large-scale GPU programs, API-level tracing may be too fine-grained to provide a big picture of the program execution. In such cases, defining specific tasks to be traced is helpful.
To specify the tasks for tracing, enclose the respective source code with the API calls provided by the ROCTX library. This process is also known as instrumentation. As the scope of code for instrumentation is defined using the enclosing API calls, it is called a range. A range is a programmer-defined task that has a well-defined start and end code scope. You can also fine-grained the scope specified within a range using further nested ranges. The `rocprofv3` tool also reports the timelines for these nested ranges.
Here is a list of useful APIs for code instrumentation.
- `roctxMark`: Inserts a marker in the code with a message. Creating marks can help you see when a line of code is executed.
- `roctxRangeStart`: Starts a range. Different threads can start ranges.
- `roctxRangePush`: Starts a new nested range.
- `roctxRangePop`: Stops the current nested range.
- `roctxRangeStop`: Stops the given range.
See how to use `rocTX` APIs in the MatrixTranspose application below:
```bash
roctxMark("before hipLaunchKernel");
int rangeId = roctxRangeStart("hipLaunchKernel range");
roctxRangePush("hipLaunchKernel");
// Launching kernel from host
hipLaunchKernelGGL(matrixTranspose, dim3(WIDTH/THREADS_PER_BLOCK_X, WIDTH/THREADS_PER_BLOCK_Y), dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y), 0,0,gpuTransposeMatrix,gpuMatrix, WIDTH);
roctxMark("after hipLaunchKernel");
// Memory transfer from device to host
roctxRangePush("hipMemcpy");
hipMemcpy(TransposeMatrix, gpuTransposeMatrix, NUM * sizeof(float), hipMemcpyDeviceToHost);
roctxRangePop(); // for "hipMemcpy"
roctxRangePop(); // for "hipLaunchKernel"
roctxRangeStop(rangeId);
```
To trace the API calls enclosed within the range, use:
```bash
rocprofv3 --marker-trace -- < app_relative_path >
```
Running the above command generates a `marker_api_trace.csv` file prefixed with the process ID.
```bash
$ cat 210_marker_api_trace.csv
"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
"MARKER_CORE_API","before hipLaunchKernel",717,717,1,1520113899312225,1520113899312225
"MARKER_CORE_API","after hipLaunchKernel",717,717,4,1520113900128482,1520113900128482
"MARKER_CORE_API","hipMemcpy",717,717,5,1520113900141100,1520113901483408
"MARKER_CORE_API","hipLaunchKernel",717,717,3,1520113899684965,1520113901491622
"MARKER_CORE_API","hipLaunchKernel range",717,0,2,1520113899682208,1520113901495882
```
For the description of the fields in the output file, see [Output file fields](#output-file-fields).
#### Kernel trace
To trace kernel dispatch traces, use:
```bash
rocprofv3 --kernel-trace -- < app_relative_path >
```
The above command generates a `kernel_trace.csv` file prefixed with the process ID.
```bash
$ cat 199_kernel_trace.csv
"Kind","Agent_Id","Queue_Id","Kernel_Id","Kernel_Name","Correlation_Id","Start_Timestamp","End_Timestamp","Private_Segment_Size","Group_Segment_Size","Workgroup_Size_X","Workgroup_Size_Y","Workgroup_Size_Z","Grid_Size_X","Grid_Size_Y","Grid_Size_Z"
"KERNEL_DISPATCH",1,139690710949888,15,"matrixTranspose(float*, float*, int)",0,671599758568,671599825328,0,0,4,4,1,1024,1024,1
```
To describe the fields in the output file, see [Output file fields](#output-file-fields).
#### Memory copy trace
To trace memory moves across the application, use:
```bash
rocprofv3 -memory-copy-trace -- < app_relative_path >
```
The above command generates a `memory_copy_trace.csv` file prefixed with the process ID.
```bash
$ cat 197_memory_copy_trace.csv
"Kind","Direction","Source_Agent_Id","Destination_Agent_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
"MEMORY_COPY","HOST_TO_DEVICE",0,1,0,14955949675563,14955950239443
"MEMORY_COPY","DEVICE_TO_HOST",1,0,0,14955952733485,14955953315285
```
To describe the fields in the output file, see [Output file fields](#output-file-fields).
#### Sys trace
This is an all-inclusive option to collect all the above-mentioned traces.
```bash
rocprofv3 -sys-trace -- < app_relative_path >
```
Running the above command generates `hip_api_trace.csv`, `hsa_api_trace.csv`, `kernel_trace.csv`, `memory_copy_trace.csv`, and `marker_api_trace.csv` (if `rocTX` APIs are specified in the application) files prefixed with the process Id.
#### Scratch Memory Trace
This command collects scratch memory operations traces. Scratch is an address space on AMD GPUs that is roughly equivalent to “local memory” in CUDA (i.e., thread-local global memory with interleaved addressing) that is used for register spills/stack space. rocprof
traces when the rocr runtime allocates, frees, and tries to reclaim scratch memory.
```bash
rocprofv3 --scratch-memory-trace < app_relative_path >
```
#### Stats
This command collects statistics of enabled tracing types. If HIP trace is enabled, then statisitics of HIP APIs will be collected
```bash
rocprofv3 --stats --hip-trace < app_relative_path >
```
The above command generates a `hip_stats.csv`, `hip_api_trace` file prefixed with the process ID.
```bash
$ cat 24189_hip_stats.csv
"Name","Calls","TotalDurationNs","AverageNs","Percentage","MinNs","MaxNs","StdDev"
"__hipPopCallConfiguration",1,721,721.000000,2.541116e-04,721,721,0.000000e+00
"__hipPushCallConfiguration",1,1090,1090.000000,3.841631e-04,1090,1090,0.000000e+00
"__hipRegisterFatBinary",1,5290,5290.000000,1.864425e-03,5290,5290,0.000000e+00
"__hipRegisterFunction",1,6620,6620.000000,2.333174e-03,6620,6620,0.000000e+00
"__hipUnregisterFatBinary",1,866077,866077.000000,3.052430e-01,866077,866077,0.000000e+00
"hipFree",2,65271,32635.500000,2.300432e-02,10900,54371,30738.638885
"hipGetDevicePropertiesR0600",1,37427618,37427618.000000,13.191110,37427618,37427618,0.000000e+00
"hipLaunchKernel",1,352186,352186.000000,1.241256e-01,352186,352186,0.000000e+00
"hipMalloc",2,237654,118827.000000,8.375954e-02,60091,177563,83065.247800
"hipMemcpy",3,232015273,77338424.333333,81.772208,9630,230659937,132782005.405723
```
### Kernel profiling
The application tracing functionality allows you to evaluate the duration of kernel execution but is of little help in providing insight into kernel execution details. The kernel profiling functionality allows you to select kernels for profiling and choose the basic counters or derived metrics to be collected for each kernel execution, thus providing a greater insight into kernel execution.
For more information on counters available on MI200, refer to the [MI200 Performance Counters and Metrics](https://rocm.docs.amd.com/en/latest/conceptual/gpu-arch/mi300-mi200-performance-counters.html).
#### Input file
To collect the desired basic counters or derived metrics, you can just mention them in an input file below. The line consisting of the counter or metric names must begin with `pmc`. We support input file in text(.txt extension), yaml(.yaml/.yml) and json(.json) format.
```bash
$ cat input.txt
pmc: GPUBusy SQ_WAVES
pmc: GRBM_GUI_ACTIVE
OR
$ cat input.json
{
"metrics": [
{
"pmc": ["SQ_WAVES", "GRBM_COUNT", "GUI_ACTIVE"]
},
{
"pmc": ["FETCH_SIZE", "WRITE_SIZE"]
}
]
}
OR
$ cat input.yaml
metrics:
- pmc:
- SQ_WAVES
- GRBM_COUNT
- GUI_ACTIVE
- 'TCC_HIT[1]'
- 'TCC_HIT[2]'
- pmc:
- FETCH_SIZE
- WRITE_SIZE
```
The GPU hardware resources limit the number of basic counters or derived metrics that can be collected in one run of profiling. If too many counters or metrics are selected, the kernels need to be executed multiple times to collect them. For multi-pass execution, include multiple `pmc` rows in the input file. Counters or metrics in each `pmc` row can be collected in each kernel run.
#### Kernel profiling output
To supply the input file for kernel profiling, use:
```bash
rocprofv3 -i input.txt -- <app_relative_path>
```
Running the above command generates a `./pmc_n/counter_collection.csv` file prefixed with the process ID. For each `pmc` row, a directory `pmc_n` containing a `counter_collection.csv` file is generated, where n = 1 for the first row and so on.
Each row of the CSV file is an instance of kernel execution. Here is a truncated version of the output file from `pmc_1`.
```bash
$ cat pmc_1/218_counter_collection.csv
"Correlation_Id","Dispatch_Id","Agent_Id","Queue_Id","Process_Id","Thread_Id","Grid_Size","Kernel_Name","Workgroup_Size","LDS_Block_Size","Scratch_Size","VGPR_Count","SGPR_Count","Counter_Name","Counter_Value"
0,1,1,139892123975680,5619,5619,1048576,"matrixTranspose(float*, float*, int)",16,0,0,8,16,"SQ_WAVES",65536
```
### Output file fields
The various fields or the columns in the output CSV files generated for application tracing and kernel profiling are described here:
| Field | Description |
|-------|-------------|
| Agent_Id | GPU identifier to which the kernel was submitted. |
| Correlation_Id | Unique identifier for correlation between HIP and HSA async calls during activity tracing. |
| Start_Timestamp | Begin time in nanoseconds (`ns`) when the kernel begins execution. |
| End_Timestamp | End time in ns when the kernel finishes execution. |
| Queue_Id | ROCm queue unique identifier to which the kernel was submitted. |
| Private_Segment_Size | The amount of memory required for the combined private, spill, and arg segments for a work item in bytes. |
| Group_Segment_Size | The group segment memory required by a workgroup in bytes. This does not include any dynamically allocated group segment memory that may be added when the kernel is dispatched. |
| Workgroup_Size | Size of the workgroup as declared by the compute shader. |
| Workgroup_Size_n | Size of the workgroup in the nth dimension as declared by the compute shader, where n = X, Y, or Z. |
| Grid_Size | Number of thread blocks required to launch the kernel. |
| Grid_Size_n | Number of thread blocks in the nth dimension required to launch the kernel, where n = X, Y, or Z. |
| LDS_Block_Size | Thread block size for the kernel's Local Data Share (`LDS`) memory. |
| Scratch_Size | Kernel’s scratch memory size. |
| SGPR_Count | Kernel's Scalar General-Purpose Register (`SGPR`) count. |
| VGPR_Count | Kernel's Vector General-Purpose Register (`VGPR`) count. |
### Sample programs
After the ROCm build is installed:
- Sample programs are installed here:
```bash
/opt/rocm/share/rocprofiler-sdk/samples
```
- `rocprofv3` tool is installed here:
```bash
/opt/rocm/bin
```
To build samples from any directory, run the following:
```bash
cmake -B <build directory> /opt/rocm/share/rocprofiler-sdk/samples -DCMAKE_PREFIX_PATH=/opt/rocm
cmake --build <build directory> --target all --parallel 8
```
To run the built samples, `cd` into the `<build directory>` mentioned in the build commands above and run:
```bash
ctest -V
```
**Note:** Running a few of these tests will require pandas and pytest to be installed first.
```bash
/usr/local/bin/python -m pip install -r requirements.txt
```
## Output Formats
rocprofv3 provides the following output formats:
- CSV (default)
- JSON
- PFTrace
Specification of the output format is via the `--output-format` command-line option. Format selection is case-insensitive
and multiple output formats are supported. Example: `--output-format json` enables JSON output exclusively whereas
`--output-format csv json pftrace` enables all three output formats for the run.
For trace visualization, use the PFTrace format and open the trace in [ui.perfetto.dev](https://ui.perfetto.dev/).
### JSON Output Schema
rocprofv3 supports a custom JSON output format designed for programmatic analysis. The schema is optimized for size
while factoring in usability. The JSON output can be generated via the `--output-format json` command-line option.
#### Properties
- **`rocprofiler-sdk-tool`** *(array)*: rocprofv3 data per process (each element represents a process).
- **Items** *(object)*: Data for rocprofv3.
- **`metadata`** *(object, required)*: Metadata related to the profiler session.
- **`pid`** *(integer, required)*: Process ID.
- **`init_time`** *(integer, required)*: Initialization time in nanoseconds.
- **`fini_time`** *(integer, required)*: Finalization time in nanoseconds.
- **`agents`** *(array, required)*: List of agents.
- **Items** *(object)*: Data for an agent.
- **`size`** *(integer, required)*: Size of the agent data.
- **`id`** *(object, required)*: Identifier for the agent.
- **`handle`** *(integer, required)*: Handle for the agent.
- **`type`** *(integer, required)*: Type of the agent.
- **`cpu_cores_count`** *(integer)*: Number of CPU cores.
- **`simd_count`** *(integer)*: Number of SIMD units.
- **`mem_banks_count`** *(integer)*: Number of memory banks.
- **`caches_count`** *(integer)*: Number of caches.
- **`io_links_count`** *(integer)*: Number of I/O links.
- **`cpu_core_id_base`** *(integer)*: Base ID for CPU cores.
- **`simd_id_base`** *(integer)*: Base ID for SIMD units.
- **`max_waves_per_simd`** *(integer)*: Maximum waves per SIMD.
- **`lds_size_in_kb`** *(integer)*: Size of LDS in KB.
- **`gds_size_in_kb`** *(integer)*: Size of GDS in KB.
- **`num_gws`** *(integer)*: Number of GWS (global work size).
- **`wave_front_size`** *(integer)*: Size of the wave front.
- **`num_xcc`** *(integer)*: Number of XCC (execution compute units).
- **`cu_count`** *(integer)*: Number of compute units (CUs).
- **`array_count`** *(integer)*: Number of arrays.
- **`num_shader_banks`** *(integer)*: Number of shader banks.
- **`simd_arrays_per_engine`** *(integer)*: SIMD arrays per engine.
- **`cu_per_simd_array`** *(integer)*: CUs per SIMD array.
- **`simd_per_cu`** *(integer)*: SIMDs per CU.
- **`max_slots_scratch_cu`** *(integer)*: Maximum slots for scratch CU.
- **`gfx_target_version`** *(integer)*: GFX target version.
- **`vendor_id`** *(integer)*: Vendor ID.
- **`device_id`** *(integer)*: Device ID.
- **`location_id`** *(integer)*: Location ID.
- **`domain`** *(integer)*: Domain identifier.
- **`drm_render_minor`** *(integer)*: DRM render minor version.
- **`num_sdma_engines`** *(integer)*: Number of SDMA engines.
- **`num_sdma_xgmi_engines`** *(integer)*: Number of SDMA XGMI engines.
- **`num_sdma_queues_per_engine`** *(integer)*: Number of SDMA queues per engine.
- **`num_cp_queues`** *(integer)*: Number of CP queues.
- **`max_engine_clk_ccompute`** *(integer)*: Maximum engine clock for compute.
- **`max_engine_clk_fcompute`** *(integer)*: Maximum engine clock for F compute.
- **`sdma_fw_version`** *(object)*: SDMA firmware version.
- **`uCodeSDMA`** *(integer, required)*: SDMA microcode version.
- **`uCodeRes`** *(integer, required)*: Reserved microcode version.
- **`fw_version`** *(object)*: Firmware version.
- **`uCode`** *(integer, required)*: Microcode version.
- **`Major`** *(integer, required)*: Major version.
- **`Minor`** *(integer, required)*: Minor version.
- **`Stepping`** *(integer, required)*: Stepping version.
- **`capability`** *(object, required)*: Agent capability flags.
- **`HotPluggable`** *(integer, required)*: Hot pluggable capability.
- **`HSAMMUPresent`** *(integer, required)*: HSAMMU present capability.
- **`SharedWithGraphics`** *(integer, required)*: Shared with graphics capability.
- **`QueueSizePowerOfTwo`** *(integer, required)*: Queue size is power of two.
- **`QueueSize32bit`** *(integer, required)*: Queue size is 32-bit.
- **`QueueIdleEvent`** *(integer, required)*: Queue idle event.
- **`VALimit`** *(integer, required)*: VA limit.
- **`WatchPointsSupported`** *(integer, required)*: Watch points supported.
- **`WatchPointsTotalBits`** *(integer, required)*: Total bits for watch points.
- **`DoorbellType`** *(integer, required)*: Doorbell type.
- **`AQLQueueDoubleMap`** *(integer, required)*: AQL queue double map.
- **`DebugTrapSupported`** *(integer, required)*: Debug trap supported.
- **`WaveLaunchTrapOverrideSupported`** *(integer, required)*: Wave launch trap override supported.
- **`WaveLaunchModeSupported`** *(integer, required)*: Wave launch mode supported.
- **`PreciseMemoryOperationsSupported`** *(integer, required)*: Precise memory operations supported.
- **`DEPRECATED_SRAM_EDCSupport`** *(integer, required)*: Deprecated SRAM EDC support.
- **`Mem_EDCSupport`** *(integer, required)*: Memory EDC support.
- **`RASEventNotify`** *(integer, required)*: RAS event notify.
- **`ASICRevision`** *(integer, required)*: ASIC revision.
- **`SRAM_EDCSupport`** *(integer, required)*: SRAM EDC support.
- **`SVMAPISupported`** *(integer, required)*: SVM API supported.
- **`CoherentHostAccess`** *(integer, required)*: Coherent host access.
- **`DebugSupportedFirmware`** *(integer, required)*: Debug supported firmware.
- **`Reserved`** *(integer, required)*: Reserved field.
- **`counters`** *(array, required)*: Array of counter objects.
- **Items** *(object)*
- **`agent_id`** *(object, required)*: Agent ID information.
- **`handle`** *(integer, required)*: Handle of the agent.
- **`id`** *(object, required)*: Counter ID information.
- **`handle`** *(integer, required)*: Handle of the counter.
- **`is_constant`** *(integer, required)*: Indicator if the counter value is constant.
- **`is_derived`** *(integer, required)*: Indicator if the counter value is derived.
- **`name`** *(string, required)*: Name of the counter.
- **`description`** *(string, required)*: Description of the counter.
- **`block`** *(string, required)*: Block information of the counter.
- **`expression`** *(string, required)*: Expression of the counter.
- **`dimension_ids`** *(array, required)*: Array of dimension IDs.
- **Items** *(integer)*: Dimension ID.
- **`strings`** *(object, required)*: String records.
- **`callback_records`** *(array)*: Callback records.
- **Items** *(object)*
- **`kind`** *(string, required)*: Kind of the record.
- **`operations`** *(array, required)*: Array of operations.
- **Items** *(string)*: Operation.
- **`buffer_records`** *(array)*: Buffer records.
- **Items** *(object)*
- **`kind`** *(string, required)*: Kind of the record.
- **`operations`** *(array, required)*: Array of operations.
- **Items** *(string)*: Operation.
- **`marker_api`** *(array)*: Marker API records.
- **Items** *(object)*
- **`key`** *(integer, required)*: Key of the record.
- **`value`** *(string, required)*: Value of the record.
- **`counters`** *(object)*: Counter records.
- **`dimension_ids`** *(array, required)*: Array of dimension IDs.
- **Items** *(object)*
- **`id`** *(integer, required)*: Dimension ID.
- **`instance_size`** *(integer, required)*: Size of the instance.
- **`name`** *(string, required)*: Name of the dimension.
- **`code_objects`** *(array, required)*: Code object records.
- **Items** *(object)*
- **`size`** *(integer, required)*: Size of the code object.
- **`code_object_id`** *(integer, required)*: ID of the code object.
- **`rocp_agent`** *(object, required)*: ROCP agent information.
- **`handle`** *(integer, required)*: Handle of the ROCP agent.
- **`hsa_agent`** *(object, required)*: HSA agent information.
- **`handle`** *(integer, required)*: Handle of the HSA agent.
- **`uri`** *(string, required)*: URI of the code object.
- **`load_base`** *(integer, required)*: Base address for loading.
- **`load_size`** *(integer, required)*: Size for loading.
- **`load_delta`** *(integer, required)*: Delta for loading.
- **`storage_type`** *(integer, required)*: Type of storage.
- **`memory_base`** *(integer, required)*: Base address for memory.
- **`memory_size`** *(integer, required)*: Size of memory.
- **`kernel_symbols`** *(array, required)*: Kernel symbol records.
- **Items** *(object)*
- **`size`** *(integer, required)*: Size of the kernel symbol.
- **`kernel_id`** *(integer, required)*: ID of the kernel.
- **`code_object_id`** *(integer, required)*: ID of the code object.
- **`kernel_name`** *(string, required)*: Name of the kernel.
- **`kernel_object`** *(integer, required)*: Object of the kernel.
- **`kernarg_segment_size`** *(integer, required)*: Size of the kernarg segment.
- **`kernarg_segment_alignment`** *(integer, required)*: Alignment of the kernarg segment.
- **`group_segment_size`** *(integer, required)*: Size of the group segment.
- **`private_segment_size`** *(integer, required)*: Size of the private segment.
- **`formatted_kernel_name`** *(string, required)*: Formatted name of the kernel.
- **`demangled_kernel_name`** *(string, required)*: Demangled name of the kernel.
- **`truncated_kernel_name`** *(string, required)*: Truncated name of the kernel.
- **`callback_records`** *(object, required)*: Callback record details.
- **`counter_collection`** *(array)*: Counter collection records.
- **Items** *(object)*
- **`dispatch_data`** *(object, required)*: Dispatch data details.
- **`size`** *(integer, required)*: Size of the dispatch data.
- **`correlation_id`** *(object, required)*: Correlation ID information.
- **`internal`** *(integer, required)*: Internal correlation ID.
- **`external`** *(integer, required)*: External correlation ID.
- **`dispatch_info`** *(object, required)*: Dispatch information details.
- **`size`** *(integer, required)*: Size of the dispatch information.
- **`agent_id`** *(object, required)*: Agent ID information.
- **`handle`** *(integer, required)*: Handle of the agent.
- **`queue_id`** *(object, required)*: Queue ID information.
- **`handle`** *(integer, required)*: Handle of the queue.
- **`kernel_id`** *(integer, required)*: ID of the kernel.
- **`dispatch_id`** *(integer, required)*: ID of the dispatch.
- **`private_segment_size`** *(integer, required)*: Size of the private segment.
- **`group_segment_size`** *(integer, required)*: Size of the group segment.
- **`workgroup_size`** *(object, required)*: Workgroup size information.
- **`x`** *(integer, required)*: X dimension.
- **`y`** *(integer, required)*: Y dimension.
- **`z`** *(integer, required)*: Z dimension.
- **`grid_size`** *(object, required)*: Grid size information.
- **`x`** *(integer, required)*: X dimension.
- **`y`** *(integer, required)*: Y dimension.
- **`z`** *(integer, required)*: Z dimension.
- **`records`** *(array, required)*: Records.
- **Items** *(object)*
- **`counter_id`** *(object, required)*: Counter ID information.
- **`handle`** *(integer, required)*: Handle of the counter.
- **`value`** *(number, required)*: Value of the counter.
- **`thread_id`** *(integer, required)*: Thread ID.
- **`arch_vgpr_count`** *(integer, required)*: Count of VGPRs.
- **`sgpr_count`** *(integer, required)*: Count of SGPRs.
- **`lds_block_size_v`** *(integer, required)*: Size of LDS block.
- **`buffer_records`** *(object, required)*: Buffer record details.
- **`kernel_dispatch`** *(array)*: Kernel dispatch records.
- **Items** *(object)*
- **`size`** *(integer, required)*: Size of the dispatch.
- **`kind`** *(integer, required)*: Kind of the dispatch.
- **`operation`** *(integer, required)*: Operation of the dispatch.
- **`thread_id`** *(integer, required)*: Thread ID.
- **`correlation_id`** *(object, required)*: Correlation ID information.
- **`internal`** *(integer, required)*: Internal correlation ID.
- **`external`** *(integer, required)*: External correlation ID.
- **`start_timestamp`** *(integer, required)*: Start timestamp.
- **`end_timestamp`** *(integer, required)*: End timestamp.
- **`dispatch_info`** *(object, required)*: Dispatch information details.
- **`size`** *(integer, required)*: Size of the dispatch information.
- **`agent_id`** *(object, required)*: Agent ID information.
- **`handle`** *(integer, required)*: Handle of the agent.
- **`queue_id`** *(object, required)*: Queue ID information.
- **`handle`** *(integer, required)*: Handle of the queue.
- **`kernel_id`** *(integer, required)*: ID of the kernel.
- **`dispatch_id`** *(integer, required)*: ID of the dispatch.
- **`private_segment_size`** *(integer, required)*: Size of the private segment.
- **`group_segment_size`** *(integer, required)*: Size of the group segment.
- **`workgroup_size`** *(object, required)*: Workgroup size information.
- **`x`** *(integer, required)*: X dimension.
- **`y`** *(integer, required)*: Y dimension.
- **`z`** *(integer, required)*: Z dimension.
- **`grid_size`** *(object, required)*: Grid size information.
- **`x`** *(integer, required)*: X dimension.
- **`y`** *(integer, required)*: Y dimension.
- **`z`** *(integer, required)*: Z dimension.
- **`hip_api`** *(array)*: HIP API records.
- **Items** *(object)*
- **`size`** *(integer, required)*: Size of the HIP API record.
- **`kind`** *(integer, required)*: Kind of the HIP API.
- **`operation`** *(integer, required)*: Operation of the HIP API.
- **`correlation_id`** *(object, required)*: Correlation ID information.
- **`internal`** *(integer, required)*: Internal correlation ID.
- **`external`** *(integer, required)*: External correlation ID.
- **`start_timestamp`** *(integer, required)*: Start timestamp.
- **`end_timestamp`** *(integer, required)*: End timestamp.
- **`thread_id`** *(integer, required)*: Thread ID.
- **`hsa_api`** *(array)*: HSA API records.
- **Items** *(object)*
- **`size`** *(integer, required)*: Size of the HSA API record.
- **`kind`** *(integer, required)*: Kind of the HSA API.
- **`operation`** *(integer, required)*: Operation of the HSA API.
- **`correlation_id`** *(object, required)*: Correlation ID information.
- **`internal`** *(integer, required)*: Internal correlation ID.
- **`external`** *(integer, required)*: External correlation ID.
- **`start_timestamp`** *(integer, required)*: Start timestamp.
- **`end_timestamp`** *(integer, required)*: End timestamp.
- **`thread_id`** *(integer, required)*: Thread ID.
- **`marker_api`** *(array)*: Marker (ROCTx) API records.
- **Items** *(object)*
- **`size`** *(integer, required)*: Size of the Marker API record.
- **`kind`** *(integer, required)*: Kind of the Marker API.
- **`operation`** *(integer, required)*: Operation of the Marker API.
- **`correlation_id`** *(object, required)*: Correlation ID information.
- **`internal`** *(integer, required)*: Internal correlation ID.
- **`external`** *(integer, required)*: External correlation ID.
- **`start_timestamp`** *(integer, required)*: Start timestamp.
- **`end_timestamp`** *(integer, required)*: End timestamp.
- **`thread_id`** *(integer, required)*: Thread ID.
- **`memory_copy`** *(array)*: Async memory copy records.
- **Items** *(object)*
- **`size`** *(integer, required)*: Size of the Marker API record.
- **`kind`** *(integer, required)*: Kind of the Marker API.
- **`operation`** *(integer, required)*: Operation of the Marker API.
- **`correlation_id`** *(object, required)*: Correlation ID information.
- **`internal`** *(integer, required)*: Internal correlation ID.
- **`external`** *(integer, required)*: External correlation ID.
- **`start_timestamp`** *(integer, required)*: Start timestamp.
- **`end_timestamp`** *(integer, required)*: End timestamp.
- **`thread_id`** *(integer, required)*: Thread ID.
- **`dst_agent_id`** *(object, required)*: Destination Agent ID.
- **`handle`** *(integer, required)*: Handle of the agent.
- **`src_agent_id`** *(object, required)*: Source Agent ID.
- **`handle`** *(integer, required)*: Handle of the agent.
- **`bytes`** *(integer, required)*: Bytes copied.
@@ -1,32 +0,0 @@
# Samples
## Running Samples
Samples and tool can be run in order to see the profiler in action. This section covers on how to build these samples and run the tool.
Once the rocm build is installed, samples are installed under:
```bash
/opt/rocm/share/rocprofiler-sdk/samples
```
rocprofv3 tool is installed under:
```bash
/opt/rocm/bin
```
### Building Samples
From any directory, run:
```bash
cmake -B build-rocprofiler-sdk-samples /opt/rocm/share/rocprofiler-sdk/samples -DCMAKE_PREFIX_PATH=/opt/rocm
cmake --build build-rocprofiler-sdk-samples --target all --parallel 8
### Running samples
To run the built samples, cd into the `build-rocprofiler-sdk-samples` directory and run:
```bash
ctest -V
@@ -1,22 +1,19 @@
# Tool Library
# Tool library
## Rocprofiler and ROCm Runtimes Design
The tool library utilizes APIs from `rocprofiler-sdk` and `rocprofiler-register` libraries for profiling and tracing HIP applications. This document provides information to help you design a tool by utilizing the `rocprofiler-sdk` and `rocprofiler-register` libraries efficiently. The command-line tool `rocprofv3` is also built on `librocprofiler-sdk-tool.so.0.4.0`, which uses these libraries.
The ROCm runtimes are now designed to directly communicate with a new library called rocprofiler-register during their initialization. This library does cursory checks
for whether any tools have indicated they want rocprofiler support via detection of one or more instances of a symbol named `rocprofiler_configure` (which is provided by
the tool libraries) and/or the `ROCP_TOOL_LIBRARIES` environment variable. This design dramatically improves upon previous designs, which relied solely on
a tool racing to set runtime-specific environment variables (e.g., `HSA_TOOLS_LIB`) before the runtime initialization.
## ROCm runtimes design
## Tool Library Design
The ROCm runtimes are designed to directly communicate with a helper library named `rocprofiler-register` during initialization. This library performs cursory checks to find if a tool requires ROCprofiler-SDK services. This detection is based on the presence of one or more instances of `rocprofiler_configure` in the tool or `ROCP_TOOL_LIBRARIES` environment variable. This design provides drastic improvement over previous designs, which relied solely on a tool racing to set runtime-specific environment variables like `HSA_TOOLS_LIB` before the runtime initialization.
When a tool has `rocprofiler_configure` visible in its symbol table, rocprofiler will invoke this function and provide information regarding
the version of rocprofiler, which invokes the function, how many tools have already been invoked, and a unique identifier for the tool. The tool
returns a pointer to a `rocprofiler_tool_configure_result_t` struct, which, if non-null, can provide rocprofiler with the function it should
call for tool initialization (i.e., the opportunity for context creation), and a function should call when rocprofiler is finalized, and a pointer
to any data that the rocprofiler should provide back to the tool when it calls the initialization and finalization functions.
## Tool library design
Rocprofiler provides a `rocprofiler/registration.h` header file, which forward declares the `rocprofiler_configure` function with the necessary
compiler function attributes to ensure that the symbol is publicly visible.
When ROCprofiler-SDK detects `rocprofiler_configure` in a tool's symbol table, ROCprofiler-SDK invokes `rocprofiler-configure` with parameters such as ROCprofiler-SDK version that invokes the function, number of tools already invoked, and a unique identifier for the tool. The tool returns a pointer to a `rocprofiler_tool_configure_result_t` struct, which, if non-null, provides ROCprofiler-SDK with:
- Function to be called for tool initialization, which is also the opportunity for context creation.
- Function to be called when ROCprofiler-SDK is finalized.
- A pointer to data to be provided to the tool when ROCprofiler-SDK calls the initialization and finalization functions.
ROCprofiler-SDK provides a `rocprofiler-sdk/registration.h` header file, which forward declares the `rocprofiler_configure` function with the necessary compiler function attributes to ensure that the `rocprofiler-configure` symbol is publicly visible.
```cpp
#include <rocprofiler-sdk/registration.h>
@@ -73,15 +70,16 @@ rocprofiler_configure(uint32_t version,
}
```
## Tool Initialization
## Tool initialization
> ***NOTE: rocprofiler does NOT support calls to any runtime function (HSA, HIP, etc.) during tool initialization.***
> ***Invoking any functions from the runtimes will result in a deadlock.***
:::{note}
ROCprofiler-SDK does NOT support calls to any runtime function (HSA, HIP, and so on) during tool initialization.
Invoking any functions from the runtimes results in a deadlock.
:::
For each tool that contains a `rocprofiler_configure` function and returns a non-null pointer to a `rocprofiler_tool_configure_result_t` struct,
rocprofiler will invoke the `initialize` callback after completing the scan for all `rocprofiler_configure` symbols. In other words, rocprofiler
collects all of the `rocprofiler_tool_configure_result_t` instances before invoking the `initialize` member of any of these instances.
When rocprofiler invokes this function in a tool, this is the opportunity to create contexts:
For each tool that contains a `rocprofiler_configure` function and returns a non-null pointer to a `rocprofiler_tool_configure_result_t` struct, ROCprofiler-SDK invokes the `initialize` callback after completing the scan for all `rocprofiler_configure` symbols. In other words, ROCprofiler-SDK
collects all `rocprofiler_tool_configure_result_t` instances before invoking the `initialize` member of any of these instances.
When ROCprofiler-SDK invokes `initialize` function in a tool, this is the opportunity to create contexts:
```cpp
#include <rocprofiler-sdk/rocprofiler.h>
@@ -106,12 +104,12 @@ tool_init(rocprofiler_client_finalize_t fini_func,
}
```
Although not strictly necessary, it is recommended that tools store the context handle(s) to control the data collection of the services associated with the context.
Although not mandatory, it is recommended that tools store the context handles to control the data collection for the services associated with the context.
## Tool Finalization
## Tool finalization
When the user-provided `initialize` callback is invoked, rocprofiler will provide a function pointer of type `rocprofiler_client_finalize_t`.
This function pointer can be invoked by the tool to explicitly invoke the `finalize` callback from the `rocprofiler_tool_configure_result_t` instance:
When the `initialize` callback is invoked in the tool, ROCprofiler-SDK provides a function pointer of type `rocprofiler_client_finalize_t`.
The tool can invoke this function pointer to explicitly invoke the `finalize` callback from the `rocprofiler_tool_configure_result_t` instance:
```cpp
#include <rocprofiler-sdk/rocprofiler.h>
@@ -143,7 +141,7 @@ tool_init(rocprofiler_client_finalize_t fini_func,
}
```
Otherwise, the rocprofiler will invoke the `finalize` callback via an `atexit` handler.
Otherwise, ROCprofiler-SDK invokes the `finalize` callback via an `atexit` handler.
## Agent Information
@@ -0,0 +1,45 @@
.. meta::
:description: Documentation of the installation, configuration, use of the ROCProfiler SDK, and rocprofv3 command-line tool
:keywords: ROCProfiler SDK tool, ROCProfiler SDK library, rocprofv3, ROCm, API, reference
.. _what-is-rocprof-sdk:
==========================
What is ROCprofiler-SDK?
==========================
ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software.
It supports application tracing to provide a big picture of the GPU application execution and kernel profiling to provide low-level hardware details from the performance counters.
The ROCprofiler-SDK library provides runtime-independent APIs for tracing runtime calls and asynchronous activities such as GPU kernel dispatches and memory moves. The tracing includes callback APIs for runtime API tracing and activity APIs for asynchronous activity records logging.
In summary, ROCprofiler-SDK combines `ROCProfiler <https://rocm.docs.amd.com/projects/rocprofiler/en/latest/index.html>`_ and `ROCTracer <https://rocm.docs.amd.com/projects/roctracer/en/latest/index.html>`_.
You can utilize the ROCprofiler-SDK to develop a tool for profiling and tracing HIP applications on ROCm software.
ROCprofiler-SDK is an improved version that enables more efficient implementations and better thread safety while avoiding problems that plague the former implementations of ROCProfiler and ROCTracer.
Here are the distinct ROCprofiler-SDK features:
- Improved tool initialization
- Support for simultaneous use of the same services by multiple tools
- Simplified control of one or more data collection services
- Improved error checking and logging
- Backward ABI compatibility
- PC sampling (beta implementation)
Improvements over ROCProfiler and ROCTracer
----------------------------------------------------
The former implementations allow a tool to access any of the services provided by ROCProfiler or ROCTracer such as API tracing, kernel tracing, etc., by calling ``roctracer_init()`` when a ROCm runtime is initially loaded.
As the calling tool is not required to specify during initialization, the services it needs to use, the libraries must be effectively prepared for any service to be available anytime.
This behavior introduces unnecessary overhead and makes thread-safe data management difficult, as tools generally don't use all the available services.
For example, ROCTracer always installs wrappers around every runtime API and adds indirection overhead through the ROCTracer library to check for the current service configuration in a thread-safe manner.
ROCprofiler-SDK introduces `context` to solve the preceding issues. Contexts are effectively bundles of service configurations. ROCprofiler-SDK provides a single opportunity for a tool to create as many contexts as required.
A tool can group all services into one context, create one context per service, or choose a mix.
This change in the design allows ROCprofiler-SDK to be aware of the services that might be requested by a tool at any given time.
The design change empowers ROCprofiler-SDK to:
- Avoid unnecessary preparation for services that are never used. If no registered contexts request HSA API tracing, no wrappers need to be generated.
- Perform more extensive checks during service specification and inform a tool about potential issues early.
- Allow multiple tools to use certain services simultaneously.
- Improve thread safety without introducing parallel bottlenecks.
- Manage internal data and allocations more efficiently.