Files
rocm-systems/projects/rocprofiler-sdk/source/docs/how-to/using-pc-sampling.rst
T
Bhardwaj, Gopesh 93abda4cfd Copilot suggestions (#360)
* Copilot suggestions

* Fixing perfetto links

* correcting default value of agent-index

[ROCm/rocprofiler-sdk commit: 1f1c192a5e]
2025-04-22 20:52:37 +05:30

308 wiersze
11 KiB
ReStructuredText

.. meta::
:description: Documentation of the usage of pc-sampling with rocprofv3 command-line tool
:keywords: Sampling PC, Sampling program counter, rocprofv3, rocprofv3 tool usage, Using rocprofv3, ROCprofiler-SDK command line tool, PC sampling
.. _using-pc-sampling:
==================
Using PC sampling
==================
PC (Program Counter) sampling service for GPU profiling is a profiling technique to periodically sample the program counter during GPU kernel execution. PC sampling helps in understanding code execution patterns and identifying hotspot(s).
Here are the benefits of using PC sampling:
- Identify performance bottlenecks
- Understand kernel execution behavior
- Analyze code coverage
- Find heavily executed code paths
To try out the PC sampling feature, you can use the command-line tool ``rocprofv3`` or the ROCprofiler-SDK library on `ROCm 6.4` or later.
.. note::
PC sampling is ONLY supported on AMD GPUs with architectures gfx90a and later.
PC sampling availability and configuration
===========================================
To check if the GPU supports PC sampling, use:
.. code-block:: bash
rocprofv3 -L
Or
.. code-block:: bash
rocprofv3 --list-avail
The output lists if ``rocprofv3`` supports PC sampling on the GPU and the supported configuration.
.. code-block:: bash
List available PC Sample Configurations for node_id 11
Method: ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP
Unit: ROCPROFILER_PC_SAMPLING_UNIT_TIME
Minimum_Interval: 1
Maximum_Interval: 18446744073709551615
The preceding output shows that the GPU supports PC sampling with the ``ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP`` method and the ``ROCPROFILER_PC_SAMPLING_UNIT_TIME`` unit. The minimum and maximum intervals are also displayed.
Based on the preceding configuration, you can use the following command to profile the application using PC sampling:
.. code-block:: bash
rocprofv3 --pc-sampling-beta-enabled --pc-sampling-method host_trap --pc-sampling-unit time --pc-sampling-interval 1 -- <application_path>
The preceding command enables PC sampling with the ``host_trap`` method, ``time`` unit, and an interval of ``1`` μs (microsecond). Replace ``<application_path>`` with the path to the application you want to profile.
This generates two files, ``agent_info.csv`` and ``pc_sampling_host_trap.csv``. Both files are prefixed with the process ID.
Here are the contents of ``pc_sampling_host_trap.csv`` file generated for MatrixTranspose sample application:
.. csv-table:: PC sampling host trap
:file: /data/pc_sampling_host_trap.csv
:widths: 20,10,10,10,10,20
:header-rows: 1
For description of the fields in the output file, see :ref:`pc-sampling-fields`.
If you find the ``Instruction_Comment`` field in the output file to be empty, populate this field by compiling your application with debug symbols.
Enabling debug symbols while compiling the application maps back to the source line. This helps in understanding the code execution pattern and hotspots.
.. csv-table:: PC sampling host trap with debug symbols
:file: /data/pc_sampling_host_trap_debug.csv
:widths: 20,10,10,10,10,20
:header-rows: 1
The preceding output shows the ``Instruction_Comment`` field populated with the source-line information.
.. _pc-sampling-fields:
PC sampling fields
===================
Here are the fields in the output file generated by PC sampling:
- ``Sample_Timestamp``: Timestamp when sample is generated
- ``Exec_Mask``: Active SIMD lanes when sampled
- ``Dispatch_Id``: Originating kernel dispatch ID
- ``Instruction``: Assembly instruction such as ``s_load_dword s8, s[1:2], 0x10``
- ``Instruction_Comment``: Instruction comment that maps back to the source-line if debug symbols were enabled when application was compiled
- ``Correlation_Id``: API launch call ID that matches dispatch ID
By default, the output file is in CSV format. To dump samples in a more comprehensive format, use JSON through ``--output-format json``:
.. code-block:: bash
rocprofv3 --pc-sampling-beta-enabled --pc-sampling-method host_trap --pc-sampling-unit time --pc-sampling-interval 1 --output-format json -- <application_path>
The preceding command generates a JSON file with the comprehensive output. Here is a trimmed down output with multiple records:
.. code-block:: text
{
"pc_sample_host_trap": [
{
"record": {
"hw_id": {
"chiplet": 0,
"wave_id": 0,
"simd_id": 2,
"pipe_id": 0,
"cu_or_wgp_id": 1,
"shader_array_id": 0,
"shader_engine_id": 2,
"workgroup_id": 0,
"vm_id": 3,
"queue_id": 2,
"microengine_id": 1
},
"pc": {
"code_object_id": 1,
"code_object_offset": 20228
},
"exec_mask": 18446744073709551615,
"timestamp": 51040126667689,
"dispatch_id": 1,
"corr_id": {
"internal": 1,
"external": 0
},
"wrkgrp_id": {
"x": 182,
"y": 0,
"z": 0
},
"wave_in_grp": 1
},
"inst_index": 0
},
{
"record": {
"hw_id": {
"chiplet": 0,
"wave_id": 0,
"simd_id": 2,
"pipe_id": 0,
"cu_or_wgp_id": 0,
"shader_array_id": 0,
"shader_engine_id": 2,
"workgroup_id": 0,
"vm_id": 3,
"queue_id": 2,
"microengine_id": 1
},
"pc": {
"code_object_id": 1,
"code_object_offset": 20236
},
"exec_mask": 18446744073709551615,
"timestamp": 51040126667689,
"dispatch_id": 1,
"corr_id": {
"internal": 1,
"external": 0
},
"wrkgrp_id": {
"x": 158,
"y": 0,
"z": 0
},
"wave_in_grp": 2
},
"inst_index": 1
}
]
}
For description of the fields in the JSON output, see :ref:`output-file-fields`.
Hardware-Based (Stochastic) PC Sampling Method
===============================================
The new ``ROCPROFILER_PC_SAMPLING_METHOD_STOCHASTIC`` has been introduced for gfx942 architecture.
It employs a specific hardware for probing waves actively running on GPU.
Beside information already provided with ``ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP`` useful for determining hot-spots within the kernel,
it delivers additional information that tells whether a sampled wave issued an instruction represented with particular PC.
If not, it provides the reason for not issuing the instruction (stall reason).
This type of information is particularly useful for understanding stalls during the kernel execution.
To use this method on gfx942, we recommend listing available PC sampling configurations to verify if the latest ROCm stack is installed
on the system by running:
.. code-block:: bash
rocprofv3 -L
Output similar to the following indicates that the ``ROCPROFILER_PC_SAMPLING_METHOD_STOCHASTIC`` method is available:
.. code-block:: bash
Method: ROCPROFILER_PC_SAMPLING_METHOD_STOCHASTIC
Unit: ROCPROFILER_PC_SAMPLING_UNIT_CYCLES
Minimum_Interval: 256
Maximum_Interval: 2147483648
Please note that on gfx942, `ROCPROFILER_PC_SAMPLING_METHOD_STOCHASTIC` requires intervals to be specified in cycles, whose values are powers of 2
To profile a gfx942 accelerated application with ``ROCPROFILER_PC_SAMPLING_METHOD_STOCHASTIC`` PC sampling, one can use the following command:
.. code-block:: bash
rocprofv3 --pc-sampling-beta-enabled --pc-sampling-method stochastic --pc-sampling-unit cycles --pc-sampling-interval 1048576 --output-format csv, json -- <application_path>
The previous command serializes samples in both CSV and JSON output formats in the ``pc_sampling_stochastic.csv`` and ``out_results.json`` files, respectively.
Comparing the ``pc_sampling_stochastic.csv`` to ``pc_sampling_host_trap`` from previous section, one can notice that the ``ROCPROFILER_PC_SAMPLING_METHOD_STOCHASTIC`` method
generates additional fields:
- ``Wave_Issued_Instruction``: Indicates whether the wave issued an instruction (value 1) represented with particular PC or not (value 0)
- ``Instruction_Type``: If the value of ``Wave_Issued_Instruction`` is 1, this fields indicates the type of the issued instruction. Otherwise, this fields irrelevant.
- ``Stall_Reason``: If the value of ``Wave_Issued_Instruction`` is 0, this fields indicates the reason for not issuing the instruction (stall reason). Otherwise, this field is irrelevant.
- ``Wave_Count``: Total number of waves actively running on a compute unit when the sample was generated.
.. csv-table:: PC sampling stochastic with debug symbols
:file: /data/pc_sampling_stochastic_debug.csv
:widths: 20,10,10,10,10,20,10,20,20,10
:header-rows: 1
Similarly, ``ROCPROFILER_PC_SAMPLING_METHOD_STOCHASTIC`` method delivers additional information to every sample in the JSON output.
The following snippet shows one sample from ``out_results.json`` file.
.. code-block:: text
{
"record": {
"flags": {
"has_mem_cnt": 0
},
"hw_id": {
"chiplet": 4,
"wave_id": 0,
"simd_id": 2,
"pipe_id": 3,
"cu_or_wgp_id": 1,
"shader_array_id": 0,
"shader_engine_id": 3,
"workgroup_id": 0,
"vm_id": 3,
"queue_id": 2,
"microengine_id": 1
},
"pc": {
"code_object_id": 2,
"code_object_offset": 13880
},
"exec_mask": 18446744073709551615,
"timestamp": 390705261924637,
"dispatch_id": 29,
"corr_id": {
"internal": 29,
"external": 0
},
"wrkgrp_id": {
"x": 9,
"y": 489,
"z": 0
},
"wave_in_grp": 0,
"wave_issued": 1,
"inst_type": "ROCPROFILER_PC_SAMPLING_INSTRUCTION_TYPE_VALU",
"wave_cnt": 6,
"snapshot": {
"stall_reason": "ROCPROFILER_PC_SAMPLING_INSTRUCTION_NOT_ISSUED_REASON_OTHER_WAIT",
"dual_issue_valu": 0,
"arb_state_issue_valu": 1,
"arb_state_issue_matrix": 0,
"arb_state_issue_lds": 0,
"arb_state_issue_lds_direct": 0,
"arb_state_issue_scalar": 0,
"arb_state_issue_vmem_tex": 0,
"arb_state_issue_flat": 0,
"arb_state_issue_exp": 0,
"arb_state_issue_misc": 0,
"arb_state_issue_brmsg": 0,
"arb_state_stall_valu": 0,
"arb_state_stall_matrix": 0,
"arb_state_stall_lds": 0,
"arb_state_stall_lds_direct": 0,
"arb_state_stall_scalar": 0,
"arb_state_stall_vmem_tex": 0,
"arb_state_stall_flat": 0,
"arb_state_stall_exp": 0,
"arb_state_stall_misc": 0,
"arb_state_stall_brmsg": 0
}
},
"inst_index": 1
},
Fields starting with ``arb_state_`` are of particular interest as they indicate the state of the arbiter at the time of sampling.
Namely, ``arb_state_issue_`` fields indicate what type of instructions arbiter issued at the time of sampling.
On the other hand, ``arb_state_stall_`` fields indicate what type of instructions were stalled at the time of sampling.
This information is useful for understanding how many instructions per cycle (IPC) are issued.