[Docs] Update using-pc-sampling (#157)

这个提交包含在:
systems-assistant[bot]
2025-08-21 11:14:16 -04:00
提交者 GitHub
父节点 87b348c51d
当前提交 c7b9533836
修改 2 个文件,包含 53 行新增37 行删除
-1
查看文件
@@ -42,5 +42,4 @@
# VSCode Workspaces
*.code-workspace
rocprofiler-sdk-build/CMakeCache.txt
/rocprofiler-sdk-build
@@ -63,7 +63,7 @@ To check the firmware versions, use:
.. code-block:: bash
# To check PSP TOS Firmware:
# To check PSP TOS Firmware:
sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info | grep SOS
# To check MEC Firmware:
@@ -81,6 +81,8 @@ This generates two files, ``agent_info.csv`` and ``pc_sampling_host_trap.csv``.
Here are the contents of ``pc_sampling_host_trap.csv`` file generated for MatrixTranspose sample application:
.. _pc_sampling_host_trap:
.. csv-table:: PC sampling host trap
:file: /data/pc_sampling_host_trap.csv
:widths: 20,10,10,10,10,20
@@ -201,50 +203,53 @@ The preceding command generates a JSON file with the comprehensive output. Here
For description of the fields in the JSON output, see :ref:`output-file-fields`.
An Arbitrary Host-Trap PC Sampling Skid
===============================================
Host-trap PC sampling and arbitrary sampling skid
==================================================
Host-Trap PC sampling is a software-based technique that utilizes a background kernel thread
to periodically interrupt running waves in order to capture the program counter (PC).
Host-trap PC sampling is a software-based technique that utilizes a background kernel thread
to periodically interrupt running waves to capture the program counter (PC).
This method is effective for gathering performance data without requiring specialized hardware
to snapshot the waves. However, it has limitations due to the potential delay between
when a wave receives an interrupt and when it processes the interrupt to capture the PC.
This delay can lead to a sampling skid, where the PC samples may be attributed to instructions
to snapshot the waves. However, this method has certain limitations due to the potential delay between
receiving and processing the interrupt by the wave to capture the PC.
This delay can lead to a sampling skid, where the PC samples might be attributed to the instructions
that are up to two instructions away from the actual source of latency.
This results in a non-precise intra-kernel sampling method.
When analyzing an application profile generated by host-trap PC sampling,
developers should consider not only the reported most costly instruction but
it is important to consider not only the costliest reported instruction but
also the instructions immediately preceding or following it.
If the costly instruction is near a branch instruction, it is important
to also consider the instruction targeted by the branch and the one immediately following it.
If the costly instruction is near a branch instruction, it is important
to consider the instruction targeted by the branch and the one immediately following it as well.
To address the limitations of host-trap sampling, the hardware-based stochastic PC sampling method
has been developed. This method provides precise intra-kernel sampling with zero sampling skid,
offering more accurate performance insights.
It is important to note that the skid issue inherent in host-trap PC sampling will not be resolved
in its current form. Therefore, users are encouraged to adopt stochastic PC sampling,
starting with the GFX942 architecture, to achieve more precise performance profiling.
It is important to note that the skid issue inherent in host-trap PC sampling is not likely to be resolved
in its current form. Therefore, to achieve more precise performance profiling, it is recommended to adopt stochastic PC sampling starting with the gfx942 architecture.
Hardware-Based (Stochastic) PC Sampling Method
.. note::
Host-trap PC sampling is supported on AMD Instinct MI200, MI300, MI325, MI350, and MI355.
Hardware-based (stochastic) PC sampling method
===============================================
The new ``ROCPROFILER_PC_SAMPLING_METHOD_STOCHASTIC`` has been introduced for gfx942 architecture.
It employs a specific hardware for probing waves actively running on GPU.
Beside information already provided with ``ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP`` useful for determining hot-spots within the kernel,
it delivers additional information that tells whether a sampled wave issued an instruction represented with particular PC.
If not, it provides the reason for not issuing the instruction (stall reason).
This type of information is particularly useful for understanding stalls during the kernel execution.
The ``ROCPROFILER_PC_SAMPLING_METHOD_STOCHASTIC`` method has been introduced for the gfx942 architecture.
It employs a specific hardware for probing waves actively running on the GPU.
Besides the information already provided with ``ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP`` useful for
determining hotspots within the kernel, this method delivers additional information, which helps to determine
whether a sampled wave issued an instruction represented with the specified PC.
If not, this method provides the reason for not issuing the instruction (stall reason).
Such information is particularly useful for understanding stalls during kernel execution.
To use this method on gfx942, we recommend listing available PC sampling configurations to verify if the latest ROCm stack is installed
on the system by running:
To use this method on gfx942, it is recommended to list available PC sampling configurations to verify if the latest ROCm stack is installed on the system using:
.. code-block:: bash
rocprofv3 -L
Output similar to the following indicates that the ``ROCPROFILER_PC_SAMPLING_METHOD_STOCHASTIC`` method is available:
An output similar to the following indicates that the ``ROCPROFILER_PC_SAMPLING_METHOD_STOCHASTIC`` method is available:
.. code-block:: bash
@@ -257,22 +262,30 @@ Output similar to the following indicates that the ``ROCPROFILER_PC_SAMPLING_MET
Max_Interval :2147483648
Flags :interval pow2
Please note that on gfx942, `ROCPROFILER_PC_SAMPLING_METHOD_STOCHASTIC` requires intervals to be specified in cycles, whose values are powers of 2
.. note::
To profile a gfx942 accelerated application with ``ROCPROFILER_PC_SAMPLING_METHOD_STOCHASTIC`` PC sampling, one can use the following command:
On gfx942, ``ROCPROFILER_PC_SAMPLING_METHOD_STOCHASTIC`` requires intervals to be specified in cycles with values as powers of 2.
To profile an application with ``ROCPROFILER_PC_SAMPLING_METHOD_STOCHASTIC`` PC sampling enabled on gfx942, use:
.. code-block:: bash
rocprofv3 --pc-sampling-beta-enabled --pc-sampling-method stochastic --pc-sampling-unit cycles --pc-sampling-interval 1048576 --output-format csv, json -- <application_path>
The previous command serializes samples in both CSV and JSON output formats in the ``pc_sampling_stochastic.csv`` and ``out_results.json`` files, respectively.
The preceding command serializes samples in both CSV and JSON output formats in the ``pc_sampling_stochastic.csv`` and ``out_results.json`` files, respectively.
Comparing the ``pc_sampling_stochastic.csv`` to ``pc_sampling_host_trap`` from previous section, one can notice that the ``ROCPROFILER_PC_SAMPLING_METHOD_STOCHASTIC`` method
generates additional fields:
- ``Wave_Issued_Instruction``: Indicates whether the wave issued an instruction (value 1) represented with particular PC or not (value 0)
- ``Instruction_Type``: If the value of ``Wave_Issued_Instruction`` is 1, this fields indicates the type of the issued instruction. Otherwise, this fields irrelevant.
- ``Stall_Reason``: If the value of ``Wave_Issued_Instruction`` is 0, this fields indicates the reason for not issuing the instruction (stall reason). Otherwise, this field is irrelevant.
- ``Wave_Count``: Total number of waves actively running on a compute unit when the sample was generated.
On comparing the :ref:`pc_sampling_stochastic.csv <pc_sampling_stochastic>` to :ref:`pc_sampling_host_trap.csv <pc_sampling_host_trap>`, you can notice that the ``ROCPROFILER_PC_SAMPLING_METHOD_STOCHASTIC`` method
generates the following additional fields:
- ``Wave_Issued_Instruction``: Indicates whether the wave issued an instruction represented with the specified PC. Value = 1 for yes and 0 for no.
- ``Instruction_Type``: If the value of ``Wave_Issued_Instruction`` is 1, this field indicates the type of the issued instruction. Otherwise, this field remains irrelevant.
- ``Stall_Reason``: If the value of ``Wave_Issued_Instruction`` is 0, this field indicates the reason for not issuing the instruction (stall reason). Otherwise, this field remains irrelevant.
- ``Wave_Count``: Total number of waves actively running on a compute unit when the sample is generated.
.. _pc_sampling_stochastic:
.. csv-table:: PC sampling stochastic with debug symbols
:file: /data/pc_sampling_stochastic_debug.csv
@@ -280,7 +293,8 @@ generates additional fields:
:header-rows: 1
Similarly, ``ROCPROFILER_PC_SAMPLING_METHOD_STOCHASTIC`` method delivers additional information to every sample in the JSON output.
The following snippet shows one sample from ``out_results.json`` file.
Here is a ``out_results.json`` file sample:
.. code-block:: text
@@ -351,7 +365,10 @@ The following snippet shows one sample from ``out_results.json`` file.
},
Fields starting with ``arb_state_`` are of particular interest as they indicate the state of the arbiter at the time of sampling.
Namely, ``arb_state_issue_`` fields indicate what type of instructions arbiter issued at the time of sampling.
On the other hand, ``arb_state_stall_`` fields indicate what type of instructions were stalled at the time of sampling.
For example, ``arb_state_issue_`` fields indicate the type of instructions issued by the arbiter at the time of sampling.
On the other hand, ``arb_state_stall_`` fields indicate the type of instructions stalled at the time of sampling.
This information is useful for understanding how many instructions per cycle (IPC) are issued.
.. note::
The stochastic PC sampling is supported on AMD Instinct MI300, MI325, MI350, and MI355.