[Host-Trap PC Sampling] Host-Trap PC sampling an introduce an arbitrary sampling skid of [0, 2] instructions (#515)

* Arbitrary host-trap sampling skid (doc)

The host-trap PC sampling might introduce a skid of [0, 2]
instructions. We documented this information and provides
some advice to application developers how to find
hot-spots in the profiles generated by host-trap sampling.
This commit is contained in:
Indic, Vladimir
2025-07-17 17:59:46 +02:00
committad av GitHub
förälder 5c45c77ec7
incheckning 650d35bdaa
+26
Visa fil
@@ -201,6 +201,31 @@ The preceding command generates a JSON file with the comprehensive output. Here
For description of the fields in the JSON output, see :ref:`output-file-fields`.
An Arbitrary Host-Trap PC Sampling Skid
===============================================
Host-Trap PC sampling is a software-based technique that utilizes a background kernel thread
to periodically interrupt running waves in order to capture the program counter (PC).
This method is effective for gathering performance data without requiring specialized hardware
to snapshot the waves. However, it has limitations due to the potential delay between
when a wave receives an interrupt and when it processes the interrupt to capture the PC.
This delay can lead to a sampling skid, where the PC samples may be attributed to instructions
that are up to two instructions away from the actual source of latency.
This results in a non-precise intra-kernel sampling method.
When analyzing an application profile generated by host-trap PC sampling,
developers should consider not only the reported most costly instruction but
also the instructions immediately preceding or following it.
If the costly instruction is near a branch instruction, it is important
to also consider the instruction targeted by the branch and the one immediately following it.
To address the limitations of host-trap sampling, the hardware-based stochastic PC sampling method
has been developed. This method provides precise intra-kernel sampling with zero sampling skid,
offering more accurate performance insights.
It is important to note that the skid issue inherent in host-trap PC sampling will not be resolved
in its current form. Therefore, users are encouraged to adopt stochastic PC sampling,
starting with the GFX942 architecture, to achieve more precise performance profiling.
Hardware-Based (Stochastic) PC Sampling Method
===============================================
@@ -329,3 +354,4 @@ Fields starting with ``arb_state_`` are of particular interest as they indicate
Namely, ``arb_state_issue_`` fields indicate what type of instructions arbiter issued at the time of sampling.
On the other hand, ``arb_state_stall_`` fields indicate what type of instructions were stalled at the time of sampling.
This information is useful for understanding how many instructions per cycle (IPC) are issued.