Adding pc sampling how to guide (#160)
* Adding pc sampling how to guide * doc update * Fixing indentation * updating index * udpating doc * updating doc * Added field information * Fixing Formatting * fix formatting error * Added json format for pc sampling * feedback resolved * formatting for text * PC Sampling API doc * Reformatted * Note for shared systems * update docs * correcting relative path for cross-referencing --------- Co-authored-by: vlaindic_amdeng <vladimir.indic@amd.com>
Este commit está contenido en:
@@ -14,6 +14,7 @@ subtrees:
|
||||
- file: how-to/using-rocprofv3
|
||||
- file: how-to/using-rocprofiler-sdk-roctx
|
||||
- file: how-to/samples
|
||||
- file: how-to/using-pc-sampling
|
||||
- file: how-to/using-rocprofv3-with-mpi
|
||||
- caption: API reference
|
||||
entries:
|
||||
|
||||
@@ -9,13 +9,161 @@ myst:
|
||||
|
||||
Program Counter (PC) sampling is a profiling method that uses statistical approximation of the kernel execution by sampling GPU program counters. Furthermore, this method periodically chooses an active wave in a round robin manner and snapshots its PC. This process takes place on every compute unit simultaneously, making it device-wide PC sampling. The outcome is the histogram of samples, explaining how many times each kernel instruction was sampled.
|
||||
|
||||
:::{note}
|
||||
Risk acknowledgment:
|
||||
> **Warning:**
|
||||
> Risk acknowledgment: The PC sampling feature is under development and might not be completely stable. Use this beta feature cautiously. It may affect your system's stability and performance. Proceed at your own risk.
|
||||
>
|
||||
> By activating this feature through `ROCPROFILER_PC_SAMPLING_BETA_ENABLED` environment variable, you acknowledge and accept the following potential risks:
|
||||
>
|
||||
> - Hardware freeze: This beta feature could cause your hardware to freeze unexpectedly.
|
||||
> - Need for cold restart: In the event of a hardware freeze, you might need to perform a cold restart (turning the hardware off and on) to restore normal operations.
|
||||
|
||||
The PC sampling feature is under development and might not be completely stable. Use this beta feature cautiously. It may affect your system's stability and performance. Proceed at your own risk.
|
||||
## ROCprofiler-SDK PC Sampling Service
|
||||
|
||||
By activating this feature through `ROCPROFILER_PC_SAMPLING_BETA_ENABLED` environment variable, you acknowledge and accept the following potential risks:
|
||||
This section describes usage of ROCProfiler-SDK PC Sampling API to configure and use PC sampling service. For a fully functional example, see [Samples](https://github.com/ROCm/rocprofiler-sdk/tree/amd-mainline/samples).
|
||||
|
||||
- Hardware freeze: This beta feature could cause your hardware to freeze unexpectedly.
|
||||
- Need for cold restart: In the event of a hardware freeze, you might need to perform a cold restart (turning the hardware off and on) to restore normal operations.
|
||||
:::
|
||||
### tool_init() Setup
|
||||
|
||||
As the PC sampling service belongs to the group of [buffered services](buffered_services.md), it requires a buffer and a context to be set up in this phase.
|
||||
|
||||
```cpp
|
||||
rocprofiler_context_id_t ctx{0};
|
||||
rocprofiler_buffer_id_t buff;
|
||||
ROCPROFILER_CALL(rocprofiler_create_context(&ctx), "context creation failed");
|
||||
ROCPROFILER_CALL(rocprofiler_create_buffer(ctx,
|
||||
8192,
|
||||
2048,
|
||||
ROCPROFILER_BUFFER_POLICY_LOSSLESS,
|
||||
pc_sampling_callback, // Callback to process PC samples
|
||||
user_data,
|
||||
&buff),
|
||||
"buffer creation failed");
|
||||
```
|
||||
|
||||
For more details about the buffer creation, please refer to the [buffered services section](buffered_services.md).
|
||||
|
||||
The PC sampling service is tied to a GPU agent. To extract the list of available agents, one could use the `rocprofiler_query_available_agents` as the following snippet outlines.
|
||||
|
||||
```cpp
|
||||
std::vector<rocprofiler_agent_v0_t> agents;
|
||||
|
||||
// Callback used by rocprofiler_query_available_agents to return
|
||||
// agents on the device. This can include CPU agents as well. We
|
||||
// select GPU agents only (i.e. type == ROCPROFILER_AGENT_TYPE_GPU)
|
||||
rocprofiler_query_available_agents_cb_t iterate_cb = [](rocprofiler_agent_version_t agents_ver,
|
||||
const void** agents_arr,
|
||||
size_t num_agents,
|
||||
void* udata) {
|
||||
if(agents_ver != ROCPROFILER_AGENT_INFO_VERSION_0)
|
||||
throw std::runtime_error{"unexpected rocprofiler agent version"};
|
||||
auto* agents_v = static_cast<std::vector<rocprofiler_agent_v0_t>*>(udata);
|
||||
for(size_t i = 0; i < num_agents; ++i)
|
||||
{
|
||||
const auto* agent = static_cast<const rocprofiler_agent_v0_t*>(agents_arr[i]);
|
||||
if(agent->type == ROCPROFILER_AGENT_TYPE_GPU) agents_v->emplace_back(*agent);
|
||||
}
|
||||
return ROCPROFILER_STATUS_SUCCESS;
|
||||
};
|
||||
|
||||
// Query the agents, only a single callback is made that contains a vector
|
||||
// of all agents.
|
||||
ROCPROFILER_CALL(
|
||||
rocprofiler_query_available_agents(ROCPROFILER_AGENT_INFO_VERSION_0,
|
||||
iterate_cb,
|
||||
sizeof(rocprofiler_agent_t),
|
||||
const_cast<void*>(static_cast<const void*>(&agents))),
|
||||
"query available agents");
|
||||
```
|
||||
|
||||
Only recent GPU architectures support the feature. To determine whether an agent with `agent_it` supports the PC sampling and what configurations (`rocprofiler_pc_sampling_configuration_t`) are available, one should use the `rocprofiler_query_pc_sampling_agent_configurations`.
|
||||
|
||||
```cpp
|
||||
std::vector<rocprofiler_pc_sampling_configuration_t> available_configurations;
|
||||
|
||||
auto cb = [](const rocprofiler_pc_sampling_configuration_t* configs,
|
||||
size_t num_config,
|
||||
void* user_data) {
|
||||
auto* avail_configs = static_cast<avail_configs_vec_t*>(user_data);
|
||||
for(size_t i = 0; i < num_config; i++)
|
||||
{
|
||||
avail_configs->emplace_back(configs[i]);
|
||||
}
|
||||
return ROCPROFILER_STATUS_SUCCESS;
|
||||
};
|
||||
|
||||
auto status = rocprofiler_query_pc_sampling_agent_configurations(
|
||||
agent_id, cb, &available_configurations);
|
||||
```
|
||||
|
||||
Assuming the `available_configurations` contains a single element:
|
||||
|
||||
```cpp
|
||||
rocprofiler_pc_sampling_configuration_t {
|
||||
.method = ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP,
|
||||
.unit = ROCPROFILER_PC_SAMPLING_UNIT_TIME,
|
||||
.min_interval = 1,
|
||||
.max_interval = 10000
|
||||
};
|
||||
```
|
||||
|
||||
one proceeds configuring the PC sampling service on an agent with `agent_id` to generate samples every 1000 micro-seconds in the following way:
|
||||
|
||||
```cpp
|
||||
auto status = rocprofiler_configure_pc_sampling_service(ctx,
|
||||
agent_id,
|
||||
picked_cfg->method,
|
||||
picked_cfg->unit,
|
||||
1000, // 1000 us
|
||||
buffer_id,
|
||||
0);
|
||||
if (status == ROCPROFILER_STATUS_SUCCESS)
|
||||
{
|
||||
// PC Sampling service has been configured successfully.
|
||||
}
|
||||
else
|
||||
{
|
||||
// code for error handling
|
||||
}
|
||||
```
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> Multiple processes can share the same GPU agent simultaneously, so the following ABA problem is possible on shared systems. Namely, process A can query available configurations and decide to configure the service with configuration CA. However, process B manages to finish configuring the service with configuration CB, meaning process A will fail. Thus, we advise that process A repeat the querying process to observe configuration CB and reuse it for configuring the PC sampling service. Please refer to the [Samples](https://github.com/ROCm/rocprofiler-sdk/tree/amd-mainline/samples) section for more technical details.
|
||||
|
||||
### Processing PC Samples (`pc_sampling_callback`)
|
||||
|
||||
PC sampling service asynchronously delivers samples via a dedicated callback. The following code outlines the process of iterating over samples.
|
||||
|
||||
```cpp
|
||||
void
|
||||
pc_sampling_callback(rocprofiler_context_id_t ctx,
|
||||
rocprofiler_buffer_id_t buff,
|
||||
rocprofiler_record_header_t** headers,
|
||||
size_t num_headers,
|
||||
void* data,
|
||||
uint64_t drop_count)
|
||||
{
|
||||
for(size_t i = 0; i < num_headers; i++)
|
||||
{
|
||||
auto* cur_header = headers[i];
|
||||
|
||||
if(cur_header->category == ROCPROFILER_BUFFER_CATEGORY_PC_SAMPLING)
|
||||
{
|
||||
if(cur_header->kind == ROCPROFILER_PC_SAMPLING_RECORD_HOST_TRAP_V0_SAMPLE)
|
||||
{
|
||||
auto* pc_sample = static_cast<rocprofiler_pc_sampling_record_host_trap_v0_t*>(
|
||||
cur_header->payload);
|
||||
|
||||
// Processing a single sample...
|
||||
}
|
||||
else
|
||||
{
|
||||
// ...
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
For more information about what data comprises a single sample, please refer to the [pc_sampling.h](https://github.com/ROCm/rocprofiler-sdk/blob/amd-mainline/source/include/rocprofiler-sdk/pc_sampling.h).
|
||||
|
||||
Note, a user can synchronously flush buffers via `rocprofiler_buffer_flush` that triggers `pc_sampling_callback`.
|
||||
|
||||
@@ -0,0 +1,80 @@
|
||||
"Sample_Timestamp","Exec_Mask","Dispatch_Id","Instruction","Instruction_Comment","Correlation_Id"
|
||||
3464444413017201,65535,1,"s_endpgm","",1
|
||||
3464444413017201,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413018481,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413018481,65535,1,"s_endpgm","",1
|
||||
3464444413018481,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413018481,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413018481,65535,1,"s_endpgm","",1
|
||||
3464444413018481,65535,1,"s_endpgm","",1
|
||||
3464444413019601,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413019761,65535,1,"s_load_dword s8, s[4:5], 0x24","",1
|
||||
3464444413019761,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413019761,65535,1,"s_endpgm","",1
|
||||
3464444413019761,65535,1,"s_load_dword s8, s[4:5], 0x24","",1
|
||||
3464444413019761,65535,1,"s_endpgm","",1
|
||||
3464444413019761,65535,1,"s_endpgm","",1
|
||||
3464444413020881,65535,1,"s_endpgm","",1
|
||||
3464444413020881,65535,1,"s_endpgm","",1
|
||||
3464444413020881,65535,1,"s_endpgm","",1
|
||||
3464444413020881,65535,1,"s_waitcnt lgkmcnt(0)","",1
|
||||
3464444413020881,65535,1,"v_addc_co_u32_e32 v5, vcc, v1, v5, vcc","",1
|
||||
3464444413020881,65535,1,"s_endpgm","",1
|
||||
3464444413020881,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413020881,65535,1,"s_endpgm","",1
|
||||
3464444413020881,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413021041,65535,1,"s_endpgm","",1
|
||||
3464444413020881,65535,1,"v_bfe_u32 v0, v0, 10, 10","",1
|
||||
3464444413021041,65535,1,"s_endpgm","",1
|
||||
3464444413021041,65535,1,"s_endpgm","",1
|
||||
3464444413021041,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413021041,65535,1,"s_endpgm","",1
|
||||
3464444413021041,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413021041,65535,1,"s_endpgm","",1
|
||||
3464444413022001,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413022001,65535,1,"s_endpgm","",1
|
||||
3464444413022001,65535,1,"s_endpgm","",1
|
||||
3464444413022001,65535,1,"s_endpgm","",1
|
||||
3464444413022001,65535,1,"s_endpgm","",1
|
||||
3464444413022001,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413022001,65535,1,"s_endpgm","",1
|
||||
3464444413022001,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413022001,65535,1,"s_waitcnt lgkmcnt(0)","",1
|
||||
3464444413022161,65535,1,"s_endpgm","",1
|
||||
3464444413022161,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413022161,65535,1,"s_endpgm","",1
|
||||
3464444413022161,65535,1,"s_load_dword s8, s[4:5], 0x24","",1
|
||||
3464444413022161,65535,1,"global_store_dword v[0:1], v3, off","",1
|
||||
3464444413022161,65535,1,"s_endpgm","",1
|
||||
3464444413022161,65535,1,"s_endpgm","",1
|
||||
3464444413022161,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413022161,65535,1,"s_endpgm","",1
|
||||
3464444413022161,65535,1,"s_endpgm","",1
|
||||
3464444413022161,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413022161,65535,1,"s_endpgm","",1
|
||||
3464444413022321,65535,1,"s_load_dwordx4 s[0:3], s[4:5], 0x0","",1
|
||||
3464444413022161,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413022321,65535,1,"s_endpgm","",1
|
||||
3464444413022161,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413023281,65535,1,"s_endpgm","",1
|
||||
3464444413023281,65535,1,"s_endpgm","",1
|
||||
3464444413023281,65535,1,"v_ashrrev_i32_e32 v1, 31, v0","",1
|
||||
3464444413024561,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413023281,65535,1,"s_endpgm","",1
|
||||
3464444413024561,65535,1,"s_endpgm","",1
|
||||
3464444413023761,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413026321,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413024401,65535,1,"global_store_dword v[0:1], v3, off","",1
|
||||
3464444413027121,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413025041,65535,1,"v_add_co_u32_e32 v0, vcc, s0, v0","",1
|
||||
3464444413027761,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413025361,65535,1,"s_endpgm","",1
|
||||
3464444413027601,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413026321,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413028401,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413026481,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413028881,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413026641,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413028401,65535,1,"s_load_dword s8, s[4:5], 0x24","",1
|
||||
3464444413027281,65535,1,"s_waitcnt vmcnt(0)","",1
|
||||
3464444413029681,65535,1,"s_endpgm","",1
|
||||
|
@@ -0,0 +1,22 @@
|
||||
"Sample_Timestamp","Exec_Mask","Dispatch_Id","Instruction","Instruction_Comment","Correlation_Id"
|
||||
54155306462675,65535,1,"s_waitcnt lgkmcnt(0)","/opt/rocm/include/hip/amd_detail/amd_hip_runtime.h:275",1
|
||||
54155306462715,65535,1,"s_waitcnt vmcnt(0)","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44",1
|
||||
54155306462755,65535,1,"s_endpgm","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:45",1
|
||||
54155306462755,65535,1,"s_endpgm","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:45",1
|
||||
54155306462955,65535,1,"s_endpgm","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:45",1
|
||||
54155306463035,65535,1,"s_waitcnt vmcnt(0)","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44",1
|
||||
54155306463235,65535,1,"s_waitcnt vmcnt(0)","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44",1
|
||||
54155306463315,65535,1,"s_waitcnt vmcnt(0)","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44",1
|
||||
54155306463515,65535,1,"s_endpgm","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:45",1
|
||||
54155306463755,65535,1,"s_waitcnt vmcnt(0)","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44",1
|
||||
54155306463875,65535,1,"s_waitcnt vmcnt(0)","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44",1
|
||||
54155306464075,65535,1,"v_mov_b32_e32 v2, s4","/opt/rocm/include/hip/amd_detail/amd_hip_runtime.h:275",1
|
||||
54155306464155,65535,1,"s_waitcnt vmcnt(0)","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44",1
|
||||
54155306464155,65535,1,"s_waitcnt vmcnt(0)","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44",1
|
||||
54155306464275,65535,1,"s_endpgm","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:45",1
|
||||
54155306464395,65535,1,"s_waitcnt vmcnt(0)","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44",1
|
||||
54155306464515,65535,1,"s_waitcnt lgkmcnt(0)","/opt/rocm/include/hip/amd_detail/amd_hip_runtime.h:275",1
|
||||
54155306464555,65535,1,"s_waitcnt vmcnt(0)","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44",1
|
||||
54155306464595,65535,1,"s_waitcnt vmcnt(0)","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44",1
|
||||
54155306464595,65535,1,"v_mov_b32_e32 v2, s6","/opt/rocm/include/hip/amd_detail/amd_hip_runtime.h:275",1
|
||||
54155306464595,65535,1,"s_waitcnt lgkmcnt(0)","/opt/rocm/include/hip/amd_detail/amd_hip_runtime.h:275",1
|
||||
|
@@ -0,0 +1,178 @@
|
||||
.. meta::
|
||||
:description: Documentation of the usage of pc-sampling with rocprofv3 command-line tool
|
||||
:keywords: ROCprofiler-SDK tool, ROCprofiler-SDK library, rocprofv3, rocprofv3 tool usage, Using rocprofv3, ROCprofiler-SDK command line tool, PC sampling
|
||||
|
||||
.. _using-pc-sampling:
|
||||
|
||||
======================
|
||||
Using ``pc-sampling``
|
||||
======================
|
||||
|
||||
PC (Program Counter) Sampling service for GPU profiling is a profiling technique that periodically samples the program counter during GPU kernel execution to understand code execution patterns and hotspots.
|
||||
This helps in:
|
||||
- Identifying performance bottlenecks
|
||||
- Understanding kernel execution behavior
|
||||
- Analyzing code coverage
|
||||
- Finding heavily executed code paths
|
||||
|
||||
To try out the PC sampling feature, you can use the rocprofv3 command-line tool or the rocprofiler SDK library on `ROCm 6.4` or later.
|
||||
|
||||
.. note::
|
||||
PC sampling is supported on AMD GPUs with gfx90a and later architectures. Before using the PC sampling feature, ensure that the GPU supports it.
|
||||
|
||||
PC Sampling availability and Configuration
|
||||
==========================================
|
||||
|
||||
To check if the GPU supports PC sampling, use the following command:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
rocprofv3 -L
|
||||
|
||||
OR
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
rocprofv3 --list-avail
|
||||
|
||||
The output will list if `rocprofv3` supports PC sampling on the GPU and what configuration is supported.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
List available PC Sample Configurations for node_id 11
|
||||
Method: ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP
|
||||
Unit: ROCPROFILER_PC_SAMPLING_UNIT_TIME
|
||||
Minimum_Interval: 1
|
||||
Maximum_Interval: 18446744073709551615
|
||||
|
||||
The above output shows that the GPU supports PC sampling with the ``ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP`` method and the ``ROCPROFILER_PC_SAMPLING_UNIT_TIME`` unit. The minimum and maximum intervals are also displayed.
|
||||
|
||||
Based on the above configuration, you can use the following command to profile the application using PC sampling:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
rocprofv3 --pc-sampling-beta-enabled --pc-sampling-method host_trap --pc-sampling-unit time --pc-sampling-interval 1 -- <application_path>
|
||||
|
||||
The above command enables PC sampling with the `host_trap` method, `time` unit, and an interval of `1` us(micro second). Replace `<application_path>` with the path to the application you want to profile.
|
||||
|
||||
This will generate 2 files. ``agent_info.csv`` and ``pc_sampling_host_trap.csv``. Both files are prefixed with file prefixed with the process ID.
|
||||
Here is the output of pc-sampling for the `MatrixTranspose` sample application:
|
||||
|
||||
Here are the contents of ``pc_sampling_host_trap.csv`` file:
|
||||
|
||||
.. csv-table:: PC sampling host trap
|
||||
:file: /data/pc_sampling_host_trap.csv
|
||||
:widths: 20,10,10,10,10,20
|
||||
:header-rows: 1
|
||||
|
||||
For the description of the fields in the output file, see :ref:`pc-sampling-fields`.
|
||||
|
||||
If you noticed ``Instruction_Comment`` field in the output file was empty. It is recommended to compile your application with debug symbols to populate this field.
|
||||
It maps back to the source line if debug symbols were enabled when the application was compiled. This helps in understanding the code execution pattern and hotspots.
|
||||
|
||||
.. csv-table:: PC sampling host trap with debug symbols
|
||||
:file: /data/pc_sampling_host_trap_debug.csv
|
||||
:widths: 20,10,10,10,10,20
|
||||
:header-rows: 1
|
||||
|
||||
The above output shows the `Instruction_Comment` field populated with the source line information.
|
||||
|
||||
.. _pc-sampling-fields:
|
||||
|
||||
PC Sampling Fields:
|
||||
===================
|
||||
The output file generated by PC sampling contains the following fields:
|
||||
|
||||
- ``Sample_Timestamp``: Timestamp when sample is generated
|
||||
- ``Exec_Mask``: Active SIMD lanes when sampled
|
||||
- ``Dispatch_Id``: Originating kernel dispatch ID
|
||||
- ``Instruction``: Assembly instruction e.g: ``s_load_dword s8, s[1:2], 0x10``
|
||||
- ``Instruction_Comment``: Instruction comment (Maps back to source-line if debug symbols were enabled when application was compiled)
|
||||
- ``Correlation_Id``: API launch call id that matches dispatch ID
|
||||
|
||||
By default the output file is in CSV format. To dump samples in a more comprehensive format, one can use JSON through `--output-format json`.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
rocprofv3 --pc-sampling-beta-enabled --pc-sampling-method host_trap --pc-sampling-unit time --pc-sampling-interval 1 --output-format json -- <application_path>
|
||||
|
||||
This will generate a JSON file with the comprehensive output. Here is a trimmed down output with multiple records:
|
||||
|
||||
.. code-block:: text
|
||||
|
||||
{
|
||||
"pc_sample_host_trap": [
|
||||
{
|
||||
"record": {
|
||||
"hw_id": {
|
||||
"chiplet": 0,
|
||||
"wave_id": 0,
|
||||
"simd_id": 2,
|
||||
"pipe_id": 0,
|
||||
"cu_or_wgp_id": 1,
|
||||
"shader_array_id": 0,
|
||||
"shader_engine_id": 2,
|
||||
"workgroup_id": 0,
|
||||
"vm_id": 3,
|
||||
"queue_id": 2,
|
||||
"microengine_id": 1
|
||||
},
|
||||
"pc": {
|
||||
"code_object_id": 1,
|
||||
"code_object_offset": 20228
|
||||
},
|
||||
"exec_mask": 18446744073709551615,
|
||||
"timestamp": 51040126667689,
|
||||
"dispatch_id": 1,
|
||||
"corr_id": {
|
||||
"internal": 1,
|
||||
"external": 0
|
||||
},
|
||||
"wrkgrp_id": {
|
||||
"x": 182,
|
||||
"y": 0,
|
||||
"z": 0
|
||||
},
|
||||
"wave_in_grp": 1
|
||||
},
|
||||
"inst_index": 0
|
||||
},
|
||||
{
|
||||
"record": {
|
||||
"hw_id": {
|
||||
"chiplet": 0,
|
||||
"wave_id": 0,
|
||||
"simd_id": 2,
|
||||
"pipe_id": 0,
|
||||
"cu_or_wgp_id": 0,
|
||||
"shader_array_id": 0,
|
||||
"shader_engine_id": 2,
|
||||
"workgroup_id": 0,
|
||||
"vm_id": 3,
|
||||
"queue_id": 2,
|
||||
"microengine_id": 1
|
||||
},
|
||||
"pc": {
|
||||
"code_object_id": 1,
|
||||
"code_object_offset": 20236
|
||||
},
|
||||
"exec_mask": 18446744073709551615,
|
||||
"timestamp": 51040126667689,
|
||||
"dispatch_id": 1,
|
||||
"corr_id": {
|
||||
"internal": 1,
|
||||
"external": 0
|
||||
},
|
||||
"wrkgrp_id": {
|
||||
"x": 158,
|
||||
"y": 0,
|
||||
"z": 0
|
||||
},
|
||||
"wave_in_grp": 2
|
||||
},
|
||||
"inst_index": 1
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
The description of the fields in the JSON output is available in the :ref:`output-file-fields`.
|
||||
@@ -33,6 +33,7 @@ The documentation is structured as follows:
|
||||
|
||||
* :ref:`using-rocprofv3`
|
||||
* :ref:`using-rocprofiler-sdk-roctx`
|
||||
* :ref:`using-pc-sampling`
|
||||
* :doc:`Samples <how-to/samples>`
|
||||
* :ref:`using-rocprofv3-with-mpi`
|
||||
|
||||
|
||||
Referencia en una nueva incidencia
Block a user