Adding pc sampling how to guide (#160)

* Adding pc sampling how to guide

* doc update

* Fixing indentation

* updating index

* udpating doc

* updating doc

* Added field information

* Fixing Formatting

* fix formatting error

* Added json format for pc sampling

* feedback resolved

* formatting for text

* PC Sampling API doc

* Reformatted

* Note for shared systems

* update docs

* correcting relative path for cross-referencing

---------

Co-authored-by: vlaindic_amdeng <vladimir.indic@amd.com>
Este commit está contenido en:
Bhardwaj, Gopesh
2025-02-11 08:03:05 +05:30
cometido por GitHub
padre c478c24616
commit cdf22eba7d
Se han modificado 6 ficheros con 437 adiciones y 7 borrados
+1
Ver fichero
@@ -14,6 +14,7 @@ subtrees:
- file: how-to/using-rocprofv3
- file: how-to/using-rocprofiler-sdk-roctx
- file: how-to/samples
- file: how-to/using-pc-sampling
- file: how-to/using-rocprofv3-with-mpi
- caption: API reference
entries:
+155 -7
Ver fichero
@@ -9,13 +9,161 @@ myst:
Program Counter (PC) sampling is a profiling method that uses statistical approximation of the kernel execution by sampling GPU program counters. Furthermore, this method periodically chooses an active wave in a round robin manner and snapshots its PC. This process takes place on every compute unit simultaneously, making it device-wide PC sampling. The outcome is the histogram of samples, explaining how many times each kernel instruction was sampled.
:::{note}
Risk acknowledgment:
> **Warning:**
> Risk acknowledgment: The PC sampling feature is under development and might not be completely stable. Use this beta feature cautiously. It may affect your system's stability and performance. Proceed at your own risk.
>
> By activating this feature through `ROCPROFILER_PC_SAMPLING_BETA_ENABLED` environment variable, you acknowledge and accept the following potential risks:
>
> - Hardware freeze: This beta feature could cause your hardware to freeze unexpectedly.
> - Need for cold restart: In the event of a hardware freeze, you might need to perform a cold restart (turning the hardware off and on) to restore normal operations.
The PC sampling feature is under development and might not be completely stable. Use this beta feature cautiously. It may affect your system's stability and performance. Proceed at your own risk.
## ROCprofiler-SDK PC Sampling Service
By activating this feature through `ROCPROFILER_PC_SAMPLING_BETA_ENABLED` environment variable, you acknowledge and accept the following potential risks:
This section describes usage of ROCProfiler-SDK PC Sampling API to configure and use PC sampling service. For a fully functional example, see [Samples](https://github.com/ROCm/rocprofiler-sdk/tree/amd-mainline/samples).
- Hardware freeze: This beta feature could cause your hardware to freeze unexpectedly.
- Need for cold restart: In the event of a hardware freeze, you might need to perform a cold restart (turning the hardware off and on) to restore normal operations.
:::
### tool_init() Setup
As the PC sampling service belongs to the group of [buffered services](buffered_services.md), it requires a buffer and a context to be set up in this phase.
```cpp
rocprofiler_context_id_t ctx{0};
rocprofiler_buffer_id_t buff;
ROCPROFILER_CALL(rocprofiler_create_context(&ctx), "context creation failed");
ROCPROFILER_CALL(rocprofiler_create_buffer(ctx,
8192,
2048,
ROCPROFILER_BUFFER_POLICY_LOSSLESS,
pc_sampling_callback, // Callback to process PC samples
user_data,
&buff),
"buffer creation failed");
```
For more details about the buffer creation, please refer to the [buffered services section](buffered_services.md).
The PC sampling service is tied to a GPU agent. To extract the list of available agents, one could use the `rocprofiler_query_available_agents` as the following snippet outlines.
```cpp
std::vector<rocprofiler_agent_v0_t> agents;
// Callback used by rocprofiler_query_available_agents to return
// agents on the device. This can include CPU agents as well. We
// select GPU agents only (i.e. type == ROCPROFILER_AGENT_TYPE_GPU)
rocprofiler_query_available_agents_cb_t iterate_cb = [](rocprofiler_agent_version_t agents_ver,
const void** agents_arr,
size_t num_agents,
void* udata) {
if(agents_ver != ROCPROFILER_AGENT_INFO_VERSION_0)
throw std::runtime_error{"unexpected rocprofiler agent version"};
auto* agents_v = static_cast<std::vector<rocprofiler_agent_v0_t>*>(udata);
for(size_t i = 0; i < num_agents; ++i)
{
const auto* agent = static_cast<const rocprofiler_agent_v0_t*>(agents_arr[i]);
if(agent->type == ROCPROFILER_AGENT_TYPE_GPU) agents_v->emplace_back(*agent);
}
return ROCPROFILER_STATUS_SUCCESS;
};
// Query the agents, only a single callback is made that contains a vector
// of all agents.
ROCPROFILER_CALL(
rocprofiler_query_available_agents(ROCPROFILER_AGENT_INFO_VERSION_0,
iterate_cb,
sizeof(rocprofiler_agent_t),
const_cast<void*>(static_cast<const void*>(&agents))),
"query available agents");
```
Only recent GPU architectures support the feature. To determine whether an agent with `agent_it` supports the PC sampling and what configurations (`rocprofiler_pc_sampling_configuration_t`) are available, one should use the `rocprofiler_query_pc_sampling_agent_configurations`.
```cpp
std::vector<rocprofiler_pc_sampling_configuration_t> available_configurations;
auto cb = [](const rocprofiler_pc_sampling_configuration_t* configs,
size_t num_config,
void* user_data) {
auto* avail_configs = static_cast<avail_configs_vec_t*>(user_data);
for(size_t i = 0; i < num_config; i++)
{
avail_configs->emplace_back(configs[i]);
}
return ROCPROFILER_STATUS_SUCCESS;
};
auto status = rocprofiler_query_pc_sampling_agent_configurations(
agent_id, cb, &available_configurations);
```
Assuming the `available_configurations` contains a single element:
```cpp
rocprofiler_pc_sampling_configuration_t {
.method = ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP,
.unit = ROCPROFILER_PC_SAMPLING_UNIT_TIME,
.min_interval = 1,
.max_interval = 10000
};
```
one proceeds configuring the PC sampling service on an agent with `agent_id` to generate samples every 1000 micro-seconds in the following way:
```cpp
auto status = rocprofiler_configure_pc_sampling_service(ctx,
agent_id,
picked_cfg->method,
picked_cfg->unit,
1000, // 1000 us
buffer_id,
0);
if (status == ROCPROFILER_STATUS_SUCCESS)
{
// PC Sampling service has been configured successfully.
}
else
{
// code for error handling
}
```
> **Note**
>
> Multiple processes can share the same GPU agent simultaneously, so the following ABA problem is possible on shared systems. Namely, process A can query available configurations and decide to configure the service with configuration CA. However, process B manages to finish configuring the service with configuration CB, meaning process A will fail. Thus, we advise that process A repeat the querying process to observe configuration CB and reuse it for configuring the PC sampling service. Please refer to the [Samples](https://github.com/ROCm/rocprofiler-sdk/tree/amd-mainline/samples) section for more technical details.
### Processing PC Samples (`pc_sampling_callback`)
PC sampling service asynchronously delivers samples via a dedicated callback. The following code outlines the process of iterating over samples.
```cpp
void
pc_sampling_callback(rocprofiler_context_id_t ctx,
rocprofiler_buffer_id_t buff,
rocprofiler_record_header_t** headers,
size_t num_headers,
void* data,
uint64_t drop_count)
{
for(size_t i = 0; i < num_headers; i++)
{
auto* cur_header = headers[i];
if(cur_header->category == ROCPROFILER_BUFFER_CATEGORY_PC_SAMPLING)
{
if(cur_header->kind == ROCPROFILER_PC_SAMPLING_RECORD_HOST_TRAP_V0_SAMPLE)
{
auto* pc_sample = static_cast<rocprofiler_pc_sampling_record_host_trap_v0_t*>(
cur_header->payload);
// Processing a single sample...
}
else
{
// ...
}
}
}
}
```
For more information about what data comprises a single sample, please refer to the [pc_sampling.h](https://github.com/ROCm/rocprofiler-sdk/blob/amd-mainline/source/include/rocprofiler-sdk/pc_sampling.h).
Note, a user can synchronously flush buffers via `rocprofiler_buffer_flush` that triggers `pc_sampling_callback`.
+80
Ver fichero
@@ -0,0 +1,80 @@
"Sample_Timestamp","Exec_Mask","Dispatch_Id","Instruction","Instruction_Comment","Correlation_Id"
3464444413017201,65535,1,"s_endpgm","",1
3464444413017201,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413018481,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413018481,65535,1,"s_endpgm","",1
3464444413018481,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413018481,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413018481,65535,1,"s_endpgm","",1
3464444413018481,65535,1,"s_endpgm","",1
3464444413019601,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413019761,65535,1,"s_load_dword s8, s[4:5], 0x24","",1
3464444413019761,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413019761,65535,1,"s_endpgm","",1
3464444413019761,65535,1,"s_load_dword s8, s[4:5], 0x24","",1
3464444413019761,65535,1,"s_endpgm","",1
3464444413019761,65535,1,"s_endpgm","",1
3464444413020881,65535,1,"s_endpgm","",1
3464444413020881,65535,1,"s_endpgm","",1
3464444413020881,65535,1,"s_endpgm","",1
3464444413020881,65535,1,"s_waitcnt lgkmcnt(0)","",1
3464444413020881,65535,1,"v_addc_co_u32_e32 v5, vcc, v1, v5, vcc","",1
3464444413020881,65535,1,"s_endpgm","",1
3464444413020881,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413020881,65535,1,"s_endpgm","",1
3464444413020881,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413021041,65535,1,"s_endpgm","",1
3464444413020881,65535,1,"v_bfe_u32 v0, v0, 10, 10","",1
3464444413021041,65535,1,"s_endpgm","",1
3464444413021041,65535,1,"s_endpgm","",1
3464444413021041,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413021041,65535,1,"s_endpgm","",1
3464444413021041,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413021041,65535,1,"s_endpgm","",1
3464444413022001,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413022001,65535,1,"s_endpgm","",1
3464444413022001,65535,1,"s_endpgm","",1
3464444413022001,65535,1,"s_endpgm","",1
3464444413022001,65535,1,"s_endpgm","",1
3464444413022001,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413022001,65535,1,"s_endpgm","",1
3464444413022001,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413022001,65535,1,"s_waitcnt lgkmcnt(0)","",1
3464444413022161,65535,1,"s_endpgm","",1
3464444413022161,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413022161,65535,1,"s_endpgm","",1
3464444413022161,65535,1,"s_load_dword s8, s[4:5], 0x24","",1
3464444413022161,65535,1,"global_store_dword v[0:1], v3, off","",1
3464444413022161,65535,1,"s_endpgm","",1
3464444413022161,65535,1,"s_endpgm","",1
3464444413022161,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413022161,65535,1,"s_endpgm","",1
3464444413022161,65535,1,"s_endpgm","",1
3464444413022161,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413022161,65535,1,"s_endpgm","",1
3464444413022321,65535,1,"s_load_dwordx4 s[0:3], s[4:5], 0x0","",1
3464444413022161,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413022321,65535,1,"s_endpgm","",1
3464444413022161,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413023281,65535,1,"s_endpgm","",1
3464444413023281,65535,1,"s_endpgm","",1
3464444413023281,65535,1,"v_ashrrev_i32_e32 v1, 31, v0","",1
3464444413024561,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413023281,65535,1,"s_endpgm","",1
3464444413024561,65535,1,"s_endpgm","",1
3464444413023761,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413026321,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413024401,65535,1,"global_store_dword v[0:1], v3, off","",1
3464444413027121,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413025041,65535,1,"v_add_co_u32_e32 v0, vcc, s0, v0","",1
3464444413027761,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413025361,65535,1,"s_endpgm","",1
3464444413027601,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413026321,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413028401,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413026481,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413028881,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413026641,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413028401,65535,1,"s_load_dword s8, s[4:5], 0x24","",1
3464444413027281,65535,1,"s_waitcnt vmcnt(0)","",1
3464444413029681,65535,1,"s_endpgm","",1
1 Sample_Timestamp Exec_Mask Dispatch_Id Instruction Instruction_Comment Correlation_Id
2 3464444413017201 65535 1 s_endpgm 1
3 3464444413017201 65535 1 s_waitcnt vmcnt(0) 1
4 3464444413018481 65535 1 s_waitcnt vmcnt(0) 1
5 3464444413018481 65535 1 s_endpgm 1
6 3464444413018481 65535 1 s_waitcnt vmcnt(0) 1
7 3464444413018481 65535 1 s_waitcnt vmcnt(0) 1
8 3464444413018481 65535 1 s_endpgm 1
9 3464444413018481 65535 1 s_endpgm 1
10 3464444413019601 65535 1 s_waitcnt vmcnt(0) 1
11 3464444413019761 65535 1 s_load_dword s8, s[4:5], 0x24 1
12 3464444413019761 65535 1 s_waitcnt vmcnt(0) 1
13 3464444413019761 65535 1 s_endpgm 1
14 3464444413019761 65535 1 s_load_dword s8, s[4:5], 0x24 1
15 3464444413019761 65535 1 s_endpgm 1
16 3464444413019761 65535 1 s_endpgm 1
17 3464444413020881 65535 1 s_endpgm 1
18 3464444413020881 65535 1 s_endpgm 1
19 3464444413020881 65535 1 s_endpgm 1
20 3464444413020881 65535 1 s_waitcnt lgkmcnt(0) 1
21 3464444413020881 65535 1 v_addc_co_u32_e32 v5, vcc, v1, v5, vcc 1
22 3464444413020881 65535 1 s_endpgm 1
23 3464444413020881 65535 1 s_waitcnt vmcnt(0) 1
24 3464444413020881 65535 1 s_endpgm 1
25 3464444413020881 65535 1 s_waitcnt vmcnt(0) 1
26 3464444413021041 65535 1 s_endpgm 1
27 3464444413020881 65535 1 v_bfe_u32 v0, v0, 10, 10 1
28 3464444413021041 65535 1 s_endpgm 1
29 3464444413021041 65535 1 s_endpgm 1
30 3464444413021041 65535 1 s_waitcnt vmcnt(0) 1
31 3464444413021041 65535 1 s_endpgm 1
32 3464444413021041 65535 1 s_waitcnt vmcnt(0) 1
33 3464444413021041 65535 1 s_endpgm 1
34 3464444413022001 65535 1 s_waitcnt vmcnt(0) 1
35 3464444413022001 65535 1 s_endpgm 1
36 3464444413022001 65535 1 s_endpgm 1
37 3464444413022001 65535 1 s_endpgm 1
38 3464444413022001 65535 1 s_endpgm 1
39 3464444413022001 65535 1 s_waitcnt vmcnt(0) 1
40 3464444413022001 65535 1 s_endpgm 1
41 3464444413022001 65535 1 s_waitcnt vmcnt(0) 1
42 3464444413022001 65535 1 s_waitcnt lgkmcnt(0) 1
43 3464444413022161 65535 1 s_endpgm 1
44 3464444413022161 65535 1 s_waitcnt vmcnt(0) 1
45 3464444413022161 65535 1 s_endpgm 1
46 3464444413022161 65535 1 s_load_dword s8, s[4:5], 0x24 1
47 3464444413022161 65535 1 global_store_dword v[0:1], v3, off 1
48 3464444413022161 65535 1 s_endpgm 1
49 3464444413022161 65535 1 s_endpgm 1
50 3464444413022161 65535 1 s_waitcnt vmcnt(0) 1
51 3464444413022161 65535 1 s_endpgm 1
52 3464444413022161 65535 1 s_endpgm 1
53 3464444413022161 65535 1 s_waitcnt vmcnt(0) 1
54 3464444413022161 65535 1 s_endpgm 1
55 3464444413022321 65535 1 s_load_dwordx4 s[0:3], s[4:5], 0x0 1
56 3464444413022161 65535 1 s_waitcnt vmcnt(0) 1
57 3464444413022321 65535 1 s_endpgm 1
58 3464444413022161 65535 1 s_waitcnt vmcnt(0) 1
59 3464444413023281 65535 1 s_endpgm 1
60 3464444413023281 65535 1 s_endpgm 1
61 3464444413023281 65535 1 v_ashrrev_i32_e32 v1, 31, v0 1
62 3464444413024561 65535 1 s_waitcnt vmcnt(0) 1
63 3464444413023281 65535 1 s_endpgm 1
64 3464444413024561 65535 1 s_endpgm 1
65 3464444413023761 65535 1 s_waitcnt vmcnt(0) 1
66 3464444413026321 65535 1 s_waitcnt vmcnt(0) 1
67 3464444413024401 65535 1 global_store_dword v[0:1], v3, off 1
68 3464444413027121 65535 1 s_waitcnt vmcnt(0) 1
69 3464444413025041 65535 1 v_add_co_u32_e32 v0, vcc, s0, v0 1
70 3464444413027761 65535 1 s_waitcnt vmcnt(0) 1
71 3464444413025361 65535 1 s_endpgm 1
72 3464444413027601 65535 1 s_waitcnt vmcnt(0) 1
73 3464444413026321 65535 1 s_waitcnt vmcnt(0) 1
74 3464444413028401 65535 1 s_waitcnt vmcnt(0) 1
75 3464444413026481 65535 1 s_waitcnt vmcnt(0) 1
76 3464444413028881 65535 1 s_waitcnt vmcnt(0) 1
77 3464444413026641 65535 1 s_waitcnt vmcnt(0) 1
78 3464444413028401 65535 1 s_load_dword s8, s[4:5], 0x24 1
79 3464444413027281 65535 1 s_waitcnt vmcnt(0) 1
80 3464444413029681 65535 1 s_endpgm 1
@@ -0,0 +1,22 @@
"Sample_Timestamp","Exec_Mask","Dispatch_Id","Instruction","Instruction_Comment","Correlation_Id"
54155306462675,65535,1,"s_waitcnt lgkmcnt(0)","/opt/rocm/include/hip/amd_detail/amd_hip_runtime.h:275",1
54155306462715,65535,1,"s_waitcnt vmcnt(0)","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44",1
54155306462755,65535,1,"s_endpgm","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:45",1
54155306462755,65535,1,"s_endpgm","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:45",1
54155306462955,65535,1,"s_endpgm","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:45",1
54155306463035,65535,1,"s_waitcnt vmcnt(0)","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44",1
54155306463235,65535,1,"s_waitcnt vmcnt(0)","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44",1
54155306463315,65535,1,"s_waitcnt vmcnt(0)","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44",1
54155306463515,65535,1,"s_endpgm","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:45",1
54155306463755,65535,1,"s_waitcnt vmcnt(0)","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44",1
54155306463875,65535,1,"s_waitcnt vmcnt(0)","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44",1
54155306464075,65535,1,"v_mov_b32_e32 v2, s4","/opt/rocm/include/hip/amd_detail/amd_hip_runtime.h:275",1
54155306464155,65535,1,"s_waitcnt vmcnt(0)","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44",1
54155306464155,65535,1,"s_waitcnt vmcnt(0)","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44",1
54155306464275,65535,1,"s_endpgm","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:45",1
54155306464395,65535,1,"s_waitcnt vmcnt(0)","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44",1
54155306464515,65535,1,"s_waitcnt lgkmcnt(0)","/opt/rocm/include/hip/amd_detail/amd_hip_runtime.h:275",1
54155306464555,65535,1,"s_waitcnt vmcnt(0)","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44",1
54155306464595,65535,1,"s_waitcnt vmcnt(0)","/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44",1
54155306464595,65535,1,"v_mov_b32_e32 v2, s6","/opt/rocm/include/hip/amd_detail/amd_hip_runtime.h:275",1
54155306464595,65535,1,"s_waitcnt lgkmcnt(0)","/opt/rocm/include/hip/amd_detail/amd_hip_runtime.h:275",1
1 Sample_Timestamp Exec_Mask Dispatch_Id Instruction Instruction_Comment Correlation_Id
2 54155306462675 65535 1 s_waitcnt lgkmcnt(0) /opt/rocm/include/hip/amd_detail/amd_hip_runtime.h:275 1
3 54155306462715 65535 1 s_waitcnt vmcnt(0) /opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44 1
4 54155306462755 65535 1 s_endpgm /opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:45 1
5 54155306462755 65535 1 s_endpgm /opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:45 1
6 54155306462955 65535 1 s_endpgm /opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:45 1
7 54155306463035 65535 1 s_waitcnt vmcnt(0) /opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44 1
8 54155306463235 65535 1 s_waitcnt vmcnt(0) /opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44 1
9 54155306463315 65535 1 s_waitcnt vmcnt(0) /opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44 1
10 54155306463515 65535 1 s_endpgm /opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:45 1
11 54155306463755 65535 1 s_waitcnt vmcnt(0) /opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44 1
12 54155306463875 65535 1 s_waitcnt vmcnt(0) /opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44 1
13 54155306464075 65535 1 v_mov_b32_e32 v2, s4 /opt/rocm/include/hip/amd_detail/amd_hip_runtime.h:275 1
14 54155306464155 65535 1 s_waitcnt vmcnt(0) /opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44 1
15 54155306464155 65535 1 s_waitcnt vmcnt(0) /opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44 1
16 54155306464275 65535 1 s_endpgm /opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:45 1
17 54155306464395 65535 1 s_waitcnt vmcnt(0) /opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44 1
18 54155306464515 65535 1 s_waitcnt lgkmcnt(0) /opt/rocm/include/hip/amd_detail/amd_hip_runtime.h:275 1
19 54155306464555 65535 1 s_waitcnt vmcnt(0) /opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44 1
20 54155306464595 65535 1 s_waitcnt vmcnt(0) /opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44 1
21 54155306464595 65535 1 v_mov_b32_e32 v2, s6 /opt/rocm/include/hip/amd_detail/amd_hip_runtime.h:275 1
22 54155306464595 65535 1 s_waitcnt lgkmcnt(0) /opt/rocm/include/hip/amd_detail/amd_hip_runtime.h:275 1
+178
Ver fichero
@@ -0,0 +1,178 @@
.. meta::
:description: Documentation of the usage of pc-sampling with rocprofv3 command-line tool
:keywords: ROCprofiler-SDK tool, ROCprofiler-SDK library, rocprofv3, rocprofv3 tool usage, Using rocprofv3, ROCprofiler-SDK command line tool, PC sampling
.. _using-pc-sampling:
======================
Using ``pc-sampling``
======================
PC (Program Counter) Sampling service for GPU profiling is a profiling technique that periodically samples the program counter during GPU kernel execution to understand code execution patterns and hotspots.
This helps in:
- Identifying performance bottlenecks
- Understanding kernel execution behavior
- Analyzing code coverage
- Finding heavily executed code paths
To try out the PC sampling feature, you can use the rocprofv3 command-line tool or the rocprofiler SDK library on `ROCm 6.4` or later.
.. note::
PC sampling is supported on AMD GPUs with gfx90a and later architectures. Before using the PC sampling feature, ensure that the GPU supports it.
PC Sampling availability and Configuration
==========================================
To check if the GPU supports PC sampling, use the following command:
.. code-block:: bash
rocprofv3 -L
OR
.. code-block:: bash
rocprofv3 --list-avail
The output will list if `rocprofv3` supports PC sampling on the GPU and what configuration is supported.
.. code-block:: bash
List available PC Sample Configurations for node_id 11
Method: ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP
Unit: ROCPROFILER_PC_SAMPLING_UNIT_TIME
Minimum_Interval: 1
Maximum_Interval: 18446744073709551615
The above output shows that the GPU supports PC sampling with the ``ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP`` method and the ``ROCPROFILER_PC_SAMPLING_UNIT_TIME`` unit. The minimum and maximum intervals are also displayed.
Based on the above configuration, you can use the following command to profile the application using PC sampling:
.. code-block:: bash
rocprofv3 --pc-sampling-beta-enabled --pc-sampling-method host_trap --pc-sampling-unit time --pc-sampling-interval 1 -- <application_path>
The above command enables PC sampling with the `host_trap` method, `time` unit, and an interval of `1` us(micro second). Replace `<application_path>` with the path to the application you want to profile.
This will generate 2 files. ``agent_info.csv`` and ``pc_sampling_host_trap.csv``. Both files are prefixed with file prefixed with the process ID.
Here is the output of pc-sampling for the `MatrixTranspose` sample application:
Here are the contents of ``pc_sampling_host_trap.csv`` file:
.. csv-table:: PC sampling host trap
:file: /data/pc_sampling_host_trap.csv
:widths: 20,10,10,10,10,20
:header-rows: 1
For the description of the fields in the output file, see :ref:`pc-sampling-fields`.
If you noticed ``Instruction_Comment`` field in the output file was empty. It is recommended to compile your application with debug symbols to populate this field.
It maps back to the source line if debug symbols were enabled when the application was compiled. This helps in understanding the code execution pattern and hotspots.
.. csv-table:: PC sampling host trap with debug symbols
:file: /data/pc_sampling_host_trap_debug.csv
:widths: 20,10,10,10,10,20
:header-rows: 1
The above output shows the `Instruction_Comment` field populated with the source line information.
.. _pc-sampling-fields:
PC Sampling Fields:
===================
The output file generated by PC sampling contains the following fields:
- ``Sample_Timestamp``: Timestamp when sample is generated
- ``Exec_Mask``: Active SIMD lanes when sampled
- ``Dispatch_Id``: Originating kernel dispatch ID
- ``Instruction``: Assembly instruction e.g: ``s_load_dword s8, s[1:2], 0x10``
- ``Instruction_Comment``: Instruction comment (Maps back to source-line if debug symbols were enabled when application was compiled)
- ``Correlation_Id``: API launch call id that matches dispatch ID
By default the output file is in CSV format. To dump samples in a more comprehensive format, one can use JSON through `--output-format json`.
.. code-block:: bash
rocprofv3 --pc-sampling-beta-enabled --pc-sampling-method host_trap --pc-sampling-unit time --pc-sampling-interval 1 --output-format json -- <application_path>
This will generate a JSON file with the comprehensive output. Here is a trimmed down output with multiple records:
.. code-block:: text
{
"pc_sample_host_trap": [
{
"record": {
"hw_id": {
"chiplet": 0,
"wave_id": 0,
"simd_id": 2,
"pipe_id": 0,
"cu_or_wgp_id": 1,
"shader_array_id": 0,
"shader_engine_id": 2,
"workgroup_id": 0,
"vm_id": 3,
"queue_id": 2,
"microengine_id": 1
},
"pc": {
"code_object_id": 1,
"code_object_offset": 20228
},
"exec_mask": 18446744073709551615,
"timestamp": 51040126667689,
"dispatch_id": 1,
"corr_id": {
"internal": 1,
"external": 0
},
"wrkgrp_id": {
"x": 182,
"y": 0,
"z": 0
},
"wave_in_grp": 1
},
"inst_index": 0
},
{
"record": {
"hw_id": {
"chiplet": 0,
"wave_id": 0,
"simd_id": 2,
"pipe_id": 0,
"cu_or_wgp_id": 0,
"shader_array_id": 0,
"shader_engine_id": 2,
"workgroup_id": 0,
"vm_id": 3,
"queue_id": 2,
"microengine_id": 1
},
"pc": {
"code_object_id": 1,
"code_object_offset": 20236
},
"exec_mask": 18446744073709551615,
"timestamp": 51040126667689,
"dispatch_id": 1,
"corr_id": {
"internal": 1,
"external": 0
},
"wrkgrp_id": {
"x": 158,
"y": 0,
"z": 0
},
"wave_in_grp": 2
},
"inst_index": 1
}
]
}
The description of the fields in the JSON output is available in the :ref:`output-file-fields`.
+1
Ver fichero
@@ -33,6 +33,7 @@ The documentation is structured as follows:
* :ref:`using-rocprofv3`
* :ref:`using-rocprofiler-sdk-roctx`
* :ref:`using-pc-sampling`
* :doc:`Samples <how-to/samples>`
* :ref:`using-rocprofv3-with-mpi`