doc improvements for 1.0.0 part 2 (#330)

* update installation steps

* Github Issue #50 Adding README's for samples

* Making name change to ROCprofiler-SDK for consistency

* Fix HIP trace documentation

* Fix HSA trace in docs

* Fix kernel trace in docs

* Fixing memory copy and memory allocation traces

* runtime trace and sys trace doc update

* Fix scratch memory doc

* kernel naming and filtering options

* Adding collection period in docs

* Perfetto configs update

* summary output file

* kernel trace format fix

* update CHANGELOG

* Agent index doc update

* rocm-smi output

* group by queue option

* Updated --group-by-queue description

* perfetto visualization

---------

Co-authored-by: Ian Trowbridge <Ian.Trowbridge@amd.com>
This commit is contained in:
Bhardwaj, Gopesh
2025-04-16 02:00:07 +05:30
committed by GitHub
orang tua 4fbcfd142c
melakukan ca7cce9e81
32 mengubah file dengan 744 tambahan dan 106 penghapusan
+6 -4
Melihat File
@@ -158,8 +158,10 @@ Full documentation for ROCprofiler-SDK is available at [rocm.docs.amd.com/projec
- SDK: `rocprofiler_agent_v0_t` support for agent UUIDs
- SDK: `rocprofiler_agent_v0_t` support for agent visibility based on gpu isolation environment variables such as `ROCR_VISIBLE_DEVICES` and so on.
- Accumulation VGPR support for `rocprofv3`.
- Host-trap based PC sampling support for rocprofv3.
- Support for OpenMP tool.
## ROCprofiler-SDK 0.7.0 for ROCm release 6.5
## ROCprofiler-SDK 1.0.0 for ROCm release 6.5
### Added
@@ -170,9 +172,9 @@ Full documentation for ROCprofiler-SDK is available at [rocm.docs.amd.com/projec
- Added perfetto support for counter collection.
- Added support for negating rocprofv3 tracing options when using aggregate options, e.g. `--sys-trace --hsa-trace=no`
- Added `--agent-index` option in rocprofv3 to specify the agent naming convention in the output
- absolute == node_id
- relative == logical_node_id
- type-relative == logical_node_type_id
- absolute == node_id
- relative == logical_node_id
- type-relative == logical_node_type_id
### Changed
+1 -1
Melihat File
@@ -31,7 +31,7 @@ ROCProfiler-SDK is AMDs new and improved tooling infrastructure, providing a
- Scratch Memory tracing
- RCCL API tracing
- rocDecode API tracing
- rocjpeg API tracing
- rocJPEG API tracing
## Parallelism API Support
+14
Melihat File
@@ -0,0 +1,14 @@
# API Callback Tracing Sample
## Services
- Captures events like API calls using callbacks.
- HSA API (Core, AMD Ext)
- HIP API (Runtime)
- Marker API (Core, Name)
## Properties
- Handles roctxProfilerPause and roctxProfilerResume operations using a control context.
- Captures API calls and logs details like thread ID, operation type, and duration.
- Provides a detailed trace of all function calls and events for debugging.
@@ -0,0 +1,13 @@
# CodeObject tracing
## Services
- code object tracing.
## Properties
- This tool hooks into ROCProfiler's callback and buffer tracing mechanisms to:
- Decode and analyze GPU code objects.
- Three kernel variants are used in sample; simple transpose, in-place LDS swap and LDS no bank conflicts.
- Trace kernel symbols and instructions.
- Log disassembly and statistics for debugging or performance analysis.
+11
Melihat File
@@ -0,0 +1,11 @@
# CodeObject Tracing
## Services
Trace and analyze the execution of GPU code objects and kernel symbols.
## Properties
- This tool is designed to capture and log information about code object loading/unloading and kernel symbol registration/un-registration events during the execution of GPU programs.
- Whenever a relevant event occurs, such as a code object being loaded/unloaded or a kernel symbol being registered/unregistered. The function processes the event data, formats it into a human-readable string, and appends it to the call stack.
+21
Melihat File
@@ -0,0 +1,21 @@
# Counter collection
## Services
- Dispatch counting
- Device Counting async
- Device Counting sync
## Properties
- Initializes tool and setup for counting service.
- Create a collection profile for the counters.
- Outputs counters mentioned during profiler creation.
- Usage of enum ROCPROFILER_BUFFER_CATEGORY_COUNTERS.
- Buffered_callback
- This sample shows the usage of buffered approach when collecting counters. buffered callback is called when the buffer is full (or when the buffer is flushed). The callback is responsible for processing the records in the buffer.
- Dispatch callback
- This sample creates a profile to collect the counter SQ_WAVES for all kernel dispatch packets.
- Prints all functional counters.
+10
Melihat File
@@ -0,0 +1,10 @@
# Runtime API Registration
## Services
- HIP runtime table registration
## Properties
- `api_registration_callback` function validates the type of library being intercepted, ensures there is only one instance of the HIP runtime library, and retrieves the dispatch table containing the API functions.
- Collects a "call stack" of intercepted API calls.
+15
Melihat File
@@ -0,0 +1,15 @@
# OMPT tool tracing
## Services
- OMPT tracing.
- CodeObject tracing.
- Marker API (Core, Name).
## Properties
- Configures tool for callback tracing.
- Configures tool for buffer tracing.
- Sets up callstack for tracing kind names and tracing operation names.
- Create a specialized (throw-away) context for handling ROCTx profiler pause and resume.
- Demonstrates the use of the `ompt_data_t*` fields from OMPT.
+12
Melihat File
@@ -0,0 +1,12 @@
# PC sampling service
## Services
- PC sampling stochastic method
## Properties
- Iterate through all gpu agents that supports PC sampling.
- Iterate through the supported configuration for that agent.
- The `configure_pc_sampling_prefer_stochastic` function is responsible for configuring PC sampling on a given GPU agent. It attempts to select a stochastic sampling configuration if available, falling back to a host-trap configuration otherwise.
- `rocprofiler_pc_sampling_callback` function processes PC sampling records delivered by the profiler. It validates the records, determines their type, and delegates the printing of their details to the appropriate print_sample function.
+1 -1
Melihat File
@@ -361,7 +361,7 @@ For MPI applications (or other job launchers such as SLURM), place rocprofv3 ins
add_parser_bool_argument(
extended_tracing_options,
"--hsa-image-trace",
help="For collecting HSA API Traces (Image-extenson API), e.g. HSA functions prefixed with only 'hsa_ext_image_' (i.e. hsa_ext_image_get_capability).",
help="For collecting HSA API Traces (Image-extension API), e.g. HSA functions prefixed with only 'hsa_ext_image_' (i.e. hsa_ext_image_get_capability).",
)
add_parser_bool_argument(
extended_tracing_options,
+1 -1
Melihat File
@@ -35,7 +35,7 @@ def build_doxyfile():
[
"cmake",
f"-DSOURCE_DIR={_srcdir}",
"-DPROJECT_NAME='Rocprofiler SDK'",
"-DPROJECT_NAME='ROCprofiler-SDK'",
f"-P {_srcdir}/source/docs/generate-doxyfile.cmake",
]
)
+6 -4
Melihat File
@@ -1,5 +1,7 @@
"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
"HIP_COMPILER_API","__hipRegisterFatBinary",208,208,1,1508780270085955,1508780270096795
"HIP_COMPILER_API","__hipRegisterFunction",208,208,2,1508780270104242,1508780270115355
"HIP_COMPILER_API","__hipPushCallConfiguration",208,208,3,1508780613897816,1508780613898701
"HIP_COMPILER_API","__hipPopCallConfiguration",208,208,4,1508780613901714,1508780613902200
"HIP_COMPILER_API_EXT","__hipRegisterFatBinary",15,15,1,1055015439953054,1055015439976484
"HIP_COMPILER_API_EXT","__hipRegisterFunction",15,15,2,1055015439992584,1055015440011104
"HIP_COMPILER_API_EXT","__hipRegisterFunction",15,15,3,1055015440011744,1055015440013824
"HIP_COMPILER_API_EXT","__hipRegisterFunction",15,15,4,1055015440014244,1055015440014534
"HIP_COMPILER_API_EXT","__hipRegisterFunction",15,15,5,1055015440014854,1055015440015524
1 Domain Function Process_Id Thread_Id Correlation_Id Start_Timestamp End_Timestamp
2 HIP_COMPILER_API HIP_COMPILER_API_EXT __hipRegisterFatBinary 208 15 208 15 1 1508780270085955 1055015439953054 1508780270096795 1055015439976484
3 HIP_COMPILER_API HIP_COMPILER_API_EXT __hipRegisterFunction 208 15 208 15 2 1508780270104242 1055015439992584 1508780270115355 1055015440011104
4 HIP_COMPILER_API HIP_COMPILER_API_EXT __hipPushCallConfiguration __hipRegisterFunction 208 15 208 15 3 1508780613897816 1055015440011744 1508780613898701 1055015440013824
5 HIP_COMPILER_API HIP_COMPILER_API_EXT __hipPopCallConfiguration __hipRegisterFunction 208 15 208 15 4 1508780613901714 1055015440014244 1508780613902200 1055015440014534
6 HIP_COMPILER_API_EXT __hipRegisterFunction 15 15 5 1055015440014854 1055015440015524
7
+8 -8
Melihat File
@@ -1,9 +1,9 @@
"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
"HIP_RUNTIME_API","hipGetDevicePropertiesR0600",238,238,1,1191915574691984,1191915687784011
"HIP_RUNTIME_API","hipMalloc",238,238,2,1191915691312459,1191915691388696
"HIP_RUNTIME_API","hipMalloc",238,238,3,1191915691390637,1191915691423279
"HIP_RUNTIME_API","hipMemcpy",238,238,4,1191915691439107,1191916547828448
"HIP_RUNTIME_API","hipLaunchKernel",238,238,5,1191916547842972,1191916548408842
"HIP_RUNTIME_API","hipMemcpy",238,238,6,1191916548412677,1191916550217834
"HIP_RUNTIME_API","hipFree",238,238,7,1191916562618151,1191916562789093
"HIP_RUNTIME_API","hipFree",238,238,8,1191916562790923,1191916562836351
"HIP_RUNTIME_API_EXT","hipGetDevicePropertiesR0600",238,238,1,1191915574691984,1191915687784011
"HIP_RUNTIME_API_EXT","hipMalloc",238,238,2,1191915691312459,1191915691388696
"HIP_RUNTIME_API_EXT","hipMalloc",238,238,3,1191915691390637,1191915691423279
"HIP_RUNTIME_API_EXT","hipMemcpy",238,238,4,1191915691439107,1191916547828448
"HIP_RUNTIME_API_EXT","hipLaunchKernel",238,238,5,1191916547842972,1191916548408842
"HIP_RUNTIME_API_EXT","hipMemcpy",238,238,6,1191916548412677,1191916550217834
"HIP_RUNTIME_API_EXT","hipFree",238,238,7,1191916562618151,1191916562789093
"HIP_RUNTIME_API_EXT","hipFree",238,238,8,1191916562790923,1191916562836351
1 Domain Function Process_Id Thread_Id Correlation_Id Start_Timestamp End_Timestamp
2 HIP_RUNTIME_API HIP_RUNTIME_API_EXT hipGetDevicePropertiesR0600 238 238 1 1191915574691984 1191915687784011
3 HIP_RUNTIME_API HIP_RUNTIME_API_EXT hipMalloc 238 238 2 1191915691312459 1191915691388696
4 HIP_RUNTIME_API HIP_RUNTIME_API_EXT hipMalloc 238 238 3 1191915691390637 1191915691423279
5 HIP_RUNTIME_API HIP_RUNTIME_API_EXT hipMemcpy 238 238 4 1191915691439107 1191916547828448
6 HIP_RUNTIME_API HIP_RUNTIME_API_EXT hipLaunchKernel 238 238 5 1191916547842972 1191916548408842
7 HIP_RUNTIME_API HIP_RUNTIME_API_EXT hipMemcpy 238 238 6 1191916548412677 1191916550217834
8 HIP_RUNTIME_API HIP_RUNTIME_API_EXT hipFree 238 238 7 1191916562618151 1191916562789093
9 HIP_RUNTIME_API HIP_RUNTIME_API_EXT hipFree 238 238 8 1191916562790923 1191916562836351
+18
Melihat File
@@ -0,0 +1,18 @@
"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
"HIP_COMPILER_API_EXT","__hipRegisterFatBinary",15,15,1,1055015439953054,1055015439976484
"HIP_COMPILER_API_EXT","__hipRegisterFunction",15,15,2,1055015439992584,1055015440011104
"HIP_COMPILER_API_EXT","__hipRegisterFunction",15,15,3,1055015440011744,1055015440013824
"HIP_COMPILER_API_EXT","__hipRegisterFunction",15,15,4,1055015440014244,1055015440014534
"HIP_COMPILER_API_EXT","__hipRegisterFunction",15,15,5,1055015440014854,1055015440015524
"HIP_RUNTIME_API_EXT","hipGetDeviceCount",15,15,6,1055015440617618,1055015539800733
"HIP_RUNTIME_API_EXT","hipSetDevice",15,15,7,1055015539819503,1055015539821693
"HIP_RUNTIME_API_EXT","hipDeviceSynchronize",15,15,8,1055015539832333,1055015539840903
"HIP_RUNTIME_API_EXT","hipStreamCreateWithFlags",15,15,9,1055015539861673,1055015865247140
"HIP_RUNTIME_API_EXT","hipHostMalloc",15,15,10,1055015865309761,1055015865849494
"HIP_RUNTIME_API_EXT","hipHostMalloc",15,15,11,1055015865850944,1055015866265546
"HIP_RUNTIME_API_EXT","hipHostMalloc",15,15,12,1055015866266646,1055015867082900
"HIP_RUNTIME_API_EXT","hipMallocAsync",15,15,13,1055015867356542,1055015867662314
"HIP_RUNTIME_API_EXT","hipMallocAsync",15,15,14,1055015867664174,1055015867937465
"HIP_RUNTIME_API_EXT","hipMallocAsync",15,15,15,1055015867938815,1055015868219987
"HIP_RUNTIME_API_EXT","hipMemcpyAsync",15,15,16,1055015868240137,1055015917307652
"HIP_RUNTIME_API_EXT","hipMemcpyAsync",15,15,17,1055015917337263,1055015917360493
1 Domain Function Process_Id Thread_Id Correlation_Id Start_Timestamp End_Timestamp
2 HIP_COMPILER_API_EXT __hipRegisterFatBinary 15 15 1 1055015439953054 1055015439976484
3 HIP_COMPILER_API_EXT __hipRegisterFunction 15 15 2 1055015439992584 1055015440011104
4 HIP_COMPILER_API_EXT __hipRegisterFunction 15 15 3 1055015440011744 1055015440013824
5 HIP_COMPILER_API_EXT __hipRegisterFunction 15 15 4 1055015440014244 1055015440014534
6 HIP_COMPILER_API_EXT __hipRegisterFunction 15 15 5 1055015440014854 1055015440015524
7 HIP_RUNTIME_API_EXT hipGetDeviceCount 15 15 6 1055015440617618 1055015539800733
8 HIP_RUNTIME_API_EXT hipSetDevice 15 15 7 1055015539819503 1055015539821693
9 HIP_RUNTIME_API_EXT hipDeviceSynchronize 15 15 8 1055015539832333 1055015539840903
10 HIP_RUNTIME_API_EXT hipStreamCreateWithFlags 15 15 9 1055015539861673 1055015865247140
11 HIP_RUNTIME_API_EXT hipHostMalloc 15 15 10 1055015865309761 1055015865849494
12 HIP_RUNTIME_API_EXT hipHostMalloc 15 15 11 1055015865850944 1055015866265546
13 HIP_RUNTIME_API_EXT hipHostMalloc 15 15 12 1055015866266646 1055015867082900
14 HIP_RUNTIME_API_EXT hipMallocAsync 15 15 13 1055015867356542 1055015867662314
15 HIP_RUNTIME_API_EXT hipMallocAsync 15 15 14 1055015867664174 1055015867937465
16 HIP_RUNTIME_API_EXT hipMallocAsync 15 15 15 1055015867938815 1055015868219987
17 HIP_RUNTIME_API_EXT hipMemcpyAsync 15 15 16 1055015868240137 1055015917307652
18 HIP_RUNTIME_API_EXT hipMemcpyAsync 15 15 17 1055015917337263 1055015917360493
@@ -0,0 +1,34 @@
"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
"HSA_CORE_API","hsa_system_get_major_extension_table",57,57,1,1056813747808832,1056813747809252
"HSA_CORE_API","hsa_agent_get_info",57,57,3,1056813747826572,1056813747826672
"HSA_CORE_API","hsa_agent_get_info",57,57,4,1056813747837582,1056813747837622
"HSA_CORE_API","hsa_agent_get_info",57,57,5,1056813747838542,1056813747838582
"HSA_CORE_API","hsa_agent_get_info",57,57,6,1056813747839042,1056813747839082
"HSA_CORE_API","hsa_agent_get_info",57,57,7,1056813747839512,1056813747839622
"HSA_CORE_API","hsa_iterate_agents",57,57,2,1056813747821012,1056813747839832
"HSA_CORE_API","hsa_agent_get_info",57,57,8,1056813747843832,1056813747844132
"HSA_CORE_API","hsa_agent_get_info",57,57,9,1056813747844482,1056813747844542
"HSA_CORE_API","hsa_agent_iterate_isas",57,57,10,1056813747849402,1056813747850422
"HSA_CORE_API","hsa_isa_get_info_alt",57,57,11,1056813747853542,1056813747875253
"HSA_CORE_API","hsa_isa_get_info_alt",57,57,12,1056813747875883,1056813747878353
"HSA_CORE_API","hsa_agent_get_info",57,57,13,1056813747886343,1056813747886403
"HSA_CORE_API","hsa_agent_get_info",57,57,54,1056813748282015,1056813748282085
"HSA_CORE_API","hsa_system_get_info",57,57,55,1056813748282465,1056813748282505
"HSA_CORE_API","hsa_signal_create",57,57,56,1056813749083419,1056813749085399
"HSA_CORE_API","hsa_agent_get_info",57,57,57,1056813749741363,1056813749741443
"HSA_CORE_API","hsa_queue_create",57,57,58,1056813749744053,1056813856914188
"HSA_CORE_API","hsa_signal_create",57,57,59,1056813857149169,1056813857154109
"HSA_CORE_API","hsa_signal_create",57,57,60,1056813857154929,1056813857155389
"HSA_CORE_API","hsa_signal_create",57,57,61,1056813857155949,1056813857156429
"HSA_CORE_API","hsa_signal_create",57,57,62,1056813857157169,1056813857157349
"HSA_CORE_API","hsa_executable_create_alt",57,57,63,1056813965439362,1056813965466952
"HSA_CORE_API","hsa_code_object_reader_create_from_memory",57,57,64,1056813965476642,1056813965587493
"HSA_CORE_API","hsa_executable_load_agent_code_object",57,57,65,1056813965592483,1056813965965295
"HSA_CORE_API","hsa_signal_create",57,57,67,1056813966149786,1056813966151706
"HSA_CORE_API","hsa_signal_wait_scacquire",57,57,68,1056813966156596,1056813966158646
"HSA_CORE_API","hsa_signal_destroy",57,57,69,1056813966162276,1056813966163746
"HSA_CORE_API","hsa_executable_freeze",57,57,66,1056813965973105,1056813966778050
"HSA_CORE_API","hsa_executable_get_symbol_by_name",57,57,70,1056813966800070,1056813966801880
"HSA_CORE_API","hsa_executable_symbol_get_info",57,57,71,1056813966805750,1056813966805980
"HSA_CORE_API","hsa_executable_symbol_get_info",57,57,72,1056813966806300,1056813966806340
1 Domain Function Process_Id Thread_Id Correlation_Id Start_Timestamp End_Timestamp
2 HSA_CORE_API hsa_system_get_major_extension_table 57 57 1 1056813747808832 1056813747809252
3 HSA_CORE_API hsa_agent_get_info 57 57 3 1056813747826572 1056813747826672
4 HSA_CORE_API hsa_agent_get_info 57 57 4 1056813747837582 1056813747837622
5 HSA_CORE_API hsa_agent_get_info 57 57 5 1056813747838542 1056813747838582
6 HSA_CORE_API hsa_agent_get_info 57 57 6 1056813747839042 1056813747839082
7 HSA_CORE_API hsa_agent_get_info 57 57 7 1056813747839512 1056813747839622
8 HSA_CORE_API hsa_iterate_agents 57 57 2 1056813747821012 1056813747839832
9 HSA_CORE_API hsa_agent_get_info 57 57 8 1056813747843832 1056813747844132
10 HSA_CORE_API hsa_agent_get_info 57 57 9 1056813747844482 1056813747844542
11 HSA_CORE_API hsa_agent_iterate_isas 57 57 10 1056813747849402 1056813747850422
12 HSA_CORE_API hsa_isa_get_info_alt 57 57 11 1056813747853542 1056813747875253
13 HSA_CORE_API hsa_isa_get_info_alt 57 57 12 1056813747875883 1056813747878353
14 HSA_CORE_API hsa_agent_get_info 57 57 13 1056813747886343 1056813747886403
15 HSA_CORE_API hsa_agent_get_info 57 57 54 1056813748282015 1056813748282085
16 HSA_CORE_API hsa_system_get_info 57 57 55 1056813748282465 1056813748282505
17 HSA_CORE_API hsa_signal_create 57 57 56 1056813749083419 1056813749085399
18 HSA_CORE_API hsa_agent_get_info 57 57 57 1056813749741363 1056813749741443
19 HSA_CORE_API hsa_queue_create 57 57 58 1056813749744053 1056813856914188
20 HSA_CORE_API hsa_signal_create 57 57 59 1056813857149169 1056813857154109
21 HSA_CORE_API hsa_signal_create 57 57 60 1056813857154929 1056813857155389
22 HSA_CORE_API hsa_signal_create 57 57 61 1056813857155949 1056813857156429
23 HSA_CORE_API hsa_signal_create 57 57 62 1056813857157169 1056813857157349
24 HSA_CORE_API hsa_executable_create_alt 57 57 63 1056813965439362 1056813965466952
25 HSA_CORE_API hsa_code_object_reader_create_from_memory 57 57 64 1056813965476642 1056813965587493
26 HSA_CORE_API hsa_executable_load_agent_code_object 57 57 65 1056813965592483 1056813965965295
27 HSA_CORE_API hsa_signal_create 57 57 67 1056813966149786 1056813966151706
28 HSA_CORE_API hsa_signal_wait_scacquire 57 57 68 1056813966156596 1056813966158646
29 HSA_CORE_API hsa_signal_destroy 57 57 69 1056813966162276 1056813966163746
30 HSA_CORE_API hsa_executable_freeze 57 57 66 1056813965973105 1056813966778050
31 HSA_CORE_API hsa_executable_get_symbol_by_name 57 57 70 1056813966800070 1056813966801880
32 HSA_CORE_API hsa_executable_symbol_get_info 57 57 71 1056813966805750 1056813966805980
33 HSA_CORE_API hsa_executable_symbol_get_info 57 57 72 1056813966806300 1056813966806340
+9 -8
Melihat File
@@ -1,9 +1,10 @@
"Kind","Agent_Id","Queue_Id","Stream_Id","Thread_Id","Dispatch_Id","Kernel_Id","Kernel_Name","Correlation_Id","Start_Timestamp","End_Timestamp","Private_Segment_Size","Group_Segment_Size","Workgroup_Size_X","Workgroup_Size_Y","Workgroup_Size_Z","Grid_Size_X","Grid_Size_Y","Grid_Size_Z"
"KERNEL_DISPATCH",2,1,1,21228,1,11,"void addition_kernel<float>(float*, float const*, float const*, int, int)",1937,2414192765353337,2414192765369494,0,0,64,1,1,1024,1024,1
"KERNEL_DISPATCH",2,1,1,21228,2,14,"subtract_kernel(float*, float const*, float const*, int, int)",1945,2414192765424862,2414192765435326,0,0,64,1,1,1024,1024,1
"KERNEL_DISPATCH",2,1,1,21228,3,13,"multiply_kernel(float*, float const*, float const*, int, int)",1953,2414192765487486,2414192765497669,0,0,64,1,1,1024,1024,1
"KERNEL_DISPATCH",2,1,1,21228,4,12,"divide_kernel(float*, float const*, float const*, int, int)",1961,2414192765545619,2414192765555722,0,0,64,1,1,1024,1024,1
"KERNEL_DISPATCH",2,2,2,21228,5,11,"void addition_kernel<float>(float*, float const*, float const*, int, int)",1969,2414192765608844,2414192765621674,0,0,64,1,1,1024,1024,1
"KERNEL_DISPATCH",2,2,2,21228,6,14,"subtract_kernel(float*, float const*, float const*, int, int)",1977,2414192765658519,2414192765669424,0,0,64,1,1,1024,1024,1
"KERNEL_DISPATCH",2,2,2,21228,7,13,"multiply_kernel(float*, float const*, float const*, int, int)",1985,2414192765715650,2414192765726795,0,0,64,1,1,1024,1024,1
"KERNEL_DISPATCH",2,2,2,21228,8,12,"divide_kernel(float*, float const*, float const*, int, int)",1993,2414192765773422,2414192765784969,0,0,64,1,1,1024,1024,1
"KERNEL_DISPATCH","Agent 4",3,4,77,9,12,"void addition_kernel<float>(float*, float const*, float const*, int, int)",9,1057678125307992,1057678125319352,0,0,64,1,1,1024,1024,1
"KERNEL_DISPATCH","Agent 4",2,3,77,8,11,"divide_kernel(float*, float const*, float const*, int, int)",8,1057678125310912,1057678125319952,0,0,64,1,1,1024,1024,1
"KERNEL_DISPATCH","Agent 4",2,3,77,7,13,"multiply_kernel(float*, float const*, float const*, int, int)",7,1057678125298952,1057678125310912,0,0,64,1,1,1024,1024,1
"KERNEL_DISPATCH","Agent 4",2,3,77,6,14,"subtract_kernel(float*, float const*, float const*, int, int)",6,1057678125288912,1057678125298952,0,0,64,1,1,1024,1024,1
"KERNEL_DISPATCH","Agent 4",2,3,77,5,12,"void addition_kernel<float>(float*, float const*, float const*, int, int)",5,1057678125279112,1057678125288912,0,0,64,1,1,1024,1024,1
"KERNEL_DISPATCH","Agent 4",3,4,77,10,14,"subtract_kernel(float*, float const*, float const*, int, int)",10,1057678125319352,1057678125328912,0,0,64,1,1,1024,1024,1
"KERNEL_DISPATCH","Agent 4",3,4,77,12,11,"divide_kernel(float*, float const*, float const*, int, int)",12,1057678125356832,1057678125366712,0,0,64,1,1,1024,1024,1
"KERNEL_DISPATCH","Agent 4",3,4,77,11,13,"multiply_kernel(float*, float const*, float const*, int, int)",11,1057678125344352,1057678125356832,0,0,64,1,1,1024,1024,1
"KERNEL_DISPATCH","Agent 4",4,5,77,13,12,"void addition_kernel<float>(float*, float const*, float const*, int, int)",13,1057678125382792,1057678125394032,0,0,64,1,1,1024,1024,1
1 Kind Agent_Id Queue_Id Stream_Id Thread_Id Dispatch_Id Kernel_Id Kernel_Name Correlation_Id Start_Timestamp End_Timestamp Private_Segment_Size Group_Segment_Size Workgroup_Size_X Workgroup_Size_Y Workgroup_Size_Z Grid_Size_X Grid_Size_Y Grid_Size_Z
2 KERNEL_DISPATCH 2 Agent 4 1 3 1 4 21228 77 1 9 11 12 void addition_kernel<float>(float*, float const*, float const*, int, int) 1937 9 2414192765353337 1057678125307992 2414192765369494 1057678125319352 0 0 64 1 1 1024 1024 1
3 KERNEL_DISPATCH 2 Agent 4 1 2 1 3 21228 77 2 8 14 11 subtract_kernel(float*, float const*, float const*, int, int) divide_kernel(float*, float const*, float const*, int, int) 1945 8 2414192765424862 1057678125310912 2414192765435326 1057678125319952 0 0 64 1 1 1024 1024 1
4 KERNEL_DISPATCH 2 Agent 4 1 2 1 3 21228 77 3 7 13 multiply_kernel(float*, float const*, float const*, int, int) 1953 7 2414192765487486 1057678125298952 2414192765497669 1057678125310912 0 0 64 1 1 1024 1024 1
5 KERNEL_DISPATCH 2 Agent 4 1 2 1 3 21228 77 4 6 12 14 divide_kernel(float*, float const*, float const*, int, int) subtract_kernel(float*, float const*, float const*, int, int) 1961 6 2414192765545619 1057678125288912 2414192765555722 1057678125298952 0 0 64 1 1 1024 1024 1
6 KERNEL_DISPATCH 2 Agent 4 2 2 3 21228 77 5 11 12 void addition_kernel<float>(float*, float const*, float const*, int, int) 1969 5 2414192765608844 1057678125279112 2414192765621674 1057678125288912 0 0 64 1 1 1024 1024 1
7 KERNEL_DISPATCH 2 Agent 4 2 3 2 4 21228 77 6 10 14 subtract_kernel(float*, float const*, float const*, int, int) 1977 10 2414192765658519 1057678125319352 2414192765669424 1057678125328912 0 0 64 1 1 1024 1024 1
8 KERNEL_DISPATCH 2 Agent 4 2 3 2 4 21228 77 7 12 13 11 multiply_kernel(float*, float const*, float const*, int, int) divide_kernel(float*, float const*, float const*, int, int) 1985 12 2414192765715650 1057678125356832 2414192765726795 1057678125366712 0 0 64 1 1 1024 1024 1
9 KERNEL_DISPATCH 2 Agent 4 2 3 2 4 21228 77 8 11 12 13 divide_kernel(float*, float const*, float const*, int, int) multiply_kernel(float*, float const*, float const*, int, int) 1993 11 2414192765773422 1057678125344352 2414192765784969 1057678125356832 0 0 64 1 1 1024 1024 1
10 KERNEL_DISPATCH Agent 4 4 5 77 13 12 void addition_kernel<float>(float*, float const*, float const*, int, int) 13 1057678125382792 1057678125394032 0 0 64 1 1 1024 1024 1
@@ -0,0 +1,5 @@
"Kind","Agent_Id","Queue_Id","Stream_Id","Thread_Id","Dispatch_Id","Kernel_Id","Kernel_Name","Correlation_Id","Start_Timestamp","End_Timestamp","Private_Segment_Size","Group_Segment_Size","Workgroup_Size_X","Workgroup_Size_Y","Workgroup_Size_Z","Grid_Size_X","Grid_Size_Y","Grid_Size_Z"
"KERNEL_DISPATCH","Agent 4",1,2,123,1,11,"addition_kernel",1,1143263119836160,1143263119848360,0,0,64,1,1,1024,1024,1
"KERNEL_DISPATCH","Agent 4",2,3,123,6,14,"subtract_kernel",6,1143263119886120,1143263119896040,0,0,64,1,1,1024,1024,1
"KERNEL_DISPATCH","Agent 4",2,3,123,5,11,"addition_kernel",5,1143263119876240,1143263119886120,0,0,64,1,1,1024,1024,1
"KERNEL_DISPATCH","Agent 4",1,2,123,4,12,"divide_kernel",4,1143263119866920,1143263119878960,0,0,64,1,1,1024,1024,1
1 Kind Agent_Id Queue_Id Stream_Id Thread_Id Dispatch_Id Kernel_Id Kernel_Name Correlation_Id Start_Timestamp End_Timestamp Private_Segment_Size Group_Segment_Size Workgroup_Size_X Workgroup_Size_Y Workgroup_Size_Z Grid_Size_X Grid_Size_Y Grid_Size_Z
2 KERNEL_DISPATCH Agent 4 1 2 123 1 11 addition_kernel 1 1143263119836160 1143263119848360 0 0 64 1 1 1024 1024 1
3 KERNEL_DISPATCH Agent 4 2 3 123 6 14 subtract_kernel 6 1143263119886120 1143263119896040 0 0 64 1 1 1024 1024 1
4 KERNEL_DISPATCH Agent 4 2 3 123 5 11 addition_kernel 5 1143263119876240 1143263119886120 0 0 64 1 1 1024 1024 1
5 KERNEL_DISPATCH Agent 4 1 2 123 4 12 divide_kernel 4 1143263119866920 1143263119878960 0 0 64 1 1 1024 1024 1
@@ -1,7 +1,7 @@
"Kind","Operation","Agent_Id","Allocation_Size","Address","Correlation_Id","Start_Timestamp","End_Timestamp"
"MEMORY_ALLOCATION","MEMORY_ALLOCATION_ALLOCATE",0,1024,0x7fb2d0005000,11,3721742710532634,3721742710584854
"MEMORY_ALLOCATION","MEMORY_ALLOCATION_FREE",0,0,0x7fb2d0005000,12,3721742710596404,3721742710933366
"MEMORY_ALLOCATION","MEMORY_ALLOCATION_ALLOCATE",0,1024,0x7fb2d0005000,13,3721742710941416,3721742710960916
"MEMORY_ALLOCATION","MEMORY_ALLOCATION_FREE",0,0,0x7fb2d0005000,14,3721742710967236,3721742711197647
"MEMORY_ALLOCATION","MEMORY_ALLOCATION_ALLOCATE",0,1024,0x7fb2d0005000,15,3721742711204077,3721742711219717
"MEMORY_ALLOCATION","MEMORY_ALLOCATION_FREE",0,0,0x7fb2d0005000,16,3721742711225857,3721742711466018
"MEMORY_ALLOCATION","MEMORY_ALLOCATION_ALLOCATE",Agent 0,1024,0x7fb2d0005000,11,3721742710532634,3721742710584854
"MEMORY_ALLOCATION","MEMORY_ALLOCATION_FREE",Agent 0,0,0x7fb2d0005000,12,3721742710596404,3721742710933366
"MEMORY_ALLOCATION","MEMORY_ALLOCATION_ALLOCATE",Agent 0,1024,0x7fb2d0005000,13,3721742710941416,3721742710960916
"MEMORY_ALLOCATION","MEMORY_ALLOCATION_FREE",Agent 0,0,0x7fb2d0005000,14,3721742710967236,3721742711197647
"MEMORY_ALLOCATION","MEMORY_ALLOCATION_ALLOCATE",Agent 0,1024,0x7fb2d0005000,15,3721742711204077,3721742711219717
"MEMORY_ALLOCATION","MEMORY_ALLOCATION_FREE",Agent 0,0,0x7fb2d0005000,16,3721742711225857,3721742711466018
1 Kind Operation Agent_Id Allocation_Size Address Correlation_Id Start_Timestamp End_Timestamp
2 MEMORY_ALLOCATION MEMORY_ALLOCATION_ALLOCATE 0 Agent 0 1024 0x7fb2d0005000 11 3721742710532634 3721742710584854
3 MEMORY_ALLOCATION MEMORY_ALLOCATION_FREE 0 Agent 0 0 0x7fb2d0005000 12 3721742710596404 3721742710933366
4 MEMORY_ALLOCATION MEMORY_ALLOCATION_ALLOCATE 0 Agent 0 1024 0x7fb2d0005000 13 3721742710941416 3721742710960916
5 MEMORY_ALLOCATION MEMORY_ALLOCATION_FREE 0 Agent 0 0 0x7fb2d0005000 14 3721742710967236 3721742711197647
6 MEMORY_ALLOCATION MEMORY_ALLOCATION_ALLOCATE 0 Agent 0 1024 0x7fb2d0005000 15 3721742711204077 3721742711219717
7 MEMORY_ALLOCATION MEMORY_ALLOCATION_FREE 0 Agent 0 0 0x7fb2d0005000 16 3721742711225857 3721742711466018
+4 -4
Melihat File
@@ -1,5 +1,5 @@
"Kind","Direction","Stream_Id","Source_Agent_Id","Destination_Agent_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
"MEMORY_COPY","MEMORY_COPY_HOST_TO_DEVICE",1,0,2,952,2414192684609085,2414192684710679
"MEMORY_COPY","MEMORY_COPY_HOST_TO_DEVICE",1,0,2,960,2414192684873841,2414192684973470
"MEMORY_COPY","MEMORY_COPY_HOST_TO_DEVICE",2,0,2,1066,2414192706436949,2414192706538622
"MEMORY_COPY","MEMORY_COPY_HOST_TO_DEVICE",2,0,2,1074,2414192706592442,2414192706692312
"MEMORY_COPY","MEMORY_COPY_HOST_TO_DEVICE",0,"Agent 0","Agent 4",1,1057963336487172,1057963336564212
"MEMORY_COPY","MEMORY_COPY_HOST_TO_DEVICE",0,"Agent 0","Agent 4",2,1057963336783973,1057963336859334
"MEMORY_COPY","MEMORY_COPY_DEVICE_TO_HOST",0,"Agent 4","Agent 0",23,1057963497396292,1057963497471732
"MEMORY_COPY","MEMORY_COPY_DEVICE_TO_HOST",0,"Agent 4","Agent 0",24,1057963498099125,1057963498200446
1 Kind Direction Stream_Id Source_Agent_Id Destination_Agent_Id Correlation_Id Start_Timestamp End_Timestamp
2 MEMORY_COPY MEMORY_COPY_HOST_TO_DEVICE 1 0 0 Agent 0 2 Agent 4 952 1 2414192684609085 1057963336487172 2414192684710679 1057963336564212
3 MEMORY_COPY MEMORY_COPY_HOST_TO_DEVICE 1 0 0 Agent 0 2 Agent 4 960 2 2414192684873841 1057963336783973 2414192684973470 1057963336859334
4 MEMORY_COPY MEMORY_COPY_HOST_TO_DEVICE MEMORY_COPY_DEVICE_TO_HOST 2 0 0 Agent 4 2 Agent 0 1066 23 2414192706436949 1057963497396292 2414192706538622 1057963497471732
5 MEMORY_COPY MEMORY_COPY_HOST_TO_DEVICE MEMORY_COPY_DEVICE_TO_HOST 2 0 0 Agent 4 2 Agent 0 1074 24 2414192706592442 1057963498099125 2414192706692312 1057963498200446
Binary file not shown.

After

Width:  |  Height:  |  Ukuran: 174 KiB

Binary file not shown.

After

Width:  |  Height:  |  Ukuran: 124 KiB

Binary file not shown.

After

Width:  |  Height:  |  Ukuran: 124 KiB

+11
Melihat File
@@ -0,0 +1,11 @@
============================================ ROCm System Management Interface ============================================
====================================================== Concise Info ======================================================
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
(DID, GUID) (Junction) (Socket) (Mem, Compute, ID)
==========================================================================================================================
0 4 0x74a0, 50375 48.0°C 110.0W NPS1, SPX, 0 98Mhz 1300Mhz 0% auto 550.0W 0% 0%
1 5 0x74a0, 20890 53.0°C 113.0W NPS1, SPX, 0 99Mhz 1200Mhz 0% auto 550.0W 0% 0%
2 6 0x74a0, 44670 52.0°C 125.0W NPS1, SPX, 0 100Mhz 1300Mhz 0% auto 550.0W 0% 0%
3 7 0x74a0, 15139 47.0°C 115.0W NPS1, SPX, 0 100Mhz 1300Mhz 0% auto 550.0W 0% 0%
==========================================================================================================================
================================================== End of ROCm SMI Log ===================================================
@@ -0,0 +1,3 @@
"Kind","Operation","Agent_Id","Queue_Id","Thread_Id","Alloc_Flags","Start_Timestamp","End_Timestamp"
"SCRATCH_MEMORY","SCRATCH_MEMORY_ALLOC","Agent 4",1,113,0,1124926523146168,1124926554133606
"SCRATCH_MEMORY","SCRATCH_MEMORY_ALLOC","Agent 4",1,113,0,1124926554522025,1124927132642186
1 Kind Operation Agent_Id Queue_Id Thread_Id Alloc_Flags Start_Timestamp End_Timestamp
2 SCRATCH_MEMORY SCRATCH_MEMORY_ALLOC Agent 4 1 113 0 1124926523146168 1124926554133606
3 SCRATCH_MEMORY SCRATCH_MEMORY_ALLOC Agent 4 1 113 0 1124926554522025 1124927132642186
Binary file not shown.

After

Width:  |  Height:  |  Ukuran: 148 KiB

Binary file not shown.

After

Width:  |  Height:  |  Ukuran: 124 KiB

+154
Melihat File
@@ -0,0 +1,154 @@
ROCPROFV3 HSA_API SUMMARY:
| NAME | DOMAIN | CALLS | DURATION (nsec) | AVERAGE (nsec) | PERCENT (INC) | MIN (nsec) | MAX (nsec) | STDDEV |
|-------------------------------------------|--------------|-----------------|-----------------|-----------------|---------------|-----------------|-----------------|-----------------|
| hsa_queue_create | HSA_API | 4 | 280077621 | 7.002e+07 | 75.372632 | 55026812 | 113288760 | 2.885e+07 |
| hsa_amd_memory_async_copy_on_engine | HSA_API | 24 | 55617052 | 2.317e+06 | 14.967292 | 7580 | 55195188 | 1.126e+07 |
| hsa_amd_memory_pool_allocate | HSA_API | 67 | 26428438 | 3.945e+05 | 7.112246 | 1510 | 857592 | 1.782e+05 |
| hsa_amd_memory_pool_free | HSA_API | 72 | 5176173 | 7.189e+04 | 1.392977 | 290 | 170374 | 3.903e+04 |
| hsa_executable_freeze | HSA_API | 2 | 964125 | 4.821e+05 | 0.259459 | 437471 | 526654 | 6.306e+04 |
| hsa_signal_wait_scacquire | HSA_API | 26 | 853122 | 3.281e+04 | 0.229587 | 2530 | 100782 | 3.394e+04 |
| hsa_executable_load_agent_code_object | HSA_API | 2 | 616175 | 3.081e+05 | 0.165821 | 254476 | 361699 | 7.582e+04 |
| hsa_amd_agents_allow_access | HSA_API | 35 | 430680 | 1.231e+04 | 0.115902 | 4830 | 55182 | 9.939e+03 |
| hsa_signal_store_screlease | HSA_API | 56 | 381491 | 6.812e+03 | 0.102664 | 1560 | 41831 | 7.895e+03 |
| hsa_signal_create | HSA_API | 107 | 160889 | 1.504e+03 | 0.043297 | 80 | 5650 | 1.475e+03 |
| hsa_code_object_reader_create_from_memory | HSA_API | 2 | 151314 | 7.566e+04 | 0.040721 | 32121 | 119193 | 6.157e+04 |
| hsa_signal_load_relaxed | HSA_API | 1296 | 137626 | 1.062e+02 | 0.037037 | 20 | 2930 | 2.712e+02 |
| hsa_signal_destroy | HSA_API | 618 | 111224 | 1.800e+02 | 0.029932 | 40 | 1540 | 2.429e+02 |
| hsa_agent_get_info | HSA_API | 65 | 77472 | 1.192e+03 | 0.020849 | 30 | 47121 | 6.341e+03 |
| hsa_amd_signal_create | HSA_API | 512 | 61290 | 1.197e+02 | 0.016494 | 40 | 930 | 1.559e+02 |
| hsa_amd_signal_async_handler | HSA_API | 24 | 52641 | 2.193e+03 | 0.014166 | 1180 | 4020 | 9.252e+02 |
| hsa_executable_iterate_symbols | HSA_API | 14 | 52521 | 3.752e+03 | 0.014134 | 2740 | 6940 | 1.105e+03 |
| hsa_amd_memory_copy_engine_status | HSA_API | 18 | 47370 | 2.632e+03 | 0.012748 | 260 | 7990 | 2.274e+03 |
| hsa_iterate_agents | HSA_API | 1 | 41391 | 4.139e+04 | 0.011139 | 41391 | 41391 | 0.000e+00 |
| hsa_executable_create_alt | HSA_API | 2 | 40470 | 2.024e+04 | 0.010891 | 7530 | 32940 | 1.797e+04 |
| hsa_isa_get_info_alt | HSA_API | 2 | 30391 | 1.520e+04 | 0.008179 | 2490 | 27901 | 1.797e+04 |
| hsa_signal_silent_store_relaxed | HSA_API | 48 | 24920 | 5.192e+02 | 0.006706 | 20 | 4570 | 7.120e+02 |
| hsa_amd_agent_iterate_memory_pools | HSA_API | 5 | 20221 | 4.044e+03 | 0.005442 | 2561 | 8600 | 2.574e+03 |
| hsa_queue_add_write_index_screlease | HSA_API | 56 | 7270 | 1.298e+02 | 0.001956 | 30 | 2310 | 3.471e+02 |
| hsa_amd_profiling_set_profiler_enabled | HSA_API | 4 | 5600 | 1.400e+03 | 0.001507 | 1370 | 1470 | 4.690e+01 |
| hsa_executable_symbol_get_info | HSA_API | 152 | 5470 | 3.599e+01 | 0.001472 | 30 | 340 | 3.563e+01 |
| hsa_queue_load_read_index_relaxed | HSA_API | 56 | 4560 | 8.143e+01 | 0.001227 | 20 | 1310 | 1.863e+02 |
| hsa_executable_get_symbol_by_name | HSA_API | 14 | 4500 | 3.214e+02 | 0.001211 | 110 | 1510 | 4.732e+02 |
| hsa_queue_load_read_index_scacquire | HSA_API | 56 | 3040 | 5.429e+01 | 0.000818 | 30 | 690 | 8.705e+01 |
| hsa_amd_memory_pool_get_info | HSA_API | 43 | 1770 | 4.116e+01 | 0.000476 | 30 | 270 | 3.640e+01 |
| hsa_system_get_info | HSA_API | 4 | 1750 | 4.375e+02 | 0.000471 | 40 | 830 | 3.544e+02 |
| hsa_amd_agent_memory_pool_get_info | HSA_API | 13 | 1140 | 8.769e+01 | 0.000307 | 30 | 640 | 1.664e+02 |
| hsa_agent_iterate_isas | HSA_API | 1 | 700 | 7.000e+02 | 0.000188 | 700 | 700 | 0.000e+00 |
| hsa_system_get_major_extension_table | HSA_API | 1 | 190 | 1.900e+02 | 0.000051 | 190 | 190 | 0.000e+00 |
ROCPROFV3 HIP_API SUMMARY:
| NAME | DOMAIN | CALLS | DURATION (nsec) | AVERAGE (nsec) | PERCENT (INC) | MIN (nsec) | MAX (nsec) | STDDEV |
|------------------------------------------|--------------|-----------------|-----------------|-----------------|---------------|-----------------|-----------------|-----------------|
| hipStreamCreateWithFlags | HIP_API | 8 | 406507215 | 5.081e+07 | 71.307804 | 735979 | 233800881 | 7.889e+07 |
| hipGetDeviceCount | HIP_API | 1 | 76707894 | 7.671e+07 | 13.455780 | 76707894 | 76707894 | 0.000e+00 |
| hipMemcpyAsync | HIP_API | 24 | 56109444 | 2.338e+06 | 9.842485 | 11640 | 55299811 | 1.128e+07 |
| hipHostMalloc | HIP_API | 24 | 13007523 | 5.420e+05 | 2.281726 | 416631 | 866382 | 1.206e+05 |
| hipMallocAsync | HIP_API | 24 | 7304847 | 3.044e+05 | 1.281386 | 275397 | 353719 | 2.207e+04 |
| hipHostFree | HIP_API | 24 | 2786484 | 1.161e+05 | 0.488793 | 72242 | 221646 | 4.606e+04 |
| hipStreamDestroy | HIP_API | 8 | 2137924 | 2.672e+05 | 0.375026 | 221596 | 377469 | 5.489e+04 |
| hipLaunchKernel | HIP_API | 32 | 2080214 | 6.501e+04 | 0.364902 | 8850 | 1608721 | 2.819e+05 |
| hipFree | HIP_API | 24 | 1572948 | 6.554e+04 | 0.275920 | 2130 | 186994 | 4.815e+04 |
| hipStreamSynchronize | HIP_API | 24 | 1452706 | 6.053e+04 | 0.254828 | 20810 | 135803 | 3.469e+04 |
| __hipRegisterFunction | HIP_API | 4 | 294207 | 7.355e+04 | 0.051609 | 210 | 291807 | 1.455e+05 |
| hipDeviceSynchronize | HIP_API | 4 | 50663 | 1.267e+04 | 0.008887 | 510 | 23621 | 9.554e+03 |
| __hipRegisterFatBinary | HIP_API | 1 | 43811 | 4.381e+04 | 0.007685 | 43811 | 43811 | 0.000e+00 |
| __hipPushCallConfiguration | HIP_API | 32 | 6250 | 1.953e+02 | 0.001096 | 60 | 3640 | 6.308e+02 |
| __hipPopCallConfiguration | HIP_API | 32 | 4780 | 1.494e+02 | 0.000838 | 60 | 2520 | 4.340e+02 |
| hipGetLastError | HIP_API | 32 | 4471 | 1.397e+02 | 0.000784 | 60 | 2381 | 4.092e+02 |
| hipSetDevice | HIP_API | 1 | 2570 | 2.570e+03 | 0.000451 | 2570 | 2570 | 0.000e+00 |
ROCPROFV3 KERNEL_DISPATCH SUMMARY:
| NAME | DOMAIN | CALLS | DURATION (nsec) | AVERAGE (nsec) | PERCENT (INC) | MIN (nsec) | MAX (nsec) | STDDEV |
|---------------------------------------------------------------------------|-----------------|-----------------|-----------------|-----------------|---------------|-----------------|-----------------|-----------------|
| void addition_kernel<float>(float*, float const*, float const*, int, int) | KERNEL_DISPATCH | 8 | 184324 | 2.304e+04 | 40.681542 | 11200 | 98802 | 3.062e+04 |
| divide_kernel(float*, float const*, float const*, int, int) | KERNEL_DISPATCH | 8 | 94482 | 1.181e+04 | 20.852811 | 10240 | 13520 | 1.061e+03 |
| multiply_kernel(float*, float const*, float const*, int, int) | KERNEL_DISPATCH | 8 | 91763 | 1.147e+04 | 20.252709 | 9800 | 12800 | 9.417e+02 |
| subtract_kernel(float*, float const*, float const*, int, int) | KERNEL_DISPATCH | 8 | 82521 | 1.032e+04 | 18.212938 | 8320 | 12920 | 1.436e+03 |
ROCPROFV3 MEMORY_COPY SUMMARY:
| NAME | DOMAIN | CALLS | DURATION (nsec) | AVERAGE (nsec) | PERCENT (INC) | MIN (nsec) | MAX (nsec) | STDDEV |
|------------------------------------------|--------------|-----------------|-----------------|-----------------|---------------|-----------------|-----------------|-----------------|
| MEMORY_COPY_HOST_TO_DEVICE | MEMORY_COPY | 16 | 3691929 | 2.307e+05 | 85.494053 | 74842 | 284487 | 6.265e+04 |
| MEMORY_COPY_DEVICE_TO_HOST | MEMORY_COPY | 8 | 626417 | 7.830e+04 | 14.505947 | 74842 | 98603 | 8.207e+03 |
ROCPROFV3 MEMORY_ALLOCATION SUMMARY:
| NAME | DOMAIN | CALLS | DURATION (nsec) | AVERAGE (nsec) | PERCENT (INC) | MIN (nsec) | MAX (nsec) | STDDEV |
|------------------------------------------|-------------------|-----------------|-----------------|-----------------|---------------|-----------------|-----------------|-----------------|
| MEMORY_ALLOCATION_ALLOCATE | MEMORY_ALLOCATION | 67 | 26314096 | 3.927e+05 | 83.661617 | 950 | 856812 | 1.785e+05 |
| MEMORY_ALLOCATION_FREE | MEMORY_ALLOCATION | 72 | 5138913 | 7.137e+04 | 16.338383 | 20 | 166234 | 3.882e+04 |
ROCPROFV3 SUMMARY:
| NAME | DOMAIN | CALLS | DURATION (nsec) | AVERAGE (nsec) | PERCENT (INC) | MIN (nsec) | MAX (nsec) | STDDEV |
|---------------------------------------------------------------------------|-------------------|-----------------|-----------------|-----------------|---------------|-----------------|-----------------|-----------------|
| hipStreamCreateWithFlags | HIP_API | 8 | 406507215 | 5.081e+07 | 41.569873 | 735979 | 233800881 | 7.889e+07 |
| hsa_queue_create | HSA_API | 4 | 280077621 | 7.002e+07 | 28.641044 | 55026812 | 113288760 | 2.885e+07 |
| hipGetDeviceCount | HIP_API | 1 | 76707894 | 7.671e+07 | 7.844233 | 76707894 | 76707894 | 0.000e+00 |
| hipMemcpyAsync | HIP_API | 24 | 56109444 | 2.338e+06 | 5.737813 | 11640 | 55299811 | 1.128e+07 |
| hsa_amd_memory_async_copy_on_engine | HSA_API | 24 | 55617052 | 2.317e+06 | 5.687461 | 7580 | 55195188 | 1.126e+07 |
| hsa_amd_memory_pool_allocate | HSA_API | 67 | 26428438 | 3.945e+05 | 2.702601 | 1510 | 857592 | 1.782e+05 |
| MEMORY_ALLOCATION_ALLOCATE | MEMORY_ALLOCATION | 67 | 26314096 | 3.927e+05 | 2.690908 | 950 | 856812 | 1.785e+05 |
| hipHostMalloc | HIP_API | 24 | 13007523 | 5.420e+05 | 1.330164 | 416631 | 866382 | 1.206e+05 |
| hipMallocAsync | HIP_API | 24 | 7304847 | 3.044e+05 | 0.747002 | 275397 | 353719 | 2.207e+04 |
| hsa_amd_memory_pool_free | HSA_API | 72 | 5176173 | 7.189e+04 | 0.529321 | 290 | 170374 | 3.903e+04 |
| MEMORY_ALLOCATION_FREE | MEMORY_ALLOCATION | 72 | 5138913 | 7.137e+04 | 0.525511 | 20 | 166234 | 3.882e+04 |
| MEMORY_COPY_HOST_TO_DEVICE | MEMORY_COPY | 16 | 3691929 | 2.307e+05 | 0.377541 | 74842 | 284487 | 6.265e+04 |
| hipHostFree | HIP_API | 24 | 2786484 | 1.161e+05 | 0.284949 | 72242 | 221646 | 4.606e+04 |
| hipStreamDestroy | HIP_API | 8 | 2137924 | 2.672e+05 | 0.218626 | 221596 | 377469 | 5.489e+04 |
| hipLaunchKernel | HIP_API | 32 | 2080214 | 6.501e+04 | 0.212725 | 8850 | 1608721 | 2.819e+05 |
| hipFree | HIP_API | 24 | 1572948 | 6.554e+04 | 0.160851 | 2130 | 186994 | 4.815e+04 |
| hipStreamSynchronize | HIP_API | 24 | 1452706 | 6.053e+04 | 0.148555 | 20810 | 135803 | 3.469e+04 |
| hsa_executable_freeze | HSA_API | 2 | 964125 | 4.821e+05 | 0.098592 | 437471 | 526654 | 6.306e+04 |
| hsa_signal_wait_scacquire | HSA_API | 26 | 853122 | 3.281e+04 | 0.087241 | 2530 | 100782 | 3.394e+04 |
| MEMORY_COPY_DEVICE_TO_HOST | MEMORY_COPY | 8 | 626417 | 7.830e+04 | 0.064058 | 74842 | 98603 | 8.207e+03 |
| hsa_executable_load_agent_code_object | HSA_API | 2 | 616175 | 3.081e+05 | 0.063011 | 254476 | 361699 | 7.582e+04 |
| hsa_amd_agents_allow_access | HSA_API | 35 | 430680 | 1.231e+04 | 0.044042 | 4830 | 55182 | 9.939e+03 |
| hsa_signal_store_screlease | HSA_API | 56 | 381491 | 6.812e+03 | 0.039012 | 1560 | 41831 | 7.895e+03 |
| __hipRegisterFunction | HIP_API | 4 | 294207 | 7.355e+04 | 0.030086 | 210 | 291807 | 1.455e+05 |
| void addition_kernel<float>(float*, float const*, float const*, int, int) | KERNEL_DISPATCH | 8 | 184324 | 2.304e+04 | 0.018849 | 11200 | 98802 | 3.062e+04 |
| hsa_signal_create | HSA_API | 107 | 160889 | 1.504e+03 | 0.016453 | 80 | 5650 | 1.475e+03 |
| hsa_code_object_reader_create_from_memory | HSA_API | 2 | 151314 | 7.566e+04 | 0.015474 | 32121 | 119193 | 6.157e+04 |
| hsa_signal_load_relaxed | HSA_API | 1296 | 137626 | 1.062e+02 | 0.014074 | 20 | 2930 | 2.712e+02 |
| hsa_signal_destroy | HSA_API | 618 | 111224 | 1.800e+02 | 0.011374 | 40 | 1540 | 2.429e+02 |
| divide_kernel(float*, float const*, float const*, int, int) | KERNEL_DISPATCH | 8 | 94482 | 1.181e+04 | 0.009662 | 10240 | 13520 | 1.061e+03 |
| multiply_kernel(float*, float const*, float const*, int, int) | KERNEL_DISPATCH | 8 | 91763 | 1.147e+04 | 0.009384 | 9800 | 12800 | 9.417e+02 |
| subtract_kernel(float*, float const*, float const*, int, int) | KERNEL_DISPATCH | 8 | 82521 | 1.032e+04 | 0.008439 | 8320 | 12920 | 1.436e+03 |
| hsa_agent_get_info | HSA_API | 65 | 77472 | 1.192e+03 | 0.007922 | 30 | 47121 | 6.341e+03 |
| hsa_amd_signal_create | HSA_API | 512 | 61290 | 1.197e+02 | 0.006268 | 40 | 930 | 1.559e+02 |
| hsa_amd_signal_async_handler | HSA_API | 24 | 52641 | 2.193e+03 | 0.005383 | 1180 | 4020 | 9.252e+02 |
| hsa_executable_iterate_symbols | HSA_API | 14 | 52521 | 3.752e+03 | 0.005371 | 2740 | 6940 | 1.105e+03 |
| hipDeviceSynchronize | HIP_API | 4 | 50663 | 1.267e+04 | 0.005181 | 510 | 23621 | 9.554e+03 |
| hsa_amd_memory_copy_engine_status | HSA_API | 18 | 47370 | 2.632e+03 | 0.004844 | 260 | 7990 | 2.274e+03 |
| __hipRegisterFatBinary | HIP_API | 1 | 43811 | 4.381e+04 | 0.004480 | 43811 | 43811 | 0.000e+00 |
| hsa_iterate_agents | HSA_API | 1 | 41391 | 4.139e+04 | 0.004233 | 41391 | 41391 | 0.000e+00 |
| hsa_executable_create_alt | HSA_API | 2 | 40470 | 2.024e+04 | 0.004139 | 7530 | 32940 | 1.797e+04 |
| hsa_isa_get_info_alt | HSA_API | 2 | 30391 | 1.520e+04 | 0.003108 | 2490 | 27901 | 1.797e+04 |
| hsa_signal_silent_store_relaxed | HSA_API | 48 | 24920 | 5.192e+02 | 0.002548 | 20 | 4570 | 7.120e+02 |
| hsa_amd_agent_iterate_memory_pools | HSA_API | 5 | 20221 | 4.044e+03 | 0.002068 | 2561 | 8600 | 2.574e+03 |
| hsa_queue_add_write_index_screlease | HSA_API | 56 | 7270 | 1.298e+02 | 0.000743 | 30 | 2310 | 3.471e+02 |
| __hipPushCallConfiguration | HIP_API | 32 | 6250 | 1.953e+02 | 0.000639 | 60 | 3640 | 6.308e+02 |
| hsa_amd_profiling_set_profiler_enabled | HSA_API | 4 | 5600 | 1.400e+03 | 0.000573 | 1370 | 1470 | 4.690e+01 |
| hsa_executable_symbol_get_info | HSA_API | 152 | 5470 | 3.599e+01 | 0.000559 | 30 | 340 | 3.563e+01 |
| __hipPopCallConfiguration | HIP_API | 32 | 4780 | 1.494e+02 | 0.000489 | 60 | 2520 | 4.340e+02 |
| hsa_queue_load_read_index_relaxed | HSA_API | 56 | 4560 | 8.143e+01 | 0.000466 | 20 | 1310 | 1.863e+02 |
| hsa_executable_get_symbol_by_name | HSA_API | 14 | 4500 | 3.214e+02 | 0.000460 | 110 | 1510 | 4.732e+02 |
| hipGetLastError | HIP_API | 32 | 4471 | 1.397e+02 | 0.000457 | 60 | 2381 | 4.092e+02 |
| hsa_queue_load_read_index_scacquire | HSA_API | 56 | 3040 | 5.429e+01 | 0.000311 | 30 | 690 | 8.705e+01 |
| hipSetDevice | HIP_API | 1 | 2570 | 2.570e+03 | 0.000263 | 2570 | 2570 | 0.000e+00 |
| hsa_amd_memory_pool_get_info | HSA_API | 43 | 1770 | 4.116e+01 | 0.000181 | 30 | 270 | 3.640e+01 |
| hsa_system_get_info | HSA_API | 4 | 1750 | 4.375e+02 | 0.000179 | 40 | 830 | 3.544e+02 |
| hsa_amd_agent_memory_pool_get_info | HSA_API | 13 | 1140 | 8.769e+01 | 0.000117 | 30 | 640 | 1.664e+02 |
| hsa_agent_iterate_isas | HSA_API | 1 | 700 | 7.000e+02 | 0.000072 | 700 | 700 | 0.000e+00 |
| hsa_system_get_major_extension_table | HSA_API | 1 | 190 | 1.900e+02 | 0.000019 | 190 | 190 | 0.000e+00 |
+340 -61
Melihat File
@@ -190,6 +190,7 @@ HIP trace
+++++++++++
HIP trace comprises execution traces for the entire application at the HIP level. This includes HIP API functions and their asynchronous activities at the runtime level. In general, HIP APIs directly interact with the user program. It is easier to analyze HIP traces as you can directly map them to the program.
Unlike previous iterations of `rocprof`, this does not enable kernel tracing, memory copy tracing, etc. If you want to enable kernel tracing, memory copy tracing, they need to be provided explicitly.
To trace HIP runtime APIs, use:
@@ -205,12 +206,15 @@ The preceding command generates a ``hip_api_trace.csv`` file prefixed with the p
Here are the contents of ``hip_api_trace.csv`` file:
.. csv-table:: HIP runtime api trace
:file: /data/hip_compile_trace.csv
.. csv-table:: HIP api trace
:file: /data/hip_trace.csv
:widths: 10,10,10,10,10,20,20
:header-rows: 1
To trace HIP compile time APIs, use:
`rocprofv3` provides options to collect traces at more granular level. For HIP, user can collect traces for HIP compile time APIs and runtime APIs separately.
To collect HIP compile time API traces, use:
.. code-block:: shell
@@ -229,6 +233,26 @@ Here are the contents of ``hip_api_trace.csv`` file:
:widths: 10,10,10,10,10,20,20
:header-rows: 1
To collect HIP runtime time API traces, use:
.. code-block:: shell
rocprofv3 --hip-runtime-trace -- <application_path>
The preceding command generates a ``hip_api_trace.csv`` file prefixed with the process ID.
.. code-block:: shell
$ cat 208_hip_api_trace.csv
Here are the contents of ``hip_api_trace.csv`` file:
.. csv-table:: HIP runtime api trace
:file: /data/hip_runtime_trace.csv
:widths: 10,10,10,10,10,20,20
:header-rows: 1
For the description of the fields in the output file, see :ref:`output-file-fields`.
HSA trace
@@ -255,6 +279,28 @@ Here are the contents of ``hsa_api_trace.csv`` file:
:widths: 10,10,10,10,10,20,20
:header-rows: 1
`rocprofv3` provides options to collect traces at more granular level. HSA traces can be collected separately for four API domains: `HSA_AMD_EXT_API`, `HSA_CORE_API`, `HSA_IMAGE_EXT_API` and `HSA_FINALIZE_EXT_API`.
To collect HSA core API traces, use:
.. code-block:: bash
rocprofv3 --hsa-core-trace -- <application_path>
The preceding command generates a ``hsa_api_trace.csv`` file prefixed with process ID. Note that the contents of this file have been truncated for demonstration purposes.
.. code-block:: shell
$ cat 197_hsa_api_trace.csv
Here are the contents of ``hsa_api_trace.csv`` file:
.. csv-table:: HSA core api trace
:file: /data/hsa_core_api_trace.csv
:widths: 10,10,10,10,10,20,20
:header-rows: 1
For the description of the fields in the output file, see :ref:`output-file-fields`.
Marker trace
@@ -268,51 +314,6 @@ Marker trace
To see usage of ``ROCTx`` or marker library, see :ref:`using-rocprofiler-sdk-roctx`.
Kernel rename
++++++++++++++
The ``roctxRangePush`` and ``roctxRangePop`` also let you rename the enclosed kernel with the supplied message. In the legacy ``rocprof``, this functionality was known as ``--roctx-rename``.
See how to use ``roctxRangePush`` and ``roctxRangePop`` for renaming the enclosed kernel:
.. code-block:: bash
#include <rocprofiler-sdk-roctx/roctx.h>
roctxRangePush("HIP_Kernel-1");
// Launching kernel from host
hipLaunchKernelGGL(matrixTranspose, dim3(WIDTH/THREADS_PER_BLOCK_X, WIDTH/THREADS_PER_BLOCK_Y), dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y), 0,0,gpuTransposeMatrix,gpuMatrix, WIDTH);
// Memory transfer from device to host
roctxRangePush("hipMemCpy-DeviceToHost");
hipMemcpy(TransposeMatrix, gpuTransposeMatrix, NUM * sizeof(float), hipMemcpyDeviceToHost);
roctxRangePop(); // for "hipMemcpy"
roctxRangePop(); // for "hipLaunchKernel"
roctxRangeStop(rangeId);
To rename the kernel, use:
.. code-block:: bash
rocprofv3 --marker-trace --kernel-rename -- <application_path>
The preceding command generates a ``marker-trace`` file prefixed with the process ID.
.. code-block:: shell
$ cat 210_marker_api_trace.csv
"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
"MARKER_CORE_API","roctxGetThreadId",315155,315155,2,58378843928406,58378843930247
"MARKER_CONTROL_API","roctxProfilerPause",315155,315155,3,58378844627184,58378844627502
"MARKER_CONTROL_API","roctxProfilerResume",315155,315155,4,58378844638601,58378844639267
"MARKER_CORE_API","pre-kernel-launch",315155,315155,5,58378844641787,58378844641787
"MARKER_CORE_API","post-kernel-launch",315155,315155,6,58378844936586,58378844936586
"MARKER_CORE_API","memCopyDth",315155,315155,7,58378844938371,58378851383270
"MARKER_CORE_API","HIP_Kernel-1",315155,315155,1,58378526575735,58378851384485
Kokkos trace
++++++++++++++
@@ -362,7 +363,7 @@ For the description of the fields in the output file, see :ref:`output-file-fiel
Memory copy trace
+++++++++++++++++++
To trace memory moves across the application, use:
Memory copy traces track `hipMemcpy` and `hipMemcpyAsync` functions, which uses the `hsa_amd_memory_async_copy_on_engine` HSA functions internally. To trace memory moves across the application, use:
.. code-block:: shell
@@ -431,7 +432,7 @@ The HSA runtime API is excluded because it is a lower-level API upon which HIP a
thus, tends to be an implementation detail irrelevant to most users. Similarly, the HIP compiler API is also excluded for being an implementation detail as these functions are automatically inserted during HIP compilation.
``--runtime-trace`` traces the HIP runtime API, marker API, kernel dispatches, and
memory operations (copies and scratch).
memory operations (copies, allocations and scratch).
.. code-block:: shell
@@ -448,16 +449,33 @@ This is an all-inclusive option to collect HIP, HSA, kernel, memory copy, memory
rocprofv3 -sys-trace -- <application_path>
Running the above command generates ``hip_api_trace.csv``, ``hsa_api_trace.csv``, ``kernel_trace.csv``, ``memory_copy_trace.csv``, ``memory_allocation_trace.csv``, and ``marker_api_trace.csv`` (if files prefixed with the process ID.
Running the above command generates ``hip_api_trace.csv``, ``hsa_api_trace.csv``, ``kernel_trace.csv``, ``memory_copy_trace.csv``, ``scratch_memory_trace.csv``, ``memory_allocation_trace.csv``, and ``marker_api_trace.csv`` (if ``ROCTx`` APIs are specified in the application).
Scratch memory trace
++++++++++++++++++++++
This option collects scratch memory operation traces. Scratch is an address space on AMD GPUs roughly equivalent to the `local memory` in NVIDIA CUDA. The `local memory` in CUDA is a thread-local global memory with interleaved addressing, which is used for register spills or stack space. This option helps to trace when the ``rocr`` runtime allocates, frees, and tries to reclaim scratch memory.
To trace scratch memory allocations during the application run, use:
.. code-block:: shell
rocprofv3 --scratch-memory-trace -- <application_path>
rocprofv3 -scratch-memory-trace -- < app_path >
The preceding command generates a ``scratch_memory_trace.csv`` file prefixed with the process ID.
.. code-block:: shell
$ cat 100_scratch_memory_trace.csv
Here are the contents of ``scratch_memory_trace.csv`` file:
.. csv-table:: Scratch memory trace
:file: /data/scratch_memory_trace.csv
:widths: 10,10,10,10,10,10,20,20
:header-rows: 1
For the description of the fields in the output file, see :ref:`output-file-fields`.
RCCL trace
++++++++++++
@@ -607,6 +625,20 @@ To see a summary for ``MEMORY_COPY`` and ``HIP_API`` domains, use:
.. image:: /data/rocprofv3_hip_memcpy_summary.png
Summary output file
######################
This option specifies the output file for the summary. By default, the summary is displayed on stderr. To specify a different output file, use:
.. code-block:: shell
rocprofv3 -S -D --summary-output-file summary --sys-trace -- <application_path>
The preceding command generates a ``summary`` file displaying the summary of each domain. This also generates the files for the enabled tracing types under `-sys-trace` option.
.. include:: /data/summary.txt
:literal:
Collecting traces using input file
++++++++++++++++++++++++++++++++++++
@@ -931,6 +963,36 @@ Here is the same sample in JSON format:
]
}
Perfetto visualization
-----------------------
Perfetto visualization for traces
+++++++++++++++++++++++++++++++++++++++++++++
Users can generate Perfetto trace files using the ``--output-format pftrace`` option. This allows users to visualize the traces in the Perfetto viewer.
Perfetto is a powerful open-source tracing tool that provides a comprehensive view of system performance. It allows you to visualize the collected traces in a user-friendly interface, making it easier to analyze and understand the performance characteristics of your application.
To generate a Perfetto trace file, use the ``--output-format pftrace`` option along with the desired tracing options. For example, to collect system traces and generate a Perfetto trace file, use:
.. code-block:: bash
rocprofv3 --sys-trace --output-format pftrace -- <application_path>
The generated Perfetto trace file can be opened in the Perfetto UI (https://ui.perfetto.dev/).
**Figure 1:** Generic perfetto visualization
.. image:: /data/perfetto_generic.png
:width: 100%
:align: center
**Figure 2:** Visualization of ROCm flow data in Perfetto
.. image:: /data/perfetto_flow.png
:width: 100%
:align: center
Perfetto visualization for counter collection
+++++++++++++++++++++++++++++++++++++++++++++
@@ -940,18 +1002,22 @@ To generate a Perfetto trace file with counter data, use:
.. code-block:: shell
rocprofv3 --pmc SQ_WAVES GRBM_GUI_ACTIVE --output-format pftrace -- <application_path>
You can also combine this with other tracing options to correlate counter data with API and kernel execution:
.. code-block:: shell
rocprofv3 -s --pmc SQ_WAVES --output-format pftrace -- <application_path>
rocprofv3 --pmc SQ_WAVES GRBM_COUNT --output-format pftrace -- <application_path>
The generated Perfetto trace file can be opened in the Perfetto UI (https://ui.perfetto.dev/). In the viewer, performance counters will appear as counter tracks organized by agent, allowing you to visualize counter values changing over time alongside kernel executions and other traced activities.
you can also combine this with the system trace option to get a more comprehensive view of the system's performance. For example, you can use the following command to collect both system trace and performance counter data:
.. code-block:: bash
rocprofv3 --pmc SQ_WAVES GRBM_COUNT --sys-trace --output-format pftrace -- <application_path>
.. image:: /data/perfetto_counters.png
:width: 100%
:align: center
Agent info
++++++++++++
-----------
.. note::
All tracing and counter collection options generate an additional ``agent_info.csv`` file prefixed with the process ID.
@@ -966,6 +1032,125 @@ The ``agent_info.csv`` file contains information about the CPU or GPU the kernel
0,0,"CPU",24,0,0,0,0,0,0,0,0,1,24,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3800,0,0,0,0,0,0,23,0,0,0,0,0,0,0,0,0,0,0,"AMD Ryzen 9 3900X 12-Core Processor","CPU","AMD Ryzen 9 3900X 12-Core Processor",""
1,1,"GPU",0,256,0,2147487744,10,64,0,64,64,1,64,4,4,1,16,4,32,90000,4098,26751,12032,0,128,2,0,2,24,3800,1630,432,440,138420864,16,40,141,1024,4294967295,0,0,64700,1024,1024,1024,4294967295,4294967295,4294967295,"gfx900","AMD","Radeon RX Vega","vega10"
Advanced options
-----------------
Agent Index
++++++++++++++
The agent index is a unique identifier for each agent in the system. It is used to identify the agent in the output files. Since, each runtime/tool has their own way to represent the agent's indices, `rocprofv3` provides a way to configure the agent index in the output files.
- **absolute** == *node_id* - absolute index of the agent regardless of cgroups masking. This is a monotonically increasing number that is incremented for every folder in `/sys/class/kfd/kfd/topology/nodes`. e.g. Agent-0, Agent-2, Agent-4.
- **relative** == *logical_node_id* - relative index of the agent accounting for cgroups masking. This is a monotonically increasing number which is incremented for every folder in `/sys/class/kfd/kfd/topology/nodes/` whose properties file was non-empty.e.g. Agent-0, Agent-1, Agent-2
- **type-relative** == *logical_node_type_id* - relative index of the agent accounting for cgroups masking where indexing starts at zero for each agent type. e.g. CPU-0, GPU-0, GPU-1
To set the agent index in the output files, use the ``--agent-index`` option. The default value is ``absolute``.
The following example shows how to set the agent index on a system with multiple GPUs and CPUs:
Here is the `rocm-smi` output:
.. include:: /data/rocm-smi.txt
:literal:
.. code-block:: shell
rocprofv3 --kernel-trace --agent-index=relative -- <application_path>
.. code-block:: shell
$ cat kernel_trace.csv
"Kind","Agent_Id","Queue_Id","Stream_Id","Thread_Id","Dispatch_Id","Kernel_Id","Kernel_Name","Correlation_Id","Start_Timestamp","End_Timestamp","Private_Segment_Size","Group_Segment_Size","Workgroup_Size_X","Workgroup_Size_Y","Workgroup_Size_Z","Grid_Size_X","Grid_Size_Y","Grid_Size_Z"
"KERNEL_DISPATCH","Agent 7",1,2,15044,1,17,"void addition_kernel<float>(float*, float const*, float const*, int, int)",1,1671247151691610,1671247151718010,0,0,64,1,1,1024,1024,1
.. code-block:: shell
rocprofv3 --kernel-trace --agent-index=type-relative -- <application_path>
.. code-block:: shell
$ cat kernel_trace.csv
"Kind","Agent_Id","Queue_Id","Stream_Id","Thread_Id","Dispatch_Id","Kernel_Id","Kernel_Name","Correlation_Id","Start_Timestamp","End_Timestamp","Private_Segment_Size","Group_Segment_Size","Workgroup_Size_X","Workgroup_Size_Y","Workgroup_Size_Z","Grid_Size_X","Grid_Size_Y","Grid_Size_Z"
"KERNEL_DISPATCH","GPU 3",1,2,15056,1,17,"void addition_kernel<float>(float*, float const*, float const*, int, int)",1,1671390884499766,1671390884525686,0,0,64,1,1,1024,1024,1
Group by queue
++++++++++++++++++
By default, `rocprofv3` shows which HIP streams kernel and memory copy operations were submitted to when outputting a perfetto trace. The ``--group-by-queue`` option allows users to display the HSA queues that these kernel and memory operations were submitted to instead.
.. image:: /data/streams_pftrace.png
.. code-block:: shell
rocprofv3 -s --group-by-queue --output-format pftrace -- <application_path>
The above command generates a ``pftrace`` file with the kernel and memory copy operations grouped into HSA queues rather than HIP streams.
.. image:: /data/streams_pftrace_grouped.png
Kernel naming and Filtering
----------------------------
Kernel name mangling
++++++++++++++++++++++
In `rocprofv3` output, by default, the kernel names are demangled to exclude the kernel arguments. This improves readability of collected output.
If you want to see the mangled kernel names, you can disable this feature by using the ``--mangled-kernels`` option.
By default the kernel trace would look like this:
.. code-block:: shell
$ cat 123_kernel_trace.csv
"Kind","Agent_Id","Queue_Id","Stream_Id","Thread_Id","Dispatch_Id","Kernel_Id","Kernel_Name","Correlation_Id","Start_Timestamp","End_Timestamp","Private_Segment_Size","Group_Segment_Size","Workgroup_Size_X","Workgroup_Size_Y","Workgroup_Size_Z","Grid_Size_X","Grid_Size_Y","Grid_Size_Z"
"KERNEL_DISPATCH","Agent 4",1,2,123,1,11,"addition_kernel<float>(float*, float const*, float const*, int, int)",1,1143263119836160,1143263119848360,0,0,64,1,1,1024,1024,1
"KERNEL_DISPATCH","Agent 4",2,3,123,6,14,"subtract_kernel(float*, float const*, float const*, int, int)",6,1143263119886120,1143263119896040,0,0,64,1,1,1024,1024,1
"KERNEL_DISPATCH","Agent 4",2,3,123,5,11,"addition_kernel<float>(float*, float const*, float const*, int, int)",5,1143263119876240,1143263119886120,0,0,64,1,1,1024,1024,1
"KERNEL_DISPATCH","Agent 4",1,2,123,4,12,"divide_kernel(float*, float const*, float const*, int)",4,1143263119866920,1143263119878960,0,0,64,1,1,1024,1024,1
To disable kernel name demangling, use:
.. code-block:: shell
rocprofv3 --mangled-kernels --kernel-trace -- <application_path>
The above command generates a ``kernel_trace.csv`` file with the mangled kernel names.
.. code-block:: shell
$ cat 123_kernel_trace.csv
"Kind","Agent_Id","Queue_Id","Stream_Id","Thread_Id","Dispatch_Id","Kernel_Id","Kernel_Name","Correlation_Id","Start_Timestamp","End_Timestamp","Private_Segment_Size","Group_Segment_Size","Workgroup_Size_X","Workgroup_Size_Y","Workgroup_Size_Z","Grid_Size_X","Grid_Size_Y","Grid_Size_Z"
"KERNEL_DISPATCH","Agent 4",1,2,123,1,11,"_Z15addition_kernelIfEvPT_PKfS3_ii.kd",1,1143263119836160,1143263119848360,0,0,64,1,1,1024,1024,1
"KERNEL_DISPATCH","Agent 4",2,3,123,6,14,"_Z15subtract_kernelPfPKfS1_ii.kd",6,1143263119886120,1143263119896040,0,0,64,1,1,1024,1024,1
"KERNEL_DISPATCH","Agent 4",2,3,123,5,11,"_Z15addition_kernelIfEvPT_PKfS3_ii.kd",5,1143263119876240,1143263119886120,0,0,64,1,1,1024,1024,1
"KERNEL_DISPATCH","Agent 4",1,2,123,4,12,"_Z13divide_kernelPfPKfS1_ii.kd",4,1143263119866920,1143263119878960,0,0,64,1,1,1024,1024,1
Kernel name truncation
+++++++++++++++++++++++
Kernel name truncation is a feature that allows you to limit the length of kernel names in the output files. This is useful when dealing with long kernel names that can make the output files difficult to read.
To enable kernel name truncation, use the ``--truncate-kernels`` option.
.. code-block:: shell
rocprofv3 --truncate-kernels --kernel-trace -- <application_path>
The above command generates a ``kernel_trace.csv`` file with truncated kernel names.
.. csv-table:: Kernel trace truncated
:file: /data/kernel_trace_truncated.csv
:widths: 10,10,10,10,10,10,10,10,10,20,20,10,10,10,10,10,10,10,10
:header-rows: 1
Kernel filtering
+++++++++++++++++
@@ -995,8 +1180,55 @@ To collect counters for the kernels matching the filters specified in the preced
3,3,4,1,225049,225049,1048576,11,"multiply_kernel(float*, float const*, float const*, int, int)",64,0,0,8,0,16,"SQ_WAVES",16384.000000,317095767176998,317095767186678
4,4,4,1,225049,225049,1048576,12,"divide_kernel(float*, float const*, float const*, int, int)",64,0,0,12,4,16,"SQ_WAVES",16384.000000,317095767380718,317095767390878
Kernel rename
++++++++++++++
The ``roctxRangePush`` and ``roctxRangePop`` also let you rename the enclosed kernel with the supplied message. In the legacy ``rocprof``, this functionality was known as ``--roctx-rename``.
See how to use ``roctxRangePush`` and ``roctxRangePop`` for renaming the enclosed kernel:
.. code-block:: bash
#include <rocprofiler-sdk-roctx/roctx.h>
roctxRangePush("HIP_Kernel-1");
// Launching kernel from host
hipLaunchKernelGGL(matrixTranspose, dim3(WIDTH/THREADS_PER_BLOCK_X, WIDTH/THREADS_PER_BLOCK_Y), dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y), 0,0,gpuTransposeMatrix,gpuMatrix, WIDTH);
// Memory transfer from device to host
roctxRangePush("hipMemCpy-DeviceToHost");
hipMemcpy(TransposeMatrix, gpuTransposeMatrix, NUM * sizeof(float), hipMemcpyDeviceToHost);
roctxRangePop(); // for "hipMemcpy"
roctxRangePop(); // for "hipLaunchKernel"
roctxRangeStop(rangeId);
To rename the kernel, use:
.. code-block:: bash
rocprofv3 --marker-trace --kernel-rename -- <application_path>
The preceding command generates a ``marker-trace`` file prefixed with the process ID.
.. code-block:: shell
$ cat 210_marker_api_trace.csv
"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
"MARKER_CORE_API","roctxGetThreadId",315155,315155,2,58378843928406,58378843930247
"MARKER_CONTROL_API","roctxProfilerPause",315155,315155,3,58378844627184,58378844627502
"MARKER_CONTROL_API","roctxProfilerResume",315155,315155,4,58378844638601,58378844639267
"MARKER_CORE_API","pre-kernel-launch",315155,315155,5,58378844641787,58378844641787
"MARKER_CORE_API","post-kernel-launch",315155,315155,6,58378844936586,58378844936586
"MARKER_CORE_API","memCopyDth",315155,315155,7,58378844938371,58378851383270
"MARKER_CORE_API","HIP_Kernel-1",315155,315155,1,58378526575735,58378851384485
I/O control options
++++++++++++++++++++
--------------------
``rocprofv3`` provides the following options to control the output.
@@ -1106,6 +1338,53 @@ Output prefix keys are useful in multiple use cases but are most helpful when de
* - ``%s``
- Shorthand for ``%size%``
Collection period
+++++++++++++++++++
The collection period is the time interval during which the profiling data is collected. You can specify the collection period using the ``--collection-period`` or ``-p`` option.
Users can specify multiple configurations, each defined by a triplet in the format `start_delay:collection_time:repeat`.
The triplet is defined as follows:
- **Start delay time**: The time after which the profiling data collection starts.
- **Collection time**: The time for which the profiling data is collected.
- **Repeat**: Rate is the number of times the cycle is repeated. repeat of 0 indicates that the cycle will repeat indefinitely
.. code-block:: shell
rocprofv3 --collection-period 5:1:0 --hip-trace -- <application_path>
The above command collects the profiling data for 1 seconds, starting after 5 seconds of the application starts, and this cycle will be repeated 1 time.
The collection period can be specified in different units, such as seconds, milliseconds, microseconds, and nanoseconds. The default unit is `seconds`. You can change the unit using the ``--collection-period-unit`` option.
The available units are:
`--collection-period-unit`: `hour`, `min`, `sec`, `msec`, `usec`, `nsec`
.. code-block:: shell
rocprofv3 --collection-period 5:1:0 --collection-period-unit msec --hip-trace -- <application_path>
The above command collects the profiling data for 1 milliseconds, starting after 5 milliseconds of the application starts, and this cycle will be repeated 1 time.
Perfetto-specific options
++++++++++++++++++++++++++
The following options are specific to Perfetto tracing and are used to control the behavior of the Perfetto data collection:
- **--perfetto-buffer-fill-policy {discard,ring_buffer}**: Policy for handling new records when perfetto has reached the buffer limit.
- **RING_BUFFER (default)**: The buffer behaves like a ring buffer and writes when full will wrap over and replace the oldest trace data in the buffer.
- **DISCARD**: The buffer stops accepting data once full. Further write attempts are dropped.
- **--perfetto-buffer-size KB**: Size of buffer for perfetto output in KB. default: 1 GB. If set, stops the tracing session after N bytes have been written. Used to cap the size of the trace.
- **--perfetto-backend {inprocess,system}**: Perfetto data collection backend. 'system' mode requires starting traced and perfetto daemons.By default Perfetto keeps the full trace buffer(s) in memory.
- **--perfetto-shmem-size-hint KB**: Perfetto shared memory size hint in KB. default: 64 KB. This option gives you control over shared memory buffer sizing. Thisoption can be tweaked to avoid data loses when data is produced at a higher rate.
.. _output-file-fields:
Output file fields
+38 -5
Melihat File
@@ -17,8 +17,15 @@ ROCprofiler-SDK is supported only on Linux. The following distributions are test
- Ubuntu 20.04
- Ubuntu 22.04
- OpenSUSE 15.4
- RedHat 8.8
- Ubuntu 24.04
- OpenSUSE 15.5
- OpenSUSE 15.6
- Red Hat 8.8
- Red Hat 8.9
- Red Hat 8.10
- Red Hat 9.2
- Red Hat 9.3
- Red Hat 9.4
ROCprofiler-SDK might operate as expected on other `Linux distributions <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-operating-systems>`_, but has not been tested.
@@ -56,8 +63,8 @@ Install `CMake <https://cmake.org/>`_ version 3.21 (or later).
export PATH=${HOME}/.local/bin:${PATH}
Building ROCprofiler-SDK
------------------------
Building ROCprofiler-SDK from source
-------------------------------------
.. code-block:: bash
@@ -95,4 +102,30 @@ To run the built tests, ``cd`` into the ``rocprofiler-sdk-build`` directory and
.. code-block:: bash
/usr/local/bin/python -m pip install -r requirements.txt
/usr/local/bin/python -m pip install -r requirements.txt
Install via package manager
----------------------------
If you have ROCm version 6.2 or higher installed, you can use the package manager to install a pre-built copy of ROCProfiler-SDK.
.. tab-set::
.. tab-item:: Ubuntu
.. code-block:: shell
$ sudo apt install rocprofiler-sdk
.. tab-item:: Red Hat Enterprise Linux
.. code-block:: shell
$ sudo dnf install rocprofiler-sdk
.. tab-item:: SUSE Linux Enterprise Server
.. code-block:: shell
$ sudo zypper install rocprofiler-sdk
+1 -1
Melihat File
@@ -4,7 +4,7 @@
# Project related configuration options
#---------------------------------------------------------------------------
DOXYFILE_ENCODING = UTF-8
PROJECT_NAME = ROCTx Developer API
PROJECT_NAME = ROCTx developer API
PROJECT_NUMBER = @ROCPROFILER_VERSION@
PROJECT_BRIEF = "ROCm Profiling API and tools"
PROJECT_LOGO =
+1 -1
Melihat File
@@ -4,7 +4,7 @@
# Project related configuration options
#---------------------------------------------------------------------------
DOXYFILE_ENCODING = UTF-8
PROJECT_NAME = @PROJECT_NAME@ Developer API
PROJECT_NAME = @PROJECT_NAME@ developer API
PROJECT_NUMBER = @ROCPROFILER_VERSION@
PROJECT_BRIEF = "ROCm Profiling API and tools"
PROJECT_LOGO =
+1 -1
Melihat File
@@ -21,7 +21,7 @@ message "Changing directory to ${WORK_DIR}"
cd ${WORK_DIR}
message "Generating rocprofiler-sdk.dox"
cmake -DSOURCE_DIR=${SOURCE_DIR} -DPROJECT_NAME="Rocprofiler SDK" -P ${WORK_DIR}/generate-doxyfile.cmake
cmake -DSOURCE_DIR=${SOURCE_DIR} -DPROJECT_NAME="ROCprofiler-SDK" -P ${WORK_DIR}/generate-doxyfile.cmake
message "Generating doxygen xml files"
mkdir -p _doxygen