doc improvements for 1.0.0 part 2 (#330)
* update installation steps
* Github Issue #50 Adding README's for samples
* Making name change to ROCprofiler-SDK for consistency
* Fix HIP trace documentation
* Fix HSA trace in docs
* Fix kernel trace in docs
* Fixing memory copy and memory allocation traces
* runtime trace and sys trace doc update
* Fix scratch memory doc
* kernel naming and filtering options
* Adding collection period in docs
* Perfetto configs update
* summary output file
* kernel trace format fix
* update CHANGELOG
* Agent index doc update
* rocm-smi output
* group by queue option
* Updated --group-by-queue description
* perfetto visualization
---------
Co-authored-by: Ian Trowbridge <Ian.Trowbridge@amd.com>
[ROCm/rocprofiler-sdk commit: ca7cce9e81]
Tento commit je obsažen v:
@@ -158,8 +158,10 @@ Full documentation for ROCprofiler-SDK is available at [rocm.docs.amd.com/projec
|
||||
- SDK: `rocprofiler_agent_v0_t` support for agent UUIDs
|
||||
- SDK: `rocprofiler_agent_v0_t` support for agent visibility based on gpu isolation environment variables such as `ROCR_VISIBLE_DEVICES` and so on.
|
||||
- Accumulation VGPR support for `rocprofv3`.
|
||||
- Host-trap based PC sampling support for rocprofv3.
|
||||
- Support for OpenMP tool.
|
||||
|
||||
## ROCprofiler-SDK 0.7.0 for ROCm release 6.5
|
||||
## ROCprofiler-SDK 1.0.0 for ROCm release 6.5
|
||||
|
||||
### Added
|
||||
|
||||
@@ -170,9 +172,9 @@ Full documentation for ROCprofiler-SDK is available at [rocm.docs.amd.com/projec
|
||||
- Added perfetto support for counter collection.
|
||||
- Added support for negating rocprofv3 tracing options when using aggregate options, e.g. `--sys-trace --hsa-trace=no`
|
||||
- Added `--agent-index` option in rocprofv3 to specify the agent naming convention in the output
|
||||
- absolute == node_id
|
||||
- relative == logical_node_id
|
||||
- type-relative == logical_node_type_id
|
||||
- absolute == node_id
|
||||
- relative == logical_node_id
|
||||
- type-relative == logical_node_type_id
|
||||
|
||||
### Changed
|
||||
|
||||
|
||||
@@ -31,7 +31,7 @@ ROCProfiler-SDK is AMD’s new and improved tooling infrastructure, providing a
|
||||
- Scratch Memory tracing
|
||||
- RCCL API tracing
|
||||
- rocDecode API tracing
|
||||
- rocjpeg API tracing
|
||||
- rocJPEG API tracing
|
||||
|
||||
## Parallelism API Support
|
||||
|
||||
|
||||
@@ -0,0 +1,14 @@
|
||||
# API Callback Tracing Sample
|
||||
|
||||
## Services
|
||||
|
||||
- Captures events like API calls using callbacks.
|
||||
- HSA API (Core, AMD Ext)
|
||||
- HIP API (Runtime)
|
||||
- Marker API (Core, Name)
|
||||
|
||||
## Properties
|
||||
|
||||
- Handles roctxProfilerPause and roctxProfilerResume operations using a control context.
|
||||
- Captures API calls and logs details like thread ID, operation type, and duration.
|
||||
- Provides a detailed trace of all function calls and events for debugging.
|
||||
@@ -0,0 +1,13 @@
|
||||
# CodeObject tracing
|
||||
|
||||
## Services
|
||||
|
||||
- code object tracing.
|
||||
|
||||
## Properties
|
||||
|
||||
- This tool hooks into ROCProfiler's callback and buffer tracing mechanisms to:
|
||||
- Decode and analyze GPU code objects.
|
||||
- Three kernel variants are used in sample; simple transpose, in-place LDS swap and LDS no bank conflicts.
|
||||
- Trace kernel symbols and instructions.
|
||||
- Log disassembly and statistics for debugging or performance analysis.
|
||||
@@ -0,0 +1,11 @@
|
||||
# CodeObject Tracing
|
||||
|
||||
## Services
|
||||
|
||||
Trace and analyze the execution of GPU code objects and kernel symbols.
|
||||
|
||||
## Properties
|
||||
|
||||
- This tool is designed to capture and log information about code object loading/unloading and kernel symbol registration/un-registration events during the execution of GPU programs.
|
||||
|
||||
- Whenever a relevant event occurs, such as a code object being loaded/unloaded or a kernel symbol being registered/unregistered. The function processes the event data, formats it into a human-readable string, and appends it to the call stack.
|
||||
@@ -0,0 +1,21 @@
|
||||
# Counter collection
|
||||
|
||||
## Services
|
||||
|
||||
- Dispatch counting
|
||||
- Device Counting async
|
||||
- Device Counting sync
|
||||
|
||||
## Properties
|
||||
|
||||
- Initializes tool and setup for counting service.
|
||||
- Create a collection profile for the counters.
|
||||
- Outputs counters mentioned during profiler creation.
|
||||
- Usage of enum ROCPROFILER_BUFFER_CATEGORY_COUNTERS.
|
||||
- Buffered_callback
|
||||
- This sample shows the usage of buffered approach when collecting counters. buffered callback is called when the buffer is full (or when the buffer is flushed). The callback is responsible for processing the records in the buffer.
|
||||
|
||||
- Dispatch callback
|
||||
- This sample creates a profile to collect the counter SQ_WAVES for all kernel dispatch packets.
|
||||
|
||||
- Prints all functional counters.
|
||||
@@ -0,0 +1,10 @@
|
||||
# Runtime API Registration
|
||||
|
||||
## Services
|
||||
|
||||
- HIP runtime table registration
|
||||
|
||||
## Properties
|
||||
|
||||
- `api_registration_callback` function validates the type of library being intercepted, ensures there is only one instance of the HIP runtime library, and retrieves the dispatch table containing the API functions.
|
||||
- Collects a "call stack" of intercepted API calls.
|
||||
@@ -0,0 +1,15 @@
|
||||
# OMPT tool tracing
|
||||
|
||||
## Services
|
||||
|
||||
- OMPT tracing.
|
||||
- CodeObject tracing.
|
||||
- Marker API (Core, Name).
|
||||
|
||||
## Properties
|
||||
|
||||
- Configures tool for callback tracing.
|
||||
- Configures tool for buffer tracing.
|
||||
- Sets up callstack for tracing kind names and tracing operation names.
|
||||
- Create a specialized (throw-away) context for handling ROCTx profiler pause and resume.
|
||||
- Demonstrates the use of the `ompt_data_t*` fields from OMPT.
|
||||
@@ -0,0 +1,12 @@
|
||||
# PC sampling service
|
||||
|
||||
## Services
|
||||
|
||||
- PC sampling stochastic method
|
||||
|
||||
## Properties
|
||||
|
||||
- Iterate through all gpu agents that supports PC sampling.
|
||||
- Iterate through the supported configuration for that agent.
|
||||
- The `configure_pc_sampling_prefer_stochastic` function is responsible for configuring PC sampling on a given GPU agent. It attempts to select a stochastic sampling configuration if available, falling back to a host-trap configuration otherwise.
|
||||
- `rocprofiler_pc_sampling_callback` function processes PC sampling records delivered by the profiler. It validates the records, determines their type, and delegates the printing of their details to the appropriate print_sample function.
|
||||
@@ -361,7 +361,7 @@ For MPI applications (or other job launchers such as SLURM), place rocprofv3 ins
|
||||
add_parser_bool_argument(
|
||||
extended_tracing_options,
|
||||
"--hsa-image-trace",
|
||||
help="For collecting HSA API Traces (Image-extenson API), e.g. HSA functions prefixed with only 'hsa_ext_image_' (i.e. hsa_ext_image_get_capability).",
|
||||
help="For collecting HSA API Traces (Image-extension API), e.g. HSA functions prefixed with only 'hsa_ext_image_' (i.e. hsa_ext_image_get_capability).",
|
||||
)
|
||||
add_parser_bool_argument(
|
||||
extended_tracing_options,
|
||||
|
||||
@@ -35,7 +35,7 @@ def build_doxyfile():
|
||||
[
|
||||
"cmake",
|
||||
f"-DSOURCE_DIR={_srcdir}",
|
||||
"-DPROJECT_NAME='Rocprofiler SDK'",
|
||||
"-DPROJECT_NAME='ROCprofiler-SDK'",
|
||||
f"-P {_srcdir}/source/docs/generate-doxyfile.cmake",
|
||||
]
|
||||
)
|
||||
|
||||
@@ -1,5 +1,7 @@
|
||||
"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
|
||||
"HIP_COMPILER_API","__hipRegisterFatBinary",208,208,1,1508780270085955,1508780270096795
|
||||
"HIP_COMPILER_API","__hipRegisterFunction",208,208,2,1508780270104242,1508780270115355
|
||||
"HIP_COMPILER_API","__hipPushCallConfiguration",208,208,3,1508780613897816,1508780613898701
|
||||
"HIP_COMPILER_API","__hipPopCallConfiguration",208,208,4,1508780613901714,1508780613902200
|
||||
"HIP_COMPILER_API_EXT","__hipRegisterFatBinary",15,15,1,1055015439953054,1055015439976484
|
||||
"HIP_COMPILER_API_EXT","__hipRegisterFunction",15,15,2,1055015439992584,1055015440011104
|
||||
"HIP_COMPILER_API_EXT","__hipRegisterFunction",15,15,3,1055015440011744,1055015440013824
|
||||
"HIP_COMPILER_API_EXT","__hipRegisterFunction",15,15,4,1055015440014244,1055015440014534
|
||||
"HIP_COMPILER_API_EXT","__hipRegisterFunction",15,15,5,1055015440014854,1055015440015524
|
||||
|
||||
|
||||
|
@@ -1,9 +1,9 @@
|
||||
"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
|
||||
"HIP_RUNTIME_API","hipGetDevicePropertiesR0600",238,238,1,1191915574691984,1191915687784011
|
||||
"HIP_RUNTIME_API","hipMalloc",238,238,2,1191915691312459,1191915691388696
|
||||
"HIP_RUNTIME_API","hipMalloc",238,238,3,1191915691390637,1191915691423279
|
||||
"HIP_RUNTIME_API","hipMemcpy",238,238,4,1191915691439107,1191916547828448
|
||||
"HIP_RUNTIME_API","hipLaunchKernel",238,238,5,1191916547842972,1191916548408842
|
||||
"HIP_RUNTIME_API","hipMemcpy",238,238,6,1191916548412677,1191916550217834
|
||||
"HIP_RUNTIME_API","hipFree",238,238,7,1191916562618151,1191916562789093
|
||||
"HIP_RUNTIME_API","hipFree",238,238,8,1191916562790923,1191916562836351
|
||||
"HIP_RUNTIME_API_EXT","hipGetDevicePropertiesR0600",238,238,1,1191915574691984,1191915687784011
|
||||
"HIP_RUNTIME_API_EXT","hipMalloc",238,238,2,1191915691312459,1191915691388696
|
||||
"HIP_RUNTIME_API_EXT","hipMalloc",238,238,3,1191915691390637,1191915691423279
|
||||
"HIP_RUNTIME_API_EXT","hipMemcpy",238,238,4,1191915691439107,1191916547828448
|
||||
"HIP_RUNTIME_API_EXT","hipLaunchKernel",238,238,5,1191916547842972,1191916548408842
|
||||
"HIP_RUNTIME_API_EXT","hipMemcpy",238,238,6,1191916548412677,1191916550217834
|
||||
"HIP_RUNTIME_API_EXT","hipFree",238,238,7,1191916562618151,1191916562789093
|
||||
"HIP_RUNTIME_API_EXT","hipFree",238,238,8,1191916562790923,1191916562836351
|
||||
|
||||
|
@@ -0,0 +1,18 @@
|
||||
"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
|
||||
"HIP_COMPILER_API_EXT","__hipRegisterFatBinary",15,15,1,1055015439953054,1055015439976484
|
||||
"HIP_COMPILER_API_EXT","__hipRegisterFunction",15,15,2,1055015439992584,1055015440011104
|
||||
"HIP_COMPILER_API_EXT","__hipRegisterFunction",15,15,3,1055015440011744,1055015440013824
|
||||
"HIP_COMPILER_API_EXT","__hipRegisterFunction",15,15,4,1055015440014244,1055015440014534
|
||||
"HIP_COMPILER_API_EXT","__hipRegisterFunction",15,15,5,1055015440014854,1055015440015524
|
||||
"HIP_RUNTIME_API_EXT","hipGetDeviceCount",15,15,6,1055015440617618,1055015539800733
|
||||
"HIP_RUNTIME_API_EXT","hipSetDevice",15,15,7,1055015539819503,1055015539821693
|
||||
"HIP_RUNTIME_API_EXT","hipDeviceSynchronize",15,15,8,1055015539832333,1055015539840903
|
||||
"HIP_RUNTIME_API_EXT","hipStreamCreateWithFlags",15,15,9,1055015539861673,1055015865247140
|
||||
"HIP_RUNTIME_API_EXT","hipHostMalloc",15,15,10,1055015865309761,1055015865849494
|
||||
"HIP_RUNTIME_API_EXT","hipHostMalloc",15,15,11,1055015865850944,1055015866265546
|
||||
"HIP_RUNTIME_API_EXT","hipHostMalloc",15,15,12,1055015866266646,1055015867082900
|
||||
"HIP_RUNTIME_API_EXT","hipMallocAsync",15,15,13,1055015867356542,1055015867662314
|
||||
"HIP_RUNTIME_API_EXT","hipMallocAsync",15,15,14,1055015867664174,1055015867937465
|
||||
"HIP_RUNTIME_API_EXT","hipMallocAsync",15,15,15,1055015867938815,1055015868219987
|
||||
"HIP_RUNTIME_API_EXT","hipMemcpyAsync",15,15,16,1055015868240137,1055015917307652
|
||||
"HIP_RUNTIME_API_EXT","hipMemcpyAsync",15,15,17,1055015917337263,1055015917360493
|
||||
|
@@ -0,0 +1,34 @@
|
||||
"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
|
||||
"HSA_CORE_API","hsa_system_get_major_extension_table",57,57,1,1056813747808832,1056813747809252
|
||||
"HSA_CORE_API","hsa_agent_get_info",57,57,3,1056813747826572,1056813747826672
|
||||
"HSA_CORE_API","hsa_agent_get_info",57,57,4,1056813747837582,1056813747837622
|
||||
"HSA_CORE_API","hsa_agent_get_info",57,57,5,1056813747838542,1056813747838582
|
||||
"HSA_CORE_API","hsa_agent_get_info",57,57,6,1056813747839042,1056813747839082
|
||||
"HSA_CORE_API","hsa_agent_get_info",57,57,7,1056813747839512,1056813747839622
|
||||
"HSA_CORE_API","hsa_iterate_agents",57,57,2,1056813747821012,1056813747839832
|
||||
"HSA_CORE_API","hsa_agent_get_info",57,57,8,1056813747843832,1056813747844132
|
||||
"HSA_CORE_API","hsa_agent_get_info",57,57,9,1056813747844482,1056813747844542
|
||||
"HSA_CORE_API","hsa_agent_iterate_isas",57,57,10,1056813747849402,1056813747850422
|
||||
"HSA_CORE_API","hsa_isa_get_info_alt",57,57,11,1056813747853542,1056813747875253
|
||||
"HSA_CORE_API","hsa_isa_get_info_alt",57,57,12,1056813747875883,1056813747878353
|
||||
"HSA_CORE_API","hsa_agent_get_info",57,57,13,1056813747886343,1056813747886403
|
||||
"HSA_CORE_API","hsa_agent_get_info",57,57,54,1056813748282015,1056813748282085
|
||||
"HSA_CORE_API","hsa_system_get_info",57,57,55,1056813748282465,1056813748282505
|
||||
"HSA_CORE_API","hsa_signal_create",57,57,56,1056813749083419,1056813749085399
|
||||
"HSA_CORE_API","hsa_agent_get_info",57,57,57,1056813749741363,1056813749741443
|
||||
"HSA_CORE_API","hsa_queue_create",57,57,58,1056813749744053,1056813856914188
|
||||
"HSA_CORE_API","hsa_signal_create",57,57,59,1056813857149169,1056813857154109
|
||||
"HSA_CORE_API","hsa_signal_create",57,57,60,1056813857154929,1056813857155389
|
||||
"HSA_CORE_API","hsa_signal_create",57,57,61,1056813857155949,1056813857156429
|
||||
"HSA_CORE_API","hsa_signal_create",57,57,62,1056813857157169,1056813857157349
|
||||
"HSA_CORE_API","hsa_executable_create_alt",57,57,63,1056813965439362,1056813965466952
|
||||
"HSA_CORE_API","hsa_code_object_reader_create_from_memory",57,57,64,1056813965476642,1056813965587493
|
||||
"HSA_CORE_API","hsa_executable_load_agent_code_object",57,57,65,1056813965592483,1056813965965295
|
||||
"HSA_CORE_API","hsa_signal_create",57,57,67,1056813966149786,1056813966151706
|
||||
"HSA_CORE_API","hsa_signal_wait_scacquire",57,57,68,1056813966156596,1056813966158646
|
||||
"HSA_CORE_API","hsa_signal_destroy",57,57,69,1056813966162276,1056813966163746
|
||||
"HSA_CORE_API","hsa_executable_freeze",57,57,66,1056813965973105,1056813966778050
|
||||
"HSA_CORE_API","hsa_executable_get_symbol_by_name",57,57,70,1056813966800070,1056813966801880
|
||||
"HSA_CORE_API","hsa_executable_symbol_get_info",57,57,71,1056813966805750,1056813966805980
|
||||
"HSA_CORE_API","hsa_executable_symbol_get_info",57,57,72,1056813966806300,1056813966806340
|
||||
|
||||
|
@@ -1,9 +1,10 @@
|
||||
"Kind","Agent_Id","Queue_Id","Stream_Id","Thread_Id","Dispatch_Id","Kernel_Id","Kernel_Name","Correlation_Id","Start_Timestamp","End_Timestamp","Private_Segment_Size","Group_Segment_Size","Workgroup_Size_X","Workgroup_Size_Y","Workgroup_Size_Z","Grid_Size_X","Grid_Size_Y","Grid_Size_Z"
|
||||
"KERNEL_DISPATCH",2,1,1,21228,1,11,"void addition_kernel<float>(float*, float const*, float const*, int, int)",1937,2414192765353337,2414192765369494,0,0,64,1,1,1024,1024,1
|
||||
"KERNEL_DISPATCH",2,1,1,21228,2,14,"subtract_kernel(float*, float const*, float const*, int, int)",1945,2414192765424862,2414192765435326,0,0,64,1,1,1024,1024,1
|
||||
"KERNEL_DISPATCH",2,1,1,21228,3,13,"multiply_kernel(float*, float const*, float const*, int, int)",1953,2414192765487486,2414192765497669,0,0,64,1,1,1024,1024,1
|
||||
"KERNEL_DISPATCH",2,1,1,21228,4,12,"divide_kernel(float*, float const*, float const*, int, int)",1961,2414192765545619,2414192765555722,0,0,64,1,1,1024,1024,1
|
||||
"KERNEL_DISPATCH",2,2,2,21228,5,11,"void addition_kernel<float>(float*, float const*, float const*, int, int)",1969,2414192765608844,2414192765621674,0,0,64,1,1,1024,1024,1
|
||||
"KERNEL_DISPATCH",2,2,2,21228,6,14,"subtract_kernel(float*, float const*, float const*, int, int)",1977,2414192765658519,2414192765669424,0,0,64,1,1,1024,1024,1
|
||||
"KERNEL_DISPATCH",2,2,2,21228,7,13,"multiply_kernel(float*, float const*, float const*, int, int)",1985,2414192765715650,2414192765726795,0,0,64,1,1,1024,1024,1
|
||||
"KERNEL_DISPATCH",2,2,2,21228,8,12,"divide_kernel(float*, float const*, float const*, int, int)",1993,2414192765773422,2414192765784969,0,0,64,1,1,1024,1024,1
|
||||
"KERNEL_DISPATCH","Agent 4",3,4,77,9,12,"void addition_kernel<float>(float*, float const*, float const*, int, int)",9,1057678125307992,1057678125319352,0,0,64,1,1,1024,1024,1
|
||||
"KERNEL_DISPATCH","Agent 4",2,3,77,8,11,"divide_kernel(float*, float const*, float const*, int, int)",8,1057678125310912,1057678125319952,0,0,64,1,1,1024,1024,1
|
||||
"KERNEL_DISPATCH","Agent 4",2,3,77,7,13,"multiply_kernel(float*, float const*, float const*, int, int)",7,1057678125298952,1057678125310912,0,0,64,1,1,1024,1024,1
|
||||
"KERNEL_DISPATCH","Agent 4",2,3,77,6,14,"subtract_kernel(float*, float const*, float const*, int, int)",6,1057678125288912,1057678125298952,0,0,64,1,1,1024,1024,1
|
||||
"KERNEL_DISPATCH","Agent 4",2,3,77,5,12,"void addition_kernel<float>(float*, float const*, float const*, int, int)",5,1057678125279112,1057678125288912,0,0,64,1,1,1024,1024,1
|
||||
"KERNEL_DISPATCH","Agent 4",3,4,77,10,14,"subtract_kernel(float*, float const*, float const*, int, int)",10,1057678125319352,1057678125328912,0,0,64,1,1,1024,1024,1
|
||||
"KERNEL_DISPATCH","Agent 4",3,4,77,12,11,"divide_kernel(float*, float const*, float const*, int, int)",12,1057678125356832,1057678125366712,0,0,64,1,1,1024,1024,1
|
||||
"KERNEL_DISPATCH","Agent 4",3,4,77,11,13,"multiply_kernel(float*, float const*, float const*, int, int)",11,1057678125344352,1057678125356832,0,0,64,1,1,1024,1024,1
|
||||
"KERNEL_DISPATCH","Agent 4",4,5,77,13,12,"void addition_kernel<float>(float*, float const*, float const*, int, int)",13,1057678125382792,1057678125394032,0,0,64,1,1,1024,1024,1
|
||||
|
||||
|
@@ -0,0 +1,5 @@
|
||||
"Kind","Agent_Id","Queue_Id","Stream_Id","Thread_Id","Dispatch_Id","Kernel_Id","Kernel_Name","Correlation_Id","Start_Timestamp","End_Timestamp","Private_Segment_Size","Group_Segment_Size","Workgroup_Size_X","Workgroup_Size_Y","Workgroup_Size_Z","Grid_Size_X","Grid_Size_Y","Grid_Size_Z"
|
||||
"KERNEL_DISPATCH","Agent 4",1,2,123,1,11,"addition_kernel",1,1143263119836160,1143263119848360,0,0,64,1,1,1024,1024,1
|
||||
"KERNEL_DISPATCH","Agent 4",2,3,123,6,14,"subtract_kernel",6,1143263119886120,1143263119896040,0,0,64,1,1,1024,1024,1
|
||||
"KERNEL_DISPATCH","Agent 4",2,3,123,5,11,"addition_kernel",5,1143263119876240,1143263119886120,0,0,64,1,1,1024,1024,1
|
||||
"KERNEL_DISPATCH","Agent 4",1,2,123,4,12,"divide_kernel",4,1143263119866920,1143263119878960,0,0,64,1,1,1024,1024,1
|
||||
|
@@ -1,7 +1,7 @@
|
||||
"Kind","Operation","Agent_Id","Allocation_Size","Address","Correlation_Id","Start_Timestamp","End_Timestamp"
|
||||
"MEMORY_ALLOCATION","MEMORY_ALLOCATION_ALLOCATE",0,1024,0x7fb2d0005000,11,3721742710532634,3721742710584854
|
||||
"MEMORY_ALLOCATION","MEMORY_ALLOCATION_FREE",0,0,0x7fb2d0005000,12,3721742710596404,3721742710933366
|
||||
"MEMORY_ALLOCATION","MEMORY_ALLOCATION_ALLOCATE",0,1024,0x7fb2d0005000,13,3721742710941416,3721742710960916
|
||||
"MEMORY_ALLOCATION","MEMORY_ALLOCATION_FREE",0,0,0x7fb2d0005000,14,3721742710967236,3721742711197647
|
||||
"MEMORY_ALLOCATION","MEMORY_ALLOCATION_ALLOCATE",0,1024,0x7fb2d0005000,15,3721742711204077,3721742711219717
|
||||
"MEMORY_ALLOCATION","MEMORY_ALLOCATION_FREE",0,0,0x7fb2d0005000,16,3721742711225857,3721742711466018
|
||||
"MEMORY_ALLOCATION","MEMORY_ALLOCATION_ALLOCATE",Agent 0,1024,0x7fb2d0005000,11,3721742710532634,3721742710584854
|
||||
"MEMORY_ALLOCATION","MEMORY_ALLOCATION_FREE",Agent 0,0,0x7fb2d0005000,12,3721742710596404,3721742710933366
|
||||
"MEMORY_ALLOCATION","MEMORY_ALLOCATION_ALLOCATE",Agent 0,1024,0x7fb2d0005000,13,3721742710941416,3721742710960916
|
||||
"MEMORY_ALLOCATION","MEMORY_ALLOCATION_FREE",Agent 0,0,0x7fb2d0005000,14,3721742710967236,3721742711197647
|
||||
"MEMORY_ALLOCATION","MEMORY_ALLOCATION_ALLOCATE",Agent 0,1024,0x7fb2d0005000,15,3721742711204077,3721742711219717
|
||||
"MEMORY_ALLOCATION","MEMORY_ALLOCATION_FREE",Agent 0,0,0x7fb2d0005000,16,3721742711225857,3721742711466018
|
||||
|
||||
|
@@ -1,5 +1,5 @@
|
||||
"Kind","Direction","Stream_Id","Source_Agent_Id","Destination_Agent_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
|
||||
"MEMORY_COPY","MEMORY_COPY_HOST_TO_DEVICE",1,0,2,952,2414192684609085,2414192684710679
|
||||
"MEMORY_COPY","MEMORY_COPY_HOST_TO_DEVICE",1,0,2,960,2414192684873841,2414192684973470
|
||||
"MEMORY_COPY","MEMORY_COPY_HOST_TO_DEVICE",2,0,2,1066,2414192706436949,2414192706538622
|
||||
"MEMORY_COPY","MEMORY_COPY_HOST_TO_DEVICE",2,0,2,1074,2414192706592442,2414192706692312
|
||||
"MEMORY_COPY","MEMORY_COPY_HOST_TO_DEVICE",0,"Agent 0","Agent 4",1,1057963336487172,1057963336564212
|
||||
"MEMORY_COPY","MEMORY_COPY_HOST_TO_DEVICE",0,"Agent 0","Agent 4",2,1057963336783973,1057963336859334
|
||||
"MEMORY_COPY","MEMORY_COPY_DEVICE_TO_HOST",0,"Agent 4","Agent 0",23,1057963497396292,1057963497471732
|
||||
"MEMORY_COPY","MEMORY_COPY_DEVICE_TO_HOST",0,"Agent 4","Agent 0",24,1057963498099125,1057963498200446
|
||||
|
||||
|
Binární soubor nebyl zobrazen.
|
Za Šířka: | Výška: | Velikost: 174 KiB |
Binární soubor nebyl zobrazen.
|
Za Šířka: | Výška: | Velikost: 124 KiB |
Binární soubor nebyl zobrazen.
|
Za Šířka: | Výška: | Velikost: 124 KiB |
@@ -0,0 +1,11 @@
|
||||
============================================ ROCm System Management Interface ============================================
|
||||
====================================================== Concise Info ======================================================
|
||||
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
|
||||
(DID, GUID) (Junction) (Socket) (Mem, Compute, ID)
|
||||
==========================================================================================================================
|
||||
0 4 0x74a0, 50375 48.0°C 110.0W NPS1, SPX, 0 98Mhz 1300Mhz 0% auto 550.0W 0% 0%
|
||||
1 5 0x74a0, 20890 53.0°C 113.0W NPS1, SPX, 0 99Mhz 1200Mhz 0% auto 550.0W 0% 0%
|
||||
2 6 0x74a0, 44670 52.0°C 125.0W NPS1, SPX, 0 100Mhz 1300Mhz 0% auto 550.0W 0% 0%
|
||||
3 7 0x74a0, 15139 47.0°C 115.0W NPS1, SPX, 0 100Mhz 1300Mhz 0% auto 550.0W 0% 0%
|
||||
==========================================================================================================================
|
||||
================================================== End of ROCm SMI Log ===================================================
|
||||
@@ -0,0 +1,3 @@
|
||||
"Kind","Operation","Agent_Id","Queue_Id","Thread_Id","Alloc_Flags","Start_Timestamp","End_Timestamp"
|
||||
"SCRATCH_MEMORY","SCRATCH_MEMORY_ALLOC","Agent 4",1,113,0,1124926523146168,1124926554133606
|
||||
"SCRATCH_MEMORY","SCRATCH_MEMORY_ALLOC","Agent 4",1,113,0,1124926554522025,1124927132642186
|
||||
|
Binární soubor nebyl zobrazen.
|
Za Šířka: | Výška: | Velikost: 148 KiB |
Binární soubor nebyl zobrazen.
|
Za Šířka: | Výška: | Velikost: 124 KiB |
@@ -0,0 +1,154 @@
|
||||
|
||||
ROCPROFV3 HSA_API SUMMARY:
|
||||
|
||||
| NAME | DOMAIN | CALLS | DURATION (nsec) | AVERAGE (nsec) | PERCENT (INC) | MIN (nsec) | MAX (nsec) | STDDEV |
|
||||
|-------------------------------------------|--------------|-----------------|-----------------|-----------------|---------------|-----------------|-----------------|-----------------|
|
||||
| hsa_queue_create | HSA_API | 4 | 280077621 | 7.002e+07 | 75.372632 | 55026812 | 113288760 | 2.885e+07 |
|
||||
| hsa_amd_memory_async_copy_on_engine | HSA_API | 24 | 55617052 | 2.317e+06 | 14.967292 | 7580 | 55195188 | 1.126e+07 |
|
||||
| hsa_amd_memory_pool_allocate | HSA_API | 67 | 26428438 | 3.945e+05 | 7.112246 | 1510 | 857592 | 1.782e+05 |
|
||||
| hsa_amd_memory_pool_free | HSA_API | 72 | 5176173 | 7.189e+04 | 1.392977 | 290 | 170374 | 3.903e+04 |
|
||||
| hsa_executable_freeze | HSA_API | 2 | 964125 | 4.821e+05 | 0.259459 | 437471 | 526654 | 6.306e+04 |
|
||||
| hsa_signal_wait_scacquire | HSA_API | 26 | 853122 | 3.281e+04 | 0.229587 | 2530 | 100782 | 3.394e+04 |
|
||||
| hsa_executable_load_agent_code_object | HSA_API | 2 | 616175 | 3.081e+05 | 0.165821 | 254476 | 361699 | 7.582e+04 |
|
||||
| hsa_amd_agents_allow_access | HSA_API | 35 | 430680 | 1.231e+04 | 0.115902 | 4830 | 55182 | 9.939e+03 |
|
||||
| hsa_signal_store_screlease | HSA_API | 56 | 381491 | 6.812e+03 | 0.102664 | 1560 | 41831 | 7.895e+03 |
|
||||
| hsa_signal_create | HSA_API | 107 | 160889 | 1.504e+03 | 0.043297 | 80 | 5650 | 1.475e+03 |
|
||||
| hsa_code_object_reader_create_from_memory | HSA_API | 2 | 151314 | 7.566e+04 | 0.040721 | 32121 | 119193 | 6.157e+04 |
|
||||
| hsa_signal_load_relaxed | HSA_API | 1296 | 137626 | 1.062e+02 | 0.037037 | 20 | 2930 | 2.712e+02 |
|
||||
| hsa_signal_destroy | HSA_API | 618 | 111224 | 1.800e+02 | 0.029932 | 40 | 1540 | 2.429e+02 |
|
||||
| hsa_agent_get_info | HSA_API | 65 | 77472 | 1.192e+03 | 0.020849 | 30 | 47121 | 6.341e+03 |
|
||||
| hsa_amd_signal_create | HSA_API | 512 | 61290 | 1.197e+02 | 0.016494 | 40 | 930 | 1.559e+02 |
|
||||
| hsa_amd_signal_async_handler | HSA_API | 24 | 52641 | 2.193e+03 | 0.014166 | 1180 | 4020 | 9.252e+02 |
|
||||
| hsa_executable_iterate_symbols | HSA_API | 14 | 52521 | 3.752e+03 | 0.014134 | 2740 | 6940 | 1.105e+03 |
|
||||
| hsa_amd_memory_copy_engine_status | HSA_API | 18 | 47370 | 2.632e+03 | 0.012748 | 260 | 7990 | 2.274e+03 |
|
||||
| hsa_iterate_agents | HSA_API | 1 | 41391 | 4.139e+04 | 0.011139 | 41391 | 41391 | 0.000e+00 |
|
||||
| hsa_executable_create_alt | HSA_API | 2 | 40470 | 2.024e+04 | 0.010891 | 7530 | 32940 | 1.797e+04 |
|
||||
| hsa_isa_get_info_alt | HSA_API | 2 | 30391 | 1.520e+04 | 0.008179 | 2490 | 27901 | 1.797e+04 |
|
||||
| hsa_signal_silent_store_relaxed | HSA_API | 48 | 24920 | 5.192e+02 | 0.006706 | 20 | 4570 | 7.120e+02 |
|
||||
| hsa_amd_agent_iterate_memory_pools | HSA_API | 5 | 20221 | 4.044e+03 | 0.005442 | 2561 | 8600 | 2.574e+03 |
|
||||
| hsa_queue_add_write_index_screlease | HSA_API | 56 | 7270 | 1.298e+02 | 0.001956 | 30 | 2310 | 3.471e+02 |
|
||||
| hsa_amd_profiling_set_profiler_enabled | HSA_API | 4 | 5600 | 1.400e+03 | 0.001507 | 1370 | 1470 | 4.690e+01 |
|
||||
| hsa_executable_symbol_get_info | HSA_API | 152 | 5470 | 3.599e+01 | 0.001472 | 30 | 340 | 3.563e+01 |
|
||||
| hsa_queue_load_read_index_relaxed | HSA_API | 56 | 4560 | 8.143e+01 | 0.001227 | 20 | 1310 | 1.863e+02 |
|
||||
| hsa_executable_get_symbol_by_name | HSA_API | 14 | 4500 | 3.214e+02 | 0.001211 | 110 | 1510 | 4.732e+02 |
|
||||
| hsa_queue_load_read_index_scacquire | HSA_API | 56 | 3040 | 5.429e+01 | 0.000818 | 30 | 690 | 8.705e+01 |
|
||||
| hsa_amd_memory_pool_get_info | HSA_API | 43 | 1770 | 4.116e+01 | 0.000476 | 30 | 270 | 3.640e+01 |
|
||||
| hsa_system_get_info | HSA_API | 4 | 1750 | 4.375e+02 | 0.000471 | 40 | 830 | 3.544e+02 |
|
||||
| hsa_amd_agent_memory_pool_get_info | HSA_API | 13 | 1140 | 8.769e+01 | 0.000307 | 30 | 640 | 1.664e+02 |
|
||||
| hsa_agent_iterate_isas | HSA_API | 1 | 700 | 7.000e+02 | 0.000188 | 700 | 700 | 0.000e+00 |
|
||||
| hsa_system_get_major_extension_table | HSA_API | 1 | 190 | 1.900e+02 | 0.000051 | 190 | 190 | 0.000e+00 |
|
||||
|
||||
|
||||
ROCPROFV3 HIP_API SUMMARY:
|
||||
|
||||
| NAME | DOMAIN | CALLS | DURATION (nsec) | AVERAGE (nsec) | PERCENT (INC) | MIN (nsec) | MAX (nsec) | STDDEV |
|
||||
|------------------------------------------|--------------|-----------------|-----------------|-----------------|---------------|-----------------|-----------------|-----------------|
|
||||
| hipStreamCreateWithFlags | HIP_API | 8 | 406507215 | 5.081e+07 | 71.307804 | 735979 | 233800881 | 7.889e+07 |
|
||||
| hipGetDeviceCount | HIP_API | 1 | 76707894 | 7.671e+07 | 13.455780 | 76707894 | 76707894 | 0.000e+00 |
|
||||
| hipMemcpyAsync | HIP_API | 24 | 56109444 | 2.338e+06 | 9.842485 | 11640 | 55299811 | 1.128e+07 |
|
||||
| hipHostMalloc | HIP_API | 24 | 13007523 | 5.420e+05 | 2.281726 | 416631 | 866382 | 1.206e+05 |
|
||||
| hipMallocAsync | HIP_API | 24 | 7304847 | 3.044e+05 | 1.281386 | 275397 | 353719 | 2.207e+04 |
|
||||
| hipHostFree | HIP_API | 24 | 2786484 | 1.161e+05 | 0.488793 | 72242 | 221646 | 4.606e+04 |
|
||||
| hipStreamDestroy | HIP_API | 8 | 2137924 | 2.672e+05 | 0.375026 | 221596 | 377469 | 5.489e+04 |
|
||||
| hipLaunchKernel | HIP_API | 32 | 2080214 | 6.501e+04 | 0.364902 | 8850 | 1608721 | 2.819e+05 |
|
||||
| hipFree | HIP_API | 24 | 1572948 | 6.554e+04 | 0.275920 | 2130 | 186994 | 4.815e+04 |
|
||||
| hipStreamSynchronize | HIP_API | 24 | 1452706 | 6.053e+04 | 0.254828 | 20810 | 135803 | 3.469e+04 |
|
||||
| __hipRegisterFunction | HIP_API | 4 | 294207 | 7.355e+04 | 0.051609 | 210 | 291807 | 1.455e+05 |
|
||||
| hipDeviceSynchronize | HIP_API | 4 | 50663 | 1.267e+04 | 0.008887 | 510 | 23621 | 9.554e+03 |
|
||||
| __hipRegisterFatBinary | HIP_API | 1 | 43811 | 4.381e+04 | 0.007685 | 43811 | 43811 | 0.000e+00 |
|
||||
| __hipPushCallConfiguration | HIP_API | 32 | 6250 | 1.953e+02 | 0.001096 | 60 | 3640 | 6.308e+02 |
|
||||
| __hipPopCallConfiguration | HIP_API | 32 | 4780 | 1.494e+02 | 0.000838 | 60 | 2520 | 4.340e+02 |
|
||||
| hipGetLastError | HIP_API | 32 | 4471 | 1.397e+02 | 0.000784 | 60 | 2381 | 4.092e+02 |
|
||||
| hipSetDevice | HIP_API | 1 | 2570 | 2.570e+03 | 0.000451 | 2570 | 2570 | 0.000e+00 |
|
||||
|
||||
|
||||
ROCPROFV3 KERNEL_DISPATCH SUMMARY:
|
||||
|
||||
| NAME | DOMAIN | CALLS | DURATION (nsec) | AVERAGE (nsec) | PERCENT (INC) | MIN (nsec) | MAX (nsec) | STDDEV |
|
||||
|---------------------------------------------------------------------------|-----------------|-----------------|-----------------|-----------------|---------------|-----------------|-----------------|-----------------|
|
||||
| void addition_kernel<float>(float*, float const*, float const*, int, int) | KERNEL_DISPATCH | 8 | 184324 | 2.304e+04 | 40.681542 | 11200 | 98802 | 3.062e+04 |
|
||||
| divide_kernel(float*, float const*, float const*, int, int) | KERNEL_DISPATCH | 8 | 94482 | 1.181e+04 | 20.852811 | 10240 | 13520 | 1.061e+03 |
|
||||
| multiply_kernel(float*, float const*, float const*, int, int) | KERNEL_DISPATCH | 8 | 91763 | 1.147e+04 | 20.252709 | 9800 | 12800 | 9.417e+02 |
|
||||
| subtract_kernel(float*, float const*, float const*, int, int) | KERNEL_DISPATCH | 8 | 82521 | 1.032e+04 | 18.212938 | 8320 | 12920 | 1.436e+03 |
|
||||
|
||||
|
||||
ROCPROFV3 MEMORY_COPY SUMMARY:
|
||||
|
||||
| NAME | DOMAIN | CALLS | DURATION (nsec) | AVERAGE (nsec) | PERCENT (INC) | MIN (nsec) | MAX (nsec) | STDDEV |
|
||||
|------------------------------------------|--------------|-----------------|-----------------|-----------------|---------------|-----------------|-----------------|-----------------|
|
||||
| MEMORY_COPY_HOST_TO_DEVICE | MEMORY_COPY | 16 | 3691929 | 2.307e+05 | 85.494053 | 74842 | 284487 | 6.265e+04 |
|
||||
| MEMORY_COPY_DEVICE_TO_HOST | MEMORY_COPY | 8 | 626417 | 7.830e+04 | 14.505947 | 74842 | 98603 | 8.207e+03 |
|
||||
|
||||
|
||||
ROCPROFV3 MEMORY_ALLOCATION SUMMARY:
|
||||
|
||||
| NAME | DOMAIN | CALLS | DURATION (nsec) | AVERAGE (nsec) | PERCENT (INC) | MIN (nsec) | MAX (nsec) | STDDEV |
|
||||
|------------------------------------------|-------------------|-----------------|-----------------|-----------------|---------------|-----------------|-----------------|-----------------|
|
||||
| MEMORY_ALLOCATION_ALLOCATE | MEMORY_ALLOCATION | 67 | 26314096 | 3.927e+05 | 83.661617 | 950 | 856812 | 1.785e+05 |
|
||||
| MEMORY_ALLOCATION_FREE | MEMORY_ALLOCATION | 72 | 5138913 | 7.137e+04 | 16.338383 | 20 | 166234 | 3.882e+04 |
|
||||
|
||||
|
||||
ROCPROFV3 SUMMARY:
|
||||
|
||||
| NAME | DOMAIN | CALLS | DURATION (nsec) | AVERAGE (nsec) | PERCENT (INC) | MIN (nsec) | MAX (nsec) | STDDEV |
|
||||
|---------------------------------------------------------------------------|-------------------|-----------------|-----------------|-----------------|---------------|-----------------|-----------------|-----------------|
|
||||
| hipStreamCreateWithFlags | HIP_API | 8 | 406507215 | 5.081e+07 | 41.569873 | 735979 | 233800881 | 7.889e+07 |
|
||||
| hsa_queue_create | HSA_API | 4 | 280077621 | 7.002e+07 | 28.641044 | 55026812 | 113288760 | 2.885e+07 |
|
||||
| hipGetDeviceCount | HIP_API | 1 | 76707894 | 7.671e+07 | 7.844233 | 76707894 | 76707894 | 0.000e+00 |
|
||||
| hipMemcpyAsync | HIP_API | 24 | 56109444 | 2.338e+06 | 5.737813 | 11640 | 55299811 | 1.128e+07 |
|
||||
| hsa_amd_memory_async_copy_on_engine | HSA_API | 24 | 55617052 | 2.317e+06 | 5.687461 | 7580 | 55195188 | 1.126e+07 |
|
||||
| hsa_amd_memory_pool_allocate | HSA_API | 67 | 26428438 | 3.945e+05 | 2.702601 | 1510 | 857592 | 1.782e+05 |
|
||||
| MEMORY_ALLOCATION_ALLOCATE | MEMORY_ALLOCATION | 67 | 26314096 | 3.927e+05 | 2.690908 | 950 | 856812 | 1.785e+05 |
|
||||
| hipHostMalloc | HIP_API | 24 | 13007523 | 5.420e+05 | 1.330164 | 416631 | 866382 | 1.206e+05 |
|
||||
| hipMallocAsync | HIP_API | 24 | 7304847 | 3.044e+05 | 0.747002 | 275397 | 353719 | 2.207e+04 |
|
||||
| hsa_amd_memory_pool_free | HSA_API | 72 | 5176173 | 7.189e+04 | 0.529321 | 290 | 170374 | 3.903e+04 |
|
||||
| MEMORY_ALLOCATION_FREE | MEMORY_ALLOCATION | 72 | 5138913 | 7.137e+04 | 0.525511 | 20 | 166234 | 3.882e+04 |
|
||||
| MEMORY_COPY_HOST_TO_DEVICE | MEMORY_COPY | 16 | 3691929 | 2.307e+05 | 0.377541 | 74842 | 284487 | 6.265e+04 |
|
||||
| hipHostFree | HIP_API | 24 | 2786484 | 1.161e+05 | 0.284949 | 72242 | 221646 | 4.606e+04 |
|
||||
| hipStreamDestroy | HIP_API | 8 | 2137924 | 2.672e+05 | 0.218626 | 221596 | 377469 | 5.489e+04 |
|
||||
| hipLaunchKernel | HIP_API | 32 | 2080214 | 6.501e+04 | 0.212725 | 8850 | 1608721 | 2.819e+05 |
|
||||
| hipFree | HIP_API | 24 | 1572948 | 6.554e+04 | 0.160851 | 2130 | 186994 | 4.815e+04 |
|
||||
| hipStreamSynchronize | HIP_API | 24 | 1452706 | 6.053e+04 | 0.148555 | 20810 | 135803 | 3.469e+04 |
|
||||
| hsa_executable_freeze | HSA_API | 2 | 964125 | 4.821e+05 | 0.098592 | 437471 | 526654 | 6.306e+04 |
|
||||
| hsa_signal_wait_scacquire | HSA_API | 26 | 853122 | 3.281e+04 | 0.087241 | 2530 | 100782 | 3.394e+04 |
|
||||
| MEMORY_COPY_DEVICE_TO_HOST | MEMORY_COPY | 8 | 626417 | 7.830e+04 | 0.064058 | 74842 | 98603 | 8.207e+03 |
|
||||
| hsa_executable_load_agent_code_object | HSA_API | 2 | 616175 | 3.081e+05 | 0.063011 | 254476 | 361699 | 7.582e+04 |
|
||||
| hsa_amd_agents_allow_access | HSA_API | 35 | 430680 | 1.231e+04 | 0.044042 | 4830 | 55182 | 9.939e+03 |
|
||||
| hsa_signal_store_screlease | HSA_API | 56 | 381491 | 6.812e+03 | 0.039012 | 1560 | 41831 | 7.895e+03 |
|
||||
| __hipRegisterFunction | HIP_API | 4 | 294207 | 7.355e+04 | 0.030086 | 210 | 291807 | 1.455e+05 |
|
||||
| void addition_kernel<float>(float*, float const*, float const*, int, int) | KERNEL_DISPATCH | 8 | 184324 | 2.304e+04 | 0.018849 | 11200 | 98802 | 3.062e+04 |
|
||||
| hsa_signal_create | HSA_API | 107 | 160889 | 1.504e+03 | 0.016453 | 80 | 5650 | 1.475e+03 |
|
||||
| hsa_code_object_reader_create_from_memory | HSA_API | 2 | 151314 | 7.566e+04 | 0.015474 | 32121 | 119193 | 6.157e+04 |
|
||||
| hsa_signal_load_relaxed | HSA_API | 1296 | 137626 | 1.062e+02 | 0.014074 | 20 | 2930 | 2.712e+02 |
|
||||
| hsa_signal_destroy | HSA_API | 618 | 111224 | 1.800e+02 | 0.011374 | 40 | 1540 | 2.429e+02 |
|
||||
| divide_kernel(float*, float const*, float const*, int, int) | KERNEL_DISPATCH | 8 | 94482 | 1.181e+04 | 0.009662 | 10240 | 13520 | 1.061e+03 |
|
||||
| multiply_kernel(float*, float const*, float const*, int, int) | KERNEL_DISPATCH | 8 | 91763 | 1.147e+04 | 0.009384 | 9800 | 12800 | 9.417e+02 |
|
||||
| subtract_kernel(float*, float const*, float const*, int, int) | KERNEL_DISPATCH | 8 | 82521 | 1.032e+04 | 0.008439 | 8320 | 12920 | 1.436e+03 |
|
||||
| hsa_agent_get_info | HSA_API | 65 | 77472 | 1.192e+03 | 0.007922 | 30 | 47121 | 6.341e+03 |
|
||||
| hsa_amd_signal_create | HSA_API | 512 | 61290 | 1.197e+02 | 0.006268 | 40 | 930 | 1.559e+02 |
|
||||
| hsa_amd_signal_async_handler | HSA_API | 24 | 52641 | 2.193e+03 | 0.005383 | 1180 | 4020 | 9.252e+02 |
|
||||
| hsa_executable_iterate_symbols | HSA_API | 14 | 52521 | 3.752e+03 | 0.005371 | 2740 | 6940 | 1.105e+03 |
|
||||
| hipDeviceSynchronize | HIP_API | 4 | 50663 | 1.267e+04 | 0.005181 | 510 | 23621 | 9.554e+03 |
|
||||
| hsa_amd_memory_copy_engine_status | HSA_API | 18 | 47370 | 2.632e+03 | 0.004844 | 260 | 7990 | 2.274e+03 |
|
||||
| __hipRegisterFatBinary | HIP_API | 1 | 43811 | 4.381e+04 | 0.004480 | 43811 | 43811 | 0.000e+00 |
|
||||
| hsa_iterate_agents | HSA_API | 1 | 41391 | 4.139e+04 | 0.004233 | 41391 | 41391 | 0.000e+00 |
|
||||
| hsa_executable_create_alt | HSA_API | 2 | 40470 | 2.024e+04 | 0.004139 | 7530 | 32940 | 1.797e+04 |
|
||||
| hsa_isa_get_info_alt | HSA_API | 2 | 30391 | 1.520e+04 | 0.003108 | 2490 | 27901 | 1.797e+04 |
|
||||
| hsa_signal_silent_store_relaxed | HSA_API | 48 | 24920 | 5.192e+02 | 0.002548 | 20 | 4570 | 7.120e+02 |
|
||||
| hsa_amd_agent_iterate_memory_pools | HSA_API | 5 | 20221 | 4.044e+03 | 0.002068 | 2561 | 8600 | 2.574e+03 |
|
||||
| hsa_queue_add_write_index_screlease | HSA_API | 56 | 7270 | 1.298e+02 | 0.000743 | 30 | 2310 | 3.471e+02 |
|
||||
| __hipPushCallConfiguration | HIP_API | 32 | 6250 | 1.953e+02 | 0.000639 | 60 | 3640 | 6.308e+02 |
|
||||
| hsa_amd_profiling_set_profiler_enabled | HSA_API | 4 | 5600 | 1.400e+03 | 0.000573 | 1370 | 1470 | 4.690e+01 |
|
||||
| hsa_executable_symbol_get_info | HSA_API | 152 | 5470 | 3.599e+01 | 0.000559 | 30 | 340 | 3.563e+01 |
|
||||
| __hipPopCallConfiguration | HIP_API | 32 | 4780 | 1.494e+02 | 0.000489 | 60 | 2520 | 4.340e+02 |
|
||||
| hsa_queue_load_read_index_relaxed | HSA_API | 56 | 4560 | 8.143e+01 | 0.000466 | 20 | 1310 | 1.863e+02 |
|
||||
| hsa_executable_get_symbol_by_name | HSA_API | 14 | 4500 | 3.214e+02 | 0.000460 | 110 | 1510 | 4.732e+02 |
|
||||
| hipGetLastError | HIP_API | 32 | 4471 | 1.397e+02 | 0.000457 | 60 | 2381 | 4.092e+02 |
|
||||
| hsa_queue_load_read_index_scacquire | HSA_API | 56 | 3040 | 5.429e+01 | 0.000311 | 30 | 690 | 8.705e+01 |
|
||||
| hipSetDevice | HIP_API | 1 | 2570 | 2.570e+03 | 0.000263 | 2570 | 2570 | 0.000e+00 |
|
||||
| hsa_amd_memory_pool_get_info | HSA_API | 43 | 1770 | 4.116e+01 | 0.000181 | 30 | 270 | 3.640e+01 |
|
||||
| hsa_system_get_info | HSA_API | 4 | 1750 | 4.375e+02 | 0.000179 | 40 | 830 | 3.544e+02 |
|
||||
| hsa_amd_agent_memory_pool_get_info | HSA_API | 13 | 1140 | 8.769e+01 | 0.000117 | 30 | 640 | 1.664e+02 |
|
||||
| hsa_agent_iterate_isas | HSA_API | 1 | 700 | 7.000e+02 | 0.000072 | 700 | 700 | 0.000e+00 |
|
||||
| hsa_system_get_major_extension_table | HSA_API | 1 | 190 | 1.900e+02 | 0.000019 | 190 | 190 | 0.000e+00 |
|
||||
|
||||
@@ -190,6 +190,7 @@ HIP trace
|
||||
+++++++++++
|
||||
|
||||
HIP trace comprises execution traces for the entire application at the HIP level. This includes HIP API functions and their asynchronous activities at the runtime level. In general, HIP APIs directly interact with the user program. It is easier to analyze HIP traces as you can directly map them to the program.
|
||||
Unlike previous iterations of `rocprof`, this does not enable kernel tracing, memory copy tracing, etc. If you want to enable kernel tracing, memory copy tracing, they need to be provided explicitly.
|
||||
|
||||
To trace HIP runtime APIs, use:
|
||||
|
||||
@@ -205,12 +206,15 @@ The preceding command generates a ``hip_api_trace.csv`` file prefixed with the p
|
||||
|
||||
Here are the contents of ``hip_api_trace.csv`` file:
|
||||
|
||||
.. csv-table:: HIP runtime api trace
|
||||
:file: /data/hip_compile_trace.csv
|
||||
.. csv-table:: HIP api trace
|
||||
:file: /data/hip_trace.csv
|
||||
:widths: 10,10,10,10,10,20,20
|
||||
:header-rows: 1
|
||||
|
||||
To trace HIP compile time APIs, use:
|
||||
|
||||
`rocprofv3` provides options to collect traces at more granular level. For HIP, user can collect traces for HIP compile time APIs and runtime APIs separately.
|
||||
|
||||
To collect HIP compile time API traces, use:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
@@ -229,6 +233,26 @@ Here are the contents of ``hip_api_trace.csv`` file:
|
||||
:widths: 10,10,10,10,10,20,20
|
||||
:header-rows: 1
|
||||
|
||||
|
||||
To collect HIP runtime time API traces, use:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
rocprofv3 --hip-runtime-trace -- <application_path>
|
||||
|
||||
The preceding command generates a ``hip_api_trace.csv`` file prefixed with the process ID.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ cat 208_hip_api_trace.csv
|
||||
|
||||
Here are the contents of ``hip_api_trace.csv`` file:
|
||||
|
||||
.. csv-table:: HIP runtime api trace
|
||||
:file: /data/hip_runtime_trace.csv
|
||||
:widths: 10,10,10,10,10,20,20
|
||||
:header-rows: 1
|
||||
|
||||
For the description of the fields in the output file, see :ref:`output-file-fields`.
|
||||
|
||||
HSA trace
|
||||
@@ -255,6 +279,28 @@ Here are the contents of ``hsa_api_trace.csv`` file:
|
||||
:widths: 10,10,10,10,10,20,20
|
||||
:header-rows: 1
|
||||
|
||||
|
||||
`rocprofv3` provides options to collect traces at more granular level. HSA traces can be collected separately for four API domains: `HSA_AMD_EXT_API`, `HSA_CORE_API`, `HSA_IMAGE_EXT_API` and `HSA_FINALIZE_EXT_API`.
|
||||
|
||||
To collect HSA core API traces, use:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
rocprofv3 --hsa-core-trace -- <application_path>
|
||||
|
||||
The preceding command generates a ``hsa_api_trace.csv`` file prefixed with process ID. Note that the contents of this file have been truncated for demonstration purposes.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ cat 197_hsa_api_trace.csv
|
||||
|
||||
Here are the contents of ``hsa_api_trace.csv`` file:
|
||||
|
||||
.. csv-table:: HSA core api trace
|
||||
:file: /data/hsa_core_api_trace.csv
|
||||
:widths: 10,10,10,10,10,20,20
|
||||
:header-rows: 1
|
||||
|
||||
For the description of the fields in the output file, see :ref:`output-file-fields`.
|
||||
|
||||
Marker trace
|
||||
@@ -268,51 +314,6 @@ Marker trace
|
||||
|
||||
To see usage of ``ROCTx`` or marker library, see :ref:`using-rocprofiler-sdk-roctx`.
|
||||
|
||||
Kernel rename
|
||||
++++++++++++++
|
||||
|
||||
The ``roctxRangePush`` and ``roctxRangePop`` also let you rename the enclosed kernel with the supplied message. In the legacy ``rocprof``, this functionality was known as ``--roctx-rename``.
|
||||
|
||||
See how to use ``roctxRangePush`` and ``roctxRangePop`` for renaming the enclosed kernel:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
#include <rocprofiler-sdk-roctx/roctx.h>
|
||||
|
||||
roctxRangePush("HIP_Kernel-1");
|
||||
|
||||
// Launching kernel from host
|
||||
hipLaunchKernelGGL(matrixTranspose, dim3(WIDTH/THREADS_PER_BLOCK_X, WIDTH/THREADS_PER_BLOCK_Y), dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y), 0,0,gpuTransposeMatrix,gpuMatrix, WIDTH);
|
||||
|
||||
// Memory transfer from device to host
|
||||
roctxRangePush("hipMemCpy-DeviceToHost");
|
||||
|
||||
hipMemcpy(TransposeMatrix, gpuTransposeMatrix, NUM * sizeof(float), hipMemcpyDeviceToHost);
|
||||
|
||||
roctxRangePop(); // for "hipMemcpy"
|
||||
roctxRangePop(); // for "hipLaunchKernel"
|
||||
roctxRangeStop(rangeId);
|
||||
|
||||
To rename the kernel, use:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
rocprofv3 --marker-trace --kernel-rename -- <application_path>
|
||||
|
||||
The preceding command generates a ``marker-trace`` file prefixed with the process ID.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ cat 210_marker_api_trace.csv
|
||||
"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
|
||||
"MARKER_CORE_API","roctxGetThreadId",315155,315155,2,58378843928406,58378843930247
|
||||
"MARKER_CONTROL_API","roctxProfilerPause",315155,315155,3,58378844627184,58378844627502
|
||||
"MARKER_CONTROL_API","roctxProfilerResume",315155,315155,4,58378844638601,58378844639267
|
||||
"MARKER_CORE_API","pre-kernel-launch",315155,315155,5,58378844641787,58378844641787
|
||||
"MARKER_CORE_API","post-kernel-launch",315155,315155,6,58378844936586,58378844936586
|
||||
"MARKER_CORE_API","memCopyDth",315155,315155,7,58378844938371,58378851383270
|
||||
"MARKER_CORE_API","HIP_Kernel-1",315155,315155,1,58378526575735,58378851384485
|
||||
|
||||
Kokkos trace
|
||||
++++++++++++++
|
||||
|
||||
@@ -362,7 +363,7 @@ For the description of the fields in the output file, see :ref:`output-file-fiel
|
||||
Memory copy trace
|
||||
+++++++++++++++++++
|
||||
|
||||
To trace memory moves across the application, use:
|
||||
Memory copy traces track `hipMemcpy` and `hipMemcpyAsync` functions, which uses the `hsa_amd_memory_async_copy_on_engine` HSA functions internally. To trace memory moves across the application, use:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
@@ -431,7 +432,7 @@ The HSA runtime API is excluded because it is a lower-level API upon which HIP a
|
||||
thus, tends to be an implementation detail irrelevant to most users. Similarly, the HIP compiler API is also excluded for being an implementation detail as these functions are automatically inserted during HIP compilation.
|
||||
|
||||
``--runtime-trace`` traces the HIP runtime API, marker API, kernel dispatches, and
|
||||
memory operations (copies and scratch).
|
||||
memory operations (copies, allocations and scratch).
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
@@ -448,16 +449,33 @@ This is an all-inclusive option to collect HIP, HSA, kernel, memory copy, memory
|
||||
|
||||
rocprofv3 –-sys-trace -- <application_path>
|
||||
|
||||
Running the above command generates ``hip_api_trace.csv``, ``hsa_api_trace.csv``, ``kernel_trace.csv``, ``memory_copy_trace.csv``, ``memory_allocation_trace.csv``, and ``marker_api_trace.csv`` (if files prefixed with the process ID.
|
||||
Running the above command generates ``hip_api_trace.csv``, ``hsa_api_trace.csv``, ``kernel_trace.csv``, ``memory_copy_trace.csv``, ``scratch_memory_trace.csv``, ``memory_allocation_trace.csv``, and ``marker_api_trace.csv`` (if ``ROCTx`` APIs are specified in the application).
|
||||
|
||||
Scratch memory trace
|
||||
++++++++++++++++++++++
|
||||
|
||||
This option collects scratch memory operation traces. Scratch is an address space on AMD GPUs roughly equivalent to the `local memory` in NVIDIA CUDA. The `local memory` in CUDA is a thread-local global memory with interleaved addressing, which is used for register spills or stack space. This option helps to trace when the ``rocr`` runtime allocates, frees, and tries to reclaim scratch memory.
|
||||
|
||||
To trace scratch memory allocations during the application run, use:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
rocprofv3 --scratch-memory-trace -- <application_path>
|
||||
rocprofv3 –-scratch-memory-trace -- < app_path >
|
||||
|
||||
The preceding command generates a ``scratch_memory_trace.csv`` file prefixed with the process ID.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ cat 100_scratch_memory_trace.csv
|
||||
|
||||
Here are the contents of ``scratch_memory_trace.csv`` file:
|
||||
|
||||
.. csv-table:: Scratch memory trace
|
||||
:file: /data/scratch_memory_trace.csv
|
||||
:widths: 10,10,10,10,10,10,20,20
|
||||
:header-rows: 1
|
||||
|
||||
For the description of the fields in the output file, see :ref:`output-file-fields`.
|
||||
|
||||
RCCL trace
|
||||
++++++++++++
|
||||
@@ -607,6 +625,20 @@ To see a summary for ``MEMORY_COPY`` and ``HIP_API`` domains, use:
|
||||
|
||||
.. image:: /data/rocprofv3_hip_memcpy_summary.png
|
||||
|
||||
Summary output file
|
||||
######################
|
||||
|
||||
This option specifies the output file for the summary. By default, the summary is displayed on stderr. To specify a different output file, use:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
rocprofv3 -S -D --summary-output-file summary --sys-trace -- <application_path>
|
||||
|
||||
The preceding command generates a ``summary`` file displaying the summary of each domain. This also generates the files for the enabled tracing types under `-sys-trace` option.
|
||||
|
||||
.. include:: /data/summary.txt
|
||||
:literal:
|
||||
|
||||
Collecting traces using input file
|
||||
++++++++++++++++++++++++++++++++++++
|
||||
|
||||
@@ -931,6 +963,36 @@ Here is the same sample in JSON format:
|
||||
]
|
||||
}
|
||||
|
||||
Perfetto visualization
|
||||
-----------------------
|
||||
|
||||
Perfetto visualization for traces
|
||||
+++++++++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
Users can generate Perfetto trace files using the ``--output-format pftrace`` option. This allows users to visualize the traces in the Perfetto viewer.
|
||||
Perfetto is a powerful open-source tracing tool that provides a comprehensive view of system performance. It allows you to visualize the collected traces in a user-friendly interface, making it easier to analyze and understand the performance characteristics of your application.
|
||||
To generate a Perfetto trace file, use the ``--output-format pftrace`` option along with the desired tracing options. For example, to collect system traces and generate a Perfetto trace file, use:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
rocprofv3 --sys-trace --output-format pftrace -- <application_path>
|
||||
|
||||
The generated Perfetto trace file can be opened in the Perfetto UI (https://ui.perfetto.dev/).
|
||||
|
||||
**Figure 1:** Generic perfetto visualization
|
||||
|
||||
.. image:: /data/perfetto_generic.png
|
||||
:width: 100%
|
||||
:align: center
|
||||
|
||||
|
||||
**Figure 2:** Visualization of ROCm flow data in Perfetto
|
||||
|
||||
.. image:: /data/perfetto_flow.png
|
||||
:width: 100%
|
||||
:align: center
|
||||
|
||||
|
||||
Perfetto visualization for counter collection
|
||||
+++++++++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
@@ -940,18 +1002,22 @@ To generate a Perfetto trace file with counter data, use:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
rocprofv3 --pmc SQ_WAVES GRBM_GUI_ACTIVE --output-format pftrace -- <application_path>
|
||||
|
||||
You can also combine this with other tracing options to correlate counter data with API and kernel execution:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
rocprofv3 -s --pmc SQ_WAVES --output-format pftrace -- <application_path>
|
||||
rocprofv3 --pmc SQ_WAVES GRBM_COUNT --output-format pftrace -- <application_path>
|
||||
|
||||
The generated Perfetto trace file can be opened in the Perfetto UI (https://ui.perfetto.dev/). In the viewer, performance counters will appear as counter tracks organized by agent, allowing you to visualize counter values changing over time alongside kernel executions and other traced activities.
|
||||
|
||||
you can also combine this with the system trace option to get a more comprehensive view of the system's performance. For example, you can use the following command to collect both system trace and performance counter data:
|
||||
|
||||
.. code-block:: bash
|
||||
rocprofv3 --pmc SQ_WAVES GRBM_COUNT --sys-trace --output-format pftrace -- <application_path>
|
||||
|
||||
.. image:: /data/perfetto_counters.png
|
||||
:width: 100%
|
||||
:align: center
|
||||
|
||||
|
||||
Agent info
|
||||
++++++++++++
|
||||
-----------
|
||||
|
||||
.. note::
|
||||
All tracing and counter collection options generate an additional ``agent_info.csv`` file prefixed with the process ID.
|
||||
@@ -966,6 +1032,125 @@ The ``agent_info.csv`` file contains information about the CPU or GPU the kernel
|
||||
0,0,"CPU",24,0,0,0,0,0,0,0,0,1,24,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3800,0,0,0,0,0,0,23,0,0,0,0,0,0,0,0,0,0,0,"AMD Ryzen 9 3900X 12-Core Processor","CPU","AMD Ryzen 9 3900X 12-Core Processor",""
|
||||
1,1,"GPU",0,256,0,2147487744,10,64,0,64,64,1,64,4,4,1,16,4,32,90000,4098,26751,12032,0,128,2,0,2,24,3800,1630,432,440,138420864,16,40,141,1024,4294967295,0,0,64700,1024,1024,1024,4294967295,4294967295,4294967295,"gfx900","AMD","Radeon RX Vega","vega10"
|
||||
|
||||
Advanced options
|
||||
-----------------
|
||||
|
||||
Agent Index
|
||||
++++++++++++++
|
||||
|
||||
The agent index is a unique identifier for each agent in the system. It is used to identify the agent in the output files. Since, each runtime/tool has their own way to represent the agent's indices, `rocprofv3` provides a way to configure the agent index in the output files.
|
||||
|
||||
- **absolute** == *node_id* - absolute index of the agent regardless of cgroups masking. This is a monotonically increasing number that is incremented for every folder in `/sys/class/kfd/kfd/topology/nodes`. e.g. Agent-0, Agent-2, Agent-4.
|
||||
- **relative** == *logical_node_id* - relative index of the agent accounting for cgroups masking. This is a monotonically increasing number which is incremented for every folder in `/sys/class/kfd/kfd/topology/nodes/` whose properties file was non-empty.e.g. Agent-0, Agent-1, Agent-2
|
||||
- **type-relative** == *logical_node_type_id* - relative index of the agent accounting for cgroups masking where indexing starts at zero for each agent type. e.g. CPU-0, GPU-0, GPU-1
|
||||
|
||||
|
||||
To set the agent index in the output files, use the ``--agent-index`` option. The default value is ``absolute``.
|
||||
|
||||
The following example shows how to set the agent index on a system with multiple GPUs and CPUs:
|
||||
|
||||
Here is the `rocm-smi` output:
|
||||
|
||||
.. include:: /data/rocm-smi.txt
|
||||
:literal:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
rocprofv3 --kernel-trace --agent-index=relative -- <application_path>
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ cat kernel_trace.csv
|
||||
|
||||
"Kind","Agent_Id","Queue_Id","Stream_Id","Thread_Id","Dispatch_Id","Kernel_Id","Kernel_Name","Correlation_Id","Start_Timestamp","End_Timestamp","Private_Segment_Size","Group_Segment_Size","Workgroup_Size_X","Workgroup_Size_Y","Workgroup_Size_Z","Grid_Size_X","Grid_Size_Y","Grid_Size_Z"
|
||||
"KERNEL_DISPATCH","Agent 7",1,2,15044,1,17,"void addition_kernel<float>(float*, float const*, float const*, int, int)",1,1671247151691610,1671247151718010,0,0,64,1,1,1024,1024,1
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
rocprofv3 --kernel-trace --agent-index=type-relative -- <application_path>
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ cat kernel_trace.csv
|
||||
|
||||
"Kind","Agent_Id","Queue_Id","Stream_Id","Thread_Id","Dispatch_Id","Kernel_Id","Kernel_Name","Correlation_Id","Start_Timestamp","End_Timestamp","Private_Segment_Size","Group_Segment_Size","Workgroup_Size_X","Workgroup_Size_Y","Workgroup_Size_Z","Grid_Size_X","Grid_Size_Y","Grid_Size_Z"
|
||||
"KERNEL_DISPATCH","GPU 3",1,2,15056,1,17,"void addition_kernel<float>(float*, float const*, float const*, int, int)",1,1671390884499766,1671390884525686,0,0,64,1,1,1024,1024,1
|
||||
|
||||
|
||||
Group by queue
|
||||
++++++++++++++++++
|
||||
|
||||
By default, `rocprofv3` shows which HIP streams kernel and memory copy operations were submitted to when outputting a perfetto trace. The ``--group-by-queue`` option allows users to display the HSA queues that these kernel and memory operations were submitted to instead.
|
||||
|
||||
.. image:: /data/streams_pftrace.png
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
rocprofv3 -s --group-by-queue --output-format pftrace -- <application_path>
|
||||
|
||||
The above command generates a ``pftrace`` file with the kernel and memory copy operations grouped into HSA queues rather than HIP streams.
|
||||
|
||||
.. image:: /data/streams_pftrace_grouped.png
|
||||
|
||||
|
||||
Kernel naming and Filtering
|
||||
----------------------------
|
||||
|
||||
Kernel name mangling
|
||||
++++++++++++++++++++++
|
||||
|
||||
In `rocprofv3` output, by default, the kernel names are demangled to exclude the kernel arguments. This improves readability of collected output.
|
||||
|
||||
If you want to see the mangled kernel names, you can disable this feature by using the ``--mangled-kernels`` option.
|
||||
|
||||
By default the kernel trace would look like this:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ cat 123_kernel_trace.csv
|
||||
|
||||
"Kind","Agent_Id","Queue_Id","Stream_Id","Thread_Id","Dispatch_Id","Kernel_Id","Kernel_Name","Correlation_Id","Start_Timestamp","End_Timestamp","Private_Segment_Size","Group_Segment_Size","Workgroup_Size_X","Workgroup_Size_Y","Workgroup_Size_Z","Grid_Size_X","Grid_Size_Y","Grid_Size_Z"
|
||||
"KERNEL_DISPATCH","Agent 4",1,2,123,1,11,"addition_kernel<float>(float*, float const*, float const*, int, int)",1,1143263119836160,1143263119848360,0,0,64,1,1,1024,1024,1
|
||||
"KERNEL_DISPATCH","Agent 4",2,3,123,6,14,"subtract_kernel(float*, float const*, float const*, int, int)",6,1143263119886120,1143263119896040,0,0,64,1,1,1024,1024,1
|
||||
"KERNEL_DISPATCH","Agent 4",2,3,123,5,11,"addition_kernel<float>(float*, float const*, float const*, int, int)",5,1143263119876240,1143263119886120,0,0,64,1,1,1024,1024,1
|
||||
"KERNEL_DISPATCH","Agent 4",1,2,123,4,12,"divide_kernel(float*, float const*, float const*, int)",4,1143263119866920,1143263119878960,0,0,64,1,1,1024,1024,1
|
||||
|
||||
To disable kernel name demangling, use:
|
||||
.. code-block:: shell
|
||||
|
||||
rocprofv3 --mangled-kernels --kernel-trace -- <application_path>
|
||||
|
||||
The above command generates a ``kernel_trace.csv`` file with the mangled kernel names.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ cat 123_kernel_trace.csv
|
||||
|
||||
"Kind","Agent_Id","Queue_Id","Stream_Id","Thread_Id","Dispatch_Id","Kernel_Id","Kernel_Name","Correlation_Id","Start_Timestamp","End_Timestamp","Private_Segment_Size","Group_Segment_Size","Workgroup_Size_X","Workgroup_Size_Y","Workgroup_Size_Z","Grid_Size_X","Grid_Size_Y","Grid_Size_Z"
|
||||
"KERNEL_DISPATCH","Agent 4",1,2,123,1,11,"_Z15addition_kernelIfEvPT_PKfS3_ii.kd",1,1143263119836160,1143263119848360,0,0,64,1,1,1024,1024,1
|
||||
"KERNEL_DISPATCH","Agent 4",2,3,123,6,14,"_Z15subtract_kernelPfPKfS1_ii.kd",6,1143263119886120,1143263119896040,0,0,64,1,1,1024,1024,1
|
||||
"KERNEL_DISPATCH","Agent 4",2,3,123,5,11,"_Z15addition_kernelIfEvPT_PKfS3_ii.kd",5,1143263119876240,1143263119886120,0,0,64,1,1,1024,1024,1
|
||||
"KERNEL_DISPATCH","Agent 4",1,2,123,4,12,"_Z13divide_kernelPfPKfS1_ii.kd",4,1143263119866920,1143263119878960,0,0,64,1,1,1024,1024,1
|
||||
|
||||
Kernel name truncation
|
||||
+++++++++++++++++++++++
|
||||
|
||||
Kernel name truncation is a feature that allows you to limit the length of kernel names in the output files. This is useful when dealing with long kernel names that can make the output files difficult to read.
|
||||
|
||||
To enable kernel name truncation, use the ``--truncate-kernels`` option.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
rocprofv3 --truncate-kernels --kernel-trace -- <application_path>
|
||||
|
||||
The above command generates a ``kernel_trace.csv`` file with truncated kernel names.
|
||||
|
||||
.. csv-table:: Kernel trace truncated
|
||||
:file: /data/kernel_trace_truncated.csv
|
||||
:widths: 10,10,10,10,10,10,10,10,10,20,20,10,10,10,10,10,10,10,10
|
||||
:header-rows: 1
|
||||
|
||||
|
||||
Kernel filtering
|
||||
+++++++++++++++++
|
||||
|
||||
@@ -995,8 +1180,55 @@ To collect counters for the kernels matching the filters specified in the preced
|
||||
3,3,4,1,225049,225049,1048576,11,"multiply_kernel(float*, float const*, float const*, int, int)",64,0,0,8,0,16,"SQ_WAVES",16384.000000,317095767176998,317095767186678
|
||||
4,4,4,1,225049,225049,1048576,12,"divide_kernel(float*, float const*, float const*, int, int)",64,0,0,12,4,16,"SQ_WAVES",16384.000000,317095767380718,317095767390878
|
||||
|
||||
|
||||
Kernel rename
|
||||
++++++++++++++
|
||||
|
||||
The ``roctxRangePush`` and ``roctxRangePop`` also let you rename the enclosed kernel with the supplied message. In the legacy ``rocprof``, this functionality was known as ``--roctx-rename``.
|
||||
|
||||
See how to use ``roctxRangePush`` and ``roctxRangePop`` for renaming the enclosed kernel:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
#include <rocprofiler-sdk-roctx/roctx.h>
|
||||
|
||||
roctxRangePush("HIP_Kernel-1");
|
||||
|
||||
// Launching kernel from host
|
||||
hipLaunchKernelGGL(matrixTranspose, dim3(WIDTH/THREADS_PER_BLOCK_X, WIDTH/THREADS_PER_BLOCK_Y), dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y), 0,0,gpuTransposeMatrix,gpuMatrix, WIDTH);
|
||||
|
||||
// Memory transfer from device to host
|
||||
roctxRangePush("hipMemCpy-DeviceToHost");
|
||||
|
||||
hipMemcpy(TransposeMatrix, gpuTransposeMatrix, NUM * sizeof(float), hipMemcpyDeviceToHost);
|
||||
|
||||
roctxRangePop(); // for "hipMemcpy"
|
||||
roctxRangePop(); // for "hipLaunchKernel"
|
||||
roctxRangeStop(rangeId);
|
||||
|
||||
To rename the kernel, use:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
rocprofv3 --marker-trace --kernel-rename -- <application_path>
|
||||
|
||||
The preceding command generates a ``marker-trace`` file prefixed with the process ID.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ cat 210_marker_api_trace.csv
|
||||
"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
|
||||
"MARKER_CORE_API","roctxGetThreadId",315155,315155,2,58378843928406,58378843930247
|
||||
"MARKER_CONTROL_API","roctxProfilerPause",315155,315155,3,58378844627184,58378844627502
|
||||
"MARKER_CONTROL_API","roctxProfilerResume",315155,315155,4,58378844638601,58378844639267
|
||||
"MARKER_CORE_API","pre-kernel-launch",315155,315155,5,58378844641787,58378844641787
|
||||
"MARKER_CORE_API","post-kernel-launch",315155,315155,6,58378844936586,58378844936586
|
||||
"MARKER_CORE_API","memCopyDth",315155,315155,7,58378844938371,58378851383270
|
||||
"MARKER_CORE_API","HIP_Kernel-1",315155,315155,1,58378526575735,58378851384485
|
||||
|
||||
|
||||
I/O control options
|
||||
++++++++++++++++++++
|
||||
--------------------
|
||||
|
||||
``rocprofv3`` provides the following options to control the output.
|
||||
|
||||
@@ -1106,6 +1338,53 @@ Output prefix keys are useful in multiple use cases but are most helpful when de
|
||||
* - ``%s``
|
||||
- Shorthand for ``%size%``
|
||||
|
||||
|
||||
Collection period
|
||||
+++++++++++++++++++
|
||||
|
||||
The collection period is the time interval during which the profiling data is collected. You can specify the collection period using the ``--collection-period`` or ``-p`` option.
|
||||
Users can specify multiple configurations, each defined by a triplet in the format `start_delay:collection_time:repeat`.
|
||||
|
||||
The triplet is defined as follows:
|
||||
|
||||
- **Start delay time**: The time after which the profiling data collection starts.
|
||||
- **Collection time**: The time for which the profiling data is collected.
|
||||
- **Repeat**: Rate is the number of times the cycle is repeated. repeat of 0 indicates that the cycle will repeat indefinitely
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
rocprofv3 --collection-period 5:1:0 --hip-trace -- <application_path>
|
||||
|
||||
|
||||
The above command collects the profiling data for 1 seconds, starting after 5 seconds of the application starts, and this cycle will be repeated 1 time.
|
||||
|
||||
The collection period can be specified in different units, such as seconds, milliseconds, microseconds, and nanoseconds. The default unit is `seconds`. You can change the unit using the ``--collection-period-unit`` option.
|
||||
|
||||
The available units are:
|
||||
|
||||
`--collection-period-unit`: `hour`, `min`, `sec`, `msec`, `usec`, `nsec`
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
rocprofv3 --collection-period 5:1:0 --collection-period-unit msec --hip-trace -- <application_path>
|
||||
|
||||
The above command collects the profiling data for 1 milliseconds, starting after 5 milliseconds of the application starts, and this cycle will be repeated 1 time.
|
||||
|
||||
Perfetto-specific options
|
||||
++++++++++++++++++++++++++
|
||||
|
||||
The following options are specific to Perfetto tracing and are used to control the behavior of the Perfetto data collection:
|
||||
|
||||
- **--perfetto-buffer-fill-policy {discard,ring_buffer}**: Policy for handling new records when perfetto has reached the buffer limit.
|
||||
- **RING_BUFFER (default)**: The buffer behaves like a ring buffer and writes when full will wrap over and replace the oldest trace data in the buffer.
|
||||
- **DISCARD**: The buffer stops accepting data once full. Further write attempts are dropped.
|
||||
|
||||
- **--perfetto-buffer-size KB**: Size of buffer for perfetto output in KB. default: 1 GB. If set, stops the tracing session after N bytes have been written. Used to cap the size of the trace.
|
||||
|
||||
- **--perfetto-backend {inprocess,system}**: Perfetto data collection backend. 'system' mode requires starting traced and perfetto daemons.By default Perfetto keeps the full trace buffer(s) in memory.
|
||||
|
||||
- **--perfetto-shmem-size-hint KB**: Perfetto shared memory size hint in KB. default: 64 KB. This option gives you control over shared memory buffer sizing. Thisoption can be tweaked to avoid data loses when data is produced at a higher rate.
|
||||
|
||||
.. _output-file-fields:
|
||||
|
||||
Output file fields
|
||||
|
||||
@@ -17,8 +17,15 @@ ROCprofiler-SDK is supported only on Linux. The following distributions are test
|
||||
|
||||
- Ubuntu 20.04
|
||||
- Ubuntu 22.04
|
||||
- OpenSUSE 15.4
|
||||
- RedHat 8.8
|
||||
- Ubuntu 24.04
|
||||
- OpenSUSE 15.5
|
||||
- OpenSUSE 15.6
|
||||
- Red Hat 8.8
|
||||
- Red Hat 8.9
|
||||
- Red Hat 8.10
|
||||
- Red Hat 9.2
|
||||
- Red Hat 9.3
|
||||
- Red Hat 9.4
|
||||
|
||||
ROCprofiler-SDK might operate as expected on other `Linux distributions <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-operating-systems>`_, but has not been tested.
|
||||
|
||||
@@ -56,8 +63,8 @@ Install `CMake <https://cmake.org/>`_ version 3.21 (or later).
|
||||
export PATH=${HOME}/.local/bin:${PATH}
|
||||
|
||||
|
||||
Building ROCprofiler-SDK
|
||||
------------------------
|
||||
Building ROCprofiler-SDK from source
|
||||
-------------------------------------
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
@@ -95,4 +102,30 @@ To run the built tests, ``cd`` into the ``rocprofiler-sdk-build`` directory and
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
/usr/local/bin/python -m pip install -r requirements.txt
|
||||
/usr/local/bin/python -m pip install -r requirements.txt
|
||||
|
||||
|
||||
Install via package manager
|
||||
----------------------------
|
||||
|
||||
If you have ROCm version 6.2 or higher installed, you can use the package manager to install a pre-built copy of ROCProfiler-SDK.
|
||||
|
||||
.. tab-set::
|
||||
|
||||
.. tab-item:: Ubuntu
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ sudo apt install rocprofiler-sdk
|
||||
|
||||
.. tab-item:: Red Hat Enterprise Linux
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ sudo dnf install rocprofiler-sdk
|
||||
|
||||
.. tab-item:: SUSE Linux Enterprise Server
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ sudo zypper install rocprofiler-sdk
|
||||
@@ -4,7 +4,7 @@
|
||||
# Project related configuration options
|
||||
#---------------------------------------------------------------------------
|
||||
DOXYFILE_ENCODING = UTF-8
|
||||
PROJECT_NAME = ROCTx Developer API
|
||||
PROJECT_NAME = ROCTx developer API
|
||||
PROJECT_NUMBER = @ROCPROFILER_VERSION@
|
||||
PROJECT_BRIEF = "ROCm Profiling API and tools"
|
||||
PROJECT_LOGO =
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
# Project related configuration options
|
||||
#---------------------------------------------------------------------------
|
||||
DOXYFILE_ENCODING = UTF-8
|
||||
PROJECT_NAME = @PROJECT_NAME@ Developer API
|
||||
PROJECT_NAME = @PROJECT_NAME@ developer API
|
||||
PROJECT_NUMBER = @ROCPROFILER_VERSION@
|
||||
PROJECT_BRIEF = "ROCm Profiling API and tools"
|
||||
PROJECT_LOGO =
|
||||
|
||||
@@ -21,7 +21,7 @@ message "Changing directory to ${WORK_DIR}"
|
||||
cd ${WORK_DIR}
|
||||
|
||||
message "Generating rocprofiler-sdk.dox"
|
||||
cmake -DSOURCE_DIR=${SOURCE_DIR} -DPROJECT_NAME="Rocprofiler SDK" -P ${WORK_DIR}/generate-doxyfile.cmake
|
||||
cmake -DSOURCE_DIR=${SOURCE_DIR} -DPROJECT_NAME="ROCprofiler-SDK" -P ${WORK_DIR}/generate-doxyfile.cmake
|
||||
|
||||
message "Generating doxygen xml files"
|
||||
mkdir -p _doxygen
|
||||
|
||||
Odkázat v novém úkolu
Zablokovat Uživatele