[rocprofv3] Use -P for collection period shorthand option (#356)

* [rocprofv3] Use -P for collection period option

- Reserve -p for profiler attachment

* Update changelog

---------

Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>
Tento commit je obsažen v:
Madsen, Jonathan
2025-04-27 20:18:26 -05:00
odevzdal GitHub
rodič 3580478426
revize d2bde3ce27
4 změnil soubory, kde provedl 88 přidání a 70 odebrání
+1 -1
Zobrazit soubor
@@ -177,11 +177,11 @@ Full documentation for ROCprofiler-SDK is available at [rocm.docs.amd.com/projec
- type-relative == logical_node_type_id
- Added MI300 stochastic (hardware-based) PC sampling support in ROCProfiler-SDK and ROCProfV3
### Changed
- SDK no longer creates a background thread when every tool returns a nullptr from `rocprofiler_configure`.
- Updated disassembly.hpp's vaddr-to-file-offset mapping to use the dedicated comgr API.
- rocprofv3 shorthand argument for `--collection-period` is now `-P` (upper-case) as `-p` (lower-case) is reserved for later use
### Resolved issues
+20 -2
Zobrazit soubor
@@ -200,6 +200,7 @@ For MPI applications (or other job launchers such as SLURM), place rocprofv3 ins
description="ROCProfilerV3 Run Script",
usage="%(prog)s [options] -- <application> [application options]",
epilog=usage_examples,
allow_abbrev=False,
formatter_class=format_help(argparse.RawTextHelpFormatter),
)
@@ -501,7 +502,7 @@ For MPI applications (or other job launchers such as SLURM), place rocprofv3 ins
type=str,
)
filter_options.add_argument(
"-p",
"-P",
"--collection-period",
help="The times are specified in seconds by default, but the unit can be changed using the `--collection-period-unit` option. Start Delay Time is the time in seconds before the collection begins, Collection Time is the duration in seconds for which data is collected, and Rate is the number of times the cycle is repeated. A repeat of 0 indicates that the cycle will repeat indefinitely. Users can specify multiple configurations, each defined by a triplet in the format `start_delay:collection_time:repeat`",
nargs="+",
@@ -511,7 +512,7 @@ For MPI applications (or other job launchers such as SLURM), place rocprofv3 ins
)
filter_options.add_argument(
"--collection-period-unit",
help="To change the unit used in `--collection-period` or `-p`, you can specify the desired unit using the `--collection-period-unit` option. The available units are `hour` for hours, `min` for minutes, `sec` for seconds, `msec` for milliseconds, `usec` for microseconds, and `nsec` for nanoseconds",
help="To change the unit used in `--collection-period` or `-P`, you can specify the desired unit using the `--collection-period-unit` option. The available units are `hour` for hours, `min` for minutes, `sec` for seconds, `msec` for milliseconds, `usec` for microseconds, and `nsec` for nanoseconds",
nargs=1,
default=["sec"],
type=str,
@@ -640,6 +641,16 @@ For MPI applications (or other job launchers such as SLURM), place rocprofv3 ins
metavar="KB",
)
reserved_options = parser.add_argument_group("Reserved options")
reserved_options.add_argument(
"-p",
"--pid",
help=argparse.SUPPRESS,
type=str,
nargs="+",
default=None,
)
if args is None:
args = sys.argv[1:]
@@ -886,6 +897,13 @@ def run(app_args, args, **kwargs):
use_execv = kwargs.get("use_execv", True)
app_pass = kwargs.get("pass_id", None)
if args.pid is not None:
fatal_error(
"""The -p shorthand option for --collection-period is now an upper-case -P
In the future, rocprofv3 plans to support debugger-like process attachment and -p
is de-facto standard shorthand option for this feature"""
)
def setattrifnone(obj, attr, value):
if getattr(obj, f"{attr}") is None:
setattr(obj, f"{attr}", value)
+52 -52
Zobrazit soubor
@@ -19,11 +19,11 @@ Here are the distinct ROCprofiler-SDK features, which also highlight the improve
- PC sampling (beta implementation)
The former implementations allow a tool to access any of the services provided by ROCProfiler or ROCTracer, such as API tracing and kernel tracing, by calling ``roctracer_init()`` when an ROCm runtime is initially loaded.
As the calling tool is not required to specify during initialization, the services it needs to use, the libraries must be effectively prepared for any service to be available anytime.
As the calling tool is not required to specify during initialization, the services it needs to use, the libraries must be effectively prepared for any service to be available anytime.
This behavior introduces unnecessary overhead and makes thread-safe data management difficult, as tools generally don't use all the available services.
For example, ROCTracer always installs wrappers around every runtime API and adds indirection overhead through the ROCTracer library to check for the current service configuration in a thread-safe manner.
ROCprofiler-SDK introduces `context` to solve the preceding issues. Contexts are effectively bundles of service configurations. ROCprofiler-SDK provides a single opportunity for a tool to create as many contexts as required.
ROCprofiler-SDK introduces `context` to solve the preceding issues. Contexts are effectively bundles of service configurations. ROCprofiler-SDK provides a single opportunity for a tool to create as many contexts as required.
A tool can group all services into one context, create one context per service, or choose a mix.
This change in the design allows ROCprofiler-SDK to be aware of the services that might be requested by a tool at any given time.
The design change empowers ROCprofiler-SDK to:
@@ -50,36 +50,36 @@ ROCprofiler-SDK introduces a new command-line tool, `rocprofv3`, which is a more
- rocprofv3
- Improvements
- Notes
* - Basic tracing options
* - Basic tracing options
- HIP Trace
- `--hip-trace`
- `--hip-api`, `--hip-trace`
- `--hip-trace`
- `--hip-trace`
- No change
- | rocprof and rocprofv2 `--hip-trace` options include kernel dispatches and memory copy activities,
| which is not the case in rocprofv3
* - Basic tracing options
* - Basic tracing options
- HSA Trace
- `--hsa-trace`
- `--hsa-trace`
- `--hsa-trace`
- No change
- | rocprof and rocprofv2 `--hsa-trace` options include kernel dispatches and memory copy activities,
- | rocprof and rocprofv2 `--hsa-trace` options include kernel dispatches and memory copy activities,
| which is not the case in rocprofv3
* - Basic tracing options
* - Basic tracing options
- Scratch Memory Trace
- *Not Available*
- *Not Available*
- `--scratch-memory-trace`
- New option to trace scratch memory operations
-
-
* - Basic tracing options
- Marker Trace (ROCTx)
- `--roctx-trace`
- `--roctx-trace`
- `--marker-trace`
- Improved ROCTx library with more features
-
-
* - Basic tracing options
- Memory Copy Trace
- Part of HIP and HSA Traces
@@ -93,56 +93,56 @@ ROCprofiler-SDK introduces a new command-line tool, `rocprofv3`, which is a more
- *Not Available*
- `--memory-allocation-trace`
- New option for collecting Memory Allocation Traces. Displays starting address, allocation size, and agent where allocation occurred.
-
-
* - Basic tracing options
- Kernel Trace
- `--kernel-trace`
- `--kernel-trace`
- `--kernel-trace`
- Performance improvement.
-
-
* - Granular tracing options
- HIP runtime trace
- Part of `--hip-trace` option
- Part of `--hip-trace` option
- `--hip-runtime-trace`
- For collecting HIP Runtime API Traces, e.g. public HIP API functions starting with 'hip' (i.e. hipSetDevice).
-
-
* - Granular tracing options
- HIP compiler trace
- *Not Available*
- *Not Available*
- `--hip-compiler-trace`
- For collecting HIP Compiler generated code Traces, e.g. HIP API functions starting with '__hip' (i.e. __hipRegisterFatBinary).
-
-
* - Granular tracing options
- HSA core API trace
- Part of `--hsa-trace` option
- Part of `--hsa-trace` option
- `--hsa-core-trace`
- New option for collecting only HSA API Traces (core API), e.g. HSA functions prefixed with only `hsa_` (i.e. hsa_init)
-
-
* - Granular tracing options
- HSA AMD trace
- Part of `--hsa-trace` option
- Part of `--hsa-trace` option
- `--hsa-amd-trace`
- For collecting HSA API Traces (AMD-extension API), e.g. HSA function prefixed with `hsa_amd_` (i.e. hsa_amd_coherency_get_type)
-
-
* - Granular tracing options
- HSA Image Extension trace
- Part of `--hsa-trace` option
- Part of `--hsa-trace` option
- `--hsa-image-trace`
- New option for collecting HSA API Traces (Image-extension API), e.g. HSA functions prefixed with only `hsa_ext_image_` (i.e. hsa_ext_image_get_capability).
-
-
* - Granular tracing options
- HSA Finalizer trace
- Part of `--hsa-trace` option
- Part of `--hsa-trace` option
- `--hsa-finalizer-trace`
- New option for collecting HSA API Traces (Finalizer-extension API), e.g. HSA functions prefixed with only `hsa_ext_program_` (i.e. hsa_ext_program_create)
-
-
* - Advanced tracing options
- Kokkos trace
- *Not Available*
@@ -156,70 +156,70 @@ ROCprofiler-SDK introduces a new command-line tool, `rocprofv3`, which is a more
- *Not Available*
- `--rccl-trace`
- For collecting RCCL (ROCm Communication Collectives Library. Also pronounced as 'Rickle' ) Traces
-
-
* - Advanced tracing options
- Scratch memory trace
- *Not Available*
- *Not Available*
- `--scratch-memory-trace`
- Collecting scratch memory event traces.
-
-
* - Advanced tracing options
- rocDecode trace
- *Not Available*
- *Not Available*
- `--rocdecode-trace`
- Tracing rocDecode library.
-
-
* - Advanced tracing options
- rocJPEG trace
- *Not Available*
- *Not Available*
- `--rocjpeg-trace`
- Tracing rocJPEG library.
-
-
* - Aggregate tracing options
- Sys Trace
- `--sys-trace` [hip-trace|hsa-trace|roctx-trace|kernel-trace]
- `--sys-trace` [hip-trace|hsa-trace|roctx-trace|kernel-trace]
- ` -s, --sys-trace` [hip-trace|hsa-trace|scratch-trace|memory-copy-trace|roctx-trace|kernel-trace]
- Extends the sys trace options with more features
-
-
* - Aggregate tracing options
- Runtime Trace
- *Not available*
- *Not available*
- ` -r, --runtime-trace` [hip-runtime-trace|scratch-trace|memory-copy-trace|roctx-trace|kernel-trace]
- New option to aggregate trace operations
-
-
* - Kernel naming options
- Kernel Name Mangling
- *Not Available*
- *Not Available*
- `-M`, `--mangled-kernels`
- New option for mangled kernel names
-
-
* - Kernel naming options
- Kernel Name Truncation
- `--basenames <on|off>`
- `--basenames`
- `-T`, `--truncate-kernels`
- New option for truncating the demangled kernel names
-
-
* - Kernel naming options
- Kernel Rename
- `--roctx-rename`
- *Not available*
- `--kernel-rename`
- New option to use region names defined by roctxRangePush/roctxRangePop regions to rename the kernels
-
-
* - Post-processing tracing options
- Statistics
- --stats
- *Not Available*
- --stats
- Statistics for the collected traces
-
-
* - Post-processing tracing options
- Summary
- *Not available*
@@ -240,28 +240,28 @@ ROCprofiler-SDK introduces a new command-line tool, `rocprofv3`, which is a more
- *Not available*
- `--summary-groups REGULAR_EXPRESSION`
- New option to output a summary for each set of domains matching the regular expression, e.g. 'KERNEL_DISPATCH|MEMORY_COPY' will generate a summary from all the tracing data in the KERNEL_DISPATCH and MEMORY_COPY domains
-
-
* - Summary options
- Summary Output File
- *Not available*
- *Not available*
- `--summary-output-file SUMMARY_OUTPUT_FILE`
- New option to output summary to a file, stdout, or stderr (default: stderr)
-
-
* - Summary options
- Summary Units
- *Not available*
- *Not available*
- `-u , --summary-units`
- New option to output summary in desired time units {sec,msec,usec,nsec}
-
-
* - Display options
- List available basic and derived metrics and PC sampling configurations
- `--list-basic`, `--list-derived`
- `--list-counters`
- `-L`, `--list-avail`
- A valid YAML is supported for this option now
-
-
* - Perfetto-specific options
- Perfetto data collection backend
- *Not available*
@@ -275,7 +275,7 @@ ROCprofiler-SDK introduces a new command-line tool, `rocprofv3`, which is a more
- Setting env variable `rocprofiler_PERFETTO_MAX_BUFFER_SIZE_KIB` to the desired buffer size
- `--perfetto-buffer-size` {KB}
- New option to define size of buffer for perfetto output in KB. default: 1 GB
-
-
* - Perfetto-specific options
- Perfetto Buffer fill Policy
- *Not available*
@@ -289,48 +289,48 @@ ROCprofiler-SDK introduces a new command-line tool, `rocprofv3`, which is a more
- *Not available*
- `--perfetto-shmem-size-hint` KB
- New option to define perfetto shared memory size hint in KB. default: 64 KB
-
-
* - Filtering options
- Kernel Filtration options for Counter Collection
- Supported in input.xml file (supports range, gpu and kernel filtration)
- kernel: <kernel_name> (can only be provided in input.txt file)
- `--kernel-include-regex`, `--kernel-exclude-regex`, `--kernel-iteration-range`
- Extensive control over output options using regular expressions
-
-
* - I/O options
- Output Directory
- `-d` <data directory>
- `-d` | `--output-directory`
- `-d` OUTPUT_DIRECTORY, `--output-directory` OUTPUT_DIRECTORY
- rocprofv3 supports special keys for runtime values, e.g. %pid% gets replaced by the process ID
-
-
* - I/O options
- Output File
- `-o` <output file>
- `-o` | `--output-file-name`
- `-o` OUTPUT_FILE, `--output-file` OUTPUT_FILE
- rocprofv3 supports special keys for runtime values, e.g. %pid% gets replaced by the process ID
-
-
* - I/O options
- Logging
- Minimal logging via environment variable
- Minimal logging via environment variable
- --log-level {fatal,error,warning,info,trace,env}
- Extensive logging options
-
-
* - I/O options
- Plugins
- *Not Available*
- plugin support for different output formats
- Replaced by `--output-format` option
- Not needed as rocprofv3 supports multiple output formats
-
-
* - I/O options
- Output Formats
- CSV, JSON (Chrome-Tracing format)
- CSV, JSON (Chrome-Tracing format), Perfetto, CTF
- CSV, JSON (custom schema), Perfetto, OTF2
- | # Multiple output formats can be supported in single run.
- | # Multiple output formats can be supported in single run.
| # OTF2 can visualize larger trace files compared to perfetto.
- The Perfetto UI does not accept the JSON output format produced by rocprofv3. Perfetto is dropping support for the JSON Chrome tracing format in favor of the binary Perfetto protobuf format (``.pftrace`` extension), which is supported by rocprofv3.
* - I/O options
@@ -349,25 +349,25 @@ ROCprofiler-SDK introduces a new command-line tool, `rocprofv3`, which is a more
- `--pmc`
- New option to collect performance counters from command line. Counters should be comma OR space separated in case of more than 1 counters
-
* - I/O options
* - I/O options
- Providing Custom metrics file
- `-m` <metric file>
- `-m` <metric file>
- `-E` <metric file> --pmc <counter>
- In rocprofv3, this option has changed to provide a file with custom metrics and collect performance counters from the command line using --pmc option
-
-
* - Advanced options
- Preload
- *Not Available*
- *Not Available*
- --preload
- Libraries to prepend to LD_PRELOAD (usually for sanitizers)
-
-
* - Trace Control options
- Trace Period
- `--trace-period`
- `-tp | --trace-period`
- `-p |--collection-period`,`--collection-period-unit`
- `-P |--collection-period`,`--collection-period-unit`
- Users can specify multiple configurations, each defined by a triplet in the format `start_delay:collection_time:repeat`, with the ability to change the unit of time in the given configurations.
-
* - Trace Control options
@@ -376,14 +376,14 @@ ROCprofiler-SDK introduces a new command-line tool, `rocprofv3`, which is a more
- *Not available*
- *Not available*
- Not yet in rocprofv3
-
-
* - Trace Control options
- Flush Interval
- `--flush-rate`
- `--flush-interval`
- *Not available*
- Not applicable for rocprofv3
-
-
* - Trace Control options
- Merge Traces
- `--merge-traces`
@@ -397,46 +397,46 @@ ROCprofiler-SDK introduces a new command-line tool, `rocprofv3`, which is a more
- *Not available*
- `--pc-sampling-beta-enabled`
- Enable pc sampling support; beta version.
-
-
* - Legacy options
- Timestamp On/Off
- `--timestamp <on|off>`
- *Not available*
- *Not available*
- Not applicable for rocprofv3
-
-
* - Legacy options
- Context wait
- `--ctx-wait`
- *Not available*
- *Not available*
- Not applicable for rocprofv3
-
-
* - Legacy options
- Context Limit
- `--ctx-limit <max number>`
- *Not available*
- *Not available*
- Not applicable for rocprofv3
-
-
* - Legacy options
- Code Object Tracking
- `--obj-tracking <on|off>`
- Always ``ON`` in rocprofv2
- Always ``ON`` in rocprofv3
-
-
-
* - Legacy options
- Heartbeat
- `--heartbeat <rate sec>`
- *Not available*
- *Not available*
- Not applicable for rocprofv3
-
-
========================================================
Timing Difference Between rocprofv3 and rocprofv1/v2
Timing Difference Between rocprofv3 and rocprofv1/v2
========================================================
``rocprofv3`` has improved the accuracy of timing information by reducing the tool overhead required to collect data and reducing the interference to the timing of the kernel being measured. The result of this work is a reduction in variance of kernel times received for the same kernel execution and more accurate timing in general. These changes have not been backported (and will not be backported) to rocprofv1/v2, so there can be substantial (20%) differences in execution time reported by v1/v2 vs v3 for a single kernel execution. Over a large number of samples of the same kernel, the difference in average execution time is in the low single digit percentage time with a much tighter variance of results on rocprofv3. We have included testing in the test suite to verify the timing information outputted by rocprofv3 to ensure that the values we are returning are accurate.
+15 -15
Zobrazit soubor
@@ -143,13 +143,13 @@ The following table lists the commonly used ``rocprofv3`` command-line options c
- | ``--kernel-include-regex`` REGULAR_EXPRESSION |br| |br| |br| |br|
| ``--kernel-exclude-regex`` REGULAR_EXPRESSION |br| |br| |br| |br|
| ``--kernel-iteration-range`` KERNEL_ITERATION_RANGE [KERNEL_ITERATION_RANGE ...] |br| |br|
| ``-p`` (START_DELAY_TIME):(COLLECTION_TIME):(REPEAT) [(START_DELAY_TIME):(COLLECTION_TIME):(REPEAT) ...] \| ``--collection-period`` (START_DELAY_TIME):(COLLECTION_TIME):(REPEAT) [(START_DELAY_TIME):(COLLECTION_TIME):(REPEAT) ...] |br| |br| |br| |br| |br| |br| |br| |br| |br| |br| |br| |br| |br| |br| |br|
| ``-P`` (START_DELAY_TIME):(COLLECTION_TIME):(REPEAT) [(START_DELAY_TIME):(COLLECTION_TIME):(REPEAT) ...] \| ``--collection-period`` (START_DELAY_TIME):(COLLECTION_TIME):(REPEAT) [(START_DELAY_TIME):(COLLECTION_TIME):(REPEAT) ...] |br| |br| |br| |br| |br| |br| |br| |br| |br| |br| |br| |br| |br| |br| |br|
| ``--collection-period-unit`` {hour,min,sec,msec,usec,nsec}
- | Filters counter-collection and thread-trace data to include the kernels matching the specified regular expression. Non-matching kernels are excluded. |br| |br|
| Filters counter-collection and thread-trace data to exclude the kernels matching the specified regular expression. It is applied after ``--kernel-include-regex`` option. |br| |br|
| Specifies iteration range for each kernel matching the filter [start-stop]. |br| |br| |br|
| START_DELAY_TIME\: Time in seconds before the data collection begins. |br| COLLECTION_TIME\: Duration of data collection in seconds. |br| REPEAT\: Number of times the data collection cycle is repeated. |br| The default unit for time is seconds, which can be changed using the ``--collection-period-unit`` or ``-pu`` option. To repeat the cycle indefinitely, specify ``repeat`` as 0. You can specify multiple configurations, each defined by a triplet in the format ``start_delay_time:collection_time:repeat``. For example, the command ``-p 10:10:1 5:3:0`` specifies two configurations, the first one with a start delay time of 10 seconds, a collection time of 10 seconds, and a repeat of 1 (the cycle repeats once), and the second with a start delay time of 5 seconds, a collection time of 3 seconds, and a repeat of 0 (the cycle repeats indefinitely). |br| |br| |br|
| To change the unit of time used in ``--collection-period`` or ``-p``, specify the desired unit using the ``--collection-period-unit`` or ``-pu`` option. The available units are ``hour`` for hours, ``min`` for minutes, ``sec`` for seconds, ``msec`` for milliseconds, ``usec`` for microseconds, and ``nsec`` for nanoseconds.
| START_DELAY_TIME\: Time in seconds before the data collection begins. |br| COLLECTION_TIME\: Duration of data collection in seconds. |br| REPEAT\: Number of times the data collection cycle is repeated. |br| The default unit for time is seconds, which can be changed using the ``--collection-period-unit`` option. To repeat the cycle indefinitely, specify ``repeat`` as 0. You can specify multiple configurations, each defined by a triplet in the format ``start_delay_time:collection_time:repeat``. For example, the command ``-P 10:10:1 5:3:0`` specifies two configurations, the first one with a start delay time of 10 seconds, a collection time of 10 seconds, and a repeat of 1 (the cycle repeats once), and the second with a start delay time of 5 seconds, a collection time of 3 seconds, and a repeat of 0 (the cycle repeats indefinitely). |br| |br| |br|
| To change the unit of time used in ``--collection-period`` or ``-P``, specify the desired unit using the ``--collection-period-unit`` option. The available units are ``hour`` for hours, ``min`` for minutes, ``sec`` for seconds, ``msec`` for milliseconds, ``usec`` for microseconds, and ``nsec`` for nanoseconds.
* - Perfetto-specific
- | ``--perfetto-backend`` {inprocess,system} |br| |br| |br| |br| |br|
@@ -935,14 +935,14 @@ For the description of the fields in the output file, see :ref:`output-file-fiel
Iteration based counter multiplexing
++++++++++++++++++++++++++++++++++++
Counter multiplexing allows a single run of the program to collect groups of counters. This is useful when the counters you want to collect exceed the hardware limits and you cannot run the program multiple times for collection.
Counter multiplexing allows a single run of the program to collect groups of counters. This is useful when the counters you want to collect exceed the hardware limits and you cannot run the program multiple times for collection.
This feature is available when using YAML (.yaml/.yml) or JSON (.json) input formats. Two new fields are introduced, ``pmc_groups`` and ``pmc_group_interval``. The ``pmc_groups`` field is used to specify the groups of counters to be collected in each run. The ``pmc_group_interval`` field is used to specify the interval between each group of counters. Interval is per-device and increments per dispatch on the device (i.e. dispatch_id). When the interval is reached the next group is selected.
Here is a sample input.yaml file for specifying counter multiplexing:
.. code-block:: yaml
jobs:
- pmc_groups: [["SQ_WAVES", "GRBM_COUNT"], ["GRBM_GUI_ACTIVE"]]
pmc_group_interval: 4
@@ -952,7 +952,7 @@ This sample input will collect the first group of counters (``SQ_WAVES``, ``GRBM
An example of the interval period for this input is given below:
.. code-block:: shell
Device 1, <Kernel A>, Collect SQ_WAVES, GRBM_COUNT
Device 1, <Kernel A>, Collect SQ_WAVES, GRBM_COUNT
Device 1, <Kernel B>, Collect SQ_WAVES, GRBM_COUNT
@@ -1054,7 +1054,7 @@ The agent index is a unique identifier for each agent in the system. It is used
- **absolute** == *node_id* - absolute index of the agent regardless of cgroups masking. This is a monotonically increasing number that is incremented for every folder in `/sys/class/kfd/kfd/topology/nodes`. e.g. Agent-0, Agent-2, Agent-4.
- **relative** == *logical_node_id* - relative index of the agent accounting for cgroups masking. This is a monotonically increasing number which is incremented for every folder in `/sys/class/kfd/kfd/topology/nodes/` whose properties file was non-empty.e.g. Agent-0, Agent-1, Agent-2
- **type-relative** == *logical_node_type_id* - relative index of the agent accounting for cgroups masking where indexing starts at zero for each agent type. e.g. CPU-0, GPU-0, GPU-1
To set the agent index in the output files, use the ``--agent-index`` option. The default value is ``relative``.
@@ -1071,19 +1071,19 @@ Here is the `rocm-smi` output:
.. code-block:: shell
$ cat kernel_trace.csv
$ cat kernel_trace.csv
"Kind","Agent_Id","Queue_Id","Stream_Id","Thread_Id","Dispatch_Id","Kernel_Id","Kernel_Name","Correlation_Id","Start_Timestamp","End_Timestamp","Private_Segment_Size","Group_Segment_Size","Workgroup_Size_X","Workgroup_Size_Y","Workgroup_Size_Z","Grid_Size_X","Grid_Size_Y","Grid_Size_Z"
"KERNEL_DISPATCH","Agent 7",1,2,15044,1,17,"void addition_kernel<float>(float*, float const*, float const*, int, int)",1,1671247151691610,1671247151718010,0,0,64,1,1,1024,1024,1
.. code-block:: shell
rocprofv3 --kernel-trace --agent-index=type-relative -- <application_path>
.. code-block:: shell
$ cat kernel_trace.csv
$ cat kernel_trace.csv
"Kind","Agent_Id","Queue_Id","Stream_Id","Thread_Id","Dispatch_Id","Kernel_Id","Kernel_Name","Correlation_Id","Start_Timestamp","End_Timestamp","Private_Segment_Size","Group_Segment_Size","Workgroup_Size_X","Workgroup_Size_Y","Workgroup_Size_Z","Grid_Size_X","Grid_Size_Y","Grid_Size_Z"
"KERNEL_DISPATCH","GPU 3",1,2,15056,1,17,"void addition_kernel<float>(float*, float const*, float const*, int, int)",1,1671390884499766,1671390884525686,0,0,64,1,1,1024,1024,1
@@ -1154,7 +1154,7 @@ To enable kernel name truncation, use the ``--truncate-kernels`` option.
rocprofv3 --truncate-kernels --kernel-trace -- <application_path>
The above command generates a ``kernel_trace.csv`` file with truncated kernel names.
The above command generates a ``kernel_trace.csv`` file with truncated kernel names.
.. csv-table:: Kernel trace truncated
:file: /data/kernel_trace_truncated.csv
@@ -1361,7 +1361,7 @@ The above command generates an ``%hostname%/%pid%_hip_api_trace.csv`` file.
Collection period
+++++++++++++++++++
The collection period is the time interval during which the profiling data is collected. You can specify the collection period using the ``--collection-period`` or ``-p`` option.
The collection period is the time interval during which the profiling data is collected. You can specify the collection period using the ``--collection-period`` or ``-P`` option.
Users can specify multiple configurations, each defined by a triplet in the format `start_delay:collection_time:repeat`.
The triplet is defined as follows:
@@ -1399,7 +1399,7 @@ The following options are specific to Perfetto tracing and are used to control t
- **DISCARD**: The buffer stops accepting data once full. Further write attempts are dropped.
- **--perfetto-buffer-size KB**: Size of buffer for perfetto output in KB. default: 1 GB. If set, stops the tracing session after N bytes have been written. Used to cap the size of the trace.
- **--perfetto-backend {inprocess,system}**: Perfetto data collection backend. 'system' mode requires starting traced and perfetto daemons.By default Perfetto keeps the full trace buffer(s) in memory.
- **--perfetto-shmem-size-hint KB**: Perfetto shared memory size hint in KB. default: 64 KB. This option gives you control over shared memory buffer sizing. Thisoption can be tweaked to avoid data loses when data is produced at a higher rate.