:keywords:ROCprofiler-SDK tool usage, rocprofv3 user manual, rocprofv3 usage, rocprofv3 user guide, using rocprofv3, ROCprofiler-SDK tool user guide, ROCprofiler-SDK tool user manual, using ROCprofiler-SDK tool, ROCprofiler-SDK command-line tool, ROCprofiler-SDK CLI, ROCprofiler-SDK command line tool
``rocprofv3`` is a CLI tool that helps you optimize applications and analyze the low-level kernel details without requiring any modification in the source code.
It's backward compatible with its predecessor, `rocprof <https://rocm.docs.amd.com/projects/rocprofiler/en/latest/index.html>`_, and provides enhanced features for application profiling with better accuracy.
- | Specifies the path to the input file. JSON and YAML formats support configuration of all command-line options for tracing and profiling whereas the text format supports only the specification of HW counters. |br| |br|
| Specifies output file name. If nothing is specified, the default path is ``%hostname%/%pid%``. |br| |br|
| Specifies the output path for saving the output files. If nothing is specified, the default path is ``%hostname%/%pid%``. |br| |br|
- | Attaches to a running process by process ID and profiles it dynamically. This enables profiling of applications that are already running without needing to restart them from the profiler. The profiler will instrument the target process and collect the specified tracing or counter data for the configured duration.
- | Collects tracing data for HIP runtime API, marker (ROCTx) API, RCCL API, memory operations (copies, scratch, and allocation), and kernel dispatches. Similar to ``--sys-trace`` but without HIP compiler API and the underlying HSA API tracing. |br| |br|
| Collects tracing data for HIP API, HSA API, marker (ROCTx) API, RCCL API, memory operations (copies, scratch, and allocations), and kernel dispatches.
- | Enables PC sampling and sets the ROCPROFILER_PC_SAMPLING_BETA_ENABLED environment variable. Note that PC sampling support is in beta version. |br| |br|
| Specifies the unit for PC sampling type or method. Note that only units of time are supported. |br| |br|
| Specifies the PC sampling type. Note that only host trap method is supported. |br| |br|
- | Combination of ``--hip-runtime-trace`` and ``--hip-compiler-trace``. This option only enables the HIP API tracing. Unlike previous iterations of ``rocprof``, this option doesn't enable kernel tracing, memory copy tracing, and so on. |br| |br|
| Collects marker (ROCTx) traces. Similar to ``--roctx-trace`` option in earlier ``rocprof`` versions, but with improved ``ROCTx`` library with more features. |br| |br|
| Collects kernel dispatch traces. |br| |br|
| Collects memory copy traces. This was a part of the HIP and HSA traces in previous ``rocprof`` versions. |br| |br|
| Collects memory allocation traces. Displays starting address, allocation size, and the agent where allocation occurs. |br| |br|
| Collects scratch memory operations traces. Helps in determining scratch allocations and manage them efficiently. |br| |br|
| Collects ``--hsa-core-trace``, ``--hsa-amd-trace``, ``--hsa-image-trace``, and ``--hsa-finalizer-trace``. This option only enables the HSA API tracing. Unlike previous iterations of ``rocprof``, this doesn't enable kernel tracing, memory copy tracing, and so on. |br| |br|
| Collects traces for RCCL (ROCm Communication Collectives Library), which is also pronounced as 'Rickle'. |br| |br|
| Enables builtin Kokkos tools support, which implies enabling ``--marker-trace`` collection and ``--kernel-rename``. |br| |br|
- | Collects HIP Runtime API traces. For example, public HIP API functions starting with ``hip`` such as ``hipSetDevice``. |br| |br|
| Collects HIP Compiler generated code traces. For example, HIP API functions starting with ``__hip`` such as ``__hipRegisterFatBinary``. |br| |br|
| Collects HSA API traces (core API). For example, HSA functions prefixed with only ``hsa_`` such as ``hsa_init``. |br| |br|
| Collects HSA API traces (AMD-extension API). For example, HSA functions prefixed with ``hsa_amd_`` such as ``hsa_amd_coherency_get_type``. |br| |br|
| Collects HSA API traces (image-extenson API). For example, HSA functions prefixed with only ``hsa_ext_image_`` such as ``hsa_ext_image_get_capability``. |br| |br|
| Collects HSA API traces (Finalizer-extension API). For example, HSA functions prefixed with only ``hsa_ext_program_`` such as ``hsa_ext_program_create``.
* - Counter collection
- | ``--pmc`` [PMC ...]
- | Specifies performance monitoring counters to be collected. Use comma or space to specify more than one counter. Also note that the job fails if the entire set of counters can't be collected in single pass.
- | Collects statistics of enabled tracing types. Must be combined with one or more tracing options. Doesn't include default kernel stats unlike previous ``rocprof`` versions. |br| |br|
| Displays single summary of tracing data for the enabled tracing type, after conclusion of the profiling session. Displays a summary of tracing data for the enabled tracing type, after conclusion of the profiling session. |br| |br|
| Displays a summary of each tracing domain for the enabled tracing type, after conclusion of the profiling session. |br| |br|
| Displays a summary for each set of domains matching the specified regular expression. For example, 'KERNEL_DISPATCH\|MEMORY_COPY' generates a summary of all the tracing data in the `KERNEL_DISPATCH` and `MEMORY_COPY` domains. Similarly '\*._API' generates a summary of all the tracing data in the ``HIP_API``, ``HSA_API``, and ``MARKER_API`` domains.
- | Overrides the default demangling of kernel names. |br| |br|
| Truncates the demangled kernel names for improved readability. In earlier ``rocprof`` versions, this was known as ``--basenames [on/off]``. |br| |br|
| Uses region names defined using ``roctxRangePush`` or ``roctxRangePop`` to rename the kernels. Was known as ``--roctx-rename`` in earlier ``rocprof`` versions.
- | Filters counter-collection and thread-trace data to include the kernels matching the specified regular expression. Non-matching kernels are excluded. |br| |br|
| Filters counter-collection and thread-trace data to exclude the kernels matching the specified regular expression. It is applied after ``--kernel-include-regex`` option. |br| |br|
| Specifies iteration range for each kernel matching the filter [start-stop]. |br| |br| |br|
| START_DELAY_TIME\: Time in seconds before the data collection begins. |br| COLLECTION_TIME\: Duration of data collection in seconds. |br| REPEAT\: Number of times the data collection cycle is repeated. |br| The default unit for time is seconds, which can be changed using the ``--collection-period-unit`` option. To repeat the cycle indefinitely, specify ``repeat`` as 0. You can specify multiple configurations, each defined by a triplet in the format ``start_delay_time:collection_time:repeat``. For example, the command ``-P 10:10:1 5:3:0`` specifies two configurations, the first one with a start delay time of 10 seconds, a collection time of 10 seconds, and a repeat of 1 (the cycle repeats once), and the second with a start delay time of 5 seconds, a collection time of 3 seconds, and a repeat of 0 (the cycle repeats indefinitely). |br| |br| |br|
| To change the unit of time used in ``--collection-period`` or ``-P``, specify the desired unit using the ``--collection-period-unit`` option. The available units are ``hour`` for hours, ``min`` for minutes, ``sec`` for seconds, ``msec`` for milliseconds, ``usec`` for microseconds, and ``nsec`` for nanoseconds.
- | Specifies backend for Perfetto data collection. When selecting 'system' mode, ensure to run the Perfetto ``traced`` daemon and then start a Perfetto session. |br| |br|
| Specifies buffer size for Perfetto output in KB. Default: 1 GB. |br| |br|
| Specifies policy for handling new records when Perfetto reaches the buffer limit. |br| |br|
| Specifies Perfetto shared memory size hint in KB. Default: 64 KB.
- | Lists the PC sampling configurations and metrics available in the counter_defs.yaml file for counter collection. In earlier ``rocprof`` versions, this was known as ``--list-basic``, ``--list-derived``, and ``--list-counters``. |br| |br|
| For displaying the HSA Queues that kernels and memory copy operations are submitted to rather than the default grouping of HIP Streams for perfetto.
- | Specifies libraries to prepend to ``LD_PRELOAD``. Useful for sanitizer libraries and custom instrumentation tools. Multiple libraries can be specified. |br| |br|
| Specifies the minimum output data size threshold in KB. Output files are generated only if the collected profiling data exceeds this threshold. This prevents creation of empty or very small output files. Default is 0 (no threshold). |br| |br|
| Controls signal handler prioritization. When set to true, disables rocprofv3 signal handler prioritization, allowing application signal handlers to take precedence. Useful for applications with custom crash handling or when integrating with testing frameworks. Default is false (rocprofv3 handlers have priority). |br| |br|
| Specifies custom ROCm installation directory instead of automatic detection. Useful for multiple ROCm installations, custom builds, or non-standard locations. |br| |br|
| Specifies the shared object version number for ROCProfiler SDK library resolution. Controls which major version of librocprofiler-sdk.so.X to use. |br| |br|
| Specifies the exact version number for ROCProfiler SDK library resolution. Controls library selection with full semantic versioning (X.Y.Z format).
Application tracing provides the big picture of a program’s execution by collecting data on the execution times of API calls and GPU commands, such as kernel execution, async memory copy, and barrier packets. This information can be used as the first step in the profiling process to answer important questions, such as how much percentage of time was spent on memory copy and which kernel took the longest time to execute.
To use ``rocprofv3`` for application tracing, run:
All the tracing examples below use the ``--output-format csv`` option to generate output in CSV format.
However, the default output format is ``rocpd`` (SQLite3 database). You can simply omit the ``--output-format`` option to generate output in the default format.
``rocpd`` format can be converted to other formats such as CSV, OTF2, and PFTrace using the ``rocpd`` module.
To understand how to convert ``rocpd`` output to other formats, see :ref:`using-rocpd-output-format`.
HIP trace comprises execution traces for the entire application at the HIP level. This includes HIP API functions and their asynchronous activities at the runtime level. In general, HIP APIs directly interact with the user program. It is easier to analyze HIP traces as you can directly map them to the program.
Unlike previous iterations of ``rocprof``, this does not enable kernel tracing, memory copy tracing, and so on. If you want to enable kernel tracing, memory copy tracing, they need to be provided explicitly.
``rocprofv3`` provides options to collect traces at more granular level. For HIP, you can collect traces for HIP compile time APIs and runtime APIs separately.
For the description of the fields in the output file, see :ref:`output-file-fields`.
HSA trace
+++++++++++++
The HIP runtime library is implemented with the low-level HSA runtime. HSA API tracing is more suited for advanced users who want to understand the application behavior at the lower level. In general, tracing at the HIP level is recommended for most users. You should use HSA trace only if you are familiar with HSA runtime.
HSA trace contains the start and end time of HSA runtime API calls and their asynchronous activities.
The preceding command generates a ``hsa_api_trace.csv`` file prefixed with process ID. Note that the contents of this file have been truncated for demonstration purposes.
``rocprofv3`` provides options to collect HSA traces at more granular level. HSA traces can be collected separately for four API domains: ``HSA_AMD_EXT_API``, ``HSA_CORE_API``, ``HSA_IMAGE_EXT_API`` and ``HSA_FINALIZE_EXT_API``.
The preceding command generates a ``hsa_api_trace.csv`` file prefixed with process ID. Note that the contents of this file have been truncated for demonstration purposes.
To use ``rocprofv3`` for marker tracing, including and linking to old ``ROCTx`` works but it's recommended to switch to the new ``ROCTx`` to utilize new APIs.
To use the new ``ROCTx``, include header ``"rocprofiler-sdk-roctx/roctx.h"`` and link your application with ``librocprofiler-sdk-roctx.so``.
To see the complete list of ``ROCTx`` APIs, see public header file ``"rocprofiler-sdk-roctx/roctx.h"``.
`Kokkos <https://github.com/kokkos/kokkos>`_ is a C++ library for writing performance portable applications. Kokkos is widely used in scientific applications to write performance-portable code for CPUs, GPUs, and other accelerators.
``rocprofv3`` loads an inbuilt `Kokkos Tools library <https://github.com/kokkos/kokkos-tools>`_, which emits roctx ranges with the labels passed using Kokkos APIs. For example, ``Kokkos::parallel_for(“MyParallelForLabel”, …)`` calls ``roctxRangePush`` internally and enables the kernel renaming option to replace the highly templated kernel names with the Kokkos labels.
To enable the inbuilt marker support, use the ``kokkos-trace`` option. Internally, this option automatically enables ``marker-trace`` and ``kernel-rename``:
Memory copy traces track ``hipMemcpy`` and ``hipMemcpyAsync`` functions, which use the ``hsa_amd_memory_async_copy_on_engine`` HSA functions internally. To trace memory moves across the application, use:
thus, tends to be an implementation detail irrelevant to most users. Similarly, the HIP compiler API is also excluded for being an implementation detail as these functions are automatically inserted during HIP compilation.
Running the preceding command generates ``hip_api_trace.csv``, ``kernel_trace.csv``, ``memory_copy_trace.csv``, ``scratch_memory_trace.csv``, ``memory_allocation_trace.csv``, and ``marker_api_trace.csv`` (if ``ROCTx`` APIs are specified in the application) files prefixed with the process ID.
This is an all-inclusive option to collect HIP, HSA, kernel, memory copy, memory allocation, and marker trace (if ``ROCTx`` APIs are specified in the application).
Running the preceding command generates ``hip_api_trace.csv``, ``hsa_api_trace.csv``, ``kernel_trace.csv``, ``memory_copy_trace.csv``, ``scratch_memory_trace.csv``, ``memory_allocation_trace.csv``, and ``marker_api_trace.csv`` if ``ROCTx`` APIs are specified in the application.
This option collects scratch memory operation traces. Scratch is an address space on AMD GPUs roughly equivalent to the local memory in NVIDIA CUDA. The local memory in CUDA is a thread-local global memory with interleaved addressing, which is used for register spills or stack space. This option helps to trace when the ``rocr`` runtime allocates, frees, and tries to reclaim scratch memory.
`RCCL <https://github.com/ROCm/rccl>`_ (pronounced "Rickle") is a stand-alone library of standard collective communication routines for GPUs. This option traces those communication routines.
``rocprofv3`` supports attaching to already running processes to profile them dynamically without requiring application restart. This is particularly useful for long-running applications, services, or when you need to profile an application that is already in a specific state.
Process attachment uses the ``-p``, ``--pid``, or ``--attach`` options (all equivalent) followed by the target process ID. The profiler will instrument the target process and collect the specified tracing or counter data for the configured duration.
The preceding command generates a ``hip_trace.csv`` and ``hsa_trace.csv`` file prefixed with the process ID along with displaying the summary of each domain.
This option displays a summary of multiple domains for the domain names specified on the command line. The summary groups can be separated using a pipe ( | ) symbol.
This option specifies the output file for the summary. By default, the summary is displayed on ``stderr``. To specify another output file for summary, use:
The preceding command generates an output file named "filename" consisting of the summary for each domain. This also generates the files for the enabled tracing types under ``-sys-trace`` option.
The ``--output-config`` option generates a comprehensive configuration output file that contains all resolved ``rocprofv3`` settings and options used during a profiling session. This feature is essential for debugging, reproducibility, and configuration validation.
To generate a configuration output file during profiling:
This command generates a configuration file (typically ``<process_id>_config.json``) alongside the regular profiling output files.
The generated JSON configuration file contains detailed information about the profiling session and is structured with a ``rocprofiler-sdk-tool`` array containing comprehensive metadata and configuration details.
The metadata section includes essential session information such as process ID (``pid``), initialization and finalization timestamps (``init_time``, ``fini_time``), the exact command executed, and detailed build specifications. The build specification contains version information, compiler details, git revision, system architecture, and kernel version, providing complete context for reproducing the environment.
The config section is the most comprehensive part, containing all profiling options with their resolved boolean and numerical values. This includes tracing options like ``hip_runtime_api_trace``, ``hip_compiler_api_trace``, ``kernel_trace``, ``hsa_core_api_trace``, ``memory_copy_trace``, and many others. It also shows advanced configuration like PC sampling settings (``pc_sampling_method``, ``pc_sampling_interval``), filtering options (``kernel_filter_include``, ``kernel_filter_exclude``), output formatting choices (``csv_output``, ``json_output``, ``pftrace_output``), and performance tuning parameters.
The environment section captures all environment variables active during the profiling session, including system variables such as ``SHELL``, ``COLORTERM``, ``HOSTNAME``, and ROCm-specific variables, providing complete environmental context for reproduction.
The configuration output file provides complete transparency into ``rocprofv3`` operation, documenting all settings, defaults, and environmental context required for profiling sessions.
The preceding sections describe how to collect traces by specifying the desired tracing type on the command line. You can also specify the desired tracing types in an input file in YAML (.yaml/.yml), or JSON (.json) format. You can supply any command-line option for tracing in the input file.
When using aggregate tracing options like ``--runtime-trace`` or ``--sys-trace``, you can disable specific tracing options by setting them to ``False``. This allows fine-grained control over the traces to be collected.
This feature is particularly useful to collect most traces excluding specific ones that might be unnecessary for your analysis or that generate excessive data.
The application tracing functionality allows you to evaluate the duration of kernel execution but is of little help in providing insight into kernel execution details. The kernel counter collection functionality allows you to select kernels for profiling and choose the basic counters or derived metrics to be collected for each kernel execution, thus providing a greater insight into kernel execution.
AMDGPUs are equipped with hardware performance counters that can be used to measure specific values during kernel execution, which are then exported from the GPU and written into the output files at the end of the kernel execution. These performance counters vary according to the GPU. Therefore, it is recommended to examine the hardware counters that can be collected before running the profile.
There are two types of data available for profiling: hardware basic counters and derived metrics.
The derived metrics are the counters derived from the basic counters using mathematical expressions. Note that the basic counters and derived metrics are collectively referred as counters in this document.
You can also customize the counters according to the requirement. Such counters are named :ref:`extra-counters`.
For a comprehensive list of counters available on MI200, see `MI200 performance counters and metrics <https://rocm.docs.amd.com/en/latest/conceptual/gpu-arch/mi300-mi200-performance-counters.html>`_.
When using input file in text format, the line consisting of the counter names must begin with ``pmc``. The number of counters that can be collected in one profiling run are limited by the GPU hardware resources. If too many counters are selected, the kernels need to be executed multiple times(multi-pass execution) to collect all the counters. For multi-pass execution, include multiple ``pmc`` rows in the input file. Counters in each ``pmc`` row can be collected in each application run.
While the input file in text format can only be used for counter collection, JSON and YAML formats support all the command-line options for profiling. The input file in YAML or JSON format has an array of profiling configurations called jobs. Each job is used to configure profiling for an application execution.
Here is the input schema (properties) of JSON or YAML input files:
While the basic counters and derived metrics are available for collection by default, you can also define counters as per requirement. These user-defined counters with custom definitions are named extra counters.
Using ``rocprofv3`` for counter collection using input file or command line generates a ``./pmc_n/counter_collection.csv`` file prefixed with the process ID. For each ``pmc`` row, a directory ``pmc_n`` containing a ``counter_collection.csv`` file is generated, where n = 1 for the first row and so on.
When using input file in JSON or YAML format, for each job, a directory ``pass_n`` containing a ``counter_collection.csv`` file is generated, where n = 1 for the first job and so on.
Counter multiplexing allows a single run of the program to collect groups of counters. This is useful when the counters you want to collect exceed the hardware limits and you cannot run the program multiple times for collection.
This feature is available when using YAML (.yaml/.yml) or JSON (.json) input formats. Two new fields are introduced, ``pmc_groups`` and ``pmc_group_interval``. The ``pmc_groups`` field is used to specify the groups of counters to be collected in each run. The ``pmc_group_interval`` field is used to specify the interval between each group of counters. Interval is per-device and increments per dispatch on the device (i.e. dispatch_id). When the interval is reached the next group is selected.
Here is a sample input.yaml file for specifying counter multiplexing:
This sample input will collect the first group of counters (``SQ_WAVES``, ``GRBM_COUNT``) for the first 4 kernel executions on the device, then the second group of counters (``GRBM_GUI_ACTIVE``) for the next 4 kernel executions on the device, and so on.
An example of the interval period for this input is given below:
`Perfetto <https://perfetto.dev/>`_ is an open-source tracing tool that provides a detailed view of system performance. You can use Perfetto to visualize traces and performance counter data as explained in the following sections.
Perfetto helps you to visualize the collected traces in Perfetto viewer, which is a user-friendly interface that makes it easier to analyze and understand the performance characteristics of your application.
To generate a Perfetto trace file, use the ``--output-format pftrace`` option along with the desired tracing options. For example, to collect system traces and generate a Perfetto trace file, use:
When collecting performance counter data, you can visualize the counter tracks per agent in the Perfetto viewer by using the PFTrace output format. This helps you see how counter values change over time during kernel execution.
The generated Perfetto trace file can be opened in the `Perfetto UI <https://ui.perfetto.dev/>`_. In the viewer, performance counters will appear as counter tracks organized by agent, allowing you to visualize counter values changing over time alongside kernel executions and other traced activities.
You can also combine this with the system trace option to get a more comprehensive view of the system's performance. For example, you can use the following command to collect both system trace and performance counter data:
When using the ``--scratch-memory-trace`` option with Perfetto output format, ROCProfiler SDK creates visualization tracks for scratch memory usage. Scratch memory operations are displayed as counter tracks organized by agent (GPU), allowing you to monitor the scratch memory allocation patterns during kernel execution.
To generate a Perfetto trace file that includes scratch memory visualization:
In the Perfetto UI, scratch memory appears as counter tracks that show:
-**Allocation peaks**: Each peak represents scratch memory allocation for a kernel execution
-**Memory usage over time**: The height of each peak indicates the amount of memory allocated (typically in KB)
-**Allocation/deallocation pattern**: You can observe when memory is allocated at kernel start and freed at kernel end
For applications with multiple kernel iterations, you'll see multiple peaks in the scratch memory track, with each peak corresponding to a kernel execution. This visualization helps identify scratch memory usage patterns and potential optimization opportunities.
..image:: /data/perfetto_scratch_memory.png
:width:100%
:align:center
For comprehensive GPU execution insights, combine scratch memory tracing with kernel tracing:
The ``--minimum-output-data`` option allows you to control the generation of output files by setting a minimum data size threshold. This prevents the creation of empty or very small output files that contain no meaningful profiling data.
When this option is specified, ``rocprofv3`` only generates output files if the collected data size exceeds the specified threshold. This is particularly useful in scenarios where:
- You're profiling applications that may have sporadic GPU activity
- You want to avoid processing empty trace files in automated workflows
- You're running batch jobs and only want meaningful results
To specify the minimum output data threshold, use the ``--minimum-output-data`` option followed by the size in KB:
You can also specify this option in YAML or JSON input files:
..code-block::yaml
jobs:
- hip_trace:true
kernel_trace:true
minimum_output_data:100
output_format:["csv","json"]
output_directory:"filtered_results"
..code-block::json
{
"jobs":[
{
"hip_trace":true,
"kernel_trace":true,
"minimum_output_data":100,
"output_format":["csv","json"],
"output_directory":"filtered_results"
}
]
}
**Important notes:**
- The threshold applies to the raw profiling data size, not the final output file size
- If multiple output formats are specified, the threshold check applies to each format independently
- A value of 0 (default) means all output files are generated regardless of size
- This option works with all tracing and counter collection modes
This feature is especially valuable in automated testing environments where you want to ensure that only applications with meaningful GPU activity generate profiling outputs, reducing storage overhead and simplifying result analysis.
Signal handler control
++++++++++++++++++++++
The ``--disable-signal-handlers`` option provides control over signal handling behavior in ``rocprofv3``, allowing you to manage how the profiler responds to system signals like SIGSEGV, SIGTERM, and others.
By default, ``rocprofv3`` installs its own signal handlers to ensure proper cleanup and data collection when the application encounters errors or is terminated. However, in some scenarios, you may want the application's own signal handlers to take precedence.
When ``--disable-signal-handlers`` is set to ``true``, ``rocprofv3`` disables the prioritization of its signal handlers over application-installed signal handlers. This means:
- If your application has custom signal handlers for SIGSEGV, SIGTERM, or similar signals, those handlers will be executed instead of ``rocprofv3``'s handlers
- The application maintains full control over signal handling behavior
-``rocprofv3`` will still attempt to collect and save profiling data when possible
**Important note**: Even with this option enabled, the underlying ``glog`` library may still install signal handlers that provide stack backtraces for debugging purposes.
- If profiling data appears incomplete with this option enabled, check if your application's signal handlers are properly saving or flushing data
- Consider implementing explicit ``rocprofv3`` cleanup calls in your application's signal handlers if data integrity is important
- Monitor application behavior to ensure custom signal handling doesn't interfere with profiling data collection
This option provides the flexibility needed for complex applications and testing environments while maintaining ``rocprofv3``'s core profiling functionality.
Library preloading
+++++++++++++++++++
The ``--preload`` option allows you to specify additional libraries to prepend to the ``LD_PRELOAD`` environment variable. This is particularly useful when working with sanitizer libraries, debugging tools, or other instrumentation libraries that need to be loaded before the application starts.
``LD_PRELOAD`` is a powerful mechanism in Linux that allows you to load shared libraries before any other libraries, effectively intercepting and overriding function calls. The ``--preload`` option in ``rocprofv3`` provides a convenient way to manage this without manually setting environment variables.
-**Library not found**: Ensure the library path is correct and the library exists
-**Symbol conflicts**: Check for conflicting symbols between preloaded libraries
-**Performance impact**: Sanitizers can significantly slow down execution
-**Memory usage**: Some tools like AddressSanitizer increase memory consumption substantially
ROCm root path configuration
++++++++++++++++++++++++++++
The ``--rocm-root`` option allows you to specify a custom ROCm installation directory instead of using the default relative path detection. This is useful when working with multiple ROCm installations, custom builds, or non-standard installation locations.
By default, ``rocprofv3`` automatically detects the ROCm installation path relative to its own location. However, in some environments, you may need to explicitly specify which ROCm installation to use.
-**Path validation**: Ensure the specified path contains a valid ROCm installation
-**Library compatibility**: Verify that the ROCm version is compatible with your application
-**Permission issues**: Check read permissions for the ROCm directory
-**Path format**: Use absolute paths to avoid ambiguity
SDK shared object version control
++++++++++++++++++++++++++++++++++
The ``--sdk-soversion`` option allows you to specify the shared object version number for the ROCProfiler SDK library. This provides precise control over which version of the library is loaded, useful for testing, compatibility verification, or working with specific library versions.
Shared object versioning follows the Linux convention where libraries have version suffixes like ``.so.X`` where X is the major version number. This option helps resolve library paths when multiple versions are installed.
-**Library not found**: Verify the specified soversion exists in the library path
-**ABI compatibility**: Ensure the SDK version is compatible with your ROCm installation
-**Symbol mismatches**: Check for symbol compatibility between versions
-**Performance differences**: Different versions may have performance characteristics
SDK version specification
+++++++++++++++++++++++++
The ``--sdk-version`` option allows you to specify the exact version number for the ROCProfiler SDK library resolution. This provides the finest level of control over library selection, useful for testing specific versions, development workflows, or ensuring reproducible profiling environments.
This option helps resolve library paths for version-specific libraries like ``librocprofiler-sdk.so.X.Y.Z`` where X.Y.Z represents the full semantic version.
The agent index is a unique identifier for each agent in the system. It is used to identify the agent in the output files. Since, each runtime or tool has an independent representation of the agent's indices, ``rocprofv3`` provides an option to configure the agent index in the output files.
-**absolute** == *node_id* - Absolute index of the agent, regardless of cgroups masking. This is a monotonically increasing number, which is incremented for every folder in ``/sys/class/kfd/kfd/topology/nodes``. For example, Agent-0, Agent-2, Agent-4.
-**relative** == *logical_node_id* - Relative index of the agent accounting for cgroups masking. This is a monotonically increasing number, which is incremented for every folder in ``/sys/class/kfd/kfd/topology/nodes/``, whose properties file is non-empty. For example, Agent-0, Agent-1, Agent-2.
-**type-relative** == *logical_node_type_id* - Relative index of the agent accounting for cgroups masking, where indexing starts at zero for each agent type. For example, CPU-0, GPU-0, GPU-1.
To set the agent index in the output files, use the ``--agent-index`` or ``-A {absolute,relative,type-relative}`` option. The default value is ``relative``.
By default, ``rocprofv3`` shows the HIP streams to which the kernel and memory copy operations were submitted, when outputting a perfetto trace. Whereas, the ``--group-by-queue`` option displays the HSA queues to which these kernel and memory operations were submitted.
In ``rocprofv3`` output, by default, the kernel names are demangled to exclude the kernel arguments. This improves readability of the collected output.
The kernel name truncation feature allows you to limit the kernel name length in the output files. This is useful when dealing with long kernel names that can make the output files difficult to read.
Kernel filtering helps to include or exclude the kernels for profiling by specifying a filter using a regex string. You can also specify an iteration range for profiling the included kernels. If the iteration range is not provided, then all iterations of the included kernels are profiled.
The ``roctxRangePush`` and ``roctxRangePop`` also let you rename the enclosed kernel with the supplied message. In the legacy ``rocprof``, this functionality was known as ``--roctx-rename``.
See how to use ``roctxRangePush`` and ``roctxRangePop`` for renaming the enclosed kernel:
Output prefix keys are useful in multiple use cases but are most helpful when dealing with multiple profiling runs or large MPI jobs. Here is the list of available keys:
The preceding command runs the application with ``rocprofv3`` and generates the trace file for each rank. The trace files are prefixed with hostname, process ID, and MPI rank.
Assuming the hostname as `ubuntu-latest` and the process IDs as 3000020 and 3000019, the output file names are:
The collection period is the time interval during which the profiling data is collected. You can specify the collection period using the ``--collection-period`` or ``-P`` option.
The collection period can be specified in different units, such as seconds, milliseconds, microseconds, and nanoseconds. The default unit is "seconds". You can change the unit using the ``--collection-period-unit`` option.
-**--perfetto-buffer-size KB**: The buffer size for Perfetto output in KB. Default: 1 GB. If set, stops the tracing session after N bytes have been written. Used to cap the trace size.
-**--perfetto-backend {inprocess,system}**: Perfetto data collection backend. ``system`` mode requires starting traced and perfetto daemons. By default Perfetto keeps the full trace buffers in memory.
-**--perfetto-shmem-size-hint KB**: Perfetto shared memory size hint in KB. Default: 64 KB. This option gives you control over shared memory buffer sizing. You can tweak this option to avoid data losses when data is produced at a higher rate.
- The amount of memory required in bytes for the combined private, spill, and arg segments for a work item.
* - Group_Segment_Size
- The group segment memory required by a workgroup in bytes. This does not include any dynamically allocated group segment memory that may be added when the kernel is dispatched.
- The total number of work-items (or, threads) in each workgroup (or, block) launched as part of the kernel dispatch. In HIP, this is equivalent to the total block size.
- The total number of work-items (or, threads) launched as a part of the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied by the total workgroup (or, block) size.
Format selection is case-insensitive and multiple output formats are supported. While ``--output-format json`` exclusively enables JSON output, ``--output-format csv json pftrace otf2, rocpd`` enables all four output formats for the run.