2024-07-29 17:23:36 -04:00
.. meta ::
2024-10-17 15:19:19 -04:00
:description: ROCm Systems Profiler development documentation and reference
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, development, developers guide, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
2024-07-29 17:23:36 -04:00
****************************************************
Development guide
****************************************************
2025-10-15 23:11:46 -04:00
This guide discusses the `ROCm Systems Profiler <https://github.com/ROCm/rocm-systems/tree/develop/projects/rocprofiler-systems> `_ design.
2024-10-17 15:19:19 -04:00
It includes a list of the executables and libraries, along with a discussion of the application's
2024-07-29 17:23:36 -04:00
memory, sampling, and time-window constraint models.
Executables
========================================
2024-10-17 15:19:19 -04:00
This section lists the ROCm Systems Profiler executables.
2024-07-29 17:23:36 -04:00
2025-10-15 23:11:46 -04:00
rocprof-sys-avail: `source/bin/rocprof-sys-avail <https://github.com/ROCm/rocm-systems/tree/develop/projects/rocprofiler-systems/source/bin/rocprof-sys-avail>`_
2025-11-04 12:48:02 -05:00
------------------------------------------------------------------------------------------------------------------------------------------------------------------
2024-07-29 17:23:36 -04:00
2024-10-17 15:19:19 -04:00
The `` main `` routine of `` rocprof-sys-avail `` has three important sections:
2024-07-29 17:23:36 -04:00
* Printing components
* Printing options
* Printing hardware counters
2025-10-15 23:11:46 -04:00
rocprof-sys-sample: `source/bin/rocprof-sys-sample <https://github.com/ROCm/rocm-systems/tree/develop/projects/rocprofiler-systems/source/bin/rocprof-sys-sample>`_
2025-11-04 12:48:02 -05:00
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
2024-07-29 17:23:36 -04:00
2024-10-17 15:19:19 -04:00
* Requires a command-line format of `` rocprof-sys-sample <options> -- <command> <command-args> ``
2024-07-29 17:23:36 -04:00
* Translates command-line options into environment variables
2024-10-17 15:19:19 -04:00
* Adds `` librocprof-sys-dl.so `` to `` LD_PRELOAD ``
2024-07-29 17:23:36 -04:00
* Is launched by using `` execvpe `` with `` <command> <command-args> `` and a modified environment
2025-10-15 23:11:46 -04:00
rocprof-sys-causal: `source/bin/rocprof-sys-causal <https://github.com/ROCm/rocm-systems/tree/develop/projects/rocprofiler-systems/source/bin/rocprof-sys-causal>`_
2025-11-04 12:48:02 -05:00
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
2024-07-29 17:23:36 -04:00
When there is exactly one causal profiling configuration variant (which enables debugging),
2024-10-17 15:19:19 -04:00
`` rocprof-sys-casual `` has a nearly identical design to `` rocprof-sys-sample ``
2024-07-29 17:23:36 -04:00
When the command-line options produce more than one causal profiling configuration variant,
the following actions take place for each variant:
2024-10-17 15:19:19 -04:00
* `` rocprof-sys-causal `` calls `` fork() ``
2024-07-29 17:23:36 -04:00
* the child process launches `` <command> <command-args> `` using `` execvpe `` , which modifies the environment for the variant
* the parent process waits for the child process to finish
2025-10-15 23:11:46 -04:00
rocprof-sys-instrument: `source/bin/rocprof-sys-instrument <https://github.com/ROCm/rocm-systems/tree/develop/projects/rocprofiler-systems/source/bin/rocprof-sys-instrument>`_
2025-11-04 12:48:02 -05:00
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2024-07-29 17:23:36 -04:00
2024-10-17 15:19:19 -04:00
* Requires a command-line format of `` rocprof-sys-instrument <options> -- <command> <command-args> ``
* Allows the user to provide options specifying whether to perform runtime instrumentation, use binary rewrite, or
2024-07-29 17:23:36 -04:00
attach to process
* Either opens the instrumentation target (for binary rewrite), launches the target and stops it
before it starts executing `` main `` , or attaches to a running executable and pauses it
* Finds all functions in the targets
2024-10-17 15:19:19 -04:00
* Finds `` librocprof-sys-dl `` and locates the functions
* Iterates over and instruments all the functions, provided they satisfy the
2024-07-29 17:23:36 -04:00
defined criteria (such as a minimum number of instructions)
* See the `` module_function `` class
2024-10-17 15:19:19 -04:00
* Until this point, the workflow has been the same for the different options,
2024-07-29 17:23:36 -04:00
but it diverges after instrumentation is complete:
* For a binary rewrite: it produces a new instrumented binary and exits
2024-10-17 15:19:19 -04:00
* For runtime instrumentation or attaching to a process: it instructs the application
2024-07-29 17:23:36 -04:00
to resume and then waits for it to exit
Libraries
========================================
2025-10-15 23:11:46 -04:00
Common library: `source/lib/common <https://github.com/ROCm/rocm-systems/tree/develop/projects/rocprofiler-systems/source/lib/common>`_
2025-11-04 12:48:02 -05:00
------------------------------------------------------------------------------------------------------------------------------------------
2024-07-29 17:23:36 -04:00
2024-10-17 15:19:19 -04:00
* General header-only functionality used in multiple executables and/or libraries.
2024-07-29 17:23:36 -04:00
* Not installed or exported outside of the build tree.
2025-10-15 23:11:46 -04:00
Core library: `source/lib/core <https://github.com/ROCm/rocm-systems/tree/develop/projects/rocprofiler-systems/source/lib/core>`_
2025-11-04 12:48:02 -05:00
-----------------------------------------------------------------------------------------------------------------------------------------
2024-07-29 17:23:36 -04:00
2024-10-17 15:19:19 -04:00
* Static PIC library with functionality that does not depend on any components.
2024-07-29 17:23:36 -04:00
* Not installed or exported outside of the build tree.
2025-10-15 23:11:46 -04:00
Binary library: `source/lib/binary <https://github.com/ROCm/rocm-systems/tree/develop/projects/rocprofiler-systems/source/lib/binary>`_
2025-11-04 12:48:02 -05:00
--------------------------------------------------------------------------------------------------------------------------------------------
2024-07-29 17:23:36 -04:00
* Static PIC library with functionality for reading/analyzing binary info.
2024-10-17 15:19:19 -04:00
* Mostly used by the causal profiling sections of `` librocprof-sys `` .
2024-07-29 17:23:36 -04:00
* Not installed or exported outside of the build tree.
2025-10-15 23:11:46 -04:00
librocprof-sys: `source/lib/rocprof-sys <https://github.com/ROCm/rocm-systems/tree/develop/projects/rocprofiler-systems/source/lib/rocprof-sys>`_
2025-11-04 12:48:02 -05:00
-----------------------------------------------------------------------------------------------------------------------------------------------------
2024-07-29 17:23:36 -04:00
This is the main library encapsulating all the capabilities.
2025-10-15 23:11:46 -04:00
librocprof-sys-dl: `source/lib/rocprof-sys-dl <https://github.com/ROCm/rocm-systems/tree/develop/projects/rocprofiler-systems/source/lib/rocprof-sys-dl>`_
2025-11-04 12:48:02 -05:00
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
2024-07-29 17:23:36 -04:00
2024-10-17 15:19:19 -04:00
This is a lightweight, front-end library for `` librocprof-sys `` which serves three primary purposes:
2024-07-29 17:23:36 -04:00
2024-10-17 15:19:19 -04:00
* Dramatically speeds up instrumentation time compared to using `` librocprof-sys `` directly because
Dyninst must parse the entire library in order to find the instrumentation functions
(a `` dlopen `` call is made on `` librocprof-sys `` when the instrumentation functions get called)
* Prevents re-entry if `` librocprof-sys `` calls an instrumented function internally
* Coordinates communication between `` librocprof-sys-user `` and `` librocprof-sys ``
2024-07-29 17:23:36 -04:00
2025-10-15 23:11:46 -04:00
librocprof-sys-user: `source/lib/rocprof-sys-user <https://github.com/ROCm/rocm-systems/tree/develop/projects/rocprofiler-systems/source/lib/rocprof-sys-user>`_
2025-11-04 12:48:02 -05:00
------------------------------------------------------------------------------------------------------------------------------------------------------------------
2024-07-29 17:23:36 -04:00
* Provides a set of functions and types for the users to add to their code, for example,
disabling data collection globally or on a specific thread or
user-defined region
2024-10-17 15:19:19 -04:00
* If `` librocprof-sys-dl `` is not loaded, the user API is effectively a set of no-op function calls.
2024-07-29 17:23:36 -04:00
Testing tools
========================================
2024-10-17 15:19:19 -04:00
* `CDash Testing Dashboard <https://my.cdash.org/index.php?project=rocprofiler-systems> `_ (requires a login)
2024-07-29 17:23:36 -04:00
Components
========================================
Most measurements and capabilities are encapsulated into a "component" with the following definitions:
Measurement
2024-10-17 15:19:19 -04:00
A recording of some data relevant to performance, for instance, the current call-stack,
2024-07-29 17:23:36 -04:00
hardware counter values, current memory usage, or timestamp
Capability
2024-10-17 15:19:19 -04:00
Handles the implementation or orchestration of some feature which is used
to collect measurements, for example, a component which handles setting up function wrappers
2024-07-29 17:23:36 -04:00
around various functions such as `` pthread_create `` or `` MPI_Init `` .
2024-10-17 15:19:19 -04:00
Components are designed to either hold no data at all or only the data for both an instantaneous
2024-07-29 17:23:36 -04:00
measurement and a phase measurement.
2024-10-17 15:19:19 -04:00
Components which store data typically implement a static `` record() `` function
2024-07-29 17:23:36 -04:00
for getting a record of the measurement,
2024-10-17 15:19:19 -04:00
`` start() `` and `` stop() `` member functions for calculating a phase measurement,
2024-07-29 17:23:36 -04:00
and a `` sample() `` member function for storing an
2024-10-17 15:19:19 -04:00
instantaneous measurement. In reality, there are several more "standard" functions
2024-07-29 17:23:36 -04:00
but these are the most commonly-used ones.
2024-10-17 15:19:19 -04:00
Components which do not store data might also have `` start() `` , `` stop() `` , and `` sample() ``
2024-07-29 17:23:36 -04:00
functions. However, components which
2024-10-17 15:19:19 -04:00
implement function wrappers typically provide a call operator or `` audit(...) ``
2024-07-29 17:23:36 -04:00
functions. These are invoked with the
2024-10-17 15:19:19 -04:00
wrapped function's arguments before the wrapped function gets called and with the return value
2024-07-29 17:23:36 -04:00
after the wrapped function gets called.
.. note ::
2024-10-17 15:19:19 -04:00
The goal of this design is to provide relatively small and resuable lightweight objects
2024-07-29 17:23:36 -04:00
for recording measurements and implementing capabilities.
Wall-clock component example
--------------------------------------
A component for computing the elapsed wall-clock time looks like this:
.. code-block :: cpp
struct wall_clock
{
using value_type = int64_t ;
static value_type record ( ) noexcept
{
return std : : chrono : : steady_clock : : now ( ) . time_since_epoch ( ) . count ( ) ;
}
void sample ( ) noexcept
{
value = record ( ) ;
}
void start ( ) noexcept
{
value = record ( ) ;
}
void stop ( ) noexcept
{
auto _start_value = value ;
value = record ( ) ;
accum + = ( value - _start_value ) ;
}
private :
int64_t value = 0 ;
int64_t accum = 0 ;
} ;
Function wrapper component example
--------------------------------------
2024-10-17 15:19:19 -04:00
A component which implements wrappers around `` fork() `` and `` exit(int) `` (and stores no data)
2024-07-29 17:23:36 -04:00
could look like this:
.. code-block :: cpp
struct function_wrapper
{
pid_t operator ( ) ( const gotcha_data & , pid_t ( * real_fork ) ( ) )
{
// disable all collection before forking
categories : : disable_categories ( config : : get_enabled_categories ( ) ) ;
auto _pid_v = real_fork ( ) ;
// only re-enable collection on parent process
if ( _pid_v ! = 0 )
categories : : enable_categories ( config : : get_enabled_categories ( ) ) ;
return _pid_v ;
}
void operator ( ) ( const gotcha_data & , void ( * real_exit ) ( int ) , int _exit_code )
{
// catch the call to exit and finalize before truly exiting
2024-10-17 15:19:19 -04:00
rocprofsys_finalize ( ) ;
2024-07-29 17:23:36 -04:00
real_exit ( _exit_code ) ;
}
} ;
Component member functions
--------------------------------------
There are no real restrictions or requirements on the member functions a component needs to provide.
Unless the component is being used directly, the invocation of component member functions via a "component bundler"
(provided by Timemory) makes extensive use of template metaprogramming concepts. This finds the best match, if any,
for calling a component's member function. This is a bit easier to demonstrate using an example:
.. code-block :: cpp
struct foo
{
void sample ( ) { puts ( " foo::sample() " ) ; }
} ;
struct bar
{
void sample ( int ) { puts ( " bar::sample(int) " ) ; }
} ;
struct spam
{
void start ( int ) { puts ( " spam::start() " ) ; }
void stop ( ) { puts ( " spam::stop() " ) ; }
} ;
int main ( )
{
auto _bundle = component_tuple < foo , bar , spam > { " main " } ;
puts ( " A " ) ;
_bundle . start ( ) ;
puts ( " B " ) ;
_bundle . sample ( 10 ) ;
puts ( " C " ) ;
_bundle . sample ( ) ;
puts ( " D " ) ;
_bundle . stop ( ) ;
}
When the preceding code runs, the following messages are printed:
.. code-block :: shell
A
spam::start( )
B
foo::sample( )
bar::sample( int)
C
foo::sample( )
D
spam::stop( )
In section A, the bundle determined that only the `` spam `` object has a `` start `` function. Since this is determined
via template metaprogramming instead of dynamic polymorphism, this effectively omits any code related to
the `` foo `` or `` bar `` objects. In section B, because the integer `` 10 `` is passed to the bundle,
the bundle forwards this value to `` bar::sample(int) `` after it invokes `` foo::sample() `` . `` foo::sample() `` is
invoked because the bundle recognizes that the call to the `` sample `` member function is still possible without
the argument.
Memory model
========================================
Collected data is generally handled in one of the three following ways:
* It is handed directly to, and stored by, Perfetto
* It is managed implicitly by Timemory and accessed as needed
* As thread-local data
2024-10-17 15:19:19 -04:00
In general, only instrumentation for relatively simple data is directly passed to
2024-07-29 17:23:36 -04:00
Perfetto and/or Timemory during runtime.
2024-10-17 15:19:19 -04:00
For example, the callbacks from binary instrumentation, user API instrumentation,
2025-02-07 23:27:58 -05:00
and rocprofiler-sdk directly invoke
2024-10-17 15:19:19 -04:00
calls to Perfetto or Timemory's storage model. Otherwise, the data is stored
by ROCm Systems Profiler in the thread-data model
2024-07-29 17:23:36 -04:00
which is more persistent than simply using `` thread_local `` static data, which gets deleted
when the thread stops.
Thread identification
--------------------------------------
2024-10-17 15:19:19 -04:00
Each CPU thread is assigned two integral identifiers. One identifier, the `` internal_value `` , is
2024-07-29 17:23:36 -04:00
atomically incremented every time a new thread is created.
2024-10-17 15:19:19 -04:00
The other identifier, known as the `` sequent_value `` , tries to account for the fact that ROCm Systems Profiler, Perfetto, ROCm, and other applications
start background threads. When a thread is created as a by-product of ROCm Systems Profiler,
2024-07-29 17:23:36 -04:00
the index is offset by a large value. This serves
two purposes:
* Accessing the data for threads created by the user is closer in memory
* When log messages are printed, the index approximately correlates to the order of thread creation from the user's perspective.
The `` sequent_value `` identifier is typically used to access the thread-data.
Thread-data class
--------------------------------------
2024-10-17 15:19:19 -04:00
Currently, most thread data is effectively stored in a static
`` std::array<std::unique_ptr<T>, ROCPROFSYS_MAX_THREADS> `` instance.
2026-01-08 00:33:37 +05:30
`` ROCPROFSYS_MAX_THREADS `` is a value defined at compile-time for release builds. During finalization,
2024-10-17 15:19:19 -04:00
ROCm Systems Profiler iterates through the thread-data and transforms that data
2024-07-29 17:23:36 -04:00
into something that can be passed along to Perfetto and/or Timemory.
2026-01-08 00:33:37 +05:30
In the current model, if the user exceeds `` ROCPROFSYS_MAX_THREADS `` at runtime,
thread creation fails gracefully with a warning message, excess threads operate with thread-local fallback,
and profiling is skipped and not persisted to output files for threads beyond `` ROCPROFSYS_MAX_THREADS `` .
To support truly dynamic thread limits without compile-time constraints, a new model is being adopted which
has all the benefits of static allocation but permits dynamic expansion beyond `` ROCPROFSYS_MAX_THREADS `` .
Currently, the thread limit can be increased at compile-time using the `` ROCPROFSYS_MAX_THREADS `` CMake configuration option.
Configuring thread limits
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ROCm Systems Profiler uses a single CMake configuration option to control thread-related memory allocation:
* `` ROCPROFSYS_MAX_THREADS `` : Maximum number of threads supported (default if not explicitly set: `` 128 `` if nproc < 8, otherwise `` pow2_ceil(16 * nproc) `` ; must be a power of 2)
This setting controls:
* Thread ID manager capacity (maximum thread IDs that can be tracked)
* Storage array sizes for thread-local data across the codebase
* Timemory's internal thread storage (`` TIMEMORY_MAX_THREADS `` )
**Build-time validation:**
CMake enforces that `` ROCPROFSYS_MAX_THREADS `` must be a power of 2:
.. code-block :: cmake
# Valid: 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, ... any power of 2
# Invalid: 100, 3000, 5000, 10000, ... (FATAL_ERROR)
**Example: Building with custom thread limit**
.. code-block :: shell
# Build with support for 8192 threads
cmake -B build \
-DROCPROFSYS_MAX_THREADS= 8192 \
..
cmake --build build
2024-07-29 17:23:36 -04:00
Sampling model
========================================
2024-10-17 15:19:19 -04:00
The general structure for the sampling is within Timemory (`` source/timemory/sampling `` ).
2024-07-29 17:23:36 -04:00
Currently, all sampling is done per-thread
2024-10-17 15:19:19 -04:00
via POSIX timers. ROCm Systems Profiler supports both a real-time timer and a CPU-time timer.
2024-07-29 17:23:36 -04:00
Both have adjustable frequencies, delays, and durations.
2024-10-17 15:19:19 -04:00
By default, only CPU-time sampling is enabled. Initial settings are inherited from
the settings starting with `` ROCPROFSYS_SAMPLING_ `` .
For each type of timer, timer-specific settings can be used to
override the common and inherited timer settings.
These settings begin with `` ROCPROFSYS_SAMPLING_CPUTIME `` for the CPU-time sampler
and `` ROCPROFSYS_SAMPLING_REALTIME `` for
the real-time sampler. For example, `` ROCPROFSYS_SAMPLING_FREQ=500 `` initially sets the
sampling frequency to 500 interrupts per second. Adding the setting `` ROCPROFSYS_SAMPLING_REALTIME_FREQ=10 ``
2024-07-29 17:23:36 -04:00
lowers the sampling frequency for the real-time sampler
to 10 interrupts per second of real-time.
2024-10-17 15:19:19 -04:00
The ROCm Systems Profiler-specific implementation can be found in
2025-10-15 23:11:46 -04:00
`source/lib/rocprof-sys/library/sampling.cpp <https://github.com/ROCm/rocm-systems/blob/develop/projects/rocprofiler-systems/source/lib/rocprof-sys/library/sampling.cpp> `_ .
Within `sampling.cpp <https://github.com/ROCm/rocm-systems/blob/develop/projects/rocprofiler-systems/source/lib/rocprof-sys/library/sampling.cpp> `_ ,
2024-07-29 17:23:36 -04:00
there is a bundle of three sampling components:
2025-10-15 23:11:46 -04:00
* `backtrace_timestamp <https://github.com/ROCm/rocm-systems/blob/develop/projects/rocprofiler-systems/source/lib/rocprof-sys/library/components/backtrace_timestamp.hpp> `_ simply
2024-07-29 17:23:36 -04:00
records the wall-clock time of the sample.
2025-10-15 23:11:46 -04:00
* `backtrace <https://github.com/ROCm/rocm-systems/blob/develop/projects/rocprofiler-systems/source/lib/rocprof-sys/library/components/backtrace.hpp> `_
2024-07-29 17:23:36 -04:00
records the call-stack via libunwind.
2025-10-15 23:11:46 -04:00
* `backtrace_metrics <https://github.com/ROCm/rocm-systems/blob/develop/projects/rocprofiler-systems/source/lib/rocprof-sys/library/components/backtrace_metrics.hpp> `_
2024-07-29 17:23:36 -04:00
records the sample metrics, such as peak RSS and the hardware counters.
2024-10-17 15:19:19 -04:00
These three components are bundled together in
2024-07-29 17:23:36 -04:00
a tuple-like `` struct `` (`` tuple<backtrace_timestamp, backtrace, backtrace_metrics> `` ).
2024-10-17 15:19:19 -04:00
A buffer of at least 1024 instances of this tuple is mapped using `` mmap ``
per-thread. When this buffer is full,
2024-07-29 17:23:36 -04:00
the sampler hands the buffer off to its allocator thread and maps a new buffer with `` mmap ``
2024-10-17 15:19:19 -04:00
before taking the next sample. The allocator thread takes this data
and either dynamically stores it in memory or writes it to a file depending on the
value of `` ROCPROFSYS_USE_TEMPORARY_FILES `` .
This schema avoids all allocations in the signal handler, lets the data grow
dynamically, avoids potentially slow I/O within the signal handler, and also enables
2024-07-29 17:23:36 -04:00
the capability of avoiding I/O altogether.
2024-10-17 15:19:19 -04:00
The maximum number of samplers handled by each allocator is governed by the
`` ROCPROFSYS_SAMPLING_ALLOCATOR_SIZE `` setting (the default is eight). Whenever an allocator
2024-07-29 17:23:36 -04:00
has reached its limit,
a new internal thread is created to handle the new samplers.
Time-window constraint model
========================================
2024-10-17 15:19:19 -04:00
With the recent introduction of tracing delay and duration, the
2025-10-15 23:11:46 -04:00
`constraint namespace <https://github.com/ROCm/rocm-systems/blob/develop/projects/rocprofiler-systems/source/lib/core/constraint.hpp> `_
2024-10-17 15:19:19 -04:00
was introduced to improve the management of delays and duration limits for
2024-07-29 17:23:36 -04:00
data collection. The `` spec `` class accepts a clock identifier, a delay value, a duration value, and an
2024-10-17 15:19:19 -04:00
integer indicating how many times to repeat the delay and duration cycle. It is therefore
2024-07-29 17:23:36 -04:00
possible to perform tasks such as periodically enabling tracing for brief periods
of time in between long periods without data collection while the application runs. The
2024-10-17 15:19:19 -04:00
syntax follows the format `` clock_identifier:delay:capture_duration:cycles `` , so a value of
2024-07-29 17:23:36 -04:00
`` 10:1:3 `` for the last three parameters represents the following sequence of operations:
* Ten seconds where no data is collected, then one second where it is
2024-10-17 15:19:19 -04:00
* Ten seconds where no data is collected, then one second where it is
* Ten seconds where no data is collected, then one second where it is
2024-07-29 17:23:36 -04:00
* Stop
2024-10-17 15:19:19 -04:00
As another example, `` ROCPROFSYS_TRACE_PERIODS = realtime:10:1:5 process_cputime:10:2:20 `` translates
2024-07-29 17:23:36 -04:00
to this sequence:
* Five cycles of: no data collection for ten seconds of real-time followed by one second of data collection
* Twenty cycles of: no data collection for ten seconds of process CPU time followed by two CPU-time seconds of data collection
2024-10-17 15:19:19 -04:00
Eventually, the goal is to migrate all subsets of data collection which currently support
2024-07-29 17:23:36 -04:00
more rudimentary models of time window constraints, such as process sampling and causal profiling,
2024-09-25 22:34:46 -04:00
to this model.