Update branding to ROCm Systems Profiler in documentation (#2)

* Update branding in docs

* Rename image used in documentation

* Update names of code samples.

In the code snippets, the "-" is not valid. ex., rocprof-sys_ --> rocprofsys_

* Update ASCII art

* update Doxyfile strip_from_path

* Add a "Formerly known as" message.

* Fixed typo in product name

ROCm Systems Profiler, not ROCm Profiler System

* Add "Omnitrace" back to the metadata keywords

* Update "install via package manager" section

* Update paths to user API files

* Rename configuration and environment settings

* Update Doxyfiles

Update publisher name & ID to "AMD".
Update bundle ID to "rocprofiler-systems"

* Update docs/what-is-rocprof-sys.rst

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update docs/conceptual/data-collection-modes.rst

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update docs/tutorials/video-tutorials.rst

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update docs/conceptual/rocprof-sys-feature-set.rst

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update docs/how-to/configuring-runtime-options.rst

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update docs/how-to/configuring-validating-environment.rst

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update docs/how-to/general-tips-using-rocprof-sys.rst

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update docs/reference/rocprof-sys-glossary.rst

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update docs/reference/development-guide.rst

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update docs/how-to/instrumenting-rewriting-binary-application.rst

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update docs/install/quick-start.rst

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Note that videos were recorded using the "Omnitrace" name.

* Rebase and update some file paths

* Update paths to doc images

* Update Omnitrace references in code snippets

* Rename examples still using the "omni" prefix.

* Update docs/how-to/performing-causal-profiling.rst

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update docs/how-to/profiling-python-scripts.rst

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update docs/how-to/sampling-call-stack.rst

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update docs/how-to/understanding-rocprof-sys-output.rst

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update docs/install/install.rst

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

---------

Co-authored-by: Peter Park <peter.park@amd.com>
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

[ROCm/rocprofiler-systems commit: 032d39f15c]
Bu işleme şunda yer alıyor:
David Galiffi
2024-10-17 15:19:19 -04:00
işlemeyi yapan: GitHub
ebeveyn 181a782835
işleme d13617cf91
59 değiştirilmiş dosya ile 1340 ekleme ve 10282 silme
+9 -8
Dosyayı Görüntüle
@@ -20,7 +20,7 @@ In addition to runtimes, ROCm Systems Profiler supports the collection of system
such as the memory usage, page-faults, and context-switches, and thread-level metrics such as memory usage, CPU time, and numerous hardware counters.
> [!NOTE]
> Full documentation is available at [ROCm Systems Profiler documentation](https://rocm.docs.amd.com/projects/omnitrace/en/latest/index.html) in an organized, easy-to-read, searchable format.
> Full documentation is available at [ROCm Systems Profiler documentation](https://rocm.docs.amd.com/projects/rocprofiler-systems/en/latest/index.html) in an organized, easy-to-read, searchable format.
The documentation source files reside in the [`/docs`](/docs) folder of this repository. For information on contributing to the documentation, see
[Contribute to ROCm documentation](https://rocm.docs.amd.com/en/latest/contribute/contributing.html)
@@ -99,14 +99,14 @@ The documentation source files reside in the [`/docs`](/docs) folder of this rep
If the above recommendation is not desired, download the `rocprofiler-systems-install.py` and specify `--prefix <install-directory>` when
executing it. This script will attempt to auto-detect a compatible OS distribution and version.
If ROCm support is desired, specify `--rocm X.Y` where `X` is the ROCm major version and `Y`
is the ROCm minor version, e.g. `--rocm 5.4`.
is the ROCm minor version, e.g. `--rocm 6.2`.
```console
wget https://github.com/ROCm/rocprofiler-systems/releases/latest/download/rocprofiler-systems-install.py
python3 ./rocprofiler-systems-install.py --prefix /opt/rocprofiler-systems/rocm-5.4 --rocm 5.4
python3 ./rocprofiler-systems-install.py --prefix /opt/rocprofiler-systems --rocm 6.2
```
See the [ROCm Systems Profiler installation guide](https://rocm.docs.amd.com/projects/omnitrace/en/latest/install/install.html) for detailed information.
See the [ROCm Systems Profiler installation guide](https://rocm.docs.amd.com/projects/rocprofiler-systems/en/latest/install/install.html) for detailed information.
### Setup
@@ -295,13 +295,13 @@ for `foo` via the direct call within `spam`. There will be no entries for `bar`
- Select "Open trace file" from panel on the left
- Locate the rocprofiler-systems perfetto output (extension: `.proto`)
![rocprof-sys-perfetto](docs/data/omnitrace-perfetto.png)
![rocprof-sys-perfetto](docs/data/rocprof-sys-perfetto.png)
![rocprof-sys-rocm](docs/data/omnitrace-rocm.png)
![rocprof-sys-rocm](docs/data/rocprof-sys-rocm.png)
![rocprof-sys-rocm-flow](docs/data/omnitrace-rocm-flow.png)
![rocprof-sys-rocm-flow](docs/data/rocprof-sys-rocm-flow.png)
![rocprof-sys-user-api](docs/data/omnitrace-user-api.png)
![rocprof-sys-user-api](docs/data/rocprof-sys-user-api.png)
## Using Perfetto tracing with System Backend
@@ -331,6 +331,7 @@ Configure rocprofiler-systems to use the perfetto system backend via the `--perf
```shell
# enable sampling on the uninstrumented binary
rocprof-sys-run --sample --trace --perfetto-backend=system -- ./myapp
# trace the instrument the binary
rocprof-sys-instrument -o ./myapp.inst -- ./myapp
rocprof-sys-run --trace --perfetto-backend=system -- ./myapp.inst
+41 -40
Dosyayı Görüntüle
@@ -1,17 +1,17 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
:description: ROCm Systems Profiler data collection modes documentation
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, profiler, data collection, tracking, visualization, tool, Instinct, accelerator, AMD
**********************
Data collection modes
**********************
Omnitrace supports several modes of recording trace and profiling data for your application.
ROCm Systems Profiler supports several modes of recording trace and profiling data for your application.
.. note::
For an explanation of the terms used in this topic, see
the :doc:`Omnitrace glossary <../reference/omnitrace-glossary>`.
For an explanation of the terms used in this topic, see
the :doc:`ROCm Systems Profiler glossary <../reference/rocprof-sys-glossary>`.
+-----------------------------+---------------------------------------------------------+
| Mode | Description |
@@ -23,61 +23,62 @@ Omnitrace supports several modes of recording trace and profiling data for your
| | and records various metrics for the given call stack |
+-----------------------------+---------------------------------------------------------+
| Callback APIs | Parallelism frameworks such as ROCm, OpenMP, and Kokkos |
| | make callbacks into Omnitrace to provide information |
| | about the work the API is performing |
| | make callbacks into ROCm Systems Profiler to provide |
| | information about the work the API is performing |
+-----------------------------+---------------------------------------------------------+
| Dynamic Symbol Interception | Wrap function symbols defined in a position independent |
| | dynamic library/executable, like ``pthread_mutex_lock`` |
| | in ``libpthread.so`` or ``MPI_Init`` in the MPI library |
+-----------------------------+---------------------------------------------------------+
| User API | User-defined regions and controls for Omnitrace |
| User API | User-defined regions and controls for ROCm Systems |
| | Profiler |
+-----------------------------+---------------------------------------------------------+
The two most generic and important modes are binary instrumentation and statistical sampling.
The two most generic and important modes are binary instrumentation and statistical sampling.
It is important to understand their advantages and disadvantages.
Binary instrumentation and statistical sampling can be performed with the ``omnitrace-instrument``
Binary instrumentation and statistical sampling can be performed with the ``rocprof-sys-instrument``
executable. For statistical sampling, it's highly recommended to use the
``omnitrace-sample`` executable instead if binary instrumentation isn't required or needed.
``rocprof-sys-sample`` executable instead if binary instrumentation isn't required or needed.
Callback APIs and dynamic symbol interception can be utilized with either tool.
Binary instrumentation
-----------------------------------
Binary instrumentation lets you record deterministic measurements for
Binary instrumentation lets you record deterministic measurements for
every single invocation of a given function.
Binary instrumentation effectively adds instructions to the target application to
collect the required information. It therefore has the potential to cause performance
changes which might, in some cases, lead to inaccurate results. The effect depends on
the information being collected and which features are activated in Omnitrace.
Binary instrumentation effectively adds instructions to the target application to
collect the required information. It therefore has the potential to cause performance
changes which might, in some cases, lead to inaccurate results. The effect depends on
the information being collected and which features are activated in ROCm Systems Profiler.
For example, collecting only the wall-clock timing data
has less of an effect than collecting the wall-clock timing, CPU-clock timing,
memory usage, cache-misses, and number of instructions that were run. Similarly,
collecting a flat profile has less overhead than a hierarchical profile
and collecting a trace OR a profile has less overhead than collecting a
has less of an effect than collecting the wall-clock timing, CPU-clock timing,
memory usage, cache-misses, and number of instructions that were run. Similarly,
collecting a flat profile has less overhead than a hierarchical profile
and collecting a trace OR a profile has less overhead than collecting a
trace AND a profile.
In Omnitrace, the primary heuristic for controlling the overhead with binary
instrumentation is the minimum number of instructions for selecting functions
In ROCm Systems Profiler, the primary heuristic for controlling the overhead with binary
instrumentation is the minimum number of instructions for selecting functions
for instrumentation.
Statistical sampling
-----------------------------------
Statistical call-stack sampling periodically interrupts the application at
Statistical call-stack sampling periodically interrupts the application at
regular intervals using operating system interrupts.
Sampling is typically less numerically accurate and specific, but the
Sampling is typically less numerically accurate and specific, but the
target program runs at nearly full speed.
In contrast to the data derived from binary instrumentation, the resulting
In contrast to the data derived from binary instrumentation, the resulting
data is not exact but is instead a statistical approximation.
However, sampling often provides a more accurate picture of the application
However, sampling often provides a more accurate picture of the application
execution because it is less intrusive to the target application and has fewer
side effects on memory caches or instruction decoding pipelines. Furthermore,
side effects on memory caches or instruction decoding pipelines. Furthermore,
because sampling does not affect the execution speed as much, is it
relatively immune to over-evaluating the cost of small, frequently called
relatively immune to over-evaluating the cost of small, frequently called
functions or "tight" loops.
In Omnitrace, the overhead for statistical sampling depends on the
sampling rate and whether the samples are taken with respect to the CPU time
In ROCm Systems Profiler, the overhead for statistical sampling depends on the
sampling rate and whether the samples are taken with respect to the CPU time
and/or real time.
Binary instrumentation vs. statistical sampling example
@@ -112,24 +113,24 @@ Consider the following code:
return 0;
}
Binary instrumentation of the ``fib`` function will record **every single invocation**
Binary instrumentation of the ``fib`` function will record **every single invocation**
of the function. For a very small function
such as ``fib``, this results in **significant** overhead since this simple function
such as ``fib``, this results in **significant** overhead since this simple function
takes about 20 instructions, whereas the entry and
exit snippets are ~1024 instructions. Therefore, you generally want to avoid
exit snippets are ~1024 instructions. Therefore, you generally want to avoid
instrumenting functions where the instrumented function has significantly fewer
instructions than entry and exit instrumentation. (Note that many of the
instructions than entry and exit instrumentation. (Note that many of the
instructions in entry and exit functions are either logging functions or
depend on the runtime settings and thus might never run). However,
depend on the runtime settings and thus might never run). However,
due to the number of potential instructions in the entry and exit snippets,
the default behavior of ``omnitrace-instrument`` is to only instrument functions
the default behavior of ``rocprof-sys-instrument`` is to only instrument functions
which contain at least 1024 instructions.
However, recording every single invocation of the function can be extremely
However, recording every single invocation of the function can be extremely
useful for detecting anomalies, such as profiles that show minimum or maximum values much smaller or larger
than the average or a high standard deviation. In this case, the traces help you
than the average or a high standard deviation. In this case, the traces help you
identify exactly when and where those instances deviated from the norm.
Compare the level of detail in the following traces. In the top image,
Compare the level of detail in the following traces. In the top image,
every instance of the ``fib`` function is instrumented, while in the bottom image,
the ``fib`` call-stack is derived via sampling.
@@ -1,14 +1,14 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
:description: ROCm Systems Profiler feature set documentation and reference
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, profiler, feature set, use cases, tracking, visualization, tool, Instinct, accelerator, AMD
***************************************
The Omnitrace feature set and use cases
The ROCm Systems Profiler feature set and use cases
***************************************
`Omnitrace <https://github.com/ROCm/omnitrace>`_ is designed to be highly extensible.
Internally, it leverages the `Timemory performance analysis toolkit <https://github.com/NERSC/timemory>`_
to manage extensions, resources, data, and other items. It supports the following features,
`ROCm Systems Profiler <https://github.com/ROCm/rocprofiler-systems>`_ is designed to be highly extensible.
Internally, it leverages the `Timemory performance analysis toolkit <https://github.com/NERSC/timemory>`_
to manage extensions, resources, data, and other items. It supports the following features,
modes, metrics, and APIs.
Data collection modes
@@ -22,11 +22,6 @@ Data collection modes
* Statistical sampling: Periodic software interrupts per-thread
* Process-level sampling: A background thread records process-, system- and device-level metrics while the application runs
* Causal profiling: Quantifies the potential impact of optimizations in parallel code
.. note::
Critical trace support was removed in Omnitrace v1.11.0.
It was replaced by the causal profiling feature.
Data analysis
========================================
@@ -98,40 +93,40 @@ Third-party API support
* NVTX
* ROCTX
Omnitrace use cases
ROCm Systems Profiler use cases
========================================
When analyzing the performance of an application, do NOT
When analyzing the performance of an application, do NOT
assume you know where the performance bottlenecks are
and why they are happening. Omnitrace is a tool for analyzing the entire
and why they are happening. ROCm Systems Profiler is a tool for analyzing the entire
application and its performance. It is
ideal for characterizing where optimization would have the greatest impact
ideal for characterizing where optimization would have the greatest impact
on an end-to-end run of the application and for
viewing what else is happening on the system during a performance bottleneck.
When GPUs are involved, there is a tendency to assume that
When GPUs are involved, there is a tendency to assume that
the quickest path to performance improvement is minimizing
the runtime of the GPU kernels. This is a highly flawed assumption.
the runtime of the GPU kernels. This is a highly flawed assumption.
If you optimize the runtime of a kernel from one millisecond
to 1 microsecond (1000x speed-up) but the original application never
to 1 microsecond (1000x speed-up) but the original application never
spent time waiting for kernels to complete,
there would be no statistically significant reduction in the end-to-end
there would be no statistically significant reduction in the end-to-end
runtime of your application. In other words, it does not matter
how fast or slow the code on GPU is if the application has a
how fast or slow the code on GPU is if the application has a
bottleneck on waiting on the GPU.
Use Omnitrace to obtain a high-level view of the entire application. Use it
Use ROCm Systems Profiler to obtain a high-level view of the entire application. Use it
to determine where the performance bottlenecks are and
obtain clues to why these bottlenecks are happening. Rather than worrying about kernel
performance, start your investigation with Omnitrace, which characterizes the
performance, start your investigation with ROCm Systems Profiler, which characterizes the
broad picture.
.. note::
For insight into the execution of individual kernels on the GPU,
use `Omniperf <https://github.com/rocm/omniperf>`_.
For insight into the execution of individual kernels on the GPU,
use `ROCm Compute Profiler <https://github.com/rocm/rocprofiler-compute>`_.
In terms of CPU analysis, Omnitrace does not target any specific vendor.
In terms of CPU analysis, ROCm Systems Profiler does not target any specific vendor.
It works just as well on AMD and non-AMD CPUs.
With regard to the GPU, Omnitrace is currently restricted to HIP and HSA APIs
With regard to the GPU, ROCm Systems Profiler is currently restricted to HIP and HSA APIs
and kernels running on AMD GPUs.
+3 -3
Dosyayı Görüntüle
@@ -36,14 +36,14 @@ with open("../VERSION", encoding="utf-8") as f:
raise ValueError("VERSION not found!")
version_number = match[1]
external_projects_current_project = "omnitrace"
external_projects_current_project = "rocprofiler-systems"
project = "omnitrace"
project = "rocprofiler-systems"
author = "Advanced Micro Devices, Inc."
copyright = "Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved."
version = version_number
release = version_number
html_title = f"Omnitrace {version} documentation"
html_title = f"ROCm Systems Profiler {version} documentation"
external_toc_path = "./sphinx/_toc.yml"
İkili dosya gösterilmiyor.

Önce

Genişlik:  |  Yükseklik:  |  Boyut: 313 KiB

İkili dosya gösterilmiyor.

Önce

Genişlik:  |  Yükseklik:  |  Boyut: 195 KiB

İkili dosya gösterilmiyor.

Önce

Genişlik:  |  Yükseklik:  |  Boyut: 230 KiB

İkili dosya gösterilmiyor.

Önce

Genişlik:  |  Yükseklik:  |  Boyut: 277 KiB

Önce

Genişlik:  |  Yükseklik:  |  Boyut: 313 KiB

Sonra

Genişlik:  |  Yükseklik:  |  Boyut: 313 KiB

Önce

Genişlik:  |  Yükseklik:  |  Boyut: 195 KiB

Sonra

Genişlik:  |  Yükseklik:  |  Boyut: 195 KiB

Önce

Genişlik:  |  Yükseklik:  |  Boyut: 230 KiB

Sonra

Genişlik:  |  Yükseklik:  |  Boyut: 230 KiB

Önce

Genişlik:  |  Yükseklik:  |  Boyut: 277 KiB

Sonra

Genişlik:  |  Yükseklik:  |  Boyut: 277 KiB

+12 -12
Dosyayı Görüntüle
@@ -4,7 +4,7 @@
# Project related configuration options
#---------------------------------------------------------------------------
DOXYFILE_ENCODING = UTF-8
PROJECT_NAME = omnitrace
PROJECT_NAME = rocprofiler-systems
PROJECT_NUMBER = 1.11.3
PROJECT_BRIEF = "High-level and comprehensive application tracing and profiling on both the CPU and GPU"
PROJECT_LOGO =
@@ -19,8 +19,8 @@ ABBREVIATE_BRIEF =
ALWAYS_DETAILED_SEC = YES
INLINE_INHERITED_MEMB = YES
FULL_PATH_NAMES = YES
STRIP_FROM_PATH = /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-omnitrace/checkouts/
STRIP_FROM_INC_PATH = /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-omnitrace/checkouts/
STRIP_FROM_PATH = /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-rocprofiler-systems/checkouts/
STRIP_FROM_INC_PATH = /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-rocprofiler-systems/checkouts/
SHORT_NAMES = NO
JAVADOC_AUTOBRIEF = NO
JAVADOC_BANNER = NO
@@ -114,10 +114,10 @@ WARN_LOGFILE = doc/warnings.log
# Configuration options related to the input files
#---------------------------------------------------------------------------
INPUT = ../../README.md \
../../source/lib/omnitrace-user/omnitrace/types.h \
../../source/lib/omnitrace-user/omnitrace/categories.h \
../../source/lib/omnitrace-user/omnitrace/user.h \
../../source/lib/omnitrace-user/omnitrace/causal.h
../../source/lib/rocprof-sys-user/rocprofiler-systems/types.h \
../../source/lib/rocprof-sys-user/rocprofiler-systems/categories.h \
../../source/lib/rocprof-sys-user/rocprofiler-systems/user.h \
../../source/lib/rocprof-sys-user/rocprofiler-systems/causal.h
INPUT_ENCODING = UTF-8
FILE_PATTERNS = *.h \
*.hh \
@@ -198,9 +198,9 @@ HTML_DYNAMIC_SECTIONS = YES
HTML_INDEX_NUM_ENTRIES = 1000
GENERATE_DOCSET = NO
DOCSET_FEEDNAME = "Doxygen generated docs"
DOCSET_BUNDLE_ID = org.doxygen.omnitrace
DOCSET_PUBLISHER_ID = org.doxygen.amdresearch
DOCSET_PUBLISHER_NAME = "Audacious Software Group"
DOCSET_BUNDLE_ID = org.doxygen.rocprofiler-systems
DOCSET_PUBLISHER_ID = org.doxygen.amd
DOCSET_PUBLISHER_NAME = "Advanced Micro Devices, Inc."
GENERATE_HTMLHELP = NO
CHM_FILE =
HHC_LOCATION =
@@ -217,7 +217,7 @@ QHP_CUST_FILTER_ATTRS =
QHP_SECT_FILTER_ATTRS =
QHG_LOCATION =
GENERATE_ECLIPSEHELP = NO
ECLIPSE_DOC_ID = org.doxygen.omnitrace
ECLIPSE_DOC_ID = org.doxygen.rocprofiler-systems
DISABLE_INDEX = NO
GENERATE_TREEVIEW = NO
ENUM_VALUES_PER_LINE = 1
@@ -311,7 +311,7 @@ ENABLE_PREPROCESSING = YES
MACRO_EXPANSION = YES
EXPAND_ONLY_PREDEF = NO
SEARCH_INCLUDES = YES
INCLUDE_PATH = ../../source/lib/omnitrace-user
INCLUDE_PATH = ../../source/lib/rocprof-sys-user
INCLUDE_FILE_PATTERNS = *.h \
*.hpp
PREDEFINED = ROCPROFSYS_PUBLIC_API= \
+281 -279
Dosyayı Görüntüle
@@ -1,131 +1,133 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
:description: ROCm Systems Profiler runtime options documentation and reference
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, runtime options, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
****************************************************
Configuring runtime options
****************************************************
The ``omnitrace.cfg`` file maintains a list of the `Omnitrace <https://github.com/ROCm/omnitrace>`_ runtime options. To create this configuration
file and view the current runtime options, use the ``omnitrace-avail`` executable.
The ``rocprof-sys.cfg`` file maintains a list of the
`ROCm Systems Profiler <https://github.com/ROCm/rocprofiler-systems>`_ runtime
options. To create this configuration
file and view the current runtime options, use the ``rocprof-sys-avail`` executable.
The omnitrace-avail executable
The rocprof-sys-avail executable
========================================
The ``omnitrace-avail`` executable provides information about the runtime settings,
The ``rocprof-sys-avail`` executable provides information about the runtime settings,
data collection capabilities, and, when built with PAPI support, the
available hardware counters. The executable is effectively
self-updating. As new capabilities and settings are added to the Omnitrace source code, they are
propagated to ``omnitrace-avail``. ``omnitrace-avail`` should be viewed as the ultimate authority
self-updating. As new capabilities and settings are added to the ROCm Systems Profiler source code, they are
propagated to ``rocprof-sys-avail``. ``rocprof-sys-avail`` should be viewed as the ultimate authority
in the event of any conflicts with this documentation.
It is recommended that you create a default configuration file in
``${HOME}/.omnitrace.cfg``. This can be done by
running the command ``omnitrace-avail -G ~/.omnitrace.cfg``. Alternatively,
use the ``omnitrace-avail -G ~/.omnitrace.cfg --all`` option
It is recommended that you create a default configuration file in
``${HOME}/.rocprof-sys.cfg``. This can be done by
running the command ``rocprof-sys-avail -G ~/.rocprof-sys.cfg``. Alternatively,
use the ``rocprof-sys-avail -G ~/.rocprof-sys.cfg --all`` option
for a verbose configuration file with descriptions, categories, and additional information.
Modify ``${HOME}/.omnitrace.cfg`` as required. For example, enable `Perfetto <https://perfetto.dev/>`_,
Modify ``${HOME}/.rocprof-sys.cfg`` as required. For example, enable `Perfetto <https://perfetto.dev/>`_,
`Timemory <https://github.com/NERSC/timemory>`_, sampling, and process-level sampling by default
and tweak the default sampling values.
.. code-block:: shell
# ...
OMNITRACE_TRACE = true
OMNITRACE_PROFILE = true
OMNITRACE_USE_SAMPLING = true
OMNITRACE_USE_PROCESS_SAMPLING = true
ROCPROFSYS_TRACE = true
ROCPROFSYS_PROFILE = true
ROCPROFSYS_USE_SAMPLING = true
ROCPROFSYS_USE_PROCESS_SAMPLING = true
# ...
OMNITRACE_SAMPLING_FREQ = 50
OMNITRACE_SAMPLING_CPUS = all
OMNITRACE_SAMPLING_GPUS = $env:HIP_VISIBLE_DEVICES
ROCPROFSYS_SAMPLING_FREQ = 50
ROCPROFSYS_SAMPLING_CPUS = all
ROCPROFSYS_SAMPLING_GPUS = $env:HIP_VISIBLE_DEVICES
Exploring runtime settings
-----------------------------------
Use the following command to view the list of the available runtime settings, their current values, and descriptions
Use the following command to view the list of the available runtime settings, their current values, and descriptions
for each setting:
.. code-block:: shell
omnitrace-avail --description
rocprof-sys-avail --description
.. note::
Use ``--brief`` to suppress printing the current value and/or ``-c 0`` to suppress truncation of the descriptions.
Any Boolean setting (``omnitrace-avail --settings --value --brief --filter bool``)
accepts a case insensitive match for nearly all common Boolean logic expressions:
Any Boolean setting (``rocprof-sys-avail --settings --value --brief --filter bool``)
accepts a case insensitive match for nearly all common Boolean logic expressions:
``ON``, ``OFF``, ``YES``, ``NO``, ``TRUE``, ``FALSE``, ``0``, ``1``, etc.
Exploring components
-----------------------------------
Omnitrace uses `Timemory <https://github.com/NERSC/timemory>`_ extensively to provide
ROCm Systems Profiler uses `Timemory <https://github.com/NERSC/timemory>`_ extensively to provide
various capabilities and manage
data and resources. By default, with ``OMNITRACE_PROFILE=ON``, Omnitrace only collects wall-clock
timing values. However, by modifying the ``OMNITRACE_TIMEMORY_COMPONENTS`` setting,
Omnitrace can be configured to
data and resources. By default, with ``ROCPROFSYS_PROFILE=ON``, ROCm Systems Profiler only collects wall-clock
timing values. However, by modifying the ``ROCPROFSYS_TIMEMORY_COMPONENTS`` setting,
ROCm Systems Profiler can be configured to
collect hardware counters, CPU-clock timers, memory usage, context switches, page faults, network statistics,
and much more. Omnitrace can even be used as a dynamic instrumentation vehicle
and much more. ROCm Systems Profiler can even be used as a dynamic instrumentation vehicle
for other third-party profiling
APIs such as `Caliper <https://github.com/LLNL/Caliper>`_ and `LIKWID <https://github.com/RRZE-HPC/likwid>`_.
To leverage this capability, build Omnitrace from source with the CMake
To leverage this capability, build ROCm Systems Profiler from source with the CMake
options ``TIMEMORY_USE_CALIPER=ON`` or ``TIMEMORY_USE_LIKWID=ON`` and then add
``caliper_marker``, ``likwid_marker``, or both to ``OMNITRACE_TIMEMORY_COMPONENTS``.
``caliper_marker``, ``likwid_marker``, or both to ``ROCPROFSYS_TIMEMORY_COMPONENTS``.
To view all possible components and their descriptions:
.. code-block:: shell
omnitrace-avail --components --description
rocprof-sys-avail --components --description
To restrict the output to available components and view the string identifiers for ``OMNITRACE_TIMEMORY_COMPONENTS``:
To restrict the output to available components and view the string identifiers for ``ROCPROFSYS_TIMEMORY_COMPONENTS``:
.. code-block:: shell
omnitrace-avail --components --available --string --brief
rocprof-sys-avail --components --available --string --brief
Exploring hardware counters
-----------------------------------
Omnitrace supports hardware counter collection via PAPI and ROCm.
ROCm Systems Profiler supports hardware counter collection via PAPI and ROCm.
Generally, PAPI is used to collect CPU-based hardware counters and ROCm is used to collect GPU-based hardware
counters. Although it is possible to install PAPI with ROCm support and use it to
collect GPU-based hardware counters, this is not recommended because PAPI
counters. Although it is possible to install PAPI with ROCm support and use it to
collect GPU-based hardware counters, this is not recommended because PAPI
cannot simultaneously collect CPU and GPU hardware counters.
To view all possible hardware counters and their descriptions, use the following command:
.. code-block:: shell
omnitrace-avail --hw-counters --description
rocprof-sys-avail --hw-counters --description
Appending the ``-c CPU`` option restricts the list of hardware counters to
Appending the ``-c CPU`` option restricts the list of hardware counters to
those available through PAPI, while ``-c GPU`` limits the list to those available from ROCm.
Enabling hardware counters
-----------------------------------
PAPI Hardware counters are configured with the ``OMNITRACE_PAPI_EVENTS`` configuration variable.
ROCm Hardware counters are configured with the ``OMNITRACE_ROCM_EVENTS`` configuration variable.
ROCm hardware counters also require the ``OMNITRACE_USE_ROCPROFILER`` configuration
variable to be enabled using ``OMNITRACE_USE_ROCPROFILER=ON``.
PAPI Hardware counters are configured with the ``ROCPROFSYS_PAPI_EVENTS`` configuration variable.
ROCm Hardware counters are configured with the ``ROCPROFSYS_ROCM_EVENTS`` configuration variable.
ROCm hardware counters also require the ``ROCPROFSYS_USE_ROCPROFILER`` configuration
variable to be enabled using ``ROCPROFSYS_USE_ROCPROFILER=ON``.
Here is a sample configuration for hardware counters:
.. code-block:: shell
# using papi identifiers
OMNITRACE_PAPI_EVENTS = PAPI_TOT_CYC PAPI_TOT_INS
ROCPROFSYS_PAPI_EVENTS = PAPI_TOT_CYC PAPI_TOT_INS
# using perf identifiers
OMNITRACE_PAPI_EVENTS = perf::INSTRUCTIONS perf::CACHE-REFERENCES perf::CACHE-MISSES
ROCPROFSYS_PAPI_EVENTS = perf::INSTRUCTIONS perf::CACHE-REFERENCES perf::CACHE-MISSES
.. _omnitrace_papi_events:
.. _rocprof-sys_papi_events:
OMNITRACE_PAPI_EVENTS
ROCPROFSYS_PAPI_EVENTS
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In order to collect the majority of hardware counters via PAPI, ensure the ``/proc/sys/kernel/perf_event_paranoid``
@@ -135,18 +137,18 @@ has a value <= 2. If you have ``sudo`` access, use the following command to modi
echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid
However this value is not retained upon reboot.
However this value is not retained upon reboot.
Use the following command to preserve this setting after a reboot:
.. code-block:: shell
echo 'kernel.perf_event_paranoid=0' | sudo tee -a /etc/sysctl.conf
PAPI events use a concept similar to a namespace. All specified hardware
PAPI events use a concept similar to a namespace. All specified hardware
counters must be from the same namespace.
For hardware counters starting with the ``PAPI_`` prefix, these are high-level
For hardware counters starting with the ``PAPI_`` prefix, these are high-level
aggregates of multiple hardware counters.
Otherwise, most events use two or three colons (``::`` or ``:::``) between the
Otherwise, most events use two or three colons (``::`` or ``:::``) between the
component name and the counter name, for example,
``amd64_rapl::RAPL_ENERGY_PKG`` and ``perf::PERF_COUNT_HW_CPU_CYCLES``.
@@ -154,33 +156,33 @@ For example, the following is a valid configuration:
.. code-block:: shell
OMNITRACE_PAPI_EVENTS = perf::INSTRUCTIONS perf::CACHE-REFERENCES perf::CACHE-MISSES
ROCPROFSYS_PAPI_EVENTS = perf::INSTRUCTIONS perf::CACHE-REFERENCES perf::CACHE-MISSES
However, the following specification of a roughly equivalent set of hardware counters is an incorrect configuration because it mixes
PAPI components from different namespaces:
.. code-block:: shell
OMNITRACE_PAPI_EVENTS = PAPI_TOT_INS perf::CACHE-REFERENCES perf::CACHE-MISSES
ROCPROFSYS_PAPI_EVENTS = PAPI_TOT_INS perf::CACHE-REFERENCES perf::CACHE-MISSES
.. note::
If Omnitrace was configured with the default ``OMNITRACE_BUILD_PAPI=ON`` setting,
If ROCm Systems Profiler was configured with the default ``ROCPROFSYS_BUILD_PAPI=ON`` setting,
standard PAPI command-line tools such as
``papi_avail`` and ``papi_event_chooser`` are not able to provide information
about the PAPI library used by Omnitrace
(because Omnitrace statically links to ``libpapi``). However, all of these tools are
installed with the prefix ``omnitrace-`` with
underscores replaced with hypens, for example ``papi_avail`` becomes ``omnitrace-papi-avail``.
``papi_avail`` and ``papi_event_chooser`` are not able to provide information
about the PAPI library used by ROCm Systems Profiler
(because ROCm Systems Profiler statically links to ``libpapi``). However, all of these tools are
installed with the prefix ``rocprof-sys-`` with
underscores replaced with hypens, for example ``papi_avail`` becomes ``rocprof-sys-papi-avail``.
OMNITRACE_ROCM_EVENTS
ROCPROFSYS_ROCM_EVENTS
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Omnitrace reads the ROCm events from the ``${ROCM_PATH}/lib/rocprofiler/metrics.xml``
ROCm Systems Profiler reads the ROCm events from the ``${ROCM_PATH}/lib/rocprofiler/metrics.xml``
file. Use the ``ROCP_METRICS`` environment
variable to point Omnitrace to a different XML metrics file, for example,
variable to point ROCm Systems Profiler to a different XML metrics file, for example,
``export ROCP_METRICS=${PWD}/custom_metrics.xml``.
``omnitrace-avail -H -c GPU`` shows event names with a suffix of ``:device=N``
``rocprof-sys-avail -H -c GPU`` shows event names with a suffix of ``:device=N``
where ``N`` is the device number.
For example, if you have two devices, the output is:
@@ -190,7 +192,7 @@ For example, if you have two devices, the output is:
...
| Wavefronts:device=1 | Derived counter: SQ_WAVES |
To collect the event on all devices, specify the event,
To collect the event on all devices, specify the event,
such as ``Wavefronts``, without the ``:device=`` suffix.
To collect the event only on specific devices, use the ``:device=`` suffix.
@@ -202,12 +204,12 @@ The following example:
.. code-block:: shell
OMNITRACE_ROCM_EVENTS = GPUBusy SQ_WAVES:device=0 SQ_INSTS_VALU:device=1
ROCPROFSYS_ROCM_EVENTS = GPUBusy SQ_WAVES:device=0 SQ_INSTS_VALU:device=1
omnitrace-avail examples
rocprof-sys-avail examples
-----------------------------------
The following examples demonstrate how to use ``omnitrace-avail`` to perform several common
The following examples demonstrate how to use ``rocprof-sys-avail`` to perform several common
configuration tasks.
Generating a default configuration file
@@ -215,96 +217,96 @@ Generating a default configuration file
.. code-block:: shell
$ omnitrace-avail -G ~/.omnitrace.cfg
[omnitrace-avail] Outputting text configuration file '/home/user/.omnitrace.cfg'...
$ cat ~/.omnitrace.cfg
# auto-generated by omnitrace-avail (version 1.2.0) on 2022-06-27 @ 19:15
$ rocprof-sys-avail -G ~/.rocprof-sys.cfg
[rocprof-sys-avail] Outputting text configuration file '/home/user/.rocprof-sys.cfg'...
$ cat ~/.rocprof-sys.cfg
# auto-generated by rocprof-sys-avail (version 1.2.0) on 2022-06-27 @ 19:15
OMNITRACE_CONFIG_FILE =
OMNITRACE_MODE = trace
OMNITRACE_TRACE = true
OMNITRACE_PROFILE = false
OMNITRACE_USE_SAMPLING = false
OMNITRACE_USE_PROCESS_SAMPLING = true
OMNITRACE_USE_ROCTRACER = true
OMNITRACE_USE_ROCM_SMI = true
OMNITRACE_USE_KOKKOSP = false
OMNITRACE_USE_CODE_COVERAGE = false
OMNITRACE_USE_PID = true
OMNITRACE_OUTPUT_PATH = omnitrace-%tag%-output
OMNITRACE_OUTPUT_PREFIX =
OMNITRACE_CI = false
OMNITRACE_THREAD_POOL_SIZE = 8
OMNITRACE_DEBUG = false
OMNITRACE_DL_VERBOSE = 0
OMNITRACE_INSTRUMENTATION_INTERVAL = 1
OMNITRACE_KOKKOSP_KERNEL_LOGGER = false
OMNITRACE_PAPI_EVENTS = PAPI_TOT_CYC
OMNITRACE_PERFETTO_BACKEND = inprocess
OMNITRACE_PERFETTO_BUFFER_SIZE_KB = 1024000
OMNITRACE_PERFETTO_COMBINE_TRACES = false
OMNITRACE_PERFETTO_FILE = perfetto-trace.proto
OMNITRACE_PERFETTO_FILL_POLICY = discard
OMNITRACE_PERFETTO_SHMEM_SIZE_HINT_KB = 4096
OMNITRACE_ROCTRACER_HSA_ACTIVITY = false
OMNITRACE_ROCTRACER_HSA_API = false
OMNITRACE_ROCTRACER_HSA_API_TYPES =
OMNITRACE_SAMPLING_CPUS =
OMNITRACE_SAMPLING_DELAY = 0.5
OMNITRACE_SAMPLING_FREQ = 10
OMNITRACE_SAMPLING_GPUS = all
OMNITRACE_TIME_OUTPUT = true
OMNITRACE_TIMEMORY_COMPONENTS = wall_clock
OMNITRACE_TRACE_THREAD_LOCKS = false
OMNITRACE_VERBOSE = 0
OMNITRACE_COLLAPSE_PROCESSES = false
OMNITRACE_COLLAPSE_THREADS = false
OMNITRACE_COUT_OUTPUT = false
OMNITRACE_CPU_AFFINITY = false
OMNITRACE_DIFF_OUTPUT = false
OMNITRACE_ENABLE_SIGNAL_HANDLER = true
OMNITRACE_ENABLED = true
OMNITRACE_FILE_OUTPUT = true
OMNITRACE_FLAT_PROFILE = false
OMNITRACE_INPUT_EXTENSIONS = json,xml
OMNITRACE_INPUT_PATH =
OMNITRACE_INPUT_PREFIX =
OMNITRACE_JSON_OUTPUT = true
OMNITRACE_MAX_DEPTH = 65535
OMNITRACE_MAX_WIDTH = 120
OMNITRACE_MEMORY_PRECISION = -1
OMNITRACE_MEMORY_SCIENTIFIC = false
OMNITRACE_MEMORY_UNITS = MB
OMNITRACE_MEMORY_WIDTH = -1
OMNITRACE_NETWORK_INTERFACE =
OMNITRACE_NODE_COUNT = 0
OMNITRACE_PAPI_FAIL_ON_ERROR = false
OMNITRACE_PAPI_MULTIPLEXING = false
OMNITRACE_PAPI_OVERFLOW = 0
OMNITRACE_PAPI_QUIET = false
OMNITRACE_PAPI_THREADING = true
OMNITRACE_PRECISION = -1
OMNITRACE_SCIENTIFIC = false
OMNITRACE_STRICT_CONFIG = true
OMNITRACE_SUPPRESS_CONFIG = true
OMNITRACE_SUPPRESS_PARSING = true
OMNITRACE_TEXT_OUTPUT = true
OMNITRACE_TIME_FORMAT = %F_%H.%M
OMNITRACE_TIMELINE_PROFILE = false
OMNITRACE_TIMING_PRECISION = 6
OMNITRACE_TIMING_SCIENTIFIC = false
OMNITRACE_TIMING_UNITS = sec
OMNITRACE_TIMING_WIDTH = -1
OMNITRACE_TREE_OUTPUT = true
OMNITRACE_WIDTH = -1
ROCPROFSYS_CONFIG_FILE =
ROCPROFSYS_MODE = trace
ROCPROFSYS_TRACE = true
ROCPROFSYS_PROFILE = false
ROCPROFSYS_USE_SAMPLING = false
ROCPROFSYS_USE_PROCESS_SAMPLING = true
ROCPROFSYS_USE_ROCTRACER = true
ROCPROFSYS_USE_ROCM_SMI = true
ROCPROFSYS_USE_KOKKOSP = false
ROCPROFSYS_USE_CODE_COVERAGE = false
ROCPROFSYS_USE_PID = true
ROCPROFSYS_OUTPUT_PATH = rocprof-sys-%tag%-output
ROCPROFSYS_OUTPUT_PREFIX =
ROCPROFSYS_CI = false
ROCPROFSYS_THREAD_POOL_SIZE = 8
ROCPROFSYS_DEBUG = false
ROCPROFSYS_DL_VERBOSE = 0
ROCPROFSYS_INSTRUMENTATION_INTERVAL = 1
ROCPROFSYS_KOKKOSP_KERNEL_LOGGER = false
ROCPROFSYS_PAPI_EVENTS = PAPI_TOT_CYC
ROCPROFSYS_PERFETTO_BACKEND = inprocess
ROCPROFSYS_PERFETTO_BUFFER_SIZE_KB = 1024000
ROCPROFSYS_PERFETTO_COMBINE_TRACES = false
ROCPROFSYS_PERFETTO_FILE = perfetto-trace.proto
ROCPROFSYS_PERFETTO_FILL_POLICY = discard
ROCPROFSYS_PERFETTO_SHMEM_SIZE_HINT_KB = 4096
ROCPROFSYS_ROCTRACER_HSA_ACTIVITY = false
ROCPROFSYS_ROCTRACER_HSA_API = false
ROCPROFSYS_ROCTRACER_HSA_API_TYPES =
ROCPROFSYS_SAMPLING_CPUS =
ROCPROFSYS_SAMPLING_DELAY = 0.5
ROCPROFSYS_SAMPLING_FREQ = 10
ROCPROFSYS_SAMPLING_GPUS = all
ROCPROFSYS_TIME_OUTPUT = true
ROCPROFSYS_TIMEMORY_COMPONENTS = wall_clock
ROCPROFSYS_TRACE_THREAD_LOCKS = false
ROCPROFSYS_VERBOSE = 0
ROCPROFSYS_COLLAPSE_PROCESSES = false
ROCPROFSYS_COLLAPSE_THREADS = false
ROCPROFSYS_COUT_OUTPUT = false
ROCPROFSYS_CPU_AFFINITY = false
ROCPROFSYS_DIFF_OUTPUT = false
ROCPROFSYS_ENABLE_SIGNAL_HANDLER = true
ROCPROFSYS_ENABLED = true
ROCPROFSYS_FILE_OUTPUT = true
ROCPROFSYS_FLAT_PROFILE = false
ROCPROFSYS_INPUT_EXTENSIONS = json,xml
ROCPROFSYS_INPUT_PATH =
ROCPROFSYS_INPUT_PREFIX =
ROCPROFSYS_JSON_OUTPUT = true
ROCPROFSYS_MAX_DEPTH = 65535
ROCPROFSYS_MAX_WIDTH = 120
ROCPROFSYS_MEMORY_PRECISION = -1
ROCPROFSYS_MEMORY_SCIENTIFIC = false
ROCPROFSYS_MEMORY_UNITS = MB
ROCPROFSYS_MEMORY_WIDTH = -1
ROCPROFSYS_NETWORK_INTERFACE =
ROCPROFSYS_NODE_COUNT = 0
ROCPROFSYS_PAPI_FAIL_ON_ERROR = false
ROCPROFSYS_PAPI_MULTIPLEXING = false
ROCPROFSYS_PAPI_OVERFLOW = 0
ROCPROFSYS_PAPI_QUIET = false
ROCPROFSYS_PAPI_THREADING = true
ROCPROFSYS_PRECISION = -1
ROCPROFSYS_SCIENTIFIC = false
ROCPROFSYS_STRICT_CONFIG = true
ROCPROFSYS_SUPPRESS_CONFIG = true
ROCPROFSYS_SUPPRESS_PARSING = true
ROCPROFSYS_TEXT_OUTPUT = true
ROCPROFSYS_TIME_FORMAT = %F_%H.%M
ROCPROFSYS_TIMELINE_PROFILE = false
ROCPROFSYS_TIMING_PRECISION = 6
ROCPROFSYS_TIMING_SCIENTIFIC = false
ROCPROFSYS_TIMING_UNITS = sec
ROCPROFSYS_TIMING_WIDTH = -1
ROCPROFSYS_TREE_OUTPUT = true
ROCPROFSYS_WIDTH = -1
When creating a new configuration file, the following recommendations apply:
* Use the ``--all`` option to view all descriptions, choices, and other information in the configuration file.
* To create a new configuration without inheriting from an existing ``${HOME}/.omnitrace.cfg`` file,
set ``OMNITRACE_SUPPRESS_CONFIG=ON`` in the environment beforehand.
* To create a new configuration without inheriting from an existing ``${HOME}/.rocprof-sys.cfg`` file,
set ``ROCPROFSYS_SUPPRESS_CONFIG=ON`` in the environment beforehand.
* To create a new configuration that makes minor changes to an existing configuration,
set ``OMNITRACE_CONFIG_FILE=/path/to/existing/file`` and define the changes as environment
set ``ROCPROFSYS_CONFIG_FILE=/path/to/existing/file`` and define the changes as environment
variables before generating it.
Viewing the setting descriptions
@@ -312,89 +314,89 @@ Viewing the setting descriptions
.. code-block:: shell
$ omnitrace-avail -S -bd
$ rocprof-sys-avail -S -bd
|-----------------------------------------|-----------------------------------------|
| ENVIRONMENT VARIABLE | DESCRIPTION |
|-----------------------------------------|-----------------------------------------|
| OMNITRACE_CI | Enable some runtime validation check... |
| OMNITRACE_ADD_SECONDARY | Enable/disable components adding sec... |
| OMNITRACE_COLLAPSE_PROCESSES | Enable/disable combining process-spe... |
| OMNITRACE_COLLAPSE_THREADS | Enable/disable combining thread-spec... |
| OMNITRACE_CONFIG_FILE | Configuration file for omnitrace |
| OMNITRACE_COUT_OUTPUT | Write output to stdout |
| OMNITRACE_CPU_AFFINITY | Enable pinning threads to CPUs (Linu... |
| OMNITRACE_THREAD_POOL_SIZE | Number of threads to use when genera... |
| OMNITRACE_DEBUG | Enable debug output |
| OMNITRACE_DIFF_OUTPUT | Generate a difference output vs. a p... |
| OMNITRACE_DL_VERBOSE | Verbosity within the omnitrace-dl li... |
| OMNITRACE_ENABLED | Activation state of timemory |
| OMNITRACE_ENABLE_SIGNAL_HANDLER | Enable signals in timemory_init |
| OMNITRACE_FILE_OUTPUT | Write output to files |
| OMNITRACE_FLAT_PROFILE | Set the label hierarchy mode to defa... |
| OMNITRACE_INPUT_EXTENSIONS | File extensions used when searching ... |
| OMNITRACE_INPUT_PATH | Explicitly specify the input folder ... |
| OMNITRACE_INPUT_PREFIX | Explicitly specify the prefix for in... |
| OMNITRACE_INSTRUMENTATION_INTERVAL | Instrumentation only takes measureme... |
| OMNITRACE_JSON_OUTPUT | Write json output files |
| OMNITRACE_KOKKOSP_KERNEL_LOGGER | Enables kernel logging |
| OMNITRACE_MAX_DEPTH | Set the maximum depth of label hiera... |
| OMNITRACE_MAX_THREAD_BOOKMARKS | Maximum number of times a worker thr... |
| OMNITRACE_MAX_WIDTH | Set the maximum width for component ... |
| OMNITRACE_MEMORY_PRECISION | Set the precision for components wit... |
| OMNITRACE_MEMORY_SCIENTIFIC | Set the numerical reporting format f... |
| OMNITRACE_MEMORY_UNITS | Set the units for components with u... |
| OMNITRACE_MEMORY_WIDTH | Set the output width for components ... |
| OMNITRACE_NETWORK_INTERFACE | Default network interface |
| OMNITRACE_NODE_COUNT | Total number of nodes used in applic... |
| OMNITRACE_OUTPUT_FILE | Perfetto filename |
| OMNITRACE_OUTPUT_PATH | Explicitly specify the output folder... |
| OMNITRACE_OUTPUT_PREFIX | Explicitly specify a prefix for all ... |
| OMNITRACE_PAPI_EVENTS | PAPI presets and events to collect (... |
| OMNITRACE_PAPI_FAIL_ON_ERROR | Configure PAPI errors to trigger a r... |
| OMNITRACE_PAPI_MULTIPLEXING | Enable multiplexing when using PAPI |
| OMNITRACE_PAPI_OVERFLOW | Value at which PAPI hw counters trig... |
| OMNITRACE_PAPI_QUIET | Configure suppression of reporting P... |
| OMNITRACE_PAPI_THREADING | Enable multithreading support when u... |
| OMNITRACE_PERFETTO_BACKEND | Specify the perfetto backend to acti... |
| OMNITRACE_PERFETTO_BUFFER_SIZE_KB | Size of perfetto buffer (in KB) |
| OMNITRACE_PERFETTO_COMBINE_TRACES | Combine Perfetto traces. If not expl... |
| OMNITRACE_PERFETTO_FILL_POLICY | Behavior when perfetto buffer is ful... |
| OMNITRACE_PERFETTO_SHMEM_SIZE_HINT_KB | Hint for shared-memory buffer size i... |
| OMNITRACE_PRECISION | Set the global output precision for ... |
| OMNITRACE_ROCTRACER_HSA_ACTIVITY | Enable HSA activity tracing support |
| OMNITRACE_ROCTRACER_HSA_API | Enable HSA API tracing support |
| OMNITRACE_ROCTRACER_HSA_API_TYPES | HSA API type to collect |
| OMNITRACE_SAMPLING_CPUS | CPUs to collect frequency informatio... |
| OMNITRACE_SAMPLING_DELAY | Number of seconds to wait before the... |
| OMNITRACE_SAMPLING_FREQ | Number of software interrupts per se... |
| OMNITRACE_SAMPLING_GPUS | Devices to query when OMNITRACE_USE_... |
| OMNITRACE_SCIENTIFIC | Set the global numerical reporting t... |
| OMNITRACE_STRICT_CONFIG | Throw errors for unknown setting nam... |
| OMNITRACE_SUPPRESS_CONFIG | Disable processing of setting config... |
| OMNITRACE_SUPPRESS_PARSING | Disable parsing environment |
| OMNITRACE_TEXT_OUTPUT | Write text output files |
| OMNITRACE_TIMELINE_PROFILE | Set the label hierarchy mode to defa... |
| OMNITRACE_TIMEMORY_COMPONENTS | List of components to collect via ti... |
| OMNITRACE_TIME_FORMAT | Customize the folder generation when... |
| OMNITRACE_TIME_OUTPUT | Output data to subfolder w/ a timest... |
| OMNITRACE_TIMING_PRECISION | Set the precision for components wit... |
| OMNITRACE_TIMING_SCIENTIFIC | Set the numerical reporting format f... |
| OMNITRACE_TIMING_UNITS | Set the units for components with u... |
| OMNITRACE_TIMING_WIDTH | Set the output width for components ... |
| OMNITRACE_TRACE_THREAD_LOCKS | Enable tracking calls to pthread_mut... |
| OMNITRACE_TREE_OUTPUT | Write hierarchical json output files |
| OMNITRACE_USE_CODE_COVERAGE | Enable support for code coverage |
| OMNITRACE_USE_KOKKOSP | Enable support for Kokkos Tools |
| OMNITRACE_USE_OMPT | Enable support for OpenMP-Tools |
| OMNITRACE_TRACE | Enable perfetto backend |
| OMNITRACE_USE_PID | Enable tagging filenames with proces... |
| OMNITRACE_USE_ROCM_SMI | Enable sampling GPU power, temp, uti... |
| OMNITRACE_USE_ROCTRACER | Enable ROCM tracing |
| OMNITRACE_USE_SAMPLING | Enable statistical sampling of call-... |
| OMNITRACE_USE_PROCESS_SAMPLING | Enable a background thread which sam... |
| OMNITRACE_PROFILE | Enable timemory backend |
| OMNITRACE_VERBOSE | Verbosity level |
| OMNITRACE_WIDTH | Set the global output width for comp... |
| ROCPROFSYS_CI | Enable some runtime validation check... |
| ROCPROFSYS_ADD_SECONDARY | Enable/disable components adding sec... |
| ROCPROFSYS_COLLAPSE_PROCESSES | Enable/disable combining process-spe... |
| ROCPROFSYS_COLLAPSE_THREADS | Enable/disable combining thread-spec... |
| ROCPROFSYS_CONFIG_FILE | Configuration file for rocprof-sys |
| ROCPROFSYS_COUT_OUTPUT | Write output to stdout |
| ROCPROFSYS_CPU_AFFINITY | Enable pinning threads to CPUs (Linu... |
| ROCPROFSYS_THREAD_POOL_SIZE | Number of threads to use when genera... |
| ROCPROFSYS_DEBUG | Enable debug output |
| ROCPROFSYS_DIFF_OUTPUT | Generate a difference output vs. a p... |
| ROCPROFSYS_DL_VERBOSE | Verbosity within the rocprof-sys-dl ... |
| ROCPROFSYS_ENABLED | Activation state of timemory |
| ROCPROFSYS_ENABLE_SIGNAL_HANDLER | Enable signals in timemory_init |
| ROCPROFSYS_FILE_OUTPUT | Write output to files |
| ROCPROFSYS_FLAT_PROFILE | Set the label hierarchy mode to defa... |
| ROCPROFSYS_INPUT_EXTENSIONS | File extensions used when searching ... |
| ROCPROFSYS_INPUT_PATH | Explicitly specify the input folder ... |
| ROCPROFSYS_INPUT_PREFIX | Explicitly specify the prefix for in... |
| ROCPROFSYS_INSTRUMENTATION_INTERVAL | Instrumentation only takes measureme... |
| ROCPROFSYS_JSON_OUTPUT | Write json output files |
| ROCPROFSYS_KOKKOSP_KERNEL_LOGGER | Enables kernel logging |
| ROCPROFSYS_MAX_DEPTH | Set the maximum depth of label hiera... |
| ROCPROFSYS_MAX_THREAD_BOOKMARKS | Maximum number of times a worker thr... |
| ROCPROFSYS_MAX_WIDTH | Set the maximum width for component ... |
| ROCPROFSYS_MEMORY_PRECISION | Set the precision for components wit... |
| ROCPROFSYS_MEMORY_SCIENTIFIC | Set the numerical reporting format f... |
| ROCPROFSYS_MEMORY_UNITS | Set the units for components with u... |
| ROCPROFSYS_MEMORY_WIDTH | Set the output width for components ... |
| ROCPROFSYS_NETWORK_INTERFACE | Default network interface |
| ROCPROFSYS_NODE_COUNT | Total number of nodes used in applic... |
| ROCPROFSYS_OUTPUT_FILE | Perfetto filename |
| ROCPROFSYS_OUTPUT_PATH | Explicitly specify the output folder... |
| ROCPROFSYS_OUTPUT_PREFIX | Explicitly specify a prefix for all ... |
| ROCPROFSYS_PAPI_EVENTS | PAPI presets and events to collect (... |
| ROCPROFSYS_PAPI_FAIL_ON_ERROR | Configure PAPI errors to trigger a r... |
| ROCPROFSYS_PAPI_MULTIPLEXING | Enable multiplexing when using PAPI |
| ROCPROFSYS_PAPI_OVERFLOW | Value at which PAPI hw counters trig... |
| ROCPROFSYS_PAPI_QUIET | Configure suppression of reporting P... |
| ROCPROFSYS_PAPI_THREADING | Enable multithreading support when u... |
| ROCPROFSYS_PERFETTO_BACKEND | Specify the perfetto backend to acti... |
| ROCPROFSYS_PERFETTO_BUFFER_SIZE_KB | Size of perfetto buffer (in KB) |
| ROCPROFSYS_PERFETTO_COMBINE_TRACES | Combine Perfetto traces. If not expl... |
| ROCPROFSYS_PERFETTO_FILL_POLICY | Behavior when perfetto buffer is ful... |
| ROCPROFSYS_PERFETTO_SHMEM_SIZE_HINT_KB | Hint for shared-memory buffer size i... |
| ROCPROFSYS_PRECISION | Set the global output precision for ... |
| ROCPROFSYS_ROCTRACER_HSA_ACTIVITY | Enable HSA activity tracing support |
| ROCPROFSYS_ROCTRACER_HSA_API | Enable HSA API tracing support |
| ROCPROFSYS_ROCTRACER_HSA_API_TYPES | HSA API type to collect |
| ROCPROFSYS_SAMPLING_CPUS | CPUs to collect frequency informatio... |
| ROCPROFSYS_SAMPLING_DELAY | Number of seconds to wait before the... |
| ROCPROFSYS_SAMPLING_FREQ | Number of software interrupts per se... |
| ROCPROFSYS_SAMPLING_GPUS | Devices to query when ROCPROFSYS_USE_... |
| ROCPROFSYS_SCIENTIFIC | Set the global numerical reporting t... |
| ROCPROFSYS_STRICT_CONFIG | Throw errors for unknown setting nam... |
| ROCPROFSYS_SUPPRESS_CONFIG | Disable processing of setting config... |
| ROCPROFSYS_SUPPRESS_PARSING | Disable parsing environment |
| ROCPROFSYS_TEXT_OUTPUT | Write text output files |
| ROCPROFSYS_TIMELINE_PROFILE | Set the label hierarchy mode to defa... |
| ROCPROFSYS_TIMEMORY_COMPONENTS | List of components to collect via ti... |
| ROCPROFSYS_TIME_FORMAT | Customize the folder generation when... |
| ROCPROFSYS_TIME_OUTPUT | Output data to subfolder w/ a timest... |
| ROCPROFSYS_TIMING_PRECISION | Set the precision for components wit... |
| ROCPROFSYS_TIMING_SCIENTIFIC | Set the numerical reporting format f... |
| ROCPROFSYS_TIMING_UNITS | Set the units for components with u... |
| ROCPROFSYS_TIMING_WIDTH | Set the output width for components ... |
| ROCPROFSYS_TRACE_THREAD_LOCKS | Enable tracking calls to pthread_mut... |
| ROCPROFSYS_TREE_OUTPUT | Write hierarchical json output files |
| ROCPROFSYS_USE_CODE_COVERAGE | Enable support for code coverage |
| ROCPROFSYS_USE_KOKKOSP | Enable support for Kokkos Tools |
| ROCPROFSYS_USE_OMPT | Enable support for OpenMP-Tools |
| ROCPROFSYS_TRACE | Enable perfetto backend |
| ROCPROFSYS_USE_PID | Enable tagging filenames with proces... |
| ROCPROFSYS_USE_ROCM_SMI | Enable sampling GPU power, temp, uti... |
| ROCPROFSYS_USE_ROCTRACER | Enable ROCM tracing |
| ROCPROFSYS_USE_SAMPLING | Enable statistical sampling of call-... |
| ROCPROFSYS_USE_PROCESS_SAMPLING | Enable a background thread which sam... |
| ROCPROFSYS_PROFILE | Enable timemory backend |
| ROCPROFSYS_VERBOSE | Verbosity level |
| ROCPROFSYS_WIDTH | Set the global output width for comp... |
|-----------------------------------------|-----------------------------------------|
Viewing components
@@ -402,7 +404,7 @@ Viewing components
.. code-block:: shell
$ omnitrace-avail -C -bd
$ rocprof-sys-avail -C -bd
|-----------------------------------|----------------------------------------------|
| COMPONENT | DESCRIPTION |
|-----------------------------------|----------------------------------------------|
@@ -460,7 +462,7 @@ Viewing components
| wall_clock | Real-clock timer (i.e. wall-clock timer). |
| written_bytes | Number of bytes sent to the storage layer. |
| written_char | Number of bytes which this task has cause... |
| omnitrace | Invokes instrumentation functions omnitr... |
| rocprof-sys | Invokes instrumentation functions rocprof... |
| roctracer | High-precision ROCm API and kernel tracing. |
| sampling_wall_clock | Wall-clock timing. Derived from statistic... |
| sampling_cpu_clock | CPU-clock timing. Derived from statistica... |
@@ -476,7 +478,7 @@ Viewing hardware counters
.. code-block:: shell
$ omnitrace-avail -H -bd
$ rocprof-sys-avail -H -bd
|---------------------------------------|---------------------------------------|
| HARDWARE COUNTER | DESCRIPTION |
|---------------------------------------|---------------------------------------|
@@ -1197,17 +1199,17 @@ Viewing hardware counters
Creating a configuration file
========================================
Omnitrace supports three configuration file formats: JSON, XML, and plain text.
Use ``omnitrace-avail -G <filename> -F txt json xml`` to generate default
ROCm Systems Profiler supports three configuration file formats: JSON, XML, and plain text.
Use ``rocprof-sys-avail -G <filename> -F txt json xml`` to generate default
configuration files in each format. Optionally
include the ``--all`` flag to include full descriptions and other information.
Configuration files are specified by the ``OMNITRACE_CONFIG_FILE`` environment variable
which by default looks for ``${HOME}/.omnitrace.cfg`` and ``${HOME}/.omnitrace.json``.
Configuration files are specified by the ``ROCPROFSYS_CONFIG_FILE`` environment variable
which by default looks for ``${HOME}/.rocprof-sys.cfg`` and ``${HOME}/.rocprof-sys.json``.
Multiple configuration files can be concatenated using the ``:`` symbol, for example:
.. code-block:: shell
export OMNITRACE_CONFIG_FILE=~/.config/omnitrace.cfg:~/.config/omnitrace.json
export ROCPROFSYS_CONFIG_FILE=~/.config/rocprof-sys.cfg:~/.config/rocprof-sys.json
If a configuration variable is specified in both a configuration file and in the environment,
the environment variable takes precedence.
@@ -1220,7 +1222,7 @@ Variables are created when an lvalue starts with a ``$`` and are
de-referenced when they appear as rvalues.
Entries in the text configuration file which do not match a known setting
in ``omnitrace-avail`` but are prefixed with ``OMNITRACE_`` are interpreted as
in ``rocprof-sys-avail`` but are prefixed with ``ROCPROFSYS_`` are interpreted as
environment variables. They are exported via ``setenv``
but do not override an existing value for the environment variable.
@@ -1231,35 +1233,35 @@ but do not override an existing value for the environment variable.
$SAMPLE = OFF
# use fields
OMNITRACE_TRACE = $ENABLE
OMNITRACE_PROFILE = $ENABLE
OMNITRACE_USE_SAMPLING = $SAMPLE
OMNITRACE_USE_PROCESS_SAMPLING = $SAMPLE
ROCPROFSYS_TRACE = $ENABLE
ROCPROFSYS_PROFILE = $ENABLE
ROCPROFSYS_USE_SAMPLING = $SAMPLE
ROCPROFSYS_USE_PROCESS_SAMPLING = $SAMPLE
# debug
OMNITRACE_DEBUG = OFF
OMNITRACE_VERBOSE = 1
ROCPROFSYS_DEBUG = OFF
ROCPROFSYS_VERBOSE = 1
# output fields
OMNITRACE_OUTPUT_PATH = omnitrace-output
OMNITRACE_OUTPUT_PREFIX = %tag%/
OMNITRACE_TIME_OUTPUT = OFF
OMNITRACE_USE_PID = OFF
ROCPROFSYS_OUTPUT_PATH = rocprof-sys-output
ROCPROFSYS_OUTPUT_PREFIX = %tag%/
ROCPROFSYS_TIME_OUTPUT = OFF
ROCPROFSYS_USE_PID = OFF
# timemory fields
OMNITRACE_PAPI_EVENTS = PAPI_TOT_INS PAPI_FP_INS
OMNITRACE_TIMEMORY_COMPONENTS = wall_clock peak_rss trip_count
OMNITRACE_MEMORY_UNITS = MB
OMNITRACE_TIMING_UNITS = sec
ROCPROFSYS_PAPI_EVENTS = PAPI_TOT_INS PAPI_FP_INS
ROCPROFSYS_TIMEMORY_COMPONENTS = wall_clock peak_rss trip_count
ROCPROFSYS_MEMORY_UNITS = MB
ROCPROFSYS_TIMING_UNITS = sec
# sampling fields
OMNITRACE_SAMPLING_FREQ = 50
OMNITRACE_SAMPLING_DELAY = 0.1
OMNITRACE_SAMPLING_CPUS = 0-3
OMNITRACE_SAMPLING_GPUS = $env:HIP_VISIBLE_DEVICES
ROCPROFSYS_SAMPLING_FREQ = 50
ROCPROFSYS_SAMPLING_DELAY = 0.1
ROCPROFSYS_SAMPLING_CPUS = 0-3
ROCPROFSYS_SAMPLING_GPUS = $env:HIP_VISIBLE_DEVICES
# misc env variables (see metadata JSON file after run)
$env:OMNITRACE_SAMPLING_KEEP_DYNINST_SUFFIX = OFF
$env:ROCPROFSYS_SAMPLING_KEEP_DYNINST_SUFFIX = OFF
Sample JSON configuration file
-----------------------------------
@@ -1269,9 +1271,9 @@ The full JSON specification for a configuration value contains a lot of informat
.. code-block:: json
{
"omnitrace": {
"rocprof-sys": {
"settings": {
"OMNITRACE_ADD_SECONDARY": {
"ROCPROFSYS_ADD_SECONDARY": {
"count": -1,
"name": "add_secondary",
"data_type": "bool",
@@ -1279,9 +1281,9 @@ The full JSON specification for a configuration value contains a lot of informat
"value": true,
"max_count": 1,
"cmdline": [
"--omnitrace-add-secondary"
"--rocprof-sys-add-secondary"
],
"environ": "OMNITRACE_ADD_SECONDARY",
"environ": "ROCPROFSYS_ADD_SECONDARY",
"cereal_class_version": 1,
"categories": [
"component",
@@ -1294,15 +1296,15 @@ The full JSON specification for a configuration value contains a lot of informat
}
}
However when writing an JSON configuration file, the following example is minimally acceptable
for ``OMNITRACE_ADD_SECONDARY``:
However when writing an JSON configuration file, the following example is minimally acceptable
for ``ROCPROFSYS_ADD_SECONDARY``:
.. code-block:: json
{
"omnitrace": {
"rocprof-sys": {
"settings": {
"OMNITRACE_ADD_SECONDARY": {
"ROCPROFSYS_ADD_SECONDARY": {
"value": true
}
}
@@ -1318,19 +1320,19 @@ The full XML specification for a configuration value contains the same informati
<?xml version="1.0" encoding="utf-8"?>
<timemory_xml>
<omnitrace>
<rocprofiler-systems>
<settings>
<cereal_class_version>2</cereal_class_version>
<!-- Full setting specification -->
<OMNITRACE_ADD_SECONDARY>
<ROCPROFSYS_ADD_SECONDARY>
<cereal_class_version>1</cereal_class_version>
<name>add_secondary</name>
<environ>OMNITRACE_ADD_SECONDARY</environ>
<environ>ROCPROFSYS_ADD_SECONDARY</environ>
<description>...</description>
<count>-1</count>
<max_count>1</max_count>
<cmdline>
<value0>--omnitrace-add-secondary</value0>
<value0>--rocprof-sys-add-secondary</value0>
</cmdline>
<categories>
<value0>component</value0>
@@ -1340,24 +1342,24 @@ The full XML specification for a configuration value contains the same informati
<data_type>bool</data_type>
<initial>true</initial>
<value>true</value>
</OMNITRACE_ADD_SECONDARY>
</ROCPROFSYS_ADD_SECONDARY>
<!-- etc. -->
</settings>
</omnitrace>
</rocprofiler-systems>
</timemory_xml>
However, when writing an XML configuration file, it is minimally acceptable
to set ``OMNITRACE_ADD_SECONDARY=false``:
However, when writing an XML configuration file, it is minimally acceptable
to set ``ROCPROFSYS_ADD_SECONDARY=false``:
.. code-block:: xml
<?xml version="1.0" encoding="utf-8"?>
<timemory_xml>
<omnitrace>
<rocprofiler-systems>
<settings>
<OMNITRACE_ADD_SECONDARY>
<ROCPROFSYS_ADD_SECONDARY>
<value>false</value>
</OMNITRACE_ADD_SECONDARY>
</ROCPROFSYS_ADD_SECONDARY>
</settings>
</omnitrace>
</rocprofiler-systems>
</timemory_xml>
+24 -24
Dosyayı Görüntüle
@@ -1,47 +1,47 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
:description: ROCm Systems Profiler environment validation documentation and reference
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, profiler, environment, tracking, visualization, tool, Instinct, accelerator, AMD
****************************************************
Configuring and validating the environment
****************************************************
After installing `Omnitrace <https://github.com/ROCm/omnitrace>`_, additional steps are required to set up
After installing `ROCm Systems Profiler <https://github.com/ROCm/rocprofiler-systems>`_, additional steps are required to set up
and validate the environment.
.. note::
The following instructions use the installation path ``/opt/omnitrace``. If
Omnitrace is installed elsewhere, substitute the actual installation path.
The following instructions use the installation path ``/opt/rocprofiler-systems``. If
ROCm Systems Profiler is installed elsewhere, substitute the actual installation path.
Configuring the environment
========================================
After Omnitrace is installed, source the ``setup-env.sh`` script to prefix the
After ROCm Systems Profiler is installed, source the ``setup-env.sh`` script to prefix the
``PATH``, ``LD_LIBRARY_PATH``, and other environment variables:
.. code-block:: shell
source /opt/omnitrace/share/omnitrace/setup-env.sh
source /opt/rocprofiler-systems/share/rocprofiler-systems/setup-env.sh
Alternatively, if environment modules are supported, add the ``<prefix>/share/modulefiles`` directory
to ``MODULEPATH``:
.. code-block:: shell
module use /opt/omnitrace/share/modulefiles
module use /opt/rocprofiler-systems/share/modulefiles
.. note::
As an alternative, the above line can be added to the ``${HOME}/.modulerc`` file.
After Omnitrace has been added to the ``MODULEPATH``, it can be loaded
using ``module load omnitrace/<VERSION>`` and unloaded using ``module unload omnitrace/<VERSION>``.
After ROCm Systems Profiler has been added to the ``MODULEPATH``, it can be loaded
using ``module load rocprofiler-systems/<VERSION>`` and unloaded using ``module unload rocprofiler-systems/<VERSION>``.
.. code-block:: shell
module load omnitrace/1.0.0
module unload omnitrace/1.0.0
module load rocprofiler-systems/1.0.0
module unload rocprofiler-systems/1.0.0
.. note::
@@ -51,21 +51,21 @@ using ``module load omnitrace/<VERSION>`` and unloaded using ``module unload omn
Validating the environment configuration
========================================
If the following commands all run successfully with the expected output,
then you are ready to use Omnitrace:
If the following commands all run successfully with the expected output,
then you are ready to use ROCm Systems Profiler:
.. code-block:: shell
which omnitrace
which omnitrace-avail
which omnitrace-sample
omnitrace-instrument --help
omnitrace-avail --all
omnitrace-sample --help
which rocprof-sys
which rocprof-sys-avail
which rocprof-sys-sample
rocprof-sys-instrument --help
rocprof-sys-avail --all
rocprof-sys-sample --help
If Omnitrace was built with Python support, validate these additional commands:
If ROCm Systems Profiler was built with Python support, validate these additional commands:
.. code-block:: shell
which omnitrace-python
omnitrace-python --help
which rocprof-sys-python
rocprof-sys-python --help
@@ -1,19 +1,19 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
:description: ROCm Systems Profiler general tips and usage documentation and reference
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, tips, how to, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
**********************************
General tips for using Omnitrace
General tips for using ROCm Systems Profiler
**********************************
Follow these general guidelines when using Omnitrace. For an explanation of the terms used in this topic, see
the :doc:`Omnitrace glossary <../reference/omnitrace-glossary>`.
Follow these general guidelines when using ROCm Systems Profiler. For an explanation of the terms used in this topic, see
the :doc:`ROCm Systems Profiler glossary <../reference/rocprof-sys-glossary>`.
* Use ``omnitrace-avail`` to look up configuration settings, hardware counters, and data collection components
* Use ``rocprof-sys-avail`` to look up configuration settings, hardware counters, and data collection components
* Use the ``-d`` flag for descriptions
* Generate a default configuration with ``omnitrace-avail -G ${HOME}/.omnitrace.cfg`` and adjust it
* Generate a default configuration with ``rocprof-sys-avail -G ${HOME}/.rocprof-sys.cfg`` and adjust it
to the desired default behavior
* **Decide whether binary instrumentation, statistical sampling, or both** provides the desired performance data (for non-Python applications)
* Compile code with optimization enabled (``-O2`` or higher), disable asserts (i.e. ``-DNDEBUG``), and include debug info (for instance, ``-g1`` at a minimum)
@@ -24,26 +24,26 @@ the :doc:`Omnitrace glossary <../reference/omnitrace-glossary>`.
* **Use binary instrumentation for characterizing the performance of every invocation of specific functions**
* **Use statistical sampling to characterize the performance of the entire application while minimizing overhead**
* Enable statistical sampling after binary instrumentation to help "fill in the gaps" between instrumented regions
* Use the user API to create custom regions and enable/disable Omnitrace for specific processes, threads, and regions
* Use the user API to create custom regions and enable/disable ROCm Systems Profiler for specific processes, threads, and regions
* Dynamic symbol interception, callback APIs, and the user API are always available with binary instrumentation and sampling
* Dynamic symbol interception and callback APIs are (generally) controlled through ``OMNITRACE_USE_<API>``
options, for example, ``OMNITRACE_USE_KOKKOSP`` and ``OMNITRACE_USE_OMPT`` enable Kokkos-Tools and OpenMP-Tools
* Dynamic symbol interception and callback APIs are (generally) controlled through ``ROCPROFSYS_USE_<API>``
options, for example, ``ROCPROFSYS_USE_KOKKOSP`` and ``ROCPROFSYS_USE_OMPT`` enable Kokkos-Tools and OpenMP-Tools
callbacks, respectively
* When generically seeking regions for performance improvement:
* **Start off by collecting a flat profile**
* Look for functions with high call counts, large cumulative runtimes/values, or large standard deviations
* When call counts are high, improving the performance of this function or "inlining" the function can result in quick and easy performance improvements
* When the standard deviation is high, collect a hierarchical profile and see if the high variation can be attributable to the calling context.
* When the standard deviation is high, collect a hierarchical profile and see if the high variation can be attributable to the calling context.
In this scenario, consider creating a specialized version of the function for the longer-running contexts
* **Collect a hierarchical profile** and verify the functions that are part of the "critical path" of your
* **Collect a hierarchical profile** and verify the functions that are part of the "critical path" of your
application, as indicated in the flat profile
* For example, functions with high call counts but which are part of a "setup" or "post-processing"
* For example, functions with high call counts but which are part of a "setup" or "post-processing"
phase that does not consume much time relative to the overall time are generally a lower priority for optimization
* **Use the information from the profiles when analyzing detailed traces**
@@ -54,7 +54,7 @@ the :doc:`Omnitrace glossary <../reference/omnitrace-glossary>`.
* When using binary instrumentation with MPI, avoid runtime instrumentation
* Runtime instrumentation requires a fork and a ``ptrace``, which is generally incompatible with how MPI applications spawn processes
* Perform a binary rewrite of the executable (and optionally, libraries used by the executable) using MPI and run
the generated instrumented executable using ``omnitrace-run`` instead of the original.
For example, instead of ``mpirun -n 2 ./myexe``, use ``mpirun -n 2 omnitrace-run -- ./myexe.inst``, where
* Perform a binary rewrite of the executable (and optionally, libraries used by the executable) using MPI and run
the generated instrumented executable using ``rocprof-sys-run`` instead of the original.
For example, instead of ``mpirun -n 2 ./myexe``, use ``mpirun -n 2 rocprof-sys-run -- ./myexe.inst``, where
``myexe.inst`` is the instrumented ``myexe`` executable that was generated.
@@ -1,12 +1,12 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
:description: ROCm Systems Profiler binary instrumentation and rewrite documentation and reference
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, binary instrumentation, binary rewrite, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
****************************************************
Instrumenting and rewriting a binary application
****************************************************
There are three ways to perform instrumentation with the ``omnitrace-instrument`` executable:
There are three ways to perform instrumentation with the ``rocprof-sys-instrument`` executable:
* Runtime instrumentation
* Attaching to an already running process
@@ -14,11 +14,11 @@ There are three ways to perform instrumentation with the ``omnitrace-instrument`
Here is a comparison of the three modes:
* Runtime instrumentation of the application using the ``omnitrace-instrument`` executable
* Runtime instrumentation of the application using the ``rocprof-sys-instrument`` executable
(analogous to ``gdb --args <program> <args>``)
* This mode is the default if neither the ``-p`` nor ``-o`` command-line options are used
* Runtime instrumentation supports instrumenting not only the target executable but also
* Runtime instrumentation supports instrumenting not only the target executable but also
the shared libraries loaded by the target executable. Consequently, this mode consumes more memory,
takes longer to perform the instrumentation, and tends to add more significant overhead to the
runtime of the application.
@@ -26,7 +26,7 @@ Here is a comparison of the three modes:
libraries but also the performance of the library dependencies
* Attaching to a process that is currently running (analogous to ``gdb -p <PID>``)
* This mode is activated using ``-p <PID>``
* The same caveats from the first example apply with respect to memory and overhead
@@ -39,25 +39,25 @@ Here is a comparison of the three modes:
* This mode is activated through the ``-o <output-file>`` option
* Binary rewriting is limited to the text section of the target executable or library. It does not instrument
the dynamically-linked libraries. Consequently, this mode performs the
the dynamically-linked libraries. Consequently, this mode performs the
instrumentation significantly faster
and has a much lower overhead when running the instrumented executable and libraries.
* Binary rewriting is the recommended mode when the target executable uses
* Binary rewriting is the recommended mode when the target executable uses
process-level parallelism (for example, MPI)
* If the target executable has a minimal ``main`` routine and the bulk of your
* If the target executable has a minimal ``main`` routine and the bulk of your
application is in one specific dynamic library,
see :ref:`binary-rewriting-library-label` for help
The omnitrace-instrument executable
The rocprof-sys-instrument executable
========================================
Instrumentation is performed with the ``omnitrace-instrument`` executable. For more details, use the ``-h`` or ``--help`` option to
Instrumentation is performed with the ``rocprof-sys-instrument`` executable. For more details, use the ``-h`` or ``--help`` option to
view the help menu.
.. code-block:: shell
$ omnitrace-instrument --help
[omnitrace-instrument] Usage: omnitrace-instrument [ --help (count: 0, dtype: bool)
$ rocprof-sys-instrument --help
[rocprof-sys-instrument] Usage: rocprof-sys-instrument [ --help (count: 0, dtype: bool)
--version (count: 0, dtype: bool)
--verbose (max: 1, dtype: bool)
--error (max: 1, dtype: boolean)
@@ -161,8 +161,8 @@ view the help menu.
[MODE OPTIONS]
-o, --output Enable generation of a new executable (binary-rewrite). If a filename is not provided,
omnitrace will use the basename and output to the cwd, unless the target binary is in the
cwd. In the latter case, omnitrace will either use ${PWD}/<basename>.inst (non-libraries)
rocprof-sys will use the basename and output to the cwd, unless the target binary is in the
cwd. In the latter case, rocprof-sys will either use ${PWD}/<basename>.inst (non-libraries)
or ${PWD}/instrumented/<basename> (libraries)
-p, --pid Connect to running process
-M, --mode [ coverage | sampling | trace ]
@@ -177,7 +177,7 @@ view the help menu.
[LIBRARY OPTIONS]
--prefer [ shared | static ] Prefer this library types when available
-L, --library Libraries with instrumentation routines (default: "libomnitrace-dl")
-L, --library Libraries with instrumentation routines (default: "librocprof-sys-dl")
-m, --main-function The primary function to instrument around, e.g. \'main\'
--load Supplemental instrumentation library names w/o extension (e.g. \'libinstr\' for
\'libinstr.so\' or \'libinstr.a\')
@@ -200,17 +200,17 @@ view the help menu.
-ME, --module-exclude Regex(es) for excluding modules/files/libraries (always applied)
-MR, --module-restrict Regex(es) for restricting modules/files/libraries only to those that match the provided
regular-expressions
--internal-function-include Regex(es) for including functions which are (likely) utilized by omnitrace itself. Use
--internal-function-include Regex(es) for including functions which are (likely) utilized by rocprof-sys itself. Use
this option with care.
--internal-module-include Regex(es) for including modules/libraries which are (likely) utilized by omnitrace
--internal-module-include Regex(es) for including modules/libraries which are (likely) utilized by rocprof-sys
itself. Use this option with care.
--instruction-exclude Regex(es) for excluding functions containing certain instructions
--internal-library-deps Treat the libraries linked to the internal libraries as internal libraries. This increase
the internal library processing time and consume more memory (so use with care) but may
be useful when the application uses Boost libraries and Dyninst is dynamically linked
against the same boost libraries
--internal-library-append Append to the list of libraries which omnitrace treats as being used internally, e.g.
OmniTrace will find all the symbols in this library and prevent them from being
--internal-library-append Append to the list of libraries which rocprof-sys treats as being used internally, e.g.
ROCm Systems Profiler will find all the symbols in this library and prevent them from being
instrumented.
--internal-library-remove [ ld-linux-x86-64.so.2
libBrokenLocale.so.1
@@ -272,7 +272,7 @@ view the help menu.
libz.so
libzstd.so ]
Remove the specified libraries from being treated as being used internally, e.g.
OmniTrace will permit all the symbols in these libraries to be eligible for
ROCm System Profiler will permit all the symbols in these libraries to be eligible for
instrumentation.
--linkage [ global | local | unique | unknown | weak ]
Only instrument functions with specified linkage (default: global, local, unique)
@@ -287,11 +287,11 @@ view the help menu.
options to gain more information about the function signature or location of the
functions
-C, --config Read in a configuration file and encode these values as the defaults in the executable
-d, --default-components Default components to instrument (only useful when timemory is enabled in omnitrace
-d, --default-components Default components to instrument (only useful when timemory is enabled in rocprof-sys
library)
--env Environment variables to add to the runtime in form VARIABLE=VALUE. E.g. use \'--env
OMNITRACE_PROFILE=ON\' to default to using timemory instead of perfetto
--mpi Enable MPI support (requires omnitrace built w/ full or partial MPI support). NOTE: this
ROCPROFSYS_PROFILE=ON\' to default to using timemory instead of perfetto
--mpi Enable MPI support (requires rocprof-sys built w/ full or partial MPI support). NOTE: this
will automatically be activated if MPI_Init, MPI_Init_thread, MPI_Finalize,
MPI_Comm_rank, or MPI_Comm_size are found in the symbol table of target
@@ -322,8 +322,8 @@ view the help menu.
--allow-overlapping Allow dyninst to instrument either multiple functions which overlap (share part of same
function body) or single functions with multiple entry points. For more info, see Section
2 of the DyninstAPI documentation.
--parse-all-modules By default, omnitrace simply requests Dyninst to provide all the procedures in the
application image. If this option is enabled, omnitrace will iterate over all the modules
--parse-all-modules By default, rocprof-sys simply requests Dyninst to provide all the procedures in the
application image. If this option is enabled, rocprof-sys will iterate over all the modules
and extract the functions. Theoretically, it should be the same but the data is slightly
different, possibly due to weak binding scopes. In general, enabling option will probably
have no visible effect
@@ -344,17 +344,17 @@ view the help menu.
TypeChecking ]
Advanced dyninst options: BPatch::set<OPTION>(bool), e.g. bpatch->setTrampRecursive(true)
``omnitrace-instrument`` uses a similar syntax as LLVM to separate command-line arguments from the
application's arguments. It uses a standalone
double-hyphen (``--``) as a separator.
``rocprof-sys-instrument`` uses a similar syntax as LLVM to separate command-line arguments from the
application's arguments. It uses a standalone
double-hyphen (``--``) as a separator.
All arguments preceding the double-hyphen
are interpreted as belonging to Omnitrace and all arguments following the
are interpreted as belonging to ROCm Systems Profiler and all arguments following the
double-hyphen are interpreted as being part of the
application and its arguments. In binary rewrite mode, all application arguments after the first argument
are ignored. As an example, ``./omnitrace-instrument -o ls.inst -- ls -l`` interprets ``ls`` as
are ignored. As an example, ``./rocprof-sys-instrument -o ls.inst -- ls -l`` interprets ``ls`` as
the target to instrument, ignoring the ``-l`` argument,
and generates a ``ls.inst`` executable that you can subsequently run using the
``omnitrace-run -- ls.inst -l`` command.
and generates a ``ls.inst`` executable that you can subsequently run using the
``rocprof-sys-run -- ls.inst -l`` command.
Runtime instrumentation example
========================================
@@ -363,7 +363,7 @@ The following example shows how to enable runtime instrumentation.
.. code-block:: shell
omnitrace-instrument <omnitrace-options> -- <exe> [<exe-options>...]
rocprof-sys-instrument <rocprof-sys-options> -- <exe> [<exe-options>...]
Attaching to a running process
========================================
@@ -372,7 +372,7 @@ Use the following command to attach to an active process.
.. code-block:: shell
omnitrace-instrument <omnitrace-options> -p <PID> -- <exe-name>
rocprof-sys-instrument <rocprof-sys-options> -p <PID> -- <exe-name>
Binary rewrite
========================================
@@ -381,24 +381,24 @@ This example demonstrates how to rewrite a binary.
.. code-block:: shell
omnitrace-instrument <omnitrace-options> -o <name-of-new-exe-or-library> -- <exe-or-library>
rocprof-sys-instrument <rocprof-sys-options> -o <name-of-new-exe-or-library> -- <exe-or-library>
.. _binary-rewriting-library-label:
Binary rewrite of a library
-----------------------------------
Many applications bundle the bulk of their functionality into one or more
Many applications bundle the bulk of their functionality into one or more
dynamic libraries and have a relatively simple ``main``
which links to these libraries and serves as the "driver" for
which links to these libraries and serves as the "driver" for
setting up the workflow. If you perform a binary rewrite of an
executable like this and find there is insufficient information, you
executable like this and find there is insufficient information, you
can either switch to runtime instrumentation or perform a
binary rewrite on the relevant libraries.
Support for stand-alone binary rewriting of a dynamic library without a binary rewrite of
Support for stand-alone binary rewriting of a dynamic library without a binary rewrite of
the executable is a beta feature.
In general, it is supported as long as the library contains the ``_init`` and
In general, it is supported as long as the library contains the ``_init`` and
``_fini`` symbols but these symbols are not
standardized to the extent of ``main`` in an executable.
@@ -406,8 +406,8 @@ Here is the recommended workflow for the binary rewrite of a library:
#. Determine the names of the dynamically linked libraries of interest using ``ldd``
#. Generate a binary rewrite of the executable
#. Generate a binary rewrite of the desired libraries with the same base name as the
original library, for example, ``libfoo.so.2`` instead of ``libfoo.so``, and output the instrumented
#. Generate a binary rewrite of the desired libraries with the same base name as the
original library, for example, ``libfoo.so.2`` instead of ``libfoo.so``, and output the instrumented
library into a different folder than the original library.
#. Prefix the ``LD_LIBRARY_PATH`` executable with the output folder from the previous step
@@ -433,10 +433,10 @@ Generate binary rewrites of ``foo`` and ``libfoo.so.2``:
.. code-block:: shell
omnitrace-instrument -o ./foo.inst -- foo
omnitrace-instrument -o ./libfoo.so.2 -- /usr/local/lib/libfoo.so.2
rocprof-sys-instrument -o ./foo.inst -- foo
rocprof-sys-instrument -o ./libfoo.so.2 -- /usr/local/lib/libfoo.so.2
At this point, the instrumented ``foo.inst`` executable still dynamically loads the
At this point, the instrumented ``foo.inst`` executable still dynamically loads the
original ``libfoo.so.2`` in ``/usr/local/lib``:
.. code-block:: shell
@@ -446,7 +446,7 @@ original ``libfoo.so.2`` in ``/usr/local/lib``:
libfoo.so.2 => /usr/local/lib/libfoo.so.2 (...)
...
Prefix the ``LD_LIBRARY_PATH`` environment variable with the folder containing
Prefix the ``LD_LIBRARY_PATH`` environment variable with the folder containing
the instrumented ``libfoo.so.2``:
.. code-block:: shell
@@ -465,90 +465,90 @@ the instrumented ``libfoo.so.2``:
Selective instrumentation
========================================
The default behavior of ``omnitrace-instrument`` does not instrument every symbol in the binary.
The default behavior of ``rocprof-sys-instrument`` does not instrument every symbol in the binary.
The default rules are:
* Skip instrumenting dynamic call-sites (such as function pointers)
* The ``--dynamic-callsites`` option forces instrumentation for all dynamic call-sites
* The cost of a function can be loosely approximated by the number of
instructions. By default, ``omnitrace-instrument`` only instruments functions
* The cost of a function can be loosely approximated by the number of
instructions. By default, ``rocprof-sys-instrument`` only instruments functions
with at least 1024 instructions
* The ``--min-instructions`` option modifies this heuristic for all functions which do not contain loops
* The ``--min-instructions-loop`` option modifies this heuristic for functions which contain loops.
* The cost of a function can be also be loosely approximated by the size of the function
in the binary so this heuristic can be used in lieu of or in addition to the
* The cost of a function can be also be loosely approximated by the size of the function
in the binary so this heuristic can be used in lieu of or in addition to the
minimum number of instructions
* The ``--min-address-range`` option modifies this heuristic for all functions which do not contain loops
* The ``--min-address-range-loop`` option modifies this heuristic for functions which contain loops
* The ``--min-address-range-loop`` option modifies this heuristic for functions which contain loops
* Skip instrumentation points which require using a trap
* See the description for the ``--traps`` and ``--loop-traps`` options for more information
* Skip instrumenting loops within the body of a function
* The ``--instrument-loops`` option enables this behavior
* Skip instrumenting functions with overlapping function bodies and single
* Skip instrumenting functions with overlapping function bodies and single
functions with multiple entry point
* These behaviors arise from various optimizations. Enable instrumenting for these functions
* These behaviors arise from various optimizations. Enable instrumenting for these functions
by using the ``--allow-overlapping`` option
.. note::
The separate loop options ``--min-instructions-loop`` and ``--min-address-range-loop``
The separate loop options ``--min-instructions-loop`` and ``--min-address-range-loop``
are provided because functions with loops can be compact in the binary while also being costly
Viewing the available, instrumented, excluded, and overlapping functions
-------------------------------------------------------------------------
Whenever ``omnitrace-instrument`` runs with a verbosity of zero or higher,
it generates files that detail which functions
were available for instrumentation (along with the module they were defined in), actually instrumented,
Whenever ``rocprof-sys-instrument`` runs with a verbosity of zero or higher,
it generates files that detail which functions
were available for instrumentation (along with the module they were defined in), actually instrumented,
excluded, and which contained overlapping function bodies.
By default, these files are saved to the ``omnitrace-<NAME>-output`` folder
By default, these files are saved to the ``rocprof-sys-<NAME>-output`` folder
where ``<NAME>`` is the base name of the targeted binary (or
the base name of the resulting executable in the case of binary rewrite). For example,
``omnitrace-instrument -- ls`` outputs these files to ``omnitrace-ls-output``
whereas ``omnitrace-instrument -o ls.inst -- ls`` places them in ``omnitrace-ls.inst-output``.
``rocprof-sys-instrument -- ls`` outputs these files to ``rocprof-sys-ls-output``
whereas ``rocprof-sys-instrument -o ls.inst -- ls`` places them in ``rocprof-sys-ls.inst-output``.
To generate these files without running or generating an
To generate these files without running or generating an
executable, use the ``--simulate`` option:
.. code-block:: shell
omnitrace-instrument --simulate -- foo
omnitrace-instrument --simulate -o foo.inst -- foo
rocprof-sys-instrument --simulate -- foo
rocprof-sys-instrument --simulate -o foo.inst -- foo
Excluding and including modules and functions
----------------------------------------------
Omnitrace has a set of six command-line options which each accept one or more
ROCm Systems Profiler has a set of six command-line options which each accept one or more
regular expressions for customizing the scope of which module and/or functions are
instrumented. Multiple regex patterns per option are treated as an OR operation,
instrumented. Multiple regex patterns per option are treated as an OR operation,
for example, ``--module-include libfoo libbar`` is effectively the same as ``--module-include 'libfoo|libbar'``.
To force the inclusion of certain modules and/or function
To force the inclusion of certain modules and/or function
without changing any of the heuristics, use the ``--module-include`` and/or ``--function-include`` options.
These options do not exclude modules or functions which do
These options do not exclude modules or functions which do
not satisfy their regular expression.
To narrow the scope of the instrumentation to a specific set
To narrow the scope of the instrumentation to a specific set
of libraries and/or functions, use the ``--module-restrict`` and ``--function-restrict`` options.
These options let you exclusively select the union of one or more
These options let you exclusively select the union of one or more
regular expressions, regardless of whether or not the functions satisfy the
previously-mentioned default heuristics. Any function or module that is not within
previously-mentioned default heuristics. Any function or module that is not within
the union of these regular expressions is excluded from instrumentation.
To avoid instrumenting a set of modules and/or functions,
To avoid instrumenting a set of modules and/or functions,
use the ``--module-exclude`` and ``--function-exclude`` options.
These options are always applied, even if the module or function
These options are always applied, even if the module or function
satisfies the "restrict" or "include" regular expression.
.. _available-module-function-output:
@@ -558,7 +558,7 @@ An example of the available module and function info output
.. code-block:: shell
omnitrace-instrument -o lulesh.inst --label file line args --simulate -- lulesh
rocprof-sys-instrument -o lulesh.inst --label file line args --simulate -- lulesh
.. code-block:: shell
@@ -779,7 +779,7 @@ An example of instrumented module and function info output
.. code-block:: shell
omnitrace-instrument -o lulesh.inst --label file line args --simulate -- lulesh
rocprof-sys-instrument -o lulesh.inst --label file line args --simulate -- lulesh
After the heuristics are applied based on the pattern in :ref:`available-module-function-output`,
the selected module and functions are:
@@ -850,15 +850,15 @@ Sampling
This capability has been deprecated in favor of :doc:`Call stack sampling <./sampling-call-stack>`.
By default, ``omnitrace-instrument`` uses ``--mode trace`` for instrumentation. The ``--mode sampling`` option
By default, ``rocprof-sys-instrument`` uses ``--mode trace`` for instrumentation. The ``--mode sampling`` option
only instruments ``main`` in an executable. It activates both CPU call-stack sampling and
background system-level thread sampling by default.
Tracing capabilities which do not rely on instrumentation, such as the HIP API and kernel tracing
(which is collected by roctracer), are still available.
The Omnitrace sampling capabilities are always available, even in trace mode, but are deactivated by default.
To activate sampling in trace mode, set ``OMNITRACE_USE_SAMPLING=ON`` in the environment
or in an Omnitrace configuration file.
The ROCm Systems Profiler sampling capabilities are always available, even in trace mode, but are deactivated by default.
To activate sampling in trace mode, set ``ROCPROFSYS_USE_SAMPLING=ON`` in the environment
or in an ROCm Systems Profiler configuration file.
Embedding a default configuration
========================================
@@ -872,31 +872,31 @@ the configuration settings are not be preserved for subsequent sessions:
.. code-block:: shell
omnitrace-instrument -o ./foo.inst -- ./foo
export OMNITRACE_USE_SAMPLING=ON
export OMNITRACE_SAMPLING_FREQ=5
omnitrace-run -- ./foo.inst
rocprof-sys-instrument -o ./foo.inst -- ./foo
export ROCPROFSYS_USE_SAMPLING=ON
export ROCPROFSYS_SAMPLING_FREQ=5
rocprof-sys-run -- ./foo.inst
Whereas the following command preserves those environment variables:
.. code-block:: shell
omnitrace-instrument -o ./foo.samp --env OMNITRACE_USE_SAMPLING=ON OMNITRACE_SAMPLING_FREQ=5 -- ./foo
rocprof-sys-instrument -o ./foo.samp --env ROCPROFSYS_USE_SAMPLING=ON ROCPROFSYS_SAMPLING_FREQ=5 -- ./foo
They can now be used in future sessions.
.. code-block:: shell
# will sample 5x per second
omnitrace-run -- ./foo.samp
rocprof-sys-run -- ./foo.samp
Even though the environment variables are preserved, subsequent sessions can still override those defaults:
.. code-block:: shell
# will sample 100x per second
export OMNITRACE_SAMPLING_FREQ=100
omnitrace-run -- ./foo.samp
export ROCPROFSYS_SAMPLING_FREQ=100
rocprof-sys-run -- ./foo.samp
.. _rpath-troubleshooting:
@@ -906,10 +906,10 @@ Troubleshooting
Checking for RPATH
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If ``ldd ./foo.inst`` from the :ref:`binary-rewriting-library-label`
section still returns ``/usr/local/lib/libfoo.so.2``, the executable could have
If ``ldd ./foo.inst`` from the :ref:`binary-rewriting-library-label`
section still returns ``/usr/local/lib/libfoo.so.2``, the executable could have
an rpath encoded in the binary.
This ELF entry results in the dynamic linker ignoring ``LD_LIBRARY_PATH`` if
This ELF entry results in the dynamic linker ignoring ``LD_LIBRARY_PATH`` if
it finds ``libfoo.so.2`` in the rpath.
Using the ``objdump`` tool, perform the following query:
@@ -923,13 +923,13 @@ If this produces output that appears similar to this output.:
RUNPATH $ORIGIN:$ORIGIN/../lib
Remove or modify the rpath to get ``foo.inst`` to resolve
Remove or modify the rpath to get ``foo.inst`` to resolve
to the instrumented ``libfoo.so.2`` as explained in the next section.
Modifying an RPATH
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This code snippet uses the ``patchelf`` tool to modify the rpath of the given executable
This code snippet uses the ``patchelf`` tool to modify the rpath of the given executable
or library to ``/home/user``, which is where the instrumented libraries are located.
.. note::
+110 -114
Dosyayı Görüntüle
@@ -1,6 +1,6 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
:description: ROCm Systems Profiler causal profiling documentation and reference
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, causal profiling, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
****************************************************
Performing causal profiling
@@ -18,10 +18,6 @@ Thus, causal profiling works by performing experiments on blocks of code during
insert pauses to slow down all other concurrently running code. During post-processing, these experiments
are translated into calculations for the potential impact of speeding up this block of code.
.. note::
Causal profiling supersedes the original critical trace feature, which was removed in Omnitrace v1.11.0.
Consider the following C++ code executing ``foo`` and ``bar`` concurrently in two different threads
where ``foo`` is ideally 30% faster than ``bar``:
@@ -51,52 +47,52 @@ where ``foo`` is ideally 30% faster than ``bar``:
itr.join();
}
No matter how many optimizations are applied to ``foo``, the application will always
No matter how many optimizations are applied to ``foo``, the application will always
require the same amount of time
because the end-to-end performance is limited by ``bar``. However, a 5% speed-up
because the end-to-end performance is limited by ``bar``. However, a 5% speed-up
in ``bar`` results in the
end-to-end performance improving by 5%. This trend continues linearly, with a 10% speed-up
end-to-end performance improving by 5%. This trend continues linearly, with a 10% speed-up
in ``bar`` yielding a 10% speed-up in
end-to-end performance, and so on, up to a 30% speed-up, at which point ``bar`` runs as fast as ``foo``.
Any speed-up to ``bar`` beyond 30% still only yields an end-to-end performance
Any speed-up to ``bar`` beyond 30% still only yields an end-to-end performance
improvement of 30% because the application
is now limited by performance of ``foo``, as demonstrated below in the causal
is now limited by performance of ``foo``, as demonstrated below in the causal
profiling visualization:
.. image:: ../data/causal-foobar.png
:alt: Visualization of the performance improvements for two functions with causal profiling
The full details of the causal profiling methodology can be found in the paper
The full details of the causal profiling methodology can be found in the paper
`Coz: Finding Code that Counts with Causal Profiling <http://arxiv.org/pdf/1608.03676v1.pdf>`_.
The author's implementation is publicly available on `GitHub <https://github.com/plasma-umass/coz>`_.
Getting started
========================================
To effectively use causal profiling, it is important to understand a few key
To effectively use causal profiling, it is important to understand a few key
concepts, such as progress points.
Progress points
-----------------------------------
Causal profiling requires "progress points" to track progress through the code
Causal profiling requires "progress points" to track progress through the code
in between samples. Progress points must be triggered in a deterministic manner via instrumentation.
This can happen in three different ways:
* `Omnitrace <https://github.com/ROCm/omnitrace>`_ can leverage the callbacks from
Kokkos-Tools, OpenMP-Tools, roctracer, etc. and the wrappers around functions for
* `ROCm Systems Profiler <https://github.com/ROCm/rocprofiler-systems>`_ can leverage the callbacks from
Kokkos-Tools, OpenMP-Tools, roctracer, etc. and the wrappers around functions for
MPI, NUMA, RCCL, etc. to act as progress points
* Users can leverage the :doc:`runtime instrumentation capabilities <./instrumenting-rewriting-binary-application>`
* Users can leverage the :doc:`runtime instrumentation capabilities <./instrumenting-rewriting-binary-application>`
to insert progress points
* Users can leverage :doc:`User APIs <../how-to/using-omnitrace-api>`,
such as ``OMNITRACE_CAUSAL_PROGRESS``
* Users can leverage :doc:`User APIs <../how-to/using-rocprof-sys-api>`,
such as ``ROCPROFSYS_CAUSAL_PROGRESS``
.. note::
Binary rewrite to insert progress points is not supported. When a rewritten binary
runs, Dyninst translates the instruction pointer address in order to perform
the instrumentation. As a result, call stack samples never return instruction
pointer addresses within the valid Omnitrace range.
Binary rewrite to insert progress points is not supported. When a rewritten binary
runs, Dyninst translates the instruction pointer address in order to perform
the instrumentation. As a result, call stack samples never return instruction
pointer addresses within the valid ROCm Systems Profiler range.
Key concepts
-----------------------------------
@@ -104,26 +100,26 @@ Key concepts
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
| Concept | Setting | Options | Description |
+==================+=====================================+==================================+============================================+
| Backend | ``OMNITRACE_CAUSAL_BACKEND`` | ``perf``, ``timer`` | Backend for recording samples required |
| Backend | ``ROCPROFSYS_CAUSAL_BACKEND`` | ``perf``, ``timer`` | Backend for recording samples required |
| | | | to calculate the virtual speed-up |
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
| Mode | ``OMNITRACE_CAUSAL_MODE`` | ``function``, ``line`` | Select an entire function or individual |
| Mode | ``ROCPROFSYS_CAUSAL_MODE`` | ``function``, ``line`` | Select an entire function or individual |
| | | | line of code for causal experiments |
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
| End-to-end | ``OMNITRACE_CAUSAL_END_TO_END`` | Boolean | Perform a single experiment during the |
| End-to-end | ``ROCPROFSYS_CAUSAL_END_TO_END`` | Boolean | Perform a single experiment during the |
| | | | entire run (does not require |
| | | | progress points) |
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
| Fixed speed-up | ``OMNITRACE_CAUSAL_FIXED_SPEEDUP`` | one or more values from [0, 100] | Virtual speed-up or pool of virtual |
| Fixed speed-up | ``ROCPROFSYS_CAUSAL_FIXED_SPEEDUP`` | one or more values from [0, 100] | Virtual speed-up or pool of virtual |
| | | | speed-ups to randomly select |
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
| Binary scope | ``OMNITRACE_CAUSAL_BINARY_SCOPE`` | regular expression(s) | Dynamic binaries containing code for |
| Binary scope | ``ROCPROFSYS_CAUSAL_BINARY_SCOPE`` | regular expression(s) | Dynamic binaries containing code for |
| | | | experiments |
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
| Source scope | ``OMNITRACE_CAUSAL_SOURCE_SCOPE`` | regular expression(s) | ``<file>`` and/or ``<file>:<line>`` |
| Source scope | ``ROCPROFSYS_CAUSAL_SOURCE_SCOPE`` | regular expression(s) | ``<file>`` and/or ``<file>:<line>`` |
| | | | containing code to include in experiments |
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
| Function scope | ``OMNITRACE_CAUSAL_FUNCTION_SCOPE`` | regular expression(s) | Restricts experiments to matching |
| Function scope | ``ROCPROFSYS_CAUSAL_FUNCTION_SCOPE`` | regular expression(s) | Restricts experiments to matching |
| | | | functions (function mode) or lines of |
| | | | code within matching functions (line mode) |
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
@@ -137,30 +133,30 @@ Key concepts
Backends
-----------------------------------
There are two backends to choose from: ``perf`` and ``timer``.
They are used to record the samples required to calculate the virtual speedup.
There are two backends to choose from: ``perf`` and ``timer``.
They are used to record the samples required to calculate the virtual speedup.
Both backends interrupt each thread 1000 times per second (of CPU-time) to apply the virtual speed-ups.
The difference between each backend is how the samples are recorded.
There are three key differences between the two backends:
* the ``perf`` backend requires Linux Perf and elevated security priviledges
* the ``perf`` backend interrupts the application less frequently whereas the ``timer`` backend
* the ``perf`` backend interrupts the application less frequently whereas the ``timer`` backend
interrupts the application 1000 times per second of realtime
* the ``timer`` backend has less accurate call stacks due to instruction pointer skid
In general, the ``perf`` backend is preferred over the ``timer`` backend when sufficient
In general, the ``perf`` backend is preferred over the ``timer`` backend when sufficient
security priviledges permit its usage.
If ``OMNITRACE_CAUSAL_BACKEND`` is set to ``auto``, Omnitrace falls back
If ``ROCPROFSYS_CAUSAL_BACKEND`` is set to ``auto``, ROCm Systems Profiler falls back
to using the ``timer`` backend only if
the ``perf`` backend fails. If ``OMNITRACE_CAUSAL_BACKEND`` is
set to ``perf`` and using this backend fails, Omnitrace aborts.
the ``perf`` backend fails. If ``ROCPROFSYS_CAUSAL_BACKEND`` is
set to ``perf`` and using this backend fails, ROCm Systems Profiler aborts.
Instruction pointer skid
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Instruction pointer (IP) skid measures how many instructions run after the event of interest
before the program actually stops. The IP skid is calculated by subtracting
the location of the IP at the point of interest from the location of the IP
the location of the IP at the point of interest from the location of the IP
when the kernel finally stops the application.
For the ``timer`` backend, this translates to the
difference in the IP between when the timer generated a signal and when the
@@ -172,9 +168,9 @@ especially in ``line`` mode.
Installing Linux Perf
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Linux Perf is built into the kernel and may already be installed
Linux Perf is built into the kernel and may already be installed
(for instance, it is included in the default kernel for OpenSUSE).
The official method of checking whether Linux Perf is installed is
The official method of checking whether Linux Perf is installed is
checking for the existence of the file
``/proc/sys/kernel/perf_event_paranoid``. If the file exists, the kernel has Perf installed.
@@ -184,12 +180,12 @@ If this file does not exist, as with Debian-based systems like Ubuntu, run the f
apt-get install linux-tools-common linux-tools-generic linux-tools-$(uname -r)
and reboot your computer. In order to use the ``perf`` backend, the value
and reboot your computer. In order to use the ``perf`` backend, the value
of ``/proc/sys/kernel/perf_event_paranoid``
should be less than or equal to 2. If the value in this file is greater than 2, you can't
should be less than or equal to 2. If the value in this file is greater than 2, you can't
use the ``perf`` backend.
To update the paranoid level temporarily until the system is rebooted, run
To update the paranoid level temporarily until the system is rebooted, run
one of the following commands
as a superuser (where ``PARANOID_LEVEL=<N>`` has a value of ``<N>`` in the range ``[-1, 2]``):
@@ -206,18 +202,18 @@ or
To make the paranoid level persistent after a reboot, add ``kernel.perf_event_paranoid=<N>``
(where ``<N>`` is the desired paranoid level) to the ``/etc/sysctl.conf`` file.
Speed-up prediction variability and the omnitrace-causal executable
Speed-up prediction variability and the rocprof-sys-causal executable
-----------------------------------------------------------------------
Causal profiling typically requires running the application several times in
order to adequately sample all the code domains, experiment
Causal profiling typically requires running the application several times in
order to adequately sample all the code domains, experiment
with speed-ups and other techniques, and resolve statistical fluctuations.
The ``omnitrace-causal`` executable is designed to simplify this procedure:
The ``rocprof-sys-causal`` executable is designed to simplify this procedure:
.. code-block:: shell
$ omnitrace-causal --help
[omnitrace-causal] Usage: ./bin/omnitrace-causal [ --help (count: 0, dtype: bool)
$ rocprof-sys-causal --help
[rocprof-sys-causal] Usage: ./bin/rocprof-sys-causal [ --help (count: 0, dtype: bool)
--version (count: 0, dtype: bool)
--monochrome (max: 1, dtype: bool)
--debug (max: 1, dtype: bool)
@@ -246,21 +242,21 @@ The ``omnitrace-causal`` executable is designed to simplify this procedure:
This executable is designed to streamline that process.
For example (assume all commands end with \'-- <exe> <args>\'):
omnitrace-causal -n 5 -- <exe> # runs <exe> 5x with causal profiling enabled
rocprof-sys-causal -n 5 -- <exe> # runs <exe> 5x with causal profiling enabled
omnitrace-causal -s 0 5,10,15,20 # runs <exe> 2x with virtual speedups:
rocprof-sys-causal -s 0 5,10,15,20 # runs <exe> 2x with virtual speedups:
# - 0
# - randomly selected from 5, 10, 15, and 20
omnitrace-causal -F func_A func_B func_(A|B) # runs <exe> 3x with the function scope limited to:
rocprof-sys-causal -F func_A func_B func_(A|B) # runs <exe> 3x with the function scope limited to:
# 1. func_A
# 2. func_B
# 3. func_A or func_B
General tips:
- Insert progress points at hotspots in your code or use omnitrace\'s runtime instrumentation
- Insert progress points at hotspots in your code or use rocprof-sys\'s runtime instrumentation
- Note: binary rewrite will produce a incompatible new binary
- Run omnitrace-causal in "function" mode first (does not require debug info)
- Run omnitrace-causal in "line" mode when you are targeting one function (requires debug info)
- Run rocprof-sys-causal in "function" mode first (does not require debug info)
- Run rocprof-sys-causal in "line" mode when you are targeting one function (requires debug info)
- Preferably, use predictions from the "function" mode to determine which function to target
- Limit the virtual speedups to a smaller pool, e.g., 0,5,10,25,50, to get reliable predictions quicker
- Make use of the binary, source, and function scope to limit the functions/lines selected for experiments
@@ -280,15 +276,15 @@ The ``omnitrace-causal`` executable is designed to simplify this procedure:
[GENERAL OPTIONS]
-c, --config Base configuration file
-l, --launcher When running MPI jobs, omnitrace-causal needs to be *before* the executable which launches the MPI processes (i.e.
-l, --launcher When running MPI jobs, rocprof-sys-causal needs to be *before* the executable which launches the MPI processes (i.e.
before `mpirun`, `srun`, etc.). Pass the name of the target executable (or a regex for matching to the name of the
target) for causal profiling, e.g., `omnitrace-causal -l foo -- mpirun -n 4 foo`. This ensures that the omnitrace
target) for causal profiling, e.g., `rocprof-sys-causal -l foo -- mpirun -n 4 foo`. This ensures that the rocprof-sys
library is LD_PRELOADed on the proper target
-g, --generate-configs Generate config files instead of passing environment variables directly. If no arguments are provided, the config files
will be placed in ${PWD}/omnitrace-causal-config folder
will be placed in ${PWD}/rocprof-sys-causal-config folder
--no-defaults Do not activate default features which are recommended for causal profiling. For example: PID-tagging of output files
and timestamped subdirectories are disabled by default. Kokkos tools support is added by default
(OMNITRACE_USE_KOKKOSP=ON) because, for Kokkos applications, the Kokkos-Tools callbacks are used for progress points.
(ROCPROFSYS_USE_KOKKOSP=ON) because, for Kokkos applications, the Kokkos-Tools callbacks are used for progress points.
Activation of OpenMP tools support is similar
[CAUSAL PROFILING OPTIONS (General)]
@@ -335,20 +331,20 @@ Examples
#!/bin/bash -e
module load omnitrace
module load rocprofiler-systems
N=20
I=3
# when providing speedups to omnitrace-causal, speedup
# when providing speedups to rocprof-sys-causal, speedup
# groups are separated by a space so "0,10" results in
# one speedup group where omnitrace samples from
# one speedup group where rocprof-sys samples from
# the speedup set of {0, 10}. Passing "0 10" (without
# quotes to omnitrace-causal multiplies the
# quotes to rocprof-sys-causal multiplies the
# number of runs by 2, where the first half of the
# runs instruct omnitrace to only use 0 as the
# runs instruct rocprof-sys to only use 0 as the
# speedup and the second half of the runs instruct
# omnitrace to only use 10 as the speedup.
# rocprof-sys to only use 10 as the speedup.
SPEEDUPS="0,0,0,10,20,30,40,50,50,75,75,75,90,90,90"
# thus, -s ${SPEEDUPS} only multiplies the number
# of runs by 1 whereas -S ${SPEEDUPS_E2E} multiplies
@@ -370,14 +366,14 @@ Examples
#
# total executions: 20
#
omnitrace-causal \
rocprof-sys-causal \
-n ${N} \
-s ${SPEEDUPS} \
-m function \
-o experiments.func \
-S ".*\\.cpp" \
-- \
./causal-omni-cpu "${@}"
./causal-rocprofsys-cpu "${@}"
# 20 iterations in line mode with 1 speedup group
@@ -390,14 +386,14 @@ Examples
#
# total executions: 20
#
omnitrace-causal \
rocprof-sys-causal \
-n ${N} \
-s ${SPEEDUPS} \
-m line \
-o experiments.line \
-S "causal\\.cpp:(100|110)" \
-- \
./causal-omni-cpu "${@}"
./causal-rocprofsys-cpu "${@}"
# 3 iterations in function mode of 15 singular speedups
@@ -411,7 +407,7 @@ Examples
#
# total executions: 90
#
omnitrace-causal \
rocprof-sys-causal \
-n ${I} \
-s ${SPEEDUPS_E2E} \
-m func \
@@ -420,7 +416,7 @@ Examples
-F "cpu_slow_func" \
"cpu_fast_func" \
-- \
./causal-omni-cpu "${@}"
./causal-rocprofsys-cpu "${@}"
# 3 iterations in line mode of 15 singular speedups
# in end-to-end mode with 2 different source scopes
@@ -433,7 +429,7 @@ Examples
#
# total executions: 90
#
omnitrace-causal \
rocprof-sys-causal \
-n ${I} \
-s ${SPEEDUPS_E2E} \
-m line \
@@ -442,7 +438,7 @@ Examples
-S "causal\\.cpp:100" \
"causal\\.cpp:110" \
-- \
./causal-omni-cpu "${@}"
./causal-rocprofsys-cpu "${@}"
export OMP_NUM_THREADS=8
@@ -468,7 +464,7 @@ Examples
# existing causal/experiments.func.(coz|json)
# file due to "--reset" argument
#
omnitrace-causal \
rocprof-sys-causal \
--reset \
-n ${N} \
-s ${SPEEDUPS} \
@@ -477,7 +473,7 @@ Examples
-S "lulesh.*" \
-FE "^(Kokkos::|std::enable_if)" \
-- \
./lulesh-omni -i 50 -s 200 -r 20 -b 5 -c 5 -p
./lulesh-rocprofsys -i 50 -s 200 -r 20 -b 5 -c 5 -p
# 5 iterations in line mode of 1 speedup
@@ -498,7 +494,7 @@ Examples
# existing causal/experiments.line.(coz|json)
# file due to "--reset" argument
#
omnitrace-causal \
rocprof-sys-causal \
--reset \
-n ${N} \
-s ${SPEEDUPS} \
@@ -507,7 +503,7 @@ Examples
-S "lulesh.*" \
-FE "^(exec_range|execute);construct_shared_allocation;\\._omp_fn\\." \
-- \
./lulesh-omni -i 50 -s 200 -r 20 -b 5 -c 5 -p
./lulesh-rocprofsys -i 50 -s 200 -r 20 -b 5 -c 5 -p
# 5 iterations in line mode of 1 speedup
@@ -528,7 +524,7 @@ Examples
# existing causal/experiments.line.(coz|json)
# file due to "--reset" argument
#
omnitrace-causal \
rocprof-sys-causal \
--reset \
-n ${N} \
-s ${SPEEDUPS} \
@@ -539,30 +535,30 @@ Examples
"CalcVolumeForceForElems" \
-S "lulesh\\.cc" \
-- \
./lulesh-omni -i 50 -s 200 -r 20 -b 5 -c 5 -p
./lulesh-rocprofsys -i 50 -s 200 -r 20 -b 5 -c 5 -p
Using omnitrace-causal with other launchers like mpirun
Using rocprof-sys-causal with other launchers like mpirun
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The ``omnitrace-causal`` executable is intended to assist with application replay
The ``rocprof-sys-causal`` executable is intended to assist with application replay
and is designed to always be at the start of the command line as the primary process.
``omnitrace-causal`` typically adds a ``LD_PRELOAD`` of the Omnitrace libraries
``rocprof-sys-causal`` typically adds a ``LD_PRELOAD`` of the ROCm Systems Profiler libraries
into the environment before launching the command to inject the functionality
required to start the causal profiling tooling. However, this is problematic
when the target application for causal profiling uses a launcher, in which case
it is listed as an argument rather than as the main application. For example,
``foo`` is the target application for profiling, but the command to run it is
``mpirun -n 2 foo``. Running the command ``omnitrace-causal -- mpirun -n 2 foo``
applies the causal profiling to ``mpirun`` instead of ``foo``.
required to start the causal profiling tooling. However, this is problematic
when the target application for causal profiling uses a launcher, in which case
it is listed as an argument rather than as the main application. For example,
``foo`` is the target application for profiling, but the command to run it is
``mpirun -n 2 foo``. Running the command ``rocprof-sys-causal -- mpirun -n 2 foo``
applies the causal profiling to ``mpirun`` instead of ``foo``.
``omnitrace-causal`` remedies this by providing a command-line option ``-l` / `--launcher``
to indicate the target application is using a launcher script/executable. The
``rocprof-sys-causal`` remedies this by providing a command-line option ``-l` / `--launcher``
to indicate the target application is using a launcher script/executable. The
argument to the command-line option is the name of, or regular expression for, the target application
on the command line. When ``--launcher`` is used, ``omnitrace-causal`` generates
on the command line. When ``--launcher`` is used, ``rocprof-sys-causal`` generates
all the replay configurations and runs them but delays adding the ``LD_PRELOAD``. Instead it
inserts a call to itself into the command line right before the target
inserts a call to itself into the command line right before the target
application. This recursive call inherits the configuration from
the parent ``omnitrace-causal`` executable, inserts an ``LD_PRELOAD`` into the environment,
the parent ``rocprof-sys-causal`` executable, inserts an ``LD_PRELOAD`` into the environment,
and calls ``execv`` to replace itself with the new process launched by the target
application.
@@ -570,32 +566,32 @@ In other words, the following command:
.. code-block:: shell
omnitrace-causal -l foo -n 3 -- mpirun -n 2 foo`
rocprof-sys-causal -l foo -n 3 -- mpirun -n 2 foo`
Effectively results in:
.. code-block:: shell
mpirun -n 2 omnitrace-causal -- foo
mpirun -n 2 omnitrace-causal -- foo
mpirun -n 2 omnitrace-causal -- foo
mpirun -n 2 rocprof-sys-causal -- foo
mpirun -n 2 rocprof-sys-causal -- foo
mpirun -n 2 rocprof-sys-causal -- foo
Visualizing the causal output
-------------------------------------------------------------------------
Omnitrace generates ``causal/experiments.json`` and ``causal/experiments.coz`` in
``${OMNITRACE_OUTPUT_PATH}/${OMNITRACE_OUTPUT_PREFIX}``. Visit
ROCm Systems Profiler generates ``causal/experiments.json`` and ``causal/experiments.coz`` in
``${ROCPROFSYS_OUTPUT_PATH}/${ROCPROFSYS_OUTPUT_PREFIX}``. Visit
`plasma-umass.org/coz <https://plasma-umass.org/coz/>`_ to open the ``*.coz`` file.
Omnitrace versus Coz
ROCm Systems Profiler versus Coz
=======================================
This comparison is intended for readers who are familiar with the
This comparison is intended for readers who are familiar with the
`Coz profiler <https://github.com/plasma-umass/coz>`_.
Omnitrace provides several additional features and utilities for causal profiling:
ROCm Systems Profiler provides several additional features and utilities for causal profiling:
.. csv-table::
:header: "Feature", "Coz", "Omnitrace", "Notes"
.. csv-table::
:header: "Feature", "Coz", "ROCm Systems Profiler", "Notes"
:widths: 20, 60, 60, 30
"Debug info", "requires debug info in DWARF v3 format (``-gdwarf-3``)", "optional, supports any DWARF format version", "See Note #1 below"
@@ -608,23 +604,23 @@ Omnitrace provides several additional features and utilities for causal profilin
.. note::
#. Omnitrace supports a "function" mode which does not require debug info.
#. Omnitrace supports selecting an entire range of instruction pointers for a function instead
#. ROCm Systems Profiler supports a "function" mode which does not require debug info.
#. ROCm Systems Profiler supports selecting an entire range of instruction pointers for a function instead
of an instruction pointer for one line. In large code bases, "function" mode
can resolve in fewer iterations. After a target function is identified, you can
can resolve in fewer iterations. After a target function is identified, you can
switch to line mode and limit the function scope to the target function.
#. Omnitrace supports randomly sampling from subsets, e.g. { 0, 0, 5, 10 }
#. ROCm Systems Profiler supports randomly sampling from subsets, e.g. { 0, 0, 5, 10 }
where 0% is randomly selected 50% of time and 5% and 10% are randomly selected 25% of the time.
#. Omnitrace and COZ have the same definition for binary scope, which is the binaries
#. ROCm Systems Profiler and COZ have the same definition for binary scope, which is the binaries
loaded at runtime (the executable and linked libraries).
#. Omnitrace "source scope" supports both ``<file>`` and ``<file>:<line>`` formats
#. ROCm Systems Profiler "source scope" supports both ``<file>`` and ``<file>:<line>`` formats
in contrast to the COZ "source scope" which requires ``<file>:<line>`` format.
#. Omnitrace supports a "function" scope which narrows the function and lines
#. ROCm Systems Profiler supports a "function" scope which narrows the function and lines
which are eligible for causal experiments to those within the matching functions.
#. Omnitrace supports a second filter on scopes for removing binary/source/function
#. ROCm Systems Profiler supports a second filter on scopes for removing binary/source/function
caught by an inclusive match. For example ``BINARY_SCOPE=.*`` and ``BINARY_EXCLUDE=libmpi.*``
initially includes all binaries but exclude regex removes MPI libraries.
#. In Omnitrace, the Linux Perf backend is preferred over use libunwind. However,
#. In ROCm Systems Profiler, the Linux Perf backend is preferred over use libunwind. However,
Linux Perf usage can be restricted for security reasons.
Omnitrace falls back to using a second POSIX timer and libunwind if
ROCm Systems Profiler falls back to using a second POSIX timer and libunwind if
Linux Perf is not available.
+53 -53
Dosyayı Görüntüle
@@ -1,80 +1,80 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
:description: ROCm Systems Profiler Python profiling documentation and reference
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, Python, profiling Python, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
****************************************************
Profiling Python scripts
****************************************************
`Omnitrace <https://github.com/ROCm/omnitrace>`_ supports profiling Python code at the
`ROCm Systems Profiler <https://github.com/ROCm/rocprofiler-systems>`_ supports profiling Python code at the
source level and the script level.
Python support is enabled via the ``OMNITRACE_USE_PYTHON`` and the
``OMNITRACE_PYTHON_VERSIONS="<MAJOR>.<MINOR>`` CMake options.
Alternatively, to build multiple Python versions, use
``OMNITRACE_PYTHON_VERSIONS="<MAJOR>.<MINOR>;[<MAJOR>.<MINOR>]"``,
and ``OMNITRACE_PYTHON_ROOT_DIRS="/path/to/version;[/path/to/version]"`` instead of ``OMNITRACE_PYTHON_VERSION``.
When building multiple Python versions, the length of the ``OMNITRACE_PYTHON_VERSIONS``
and ``OMNITRACE_PYTHON_ROOT_DIRS`` lists must
Python support is enabled via the ``ROCPROFSYS_USE_PYTHON`` and the
``ROCPROFSYS_PYTHON_VERSIONS="<MAJOR>.<MINOR>`` CMake options.
Alternatively, to build multiple Python versions, use
``ROCPROFSYS_PYTHON_VERSIONS="<MAJOR>.<MINOR>;[<MAJOR>.<MINOR>]"``,
and ``ROCPROFSYS_PYTHON_ROOT_DIRS="/path/to/version;[/path/to/version]"`` instead of ``ROCPROFSYS_PYTHON_VERSION``.
When building multiple Python versions, the length of the ``ROCPROFSYS_PYTHON_VERSIONS``
and ``ROCPROFSYS_PYTHON_ROOT_DIRS`` lists must
be the same size.
.. note::
When using Omnitrace with Python programs, the Python interpreter major and minor version (e.g. 3.7)
When using ROCm Systems Profiler with Python programs, the Python interpreter major and minor version (e.g. 3.7)
must match the interpreter major and minor version
used when compiling the Python bindings. When building Omnitrace,
the shared object file ``libpyomnitrace.<IMPL>-<VERSION>-<ARCH>-<OS>-<ABI>.so`` is generated
where ``IMPL`` is the Python implementation, ``VERSION`` is the major and minor
used when compiling the Python bindings. When building ROCm Systems Profiler,
the shared object file ``libpyrocprofsys.<IMPL>-<VERSION>-<ARCH>-<OS>-<ABI>.so`` is generated
where ``IMPL`` is the Python implementation, ``VERSION`` is the major and minor
version, ``ARCH`` is the architecture,
``OS`` is the operating system, and ``ABI`` is the application binary interface,
for example, ``libpyomnitrace.cpython-38-x86_64-linux-gnu.so``.
``OS`` is the operating system, and ``ABI`` is the application binary interface,
for example, ``libpyrocprofsys.cpython-38-x86_64-linux-gnu.so``.
Getting Started
========================================
The Omnitrace Python package is installed in ``lib/pythonX.Y/site-packages/omnitrace``.
To ensure the Python interpreter can find the Omnitrace package,
The ROCm Systems Profiler Python package is installed in ``lib/pythonX.Y/site-packages/rocprofsys``.
To ensure the Python interpreter can find the ROCm Systems Profiler package,
add this path to the ``PYTHONPATH`` environment variable, as in the following example:
.. code-block:: shell
export PYTHONPATH=/opt/omnitrace/lib/python3.8/site-packages:${PYTHONPATH}
export PYTHONPATH=/opt/rocprofiler-systems/lib/python3.8/site-packages:${PYTHONPATH}
Both the ``share/omnitrace/setup-env.sh`` script and the module file in
``share/modulefiles/omnitrace`` automatically handle the prefixing of the ``PYTHONPATH``
Both the ``share/rocprofiler-systems/setup-env.sh`` script and the module file in
``share/modulefiles/rocprofiler-systems`` automatically handle the prefixing of the ``PYTHONPATH``
environment variable.
Running Omnitrace on a Python script
Running ROCm Systems Profiler on a Python script
========================================
Omnitrace provides an ``omnitrace-python`` helper bash script which
ROCm Systems Profiler provides an ``rocprof-sys-python`` helper bash script which
ensures ``PYTHONPATH`` is properly set and the correct Python interpreter is used.
This means the following commands are effectively equivalent:
.. code-block:: shell
omnitrace-python --help
rocprof-sys-python --help
and
.. code-block:: shell
export PYTHONPATH=/opt/omnitrace/lib/python3.8/site-packages:${PYTHONPATH}
python3.8 -m omnitrace --help
export PYTHONPATH=/opt/rocprofiler-systems/lib/python3.8/site-packages:${PYTHONPATH}
python3.8 -m rocprofsys --help
.. note::
``omnitrace-python`` and ``python -m omnitrace`` use the same command-line syntax
as the other ``omnitrace`` executables (``omnitrace-python <OMNITRACE_ARGS> -- <SCRIPT> <SCRIPT_ARGS>``)
``rocprof-sys-python`` and ``python -m rocprofsys`` use the same command-line syntax
as the other ``rocprof-sys`` executables (``rocprof-sys-python <ROCPROFSYS_ARGS> -- <SCRIPT> <SCRIPT_ARGS>``)
and has similar options.
Command line options
-----------------------------------
Use ``omnitrace-python --help`` to view the available options:
Use ``rocprof-sys-python --help`` to view the available options:
.. code-block:: shell
usage: omnitrace [-h] [-v VERBOSITY] [-b] [-c FILE] [-s FILE] [-F [BOOL]] [--label [{args,file,line} [{args,file,line} ...]]] [-I FUNC [FUNC ...]] [-E FUNC [FUNC ...]] [-R FUNC [FUNC ...]] [-MI FILE [FILE ...]] [-ME FILE [FILE ...]] [-MR FILE [FILE ...]] [--trace-c [BOOL]]
usage: rocprof-sys [-h] [-v VERBOSITY] [-b] [-c FILE] [-s FILE] [-F [BOOL]] [--label [{args,file,line} [{args,file,line} ...]]] [-I FUNC [FUNC ...]] [-E FUNC [FUNC ...]] [-R FUNC [FUNC ...]] [-MI FILE [FILE ...]] [-ME FILE [FILE ...]] [-MR FILE [FILE ...]] [--trace-c [BOOL]]
optional arguments:
-h, --help show this help message and exit
@@ -82,7 +82,7 @@ Use ``omnitrace-python --help`` to view the available options:
Logging verbosity
-b, --builtin Put 'profile' in the builtins. Use '@profile' to decorate a single function, or 'with profile:' to profile a single section of code.
-c FILE, --config FILE
OmniTrace configuration file
ROCm Systems Profiler configuration file
-s FILE, --setup FILE
Code to execute before the code to profile
-F [BOOL], --full-filepath [BOOL]
@@ -103,19 +103,19 @@ Use ``omnitrace-python --help`` to view the available options:
Select only entries from these files
--trace-c [BOOL] Enable profiling C functions
usage: python3 -m omnitrace <OMNITRACE_ARGS> -- <SCRIPT> <SCRIPT_ARGS>
usage: python3 -m rocprofsys <ROCPROFSYS_ARGS> -- <SCRIPT> <SCRIPT_ARGS>
.. note::
The ``--trace-c`` option does not incorporate Omnitrace's dynamic instrumentation support.
The ``--trace-c`` option does not incorporate ROCm Systems Profiler's dynamic instrumentation support.
It only enables profiling the underlying C function call within the Python interpreter.
Selective instrumentation
-----------------------------------
Similar to the ``omnitrace-instrument`` executable, command-line options exist for restricting,
Similar to the ``rocprof-sys-instrument`` executable, command-line options exist for restricting,
including, and excluding certain functions and modules, for example, ``--function-exclude "^__init__$"``.
Alternatively, add the ``@profile`` decorator to the primary function of interest
Alternatively, add the ``@profile`` decorator to the primary function of interest
in your program and use the ``-b`` / ``--builtin`` command-line option to narrow the scope of the
instrumentation to this function and its children.
@@ -145,8 +145,8 @@ Consider the following Python code (``example.py``):
if __name__ == "__main__":
run(20)
Running ``omnitrace-python ./example.py`` with ``OMNITRACE_PROFILE=ON`` and
``OMNITRACE_TIMEMORY_COMPONENTS=trip_count`` produces the following:
Running ``rocprof-sys-python ./example.py`` with ``ROCPROFSYS_PROFILE=ON`` and
``ROCPROFSYS_TIMEMORY_COMPONENTS=trip_count`` produces the following:
.. code-block:: shell
@@ -187,7 +187,7 @@ If the ``inefficient`` function is decorated with ``@profile`` as follows:
def inefficient(n):
# ...
And then run using the command ``omnitrace-python -b -- ./example.py``, Omnitrace produces this output:
And then run using the command ``rocprof-sys-python -b -- ./example.py``, ROCm Systems Profiler produces this output:
.. code-block:: shell
@@ -199,37 +199,37 @@ And then run using the command ``omnitrace-python -b -- ./example.py``, Omnitrac
| |0>>> inefficient | 1 | 0 | trip_count | 1 |
|-----------------------------------------------------------|
Omnitrace Python source instrumentation
ROCm Systems Profiler Python source instrumentation
========================================
Starting with the unmodified ``example.py`` script above, import the ``omnitrace`` module:
Starting with the unmodified ``example.py`` script above, import the ``rocprofsys`` module:
.. code-block:: python
import sys
import omnitrace # import omnitrace
import rocprofsys # import rocprofsys
def fib(n):
# ... etc. ...
Next, add ``@omnitrace.profile()`` to the ``run`` function:
Next, add ``@rocprofsys.profile()`` to the ``run`` function:
.. code-block:: python
@omnitrace.profile()
@rocprofsys.profile()
def run(n):
# ...
Alternatively, use ``omnitrace.profile()`` as a context-manager around ``run(20)``:
Alternatively, use ``rocprofsys.profile()`` as a context-manager around ``run(20)``:
.. code-block:: python
if __name__ == "__main__":
with omnitrace.profile():
with rocprofsys.profile():
run(20)
The results for both of the source-level instrumentation modes are identical to the
original ``omnitrace-python ./example.py`` results:
The results for both of the source-level instrumentation modes are identical to the
original ``rocprofsys-python ./example.py`` results:
.. code-block:: shell
@@ -264,14 +264,14 @@ original ``omnitrace-python ./example.py`` results:
.. note::
When ``omnitrace-python`` is used without built-ins, the profiling results can be cluttered by the
When ``rocprof-sys-python`` is used without built-ins, the profiling results can be cluttered by the
numerous functions called when more complex modules are imported, such as ``import numpy``.
Omnitrace Python source instrumentation configuration
ROCm Systems Profiler Python source instrumentation configuration
-------------------------------------------------------------
Within the Python source code, the profiler can be configured by directly
modifying the ``omnitrace.profiler.config`` data fields.
Within the Python source code, the profiler can be configured by directly
modifying the ``rocprof-sys.profiler.config`` data fields.
.. code-block:: python
@@ -295,8 +295,8 @@ modifying the ``omnitrace.profiler.config`` data fields.
if __name__ == "__main__":
from omnitrace.profiler import config
from omnitrace import profile
from rocprofsys.profiler import config
from rocprofsys import profile
config.include_args = True
config.include_filename = False
+229 -228
Dosyayı Görüntüle
@@ -1,77 +1,77 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
:description: ROCm Systems Profiler call stack sampling documentation and reference
:keywords: rocprofiler-systems,rocprofsys, ROCm, profiler, sampling, call stack, tracking, visualization, tool, Instinct, accelerator, AMD
****************************************************
Sampling the call stack
****************************************************
`Omnitrace <https://github.com/ROCm/omnitrace>`_ can use call-stack sampling
on a binary instrumented with either the ``omnitrace`` executable
or the ``omnitrace-sample`` executable.
`ROCm Systems Profiler <https://github.com/ROCm/rocprofiler-systems>`_ can use call-stack sampling
on a binary instrumented with either the ``rocprof-sys`` executable
or the ``rocprof-sys-sample`` executable.
For example, all of the following commands are effectively equivalent:
* Binary rewrite with only the instrumentation necessary to start and stop sampling
.. code-block:: shell
omnitrace-instrument -M sampling -o foo.inst -- foo
omnitrace-run -- ./foo.inst
rocprof-sys-instrument -M sampling -o foo.inst -- foo
rocprof-sys-run -- ./foo.inst
* Runtime instrumentation with only the instrumentation necessary to start and stop sampling
.. code-block:: shell
omnitrace-instrument -M sampling -- foo
rocprof-sys-instrument -M sampling -- foo
* No instrumentation required
.. code-block:: shell
omnitrace-sample -- foo
rocprof-sys-sample -- foo
.. note::
Set ``OMNITRACE_USE_SAMPLING=ON`` to activate call-stack sampling when executing an instrumented binary.
Set ``ROCPROFSYS_USE_SAMPLING=ON`` to activate call-stack sampling when executing an instrumented binary.
All ``omnitrace-instrument -M sampling`` (subsequently referred to as "instrumented-sampling")
All ``rocprof-sys-instrument -M sampling`` (subsequently referred to as "instrumented-sampling")
does is wrap the ``main`` of the executable with initialization
before ``main`` starts and finalization after ``main`` ends.
This can be accomplished without instrumentation through a ``LD_PRELOAD``
This can be accomplished without instrumentation through a ``LD_PRELOAD``
of a library containing a dynamic symbol wrapper around ``__libc_start_main``.
The use of ``omnitrace-sample`` is **recommended** over
``omnitrace-instrument -M sampling`` when binary instrumentation
The use of ``rocprof-sys-sample`` is **recommended** over
``rocprof-sys-instrument -M sampling`` when binary instrumentation
is not necessary. This is for a number of reasons:
* ``omnitrace-sample`` provides command-line options for controlling the Omnitrace feature set instead of
* ``rocprof-sys-sample`` provides command-line options for controlling the ROCm Systems Profiler feature set instead of
requiring configuration files or environment variables
* Despite the fact that instrumented-sampling only requires inserting snippets
* Despite the fact that instrumented-sampling only requires inserting snippets
around one function (``main``), Dyninst
does not have a feature for specifying that parsing and processing all the
does not have a feature for specifying that parsing and processing all the
other symbols in the binary is unnecessary.
In the best-case scenario when the target binary is relatively small,
In the best-case scenario when the target binary is relatively small,
instrumented-sampling has a slightly slower launch time,
but in the worst case scenarios it requires a significant amount of time and memory to launch.
* ``omnitrace-sample`` is fully compatible with MPI. For example,
the command ``mpirun -n 2 omnitrace-sample -- foo`` is valid,
whereas ``mpirun -n 2 omnitrace-instrument -M sampling -- foo``
* ``rocprof-sys-sample`` is fully compatible with MPI. For example,
the command ``mpirun -n 2 rocprof-sys-sample -- foo`` is valid,
whereas ``mpirun -n 2 rocprof-sys-instrument -M sampling -- foo``
is incompatible with some MPI distributions (particularly OpenMPI). This is because
MPI prohibits forking within an MPI rank.
* When MPI and binary instrumentation are both involved, two steps are required:
performing a binary rewrite of the executable and then using the instrumented executable
in lieu of the original executable. ``omnitrace-sample`` is therefore much easier to use with MPI.
performing a binary rewrite of the executable and then using the instrumented executable
in lieu of the original executable. ``rocprof-sys-sample`` is therefore much easier to use with MPI.
The omnitrace-sample executable
The rocprof-sys-sample executable
========================================
View the help menu of ``omnitrace-sample`` with the ``-h`` / ``--help`` option:
View the help menu of ``rocprof-sys-sample`` with the ``-h`` / ``--help`` option:
.. code-block:: shell
$ omnitrace-sample --help
[omnitrace-sample] Usage: omnitrace-sample [ --help (count: 0, dtype: bool)
$ rocprof-sys-sample --help
[rocprof-sys-sample] Usage: rocprof-sys-sample [ --help (count: 0, dtype: bool)
--version (count: 0, dtype: bool)
--monochrome (max: 1, dtype: bool)
--debug (max: 1, dtype: bool)
@@ -111,47 +111,47 @@ View the help menu of ``omnitrace-sample`` with the ``-h`` / ``--help`` option:
--gpu-events (count: unlimited)
--inlines (max: 1, dtype: bool)
--hsa-interrupt (count: 1, dtype: int)
]
]
Options:
-h, -?, --help Shows this page (count: 0, dtype: bool)
--version Prints the version and exit (count: 0, dtype: bool)
[DEBUG OPTIONS]
--monochrome Disable colorized output (max: 1, dtype: bool)
--debug Debug output (max: 1, dtype: bool)
-v, --verbose Verbose output (count: 1)
[GENERAL OPTIONS] These are options which are ubiquitously applied
-c, --config Configuration file (min: 0, dtype: filepath)
-o, --output Output path. Accepts 1-2 parameters corresponding to the output path and the output prefix (min: 1)
-T, --trace Generate a detailed trace (perfetto output) (max: 1, dtype: bool)
-P, --profile Generate a call-stack-based profile (conflicts with --flat-profile) (max: 1, dtype: bool)
-F, --flat-profile Generate a flat profile (conflicts with --profile) (max: 1, dtype: bool)
-H, --host Enable sampling host-based metrics for the process. E.g. CPU frequency, memory usage, etc. (max: 1, dtype: bool)
-D, --device Enable sampling device-based metrics for the process. E.g. GPU temperature, memory usage, etc. (max: 1, dtype: bool)
-w, --wait This option is a combination of '--trace-wait' and '--sampling-wait'. See the descriptions for those two options.
(count: 1)
-d, --duration This option is a combination of '--trace-duration' and '--sampling-duration'. See the descriptions for those two
options. (count: 1)
[TRACING OPTIONS] Specific options controlling tracing (i.e. deterministic measurements of every event)
--trace-file Specify the trace output filename. Relative filepath will be with respect to output path and output prefix. (count: 1,
dtype: filepath)
--trace-buffer-size Size limit for the trace output (in KB) (count: 1, dtype: KB)
-h, -?, --help Shows this page (count: 0, dtype: bool)
--version Prints the version and exit (count: 0, dtype: bool)
[DEBUG OPTIONS]
--monochrome Disable colorized output (max: 1, dtype: bool)
--debug Debug output (max: 1, dtype: bool)
-v, --verbose Verbose output (count: 1)
[GENERAL OPTIONS] These are options which are ubiquitously applied
-c, --config Configuration file (min: 0, dtype: filepath)
-o, --output Output path. Accepts 1-2 parameters corresponding to the output path and the output prefix (min: 1)
-T, --trace Generate a detailed trace (perfetto output) (max: 1, dtype: bool)
-P, --profile Generate a call-stack-based profile (conflicts with --flat-profile) (max: 1, dtype: bool)
-F, --flat-profile Generate a flat profile (conflicts with --profile) (max: 1, dtype: bool)
-H, --host Enable sampling host-based metrics for the process. E.g. CPU frequency, memory usage, etc. (max: 1, dtype: bool)
-D, --device Enable sampling device-based metrics for the process. E.g. GPU temperature, memory usage, etc. (max: 1, dtype: bool)
-w, --wait This option is a combination of '--trace-wait' and '--sampling-wait'. See the descriptions for those two options.
(count: 1)
-d, --duration This option is a combination of '--trace-duration' and '--sampling-duration'. See the descriptions for those two
options. (count: 1)
[TRACING OPTIONS] Specific options controlling tracing (i.e. deterministic measurements of every event)
--trace-file Specify the trace output filename. Relative filepath will be with respect to output path and output prefix. (count: 1,
dtype: filepath)
--trace-buffer-size Size limit for the trace output (in KB) (count: 1, dtype: KB)
--trace-fill-policy [ discard | ring_buffer ]
Policy for new data when the buffer size limit is reached:
- discard : new data is ignored
- ring_buffer : new data overwrites oldest data (count: 1)
--trace-wait Set the wait time (in seconds) before collecting trace and/or profiling data(in seconds). By default, the duration is
in seconds of realtime but that can changed via --trace-clock-id. (count: 1)
--trace-duration Set the duration of the trace and/or profile data collection (in seconds). By default, the duration is in seconds of
realtime but that can changed via --trace-clock-id. (count: 1)
--trace-periods More powerful version of specifying trace delay and/or duration. Format is one or more groups of: <DELAY>:<DURATION>,
<DELAY>:<DURATION>:<REPEAT>, and/or <DELAY>:<DURATION>:<REPEAT>:<CLOCK_ID>. (min: 1)
--trace-wait Set the wait time (in seconds) before collecting trace and/or profiling data(in seconds). By default, the duration is
in seconds of realtime but that can changed via --trace-clock-id. (count: 1)
--trace-duration Set the duration of the trace and/or profile data collection (in seconds). By default, the duration is in seconds of
realtime but that can changed via --trace-clock-id. (count: 1)
--trace-periods More powerful version of specifying trace delay and/or duration. Format is one or more groups of: <DELAY>:<DURATION>,
<DELAY>:<DURATION>:<REPEAT>, and/or <DELAY>:<DURATION>:<REPEAT>:<CLOCK_ID>. (min: 1)
--trace-clock-id [ 0 (realtime|CLOCK_REALTIME)
1 (monotonic|CLOCK_MONOTONIC)
2 (cputime|CLOCK_PROCESS_CPUTIME_ID)
@@ -159,40 +159,40 @@ View the help menu of ``omnitrace-sample`` with the ``-h`` / ``--help`` option:
5 (realtime_coarse|CLOCK_REALTIME_COARSE)
6 (monotonic_coarse|CLOCK_MONOTONIC_COARSE)
7 (boottime|CLOCK_BOOTTIME) ]
Set the default clock ID for for trace delay/duration. Note: "cputime" is the *process* CPU time and might need to be
scaled based on the number of threads, i.e. 4 seconds of CPU-time for an application with 4 fully active threads would
equate to ~1 second of realtime. If this proves to be difficult to handle in practice, please file a feature request
for omnitrace to auto-scale based on the number of threads. (count: 1)
[PROFILE OPTIONS] Specific options controlling profiling (i.e. deterministic measurements which are aggregated into a summary)
Set the default clock ID for for trace delay/duration. Note: "cputime" is the *process* CPU time and might need to be
scaled based on the number of threads, i.e. 4 seconds of CPU-time for an application with 4 fully active threads would
equate to ~1 second of realtime. If this proves to be difficult to handle in practice, please file a feature request
for rocprof-sys to auto-scale based on the number of threads. (count: 1)
[PROFILE OPTIONS] Specific options controlling profiling (i.e. deterministic measurements which are aggregated into a summary)
--profile-format [ console | json | text ]
Data formats for profiling results (min: 1)
--profile-diff Generate a diff output b/t the profile collected and an existing profile from another run Accepts 1-2 parameters
corresponding to the input path and the input prefix (min: 1)
Data formats for profiling results (min: 1)
--profile-diff Generate a diff output b/t the profile collected and an existing profile from another run Accepts 1-2 parameters
corresponding to the input path and the input prefix (min: 1)
[HOST/DEVICE (PROCESS SAMPLING) OPTIONS]
Process sampling is background measurements for resources available to the entire process. These samples are not tied
to specific lines/regions of code
--process-freq Set the default host/device sampling frequency (number of interrupts per second) (count: 1)
--process-wait Set the default wait time (i.e. delay) before taking first host/device sample (in seconds of realtime) (count: 1)
--process-duration Set the duration of the host/device sampling (in seconds of realtime) (count: 1)
--cpus CPU IDs for frequency sampling. Supports integers and/or ranges (count: unlimited, dtype: int or range)
--gpus GPU IDs for SMI queries. Supports integers and/or ranges (count: unlimited, dtype: int or range)
[GENERAL SAMPLING OPTIONS] General options for timer-based sampling per-thread
-f, --freq Set the default sampling frequency (number of interrupts per second) (count: 1)
--sampling-wait Set the default wait time (i.e. delay) before taking first sample (in seconds). This delay time is based on the clock
of the sampler, i.e., a delay of 1 second for CPU-clock sampler may not equal 1 second of realtime (count: 1)
--sampling-duration Set the duration of the sampling (in seconds of realtime). I.e., it is possible (currently) to set a CPU-clock time
delay that exceeds the real-time duration... resulting in zero samples being taken (count: 1)
-t, --tids Specify the default thread IDs for sampling, where 0 (zero) is the main thread and each thread created by the target
application is assigned an atomically incrementing value. (min: 1)
[SAMPLING TIMER OPTIONS] These options determine the heuristic for deciding when to take a sample
Process sampling is background measurements for resources available to the entire process. These samples are not tied
to specific lines/regions of code
--process-freq Set the default host/device sampling frequency (number of interrupts per second) (count: 1)
--process-wait Set the default wait time (i.e. delay) before taking first host/device sample (in seconds of realtime) (count: 1)
--process-duration Set the duration of the host/device sampling (in seconds of realtime) (count: 1)
--cpus CPU IDs for frequency sampling. Supports integers and/or ranges (count: unlimited, dtype: int or range)
--gpus GPU IDs for SMI queries. Supports integers and/or ranges (count: unlimited, dtype: int or range)
[GENERAL SAMPLING OPTIONS] General options for timer-based sampling per-thread
-f, --freq Set the default sampling frequency (number of interrupts per second) (count: 1)
--sampling-wait Set the default wait time (i.e. delay) before taking first sample (in seconds). This delay time is based on the clock
of the sampler, i.e., a delay of 1 second for CPU-clock sampler may not equal 1 second of realtime (count: 1)
--sampling-duration Set the duration of the sampling (in seconds of realtime). I.e., it is possible (currently) to set a CPU-clock time
delay that exceeds the real-time duration... resulting in zero samples being taken (count: 1)
-t, --tids Specify the default thread IDs for sampling, where 0 (zero) is the main thread and each thread created by the target
application is assigned an atomically incrementing value. (min: 1)
[SAMPLING TIMER OPTIONS] These options determine the heuristic for deciding when to take a sample
--cputime Sample based on a CPU-clock timer (default). Accepts zero or more arguments:
0. Enables sampling based on CPU-clock timer.
1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of CPU-time.
@@ -210,22 +210,22 @@ View the help menu of ``omnitrace-sample`` with the ``-h`` / ``--help`` option:
When sampling with a real-clock timer, please note that enabling this will cause threads which are typically "idle"
to consume more resources since, while idle, the real-clock time increases (and therefore triggers taking samples)
whereas the CPU-clock time does not. (min: 0)
[BACKEND OPTIONS] These options control region information captured w/o sampling or instrumentation
[BACKEND OPTIONS] These options control region information captured w/o sampling or instrumentation
-I, --include [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
Include data from these backends (count: unlimited)
Include data from these backends (count: unlimited)
-E, --exclude [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
Exclude data from these backends (count: unlimited)
[HARDWARE COUNTER OPTIONS] See also: omnitrace-avail -H
-C, --cpu-events Set the CPU hardware counter events to record (ref: `omnitrace-avail -H -c CPU`) (count: unlimited)
-G, --gpu-events Set the GPU hardware counter events to record (ref: `omnitrace-avail -H -c GPU`) (count: unlimited)
[MISCELLANEOUS OPTIONS]
-i, --inlines Include inline info in output when available (max: 1, dtype: bool)
Exclude data from these backends (count: unlimited)
[HARDWARE COUNTER OPTIONS] See also: rocprof-sys-avail -H
-C, --cpu-events Set the CPU hardware counter events to record (ref: `rocprof-sys-avail -H -c CPU`) (count: unlimited)
-G, --gpu-events Set the GPU hardware counter events to record (ref: `rocprof-sys-avail -H -c GPU`) (count: unlimited)
[MISCELLANEOUS OPTIONS]
-i, --inlines Include inline info in output when available (max: 1, dtype: bool)
--hsa-interrupt [ 0 | 1 ] Set the value of the HSA_ENABLE_INTERRUPT environment variable.
ROCm version 5.2 and older have a bug which will cause a deadlock if a sample is taken while waiting for the signal
that a kernel completed -- which happens when sampling with a real-clock timer. We require this option to be set to
@@ -235,147 +235,148 @@ View the help menu of ``omnitrace-sample`` with the ``-h`` / ``--help`` option:
0 avoid triggering the bug, potentially at the cost of reduced performance
1 do not modify how ROCm is notified about kernel completion (count: 1, dtype: int)
The general syntax for separating Omnitrace command-line arguments from the
following application arguments
is consistent with the LLVM style of using a stand-alone double hyphen (``--``).
The general syntax for separating ROCm Systems Profiler command-line arguments from the
following application arguments
is consistent with the LLVM style of using a stand-alone double hyphen (``--``).
All arguments preceding the double hyphen
are interpreted as belonging to Omnitrace and all arguments following it
are interpreted as belonging to ROCm Systems Profiler and all arguments following it
are interpreted as the
application and its arguments. The double hyphen is only necessary when passing
application and its arguments. The double hyphen is only necessary when passing
command-line arguments to a target
which also uses hyphens. For example, you can run ``omnitrace-sample ls``, but
to run ``ls -la``, use ``omnitrace-sample -- ls -la``.
which also uses hyphens. For example, you can run ``rocprof-sys-sample ls``, but
to run ``ls -la``, use ``rocprof-sys-sample -- ls -la``.
:doc:`Configuring the Omnitrace runtime options <./configuring-runtime-options>`
establishes the precedence of environment variable values over values specified
:doc:`Configuring the ROCm Systems Profiler runtime options <./configuring-runtime-options>`
establishes the precedence of environment variable values over values specified
in the configuration files. This enables
you to configure the Omnitrace runtime to your preferred default behavior
in a file such as ``~/.omnitrace.cfg`` and then easily override
those settings in the command line, for example, ``OMNITRACE_ENABLED=OFF omnitrace-sample -- foo``.
Similarly, the command-line arguments passed to ``omnitrace-sample`` take precedence
you to configure the ROCm Systems Profiler runtime to your preferred default behavior
in a file such as ``~/.rocprof-sys.cfg`` and then easily override
those settings in the command line, for example, ``ROCPROFSYS_ENABLED=OFF rocprof-sys-sample -- foo``.
Similarly, the command-line arguments passed to ``rocprof-sys-sample`` take precedence
over environment variables.
All of the command-line options above correlate to one or more configuration
settings, for example, ``--cpu-events`` correlates to the ``OMNITRACE_PAPI_EVENTS`` configuration variable.
``omnitrace-sample`` processes the arguments and outputs a summary of its configuration
before running the target application.
All of the command-line options above correlate to one or more configuration
settings, for example, ``--cpu-events`` correlates to the ``ROCPROFSYS_PAPI_EVENTS`` configuration variable.
``rocprof-sys-sample`` processes the arguments and outputs a summary of its configuration
before running the target application.
The following snippets show how ``omnitrace-sample`` runs with various environment updates.
The following snippets show how ``rocprof-sys-sample`` runs with various environment updates.
* This snippet shows the environment updates when ``omnitrace-sample`` is invoked with no arguments:
* This snippet shows the environment updates when ``rocprof-sys-sample`` is invoked with no arguments:
.. code-block:: shell
$ omnitrace-sample -- ./parallel-overhead-locks 30 4 100
$ rocprof-sys-sample -- ./parallel-overhead-locks 30 4 100
HSA_TOOLS_LIB=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
HSA_TOOLS_LIB=/opt/rocprofiler-systems/lib/librocprof-sys-dl.so.1.7.1
HSA_TOOLS_REPORT_LOAD_FAILURE=1
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
OMNITRACE_USE_PROCESS_SAMPLING=false
OMNITRACE_USE_SAMPLING=true
OMP_TOOL_LIBRARIES=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
ROCP_TOOL_LIB=/opt/omnitrace/lib/libomnitrace.so.1.7.1
LD_PRELOAD=/opt/rocprofiler-systems/lib/librocprof-sys-dl.so.1.7.1
ROCPROFSYS_USE_PROCESS_SAMPLING=false
ROCPROFSYS_USE_SAMPLING=true
OMP_TOOL_LIBRARIES=/opt/rocprofiler-systems/lib/librocprof-sys-dl.so.1.7.1
ROCP_TOOL_LIB=/opt/rocprofiler-systems/lib/librocprof-sys.so.1.7.1
* The next snippet shows the environment updates when ``omnitrace-sample`` enables
* The next snippet shows the environment updates when ``rocprof-sys-sample`` enables
profiling, tracing, host process-sampling, device process-sampling, and all the available backends:
.. code-block:: shell
$ omnitrace-sample -PTDH -I all -- ./parallel-overhead-locks 30 4 100
$ rocprof-sys-sample -PTDH -I all -- ./parallel-overhead-locks 30 4 100
HSA_TOOLS_LIB=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
HSA_TOOLS_LIB=/opt/rocprofiler-systems/lib/librocprof-sys-dl.so.1.7.1
HSA_TOOLS_REPORT_LOAD_FAILURE=1
KOKKOS_PROFILE_LIBRARY=/opt/omnitrace/lib/libomnitrace.so.1.7.1
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
OMNITRACE_CPU_FREQ_ENABLED=true
OMNITRACE_TRACE_THREAD_LOCKS=true
OMNITRACE_TRACE_THREAD_RW_LOCKS=true
OMNITRACE_TRACE_THREAD_SPIN_LOCKS=true
OMNITRACE_USE_KOKKOSP=true
OMNITRACE_USE_MPIP=true
OMNITRACE_USE_OMPT=true
OMNITRACE_TRACE=true
OMNITRACE_USE_PROCESS_SAMPLING=true
OMNITRACE_USE_RCCLP=true
OMNITRACE_USE_ROCM_SMI=true
OMNITRACE_USE_ROCPROFILER=true
OMNITRACE_USE_ROCTRACER=true
OMNITRACE_USE_ROCTX=true
OMNITRACE_USE_SAMPLING=true
OMNITRACE_PROFILE=true
OMP_TOOL_LIBRARIES=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
ROCP_TOOL_LIB=/opt/omnitrace/lib/libomnitrace.so.1.7.1
KOKKOS_PROFILE_LIBRARY=/opt/rocprofiler-systems/lib/librocprof-sys.so.1.7.1
LD_PRELOAD=/opt/rocprofiler-systems/lib/librocprof-sys-dl.so.1.7.1
ROCPROFSYS_CPU_FREQ_ENABLED=true
ROCPROFSYS_TRACE_THREAD_LOCKS=true
ROCPROFSYS_TRACE_THREAD_RW_LOCKS=true
ROCPROFSYS_TRACE_THREAD_SPIN_LOCKS=true
ROCPROFSYS_USE_KOKKOSP=true
ROCPROFSYS_USE_MPIP=true
ROCPROFSYS_USE_OMPT=true
ROCPROFSYS_TRACE=true
ROCPROFSYS_USE_PROCESS_SAMPLING=true
ROCPROFSYS_USE_RCCLP=true
ROCPROFSYS_USE_ROCM_SMI=true
ROCPROFSYS_USE_ROCPROFILER=true
ROCPROFSYS_USE_ROCTRACER=true
ROCPROFSYS_USE_ROCTX=true
ROCPROFSYS_USE_SAMPLING=true
ROCPROFSYS_PROFILE=true
OMP_TOOL_LIBRARIES=/opt/rocprofiler-systems/lib/librocprof-sys-dl.so.1.7.1
ROCP_TOOL_LIB=/opt/rocprofiler-systems/lib/librocprof-sys.so.1.7.1
...
* The final snippet shows the environment updates when ``omnitrace-sample`` enables
* The final snippet shows the environment updates when ``rocprof-sys-sample`` enables
profiling, tracing, host process-sampling, and device process-sampling,
sets the output path to ``omnitrace-output`` and the output prefix to ``%tag%``, and disables
sets the output path to ``rocprof-sys-output`` and the output prefix to ``%tag%``, and disables
all the available backends:
.. code-block:: shell
$ omnitrace-sample -PTDH -E all -o omnitrace-output %tag% -- ./parallel-overhead-locks 30 4 100
$ rocprof-sys-sample -PTDH -E all -o rocprof-sys-output %tag% -- ./parallel-overhead-locks 30 4 100
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
OMNITRACE_CPU_FREQ_ENABLED=true
OMNITRACE_OUTPUT_PATH=omnitrace-output
OMNITRACE_OUTPUT_PREFIX=%tag%
OMNITRACE_TRACE_THREAD_LOCKS=false
OMNITRACE_TRACE_THREAD_RW_LOCKS=false
OMNITRACE_TRACE_THREAD_SPIN_LOCKS=false
OMNITRACE_USE_KOKKOSP=false
OMNITRACE_USE_MPIP=false
OMNITRACE_USE_OMPT=false
OMNITRACE_TRACE=true
OMNITRACE_USE_PROCESS_SAMPLING=true
OMNITRACE_USE_RCCLP=false
OMNITRACE_USE_ROCM_SMI=false
OMNITRACE_USE_ROCPROFILER=false
OMNITRACE_USE_ROCTRACER=false
OMNITRACE_USE_ROCTX=false
OMNITRACE_USE_SAMPLING=true
OMNITRACE_PROFILE=true
LD_PRELOAD=/opt/rocprofiler-systems/lib/librocprof-sys-dl.so.1.7.1
ROCPROFSYS_CPU_FREQ_ENABLED=true
ROCPROFSYS_OUTPUT_PATH=rocprof-sys-output
ROCPROFSYS_OUTPUT_PREFIX=%tag%
ROCPROFSYS_TRACE_THREAD_LOCKS=false
ROCPROFSYS_TRACE_THREAD_RW_LOCKS=false
ROCPROFSYS_TRACE_THREAD_SPIN_LOCKS=false
ROCPROFSYS_USE_KOKKOSP=false
ROCPROFSYS_USE_MPIP=false
ROCPROFSYS_USE_OMPT=false
ROCPROFSYS_TRACE=true
ROCPROFSYS_USE_PROCESS_SAMPLING=true
ROCPROFSYS_USE_RCCLP=false
ROCPROFSYS_USE_ROCM_SMI=false
ROCPROFSYS_USE_ROCPROFILER=false
ROCPROFSYS_USE_ROCTRACER=false
ROCPROFSYS_USE_ROCTX=false
ROCPROFSYS_USE_SAMPLING=true
ROCPROFSYS_PROFILE=true
...
An omnitrace-sample example
An rocprof-sys-sample example
========================================
Here is the full output from the previous
``omnitrace-sample -PTDH -E all -o omnitrace-output %tag% -- ./parallel-overhead-locks 30 4 100`` command:
Here is the full output from the previous
``rocprof-sys-sample -PTDH -E all -o rocprof-sys-output %tag% -- ./parallel-overhead-locks 30 4 100`` command:
.. code-block:: shell
$ omnitrace-sample -PTDH -E all -o omnitrace-output %tag% -c -- ./parallel-overhead-locks 30 4 100
$ rocprof-sys-sample -PTDH -E all -o rocprof-sys-output %tag% -c -- ./parallel-overhead-locks 30 4 100
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.11.3
OMNITRACE_CONFIG_FILE=
OMNITRACE_CPU_FREQ_ENABLED=true
OMNITRACE_OUTPUT_PATH=omnitrace-output
OMNITRACE_OUTPUT_PREFIX=%tag%
OMNITRACE_PROFILE=true
OMNITRACE_TRACE=true
OMNITRACE_TRACE_THREAD_LOCKS=false
OMNITRACE_TRACE_THREAD_RW_LOCKS=false
OMNITRACE_TRACE_THREAD_SPIN_LOCKS=false
OMNITRACE_USE_KOKKOSP=false
OMNITRACE_USE_MPIP=false
OMNITRACE_USE_OMPT=false
OMNITRACE_USE_PROCESS_SAMPLING=true
OMNITRACE_USE_RCCLP=false
OMNITRACE_USE_ROCM_SMI=false
OMNITRACE_USE_ROCPROFILER=false
OMNITRACE_USE_ROCTRACER=false
OMNITRACE_USE_ROCTX=false
OMNITRACE_USE_SAMPLING=true
[omnitrace][dl][1785877] omnitrace_main
[omnitrace][1785877][omnitrace_init_tooling] Instrumentation mode: Sampling
______ .___ ___. .__ __. __ .___________..______ ___ ______ _______
/ __ \ | \/ | | \ | | | | | || _ \ / \ / || ____|
| | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__
| | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __|
| `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____
\______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______|
omnitrace v1.11.2 (rev: 2586b74db8bf335742600010b8d9f1ce8da9cf89, compiler: GNU v11.4.1, rocm: v6.1.x)
LD_PRELOAD=/opt/rocprofiler-systems/lib/librocprof-sys-dl.so.1.11.3
ROCPROFSYS_CONFIG_FILE=
ROCPROFSYS_CPU_FREQ_ENABLED=true
ROCPROFSYS_OUTPUT_PATH=rocprof-sys-output
ROCPROFSYS_OUTPUT_PREFIX=%tag%
ROCPROFSYS_PROFILE=true
ROCPROFSYS_TRACE=true
ROCPROFSYS_TRACE_THREAD_LOCKS=false
ROCPROFSYS_TRACE_THREAD_RW_LOCKS=false
ROCPROFSYS_TRACE_THREAD_SPIN_LOCKS=false
ROCPROFSYS_USE_KOKKOSP=false
ROCPROFSYS_USE_MPIP=false
ROCPROFSYS_USE_OMPT=false
ROCPROFSYS_USE_PROCESS_SAMPLING=true
ROCPROFSYS_USE_RCCLP=false
ROCPROFSYS_USE_ROCM_SMI=false
ROCPROFSYS_USE_ROCPROFILER=false
ROCPROFSYS_USE_ROCTRACER=false
ROCPROFSYS_USE_ROCTX=false
ROCPROFSYS_USE_SAMPLING=true
[rocprof-sys][dl][1785877] rocprofsys_main
[rocprof-sys][1785877][rocprofsys_init_tooling] Instrumentation mode: Sampling
__
_ __ ___ ___ _ __ _ __ ___ / _| ___ _ _ ___
| '__| / _ \ / __| | '_ \ | '__| / _ \ | |_ _____ / __| | | | | / __|
| | | (_) | | (__ | |_) | | | | (_) | | _| |_____| \__ \ | |_| | \__ \
|_| \___/ \___| | .__/ |_| \___/ |_| |___/ \__, | |___/
|_| |___/
rocprof-sys v1.11.2 (rev: 2586b74db8bf335742600010b8d9f1ce8da9cf89, compiler: GNU v11.4.1, rocm: v6.1.x)
[988.958] perfetto.cc:58649 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
[parallel-overhead-locks] Threads: 4
[parallel-overhead-locks] Iterations: 100
@@ -386,19 +387,19 @@ Here is the full output from the previous
[4] number of iterations: 100
[parallel-overhead-locks] fibonacci(30) x 4 = 409221992
[parallel-overhead-locks] number of mutex locks = 400
[omnitrace][1785877][0][omnitrace_finalize] finalizing...
[omnitrace][1785877][0][omnitrace_finalize]
[omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877 : 0.294342 sec wall_clock, 4.776 MB peak_rss, 3.170 MB page_rss, 0.990000 sec cpu_clock, 336.3 % cpu_util [laps: 1]
[omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/0 : 0.291535 sec wall_clock, 0.002619 sec thread_cpu_clock, 0.9 % thread_cpu_util, 4.776 MB peak_rss [laps: 1]
[omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/1 : 0.271353 sec wall_clock, 0.222572 sec thread_cpu_clock, 82.0 % thread_cpu_util, 4.200 MB peak_rss [laps: 1]
[omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/2 : 0.238218 sec wall_clock, 0.206405 sec thread_cpu_clock, 86.6 % thread_cpu_util, 3.432 MB peak_rss [laps: 1]
[omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/3 : 0.209459 sec wall_clock, 0.193415 sec thread_cpu_clock, 92.3 % thread_cpu_util, 2.472 MB peak_rss [laps: 1]
[omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/4 : 0.212029 sec wall_clock, 0.211694 sec thread_cpu_clock, 99.8 % thread_cpu_util, 1.152 MB peak_rss [laps: 1]
[omnitrace][1785877][0][omnitrace_finalize]
[omnitrace][1785877][0][omnitrace_finalize] Finalizing perfetto...
[omnitrace][1785877][perfetto]> Outputting '/home/user/code/omnitrace/build-release/omnitrace-output/2024-07-15_16.21/parallel-overhead-locksperfetto-trace-1785877.proto' (39.12 KB / 0.04 MB / 0.00 GB)... Done
[omnitrace][1785877][wall_clock]> Outputting 'omnitrace-output/2024-07-15_16.21/parallel-overhead-lockswall_clock-1785877.json'
[omnitrace][1785877][wall_clock]> Outputting 'omnitrace-output/2024-07-15_16.21/parallel-overhead-lockswall_clock-1785877.txt'
[omnitrace][1785877][metadata]> Outputting 'omnitrace-output/2024-07-15_16.21/parallel-overhead-locksmetadata-1785877.json' and 'omnitrace-output/2024-07-15_16.21/parallel-overhead-locksfunctions-1785877.json'
[omnitrace][1785877][0][omnitrace_finalize] Finalized: 0.054582 sec wall_clock, 0.000 MB peak_rss, -1.798 MB page_rss, 0.040000 sec cpu_clock, 73.3 % cpu_util
[rocprof-sys][1785877][0][rocprofsys_finalize] finalizing...
[rocprof-sys][1785877][0][rocprofsys_finalize]
[rocprof-sys][1785877][0][rocprofsys_finalize] rocprof-sys/process/1785877 : 0.294342 sec wall_clock, 4.776 MB peak_rss, 3.170 MB page_rss, 0.990000 sec cpu_clock, 336.3 % cpu_util [laps: 1]
[rocprof-sys][1785877][0][rocprofsys_finalize] rocprof-sys/process/1785877/thread/0 : 0.291535 sec wall_clock, 0.002619 sec thread_cpu_clock, 0.9 % thread_cpu_util, 4.776 MB peak_rss [laps: 1]
[rocprof-sys][1785877][0][rocprofsys_finalize] rocprof-sys/process/1785877/thread/1 : 0.271353 sec wall_clock, 0.222572 sec thread_cpu_clock, 82.0 % thread_cpu_util, 4.200 MB peak_rss [laps: 1]
[rocprof-sys][1785877][0][rocprofsys_finalize] rocprof-sys/process/1785877/thread/2 : 0.238218 sec wall_clock, 0.206405 sec thread_cpu_clock, 86.6 % thread_cpu_util, 3.432 MB peak_rss [laps: 1]
[rocprof-sys][1785877][0][rocprofsys_finalize] rocprof-sys/process/1785877/thread/3 : 0.209459 sec wall_clock, 0.193415 sec thread_cpu_clock, 92.3 % thread_cpu_util, 2.472 MB peak_rss [laps: 1]
[rocprof-sys][1785877][0][rocprofsys_finalize] rocprof-sys/process/1785877/thread/4 : 0.212029 sec wall_clock, 0.211694 sec thread_cpu_clock, 99.8 % thread_cpu_util, 1.152 MB peak_rss [laps: 1]
[rocprof-sys][1785877][0][rocprofsys_finalize]
[rocprof-sys][1785877][0][rocprofsys_finalize] Finalizing perfetto...
[rocprof-sys][1785877][perfetto]> Outputting '/home/user/code/rocprofiler-systems/build-release/rocprofiler-systems-output/2024-07-15_16.21/parallel-overhead-locksperfetto-trace-1785877.proto' (39.12 KB / 0.04 MB / 0.00 GB)... Done
[rocprof-sys][1785877][wall_clock]> Outputting 'rocprof-sys-output/2024-07-15_16.21/parallel-overhead-lockswall_clock-1785877.json'
[rocprof-sys][1785877][wall_clock]> Outputting 'rocprof-sys-output/2024-07-15_16.21/parallel-overhead-lockswall_clock-1785877.txt'
[rocprof-sys][1785877][metadata]> Outputting 'rocprof-sys-output/2024-07-15_16.21/parallel-overhead-locksmetadata-1785877.json' and 'rocprof-sys-output/2024-07-15_16.21/parallel-overhead-locksfunctions-1785877.json'
[rocprof-sys][1785877][0][rocprofsys_finalize] Finalized: 0.054582 sec wall_clock, 0.000 MB peak_rss, -1.798 MB page_rss, 0.040000 sec cpu_clock, 73.3 % cpu_util
[989.312] perfetto.cc:60128 Tracing session 1 ended, total sessions:0
@@ -1,63 +1,63 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
:description: ROCm Systems Profiler system output documentation and reference
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, profiler, system output, tracking, visualization, tool, Instinct, accelerator, AMD
****************************************************
Understanding the Omnitrace output
Understanding the Systems Profiler output
****************************************************
The general output form of `Omnitrace <https://github.com/ROCm/omnitrace>`_ is
The general output form of `ROCm Systems Profiler <https://github.com/ROCm/rocprofiler-systems>`_ is
``<OUTPUT_PATH>[/<TIMESTAMP>]/[<PREFIX>]<DATA_NAME>[-<OUTPUT_SUFFIX>].<EXT>``.
For example, starting with the following base configuration:
.. code-block:: shell
export OMNITRACE_OUTPUT_PATH=omnitrace-example-output
export OMNITRACE_TIME_OUTPUT=ON
export OMNITRACE_USE_PID=OFF
export OMNITRACE_PROFILE=ON
export OMNITRACE_TRACE=ON
export ROCPROFSYS_OUTPUT_PATH=rocprof-sys-example-output
export ROCPROFSYS_TIME_OUTPUT=ON
export ROCPROFSYS_USE_PID=OFF
export ROCPROFSYS_PROFILE=ON
export ROCPROFSYS_TRACE=ON
.. code-block:: shell
$ omnitrace-instrument -- ./foo
$ rocprof-sys-instrument -- ./foo
...
[omnitrace] Outputting 'omnitrace-example-output/perfetto-trace.proto'...
[rocprof-sys] Outputting 'rocprof-sys-example-output/perfetto-trace.proto'...
[omnitrace] Outputting 'omnitrace-example-output/wall-clock.txt'...
[omnitrace] Outputting 'omnitrace-example-output/wall-clock.json'...
[rocprof-sys] Outputting 'rocprof-sys-example-output/wall-clock.txt'...
[rocprof-sys] Outputting 'rocprof-sys-example-output/wall-clock.json'...
If the ``OMNITRACE_USE_PID`` option is enabled, then running a non-MPI executable
If the ``ROCPROFSYS_USE_PID`` option is enabled, then running a non-MPI executable
with a PID of ``63453`` results in the following output:
.. code-block:: shell
$ export OMNITRACE_USE_PID=ON
$ omnitrace-instrument -- ./foo
$ export ROCPROFSYS_USE_PID=ON
$ rocprof-sys-instrument -- ./foo
...
[omnitrace] Outputting 'omnitrace-example-output/perfetto-trace-63453.proto'...
[rocprof-sys] Outputting 'rocprof-sys-example-output/perfetto-trace-63453.proto'...
[omnitrace] Outputting 'omnitrace-example-output/wall-clock-63453.txt'...
[omnitrace] Outputting 'omnitrace-example-output/wall-clock-63453.json'...
[rocprof-sys] Outputting 'rocprof-sys-example-output/wall-clock-63453.txt'...
[rocprof-sys] Outputting 'rocprof-sys-example-output/wall-clock-63453.json'...
If ``OMNITRACE_TIME_OUTPUT`` is enabled, then a job that started on January 31, 2022 at 12:30 PM
If ``ROCPROFSYS_TIME_OUTPUT`` is enabled, then a job that started on January 31, 2022 at 12:30 PM
generates the following:
.. code-block:: shell
$ export OMNITRACE_TIME_OUTPUT=ON
$ omnitrace-instrument -- ./foo
$ export ROCPROFSYS_TIME_OUTPUT=ON
$ rocprof-sys-instrument -- ./foo
...
[omnitrace] Outputting 'omnitrace-example-output/2022-01-31_12.30_PM/perfetto-trace-63453.proto'...
[rocprof-sys] Outputting 'rocprof-sys-example-output/2022-01-31_12.30_PM/perfetto-trace-63453.proto'...
[omnitrace] Outputting 'omnitrace-example-output/2022-01-31_12.30_PM/wall-clock-63453.txt'...
[omnitrace] Outputting 'omnitrace-example-output/2022-01-31_12.30_PM/wall-clock-63453.json'...
[rocprof-sys] Outputting 'rocprof-sys-example-output/2022-01-31_12.30_PM/wall-clock-63453.txt'...
[rocprof-sys] Outputting 'rocprof-sys-example-output/2022-01-31_12.30_PM/wall-clock-63453.json'...
Metadata
========================================
Omnitrace outputs a ``metadata.json`` file. This metadata file contains
ROCm Systems Profiler outputs a ``metadata.json`` file. This metadata file contains
information about the settings, environment variables, output files, and info
about the system and the run, as follows:
@@ -77,7 +77,7 @@ Metadata JSON Sample
.. code-block:: json
{
"omnitrace": {
"rocprof-sys": {
"metadata": {
"info": {
"HW_L1_CACHE_SIZE": 32768,
@@ -161,13 +161,13 @@ Metadata JSON Sample
"text": [
{
"value": [
"omnitrace-tests-output/parallel-overhead-binary-rewrite/roctracer.txt"
"rocprof-sys-tests-output/parallel-overhead-binary-rewrite/roctracer.txt"
],
"key": "roctracer"
},
{
"value": [
"omnitrace-tests-output/parallel-overhead-binary-rewrite/wall_clock.txt"
"rocprof-sys-tests-output/parallel-overhead-binary-rewrite/wall_clock.txt"
],
"key": "wall_clock"
}
@@ -175,15 +175,15 @@ Metadata JSON Sample
"json": [
{
"value": [
"omnitrace-tests-output/parallel-overhead-binary-rewrite/roctracer.json",
"omnitrace-tests-output/parallel-overhead-binary-rewrite/roctracer.tree.json"
"rocprof-sys-tests-output/parallel-overhead-binary-rewrite/roctracer.json",
"rocprof-sys-tests-output/parallel-overhead-binary-rewrite/roctracer.tree.json"
],
"key": "roctracer"
},
{
"value": [
"omnitrace-tests-output/parallel-overhead-binary-rewrite/wall_clock.json",
"omnitrace-tests-output/parallel-overhead-binary-rewrite/wall_clock.tree.json"
"rocprof-sys-tests-output/parallel-overhead-binary-rewrite/wall_clock.json",
"rocprof-sys-tests-output/parallel-overhead-binary-rewrite/wall_clock.tree.json"
],
"key": "wall_clock"
}
@@ -208,7 +208,7 @@ Metadata JSON Sample
}
],
"settings": {
"OMNITRACE_JSON_OUTPUT": {
"ROCPROFSYS_JSON_OUTPUT": {
"count": -1,
"environ_updated": false,
"name": "json_output",
@@ -218,9 +218,9 @@ Metadata JSON Sample
"value": true,
"max_count": 1,
"cmdline": [
"--omnitrace-json-output"
"--rocprof-sys-json-output"
],
"environ": "OMNITRACE_JSON_OUTPUT",
"environ": "ROCPROFSYS_JSON_OUTPUT",
"config_updated": false,
"categories": [
"io",
@@ -237,10 +237,10 @@ Metadata JSON Sample
}
}
Configuring the Omnitrace output
Configuring the ROCm Systems Profiler output
========================================
Omnitrace includes a core set of options for controlling the format
ROCm Systems Profiler includes a core set of options for controlling the format
and contents of the output files. For additional information, see the guide on
:doc:`configuring runtime options <./configuring-runtime-options>`.
@@ -251,19 +251,19 @@ Core configuration settings
:header: "Setting", "Value", "Description"
:widths: 30, 30, 100
"``OMNITRACE_OUTPUT_PATH``", "Any valid path", "Path to folder where output files should be placed"
"``OMNITRACE_OUTPUT_PREFIX``", "String", "Useful for multiple runs with different arguments. See the next section on output prefix keys."
"``OMNITRACE_OUTPUT_FILE``", "Any valid filepath", "Specific location for the Perfetto output file"
"``OMNITRACE_TIME_OUTPUT``", "Boolean", "Place all output in a timestamped folder, timestamp format controlled via ``OMNITRACE_TIME_FORMAT``"
"``OMNITRACE_TIME_FORMAT``", "String", "See ``strftime`` man pages for valid identifiers"
"``OMNITRACE_USE_PID``", "Boolean", "Append either the PID or the MPI rank to all output files (before the extension)"
"``ROCPROFSYS_OUTPUT_PATH``", "Any valid path", "Path to folder where output files should be placed"
"``ROCPROFSYS_OUTPUT_PREFIX``", "String", "Useful for multiple runs with different arguments. See the next section on output prefix keys."
"``ROCPROFSYS_OUTPUT_FILE``", "Any valid filepath", "Specific location for the Perfetto output file"
"``ROCPROFSYS_TIME_OUTPUT``", "Boolean", "Place all output in a timestamped folder, timestamp format controlled via ``ROCPROFSYS_TIME_FORMAT``"
"``ROCPROFSYS_TIME_FORMAT``", "String", "See ``strftime`` man pages for valid identifiers"
"``ROCPROFSYS_USE_PID``", "Boolean", "Append either the PID or the MPI rank to all output files (before the extension)"
Output prefix keys
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Output prefix keys have many uses but are most helpful when dealing with multiple
profiling runs or large MPI jobs.
They are included in Omnitrace because they were introduced into Timemory
They are included in ROCm Systems Profiler because they were introduced into Timemory
for `compile-time-perf <https://github.com/jrmadsen/compile-time-perf>`_.
They are needed to create different output files for a generic wrapper around
compilation commands while still
@@ -271,8 +271,8 @@ overwriting the output from the last time a file was compiled.
When doing scaling studies and specifying options via the command line,
the recommended process is to
use a common ``OMNITRACE_OUTPUT_PATH``, disable ``OMNITRACE_TIME_OUTPUT``,
set ``OMNITRACE_OUTPUT_PREFIX="%argt%-"``, and let Omnitrace cleanly organize the output.
use a common ``ROCPROFSYS_OUTPUT_PATH``, disable ``ROCPROFSYS_TIME_OUTPUT``,
set ``ROCPROFSYS_OUTPUT_PREFIX="%argt%-"``, and let ROCm Systems Profiler cleanly organize the output.
.. csv-table::
:header: "String", "Encoding"
@@ -297,9 +297,9 @@ set ``OMNITRACE_OUTPUT_PREFIX="%argt%-"``, and let Omnitrace cleanly organize th
"``%rank%``", "Value of ``SLURM_PROCID`` environment variable if exists, else ``MPI_Comm_rank`` (or ``0`` non-mpi)"
"``%size%``", "``MPI_Comm_size`` or ``1`` if non-mpi"
"``%nid%``", "``%rank%`` if possible, otherwise ``%pid%``"
"``%launch_time%``", "Launch date and time (uses ``OMNITRACE_TIME_FORMAT``)"
"``%launch_time%``", "Launch date and time (uses ``ROCPROFSYS_TIME_FORMAT``)"
"``%env{NAME}%``", "Value of environment variable ``NAME`` (i.e. ``getenv(NAME)``)"
"``%cfg{NAME}%``", "Value of configuration variable ``NAME`` (e.g. ``%cfg{OMNITRACE_SAMPLING_FREQ}%`` would resolve to sampling frequency)"
"``%cfg{NAME}%``", "Value of configuration variable ``NAME`` (e.g. ``%cfg{ROCPROFSYS_SAMPLING_FREQ}%`` would resolve to sampling frequency)"
"``$env{NAME}``", "Alternative syntax to ``%env{NAME}%``"
"``$cfg{NAME}``", "Alternative syntax to ``%cfg{NAME}%``"
"``%m``", "Shorthand for ``%argt_hash%``"
@@ -318,8 +318,8 @@ set ``OMNITRACE_OUTPUT_PREFIX="%argt%-"``, and let Omnitrace cleanly organize th
Perfetto output
========================================
Use the ``OMNITRACE_OUTPUT_FILE`` to specify a specific location. If this is an
absolute path, then all ``OMNITRACE_OUTPUT_PATH`` and similar
Use the ``ROCPROFSYS_OUTPUT_FILE`` to specify a specific location. If this is an
absolute path, then all ``ROCPROFSYS_OUTPUT_PATH`` and similar
settings are ignored. Visit `ui.perfetto.dev <https://ui.perfetto.dev>`_ and open
this file.
@@ -328,26 +328,26 @@ this file.
If you are experiencing problems viewing your trace in the latest version of `Perfetto <http://ui.perfetto.dev>`_,
then try using `Perfetto UI v46.0 <https://ui.perfetto.dev/v46.0-35b3d9845/#!/>`_.
.. image:: ../data/omnitrace-perfetto.png
.. image:: ../data/rocprof-sys-perfetto.png
:alt: Visualization of a performance graph in Perfetto
.. image:: ../data/omnitrace-rocm.png
.. image:: ../data/rocprof-sys-rocm.png
:alt: Visualization of ROCm data in Perfetto
.. image:: ../data/omnitrace-rocm-flow.png
.. image:: ../data/rocprof-sys-rocm-flow.png
:alt: Visualization of ROCm flow data in Perfetto
.. image:: ../data/omnitrace-user-api.png
.. image:: ../data/rocprof-sys-user-api.png
:alt: Visualization of ROCm API calls in Perfetto
Timemory output
========================================
Use ``omnitrace-avail --components --filename`` to view the base filename for each component, as follows
Use ``rocprof-sys-avail --components --filename`` to view the base filename for each component, as follows
.. code-block:: shell
$ omnitrace-avail wall_clock -C -f
$ rocprof-sys-avail wall_clock -C -f
|---------------------------------|---------------|------------------------|
| COMPONENT | AVAILABLE | FILENAME |
|---------------------------------|---------------|------------------------|
@@ -355,16 +355,16 @@ Use ``omnitrace-avail --components --filename`` to view the base filename for ea
| sampling_wall_clock | true | sampling_wall_clock |
|---------------------------------|---------------|------------------------|
The ``OMNITRACE_COLLAPSE_THREADS`` and ``OMNITRACE_COLLAPSE_PROCESSES`` settings are
only valid when full `MPI support is enabled <../install/install.html#mpi-support-within-omnitrace>`_.
The ``ROCPROFSYS_COLLAPSE_THREADS`` and ``ROCPROFSYS_COLLAPSE_PROCESSES`` settings are
only valid when full `MPI support is enabled <../install/install.html#mpi-support-within-rocprof-sys>`_.
When they are set, Timemory combines the per-thread and per-rank data (respectively) of
identical call stacks.
The ``OMNITRACE_FLAT_PROFILE`` setting removes all call stack hierarchy.
Using ``OMNITRACE_FLAT_PROFILE=ON`` in combination
with ``OMNITRACE_COLLAPSE_THREADS=ON`` is a useful configuration for identifying
The ``ROCPROFSYS_FLAT_PROFILE`` setting removes all call stack hierarchy.
Using ``ROCPROFSYS_FLAT_PROFILE=ON`` in combination
with ``ROCPROFSYS_COLLAPSE_THREADS=ON`` is a useful configuration for identifying
min/max measurements regardless of the calling context.
The ``OMNITRACE_TIMELINE_PROFILE`` setting (with ``OMNITRACE_FLAT_PROFILE=OFF``) effectively
The ``ROCPROFSYS_TIMELINE_PROFILE`` setting (with ``ROCPROFSYS_FLAT_PROFILE=OFF``) effectively
generates similar data to that found
in Perfetto. Enabling timeline and flat profiling effectively generates
similar data to ``strace``. However, while Timemory generally
@@ -376,11 +376,11 @@ Timemory text output
Timemory text output files are meant for human consumption (while JSON formats are for analysis),
so some fields such as the ``LABEL`` might be truncated for readability.
The truncation settings be changed through the ``OMNITRACE_MAX_WIDTH`` setting.
The truncation settings be changed through the ``ROCPROFSYS_MAX_WIDTH`` setting.
.. note::
The generation of text output is configurable via ``OMNITRACE_TEXT_OUTPUT``.
The generation of text output is configurable via ``ROCPROFSYS_TEXT_OUTPUT``.
.. _text-output-example-label:
@@ -389,7 +389,7 @@ Timemory text output example
In the following example, the ``NN`` field in ``|NN>>>`` is the thread ID. If MPI support is enabled,
this becomes ``|MM|NN>>>`` where ``MM`` is the rank.
If ``OMNITRACE_COLLAPSE_THREADS=ON`` and ``OMNITRACE_COLLAPSE_PROCESSES=ON`` are configured,
If ``ROCPROFSYS_COLLAPSE_THREADS=ON`` and ``ROCPROFSYS_COLLAPSE_PROCESSES=ON`` are configured,
neither the ``MM`` nor the ``NN`` are present unless the
component explicitly sets type traits. Type traits specify that the data is only
relevant per-thread or per-process, such as the ``thread_cpu_clock`` clock component.
@@ -592,8 +592,8 @@ write a simple Python script for post-processing using this format than with the
.. note::
The generation of flat JSON output is configurable via ``OMNITRACE_JSON_OUTPUT``.
The generation of hierarchical JSON data is configurable via ``OMNITRACE_TREE_OUTPUT``
The generation of flat JSON output is configurable via ``ROCPROFSYS_JSON_OUTPUT``.
The generation of hierarchical JSON data is configurable via ``ROCPROFSYS_TREE_OUTPUT``
Timemory JSON output sample
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -1,37 +1,38 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
:description: ROCm Systems Profiler documentation and reference
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
****************************************************
Using the Omnitrace API
Using the ROCm Systems Profiler API
****************************************************
The following example shows how a program can use the Omnitrace API for run-time analysis.
The following example shows how a program can use the ROCm Systems Profiler API
for run-time analysis.
Omnitrace user API example program
ROCm Systems Profiler user API example program
========================================
You can use the Omnitrace API to define custom regions to profile and trace.
The following C++ program demonstrates this technique by calling several functions from the
Omnitrace API, such as ``omnitrace_user_push_region`` and
``omnitrace_user_stop_thread_trace``.
You can use the ROCm Systems Profiler API to define custom regions to profile and trace.
The following C++ program demonstrates this technique by calling several functions from the
ROCm Systems Profiler API, such as ``rocprofsys_user_push_region`` and
``rocprofsys_user_stop_thread_trace``.
.. note::
By default, when Omnitrace detects any ``omnitrace_user_start_*`` or
``omnitrace_user_stop_*`` function, instrumentation
is disabled at start up, which means ``omnitrace_user_stop_trace()`` is not
By default, when ROCm Systems Profiler detects any ``rocprofsys_user_start_*`` or
``rocprofsys_user_stop_*`` function, instrumentation
is disabled at start up, which means ``rocprofsys_user_stop_trace()`` is not
required at the beginning of ``main``. This behavior
can be manually controlled by using the ``OMNITRACE_INIT_ENABLED`` environment variable.
can be manually controlled by using the ``ROCPROFSYS_INIT_ENABLED`` environment variable.
User-defined regions are always
recorded, regardless of whether ``omnitrace_user_start_*`` or
``omnitrace_user_stop_*`` has been called.
recorded, regardless of whether ``rocprofsys_user_start_*`` or
``rocprofsys_user_stop_*`` has been called.
.. code-block:: shell
#include <omnitrace/categories.h>
#include <omnitrace/types.h>
#include <omnitrace/user.h>
#include <rocprofiler-systems/categories.h>
#include <rocprofiler-systems/types.h>
#include <rocprofiler-systems/user.h>
#include <atomic>
#include <cassert>
@@ -56,52 +57,52 @@ Omnitrace API, such as ``omnitrace_user_push_region`` and
namespace
{
omnitrace_user_callbacks_t custom_callbacks = OMNITRACE_USER_CALLBACKS_INIT;
omnitrace_user_callbacks_t original_callbacks = OMNITRACE_USER_CALLBACKS_INIT;
rocprofsys_user_callbacks_t custom_callbacks = ROCPROFSYS_USER_CALLBACKS_INIT;
rocprofsys_user_callbacks_t original_callbacks = ROCPROFSYS_USER_CALLBACKS_INIT;
} // namespace
int
main(int argc, char** argv)
{
custom_callbacks.push_region = &custom_push_region;
omnitrace_user_configure(OMNITRACE_USER_UNION_CONFIG, custom_callbacks,
rocprofsys_user_configure(ROCPROFSYS_USER_UNION_CONFIG, custom_callbacks,
&original_callbacks);
omnitrace_user_push_region(argv[0]);
omnitrace_user_push_region("initialization");
rocprofsys_user_push_region(argv[0]);
rocprofsys_user_push_region("initialization");
size_t nthread = std::min<size_t>(16, std::thread::hardware_concurrency());
size_t nitr = 50000;
long nfib = 10;
if(argc > 1) nfib = atol(argv[1]);
if(argc > 2) nthread = atol(argv[2]);
if(argc > 3) nitr = atol(argv[3]);
omnitrace_user_pop_region("initialization");
rocprofsys_user_pop_region("initialization");
printf("[%s] Threads: %zu\n[%s] Iterations: %zu\n[%s] fibonacci(%li)...\n", argv[0],
nthread, argv[0], nitr, argv[0], nfib);
omnitrace_user_push_region("thread_creation");
rocprofsys_user_push_region("thread_creation");
std::vector<std::thread> threads{};
threads.reserve(nthread);
// disable instrumentation for child threads
omnitrace_user_stop_thread_trace();
rocprofsys_user_stop_thread_trace();
for(size_t i = 0; i < nthread; ++i)
{
threads.emplace_back(&run, nitr, nfib);
}
// re-enable instrumentation
omnitrace_user_start_thread_trace();
omnitrace_user_pop_region("thread_creation");
rocprofsys_user_start_thread_trace();
rocprofsys_user_pop_region("thread_creation");
omnitrace_user_push_region("thread_wait");
rocprofsys_user_push_region("thread_wait");
for(auto& itr : threads)
itr.join();
omnitrace_user_pop_region("thread_wait");
rocprofsys_user_pop_region("thread_wait");
run(nitr, nfib);
printf("[%s] fibonacci(%li) x %lu = %li\n", argv[0], nfib, nthread, total.load());
omnitrace_user_pop_region(argv[0]);
rocprofsys_user_pop_region(argv[0]);
return 0;
}
@@ -120,19 +121,19 @@ Omnitrace API, such as ``omnitrace_user_push_region`` and
void
run(size_t nitr, long n)
{
omnitrace_user_push_region(RUN_LABEL);
rocprofsys_user_push_region(RUN_LABEL);
long local = 0;
for(size_t i = 0; i < nitr; ++i)
local += fib(n);
total += local;
omnitrace_user_pop_region(RUN_LABEL);
rocprofsys_user_pop_region(RUN_LABEL);
}
int
custom_push_region(const char* name)
{
if(!original_callbacks.push_region || !original_callbacks.push_annotated_region)
return OMNITRACE_USER_ERROR_NO_BINDING;
return ROCPROFSYS_USER_ERROR_NO_BINDING;
printf("Pushing custom region :: %s\n", name);
@@ -143,22 +144,22 @@ Omnitrace API, such as ``omnitrace_user_push_region`` and
char _buff[1024];
if(_err != 0) _msg = strerror_r(_err, _buff, sizeof(_buff));
omnitrace_annotation_t _annotations[] = {
{ "errno", OMNITRACE_INT32, &_err }, { "strerror", OMNITRACE_STRING, _msg }
rocprofsys_annotation_t _annotations[] = {
{ "errno", ROCPROFSYS_INT32, &_err }, { "strerror", ROCPROFSYS_STRING, _msg }
};
errno = 0; // reset errno
return (*original_callbacks.push_annotated_region)(
name, _annotations, sizeof(_annotations) / sizeof(omnitrace_annotation_t));
name, _annotations, sizeof(_annotations) / sizeof(rocprofsys_annotation_t));
}
return (*original_callbacks.push_region)(name);
}
Linking the Omnitrace libraries to another program
Linking the ROCm Systems Profiler libraries to another program
=======================================================
To link the ``omnitrace-user-library`` to another program,
To link the ``rocprofiler-systems-user-library`` to another program,
use the following CMake and ``g++`` directives.
CMake
@@ -166,19 +167,19 @@ CMake
.. code-block:: cmake
find_package(omnitrace REQUIRED COMPONENTS user)
find_package(rocprofiler-systems REQUIRED COMPONENTS user)
add_executable(foo foo.cpp)
target_link_libraries(foo PRIVATE omnitrace::omnitrace-user-library)
target_link_libraries(foo PRIVATE rocprofiler-systems::rocprofiler-systems-user-library)
g++ compilation
-------------------------------------------------------
Assuming Omnitrace is installed in ``/opt/omnitrace``, use the ``g++`` compiler
Assuming ROCm Systems Profiler is installed in ``/opt/rocprofiler-systems``, use the ``g++`` compiler
to build the application.
.. code-block:: shell
g++ -I/opt/omnitrace foo.cpp -o foo -lomnitrace-user
g++ -I/opt/rocprofiler-systems foo.cpp -o foo -lrocprofiler-systems-user
Output from the API example program
========================================
@@ -187,19 +188,19 @@ First, instrument and run the program.
.. code-block:: shell
$ omnitrace-instrument -l --min-instructions=8 -E custom_push_region -o -- ./user-api
$ rocprof-sys-instrument -l --min-instructions=8 -E custom_push_region -o -- ./user-api
...
$ omnitrace-run --profile --use-pid off --time-output off -- ./user-api.inst 20 4 100
$ rocprof-sys-run --profile --use-pid off --time-output off -- ./user-api.inst 20 4 100
Pushing custom region :: ./user-api.inst
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Trace
[rocprof-sys][rocprofsys_init_tooling] Instrumentation mode: Trace
______ .___ ___. .__ __. __ .___________..______ ___ ______ _______
/ __ \ | \/ | | \ | | | | | || _ \ / \ / || ____|
| | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__
| | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __|
| `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____
\______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______|
__
_ __ ___ ___ _ __ _ __ ___ / _| ___ _ _ ___
| '__| / _ \ / __| | '_ \ | '__| / _ \ | |_ _____ / __| | | | | / __|
| | | (_) | | (__ | |_) | | | | (_) | | _| |_____| \__ \ | |_| | \__ \
|_| \___/ \___| | .__/ |_| \___/ |_| |___/ \__, | |___/
|_| |___/
@@ -215,29 +216,29 @@ First, instrument and run the program.
Pushing custom region :: run(20) x 100
Pushing custom region :: run(20) x 100
[./user-api.inst] fibonacci(20) x 4 = 3382500
[omnitrace][86267][0][omnitrace_finalize] finalizing...
[rocprof-sys][86267][0][rocprofsys_finalize] finalizing...
[omnitrace][86267][0] omnitrace : 5.190895 sec wall_clock, 2.748 mb peak_rss, 6.330000 sec cpu_clock, 121.9 % cpu_util [laps: 1]
[omnitrace][86267][0] user-api.inst/thread-0 : 5.078713 sec wall_clock, 4.722415 sec thread_cpu_clock, 93.0 % thread_cpu_util, 1.276 mb peak_rss [laps: 1]
[omnitrace][86267][0] user-api.inst/thread-1 : 0.322248 sec wall_clock, 0.322191 sec thread_cpu_clock, 100.0 % thread_cpu_util, 1.000 mb peak_rss [laps: 1]
[omnitrace][86267][0] user-api.inst/thread-2 : 0.323255 sec wall_clock, 0.323194 sec thread_cpu_clock, 100.0 % thread_cpu_util, 0.000 mb peak_rss [laps: 1]
[omnitrace][86267][0] user-api.inst/thread-3 : 0.323569 sec wall_clock, 0.323484 sec thread_cpu_clock, 100.0 % thread_cpu_util, 1.092 mb peak_rss [laps: 1]
[omnitrace][86267][0] user-api.inst/thread-4 : 0.324178 sec wall_clock, 0.324057 sec thread_cpu_clock, 100.0 % thread_cpu_util, 1.184 mb peak_rss [laps: 1]
[omnitrace][86267][0] Post-processing 51 cpu frequency and memory usage entries...
[rocprof-sys][86267][0] rocprof-sys : 5.190895 sec wall_clock, 2.748 mb peak_rss, 6.330000 sec cpu_clock, 121.9 % cpu_util [laps: 1]
[rocprof-sys][86267][0] user-api.inst/thread-0 : 5.078713 sec wall_clock, 4.722415 sec thread_cpu_clock, 93.0 % thread_cpu_util, 1.276 mb peak_rss [laps: 1]
[rocprof-sys][86267][0] user-api.inst/thread-1 : 0.322248 sec wall_clock, 0.322191 sec thread_cpu_clock, 100.0 % thread_cpu_util, 1.000 mb peak_rss [laps: 1]
[rocprof-sys][86267][0] user-api.inst/thread-2 : 0.323255 sec wall_clock, 0.323194 sec thread_cpu_clock, 100.0 % thread_cpu_util, 0.000 mb peak_rss [laps: 1]
[rocprof-sys][86267][0] user-api.inst/thread-3 : 0.323569 sec wall_clock, 0.323484 sec thread_cpu_clock, 100.0 % thread_cpu_util, 1.092 mb peak_rss [laps: 1]
[rocprof-sys][86267][0] user-api.inst/thread-4 : 0.324178 sec wall_clock, 0.324057 sec thread_cpu_clock, 100.0 % thread_cpu_util, 1.184 mb peak_rss [laps: 1]
[rocprof-sys][86267][0] Post-processing 51 cpu frequency and memory usage entries...
[omnitrace][wall_clock]|0> Outputting 'omnitrace-user-api.inst-output/wall_clock.json'...
[omnitrace][wall_clock]|0> Outputting 'omnitrace-user-api.inst-output/wall_clock.tree.json'...
[omnitrace][wall_clock]|0> Outputting 'omnitrace-user-api.inst-output/wall_clock.txt'...
[rocprof-sys][wall_clock]|0> Outputting 'rocprof-sys-user-api.inst-output/wall_clock.json'...
[rocprof-sys][wall_clock]|0> Outputting 'rocprof-sys-user-api.inst-output/wall_clock.tree.json'...
[rocprof-sys][wall_clock]|0> Outputting 'rocprof-sys-user-api.inst-output/wall_clock.txt'...
[omnitrace][manager::finalize][metadata]> Outputting 'omnitrace-user-api.inst-output/metadata.json' and 'omnitrace-user-api.inst-output/functions.json'...
[omnitrace][86267][0][omnitrace_finalize] Finalized
[rocprof-sys][manager::finalize][metadata]> Outputting 'rocprof-sys-user-api.inst-output/metadata.json' and 'rocprof-sys-user-api.inst-output/functions.json'...
[rocprof-sys][86267][0][rocprofsys_finalize] Finalized
Then review the output.
.. code-block:: shell
$ cat omnitrace-example-output/wall_clock.txt
$ cat rocprof-sys-example-output/wall_clock.txt
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| REAL-CLOCK TIMER (I.E. WALL-CLOCK TIMER) |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+15 -16
Dosyayı Görüntüle
@@ -1,17 +1,17 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
:description: ROCm Systems Profiler documentation and reference
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
***********************
Omnitrace documentation
ROCm Systems Profiler documentation
***********************
Omnitrace is designed for the high-level profiling and comprehensive tracing
ROCm Systems Profiler, formerly known as "Omnitrace", is designed for the high-level profiling and comprehensive tracing
of applications running on the CPU or the CPU and GPU. It supports dynamic binary
instrumentation, call-stack sampling, and various other features for determining
which function and line number are currently executing. To learn more, see :doc:`what-is-omnitrace`
which function and line number are currently executing. To learn more, see :doc:`what-is-rocprof-sys`
The code is open and hosted at `<https://github.com/ROCm/omnitrace>`_.
The code is open and hosted at `<https://github.com/ROCm/rocprofiler-systems>`_.
.. grid:: 2
@@ -20,7 +20,7 @@ The code is open and hosted at `<https://github.com/ROCm/omnitrace>`_.
.. grid-item-card:: Install
* :doc:`Quick start <./install/quick-start>`
* :doc:`Omnitrace installation <./install/install>`
* :doc:`ROCm Systems Profiler installation <./install/install>`
The documentation is structured as follows:
@@ -30,31 +30,30 @@ The documentation is structured as follows:
.. grid-item-card:: Tutorials
* `GitHub examples <https://github.com/ROCm/omnitrace/tree/amd-mainline/examples>`_
* `GitHub examples <https://github.com/ROCm/rocprofiler-systems/tree/amd-mainline/examples>`_
* :doc:`Video tutorials <./tutorials/video-tutorials>`
.. grid-item-card:: How to
* :doc:`Configuring and validating the Omnitrace environment <./how-to/configuring-validating-environment>`
* :doc:`Configuring and validating the ROCm Systems Profiler environment <./how-to/configuring-validating-environment>`
* :doc:`Configuring runtime options <./how-to/configuring-runtime-options>`
* :doc:`Sampling the call stack <./how-to/sampling-call-stack>`
* :doc:`Instrumenting and rewriting a binary application <./how-to/instrumenting-rewriting-binary-application>`
* :doc:`Performing causal profiling <./how-to/performing-causal-profiling>`
* :doc:`Understanding the Omnitrace output <./how-to/understanding-omnitrace-output>`
* :doc:`Understanding the ROCm Systems Profiler output <./how-to/understanding-rocprof-sys-output>`
* :doc:`Profiling Python scripts <./how-to/profiling-python-scripts>`
* :doc:`Using the Omnitrace API <./how-to/using-omnitrace-api>`
* :doc:`General tips for using Omnitrace <./how-to/general-tips-using-omnitrace>`
* :doc:`Using the ROCm Systems Profiler API <./how-to/using-rocprof-sys-api>`
* :doc:`General tips for using ROCm Systems Profiler <./how-to/general-tips-using-rocprof-sys>`
.. grid-item-card:: Conceptual
* :doc:`Data collection modes <./conceptual/data-collection-modes>`
* :doc:`The Omnitrace feature set <./conceptual/omnitrace-feature-set>`
* :doc:`The ROCm Systems Profiler feature set <./conceptual/rocprof-sys-feature-set>`
.. grid-item-card:: Reference
* :doc:`Development guide <./reference/development-guide>`
* :doc:`Omnitrace glossary <./reference/omnitrace-glossary>`
* :doc:`ROCm Systems Profiler glossary <./reference/rocprof-sys-glossary>`
* :doc:`API library <./doxygen/html/files>`
* :doc:`Class member functions <./doxygen/html/functions>`
* :doc:`Globals <./doxygen/html/globals>`
+93 -90
Dosyayı Görüntüle
@@ -1,38 +1,41 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
:description: ROCm Systems Profiler installation documentation and reference
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, installation, installer, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
*************************************
Omnitrace installation
ROCm Systems Profiler installation
*************************************
The following information builds on the guidelines in the :doc:`Quick start <./quick-start>` guide.
It covers how to install `Omnitrace <https://github.com/ROCm/omnitrace>`_ from source or a binary distribution,
as well as the :ref:`post-installation-steps`.
It covers how to install `ROCm Systems Profiler <https://github.com/ROCm/rocprofiler-systems>`_ from
source or a binary distribution, as well as the :ref:`post-installation-steps`.
If you have problems using Omnitrace after installation,
If you have problems using ROCm Systems Profiler after installation,
consult the :ref:`post-installation-troubleshooting` section.
Release links
========================================
To review and install either the current Omnitrace release or earlier releases, use these links:
To review and install either the current ROCm Systems Profiler release or earlier releases, use these links:
* Latest Omnitrace Release: `<https://github.com/ROCm/omnitrace/releases/latest>`_
* All Omnitrace Releases: `<https://github.com/ROCm/omnitrace/releases>`_
* Latest ROCm Systems Profiler Release: `<https://github.com/ROCm/rocprofiler-systems/releases/latest>`_
* All ROCm Systems Profiler Releases: `<https://github.com/ROCm/rocprofiler-systems/releases>`_
Operating system support
========================================
Omnitrace is only supported on Linux. The following distributions are tested in the Omnitrace GitHub workflows:
ROCm Systems Profiler is only supported on Linux. The following distributions are tested in the ROCm Systems Profiler GitHub workflows:
* Ubuntu 20.04
* Ubuntu 22.04
* OpenSUSE 15.3
* OpenSUSE 15.4
* Red Hat 8.7
* Red Hat 9.0
* Red Hat 9.1
* OpenSUSE 15.5
* OpenSUSE 15.6
* Red Hat 8.8
* Red Hat 8.9
* Red Hat 8.10
* Red Hat 9.2
* Red Hat 9.3
* Red Hat 9.4
Other OS distributions might function but are not supported or tested.
@@ -61,58 +64,58 @@ Architecture
========================================
With regards to instrumentation, at present only AMD64 (x86_64) architectures are tested. However,
Dyninst supports several more architectures and Omnitrace instrumentation may support other
Dyninst supports several more architectures and ROCm Systems Profiler instrumentation may support other
CPU architectures such as aarch64 and ppc64.
Other modes of use, such as sampling and causal profiling, are not dependent on Dyninst and therefore
might be more portable.
Installing Omnitrace from binary distributions
Installing ROCm Systems Profiler from binary distributions
================================================
Every Omnitrace release provides binary installer scripts of the form:
Every ROCm Systems Profiler release provides binary installer scripts of the form:
.. code-block:: shell
omnitrace-{VERSION}-{OS_DISTRIB}-{OS_VERSION}[-ROCm-{ROCM_VERSION}[-{EXTRA}]].sh
rocprof-sys-{VERSION}-{OS_DISTRIB}-{OS_VERSION}[-ROCm-{ROCM_VERSION}[-{EXTRA}]].sh
For example,
.. code-block:: shell
omnitrace-1.0.0-ubuntu-18.04-OMPT-PAPI-Python3.sh
omnitrace-1.0.0-ubuntu-18.04-ROCm-405000-OMPT-PAPI-Python3.sh
rocprof-sys-1.0.0-ubuntu-18.04-OMPT-PAPI-Python3.sh
rocprof-sys-1.0.0-ubuntu-18.04-ROCm-405000-OMPT-PAPI-Python3.sh
...
omnitrace-1.0.0-ubuntu-20.04-ROCm-50000-OMPT-PAPI-Python3.sh
rocprof-sys-1.0.0-ubuntu-20.04-ROCm-50000-OMPT-PAPI-Python3.sh
Any of the ``EXTRA`` fields with a CMake build option
(for example, PAPI, as referenced in a following section) or
with no link requirements (such as OMPT) have
self-contained support for these packages.
To install Omnitrace using a binary installer script, follow these steps:
To install ROCm Systems Profiler using a binary installer script, follow these steps:
#. Download the appropriate binary distribution
.. code-block:: shell
wget https://github.com/ROCm/omnitrace/releases/download/v<VERSION>/<SCRIPT>
wget https://github.com/ROCm/rocprofiler-systems/releases/download/v<VERSION>/<SCRIPT>
#. Create the target installation directory
.. code-block:: shell
mkdir /opt/omnitrace
mkdir /opt/rocprofiler-systems
#. Run the installer script
.. code-block:: shell
./omnitrace-1.0.0-ubuntu-18.04-ROCm-405000-OMPT-PAPI.sh --prefix=/opt/omnitrace --exclude-subdir
./rocprofiler-systems-1.0.0-ubuntu-18.04-ROCm-405000-OMPT-PAPI.sh --prefix=/opt/rocprofiler-systems --exclude-subdir
Installing Omnitrace from source
Installing ROCm Systems Profiler from source
========================================
Omnitrace needs a GCC compiler with full support for C++17 and CMake v3.16 or higher.
ROCm Systems Profiler needs a GCC compiler with full support for C++17 and CMake v3.16 or higher.
The Clang compiler may be used in lieu of the GCC compiler if `Dyninst <https://github.com/dyninst/dyninst>`_
is already installed.
@@ -122,7 +125,7 @@ Build requirements
* GCC compiler v7+
* Older GCC compilers may be supported but are not tested
* Clang compilers are generally supported for Omnitrace but not Dyninst
* Clang compilers are generally supported for ROCm Systems Profiler but not Dyninst
* `CMake <https://cmake.org/>`_ v3.16+
@@ -151,16 +154,16 @@ Required third-party packages
* `libunwind <https://www.nongnu.org/libunwind/>`_ for call-stack sampling
Any of the third-party packages required by Dyninst, along with Dyninst itself, can be built and installed
during the Omnitrace build. The following list indicates the package, the version,
the application that requires the package (for example, Omnitrace requires Dyninst
while Dyninst requires TBB), and the CMake option to build the package alongside Omnitrace:
during the ROCm Systems Profiler build. The following list indicates the package, the version,
the application that requires the package (for example, ROCm Systems Profiler requires Dyninst
while Dyninst requires TBB), and the CMake option to build the package alongside ROCm Systems Profiler:
.. csv-table::
:header: "Third-Party Library", "Minimum Version", "Required By", "CMake Option"
:widths: 15, 10, 12, 40
"Dyninst", "12.0", "Omnitrace", "``OMNITRACE_BUILD_DYNINST`` (default: OFF)"
"Libunwind", "", "Omnitrace", "``OMNITRACE_BUILD_LIBUNWIND`` (default: ON)"
"Dyninst", "12.0", "ROCm Systems Profiler", "``ROCPROFSYS_BUILD_DYNINST`` (default: OFF)"
"Libunwind", "", "ROCm Systems Profiler", "``ROCPROFSYS_BUILD_LIBUNWIND`` (default: ON)"
"TBB", "2018.6", "Dyninst", "``DYNINST_BUILD_TBB`` (default: OFF)"
"ElfUtils", "0.178", "Dyninst", "``DYNINST_BUILD_ELFUTILS`` (default: OFF)"
"LibIberty", "", "Dyninst", "``DYNINST_BUILD_LIBIBERTY`` (default: OFF)"
@@ -180,9 +183,9 @@ Optional third-party packages
* `PAPI <https://icl.utk.edu/papi/>`_
* MPI
* ``OMNITRACE_USE_MPI`` enables full MPI support
* ``OMNITRACE_USE_MPI_HEADERS`` enables wrapping of the dynamically-linked MPI C function calls.
(By default, if Omnitrace cannot find an OpenMPI MPI distribution, it uses a local copy
* ``ROCPROFSYS_USE_MPI`` enables full MPI support
* ``ROCPROFSYS_USE_MPI_HEADERS`` enables wrapping of the dynamically-linked MPI C function calls.
(By default, if ROCm Systems Profiler cannot find an OpenMPI MPI distribution, it uses a local copy
of the OpenMPI ``mpi.h``.)
* Several optional third-party profiling tools supported by Timemory
@@ -192,19 +195,19 @@ Optional third-party packages
:header: "Third-Party Library", "CMake Enable Option", "CMake Build Option"
:widths: 15, 45, 40
"PAPI", "``OMNITRACE_USE_PAPI`` (default: ON)", "``OMNITRACE_BUILD_PAPI`` (default: ON)"
"MPI", "``OMNITRACE_USE_MPI`` (default: OFF)", ""
"MPI (header-only)", "``OMNITRACE_USE_MPI_HEADERS`` (default: ON)", ""
"PAPI", "``ROCPROFSYS_USE_PAPI`` (default: ON)", "``ROCPROFSYS_BUILD_PAPI`` (default: ON)"
"MPI", "``ROCPROFSYS_USE_MPI`` (default: OFF)", ""
"MPI (header-only)", "``ROCPROFSYS_USE_MPI_HEADERS`` (default: ON)", ""
Installing Dyninst
-----------------------------------
The easiest way to install Dyninst is alongside Omnitrace, but it can also be installed using Spack.
The easiest way to install Dyninst is alongside ROCm Systems Profiler, but it can also be installed using Spack.
Building Dyninst alongside Omnitrace
Building Dyninst alongside ROCm Systems Profiler
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To install Dyninst alongside Omnitrace, configure Omnitrace with ``OMNITRACE_BUILD_DYNINST=ON``.
To install Dyninst alongside ROCm Systems Profiler, configure ROCm Systems Profiler with ``ROCPROFSYS_BUILD_DYNINST=ON``.
Depending on the version of Ubuntu, the ``apt`` package manager might have current enough
versions of the Dyninst Boost, TBB, and LibIberty dependencies
(use ``apt-get install libtbb-dev libiberty-dev libboost-dev``).
@@ -213,8 +216,8 @@ its dependencies via ``DYNINST_BUILD_<DEP>=ON``, as follows:
.. code-block:: shell
git clone https://github.com/ROCm/omnitrace.git omnitrace-source
cmake -B omnitrace-build -DOMNITRACE_BUILD_DYNINST=ON -DDYNINST_BUILD_{TBB,ELFUTILS,BOOST,LIBIBERTY}=ON omnitrace-source
git clone https://github.com/ROCm/rocprofiler-systems.git rocprof-sys-source
cmake -B rocprof-sys-build -DROCPROFSYS_BUILD_DYNINST=ON -DDYNINST_BUILD_{TBB,ELFUTILS,BOOST,LIBIBERTY}=ON rocprof-sys-source
where ``-DDYNINST_BUILD_{TBB,BOOST,ELFUTILS,LIBIBERTY}=ON`` is expanded by
the shell to ``-DDYNINST_BUILD_TBB=ON -DDYNINST_BUILD_BOOST=ON ...``
@@ -234,18 +237,18 @@ Installing Dyninst via Spack
spack install --reuse dyninst
spack load -r dyninst
Installing Omnitrace
Installing ROCm Systems Profiler
-----------------------------------
Omnitrace has CMake configuration options for MPI support (``OMNITRACE_USE_MPI`` or
``OMNITRACE_USE_MPI_HEADERS``), HIP kernel tracing (``OMNITRACE_USE_ROCTRACER``),
ROCm device sampling (``OMNITRACE_USE_ROCM_SMI``), OpenMP-Tools (``OMNITRACE_USE_OMPT``),
hardware counters via PAPI (``OMNITRACE_USE_PAPI``), among other features.
ROCm Systems Profiler has CMake configuration options for MPI support (``ROCPROFSYS_USE_MPI`` or
``ROCPROFSYS_USE_MPI_HEADERS``), HIP kernel tracing (``ROCPROFSYS_USE_ROCTRACER``),
ROCm device sampling (``ROCPROFSYS_USE_ROCM_SMI``), OpenMP-Tools (``ROCPROFSYS_USE_OMPT``),
hardware counters via PAPI (``ROCPROFSYS_USE_PAPI``), among other features.
Various additional features can be enabled via the
``TIMEMORY_USE_*`` `CMake options <https://timemory.readthedocs.io/en/develop/installation.html#cmake-options>`_.
Any ``OMNITRACE_USE_<VAL>`` option which has a corresponding ``TIMEMORY_USE_<VAL>``
Any ``ROCPROFSYS_USE_<VAL>`` option which has a corresponding ``TIMEMORY_USE_<VAL>``
option means that the Timemory support for this feature has been integrated
into Perfetto support for Omnitrace, for example, ``OMNITRACE_USE_PAPI=<VAL>`` also configures
into Perfetto support for ROCm Systems Profiler, for example, ``ROCPROFSYS_USE_PAPI=<VAL>`` also configures
``TIMEMORY_USE_PAPI=<VAL>``. This means the data that Timemory is able to collect via this package
is passed along to Perfetto and is displayed when the ``.proto`` file is visualized
in `the Perfetto UI <https://ui.perfetto.dev>`_.
@@ -257,39 +260,39 @@ in `the Perfetto UI <https://ui.perfetto.dev>`_.
.. code-block:: shell
git clone https://github.com/ROCm/omnitrace.git omnitrace-source
git clone https://github.com/ROCm/rocprofiler-systems.git rocprof-sys-source
cmake \
-B omnitrace-build \
-D CMAKE_INSTALL_PREFIX=/opt/omnitrace \
-D OMNITRACE_USE_HIP=ON \
-D OMNITRACE_USE_ROCM_SMI=ON \
-D OMNITRACE_USE_ROCTRACER=ON \
-D OMNITRACE_USE_PYTHON=ON \
-D OMNITRACE_USE_OMPT=ON \
-D OMNITRACE_USE_MPI_HEADERS=ON \
-D OMNITRACE_BUILD_PAPI=ON \
-D OMNITRACE_BUILD_LIBUNWIND=ON \
-D OMNITRACE_BUILD_DYNINST=ON \
-B rocprof-sys-build \
-D CMAKE_INSTALL_PREFIX=/opt/rocprofiler-systems \
-D ROCPROFSYS_USE_HIP=ON \
-D ROCPROFSYS_USE_ROCM_SMI=ON \
-D ROCPROFSYS_USE_ROCTRACER=ON \
-D ROCPROFSYS_USE_PYTHON=ON \
-D ROCPROFSYS_USE_OMPT=ON \
-D ROCPROFSYS_USE_MPI_HEADERS=ON \
-D ROCPROFSYS_BUILD_PAPI=ON \
-D ROCPROFSYS_BUILD_LIBUNWIND=ON \
-D ROCPROFSYS_BUILD_DYNINST=ON \
-D DYNINST_BUILD_TBB=ON \
-D DYNINST_BUILD_BOOST=ON \
-D DYNINST_BUILD_ELFUTILS=ON \
-D DYNINST_BUILD_LIBIBERTY=ON \
omnitrace-source
cmake --build omnitrace-build --target all --parallel 8
cmake --build omnitrace-build --target install
source /opt/omnitrace/share/omnitrace/setup-env.sh
rocprof-sys-source
cmake --build rocprof-sys-build --target all --parallel 8
cmake --build rocprof-sys-build --target install
source /opt/rocprofiler-systems/share/rocprofiler-systems/setup-env.sh
.. _mpi-support-omnitrace:
.. _mpi-support-rocprof-sys:
MPI support within Omnitrace
MPI support within ROCm Systems Profiler
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Omnitrace can have full (``OMNITRACE_USE_MPI=ON``) or partial (``OMNITRACE_USE_MPI_HEADERS=ON``) MPI support.
ROCm Systems Profiler can have full (``ROCPROFSYS_USE_MPI=ON``) or partial (``ROCPROFSYS_USE_MPI_HEADERS=ON``) MPI support.
The only difference between these two modes is whether or not the results collected
via Timemory and/or Perfetto can be aggregated into a single
output file during finalization. When full MPI support is enabled, combining the
Timemory results always occurs, whereas combining the Perfetto
results is configurable via the ``OMNITRACE_PERFETTO_COMBINE_TRACES`` setting.
results is configurable via the ``ROCPROFSYS_PERFETTO_COMBINE_TRACES`` setting.
The primary benefits of partial or full MPI support are the automatic wrapping
of MPI functions and the ability
@@ -298,13 +301,13 @@ instead of having to use the system process identifier (i.e. ``PID``).
In general, it's recommended to use partial MPI support with the OpenMPI
headers as this is the most portable configuration.
If full MPI support is selected, make sure your target application is built
against the same MPI distribution as Omnitrace.
For example, do not build Omnitrace with MPICH and use it on a target application built against OpenMPI.
against the same MPI distribution as ROCm Systems Profiler.
For example, do not build ROCm Systems Profiler with MPICH and use it on a target application built against OpenMPI.
If partial support is selected, the reason the OpenMPI headers are recommended instead of the MPICH headers is
because the ``MPI_COMM_WORLD`` in OpenMPI is a pointer to ``ompi_communicator_t`` (8 bytes),
whereas ``MPI_COMM_WORLD`` in MPICH is an ``int`` (4 bytes). Building Omnitrace with partial MPI support
whereas ``MPI_COMM_WORLD`` in MPICH is an ``int`` (4 bytes). Building ROCm Systems Profiler with partial MPI support
and the MPICH headers and then using
Omnitrace on an application built against OpenMPI causes a segmentation fault.
ROCm Systems Profiler on an application built against OpenMPI causes a segmentation fault.
This happens because the value of the ``MPI_COMM_WORLD`` is truncated
during the function wrapping before being passed along to the underlying MPI function.
@@ -313,8 +316,8 @@ during the function wrapping before being passed along to the underlying MPI fun
Post-installation steps
========================================
After installation, you can optionally configure the Omnitrace environment.
You should also test the executables to confirm Omnitrace is correctly installed.
After installation, you can optionally configure the ROCm Systems Profiler environment.
You should also test the executables to confirm ROCm Systems Profiler is correctly installed.
Configure the environment
-----------------------------------
@@ -323,14 +326,14 @@ If environment modules are available and preferred, add them using these command
.. code-block:: shell
module use /opt/omnitrace/share/modulefiles
module load omnitrace/1.0.0
module use /opt/rocprofiler-systems/share/modulefiles
module load rocprofiler-systems/1.0.0
Alternatively, you can directly source the ``setup-env.sh`` script:
.. code-block:: shell
source /opt/omnitrace/share/omnitrace/setup-env.sh
source /opt/rocprofiler-systems/share/rocprofiler-systems/setup-env.sh
Test the executables
-----------------------------------
@@ -340,8 +343,8 @@ issues locating the installed libraries:
.. code-block:: shell
omnitrace-instrument --help
omnitrace-avail --help
rocprof-sys-instrument --help
rocprof-sys-avail --help
.. note::
@@ -353,27 +356,27 @@ issues locating the installed libraries:
Post-installation troubleshooting
========================================
This section explains how to resolve certain issues that might happen when you first use Omnitrace.
This section explains how to resolve certain issues that might happen when you first use ROCm Systems Profiler.
Issues with RHEL and SELinux
----------------------------------------------------
RHEL (Red Hat Enterprise Linux) and related distributions of Linux automatically enable a security feature
named SELinux (Security-Enhanced Linux) that prevents Omnitrace from running.
named SELinux (Security-Enhanced Linux) that prevents ROCm Systems Profiler from running.
This issue applies to any Linux distribution with SELinux installed, including RHEL,
CentOS, Fedora, and Rocky Linux. The problem can happen with any GPU, or even without a GPU.
The problem occurs after you instrument a program and try to
run ``omnitrace-run`` with the instrumented program.
run ``rocprof-sys-run`` with the instrumented program.
.. code-block:: shell
g++ hello.cpp -o hello
omniperf-instrument -M sampling -o hello.instr -- ./hello
omnitrace-run -- ./hello.instr
rocprof-sys-instrument -M sampling -o hello.instr -- ./hello
rocprof-sys-run -- ./hello.instr
Instead of successfully running the binary with call-stack sampling,
Omnitrace crashes with a segmentation fault.
ROCm Systems Profiler crashes with a segmentation fault.
.. note::
@@ -412,4 +415,4 @@ Configuring PAPI to collect hardware counters
To use PAPI to collect the majority of hardware counters, ensure
the ``/proc/sys/kernel/perf_event_paranoid`` setting has a value less than or equal to ``2``.
For more information, see the :ref:`omnitrace_papi_events` section.
For more information, see the :ref:`rocprof-sys_papi_events` section.
+34 -13
Dosyayı Görüntüle
@@ -1,21 +1,22 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
:description: ROCm Systems Profiler quick start documentation and reference
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, profiler, quick start, getting started, quick install, tracking, visualization, tool, Instinct, accelerator, AMD
*************************************
Omnitrace quick start
ROCm Systems Profiler quick start
*************************************
To install Omnitrace, download the `Omnitrace installer <https://github.com/ROCm/omnitrace/releases/latest/download/omnitrace-install.py>`_
and specify ``--prefix <install-directory>``. The script attempts to auto-detect
the appropriate OS distribution and version. To include AMD ROCm Software support,
To install ROCm Systems Profiler, download the
`ROCm Systems Profiler installer <https://github.com/ROCm/rocprofiler-systems/releases/latest/download/rocprofiler-systems-install.py>`_
and specify ``--prefix <install-directory>``. The script attempts to auto-detect
the appropriate OS distribution and version. To include AMD ROCm Software support,
specify ``--rocm X.Y``, where ``X`` is the ROCm major
version and ``Y`` is the ROCm minor version, for example, ``--rocm 6.2``.
version and ``Y`` is the ROCm minor version, for example, ``--rocm 6.3``.
.. code-block:: shell
wget https://github.com/ROCm/omnitrace/releases/latest/download/omnitrace-install.py
python3 ./omnitrace-install.py --prefix /opt/omnitrace --rocm 6.2
wget https://github.com/ROCm/rocprofiler-systems/releases/latest/download/rocprofiler-systems-install.py
python3 ./rocprofiler-systems-install.py --prefix /opt/rocprofiler-systems --rocm 6.3
This script supports installation on Ubuntu, OpenSUSE, Red Hat, Debian, CentOS, and Fedora.
If the target OS is compatible with one of the operating system versions listed in
@@ -23,8 +24,28 @@ the comprehensive :doc:`Installation guidelines <./install>`,
specify ``-d <DISTRO> -v <VERSION>``. For example, if the OS is compatible with Ubuntu 22.04, pass
``-d ubuntu -v 22.04`` to the script.
.. note::
Install via package manager
============================
If you have ROCm version 6.2 or higher installed, you can use the
package manager to install a pre-built copy of Omnitrace using
``apt install omnitrace`` or ``dnf install omnitrace``.
If you have ROCm version 6.3 or higher installed, you can use the
package manager to install a pre-built copy of ROCm Systems Profiler.
.. tab-set::
.. tab-item:: Ubuntu
.. code-block:: shell
$ sudo apt install rocprofiler-systems
.. tab-item:: Red Hat Enterprise Linux
.. code-block:: shell
$ sudo dnf install rocprofiler-systems
.. tab-item:: SUSE Linux Enterprise Server
.. code-block:: shell
$ sudo zypper install rocprofiler-systems
+98 -98
Dosyayı Görüntüle
@@ -1,122 +1,122 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
:description: ROCm Systems Profiler development documentation and reference
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, development, developers guide, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
****************************************************
Development guide
****************************************************
This guide discusses the `Omnitrace <https://github.com/ROCm/omnitrace>`_ design.
It includes a list of the executables and libraries, along with a discussion of the application's
This guide discusses the `ROCm Systems Profiler <https://github.com/ROCm/rocprofiler-systems>`_ design.
It includes a list of the executables and libraries, along with a discussion of the application's
memory, sampling, and time-window constraint models.
Executables
========================================
This section lists the Omnitrace executables.
This section lists the ROCm Systems Profiler executables.
omnitrace-avail: `source/bin/omnitrace-avail <https://github.com/ROCm/omnitrace/tree/amd-mainline/source/bin/omnitrace-avail>`_
rocprof-sys-avail: `source/bin/rocprof-sys-avail <https://github.com/ROCm/rocprofiler-systems/tree/amd-mainline/source/bin/rocprof-sys-avail>`_
-------------------------------------------------------------------------------------------------------------------------------
The ``main`` routine of ``omnitrace-avail`` has three important sections:
The ``main`` routine of ``rocprof-sys-avail`` has three important sections:
* Printing components
* Printing options
* Printing hardware counters
omnitrace-sample: `source/bin/omnitrace-sample <https://github.com/ROCm/omnitrace/tree/amd-mainline/source/bin/omnitrace-sample>`_
rocprof-sys-sample: `source/bin/rocprof-sys-sample <https://github.com/ROCm/rocprofiler-systems/tree/amd-mainline/source/bin/rocprof-sys-sample>`_
----------------------------------------------------------------------------------------------------------------------------------
* Requires a command-line format of ``omnitrace-sample <options> -- <command> <command-args>``
* Requires a command-line format of ``rocprof-sys-sample <options> -- <command> <command-args>``
* Translates command-line options into environment variables
* Adds ``libomnitrace-dl.so`` to ``LD_PRELOAD``
* Adds ``librocprof-sys-dl.so`` to ``LD_PRELOAD``
* Is launched by using ``execvpe`` with ``<command> <command-args>`` and a modified environment
omnitrace-casual: `source/bin/omnitrace-causal <https://github.com/ROCm/omnitrace/tree/amd-mainline/source/bin/omnitrace-causal>`_
rocprof-sys-casual: `source/bin/rocprof-sys-causal <https://github.com/ROCm/rocprofiler-systems/tree/amd-mainline/source/bin/rocprof-sys-causal>`_
----------------------------------------------------------------------------------------------------------------------------------
When there is exactly one causal profiling configuration variant (which enables debugging),
``omnitrace-casual`` has a nearly identical design to ``omnitrace-sample``
``rocprof-sys-casual`` has a nearly identical design to ``rocprof-sys-sample``
When the command-line options produce more than one causal profiling configuration variant,
the following actions take place for each variant:
* ``omnitrace-causal`` calls ``fork()``
* ``rocprof-sys-causal`` calls ``fork()``
* the child process launches ``<command> <command-args>`` using ``execvpe``, which modifies the environment for the variant
* the parent process waits for the child process to finish
omnitrace-instrument: `source/bin/omnitrace-instrument <https://github.com/ROCm/omnitrace/tree/amd-mainline/source/bin/omnitrace-instrument>`_
rocprof-sys-instrument: `source/bin/rocprof-sys-instrument <https://github.com/ROCm/rocprofiler-systems/tree/amd-mainline/source/bin/rocprof-sys-instrument>`_
----------------------------------------------------------------------------------------------------------------------------------------------
* Requires a command-line format of ``omnitrace-instrument <options> -- <command> <command-args>``
* Allows the user to provide options specifying whether to perform runtime instrumentation, use binary rewrite, or
* Requires a command-line format of ``rocprof-sys-instrument <options> -- <command> <command-args>``
* Allows the user to provide options specifying whether to perform runtime instrumentation, use binary rewrite, or
attach to process
* Either opens the instrumentation target (for binary rewrite), launches the target and stops it
before it starts executing ``main``, or attaches to a running executable and pauses it
* Finds all functions in the targets
* Finds ``libomnitrace-dl`` and locates the functions
* Iterates over and instruments all the functions, provided they satisfy the
* Finds ``librocprof-sys-dl`` and locates the functions
* Iterates over and instruments all the functions, provided they satisfy the
defined criteria (such as a minimum number of instructions)
* See the ``module_function`` class
* Until this point, the workflow has been the same for the different options,
* Until this point, the workflow has been the same for the different options,
but it diverges after instrumentation is complete:
* For a binary rewrite: it produces a new instrumented binary and exits
* For runtime instrumentation or attaching to a process: it instructs the application
* For runtime instrumentation or attaching to a process: it instructs the application
to resume and then waits for it to exit
Libraries
========================================
Common library: `source/lib/common <https://github.com/ROCm/omnitrace/tree/amd-mainline/source/lib/common>`_
Common library: `source/lib/common <https://github.com/ROCm/rocprofiler-systems/tree/amd-mainline/source/lib/common>`_
--------------------------------------------------------------------------------------------------------------------------------
* General header-only functionality used in multiple executables and/or libraries.
* General header-only functionality used in multiple executables and/or libraries.
* Not installed or exported outside of the build tree.
Core library: `source/lib/core <https://github.com/ROCm/omnitrace/tree/amd-mainline/source/lib/core>`_
Core library: `source/lib/core <https://github.com/ROCm/rocprofiler-systems/tree/amd-mainline/source/lib/core>`_
--------------------------------------------------------------------------------------------------------------------------------
* Static PIC library with functionality that does not depend on any components.
* Static PIC library with functionality that does not depend on any components.
* Not installed or exported outside of the build tree.
Binary library: `source/lib/binary <https://github.com/ROCm/omnitrace/tree/amd-mainline/source/lib/binary>`_
Binary library: `source/lib/binary <https://github.com/ROCm/rocprofiler-systems/tree/amd-mainline/source/lib/binary>`_
--------------------------------------------------------------------------------------------------------------------------------
* Static PIC library with functionality for reading/analyzing binary info.
* Mostly used by the causal profiling sections of ``libomnitrace``.
* Mostly used by the causal profiling sections of ``librocprof-sys``.
* Not installed or exported outside of the build tree.
libomnitrace: `source/lib/omnitrace <https://github.com/ROCm/omnitrace/tree/amd-mainline/source/lib/omnitrace>`_
librocprof-sys: `source/lib/rocprof-sys <https://github.com/ROCm/rocprofiler-systems/tree/amd-mainline/source/lib/rocprof-sys>`_
--------------------------------------------------------------------------------------------------------------------------------
This is the main library encapsulating all the capabilities.
libomnitrace-dl: `source/lib/omnitrace-dl <https://github.com/ROCm/omnitrace/tree/amd-mainline/source/lib/omnitrace-dl>`_
librocprof-sys-dl: `source/lib/rocprof-sys-dl <https://github.com/ROCm/rocprofiler-systems/tree/amd-mainline/source/lib/rocprof-sys-dl>`_
--------------------------------------------------------------------------------------------------------------------------------
This is a lightweight, front-end library for ``libomnitrace`` which serves three primary purposes:
This is a lightweight, front-end library for ``librocprof-sys`` which serves three primary purposes:
* Dramatically speeds up instrumentation time compared to using ``libomnitrace`` directly because
Dyninst must parse the entire library in order to find the instrumentation functions
(a ``dlopen`` call is made on ``libomnitrace`` when the instrumentation functions get called)
* Prevents re-entry if ``libomnitrace`` calls an instrumented function internally
* Coordinates communication between ``libomnitrace-user`` and ``libomnitrace``
* Dramatically speeds up instrumentation time compared to using ``librocprof-sys`` directly because
Dyninst must parse the entire library in order to find the instrumentation functions
(a ``dlopen`` call is made on ``librocprof-sys`` when the instrumentation functions get called)
* Prevents re-entry if ``librocprof-sys`` calls an instrumented function internally
* Coordinates communication between ``librocprof-sys-user`` and ``librocprof-sys``
libomnitrace-user: `source/lib/omnitrace-user <https://github.com/ROCm/omnitrace/tree/amd-mainline/source/lib/omnitrace-user>`_
librocprof-sys-user: `source/lib/rocprof-sys-user <https://github.com/ROCm/rocprofiler-systems/tree/amd-mainline/source/lib/rocprof-sys-user>`_
--------------------------------------------------------------------------------------------------------------------------------
* Provides a set of functions and types for the users to add to their code, for example,
disabling data collection globally or on a specific thread or
user-defined region
* If ``libomnitrace-dl`` is not loaded, the user API is effectively a set of no-op function calls.
* If ``librocprof-sys-dl`` is not loaded, the user API is effectively a set of no-op function calls.
Testing tools
========================================
* `CDash Testing Dashboard <https://my.cdash.org/index.php?project=Omnitrace>`_ (requires a login)
* `CDash Testing Dashboard <https://my.cdash.org/index.php?project=rocprofiler-systems>`_ (requires a login)
Components
========================================
@@ -124,34 +124,34 @@ Components
Most measurements and capabilities are encapsulated into a "component" with the following definitions:
Measurement
A recording of some data relevant to performance, for instance, the current call-stack,
A recording of some data relevant to performance, for instance, the current call-stack,
hardware counter values, current memory usage, or timestamp
Capability
Handles the implementation or orchestration of some feature which is used
to collect measurements, for example, a component which handles setting up function wrappers
Handles the implementation or orchestration of some feature which is used
to collect measurements, for example, a component which handles setting up function wrappers
around various functions such as ``pthread_create`` or ``MPI_Init``.
Components are designed to either hold no data at all or only the data for both an instantaneous
Components are designed to either hold no data at all or only the data for both an instantaneous
measurement and a phase measurement.
Components which store data typically implement a static ``record()`` function
Components which store data typically implement a static ``record()`` function
for getting a record of the measurement,
``start()`` and ``stop()`` member functions for calculating a phase measurement,
``start()`` and ``stop()`` member functions for calculating a phase measurement,
and a ``sample()`` member function for storing an
instantaneous measurement. In reality, there are several more "standard" functions
instantaneous measurement. In reality, there are several more "standard" functions
but these are the most commonly-used ones.
Components which do not store data might also have ``start()``, ``stop()``, and ``sample()``
Components which do not store data might also have ``start()``, ``stop()``, and ``sample()``
functions. However, components which
implement function wrappers typically provide a call operator or ``audit(...)``
implement function wrappers typically provide a call operator or ``audit(...)``
functions. These are invoked with the
wrapped function's arguments before the wrapped function gets called and with the return value
wrapped function's arguments before the wrapped function gets called and with the return value
after the wrapped function gets called.
.. note::
The goal of this design is to provide relatively small and resuable lightweight objects
The goal of this design is to provide relatively small and resuable lightweight objects
for recording measurements and implementing capabilities.
Wall-clock component example
@@ -195,7 +195,7 @@ A component for computing the elapsed wall-clock time looks like this:
Function wrapper component example
--------------------------------------
A component which implements wrappers around ``fork()`` and ``exit(int)`` (and stores no data)
A component which implements wrappers around ``fork()`` and ``exit(int)`` (and stores no data)
could look like this:
.. code-block:: cpp
@@ -219,7 +219,7 @@ could look like this:
void operator()(const gotcha_data&, void (*real_exit)(int), int _exit_code)
{
// catch the call to exit and finalize before truly exiting
omnitrace_finalize();
rocprofsys_finalize();
real_exit(_exit_code);
}
@@ -298,22 +298,22 @@ Collected data is generally handled in one of the three following ways:
* It is managed implicitly by Timemory and accessed as needed
* As thread-local data
In general, only instrumentation for relatively simple data is directly passed to
In general, only instrumentation for relatively simple data is directly passed to
Perfetto and/or Timemory during runtime.
For example, the callbacks from binary instrumentation, user API instrumentation,
For example, the callbacks from binary instrumentation, user API instrumentation,
and roctracer directly invoke
calls to Perfetto or Timemory's storage model. Otherwise, the data is stored
by Omnitrace in the thread-data model
calls to Perfetto or Timemory's storage model. Otherwise, the data is stored
by ROCm Systems Profiler in the thread-data model
which is more persistent than simply using ``thread_local`` static data, which gets deleted
when the thread stops.
Thread identification
--------------------------------------
Each CPU thread is assigned two integral identifiers. One identifier, the ``internal_value``, is
Each CPU thread is assigned two integral identifiers. One identifier, the ``internal_value``, is
atomically incremented every time a new thread is created.
The other identifier, known as the ``sequent_value``, tries to account for the fact that Omnitrace, Perfetto, ROCm, and other applications
start background threads. When a thread is created as a by-product of Omnitrace,
The other identifier, known as the ``sequent_value``, tries to account for the fact that ROCm Systems Profiler, Perfetto, ROCm, and other applications
start background threads. When a thread is created as a by-product of ROCm Systems Profiler,
the index is offset by a large value. This serves
two purposes:
@@ -325,88 +325,88 @@ The ``sequent_value`` identifier is typically used to access the thread-data.
Thread-data class
--------------------------------------
Currently, most thread data is effectively stored in a static
``std::array<std::unique_ptr<T>, OMNITRACE_MAX_THREADS>`` instance.
``OMNITRACE_MAX_THREADS`` is a value defined a compile-time and set to ``2048``
Currently, most thread data is effectively stored in a static
``std::array<std::unique_ptr<T>, ROCPROFSYS_MAX_THREADS>`` instance.
``ROCPROFSYS_MAX_THREADS`` is a value defined a compile-time and set to ``2048``
for release builds. During finalization,
Omnitrace iterates through the thread-data and transforms that data
ROCm Systems Profiler iterates through the thread-data and transforms that data
into something that can be passed along to Perfetto and/or Timemory.
The downside of the current model is that if the user exceeds ``OMNITRACE_MAX_THREADS``,
The downside of the current model is that if the user exceeds ``ROCPROFSYS_MAX_THREADS``,
a segmentation fault occurs. To fix this issue,
a new model is being adopted which has all the benefits of this model
a new model is being adopted which has all the benefits of this model
but permits dynamic expansion.
Sampling model
========================================
The general structure for the sampling is within Timemory (``source/timemory/sampling``).
The general structure for the sampling is within Timemory (``source/timemory/sampling``).
Currently, all sampling is done per-thread
via POSIX timers. Omnitrace supports both a real-time timer and a CPU-time timer.
via POSIX timers. ROCm Systems Profiler supports both a real-time timer and a CPU-time timer.
Both have adjustable frequencies, delays, and durations.
By default, only CPU-time sampling is enabled. Initial settings are inherited from
the settings starting with ``OMNITRACE_SAMPLING_``.
By default, only CPU-time sampling is enabled. Initial settings are inherited from
the settings starting with ``ROCPROFSYS_SAMPLING_``.
For each type of timer, timer-specific settings can be used to
override the common and inherited timer settings.
These settings begin with ``OMNITRACE_SAMPLING_CPUTIME`` for the CPU-time sampler
and ``OMNITRACE_SAMPLING_REALTIME`` for
the real-time sampler. For example, ``OMNITRACE_SAMPLING_FREQ=500`` initially sets the
sampling frequency to 500 interrupts per second. Adding the setting ``OMNITRACE_SAMPLING_REALTIME_FREQ=10``
For each type of timer, timer-specific settings can be used to
override the common and inherited timer settings.
These settings begin with ``ROCPROFSYS_SAMPLING_CPUTIME`` for the CPU-time sampler
and ``ROCPROFSYS_SAMPLING_REALTIME`` for
the real-time sampler. For example, ``ROCPROFSYS_SAMPLING_FREQ=500`` initially sets the
sampling frequency to 500 interrupts per second. Adding the setting ``ROCPROFSYS_SAMPLING_REALTIME_FREQ=10``
lowers the sampling frequency for the real-time sampler
to 10 interrupts per second of real-time.
The Omnitrace-specific implementation can be found in
`source/lib/omnitrace/library/sampling.cpp <https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/sampling.cpp>`_.
Within `sampling.cpp <https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/sampling.cpp>`_,
The ROCm Systems Profiler-specific implementation can be found in
`source/lib/rocprof-sys/library/sampling.cpp <https://github.com/ROCm/rocprofiler-systems/blob/main/source/lib/rocprof-sys/library/sampling.cpp>`_.
Within `sampling.cpp <https://github.com/ROCm/rocprofiler-systems/blob/main/source/lib/rocprof-sys/library/sampling.cpp>`_,
there is a bundle of three sampling components:
* `backtrace_timestamp <https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/components/backtrace_timestamp.hpp>`_ simply
* `backtrace_timestamp <https://github.com/ROCm/rocprofiler-systems/blob/main/source/lib/rocprof-sys/library/components/backtrace_timestamp.hpp>`_ simply
records the wall-clock time of the sample.
* `backtrace <https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/components/backtrace.hpp>`_
* `backtrace <https://github.com/ROCm/rocprofiler-systems/blob/main/source/lib/rocprof-sys/library/components/backtrace.hpp>`_
records the call-stack via libunwind.
* `backtrace_metrics <https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/components/backtrace_metrics.hpp>`_
* `backtrace_metrics <https://github.com/ROCm/rocprofiler-systems/blob/main/source/lib/rocprof-sys/library/components/backtrace_metrics.hpp>`_
records the sample metrics, such as peak RSS and the hardware counters.
These three components are bundled together in
These three components are bundled together in
a tuple-like ``struct`` (``tuple<backtrace_timestamp, backtrace, backtrace_metrics>``).
A buffer of at least 1024 instances of this tuple is mapped using ``mmap``
per-thread. When this buffer is full,
A buffer of at least 1024 instances of this tuple is mapped using ``mmap``
per-thread. When this buffer is full,
the sampler hands the buffer off to its allocator thread and maps a new buffer with ``mmap``
before taking the next sample. The allocator thread takes this data
and either dynamically stores it in memory or writes it to a file depending on the
value of ``OMNITRACE_USE_TEMPORARY_FILES``.
This schema avoids all allocations in the signal handler, lets the data grow
dynamically, avoids potentially slow I/O within the signal handler, and also enables
before taking the next sample. The allocator thread takes this data
and either dynamically stores it in memory or writes it to a file depending on the
value of ``ROCPROFSYS_USE_TEMPORARY_FILES``.
This schema avoids all allocations in the signal handler, lets the data grow
dynamically, avoids potentially slow I/O within the signal handler, and also enables
the capability of avoiding I/O altogether.
The maximum number of samplers handled by each allocator is governed by the
``OMNITRACE_SAMPLING_ALLOCATOR_SIZE`` setting (the default is eight). Whenever an allocator
The maximum number of samplers handled by each allocator is governed by the
``ROCPROFSYS_SAMPLING_ALLOCATOR_SIZE`` setting (the default is eight). Whenever an allocator
has reached its limit,
a new internal thread is created to handle the new samplers.
Time-window constraint model
========================================
With the recent introduction of tracing delay and duration, the
`constraint namespace <https://github.com/ROCm/omnitrace/blob/main/source/lib/core/constraint.hpp>`_
was introduced to improve the management of delays and duration limits for
With the recent introduction of tracing delay and duration, the
`constraint namespace <https://github.com/ROCm/rocprofiler-systems/blob/main/source/lib/core/constraint.hpp>`_
was introduced to improve the management of delays and duration limits for
data collection. The ``spec`` class accepts a clock identifier, a delay value, a duration value, and an
integer indicating how many times to repeat the delay and duration cycle. It is therefore
integer indicating how many times to repeat the delay and duration cycle. It is therefore
possible to perform tasks such as periodically enabling tracing for brief periods
of time in between long periods without data collection while the application runs. The
syntax follows the format ``clock_identifier:delay:capture_duration:cycles``, so a value of
syntax follows the format ``clock_identifier:delay:capture_duration:cycles``, so a value of
``10:1:3`` for the last three parameters represents the following sequence of operations:
* Ten seconds where no data is collected, then one second where it is
* Ten seconds where no data is collected, then one second where it is
* Ten seconds where no data is collected, then one second where it is
* Ten seconds where no data is collected, then one second where it is
* Ten seconds where no data is collected, then one second where it is
* Stop
As another example, ``OMNITRACE_TRACE_PERIODS = realtime:10:1:5 process_cputime:10:2:20`` translates
As another example, ``ROCPROFSYS_TRACE_PERIODS = realtime:10:1:5 process_cputime:10:2:20`` translates
to this sequence:
* Five cycles of: no data collection for ten seconds of real-time followed by one second of data collection
* Twenty cycles of: no data collection for ten seconds of process CPU time followed by two CPU-time seconds of data collection
Eventually, the goal is to migrate all subsets of data collection which currently support
Eventually, the goal is to migrate all subsets of data collection which currently support
more rudimentary models of time window constraints, such as process sampling and causal profiling,
to this model.
@@ -1,40 +1,40 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
:description: ROCm Systems Profiler glossary and reference
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, glossary, terminology, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
*******************
Omnitrace Glossary
ROCm Systems Profiler Glossary
*******************
This topic explains the terminology necessary to use Omnitrace.
The list below provides a basic glossary for those who
are new to binary instrumentation. It also clarifies ambiguities
when certain terms have different
contextual meanings, for example, the Omnitrace meaning of the term "module"
This topic explains the terminology necessary to use ROCm Systems Profiler.
The list below provides a basic glossary for those who
are new to binary instrumentation. It also clarifies ambiguities
when certain terms have different
contextual meanings, for example, the ROCm Systems Profiler meaning of the term "module"
when instrumenting Python.
**Binary**
A file written in the Executable and Linkable Format (ELF). This is the standard file
A file written in the Executable and Linkable Format (ELF). This is the standard file
format for executable files, shared libraries, etc.
**Binary instrumentation**
Inserting callbacks to instrumentation into an existing binary. This can be performed
Inserting callbacks to instrumentation into an existing binary. This can be performed
statically or dynamically.
**Static binary instrumentation**
Loads an existing binary, determines instrumentation points, and generates a new binary
with instrumentation directly embedded. It is applicable to executables and libraries but
Loads an existing binary, determines instrumentation points, and generates a new binary
with instrumentation directly embedded. It is applicable to executables and libraries but
limited to only the functions defined in the binary. This is also known as **Binary rewrite**.
**Dynamic binary instrumentation**
Loads an existing binary into memory, inserts instrumentation, and runs the binary.
It is limited to executables but is capable of instrumenting linked libraries.
Loads an existing binary into memory, inserts instrumentation, and runs the binary.
It is limited to executables but is capable of instrumenting linked libraries.
This is also known as **Runtime instrumentation**.
**Statistical sampling**
At periodic intervals, the application is paused and the current call-stack of the CPU
is recorded along with various other metrics. It uses timers that measure either
(A) real clock time or (B) the CPU time used by the current thread and the CPU time
**Statistical sampling**
At periodic intervals, the application is paused and the current call-stack of the CPU
is recorded along with various other metrics. It uses timers that measure either
(A) real clock time or (B) the CPU time used by the current thread and the CPU time
expended on behalf of the thread by the system. This is also known as simply **sampling**.
**Sampling rate**
@@ -45,12 +45,12 @@ when instrumenting Python.
* How long to wait before (A) and (B) begin triggering at their designated rate
**Sampling duration**
* The amount of time (in real-time) after the start of the application to record samples.
* The amount of time (in real-time) after the start of the application to record samples.
* After this time limit has been reached, no more samples are recorded.
**Process sampling**
At periodic (real-time) intervals, a background thread records global metrics without
interrupting the current process. These metrics include, but are not limited to:
At periodic (real-time) intervals, a background thread records global metrics without
interrupting the current process. These metrics include, but are not limited to:
CPU frequency, CPU memory high-water mark (i.e. peak memory usage), GPU temperature,
and GPU power usage.
@@ -62,41 +62,41 @@ when instrumenting Python.
* How long to wait (in real-time) before recording samples
**Sampling duration**
* The amount of time (in real-time) after the start of the application to record samples.
* The amount of time (in real-time) after the start of the application to record samples.
* After this time limit has been reached, no more samples are recorded.
**Module**
With respect to binary instrumentation, a module is defined as either the filename
(such as ``foo.c``) or library name (``libfoo.so``) which contains the definition
With respect to binary instrumentation, a module is defined as either the filename
(such as ``foo.c``) or library name (``libfoo.so``) which contains the definition
of one or more functions.
With respect to Python instrumentation, a module is defined as the **file** which contains
the definition of one or more functions. The full path to this file typically contains the
With respect to Python instrumentation, a module is defined as the **file** which contains
the definition of one or more functions. The full path to this file typically contains the
name of the "Python module".
**Basic block**
A straight-line code sequence with no branches in (except for the entry) and
A straight-line code sequence with no branches in (except for the entry) and
no branches out (except for the exit).
**Address range**
The instructions for a function in a binary start at certain address with the ELF file
The instructions for a function in a binary start at certain address with the ELF file
and end at a certain address. The range is ``end - start``.
The address range is a decent approximation for the "cost" of a function.
The address range is a decent approximation for the "cost" of a function.
For example, a larger address range approximately equates to more instructions.
**Instrumentation traps**
On the x86 architecture, because instructions are of variable size, an instruction
might be too small for Dyninst to replace it with the normal code sequence
used to call instrumentation. When instrumentation is placed at points other
than subroutine entry, exit, or call points, traps may be used to ensure
the instrumentation fits. (By default, ``omnitrace-instrument`` avoids instrumentation
On the x86 architecture, because instructions are of variable size, an instruction
might be too small for Dyninst to replace it with the normal code sequence
used to call instrumentation. When instrumentation is placed at points other
than subroutine entry, exit, or call points, traps may be used to ensure
the instrumentation fits. (By default, ``rocprof-sys-instrument`` avoids instrumentation
which requires a trap.)
**Overlapping functions**
Due to language constructs or compiler optimizations, it might be possible for
multiple functions to overlap (that is, share part of the same function body)
or for a single function to have multiple entry points. In practice, it's
impossible to determine the difference between multiple overlapping functions
and a single function with multiple entry points. (By default, ``omnitrace-instrument``
Due to language constructs or compiler optimizations, it might be possible for
multiple functions to overlap (that is, share part of the same function body)
or for a single function to have multiple entry points. In practice, it's
impossible to determine the difference between multiple overlapping functions
and a single function with multiple entry points. (By default, ``rocprof-sys-instrument``
avoids instrumenting overlapping functions.)
+19 -19
Dosyayı Görüntüle
@@ -6,18 +6,18 @@ defaults:
root: index
subtrees:
- entries:
- file: what-is-omnitrace.rst
- file: what-is-rocprof-sys.rst
- caption: Install
entries:
- file: install/quick-start.rst
title: Omnitrace quick start
title: ROCm Systems Profiler quick start
- file: install/install.rst
title: Omnitrace installation guide
title: ROCm Systems Profiler installation guide
- caption: Tutorials
entries:
- url: https://github.com/ROCm/omnitrace/tree/amd-mainline/examples
- url: https://github.com/ROCm/rocprofiler-systems/tree/amd-mainline/examples
title: GitHub examples
- file: tutorials/video-tutorials.rst
title: Video tutorials
@@ -25,37 +25,37 @@ subtrees:
- caption: How to
entries:
- file: how-to/configuring-validating-environment.rst
title: Configuring and validating the environment
title: Configuring and validating the environment
- file: how-to/configuring-runtime-options.rst
title: Configuring runtime options
title: Configuring runtime options
- file: how-to/sampling-call-stack.rst
title: Sampling the call stack
title: Sampling the call stack
- file: how-to/instrumenting-rewriting-binary-application.rst
title: Instrumenting and rewriting a binary application
- file: how-to/performing-causal-profiling.rst
title: Performing causal profiling
- file: how-to/understanding-omnitrace-output.rst
title: Understanding the Omnitrace output
title: Performing causal profiling
- file: how-to/understanding-rocprof-sys-output.rst
title: Understanding the ROCm Systems Profiler output
- file: how-to/profiling-python-scripts.rst
title: Profiling Python scripts
- file: how-to/using-omnitrace-api.rst
title: Using the Omnitrace API
- file: how-to/general-tips-using-omnitrace.rst
title: General tips for using Omnitrace
title: Profiling Python scripts
- file: how-to/using-rocprof-sys-api.rst
title: Using the ROCm Systems Profiler API
- file: how-to/general-tips-using-rocprof-sys.rst
title: General tips for using ROCm Systems Profiler
- caption: Conceptual
entries:
- file: conceptual/data-collection-modes.rst
title: Data collection modes
- file: conceptual/omnitrace-feature-set.rst
title: The Omnitrace feature set and use cases
- file: conceptual/rocprof-sys-feature-set.rst
title: The ROCm Systems Profiler feature set and use cases
- caption: Reference
entries:
- file: reference/development-guide.rst
title: Development guide
- file: reference/omnitrace-glossary.rst
title: Omnitrace glossary
- file: reference/rocprof-sys-glossary.rst
title: ROCm Systems Profiler glossary
- file: doxygen/html/files
title: API library
- file: doxygen/html/functions
+6 -3
Dosyayı Görüntüle
@@ -1,11 +1,14 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
:description: ROCm Systems Profiler video documentation and reference
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, profiler, video, tutorial, demonstration, tracking, visualization, tool, Instinct, accelerator, AMD
****************************************************
Video tutorials
****************************************************
The following video tutorials provide a visual guide to using ROCm Systems Profiler.
They were recorded using the former name of the tool, Omnitrace, but the content is still applicable.
Installing a binary release
========================================
@@ -20,7 +23,7 @@ Instrumenting a binary
<p align="center"><iframe width="560" height="315" src="https://www.youtube.com/embed/2B0gRr3FygQ?modestbranding=1" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p>
Writing an Omnitrace configuration file
Writing an ROCm Systems Profiler configuration file
========================================
.. raw:: html
@@ -1,18 +1,18 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
:description: ROCm Systems Profiler introduction, explanation, and reference
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, profiler, explanation, introduction, what is, tracking, visualization, tool, Instinct, accelerator, AMD
******************
What is Omnitrace?
What is ROCm Systems Profiler?
******************
Omnitrace is designed for the high-level profiling and comprehensive tracing
ROCm Systems Profiler is designed for the high-level profiling and comprehensive tracing
of applications running on the CPU or the CPU and GPU. It supports dynamic binary
instrumentation, call-stack sampling, and various other features for determining
which function and line number are currently executing.
A visualization of the comprehensive Omnitrace results can be observed in any modern
web browser. Upload the Perfetto (``.proto``) output files produced by Omnitrace at
A visualization of the comprehensive ROCm Systems Profiler results can be observed in any modern
web browser. Upload the Perfetto (``.proto``) output files produced by ROCm Systems Profiler at
`ui.perfetto.dev <https://ui.perfetto.dev/>`_ to see the details.
.. important::
@@ -26,7 +26,7 @@ JSON files for programmatic analysis. The JSON output files are compatible with
the performance data into pandas data frames and facilitates multi-run comparisons, filtering,
and visualization in Jupyter notebooks.
To use Omnitrace for instrumentation, follow these two configuration steps:
To use ROCm Systems Profiler for instrumentation, follow these two configuration steps:
#. Indicate the functions and modules to :doc:`instrument <./how-to/instrumenting-rewriting-binary-application>` in the target binaries, including the executable and any libraries
#. Specify the :doc:`instrumentation parameters <./how-to/configuring-runtime-options>` to use when the instrumented binaries are launched
-5
Dosyayı Görüntüle
@@ -1,5 +0,0 @@
/build*
/_build
/_doxygen
/.gitinfo
/omnitrace.dox
Dosyayı Görüntüle
-20
Dosyayı Görüntüle
@@ -1,20 +0,0 @@
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
-53
Dosyayı Görüntüle
@@ -1,53 +0,0 @@
# About
```eval_rst
.. toctree::
:glob:
:maxdepth: 4
```
## Overview
> ***[OmniTrace](https://github.com/ROCm/omnitrace) is an AMD open source research project and is not supported as part of the ROCm software stack.***
[Browse OmniTrace source code on Github](https://github.com/ROCm/omnitrace)
[OmniTrace](https://github.com/ROCm/omnitrace) is designed for both high-level profiling and
comprehensive tracing of applications running on the CPU or the CPU+GPU via dynamic binary instrumentation,
call-stack sampling, and various other means for determining currently executing function and line information.
Visualization of the comprehensive omnitrace results can be viewed in any modern web browser by visiting
[ui.perfetto.dev](https://ui.perfetto.dev/) and loading the perfetto output (`.proto` files) produced by omnitrace.
Aggregated high-level results are available in text files for human consumption and JSON files for programmatic analysis.
The JSON output files are compatible with the python package [hatchet](https://github.com/hatchet/hatchet) which converts
the performance data into pandas dataframes and facilitate multi-run comparisons, filtering, visualization in Jupyter notebooks,
and much more.
[OmniTrace](https://github.com/ROCm/omnitrace) has two distinct configuration steps when instrumenting:
1. Configuring which functions and modules are instrumented in the target binaries (i.e. executable and/or libraries)
- [Instrumenting with OmniTrace](instrumenting.md)
2. Configuring what the instrumentation does happens when the instrumented binaries are executed
- [Customizing OmniTrace Runtime](runtime.md)
## OmniTrace Use Cases
When analyzing the performance of an application, ***it is always best to NOT assume you know where the performance bottlenecks are***
***and why they are happening.*** OmniTrace is a ***tool for the entire execution of application***. It is the sort of tool which is
ideal for *characterizing* where optimization would have the greatest impact on the end-to-end execution of the application and/or
viewing what else is happening on the system during a performance bottleneck.
Especially when GPUs are involved, there is a tendency to assume that the quickest path to performance improvement is minimizing
the runtime of the GPU kernels. This is a highly flawed assumption: if you optimize the runtime of a kernel from 1 millisecond
to 1 microsecond (1000x speed-up) but the original application *never spent time waiting* for kernel(s) to complete,
you will see zero statistically significant speed-up in end-to-end runtime of your application. In other words, it does not matter
how fast or slow the code on GPU is if the application is not bottlenecked waiting on the GPU.
Use OmniTrace to obtain a high-level view of the entire application. Use it to determine where the performance bottlenecks are and
obtain clues to why these bottlenecks are happening. If you want ***extensive*** insight into the execution of individual kernels
on the GPU, AMD Research is working on another tool for this but you should start with the tool which characterizes the
broad picture: OmniTrace.
With regard to the CPU, OmniTrace does not target any specific vendor, it works just as well with non-AMD CPUs as with AMD CPUs.
With regard to the GPU, OmniTrace is currently restricted to the HIP and HSA APIs and kernels executing on AMD GPUs.
-535
Dosyayı Görüntüle
@@ -1,535 +0,0 @@
# Causal Profiling
```eval_rst
.. toctree::
:glob:
:maxdepth: 3
```
## What is "Causal Profiling"?
> ***If you speed up a given block of code by X%, the application will execute Y% faster***
Causal profiling directs parallel application developers to where they should focus their optimization
efforts by quantifying the potential impact of optimizations. Causal profiling is rooted in the concept
that *software execution speed is relative*: speeding up a block of code by X% is mathematically equivalent
to that block of code running at its current speed if all the other code running slower by X%.
Thus, causal profiling works by performing experiments on blocks of code during program execution which
insert pauses to slow down all other concurrently running code. During post-processing, these experiments
are translated into calculations for the potential impact of speeding up this block of code.
Consider the following C++ code executing `foo` and `bar` concurrently in two different threads
where `foo` is 30% faster than `bar` (ideally):
```cpp
#include <cstddef>
#include <thread>
constexpr size_t FOO_N = 7 * 1000000000UL;
constexpr size_t BAR_N = 10 * 1000000000UL;
void foo()
{
for(volatile size_t i = 0; i < FOO_N; ++i) {}
}
void bar()
{
for(volatile size_t i = 0; i < BAR_N; ++i) {}
}
int main()
{
std::thread _threads[] = { std::thread{ foo },
std::thread{ bar } };
for(auto& itr : _threads)
itr.join();
}
```
No matter how many optimizations are applied to `foo`, the application will always require the same amount of time
because the end-to-end performance is limited by `bar`. However, a 5% speedup in `bar` will result in the
end-to-end performance improving by 5% and this trend will continue linearly (10% speedup in `bar` yields 10% speedup in
end-to-end performance, and so on) up to 30% speedup, at which point, `bar` executes as fast as `foo`;
any speedup to `bar` beyond 30% will still only yield an end-to-end performance speedup of 30% since the application
will be limited by performance of `foo`, as demonstrated below in the causal profiling visualization:
![foobar-causal-plot](images/causal-foobar.png)
The full details of the causal profiling methodology can be found in the paper [Coz: Finding Code that Counts with Causal Profiling](http://arxiv.org/pdf/1608.03676v1.pdf).
The author's implementation is publicly available on [GitHub](https://github.com/plasma-umass/coz).
## Getting Started
### Progress Points
Causal profiling requires "progress points" to track progress through the code in between samples. Progress points must be triggered deterministically via instrumentation.
This can happen in three different ways:
1. OmniTrace can leverage the callbacks from Kokkos-Tools, OpenMP-Tools, roctracer, etc. and the wrappers around functions for MPI, NUMA, RCCL, etc. to act as progress-points
2. User can leverage the [runtime instrumentation capabilities](instrumenting.md#runtime-instrumentation) to insert progress-points (NOTE: binary rewrite to insert progress-points is not supported)
3. User can leverage the [User API](user_api.md), e.g. `OMNITRACE_CAUSAL_PROGRESS`
Please note with regard to #2, binary rewrite to insert progress-points is not supported: when a rewritten binary is executed, Dyninst translates the instruction pointer address in order
to execute the instrumentation and, as a result, call-stack samples never return instruction pointer addresses in the ranges defined as valid by OmniTrace. Hopefully, a work-around will
be found in the future.
### Key Concepts
| Concept | Setting | Options | Description |
|------------------|-----------------------------------|----------------------------------|--------------------------------------------------------------------------------------------------------------------|
| Backend | `OMNITRACE_CAUSAL_BACKEND` | `perf`, `timer` | Backend for recording samples required to calculate the virtual speed-up |
| Mode | `OMNITRACE_CAUSAL_MODE` | `function`, `line` | Select entire function or individual line of code for causal experiments |
| End-to-End | `OMNITRACE_CAUSAL_END_TO_END` | boolean | Perform a single experiment during the entire run (does not require progress-points) |
| Fixed speedup(s) | `OMNITRACE_CAUSAL_FIXED_SPEEDUP` | one or more values from [0, 100] | Virtual speedup or pool of virtual speedups to randomly select |
| Binary scope | `OMNITRACE_CAUSAL_BINARY_SCOPE` | regular expression(s) | Dynamic binaries containing code for experiments |
| Source scope | `OMNITRACE_CAUSAL_SOURCE_SCOPE` | regular expression(s) | `<file>` and/or `<file>:<line>` containing code to include in experiments |
| Function scope | `OMNITRACE_CAUSAL_FUNCTION_SCOPE` | regular expression(s) | Restricts experiments to matching functions (function mode) or lines of code within matching functions (line mode) |
#### Notes
1. Binary scope defaults to `%MAIN%` (executable). Scope can be expanded to include linked libraries
2. `<file>` and `<file>:<line>` support requires debug info (i.e. code was compiled with `-g` or, preferably, `-g3`)
3. Function mode does not require debug info but does not support stripped binaries
### Backends
Both causal profiling backends interrupt each thread 1000x per second of CPU-time to apply virtual speedups.
The difference between the backends is how the samples which are responsible calculating the virtual speedup are recorded.
There are 3 key differences between the two backends:
1. `perf` backend requires Linux Perf and elevated security priviledges
2. `perf` backend interrupts the application less frequently whereas the `timer` backend will interrupt the applicaiton 1000x per second of realtime
3. `timer` backend has less accurate call-stacks due to instruction pointer skid
In general, the `"perf"` is preferred over the `"timer"` backend when sufficient security priviledges permit it's usage.
If `"OMNITRACE_CAUSAL_BACKEND"` is set to `"auto"`, Omnitrace will fallback to using the `"timer"` backend only if
using the `"perf"` backend fails; if `"OMNITRACE_CAUSAL_BACKEND"` is set to `"perf"` and using this backend fails, Omnitrace
will abort.
#### Instruction Pointer Skid
Instruction pointer (IP) skid is how many instructions execute between an event of interest
happening and where the IP is when the kernel is able to stop the application.
For the `"timer"` backend, this translates to the
difference between when the IP when the timer generated a signal and the IP when the
signal was actually generated. Although IP skid does still occur with the `"perf"` backend,
the overhead of pausing the entire thread with the `"timer"` backend makes this much more pronounced
and, as such, the `"timer"` backend tends to have a lower resolution than the `"perf"` backend,
especially in `"line"` mode.
#### Installing Linux Perf
Linux Perf is built into the kernel and may already be installed (e.g., included in the default kernel for OpenSUSE).
The official method of checking whether Linux Perf is installed is checking for the existence of the file
`/proc/sys/kernel/perf_event_paranoid` -- if the file exists, the kernel has Perf installed.
If this file does not exist, on Debian-based systems like Ubuntu, install (as superuser):
```console
apt-get install linux-tools-common linux-tools-generic linux-tools-$(uname -r)
```
and reboot your computer. In order to use the `"perf"` backend, the value of `/proc/sys/kernel/perf_event_paranoid`
should be <= 2. If the value in this file is greater than 2, you will likely be unable to use the perf backend.
To update the paranoid level temporarily (until the system is rebooted), run one of the following methods
as a superuser (where `PARANOID_LEVEL=<N>` with `<N>` in the range `[-1, 2]):
```console
echo ${PARANOID_LEVEL} | sudo tee /proc/sys/kernel/perf_event_paranoid
sysctl kernel.perf_event_paranoid=${PARANOID_LEVEL}
```
To make the paranoid level persistent after a reboot, add `kernel.perf_event_paranoid=<N>`
(where `<N>` is the desired paranoid level) to the `/etc/sysctl.conf` file.
### Speedup Prediction Variability and `omnitrace-causal` Executable
Causal profiling typically require executing the application several times in order to adequately sample all the domains of executing code, experiment speedups, etc. and resolve statistical fluctuations.
The `omnitrace-causal` executable is designed to simplify running this procedure:
```console
$ omnitrace-causal --help
[omnitrace-causal] Usage: ./bin/omnitrace-causal [ --help (count: 0, dtype: bool)
--version (count: 0, dtype: bool)
--monochrome (max: 1, dtype: bool)
--debug (max: 1, dtype: bool)
--verbose (count: 1)
--config (min: 0, dtype: filepath)
--launcher (count: 1, dtype: executable)
--generate-configs (min: 0, dtype: folder)
--no-defaults (min: 0, dtype: bool)
--mode (count: 1, dtype: string)
--output-name (min: 1, dtype: filename)
--reset (max: 1, dtype: bool)
--end-to-end (max: 1, dtype: bool)
--wait (count: 1, dtype: seconds)
--duration (count: 1, dtype: seconds)
--iterations (count: 1, dtype: int)
--speedups (min: 0, dtype: integers)
--binary-scope (min: 0, dtype: integers)
--source-scope (min: 0, dtype: integers)
--function-scope (min: 0, dtype: regex-list)
--binary-exclude (min: 0, dtype: integers)
--source-exclude (min: 0, dtype: integers)
--function-exclude (min: 0, dtype: regex-list)
]
Causal profiling usually requires multiple runs to reliably resolve the speedup estimates.
This executable is designed to streamline that process.
For example (assume all commands end with '-- <exe> <args>'):
omnitrace-causal -n 5 -- <exe> # runs <exe> 5x with causal profiling enabled
omnitrace-causal -s 0 5,10,15,20 # runs <exe> 2x with virtual speedups:
# - 0
# - randomly selected from 5, 10, 15, and 20
omnitrace-causal -F func_A func_B func_(A|B) # runs <exe> 3x with the function scope limited to:
# 1. func_A
# 2. func_B
# 3. func_A or func_B
General tips:
- Insert progress points at hotspots in your code or use omnitrace's runtime instrumentation
- Note: binary rewrite will produce a incompatible new binary
- Run omnitrace-causal in "function" mode first (does not require debug info)
- Run omnitrace-causal in "line" mode when you are targeting one function (requires debug info)
- Preferably, use predictions from the "function" mode to determine which function to target
- Limit the virtual speedups to a smaller pool, e.g., 0,5,10,25,50, to get reliable predictions quicker
- Make use of the binary, source, and function scope to limit the functions/lines selected for experiments
- Note: source scope requires debug info
Options:
-h, -?, --help Shows this page
--version Prints the version and exit
[DEBUG OPTIONS]
--monochrome Disable colorized output
--debug Debug output
-v, --verbose Verbose output
[GENERAL OPTIONS]
-c, --config Base configuration file
-l, --launcher When running MPI jobs, omnitrace-causal needs to be *before* the executable which launches the MPI processes (i.e.
before `mpirun`, `srun`, etc.). Pass the name of the target executable (or a regex for matching to the name of the
target) for causal profiling, e.g., `omnitrace-causal -l foo -- mpirun -n 4 foo`. This ensures that the omnitrace
library is LD_PRELOADed on the proper target
-g, --generate-configs Generate config files instead of passing environment variables directly. If no arguments are provided, the config files
will be placed in ${PWD}/omnitrace-causal-config folder
--no-defaults Do not activate default features which are recommended for causal profiling. For example: PID-tagging of output files
and timestamped subdirectories are disabled by default. Kokkos tools support is added by default
(OMNITRACE_USE_KOKKOSP=ON) because, for Kokkos applications, the Kokkos-Tools callbacks are used for progress points.
Activation of OpenMP tools support is similar
[CAUSAL PROFILING OPTIONS (General)]
(These settings will be applied to all causal profiling runs)
-m, --mode [ function (func) | line ]
Causal profiling mode
-o, --output-name Output filename of causal profiling data w/o extension
-r, --reset Overwrite any existing experiment results during the first run
-e, --end-to-end Single causal experiment for the entire application runtime
-w, --wait Set the wait time (i.e. delay) before starting the first causal experiment (in seconds)
-d, --duration Set the length of time (in seconds) to perform causal experimentationafter the first experiment is started. Once this
amount of time has elapsed, no more causal experiments will be started but any currently running experiment will be
allowed to finish.
-n, --iterations Number of times to repeat the combination of run configurations
[CAUSAL PROFILING OPTIONS (Combinatorial)]
(Each individual argument to these options will multiply the number runs by the number of arguments and the number of
iterations. E.g. -n 2 -B "MAIN" -F "foo" "bar" will produce 4 runs: 2 iterations x 1 binary scope x 2 function scopes
(MAIN+foo, MAIN+bar, MAIN+foo, MAIN+bar))
-s, --speedups Pool of virtual speedups to sample from during experimentation. Each space designates a group and multiple speedups can
be grouped together by commas, e.g. -s 0 0,10,20-50 is two groups: group #1 is '0' and group #2 is '0 10 20 25 30 35 40
45 50'
-B, --binary-scope Restricts causal experiments to the binaries matching the list of regular expressions. Each space designates a group
and multiple scopes can be grouped together with a semi-colon
-S, --source-scope Restricts causal experiments to the source files or source file + lineno pairs (i.e. <file> or <file>:<line>) matching
the list of regular expressions. Each space designates a group and multiple scopes can be grouped together with a
semi-colon
-F, --function-scope Restricts causal experiments to the functions matching the list of regular expressions. Each space designates a group
and multiple scopes can be grouped together with a semi-colon
-BE, --binary-exclude Excludes causal experiments from being performed on the binaries matching the list of regular expressions. Each space
designates a group and multiple excludes can be grouped together with a semi-colon
-SE, --source-exclude Excludes causal experiments from being performed on the code from the source files or source file + lineno pair (i.e.
<file> or <file>:<line>) matching the list of regular expressions. Each space designates a group and multiple excludes
can be grouped together with a semi-colon
-FE, --function-exclude Excludes causal experiments from being performed on the functions matching the list of regular expressions. Each space
designates a group and multiple excludes can be grouped together with a semi-colon
```
#### Examples
```bash
#!/bin/bash -e
module load omnitrace
N=20
I=3
# when providing speedups to omnitrace-causal, speedup
# groups are separated by a space so "0,10" results in
# one speedup group where omnitrace samples from
# the speedup set of {0, 10}. Passing "0 10" (without
# quotes to omnitrace-causal multiplies the
# number of runs by 2, where the first half of the
# runs instruct omnitrace to only use 0 as the
# speedup and the second half of the runs instruct
# omnitrace to only use 10 as the speedup.
SPEEDUPS="0,0,0,10,20,30,40,50,50,75,75,75,90,90,90"
# thus, -s ${SPEEDUPS} only multiplies the number
# of runs by 1 whereas -S ${SPEEDUPS_E2E} multiplies
# the number of runs by 15:
# - 3 runs with speedup of 0
# - 1 run for each of the speedups 10, 20, 30, and 40
# - 2 runs with speedup of 50
# - 3 runs with speedup of 75
# - 3 runs with speedup of 90
SPEEDUPS_E2E=$(echo "${SPEEDUPS}" | sed 's/,/ /g')
# 20 iterations in function mode with 1 speedup group
# and source scope set to .cpp files
#
# outputs to files:
# - causal/experiments.func.coz
# - causal/experiments.func.json
#
# total executions: 20
#
omnitrace-causal \
-n ${N} \
-s ${SPEEDUPS} \
-m function \
-o experiments.func \
-S ".*\\.cpp" \
-- \
./causal-omni-cpu "${@}"
# 20 iterations in line mode with 1 speedup group
# and source scope restricted to lines 100 and 110
# in the causal.cpp file.
#
# outputs to files:
# - causal/experiments.line.coz
# - causal/experiments.line.json
#
# total executions: 20
#
omnitrace-causal \
-n ${N} \
-s ${SPEEDUPS} \
-m line \
-o experiments.line \
-S "causal\\.cpp:(100|110)" \
-- \
./causal-omni-cpu "${@}"
# 3 iterations in function mode of 15 singular speedups
# in end-to-end mode with 2 different function scopes
# where one is restricted to "cpu_slow_func" and
# another is restricted to "cpu_fast_func".
#
# outputs to files:
# - causal/experiments.func.e2e.coz
# - causal/experiments.func.e2e.json
#
# total executions: 90
#
omnitrace-causal \
-n ${I} \
-s ${SPEEDUPS_E2E} \
-m func \
-e \
-o experiments.func.e2e \
-F "cpu_slow_func" \
"cpu_fast_func" \
-- \
./causal-omni-cpu "${@}"
# 3 iterations in line mode of 15 singular speedups
# in end-to-end mode with 2 different source scopes
# where one is restricted to line 100 in causal.cpp
# and another is restricted to line 110 in causal.cpp.
#
# outputs to files:
# - causal/experiments.line.e2e.coz
# - causal/experiments.line.e2e.json
#
# total executions: 90
#
omnitrace-causal \
-n ${I} \
-s ${SPEEDUPS_E2E} \
-m line \
-e \
-o experiments.line.e2e \
-S "causal\\.cpp:100" \
"causal\\.cpp:110" \
-- \
./causal-omni-cpu "${@}"
export OMP_NUM_THREADS=8
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
# set number of iterations to 5
N=5
# 5 iterations in function mode of 1 speedup
# group with the source scope restricted
# to files containing "lulesh" in their filename
# and exclude functions which start with "Kokkos::"
# or "std::enable_if".
#
# outputs to files:
# - causal/experiments.func.coz
# - causal/experiments.func.json
#
# total executions: 5
#
# First of 5 executions overwrites any
# existing causal/experiments.func.(coz|json)
# file due to "--reset" argument
#
omnitrace-causal \
--reset \
-n ${N} \
-s ${SPEEDUPS} \
-m func \
-o experiments.func \
-S "lulesh.*" \
-FE "^(Kokkos::|std::enable_if)" \
-- \
./lulesh-omni -i 50 -s 200 -r 20 -b 5 -c 5 -p
# 5 iterations in line mode of 1 speedup
# group with the source scope restricted
# to files containing "lulesh" in their filename
# and exclude functions which start with "exec_range"
# or "execute" and which contain either
# "construct_shared_allocation" or "._omp_fn." in
# the function name.
#
# outputs to files:
# - causal/experiments.line.coz
# - causal/experiments.line.json
#
# total executions: 5
#
# First of 5 executions overwrites any
# existing causal/experiments.line.(coz|json)
# file due to "--reset" argument
#
omnitrace-causal \
--reset \
-n ${N} \
-s ${SPEEDUPS} \
-m line \
-o experiments.line \
-S "lulesh.*" \
-FE "^(exec_range|execute);construct_shared_allocation;\\._omp_fn\\." \
-- \
./lulesh-omni -i 50 -s 200 -r 20 -b 5 -c 5 -p
# 5 iterations in line mode of 1 speedup
# group with the source scope restricted
# to files whose basename is "lulesh.cc"
# for 3 different functions:
# - ApplyMaterialPropertiesForElems
# - CalcHourglassControlForElems
# - CalcVolumeForceForElems
#
# outputs to files:
# - causal/experiments.line.targeted.coz
# - causal/experiments.line.targeted.json
#
# total executions: 15
#
# First of 5 executions overwrites any
# existing causal/experiments.line.(coz|json)
# file due to "--reset" argument
#
omnitrace-causal \
--reset \
-n ${N} \
-s ${SPEEDUPS} \
-m line \
-o experiments.line.targeted \
-F "ApplyMaterialPropertiesForElems" \
"CalcHourglassControlForElems" \
"CalcVolumeForceForElems" \
-S "lulesh\\.cc" \
-- \
./lulesh-omni -i 50 -s 200 -r 20 -b 5 -c 5 -p
```
#### Using `omnitrace-causal` with other launchers (e.g. `mpirun`)
The `omnitrace-causal` executable is intended to assist with application replay and is designed to always be at the start of the command-line (i.e. the primary process).
`omnitrace-causal` typically adds a `LD_PRELOAD` of the OmniTrace libraries into the environment before launching the command in order to inject the functionality
required to start the causal profiling tooling. However, this is problematic when the target application for causal profiling requires another command-line
tool in order to run, e.g. `foo` is the target application but executing `foo` requires `mpirun -n 2 foo`. If one were to simply do `omnitrace-causal -- mpirun -n 2 foo`,
then the causal profiling would be applied to `mpirun` instead of `foo`. `omnitrace-causal` remedies this by providing a command-line option `-l` / `--launcher`
to indicate the target application is using a launcher script/executable. The argument to the command-line option is the name of (or regex for) the target application
on the command-line. When `--launcher` is used, `omnitrace-causal` will generate all the replay configurations and execute them but delay adding the `LD_PRELOAD`, instead it
will inject a call to itself into the command-line right before the target application. This recursive call to itself will inherit the configuration from
parent `omnitrace-causal` executable, insert an `LD_PRELOAD` into the environment, and then invoke an `execv` to replace itself with the new process launched by the target
application.
In other words, the following command:
```console
omnitrace-causal -l foo -n 3 -- mpirun -n 2 foo`
```
Effectively results in:
```console
mpirun -n 2 omnitrace-causal -- foo
mpirun -n 2 omnitrace-causal -- foo
mpirun -n 2 omnitrace-causal -- foo
```
### Visualizing the Causal Output
OmniTrace generates a `causal/experiments.json` and `causal/experiments.coz` in `${OMNITRACE_OUTPUT_PATH}/${OMNITRACE_OUTPUT_PREFIX}`. A standalone GUI for viewing the causal profiling
results in under development but until this is available, visit [plasma-umass.org/coz/](https://plasma-umass.org/coz/) and open the `*.coz` file.
## OmniTrace vs. Coz
This section is intended for readers who are familiar with the [Coz profiler](https://github.com/plasma-umass/coz).
OmniTrace provides several additional features and utilities for causal profiling:
| | [Coz](https://github.com/plasma-umass/coz) | [OmniTrace](https://github.com/ROCm/omnitrace) | Notes |
|----------------------|:-------------------------------------------------------------------:|:----------------------------------------------------------:|-------------------------------|
| Debug info | requires debug info in DWARF v3 format (`-gdwarf-3`) | optional, supports any DWARF format version | See Note #1 below |
| Experiment selection | `<file>:<line>` | `<function>` or `<file>:<line>` | See Note #2 below |
| Experiment speedups | Randomly samples b/t 0..100 in increments of 5 or one fixed speedup | Supports specifying smaller subset | Set Note #3 below |
| Scope options | Supports binary and source scopes | Supports binary, source, and function scopes | See Note #4, #5, and #6 below |
| Scope inclusion | Uses `%` as wildcard for binary and source scopes | Full regex support for binary, source, and function scopes | |
| Scope exclusion | Not supported | Supports regexes for excluding binary/source/function | See Note #7 below |
| Call-stack sampling | Linux perf | Linux perf, libunwind | See Note #8 below |
1. OmniTrace supports a "function" mode which does not require debug info
2. OmniTrace supports selecting entire range of instruction pointers for a function instead of instruction pointer for one line. In large codes, "function" mode
can resolve in fewer iterations and once a target function is identified, one can switch to line mode and limit the function scope to the target function
3. OmniTrace supports randomly sampling from subsets, e.g. { 0, 0, 5, 10 } where 0% is randomly selected 50% of time and 5% and 10% are randomly selected 25% of the time
4. OmniTrace and COZ have same definition for binary scope: the binaries loaded at runtime (e.g. executable and linked libraries)
5. OmniTrace "source scope" supports both `<file>` and `<file>:<line>` formats in contrast to COZ "source scope" which requires `<file>:<line>` format
6. OmniTrace supports a "function" scope which narrows the functions/lines which are eligible for causal experiments to those within the matching functions
7. OmniTrace supports a second filter on scopes for removing binary/source/function caught by inclusive match, e.g. `BINARY_SCOPE=.*` + `BINARY_EXCLUDE=libmpi.*`
initially includes all binaries but exclude regex removes MPI libraries
8. In Omnitrace, the Linux perf backend is preferred over use libunwind. However, Linux perf usage can be restricted for security reasons.
Omnitrace will fallback to using a second POSIX timer and libunwind if Linux perf is not available.
-169
Dosyayı Görüntüle
@@ -1,169 +0,0 @@
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# http://www.sphinx-doc.org/en/master/config
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
# import os
# sys.path.insert(0, os.path.abspath('.'))
import os
import sys
import subprocess as sp
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
sys.path.insert(0, os.path.abspath(".."))
def install(package):
sp.call([sys.executable, "-m", "pip", "install", package])
# Check if we're running on Read the Docs' servers
read_the_docs_build = os.environ.get("READTHEDOCS", None) == "True"
# -- Project information -----------------------------------------------------
project = "omnitrace"
copyright = "2022, Advanced Micro Devices, Inc."
author = "Audacious Software Group"
project_root = os.path.normpath(os.path.join(os.getcwd(), "..", ".."))
version = open(os.path.join(project_root, "VERSION")).read().strip()
# The full version, including alpha/beta/rc tags
release = version
_docdir = os.path.realpath(os.getcwd())
_srcdir = os.path.realpath(os.path.join(os.getcwd(), ".."))
_sitedir = os.path.realpath(os.path.join(os.getcwd(), "..", "site"))
_staticdir = os.path.realpath(os.path.join(_docdir, "_static"))
_templatedir = os.path.realpath(os.path.join(_docdir, "_templates"))
if not os.path.exists(_staticdir):
os.makedirs(_staticdir)
if not os.path.exists(_templatedir):
os.makedirs(_templatedir)
# -- General configuration ---------------------------------------------------
install("sphinx_rtd_theme")
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
"sphinx.ext.autodoc",
"sphinx.ext.doctest",
"sphinx.ext.todo",
"sphinx.ext.viewcode",
"sphinx.ext.githubpages",
"sphinx.ext.mathjax",
"sphinx.ext.autosummary",
"sphinx.ext.napoleon",
"sphinx_markdown_tables",
"recommonmark",
"breathe",
]
source_suffix = {
".rst": "restructuredtext",
".md": "markdown",
}
from recommonmark.parser import CommonMarkParser
source_parsers = {".md": CommonMarkParser}
# Add any paths that contain templates here, relative to this directory.
templates_path = ["_templates"]
# The master toctree document.
master_doc = "index"
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
default_role = None
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = "sphinx_rtd_theme"
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ["_static"]
html_theme_options = {
"analytics_id": "G-1HLBBRSTT9", # Provided by Google in your dashboard
"analytics_anonymize_ip": False,
"logo_only": False,
"display_version": True,
"prev_next_buttons_location": "bottom",
"style_external_links": False,
"vcs_pageview_mode": "",
# 'style_nav_header_background': 'white',
# Toc options
"collapse_navigation": True,
"sticky_navigation": True,
"navigation_depth": 4,
"includehidden": True,
"titles_only": False,
}
# Breathe Configuration
breathe_projects = {"omnitrace": "_doxygen/xml"}
breathe_default_project = "omnitrace"
breathe_default_members = ("members",)
breathe_projects_source = {
"omnitrace": (
os.path.join(project_root, "source", "lib", "omnitrace-user"),
[
"omnitrace/types.h",
"omnitrace/categories.h",
"omnitrace/user.h",
"omnitrace/causal.h",
],
)
}
from pygments.styles import get_all_styles
# The name of the Pygments (syntax highlighting) style to use.
styles = list(get_all_styles())
preferences = ("emacs", "pastie", "colorful")
for pref in preferences:
if pref in styles:
pygments_style = pref
break
from recommonmark.transform import AutoStructify
# app setup hook
def setup(app):
app.add_config_value(
"recommonmark_config",
{
"auto_toc_tree_section": "Contents",
"enable_eval_rst": True,
"enable_auto_doc_ref": False,
},
True,
)
app.add_transform(AutoStructify)
-10
Dosyayı Görüntüle
@@ -1,10 +0,0 @@
# Critical Trace Support
```eval_rst
.. toctree::
:glob:
:maxdepth: 4
```
Critical trace support has been superseded by causal profiling support.
Critical trace support was removed in Omnitrace v1.11.0 due to incomplete implementation.
-307
Dosyayı Görüntüle
@@ -1,307 +0,0 @@
# Development Guide
## Miscellaneous Info
- [CDash Testing Dashboard](https://my.cdash.org/index.php?project=Omnitrace)
- requires login to view
## Executables
### omnitrace-avail: [source/bin/omnitrace-avail](https://github.com/ROCm/omnitrace/tree/amd-mainline/source/bin/omnitrace-avail)
The main of `omnitrace-avail` has three important sections:
1. Printing components
2. Printing options
3. Printing hardware counters
### omnitrace-sample: [source/bin/omnitrace-sample](https://github.com/ROCm/omnitrace/tree/amd-mainline/source/bin/omnitrace-sample)
General design:
- Requires a command-line format of `omnitrace-sample <options> -- <command> <command-args>`
- Translates command line options into environment variables
- Adds `libomnitrace-dl.so` to `LD_PRELOAD`
- Application is launched via `execvpe` with `<command> <command-args>` and modified environment
### omnitrace-casual: [source/bin/omnitrace-causal](https://github.com/ROCm/omnitrace/tree/amd-mainline/source/bin/omnitrace-causal)
Nearly identical design to [omnitrace-sample](#omnitrace-sample-sourcebinomnitrace-sample) when
there is exactly one causal profiling configuration variant (this enables debugging).
When more than one causal profiling configuration variant it produced from command-line options,
for each variant:
- `omnitrace-causal` calls `fork()`
- child process launches `<command> <command-args>` via `execvpe` which modified environment for variant
- parent process waits for child process to finish
### omnitrace-instrument: [source/bin/omnitrace-instrument](https://github.com/ROCm/omnitrace/tree/amd-mainline/source/bin/omnitrace-instrument)
- Requires a command-line format of `omnitrace-instrument <options> -- <command> <command-args>`
- User specifies in options whether they want to do runtime instrumentation, binary rewrite, or attach to process
- Either opens the instrumentation target (binary rewrite), launches the target and stops it before it starts executing main (runtime), or
attaches to running executable and pauses it
- Finds all functions in target(s)
- Finds `libomnitrace-dl` and finds the functions
- Iterates over all the functions and instruments them as long as they satisfy the defined criteria (minimum number of instructions, etc.)
- See the `module_function` class
- Most of the workflow has been the same at the point but once the instrumentation is complete, it diverges
- For a binary rewrite: outputs new instrumented binary and exits
- For runtime instrumentation or attaching to a process: instructs the application to resume executing and then waits for the application to exit
## Libraries
### Common Library: [source/lib/common](https://github.com/ROCm/omnitrace/tree/amd-mainline/source/lib/common)
General header-only functionality used in multiple executables and/or libraries. Not installed or exported outside of the build tree.
### Core Library: [source/lib/core](https://github.com/ROCm/omnitrace/tree/amd-mainline/source/lib/core)
Static PIC library with functionality that does not depend on any components. Not installed or exported outside of the build tree.
### Binary Library: [source/lib/binary](https://github.com/ROCm/omnitrace/tree/amd-mainline/source/lib/binary)
Static PIC library with functionality for reading/analyzing binary info. Mostly used by the causal profiling sections
of [libomnitrace](#libomnitrace-sourcelibomnitrace). Not installed or exported outside of the build tree.
### libomnitrace: [source/lib/omnitrace](https://github.com/ROCm/omnitrace/tree/amd-mainline/source/lib/omnitrace)
This is the main library encapsulating all the capabilities.
### libomnitrace-dl: [source/lib/omnitrace-dl](https://github.com/ROCm/omnitrace/tree/amd-mainline/source/lib/omnitrace-dl)
Lightweight, front-end library for [libomnitrace](#libomnitrace-sourcelibomnitrace) which serves 3 primary purposes:
1. Dramatically speeds up instrumentation time vs. using [libomnitrace](#libomnitrace-sourcelibomnitrace) directly since Dyninst must parse entire library in order to find instrumentation functions ([libomnitrace](#libomnitrace-sourcelibomnitrace) is dlopen'ed when the instrumentation functions get called)
2. Prevents re-entry if [libomnitrace](#libomnitrace-sourcelibomnitrace) calls an instrumentated function internally)
3. Coordinates communication between [libomnitrace-user](#libomnitrace-user-sourcelibomnitrace-user) and [libomnitrace](#libomnitrace-sourcelibomnitrace)
### libomnitrace-user: [source/lib/omnitrace-user](https://github.com/ROCm/omnitrace/tree/amd-mainline/source/lib/omnitrace-user)
Provides a set of functions and types for the users to add to their code, e.g. disabling data collection globally or on a specific thread,
user-defined regions, etc. If [libomnitrace-dl](#libomnitrace-dl-sourcelibomnitrace-dl) is not loaded, the user API is effectively no-op
function calls.
## Concepts
### Component
Most measurements and capabilities are encapsulated into a "component" with the following definitions:
- Measurement: recording of some data relevant to performance, e.g. current call-stack, hardware counter values, current memory usage, timestamp
- Capability: handles the implementation or orchestration of some feature which is used to collect measurements, e.g. a component which handles setting up function wrappers around various functions such as `pthread_create`, `MPI_Init`, etc.
Components are designed to hold no data at all or only the data for both an instantaeous measurement and a phase measurement.
Components which store data typically implement a static `record()` function (for getting a record of the measurement),
`start()` + `stop()` member functions for calculating a phase measurement, and a `sample()` member function for storing an
instantaneous measurement. In reality, there are several more "standard" functions but these are the most often used ones.
Components which do not store data may also have `start()`, `stop()`, and `sample()` functions but for components which
implement function wrappers, they typically provide a call operator or `audit(...)` functions which are invoked with the
wrappee function's arguments before the wrappee gets called and with the return value after the wrappee gets called.
***The goal of this design is to provide relatively small and resuable lightweight objects for recording measurements
and/or implementing capabilities.***
#### Wall-Clock Component Example
A component for computing the elapsed wall-clock time looks like this:
```cpp
struct wall_clock
{
using value_type = int64_t;
static value_type record() noexcept
{
return std::chrono::steady_clock::now().time_since_epoch().count();
}
void sample() noexcept
{
value = record();
}
void start() noexcept
{
value = record();
}
void stop() noexcept
{
auto _start_value = value;
value = record();
accum += (value - _start_value);
}
private:
int64_t value = 0;
int64_t accum = 0;
};
```
#### Function Wrapper Component Example
A component which implements wrappers around `fork()` and `exit(int)` (and stores no data) may look like this:
```cpp
struct function_wrapper
{
pid_t operator()(const gotcha_data&, pid_t (*real_fork)())
{
// disable all collection before forking
categories::disable_categories(config::get_enabled_categories());
auto _pid_v = real_fork();
// only re-enable collection on parent process
if(_pid_v != 0)
categories::enable_categories(config::get_enabled_categories());
return _pid_v;
}
void operator()(const gotcha_data&, void (*real_exit)(int), int _exit_code)
{
// catch the call to exit and finalize before truly exiting
omnitrace_finalize();
real_exit(_exit_code);
}
};
```
#### Component Member Functions
There are no real restrictions or requirements on the member functions a component needs to provide.
Unless the component is being directly used, invocation of component member functions via "component bundlers"
(provided via timemory) makes extensive use of template metaprogramming concept to find the best match (if any)
for calling a components member function. This is a bit easier to demonstrate via example:
```cpp
struct foo
{
void sample() { puts("foo::sample()"); }
};
struct bar
{
void sample(int) { puts("bar::sample(int)"); }
};
struct spam
{
void start(int) { puts("spam::start()"); }
void stop() { puts("spam::stop()"); }
};
int main()
{
auto _bundle = component_tuple<foo, bar, spam>{ "main" };
puts("A");
_bundle.start();
puts("B");
_bundle.sample(10);
puts("C");
_bundle.sample();
puts("D");
_bundle.stop();
}
```
In the above, this would be the message printed:
```console
A
bar::start()
B
foo::sample()
bar::sample(int)
C
foo::sample()
D
spam::stop()
```
In section A, the bundle determined only the `spam` object had a `start` function. Since this is determined
via template metaprogramming instead of dynamic polymorphism, this effectively elides any code related to
the `foo` or `bar` objects. In section B, since an integer of `10` was passed to the bundle,
the bundle forwards that value onto `spam::sample(int)` after it invokes `foo::sample()` -- which
is invoked because it recognizes that the call is the `sample` member function is still possible without
the arguments.
## Memory Model
Collected data is generally stored in one of following 3 places:
1. Perfetto (i.e. data is handed directly to perfetto)
2. Managed implictly by timemory and accessed as needed
3. Thread-local data
In general, only instrumentation for relatively simple data is directly passed to Perfetto and/or timemory during runtime.
For example, the callbacks from binary instrumentation, user API instrumentation, and roctracer directly invoke
calls to Perfetto and/or timemory's storage model. Otherwise, the data is stored by omnitrace in the thread-data model
which is more persistent than simply using `thread_local` static data (which is problematic because the data gets deleted
when a thread terminates).
### Thread Identification
Each CPU thread is assigned two integral identifiers. One identifier is simply an atomic increment everytime a new thread is created
(called `internal_value`).
The other identifier tries to account for the fact that OmniTrace, Perfetto, ROCm, etc. start background threads and for these threads
(called `sequent_value`). When a thread is created as a byproduct of OmniTrace, the index is offset by a large value. This serves
two purposes: (1) accessing the data for threads created by the user is closer in memory and (2) when log messages are printed,
the index more-or-less correlates to the order of thread creation to the user's knowledge.
The `sequent_value` is typically the one used to access the thread-data.
### Thread-Data Class
Currently, most thread data is effectively stored in a static `std::array<std::unique_ptr<T>, OMNITRACE_MAX_THREADS>` instance.
`OMNITRACE_MAX_THREADS` is a value defined a compile-time and set to 2048 for release builds. During finalization,
omnitrace iterates over all the thread-data and then transforms that data into something that is passed to perfetto and/or timemory.
The downside of the current model is that if the user exceeds `OMNITRACE_MAX_THREADS`, omnitrace segfaults. To fix this issue,
a new model is being adopted which has all the benefits of this model but permits dynamic expansion.
## Sampling Model
The general structure for the sampling is within timemory (`source/timemory/sampling`). Currently, all sampling is done per-thread
via POSIX timers. Omnitrace supports using a realtime timer and a CPU-time timer. Both have adjustable frequencies, delays, and durations.
By default, only CPU-time sampling is enabled. Initial settings are inherited from the settings starting with `OMNITRACE_SAMPLING_`.
For each type of timer, there exists timer-specific settings that can be used to override the common/inherited settings for that timer
specifically. For the CPU-time sampler, these settings start with `OMNITRACE_SAMPLING_CPUTIME` and `OMNITRACE_SAMPLING_REALTIME` for
the realtime sampler. For example, `OMNITRACE_SAMPLING_FREQ=500` initially sets the sampling frequency to 500 interrupts per second
(based on their clock). Settings `OMNITRACE_SAMPLING_REALTIME_FREQ=10` will lower the sampling frequency for the realtime sampler
to 10 interrupts per second of realtime.
The omnitrace-specific implementation can be found in [source/lib/omnitrace/library/sampling.cpp](https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/sampling.cpp).
Within [sampling.cpp](https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/sampling.cpp), you will a bundle of 3 sampling components:
`backtrace_timestamp`, `backtrace`, and `backtrace_metrics`.
The first component [backtrace_timestamp](https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/components/backtrace_timestamp.hpp) simply
records the wall-clock time of the sample.
The second component [backtrace](https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/components/backtrace.hpp) records the call-stack via libunwind.
The last component [backtrace_metrics](https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/components/backtrace_metrics.hpp) is responsible for recording the
metrics for that sample, e.g. peak RSS, HW counters, etc. These 3 components are bundled together in a tuple-like struct (e.g. `tuple<backtrace_timestamp, backtrace, backtrace_metrics>`)
a buffer of at least 1024 instances of this tuple are mmap'ed per-thread. When this buffer is full, before taking the next sample, the sampler will hand the buffer
off to it's allocator thread and mmap a new buffer. The allocator thread takes this data and either dynamically stores it in memory or writes it to a file depending on the value of `OMNITRACE_USE_TEMPORARY_FILES`.
This schema avoids all allocations in the signal handler, allows the data to grow dynamically, avoid potentially slow I/O within the signal handler, and also enables the capability to avoid I/O altogether.
The maximum number of samplers handled by each allocator is governed by the setting `OMNITRACE_SAMPLING_ALLOCATOR_SIZE` setting (the default is 8) -- whenever an allocator has reached it's limit,
a new internal thread is created to handle the new samplers.
## Time-Window Constraint Model
Recently with the introduction of tracing delay/duration/etc., the [constraint namespace](https://github.com/ROCm/omnitrace/blob/main/source/lib/core/constraint.hpp)
was introduced to improve the management of delays and/or duration limits of data collection. The `spec` class takes a clock identifier, a delay value, a duration value, and an
integer indicating how many times to repeat the delay + duration. Thus, it is possible to perform tasks such as periodically enabling tracing for brief periods
of time in between long periods without data collection during the application, e.g. `OMNITRACE_TRACE_PERIODS = realtime:10:1:5 process_cputime:10:2:20` would enable
five periods of no data collection for 10 seconds of realtime followed by 1 second of data collection + twenty periods of no data collection for 10 seconds
of process CPU time followed by 2 CPU-time seconds of data collection.
Eventually, the goal is have all subsets of data collection which currently support more rudimentary models of time window constraints, such as process sampling and causal profiling,
to be migrated to this model.
-196
Dosyayı Görüntüle
@@ -1,196 +0,0 @@
name: omnitrace-docs
channels:
- conda-forge
- defaults
dependencies:
- _libgcc_mutex=0.1=conda_forge
- _openmp_mutex=4.5=1_gnu
- alabaster=0.7.12=py_0
- alsa-lib=1.2.3=h516909a_0
- argh=0.26.2=pyh9f0ad1d_1002
- atk-1.0=2.36.0=h3371d22_4
- babel=2.9.1=pyh44b312d_0
- breathe=4.29.2=pyhd8ed1ab_0
- brotli=1.0.9=h7f98852_6
- brotli-bin=1.0.9=h7f98852_6
- brotlipy=0.7.0=py39h3811e60_1003
- bzip2=1.0.8=h7f98852_4
- c-ares=1.18.1=h7f98852_0
- ca-certificates=2021.10.8=ha878542_0
- cairo=1.16.0=ha00ac49_1009
- certifi=2021.10.8=py39hf3d152e_1
- cffi=1.15.0=py39h4bc2ebd_0
- charset-normalizer=2.0.12=pyhd8ed1ab_0
- click=8.0.4=py39hf3d152e_0
- cmake=3.22.2=h1021d11_0
- colorama=0.4.4=pyh9f0ad1d_0
- commonmark=0.9.1=py_0
- cryptography=36.0.1=py39h95dcef6_0
- curl=7.81.0=h2574ce0_0
- cycler=0.11.0=pyhd8ed1ab_0
- dbus=1.13.6=h5008d03_3
- docutils=0.16=py39hf3d152e_3
- doxygen=1.9.2=hb166930_0
- expat=2.4.4=h9c3ff4c_0
- font-ttf-dejavu-sans-mono=2.37=hab24e00_0
- font-ttf-inconsolata=3.000=h77eed37_0
- font-ttf-source-code-pro=2.038=h77eed37_0
- font-ttf-ubuntu=0.83=hab24e00_0
- fontconfig=2.13.96=ha180cfb_0
- fonts-conda-ecosystem=1=0
- fonts-conda-forge=1=0
- fonttools=4.29.1=py39h3811e60_0
- freetype=2.10.4=h0708190_1
- fribidi=1.0.10=h36c2ea0_0
- future=0.18.2=py39hf3d152e_4
- gdk-pixbuf=2.42.6=h04a7f16_0
- gettext=0.19.8.1=h73d1719_1008
- ghp-import=2.0.2=pyhd8ed1ab_0
- giflib=5.2.1=h36c2ea0_2
- git=2.35.0=pl5321hc30692c_0
- graphite2=1.3.13=h58526e2_1001
- graphviz=2.50.0=h8e749b2_2
- gst-plugins-base=1.18.5=hf529b03_3
- gstreamer=1.18.5=h9f60fe5_3
- gtk2=2.24.33=h90689f9_2
- gts=0.7.6=h64030ff_2
- harfbuzz=3.4.0=hb4a5f5f_0
- icu=69.1=h9c3ff4c_0
- idna=3.3=pyhd8ed1ab_0
- imagesize=1.3.0=pyhd8ed1ab_0
- importlib-metadata=4.11.1=py39hf3d152e_0
- jbig=2.1=h7f98852_2003
- jinja2=3.0.3=pyhd8ed1ab_0
- jpeg=9e=h7f98852_0
- kiwisolver=1.3.2=py39h1a9c180_1
- krb5=1.19.2=hcc1bbae_3
- lcms2=2.12=hddcbb42_0
- ld_impl_linux-64=2.36.1=hea4e1c9_2
- lerc=3.0=h9c3ff4c_0
- libblas=3.9.0=13_linux64_openblas
- libbrotlicommon=1.0.9=h7f98852_6
- libbrotlidec=1.0.9=h7f98852_6
- libbrotlienc=1.0.9=h7f98852_6
- libcblas=3.9.0=13_linux64_openblas
- libclang=13.0.1=default_hc23dcda_0
- libcurl=7.81.0=h2574ce0_0
- libdeflate=1.10=h7f98852_0
- libedit=3.1.20191231=he28a2e2_2
- libev=4.33=h516909a_1
- libevent=2.1.10=h9b69904_4
- libffi=3.4.2=h7f98852_5
- libgcc-ng=11.2.0=h1d223b6_12
- libgd=2.3.3=h3cfcdeb_1
- libgfortran-ng=11.2.0=h69a702a_12
- libgfortran5=11.2.0=h5c6108e_12
- libglib=2.70.2=h174f98d_4
- libgomp=11.2.0=h1d223b6_12
- libiconv=1.16=h516909a_0
- liblapack=3.9.0=13_linux64_openblas
- libllvm13=13.0.1=hf817b99_0
- libnghttp2=1.46.0=h812cca2_0
- libnsl=2.0.0=h7f98852_0
- libogg=1.3.4=h7f98852_1
- libopenblas=0.3.18=pthreads_h8fe5266_0
- libopus=1.3.1=h7f98852_1
- libpng=1.6.37=h21135ba_2
- libpq=14.2=hd57d9b9_0
- librsvg=2.52.5=h0a9e6e8_2
- libssh2=1.10.0=ha56f1ee_2
- libstdcxx-ng=11.2.0=he4da1e4_12
- libtiff=4.3.0=h542a066_3
- libtool=2.4.6=h9c3ff4c_1008
- libuuid=2.32.1=h7f98852_1000
- libuv=1.43.0=h7f98852_0
- libvorbis=1.3.7=h9c3ff4c_0
- libwebp=1.2.2=h3452ae3_0
- libwebp-base=1.2.2=h7f98852_1
- libxcb=1.13=h7f98852_1004
- libxkbcommon=1.0.3=he3ba5ed_0
- libxml2=2.9.12=h885dcf4_1
- libzlib=1.2.11=h36c2ea0_1013
- lz4-c=1.9.3=h9c3ff4c_1
- markdown=3.3.6=pyhd8ed1ab_0
- markupsafe=2.1.0=py39hb9d737c_0
- matplotlib=3.5.1=py39hf3d152e_0
- matplotlib-base=3.5.1=py39h2fa2bec_0
- mergedeep=1.3.4=pyhd8ed1ab_0
- mkdocs=1.2.3=pyhd8ed1ab_0
- munkres=1.1.4=pyh9f0ad1d_0
- mysql-common=8.0.28=ha770c72_0
- mysql-libs=8.0.28=hfa10184_0
- ncurses=6.3=h9c3ff4c_0
- nspr=4.32=h9c3ff4c_1
- nss=3.74=hb5efdd6_0
- numpy=1.22.2=py39h91f2184_0
- openjpeg=2.4.0=hb52868f_1
- openssl=1.1.1l=h7f98852_0
- packaging=21.3=pyhd8ed1ab_0
- pango=1.50.3=h9967ed3_0
- pcre=8.45=h9c3ff4c_0
- pcre2=10.37=h032f7d1_0
- perl=5.32.1=2_h7f98852_perl5
- pillow=9.0.1=py39hae2aec6_2
- pip=22.0.3=pyhd8ed1ab_0
- pixman=0.40.0=h36c2ea0_0
- pthread-stubs=0.4=h36c2ea0_1001
- pycparser=2.21=pyhd8ed1ab_0
- pygments=2.11.2=pyhd8ed1ab_0
- pyopenssl=22.0.0=pyhd8ed1ab_0
- pyparsing=3.0.7=pyhd8ed1ab_0
- pyqt=5.12.3=py39hf3d152e_8
- pyqt-impl=5.12.3=py39hde8b62d_8
- pyqt5-sip=4.19.18=py39he80948d_8
- pyqtchart=5.12=py39h0fcd23e_8
- pyqtwebengine=5.12.1=py39h0fcd23e_8
- pysocks=1.7.1=py39hf3d152e_4
- python=3.9.10=h85951f9_2_cpython
- python-dateutil=2.8.2=pyhd8ed1ab_0
- python_abi=3.9=2_cp39
- pytz=2021.3=pyhd8ed1ab_0
- pyyaml=6.0=py39h3811e60_3
- pyyaml-env-tag=0.1=pyhd8ed1ab_0
- qt=5.12.9=ha98a1a1_5
- readline=8.1=h46c0cb4_0
- recommonmark=0.7.1=pyhd8ed1ab_0
- requests=2.27.1=pyhd8ed1ab_0
- rhash=1.4.1=h7f98852_0
- setuptools=60.9.3=py39hf3d152e_0
- six=1.16.0=pyh6c4a22f_0
- snowballstemmer=2.2.0=pyhd8ed1ab_0
- sphinx=3.5.4=pyh44b312d_0
- sphinx-markdown-tables=0.0.15=pyhd3deb0d_0
- sphinxcontrib-applehelp=1.0.2=py_0
- sphinxcontrib-devhelp=1.0.2=py_0
- sphinxcontrib-htmlhelp=2.0.0=pyhd8ed1ab_0
- sphinxcontrib-jsmath=1.0.1=py_0
- sphinxcontrib-qthelp=1.0.3=py_0
- sphinxcontrib-serializinghtml=1.1.5=pyhd8ed1ab_1
- sqlite=3.37.0=h9cd32fc_0
- tk=8.6.12=h27826a3_0
- tornado=6.1=py39h3811e60_2
- tzdata=2021e=he74cb21_0
- unicodedata2=14.0.0=py39h3811e60_0
- urllib3=1.26.8=pyhd8ed1ab_1
- watchdog=2.1.6=py39hf3d152e_1
- wheel=0.37.1=pyhd8ed1ab_0
- xorg-kbproto=1.0.7=h7f98852_1002
- xorg-libice=1.0.10=h7f98852_0
- xorg-libsm=1.2.3=hd9c2040_1000
- xorg-libx11=1.7.2=h7f98852_0
- xorg-libxau=1.0.9=h7f98852_0
- xorg-libxdmcp=1.1.3=h7f98852_0
- xorg-libxext=1.3.4=h7f98852_1
- xorg-libxrender=0.9.10=h7f98852_1003
- xorg-renderproto=0.11.1=h7f98852_1002
- xorg-xextproto=7.3.0=h7f98852_1002
- xorg-xproto=7.0.31=h7f98852_1007
- xz=5.2.5=h516909a_1
- yaml=0.2.5=h7f98852_2
- zipp=3.7.0=pyhd8ed1ab_1
- zlib=1.2.11=h36c2ea0_1013
- zstd=1.5.2=ha95c52a_0
- pip:
- py-gfm==1.0.2
- sphinx-markdown==1.0.2
- sphinx-rtd-theme==1.0.0
-86
Dosyayı Görüntüle
@@ -1,86 +0,0 @@
# Features
```eval_rst
.. toctree::
:glob:
:maxdepth: 4
```
## Overview
[OmniTrace](https://github.com/ROCm/omnitrace) is designed to be highly extensible. Internally, it leverages the
[timemory performance analysis toolkit](https://github.com/NERSC/timemory) to
manage extensions, resources, data, etc.
### Data Collection Modes
- Dynamic instrumentation
- Runtime instrumentation
- Instrument executable and shared libraries at runtime
- Binary rewriting
- Generate a new executable and/or library with instrumentation built-in
- Statistical sampling
- Periodic software interrupts per-thread
- Process-level sampling
- Background thread records process-, system- and device-level metrics while the application executes
- Causal profiling
- Quantifies the potential impact of optimizations in parallel codes
### Data Analysis
- High-level summary profiles with mean/min/max/stddev statistics
- Low overhead, memory efficient
- Ideal for running at scale
- Comprehensive traces
- Every individual event/measurement
- Application speedup predictions resulting from potential optimizations in functions and lines of code (causal profiling)
### Parallelism API Support
- HIP
- HSA
- Pthreads
- MPI
- Kokkos-Tools (KokkosP)
- OpenMP-Tools (OMPT)
### GPU Metrics
- GPU hardware counters
- HIP API tracing
- HIP kernel tracing
- HSA API tracing
- HSA operation tracing
- System-level sampling (via rocm-smi)
- Memory usage
- Power usage
- Temperature
- Utilization
### CPU Metrics
- CPU hardware counters sampling and profiles
- CPU frequency sampling
- Various timing metrics
- Wall time
- CPU time (process and/or thread)
- CPU utilization (process and/or thread)
- User CPU time
- Kernel CPU time
- Various memory metrics
- High-water mark (sampling and profiles)
- Memory page allocation
- Virtual memory usage
- Network statistics
- I/O metrics
- ... many more
### Third-party API support
- TAU
- LIKWID
- Caliper
- CrayPAT
- VTune
- NVTX
- ROCTX
-19
Dosyayı Görüntüle
@@ -1,19 +0,0 @@
if(NOT DEFINED SOURCE_DIR)
message(FATAL_ERROR "Please define SOURCE_DIR")
endif()
get_filename_component(SOURCE_DIR "${SOURCE_DIR}" ABSOLUTE)
find_program(DOT_EXECUTABLE NAMES dot)
if(NOT DOT_EXECUTABLE)
message(FATAL_ERROR "Please install dot and/or specify DOT_EXECUTABLE")
endif()
file(READ "${SOURCE_DIR}/VERSION" FULL_VERSION_STRING LIMIT_COUNT 1)
string(REGEX REPLACE "(\n|\r)" "" FULL_VERSION_STRING "${FULL_VERSION_STRING}")
string(REGEX REPLACE "([0-9]+)\\.([0-9]+)\\.([0-9]+)(.*)" "\\1.\\2.\\3" OMNITRACE_VERSION
"${FULL_VERSION_STRING}")
configure_file(${SOURCE_DIR}/source/docs/omnitrace.dox.in
${SOURCE_DIR}/source/docs/omnitrace.dox @ONLY)
-189
Dosyayı Görüntüle
@@ -1,189 +0,0 @@
# Getting Started
```eval_rst
.. toctree::
:glob:
:maxdepth: 3
```
<style>
em { color: Green; }
</style>
## Nomenclature
The list provided below is intended to (A) provide a basic glossary for those who are not familiar with binary instrumentation, etc. and (B)
provide clarification to ambiguities when certain terms have different contextual meanings,
e.g., omnitrace's meaning of the term "module" when instrumenting Python.
- **Binary**
- File written in the Executable and Linkable Format (ELF)
- Standard file format for executable files, shared libraries, etc.
- **Binary Instrumentation**
- Inserting callbacks to instrumentation into an existing binary. This can be performed statically or dynamically
- **Static Binary Instrumentation**
- Loads an existing binary, determines instrumentation points, and generates a new binary with instrumentation directly embedded
- Applicable to executables and libraries but limited to only the functions defined in the binary
- Also known as: **Binary Rewrite**
- **Dynamic Binary Instrumentation**
- Loads an existing binary into memory, inserts instrumentation, executes binary
- Limited to executables but capable of instrumenting linked libraries
- Also known as: **Runtime Instrumentation**
- **Statistical Sampling**
- Also known as (simply) "sampling"
- At periodic intervals, the application is paused and the current call-stack of the CPU is recorded alongside with various other metrics
- Uses timers that measure either (A) real clock time or (B) the CPU time used by the current thread and the CPU time expended on behalf of the thread by the system
- **Sampling Rate**
- The period at which (A) or (B) are triggered (in units of `# interrupts / second`)
- Higher values increase the number of samples
- **Sampling Delay**
- How long to wait before (A) and (B) begin triggering at their designated rate
- **Sampling Duration**
- The time (in realtime) after the start of the application to record samples. Once this time limit has been reached, no more samples will be recorded.
- **Process Sampling**
- At periodic (realtime) intervals, a background thread records global metrics without interrupting the current process. These metrics include, but are not limited to: CPU frequency,
CPU memory high-water mark (i.e. peak memory usage), GPU Temperature, GPU Power usage, etc.
- **Sampling Rate**
- The realtime period for recording metrics (in units of `# measurements / second`)
- Higher values increase the number of samples
- **Sampling Delay**
- How long to wait (in realtime) before recording samples
- **Sampling Duration**
- The time (in realtime) after the start of the application to record samples. Once this time limit has been reached, no more samples will be recorded.
- **Module**
- With respect to binary instrumentation, a module is defined as either the filename (e.g. `foo.c`) or library name (`libfoo.so`) which contains the definition of one or more functions
- With respect to Python instrumentation, a module is defined as the *file* which contains the definition of one or more functions.
- The full path to this file *typically* contains the name of the "Python module"
- **Basic Block**
- Straight-line code sequence with:
- No branches in (except for the entry)
- No branches out (except for the exit)
- **Address Range**
- The instructions for a function in a binary start at certain address with the ELF file and end at a certain address, the range is `end - start`
- The address range is a decent approximation for the "cost" of a function, i.e., a larger address range approx. equates to more instructions
- **Instrumentation Traps**
- On the x86 architecture, because instructions are of variable size, the instruction at a point may be too small for Dyninst to replace it with the normal code sequence used to call instrumentation
- Also, when instrumentation is placed at points other than subroutine entry, exit, or call points, traps may be used to ensure the instrumentation fits
- By default, omnitrace-instrument avoids instrumentation which requires using a trap
- **Overlapping functions**
- Due to language constructs or compiler optimizations, it may be possible for multiple functions to overlap (that is, share part of the same function body) or for a single function to have multiple entry points
- In practice, it is impossible to determine the difference between multiple overlapping functions and a single function with multiple entry points
- By default, omnitrace-instrument avoids instrumenting overlapping functions
## General Tips
- ***Use `omnitrace-avail` to lookup configuration settings***, hardware counters, and data collection components
- Use `-d` flag for descriptions
- Generate a default configuration with `omnitrace-avail -G ${HOME}/.omnitrace.cfg` and tweak accordingly to the desired default behavior
- ***Decide whether binary instrumentation, statistical sampling, or both*** will provide the desired performance data (for non-Python applications)
- Compile code with optimization enabled (e.g. `-O2` or higher), disable asserts (i.e. `-DNDEBUG`), and include debug info (i.e. `-g1` at a minimum)
- NOTE: compiling with debug info does not slow down the code, it only increases compile time and the size of the binary
- In CMake, this is generally as easy as settings `CMAKE_BUILD_TYPE=RelWithDebInfo` or `CMAKE_BUILD_TYPE=Release` and `CMAKE_<LANG>_FLAGS=-g1`
- Use ***binary instrumentation for characterizing the performance of every invocation of specific functions***
- Use ***statistical sampling to characterize the performance of the entire application while minimizing overhead***
- Enable statistical sampling after binary instrumentation to help "fill in the gaps" between instrumented regions
- Use the user API to create custom regions, enable/disable omnitrace to specific processes, threads, and/or regions
- Dynamic symbol interception, callback APIs, and the user API are always available with binary instrumentation and sampling
- Dynamic symbol interception and callback APIs are (generally) controlled through `OMNITRACE_USE_<API>` options, e.g. `OMNITRACE_USE_KOKKOSP`, `OMNITRACE_USE_OMPT` enable Kokkos-Tools and OpenMP-Tools callbacks, respectively
- When generically seeking regions for performance improvement:
- ***Start off collecting a flat profile***
- Look for functions with high call counts, large cumulative runtimes/values, and/or large standard deviations
- When call-counts are high, improving the performance of this function or "inlining" the function can be quick and easy performance improvements
- When the standard-deviation is high, collect a hierarchical profile and see if the high variation can be attributable to the calling context. In this scenario, consider creating a specialized version for the function for the longer running contexts
- ***Collect a hierarchical profile*** and, keeping the flat-profiling data in mind, verify the functions noted in the flat profile are part of the "critical path" of your application
- E.g. function(s) with high call counts, etc. which are part of a "setup" or "post-processing" phase which does not consume much time relative to the overall time is, generally, a lower priority for optimization
- ***Use the information from the profiles when analyzing detailed traces***
- When using binary instrumentation in the "trace" mode, the ***binary rewrites are preferable to runtime instrumentation***.
- Binary rewrites only instrument the functions defined in the target binary, whereas runtime instrumentation can/will instrument functions defined in the shared libraries which are linked into the target binary
- When using binary instrumentation with MPI, avoid runtime instrumentation
- Runtime instrumentation requires a fork + ptrace: which is generally incompatible with how MPI applications spawn their processes
- Binary rewrite the executable using MPI (and, optionally, libraries used by the executable) and execute the generated instrumented executable via `omnitrace-run` instead of the original, e.g. `mpirun -n 2 ./myexe` should be `mpirun -n 2 omnitrace-run -- ./myexe.inst` where `myexe.inst` is the generated instrumented `myexe` executable.
## Data Collection Mode(s)
OmniTrace supports several modes of recording trace and profiling data for your application:
| Mode | Descriptions |
|-----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Binary Instrumentation | Locates functions (and loops, if desired) in binary and inserts snippets at the entry and exit |
| Statistical Sampling | Periodically pauses application at specified intervals and records various metrics for the given call-stack |
| Callback APIs | Parallelism frameworks such as ROCm, OpenMP, and Kokkos will make callbacks into omnitrace to provide information about the work the API is performing |
| Dynamic Symbol Interception | Wrap function symbols defined in position independent dynamic library/executable, e.g. `pthread_mutex_lock` in libpthread.so or `MPI_Init` in the MPI library |
| User API | User-defined regions and controls for omnitrace |
The two most generic, important modes are binary instrumentation and statistical sampling. It is important to understand the advantages and disadvantages.
Binary instrumentation and statistical sampling can be performed with the `omnitrace` executable but for statistical sampling, it is highly recommended to use the
`omnitrace-sample` executable instead if no binary instrumentation is required/desired. With either tool, the callback APIs and dynamic symbol interception can be
utilized.
### Binary Instrumentation
Binary instrumentation will allow one to deterministically record measurements for every single invocation of a given function.
Binary instrumentation effectively adds instructions to the target application to collect the required information and, thus, has the potential to cause performance changes which may,
in some cases, lead to inaccurate results. The effect depends on what information being collected and which features are activated in omnitrace. For example, collecting only the wall-clock timing data
will have less effect than collected the wall-clock timing, cpu-clock timing, memory usage, cache-misses, and number of instructions executed. Similarly, collecting a flat profile will have
less overhead than a hierarchical profile and collecting a trace OR a profile will have less overhead than collecting a trace AND a profile.
In omnitrace, the primary heuristic for controlling the overhead with binary instrumentation is the minimum number of instructions for selecting functions for instrumentation.
### Statistical Sampling
Statistical call-stack sampling periodically interrupts the application at regular intervals using operating system interrupts.
Sampling is typically less numerically accurate and specific, but allows the target program to run at near full speed.
In constrast to the data derived from binary instrumentation, the resulting data is not exact but, instead, a statistical approximation.
However, sampling often provides a more accurate picture of the application execution because it is less intrusive to the target application and has fewer
side effects on memory caches or instruction decoding pipelines. Furthermore, since sampling does not affect the execution speed as significantly, is it
relatively immune to over-evaluating the cost of small, frequently called functions or "tight" loops.
In omnitrace, the overhead for statistical sampling is a factor of the sampling rate and whether the samples are taken with respect to the CPU time and/or real time.
### Binary Instrumentation vs. Statistical Sampling Example
Consider for the following code:
```cpp
long fib(long n)
{
if(n < 2) return n;
return fib(n - 1) + fib(n - 2);
}
void run(long n)
{
long result = fib(nfib);
printf("[%li] fibonacci(%li) = %li\n", i, nfib, result);
}
int main(int argc, char** argv)
{
long nfib = 30;
long nitr = 10;
if(argc > 1) nfib = atol(argv[1]);
if(argc > 2) nitr = atol(argv[2]);
for(long i = 0; i < nitr; ++i)
run(nfib);
return 0;
}
```
Binary instrumentation of the `fib` function will record ***every single invocation*** of the function -- which for a very small function
such as `fib`, will result in *significant* overhead since this simple function tends to be less than 20 or so instructions, whereas the entry and
exit snippets are ~1024 instructions. Thus, ***we generally want to avoid instrumenting functions where the instrumented function has significantly fewer
instructions than entry + exit instrumentation*** (please note, however, that many of the instructions entry/exit functions are either logging functions or
depend on the runtime settins and thus may never be executed). However, due to the number of potentially executed instructions in the entry/exit snippets,
the default behavior of omnitrace-instrument is to only instrument functions which contain fewer than 1024 instructions.
However, recording every single invocation of the function can be extremely useful for detecting anomalies: profiles will show min/max values much smaller/larger
than the average and/or high standard deviation and traces will allow you to identify exactly when and where those instances deviated from the norm.
Consider the level of details in the following traces where, in the top image, every instance of the `fib` function was instrumented vs. the bottom image
where the `fib` call-stack was derived via sampling:
#### Binary Instrumentation of Fibonacci Function
![instrumented-fibonnaci-trace](images/fibonacci-instrumented.png)
#### Statistical Sampling of Fibonacci Function
![sampled-fibonnaci-trace](images/fibonacci-sampling.png)
İkili dosya gösterilmiyor.

Önce

Genişlik:  |  Yükseklik:  |  Boyut: 27 KiB

İkili dosya gösterilmiyor.

Önce

Genişlik:  |  Yükseklik:  |  Boyut: 106 KiB

İkili dosya gösterilmiyor.

Önce

Genişlik:  |  Yükseklik:  |  Boyut: 408 KiB

-24
Dosyayı Görüntüle
@@ -1,24 +0,0 @@
# Welcome to the [OmniTrace](https://github.com/ROCm/omnitrace) Documentation!
```eval_rst
.. toctree::
:glob:
:maxdepth: 4
:caption: Table of Contents
about
features
installation
setup
getting_started
runtime
sampling
instrumenting
causal_profiling
critical_trace
output
user_api
python
youtube
development
```
-281
Dosyayı Görüntüle
@@ -1,281 +0,0 @@
# Installation
```eval_rst
.. toctree::
:glob:
:maxdepth: 4
```
## Release Links
- [Latest Omnitrace Release](https://github.com/ROCm/omnitrace/releases/latest)
- [All Omnitrace Releases](https://github.com/ROCm/omnitrace/releases)
## Quick Start (Latest Release, Binary Installer)
Download the [omnitrace-install.py](https://github.com/ROCm/omnitrace/releases/latest/download/omnitrace-install.py)
and specify `--prefix <install-directory>`. This script will attempt to auto-detect the appropriate OS
distribution and OS version. If ROCm support is desired, specify `--rocm X.Y` where `X` is the ROCm major
version and `Y` is the ROCm minor version, e.g. `--rocm 6.0`.
```shell
wget https://github.com/ROCm/omnitrace/releases/latest/download/omnitrace-install.py
python3 ./omnitrace-install.py --prefix /opt/omnitrace --rocm 6.0
```
This script supports installation on Ubuntu, OpenSUSE, RedHat, Debian, CentOS, and Fedora.
If the target OS is compatible with one of the [operating system versions](#operating-system-support) below,
specify `-d <DISTRO> -v <VERSION>`, e.g. if the OS is compatible with Ubuntu 18.04, pass
`-d ubuntu -v 18.04` to the script.
## Operating System Support
OmniTrace is only supported on Linux. The following distributions are tested:
- Ubuntu 18.04
- Ubuntu 20.04
- Ubuntu 22.04
- OpenSUSE 15.2
- OpenSUSE 15.3
- OpenSUSE 15.4
- RedHat 8.7
- RedHat 9.0
- RedHat 9.1
Other OS distributions may be supported but are not tested.
### Identifying the Operating System
If you are unsure of the operating system and version, the `/etc/os-release` and `/usr/lib/os-release` files contain operating system identification data for Linux systems.
```shell
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.4 LTS (Focal Fossa)"
ID=ubuntu
...
VERSION_ID="20.04"
...
```
The relevent fields are `ID` and the `VERSION_ID`.
## Architecture
With regards to instrumentation, at present only amd64 (x86_64) architectures are tested; however,
Dyninst supports several more architectures and thus, omnitrace instrumentation may support other
CPU architectures such as aarch64, ppc64, etc.
Other modes of use, such as sampling and causal profiling, are not dependent on Dyninst and therefore
may be more portable.
## Installing omnitrace from binary distributions
Every omnitrace release provides binary installer scripts of the form:
```shell
omnitrace-{VERSION}-{OS_DISTRIB}-{OS_VERSION}[-ROCm-{ROCM_VERSION}[-{EXTRA}]].sh
```
E.g.:
```shell
omnitrace-1.0.0-ubuntu-18.04-OMPT-PAPI-Python3.sh
omnitrace-1.0.0-ubuntu-18.04-ROCm-405000-OMPT-PAPI-Python3.sh
...
omnitrace-1.0.0-ubuntu-20.04-ROCm-50000-OMPT-PAPI-Python3.sh
```
Any of the EXTRA fields with a cmake build option (e.g. PAPI, see below) or no link requirements (e.g. OMPT) have
self-contained support for these packages.
### Download the appropriate binary distribution
```shell
wget https://github.com/ROCm/omnitrace/releases/download/v<VERSION>/<SCRIPT>
```
### Create the target installation directory
```shell
mkdir /opt/omnitrace
```
### Run the installer script
```shell
./omnitrace-1.0.0-ubuntu-18.04-ROCm-405000-OMPT-PAPI.sh --prefix=/opt/omnitrace --exclude-subdir
```
## Installing OmniTrace from source
### Build Requirements
OmniTrace needs a GCC compiler with full support for C++17 and CMake v3.16 or higher.
The Clang compiler may be used in lieu of the GCC compiler if Dyninst is already installed.
- GCC compiler v7+
- Older GCC compilers may be supported but are not tested
- Clang compilers are generally supported for [OmniTrace](https://github.com/ROCm/omnitrace) but not Dyninst
- [CMake](https://cmake.org/) v3.16+
> ***If the system installed cmake is too old, installing a new version of cmake can be done through several methods.***
> ***One of the easiest options is to use PyPi (i.e. python's pip):***
>
> ```shell
> pip install --user 'cmake==3.18.4'
> export PATH=${HOME}/.local/bin:${PATH}
> ```
### Required Third-Party Packages
- [DynInst](https://github.com/dyninst/dyninst) for dynamic or static instrumentation
- [TBB](https://github.com/oneapi-src/oneTBB) required by Dyninst
- [ElfUtils](https://sourceware.org/elfutils/) required by Dyninst
- [LibIberty](https://github.com/gcc-mirror/gcc/tree/master/libiberty) required by Dyninst
- [Boost](https://www.boost.org/) required by Dyninst
- [OpenMP](https://www.openmp.org/) optional by Dyninst
- [libunwind](https://www.nongnu.org/libunwind/) for call-stack sampling
All of the third-party packages required by [DynInst](https://github.com/dyninst/dyninst) and
[DynInst](https://github.com/dyninst/dyninst) itself can be built and installed
during the build of omnitrace itself. In the list below, we list the package, the version,
which package requires the package (i.e. omnitrace requires Dyninst
and Dyninst requires TBB), and the CMake option to build the package alongside omnitrace:
| Third-Party Library | Minimum Version | Required By | CMake Option |
|---------------------|-----------------|-------------|-------------------------------------------|
| Dyninst | 12.0 | OmniTrace | `OMNITRACE_BUILD_DYNINST` (default: OFF) |
| Libunwind | | OmniTrace | `OMNITRACE_BUILD_LIBUNWIND` (default: ON) |
| TBB | 2018.6 | Dyninst | `DYNINST_BUILD_TBB` (default: OFF) |
| ElfUtils | 0.178 | Dyninst | `DYNINST_BUILD_ELFUTILS` (default: OFF) |
| LibIberty | | Dyninst | `DYNINST_BUILD_LIBIBERTY` (default: OFF) |
| Boost | 1.67.0 | Dyninst | `DYNINST_BUILD_BOOST` (default: OFF) |
| OpenMP | 4.x | Dyninst | |
### Optional Third-Party Packages
- [ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/latest)
- HIP
- Roctracer for HIP API and kernel tracing
- ROCM-SMI for GPU monitoring
- Rocprofiler for GPU hardware counters
- [PAPI](https://icl.utk.edu/papi/)
- MPI
- `OMNITRACE_USE_MPI` will enable full MPI support
- `OMNITRACE_USE_MPI_HEADERS` will enable wrapping of the dynamically-linked MPI C function calls
- By default, if an OpenMPI MPI distribution cannot be found, omnitrace will use a local copy of the OpenMPI mpi.h
- Several optional third-party profiling tools supported by timemory (e.g. [Caliper](https://github.com/LLNL/Caliper), [TAU](https://www.cs.uoregon.edu/research/tau/home.php), CrayPAT, etc.)
| Third-Party Library | CMake Enable Option | CMake Build Option |
|---------------------|--------------------------------------------|--------------------------------------|
| PAPI | `OMNITRACE_USE_PAPI` (default: ON) | `OMNITRACE_BUILD_PAPI` (default: ON) |
| MPI | `OMNITRACE_USE_MPI` (default: OFF) | |
| MPI (header-only) | `OMNITRACE_USE_MPI_HEADERS` (default: ON) | |
### Installing DynInst
#### Building Dyninst alongside OmniTrace
The easiest way to install Dyninst is to configure omnitrace with `OMNITRACE_BUILD_DYNINST=ON`. Depending on the version of Ubuntu, the apt package manager may have current enough
versions of Dyninst's Boost, TBB, and LibIberty dependencies (i.e. `apt-get install libtbb-dev libiberty-dev libboost-dev`); however, it is possible to request Dyninst to install
it's dependencies via `DYNINST_BUILD_<DEP>=ON`, e.g.:
```shell
git clone https://github.com/ROCm/omnitrace.git omnitrace-source
cmake -B omnitrace-build -DOMNITRACE_BUILD_DYNINST=ON -DDYNINST_BUILD_{TBB,ELFUTILS,BOOST,LIBIBERTY}=ON omnitrace-source
```
where `-DDYNINST_BUILD_{TBB,BOOST,ELFUTILS,LIBIBERTY}=ON` is expanded by the shell to `-DDYNINST_BUILD_TBB=ON -DDYNINST_BUILD_BOOST=ON ...`
#### Installing Dyninst via Spack
[Spack](https://github.com/spack/spack) is another option to install Dyninst and it's dependencies:
```shell
git clone https://github.com/spack/spack.git
source ./spack/share/spack/setup-env.sh
spack compiler find
spack external find --all --not-buildable
spack spec -I --reuse dyninst
spack install --reuse dyninst
spack load -r dyninst
```
### Installing omnitrace
OmniTrace has cmake configuration options for supporting MPI (`OMNITRACE_USE_MPI` or `OMNITRACE_USE_MPI_HEADERS`), HIP kernel tracing (`OMNITRACE_USE_ROCTRACER`),
sampling ROCm devices (`OMNITRACE_USE_ROCM_SMI`), OpenMP-Tools (`OMNITRACE_USE_OMPT`), hardware counters via PAPI (`OMNITRACE_USE_PAPI`), among others.
Various additional features can be enabled via the [`TIMEMORY_USE_*` CMake options](https://timemory.readthedocs.io/en/develop/installation.html#cmake-options).
Any `OMNITRACE_USE_<VAL>` option which has a corresponding `TIMEMORY_USE_<VAL>` option means that the support within timemory for this feature has been integrated
into omnitrace's perfetto support, e.g. `OMNITRACE_USE_PAPI=<VAL>` forces `TIMEMORY_USE_PAPI=<VAL>` and the data that timemory is able to collect via this package
is passed along to perfetto and will be displayed when the `.proto` file is visualized in [ui.perfetto.dev](https://ui.perfetto.dev).
```shell
git clone https://github.com/ROCm/omnitrace.git omnitrace-source
cmake \
-B omnitrace-build \
-D CMAKE_INSTALL_PREFIX=/opt/omnitrace \
-D OMNITRACE_USE_HIP=ON \
-D OMNITRACE_USE_ROCM_SMI=ON \
-D OMNITRACE_USE_ROCTRACER=ON \
-D OMNITRACE_USE_PYTHON=ON \
-D OMNITRACE_USE_OMPT=ON \
-D OMNITRACE_USE_MPI_HEADERS=ON \
-D OMNITRACE_BUILD_PAPI=ON \
-D OMNITRACE_BUILD_LIBUNWIND=ON \
-D OMNITRACE_BUILD_DYNINST=ON \
-D DYNINST_BUILD_TBB=ON \
-D DYNINST_BUILD_BOOST=ON \
-D DYNINST_BUILD_ELFUTILS=ON \
-D DYNINST_BUILD_LIBIBERTY=ON \
omnitrace-source
cmake --build omnitrace-build --target all --parallel 8
cmake --build omnitrace-build --target install
source /opt/omnitrace/share/omnitrace/setup-env.sh
```
#### MPI Support within OmniTrace
[OmniTrace](https://github.com/ROCm/omnitrace) can have full (`OMNITRACE_USE_MPI=ON`) or partial (`OMNITRACE_USE_MPI_HEADERS=ON`) MPI support.
The only difference between these two modes is whether or not the results collected via timemory and/or perfetto can be aggregated into a single
output file during finalization. When full MPI support is enabled, combining the timemory results always occurs whereas combining the perfetto
results is configurable via the `OMNITRACE_PERFETTO_COMBINE_TRACES` setting.
The primary benefits of partial or full MPI support are the automatic wrapping of MPI functions and the ability
to label output with suffixes which correspond to the `MPI_COMM_WORLD` rank ID instead of using the system process identifier (i.e. PID).
In general, it is recommended to use partial MPI support with the OpenMPI headers as this is the most portable configuration.
If full MPI support is selected, make sure your target application is built against the same MPI distribution as omnitrace,
i.e. do not build omnitrace with MPICH and use it on a target application built against OpenMPI.
If partial support is selected, the reason the OpenMPI headers are recommended instead of the MPICH headers is
because the `MPI_COMM_WORLD` in OpenMPI is a pointer to `ompi_communicator_t` (8 bytes), whereas `MPI_COMM_WORLD` in MPICH,
it is an `int` (4 bytes). Building omnitrace with partial MPI support and the MPICH headers and then using
omnitrace on an application built against OpenMPI will cause a segmentation fault due to the value of the `MPI_COMM_WORLD` being narrowed
during the function wrapping before being passed along to the underlying MPI function.
## Post-Installation Steps
### Configure the environment
If environment modules are available and preferred:
```shell
module use /opt/omnitrace/share/modulefiles
module load omnitrace/1.0.0
```
Alternatively, once can directly source the `setup-env.sh` script:
```shell
source /opt/omnitrace/share/omnitrace/setup-env.sh
```
### Test the executables
Successful execution of these commands indicates that the installation does not have any issues locating the installed libraries:
```shell
omnitrace-instrument --help
omnitrace-avail --help
```
> ***NOTE: If ROCm support was enabled, you may have to add the path to the ROCm libraries to `LD_LIBRARY_PATH`, e.g. `export LD_LIBRARY_PATH=/opt/rocm/lib:${LD_LIBRARY_PATH}`***
-835
Dosyayı Görüntüle
@@ -1,835 +0,0 @@
# Binary Instrumentation
```eval_rst
.. toctree::
:glob:
:maxdepth: 4
```
## omnitrace-instrument Executable
> ***NOTE: With the introduction of `omnitrace-sample`, in future versions of omnitrace, the current `omnitrace` executable***
> ***noted below will likely be renamed to `omnitrace-instrument` and a new `omnitrace` executable will serve as a common***
> ***executable for multiple executables, e.g. `omnitrace-instrument sample ...`, `omnitrace run ...`, `omnitrace rewrite ...`, etc.***
Instrumentation is performed with the `omnitrace` executable. View the help menu with the `-h` / `--help` option:
```console
$ omnitrace-instrument --help
[omnitrace-instrument] Usage: omnitrace-instrument [ --help (count: 0, dtype: bool)
--version (count: 0, dtype: bool)
--verbose (max: 1, dtype: bool)
--error (max: 1, dtype: boolean)
--debug (max: 1, dtype: bool)
--log (count: 1)
--log-file (count: 1)
--simulate (max: 1, dtype: boolean)
--print-format (min: 1, dtype: string)
--print-dir (count: 1, dtype: string)
--print-available (count: 1)
--print-instrumented (count: 1)
--print-coverage (count: 1)
--print-excluded (count: 1)
--print-overlapping (count: 1)
--print-instructions (max: 1, dtype: bool)
--output (min: 0, dtype: string)
--pid (count: 1, dtype: int)
--mode (count: 1)
--force (max: 1, dtype: bool)
--command (count: 1)
--prefer (count: 1)
--library (count: unlimited)
--main-function (count: 1)
--load (count: unlimited, dtype: string)
--load-instr (count: unlimited, dtype: filepath)
--init-functions (count: unlimited, dtype: string)
--fini-functions (count: unlimited, dtype: string)
--all-functions (max: 1, dtype: boolean)
--function-include (count: unlimited)
--function-exclude (count: unlimited)
--function-restrict (count: unlimited)
--caller-include (count: unlimited)
--module-include (count: unlimited)
--module-exclude (count: unlimited)
--module-restrict (count: unlimited)
--internal-function-include (count: unlimited)
--internal-module-include (count: unlimited)
--instruction-exclude (count: unlimited)
--internal-library-deps (min: 0, dtype: boolean)
--internal-library-append (count: unlimited)
--internal-library-remove (count: unlimited)
--linkage (min: 1)
--visibility (min: 1)
--label (count: unlimited, dtype: string)
--config (min: 1, dtype: string)
--default-components (count: unlimited, dtype: string)
--env (count: unlimited)
--mpi (max: 1, dtype: bool)
--instrument-loops (max: 1, dtype: boolean)
--min-instructions (count: 1, dtype: int)
--min-address-range (count: 1, dtype: int)
--min-instructions-loop (count: 1, dtype: int)
--min-address-range-loop (count: 1, dtype: int)
--coverage (max: 1, dtype: bool)
--dynamic-callsites (max: 1, dtype: boolean)
--traps (max: 1, dtype: boolean)
--loop-traps (max: 1, dtype: boolean)
--allow-overlapping (max: 1, dtype: bool)
--parse-all-modules (max: 1, dtype: bool)
--batch-size (count: 1, dtype: int)
--dyninst-rt (min: 1, dtype: filepath)
--dyninst-options (count: unlimited)
] -- <CMD> <ARGS>
Options:
-h, -?, --help Shows this page
--version Prints the version and exit
[DEBUG OPTIONS]
-v, --verbose Verbose output
-e, --error All warnings produce runtime errors
--debug Debug output
--log Number of log entries to display after an error. Any value < 0 will emit the entire log
--log-file Write the log out the specified file during the run
--simulate Exit after outputting diagnostic {available,instrumented,excluded,overlapping} module
function lists, e.g. available.txt
--print-format [ json | txt | xml ]
Output format for diagnostic {available,instrumented,excluded,overlapping} module
function lists, e.g. {print-dir}/available.txt
--print-dir Output directory for diagnostic {available,instrumented,excluded,overlapping} module
function lists, e.g. {print-dir}/available.txt
--print-available [ functions | functions+ | modules | pair | pair+ ]
Print the available entities for instrumentation (functions, modules, or module-function
pair) to stdout after applying regular expressions
--print-instrumented [ functions | functions+ | modules | pair | pair+ ]
Print the instrumented entities (functions, modules, or module-function pair) to stdout
after applying regular expressions
--print-coverage [ functions | functions+ | modules | pair | pair+ ]
Print the instrumented coverage entities (functions, modules, or module-function pair) to
stdout after applying regular expressions
--print-excluded [ functions | functions+ | modules | pair | pair+ ]
Print the entities for instrumentation (functions, modules, or module-function pair)
which are excluded from the instrumentation to stdout after applying regular expressions
--print-overlapping [ functions | functions+ | modules | pair | pair+ ]
Print the entities for instrumentation (functions, modules, or module-function pair)
which overlap other function calls or have multiple entry points to stdout after applying
regular expressions
--print-instructions Print the instructions for each basic-block in the JSON/XML outputs
[MODE OPTIONS]
-o, --output Enable generation of a new executable (binary-rewrite). If a filename is not provided,
omnitrace will use the basename and output to the cwd, unless the target binary is in the
cwd. In the latter case, omnitrace will either use ${PWD}/<basename>.inst (non-libraries)
or ${PWD}/instrumented/<basename> (libraries)
-p, --pid Connect to running process
-M, --mode [ coverage | sampling | trace ]
Instrumentation mode. 'trace' mode instruments the selected functions, 'sampling' mode
only instruments the main function to start and stop the sampler.
-f, --force Force the command-line argument configuration, i.e. don't get cute. Useful for forcing
runtime instrumentation of an executable that [A] Dyninst thinks is a library after
reading ELF and [B] whose name makes it look like a library (e.g. starts with 'lib'
and/or ends in '.so', '.so.*', or '.a')
-c, --command Input executable and arguments (if '-- <CMD>' not provided)
[LIBRARY OPTIONS]
--prefer [ shared | static ] Prefer this library types when available
-L, --library Libraries with instrumentation routines (default: "libomnitrace-dl")
-m, --main-function The primary function to instrument around, e.g. 'main'
--load Supplemental instrumentation library names w/o extension (e.g. 'libinstr' for
'libinstr.so' or 'libinstr.a')
--load-instr Load {available,instrumented,excluded,overlapping}-instr JSON or XML file(s) and override
what is read from the binary
--init-functions Initialization function(s) for supplemental instrumentation libraries (see '--load'
option)
--fini-functions Finalization function(s) for supplemental instrumentation libraries (see '--load' option)
--all-functions When finding functions, include the functions which are not instrumentable. This is
purely diagnostic for the available/excluded functions output
[SYMBOL SELECTION OPTIONS]
-I, --function-include Regex(es) for including functions (despite heuristics)
-E, --function-exclude Regex(es) for excluding functions (always applied)
-R, --function-restrict Regex(es) for restricting functions only to those that match the provided
regular-expressions
--caller-include Regex(es) for including functions that call the listed functions (despite heuristics)
-MI, --module-include Regex(es) for selecting modules/files/libraries (despite heuristics)
-ME, --module-exclude Regex(es) for excluding modules/files/libraries (always applied)
-MR, --module-restrict Regex(es) for restricting modules/files/libraries only to those that match the provided
regular-expressions
--internal-function-include Regex(es) for including functions which are (likely) utilized by omnitrace itself. Use
this option with care.
--internal-module-include Regex(es) for including modules/libraries which are (likely) utilized by omnitrace
itself. Use this option with care.
--instruction-exclude Regex(es) for excluding functions containing certain instructions
--internal-library-deps Treat the libraries linked to the internal libraries as internal libraries. This increase
the internal library processing time and consume more memory (so use with care) but may
be useful when the application uses Boost libraries and Dyninst is dynamically linked
against the same boost libraries
--internal-library-append Append to the list of libraries which omnitrace treats as being used internally, e.g.
OmniTrace will find all the symbols in this library and prevent them from being
instrumented.
--internal-library-remove [ ld-linux-x86-64.so.2
libBrokenLocale.so.1
libanl.so.1
libbfd.so
libbz2.so
libc.so.6
libcaliper.so
libcommon.so
libcrypt.so.1
libdl.so.2
libdw.so
libdwarf.so
libdyninstAPI_RT.so
libelf.so
libgcc_s.so.1
libgotcha.so
liblikwid.so
liblzma.so
libnsl.so.1
libnss_compat.so.2
libnss_db.so.2
libnss_dns.so.2
libnss_files.so.2
libnss_hesiod.so.2
libnss_ldap.so.2
libnss_nis.so.2
libnss_nisplus.so.2
libnss_test1.so.2
libnss_test2.so.2
libpapi.so
libpfm.so
libprofiler.so
libpthread.so.0
libresolv.so.2
librocm_smi64.so
librocmtools.so
librocprofiler64.so
libroctracer64.so
libroctx64.so
librt.so.1
libstdc++.so.6
libtbb.so
libtbbmalloc.so
libtbbmalloc_proxy.so
libtcmalloc.so
libtcmalloc_and_profiler.so
libtcmalloc_debug.so
libtcmalloc_minimal.so
libtcmalloc_minimal_debug.so
libthread_db.so.1
libunwind-coredump.so
libunwind-generic.so
libunwind-ptrace.so
libunwind-setjmp.so
libunwind-x86_64.so
libunwind.so
libutil.so.1
libz.so
libzstd.so ]
Remove the specified libraries from being treated as being used internally, e.g.
OmniTrace will permit all the symbols in these libraries to be eligible for
instrumentation.
--linkage [ global | local | unique | unknown | weak ]
Only instrument functions with specified linkage (default: global, local, unique)
--visibility [ default | hidden | internal | protected | unknown ]
Only instrument functions with specified visibility (default: default, internal, hidden,
protected)
[RUNTIME OPTIONS]
--label [ args | file | line | return ]
Labeling info for functions. By default, just the function name is recorded. Use these
options to gain more information about the function signature or location of the
functions
-C, --config Read in a configuration file and encode these values as the defaults in the executable
-d, --default-components Default components to instrument (only useful when timemory is enabled in omnitrace
library)
--env Environment variables to add to the runtime in form VARIABLE=VALUE. E.g. use '--env
OMNITRACE_PROFILE=ON' to default to using timemory instead of perfetto
--mpi Enable MPI support (requires omnitrace built w/ full or partial MPI support). NOTE: this
will automatically be activated if MPI_Init, MPI_Init_thread, MPI_Finalize,
MPI_Comm_rank, or MPI_Comm_size are found in the symbol table of target
[GRANULARITY OPTIONS]
-l, --instrument-loops Instrument at the loop level
-i, --min-instructions If the number of instructions in a function is less than this value, exclude it from
instrumentation
-r, --min-address-range If the address range of a function is less than this value, exclude it from
instrumentation
--min-instructions-loop If the number of instructions in a function containing a loop is less than this value,
exclude it from instrumentation
--min-address-range-loop If the address range of a function containing a loop is less than this value, exclude it
from instrumentation
--coverage [ basic_block | function | none ]
Enable recording the code coverage. If instrumenting in coverage mode ('-M converage'),
this simply specifies the granularity. If instrumenting in trace or sampling mode, this
enables recording code-coverage in addition to the instrumentation of that mode (if any).
--dynamic-callsites Force instrumentation if a function has dynamic callsites (e.g. function pointers)
--traps Instrument points which require using a trap. On the x86 architecture, because
instructions are of variable size, the instruction at a point may be too small for
Dyninst to replace it with the normal code sequence used to call instrumentation. Also,
when instrumentation is placed at points other than subroutine entry, exit, or call
points, traps may be used to ensure the instrumentation fits. In this case, Dyninst
replaces the instruction with a single-byte instruction that generates a trap.
--loop-traps Instrument points within a loop which require using a trap (only relevant when
--instrument-loops is enabled).
--allow-overlapping Allow dyninst to instrument either multiple functions which overlap (share part of same
function body) or single functions with multiple entry points. For more info, see Section
2 of the DyninstAPI documentation.
--parse-all-modules By default, omnitrace simply requests Dyninst to provide all the procedures in the
application image. If this option is enabled, omnitrace will iterate over all the modules
and extract the functions. Theoretically, it should be the same but the data is slightly
different, possibly due to weak binding scopes. In general, enabling option will probably
have no visible effect
[DYNINST OPTIONS]
-b, --batch-size Dyninst supports batch insertion of multiple points during runtime instrumentation. If
one large batch insertion fails, this value will be used to create smaller batches.
Larger batches generally decrease the instrumentation time
--dyninst-rt Path(s) to the dyninstAPI_RT library
--dyninst-options [ BaseTrampDeletion
DebugParsing
DelayedParsing
InstrStackFrames
MergeTramp
SaveFPR
TrampRecursive
TypeChecking ]
Advanced dyninst options: BPatch::set<OPTION>(bool), e.g. bpatch->setTrampRecursive(true)
```
There are three ways to perform instrumentation:
1. Running the application via the omnitrace-instrument executable (analagous to `gdb --args <program> <args>`)
- This mode is the default if neither the `-p` nor `-o` comand-line options are used
- Runtime instrumentation supports instrumenting not only the target executable but also the
the shared libraries loaded by the target executable. Consequently, this mode consumes more memory,
takes longer to perform the instrumentation, and tends to have a more significant overhead on the
runtime of the application
- This mode is recommended if you want to analyze not only the performance of your executable and/or
libraries but also the performance of the library dependencies
2. Attaching to a process that is currently running (analagous to `gdb -p <PID>`)
- This mode is activate via `-p <PID>`
- Same caveats as 1. with respect to memory and overhead
3. Generating a new executable or library with the instrumentation built-in (binary rewrite)
- This mode is activated via the `-o <output-file>` option
- Binary rewriting is limited to the text section of the target executable or library: it will not instrument
the dynamically-linked libraries. Consequently, this mode performs the instrumentation significantly faster
and has a much lower overhead when running the instrumentated executable and/or libraries
- Binary rewriting is the recommended mode when the target executable uses process-level parallelism (e.g. MPI)
- If your target executable has a minimal main which and the bulk of your application is in one specific dynamic library,
see [Binary Rewriting a Library](#binary-rewriting-a-library) for help
> ***Attaching to a running process is an alpha feature and support for detaching from the target process***
> ***without ending the target process is not currently supported.***
The general syntax for separating omnitrace command line arguments from the application arguments
is consistent with the LLVM style of using a standalone double-hyphen (`--`). All arguments preceding the double-hyphen
are interpreted as belonging to omnitrace and all arguments following the double-hyphen are interpreted as the
application and it's arguments. In binary rewrite mode, all application arguments after the first argument
are ignored, i.e. `./omnitrace-instrument -o ls.inst -- ls -l` interprets `ls` as the target to instrument (ignores the `-l` argument)
and generates a `ls.inst` executable that you can subsequently run `omnitrace-run -- ls.inst -l` with.
## Runtime Instrumentation
```shell
omnitrace-instrument <omnitrace-options> -- <exe> [<exe-options>...]
```
## Attaching to Running Process
```shell
omnitrace-instrument <omnitrace-options> -p <PID> -- <exe-name>
```
## Binary Rewrite
```shell
omnitrace-instrument <omnitrace-options> -o <name-of-new-exe-or-library> -- <exe-or-library>
```
### Binary Rewriting a Library
Many applications bundle the bulk of their functionality into one or more dynamic libraries and have a relatively simple main
which links to these libraries and simply serves as the "driver" for setting up the workflow. If you binary rewrite your
executable and find there is insufficient info because of this, you can either switch to runtime instrumentation or
binary rewrite the libraries of interest.
Support for standalone binary rewriting of a dynamic library without binary rewriting the executable is a beta feature.
In general, it is supported as long as the library contains the `_init` and `_fini` symbols but these symbols are not
standardized to the extent of `main` in an executable.
The recommended workflow is as follows:
1. Determine the names of the dynamically linked libraries of interest via `ldd`
2. Generate a binary rewrite of the executable
3. Generate a binary rewrite of the desired libraries with the same base name as the original library, e.g. `libfoo.so.2` instead of `libfoo.so`
- Output the instrumented library into a different folder than the original library
4. Prefix the `LD_LIBRARY_PATH` executable with the output folder from 3
5. Verify via `ldd` that the instrumented executable resolves the location of the instrumented library
### Binary Rewriting a Library Example
`foo` executable is dynamically linked to `libfoo.so.2`:
```shell
$ pwd
/home/user
$ which foo
/usr/local/bin/foo
$ ldd /usr/local/bin/foo
...
libfoo.so.2 => /usr/local/lib/libfoo.so.2 (...)
...
```
Generate binary rewrites of `foo` and `libfoo.so.2`:
```shell
omnitrace-instrument -o ./foo.inst -- foo
omnitrace-instrument -o ./libfoo.so.2 -- /usr/local/lib/libfoo.so.2
```
At this point, the instrumented `foo.inst` executable will still dynamically load the original `libfoo.so.2` in `/usr/local/lib`:
```shell
$ ldd ./foo.inst
...
libfoo.so.2 => /usr/local/lib/libfoo.so.2 (...)
...
```
Prefix the `LD_LIBRARY_PATH` environment variable with the folder containing the instrumented `libfoo.so.2`:
```shell
export LD_LIBRARY_PATH=/home/user:${LD_LIBRARY_PATH}
```
When `foo.inst` is executed, it will now load the instrumented library:
```shell
$ ldd ./foo.inst
...
libfoo.so.2 => /home/user/libfoo.so.2 (...)
...
```
## Selective Instrumentation
The default behavior of omnitrace-instrument does not instrument every symbol in the binary. These default rules are:
- Skip instrumenting dynamic call-sites (i.e. function pointers)
- Option `--dynamic-callsites` will force instrumentation for all dynamic call-sites
- The cost of a function can be loosely approximated by the number of instruction so by default, omnitrace-instrument only instruments functions with at least 1024 instructions
- Option `--min-instructions` will modify this heuristic for all functions which do not contain loops
- Option `--min-instructions-loop` will modify this heuristic for functions which contain loops
- This separate loop option is provided because functions with loops can be compact in the binary while also being costly
- The cost of a function can be also be loosely approximated by the size of the function in the binary so this heuristic can also be used in lieu of or in addition to the minimum number of instructions
- Option `--min-address-range` will modify this heuristic for all functions which do not contain loops
- Option `--min-address-range-loop` will modify this heuristic for functions which contain loops
- This separate loop option is provided because functions with loops can be compact in the binary while also being costly
- Skip instrumentation points which require using a trap
- See the description for the `--traps` and `--loop-traps` options for more information
- Skip instrumenting loops within the body of a function
- Option `--instrument-loops` will enable this behavior
- Skip instrumenting functions with overlapping function bodies and single functions with multiple entry point
- These arise from various optimizations and instrumenting these functions can be enabled via the `--allow-overlapping` option
### Viewing the Available, Instrumented, Excluded, and Overlapping Functions
Whenever omnitrace-instrument is executed with a verbosity of zero or higher, it emits files which detail which functions (and which module they were defined in)
were available for instrumentation, which functions were instrumented, which functions were excluded, and which functions contained overlapping function bodies.
The default output path of these files will be in a `omnitrace-<NAME>-output` folder where `<NAME>` is the basename of the targeted binary or
(in the case of binary rewrite, the basename of the resulting executable), e.g.
`omnitrace-instrument -- ls` will output it's files to `omnitrace-ls-output` whereas `omnitrace-instrument -o ls.inst -- ls` will output to `omnitrace-ls.inst-output`.
If you would like to generate these files without executing or generating an executable, use the `--simulate` option:
```shell
omnitrace-instrument --simulate -- foo
omnitrace-instrument --simulate -o foo.inst -- foo
```
### Excluding and Including Modules and Functions
[OmniTrace](https://github.com/ROCm/omnitrace) has a set of 6 command-line options which each accept one or more regular expressions for customizing the scope of which module and/or functions are
instrumented. Multiple regexes per option are treated as an OR operation, e.g. `--module-include libfoo libbar` is effectively that same as `--module-include 'libfoo|libbar'`.
If you would like to force the inclusion of certain modules and/or function without changing any of the heuristics, use the `--module-include` and/or `--function-include` options.
Note that these options will not exclude modules and/or functions which do not satisfy their regular expression.
If you would like to narrow the scope of the instrumentation to a specific set of libraries and/or functions, use the `--module-restrict` and `--function-restrict` options.
Applying these options allow you to exclusively select the union one or more regular expressions, regardless of whether or not the functions satisfy the
aforementioned default heuristics. Any function or module that is not within the union of these regular expressions will be excluded from instrumentation.
If you would like to avoid instrumenting a set of modules and/or functions, use the `--module-exclude` and `--function-exclude` options.
These options are always applied regardless of whether the module or function satisfied the "restrict" or "include" regular expression.
#### Example Available Module and Function Info Output
> ***`omnitrace-instrument -o lulesh.inst --label file line args --simulate -- lulesh`***
```console
AddressRange Module Function FunctionSignature
9165 ../examples/lulesh/lulesh-comm.cc CommMonoQ CommMonoQ(domain) [lulesh-comm.cc:1891]
3396 ../examples/lulesh/lulesh-comm.cc CommRecv CommRecv(domain, int, Index_t, Index_t, Index_t, Index_t, bool, bool) [lulesh...
8666 ../examples/lulesh/lulesh-comm.cc CommSBN CommSBN(domain, int, Domain_member *) [lulesh-comm.cc:926]
10212 ../examples/lulesh/lulesh-comm.cc CommSend CommSend(domain, int, Index_t, Domain_member *, Index_t, Index_t, Index_t, bo...
6823 ../examples/lulesh/lulesh-comm.cc CommSyncPosVel CommSyncPosVel(domain) [lulesh-comm.cc:1404]
126 ../examples/lulesh/lulesh-comm.cc _GLOBAL__sub_I_lulesh_comm.cc _GLOBAL__sub_I_lulesh_comm.cc() [lulesh-comm.cc]
308 ../examples/lulesh/lulesh-init.cc .omp_outlined..26 .omp_outlined..26(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
628 ../examples/lulesh/lulesh-init.cc .omp_outlined..34 .omp_outlined..34(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
656 ../examples/lulesh/lulesh-init.cc .omp_outlined..41 .omp_outlined..41(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
662 ../examples/lulesh/lulesh-init.cc .omp_outlined..45 .omp_outlined..45(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
550 ../examples/lulesh/lulesh-init.cc .omp_outlined..55 .omp_outlined..55(const , const , const ParallelFor<Kokkos::Impl::ViewFill<Ko...
556 ../examples/lulesh/lulesh-init.cc .omp_outlined..57 .omp_outlined..57(const , const , const ParallelFor<Kokkos::Impl::ViewFill<Ko...
550 ../examples/lulesh/lulesh-init.cc .omp_outlined..78 .omp_outlined..78(const , const , const ParallelFor<Kokkos::Impl::ViewFill<Ko...
640 ../examples/lulesh/lulesh-init.cc .omp_outlined..84 .omp_outlined..84(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
646 ../examples/lulesh/lulesh-init.cc .omp_outlined..88 .omp_outlined..88(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
1840 ../examples/lulesh/lulesh-init.cc Domain::AllocateElemPersistent Domain::AllocateElemPersistent(Domain *, Int_t) [lulesh-init.cc:94]
1384 ../examples/lulesh/lulesh-init.cc Domain::AllocateNodePersistent Domain::AllocateNodePersistent(Domain *, Int_t) [lulesh-init.cc:94]
1264 ../examples/lulesh/lulesh-init.cc Domain::BuildMesh Domain::BuildMesh(Domain *, Int_t, Int_t, Int_t) [lulesh-init.cc:308]
2312 ../examples/lulesh/lulesh-init.cc Domain::CreateRegionIndexSets Domain::CreateRegionIndexSets(Domain *, Int_t, Int_t) [lulesh-init.cc:409]
7109 ../examples/lulesh/lulesh-init.cc Domain::Domain Domain::Domain(Domain *, Int_t, Index_t, Index_t, Index_t, Index_t, int, int,...
2458 ../examples/lulesh/lulesh-init.cc Domain::SetupBoundaryConditions Domain::SetupBoundaryConditions(Domain *, Int_t) [lulesh-init.cc:409]
956 ../examples/lulesh/lulesh-init.cc Domain::SetupCommBuffers Domain::SetupCommBuffers(Domain *, Int_t) [lulesh-init.cc]
1456 ../examples/lulesh/lulesh-init.cc Domain::SetupElementConnectivities Domain::SetupElementConnectivities(Domain *, Int_t) [lulesh-init.cc:409]
721 ../examples/lulesh/lulesh-init.cc Domain::SetupSymmetryPlanes Domain::SetupSymmetryPlanes(Domain *, Int_t) [lulesh-init.cc:409]
1591 ../examples/lulesh/lulesh-init.cc Domain::SetupThreadSupportStructures Domain::SetupThreadSupportStructures(Domain *) [lulesh-init.cc:376]
1644 ../examples/lulesh/lulesh-init.cc Domain::~Domain Domain::~Domain(Domain *) [lulesh-init.cc:286]
218 ../examples/lulesh/lulesh-init.cc InitMeshDecomp InitMeshDecomp(Int_t, Int_t, Int_t *, Int_t *, Int_t *, Int_t *) [lulesh-init...
260 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::CommonSubview<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokk... Kokkos::Impl::CommonSubview<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokk...
1786 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::HostIterateTile<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::R... Kokkos::Impl::HostIterateTile<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::R...
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int**... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int**...
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int**... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int**...
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int*,... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int*,...
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int*,... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int*,...
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl...
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl...
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl...
522 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::... Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::...
232 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::... Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::...
49 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, Kokkos::Impl::ViewVal... Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, Kokkos::Impl::ViewVal...
1476 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::Tile_Loop_Type<2, false, int, void, void>::apply<Kokkos::Impl::... Kokkos::Impl::Tile_Loop_Type<2, false, int, void, void>::apply<Kokkos::Impl::...
555 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::LayoutRight, Kokkos::Devic... Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::LayoutRight, Kokkos::Devic...
613 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::LayoutRight, Kokkos::Devic... Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::LayoutRight, Kokkos::Devic...
603 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCopy<Kokkos::View<int*, Kokkos::LayoutLeft, Kokkos::Device<... Kokkos::Impl::ViewCopy<Kokkos::View<int*, Kokkos::LayoutLeft, Kokkos::Device<...
604 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCopy<Kokkos::View<int*, Kokkos::LayoutLeft, Kokkos::Device<... Kokkos::Impl::ViewCopy<Kokkos::View<int*, Kokkos::LayoutLeft, Kokkos::Device<...
281 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
281 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
281 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
281 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
281 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
524 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev... Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev...
525 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev... Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev...
524 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev... Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev...
583 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<int* [8], Kokkos::LayoutRight>, ... SharedAllocationRecord<void, void> * Kokkos::Impl::ViewMapping<Kokkos::ViewTr...
529 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<int*, Kokkos::HostSpace>, void>:... SharedAllocationRecord<void, void> * Kokkos::Impl::ViewMapping<Kokkos::ViewTr...
529 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<int*>, void>::allocate_shared<st... SharedAllocationRecord<void, void> * Kokkos::Impl::ViewMapping<Kokkos::ViewTr...
203 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewRemap<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::... Kokkos::Impl::ViewRemap<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::...
331 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewRemap<Kokkos::View<int*>, Kokkos::View<int*>, Kokkos::OpenM... Kokkos::Impl::ViewRemap<Kokkos::View<int*>, Kokkos::View<int*>, Kokkos::OpenM...
461 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpa... enable_if_t<std::is_trivial<int>::value && std::is_trivially_copy_assignable<...
353 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::contiguous_fill<Kokkos::OpenMP, double*> Kokkos::Impl::contiguous_fill<Kokkos::OpenMP, double*>(exec_space, dst, value...
139 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::contiguous_fill<Kokkos::OpenMP, double, Kokkos::LayoutRight, Ko... Kokkos::Impl::contiguous_fill<Kokkos::OpenMP, double, Kokkos::LayoutRight, Ko...
824 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight, Kokkos::D... Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight, Kokkos::D...
824 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight, Kokkos::D... Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight, Kokkos::D...
824 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::... Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::...
824 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::... Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::...
697 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::view_copy<Kokkos::View<int*, Kokkos::LayoutRight, Kokkos::Devic... Kokkos::Impl::view_copy<Kokkos::View<int*, Kokkos::LayoutRight, Kokkos::Devic...
697 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::view_copy<Kokkos::View<int*>, Kokkos::View<int*> > Kokkos::Impl::view_copy<Kokkos::View<int*>, Kokkos::View<int*> >(dst, src) [l...
2036 ../examples/lulesh/lulesh-init.cc Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::Schedule<Kokkos::Static>, int>::R... Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::Schedule<Kokkos::Static>, int>::R...
2506 ../examples/lulesh/lulesh-init.cc Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::Schedule<Kokkos::Static>, long>::... Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::Schedule<Kokkos::Static>, long>::...
271 ../examples/lulesh/lulesh-init.cc Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor... Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor...
470 ../examples/lulesh/lulesh-init.cc Kokkos::View<int* [8], Kokkos::LayoutRight>::View<std::__cxx11::basic_string<... Kokkos::View<int* [8], Kokkos::LayoutRight>::View<std::__cxx11::basic_string<...
323 ../examples/lulesh/lulesh-init.cc Kokkos::View<int* [8], Kokkos::LayoutRight>::View<std::__cxx11::basic_string<... Kokkos::View<int* [8], Kokkos::LayoutRight>::View<std::__cxx11::basic_string<...
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*, Kokkos::HostSpace>::View<char [10]> Kokkos::View<int*, Kokkos::HostSpace>::View<char [10]>(View<int *, Kokkos::Ho...
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*, Kokkos::HostSpace>::View<char [14]> Kokkos::View<int*, Kokkos::HostSpace>::View<char [14]>(View<int *, Kokkos::Ho...
462 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*, Kokkos::HostSpace>::View<std::__cxx11::basic_string<char, ... Kokkos::View<int*, Kokkos::HostSpace>::View<std::__cxx11::basic_string<char, ...
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<char [16]> Kokkos::View<int*>::View<char [16]>(View<int *> *, arg_label, type, const siz...
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<char [19]> Kokkos::View<int*>::View<char [19]>(View<int *> *, arg_label, type, const siz...
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<char [21]> Kokkos::View<int*>::View<char [21]>(View<int *> *, arg_label, type, const siz...
462 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<std::__cxx11::basic_string<char, std::char_traits<ch... Kokkos::View<int*>::View<std::__cxx11::basic_string<char, std::char_traits<ch...
323 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<std::__cxx11::basic_string<char, std::char_traits<ch... Kokkos::View<int*>::View<std::__cxx11::basic_string<char, std::char_traits<ch...
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<double*, , double*, Kokkos::LayoutRight, Kokkos::Device<Kok... Kokkos::deep_copy<double*, , double*, Kokkos::LayoutRight, Kokkos::Device<Kok...
1052 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<double*> Kokkos::deep_copy<double*>(dst, value) [lulesh-init.cc]
1050 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<double, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP,... Kokkos::deep_copy<double, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP,...
7686 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenM... Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenM...
7686 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, int* [8], Kokkos::LayoutRigh... Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, int* [8], Kokkos::LayoutRigh...
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int*, , int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::O... Kokkos::deep_copy<int*, , int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::O...
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Ko... Kokkos::deep_copy<int*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Ko...
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP, K... Kokkos::deep_copy<int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP, K...
863 ../examples/lulesh/lulesh-init.cc Kokkos::impl_resize<, int* [8], Kokkos::LayoutRight> type Kokkos::impl_resize<, int* [8], Kokkos::LayoutRight>(v, const size_t, co...
854 ../examples/lulesh/lulesh-init.cc Kokkos::impl_resize<, int*> type Kokkos::impl_resize<, int*>(v, const size_t, const size_t, const size_t,...
697 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (... Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (...
706 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (... Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (...
912 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
791 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
791 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
944 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
839 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
126 ../examples/lulesh/lulesh-init.cc _GLOBAL__sub_I_lulesh_init.cc _GLOBAL__sub_I_lulesh_init.cc() [lulesh-init.cc]
6589 ../examples/lulesh/lulesh-util.cc Kokkos::deep_copy<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP... Kokkos::deep_copy<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP...
1345 ../examples/lulesh/lulesh-util.cc ParseCommandLineOptions ParseCommandLineOptions(int, char * *, int, cmdLineOpts *) [lulesh-util.cc:67]
171 ../examples/lulesh/lulesh-util.cc PrintCommandLineOptions PrintCommandLineOptions(char *, int) [lulesh-util.cc:31]
67 ../examples/lulesh/lulesh-util.cc StrToInt int StrToInt(const char *, int *) [lulesh-util.cc:13]
706 ../examples/lulesh/lulesh-util.cc VerifyAndWriteFinalOutput VerifyAndWriteFinalOutput(Real_t, locDom, Int_t, Int_t) [lulesh-util.cc:222]
126 ../examples/lulesh/lulesh-util.cc _GLOBAL__sub_I_lulesh_util.cc _GLOBAL__sub_I_lulesh_util.cc() [lulesh-util.cc]
17 ../examples/lulesh/lulesh-viz.cc DumpToVisit DumpToVisit(domain, int, int, int) [lulesh-viz.cc:415]
126 ../examples/lulesh/lulesh-viz.cc _GLOBAL__sub_I_lulesh_viz.cc _GLOBAL__sub_I_lulesh_viz.cc() [lulesh-viz.cc]
451 ../examples/lulesh/lulesh.cc .omp_outlined..103 .omp_outlined..103(const , const , const ParallelReduce<(lambda at ../example...
796 ../examples/lulesh/lulesh.cc .omp_outlined..109 .omp_outlined..109(const , const , const ParallelFor<(lambda at ../examples/l...
394 ../examples/lulesh/lulesh.cc .omp_outlined..111 .omp_outlined..111(const , const , const ParallelFor<(lambda at ../examples/l...
402 ../examples/lulesh/lulesh.cc .omp_outlined..113 .omp_outlined..113(const , const , const ParallelFor<(lambda at ../examples/l...
427 ../examples/lulesh/lulesh.cc .omp_outlined..115 .omp_outlined..115(const , const , const ParallelReduce<(lambda at ../example...
859 ../examples/lulesh/lulesh.cc .omp_outlined..119 .omp_outlined..119(const , const , const ParallelFor<(lambda at ../examples/l...
243 ../examples/lulesh/lulesh.cc .omp_outlined..122 .omp_outlined..122(const , const , const ParallelFor<(lambda at ../examples/l...
426 ../examples/lulesh/lulesh.cc .omp_outlined..124 .omp_outlined..124(const , const , const ParallelFor<(lambda at ../examples/l...
529 ../examples/lulesh/lulesh.cc .omp_outlined..127 .omp_outlined..127(const , const , const ParallelFor<(lambda at ../examples/l...
865 ../examples/lulesh/lulesh.cc .omp_outlined..130 .omp_outlined..130(const , const , const ParallelFor<(lambda at ../examples/l...
539 ../examples/lulesh/lulesh.cc .omp_outlined..132 .omp_outlined..132(const , const , const ParallelReduce<(lambda at ../example...
456 ../examples/lulesh/lulesh.cc .omp_outlined..134 .omp_outlined..134(const , const , const ParallelReduce<(lambda at ../example...
252 ../examples/lulesh/lulesh.cc .omp_outlined..20 .omp_outlined..20(const , const , const ParallelFor<(lambda at ../examples/lu...
870 ../examples/lulesh/lulesh.cc .omp_outlined..35 .omp_outlined..35(const , const , const ParallelFor<(lambda at ../examples/lu...
473 ../examples/lulesh/lulesh.cc .omp_outlined..42 .omp_outlined..42(const , const , const ParallelFor<(lambda at ../examples/lu...
252 ../examples/lulesh/lulesh.cc .omp_outlined..46 .omp_outlined..46(const , const , const ParallelFor<(lambda at ../examples/lu...
1101 ../examples/lulesh/lulesh.cc .omp_outlined..48 .omp_outlined..48(const , const , const ParallelFor<(lambda at ../examples/lu...
427 ../examples/lulesh/lulesh.cc .omp_outlined..55 .omp_outlined..55(const , const , const ParallelReduce<(lambda at ../examples...
1326 ../examples/lulesh/lulesh.cc .omp_outlined..57 .omp_outlined..57(const , const , const ParallelReduce<(lambda at ../examples...
243 ../examples/lulesh/lulesh.cc .omp_outlined..61 .omp_outlined..61(const , const , const ParallelFor<(lambda at ../examples/lu...
1101 ../examples/lulesh/lulesh.cc .omp_outlined..63 .omp_outlined..63(const , const , const ParallelFor<(lambda at ../examples/lu...
372 ../examples/lulesh/lulesh.cc .omp_outlined..66 .omp_outlined..66(const , const , const ParallelFor<(lambda at ../examples/lu...
499 ../examples/lulesh/lulesh.cc .omp_outlined..71 .omp_outlined..71(const , const , const ParallelFor<(lambda at ../examples/lu...
499 ../examples/lulesh/lulesh.cc .omp_outlined..73 .omp_outlined..73(const , const , const ParallelFor<(lambda at ../examples/lu...
499 ../examples/lulesh/lulesh.cc .omp_outlined..75 .omp_outlined..75(const , const , const ParallelFor<(lambda at ../examples/lu...
465 ../examples/lulesh/lulesh.cc .omp_outlined..78 .omp_outlined..78(const , const , const ParallelFor<(lambda at ../examples/lu...
396 ../examples/lulesh/lulesh.cc .omp_outlined..81 .omp_outlined..81(const , const , const ParallelFor<(lambda at ../examples/lu...
656 ../examples/lulesh/lulesh.cc .omp_outlined..85 .omp_outlined..85(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
662 ../examples/lulesh/lulesh.cc .omp_outlined..89 .omp_outlined..89(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
443 ../examples/lulesh/lulesh.cc .omp_outlined..93 .omp_outlined..93(const , const , const ParallelReduce<(lambda at ../examples...
243 ../examples/lulesh/lulesh.cc .omp_outlined..96 .omp_outlined..96(const , const , const ParallelFor<(lambda at ../examples/lu...
243 ../examples/lulesh/lulesh.cc .omp_outlined..99 .omp_outlined..99(const , const , const ParallelFor<(lambda at ../examples/lu...
13367 ../examples/lulesh/lulesh.cc ApplyMaterialPropertiesForElems ApplyMaterialPropertiesForElems(domain) [lulesh.cc:409]
1530 ../examples/lulesh/lulesh.cc CalcElemCharacteristicLength Real_t CalcElemCharacteristicLength(const Real_t *, const Real_t *, const Rea...
982 ../examples/lulesh/lulesh.cc CalcElemFBHourglassForce CalcElemFBHourglassForce(const Real_t *, const Real_t[] *, coefficient, Real_...
2428 ../examples/lulesh/lulesh.cc CalcElemNodeNormals CalcElemNodeNormals(Real_t *, Real_t *, Real_t *, const Real_t *, const Real_...
853 ../examples/lulesh/lulesh.cc CalcElemShapeFunctionDerivatives CalcElemShapeFunctionDerivatives(const Real_t *, const Real_t *, const Real_t...
1097 ../examples/lulesh/lulesh.cc CalcElemVolumeDerivative CalcElemVolumeDerivative(i, dvdx, dvdy, dvdz, const Real_t *, const Real_t *,...
1054 ../examples/lulesh/lulesh.cc CalcKinematicsForElems CalcKinematicsForElems(domain, Real_t, Index_t) [lulesh.cc]
14160 ../examples/lulesh/lulesh.cc CalcVolumeForceForElems CalcVolumeForceForElems(domain) [lulesh.cc:409]
366 ../examples/lulesh/lulesh.cc Domain::AllocateGradients Domain::AllocateGradients(Domain *, Int_t, Int_t) [lulesh.cc:214]
475 ../examples/lulesh/lulesh.cc Domain::DeallocateGradients Domain::DeallocateGradients(Domain *) [lulesh.cc:105]
250 ../examples/lulesh/lulesh.cc Domain::DeallocateStrains Domain::DeallocateStrains(Domain *) [lulesh.cc:105]
4356 ../examples/lulesh/lulesh.cc Domain::Domain Domain::Domain(Domain *) [lulesh.cc:78]
15 ../examples/lulesh/lulesh.cc Domain::delv_eta Domain::delv_eta(const Domain *, const Index_t) [lulesh.cc:371]
15 ../examples/lulesh/lulesh.cc Domain::delv_xi Domain::delv_xi(const Domain *, const Index_t) [lulesh.cc:368]
15 ../examples/lulesh/lulesh.cc Domain::delv_zeta Domain::delv_zeta(const Domain *, const Index_t) [lulesh.cc:374]
15 ../examples/lulesh/lulesh.cc Domain::fx Domain::fx(const Domain *, const Index_t) [lulesh.cc:303]
15 ../examples/lulesh/lulesh.cc Domain::fy Domain::fy(const Domain *, const Index_t) [lulesh.cc:306]
15 ../examples/lulesh/lulesh.cc Domain::fz Domain::fz(const Domain *, const Index_t) [lulesh.cc:309]
15 ../examples/lulesh/lulesh.cc Domain::nodalMass Domain::nodalMass(const Domain *, const Index_t) [lulesh.cc:314]
15 ../examples/lulesh/lulesh.cc Domain::x Domain::x(const Domain *, const Index_t) [lulesh.cc:257]
15 ../examples/lulesh/lulesh.cc Domain::xd Domain::xd(const Domain *, const Index_t) [lulesh.cc:272]
15 ../examples/lulesh/lulesh.cc Domain::y Domain::y(const Domain *, const Index_t) [lulesh.cc:258]
15 ../examples/lulesh/lulesh.cc Domain::yd Domain::yd(const Domain *, const Index_t) [lulesh.cc:275]
15 ../examples/lulesh/lulesh.cc Domain::z Domain::z(const Domain *, const Index_t) [lulesh.cc:259]
15 ../examples/lulesh/lulesh.cc Domain::zd Domain::zd(const Domain *, const Index_t) [lulesh.cc:278]
330 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<doubl... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<doubl...
330 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<doubl... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<doubl...
1508 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelFor<CalcEnergyForElems(double*, double*, double*, doubl... type Kokkos::Impl::ParallelFor<CalcEnergyForElems(double*, double*, double*, ...
3606 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelFor<CalcFBHourglassForceForElems(Domain&, double*, Kokk... type Kokkos::Impl::ParallelFor<CalcFBHourglassForceForElems(Domain&, double*,...
2917 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelFor<CalcKinematicsForElems(Domain&, double, int)::$_0, ... type Kokkos::Impl::ParallelFor<CalcKinematicsForElems(Domain&, double, int)::...
3119 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelFor<CalcMonotonicQGradientsForElems(Domain&)::{lambda(i... type Kokkos::Impl::ParallelFor<CalcMonotonicQGradientsForElems(Domain&)::{lam...
1969 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelFor<CalcMonotonicQRegionForElems(Domain&, int, double):... type Kokkos::Impl::ParallelFor<CalcMonotonicQRegionForElems(Domain&, int, dou...
1265 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelFor<IntegrateStressForElems(Domain&, double*, double*, ... type Kokkos::Impl::ParallelFor<IntegrateStressForElems(Domain&, double*, doub...
49 ../examples/lulesh/lulesh.cc Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, Kokkos::Impl::ViewVal... Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, Kokkos::Impl::ViewVal...
1497 ../examples/lulesh/lulesh.cc Kokkos::Impl::TeamPolicyInternal<Kokkos::OpenMP>::TeamPolicyInternal Kokkos::Impl::TeamPolicyInternal<Kokkos::OpenMP>::TeamPolicyInternal(TeamPoli...
603 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewCopy<Kokkos::View<double*, Kokkos::LayoutLeft, Kokkos::Devi... Kokkos::Impl::ViewCopy<Kokkos::View<double*, Kokkos::LayoutLeft, Kokkos::Devi...
604 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewCopy<Kokkos::View<double*, Kokkos::LayoutLeft, Kokkos::Devi... Kokkos::Impl::ViewCopy<Kokkos::View<double*, Kokkos::LayoutLeft, Kokkos::Devi...
281 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
281 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
521 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<double*>, void>::allocate_shared... SharedAllocationRecord<void, void> * Kokkos::Impl::ViewMapping<Kokkos::ViewTr...
331 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewRemap<Kokkos::View<double*>, Kokkos::View<double*>, Kokkos:... Kokkos::Impl::ViewRemap<Kokkos::View<double*>, Kokkos::View<double*>, Kokkos:...
461 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpa... enable_if_t<std::is_trivial<double>::value && std::is_trivially_copy_assignab...
1609 ../examples/lulesh/lulesh.cc Kokkos::Impl::runtime_check_rank_host Kokkos::Impl::runtime_check_rank_host(const size_t, const bool, const size_t,...
697 ../examples/lulesh/lulesh.cc Kokkos::Impl::view_copy<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::De... Kokkos::Impl::view_copy<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::De...
697 ../examples/lulesh/lulesh.cc Kokkos::Impl::view_copy<Kokkos::View<double*>, Kokkos::View<double*> > Kokkos::Impl::view_copy<Kokkos::View<double*>, Kokkos::View<double*> >(dst, s...
2250 ../examples/lulesh/lulesh.cc Kokkos::RangePolicy<Kokkos::OpenMP>::RangePolicy Kokkos::RangePolicy<Kokkos::OpenMP>::RangePolicy(RangePolicy<Kokkos::OpenMP> ...
213 ../examples/lulesh/lulesh.cc Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor... Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor...
410 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::View<char [6]> Kokkos::View<double*>::View<char [6]>(View<double *> *, arg_label, type, cons...
410 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::View<char [7]> Kokkos::View<double*>::View<char [7]>(View<double *> *, arg_label, type, cons...
462 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::View<std::__cxx11::basic_string<char, std::char_traits... Kokkos::View<double*>::View<std::__cxx11::basic_string<char, std::char_traits...
323 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::View<std::__cxx11::basic_string<char, std::char_traits... Kokkos::View<double*>::View<std::__cxx11::basic_string<char, std::char_traits...
25 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::~View Kokkos::View<double*>::~View(View<double *> *) [lulesh.cc:409]
840 ../examples/lulesh/lulesh.cc Kokkos::abort Kokkos::abort(const const char *, const const char *) [lulesh.cc:202]
854 ../examples/lulesh/lulesh.cc Kokkos::impl_resize<, double*> type Kokkos::impl_resize<, double*>(v, const size_t, const size_t, const size...
928 ../examples/lulesh/lulesh.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
960 ../examples/lulesh/lulesh.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
21470 ../examples/lulesh/lulesh.cc LagrangeLeapFrog LagrangeLeapFrog(domain) [lulesh.cc]
226 ../examples/lulesh/lulesh.cc ResizeBuffer ResizeBuffer(const size_t) [lulesh.cc:23]
169 ../examples/lulesh/lulesh.cc _GLOBAL__sub_I_lulesh.cc _GLOBAL__sub_I_lulesh.cc() [lulesh.cc]
1836 ../examples/lulesh/lulesh.cc main int main(int, char * *) [lulesh.cc]
63 ../examples/lulesh/lulesh.cc std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::a... std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::a...
20 ../examples/lulesh/lulesh.cc std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::alloca... std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::alloca...
160 ../examples/lulesh/lulesh.cc std::operator+<char, std::char_traits<char>, std::allocator<char> > basic_string<char, std::char_traits<char>, std::allocator<char> > std::operat...
187 ../examples/lulesh/lulesh.cc std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::alloc... std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::alloc...
11 lulesh __clang_call_terminate __clang_call_terminate() [lulesh]
33 lulesh __do_global_dtors_aux __do_global_dtors_aux() [lulesh]
5 lulesh __libc_csu_fini __libc_csu_fini() [lulesh]
101 lulesh __libc_csu_init __libc_csu_init() [lulesh]
5 lulesh _dl_relocate_static_pie _dl_relocate_static_pie() [lulesh]
13 lulesh _fini _fini() [lulesh]
27 lulesh _init _init() [lulesh]
47 lulesh _start _start() [lulesh]
6 lulesh frame_dummy frame_dummy() [lulesh]
```
#### Example Instrumented Module and Function Info Output
> ***`omnitrace-instrument -o lulesh.inst --label file line args --simulate -- lulesh`***
After the heuristics are applied in [Example Available Module and Function Info Output](#example-available-module-and-function-info-output),
the selected module/functions are:
```console
AddressRange Module Function FunctionSignature
9165 ../examples/lulesh/lulesh-comm.cc CommMonoQ CommMonoQ(domain) [lulesh-comm.cc:1891]
3396 ../examples/lulesh/lulesh-comm.cc CommRecv CommRecv(domain, int, Index_t, Index_t, Index_t, Index_t, bool, bool) [lulesh...
8666 ../examples/lulesh/lulesh-comm.cc CommSBN CommSBN(domain, int, Domain_member *) [lulesh-comm.cc:926]
10212 ../examples/lulesh/lulesh-comm.cc CommSend CommSend(domain, int, Index_t, Domain_member *, Index_t, Index_t, Index_t, bo...
6823 ../examples/lulesh/lulesh-comm.cc CommSyncPosVel CommSyncPosVel(domain) [lulesh-comm.cc:1404]
1840 ../examples/lulesh/lulesh-init.cc Domain::AllocateElemPersistent Domain::AllocateElemPersistent(Domain *, Int_t) [lulesh-init.cc:94]
1384 ../examples/lulesh/lulesh-init.cc Domain::AllocateNodePersistent Domain::AllocateNodePersistent(Domain *, Int_t) [lulesh-init.cc:94]
1264 ../examples/lulesh/lulesh-init.cc Domain::BuildMesh Domain::BuildMesh(Domain *, Int_t, Int_t, Int_t) [lulesh-init.cc:308]
2312 ../examples/lulesh/lulesh-init.cc Domain::CreateRegionIndexSets Domain::CreateRegionIndexSets(Domain *, Int_t, Int_t) [lulesh-init.cc:409]
7109 ../examples/lulesh/lulesh-init.cc Domain::Domain Domain::Domain(Domain *, Int_t, Index_t, Index_t, Index_t, Index_t, int, int,...
2458 ../examples/lulesh/lulesh-init.cc Domain::SetupBoundaryConditions Domain::SetupBoundaryConditions(Domain *, Int_t) [lulesh-init.cc:409]
956 ../examples/lulesh/lulesh-init.cc Domain::SetupCommBuffers Domain::SetupCommBuffers(Domain *, Int_t) [lulesh-init.cc]
1456 ../examples/lulesh/lulesh-init.cc Domain::SetupElementConnectivities Domain::SetupElementConnectivities(Domain *, Int_t) [lulesh-init.cc:409]
721 ../examples/lulesh/lulesh-init.cc Domain::SetupSymmetryPlanes Domain::SetupSymmetryPlanes(Domain *, Int_t) [lulesh-init.cc:409]
1591 ../examples/lulesh/lulesh-init.cc Domain::SetupThreadSupportStructures Domain::SetupThreadSupportStructures(Domain *) [lulesh-init.cc:376]
1644 ../examples/lulesh/lulesh-init.cc Domain::~Domain Domain::~Domain(Domain *) [lulesh-init.cc:286]
271 ../examples/lulesh/lulesh-init.cc Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor... Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor...
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*, Kokkos::HostSpace>::View<char [10]> Kokkos::View<int*, Kokkos::HostSpace>::View<char [10]>(View<int *, Kokkos::Ho...
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*, Kokkos::HostSpace>::View<char [14]> Kokkos::View<int*, Kokkos::HostSpace>::View<char [14]>(View<int *, Kokkos::Ho...
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<char [16]> Kokkos::View<int*>::View<char [16]>(View<int *> *, arg_label, type, const siz...
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<char [19]> Kokkos::View<int*>::View<char [19]>(View<int *> *, arg_label, type, const siz...
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<char [21]> Kokkos::View<int*>::View<char [21]>(View<int *> *, arg_label, type, const siz...
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<double*, , double*, Kokkos::LayoutRight, Kokkos::Device<Kok... Kokkos::deep_copy<double*, , double*, Kokkos::LayoutRight, Kokkos::Device<Kok...
1052 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<double*> Kokkos::deep_copy<double*>(dst, value) [lulesh-init.cc]
1050 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<double, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP,... Kokkos::deep_copy<double, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP,...
7686 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenM... Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenM...
7686 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, int* [8], Kokkos::LayoutRigh... Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, int* [8], Kokkos::LayoutRigh...
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int*, , int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::O... Kokkos::deep_copy<int*, , int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::O...
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Ko... Kokkos::deep_copy<int*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Ko...
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP, K... Kokkos::deep_copy<int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP, K...
697 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (... Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (...
706 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (... Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (...
912 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
791 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
791 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
944 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
839 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
6589 ../examples/lulesh/lulesh-util.cc Kokkos::deep_copy<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP... Kokkos::deep_copy<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP...
1345 ../examples/lulesh/lulesh-util.cc ParseCommandLineOptions ParseCommandLineOptions(int, char * *, int, cmdLineOpts *) [lulesh-util.cc:67]
706 ../examples/lulesh/lulesh-util.cc VerifyAndWriteFinalOutput VerifyAndWriteFinalOutput(Real_t, locDom, Int_t, Int_t) [lulesh-util.cc:222]
13367 ../examples/lulesh/lulesh.cc ApplyMaterialPropertiesForElems ApplyMaterialPropertiesForElems(domain) [lulesh.cc:409]
982 ../examples/lulesh/lulesh.cc CalcElemFBHourglassForce CalcElemFBHourglassForce(const Real_t *, const Real_t[] *, coefficient, Real_...
2428 ../examples/lulesh/lulesh.cc CalcElemNodeNormals CalcElemNodeNormals(Real_t *, Real_t *, Real_t *, const Real_t *, const Real_...
853 ../examples/lulesh/lulesh.cc CalcElemShapeFunctionDerivatives CalcElemShapeFunctionDerivatives(const Real_t *, const Real_t *, const Real_t...
1054 ../examples/lulesh/lulesh.cc CalcKinematicsForElems CalcKinematicsForElems(domain, Real_t, Index_t) [lulesh.cc]
14160 ../examples/lulesh/lulesh.cc CalcVolumeForceForElems CalcVolumeForceForElems(domain) [lulesh.cc:409]
366 ../examples/lulesh/lulesh.cc Domain::AllocateGradients Domain::AllocateGradients(Domain *, Int_t, Int_t) [lulesh.cc:214]
475 ../examples/lulesh/lulesh.cc Domain::DeallocateGradients Domain::DeallocateGradients(Domain *) [lulesh.cc:105]
4356 ../examples/lulesh/lulesh.cc Domain::Domain Domain::Domain(Domain *) [lulesh.cc:78]
410 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::View<char [6]> Kokkos::View<double*>::View<char [6]>(View<double *> *, arg_label, type, cons...
410 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::View<char [7]> Kokkos::View<double*>::View<char [7]>(View<double *> *, arg_label, type, cons...
928 ../examples/lulesh/lulesh.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
960 ../examples/lulesh/lulesh.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
21470 ../examples/lulesh/lulesh.cc LagrangeLeapFrog LagrangeLeapFrog(domain) [lulesh.cc]
1836 ../examples/lulesh/lulesh.cc main int main(int, char * *) [lulesh.cc]
```
## Sampling
> ***NOTE: This capability has been deprecated in favor of [omnitrace-sample](sampling.md)***
By default, omnitrace-instrument uses `--mode trace` for instrumentation. The `--mode sampling` option
will only instrument `main` in an executable and will activate both CPU call-stack sampling and
background system-level thread sampling by default.
Tracing capabilities which do not rely on instrumentation, such as the HIP API and kernel tracing
(which is collected via roctracer), will still be available.
[OmniTrace](https://github.com/ROCm/omnitrace)'s sampling capabilities are always available, even in trace mode, but is deactivated by default.
In order to activate sampling in trace mode, simply set `OMNITRACE_USE_SAMPLING=ON` in the environment
or in an omnitrace configuration file.
## Embedding a Default Configuration
Using the `--env` option, a default configuration can be embedded into the target. Although this option
works for runtime instrumentation, it is most useful when generating new binaries since the generated
binary may be used later in a different login sessions when the environment may have changed.
For example, if the following sequence of commands are run:
```shell
omnitrace-instrument -o ./foo.inst -- ./foo
export OMNITRACE_USE_SAMPLING=ON
export OMNITRACE_SAMPLING_FREQ=5
omnitrace-run -- ./foo.inst
```
These configuration settings will not be preserved in another session, whereas:
```shell
omnitrace-instrument -o ./foo.samp --env OMNITRACE_USE_SAMPLING=ON OMNITRACE_SAMPLING_FREQ=5 -- ./foo
```
will preserve those environment variables:
```shell
# will sample 5x per second
omnitrace-run -- ./foo.samp
```
while still allowing the subsequent session to override those defaults:
```shell
# will sample 100x per second
export OMNITRACE_SAMPLING_FREQ=100
omnitrace-run -- ./foo.samp
```
### Troubleshooting
#### Checking for RPATH
If `ldd ./foo.inst` from the [Binary Rewriting a Library Example](#binary-rewriting-a-library-example) section still returned `/usr/local/lib/libfoo.so.2`, your executable may have an rpath encoded in the binary.
This ELF entry will result in the dynamic linker to ignore `LD_LIBRARY_PATH` if it finds a `libfoo.so.2` in the rpath.
You can use the `objdump` tool to perform this query:
```shell
objdump -p <exe-or-library> | egrep 'RPATH|RUNPATH'
```
If this produces output, e.g.:
```shell
RUNPATH $ORIGIN:$ORIGIN/../lib
```
You will have to remove or modify the rpath in order to get `foo.inst` to resolve to the instrumented `libfoo.so.2`
#### Modifying RPATH
> ***Requires `patchelf` package***
```shell
patchelf --remove-rpath <exe-or-library>
patchelf --set-rpath '/home/user' <exe-or-library>
```
-35
Dosyayı Görüntüle
@@ -1,35 +0,0 @@
@ECHO OFF
pushd %~dp0
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=.
set BUILDDIR=_build
if "%1" == "" goto help
%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.http://sphinx-doc.org/
exit /b 1
)
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end
:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
:end
popd
-888
Dosyayı Görüntüle
@@ -1,888 +0,0 @@
# OmniTrace Output
```eval_rst
.. toctree::
:glob:
:maxdepth: 3
```
## Overview
The general output form of omnitrace is `<OUTPUT_PATH>[/<TIMESTAMP>]/[<PREFIX>]<DATA_NAME>[-<OUTPUT_SUFFIX>].<EXT>`.
E.g. with the base configuration:
```shell
export OMNITRACE_OUTPUT_PATH=omnitrace-example-output
export OMNITRACE_TIME_OUTPUT=ON
export OMNITRACE_USE_PID=OFF
export OMNITRACE_PROFILE=ON
export OMNITRACE_TRACE=ON
```
```shell
$ omnitrace-instrument -- ./foo
...
[omnitrace] Outputting 'omnitrace-example-output/perfetto-trace.proto'...
[omnitrace] Outputting 'omnitrace-example-output/wall-clock.txt'...
[omnitrace] Outputting 'omnitrace-example-output/wall-clock.json'...
```
If we enable the `OMNITRACE_USE_PID` option, then when our non-MPI executable is executed with a PID of 63453:
```shell
$ export OMNITRACE_USE_PID=ON
$ omnitrace-instrument -- ./foo
...
[omnitrace] Outputting 'omnitrace-example-output/perfetto-trace-63453.proto'...
[omnitrace] Outputting 'omnitrace-example-output/wall-clock-63453.txt'...
[omnitrace] Outputting 'omnitrace-example-output/wall-clock-63453.json'...
```
If we enable `OMNITRACE_TIME_OUTPUT`, then a job started on January 31, 2022 at 12:30 PM:
```shell
$ export OMNITRACE_TIME_OUTPUT=ON
$ omnitrace-instrument -- ./foo
...
[omnitrace] Outputting 'omnitrace-example-output/2022-01-31_12.30_PM/perfetto-trace-63453.proto'...
[omnitrace] Outputting 'omnitrace-example-output/2022-01-31_12.30_PM/wall-clock-63453.txt'...
[omnitrace] Outputting 'omnitrace-example-output/2022-01-31_12.30_PM/wall-clock-63453.json'...
```
## Metadata
[OmniTrace](https://github.com/ROCm/omnitrace) will output a metadata.json file. This metadata file will contain
information about the settings, environment variables, output files, and info about the system and the run:
- Hardware cache sizes
- Physical CPUs
- Hardware concurrency
- CPU model, frequency, vendor, and features
- Launch date and time
- Memory maps (e.g. shared libraries)
- Output files
- Environment Variables
- Configuration Settings
### Metadata JSON Sample
```json
{
"omnitrace": {
"metadata": {
"info": {
"HW_L1_CACHE_SIZE": 32768,
"HW_L2_CACHE_SIZE": 524288,
"HW_L3_CACHE_SIZE": 16777216,
"HW_PHYSICAL_CPU": 12,
"HW_CONCURRENCY": 24,
"LAUNCH_TIME": "02:04",
"LAUNCH_DATE": "05/08/22",
"TIMEMORY_GIT_REVISION": "52e7034fd419ff296506cdef43084f6071dbaba1",
"TIMEMORY_VERSION": "3.3.0rc4",
"TIMEMORY_API": "tim::project::timemory",
"TIMEMORY_GIT_DESCRIBE": "v3.2.0-263-g52e7034f",
"PWD": "/home/jrmadsen/devel/c++/AARInternal/hosttrace-dyninst/build-vscode",
"USER": "jrmadsen",
"HOME": "/home/jrmadsen",
"SHELL": "/bin/bash",
"CPU_MODEL": "AMD Ryzen Threadripper PRO 3945WX 12-Cores",
"CPU_FREQUENCY": 2400,
"CPU_VENDOR": "AuthenticAMD",
"CPU_FEATURES": [
"fpu",
"msr",
"sse",
"sse2",
"constant_tsc",
"ssse3",
"fma",
"sse4_1",
"sse4_2",
"popcnt",
"avx2",
"... etc. ..."
],
"memory_maps": [
{
"end_address": "7f4013797000",
"start_address": "7f4012e58000",
"pathname": "/opt/rocm-5.0.0/hip/lib/libamdhip64.so.5.0.50000",
"offset": "34a000",
"device": "103:05",
"inode": 4331165,
"permissions": "rw-p"
},
{
"end_address": "7f4013902000",
"start_address": "7f4013901000",
"pathname": "/usr/lib/x86_64-linux-gnu/libm-2.31.so",
"offset": "14d000",
"device": "103:05",
"inode": 42078854,
"permissions": "rwxp"
},
{
"end_address": "7f4013919000",
"start_address": "7f4013908000",
"pathname": "/usr/lib/x86_64-linux-gnu/libpthread-2.31.so",
"offset": "6000",
"device": "103:05",
"inode": 42078874,
"permissions": "r-xp"
},
{
"...": "etc."
},
],
"memory_maps_files": [
"/opt/rocm-5.0.0/hip/lib/libamdhip64.so.5.0.50000",
"/opt/rocm-5.0.0/hsa-amd-aqlprofile/lib/libhsa-amd-aqlprofile64.so.1.0.50000",
"/opt/rocm-5.0.0/lib/libamd_comgr.so.2.4.50000",
"/opt/rocm-5.0.0/lib/libhsa-runtime64.so.1.5.50000",
"/opt/rocm-5.0.0/rocm_smi/lib/librocm_smi64.so.5.0.50000",
"/opt/rocm-5.0.0/roctracer/lib/libroctracer64.so.1.0.50000",
"/usr/lib/x86_64-linux-gnu/ld-2.31.so",
"/usr/lib/x86_64-linux-gnu/libc-2.31.so",
"/usr/lib/x86_64-linux-gnu/libdl-2.31.so",
"... etc. ..."
],
},
"output": {
"text": [
{
"value": [
"omnitrace-tests-output/parallel-overhead-binary-rewrite/roctracer.txt"
],
"key": "roctracer"
},
{
"value": [
"omnitrace-tests-output/parallel-overhead-binary-rewrite/wall_clock.txt"
],
"key": "wall_clock"
}
],
"json": [
{
"value": [
"omnitrace-tests-output/parallel-overhead-binary-rewrite/roctracer.json",
"omnitrace-tests-output/parallel-overhead-binary-rewrite/roctracer.tree.json"
],
"key": "roctracer"
},
{
"value": [
"omnitrace-tests-output/parallel-overhead-binary-rewrite/wall_clock.json",
"omnitrace-tests-output/parallel-overhead-binary-rewrite/wall_clock.tree.json"
],
"key": "wall_clock"
}
]
},
"environment": [
{
"value": "/home/jrmadsen",
"key": "HOME"
},
{
"value": "/bin/bash",
"key": "SHELL"
},
{
"value": "jrmadsen",
"key": "USER"
},
{
"value": "true",
"key": "... etc. ..."
}
],
"settings": {
"OMNITRACE_JSON_OUTPUT": {
"count": -1,
"environ_updated": false,
"name": "json_output",
"data_type": "bool",
"initial": true,
"enabled": true,
"value": true,
"max_count": 1,
"cmdline": [
"--omnitrace-json-output"
],
"environ": "OMNITRACE_JSON_OUTPUT",
"config_updated": false,
"categories": [
"io",
"json",
"native"
],
"description": "Write json output files"
},
"... etc. ...": {
"etc.": true
}
}
}
}
}
```
## Configuring Output
### Core Configuration Settings
> ***See also: [Customizing OmniTrace Runtime](runtime.md)***
| Setting | Value | Description |
|---------------------------|--------------------|---------------------------------------------------------------------------------------------------|
| `OMNITRACE_OUTPUT_PATH` | Any valid path | Path to folder where output files should be placed |
| `OMNITRACE_OUTPUT_PREFIX` | String | Useful for multiple runs with different arguments. See [Output Prefix Keys](#output-prefix-keys) |
| `OMNITRACE_OUTPUT_FILE` | Any valid filepath | Specific location for perfetto output file. |
| `OMNITRACE_TIME_OUTPUT` | Boolean | Place all output in a timestamped folder, timestamp format controlled via `OMNITRACE_TIME_FORMAT` |
| `OMNITRACE_TIME_FORMAT` | String | See `strftime` man pages for valid identifiers |
| `OMNITRACE_USE_PID` | Boolean | Append either the PID or the MPI rank to all output files (before the extension) |
#### Output Prefix Keys
Output prefix keys have many uses but most useful when dealing with multiple profiling runs or large MPI jobs.
Their inclusion in omnitrace stems from their introduction into timemory for [compile-time-perf](https://github.com/jrmadsen/compile-time-perf)
which needed to be able to create different output files for a generic wrapper around compilation commands while still
overwriting the output from the last time a file was compiled.
If you are ever doing scaling studies and specifying options via the command line, it is highly recommend to just
use a common `OMNITRACE_OUTPUT_PATH`, disable `OMNITRACE_TIME_OUTPUT`,
set `OMNITRACE_OUTPUT_PREFIX="%argt%-"` and let omnitrace cleanly organize the output.
| String | Encoding |
|-----------------|--------------------------------------------------------------------------------------------------------------------|
| `%argv%` | Entire command-line condensed into a single string |
| `%argt%` | Similar to `%argv%` except basename of first command line argument |
| `%args%` | All command line arguments condensed into a single string |
| `%tag%` | Basename of first command line argument |
| `%arg<N>%` | Command line argument at position `<N>` (zero indexed), e.g. `%arg0%` for first argument. |
| `%argv_hash%` | MD5 sum of `%argv%` |
| `%argt_hash%` | MD5 sum if `%argt%` |
| `%args_hash%` | MD5 sum of `%args%` |
| `%tag_hash%` | MD5 sum of `%tag%` |
| `%arg<N>_hash%` | MD5 sum of `%arg<N>%` |
| `%pid%` | Process identifier (i.e. `getpid()`) |
| `%ppid%` | Parent process identifier (i.e. `getppid()`) |
| `%pgid%` | Process group identifier (i.e. `getpgid(getpid())`) |
| `%psid%` | Process session identifier (i.e. `getsid(getpid())`) |
| `%psize%` | Number of sibling process (from reading `/proc/<PPID>/tasks/<PPID>/children`) |
| `%job%` | Value of `SLURM_JOB_ID` environment variable if exists, else `0` |
| `%rank%` | Value of `SLURM_PROCID` environment variable if exists, else `MPI_Comm_rank` (or `0` non-mpi) |
| `%size%` | `MPI_Comm_size` or `1` if non-mpi |
| `%nid%` | `%rank%` if possible, otherwise `%pid%` |
| `%launch_time%` | Launch date and time (uses `OMNITRACE_TIME_FORMAT`) |
| `%env{NAME}%` | Value of environment variable `NAME` (i.e. `getenv(NAME)`) |
| `%cfg{NAME}%` | Value of configuration variable `NAME` (e.g. `%cfg{OMNITRACE_SAMPLING_FREQ}%` would resolve to sampling frequency) |
| `$env{NAME}` | Alternative syntax to `%env{NAME}%` |
| `$cfg{NAME}` | Alternative syntax to `%cfg{NAME}%` |
| `%m` | Shorthand for `%argt_hash%` |
| `%p` | Shorthand for `%pid%` |
| `%j` | Shorthand for `%job%` |
| `%r` | Shorthand for `%rank%` |
| `%s` | Shorthand for `%size%` |
> ***Any output prefix key which contain a `/` will have the `/` characters***
> ***replaced with `_` and any leading underscores will be stripped, e.g. if `%arg0%` is `/usr/bin/foo`, this***
> ***will translate to `usr_bin_foo`. Additionally, any `%arg<N>%` keys which do not have a command line argument***
> ***at position `<N>` will be ignored.***
## Perfetto Output
Use the `OMNITRACE_OUTPUT_FILE` to specify a specific location. If this is an absolute path, then all `OMNITRACE_OUTPUT_PATH`, etc.
settings will be ignored. Visit [ui.perfetto.dev](https://ui.perfetto.dev) and open this file.
![omnitrace-perfetto](images/omnitrace-perfetto.png)
![omnitrace-rocm](images/omnitrace-rocm.png)
![omnitrace-rocm-flow](images/omnitrace-rocm-flow.png)
![omnitrace-user-api](images/omnitrace-user-api.png)
## Timemory Output
Use `omnitrace-avail --components --filename` to view the base filename for each component. E.g.
```shell
$ omnitrace-avail wall_clock -C -f
|---------------------------------|---------------|------------------------|
| COMPONENT | AVAILABLE | FILENAME |
|---------------------------------|---------------|------------------------|
| wall_clock | true | wall_clock |
| sampling_wall_clock | true | sampling_wall_clock |
|---------------------------------|---------------|------------------------|
```
Setting `OMNITRACE_COLLAPSE_THREADS=ON` and/or `OMNITRACE_COLLAPSE_PROCESSES=ON` (only valid with full MPI support) the timemory output
will combine the per-thread and/or per-rank data which have identical call-stacks.
The `OMNITRACE_FLAT_PROFILE` setting will remove all call stack heirarchy. Using `OMNITRACE_FLAT_PROFILE=ON` in combination
with `OMNITRACE_COLLAPSE_THREADS=ON` is a useful configuration for identifying min/max measurements regardless of calling context.
The `OMNITRACE_TIMELINE_PROFILE` setting (with `OMNITRACE_FLAT_PROFILE=OFF`) will effectively generate similar data that can be found
in perfetto. Enabling timeline and flat profiling will effectively generate similar data to `strace`. However, while timemory in general
requires significantly less memory than perfetto, this is not the case in timeline mode so activate this setting with caution.
### Timemory Text Output
> ***Hint: the generation of text output is configurable via `OMNITRACE_TEXT_OUTPUT`***
Timemory text output files are meant for human-consumption (use JSON formats for analysis)
and as such, some fields such as the `LABEL` fields may be truncated for readability.
Modification of the truncation can be changed via the `OMNITRACE_MAX_WIDTH` setting.
#### Timemory Text Output Example
In the below, the `NN` field in `|NN>>>` is the thread ID. If MPI support is enabled, this will be `|MM|NN>>>` and `MM` will be the rank.
If `OMNITRACE_COLLAPSE_THREADS=ON` and `OMNITRACE_COLLAPSE_PROCESSES=ON`, neither the `MM` nor the `NN` will be present unless the
component explicitly sets type-traits which specify that the data is only relevant per-thread or per-process, e.g. the `thread_cpu_clock` clock component.
```console
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| REAL-CLOCK TIMER (I.E. WALL-CLOCK TIMER) |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| LABEL | COUNT | DEPTH | METRIC | UNITS | SUM | MEAN | MIN | MAX | VAR | STDDEV | % SELF |
|--------------------------------------------------------------|--------|--------|------------|--------|-----------|-----------|-----------|-----------|----------|----------|--------|
| |00>>> main | 1 | 0 | wall_clock | sec | 13.360265 | 13.360265 | 13.360265 | 13.360265 | 0.000000 | 0.000000 | 18.2 |
| |00>>> |_ompt_thread_initial | 1 | 1 | wall_clock | sec | 10.924161 | 10.924161 | 10.924161 | 10.924161 | 0.000000 | 0.000000 | 0.0 |
| |00>>> |_ompt_implicit_task | 1 | 2 | wall_clock | sec | 10.923050 | 10.923050 | 10.923050 | 10.923050 | 0.000000 | 0.000000 | 0.1 |
| |00>>> |_ompt_parallel [parallelism=12] | 1 | 3 | wall_clock | sec | 10.915026 | 10.915026 | 10.915026 | 10.915026 | 0.000000 | 0.000000 | 0.0 |
| |00>>> |_ompt_implicit_task | 1 | 4 | wall_clock | sec | 10.647951 | 10.647951 | 10.647951 | 10.647951 | 0.000000 | 0.000000 | 0.0 |
| |00>>> |_ompt_work_loop | 156 | 5 | wall_clock | sec | 0.000812 | 0.000005 | 0.000001 | 0.000212 | 0.000000 | 0.000018 | 100.0 |
| |00>>> |_ompt_work_single_executor | 40 | 5 | wall_clock | sec | 0.000016 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |00>>> |_ompt_sync_region_barrier_implicit | 308 | 5 | wall_clock | sec | 0.000629 | 0.000002 | 0.000001 | 0.000017 | 0.000000 | 0.000002 | 100.0 |
| |00>>> |_conj_grad | 76 | 5 | wall_clock | sec | 10.641165 | 0.140015 | 0.131894 | 0.155099 | 0.000017 | 0.004080 | 1.0 |
| |00>>> |_ompt_work_single_executor | 803 | 6 | wall_clock | sec | 0.000292 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |00>>> |_ompt_work_loop | 7904 | 6 | wall_clock | sec | 7.420265 | 0.000939 | 0.000005 | 0.006974 | 0.000003 | 0.001613 | 100.0 |
| |00>>> |_ompt_sync_region_barrier_implicit | 6004 | 6 | wall_clock | sec | 0.283160 | 0.000047 | 0.000001 | 0.004087 | 0.000000 | 0.000303 | 100.0 |
| |00>>> |_ompt_sync_region_barrier_implementation | 3952 | 6 | wall_clock | sec | 2.829252 | 0.000716 | 0.000007 | 0.009005 | 0.000001 | 0.000985 | 99.7 |
| |00>>> |_ompt_sync_region_reduction | 15808 | 7 | wall_clock | sec | 0.009142 | 0.000001 | 0.000000 | 0.000007 | 0.000000 | 0.000000 | 100.0 |
| |00>>> |_ompt_work_single_other | 1249 | 6 | wall_clock | sec | 0.000270 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |00>>> |_ompt_work_single_other | 114 | 5 | wall_clock | sec | 0.000024 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |00>>> |_ompt_sync_region_barrier_implementation | 76 | 5 | wall_clock | sec | 0.000876 | 0.000012 | 0.000008 | 0.000025 | 0.000000 | 0.000003 | 84.4 |
| |00>>> |_ompt_sync_region_reduction | 304 | 6 | wall_clock | sec | 0.000136 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |00>>> |_ompt_master | 226 | 5 | wall_clock | sec | 0.001978 | 0.000009 | 0.000000 | 0.000038 | 0.000000 | 0.000012 | 100.0 |
| |11>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.656145 | 10.656145 | 10.656145 | 10.656145 | 0.000000 | 0.000000 | 0.1 |
| |11>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649183 | 10.649183 | 10.649183 | 10.649183 | 0.000000 | 0.000000 | 0.0 |
| |11>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000852 | 0.000005 | 0.000002 | 0.000230 | 0.000000 | 0.000019 | 100.0 |
| |11>>> |_ompt_work_single_other | 149 | 6 | wall_clock | sec | 0.000035 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
| |11>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004135 | 0.000013 | 0.000001 | 0.001233 | 0.000000 | 0.000070 | 100.0 |
| |11>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641302 | 0.140017 | 0.131896 | 0.155102 | 0.000017 | 0.004080 | 0.6 |
| |11>>> |_ompt_work_single_other | 2023 | 7 | wall_clock | sec | 0.000458 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |11>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 8.253555 | 0.001044 | 0.000005 | 0.008021 | 0.000003 | 0.001790 | 100.0 |
| |11>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.263840 | 0.000044 | 0.000001 | 0.004087 | 0.000000 | 0.000297 | 100.0 |
| |11>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.059823 | 0.000521 | 0.000007 | 0.009508 | 0.000001 | 0.000863 | 100.0 |
| |11>>> |_ompt_work_single_executor | 29 | 7 | wall_clock | sec | 0.000011 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |11>>> |_ompt_work_single_executor | 5 | 6 | wall_clock | sec | 0.000002 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
| |11>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000975 | 0.000013 | 0.000008 | 0.000024 | 0.000000 | 0.000003 | 100.0 |
| |10>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.681664 | 10.681664 | 10.681664 | 10.681664 | 0.000000 | 0.000000 | 0.3 |
| |10>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649158 | 10.649158 | 10.649158 | 10.649158 | 0.000000 | 0.000000 | 0.0 |
| |10>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000863 | 0.000006 | 0.000002 | 0.000231 | 0.000000 | 0.000019 | 100.0 |
| |10>>> |_ompt_work_single_other | 140 | 6 | wall_clock | sec | 0.000037 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |10>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004149 | 0.000013 | 0.000001 | 0.001221 | 0.000000 | 0.000070 | 100.0 |
| |10>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641288 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 0.7 |
| |10>>> |_ompt_work_single_other | 1883 | 7 | wall_clock | sec | 0.000487 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |10>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 8.174545 | 0.001034 | 0.000005 | 0.006899 | 0.000003 | 0.001766 | 100.0 |
| |10>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.268808 | 0.000045 | 0.000001 | 0.004087 | 0.000000 | 0.000299 | 100.0 |
| |10>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.126988 | 0.000538 | 0.000007 | 0.009843 | 0.000001 | 0.000872 | 99.9 |
| |10>>> |_ompt_sync_region_reduction | 3952 | 8 | wall_clock | sec | 0.002574 | 0.000001 | 0.000000 | 0.000014 | 0.000000 | 0.000000 | 100.0 |
| |10>>> |_ompt_work_single_executor | 169 | 7 | wall_clock | sec | 0.000072 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |10>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000954 | 0.000013 | 0.000009 | 0.000023 | 0.000000 | 0.000003 | 95.9 |
| |10>>> |_ompt_sync_region_reduction | 76 | 7 | wall_clock | sec | 0.000039 | 0.000001 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |10>>> |_ompt_work_single_executor | 14 | 6 | wall_clock | sec | 0.000006 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |09>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.686552 | 10.686552 | 10.686552 | 10.686552 | 0.000000 | 0.000000 | 0.3 |
| |09>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649151 | 10.649151 | 10.649151 | 10.649151 | 0.000000 | 0.000000 | 0.0 |
| |09>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000880 | 0.000006 | 0.000002 | 0.000258 | 0.000000 | 0.000021 | 100.0 |
| |09>>> |_ompt_work_single_other | 148 | 6 | wall_clock | sec | 0.000034 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |09>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004129 | 0.000013 | 0.000001 | 0.001210 | 0.000000 | 0.000069 | 100.0 |
| |09>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641308 | 0.140017 | 0.131895 | 0.155102 | 0.000017 | 0.004080 | 0.7 |
| |09>>> |_ompt_work_single_other | 2043 | 7 | wall_clock | sec | 0.000473 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |09>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 7.977001 | 0.001009 | 0.000005 | 0.007325 | 0.000003 | 0.001732 | 100.0 |
| |09>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.242996 | 0.000040 | 0.000001 | 0.004087 | 0.000000 | 0.000284 | 100.0 |
| |09>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.350895 | 0.000595 | 0.000007 | 0.008689 | 0.000001 | 0.000926 | 100.0 |
| |09>>> |_ompt_work_single_executor | 9 | 7 | wall_clock | sec | 0.000004 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |09>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000973 | 0.000013 | 0.000008 | 0.000025 | 0.000000 | 0.000003 | 100.0 |
| |09>>> |_ompt_work_single_executor | 6 | 6 | wall_clock | sec | 0.000002 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
| |08>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.721622 | 10.721622 | 10.721622 | 10.721622 | 0.000000 | 0.000000 | 0.7 |
| |08>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649135 | 10.649135 | 10.649135 | 10.649135 | 0.000000 | 0.000000 | 0.0 |
| |08>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000839 | 0.000005 | 0.000001 | 0.000231 | 0.000000 | 0.000019 | 100.0 |
| |08>>> |_ompt_work_single_other | 141 | 6 | wall_clock | sec | 0.000030 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |08>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004114 | 0.000013 | 0.000001 | 0.001198 | 0.000000 | 0.000069 | 100.0 |
| |08>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641294 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 0.6 |
| |08>>> |_ompt_work_single_other | 1742 | 7 | wall_clock | sec | 0.000392 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |08>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 8.306388 | 0.001051 | 0.000005 | 0.007886 | 0.000003 | 0.001795 | 100.0 |
| |08>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.274358 | 0.000046 | 0.000001 | 0.004090 | 0.000000 | 0.000302 | 100.0 |
| |08>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 1.991251 | 0.000504 | 0.000007 | 0.008694 | 0.000001 | 0.000844 | 99.8 |
| |08>>> |_ompt_sync_region_reduction | 7904 | 8 | wall_clock | sec | 0.003816 | 0.000000 | 0.000000 | 0.000017 | 0.000000 | 0.000000 | 100.0 |
| |08>>> |_ompt_work_single_executor | 310 | 7 | wall_clock | sec | 0.000112 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |08>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000955 | 0.000013 | 0.000009 | 0.000026 | 0.000000 | 0.000003 | 93.7 |
| |08>>> |_ompt_sync_region_reduction | 152 | 7 | wall_clock | sec | 0.000060 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |08>>> |_ompt_work_single_executor | 13 | 6 | wall_clock | sec | 0.000005 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |07>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.747282 | 10.747282 | 10.747282 | 10.747282 | 0.000000 | 0.000000 | 0.9 |
| |07>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649093 | 10.649093 | 10.649093 | 10.649093 | 0.000000 | 0.000000 | 0.0 |
| |07>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000923 | 0.000006 | 0.000002 | 0.000231 | 0.000000 | 0.000019 | 100.0 |
| |07>>> |_ompt_work_single_other | 152 | 6 | wall_clock | sec | 0.000048 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |07>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.003981 | 0.000013 | 0.000001 | 0.001186 | 0.000000 | 0.000068 | 100.0 |
| |07>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641295 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 0.7 |
| |07>>> |_ompt_work_single_other | 2043 | 7 | wall_clock | sec | 0.000648 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |07>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 7.978811 | 0.001009 | 0.000005 | 0.006728 | 0.000003 | 0.001732 | 100.0 |
| |07>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.199939 | 0.000033 | 0.000001 | 0.004086 | 0.000000 | 0.000255 | 100.0 |
| |07>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.385843 | 0.000604 | 0.000009 | 0.009039 | 0.000001 | 0.000938 | 100.0 |
| |07>>> |_ompt_work_single_executor | 9 | 7 | wall_clock | sec | 0.000004 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |07>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000905 | 0.000012 | 0.000010 | 0.000025 | 0.000000 | 0.000003 | 100.0 |
| |07>>> |_ompt_work_single_executor | 2 | 6 | wall_clock | sec | 0.000001 | 0.000001 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |06>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.772278 | 10.772278 | 10.772278 | 10.772278 | 0.000000 | 0.000000 | 1.1 |
| |06>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649092 | 10.649092 | 10.649092 | 10.649092 | 0.000000 | 0.000000 | 0.0 |
| |06>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000888 | 0.000006 | 0.000002 | 0.000236 | 0.000000 | 0.000020 | 100.0 |
| |06>>> |_ompt_work_single_other | 153 | 6 | wall_clock | sec | 0.000037 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |06>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004090 | 0.000013 | 0.000001 | 0.001175 | 0.000000 | 0.000067 | 100.0 |
| |06>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641317 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 0.8 |
| |06>>> |_ompt_work_single_other | 2041 | 7 | wall_clock | sec | 0.000476 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |06>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 7.467961 | 0.000945 | 0.000005 | 0.010712 | 0.000003 | 0.001627 | 100.0 |
| |06>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.250883 | 0.000042 | 0.000001 | 0.004087 | 0.000000 | 0.000285 | 100.0 |
| |06>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.838733 | 0.000718 | 0.000009 | 0.009015 | 0.000001 | 0.001015 | 99.9 |
| |06>>> |_ompt_sync_region_reduction | 3952 | 8 | wall_clock | sec | 0.003334 | 0.000001 | 0.000000 | 0.000025 | 0.000000 | 0.000001 | 100.0 |
| |06>>> |_ompt_work_single_executor | 11 | 7 | wall_clock | sec | 0.000005 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |06>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000940 | 0.000012 | 0.000009 | 0.000025 | 0.000000 | 0.000003 | 95.4 |
| |06>>> |_ompt_sync_region_reduction | 76 | 7 | wall_clock | sec | 0.000044 | 0.000001 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |06>>> |_ompt_work_single_executor | 1 | 6 | wall_clock | sec | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
| |05>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.797950 | 10.797950 | 10.797950 | 10.797950 | 0.000000 | 0.000000 | 1.4 |
| |05>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649072 | 10.649072 | 10.649072 | 10.649072 | 0.000000 | 0.000000 | 0.0 |
| |05>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000879 | 0.000006 | 0.000001 | 0.000248 | 0.000000 | 0.000021 | 100.0 |
| |05>>> |_ompt_work_single_other | 142 | 6 | wall_clock | sec | 0.000034 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |05>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004062 | 0.000013 | 0.000002 | 0.001163 | 0.000000 | 0.000067 | 100.0 |
| |05>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641291 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 0.7 |
| |05>>> |_ompt_work_single_other | 2038 | 7 | wall_clock | sec | 0.000500 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |05>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 8.279191 | 0.001047 | 0.000005 | 0.006596 | 0.000003 | 0.001792 | 100.0 |
| |05>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.250939 | 0.000042 | 0.000001 | 0.004090 | 0.000000 | 0.000286 | 100.0 |
| |05>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.039013 | 0.000516 | 0.000009 | 0.008689 | 0.000001 | 0.000855 | 100.0 |
| |05>>> |_ompt_work_single_executor | 14 | 7 | wall_clock | sec | 0.000005 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
| |05>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000926 | 0.000012 | 0.000009 | 0.000023 | 0.000000 | 0.000003 | 100.0 |
| |05>>> |_ompt_work_single_executor | 12 | 6 | wall_clock | sec | 0.000005 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |04>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.825935 | 10.825935 | 10.825935 | 10.825935 | 0.000000 | 0.000000 | 1.6 |
| |04>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649068 | 10.649068 | 10.649068 | 10.649068 | 0.000000 | 0.000000 | 0.0 |
| |04>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000884 | 0.000006 | 0.000002 | 0.000245 | 0.000000 | 0.000020 | 100.0 |
| |04>>> |_ompt_work_single_other | 150 | 6 | wall_clock | sec | 0.000034 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |04>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004069 | 0.000013 | 0.000001 | 0.001151 | 0.000000 | 0.000066 | 100.0 |
| |04>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641300 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 1.1 |
| |04>>> |_ompt_work_single_other | 2041 | 7 | wall_clock | sec | 0.000448 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |04>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 7.438393 | 0.000941 | 0.000005 | 0.007090 | 0.000003 | 0.001624 | 100.0 |
| |04>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.270654 | 0.000045 | 0.000001 | 0.004090 | 0.000000 | 0.000295 | 100.0 |
| |04>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.819165 | 0.000713 | 0.000009 | 0.008379 | 0.000001 | 0.001013 | 99.9 |
| |04>>> |_ompt_sync_region_reduction | 7904 | 8 | wall_clock | sec | 0.003932 | 0.000000 | 0.000000 | 0.000015 | 0.000000 | 0.000000 | 100.0 |
| |04>>> |_ompt_work_single_executor | 11 | 7 | wall_clock | sec | 0.000005 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |04>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000936 | 0.000012 | 0.000009 | 0.000025 | 0.000000 | 0.000003 | 93.2 |
| |04>>> |_ompt_sync_region_reduction | 152 | 7 | wall_clock | sec | 0.000064 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |04>>> |_ompt_work_single_executor | 4 | 6 | wall_clock | sec | 0.000001 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
| |03>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.849322 | 10.849322 | 10.849322 | 10.849322 | 0.000000 | 0.000000 | 1.8 |
| |03>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649075 | 10.649075 | 10.649075 | 10.649075 | 0.000000 | 0.000000 | 0.0 |
| |03>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000861 | 0.000006 | 0.000002 | 0.000238 | 0.000000 | 0.000020 | 100.0 |
| |03>>> |_ompt_work_single_other | 120 | 6 | wall_clock | sec | 0.000028 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |03>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.003993 | 0.000013 | 0.000001 | 0.001138 | 0.000000 | 0.000065 | 100.0 |
| |03>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641302 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 0.8 |
| |03>>> |_ompt_work_single_other | 1756 | 7 | wall_clock | sec | 0.000426 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |03>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 8.005617 | 0.001013 | 0.000005 | 0.011500 | 0.000003 | 0.001741 | 100.0 |
| |03>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.231485 | 0.000039 | 0.000001 | 0.004086 | 0.000000 | 0.000277 | 100.0 |
| |03>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.320428 | 0.000587 | 0.000009 | 0.010868 | 0.000001 | 0.000912 | 100.0 |
| |03>>> |_ompt_work_single_executor | 296 | 7 | wall_clock | sec | 0.000120 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |03>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000967 | 0.000013 | 0.000010 | 0.000023 | 0.000000 | 0.000003 | 100.0 |
| |03>>> |_ompt_work_single_executor | 34 | 6 | wall_clock | sec | 0.000013 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |02>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.876387 | 10.876387 | 10.876387 | 10.876387 | 0.000000 | 0.000000 | 2.1 |
| |02>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649050 | 10.649050 | 10.649050 | 10.649050 | 0.000000 | 0.000000 | 0.0 |
| |02>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000924 | 0.000006 | 0.000001 | 0.000241 | 0.000000 | 0.000020 | 100.0 |
| |02>>> |_ompt_work_single_other | 139 | 6 | wall_clock | sec | 0.000040 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |02>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.003972 | 0.000013 | 0.000001 | 0.001127 | 0.000000 | 0.000064 | 100.0 |
| |02>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641287 | 0.140017 | 0.131895 | 0.155101 | 0.000017 | 0.004080 | 0.7 |
| |02>>> |_ompt_work_single_other | 1902 | 7 | wall_clock | sec | 0.000553 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |02>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 7.906688 | 0.001000 | 0.000005 | 0.007068 | 0.000003 | 0.001713 | 100.0 |
| |02>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.261367 | 0.000044 | 0.000001 | 0.004088 | 0.000000 | 0.000295 | 100.0 |
| |02>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.402362 | 0.000608 | 0.000009 | 0.010399 | 0.000001 | 0.000944 | 99.9 |
| |02>>> |_ompt_sync_region_reduction | 3952 | 8 | wall_clock | sec | 0.002937 | 0.000001 | 0.000000 | 0.000021 | 0.000000 | 0.000000 | 100.0 |
| |02>>> |_ompt_work_single_executor | 150 | 7 | wall_clock | sec | 0.000073 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |02>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000895 | 0.000012 | 0.000009 | 0.000026 | 0.000000 | 0.000003 | 95.2 |
| |02>>> |_ompt_sync_region_reduction | 76 | 7 | wall_clock | sec | 0.000043 | 0.000001 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |02>>> |_ompt_work_single_executor | 15 | 6 | wall_clock | sec | 0.000007 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |01>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.901650 | 10.901650 | 10.901650 | 10.901650 | 0.000000 | 0.000000 | 2.3 |
| |01>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649017 | 10.649017 | 10.649017 | 10.649017 | 0.000000 | 0.000000 | 0.0 |
| |01>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000863 | 0.000006 | 0.000001 | 0.000231 | 0.000000 | 0.000019 | 100.0 |
| |01>>> |_ompt_work_single_other | 146 | 6 | wall_clock | sec | 0.000033 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
| |01>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004012 | 0.000013 | 0.000001 | 0.001115 | 0.000000 | 0.000064 | 100.0 |
| |01>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641316 | 0.140017 | 0.131895 | 0.155101 | 0.000017 | 0.004080 | 0.8 |
| |01>>> |_ompt_work_single_other | 1811 | 7 | wall_clock | sec | 0.000403 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |01>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 7.410337 | 0.000938 | 0.000005 | 0.010556 | 0.000003 | 0.001610 | 100.0 |
| |01>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.202494 | 0.000034 | 0.000001 | 0.003521 | 0.000000 | 0.000256 | 100.0 |
| |01>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.943604 | 0.000745 | 0.000008 | 0.009033 | 0.000001 | 0.001024 | 100.0 |
| |01>>> |_ompt_work_single_executor | 241 | 7 | wall_clock | sec | 0.000093 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |01>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000917 | 0.000012 | 0.000009 | 0.000026 | 0.000000 | 0.000003 | 100.0 |
| |01>>> |_ompt_work_single_executor | 8 | 6 | wall_clock | sec | 0.000004 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |00>>> |_c_print_results | 1 | 2 | wall_clock | sec | 0.000049 | 0.000049 | 0.000049 | 0.000049 | 0.000000 | 0.000000 | 100.0 |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
```
### Timemory JSON Output
> ***Hint: the generation of flat JSON output is configurable via `OMNITRACE_JSON_OUTPUT`.***
> ***The generation of hierarchical JSON data is configurable via `OMNITRACE_TREE_OUTPUT`.***
Timemory represents the data within the JSON output in two forms: a flat structure and a hierarchical structure.
The flat JSON data represents the data similar to the text files: the hierarchical information
is represented by the indentation of the `"prefix"` field and the `"depth"` field.
The hierarchical JSON contains additional information with respect to inclusive and exclusive value, however,
it's structure requires processing through recursion. This section of the JSON supports analysis
by [hatchet](https://github.com/hatchet/hatchet).
All the data entries for the flat structure are in a single JSON array.
This format is easier than the hierarchical format to write a simple Python script for post-processing.
#### Timemory JSON Output Sample
In the JSON below, the flat data starts at `["timemory"]["wall_clock"]["ranks"]`
and the hierarchical data starts at `["timemory"]["wall_clock"]["graph"]`.
E.g., accessing the name (prefix) of the nth entry in the flat data layout is:
`["timemory"]["wall_clock"]["ranks"][0]["graph"][<N>]["prefix"]`. When full MPI
support is enable, the per-rank data in flat layout will be represented
in as an entry in the "ranks" array; in the hierarchical data structure,
the per-rank data is represented as entry in the "mpi" array (but "graph"
is used in lieu of "mpi" when full MPI support is enabled).
In the hierarchical layout, all data for the process is all a child of a (dummy)
root node (which has the name `unknown-hash=0`).
```json
{
"timemory": {
"wall_clock": {
"properties": {
"cereal_class_version": 0,
"value": 78,
"enum": "WALL_CLOCK",
"id": "wall_clock",
"ids": [
"real_clock",
"virtual_clock",
"wall_clock"
]
},
"type": "wall_clock",
"description": "Real-clock timer (i.e. wall-clock timer)",
"unit_value": 1000000000,
"unit_repr": "sec",
"thread_scope_only": false,
"thread_count": 2,
"mpi_size": 1,
"upcxx_size": 1,
"process_count": 1,
"num_ranks": 1,
"concurrency": 2,
"ranks": [
{
"rank": 0,
"graph_size": 112,
"graph": [
{
"hash": 17481650134347108265,
"prefix": "|0>>> main",
"depth": 0,
"entry": {
"cereal_class_version": 0,
"laps": 1,
"value": 894743517,
"accum": 894743517,
"repr_data": 0.894743517,
"repr_display": 0.894743517
},
"stats": {
"cereal_class_version": 0,
"sum": 0.894743517,
"count": 1,
"min": 0.894743517,
"max": 0.894743517,
"sqr": 0.8005659612135293,
"mean": 0.894743517,
"stddev": 0.0
},
"rolling_hash": 17481650134347108265
},
{
"hash": 3455444288293231339,
"prefix": "|0>>> |_read_input",
"depth": 1,
"entry": {
"laps": 1,
"value": 9808,
"accum": 9808,
"repr_data": 9.808e-06,
"repr_display": 9.808e-06
},
"stats": {
"sum": 9.808e-06,
"count": 1,
"min": 9.808e-06,
"max": 9.808e-06,
"sqr": 9.6196864e-11,
"mean": 9.808e-06,
"stddev": 0.0
},
"rolling_hash": 2490350348930787988
},
{
"hash": 8456966793631718807,
"prefix": "|0>>> |_setcoeff",
"depth": 1,
"entry": {
"laps": 1,
"value": 922,
"accum": 922,
"repr_data": 9.22e-07,
"repr_display": 9.22e-07
},
"stats": {
"sum": 9.22e-07,
"count": 1,
"min": 9.22e-07,
"max": 9.22e-07,
"sqr": 8.50084e-13,
"mean": 9.22e-07,
"stddev": 0.0
},
"rolling_hash": 7491872854269275456
},
{
"hash": 6107876127803219007,
"prefix": "|0>>> |_ompt_thread_initial",
"depth": 1,
"entry": {
"laps": 1,
"value": 896506392,
"accum": 896506392,
"repr_data": 0.896506392,
"repr_display": 0.896506392
},
"stats": {
"sum": 0.896506392,
"count": 1,
"min": 0.896506392,
"max": 0.896506392,
"sqr": 0.8037237108968578,
"mean": 0.896506392,
"stddev": 0.0
},
"rolling_hash": 5142782188440775656
},
{
"hash": 15402802091993617561,
"prefix": "|0>>> |_ompt_implicit_task",
"depth": 2,
"entry": {
"laps": 1,
"value": 896479111,
"accum": 896479111,
"repr_data": 0.896479111,
"repr_display": 0.896479111
},
"stats": {
"sum": 0.896479111,
"count": 1,
"min": 0.896479111,
"max": 0.896479111,
"sqr": 0.8036747964593504,
"mean": 0.896479111,
"stddev": 0.0
},
"rolling_hash": 2098840206724841601 },
{
"..." : "... etc. ..."
}
]
}
],
"graph": [
[
{
"cereal_class_version": 0,
"node": {
"hash": 0,
"prefix": "unknown-hash=0",
"tid": [
0
],
"pid": [
2539175
],
"depth": 0,
"is_dummy": false,
"inclusive": {
"entry": {
"laps": 0,
"value": 0,
"accum": 0,
"repr_data": 0.0,
"repr_display": 0.0
},
"stats": {
"sum": 0.0,
"count": 0,
"min": 0.0,
"max": 0.0,
"sqr": 0.0,
"mean": 0.0,
"stddev": 0.0
}
},
"exclusive": {
"entry": {
"laps": 0,
"value": -894743517,
"accum": -894743517,
"repr_data": -0.894743517,
"repr_display": -0.894743517
},
"stats": {
"sum": 0.0,
"count": 0,
"min": 0.0,
"max": 0.0,
"sqr": 0.0,
"mean": 0.0,
"stddev": 0.0
}
}
},
"children": [
{
"node": {
"hash": 17481650134347108265,
"prefix": "main",
"tid": [
0
],
"pid": [
2539175
],
"depth": 1,
"is_dummy": false,
"inclusive": {
"entry": {
"laps": 1,
"value": 894743517,
"accum": 894743517,
"repr_data": 0.894743517,
"repr_display": 0.894743517
},
"stats": {
"sum": 0.894743517,
"count": 1,
"min": 0.894743517,
"max": 0.894743517,
"sqr": 0.8005659612135293,
"mean": 0.894743517,
"stddev": 0.0
}
},
"exclusive": {
"entry": {
"laps": 1,
"value": -1773605,
"accum": -1773605,
"repr_data": -0.001773605,
"repr_display": -0.001773605
},
"stats": {
"sum": -0.001773605,
"count": 1,
"min": 9.22e-07,
"max": 0.896506392,
"sqr": -0.0031577497803754,
"mean": -0.001773605,
"stddev": 0.0
}
}
},
"children": [
{
"..." : "... etc. ..."
}
]
}
]
}
]
]
}
}
}
```
#### Timemory JSON Output Python Post-Processing Example
```python
#!/usr/bin/env python3
import sys
import json
def read_json(inp):
with open(inp, "r") as f:
return json.load(f)
def find_max(data):
"""Find the max for any function called multiple times"""
max_entry = None
for itr in data:
if itr["entry"]["laps"] == 1:
continue
if max_entry is None:
max_entry = itr
else:
if itr["stats"]["mean"] > max_entry["stats"]["mean"]:
max_entry = itr
return max_entry
def strip_name(name):
"""Return everything after |_ if it exists"""
idx = name.index("|_")
return name if idx is None else name[(idx + 2) :]
if __name__ == "__main__":
input_data = [[x, read_json(x)] for x in sys.argv[1:]]
for file, data in input_data:
for metric, metric_data in data["timemory"].items():
print(f"[{file}] Found metric: {metric}")
for n, itr in enumerate(metric_data["ranks"]):
max_entry = find_max(itr["graph"])
print(
"[{}] Maximum value: '{}' at depth {} was called {}x :: {:.3f} {} (mean = {:.3e} {})".format(
file,
strip_name(max_entry["prefix"]),
max_entry["depth"],
max_entry["entry"]["laps"],
max_entry["entry"]["repr_data"],
metric_data["unit_repr"],
max_entry["stats"]["mean"],
metric_data["unit_repr"],
)
)
```
This script applied to the corresponding JSON output from [Text Output Example](#timemory-text-output-example) would be:
```console
[openmp-cg.inst-wall_clock.json] Found metric: wall_clock
[openmp-cg.inst-wall_clock.json] Maximum value: 'conj_grad' at depth 6 was called 76x :: 10.641 sec (mean = 1.400e-01 sec)
```
-297
Dosyayı Görüntüle
@@ -1,297 +0,0 @@
# Python Support
```eval_rst
.. toctree::
:glob:
:maxdepth: 3
```
[OmniTrace](https://github.com/ROCm/omnitrace) supports profiling Python code at the source-level and/or the script-level.
Python support is enabled via the `OMNITRACE_USE_PYTHON` and the `OMNITRACE_PYTHON_VERSIONS="<MAJOR>.<MINOR>` CMake options.
Alternatively, to build multiple python versions, use `OMNITRACE_PYTHON_VERSIONS="<MAJOR>.<MINOR>;[<MAJOR>.<MINOR>]"`,
and `OMNITRACE_PYTHON_ROOT_DIRS="/path/to/version;[/path/to/version]"` instead of `OMNITRACE_PYTHON_VERSION`.
When building multiple Python versions, the length of the `OMNITRACE_PYTHON_VERSIONS` and `OMNITRACE_PYTHON_ROOT_DIRS` lists must
be the same size.
> ***When using omnitrace for Python, the Python interpreter major and minor version (e.g. 3.7) must match the interpreter major and minor version***
> ***used when compiling the Python bindings, i.e. when building omnitrace, a `libpyomnitrace.<IMPL>-<VERSION>-<ARCH>-<OS>-<ABI>.so` will be generated***
> ***where `IMPL` is the Python implementation, `VERSION` is the major and minor version, `ARCH` is the architecture,***
> ***`OS` is the operating system, and `ABI` is the application binary interface; Example: `libpyomnitrace.cpython-38-x86_64-linux-gnu.so`.***
## Getting Started
The omnitrace Python package is installed in `lib/pythonX.Y/site-packages/omnitrace`. In order to ensure the Python interpreter can find the omnitrace package,
add this path to the `PYTHONPATH` environment variable, e.g.:
```bash
export PYTHONPATH=/opt/omnitrace/lib/python3.8/site-packages:${PYTHONPATH}
```
If using either the `share/omnitrace/setup-env.sh` script or the modulefile in `share/modulefiles/omnitrace`, prefixing the `PYTHONPATH`
environment variable is automatically handled.
## Running OmniTrace on a Python Script
OmniTrace provides an `omnitrace-python` helper bash script which effectively handles ensuring `PYTHONPATH` is properly set and the correct python interpreter is used.
Thus the following are effectively equivalent:
```bash
omnitrace-python --help
export PYTHONPATH=/opt/omnitrace/lib/python3.8/site-packages:${PYTHONPATH}
python3.8 -m omnitrace --help
```
> ***`omnitrace-python` / `python -m omnitrace` uses the same command-line syntax as the `omnitrace` executable (i.e. `omnitrace-python <OMNITRACE_ARGS> -- <SCRIPT> <SCRIPT_ARGS>`) and has similar options.***
### Command Line Options
Use `omnitrace-python --help` to view the available options:
```console
usage: omnitrace [-h] [-v VERBOSITY] [-b] [-c FILE] [-s FILE] [-F [BOOL]] [--label [{args,file,line} [{args,file,line} ...]]] [-I FUNC [FUNC ...]] [-E FUNC [FUNC ...]] [-R FUNC [FUNC ...]] [-MI FILE [FILE ...]] [-ME FILE [FILE ...]] [-MR FILE [FILE ...]] [--trace-c [BOOL]]
optional arguments:
-h, --help show this help message and exit
-v VERBOSITY, --verbosity VERBOSITY
Logging verbosity
-b, --builtin Put 'profile' in the builtins. Use '@profile' to decorate a single function, or 'with profile:' to profile a single section of code.
-c FILE, --config FILE
OmniTrace configuration file
-s FILE, --setup FILE
Code to execute before the code to profile
-F [BOOL], --full-filepath [BOOL]
Encode the full function filename (instead of basename)
--label [{args,file,line} [{args,file,line} ...]]
Encode the function arguments, filename, and/or line number into the profiling function label
-I FUNC [FUNC ...], --function-include FUNC [FUNC ...]
Include any entries with these function names
-E FUNC [FUNC ...], --function-exclude FUNC [FUNC ...]
Filter out any entries with these function names
-R FUNC [FUNC ...], --function-restrict FUNC [FUNC ...]
Select only entries with these function names
-MI FILE [FILE ...], --module-include FILE [FILE ...]
Include any entries from these files
-ME FILE [FILE ...], --module-exclude FILE [FILE ...]
Filter out any entries from these files
-MR FILE [FILE ...], --module-restrict FILE [FILE ...]
Select only entries from these files
--trace-c [BOOL] Enable profiling C functions
usage: python3 -m omnitrace <OMNITRACE_ARGS> -- <SCRIPT> <SCRIPT_ARGS>
```
> ***The `--trace-c` option does not incorporate omnitrace's dynamic instrumentation support, rather it just enables profiling the underlying C function call within the Python interpreter.***
### Selective Instrumentation
Similar to the `omnitrace` executable, command-line options exist for restricting, including, and excluded the desired functions and modules, e.g. `--function-exclude "^__init__$"`.
Alternatively, adding `@profile` decorator to the primary function of interest in combination with the `-b` / `--builtin` option will narrow the scope of the
instrumentation to these function(s) and their children.
Consider the following Python code (`example.py`):
```python
import sys
def fib(n):
return n if n < 2 else (fib(n - 1) + fib(n - 2))
def inefficient(n):
a = 0
for i in range(n):
a += i
for j in range(n):
a += j
return a
def run(n):
return fib(n) + inefficient(n)
if __name__ == "__main__":
run(20)
```
Using `omnitrace-python ./example.py` with `OMNITRACE_PROFILE=ON` and `OMNITRACE_TIMEMORY_COMPONENTS=trip_count` would produce:
```console
|-------------------------------------------------------------------------------------------|
| COUNTS NUMBER OF INVOCATIONS |
|-------------------------------------------------------------------------------------------|
| LABEL | COUNT | DEPTH | METRIC | SUM |
|---------------------------------------------------|--------|--------|------------|--------|
| |0>>> run | 1 | 0 | trip_count | 1 |
| |0>>> |_fib | 1 | 1 | trip_count | 1 |
| |0>>> |_fib | 2 | 2 | trip_count | 2 |
| |0>>> |_fib | 4 | 3 | trip_count | 4 |
| |0>>> |_fib | 8 | 4 | trip_count | 8 |
| |0>>> |_fib | 16 | 5 | trip_count | 16 |
| |0>>> |_fib | 32 | 6 | trip_count | 32 |
| |0>>> |_fib | 64 | 7 | trip_count | 64 |
| |0>>> |_fib | 128 | 8 | trip_count | 128 |
| |0>>> |_fib | 256 | 9 | trip_count | 256 |
| |0>>> |_fib | 512 | 10 | trip_count | 512 |
| |0>>> |_fib | 1024 | 11 | trip_count | 1024 |
| |0>>> |_fib | 2026 | 12 | trip_count | 2026 |
| |0>>> |_fib | 3632 | 13 | trip_count | 3632 |
| |0>>> |_fib | 5020 | 14 | trip_count | 5020 |
| |0>>> |_fib | 4760 | 15 | trip_count | 4760 |
| |0>>> |_fib | 2942 | 16 | trip_count | 2942 |
| |0>>> |_fib | 1152 | 17 | trip_count | 1152 |
| |0>>> |_fib | 274 | 18 | trip_count | 274 |
| |0>>> |_fib | 36 | 19 | trip_count | 36 |
| |0>>> |_fib | 2 | 20 | trip_count | 2 |
| |0>>> |_inefficient | 1 | 1 | trip_count | 1 |
|-------------------------------------------------------------------------------------------|
```
If the `inefficient` function were decorated with `@profile`:
```python
@profile
def inefficient(n):
# ...
```
And executed with `omnitrace-python -b -- ./example.py`, omnitrace would produce:
```console
|-----------------------------------------------------------|
| COUNTS NUMBER OF INVOCATIONS |
|-----------------------------------------------------------|
| LABEL | COUNT | DEPTH | METRIC | SUM |
|-------------------|--------|--------|------------|--------|
| |0>>> inefficient | 1 | 0 | trip_count | 1 |
|-----------------------------------------------------------|
```
## OmniTrace Python Source Instrumentation
Starting from the unmodified `example.py` script above, we start by importing the `omnitrace` module:
```python
import sys
import omnitrace # import omnitrace
def fib(n):
# ... etc. ...
```
Then, we can add `@omnitrace.profile()` to the `run` function:
```python
@omnitrace.profile()
def run(n):
# ...
```
Or we can use `omnitrace.profile()` as a context-manager around `run(20)`:
```python
if __name__ == "__main__":
with omnitrace.profile():
run(20)
```
The results for both of the source-level instrumentation modes are identical to the original `omnitrace-python ./example.py` results:
```console
|-------------------------------------------------------------------------------------------|
| COUNTS NUMBER OF INVOCATIONS |
|-------------------------------------------------------------------------------------------|
| LABEL | COUNT | DEPTH | METRIC | SUM |
|---------------------------------------------------|--------|--------|------------|--------|
| |0>>> run | 1 | 0 | trip_count | 1 |
| |0>>> |_fib | 1 | 1 | trip_count | 1 |
| |0>>> |_fib | 2 | 2 | trip_count | 2 |
| |0>>> |_fib | 4 | 3 | trip_count | 4 |
| |0>>> |_fib | 8 | 4 | trip_count | 8 |
| |0>>> |_fib | 16 | 5 | trip_count | 16 |
| |0>>> |_fib | 32 | 6 | trip_count | 32 |
| |0>>> |_fib | 64 | 7 | trip_count | 64 |
| |0>>> |_fib | 128 | 8 | trip_count | 128 |
| |0>>> |_fib | 256 | 9 | trip_count | 256 |
| |0>>> |_fib | 512 | 10 | trip_count | 512 |
| |0>>> |_fib | 1024 | 11 | trip_count | 1024 |
| |0>>> |_fib | 2026 | 12 | trip_count | 2026 |
| |0>>> |_fib | 3632 | 13 | trip_count | 3632 |
| |0>>> |_fib | 5020 | 14 | trip_count | 5020 |
| |0>>> |_fib | 4760 | 15 | trip_count | 4760 |
| |0>>> |_fib | 2942 | 16 | trip_count | 2942 |
| |0>>> |_fib | 1152 | 17 | trip_count | 1152 |
| |0>>> |_fib | 274 | 18 | trip_count | 274 |
| |0>>> |_fib | 36 | 19 | trip_count | 36 |
| |0>>> |_fib | 2 | 20 | trip_count | 2 |
| |0>>> |_inefficient | 1 | 1 | trip_count | 1 |
|-------------------------------------------------------------------------------------------|
```
> ***When `omnitrace-python` is used without built-ins, the profiling results will likely be cluttered by***
> ***numerous functions called during the importing of more complex modules, e.g. `import numpy`.***
### OmniTrace Python Source Instrumentation Configuration
Within the Python source code, the profiler can be configured by directly modifying the `omnitrace.profiler.config` data fields.
```python
import sys
def fib(n):
return n if n < 2 else (fib(n - 1) + fib(n - 2))
def inefficient(n):
a = 0
for i in range(n):
a += i
for j in range(n):
a += j
return a
def run(n):
return fib(n) + inefficient(n)
if __name__ == "__main__":
from omnitrace.profiler import config
from omnitrace import profile
config.include_args = True
config.include_filename = False
config.include_line = False
config.restrict_functions += ["fib", "run"]
with profile():
run(5)
```
Executing this script would produce:
```console
|------------------------------------------------------------------|
| COUNTS NUMBER OF INVOCATIONS |
|------------------------------------------------------------------|
| LABEL | COUNT | DEPTH | METRIC | SUM |
|--------------------------|--------|--------|------------|--------|
| |0>>> run(n=5) | 1 | 0 | trip_count | 1 |
| |0>>> |_fib(n=5) | 1 | 1 | trip_count | 1 |
| |0>>> |_fib(n=4) | 1 | 2 | trip_count | 1 |
| |0>>> |_fib(n=3) | 1 | 3 | trip_count | 1 |
| |0>>> |_fib(n=2) | 1 | 4 | trip_count | 1 |
| |0>>> |_fib(n=1) | 1 | 5 | trip_count | 1 |
| |0>>> |_fib(n=0) | 1 | 5 | trip_count | 1 |
| |0>>> |_fib(n=1) | 1 | 4 | trip_count | 1 |
| |0>>> |_fib(n=2) | 1 | 3 | trip_count | 1 |
| |0>>> |_fib(n=1) | 1 | 4 | trip_count | 1 |
| |0>>> |_fib(n=0) | 1 | 4 | trip_count | 1 |
| |0>>> |_fib(n=3) | 1 | 2 | trip_count | 1 |
| |0>>> |_fib(n=2) | 1 | 3 | trip_count | 1 |
| |0>>> |_fib(n=1) | 1 | 4 | trip_count | 1 |
| |0>>> |_fib(n=0) | 1 | 4 | trip_count | 1 |
| |0>>> |_fib(n=1) | 1 | 3 | trip_count | 1 |
|------------------------------------------------------------------|
```
Dosya farkı çok büyük olduğundan ihmal edildi Fark Yükle
Dosya farkı çok büyük olduğundan ihmal edildi Fark Yükle
-353
Dosyayı Görüntüle
@@ -1,353 +0,0 @@
# Call-Stack Sampling
```eval_rst
.. toctree::
:glob:
:maxdepth: 4
```
> ***NOTE: Set `OMNITRACE_USE_SAMPLING=ON` to activate call-stack sampling when executing an instrumented binary***
Call-stack sampling can be activated with either a binary instrumented via the `omnitrace` executable or via the `omnitrace-sample` executable.
***Effectively***, all of the commands below are equivalent:
- Binary rewrite with only instrumentation necessary to start/stop sampling
```console
omnitrace-instrument -M sampling -o foo.inst -- foo
omnitrace-run -- ./foo.inst
```
- Runtime instrumentation with only instrumentation necessary to start/stop sampling
```console
omnitrace-instrument -M sampling -- foo
```
- No instrumentation required
```console
omnitrace-sample -- foo
```
All `omnitrace-instrument -M sampling` (referred to as "instrumented-sampling" henceforth) does is wrap the `main` of the executable with initialization
before `main` starts and finalization after `main` ends.
This can be easily accomplished without instrumentation via a `LD_PRELOAD` of a library with containing a dynamic symbol wrapper around `__libc_start_main`.
Thus, whenever binary instrumentation is unnecessary, using `omnitrace-sample` is recommended over `omnitrace-instrument -M sampling` for several reasons:
1. `omnitrace-sample` provides command-line options for controlling features of omnitrace instead of *requiring* configuration files or environment variables
2. Despite the fact that instrumented-sampling only requires inserting snippets around one function (`main`), Dyninst
does not have a feature for specifying that parsing and processing all the other symbols in the binary is unnecessary,
thus, in the best case scenario, instrumented-sampling has a slightly slower launch time when the target binary is relatively small
but, in the worst case scenarios, requires a significant amount of time and memory to launch
3. `omnitrace-sample` is fully compatible with MPI, e.g. `mpirun -n 2 omnitrace-sample -- foo`, whereas `mpirun -n 2 omnitrace-instrument -M sampling -- foo`
is incompatible with some MPI distributions (particularly OpenMPI) because of MPI restrictions against forking within an MPI rank
- If you recall, when MPI and binary instrumentation is involved, two steps are involed: (1) do a binary rewrite of the executable
and (2) use the instrumented executable in leiu of the original executable. `omnitrace-sample` is thus much easier to use with MPI.
## omnitrace-sample Executable
View the help menu of `omnitrace-sample` with the `-h` / `--help` option:
```console
$ omnitrace-sample --help
[omnitrace-sample] Usage: omnitrace-sample [ --help (count: 0, dtype: bool)
--monochrome (max: 1, dtype: bool)
--debug (max: 1, dtype: bool)
--verbose (count: 1)
--config (min: 0, dtype: filepath)
--output (min: 1)
--trace (max: 1, dtype: bool)
--profile (max: 1, dtype: bool)
--flat-profile (max: 1, dtype: bool)
--host (max: 1, dtype: bool)
--device (max: 1, dtype: bool)
--trace-file (count: 1, dtype: filepath)
--trace-buffer-size (count: 1, dtype: KB)
--trace-fill-policy (count: 1)
--profile-format (min: 1)
--profile-diff (min: 1)
--process-freq (count: 1)
--process-wait (count: 1)
--process-duration (count: 1)
--cpus (count: unlimited, dtype: int or range)
--gpus (count: unlimited, dtype: int or range)
--freq (count: 1)
--wait (count: 1)
--duration (count: 1)
--tids (min: 1)
--cputime (min: 0)
--realtime (min: 0)
--include (count: unlimited)
--exclude (count: unlimited)
--cpu-events (count: unlimited)
--gpu-events (count: unlimited)
--inlines (max: 1, dtype: bool)
--hsa-interrupt (count: 1, dtype: int)
]
Options:
-h, -?, --help Shows this page
[DEBUG OPTIONS]
--monochrome Disable colorized output
--debug Debug output
-v, --verbose Verbose output
[GENERAL OPTIONS]
-c, --config Configuration file
-o, --output Output path. Accepts 1-2 parameters corresponding to the output path and the output prefix
-T, --trace Generate a detailed trace (perfetto output)
-P, --profile Generate a call-stack-based profile (conflicts with --flat-profile)
-F, --flat-profile Generate a flat profile (conflicts with --profile)
-H, --host Enable sampling host-based metrics for the process. E.g. CPU frequency, memory usage, etc.
-D, --device Enable sampling device-based metrics for the process. E.g. GPU temperature, memory usage, etc.
[TRACING OPTIONS]
--trace-file Specify the trace output filename. Relative filepath will be with respect to output path and output prefix.
--trace-buffer-size Size limit for the trace output (in KB)
--trace-fill-policy [ discard | ring_buffer ]
Policy for new data when the buffer size limit is reached:
- discard : new data is ignored
- ring_buffer : new data overwrites oldest data
[PROFILE OPTIONS]
--profile-format [ console | json | text ]
Data formats for profiling results
--profile-diff Generate a diff output b/t the profile collected and an existing profile from another run Accepts 1-2 parameters
corresponding to the input path and the input prefix
[HOST/DEVICE (PROCESS SAMPLING) OPTIONS]
--process-freq Set the default host/device sampling frequency (number of interrupts per second)
--process-wait Set the default wait time (i.e. delay) before taking first host/device sample (in seconds of realtime)
--process-duration Set the duration of the host/device sampling (in seconds of realtime)
--cpus CPU IDs for frequency sampling. Supports integers and/or ranges
--gpus GPU IDs for SMI queries. Supports integers and/or ranges
[GENERAL SAMPLING OPTIONS]
-f, --freq Set the default sampling frequency (number of interrupts per second)
-w, --wait Set the default wait time (i.e. delay) before taking first sample (in seconds). This delay time is based on the clock
of the sampler, i.e., a delay of 1 second for CPU-clock sampler may not equal 1 second of realtime
-d, --duration Set the duration of the sampling (in seconds of realtime). I.e., it is possible (currently) to set a CPU-clock time
delay that exceeds the real-time duration... resulting in zero samples being taken
-t, --tids Specify the default thread IDs for sampling, where 0 (zero) is the main thread and each thread created by the target
application is assigned an atomically incrementing value.
[SAMPLING TIMER OPTIONS]
--cputime Sample based on a CPU-clock timer (default). Accepts zero or more arguments:
0. Enables sampling based on CPU-clock timer.
1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of CPU-time.
2. Delay (in seconds of CPU-clock time). I.e., how long each thread should wait before taking first sample.
3+ Thread IDs to target for sampling, starting at 0 (the main thread).
May be specified as index or range, e.g., '0 2-4' will be interpreted as:
sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads
--realtime Sample based on a real-clock timer. Accepts zero or more arguments:
0. Enables sampling based on real-clock timer.
1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of realtime.
2. Delay (in seconds of real-clock time). I.e., how long each thread should wait before taking first sample.
3+ Thread IDs to target for sampling, starting at 0 (the main thread).
May be specified as index or range, e.g., '0 2-4' will be interpreted as:
sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads
When sampling with a real-clock timer, please note that enabling this will cause threads which are typically "idle"
to consume more resources since, while idle, the real-clock time increases (and therefore triggers taking samples)
whereas the CPU-clock time does not.
[BACKEND OPTIONS] (These options control region information captured w/o sampling or instrumentation)
-I, --include [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
Include data from these backends
-E, --exclude [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
Exclude data from these backends
[HARDWARE COUNTER OPTIONS]
-C, --cpu-events Set the CPU hardware counter events to record (ref: `omnitrace-avail -H -c CPU`)
-G, --gpu-events Set the GPU hardware counter events to record (ref: `omnitrace-avail -H -c GPU`)
[MISCELLANEOUS OPTIONS]
-i, --inlines Include inline info in output when available
--hsa-interrupt [ 0 | 1 ] Set the value of the HSA_ENABLE_INTERRUPT environment variable.
ROCm version 5.2 and older have a bug which will cause a deadlock if a sample is taken while waiting for the signal
that a kernel completed -- which happens when sampling with a real-clock timer. We require this option to be set to
when --realtime is specified to make users aware that, while this may fix the bug, it can have a negative impact on
performance.
Values:
0 avoid triggering the bug, potentially at the cost of reduced performance
1 do not modify how ROCm is notified about kernel completion
```
The general syntax for separating omnitrace command line arguments from the application arguments follows the
is consistent with the LLVM style of using a standalone double-hyphen (`--`). All arguments preceding the double-hyphen
are interpreted as belonging to omnitrace and all arguments following the double-hyphen are interpreted as the
application and it's arguments. The double-hyphen is only necessary when passing command line arguments to the target
which also use hyphens. E.g. `omnitrace-sample ls` works but, in order to run `ls -la`, use `omnitrace-sample -- ls -la`.
[Configuring OmniTrace Runtime](runtime.md) establish the precedence of environment variable values over values specified in the configuration files. This enables
the user to configure the omnitrace runtime to their preferred default behavior in a file such as `~/.omnitrace.cfg` and then easily override
those settings via something like `OMNITRACE_ENABLED=OFF omnitrace-sample -- foo`.
Similarly, the command line arguments passed to `omnitrace-sample` take precedence over environment variables.
All of the command-line options above correlate to one or more configuration settings, e.g. `--cpu-events` correlates to the `OMNITRACE_PAPI_EVENTS` configuration variable.
After the command-line arguments to `omnitrace-sample` have been processed but before the target application is executed, `omnitrace-sample` will emit a log
for which environment variables where set and/or modified:
The snippet below shows the environment updates when `omnitrace-sample` is invoked with no arguments
```console
$ omnitrace-sample -- ./parallel-overhead-locks 30 4 100
HSA_TOOLS_LIB=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
HSA_TOOLS_REPORT_LOAD_FAILURE=1
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
OMNITRACE_USE_PROCESS_SAMPLING=false
OMNITRACE_USE_SAMPLING=true
OMP_TOOL_LIBRARIES=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
ROCP_TOOL_LIB=/opt/omnitrace/lib/libomnitrace.so.1.7.1
...
```
The snippet below shows the environment updates when `omnitrace-sample` enables profiling, tracing, host process-sampling, device process-sampling, and all the available backends:
```console
$ omnitrace-sample -PTDH -I all -- ./parallel-overhead-locks 30 4 100
HSA_TOOLS_LIB=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
HSA_TOOLS_REPORT_LOAD_FAILURE=1
KOKKOS_PROFILE_LIBRARY=/opt/omnitrace/lib/libomnitrace.so.1.7.1
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
OMNITRACE_CPU_FREQ_ENABLED=true
OMNITRACE_TRACE_THREAD_LOCKS=true
OMNITRACE_TRACE_THREAD_RW_LOCKS=true
OMNITRACE_TRACE_THREAD_SPIN_LOCKS=true
OMNITRACE_USE_KOKKOSP=true
OMNITRACE_USE_MPIP=true
OMNITRACE_USE_OMPT=true
OMNITRACE_TRACE=true
OMNITRACE_USE_PROCESS_SAMPLING=true
OMNITRACE_USE_RCCLP=true
OMNITRACE_USE_ROCM_SMI=true
OMNITRACE_USE_ROCPROFILER=true
OMNITRACE_USE_ROCTRACER=true
OMNITRACE_USE_ROCTX=true
OMNITRACE_USE_SAMPLING=true
OMNITRACE_PROFILE=true
OMP_TOOL_LIBRARIES=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
ROCP_TOOL_LIB=/opt/omnitrace/lib/libomnitrace.so.1.7.1
...
```
The snippet below shows the environment updates when `omnitrace-sample` enables profiling, tracing, host process-sampling, device process-sampling,
sets the output path to `omnitrace-output`, the output prefix to `%tag%` and disables all the available backends:
```console
$ omnitrace-sample -PTDH -E all -o omnitrace-output %tag% -- ./parallel-overhead-locks 30 4 100
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
OMNITRACE_CPU_FREQ_ENABLED=true
OMNITRACE_OUTPUT_PATH=omnitrace-output
OMNITRACE_OUTPUT_PREFIX=%tag%
OMNITRACE_TRACE_THREAD_LOCKS=false
OMNITRACE_TRACE_THREAD_RW_LOCKS=false
OMNITRACE_TRACE_THREAD_SPIN_LOCKS=false
OMNITRACE_USE_KOKKOSP=false
OMNITRACE_USE_MPIP=false
OMNITRACE_USE_OMPT=false
OMNITRACE_TRACE=true
OMNITRACE_USE_PROCESS_SAMPLING=true
OMNITRACE_USE_RCCLP=false
OMNITRACE_USE_ROCM_SMI=false
OMNITRACE_USE_ROCPROFILER=false
OMNITRACE_USE_ROCTRACER=false
OMNITRACE_USE_ROCTX=false
OMNITRACE_USE_SAMPLING=true
OMNITRACE_PROFILE=true
...
```
## omnitrace-sample Example
```console
$ omnitrace-sample -PTDH -E all -o omnitrace-output %tag% -c -- ./parallel-overhead-locks 30 4 100
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
OMNITRACE_CONFIG_FILE=
OMNITRACE_CPU_FREQ_ENABLED=true
OMNITRACE_OUTPUT_PATH=omnitrace-output
OMNITRACE_OUTPUT_PREFIX=%tag%
OMNITRACE_TRACE_THREAD_LOCKS=false
OMNITRACE_TRACE_THREAD_RW_LOCKS=false
OMNITRACE_TRACE_THREAD_SPIN_LOCKS=false
OMNITRACE_USE_KOKKOSP=false
OMNITRACE_USE_MPIP=false
OMNITRACE_USE_OMPT=false
OMNITRACE_TRACE=true
OMNITRACE_USE_PROCESS_SAMPLING=true
OMNITRACE_USE_RCCLP=false
OMNITRACE_USE_ROCM_SMI=false
OMNITRACE_USE_ROCPROFILER=false
OMNITRACE_USE_ROCTRACER=false
OMNITRACE_USE_ROCTX=false
OMNITRACE_USE_SAMPLING=true
OMNITRACE_PROFILE=true
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Sampling
______ .___ ___. .__ __. __ .___________..______ ___ ______ _______
/ __ \ | \/ | | \ | | | | | || _ \ / \ / || ____|
| | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__
| | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __|
| `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____
\______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______|
[759.689] perfetto.cc:55903 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
[parallel-overhead-locks] Threads: 4
[parallel-overhead-locks] Iterations: 100
[parallel-overhead-locks] fibonacci(30)...
[1] number of iterations: 100
[2] number of iterations: 100
[3] number of iterations: 100
[4] number of iterations: 100
[parallel-overhead-locks] fibonacci(30) x 4 = 394644873
[parallel-overhead-locks] number of mutex locks = 400
[omnitrace][107157][0][omnitrace_finalize]
[omnitrace][107157][0][omnitrace_finalize] finalizing...
[omnitrace][107157][0][omnitrace_finalize]
[omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157 : 0.610427 sec wall_clock, 2.248 MB peak_rss, 2.265 MB page_rss, 2.560000 sec cpu_clock, 419.4 % cpu_util [laps: 1]
[omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157/thread/0 : 0.608866 sec wall_clock, 0.000677 sec thread_cpu_clock, 0.1 % thread_cpu_util, 2.248 MB peak_rss [laps: 1]
[omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157/thread/1 : 0.608237 sec wall_clock, 0.603553 sec thread_cpu_clock, 99.2 % thread_cpu_util, 2.204 MB peak_rss [laps: 1]
[omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157/thread/2 : 0.601430 sec wall_clock, 0.598378 sec thread_cpu_clock, 99.5 % thread_cpu_util, 1.156 MB peak_rss [laps: 1]
[omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157/thread/3 : 0.570223 sec wall_clock, 0.568713 sec thread_cpu_clock, 99.7 % thread_cpu_util, 0.772 MB peak_rss [laps: 1]
[omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157/thread/4 : 0.557637 sec wall_clock, 0.557198 sec thread_cpu_clock, 99.9 % thread_cpu_util, 0.156 MB peak_rss [laps: 1]
[omnitrace][107157][0][omnitrace_finalize]
[omnitrace][107157][0][omnitrace_finalize] Finalizing perfetto...
[omnitrace][107157][perfetto]> Outputting '/home/user/data/omnitrace-output/2022-10-19_02.46/parallel-overhead-locksperfetto-trace-107157.proto' (842.90 KB / 0.84 MB / 0.00 GB)... Done
[omnitrace][107157][trip_count]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockstrip_count-107157.json'
[omnitrace][107157][trip_count]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockstrip_count-107157.txt'
[omnitrace][107157][sampling_percent]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_percent-107157.json'
[omnitrace][107157][sampling_percent]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_percent-107157.txt'
[omnitrace][107157][sampling_cpu_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_cpu_clock-107157.json'
[omnitrace][107157][sampling_cpu_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_cpu_clock-107157.txt'
[omnitrace][107157][sampling_wall_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_wall_clock-107157.json'
[omnitrace][107157][sampling_wall_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_wall_clock-107157.txt'
[omnitrace][107157][wall_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockswall_clock-107157.json'
[omnitrace][107157][wall_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockswall_clock-107157.txt'
[omnitrace][107157][metadata]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-locksmetadata-107157.json' and 'omnitrace-output/2022-10-19_02.46/parallel-overhead-locksfunctions-107157.json'
[omnitrace][107157][0][omnitrace_finalize] Finalized
[761.584] perfetto.cc:57382 Tracing session 1 ended, total sessions:0
```
-49
Dosyayı Görüntüle
@@ -1,49 +0,0 @@
# Setup and Validation
```eval_rst
.. toctree::
:glob:
:maxdepth: 3
```
## Configuring Environment
Once omnitrace is installed, source the `setup-env.sh` script to prefix the `PATH`, `LD_LIBRARY_PATH`, etc. environment variables:
```bash
source /opt/omnitrace/share/omnitrace/setup-env.sh
```
Alternatively, if environment modules are supported, add the `<prefix>/share/modulefiles` directory to `MODULEPATH`:
```bash
module use /opt/omnitrace/share/modulefiles
```
> ***Alternatively, the above line can be added to the `${HOME}/.modulerc` file.***
Once omnitrace is in the `MODULEPATH`, omnitrace can be loaded via `module load omnitrace/<VERSION>` and unloaded via `module unload omnitrace/<VERSION>`, e.g.:
```bash
module load omnitrace/1.0.0
module unload omnitrace/1.0.0
```
> ***You may need to also add the path to the ROCm libraries to `LD_LIBRARY_PATH`, e.g. `export LD_LIBRARY_PATH=/opt/rocm/lib:${LD_LIBRARY_PATH}`***
## Validating Environment Configuration
If all the following commands execute successfully with output, then you are ready to use omnitrace:
```bash
which omnitrace
which omnitrace-avail
which omnitrace-sample
omnitrace-instrument --help
omnitrace-avail --all
omnitrace-sample --help
# if built with python support
which omnitrace-python
omnitrace-python --help
```
-36
Dosyayı Görüntüle
@@ -1,36 +0,0 @@
#!/bin/bash -e
message()
{
echo -e "\n\n##### ${@}... #####\n"
}
WORK_DIR=$(cd $(dirname ${BASH_SOURCE[0]}) && pwd)
SOURCE_DIR=$(cd ${WORK_DIR}/../.. &> /dev/null && pwd)
message "Working directory is ${WORK_DIR}"
message "Source directory is ${SOURCE_DIR}"
message "Changing directory to ${WORK_DIR}"
cd ${WORK_DIR}
message "Generating omnitrace.dox"
cmake -DSOURCE_DIR=${SOURCE_DIR} -P ${WORK_DIR}/generate-doxyfile.cmake
message "Generating doxygen xml files"
doxygen omnitrace.dox
doxygen omnitrace.dox
message "Building html documentation"
make html SPHINXOPTS="-W --keep-going -n"
if [ -d ${SOURCE_DIR}/docs ]; then
message "Removing stale documentation in ${SOURCE_DIR}/docs/"
rm -rf ${SOURCE_DIR}/docs/*
message "Adding nojekyll to docs/"
cp -r ${WORK_DIR}/.nojekyll ${SOURCE_DIR}/docs/.nojekyll
message "Copying source/docs/_build/html/* to docs/"
cp -r ${WORK_DIR}/_build/html/* ${SOURCE_DIR}/docs/
fi
-9
Dosyayı Görüntüle
@@ -1,9 +0,0 @@
#!/bin/bash -e
WORK_DIR=$(dirname ${BASH_SOURCE[0]})
SOURCE_DIR=$(cd ${WORK_DIR}/../.. &> /dev/null && pwd)
cmake -DSOURCE_DIR=${SOURCE_DIR} -P generate-doxyfile.cmake
doxygen omnitrace.dox
-270
Dosyayı Görüntüle
@@ -1,270 +0,0 @@
# User API
```eval_rst
.. doxygenfile:: omnitrace/types.h
.. doxygenfile:: omnitrace/categories.h
.. doxygenfile:: omnitrace/user.h
.. doxygenfile:: omnitrace/causal.h
```
By default, when omnitrace detects any `omnitrace_user_start_*` or `omnitrace_user_stop_*` function, instrumentation
is disabled at start-up -- thus, `omnitrace_user_stop_trace()` is not required at the beginning of main. This is
can be manually controlled via the `OMNITRACE_INIT_ENABLED` environment variable. User-defined regions are always
recorded, regardless of whether whether `omnitrace_user_start_*` or `omnitrace_user_stop_*` has been called.
## Example
### Compilation
#### CMake
```cmake
find_package(omnitrace REQUIRED COMPONENTS user)
add_executable(foo foo.cpp)
target_link_libraries(foo PRIVATE omnitrace::omnitrace-user-library)
```
#### General
Assuming omnitrace installed in `/opt/omnitrace`:
```bash
g++ -I/opt/omnitrace foo.cpp -o foo -lomnitrace-user
```
### User API Implementation
```cpp
#include <omnitrace/categories.h>
#include <omnitrace/types.h>
#include <omnitrace/user.h>
#include <atomic>
#include <cassert>
#include <cerrno>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <sstream>
#include <thread>
#include <vector>
std::atomic<long> total{ 0 };
long
fib(long n) __attribute__((noinline));
void
run(size_t nitr, long) __attribute__((noinline));
int
custom_push_region(const char* name);
namespace
{
omnitrace_user_callbacks_t custom_callbacks = OMNITRACE_USER_CALLBACKS_INIT;
omnitrace_user_callbacks_t original_callbacks = OMNITRACE_USER_CALLBACKS_INIT;
} // namespace
int
main(int argc, char** argv)
{
custom_callbacks.push_region = &custom_push_region;
omnitrace_user_configure(OMNITRACE_USER_UNION_CONFIG, custom_callbacks,
&original_callbacks);
omnitrace_user_push_region(argv[0]);
omnitrace_user_push_region("initialization");
size_t nthread = std::min<size_t>(16, std::thread::hardware_concurrency());
size_t nitr = 50000;
long nfib = 10;
if(argc > 1) nfib = atol(argv[1]);
if(argc > 2) nthread = atol(argv[2]);
if(argc > 3) nitr = atol(argv[3]);
omnitrace_user_pop_region("initialization");
printf("[%s] Threads: %zu\n[%s] Iterations: %zu\n[%s] fibonacci(%li)...\n", argv[0],
nthread, argv[0], nitr, argv[0], nfib);
omnitrace_user_push_region("thread_creation");
std::vector<std::thread> threads{};
threads.reserve(nthread);
// disable instrumentation for child threads
omnitrace_user_stop_thread_trace();
for(size_t i = 0; i < nthread; ++i)
{
threads.emplace_back(&run, nitr, nfib);
}
// re-enable instrumentation
omnitrace_user_start_thread_trace();
omnitrace_user_pop_region("thread_creation");
omnitrace_user_push_region("thread_wait");
for(auto& itr : threads)
itr.join();
omnitrace_user_pop_region("thread_wait");
run(nitr, nfib);
printf("[%s] fibonacci(%li) x %lu = %li\n", argv[0], nfib, nthread, total.load());
omnitrace_user_pop_region(argv[0]);
return 0;
}
long
fib(long n)
{
return (n < 2) ? n : fib(n - 1) + fib(n - 2);
}
#define RUN_LABEL \
std::string{ std::string{ __FUNCTION__ } + "(" + std::to_string(n) + ") x " + \
std::to_string(nitr) } \
.c_str()
void
run(size_t nitr, long n)
{
omnitrace_user_push_region(RUN_LABEL);
long local = 0;
for(size_t i = 0; i < nitr; ++i)
local += fib(n);
total += local;
omnitrace_user_pop_region(RUN_LABEL);
}
int
custom_push_region(const char* name)
{
if(!original_callbacks.push_region || !original_callbacks.push_annotated_region)
return OMNITRACE_USER_ERROR_NO_BINDING;
printf("Pushing custom region :: %s\n", name);
if(original_callbacks.push_annotated_region)
{
int32_t _err = errno;
char* _msg = nullptr;
char _buff[1024];
if(_err != 0) _msg = strerror_r(_err, _buff, sizeof(_buff));
omnitrace_annotation_t _annotations[] = {
{ "errno", OMNITRACE_INT32, &_err }, { "strerror", OMNITRACE_STRING, _msg }
};
errno = 0; // reset errno
return (*original_callbacks.push_annotated_region)(
name, _annotations, sizeof(_annotations) / sizeof(omnitrace_annotation_t));
}
return (*original_callbacks.push_region)(name);
}
```
### User API Output
```console
$ omnitrace-instrument -l --min-instructions=8 -E custom_push_region -o -- ./user-api
...
$ omnitrace-run --profile --use-pid off --time-output off -- ./user-api.inst 20 4 100
Pushing custom region :: ./user-api.inst
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Trace
______ .___ ___. .__ __. __ .___________..______ ___ ______ _______
/ __ \ | \/ | | \ | | | | | || _ \ / \ / || ____|
| | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__
| | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __|
| `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____
\______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______|
Pushing custom region :: initialization
[./user-api.inst] Threads: 4
[./user-api.inst] Iterations: 100
[./user-api.inst] fibonacci(20)...
Pushing custom region :: thread_creation
Pushing custom region :: thread_wait
Pushing custom region :: run(20) x 100
Pushing custom region :: run(20) x 100
Pushing custom region :: run(20) x 100
Pushing custom region :: run(20) x 100
Pushing custom region :: run(20) x 100
[./user-api.inst] fibonacci(20) x 4 = 3382500
[omnitrace][86267][0][omnitrace_finalize] finalizing...
[omnitrace][86267][0] omnitrace : 5.190895 sec wall_clock, 2.748 mb peak_rss, 6.330000 sec cpu_clock, 121.9 % cpu_util [laps: 1]
[omnitrace][86267][0] user-api.inst/thread-0 : 5.078713 sec wall_clock, 4.722415 sec thread_cpu_clock, 93.0 % thread_cpu_util, 1.276 mb peak_rss [laps: 1]
[omnitrace][86267][0] user-api.inst/thread-1 : 0.322248 sec wall_clock, 0.322191 sec thread_cpu_clock, 100.0 % thread_cpu_util, 1.000 mb peak_rss [laps: 1]
[omnitrace][86267][0] user-api.inst/thread-2 : 0.323255 sec wall_clock, 0.323194 sec thread_cpu_clock, 100.0 % thread_cpu_util, 0.000 mb peak_rss [laps: 1]
[omnitrace][86267][0] user-api.inst/thread-3 : 0.323569 sec wall_clock, 0.323484 sec thread_cpu_clock, 100.0 % thread_cpu_util, 1.092 mb peak_rss [laps: 1]
[omnitrace][86267][0] user-api.inst/thread-4 : 0.324178 sec wall_clock, 0.324057 sec thread_cpu_clock, 100.0 % thread_cpu_util, 1.184 mb peak_rss [laps: 1]
[omnitrace][86267][0] Post-processing 51 cpu frequency and memory usage entries...
[omnitrace][wall_clock]|0> Outputting 'omnitrace-user-api.inst-output/wall_clock.json'...
[omnitrace][wall_clock]|0> Outputting 'omnitrace-user-api.inst-output/wall_clock.tree.json'...
[omnitrace][wall_clock]|0> Outputting 'omnitrace-user-api.inst-output/wall_clock.txt'...
[omnitrace][manager::finalize][metadata]> Outputting 'omnitrace-user-api.inst-output/metadata.json' and 'omnitrace-user-api.inst-output/functions.json'...
[omnitrace][86267][0][omnitrace_finalize] Finalized
$ cat omnitrace-example-output/wall_clock.txt
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| REAL-CLOCK TIMER (I.E. WALL-CLOCK TIMER) |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| LABEL | COUNT | DEPTH | METRIC | UNITS | SUM | MEAN | MIN | MAX | VAR | STDDEV | % SELF |
|---------------------------------------------------------------------------------|--------|--------|------------|--------|----------|----------|----------|----------|----------|----------|--------|
| |0>>> ./user-api.inst | 1 | 0 | wall_clock | sec | 5.078521 | 5.078521 | 5.078521 | 5.078521 | 0.000000 | 0.000000 | 0.0 |
| |0>>> |_initialization | 1 | 1 | wall_clock | sec | 0.000004 | 0.000004 | 0.000004 | 0.000004 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_thread_creation | 1 | 1 | wall_clock | sec | 0.000159 | 0.000159 | 0.000159 | 0.000159 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_thread_wait | 1 | 1 | wall_clock | sec | 0.355307 | 0.355307 | 0.355307 | 0.355307 | 0.000000 | 0.000000 | 0.0 |
| |0>>> |_std::vector<std::thread, std::allocator<std::thread> >::begin | 1 | 2 | wall_clock | sec | 0.000001 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_std::vector<std::thread, std::allocator<std::thread> >::end | 1 | 2 | wall_clock | sec | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_pthread_join | 4 | 2 | wall_clock | sec | 0.355257 | 0.088814 | 0.000001 | 0.333144 | 0.026559 | 0.162970 | 100.0 |
| |2>>> |_start_thread | 1 | 3 | wall_clock | sec | 0.000032 | 0.000032 | 0.000032 | 0.000032 | 0.000000 | 0.000000 | 100.0 |
| |1>>> |_start_thread | 1 | 3 | wall_clock | sec | 0.000036 | 0.000036 | 0.000036 | 0.000036 | 0.000000 | 0.000000 | 100.0 |
| |3>>> |_start_thread | 1 | 3 | wall_clock | sec | 0.000034 | 0.000034 | 0.000034 | 0.000034 | 0.000000 | 0.000000 | 100.0 |
| |4>>> |_start_thread | 1 | 3 | wall_clock | sec | 0.000039 | 0.000039 | 0.000039 | 0.000039 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_run | 1 | 1 | wall_clock | sec | 4.722993 | 4.722993 | 4.722993 | 4.722993 | 0.000000 | 0.000000 | 0.0 |
| |0>>> |_std::char_traits<char>::length | 1 | 2 | wall_clock | sec | 0.000001 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_std::distance<char const*> | 1 | 2 | wall_clock | sec | 0.000001 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_std::operator+<char, std::char_traits<char>, std::allocator<char> > | 2 | 2 | wall_clock | sec | 0.000002 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_run(20) x 100 | 1 | 2 | wall_clock | sec | 4.722951 | 4.722951 | 4.722951 | 4.722951 | 0.000000 | 0.000000 | 0.0 |
| |0>>> |_run [{94,25}-{96,25}] | 1 | 3 | wall_clock | sec | 4.722925 | 4.722925 | 4.722925 | 4.722925 | 0.000000 | 0.000000 | 0.0 |
| |0>>> |_fib | 100 | 4 | wall_clock | sec | 4.722718 | 0.047227 | 0.046713 | 0.051987 | 0.000000 | 0.000625 | 0.0 |
| |0>>> |_fib | 200 | 5 | wall_clock | sec | 4.722302 | 0.023612 | 0.017827 | 0.034091 | 0.000032 | 0.005627 | 0.0 |
| |0>>> |_fib | 400 | 6 | wall_clock | sec | 4.721485 | 0.011804 | 0.006790 | 0.023003 | 0.000016 | 0.004024 | 0.0 |
| |0>>> |_fib | 800 | 7 | wall_clock | sec | 4.719858 | 0.005900 | 0.002564 | 0.016078 | 0.000006 | 0.002498 | 0.1 |
| |0>>> |_fib | 1600 | 8 | wall_clock | sec | 4.716572 | 0.002948 | 0.000977 | 0.011849 | 0.000002 | 0.001465 | 0.1 |
| |0>>> |_fib | 3200 | 9 | wall_clock | sec | 4.709918 | 0.001472 | 0.000371 | 0.008246 | 0.000001 | 0.000831 | 0.3 |
| |0>>> |_fib | 6400 | 10 | wall_clock | sec | 4.696775 | 0.000734 | 0.000140 | 0.005111 | 0.000000 | 0.000461 | 0.6 |
| |0>>> |_fib | 12800 | 11 | wall_clock | sec | 4.670093 | 0.000365 | 0.000050 | 0.003166 | 0.000000 | 0.000253 | 1.1 |
| |0>>> |_fib | 25600 | 12 | wall_clock | sec | 4.617496 | 0.000180 | 0.000017 | 0.001959 | 0.000000 | 0.000137 | 2.3 |
| |0>>> |_fib | 51200 | 13 | wall_clock | sec | 4.512671 | 0.000088 | 0.000004 | 0.001212 | 0.000000 | 0.000074 | 4.6 |
| |0>>> |_fib | 102400 | 14 | wall_clock | sec | 4.304142 | 0.000042 | 0.000000 | 0.000752 | 0.000000 | 0.000039 | 9.6 |
| |0>>> |_fib | 202600 | 15 | wall_clock | sec | 3.892580 | 0.000019 | 0.000000 | 0.000469 | 0.000000 | 0.000021 | 19.0 |
| |0>>> |_fib | 363200 | 16 | wall_clock | sec | 3.151143 | 0.000009 | 0.000000 | 0.000293 | 0.000000 | 0.000011 | 33.2 |
| |0>>> |_fib | 502000 | 17 | wall_clock | sec | 2.105217 | 0.000004 | 0.000000 | 0.000183 | 0.000000 | 0.000006 | 49.1 |
| |0>>> |_fib | 476000 | 18 | wall_clock | sec | 1.071652 | 0.000002 | 0.000000 | 0.000114 | 0.000000 | 0.000004 | 63.6 |
| |0>>> |_fib | 294200 | 19 | wall_clock | sec | 0.390193 | 0.000001 | 0.000000 | 0.000071 | 0.000000 | 0.000003 | 75.3 |
| |0>>> |_fib | 115200 | 20 | wall_clock | sec | 0.096190 | 0.000001 | 0.000000 | 0.000043 | 0.000000 | 0.000002 | 84.4 |
| |0>>> |_fib | 27400 | 21 | wall_clock | sec | 0.015020 | 0.000001 | 0.000000 | 0.000025 | 0.000000 | 0.000001 | 91.1 |
| |0>>> |_fib | 3600 | 22 | wall_clock | sec | 0.001336 | 0.000000 | 0.000000 | 0.000013 | 0.000000 | 0.000001 | 96.3 |
| |0>>> |_fib | 200 | 23 | wall_clock | sec | 0.000050 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_std::char_traits<char>::length | 1 | 3 | wall_clock | sec | 0.000001 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_std::distance<char const*> | 1 | 3 | wall_clock | sec | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_std::operator+<char, std::char_traits<char>, std::allocator<char> > | 2 | 3 | wall_clock | sec | 0.000001 | 0.000001 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_std::operator& | 1 | 1 | wall_clock | sec | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
| |0>>> std::vector<std::thread, std::allocator<std::thread> >::~vector | 1 | 0 | wall_clock | sec | 0.000045 | 0.000045 | 0.000045 | 0.000045 | 0.000000 | 0.000000 | 32.7 |
| |0>>> |_std::thread::~thread | 4 | 1 | wall_clock | sec | 0.000030 | 0.000007 | 0.000007 | 0.000009 | 0.000000 | 0.000001 | 31.2 |
| |0>>> |_std::thread::joinable | 4 | 2 | wall_clock | sec | 0.000021 | 0.000005 | 0.000005 | 0.000006 | 0.000000 | 0.000001 | 89.4 |
| |0>>> |_std::thread::id::id | 4 | 3 | wall_clock | sec | 0.000001 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_std::operator== | 4 | 3 | wall_clock | sec | 0.000001 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_std::allocator_traits<std::allocator<std::thread> >::deallocate | 1 | 1 | wall_clock | sec | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_std::allocator<std::thread>::~allocator | 1 | 1 | wall_clock | sec | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
```
-23
Dosyayı Görüntüle
@@ -1,23 +0,0 @@
# YouTube Tutorials
```eval_rst
.. toctree::
:glob:
:maxdepth: 3
```
## Installing a binary release
<p align="center"><iframe width="560" height="315" src="https://www.youtube.com/embed/gKtNCKf1IXA?modestbranding=1" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p>
## Instrumenting a binary
<p align="center"><iframe width="560" height="315" src="https://www.youtube.com/embed/2B0gRr3FygQ?modestbranding=1" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p>
## Writing an omnitrace configuration file
<p align="center"><iframe width="560" height="315" src="https://www.youtube.com/embed/oG_fPYx9_gs?modestbranding=1" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p>
## Visualization and Features of Perfetto Traces
<p align="center"><iframe width="560" height="315" src="https://www.youtube.com/embed/7WN3N1hnCbI?modestbranding=1" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p>