Omnitrace docs refactoring (#353)

* Add Sphinx and Read the Docs configs

* Add documentation workflow configurations

* Changed macros verbprintf and verbprintf_bare so they write to stdout… (#346)

Flush stdout when listing keys + bump verbose level for GPU count

* Removing static version asserts. (#347)

It is causing failures on our internal builds

Signed-off-by: David Galiffi <David.Galiffi@amd.com>

* Check for an empty vector before popping (#350)

Protect from possible seg. fault

Signed-off-by: David Galiffi <David.Galiffi@amd.com>

* Add release links to installation.md (#351)

* Initial infrastructure rework for Omnitrace refactoring and a rewrite of the What is file

* Add files in conceptual section, along with images and infrastructure changes.

* Formatting and style fixes for files in conceptual directory

* Add quick start install guide and fix spelling errors in other files

* Add install document and fix code tags. Infrastructure changes

* Add two how-to guides along with infra changes and spelling fixes

* Add two new how to files and fix errors in the last commit

* Fix spelling mistakes

* Add new how to file on causal profiling and infra changes.

* Add how to file on interpreting Omnitrace output, fixes, and images

* Add remaining how-to guides and reference materials along with fixes and infrastructure

* Add YouTube file and fix spelling and formatting

* Fix a few loose ends and add link to license page

* Add Sphinx and Doxygen infrastructure and some additional corrections

* Update rocm-docs-core

* Fix Doxyfile

* Fix path to API header files

* Run doxysphinx in conf.py

* Add back custom css for doxygen

* Remove doxygenlayout

* Add api to toc

* Update Doxyfile

Generate from source .in

* Proofreading edits and other changes

* Add .gitignore for Doxygen and remove deprecated words and typos

* Fix one additional typo

* Turn off dot

* Update doxyfile strip from path

* Workflow, submodules, and thread info Updates (#352)

* Update CI workflows

- use node20 workflow packages

* Update tests/source/CMakeLists.txt

- Use OMNITRACE_TRACE and OMNTRACE_PROFILE instead of perfetto/timemory

* Update timemory submodule

- argparse: requires -> required
- parse callbacks

* Update thread_info.cpp

- fix causal::delay::get_local usage

* Update timemory submodule

* Update kokkos submodule

- release 3.7.02

* Revert opensuse.yml and ubuntu-bionic.yml to use node16 workflows

* Update docs.yml

* ROCm 6.1 Installers (#349)

* Add ROCm 6.1 to packages
* Bump version to 1.11.3
* Add 6.1 support to the docker build support.
   Simplified this by adding 6.* to case statements, now that repo links have been standardized.

* Update timemory submodule (#354)

- fix argparse::argument::required template deduction

* Build omnitrace-rt library (#355)

* Build omnitrace-rt library

- Explicitly build dyninstAPI_RT as omnitrace-rt so that the SONAME in the ELF is omnitrace-rt instead of dyninstAPI_RT
- Create symbolic link lib/omnitrace/libdyninstAPI_RT.so which points to lib/libomnitrace-rt.so
- Simplify build tree location of libomnitrace-rt.so since it is ../lib from the bin directory even in the build tree
- Update dyninst submodule with minor tweaks to dyninstAPI_RT/CMakeLists.txt

* Update source/lib/omnitrace-rt/cmake/platform.cmake

* Use ftpmirror.gnu.org instead of ftp.gnu.org

- in timemory and dyninst submodules
- minor .clang-tidy tweak

* Executables append omnitrace library directory to LD_LIBRARY_PATH (#356)

- omnitrace-run, omnitrace-sample, and omnitrace-causal now automatically append the LD_LIBRARY_PATH with the directory containing the omnitrace libraries
  - this helps ensure that binary rewritten exes can resolve omnitrace-rt library location

* Fix a few typos and formatting issues

* Additional fixes and minor formatting changes.

* More fixes and minor formatting changes.

* Complete second proofreading with fixes and minor formatting changes.

* Make changes to table of contents and disable linting

* Update links in the README doc to reflect the new structure.

* Align intro on the Omnitrace index page with the first paragraph of the what-is page

* Changes and edits based on review comments

* Additional changes and edits based on external review

* Additional updates and changes from the external review of Omnitrace

* Additional changes based on the external review

* New round of edits based on the external review

* Additional edits based on the external review

* Changes to address comments from the internal review

* Correct to the RHEL SELinux note in the troubleshooting guide

* One additional change to the development guide code example

* Move troubleshooting to post-install of install.rst and other minor edits.

* Remove troubleshooting page and modify new post-install troubleshooting section on install.rst

* Refactor the how Omnitrace works page into seperate topics and redo infrastructure

* API ToC changes

* Additional API and ToC changes

* Back out API and ToC changes and update requirements.txt

* Additional API and ToC changes

* Add commit for signing purposes

* Add ElfUtils and BinUtils Download URL Overrides (#358)

* Add CMake CACHE Variable ElfUtils_DOWNLOAD_URL

Used to override the default URL to download ElfUtils from.
Useful for internal builds

Also, include a mirror to fallback to if the override URL fails.

* Update timemory submodule

Updating to include the BINUTIL_DOWNLOAD_URL override cmake
variable.

---------

Signed-off-by: David Galiffi <David.Galiffi@amd.com>

* Remove Ubuntu 18.04 and SUSE 15.2

* Update checkout action to v4

* Add `docs/**` to `paths-ignore`

Document location is being refactored.

* Modified submodules dyninst and timemory. (#361)

---------

Signed-off-by: David Galiffi <David.Galiffi@amd.com>
Co-authored-by: Peter Jun Park <peter.park@amd.com>
Co-authored-by: ajanicijamd <Aleksandar.Janicijevic@amd.com>
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
Co-authored-by: Jonathan R. Madsen <jrmadsen@users.noreply.github.com>
Co-authored-by: Sam Wu <22262939+samjwu@users.noreply.github.com>

[ROCm/rocprofiler-systems commit: 0689797736]
Tá an tiomantas seo le fáil i:
Jeffrey Novotny
2024-07-29 17:23:36 -04:00
tiomanta ag GitHub
tuismitheoir a416c7d817
tiomantas dfaa4dc9c5
D'athraigh 38 comhad le 7122 breiseanna agus 9 scriosta
+1
Féach ar an gComhad
@@ -4,3 +4,4 @@
docs/* @ROCm/rocm-documentation
*.md @ROCm/rocm-documentation
*.rst @ROCm/rocm-documentation
.readthedocs.yaml @ROCm/rocm-documentation
+11
Féach ar an gComhad
@@ -9,3 +9,14 @@ updates:
directory: "/" # Location of package manifests
schedule:
interval: "weekly"
- package-ecosystem: "pip" # See documentation for possible values
directory: "/docs/sphinx" # Location of package manifests
open-pull-requests-limit: 10
schedule:
interval: "daily"
labels:
- "documentation"
- "dependencies"
reviewers:
- "samjwu"
+4
Féach ar an gComhad
@@ -37,6 +37,10 @@
# Python cache files
*.pyc
# Documentation artifacts
/_build
_toc.yml
/build*
/.vscode
/.cache
+18
Féach ar an gComhad
@@ -0,0 +1,18 @@
# Read the Docs configuration file
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
version: 2
build:
os: ubuntu-22.04
tools:
python: "3.10"
python:
install:
- requirements: docs/sphinx/requirements.txt
sphinx:
configuration: docs/conf.py
formats: []
+7 -9
Féach ar an gComhad
@@ -7,8 +7,6 @@
[![Installer Packaging (CPack)](https://github.com/ROCm/omnitrace/actions/workflows/cpack.yml/badge.svg)](https://github.com/ROCm/omnitrace/actions/workflows/cpack.yml)
[![Documentation](https://github.com/ROCm/omnitrace/actions/workflows/docs.yml/badge.svg)](https://github.com/ROCm/omnitrace/actions/workflows/docs.yml)
> ***[Omnitrace](https://github.com/ROCm/omnitrace) is an AMD open source research project and is not supported as part of the ROCm software stack.***
## Overview
AMD Research is seeking to improve observability and performance analysis for software running on AMD heterogeneous systems.
@@ -86,8 +84,8 @@ such as the memory usage, page-faults, and context-switches, and thread-level me
## Documentation
The full documentation for [omnitrace](https://github.com/ROCm/omnitrace) is available at [rocm.github.io/omnitrace](https://rocm.github.io/omnitrace/).
See the [Getting Started documentation](https://rocm.github.io/omnitrace/getting_started) for general tips and a detailed discussion about sampling vs. binary instrumentation.
The full documentation for [omnitrace](https://github.com/ROCm/omnitrace) is available at [the ROCm Omnitrace documentation repository](https://rocm.docs.amd.com/projects/omnitrace/en/latest/index.html).
See the [Getting Started documentation](https://rocm.docs.amd.com/projects/omnitrace/en/conceptual/how-omnitrace-works.html) for general tips and a detailed discussion about sampling vs. binary instrumentation.
## Quick Start
@@ -108,7 +106,7 @@ wget https://github.com/ROCm/omnitrace/releases/latest/download/omnitrace-instal
python3 ./omnitrace-install.py --prefix /opt/omnitrace/rocm-5.4 --rocm 5.4
```
See the [Installation Documentation](https://rocm.github.io/omnitrace/installation) for detailed information.
See the [Installation Documentation](https://rocm.docs.amd.com/projects/omnitrace/en/install/install.html) for detailed information.
### Setup
@@ -297,13 +295,13 @@ for `foo` via the direct call within `spam`. There will be no entries for `bar`
- Select "Open trace file" from panel on the left
- Locate the omnitrace perfetto output (extension: `.proto`)
![omnitrace-perfetto](source/docs/images/omnitrace-perfetto.png)
![omnitrace-perfetto](docs/data/omnitrace-perfetto.png)
![omnitrace-rocm](source/docs/images/omnitrace-rocm.png)
![omnitrace-rocm](docs/data/omnitrace-rocm.png)
![omnitrace-rocm-flow](source/docs/images/omnitrace-rocm-flow.png)
![omnitrace-rocm-flow](docs/data/omnitrace-rocm-flow.png)
![omnitrace-user-api](source/docs/images/omnitrace-user-api.png)
![omnitrace-user-api](docs/data/omnitrace-user-api.png)
## Using Perfetto tracing with System Backend
+2
Féach ar an gComhad
@@ -0,0 +1,2 @@
_build/
_doxygen/
@@ -0,0 +1,146 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
**********************
Data collection modes
**********************
Omnitrace supports several modes of recording trace and profiling data for your application.
.. note::
For an explanation of the terms used in this topic, see
the :doc:`Omnitrace glossary <../reference/omnitrace-glossary>`.
+-----------------------------+---------------------------------------------------------+
| Mode | Description |
+=============================+=========================================================+
| Binary Instrumentation | Locates functions (and loops, if desired) in the binary |
| | and inserts snippets at the entry and exit |
+-----------------------------+---------------------------------------------------------+
| Statistical Sampling | Periodically pauses application at specified intervals |
| | and records various metrics for the given call stack |
+-----------------------------+---------------------------------------------------------+
| Callback APIs | Parallelism frameworks such as ROCm, OpenMP, and Kokkos |
| | make callbacks into Omnitrace to provide information |
| | about the work the API is performing |
+-----------------------------+---------------------------------------------------------+
| Dynamic Symbol Interception | Wrap function symbols defined in a position independent |
| | dynamic library/executable, like ``pthread_mutex_lock`` |
| | in ``libpthread.so`` or ``MPI_Init`` in the MPI library |
+-----------------------------+---------------------------------------------------------+
| User API | User-defined regions and controls for Omnitrace |
+-----------------------------+---------------------------------------------------------+
The two most generic and important modes are binary instrumentation and statistical sampling.
It is important to understand their advantages and disadvantages.
Binary instrumentation and statistical sampling can be performed with the ``omnitrace-instrument``
executable. For statistical sampling, it's highly recommended to use the
``omnitrace-sample`` executable instead if binary instrumentation isn't required or needed.
Callback APIs and dynamic symbol interception can be utilized with either tool.
Binary instrumentation
-----------------------------------
Binary instrumentation lets you record deterministic measurements for
every single invocation of a given function.
Binary instrumentation effectively adds instructions to the target application to
collect the required information. It therefore has the potential to cause performance
changes which might, in some cases, lead to inaccurate results. The effect depends on
the information being collected and which features are activated in Omnitrace.
For example, collecting only the wall-clock timing data
has less of an effect than collecting the wall-clock timing, CPU-clock timing,
memory usage, cache-misses, and number of instructions that were run. Similarly,
collecting a flat profile has less overhead than a hierarchical profile
and collecting a trace OR a profile has less overhead than collecting a
trace AND a profile.
In Omnitrace, the primary heuristic for controlling the overhead with binary
instrumentation is the minimum number of instructions for selecting functions
for instrumentation.
Statistical sampling
-----------------------------------
Statistical call-stack sampling periodically interrupts the application at
regular intervals using operating system interrupts.
Sampling is typically less numerically accurate and specific, but the
target program runs at nearly full speed.
In contrast to the data derived from binary instrumentation, the resulting
data is not exact but is instead a statistical approximation.
However, sampling often provides a more accurate picture of the application
execution because it is less intrusive to the target application and has fewer
side effects on memory caches or instruction decoding pipelines. Furthermore,
because sampling does not affect the execution speed as much, is it
relatively immune to over-evaluating the cost of small, frequently called
functions or "tight" loops.
In Omnitrace, the overhead for statistical sampling depends on the
sampling rate and whether the samples are taken with respect to the CPU time
and/or real time.
Binary instrumentation vs. statistical sampling example
-------------------------------------------------------
Consider the following code:
.. code-block:: c++
long fib(long n)
{
if(n < 2) return n;
return fib(n - 1) + fib(n - 2);
}
void run(long n)
{
long result = fib(n);
printf("[%li] fibonacci(%li) = %li\n", i, n, result);
}
int main(int argc, char** argv)
{
long nfib = 30;
long nitr = 10;
if(argc > 1) nfib = atol(argv[1]);
if(argc > 2) nitr = atol(argv[2]);
for(long i = 0; i < nitr; ++i)
run(nfib);
return 0;
}
Binary instrumentation of the ``fib`` function will record **every single invocation**
of the function. For a very small function
such as ``fib``, this results in **significant** overhead since this simple function
takes about 20 instructions, whereas the entry and
exit snippets are ~1024 instructions. Therefore, you generally want to avoid
instrumenting functions where the instrumented function has significantly fewer
instructions than entry and exit instrumentation. (Note that many of the
instructions in entry and exit functions are either logging functions or
depend on the runtime settings and thus might never run). However,
due to the number of potential instructions in the entry and exit snippets,
the default behavior of ``omnitrace-instrument`` is to only instrument functions
which contain fewer than 1024 instructions.
However, recording every single invocation of the function can be extremely
useful for detecting anomalies, such as profiles that show minimum or maximum values much smaller or larger
than the average or a high standard deviation. In this case, the traces help you
identify exactly when and where those instances deviated from the norm.
Compare the level of detail in the following traces. In the top image,
every instance of the ``fib`` function is instrumented, while in the bottom image,
the ``fib`` call-stack is derived via sampling.
Binary instrumentation of the Fibonacci function
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. image:: ../data/fibonacci-instrumented.png
:alt: Visualization of the output of a binary instrumentation of the Fibonacci function
Statistical sampling of the Fibonacci function
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. image:: ../data/fibonacci-sampling.png
:alt: Visualization of the output of a statistical sample of the Fibonacci function
@@ -0,0 +1,137 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
***************************************
The Omnitrace feature set and use cases
***************************************
`Omnitrace <https://github.com/ROCm/omnitrace>`_ is designed to be highly extensible.
Internally, it leverages the `Timemory performance analysis toolkit <https://github.com/NERSC/timemory>`_
to manage extensions, resources, data, and other items. It supports the following features,
modes, metrics, and APIs.
Data collection modes
========================================
* Dynamic instrumentation
* Runtime instrumentation: Instrument executables and shared libraries at runtime
* Binary rewriting: Generate a new executable and/or library with instrumentation built-in
* Statistical sampling: Periodic software interrupts per-thread
* Process-level sampling: A background thread records process-, system- and device-level metrics while the application runs
* Causal profiling: Quantifies the potential impact of optimizations in parallel code
.. note::
Critical trace support was removed in Omnitrace v1.11.0.
It was replaced by the causal profiling feature.
Data analysis
========================================
* High-level summary profiles with mean, min, max, and standard deviation statistics
* Low overhead and memory efficient
* Ideal for running at scale
* Comprehensive traces for every individual event and measurement
* Application speed-up predictions resulting from potential optimizations in functions and lines of code based on causal profiling
Parallelism API support
========================================
* HIP
* HSA
* Pthreads
* MPI
* Kokkos-Tools (KokkosP)
* OpenMP-Tools (OMPT)
GPU metrics
========================================
* GPU hardware counters
* HIP API tracing
* HIP kernel tracing
* HSA API tracing
* HSA operation tracing
* System-level sampling (via rocm-smi)
* Memory usage
* Power usage
* Temperature
* Utilization
CPU metrics
========================================
* CPU hardware counters sampling and profiles
* CPU frequency sampling
* Various timing metrics
* Wall time
* CPU time (process and thread)
* CPU utilization (process and thread)
* User CPU time
* Kernel CPU time
* Various memory metrics
* High-water mark (sampling and profiles)
* Memory page allocation
* Virtual memory usage
* Network statistics
* I/O metrics
* Many others
Third-party API support
========================================
* TAU
* LIKWID
* Caliper
* CrayPAT
* VTune
* NVTX
* ROCTX
Omnitrace use cases
========================================
When analyzing the performance of an application, do NOT
assume you know where the performance bottlenecks are
and why they are happening. Omnitrace is a tool for analyzing the entire
application and its performance. It is
ideal for characterizing where optimization would have the greatest impact
on an end-to-end run of the application and for
viewing what else is happening on the system during a performance bottleneck.
When GPUs are involved, there is a tendency to assume that
the quickest path to performance improvement is minimizing
the runtime of the GPU kernels. This is a highly flawed assumption.
If you optimize the runtime of a kernel from one millisecond
to 1 microsecond (1000x speed-up) but the original application never
spent time waiting for kernels to complete,
there would be no statistically significant reduction in the end-to-end
runtime of your application. In other words, it does not matter
how fast or slow the code on GPU is if the application has a
bottleneck on waiting on the GPU.
Use Omnitrace to obtain a high-level view of the entire application. Use it
to determine where the performance bottlenecks are and
obtain clues to why these bottlenecks are happening. Rather than worrying about kernel
performance, start your investigation with Omnitrace, which characterizes the
broad picture.
.. note::
For insight into the execution of individual kernels on the GPU,
use `Omniperf <https://github.com/rocm/omniperf>`_.
In terms of CPU analysis, Omnitrace does not target any specific vendor.
It works just as well on AMD and non-AMD CPUs.
With regard to the GPU, Omnitrace is currently restricted to HIP and HSA APIs
and kernels running on AMD GPUs.
+56
Féach ar an gComhad
@@ -0,0 +1,56 @@
# MIT License
# Copyright (c) 2023 - 2024 Advanced Micro Devices, Inc. All rights reserved.
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
import re
from rocm_docs import ROCmDocs
with open("../VERSION", encoding="utf-8") as f:
match = re.search(r"([0-9.]+)[^0-9.]+", f.read())
if not match:
raise ValueError("VERSION not found!")
version_number = match[1]
external_projects_current_project = "omnitrace"
project = "omnitrace"
author = "Advanced Micro Devices, Inc."
copyright = "Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved."
version = version_number
release = version_number
html_title = f"Omnitrace {version} documentation"
external_toc_path = "./sphinx/_toc.yml"
docs_core = ROCmDocs(html_title)
docs_core.setup()
docs_core.run_doxygen(doxygen_root="doxygen", doxygen_path="doxygen/xml")
docs_core.enable_api_reference()
for sphinx_var in ROCmDocs.SPHINX_VARS:
globals()[sphinx_var] = getattr(docs_core, sphinx_var)
Ní thaispeántar comhad dénártha.

Tar éis

Leithead:  |  Airde:  |  Méid: 27 KiB

Ní thaispeántar comhad dénártha.

Tar éis

Leithead:  |  Airde:  |  Méid: 106 KiB

Ní thaispeántar comhad dénártha.

Tar éis

Leithead:  |  Airde:  |  Méid: 408 KiB

Ní thaispeántar comhad dénártha.

Tar éis

Leithead:  |  Airde:  |  Méid: 313 KiB

Ní thaispeántar comhad dénártha.

Tar éis

Leithead:  |  Airde:  |  Méid: 195 KiB

Ní thaispeántar comhad dénártha.

Tar éis

Leithead:  |  Airde:  |  Méid: 230 KiB

Ní thaispeántar comhad dénártha.

Tar éis

Leithead:  |  Airde:  |  Méid: 277 KiB

@@ -0,0 +1,3 @@
html/
latex/
xml/
+373
Féach ar an gComhad
@@ -0,0 +1,373 @@
# Doxyfile 1.8.20
#---------------------------------------------------------------------------
# Project related configuration options
#---------------------------------------------------------------------------
DOXYFILE_ENCODING = UTF-8
PROJECT_NAME = omnitrace
PROJECT_NUMBER = 1.11.3
PROJECT_BRIEF = "High-level and comprehensive application tracing and profiling on both the CPU and GPU"
PROJECT_LOGO =
OUTPUT_DIRECTORY = .
CREATE_SUBDIRS = NO
ALLOW_UNICODE_NAMES = YES
OUTPUT_LANGUAGE = English
OUTPUT_TEXT_DIRECTION = None
BRIEF_MEMBER_DESC = YES
REPEAT_BRIEF = YES
ABBREVIATE_BRIEF =
ALWAYS_DETAILED_SEC = YES
INLINE_INHERITED_MEMB = YES
FULL_PATH_NAMES = YES
STRIP_FROM_PATH = /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-omnitrace/checkouts/
STRIP_FROM_INC_PATH = /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-omnitrace/checkouts/
SHORT_NAMES = NO
JAVADOC_AUTOBRIEF = NO
JAVADOC_BANNER = NO
QT_AUTOBRIEF = NO
MULTILINE_CPP_IS_BRIEF = YES
PYTHON_DOCSTRING = YES
INHERIT_DOCS = YES
SEPARATE_MEMBER_PAGES = NO
TAB_SIZE = 4
ALIASES =
OPTIMIZE_OUTPUT_FOR_C = NO
OPTIMIZE_OUTPUT_JAVA = NO
OPTIMIZE_FOR_FORTRAN = NO
OPTIMIZE_OUTPUT_VHDL = NO
OPTIMIZE_OUTPUT_SLICE = NO
EXTENSION_MAPPING = hpp=C++ \
cpp=C++ \
hh=C++ \
cc=C++ \
h=C \
c=C \
py=Python
MARKDOWN_SUPPORT = YES
TOC_INCLUDE_HEADINGS = 2
AUTOLINK_SUPPORT = YES
BUILTIN_STL_SUPPORT = YES
CPP_CLI_SUPPORT = NO
SIP_SUPPORT = NO
IDL_PROPERTY_SUPPORT = YES
DISTRIBUTE_GROUP_DOC = NO
GROUP_NESTED_COMPOUNDS = YES
SUBGROUPING = YES
INLINE_GROUPED_CLASSES = NO
INLINE_SIMPLE_STRUCTS = YES
TYPEDEF_HIDES_STRUCT = NO
LOOKUP_CACHE_SIZE = 5
NUM_PROC_THREADS = 0
#---------------------------------------------------------------------------
# Build related configuration options
#---------------------------------------------------------------------------
EXTRACT_ALL = YES
EXTRACT_PRIVATE = NO
EXTRACT_PRIV_VIRTUAL = NO
EXTRACT_PACKAGE = NO
EXTRACT_STATIC = NO
EXTRACT_LOCAL_CLASSES = YES
EXTRACT_LOCAL_METHODS = NO
EXTRACT_ANON_NSPACES = NO
HIDE_UNDOC_MEMBERS = NO
HIDE_UNDOC_CLASSES = YES
HIDE_FRIEND_COMPOUNDS = NO
HIDE_IN_BODY_DOCS = NO
INTERNAL_DOCS = NO
CASE_SENSE_NAMES = NO
HIDE_SCOPE_NAMES = NO
HIDE_COMPOUND_REFERENCE= NO
SHOW_INCLUDE_FILES = YES
SHOW_GROUPED_MEMB_INC = NO
FORCE_LOCAL_INCLUDES = YES
INLINE_INFO = YES
SORT_MEMBER_DOCS = YES
SORT_BRIEF_DOCS = NO
SORT_MEMBERS_CTORS_1ST = YES
SORT_GROUP_NAMES = NO
SORT_BY_SCOPE_NAME = NO
STRICT_PROTO_MATCHING = NO
GENERATE_TODOLIST = NO
GENERATE_TESTLIST = NO
GENERATE_BUGLIST = NO
GENERATE_DEPRECATEDLIST= NO
ENABLED_SECTIONS =
MAX_INITIALIZER_LINES = 30
SHOW_USED_FILES = YES
SHOW_FILES = YES
SHOW_NAMESPACES = YES
FILE_VERSION_FILTER =
LAYOUT_FILE =
CITE_BIB_FILES =
#---------------------------------------------------------------------------
# Configuration options related to warning and progress messages
#---------------------------------------------------------------------------
QUIET = NO
WARNINGS = YES
WARN_IF_UNDOCUMENTED = YES
WARN_IF_DOC_ERROR = YES
WARN_NO_PARAMDOC = YES
WARN_AS_ERROR = YES
WARN_FORMAT = "---> WARNING! $file:$line: $text"
WARN_LOGFILE = doc/warnings.log
#---------------------------------------------------------------------------
# Configuration options related to the input files
#---------------------------------------------------------------------------
INPUT = ../../README.md \
../../source/lib/omnitrace-user/omnitrace/types.h \
../../source/lib/omnitrace-user/omnitrace/categories.h \
../../source/lib/omnitrace-user/omnitrace/user.h \
../../source/lib/omnitrace-user/omnitrace/causal.h
INPUT_ENCODING = UTF-8
FILE_PATTERNS = *.h \
*.hh \
*.hpp \
*.c \
*.cc \
*.cxx \
*.cpp \
*.c++ \
*.icc \
*.tcc \
*.py
RECURSIVE = YES
EXCLUDE =
EXCLUDE_SYMLINKS = YES
EXCLUDE_PATTERNS = */.git/* \
../../external/* \
../../examples/* \
../../tests/*
EXCLUDE_SYMBOLS = "std::*" \
"OMNITRACE_ATTRIBUTE" \
"OMNITRACE_VISIBILITY" \
"OMNITRACE_PUBLIC_API" \
"OMNITRACE_HIDDEN_API" \
"SpaceHandle" \
"KokkosPDevice*"
EXAMPLE_PATH = ../../examples
EXAMPLE_PATTERNS = *.h \
*.hh \
*.hpp \
*.c \
*.cc \
*.cpp \
*.py \
*.txt
EXAMPLE_RECURSIVE = YES
IMAGE_PATH =
INPUT_FILTER =
FILTER_PATTERNS =
FILTER_SOURCE_FILES = NO
FILTER_SOURCE_PATTERNS =
USE_MDFILE_AS_MAINPAGE = ../../README.md
#---------------------------------------------------------------------------
# Configuration options related to source browsing
#---------------------------------------------------------------------------
SOURCE_BROWSER = YES
INLINE_SOURCES = YES
STRIP_CODE_COMMENTS = NO
REFERENCED_BY_RELATION = YES
REFERENCES_RELATION = YES
REFERENCES_LINK_SOURCE = YES
SOURCE_TOOLTIPS = YES
USE_HTAGS = NO
VERBATIM_HEADERS = YES
#---------------------------------------------------------------------------
# Configuration options related to the alphabetical class index
#---------------------------------------------------------------------------
ALPHABETICAL_INDEX = YES
COLS_IN_ALPHA_INDEX = 5
IGNORE_PREFIX =
#---------------------------------------------------------------------------
# Configuration options related to the HTML output
#---------------------------------------------------------------------------
GENERATE_HTML = YES
HTML_OUTPUT = html
HTML_FILE_EXTENSION = .html
HTML_HEADER = ../_doxygen/header.html
HTML_FOOTER = ../_doxygen/footer.html
HTML_STYLESHEET = ../_doxygen/stylesheet.css
HTML_EXTRA_STYLESHEET = ../_doxygen/extra_stylesheet.css
HTML_EXTRA_FILES =
HTML_COLORSTYLE_HUE = 220
HTML_COLORSTYLE_SAT = 100
HTML_COLORSTYLE_GAMMA = 80
HTML_TIMESTAMP = YES
HTML_DYNAMIC_MENUS = YES
HTML_DYNAMIC_SECTIONS = YES
HTML_INDEX_NUM_ENTRIES = 1000
GENERATE_DOCSET = NO
DOCSET_FEEDNAME = "Doxygen generated docs"
DOCSET_BUNDLE_ID = org.doxygen.omnitrace
DOCSET_PUBLISHER_ID = org.doxygen.amdresearch
DOCSET_PUBLISHER_NAME = "Audacious Software Group"
GENERATE_HTMLHELP = NO
CHM_FILE =
HHC_LOCATION =
GENERATE_CHI = NO
CHM_INDEX_ENCODING =
BINARY_TOC = NO
TOC_EXPAND = YES
GENERATE_QHP = NO
QCH_FILE =
QHP_NAMESPACE =
QHP_VIRTUAL_FOLDER = doc
QHP_CUST_FILTER_NAME =
QHP_CUST_FILTER_ATTRS =
QHP_SECT_FILTER_ATTRS =
QHG_LOCATION =
GENERATE_ECLIPSEHELP = NO
ECLIPSE_DOC_ID = org.doxygen.omnitrace
DISABLE_INDEX = NO
GENERATE_TREEVIEW = NO
ENUM_VALUES_PER_LINE = 1
TREEVIEW_WIDTH = 300
EXT_LINKS_IN_WINDOW = YES
HTML_FORMULA_FORMAT = png
FORMULA_FONTSIZE = 12
FORMULA_TRANSPARENT = YES
FORMULA_MACROFILE =
USE_MATHJAX = NO
MATHJAX_FORMAT = HTML-CSS
MATHJAX_RELPATH = http://cdn.mathjax.org/mathjax/latest
MATHJAX_EXTENSIONS =
MATHJAX_CODEFILE =
SEARCHENGINE = NO
SERVER_BASED_SEARCH = NO
EXTERNAL_SEARCH = NO
SEARCHENGINE_URL =
SEARCHDATA_FILE = searchdata.xml
EXTERNAL_SEARCH_ID =
EXTRA_SEARCH_MAPPINGS =
#---------------------------------------------------------------------------
# Configuration options related to the LaTeX output
#---------------------------------------------------------------------------
GENERATE_LATEX = NO
LATEX_OUTPUT = latex
LATEX_CMD_NAME = latex
MAKEINDEX_CMD_NAME = makeindex
LATEX_MAKEINDEX_CMD = makeindex
COMPACT_LATEX = NO
PAPER_TYPE = a4wide
EXTRA_PACKAGES = float
LATEX_HEADER =
LATEX_FOOTER =
LATEX_EXTRA_STYLESHEET =
LATEX_EXTRA_FILES =
PDF_HYPERLINKS = YES
USE_PDFLATEX = YES
LATEX_BATCHMODE = YES
LATEX_HIDE_INDICES = NO
LATEX_SOURCE_CODE = YES
LATEX_BIB_STYLE = plain
LATEX_TIMESTAMP = NO
LATEX_EMOJI_DIRECTORY =
#---------------------------------------------------------------------------
# Configuration options related to the RTF output
#---------------------------------------------------------------------------
GENERATE_RTF = NO
RTF_OUTPUT = rtf
COMPACT_RTF = NO
RTF_HYPERLINKS = NO
RTF_STYLESHEET_FILE =
RTF_EXTENSIONS_FILE =
RTF_SOURCE_CODE = NO
#---------------------------------------------------------------------------
# Configuration options related to the man page output
#---------------------------------------------------------------------------
GENERATE_MAN = NO
MAN_OUTPUT = man
MAN_EXTENSION = .3
MAN_SUBDIR =
MAN_LINKS = YES
#---------------------------------------------------------------------------
# Configuration options related to the XML output
#---------------------------------------------------------------------------
GENERATE_XML = YES
XML_OUTPUT = xml
XML_PROGRAMLISTING = YES
XML_NS_MEMB_FILE_SCOPE = YES
#---------------------------------------------------------------------------
# Configuration options related to the DOCBOOK output
#---------------------------------------------------------------------------
GENERATE_DOCBOOK = NO
DOCBOOK_OUTPUT = docbook
DOCBOOK_PROGRAMLISTING = NO
#---------------------------------------------------------------------------
# Configuration options for the AutoGen Definitions output
#---------------------------------------------------------------------------
GENERATE_AUTOGEN_DEF = NO
#---------------------------------------------------------------------------
# Configuration options related to the Perl module output
#---------------------------------------------------------------------------
GENERATE_PERLMOD = NO
PERLMOD_LATEX = NO
PERLMOD_PRETTY = YES
PERLMOD_MAKEVAR_PREFIX =
#---------------------------------------------------------------------------
# Configuration options related to the preprocessor
#---------------------------------------------------------------------------
ENABLE_PREPROCESSING = YES
MACRO_EXPANSION = YES
EXPAND_ONLY_PREDEF = NO
SEARCH_INCLUDES = YES
INCLUDE_PATH = ../../source/lib/omnitrace-user
INCLUDE_FILE_PATTERNS = *.h \
*.hpp
PREDEFINED = OMNITRACE_PUBLIC_API= \
OMNITRACE_HIDDEN_API= \
"OMNITRACE_ATTRIBUTE(...)=" \
"OMNITRACE_VISIBILITY(...)=" \
"__attribute__(x)=" \
"__declspec(x)=" \
"size_t=unsigned long" \
"uintptr_t=unsigned long" \
DOXYGEN_SHOULD_SKIP_THIS
EXPAND_AS_DEFINED =
SKIP_FUNCTION_MACROS = NO
#---------------------------------------------------------------------------
# Configuration options related to external references
#---------------------------------------------------------------------------
TAGFILES =
GENERATE_TAGFILE = html/tagfile.xml
ALLEXTERNALS = NO
EXTERNAL_GROUPS = YES
EXTERNAL_PAGES = YES
#---------------------------------------------------------------------------
# Configuration options related to the dot tool
#---------------------------------------------------------------------------
CLASS_DIAGRAMS = YES
DIA_PATH =
HIDE_UNDOC_RELATIONS = NO
HAVE_DOT = NO
DOT_NUM_THREADS = 0
DOT_FONTNAME = Helvetica
DOT_FONTSIZE = 12
DOT_FONTPATH =
CLASS_GRAPH = NO
COLLABORATION_GRAPH = YES
GROUP_GRAPHS = YES
UML_LOOK = YES
UML_LIMIT_NUM_FIELDS = 10
TEMPLATE_RELATIONS = YES
INCLUDE_GRAPH = YES
INCLUDED_BY_GRAPH = YES
CALL_GRAPH = NO
CALLER_GRAPH = NO
GRAPHICAL_HIERARCHY = YES
DIRECTORY_GRAPH = YES
DOT_IMAGE_FORMAT = svg
INTERACTIVE_SVG = YES
DOT_PATH = /usr/bin/dot
DOTFILE_DIRS =
MSCFILE_DIRS =
DIAFILE_DIRS =
PLANTUML_JAR_PATH =
PLANTUML_CFG_FILE =
PLANTUML_INCLUDE_PATH =
DOT_GRAPH_MAX_NODES = 50
MAX_DOT_GRAPH_DEPTH = 0
DOT_TRANSPARENT = NO
DOT_MULTI_TARGETS = YES
GENERATE_LEGEND = YES
DOT_CLEANUP = YES
Tá difríocht comhad cosc orthu toisc go bhfuil sé ró-mhór Difríocht Luchtaigh
@@ -0,0 +1,71 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
****************************************************
Configuring and validating the environment
****************************************************
After installing `Omnitrace <https://github.com/ROCm/omnitrace>`_, additional steps are required to set up
and validate the environment.
.. note::
The following instructions use the installation path ``/opt/omnitrace``. If
Omnitrace is installed elsewhere, substitute the actual installation path.
Configuring the environment
========================================
After Omnitrace is installed, source the ``setup-env.sh`` script to prefix the
``PATH``, ``LD_LIBRARY_PATH``, and other environment variables:
.. code-block:: shell
source /opt/omnitrace/share/omnitrace/setup-env.sh
Alternatively, if environment modules are supported, add the ``<prefix>/share/modulefiles`` directory
to ``MODULEPATH``:
.. code-block:: shell
module use /opt/omnitrace/share/modulefiles
.. note::
As an alternative, the above line can be added to the ``${HOME}/.modulerc`` file.
After Omnitrace has been added to the ``MODULEPATH``, it can be loaded
using ``module load omnitrace/<VERSION>`` and unloaded using ``module unload omnitrace/<VERSION>``.
.. code-block:: shell
module load omnitrace/1.0.0
module unload omnitrace/1.0.0
.. note::
You might also need to add the path to the ROCm libraries to ``LD_LIBRARY_PATH``,
for example, ``export LD_LIBRARY_PATH=/opt/rocm/lib:${LD_LIBRARY_PATH}``
Validating the environment configuration
========================================
If the following commands all run successfully with the expected output,
then you are ready to use Omnitrace:
.. code-block:: shell
which omnitrace
which omnitrace-avail
which omnitrace-sample
omnitrace-instrument --help
omnitrace-avail --all
omnitrace-sample --help
If Omnitrace was built with Python support, validate these additional commands:
.. code-block:: shell
which omnitrace-python
omnitrace-python --help
@@ -0,0 +1,60 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
**********************************
General tips for using Omnitrace
**********************************
Follow these general guidelines when using Omnitrace. For an explanation of the terms used in this topic, see
the :doc:`Omnitrace glossary <../reference/omnitrace-glossary>`.
* Use ``omnitrace-avail`` to look up configuration settings, hardware counters, and data collection components
* Use the ``-d`` flag for descriptions
* Generate a default configuration with ``omnitrace-avail -G ${HOME}/.omnitrace.cfg`` and adjust it
to the desired default behavior
* **Decide whether binary instrumentation, statistical sampling, or both** provides the desired performance data (for non-Python applications)
* Compile code with optimization enabled (``-O2`` or higher), disable asserts (i.e. ``-DNDEBUG``), and include debug info (for instance, ``-g1`` at a minimum)
* Compiling with debug info does not slow down the code, it only increases compile time and the size of the binary
* In CMake, this is generally done with the settings ``CMAKE_BUILD_TYPE=RelWithDebInfo`` or ``CMAKE_BUILD_TYPE=Release`` and ``CMAKE_<LANG>_FLAGS=-g1``
* **Use binary instrumentation for characterizing the performance of every invocation of specific functions**
* **Use statistical sampling to characterize the performance of the entire application while minimizing overhead**
* Enable statistical sampling after binary instrumentation to help "fill in the gaps" between instrumented regions
* Use the user API to create custom regions and enable/disable Omnitrace for specific processes, threads, and regions
* Dynamic symbol interception, callback APIs, and the user API are always available with binary instrumentation and sampling
* Dynamic symbol interception and callback APIs are (generally) controlled through ``OMNITRACE_USE_<API>``
options, for example, ``OMNITRACE_USE_KOKKOSP`` and ``OMNITRACE_USE_OMPT`` enable Kokkos-Tools and OpenMP-Tools
callbacks, respectively
* When generically seeking regions for performance improvement:
* **Start off by collecting a flat profile**
* Look for functions with high call counts, large cumulative runtimes/values, or large standard deviations
* When call counts are high, improving the performance of this function or "inlining" the function can result in quick and easy performance improvements
* When the standard deviation is high, collect a hierarchical profile and see if the high variation can be attributable to the calling context.
In this scenario, consider creating a specialized version of the function for the longer-running contexts
* **Collect a hierarchical profile** and verify the functions that are part of the "critical path" of your
application, as indicated in the flat profile
* For example, functions with high call counts but which are part of a "setup" or "post-processing"
phase that does not consume much time relative to the overall time are generally a lower priority for optimization
* **Use the information from the profiles when analyzing detailed traces**
* When using binary instrumentation in "trace" mode, **binary rewrites are preferable to runtime instrumentation**.
* Binary rewrites only instrument the functions defined in the target binary, whereas runtime instrumentation might instrument functions defined in the shared libraries which are linked into the target binary
* When using binary instrumentation with MPI, avoid runtime instrumentation
* Runtime instrumentation requires a fork and a ``ptrace``, which is generally incompatible with how MPI applications spawn processes
* Perform a binary rewrite of the executable (and optionally, libraries used by the executable) using MPI and run
the generated instrumented executable using ``omnitrace-run`` instead of the original.
For example, instead of ``mpirun -n 2 ./myexe``, use ``mpirun -n 2 omnitrace-run -- ./myexe.inst``, where
``myexe.inst`` is the instrumented ``myexe`` executable that was generated.
@@ -0,0 +1,942 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
****************************************************
Instrumenting and rewriting a binary application
****************************************************
There are three ways to perform instrumentation with the ``omnitrace-instrument`` executable:
* Runtime instrumentation
* Attaching to an already running process
* Binary rewrite
Here is a comparison of the three modes:
* Runtime instrumentation of the application using the ``omnitrace-instrument`` executable
(analogous to ``gdb --args <program> <args>``)
* This mode is the default if neither the ``-p`` nor ``-o`` command-line options are used
* Runtime instrumentation supports instrumenting not only the target executable but also
the shared libraries loaded by the target executable. Consequently, this mode consumes more memory,
takes longer to perform the instrumentation, and tends to add more significant overhead to the
runtime of the application.
* This mode is recommended if you want to analyze not only the performance of your executable and/or
libraries but also the performance of the library dependencies
* Attaching to a process that is currently running (analogous to ``gdb -p <PID>``)
* This mode is activated using ``-p <PID>``
* The same caveats from the first example apply with respect to memory and overhead
.. note::
Attaching to a running process is an alpha feature and detaching from the target process
without ending the target process is not currently supported.
* Binary rewrite to generate a new executable or library with the instrumentation built-in
* This mode is activated through the ``-o <output-file>`` option
* Binary rewriting is limited to the text section of the target executable or library. It does not instrument
the dynamically-linked libraries. Consequently, this mode performs the
instrumentation significantly faster
and has a much lower overhead when running the instrumented executable and libraries.
* Binary rewriting is the recommended mode when the target executable uses
process-level parallelism (for example, MPI)
* If the target executable has a minimal ``main`` routine and the bulk of your
application is in one specific dynamic library,
see :ref:`binary-rewriting-library-label` for help
The omnitrace-instrument executable
========================================
Instrumentation is performed with the ``omnitrace-instrument`` executable. For more details, use the ``-h`` or ``--help`` option to
view the help menu.
.. code-block:: shell
$ omnitrace-instrument --help
[omnitrace-instrument] Usage: omnitrace-instrument [ --help (count: 0, dtype: bool)
--version (count: 0, dtype: bool)
--verbose (max: 1, dtype: bool)
--error (max: 1, dtype: boolean)
--debug (max: 1, dtype: bool)
--log (count: 1)
--log-file (count: 1)
--simulate (max: 1, dtype: boolean)
--print-format (min: 1, dtype: string)
--print-dir (count: 1, dtype: string)
--print-available (count: 1)
--print-instrumented (count: 1)
--print-coverage (count: 1)
--print-excluded (count: 1)
--print-overlapping (count: 1)
--print-instructions (max: 1, dtype: bool)
--output (min: 0, dtype: string)
--pid (count: 1, dtype: int)
--mode (count: 1)
--force (max: 1, dtype: bool)
--command (count: 1)
--prefer (count: 1)
--library (count: unlimited)
--main-function (count: 1)
--load (count: unlimited, dtype: string)
--load-instr (count: unlimited, dtype: filepath)
--init-functions (count: unlimited, dtype: string)
--fini-functions (count: unlimited, dtype: string)
--all-functions (max: 1, dtype: boolean)
--function-include (count: unlimited)
--function-exclude (count: unlimited)
--function-restrict (count: unlimited)
--caller-include (count: unlimited)
--module-include (count: unlimited)
--module-exclude (count: unlimited)
--module-restrict (count: unlimited)
--internal-function-include (count: unlimited)
--internal-module-include (count: unlimited)
--instruction-exclude (count: unlimited)
--internal-library-deps (min: 0, dtype: boolean)
--internal-library-append (count: unlimited)
--internal-library-remove (count: unlimited)
--linkage (min: 1)
--visibility (min: 1)
--label (count: unlimited, dtype: string)
--config (min: 1, dtype: string)
--default-components (count: unlimited, dtype: string)
--env (count: unlimited)
--mpi (max: 1, dtype: bool)
--instrument-loops (max: 1, dtype: boolean)
--min-instructions (count: 1, dtype: int)
--min-address-range (count: 1, dtype: int)
--min-instructions-loop (count: 1, dtype: int)
--min-address-range-loop (count: 1, dtype: int)
--coverage (max: 1, dtype: bool)
--dynamic-callsites (max: 1, dtype: boolean)
--traps (max: 1, dtype: boolean)
--loop-traps (max: 1, dtype: boolean)
--allow-overlapping (max: 1, dtype: bool)
--parse-all-modules (max: 1, dtype: bool)
--batch-size (count: 1, dtype: int)
--dyninst-rt (min: 1, dtype: filepath)
--dyninst-options (count: unlimited)
] -- <CMD> <ARGS>
Options:
-h, -?, --help Shows this page
--version Prints the version and exit
[DEBUG OPTIONS]
-v, --verbose Verbose output
-e, --error All warnings produce runtime errors
--debug Debug output
--log Number of log entries to display after an error. Any value < 0 will emit the entire log
--log-file Write the log out the specified file during the run
--simulate Exit after outputting diagnostic {available,instrumented,excluded,overlapping} module
function lists, e.g. available.txt
--print-format [ json | txt | xml ]
Output format for diagnostic {available,instrumented,excluded,overlapping} module
function lists, e.g. {print-dir}/available.txt
--print-dir Output directory for diagnostic {available,instrumented,excluded,overlapping} module
function lists, e.g. {print-dir}/available.txt
--print-available [ functions | functions+ | modules | pair | pair+ ]
Print the available entities for instrumentation (functions, modules, or module-function
pair) to stdout after applying regular expressions
--print-instrumented [ functions | functions+ | modules | pair | pair+ ]
Print the instrumented entities (functions, modules, or module-function pair) to stdout
after applying regular expressions
--print-coverage [ functions | functions+ | modules | pair | pair+ ]
Print the instrumented coverage entities (functions, modules, or module-function pair) to
stdout after applying regular expressions
--print-excluded [ functions | functions+ | modules | pair | pair+ ]
Print the entities for instrumentation (functions, modules, or module-function pair)
which are excluded from the instrumentation to stdout after applying regular expressions
--print-overlapping [ functions | functions+ | modules | pair | pair+ ]
Print the entities for instrumentation (functions, modules, or module-function pair)
which overlap other function calls or have multiple entry points to stdout after applying
regular expressions
--print-instructions Print the instructions for each basic-block in the JSON/XML outputs
[MODE OPTIONS]
-o, --output Enable generation of a new executable (binary-rewrite). If a filename is not provided,
omnitrace will use the basename and output to the cwd, unless the target binary is in the
cwd. In the latter case, omnitrace will either use ${PWD}/<basename>.inst (non-libraries)
or ${PWD}/instrumented/<basename> (libraries)
-p, --pid Connect to running process
-M, --mode [ coverage | sampling | trace ]
Instrumentation mode. \'trace\' mode instruments the selected functions, \'sampling\' mode
only instruments the main function to start and stop the sampler.
-f, --force Force the command-line argument configuration, i.e. don't get cute. Useful for forcing
runtime instrumentation of an executable that [A] Dyninst thinks is a library after
reading ELF and [B] whose name makes it look like a library (e.g. starts with 'lib'
and/or ends in \'.so\', \'.so.*\', or \'.a\')
-c, --command Input executable and arguments (if \'-- <CMD>\' not provided)
[LIBRARY OPTIONS]
--prefer [ shared | static ] Prefer this library types when available
-L, --library Libraries with instrumentation routines (default: "libomnitrace-dl")
-m, --main-function The primary function to instrument around, e.g. \'main\'
--load Supplemental instrumentation library names w/o extension (e.g. \'libinstr\' for
\'libinstr.so\' or \'libinstr.a\')
--load-instr Load {available,instrumented,excluded,overlapping}-instr JSON or XML file(s) and override
what is read from the binary
--init-functions Initialization function(s) for supplemental instrumentation libraries (see \'--load\'
option)
--fini-functions Finalization function(s) for supplemental instrumentation libraries (see \'--load\' option)
--all-functions When finding functions, include the functions which are not instrumentable. This is
purely diagnostic for the available/excluded functions output
[SYMBOL SELECTION OPTIONS]
-I, --function-include Regex(es) for including functions (despite heuristics)
-E, --function-exclude Regex(es) for excluding functions (always applied)
-R, --function-restrict Regex(es) for restricting functions only to those that match the provided
regular-expressions
--caller-include Regex(es) for including functions that call the listed functions (despite heuristics)
-MI, --module-include Regex(es) for selecting modules/files/libraries (despite heuristics)
-ME, --module-exclude Regex(es) for excluding modules/files/libraries (always applied)
-MR, --module-restrict Regex(es) for restricting modules/files/libraries only to those that match the provided
regular-expressions
--internal-function-include Regex(es) for including functions which are (likely) utilized by omnitrace itself. Use
this option with care.
--internal-module-include Regex(es) for including modules/libraries which are (likely) utilized by omnitrace
itself. Use this option with care.
--instruction-exclude Regex(es) for excluding functions containing certain instructions
--internal-library-deps Treat the libraries linked to the internal libraries as internal libraries. This increase
the internal library processing time and consume more memory (so use with care) but may
be useful when the application uses Boost libraries and Dyninst is dynamically linked
against the same boost libraries
--internal-library-append Append to the list of libraries which omnitrace treats as being used internally, e.g.
OmniTrace will find all the symbols in this library and prevent them from being
instrumented.
--internal-library-remove [ ld-linux-x86-64.so.2
libBrokenLocale.so.1
libanl.so.1
libbfd.so
libbz2.so
libc.so.6
libcaliper.so
libcommon.so
libcrypt.so.1
libdl.so.2
libdw.so
libdwarf.so
libdyninstAPI_RT.so
libelf.so
libgcc_s.so.1
libgotcha.so
liblikwid.so
liblzma.so
libnsl.so.1
libnss_compat.so.2
libnss_db.so.2
libnss_dns.so.2
libnss_files.so.2
libnss_hesiod.so.2
libnss_ldap.so.2
libnss_nis.so.2
libnss_nisplus.so.2
libnss_test1.so.2
libnss_test2.so.2
libpapi.so
libpfm.so
libprofiler.so
libpthread.so.0
libresolv.so.2
librocm_smi64.so
librocmtools.so
librocprofiler64.so
libroctracer64.so
libroctx64.so
librt.so.1
libstdc++.so.6
libtbb.so
libtbbmalloc.so
libtbbmalloc_proxy.so
libtcmalloc.so
libtcmalloc_and_profiler.so
libtcmalloc_debug.so
libtcmalloc_minimal.so
libtcmalloc_minimal_debug.so
libthread_db.so.1
libunwind-coredump.so
libunwind-generic.so
libunwind-ptrace.so
libunwind-setjmp.so
libunwind-x86_64.so
libunwind.so
libutil.so.1
libz.so
libzstd.so ]
Remove the specified libraries from being treated as being used internally, e.g.
OmniTrace will permit all the symbols in these libraries to be eligible for
instrumentation.
--linkage [ global | local | unique | unknown | weak ]
Only instrument functions with specified linkage (default: global, local, unique)
--visibility [ default | hidden | internal | protected | unknown ]
Only instrument functions with specified visibility (default: default, internal, hidden,
protected)
[RUNTIME OPTIONS]
--label [ args | file | line | return ]
Labeling info for functions. By default, just the function name is recorded. Use these
options to gain more information about the function signature or location of the
functions
-C, --config Read in a configuration file and encode these values as the defaults in the executable
-d, --default-components Default components to instrument (only useful when timemory is enabled in omnitrace
library)
--env Environment variables to add to the runtime in form VARIABLE=VALUE. E.g. use \'--env
OMNITRACE_PROFILE=ON\' to default to using timemory instead of perfetto
--mpi Enable MPI support (requires omnitrace built w/ full or partial MPI support). NOTE: this
will automatically be activated if MPI_Init, MPI_Init_thread, MPI_Finalize,
MPI_Comm_rank, or MPI_Comm_size are found in the symbol table of target
[GRANULARITY OPTIONS]
-l, --instrument-loops Instrument at the loop level
-i, --min-instructions If the number of instructions in a function is less than this value, exclude it from
instrumentation
-r, --min-address-range If the address range of a function is less than this value, exclude it from
instrumentation
--min-instructions-loop If the number of instructions in a function containing a loop is less than this value,
exclude it from instrumentation
--min-address-range-loop If the address range of a function containing a loop is less than this value, exclude it
from instrumentation
--coverage [ basic_block | function | none ]
Enable recording the code coverage. If instrumenting in coverage mode (\'-M converage\'),
this simply specifies the granularity. If instrumenting in trace or sampling mode, this
enables recording code-coverage in addition to the instrumentation of that mode (if any).
--dynamic-callsites Force instrumentation if a function has dynamic callsites (e.g. function pointers)
--traps Instrument points which require using a trap. On the x86 architecture, because
instructions are of variable size, the instruction at a point may be too small for
Dyninst to replace it with the normal code sequence used to call instrumentation. Also,
when instrumentation is placed at points other than subroutine entry, exit, or call
points, traps may be used to ensure the instrumentation fits. In this case, Dyninst
replaces the instruction with a single-byte instruction that generates a trap.
--loop-traps Instrument points within a loop which require using a trap (only relevant when
--instrument-loops is enabled).
--allow-overlapping Allow dyninst to instrument either multiple functions which overlap (share part of same
function body) or single functions with multiple entry points. For more info, see Section
2 of the DyninstAPI documentation.
--parse-all-modules By default, omnitrace simply requests Dyninst to provide all the procedures in the
application image. If this option is enabled, omnitrace will iterate over all the modules
and extract the functions. Theoretically, it should be the same but the data is slightly
different, possibly due to weak binding scopes. In general, enabling option will probably
have no visible effect
[DYNINST OPTIONS]
-b, --batch-size Dyninst supports batch insertion of multiple points during runtime instrumentation. If
one large batch insertion fails, this value will be used to create smaller batches.
Larger batches generally decrease the instrumentation time
--dyninst-rt Path(s) to the dyninstAPI_RT library
--dyninst-options [ BaseTrampDeletion
DebugParsing
DelayedParsing
InstrStackFrames
MergeTramp
SaveFPR
TrampRecursive
TypeChecking ]
Advanced dyninst options: BPatch::set<OPTION>(bool), e.g. bpatch->setTrampRecursive(true)
``omnitrace-instrument`` uses a similar syntax as LLVM to separate command-line arguments from the
application's arguments. It uses a standalone
double-hyphen (``--``) as a separator.
All arguments preceding the double-hyphen
are interpreted as belonging to Omnitrace and all arguments following the
double-hyphen are interpreted as being part of the
application and its arguments. In binary rewrite mode, all application arguments after the first argument
are ignored. As an example, ``./omnitrace-instrument -o ls.inst -- ls -l`` interprets ``ls`` as
the target to instrument, ignoring the ``-l`` argument,
and generates a ``ls.inst`` executable that you can subsequently run using the
``omnitrace-run -- ls.inst -l`` command.
Runtime instrumentation example
========================================
The following example shows how to enable runtime instrumentation.
.. code-block:: shell
omnitrace-instrument <omnitrace-options> -- <exe> [<exe-options>...]
Attaching to a running process
========================================
Use the following command to attach to an active process.
.. code-block:: shell
omnitrace-instrument <omnitrace-options> -p <PID> -- <exe-name>
Binary rewrite
========================================
This example demonstrates how to rewrite a binary.
.. code-block:: shell
omnitrace-instrument <omnitrace-options> -o <name-of-new-exe-or-library> -- <exe-or-library>
.. _binary-rewriting-library-label:
Binary rewrite of a library
-----------------------------------
Many applications bundle the bulk of their functionality into one or more
dynamic libraries and have a relatively simple ``main``
which links to these libraries and serves as the "driver" for
setting up the workflow. If you perform a binary rewrite of an
executable like this and find there is insufficient information, you
can either switch to runtime instrumentation or perform a
binary rewrite on the relevant libraries.
Support for stand-alone binary rewriting of a dynamic library without a binary rewrite of
the executable is a beta feature.
In general, it is supported as long as the library contains the ``_init`` and
``_fini`` symbols but these symbols are not
standardized to the extent of ``main`` in an executable.
Here is the recommended workflow for the binary rewrite of a library:
#. Determine the names of the dynamically linked libraries of interest using ``ldd``
#. Generate a binary rewrite of the executable
#. Generate a binary rewrite of the desired libraries with the same base name as the
original library, for example, ``libfoo.so.2`` instead of ``libfoo.so``, and output the instrumented
library into a different folder than the original library.
#. Prefix the ``LD_LIBRARY_PATH`` executable with the output folder from the previous step
#. Use ``ldd`` to verify that the instrumented executable can resolve the location of the instrumented library
Binary rewrite of a library example
-----------------------------------
The ``foo`` executable is dynamically linked to ``libfoo.so.2``:
.. code-block:: shell
$ pwd
/home/user
$ which foo
/usr/local/bin/foo
$ ldd /usr/local/bin/foo
...
libfoo.so.2 => /usr/local/lib/libfoo.so.2 (...)
...
Generate binary rewrites of ``foo`` and ``libfoo.so.2``:
.. code-block:: shell
omnitrace-instrument -o ./foo.inst -- foo
omnitrace-instrument -o ./libfoo.so.2 -- /usr/local/lib/libfoo.so.2
At this point, the instrumented ``foo.inst`` executable still dynamically loads the
original ``libfoo.so.2`` in ``/usr/local/lib``:
.. code-block:: shell
$ ldd ./foo.inst
...
libfoo.so.2 => /usr/local/lib/libfoo.so.2 (...)
...
Prefix the ``LD_LIBRARY_PATH`` environment variable with the folder containing
the instrumented ``libfoo.so.2``:
.. code-block:: shell
export LD_LIBRARY_PATH=/home/user:${LD_LIBRARY_PATH}
``foo.inst`` now loads the instrumented library when it runs:
.. code-block:: shell
$ ldd ./foo.inst
...
libfoo.so.2 => /home/user/libfoo.so.2 (...)
...
Selective instrumentation
========================================
The default behavior of ``omnitrace-instrument`` does not instrument every symbol in the binary.
The default rules are:
* Skip instrumenting dynamic call-sites (such as function pointers)
* The ``--dynamic-callsites`` option forces instrumentation for all dynamic call-sites
* The cost of a function can be loosely approximated by the number of
instructions. By default, ``omnitrace-instrument`` only instruments functions
with at least 1024 instructions
* The ``--min-instructions`` option modifies this heuristic for all functions which do not contain loops
* The ``--min-instructions-loop`` option modifies this heuristic for functions which contain loops.
* The cost of a function can be also be loosely approximated by the size of the function
in the binary so this heuristic can be used in lieu of or in addition to the
minimum number of instructions
* The ``--min-address-range`` option modifies this heuristic for all functions which do not contain loops
* The ``--min-address-range-loop`` option modifies this heuristic for functions which contain loops
* Skip instrumentation points which require using a trap
* See the description for the ``--traps`` and ``--loop-traps`` options for more information
* Skip instrumenting loops within the body of a function
* The ``--instrument-loops`` option enables this behavior
* Skip instrumenting functions with overlapping function bodies and single
functions with multiple entry point
* These behaviors arise from various optimizations. Enable instrumenting for these functions
by using the ``--allow-overlapping`` option
.. note::
The separate loop options ``--min-instructions-loop`` and ``--min-address-range-loop``
are provided because functions with loops can be compact in the binary while also being costly
Viewing the available, instrumented, excluded, and overlapping functions
-------------------------------------------------------------------------
Whenever ``omnitrace-instrument`` runs with a verbosity of zero or higher,
it generates files that detail which functions
were available for instrumentation (along with the module they were defined in), actually instrumented,
excluded, and which contained overlapping function bodies.
By default, these files are saved to the ``omnitrace-<NAME>-output`` folder
where ``<NAME>`` is the base name of the targeted binary (or
the base name of the resulting executable in the case of binary rewrite). For example,
``omnitrace-instrument -- ls`` outputs these files to ``omnitrace-ls-output``
whereas ``omnitrace-instrument -o ls.inst -- ls`` places them in ``omnitrace-ls.inst-output``.
To generate these files without running or generating an
executable, use the ``--simulate`` option:
.. code-block:: shell
omnitrace-instrument --simulate -- foo
omnitrace-instrument --simulate -o foo.inst -- foo
Excluding and including modules and functions
----------------------------------------------
Omnitrace has a set of six command-line options which each accept one or more
regular expressions for customizing the scope of which module and/or functions are
instrumented. Multiple regex patterns per option are treated as an OR operation,
for example, ``--module-include libfoo libbar`` is effectively the same as ``--module-include 'libfoo|libbar'``.
To force the inclusion of certain modules and/or function
without changing any of the heuristics, use the ``--module-include`` and/or ``--function-include`` options.
These options do not exclude modules or functions which do
not satisfy their regular expression.
To narrow the scope of the instrumentation to a specific set
of libraries and/or functions, use the ``--module-restrict`` and ``--function-restrict`` options.
These options let you exclusively select the union of one or more
regular expressions, regardless of whether or not the functions satisfy the
previously-mentioned default heuristics. Any function or module that is not within
the union of these regular expressions is excluded from instrumentation.
To avoid instrumenting a set of modules and/or functions,
use the ``--module-exclude`` and ``--function-exclude`` options.
These options are always applied, even if the module or function
satisfies the "restrict" or "include" regular expression.
.. _available-module-function-output:
An example of the available module and function info output
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: shell
omnitrace-instrument -o lulesh.inst --label file line args --simulate -- lulesh
.. code-block:: shell
AddressRange Module Function FunctionSignature
9165 ../examples/lulesh/lulesh-comm.cc CommMonoQ CommMonoQ(domain) [lulesh-comm.cc:1891]
3396 ../examples/lulesh/lulesh-comm.cc CommRecv CommRecv(domain, int, Index_t, Index_t, Index_t, Index_t, bool, bool) [lulesh...
8666 ../examples/lulesh/lulesh-comm.cc CommSBN CommSBN(domain, int, Domain_member *) [lulesh-comm.cc:926]
10212 ../examples/lulesh/lulesh-comm.cc CommSend CommSend(domain, int, Index_t, Domain_member *, Index_t, Index_t, Index_t, bo...
6823 ../examples/lulesh/lulesh-comm.cc CommSyncPosVel CommSyncPosVel(domain) [lulesh-comm.cc:1404]
126 ../examples/lulesh/lulesh-comm.cc _GLOBAL__sub_I_lulesh_comm.cc _GLOBAL__sub_I_lulesh_comm.cc() [lulesh-comm.cc]
308 ../examples/lulesh/lulesh-init.cc .omp_outlined..26 .omp_outlined..26(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
628 ../examples/lulesh/lulesh-init.cc .omp_outlined..34 .omp_outlined..34(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
656 ../examples/lulesh/lulesh-init.cc .omp_outlined..41 .omp_outlined..41(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
662 ../examples/lulesh/lulesh-init.cc .omp_outlined..45 .omp_outlined..45(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
550 ../examples/lulesh/lulesh-init.cc .omp_outlined..55 .omp_outlined..55(const , const , const ParallelFor<Kokkos::Impl::ViewFill<Ko...
556 ../examples/lulesh/lulesh-init.cc .omp_outlined..57 .omp_outlined..57(const , const , const ParallelFor<Kokkos::Impl::ViewFill<Ko...
550 ../examples/lulesh/lulesh-init.cc .omp_outlined..78 .omp_outlined..78(const , const , const ParallelFor<Kokkos::Impl::ViewFill<Ko...
640 ../examples/lulesh/lulesh-init.cc .omp_outlined..84 .omp_outlined..84(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
646 ../examples/lulesh/lulesh-init.cc .omp_outlined..88 .omp_outlined..88(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
1840 ../examples/lulesh/lulesh-init.cc Domain::AllocateElemPersistent Domain::AllocateElemPersistent(Domain *, Int_t) [lulesh-init.cc:94]
1384 ../examples/lulesh/lulesh-init.cc Domain::AllocateNodePersistent Domain::AllocateNodePersistent(Domain *, Int_t) [lulesh-init.cc:94]
1264 ../examples/lulesh/lulesh-init.cc Domain::BuildMesh Domain::BuildMesh(Domain *, Int_t, Int_t, Int_t) [lulesh-init.cc:308]
2312 ../examples/lulesh/lulesh-init.cc Domain::CreateRegionIndexSets Domain::CreateRegionIndexSets(Domain *, Int_t, Int_t) [lulesh-init.cc:409]
7109 ../examples/lulesh/lulesh-init.cc Domain::Domain Domain::Domain(Domain *, Int_t, Index_t, Index_t, Index_t, Index_t, int, int,...
2458 ../examples/lulesh/lulesh-init.cc Domain::SetupBoundaryConditions Domain::SetupBoundaryConditions(Domain *, Int_t) [lulesh-init.cc:409]
956 ../examples/lulesh/lulesh-init.cc Domain::SetupCommBuffers Domain::SetupCommBuffers(Domain *, Int_t) [lulesh-init.cc]
1456 ../examples/lulesh/lulesh-init.cc Domain::SetupElementConnectivities Domain::SetupElementConnectivities(Domain *, Int_t) [lulesh-init.cc:409]
721 ../examples/lulesh/lulesh-init.cc Domain::SetupSymmetryPlanes Domain::SetupSymmetryPlanes(Domain *, Int_t) [lulesh-init.cc:409]
1591 ../examples/lulesh/lulesh-init.cc Domain::SetupThreadSupportStructures Domain::SetupThreadSupportStructures(Domain *) [lulesh-init.cc:376]
1644 ../examples/lulesh/lulesh-init.cc Domain::~Domain Domain::~Domain(Domain *) [lulesh-init.cc:286]
218 ../examples/lulesh/lulesh-init.cc InitMeshDecomp InitMeshDecomp(Int_t, Int_t, Int_t *, Int_t *, Int_t *, Int_t *) [lulesh-init...
260 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::CommonSubview<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokk... Kokkos::Impl::CommonSubview<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokk...
1786 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::HostIterateTile<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::R... Kokkos::Impl::HostIterateTile<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::R...
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int**... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int**...
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int**... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int**...
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int*,... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int*,...
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int*,... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int*,...
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl...
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl...
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl...
522 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::... Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::...
232 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::... Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::...
49 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, Kokkos::Impl::ViewVal... Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, Kokkos::Impl::ViewVal...
1476 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::Tile_Loop_Type<2, false, int, void, void>::apply<Kokkos::Impl::... Kokkos::Impl::Tile_Loop_Type<2, false, int, void, void>::apply<Kokkos::Impl::...
555 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::LayoutRight, Kokkos::Devic... Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::LayoutRight, Kokkos::Devic...
613 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::LayoutRight, Kokkos::Devic... Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::LayoutRight, Kokkos::Devic...
603 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCopy<Kokkos::View<int*, Kokkos::LayoutLeft, Kokkos::Device<... Kokkos::Impl::ViewCopy<Kokkos::View<int*, Kokkos::LayoutLeft, Kokkos::Device<...
604 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCopy<Kokkos::View<int*, Kokkos::LayoutLeft, Kokkos::Device<... Kokkos::Impl::ViewCopy<Kokkos::View<int*, Kokkos::LayoutLeft, Kokkos::Device<...
281 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
281 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
281 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
281 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
281 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
524 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev... Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev...
525 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev... Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev...
524 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev... Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev...
583 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<int* [8], Kokkos::LayoutRight>, ... SharedAllocationRecord<void, void> * Kokkos::Impl::ViewMapping<Kokkos::ViewTr...
529 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<int*, Kokkos::HostSpace>, void>:... SharedAllocationRecord<void, void> * Kokkos::Impl::ViewMapping<Kokkos::ViewTr...
529 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<int*>, void>::allocate_shared<st... SharedAllocationRecord<void, void> * Kokkos::Impl::ViewMapping<Kokkos::ViewTr...
203 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewRemap<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::... Kokkos::Impl::ViewRemap<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::...
331 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewRemap<Kokkos::View<int*>, Kokkos::View<int*>, Kokkos::OpenM... Kokkos::Impl::ViewRemap<Kokkos::View<int*>, Kokkos::View<int*>, Kokkos::OpenM...
461 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpa... enable_if_t<std::is_trivial<int>::value && std::is_trivially_copy_assignable<...
353 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::contiguous_fill<Kokkos::OpenMP, double*> Kokkos::Impl::contiguous_fill<Kokkos::OpenMP, double*>(exec_space, dst, value...
139 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::contiguous_fill<Kokkos::OpenMP, double, Kokkos::LayoutRight, Ko... Kokkos::Impl::contiguous_fill<Kokkos::OpenMP, double, Kokkos::LayoutRight, Ko...
824 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight, Kokkos::D... Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight, Kokkos::D...
824 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight, Kokkos::D... Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight, Kokkos::D...
824 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::... Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::...
824 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::... Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::...
697 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::view_copy<Kokkos::View<int*, Kokkos::LayoutRight, Kokkos::Devic... Kokkos::Impl::view_copy<Kokkos::View<int*, Kokkos::LayoutRight, Kokkos::Devic...
697 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::view_copy<Kokkos::View<int*>, Kokkos::View<int*> > Kokkos::Impl::view_copy<Kokkos::View<int*>, Kokkos::View<int*> >(dst, src) [l...
2036 ../examples/lulesh/lulesh-init.cc Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::Schedule<Kokkos::Static>, int>::R... Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::Schedule<Kokkos::Static>, int>::R...
2506 ../examples/lulesh/lulesh-init.cc Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::Schedule<Kokkos::Static>, long>::... Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::Schedule<Kokkos::Static>, long>::...
271 ../examples/lulesh/lulesh-init.cc Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor... Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor...
470 ../examples/lulesh/lulesh-init.cc Kokkos::View<int* [8], Kokkos::LayoutRight>::View<std::__cxx11::basic_string<... Kokkos::View<int* [8], Kokkos::LayoutRight>::View<std::__cxx11::basic_string<...
323 ../examples/lulesh/lulesh-init.cc Kokkos::View<int* [8], Kokkos::LayoutRight>::View<std::__cxx11::basic_string<... Kokkos::View<int* [8], Kokkos::LayoutRight>::View<std::__cxx11::basic_string<...
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*, Kokkos::HostSpace>::View<char [10]> Kokkos::View<int*, Kokkos::HostSpace>::View<char [10]>(View<int *, Kokkos::Ho...
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*, Kokkos::HostSpace>::View<char [14]> Kokkos::View<int*, Kokkos::HostSpace>::View<char [14]>(View<int *, Kokkos::Ho...
462 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*, Kokkos::HostSpace>::View<std::__cxx11::basic_string<char, ... Kokkos::View<int*, Kokkos::HostSpace>::View<std::__cxx11::basic_string<char, ...
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<char [16]> Kokkos::View<int*>::View<char [16]>(View<int *> *, arg_label, type, const siz...
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<char [19]> Kokkos::View<int*>::View<char [19]>(View<int *> *, arg_label, type, const siz...
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<char [21]> Kokkos::View<int*>::View<char [21]>(View<int *> *, arg_label, type, const siz...
462 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<std::__cxx11::basic_string<char, std::char_traits<ch... Kokkos::View<int*>::View<std::__cxx11::basic_string<char, std::char_traits<ch...
323 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<std::__cxx11::basic_string<char, std::char_traits<ch... Kokkos::View<int*>::View<std::__cxx11::basic_string<char, std::char_traits<ch...
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<double*, , double*, Kokkos::LayoutRight, Kokkos::Device<Kok... Kokkos::deep_copy<double*, , double*, Kokkos::LayoutRight, Kokkos::Device<Kok...
1052 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<double*> Kokkos::deep_copy<double*>(dst, value) [lulesh-init.cc]
1050 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<double, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP,... Kokkos::deep_copy<double, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP,...
7686 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenM... Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenM...
7686 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, int* [8], Kokkos::LayoutRigh... Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, int* [8], Kokkos::LayoutRigh...
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int*, , int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::O... Kokkos::deep_copy<int*, , int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::O...
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Ko... Kokkos::deep_copy<int*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Ko...
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP, K... Kokkos::deep_copy<int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP, K...
863 ../examples/lulesh/lulesh-init.cc Kokkos::impl_resize<, int* [8], Kokkos::LayoutRight> type Kokkos::impl_resize<, int* [8], Kokkos::LayoutRight>(v, const size_t, co...
854 ../examples/lulesh/lulesh-init.cc Kokkos::impl_resize<, int*> type Kokkos::impl_resize<, int*>(v, const size_t, const size_t, const size_t,...
697 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (... Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (...
706 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (... Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (...
912 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
791 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
791 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
944 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
839 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
126 ../examples/lulesh/lulesh-init.cc _GLOBAL__sub_I_lulesh_init.cc _GLOBAL__sub_I_lulesh_init.cc() [lulesh-init.cc]
6589 ../examples/lulesh/lulesh-util.cc Kokkos::deep_copy<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP... Kokkos::deep_copy<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP...
1345 ../examples/lulesh/lulesh-util.cc ParseCommandLineOptions ParseCommandLineOptions(int, char * *, int, cmdLineOpts *) [lulesh-util.cc:67]
171 ../examples/lulesh/lulesh-util.cc PrintCommandLineOptions PrintCommandLineOptions(char *, int) [lulesh-util.cc:31]
67 ../examples/lulesh/lulesh-util.cc StrToInt int StrToInt(const char *, int *) [lulesh-util.cc:13]
706 ../examples/lulesh/lulesh-util.cc VerifyAndWriteFinalOutput VerifyAndWriteFinalOutput(Real_t, locDom, Int_t, Int_t) [lulesh-util.cc:222]
126 ../examples/lulesh/lulesh-util.cc _GLOBAL__sub_I_lulesh_util.cc _GLOBAL__sub_I_lulesh_util.cc() [lulesh-util.cc]
17 ../examples/lulesh/lulesh-viz.cc DumpToVisit DumpToVisit(domain, int, int, int) [lulesh-viz.cc:415]
126 ../examples/lulesh/lulesh-viz.cc _GLOBAL__sub_I_lulesh_viz.cc _GLOBAL__sub_I_lulesh_viz.cc() [lulesh-viz.cc]
451 ../examples/lulesh/lulesh.cc .omp_outlined..103 .omp_outlined..103(const , const , const ParallelReduce<(lambda at ../example...
796 ../examples/lulesh/lulesh.cc .omp_outlined..109 .omp_outlined..109(const , const , const ParallelFor<(lambda at ../examples/l...
394 ../examples/lulesh/lulesh.cc .omp_outlined..111 .omp_outlined..111(const , const , const ParallelFor<(lambda at ../examples/l...
402 ../examples/lulesh/lulesh.cc .omp_outlined..113 .omp_outlined..113(const , const , const ParallelFor<(lambda at ../examples/l...
427 ../examples/lulesh/lulesh.cc .omp_outlined..115 .omp_outlined..115(const , const , const ParallelReduce<(lambda at ../example...
859 ../examples/lulesh/lulesh.cc .omp_outlined..119 .omp_outlined..119(const , const , const ParallelFor<(lambda at ../examples/l...
243 ../examples/lulesh/lulesh.cc .omp_outlined..122 .omp_outlined..122(const , const , const ParallelFor<(lambda at ../examples/l...
426 ../examples/lulesh/lulesh.cc .omp_outlined..124 .omp_outlined..124(const , const , const ParallelFor<(lambda at ../examples/l...
529 ../examples/lulesh/lulesh.cc .omp_outlined..127 .omp_outlined..127(const , const , const ParallelFor<(lambda at ../examples/l...
865 ../examples/lulesh/lulesh.cc .omp_outlined..130 .omp_outlined..130(const , const , const ParallelFor<(lambda at ../examples/l...
539 ../examples/lulesh/lulesh.cc .omp_outlined..132 .omp_outlined..132(const , const , const ParallelReduce<(lambda at ../example...
456 ../examples/lulesh/lulesh.cc .omp_outlined..134 .omp_outlined..134(const , const , const ParallelReduce<(lambda at ../example...
252 ../examples/lulesh/lulesh.cc .omp_outlined..20 .omp_outlined..20(const , const , const ParallelFor<(lambda at ../examples/lu...
870 ../examples/lulesh/lulesh.cc .omp_outlined..35 .omp_outlined..35(const , const , const ParallelFor<(lambda at ../examples/lu...
473 ../examples/lulesh/lulesh.cc .omp_outlined..42 .omp_outlined..42(const , const , const ParallelFor<(lambda at ../examples/lu...
252 ../examples/lulesh/lulesh.cc .omp_outlined..46 .omp_outlined..46(const , const , const ParallelFor<(lambda at ../examples/lu...
1101 ../examples/lulesh/lulesh.cc .omp_outlined..48 .omp_outlined..48(const , const , const ParallelFor<(lambda at ../examples/lu...
427 ../examples/lulesh/lulesh.cc .omp_outlined..55 .omp_outlined..55(const , const , const ParallelReduce<(lambda at ../examples...
1326 ../examples/lulesh/lulesh.cc .omp_outlined..57 .omp_outlined..57(const , const , const ParallelReduce<(lambda at ../examples...
243 ../examples/lulesh/lulesh.cc .omp_outlined..61 .omp_outlined..61(const , const , const ParallelFor<(lambda at ../examples/lu...
1101 ../examples/lulesh/lulesh.cc .omp_outlined..63 .omp_outlined..63(const , const , const ParallelFor<(lambda at ../examples/lu...
372 ../examples/lulesh/lulesh.cc .omp_outlined..66 .omp_outlined..66(const , const , const ParallelFor<(lambda at ../examples/lu...
499 ../examples/lulesh/lulesh.cc .omp_outlined..71 .omp_outlined..71(const , const , const ParallelFor<(lambda at ../examples/lu...
499 ../examples/lulesh/lulesh.cc .omp_outlined..73 .omp_outlined..73(const , const , const ParallelFor<(lambda at ../examples/lu...
499 ../examples/lulesh/lulesh.cc .omp_outlined..75 .omp_outlined..75(const , const , const ParallelFor<(lambda at ../examples/lu...
465 ../examples/lulesh/lulesh.cc .omp_outlined..78 .omp_outlined..78(const , const , const ParallelFor<(lambda at ../examples/lu...
396 ../examples/lulesh/lulesh.cc .omp_outlined..81 .omp_outlined..81(const , const , const ParallelFor<(lambda at ../examples/lu...
656 ../examples/lulesh/lulesh.cc .omp_outlined..85 .omp_outlined..85(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
662 ../examples/lulesh/lulesh.cc .omp_outlined..89 .omp_outlined..89(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
443 ../examples/lulesh/lulesh.cc .omp_outlined..93 .omp_outlined..93(const , const , const ParallelReduce<(lambda at ../examples...
243 ../examples/lulesh/lulesh.cc .omp_outlined..96 .omp_outlined..96(const , const , const ParallelFor<(lambda at ../examples/lu...
243 ../examples/lulesh/lulesh.cc .omp_outlined..99 .omp_outlined..99(const , const , const ParallelFor<(lambda at ../examples/lu...
13367 ../examples/lulesh/lulesh.cc ApplyMaterialPropertiesForElems ApplyMaterialPropertiesForElems(domain) [lulesh.cc:409]
1530 ../examples/lulesh/lulesh.cc CalcElemCharacteristicLength Real_t CalcElemCharacteristicLength(const Real_t *, const Real_t *, const Rea...
982 ../examples/lulesh/lulesh.cc CalcElemFBHourglassForce CalcElemFBHourglassForce(const Real_t *, const Real_t[] *, coefficient, Real_...
2428 ../examples/lulesh/lulesh.cc CalcElemNodeNormals CalcElemNodeNormals(Real_t *, Real_t *, Real_t *, const Real_t *, const Real_...
853 ../examples/lulesh/lulesh.cc CalcElemShapeFunctionDerivatives CalcElemShapeFunctionDerivatives(const Real_t *, const Real_t *, const Real_t...
1097 ../examples/lulesh/lulesh.cc CalcElemVolumeDerivative CalcElemVolumeDerivative(i, dvdx, dvdy, dvdz, const Real_t *, const Real_t *,...
1054 ../examples/lulesh/lulesh.cc CalcKinematicsForElems CalcKinematicsForElems(domain, Real_t, Index_t) [lulesh.cc]
14160 ../examples/lulesh/lulesh.cc CalcVolumeForceForElems CalcVolumeForceForElems(domain) [lulesh.cc:409]
366 ../examples/lulesh/lulesh.cc Domain::AllocateGradients Domain::AllocateGradients(Domain *, Int_t, Int_t) [lulesh.cc:214]
475 ../examples/lulesh/lulesh.cc Domain::DeallocateGradients Domain::DeallocateGradients(Domain *) [lulesh.cc:105]
250 ../examples/lulesh/lulesh.cc Domain::DeallocateStrains Domain::DeallocateStrains(Domain *) [lulesh.cc:105]
4356 ../examples/lulesh/lulesh.cc Domain::Domain Domain::Domain(Domain *) [lulesh.cc:78]
15 ../examples/lulesh/lulesh.cc Domain::delv_eta Domain::delv_eta(const Domain *, const Index_t) [lulesh.cc:371]
15 ../examples/lulesh/lulesh.cc Domain::delv_xi Domain::delv_xi(const Domain *, const Index_t) [lulesh.cc:368]
15 ../examples/lulesh/lulesh.cc Domain::delv_zeta Domain::delv_zeta(const Domain *, const Index_t) [lulesh.cc:374]
15 ../examples/lulesh/lulesh.cc Domain::fx Domain::fx(const Domain *, const Index_t) [lulesh.cc:303]
15 ../examples/lulesh/lulesh.cc Domain::fy Domain::fy(const Domain *, const Index_t) [lulesh.cc:306]
15 ../examples/lulesh/lulesh.cc Domain::fz Domain::fz(const Domain *, const Index_t) [lulesh.cc:309]
15 ../examples/lulesh/lulesh.cc Domain::nodalMass Domain::nodalMass(const Domain *, const Index_t) [lulesh.cc:314]
15 ../examples/lulesh/lulesh.cc Domain::x Domain::x(const Domain *, const Index_t) [lulesh.cc:257]
15 ../examples/lulesh/lulesh.cc Domain::xd Domain::xd(const Domain *, const Index_t) [lulesh.cc:272]
15 ../examples/lulesh/lulesh.cc Domain::y Domain::y(const Domain *, const Index_t) [lulesh.cc:258]
15 ../examples/lulesh/lulesh.cc Domain::yd Domain::yd(const Domain *, const Index_t) [lulesh.cc:275]
15 ../examples/lulesh/lulesh.cc Domain::z Domain::z(const Domain *, const Index_t) [lulesh.cc:259]
15 ../examples/lulesh/lulesh.cc Domain::zd Domain::zd(const Domain *, const Index_t) [lulesh.cc:278]
330 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<doubl... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<doubl...
330 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<doubl... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<doubl...
1508 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelFor<CalcEnergyForElems(double*, double*, double*, doubl... type Kokkos::Impl::ParallelFor<CalcEnergyForElems(double*, double*, double*, ...
3606 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelFor<CalcFBHourglassForceForElems(Domain&, double*, Kokk... type Kokkos::Impl::ParallelFor<CalcFBHourglassForceForElems(Domain&, double*,...
2917 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelFor<CalcKinematicsForElems(Domain&, double, int)::$_0, ... type Kokkos::Impl::ParallelFor<CalcKinematicsForElems(Domain&, double, int)::...
3119 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelFor<CalcMonotonicQGradientsForElems(Domain&)::{lambda(i... type Kokkos::Impl::ParallelFor<CalcMonotonicQGradientsForElems(Domain&)::{lam...
1969 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelFor<CalcMonotonicQRegionForElems(Domain&, int, double):... type Kokkos::Impl::ParallelFor<CalcMonotonicQRegionForElems(Domain&, int, dou...
1265 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelFor<IntegrateStressForElems(Domain&, double*, double*, ... type Kokkos::Impl::ParallelFor<IntegrateStressForElems(Domain&, double*, doub...
49 ../examples/lulesh/lulesh.cc Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, Kokkos::Impl::ViewVal... Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, Kokkos::Impl::ViewVal...
1497 ../examples/lulesh/lulesh.cc Kokkos::Impl::TeamPolicyInternal<Kokkos::OpenMP>::TeamPolicyInternal Kokkos::Impl::TeamPolicyInternal<Kokkos::OpenMP>::TeamPolicyInternal(TeamPoli...
603 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewCopy<Kokkos::View<double*, Kokkos::LayoutLeft, Kokkos::Devi... Kokkos::Impl::ViewCopy<Kokkos::View<double*, Kokkos::LayoutLeft, Kokkos::Devi...
604 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewCopy<Kokkos::View<double*, Kokkos::LayoutLeft, Kokkos::Devi... Kokkos::Impl::ViewCopy<Kokkos::View<double*, Kokkos::LayoutLeft, Kokkos::Devi...
281 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
281 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
521 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<double*>, void>::allocate_shared... SharedAllocationRecord<void, void> * Kokkos::Impl::ViewMapping<Kokkos::ViewTr...
331 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewRemap<Kokkos::View<double*>, Kokkos::View<double*>, Kokkos:... Kokkos::Impl::ViewRemap<Kokkos::View<double*>, Kokkos::View<double*>, Kokkos:...
461 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpa... enable_if_t<std::is_trivial<double>::value && std::is_trivially_copy_assignab...
1609 ../examples/lulesh/lulesh.cc Kokkos::Impl::runtime_check_rank_host Kokkos::Impl::runtime_check_rank_host(const size_t, const bool, const size_t,...
697 ../examples/lulesh/lulesh.cc Kokkos::Impl::view_copy<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::De... Kokkos::Impl::view_copy<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::De...
697 ../examples/lulesh/lulesh.cc Kokkos::Impl::view_copy<Kokkos::View<double*>, Kokkos::View<double*> > Kokkos::Impl::view_copy<Kokkos::View<double*>, Kokkos::View<double*> >(dst, s...
2250 ../examples/lulesh/lulesh.cc Kokkos::RangePolicy<Kokkos::OpenMP>::RangePolicy Kokkos::RangePolicy<Kokkos::OpenMP>::RangePolicy(RangePolicy<Kokkos::OpenMP> ...
213 ../examples/lulesh/lulesh.cc Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor... Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor...
410 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::View<char [6]> Kokkos::View<double*>::View<char [6]>(View<double *> *, arg_label, type, cons...
410 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::View<char [7]> Kokkos::View<double*>::View<char [7]>(View<double *> *, arg_label, type, cons...
462 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::View<std::__cxx11::basic_string<char, std::char_traits... Kokkos::View<double*>::View<std::__cxx11::basic_string<char, std::char_traits...
323 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::View<std::__cxx11::basic_string<char, std::char_traits... Kokkos::View<double*>::View<std::__cxx11::basic_string<char, std::char_traits...
25 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::~View Kokkos::View<double*>::~View(View<double *> *) [lulesh.cc:409]
840 ../examples/lulesh/lulesh.cc Kokkos::abort Kokkos::abort(const const char *, const const char *) [lulesh.cc:202]
854 ../examples/lulesh/lulesh.cc Kokkos::impl_resize<, double*> type Kokkos::impl_resize<, double*>(v, const size_t, const size_t, const size...
928 ../examples/lulesh/lulesh.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
960 ../examples/lulesh/lulesh.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
21470 ../examples/lulesh/lulesh.cc LagrangeLeapFrog LagrangeLeapFrog(domain) [lulesh.cc]
226 ../examples/lulesh/lulesh.cc ResizeBuffer ResizeBuffer(const size_t) [lulesh.cc:23]
169 ../examples/lulesh/lulesh.cc _GLOBAL__sub_I_lulesh.cc _GLOBAL__sub_I_lulesh.cc() [lulesh.cc]
1836 ../examples/lulesh/lulesh.cc main int main(int, char * *) [lulesh.cc]
63 ../examples/lulesh/lulesh.cc std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::a... std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::a...
20 ../examples/lulesh/lulesh.cc std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::alloca... std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::alloca...
160 ../examples/lulesh/lulesh.cc std::operator+<char, std::char_traits<char>, std::allocator<char> > basic_string<char, std::char_traits<char>, std::allocator<char> > std::operat...
187 ../examples/lulesh/lulesh.cc std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::alloc... std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::alloc...
11 lulesh __clang_call_terminate __clang_call_terminate() [lulesh]
33 lulesh __do_global_dtors_aux __do_global_dtors_aux() [lulesh]
5 lulesh __libc_csu_fini __libc_csu_fini() [lulesh]
101 lulesh __libc_csu_init __libc_csu_init() [lulesh]
5 lulesh _dl_relocate_static_pie _dl_relocate_static_pie() [lulesh]
13 lulesh _fini _fini() [lulesh]
27 lulesh _init _init() [lulesh]
47 lulesh _start _start() [lulesh]
6 lulesh frame_dummy frame_dummy() [lulesh]
An example of instrumented module and function info output
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: shell
omnitrace-instrument -o lulesh.inst --label file line args --simulate -- lulesh
After the heuristics are applied based on the pattern in :ref:`available-module-function-output`,
the selected module and functions are:
.. code-block:: shell
AddressRange Module Function FunctionSignature
9165 ../examples/lulesh/lulesh-comm.cc CommMonoQ CommMonoQ(domain) [lulesh-comm.cc:1891]
3396 ../examples/lulesh/lulesh-comm.cc CommRecv CommRecv(domain, int, Index_t, Index_t, Index_t, Index_t, bool, bool) [lulesh...
8666 ../examples/lulesh/lulesh-comm.cc CommSBN CommSBN(domain, int, Domain_member *) [lulesh-comm.cc:926]
10212 ../examples/lulesh/lulesh-comm.cc CommSend CommSend(domain, int, Index_t, Domain_member *, Index_t, Index_t, Index_t, bo...
6823 ../examples/lulesh/lulesh-comm.cc CommSyncPosVel CommSyncPosVel(domain) [lulesh-comm.cc:1404]
1840 ../examples/lulesh/lulesh-init.cc Domain::AllocateElemPersistent Domain::AllocateElemPersistent(Domain *, Int_t) [lulesh-init.cc:94]
1384 ../examples/lulesh/lulesh-init.cc Domain::AllocateNodePersistent Domain::AllocateNodePersistent(Domain *, Int_t) [lulesh-init.cc:94]
1264 ../examples/lulesh/lulesh-init.cc Domain::BuildMesh Domain::BuildMesh(Domain *, Int_t, Int_t, Int_t) [lulesh-init.cc:308]
2312 ../examples/lulesh/lulesh-init.cc Domain::CreateRegionIndexSets Domain::CreateRegionIndexSets(Domain *, Int_t, Int_t) [lulesh-init.cc:409]
7109 ../examples/lulesh/lulesh-init.cc Domain::Domain Domain::Domain(Domain *, Int_t, Index_t, Index_t, Index_t, Index_t, int, int,...
2458 ../examples/lulesh/lulesh-init.cc Domain::SetupBoundaryConditions Domain::SetupBoundaryConditions(Domain *, Int_t) [lulesh-init.cc:409]
956 ../examples/lulesh/lulesh-init.cc Domain::SetupCommBuffers Domain::SetupCommBuffers(Domain *, Int_t) [lulesh-init.cc]
1456 ../examples/lulesh/lulesh-init.cc Domain::SetupElementConnectivities Domain::SetupElementConnectivities(Domain *, Int_t) [lulesh-init.cc:409]
721 ../examples/lulesh/lulesh-init.cc Domain::SetupSymmetryPlanes Domain::SetupSymmetryPlanes(Domain *, Int_t) [lulesh-init.cc:409]
1591 ../examples/lulesh/lulesh-init.cc Domain::SetupThreadSupportStructures Domain::SetupThreadSupportStructures(Domain *) [lulesh-init.cc:376]
1644 ../examples/lulesh/lulesh-init.cc Domain::~Domain Domain::~Domain(Domain *) [lulesh-init.cc:286]
271 ../examples/lulesh/lulesh-init.cc Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor... Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor...
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*, Kokkos::HostSpace>::View<char [10]> Kokkos::View<int*, Kokkos::HostSpace>::View<char [10]>(View<int *, Kokkos::Ho...
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*, Kokkos::HostSpace>::View<char [14]> Kokkos::View<int*, Kokkos::HostSpace>::View<char [14]>(View<int *, Kokkos::Ho...
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<char [16]> Kokkos::View<int*>::View<char [16]>(View<int *> *, arg_label, type, const siz...
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<char [19]> Kokkos::View<int*>::View<char [19]>(View<int *> *, arg_label, type, const siz...
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<char [21]> Kokkos::View<int*>::View<char [21]>(View<int *> *, arg_label, type, const siz...
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<double*, , double*, Kokkos::LayoutRight, Kokkos::Device<Kok... Kokkos::deep_copy<double*, , double*, Kokkos::LayoutRight, Kokkos::Device<Kok...
1052 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<double*> Kokkos::deep_copy<double*>(dst, value) [lulesh-init.cc]
1050 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<double, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP,... Kokkos::deep_copy<double, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP,...
7686 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenM... Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenM...
7686 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, int* [8], Kokkos::LayoutRigh... Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, int* [8], Kokkos::LayoutRigh...
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int*, , int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::O... Kokkos::deep_copy<int*, , int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::O...
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Ko... Kokkos::deep_copy<int*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Ko...
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP, K... Kokkos::deep_copy<int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP, K...
697 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (... Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (...
706 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (... Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (...
912 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
791 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
791 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
944 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
839 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
6589 ../examples/lulesh/lulesh-util.cc Kokkos::deep_copy<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP... Kokkos::deep_copy<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP...
1345 ../examples/lulesh/lulesh-util.cc ParseCommandLineOptions ParseCommandLineOptions(int, char * *, int, cmdLineOpts *) [lulesh-util.cc:67]
706 ../examples/lulesh/lulesh-util.cc VerifyAndWriteFinalOutput VerifyAndWriteFinalOutput(Real_t, locDom, Int_t, Int_t) [lulesh-util.cc:222]
13367 ../examples/lulesh/lulesh.cc ApplyMaterialPropertiesForElems ApplyMaterialPropertiesForElems(domain) [lulesh.cc:409]
982 ../examples/lulesh/lulesh.cc CalcElemFBHourglassForce CalcElemFBHourglassForce(const Real_t *, const Real_t[] *, coefficient, Real_...
2428 ../examples/lulesh/lulesh.cc CalcElemNodeNormals CalcElemNodeNormals(Real_t *, Real_t *, Real_t *, const Real_t *, const Real_...
853 ../examples/lulesh/lulesh.cc CalcElemShapeFunctionDerivatives CalcElemShapeFunctionDerivatives(const Real_t *, const Real_t *, const Real_t...
1054 ../examples/lulesh/lulesh.cc CalcKinematicsForElems CalcKinematicsForElems(domain, Real_t, Index_t) [lulesh.cc]
14160 ../examples/lulesh/lulesh.cc CalcVolumeForceForElems CalcVolumeForceForElems(domain) [lulesh.cc:409]
366 ../examples/lulesh/lulesh.cc Domain::AllocateGradients Domain::AllocateGradients(Domain *, Int_t, Int_t) [lulesh.cc:214]
475 ../examples/lulesh/lulesh.cc Domain::DeallocateGradients Domain::DeallocateGradients(Domain *) [lulesh.cc:105]
4356 ../examples/lulesh/lulesh.cc Domain::Domain Domain::Domain(Domain *) [lulesh.cc:78]
410 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::View<char [6]> Kokkos::View<double*>::View<char [6]>(View<double *> *, arg_label, type, cons...
410 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::View<char [7]> Kokkos::View<double*>::View<char [7]>(View<double *> *, arg_label, type, cons...
928 ../examples/lulesh/lulesh.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
960 ../examples/lulesh/lulesh.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
21470 ../examples/lulesh/lulesh.cc LagrangeLeapFrog LagrangeLeapFrog(domain) [lulesh.cc]
1836 ../examples/lulesh/lulesh.cc main int main(int, char * *) [lulesh.cc]
Sampling
========================================
.. note::
This capability has been deprecated in favor of :doc:`Call stack sampling <./sampling-call-stack>`.
By default, ``omnitrace-instrument`` uses ``--mode trace`` for instrumentation. The ``--mode sampling`` option
only instruments ``main`` in an executable. It activates both CPU call-stack sampling and
background system-level thread sampling by default.
Tracing capabilities which do not rely on instrumentation, such as the HIP API and kernel tracing
(which is collected by roctracer), are still available.
The Omnitrace sampling capabilities are always available, even in trace mode, but are deactivated by default.
To activate sampling in trace mode, set ``OMNITRACE_USE_SAMPLING=ON`` in the environment
or in an Omnitrace configuration file.
Embedding a default configuration
========================================
Use the ``--env`` option to embed a default configuration into the target. Although this option
works for runtime instrumentation, it is most useful when generating new binaries because the generated
binary can be used later on in a different login session when the environment might have changed.
For example, if the following commands are run,
the configuration settings are not be preserved for subsequent sessions:
.. code-block:: shell
omnitrace-instrument -o ./foo.inst -- ./foo
export OMNITRACE_USE_SAMPLING=ON
export OMNITRACE_SAMPLING_FREQ=5
omnitrace-run -- ./foo.inst
Whereas the following command preserves those environment variables:
.. code-block:: shell
omnitrace-instrument -o ./foo.samp --env OMNITRACE_USE_SAMPLING=ON OMNITRACE_SAMPLING_FREQ=5 -- ./foo
They can now be used in future sessions.
.. code-block:: shell
# will sample 5x per second
omnitrace-run -- ./foo.samp
Even though the environment variables are preserved, subsequent sessions can still override those defaults:
.. code-block:: shell
# will sample 100x per second
export OMNITRACE_SAMPLING_FREQ=100
omnitrace-run -- ./foo.samp
.. _rpath-troubleshooting:
Troubleshooting
----------------------------------------------
Checking for RPATH
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If ``ldd ./foo.inst`` from the :ref:`binary-rewriting-library-label`
section still returns ``/usr/local/lib/libfoo.so.2``, the executable could have
an rpath encoded in the binary.
This ELF entry results in the dynamic linker ignoring ``LD_LIBRARY_PATH`` if
it finds ``libfoo.so.2`` in the rpath.
Using the ``objdump`` tool, perform the following query:
.. code-block:: shell
objdump -p <exe-or-library> | egrep 'RPATH|RUNPATH'
If this produces output that appears similar to this output.:
.. code-block:: shell
RUNPATH $ORIGIN:$ORIGIN/../lib
Remove or modify the rpath to get ``foo.inst`` to resolve
to the instrumented ``libfoo.so.2`` as explained in the next section.
Modifying an RPATH
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This code snippet uses the ``patchelf`` tool to modify the rpath of the given executable
or library to ``/home/user``, which is where the instrumented libraries are located.
.. note::
This functionality requires the ``patchelf`` package.
.. code-block:: shell
patchelf --remove-rpath <exe-or-library>
patchelf --set-rpath '/home/user' <exe-or-library>
@@ -0,0 +1,630 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
****************************************************
Performing causal profiling
****************************************************
The process of causal profiling can be summarized as:
*If you speed up a given block of code by X%, the application will run Y% faster*.
Causal profiling directs parallel application developers to where they should focus their optimization
efforts by quantifying the potential impact of optimizations. Causal profiling is rooted in the concept
that *software execution speed is relative*. Speeding up a block of code by X% is mathematically equivalent
to that block of code running at its current speed if all the other code is running slower by X%.
Thus, causal profiling works by performing experiments on blocks of code during program execution which
insert pauses to slow down all other concurrently running code. During post-processing, these experiments
are translated into calculations for the potential impact of speeding up this block of code.
.. note::
Causal profiling supersedes the original critical trace feature, which was removed in Omnitrace v1.11.0.
Consider the following C++ code executing ``foo`` and ``bar`` concurrently in two different threads
where ``foo`` is ideally 30% faster than ``bar``:
.. code-block:: cpp
#include <cstddef>
#include <thread>
constexpr size_t FOO_N = 7 * 1000000000UL;
constexpr size_t BAR_N = 10 * 1000000000UL;
void foo()
{
for(volatile size_t i = 0; i < FOO_N; ++i) {}
}
void bar()
{
for(volatile size_t i = 0; i < BAR_N; ++i) {}
}
int main()
{
std::thread _threads[] = { std::thread{ foo },
std::thread{ bar } };
for(auto& itr : _threads)
itr.join();
}
No matter how many optimizations are applied to ``foo``, the application will always
require the same amount of time
because the end-to-end performance is limited by ``bar``. However, a 5% speed-up
in ``bar`` results in the
end-to-end performance improving by 5%. This trend continues linearly, with a 10% speed-up
in ``bar`` yielding a 10% speed-up in
end-to-end performance, and so on, up to a 30% speed-up, at which point ``bar`` runs as fast as ``foo``.
Any speed-up to ``bar`` beyond 30% still only yields an end-to-end performance
improvement of 30% because the application
is now limited by performance of ``foo``, as demonstrated below in the causal
profiling visualization:
.. image:: ../data/causal-foobar.png
:alt: Visualization of the performance improvements for two functions with causal profiling
The full details of the causal profiling methodology can be found in the paper
`Coz: Finding Code that Counts with Causal Profiling <http://arxiv.org/pdf/1608.03676v1.pdf>`_.
The author's implementation is publicly available on `GitHub <https://github.com/plasma-umass/coz>`_.
Getting started
========================================
To effectively use causal profiling, it is important to understand a few key
concepts, such as progress points.
Progress points
-----------------------------------
Causal profiling requires "progress points" to track progress through the code
in between samples. Progress points must be triggered in a deterministic manner via instrumentation.
This can happen in three different ways:
* `Omnitrace <https://github.com/ROCm/omnitrace>`_ can leverage the callbacks from
Kokkos-Tools, OpenMP-Tools, roctracer, etc. and the wrappers around functions for
MPI, NUMA, RCCL, etc. to act as progress points
* Users can leverage the :doc:`runtime instrumentation capabilities <./instrumenting-rewriting-binary-application>`
to insert progress points
* Users can leverage :doc:`User APIs <../how-to/using-omnitrace-api>`,
such as ``OMNITRACE_CAUSAL_PROGRESS``
.. note::
Binary rewrite to insert progress points is not supported. When a rewritten binary
runs, Dyninst translates the instruction pointer address in order to perform
the instrumentation. As a result, call stack samples never return instruction
pointer addresses within the valid Omnitrace range.
Key concepts
-----------------------------------
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
| Concept | Setting | Options | Description |
+==================+=====================================+==================================+============================================+
| Backend | ``OMNITRACE_CAUSAL_BACKEND`` | ``perf``, ``timer`` | Backend for recording samples required |
| | | | to calculate the virtual speed-up |
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
| Mode | ``OMNITRACE_CAUSAL_MODE`` | ``function``, ``line`` | Select an entire function or individual |
| | | | line of code for causal experiments |
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
| End-to-end | ``OMNITRACE_CAUSAL_END_TO_END`` | Boolean | Perform a single experiment during the |
| | | | entire run (does not require |
| | | | progress points) |
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
| Fixed speed-up | ``OMNITRACE_CAUSAL_FIXED_SPEEDUP`` | one or more values from [0, 100] | Virtual speed-up or pool of virtual |
| | | | speed-ups to randomly select |
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
| Binary scope | ``OMNITRACE_CAUSAL_BINARY_SCOPE`` | regular expression(s) | Dynamic binaries containing code for |
| | | | experiments |
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
| Source scope | ``OMNITRACE_CAUSAL_SOURCE_SCOPE`` | regular expression(s) | ``<file>`` and/or ``<file>:<line>`` |
| | | | containing code to include in experiments |
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
| Function scope | ``OMNITRACE_CAUSAL_FUNCTION_SCOPE`` | regular expression(s) | Restricts experiments to matching |
| | | | functions (function mode) or lines of |
| | | | code within matching functions (line mode) |
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
.. note::
* Binary scope defaults to ``%MAIN%`` (in the executable), but the scope can be expanded to include linked libraries.
* ``<file>`` and ``<file>:<line>`` support requires debug info (for example, the code must be compiled with ``-g`` or, preferably, with ``-g3``)
* Function mode does not require debug info but does not support stripped binaries
Backends
-----------------------------------
There are two backends to choose from: ``perf`` and ``timer``.
They are used to record the samples required to calculate the virtual speedup.
Both backends interrupt each thread 1000 times per second (of CPU-time) to apply the virtual speed-ups.
The difference between each backend is how the samples are recorded.
There are three key differences between the two backends:
* the ``perf`` backend requires Linux Perf and elevated security priviledges
* the ``perf`` backend interrupts the application less frequently whereas the ``timer`` backend
interrupts the application 1000 times per second of realtime
* the ``timer`` backend has less accurate call stacks due to instruction pointer skid
In general, the ``perf`` backend is preferred over the ``timer`` backend when sufficient
security priviledges permit its usage.
If ``OMNITRACE_CAUSAL_BACKEND`` is set to ``auto``, Omnitrace falls back
to using the ``timer`` backend only if
the ``perf`` backend fails. If ``OMNITRACE_CAUSAL_BACKEND`` is
set to ``perf`` and using this backend fails, Omnitrace aborts.
Instruction pointer skid
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Instruction pointer (IP) skid measures how many instructions run after the event of interest
before the program actually stops. The IP skid is calculated by subtracting
the location of the IP at the point of interest from the location of the IP
when the kernel finally stops the application.
For the ``timer`` backend, this translates to the
difference in the IP between when the timer generated a signal and when the
signal was actually generated. Although IP skid still occurs with the ``perf`` backend,
it is much more pronounced with the ``timer`` backend due to the overhead of pausing the entire thread.
This means the ``timer`` backend tends to have a lower resolution than the ``perf`` backend,
especially in ``line`` mode.
Installing Linux Perf
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Linux Perf is built into the kernel and may already be installed
(for instance, it is included in the default kernel for OpenSUSE).
The official method of checking whether Linux Perf is installed is
checking for the existence of the file
``/proc/sys/kernel/perf_event_paranoid``. If the file exists, the kernel has Perf installed.
If this file does not exist, as with Debian-based systems like Ubuntu, run the following command as superuser:
.. code-block:: shell
apt-get install linux-tools-common linux-tools-generic linux-tools-$(uname -r)
and reboot your computer. In order to use the ``perf`` backend, the value
of ``/proc/sys/kernel/perf_event_paranoid``
should be less than or equal to 2. If the value in this file is greater than 2, you can't
use the ``perf`` backend.
To update the paranoid level temporarily until the system is rebooted, run
one of the following commands
as a superuser (where ``PARANOID_LEVEL=<N>`` has a value of ``<N>`` in the range ``[-1, 2]``):
.. code-block:: shell
echo ${PARANOID_LEVEL} | sudo tee /proc/sys/kernel/perf_event_paranoid
or
.. code-block:: shell
sysctl kernel.perf_event_paranoid=${PARANOID_LEVEL}
To make the paranoid level persistent after a reboot, add ``kernel.perf_event_paranoid=<N>``
(where ``<N>`` is the desired paranoid level) to the ``/etc/sysctl.conf`` file.
Speed-up prediction variability and the omnitrace-causal executable
-----------------------------------------------------------------------
Causal profiling typically requires running the application several times in
order to adequately sample all the code domains, experiment
with speed-ups and other techniques, and resolve statistical fluctuations.
The ``omnitrace-causal`` executable is designed to simplify this procedure:
.. code-block:: shell
$ omnitrace-causal --help
[omnitrace-causal] Usage: ./bin/omnitrace-causal [ --help (count: 0, dtype: bool)
--version (count: 0, dtype: bool)
--monochrome (max: 1, dtype: bool)
--debug (max: 1, dtype: bool)
--verbose (count: 1)
--config (min: 0, dtype: filepath)
--launcher (count: 1, dtype: executable)
--generate-configs (min: 0, dtype: folder)
--no-defaults (min: 0, dtype: bool)
--mode (count: 1, dtype: string)
--output-name (min: 1, dtype: filename)
--reset (max: 1, dtype: bool)
--end-to-end (max: 1, dtype: bool)
--wait (count: 1, dtype: seconds)
--duration (count: 1, dtype: seconds)
--iterations (count: 1, dtype: int)
--speedups (min: 0, dtype: integers)
--binary-scope (min: 0, dtype: integers)
--source-scope (min: 0, dtype: integers)
--function-scope (min: 0, dtype: regex-list)
--binary-exclude (min: 0, dtype: integers)
--source-exclude (min: 0, dtype: integers)
--function-exclude (min: 0, dtype: regex-list)
]
Causal profiling usually requires multiple runs to reliably resolve the speedup estimates.
This executable is designed to streamline that process.
For example (assume all commands end with \'-- <exe> <args>\'):
omnitrace-causal -n 5 -- <exe> # runs <exe> 5x with causal profiling enabled
omnitrace-causal -s 0 5,10,15,20 # runs <exe> 2x with virtual speedups:
# - 0
# - randomly selected from 5, 10, 15, and 20
omnitrace-causal -F func_A func_B func_(A|B) # runs <exe> 3x with the function scope limited to:
# 1. func_A
# 2. func_B
# 3. func_A or func_B
General tips:
- Insert progress points at hotspots in your code or use omnitrace\'s runtime instrumentation
- Note: binary rewrite will produce a incompatible new binary
- Run omnitrace-causal in "function" mode first (does not require debug info)
- Run omnitrace-causal in "line" mode when you are targeting one function (requires debug info)
- Preferably, use predictions from the "function" mode to determine which function to target
- Limit the virtual speedups to a smaller pool, e.g., 0,5,10,25,50, to get reliable predictions quicker
- Make use of the binary, source, and function scope to limit the functions/lines selected for experiments
- Note: source scope requires debug info
Options:
-h, -?, --help Shows this page
--version Prints the version and exit
[DEBUG OPTIONS]
--monochrome Disable colorized output
--debug Debug output
-v, --verbose Verbose output
[GENERAL OPTIONS]
-c, --config Base configuration file
-l, --launcher When running MPI jobs, omnitrace-causal needs to be *before* the executable which launches the MPI processes (i.e.
before `mpirun`, `srun`, etc.). Pass the name of the target executable (or a regex for matching to the name of the
target) for causal profiling, e.g., `omnitrace-causal -l foo -- mpirun -n 4 foo`. This ensures that the omnitrace
library is LD_PRELOADed on the proper target
-g, --generate-configs Generate config files instead of passing environment variables directly. If no arguments are provided, the config files
will be placed in ${PWD}/omnitrace-causal-config folder
--no-defaults Do not activate default features which are recommended for causal profiling. For example: PID-tagging of output files
and timestamped subdirectories are disabled by default. Kokkos tools support is added by default
(OMNITRACE_USE_KOKKOSP=ON) because, for Kokkos applications, the Kokkos-Tools callbacks are used for progress points.
Activation of OpenMP tools support is similar
[CAUSAL PROFILING OPTIONS (General)]
(These settings will be applied to all causal profiling runs)
-m, --mode [ function (func) | line ]
Causal profiling mode
-o, --output-name Output filename of causal profiling data w/o extension
-r, --reset Overwrite any existing experiment results during the first run
-e, --end-to-end Single causal experiment for the entire application runtime
-w, --wait Set the wait time (i.e. delay) before starting the first causal experiment (in seconds)
-d, --duration Set the length of time (in seconds) to perform causal experimentationafter the first experiment is started. Once this
amount of time has elapsed, no more causal experiments will be started but any currently running experiment will be
allowed to finish.
-n, --iterations Number of times to repeat the combination of run configurations
[CAUSAL PROFILING OPTIONS (Combinatorial)]
(Each individual argument to these options will multiply the number runs by the number of arguments and the number of
iterations. E.g. -n 2 -B "MAIN" -F "foo" "bar" will produce 4 runs: 2 iterations x 1 binary scope x 2 function scopes
(MAIN+foo, MAIN+bar, MAIN+foo, MAIN+bar))
-s, --speedups Pool of virtual speedups to sample from during experimentation. Each space designates a group and multiple speedups can
be grouped together by commas, e.g. -s 0 0,10,20-50 is two groups: group #1 is \'0\' and group #2 is \'0 10 20 25 30 35 40
45 50\'
-B, --binary-scope Restricts causal experiments to the binaries matching the list of regular expressions. Each space designates a group
and multiple scopes can be grouped together with a semi-colon
-S, --source-scope Restricts causal experiments to the source files or source file + lineno pairs (i.e. <file> or <file>:<line>) matching
the list of regular expressions. Each space designates a group and multiple scopes can be grouped together with a
semi-colon
-F, --function-scope Restricts causal experiments to the functions matching the list of regular expressions. Each space designates a group
and multiple scopes can be grouped together with a semi-colon
-BE, --binary-exclude Excludes causal experiments from being performed on the binaries matching the list of regular expressions. Each space
designates a group and multiple excludes can be grouped together with a semi-colon
-SE, --source-exclude Excludes causal experiments from being performed on the code from the source files or source file + lineno pair (i.e.
<file> or <file>:<line>) matching the list of regular expressions. Each space designates a group and multiple excludes
can be grouped together with a semi-colon
-FE, --function-exclude Excludes causal experiments from being performed on the functions matching the list of regular expressions. Each space
designates a group and multiple excludes can be grouped together with a semi-colon
Examples
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: shell
#!/bin/bash -e
module load omnitrace
N=20
I=3
# when providing speedups to omnitrace-causal, speedup
# groups are separated by a space so "0,10" results in
# one speedup group where omnitrace samples from
# the speedup set of {0, 10}. Passing "0 10" (without
# quotes to omnitrace-causal multiplies the
# number of runs by 2, where the first half of the
# runs instruct omnitrace to only use 0 as the
# speedup and the second half of the runs instruct
# omnitrace to only use 10 as the speedup.
SPEEDUPS="0,0,0,10,20,30,40,50,50,75,75,75,90,90,90"
# thus, -s ${SPEEDUPS} only multiplies the number
# of runs by 1 whereas -S ${SPEEDUPS_E2E} multiplies
# the number of runs by 15:
# - 3 runs with speedup of 0
# - 1 run for each of the speedups 10, 20, 30, and 40
# - 2 runs with speedup of 50
# - 3 runs with speedup of 75
# - 3 runs with speedup of 90
SPEEDUPS_E2E=$(echo "${SPEEDUPS}" | sed \'s/,/ /g\')
# 20 iterations in function mode with 1 speedup group
# and source scope set to .cpp files
#
# outputs to files:
# - causal/experiments.func.coz
# - causal/experiments.func.json
#
# total executions: 20
#
omnitrace-causal \
-n ${N} \
-s ${SPEEDUPS} \
-m function \
-o experiments.func \
-S ".*\\.cpp" \
-- \
./causal-omni-cpu "${@}"
# 20 iterations in line mode with 1 speedup group
# and source scope restricted to lines 100 and 110
# in the causal.cpp file.
#
# outputs to files:
# - causal/experiments.line.coz
# - causal/experiments.line.json
#
# total executions: 20
#
omnitrace-causal \
-n ${N} \
-s ${SPEEDUPS} \
-m line \
-o experiments.line \
-S "causal\\.cpp:(100|110)" \
-- \
./causal-omni-cpu "${@}"
# 3 iterations in function mode of 15 singular speedups
# in end-to-end mode with 2 different function scopes
# where one is restricted to "cpu_slow_func" and
# another is restricted to "cpu_fast_func".
#
# outputs to files:
# - causal/experiments.func.e2e.coz
# - causal/experiments.func.e2e.json
#
# total executions: 90
#
omnitrace-causal \
-n ${I} \
-s ${SPEEDUPS_E2E} \
-m func \
-e \
-o experiments.func.e2e \
-F "cpu_slow_func" \
"cpu_fast_func" \
-- \
./causal-omni-cpu "${@}"
# 3 iterations in line mode of 15 singular speedups
# in end-to-end mode with 2 different source scopes
# where one is restricted to line 100 in causal.cpp
# and another is restricted to line 110 in causal.cpp.
#
# outputs to files:
# - causal/experiments.line.e2e.coz
# - causal/experiments.line.e2e.json
#
# total executions: 90
#
omnitrace-causal \
-n ${I} \
-s ${SPEEDUPS_E2E} \
-m line \
-e \
-o experiments.line.e2e \
-S "causal\\.cpp:100" \
"causal\\.cpp:110" \
-- \
./causal-omni-cpu "${@}"
export OMP_NUM_THREADS=8
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
# set number of iterations to 5
N=5
# 5 iterations in function mode of 1 speedup
# group with the source scope restricted
# to files containing "lulesh" in their filename
# and exclude functions which start with "Kokkos::"
# or "std::enable_if".
#
# outputs to files:
# - causal/experiments.func.coz
# - causal/experiments.func.json
#
# total executions: 5
#
# First of 5 executions overwrites any
# existing causal/experiments.func.(coz|json)
# file due to "--reset" argument
#
omnitrace-causal \
--reset \
-n ${N} \
-s ${SPEEDUPS} \
-m func \
-o experiments.func \
-S "lulesh.*" \
-FE "^(Kokkos::|std::enable_if)" \
-- \
./lulesh-omni -i 50 -s 200 -r 20 -b 5 -c 5 -p
# 5 iterations in line mode of 1 speedup
# group with the source scope restricted
# to files containing "lulesh" in their filename
# and exclude functions which start with "exec_range"
# or "execute" and which contain either
# "construct_shared_allocation" or "._omp_fn." in
# the function name.
#
# outputs to files:
# - causal/experiments.line.coz
# - causal/experiments.line.json
#
# total executions: 5
#
# First of 5 executions overwrites any
# existing causal/experiments.line.(coz|json)
# file due to "--reset" argument
#
omnitrace-causal \
--reset \
-n ${N} \
-s ${SPEEDUPS} \
-m line \
-o experiments.line \
-S "lulesh.*" \
-FE "^(exec_range|execute);construct_shared_allocation;\\._omp_fn\\." \
-- \
./lulesh-omni -i 50 -s 200 -r 20 -b 5 -c 5 -p
# 5 iterations in line mode of 1 speedup
# group with the source scope restricted
# to files whose basename is "lulesh.cc"
# for 3 different functions:
# - ApplyMaterialPropertiesForElems
# - CalcHourglassControlForElems
# - CalcVolumeForceForElems
#
# outputs to files:
# - causal/experiments.line.targeted.coz
# - causal/experiments.line.targeted.json
#
# total executions: 15
#
# First of 5 executions overwrites any
# existing causal/experiments.line.(coz|json)
# file due to "--reset" argument
#
omnitrace-causal \
--reset \
-n ${N} \
-s ${SPEEDUPS} \
-m line \
-o experiments.line.targeted \
-F "ApplyMaterialPropertiesForElems" \
"CalcHourglassControlForElems" \
"CalcVolumeForceForElems" \
-S "lulesh\\.cc" \
-- \
./lulesh-omni -i 50 -s 200 -r 20 -b 5 -c 5 -p
Using omnitrace-causal with other launchers like mpirun
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The ``omnitrace-causal`` executable is intended to assist with application replay
and is designed to always be at the start of the command line as the primary process.
``omnitrace-causal`` typically adds a ``LD_PRELOAD`` of the Omnitrace libraries
into the environment before launching the command to inject the functionality
required to start the causal profiling tooling. However, this is problematic
when the target application for causal profiling uses a launcher, in which case
it is listed as an argument rather than as the main application. For example,
``foo`` is the target application for profiling, but the command to run it is
``mpirun -n 2 foo``. Running the command ``omnitrace-causal -- mpirun -n 2 foo``
applies the causal profiling to ``mpirun`` instead of ``foo``.
``omnitrace-causal`` remedies this by providing a command-line option ``-l` / `--launcher``
to indicate the target application is using a launcher script/executable. The
argument to the command-line option is the name of, or regular expression for, the target application
on the command line. When ``--launcher`` is used, ``omnitrace-causal`` generates
all the replay configurations and runs them but delays adding the ``LD_PRELOAD``. Instead it
inserts a call to itself into the command line right before the target
application. This recursive call inherits the configuration from
the parent ``omnitrace-causal`` executable, inserts an ``LD_PRELOAD`` into the environment,
and calls ``execv`` to replace itself with the new process launched by the target
application.
In other words, the following command:
.. code-block:: shell
omnitrace-causal -l foo -n 3 -- mpirun -n 2 foo`
Effectively results in:
.. code-block:: shell
mpirun -n 2 omnitrace-causal -- foo
mpirun -n 2 omnitrace-causal -- foo
mpirun -n 2 omnitrace-causal -- foo
Visualizing the causal output
-------------------------------------------------------------------------
Omnitrace generates ``causal/experiments.json`` and ``causal/experiments.coz`` in
``${OMNITRACE_OUTPUT_PATH}/${OMNITRACE_OUTPUT_PREFIX}``. Visit
`plasma-umass.org/coz <https://plasma-umass.org/coz/>`_ to open the ``*.coz`` file.
Omnitrace versus Coz
=======================================
This comparison is intended for readers who are familiar with the
`Coz profiler <https://github.com/plasma-umass/coz>`_.
Omnitrace provides several additional features and utilities for causal profiling:
.. csv-table::
:header: "Feature", "Coz", "Omnitrace", "Notes"
:widths: 20, 60, 60, 30
"Debug info", "requires debug info in DWARF v3 format (``-gdwarf-3``)", "optional, supports any DWARF format version", "See Note #1 below"
"Experiment selection", "``<file>:<line>``", "``<function>`` or ``<file>:<line>``", "See Note #2 below"
"Experiment speed-ups", "Randomly samples b/t 0..100 in increments of 5 or one fixed speed-up", "Supports specifying smaller subset", "See Note #3 below"
"Scope options", "Supports binary and source scopes", "Supports binary, source, and function scopes", "See Note #4, #5, and #6 below"
"Scope inclusion", "Uses ``%`` as a wildcard for binary and source scopes", "Full regex support for binary, source, and function scopes", ""
"Scope exclusion", "Not supported", "Supports regexes for excluding binary/source/function", "See Note #7 below"
"Call-stack sampling", "Linux Perf", "Linux Perf, libunwind", "See Note #8 below"
.. note::
#. Omnitrace supports a "function" mode which does not require debug info.
#. Omnitrace supports selecting an entire range of instruction pointers for a function instead
of an instruction pointer for one line. In large code bases, "function" mode
can resolve in fewer iterations. After a target function is identified, you can
switch to line mode and limit the function scope to the target function.
#. Omnitrace supports randomly sampling from subsets, e.g. { 0, 0, 5, 10 }
where 0% is randomly selected 50% of time and 5% and 10% are randomly selected 25% of the time.
#. Omnitrace and COZ have the same definition for binary scope, which is the binaries
loaded at runtime (the executable and linked libraries).
#. Omnitrace "source scope" supports both ``<file>`` and ``<file>:<line>`` formats
in contrast to the COZ "source scope" which requires ``<file>:<line>`` format.
#. Omnitrace supports a "function" scope which narrows the function and lines
which are eligible for causal experiments to those within the matching functions.
#. Omnitrace supports a second filter on scopes for removing binary/source/function
caught by an inclusive match. For example ``BINARY_SCOPE=.*`` and ``BINARY_EXCLUDE=libmpi.*``
initially includes all binaries but exclude regex removes MPI libraries.
#. In Omnitrace, the Linux Perf backend is preferred over use libunwind. However,
Linux Perf usage can be restricted for security reasons.
Omnitrace falls back to using a second POSIX timer and libunwind if
Linux Perf is not available.
@@ -0,0 +1,334 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
****************************************************
Profiling Python scripts
****************************************************
`Omnitrace <https://github.com/ROCm/omnitrace>`_ supports profiling Python code at the
source level and the script level.
Python support is enabled via the ``OMNITRACE_USE_PYTHON`` and the
``OMNITRACE_PYTHON_VERSIONS="<MAJOR>.<MINOR>`` CMake options.
Alternatively, to build multiple Python versions, use
``OMNITRACE_PYTHON_VERSIONS="<MAJOR>.<MINOR>;[<MAJOR>.<MINOR>]"``,
and ``OMNITRACE_PYTHON_ROOT_DIRS="/path/to/version;[/path/to/version]"`` instead of ``OMNITRACE_PYTHON_VERSION``.
When building multiple Python versions, the length of the ``OMNITRACE_PYTHON_VERSIONS``
and ``OMNITRACE_PYTHON_ROOT_DIRS`` lists must
be the same size.
.. note::
When using Omnitrace with Python programs, the Python interpreter major and minor version (e.g. 3.7)
must match the interpreter major and minor version
used when compiling the Python bindings. When building Omnitrace,
the shared object file ``libpyomnitrace.<IMPL>-<VERSION>-<ARCH>-<OS>-<ABI>.so`` is generated
where ``IMPL`` is the Python implementation, ``VERSION`` is the major and minor
version, ``ARCH`` is the architecture,
``OS`` is the operating system, and ``ABI`` is the application binary interface,
for example, ``libpyomnitrace.cpython-38-x86_64-linux-gnu.so``.
Getting Started
========================================
The Omnitrace Python package is installed in ``lib/pythonX.Y/site-packages/omnitrace``.
To ensure the Python interpreter can find the Omnitrace package,
add this path to the ``PYTHONPATH`` environment variable, as in the following example:
.. code-block:: shell
export PYTHONPATH=/opt/omnitrace/lib/python3.8/site-packages:${PYTHONPATH}
Both the ``share/omnitrace/setup-env.sh`` script and the module file in
``share/modulefiles/omnitrace`` automatically handle the prefixing of the ``PYTHONPATH``
environment variable.
Running Omnitrace on a Python script
========================================
Omnitrace provides an ``omnitrace-python`` helper bash script which
ensures ``PYTHONPATH`` is properly set and the correct Python interpreter is used.
This means the following commands are effectively equivalent:
.. code-block:: shell
omnitrace-python --help
and
.. code-block:: shell
export PYTHONPATH=/opt/omnitrace/lib/python3.8/site-packages:${PYTHONPATH}
python3.8 -m omnitrace --help
.. note::
``omnitrace-python`` and ``python -m omnitrace`` use the same command-line syntax
as the other ``omnitrace`` executables (``omnitrace-python <OMNITRACE_ARGS> -- <SCRIPT> <SCRIPT_ARGS>``)
and has similar options.
Command line options
-----------------------------------
Use ``omnitrace-python --help`` to view the available options:
.. code-block:: shell
usage: omnitrace [-h] [-v VERBOSITY] [-b] [-c FILE] [-s FILE] [-F [BOOL]] [--label [{args,file,line} [{args,file,line} ...]]] [-I FUNC [FUNC ...]] [-E FUNC [FUNC ...]] [-R FUNC [FUNC ...]] [-MI FILE [FILE ...]] [-ME FILE [FILE ...]] [-MR FILE [FILE ...]] [--trace-c [BOOL]]
optional arguments:
-h, --help show this help message and exit
-v VERBOSITY, --verbosity VERBOSITY
Logging verbosity
-b, --builtin Put 'profile' in the builtins. Use '@profile' to decorate a single function, or 'with profile:' to profile a single section of code.
-c FILE, --config FILE
OmniTrace configuration file
-s FILE, --setup FILE
Code to execute before the code to profile
-F [BOOL], --full-filepath [BOOL]
Encode the full function filename (instead of basename)
--label [{args,file,line} [{args,file,line} ...]]
Encode the function arguments, filename, and/or line number into the profiling function label
-I FUNC [FUNC ...], --function-include FUNC [FUNC ...]
Include any entries with these function names
-E FUNC [FUNC ...], --function-exclude FUNC [FUNC ...]
Filter out any entries with these function names
-R FUNC [FUNC ...], --function-restrict FUNC [FUNC ...]
Select only entries with these function names
-MI FILE [FILE ...], --module-include FILE [FILE ...]
Include any entries from these files
-ME FILE [FILE ...], --module-exclude FILE [FILE ...]
Filter out any entries from these files
-MR FILE [FILE ...], --module-restrict FILE [FILE ...]
Select only entries from these files
--trace-c [BOOL] Enable profiling C functions
usage: python3 -m omnitrace <OMNITRACE_ARGS> -- <SCRIPT> <SCRIPT_ARGS>
.. note::
The ``--trace-c`` option does not incorporate Omnitrace's dynamic instrumentation support.
It only enables profiling the underlying C function call within the Python interpreter.
Selective instrumentation
-----------------------------------
Similar to the ``omnitrace-instrument`` executable, command-line options exist for restricting,
including, and excluding certain functions and modules, for example, ``--function-exclude "^__init__$"``.
Alternatively, add the ``@profile`` decorator to the primary function of interest
in your program and use the ``-b`` / ``--builtin`` command-line option to narrow the scope of the
instrumentation to this function and its children.
Consider the following Python code (``example.py``):
.. code-block:: python
import sys
def fib(n):
return n if n < 2 else (fib(n - 1) + fib(n - 2))
def inefficient(n):
a = 0
for i in range(n):
a += i
for j in range(n):
a += j
return a
def run(n):
return fib(n) + inefficient(n)
if __name__ == "__main__":
run(20)
Running ``omnitrace-python ./example.py`` with ``OMNITRACE_PROFILE=ON`` and
``OMNITRACE_TIMEMORY_COMPONENTS=trip_count`` produces the following:
.. code-block:: shell
|-------------------------------------------------------------------------------------------|
| COUNTS NUMBER OF INVOCATIONS |
|-------------------------------------------------------------------------------------------|
| LABEL | COUNT | DEPTH | METRIC | SUM |
|---------------------------------------------------|--------|--------|------------|--------|
| |0>>> run | 1 | 0 | trip_count | 1 |
| |0>>> |_fib | 1 | 1 | trip_count | 1 |
| |0>>> |_fib | 2 | 2 | trip_count | 2 |
| |0>>> |_fib | 4 | 3 | trip_count | 4 |
| |0>>> |_fib | 8 | 4 | trip_count | 8 |
| |0>>> |_fib | 16 | 5 | trip_count | 16 |
| |0>>> |_fib | 32 | 6 | trip_count | 32 |
| |0>>> |_fib | 64 | 7 | trip_count | 64 |
| |0>>> |_fib | 128 | 8 | trip_count | 128 |
| |0>>> |_fib | 256 | 9 | trip_count | 256 |
| |0>>> |_fib | 512 | 10 | trip_count | 512 |
| |0>>> |_fib | 1024 | 11 | trip_count | 1024 |
| |0>>> |_fib | 2026 | 12 | trip_count | 2026 |
| |0>>> |_fib | 3632 | 13 | trip_count | 3632 |
| |0>>> |_fib | 5020 | 14 | trip_count | 5020 |
| |0>>> |_fib | 4760 | 15 | trip_count | 4760 |
| |0>>> |_fib | 2942 | 16 | trip_count | 2942 |
| |0>>> |_fib | 1152 | 17 | trip_count | 1152 |
| |0>>> |_fib | 274 | 18 | trip_count | 274 |
| |0>>> |_fib | 36 | 19 | trip_count | 36 |
| |0>>> |_fib | 2 | 20 | trip_count | 2 |
| |0>>> |_inefficient | 1 | 1 | trip_count | 1 |
|-------------------------------------------------------------------------------------------|
If the ``inefficient`` function is decorated with ``@profile`` as follows:
.. code-block:: python
@profile
def inefficient(n):
# ...
And then run using the command ``omnitrace-python -b -- ./example.py``, Omnitrace produces this output:
.. code-block:: shell
|-----------------------------------------------------------|
| COUNTS NUMBER OF INVOCATIONS |
|-----------------------------------------------------------|
| LABEL | COUNT | DEPTH | METRIC | SUM |
|-------------------|--------|--------|------------|--------|
| |0>>> inefficient | 1 | 0 | trip_count | 1 |
|-----------------------------------------------------------|
Omnitrace Python source instrumentation
========================================
Starting with the unmodified ``example.py`` script above, import the ``omnitrace`` module:
.. code-block:: python
import sys
import omnitrace # import omnitrace
def fib(n):
# ... etc. ...
Next, add ``@omnitrace.profile()`` to the ``run`` function:
.. code-block:: python
@omnitrace.profile()
def run(n):
# ...
Alternatively, use ``omnitrace.profile()`` as a context-manager around ``run(20)``:
.. code-block:: python
if __name__ == "__main__":
with omnitrace.profile():
run(20)
The results for both of the source-level instrumentation modes are identical to the
original ``omnitrace-python ./example.py`` results:
.. code-block:: shell
|-------------------------------------------------------------------------------------------|
| COUNTS NUMBER OF INVOCATIONS |
|-------------------------------------------------------------------------------------------|
| LABEL | COUNT | DEPTH | METRIC | SUM |
|---------------------------------------------------|--------|--------|------------|--------|
| |0>>> run | 1 | 0 | trip_count | 1 |
| |0>>> |_fib | 1 | 1 | trip_count | 1 |
| |0>>> |_fib | 2 | 2 | trip_count | 2 |
| |0>>> |_fib | 4 | 3 | trip_count | 4 |
| |0>>> |_fib | 8 | 4 | trip_count | 8 |
| |0>>> |_fib | 16 | 5 | trip_count | 16 |
| |0>>> |_fib | 32 | 6 | trip_count | 32 |
| |0>>> |_fib | 64 | 7 | trip_count | 64 |
| |0>>> |_fib | 128 | 8 | trip_count | 128 |
| |0>>> |_fib | 256 | 9 | trip_count | 256 |
| |0>>> |_fib | 512 | 10 | trip_count | 512 |
| |0>>> |_fib | 1024 | 11 | trip_count | 1024 |
| |0>>> |_fib | 2026 | 12 | trip_count | 2026 |
| |0>>> |_fib | 3632 | 13 | trip_count | 3632 |
| |0>>> |_fib | 5020 | 14 | trip_count | 5020 |
| |0>>> |_fib | 4760 | 15 | trip_count | 4760 |
| |0>>> |_fib | 2942 | 16 | trip_count | 2942 |
| |0>>> |_fib | 1152 | 17 | trip_count | 1152 |
| |0>>> |_fib | 274 | 18 | trip_count | 274 |
| |0>>> |_fib | 36 | 19 | trip_count | 36 |
| |0>>> |_fib | 2 | 20 | trip_count | 2 |
| |0>>> |_inefficient | 1 | 1 | trip_count | 1 |
|-------------------------------------------------------------------------------------------|
.. note::
When ``omnitrace-python`` is used without built-ins, the profiling results can be cluttered by the
numerous functions called when more complex modules are imported, such as ``import numpy``.
Omnitrace Python source instrumentation configuration
-------------------------------------------------------------
Within the Python source code, the profiler can be configured by directly
modifying the ``omnitrace.profiler.config`` data fields.
.. code-block:: python
import sys
def fib(n):
return n if n < 2 else (fib(n - 1) + fib(n - 2))
def inefficient(n):
a = 0
for i in range(n):
a += i
for j in range(n):
a += j
return a
def run(n):
return fib(n) + inefficient(n)
if __name__ == "__main__":
from omnitrace.profiler import config
from omnitrace import profile
config.include_args = True
config.include_filename = False
config.include_line = False
config.restrict_functions += ["fib", "run"]
with profile():
run(5)
Executing this script produces the following:
.. code-block:: shell
|------------------------------------------------------------------|
| COUNTS NUMBER OF INVOCATIONS |
|------------------------------------------------------------------|
| LABEL | COUNT | DEPTH | METRIC | SUM |
|--------------------------|--------|--------|------------|--------|
| |0>>> run(n=5) | 1 | 0 | trip_count | 1 |
| |0>>> |_fib(n=5) | 1 | 1 | trip_count | 1 |
| |0>>> |_fib(n=4) | 1 | 2 | trip_count | 1 |
| |0>>> |_fib(n=3) | 1 | 3 | trip_count | 1 |
| |0>>> |_fib(n=2) | 1 | 4 | trip_count | 1 |
| |0>>> |_fib(n=1) | 1 | 5 | trip_count | 1 |
| |0>>> |_fib(n=0) | 1 | 5 | trip_count | 1 |
| |0>>> |_fib(n=1) | 1 | 4 | trip_count | 1 |
| |0>>> |_fib(n=2) | 1 | 3 | trip_count | 1 |
| |0>>> |_fib(n=1) | 1 | 4 | trip_count | 1 |
| |0>>> |_fib(n=0) | 1 | 4 | trip_count | 1 |
| |0>>> |_fib(n=3) | 1 | 2 | trip_count | 1 |
| |0>>> |_fib(n=2) | 1 | 3 | trip_count | 1 |
| |0>>> |_fib(n=1) | 1 | 4 | trip_count | 1 |
| |0>>> |_fib(n=0) | 1 | 4 | trip_count | 1 |
| |0>>> |_fib(n=1) | 1 | 3 | trip_count | 1 |
|------------------------------------------------------------------|
@@ -0,0 +1,404 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
****************************************************
Sampling the call stack
****************************************************
`Omnitrace <https://github.com/ROCm/omnitrace>`_ can use call-stack sampling
on a binary instrumented with either the ``omnitrace`` executable
or the ``omnitrace-sample`` executable.
For example, all of the following commands are effectively equivalent:
* Binary rewrite with only the instrumentation necessary to start and stop sampling
.. code-block:: shell
omnitrace-instrument -M sampling -o foo.inst -- foo
omnitrace-run -- ./foo.inst
* Runtime instrumentation with only the instrumentation necessary to start and stop sampling
.. code-block:: shell
omnitrace-instrument -M sampling -- foo
* No instrumentation required
.. code-block:: shell
omnitrace-sample -- foo
.. note::
Set ``OMNITRACE_USE_SAMPLING=ON`` to activate call-stack sampling when executing an instrumented binary.
All ``omnitrace-instrument -M sampling`` (subsequently referred to as "instrumented-sampling")
does is wrap the ``main`` of the executable with initialization
before ``main`` starts and finalization after ``main`` ends.
This can be accomplished without instrumentation through a ``LD_PRELOAD``
of a library containing a dynamic symbol wrapper around ``__libc_start_main``.
The use of ``omnitrace-sample`` is **recommended** over
``omnitrace-instrument -M sampling`` when binary instrumentation
is not necessary. This is for a number of reasons:
* ``omnitrace-sample`` provides command-line options for controlling the Omnitrace feature set instead of
requiring configuration files or environment variables
* Despite the fact that instrumented-sampling only requires inserting snippets
around one function (``main``), Dyninst
does not have a feature for specifying that parsing and processing all the
other symbols in the binary is unnecessary.
In the best-case scenario when the target binary is relatively small,
instrumented-sampling has a slightly slower launch time,
but in the worst case scenarios it requires a significant amount of time and memory to launch.
* ``omnitrace-sample`` is fully compatible with MPI. For example,
the command ``mpirun -n 2 omnitrace-sample -- foo`` is valid,
whereas ``mpirun -n 2 omnitrace-instrument -M sampling -- foo``
is incompatible with some MPI distributions (particularly OpenMPI). This is because
MPI prohibits forking within an MPI rank.
* When MPI and binary instrumentation are both involved, two steps are required:
performing a binary rewrite of the executable and then using the instrumented executable
in lieu of the original executable. ``omnitrace-sample`` is therefore much easier to use with MPI.
The omnitrace-sample executable
========================================
View the help menu of ``omnitrace-sample`` with the ``-h`` / ``--help`` option:
.. code-block:: shell
$ omnitrace-sample --help
[omnitrace-sample] Usage: omnitrace-sample [ --help (count: 0, dtype: bool)
--version (count: 0, dtype: bool)
--monochrome (max: 1, dtype: bool)
--debug (max: 1, dtype: bool)
--verbose (count: 1)
--config (min: 0, dtype: filepath)
--output (min: 1)
--trace (max: 1, dtype: bool)
--profile (max: 1, dtype: bool)
--flat-profile (max: 1, dtype: bool)
--host (max: 1, dtype: bool)
--device (max: 1, dtype: bool)
--wait (count: 1)
--duration (count: 1)
--trace-file (count: 1, dtype: filepath)
--trace-buffer-size (count: 1, dtype: KB)
--trace-fill-policy (count: 1)
--trace-wait (count: 1)
--trace-duration (count: 1)
--trace-periods (min: 1)
--trace-clock-id (count: 1)
--profile-format (min: 1)
--profile-diff (min: 1)
--process-freq (count: 1)
--process-wait (count: 1)
--process-duration (count: 1)
--cpus (count: unlimited, dtype: int or range)
--gpus (count: unlimited, dtype: int or range)
--freq (count: 1)
--sampling-wait (count: 1)
--sampling-duration (count: 1)
--tids (min: 1)
--cputime (min: 0)
--realtime (min: 0)
--include (count: unlimited)
--exclude (count: unlimited)
--cpu-events (count: unlimited)
--gpu-events (count: unlimited)
--inlines (max: 1, dtype: bool)
--hsa-interrupt (count: 1, dtype: int)
]
Options:
-h, -?, --help Shows this page (count: 0, dtype: bool)
--version Prints the version and exit (count: 0, dtype: bool)
[DEBUG OPTIONS]
--monochrome Disable colorized output (max: 1, dtype: bool)
--debug Debug output (max: 1, dtype: bool)
-v, --verbose Verbose output (count: 1)
[GENERAL OPTIONS] These are options which are ubiquitously applied
-c, --config Configuration file (min: 0, dtype: filepath)
-o, --output Output path. Accepts 1-2 parameters corresponding to the output path and the output prefix (min: 1)
-T, --trace Generate a detailed trace (perfetto output) (max: 1, dtype: bool)
-P, --profile Generate a call-stack-based profile (conflicts with --flat-profile) (max: 1, dtype: bool)
-F, --flat-profile Generate a flat profile (conflicts with --profile) (max: 1, dtype: bool)
-H, --host Enable sampling host-based metrics for the process. E.g. CPU frequency, memory usage, etc. (max: 1, dtype: bool)
-D, --device Enable sampling device-based metrics for the process. E.g. GPU temperature, memory usage, etc. (max: 1, dtype: bool)
-w, --wait This option is a combination of '--trace-wait' and '--sampling-wait'. See the descriptions for those two options.
(count: 1)
-d, --duration This option is a combination of '--trace-duration' and '--sampling-duration'. See the descriptions for those two
options. (count: 1)
[TRACING OPTIONS] Specific options controlling tracing (i.e. deterministic measurements of every event)
--trace-file Specify the trace output filename. Relative filepath will be with respect to output path and output prefix. (count: 1,
dtype: filepath)
--trace-buffer-size Size limit for the trace output (in KB) (count: 1, dtype: KB)
--trace-fill-policy [ discard | ring_buffer ]
Policy for new data when the buffer size limit is reached:
- discard : new data is ignored
- ring_buffer : new data overwrites oldest data (count: 1)
--trace-wait Set the wait time (in seconds) before collecting trace and/or profiling data(in seconds). By default, the duration is
in seconds of realtime but that can changed via --trace-clock-id. (count: 1)
--trace-duration Set the duration of the trace and/or profile data collection (in seconds). By default, the duration is in seconds of
realtime but that can changed via --trace-clock-id. (count: 1)
--trace-periods More powerful version of specifying trace delay and/or duration. Format is one or more groups of: <DELAY>:<DURATION>,
<DELAY>:<DURATION>:<REPEAT>, and/or <DELAY>:<DURATION>:<REPEAT>:<CLOCK_ID>. (min: 1)
--trace-clock-id [ 0 (realtime|CLOCK_REALTIME)
1 (monotonic|CLOCK_MONOTONIC)
2 (cputime|CLOCK_PROCESS_CPUTIME_ID)
4 (monotonic_raw|CLOCK_MONOTONIC_RAW)
5 (realtime_coarse|CLOCK_REALTIME_COARSE)
6 (monotonic_coarse|CLOCK_MONOTONIC_COARSE)
7 (boottime|CLOCK_BOOTTIME) ]
Set the default clock ID for for trace delay/duration. Note: "cputime" is the *process* CPU time and might need to be
scaled based on the number of threads, i.e. 4 seconds of CPU-time for an application with 4 fully active threads would
equate to ~1 second of realtime. If this proves to be difficult to handle in practice, please file a feature request
for omnitrace to auto-scale based on the number of threads. (count: 1)
[PROFILE OPTIONS] Specific options controlling profiling (i.e. deterministic measurements which are aggregated into a summary)
--profile-format [ console | json | text ]
Data formats for profiling results (min: 1)
--profile-diff Generate a diff output b/t the profile collected and an existing profile from another run Accepts 1-2 parameters
corresponding to the input path and the input prefix (min: 1)
[HOST/DEVICE (PROCESS SAMPLING) OPTIONS]
Process sampling is background measurements for resources available to the entire process. These samples are not tied
to specific lines/regions of code
--process-freq Set the default host/device sampling frequency (number of interrupts per second) (count: 1)
--process-wait Set the default wait time (i.e. delay) before taking first host/device sample (in seconds of realtime) (count: 1)
--process-duration Set the duration of the host/device sampling (in seconds of realtime) (count: 1)
--cpus CPU IDs for frequency sampling. Supports integers and/or ranges (count: unlimited, dtype: int or range)
--gpus GPU IDs for SMI queries. Supports integers and/or ranges (count: unlimited, dtype: int or range)
[GENERAL SAMPLING OPTIONS] General options for timer-based sampling per-thread
-f, --freq Set the default sampling frequency (number of interrupts per second) (count: 1)
--sampling-wait Set the default wait time (i.e. delay) before taking first sample (in seconds). This delay time is based on the clock
of the sampler, i.e., a delay of 1 second for CPU-clock sampler may not equal 1 second of realtime (count: 1)
--sampling-duration Set the duration of the sampling (in seconds of realtime). I.e., it is possible (currently) to set a CPU-clock time
delay that exceeds the real-time duration... resulting in zero samples being taken (count: 1)
-t, --tids Specify the default thread IDs for sampling, where 0 (zero) is the main thread and each thread created by the target
application is assigned an atomically incrementing value. (min: 1)
[SAMPLING TIMER OPTIONS] These options determine the heuristic for deciding when to take a sample
--cputime Sample based on a CPU-clock timer (default). Accepts zero or more arguments:
0. Enables sampling based on CPU-clock timer.
1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of CPU-time.
2. Delay (in seconds of CPU-clock time). I.e., how long each thread should wait before taking first sample.
3+ Thread IDs to target for sampling, starting at 0 (the main thread).
May be specified as index or range, e.g., '0 2-4' will be interpreted as:
sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads (min: 0)
--realtime Sample based on a real-clock timer. Accepts zero or more arguments:
0. Enables sampling based on real-clock timer.
1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of realtime.
2. Delay (in seconds of real-clock time). I.e., how long each thread should wait before taking first sample.
3+ Thread IDs to target for sampling, starting at 0 (the main thread).
May be specified as index or range, e.g., '0 2-4' will be interpreted as:
sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads
When sampling with a real-clock timer, please note that enabling this will cause threads which are typically "idle"
to consume more resources since, while idle, the real-clock time increases (and therefore triggers taking samples)
whereas the CPU-clock time does not. (min: 0)
[BACKEND OPTIONS] These options control region information captured w/o sampling or instrumentation
-I, --include [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
Include data from these backends (count: unlimited)
-E, --exclude [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
Exclude data from these backends (count: unlimited)
[HARDWARE COUNTER OPTIONS] See also: omnitrace-avail -H
-C, --cpu-events Set the CPU hardware counter events to record (ref: `omnitrace-avail -H -c CPU`) (count: unlimited)
-G, --gpu-events Set the GPU hardware counter events to record (ref: `omnitrace-avail -H -c GPU`) (count: unlimited)
[MISCELLANEOUS OPTIONS]
-i, --inlines Include inline info in output when available (max: 1, dtype: bool)
--hsa-interrupt [ 0 | 1 ] Set the value of the HSA_ENABLE_INTERRUPT environment variable.
ROCm version 5.2 and older have a bug which will cause a deadlock if a sample is taken while waiting for the signal
that a kernel completed -- which happens when sampling with a real-clock timer. We require this option to be set to
when --realtime is specified to make users aware that, while this may fix the bug, it can have a negative impact on
performance.
Values:
0 avoid triggering the bug, potentially at the cost of reduced performance
1 do not modify how ROCm is notified about kernel completion (count: 1, dtype: int)
The general syntax for separating Omnitrace command-line arguments from the
following application arguments
is consistent with the LLVM style of using a stand-alone double hyphen (``--``).
All arguments preceding the double hyphen
are interpreted as belonging to Omnitrace and all arguments following it
are interpreted as the
application and its arguments. The double hyphen is only necessary when passing
command-line arguments to a target
which also uses hyphens. For example, you can run ``omnitrace-sample ls``, but
to run ``ls -la``, use ``omnitrace-sample -- ls -la``.
:doc:`Configuring the Omnitrace runtime options <./configuring-runtime-options>`
establishes the precedence of environment variable values over values specified
in the configuration files. This enables
you to configure the Omnitrace runtime to your preferred default behavior
in a file such as ``~/.omnitrace.cfg`` and then easily override
those settings in the command line, for example, ``OMNITRACE_ENABLED=OFF omnitrace-sample -- foo``.
Similarly, the command-line arguments passed to ``omnitrace-sample`` take precedence
over environment variables.
All of the command-line options above correlate to one or more configuration
settings, for example, ``--cpu-events`` correlates to the ``OMNITRACE_PAPI_EVENTS`` configuration variable.
``omnitrace-sample`` processes the arguments and outputs a summary of its configuration
before running the target application.
The following snippets show how ``omnitrace-sample`` runs with various environment updates.
* This snippet shows the environment updates when ``omnitrace-sample`` is invoked with no arguments:
.. code-block:: shell
$ omnitrace-sample -- ./parallel-overhead-locks 30 4 100
HSA_TOOLS_LIB=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
HSA_TOOLS_REPORT_LOAD_FAILURE=1
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
OMNITRACE_USE_PROCESS_SAMPLING=false
OMNITRACE_USE_SAMPLING=true
OMP_TOOL_LIBRARIES=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
ROCP_TOOL_LIB=/opt/omnitrace/lib/libomnitrace.so.1.7.1
* The next snippet shows the environment updates when ``omnitrace-sample`` enables
profiling, tracing, host process-sampling, device process-sampling, and all the available backends:
.. code-block:: shell
$ omnitrace-sample -PTDH -I all -- ./parallel-overhead-locks 30 4 100
HSA_TOOLS_LIB=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
HSA_TOOLS_REPORT_LOAD_FAILURE=1
KOKKOS_PROFILE_LIBRARY=/opt/omnitrace/lib/libomnitrace.so.1.7.1
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
OMNITRACE_CPU_FREQ_ENABLED=true
OMNITRACE_TRACE_THREAD_LOCKS=true
OMNITRACE_TRACE_THREAD_RW_LOCKS=true
OMNITRACE_TRACE_THREAD_SPIN_LOCKS=true
OMNITRACE_USE_KOKKOSP=true
OMNITRACE_USE_MPIP=true
OMNITRACE_USE_OMPT=true
OMNITRACE_TRACE=true
OMNITRACE_USE_PROCESS_SAMPLING=true
OMNITRACE_USE_RCCLP=true
OMNITRACE_USE_ROCM_SMI=true
OMNITRACE_USE_ROCPROFILER=true
OMNITRACE_USE_ROCTRACER=true
OMNITRACE_USE_ROCTX=true
OMNITRACE_USE_SAMPLING=true
OMNITRACE_PROFILE=true
OMP_TOOL_LIBRARIES=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
ROCP_TOOL_LIB=/opt/omnitrace/lib/libomnitrace.so.1.7.1
...
* The final snippet shows the environment updates when ``omnitrace-sample`` enables
profiling, tracing, host process-sampling, and device process-sampling,
sets the output path to ``omnitrace-output`` and the output prefix to ``%tag%``, and disables
all the available backends:
.. code-block:: shell
$ omnitrace-sample -PTDH -E all -o omnitrace-output %tag% -- ./parallel-overhead-locks 30 4 100
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
OMNITRACE_CPU_FREQ_ENABLED=true
OMNITRACE_OUTPUT_PATH=omnitrace-output
OMNITRACE_OUTPUT_PREFIX=%tag%
OMNITRACE_TRACE_THREAD_LOCKS=false
OMNITRACE_TRACE_THREAD_RW_LOCKS=false
OMNITRACE_TRACE_THREAD_SPIN_LOCKS=false
OMNITRACE_USE_KOKKOSP=false
OMNITRACE_USE_MPIP=false
OMNITRACE_USE_OMPT=false
OMNITRACE_TRACE=true
OMNITRACE_USE_PROCESS_SAMPLING=true
OMNITRACE_USE_RCCLP=false
OMNITRACE_USE_ROCM_SMI=false
OMNITRACE_USE_ROCPROFILER=false
OMNITRACE_USE_ROCTRACER=false
OMNITRACE_USE_ROCTX=false
OMNITRACE_USE_SAMPLING=true
OMNITRACE_PROFILE=true
...
An omnitrace-sample example
========================================
Here is the full output from the previous
``omnitrace-sample -PTDH -E all -o omnitrace-output %tag% -- ./parallel-overhead-locks 30 4 100`` command:
.. code-block:: shell
$ omnitrace-sample -PTDH -E all -o omnitrace-output %tag% -c -- ./parallel-overhead-locks 30 4 100
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.11.3
OMNITRACE_CONFIG_FILE=
OMNITRACE_CPU_FREQ_ENABLED=true
OMNITRACE_OUTPUT_PATH=omnitrace-output
OMNITRACE_OUTPUT_PREFIX=%tag%
OMNITRACE_PROFILE=true
OMNITRACE_TRACE=true
OMNITRACE_TRACE_THREAD_LOCKS=false
OMNITRACE_TRACE_THREAD_RW_LOCKS=false
OMNITRACE_TRACE_THREAD_SPIN_LOCKS=false
OMNITRACE_USE_KOKKOSP=false
OMNITRACE_USE_MPIP=false
OMNITRACE_USE_OMPT=false
OMNITRACE_USE_PROCESS_SAMPLING=true
OMNITRACE_USE_RCCLP=false
OMNITRACE_USE_ROCM_SMI=false
OMNITRACE_USE_ROCPROFILER=false
OMNITRACE_USE_ROCTRACER=false
OMNITRACE_USE_ROCTX=false
OMNITRACE_USE_SAMPLING=true
[omnitrace][dl][1785877] omnitrace_main
[omnitrace][1785877][omnitrace_init_tooling] Instrumentation mode: Sampling
______ .___ ___. .__ __. __ .___________..______ ___ ______ _______
/ __ \ | \/ | | \ | | | | | || _ \ / \ / || ____|
| | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__
| | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __|
| `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____
\______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______|
omnitrace v1.11.2 (rev: 2586b74db8bf335742600010b8d9f1ce8da9cf89, compiler: GNU v11.4.1, rocm: v6.1.x)
[988.958] perfetto.cc:58649 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
[parallel-overhead-locks] Threads: 4
[parallel-overhead-locks] Iterations: 100
[parallel-overhead-locks] fibonacci(30)...
[1] number of iterations: 100
[2] number of iterations: 100
[3] number of iterations: 100
[4] number of iterations: 100
[parallel-overhead-locks] fibonacci(30) x 4 = 409221992
[parallel-overhead-locks] number of mutex locks = 400
[omnitrace][1785877][0][omnitrace_finalize] finalizing...
[omnitrace][1785877][0][omnitrace_finalize]
[omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877 : 0.294342 sec wall_clock, 4.776 MB peak_rss, 3.170 MB page_rss, 0.990000 sec cpu_clock, 336.3 % cpu_util [laps: 1]
[omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/0 : 0.291535 sec wall_clock, 0.002619 sec thread_cpu_clock, 0.9 % thread_cpu_util, 4.776 MB peak_rss [laps: 1]
[omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/1 : 0.271353 sec wall_clock, 0.222572 sec thread_cpu_clock, 82.0 % thread_cpu_util, 4.200 MB peak_rss [laps: 1]
[omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/2 : 0.238218 sec wall_clock, 0.206405 sec thread_cpu_clock, 86.6 % thread_cpu_util, 3.432 MB peak_rss [laps: 1]
[omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/3 : 0.209459 sec wall_clock, 0.193415 sec thread_cpu_clock, 92.3 % thread_cpu_util, 2.472 MB peak_rss [laps: 1]
[omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/4 : 0.212029 sec wall_clock, 0.211694 sec thread_cpu_clock, 99.8 % thread_cpu_util, 1.152 MB peak_rss [laps: 1]
[omnitrace][1785877][0][omnitrace_finalize]
[omnitrace][1785877][0][omnitrace_finalize] Finalizing perfetto...
[omnitrace][1785877][perfetto]> Outputting '/home/user/code/omnitrace/build-release/omnitrace-output/2024-07-15_16.21/parallel-overhead-locksperfetto-trace-1785877.proto' (39.12 KB / 0.04 MB / 0.00 GB)... Done
[omnitrace][1785877][wall_clock]> Outputting 'omnitrace-output/2024-07-15_16.21/parallel-overhead-lockswall_clock-1785877.json'
[omnitrace][1785877][wall_clock]> Outputting 'omnitrace-output/2024-07-15_16.21/parallel-overhead-lockswall_clock-1785877.txt'
[omnitrace][1785877][metadata]> Outputting 'omnitrace-output/2024-07-15_16.21/parallel-overhead-locksmetadata-1785877.json' and 'omnitrace-output/2024-07-15_16.21/parallel-overhead-locksfunctions-1785877.json'
[omnitrace][1785877][0][omnitrace_finalize] Finalized: 0.054582 sec wall_clock, 0.000 MB peak_rss, -1.798 MB page_rss, 0.040000 sec cpu_clock, 73.3 % cpu_util
[989.312] perfetto.cc:60128 Tracing session 1 ended, total sessions:0
@@ -0,0 +1,938 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
****************************************************
Understanding the Omnitrace output
****************************************************
The general output form of `Omnitrace <https://github.com/ROCm/omnitrace>`_ is
``<OUTPUT_PATH>[/<TIMESTAMP>]/[<PREFIX>]<DATA_NAME>[-<OUTPUT_SUFFIX>].<EXT>``.
For example, starting with the following base configuration:
.. code-block:: shell
export OMNITRACE_OUTPUT_PATH=omnitrace-example-output
export OMNITRACE_TIME_OUTPUT=ON
export OMNITRACE_USE_PID=OFF
export OMNITRACE_PROFILE=ON
export OMNITRACE_TRACE=ON
.. code-block:: shell
$ omnitrace-instrument -- ./foo
...
[omnitrace] Outputting 'omnitrace-example-output/perfetto-trace.proto'...
[omnitrace] Outputting 'omnitrace-example-output/wall-clock.txt'...
[omnitrace] Outputting 'omnitrace-example-output/wall-clock.json'...
If the ``OMNITRACE_USE_PID`` option is enabled, then running a non-MPI executable
with a PID of ``63453`` results in the following output:
.. code-block:: shell
$ export OMNITRACE_USE_PID=ON
$ omnitrace-instrument -- ./foo
...
[omnitrace] Outputting 'omnitrace-example-output/perfetto-trace-63453.proto'...
[omnitrace] Outputting 'omnitrace-example-output/wall-clock-63453.txt'...
[omnitrace] Outputting 'omnitrace-example-output/wall-clock-63453.json'...
If ``OMNITRACE_TIME_OUTPUT`` is enabled, then a job that started on January 31, 2022 at 12:30 PM
generates the following:
.. code-block:: shell
$ export OMNITRACE_TIME_OUTPUT=ON
$ omnitrace-instrument -- ./foo
...
[omnitrace] Outputting 'omnitrace-example-output/2022-01-31_12.30_PM/perfetto-trace-63453.proto'...
[omnitrace] Outputting 'omnitrace-example-output/2022-01-31_12.30_PM/wall-clock-63453.txt'...
[omnitrace] Outputting 'omnitrace-example-output/2022-01-31_12.30_PM/wall-clock-63453.json'...
Metadata
========================================
Omnitrace outputs a ``metadata.json`` file. This metadata file contains
information about the settings, environment variables, output files, and info
about the system and the run, as follows:
* Hardware cache sizes
* Physical CPUs
* Hardware concurrency
* CPU model, frequency, vendor, and features
* Launch date and time
* Memory maps (for example, shared libraries)
* Output files
* Environment variables
* Configuration settings
Metadata JSON Sample
-----------------------------------------------------------------------
.. code-block:: json
{
"omnitrace": {
"metadata": {
"info": {
"HW_L1_CACHE_SIZE": 32768,
"HW_L2_CACHE_SIZE": 524288,
"HW_L3_CACHE_SIZE": 16777216,
"HW_PHYSICAL_CPU": 12,
"HW_CONCURRENCY": 24,
"LAUNCH_TIME": "02:04",
"LAUNCH_DATE": "05/08/22",
"TIMEMORY_GIT_REVISION": "52e7034fd419ff296506cdef43084f6071dbaba1",
"TIMEMORY_VERSION": "3.3.0rc4",
"TIMEMORY_API": "tim::project::timemory",
"TIMEMORY_GIT_DESCRIBE": "v3.2.0-263-g52e7034f",
"PWD": "/home/jrmadsen/devel/c++/AARInternal/hosttrace-dyninst/build-vscode",
"USER": "jrmadsen",
"HOME": "/home/jrmadsen",
"SHELL": "/bin/bash",
"CPU_MODEL": "AMD Ryzen Threadripper PRO 3945WX 12-Cores",
"CPU_FREQUENCY": 2400,
"CPU_VENDOR": "AuthenticAMD",
"CPU_FEATURES": [
"fpu",
"msr",
"sse",
"sse2",
"constant_tsc",
"ssse3",
"fma",
"sse4_1",
"sse4_2",
"popcnt",
"avx2",
"... etc. ..."
],
"memory_maps": [
{
"end_address": "7f4013797000",
"start_address": "7f4012e58000",
"pathname": "/opt/rocm-5.0.0/hip/lib/libamdhip64.so.5.0.50000",
"offset": "34a000",
"device": "103:05",
"inode": 4331165,
"permissions": "rw-p"
},
{
"end_address": "7f4013902000",
"start_address": "7f4013901000",
"pathname": "/usr/lib/x86_64-linux-gnu/libm-2.31.so",
"offset": "14d000",
"device": "103:05",
"inode": 42078854,
"permissions": "rwxp"
},
{
"end_address": "7f4013919000",
"start_address": "7f4013908000",
"pathname": "/usr/lib/x86_64-linux-gnu/libpthread-2.31.so",
"offset": "6000",
"device": "103:05",
"inode": 42078874,
"permissions": "r-xp"
},
{
"...": "etc."
},
],
"memory_maps_files": [
"/opt/rocm-5.0.0/hip/lib/libamdhip64.so.5.0.50000",
"/opt/rocm-5.0.0/hsa-amd-aqlprofile/lib/libhsa-amd-aqlprofile64.so.1.0.50000",
"/opt/rocm-5.0.0/lib/libamd_comgr.so.2.4.50000",
"/opt/rocm-5.0.0/lib/libhsa-runtime64.so.1.5.50000",
"/opt/rocm-5.0.0/rocm_smi/lib/librocm_smi64.so.5.0.50000",
"/opt/rocm-5.0.0/roctracer/lib/libroctracer64.so.1.0.50000",
"/usr/lib/x86_64-linux-gnu/ld-2.31.so",
"/usr/lib/x86_64-linux-gnu/libc-2.31.so",
"/usr/lib/x86_64-linux-gnu/libdl-2.31.so",
"... etc. ..."
],
},
"output": {
"text": [
{
"value": [
"omnitrace-tests-output/parallel-overhead-binary-rewrite/roctracer.txt"
],
"key": "roctracer"
},
{
"value": [
"omnitrace-tests-output/parallel-overhead-binary-rewrite/wall_clock.txt"
],
"key": "wall_clock"
}
],
"json": [
{
"value": [
"omnitrace-tests-output/parallel-overhead-binary-rewrite/roctracer.json",
"omnitrace-tests-output/parallel-overhead-binary-rewrite/roctracer.tree.json"
],
"key": "roctracer"
},
{
"value": [
"omnitrace-tests-output/parallel-overhead-binary-rewrite/wall_clock.json",
"omnitrace-tests-output/parallel-overhead-binary-rewrite/wall_clock.tree.json"
],
"key": "wall_clock"
}
]
},
"environment": [
{
"value": "/home/jrmadsen",
"key": "HOME"
},
{
"value": "/bin/bash",
"key": "SHELL"
},
{
"value": "jrmadsen",
"key": "USER"
},
{
"value": "true",
"key": "... etc. ..."
}
],
"settings": {
"OMNITRACE_JSON_OUTPUT": {
"count": -1,
"environ_updated": false,
"name": "json_output",
"data_type": "bool",
"initial": true,
"enabled": true,
"value": true,
"max_count": 1,
"cmdline": [
"--omnitrace-json-output"
],
"environ": "OMNITRACE_JSON_OUTPUT",
"config_updated": false,
"categories": [
"io",
"json",
"native"
],
"description": "Write json output files"
},
"... etc. ...": {
"etc.": true
}
}
}
}
}
Configuring the Omnitrace output
========================================
Omnitrace includes a core set of options for controlling the format
and contents of the output files. For additional information, see the guide on
:doc:`configuring runtime options <./configuring-runtime-options>`.
Core configuration settings
-----------------------------------
.. csv-table::
:header: "Setting", "Value", "Description"
:widths: 30, 30, 100
"``OMNITRACE_OUTPUT_PATH``", "Any valid path", "Path to folder where output files should be placed"
"``OMNITRACE_OUTPUT_PREFIX``", "String", "Useful for multiple runs with different arguments. See the next section on output prefix keys."
"``OMNITRACE_OUTPUT_FILE``", "Any valid filepath", "Specific location for the Perfetto output file"
"``OMNITRACE_TIME_OUTPUT``", "Boolean", "Place all output in a timestamped folder, timestamp format controlled via ``OMNITRACE_TIME_FORMAT``"
"``OMNITRACE_TIME_FORMAT``", "String", "See ``strftime`` man pages for valid identifiers"
"``OMNITRACE_USE_PID``", "Boolean", "Append either the PID or the MPI rank to all output files (before the extension)"
Output prefix keys
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Output prefix keys have many uses but are most helpful when dealing with multiple
profiling runs or large MPI jobs.
They are included in Omnitrace because they were introduced into Timemory
for `compile-time-perf <https://github.com/jrmadsen/compile-time-perf>`_.
They are needed to create different output files for a generic wrapper around
compilation commands while still
overwriting the output from the last time a file was compiled.
When doing scaling studies and specifying options via the command line,
the recommended process is to
use a common ``OMNITRACE_OUTPUT_PATH``, disable ``OMNITRACE_TIME_OUTPUT``,
set ``OMNITRACE_OUTPUT_PREFIX="%argt%-"``, and let Omnitrace cleanly organize the output.
.. csv-table::
:header: "String", "Encoding"
:widths: 20, 120
"``%argv%``", "Entire command-line condensed into a single string"
"``%argt%``", "Similar to ``%argv%`` except basename of first command line argument"
"``%args%``", "All command line arguments condensed into a single string"
"``%tag%``", "Basename of first command line argument"
"``%arg<N>%``", "Command line argument at position ``<N>`` (zero indexed), e.g. ``%arg0%`` for first argument"
"``%argv_hash%``", "MD5 sum of ``%argv%``"
"``%argt_hash%``", "MD5 sum if ``%argt%``"
"``%args_hash%``", "MD5 sum of ``%args%``"
"``%tag_hash%``", "MD5 sum of ``%tag%``"
"``%arg<N>_hash%``", "MD5 sum of ``%arg<N>%``"
"``%pid%``", "Process identifier (i.e. ``getpid()``)"
"``%ppid%``", "Parent process identifier (i.e. ``getppid()``)"
"``%pgid%``", "Process group identifier (i.e. ``getpgid(getpid())``)"
"``%psid%``", "Process session identifier (i.e. ``getsid(getpid())``)"
"``%psize%``", "Number of sibling process (from reading ``/proc/<PPID>/tasks/<PPID>/children``)"
"``%job%``", "Value of ``SLURM_JOB_ID`` environment variable if exists, else ``0``"
"``%rank%``", "Value of ``SLURM_PROCID`` environment variable if exists, else ``MPI_Comm_rank`` (or ``0`` non-mpi)"
"``%size%``", "``MPI_Comm_size`` or ``1`` if non-mpi"
"``%nid%``", "``%rank%`` if possible, otherwise ``%pid%``"
"``%launch_time%``", "Launch date and time (uses ``OMNITRACE_TIME_FORMAT``)"
"``%env{NAME}%``", "Value of environment variable ``NAME`` (i.e. ``getenv(NAME)``)"
"``%cfg{NAME}%``", "Value of configuration variable ``NAME`` (e.g. ``%cfg{OMNITRACE_SAMPLING_FREQ}%`` would resolve to sampling frequency)"
"``$env{NAME}``", "Alternative syntax to ``%env{NAME}%``"
"``$cfg{NAME}``", "Alternative syntax to ``%cfg{NAME}%``"
"``%m``", "Shorthand for ``%argt_hash%``"
"``%p``", "Shorthand for ``%pid%``"
"``%j``", "Shorthand for ``%job%``"
"``%r``", "Shorthand for ``%rank%``"
"``%s``", "Shorthand for ``%size%``"
.. note::
In any output prefix key which contains a ``/`` character, the ``/`` characters
are replaced with ``_`` and any leading underscores are stripped. For example,
an ``%arg0%`` of ``/usr/bin/foo`` translates to ``usr_bin_foo``. Additionally, any ``%arg<N>%`` keys which
do not have a command line argument at position ``<N>`` are ignored.
Perfetto output
========================================
Use the ``OMNITRACE_OUTPUT_FILE`` to specify a specific location. If this is an
absolute path, then all ``OMNITRACE_OUTPUT_PATH`` and similar
settings are ignored. Visit `ui.perfetto.dev <https://ui.perfetto.dev>`_ and open this file.
.. image:: ../data/omnitrace-perfetto.png
:alt: Visualization of a performance graph in Perfetto
.. image:: ../data/omnitrace-rocm.png
:alt: Visualization of ROCm data in Perfetto
.. image:: ../data/omnitrace-rocm-flow.png
:alt: Visualization of ROCm flow data in Perfetto
.. image:: ../data/omnitrace-user-api.png
:alt: Visualization of ROCm API calls in Perfetto
Timemory output
========================================
Use ``omnitrace-avail --components --filename`` to view the base filename for each component, as follows
.. code-block:: shell
$ omnitrace-avail wall_clock -C -f
|---------------------------------|---------------|------------------------|
| COMPONENT | AVAILABLE | FILENAME |
|---------------------------------|---------------|------------------------|
| wall_clock | true | wall_clock |
| sampling_wall_clock | true | sampling_wall_clock |
|---------------------------------|---------------|------------------------|
The ``OMNITRACE_COLLAPSE_THREADS`` and ``OMNITRACE_COLLAPSE_PROCESSES`` settings are
only valid when full `MPI support is enabled <../install/install.html#mpi-support-within-omnitrace>`_.
When they are set, Timemory combines the per-thread and per-rank data (respectively) of
identical call stacks.
The ``OMNITRACE_FLAT_PROFILE`` setting removes all call stack hierarchy.
Using ``OMNITRACE_FLAT_PROFILE=ON`` in combination
with ``OMNITRACE_COLLAPSE_THREADS=ON`` is a useful configuration for identifying
min/max measurements regardless of the calling context.
The ``OMNITRACE_TIMELINE_PROFILE`` setting (with ``OMNITRACE_FLAT_PROFILE=OFF``) effectively
generates similar data to that found
in Perfetto. Enabling timeline and flat profiling effectively generates
similar data to ``strace``. However, while Timemory generally
requires significantly less memory than Perfetto, this is not the case in timeline
mode, so use this setting with caution.
Timemory text output
-----------------------------------------------------------------------
Timemory text output files are meant for human consumption (while JSON formats are for analysis),
so some fields such as the ``LABEL`` might be truncated for readability.
The truncation settings be changed through the ``OMNITRACE_MAX_WIDTH`` setting.
.. note::
The generation of text output is configurable via ``OMNITRACE_TEXT_OUTPUT``.
.. _text-output-example-label:
Timemory text output example
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In the following example, the ``NN`` field in ``|NN>>>`` is the thread ID. If MPI support is enabled,
this becomes ``|MM|NN>>>`` where ``MM`` is the rank.
If ``OMNITRACE_COLLAPSE_THREADS=ON`` and ``OMNITRACE_COLLAPSE_PROCESSES=ON`` are configured,
neither the ``MM`` nor the ``NN`` are present unless the
component explicitly sets type traits. Type traits specify that the data is only
relevant per-thread or per-process, such as the ``thread_cpu_clock`` clock component.
.. code-block:: shell
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| REAL-CLOCK TIMER (I.E. WALL-CLOCK TIMER) |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| LABEL | COUNT | DEPTH | METRIC | UNITS | SUM | MEAN | MIN | MAX | VAR | STDDEV | % SELF |
|--------------------------------------------------------------|--------|--------|------------|--------|-----------|-----------|-----------|-----------|----------|----------|--------|
| |00>>> main | 1 | 0 | wall_clock | sec | 13.360265 | 13.360265 | 13.360265 | 13.360265 | 0.000000 | 0.000000 | 18.2 |
| |00>>> |_ompt_thread_initial | 1 | 1 | wall_clock | sec | 10.924161 | 10.924161 | 10.924161 | 10.924161 | 0.000000 | 0.000000 | 0.0 |
| |00>>> |_ompt_implicit_task | 1 | 2 | wall_clock | sec | 10.923050 | 10.923050 | 10.923050 | 10.923050 | 0.000000 | 0.000000 | 0.1 |
| |00>>> |_ompt_parallel [parallelism=12] | 1 | 3 | wall_clock | sec | 10.915026 | 10.915026 | 10.915026 | 10.915026 | 0.000000 | 0.000000 | 0.0 |
| |00>>> |_ompt_implicit_task | 1 | 4 | wall_clock | sec | 10.647951 | 10.647951 | 10.647951 | 10.647951 | 0.000000 | 0.000000 | 0.0 |
| |00>>> |_ompt_work_loop | 156 | 5 | wall_clock | sec | 0.000812 | 0.000005 | 0.000001 | 0.000212 | 0.000000 | 0.000018 | 100.0 |
| |00>>> |_ompt_work_single_executor | 40 | 5 | wall_clock | sec | 0.000016 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |00>>> |_ompt_sync_region_barrier_implicit | 308 | 5 | wall_clock | sec | 0.000629 | 0.000002 | 0.000001 | 0.000017 | 0.000000 | 0.000002 | 100.0 |
| |00>>> |_conj_grad | 76 | 5 | wall_clock | sec | 10.641165 | 0.140015 | 0.131894 | 0.155099 | 0.000017 | 0.004080 | 1.0 |
| |00>>> |_ompt_work_single_executor | 803 | 6 | wall_clock | sec | 0.000292 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |00>>> |_ompt_work_loop | 7904 | 6 | wall_clock | sec | 7.420265 | 0.000939 | 0.000005 | 0.006974 | 0.000003 | 0.001613 | 100.0 |
| |00>>> |_ompt_sync_region_barrier_implicit | 6004 | 6 | wall_clock | sec | 0.283160 | 0.000047 | 0.000001 | 0.004087 | 0.000000 | 0.000303 | 100.0 |
| |00>>> |_ompt_sync_region_barrier_implementation | 3952 | 6 | wall_clock | sec | 2.829252 | 0.000716 | 0.000007 | 0.009005 | 0.000001 | 0.000985 | 99.7 |
| |00>>> |_ompt_sync_region_reduction | 15808 | 7 | wall_clock | sec | 0.009142 | 0.000001 | 0.000000 | 0.000007 | 0.000000 | 0.000000 | 100.0 |
| |00>>> |_ompt_work_single_other | 1249 | 6 | wall_clock | sec | 0.000270 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |00>>> |_ompt_work_single_other | 114 | 5 | wall_clock | sec | 0.000024 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |00>>> |_ompt_sync_region_barrier_implementation | 76 | 5 | wall_clock | sec | 0.000876 | 0.000012 | 0.000008 | 0.000025 | 0.000000 | 0.000003 | 84.4 |
| |00>>> |_ompt_sync_region_reduction | 304 | 6 | wall_clock | sec | 0.000136 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |00>>> |_ompt_master | 226 | 5 | wall_clock | sec | 0.001978 | 0.000009 | 0.000000 | 0.000038 | 0.000000 | 0.000012 | 100.0 |
| |11>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.656145 | 10.656145 | 10.656145 | 10.656145 | 0.000000 | 0.000000 | 0.1 |
| |11>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649183 | 10.649183 | 10.649183 | 10.649183 | 0.000000 | 0.000000 | 0.0 |
| |11>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000852 | 0.000005 | 0.000002 | 0.000230 | 0.000000 | 0.000019 | 100.0 |
| |11>>> |_ompt_work_single_other | 149 | 6 | wall_clock | sec | 0.000035 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
| |11>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004135 | 0.000013 | 0.000001 | 0.001233 | 0.000000 | 0.000070 | 100.0 |
| |11>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641302 | 0.140017 | 0.131896 | 0.155102 | 0.000017 | 0.004080 | 0.6 |
| |11>>> |_ompt_work_single_other | 2023 | 7 | wall_clock | sec | 0.000458 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |11>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 8.253555 | 0.001044 | 0.000005 | 0.008021 | 0.000003 | 0.001790 | 100.0 |
| |11>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.263840 | 0.000044 | 0.000001 | 0.004087 | 0.000000 | 0.000297 | 100.0 |
| |11>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.059823 | 0.000521 | 0.000007 | 0.009508 | 0.000001 | 0.000863 | 100.0 |
| |11>>> |_ompt_work_single_executor | 29 | 7 | wall_clock | sec | 0.000011 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |11>>> |_ompt_work_single_executor | 5 | 6 | wall_clock | sec | 0.000002 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
| |11>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000975 | 0.000013 | 0.000008 | 0.000024 | 0.000000 | 0.000003 | 100.0 |
| |10>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.681664 | 10.681664 | 10.681664 | 10.681664 | 0.000000 | 0.000000 | 0.3 |
| |10>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649158 | 10.649158 | 10.649158 | 10.649158 | 0.000000 | 0.000000 | 0.0 |
| |10>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000863 | 0.000006 | 0.000002 | 0.000231 | 0.000000 | 0.000019 | 100.0 |
| |10>>> |_ompt_work_single_other | 140 | 6 | wall_clock | sec | 0.000037 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |10>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004149 | 0.000013 | 0.000001 | 0.001221 | 0.000000 | 0.000070 | 100.0 |
| |10>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641288 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 0.7 |
| |10>>> |_ompt_work_single_other | 1883 | 7 | wall_clock | sec | 0.000487 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |10>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 8.174545 | 0.001034 | 0.000005 | 0.006899 | 0.000003 | 0.001766 | 100.0 |
| |10>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.268808 | 0.000045 | 0.000001 | 0.004087 | 0.000000 | 0.000299 | 100.0 |
| |10>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.126988 | 0.000538 | 0.000007 | 0.009843 | 0.000001 | 0.000872 | 99.9 |
| |10>>> |_ompt_sync_region_reduction | 3952 | 8 | wall_clock | sec | 0.002574 | 0.000001 | 0.000000 | 0.000014 | 0.000000 | 0.000000 | 100.0 |
| |10>>> |_ompt_work_single_executor | 169 | 7 | wall_clock | sec | 0.000072 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |10>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000954 | 0.000013 | 0.000009 | 0.000023 | 0.000000 | 0.000003 | 95.9 |
| |10>>> |_ompt_sync_region_reduction | 76 | 7 | wall_clock | sec | 0.000039 | 0.000001 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |10>>> |_ompt_work_single_executor | 14 | 6 | wall_clock | sec | 0.000006 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |09>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.686552 | 10.686552 | 10.686552 | 10.686552 | 0.000000 | 0.000000 | 0.3 |
| |09>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649151 | 10.649151 | 10.649151 | 10.649151 | 0.000000 | 0.000000 | 0.0 |
| |09>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000880 | 0.000006 | 0.000002 | 0.000258 | 0.000000 | 0.000021 | 100.0 |
| |09>>> |_ompt_work_single_other | 148 | 6 | wall_clock | sec | 0.000034 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |09>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004129 | 0.000013 | 0.000001 | 0.001210 | 0.000000 | 0.000069 | 100.0 |
| |09>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641308 | 0.140017 | 0.131895 | 0.155102 | 0.000017 | 0.004080 | 0.7 |
| |09>>> |_ompt_work_single_other | 2043 | 7 | wall_clock | sec | 0.000473 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |09>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 7.977001 | 0.001009 | 0.000005 | 0.007325 | 0.000003 | 0.001732 | 100.0 |
| |09>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.242996 | 0.000040 | 0.000001 | 0.004087 | 0.000000 | 0.000284 | 100.0 |
| |09>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.350895 | 0.000595 | 0.000007 | 0.008689 | 0.000001 | 0.000926 | 100.0 |
| |09>>> |_ompt_work_single_executor | 9 | 7 | wall_clock | sec | 0.000004 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |09>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000973 | 0.000013 | 0.000008 | 0.000025 | 0.000000 | 0.000003 | 100.0 |
| |09>>> |_ompt_work_single_executor | 6 | 6 | wall_clock | sec | 0.000002 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
| |08>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.721622 | 10.721622 | 10.721622 | 10.721622 | 0.000000 | 0.000000 | 0.7 |
| |08>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649135 | 10.649135 | 10.649135 | 10.649135 | 0.000000 | 0.000000 | 0.0 |
| |08>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000839 | 0.000005 | 0.000001 | 0.000231 | 0.000000 | 0.000019 | 100.0 |
| |08>>> |_ompt_work_single_other | 141 | 6 | wall_clock | sec | 0.000030 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |08>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004114 | 0.000013 | 0.000001 | 0.001198 | 0.000000 | 0.000069 | 100.0 |
| |08>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641294 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 0.6 |
| |08>>> |_ompt_work_single_other | 1742 | 7 | wall_clock | sec | 0.000392 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |08>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 8.306388 | 0.001051 | 0.000005 | 0.007886 | 0.000003 | 0.001795 | 100.0 |
| |08>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.274358 | 0.000046 | 0.000001 | 0.004090 | 0.000000 | 0.000302 | 100.0 |
| |08>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 1.991251 | 0.000504 | 0.000007 | 0.008694 | 0.000001 | 0.000844 | 99.8 |
| |08>>> |_ompt_sync_region_reduction | 7904 | 8 | wall_clock | sec | 0.003816 | 0.000000 | 0.000000 | 0.000017 | 0.000000 | 0.000000 | 100.0 |
| |08>>> |_ompt_work_single_executor | 310 | 7 | wall_clock | sec | 0.000112 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |08>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000955 | 0.000013 | 0.000009 | 0.000026 | 0.000000 | 0.000003 | 93.7 |
| |08>>> |_ompt_sync_region_reduction | 152 | 7 | wall_clock | sec | 0.000060 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |08>>> |_ompt_work_single_executor | 13 | 6 | wall_clock | sec | 0.000005 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |07>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.747282 | 10.747282 | 10.747282 | 10.747282 | 0.000000 | 0.000000 | 0.9 |
| |07>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649093 | 10.649093 | 10.649093 | 10.649093 | 0.000000 | 0.000000 | 0.0 |
| |07>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000923 | 0.000006 | 0.000002 | 0.000231 | 0.000000 | 0.000019 | 100.0 |
| |07>>> |_ompt_work_single_other | 152 | 6 | wall_clock | sec | 0.000048 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |07>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.003981 | 0.000013 | 0.000001 | 0.001186 | 0.000000 | 0.000068 | 100.0 |
| |07>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641295 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 0.7 |
| |07>>> |_ompt_work_single_other | 2043 | 7 | wall_clock | sec | 0.000648 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |07>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 7.978811 | 0.001009 | 0.000005 | 0.006728 | 0.000003 | 0.001732 | 100.0 |
| |07>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.199939 | 0.000033 | 0.000001 | 0.004086 | 0.000000 | 0.000255 | 100.0 |
| |07>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.385843 | 0.000604 | 0.000009 | 0.009039 | 0.000001 | 0.000938 | 100.0 |
| |07>>> |_ompt_work_single_executor | 9 | 7 | wall_clock | sec | 0.000004 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |07>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000905 | 0.000012 | 0.000010 | 0.000025 | 0.000000 | 0.000003 | 100.0 |
| |07>>> |_ompt_work_single_executor | 2 | 6 | wall_clock | sec | 0.000001 | 0.000001 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |06>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.772278 | 10.772278 | 10.772278 | 10.772278 | 0.000000 | 0.000000 | 1.1 |
| |06>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649092 | 10.649092 | 10.649092 | 10.649092 | 0.000000 | 0.000000 | 0.0 |
| |06>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000888 | 0.000006 | 0.000002 | 0.000236 | 0.000000 | 0.000020 | 100.0 |
| |06>>> |_ompt_work_single_other | 153 | 6 | wall_clock | sec | 0.000037 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |06>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004090 | 0.000013 | 0.000001 | 0.001175 | 0.000000 | 0.000067 | 100.0 |
| |06>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641317 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 0.8 |
| |06>>> |_ompt_work_single_other | 2041 | 7 | wall_clock | sec | 0.000476 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |06>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 7.467961 | 0.000945 | 0.000005 | 0.010712 | 0.000003 | 0.001627 | 100.0 |
| |06>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.250883 | 0.000042 | 0.000001 | 0.004087 | 0.000000 | 0.000285 | 100.0 |
| |06>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.838733 | 0.000718 | 0.000009 | 0.009015 | 0.000001 | 0.001015 | 99.9 |
| |06>>> |_ompt_sync_region_reduction | 3952 | 8 | wall_clock | sec | 0.003334 | 0.000001 | 0.000000 | 0.000025 | 0.000000 | 0.000001 | 100.0 |
| |06>>> |_ompt_work_single_executor | 11 | 7 | wall_clock | sec | 0.000005 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |06>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000940 | 0.000012 | 0.000009 | 0.000025 | 0.000000 | 0.000003 | 95.4 |
| |06>>> |_ompt_sync_region_reduction | 76 | 7 | wall_clock | sec | 0.000044 | 0.000001 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |06>>> |_ompt_work_single_executor | 1 | 6 | wall_clock | sec | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
| |05>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.797950 | 10.797950 | 10.797950 | 10.797950 | 0.000000 | 0.000000 | 1.4 |
| |05>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649072 | 10.649072 | 10.649072 | 10.649072 | 0.000000 | 0.000000 | 0.0 |
| |05>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000879 | 0.000006 | 0.000001 | 0.000248 | 0.000000 | 0.000021 | 100.0 |
| |05>>> |_ompt_work_single_other | 142 | 6 | wall_clock | sec | 0.000034 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |05>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004062 | 0.000013 | 0.000002 | 0.001163 | 0.000000 | 0.000067 | 100.0 |
| |05>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641291 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 0.7 |
| |05>>> |_ompt_work_single_other | 2038 | 7 | wall_clock | sec | 0.000500 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |05>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 8.279191 | 0.001047 | 0.000005 | 0.006596 | 0.000003 | 0.001792 | 100.0 |
| |05>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.250939 | 0.000042 | 0.000001 | 0.004090 | 0.000000 | 0.000286 | 100.0 |
| |05>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.039013 | 0.000516 | 0.000009 | 0.008689 | 0.000001 | 0.000855 | 100.0 |
| |05>>> |_ompt_work_single_executor | 14 | 7 | wall_clock | sec | 0.000005 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
| |05>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000926 | 0.000012 | 0.000009 | 0.000023 | 0.000000 | 0.000003 | 100.0 |
| |05>>> |_ompt_work_single_executor | 12 | 6 | wall_clock | sec | 0.000005 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |04>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.825935 | 10.825935 | 10.825935 | 10.825935 | 0.000000 | 0.000000 | 1.6 |
| |04>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649068 | 10.649068 | 10.649068 | 10.649068 | 0.000000 | 0.000000 | 0.0 |
| |04>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000884 | 0.000006 | 0.000002 | 0.000245 | 0.000000 | 0.000020 | 100.0 |
| |04>>> |_ompt_work_single_other | 150 | 6 | wall_clock | sec | 0.000034 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |04>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004069 | 0.000013 | 0.000001 | 0.001151 | 0.000000 | 0.000066 | 100.0 |
| |04>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641300 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 1.1 |
| |04>>> |_ompt_work_single_other | 2041 | 7 | wall_clock | sec | 0.000448 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |04>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 7.438393 | 0.000941 | 0.000005 | 0.007090 | 0.000003 | 0.001624 | 100.0 |
| |04>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.270654 | 0.000045 | 0.000001 | 0.004090 | 0.000000 | 0.000295 | 100.0 |
| |04>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.819165 | 0.000713 | 0.000009 | 0.008379 | 0.000001 | 0.001013 | 99.9 |
| |04>>> |_ompt_sync_region_reduction | 7904 | 8 | wall_clock | sec | 0.003932 | 0.000000 | 0.000000 | 0.000015 | 0.000000 | 0.000000 | 100.0 |
| |04>>> |_ompt_work_single_executor | 11 | 7 | wall_clock | sec | 0.000005 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |04>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000936 | 0.000012 | 0.000009 | 0.000025 | 0.000000 | 0.000003 | 93.2 |
| |04>>> |_ompt_sync_region_reduction | 152 | 7 | wall_clock | sec | 0.000064 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |04>>> |_ompt_work_single_executor | 4 | 6 | wall_clock | sec | 0.000001 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
| |03>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.849322 | 10.849322 | 10.849322 | 10.849322 | 0.000000 | 0.000000 | 1.8 |
| |03>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649075 | 10.649075 | 10.649075 | 10.649075 | 0.000000 | 0.000000 | 0.0 |
| |03>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000861 | 0.000006 | 0.000002 | 0.000238 | 0.000000 | 0.000020 | 100.0 |
| |03>>> |_ompt_work_single_other | 120 | 6 | wall_clock | sec | 0.000028 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |03>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.003993 | 0.000013 | 0.000001 | 0.001138 | 0.000000 | 0.000065 | 100.0 |
| |03>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641302 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 0.8 |
| |03>>> |_ompt_work_single_other | 1756 | 7 | wall_clock | sec | 0.000426 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |03>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 8.005617 | 0.001013 | 0.000005 | 0.011500 | 0.000003 | 0.001741 | 100.0 |
| |03>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.231485 | 0.000039 | 0.000001 | 0.004086 | 0.000000 | 0.000277 | 100.0 |
| |03>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.320428 | 0.000587 | 0.000009 | 0.010868 | 0.000001 | 0.000912 | 100.0 |
| |03>>> |_ompt_work_single_executor | 296 | 7 | wall_clock | sec | 0.000120 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |03>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000967 | 0.000013 | 0.000010 | 0.000023 | 0.000000 | 0.000003 | 100.0 |
| |03>>> |_ompt_work_single_executor | 34 | 6 | wall_clock | sec | 0.000013 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |02>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.876387 | 10.876387 | 10.876387 | 10.876387 | 0.000000 | 0.000000 | 2.1 |
| |02>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649050 | 10.649050 | 10.649050 | 10.649050 | 0.000000 | 0.000000 | 0.0 |
| |02>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000924 | 0.000006 | 0.000001 | 0.000241 | 0.000000 | 0.000020 | 100.0 |
| |02>>> |_ompt_work_single_other | 139 | 6 | wall_clock | sec | 0.000040 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |02>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.003972 | 0.000013 | 0.000001 | 0.001127 | 0.000000 | 0.000064 | 100.0 |
| |02>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641287 | 0.140017 | 0.131895 | 0.155101 | 0.000017 | 0.004080 | 0.7 |
| |02>>> |_ompt_work_single_other | 1902 | 7 | wall_clock | sec | 0.000553 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |02>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 7.906688 | 0.001000 | 0.000005 | 0.007068 | 0.000003 | 0.001713 | 100.0 |
| |02>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.261367 | 0.000044 | 0.000001 | 0.004088 | 0.000000 | 0.000295 | 100.0 |
| |02>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.402362 | 0.000608 | 0.000009 | 0.010399 | 0.000001 | 0.000944 | 99.9 |
| |02>>> |_ompt_sync_region_reduction | 3952 | 8 | wall_clock | sec | 0.002937 | 0.000001 | 0.000000 | 0.000021 | 0.000000 | 0.000000 | 100.0 |
| |02>>> |_ompt_work_single_executor | 150 | 7 | wall_clock | sec | 0.000073 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |02>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000895 | 0.000012 | 0.000009 | 0.000026 | 0.000000 | 0.000003 | 95.2 |
| |02>>> |_ompt_sync_region_reduction | 76 | 7 | wall_clock | sec | 0.000043 | 0.000001 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |02>>> |_ompt_work_single_executor | 15 | 6 | wall_clock | sec | 0.000007 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |01>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.901650 | 10.901650 | 10.901650 | 10.901650 | 0.000000 | 0.000000 | 2.3 |
| |01>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649017 | 10.649017 | 10.649017 | 10.649017 | 0.000000 | 0.000000 | 0.0 |
| |01>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000863 | 0.000006 | 0.000001 | 0.000231 | 0.000000 | 0.000019 | 100.0 |
| |01>>> |_ompt_work_single_other | 146 | 6 | wall_clock | sec | 0.000033 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
| |01>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004012 | 0.000013 | 0.000001 | 0.001115 | 0.000000 | 0.000064 | 100.0 |
| |01>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641316 | 0.140017 | 0.131895 | 0.155101 | 0.000017 | 0.004080 | 0.8 |
| |01>>> |_ompt_work_single_other | 1811 | 7 | wall_clock | sec | 0.000403 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |01>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 7.410337 | 0.000938 | 0.000005 | 0.010556 | 0.000003 | 0.001610 | 100.0 |
| |01>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.202494 | 0.000034 | 0.000001 | 0.003521 | 0.000000 | 0.000256 | 100.0 |
| |01>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.943604 | 0.000745 | 0.000008 | 0.009033 | 0.000001 | 0.001024 | 100.0 |
| |01>>> |_ompt_work_single_executor | 241 | 7 | wall_clock | sec | 0.000093 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |01>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000917 | 0.000012 | 0.000009 | 0.000026 | 0.000000 | 0.000003 | 100.0 |
| |01>>> |_ompt_work_single_executor | 8 | 6 | wall_clock | sec | 0.000004 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |00>>> |_c_print_results | 1 | 2 | wall_clock | sec | 0.000049 | 0.000049 | 0.000049 | 0.000049 | 0.000000 | 0.000000 | 100.0 |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
Timemory JSON output
-------------------------------------------------------------------------
Timemory represents the data within the JSON output in two forms:
a flat structure and a hierarchical structure.
The flat JSON data represents the data similar to the text files, where the hierarchical information
is represented by the indentation of the ``prefix`` field and the ``depth`` field.
The hierarchical JSON contains additional information with respect
to inclusive and exclusive values. However,
its structure must be processed using recursion. This section of the JSON output supports analysis
by `hatchet <https://github.com/hatchet/hatchet>`_.
All the data entries for the flat structure are in a single JSON array. It is easier to
write a simple Python script for post-processing using this format than with the hierarchical structure.
.. note::
The generation of flat JSON output is configurable via ``OMNITRACE_JSON_OUTPUT``.
The generation of hierarchical JSON data is configurable via ``OMNITRACE_TREE_OUTPUT``
Timemory JSON output sample
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In the following JSON data, the flat data starts at ``["timemory"]["wall_clock"]["ranks"]``
and the hierarchical data starts at ``["timemory"]["wall_clock"]["graph"]``.
To access the name (or prefix) of the nth entry in the flat data layout, use
``["timemory"]["wall_clock"]["ranks"][0]["graph"][<N>]["prefix"]``. When full MPI
support is enabled, the per-rank data in flat layout is represented
as an entry in the ``ranks`` array. In the hierarchical data structure,
the per-rank data is represented as an entry in the ``mpi`` array. However, ``graph``
is used in lieu of ``mpi`` when full MPI support is enabled.
In the hierarchical layout, all data for the process is a child of a dummy
root node, which has the name ``unknown-hash=0``.
.. code-block:: json
{
"timemory": {
"wall_clock": {
"properties": {
"cereal_class_version": 0,
"value": 78,
"enum": "WALL_CLOCK",
"id": "wall_clock",
"ids": [
"real_clock",
"virtual_clock",
"wall_clock"
]
},
"type": "wall_clock",
"description": "Real-clock timer (i.e. wall-clock timer)",
"unit_value": 1000000000,
"unit_repr": "sec",
"thread_scope_only": false,
"thread_count": 2,
"mpi_size": 1,
"upcxx_size": 1,
"process_count": 1,
"num_ranks": 1,
"concurrency": 2,
"ranks": [
{
"rank": 0,
"graph_size": 112,
"graph": [
{
"hash": 17481650134347108265,
"prefix": "|0>>> main",
"depth": 0,
"entry": {
"cereal_class_version": 0,
"laps": 1,
"value": 894743517,
"accum": 894743517,
"repr_data": 0.894743517,
"repr_display": 0.894743517
},
"stats": {
"cereal_class_version": 0,
"sum": 0.894743517,
"count": 1,
"min": 0.894743517,
"max": 0.894743517,
"sqr": 0.8005659612135293,
"mean": 0.894743517,
"stddev": 0.0
},
"rolling_hash": 17481650134347108265
},
{
"hash": 3455444288293231339,
"prefix": "|0>>> |_read_input",
"depth": 1,
"entry": {
"laps": 1,
"value": 9808,
"accum": 9808,
"repr_data": 9.808e-06,
"repr_display": 9.808e-06
},
"stats": {
"sum": 9.808e-06,
"count": 1,
"min": 9.808e-06,
"max": 9.808e-06,
"sqr": 9.6196864e-11,
"mean": 9.808e-06,
"stddev": 0.0
},
"rolling_hash": 2490350348930787988
},
{
"hash": 8456966793631718807,
"prefix": "|0>>> |_setcoeff",
"depth": 1,
"entry": {
"laps": 1,
"value": 922,
"accum": 922,
"repr_data": 9.22e-07,
"repr_display": 9.22e-07
},
"stats": {
"sum": 9.22e-07,
"count": 1,
"min": 9.22e-07,
"max": 9.22e-07,
"sqr": 8.50084e-13,
"mean": 9.22e-07,
"stddev": 0.0
},
"rolling_hash": 7491872854269275456
},
{
"hash": 6107876127803219007,
"prefix": "|0>>> |_ompt_thread_initial",
"depth": 1,
"entry": {
"laps": 1,
"value": 896506392,
"accum": 896506392,
"repr_data": 0.896506392,
"repr_display": 0.896506392
},
"stats": {
"sum": 0.896506392,
"count": 1,
"min": 0.896506392,
"max": 0.896506392,
"sqr": 0.8037237108968578,
"mean": 0.896506392,
"stddev": 0.0
},
"rolling_hash": 5142782188440775656
},
{
"hash": 15402802091993617561,
"prefix": "|0>>> |_ompt_implicit_task",
"depth": 2,
"entry": {
"laps": 1,
"value": 896479111,
"accum": 896479111,
"repr_data": 0.896479111,
"repr_display": 0.896479111
},
"stats": {
"sum": 0.896479111,
"count": 1,
"min": 0.896479111,
"max": 0.896479111,
"sqr": 0.8036747964593504,
"mean": 0.896479111,
"stddev": 0.0
},
"rolling_hash": 2098840206724841601 },
{
"..." : "... etc. ..."
}
]
}
],
"graph": [
[
{
"cereal_class_version": 0,
"node": {
"hash": 0,
"prefix": "unknown-hash=0",
"tid": [
0
],
"pid": [
2539175
],
"depth": 0,
"is_dummy": false,
"inclusive": {
"entry": {
"laps": 0,
"value": 0,
"accum": 0,
"repr_data": 0.0,
"repr_display": 0.0
},
"stats": {
"sum": 0.0,
"count": 0,
"min": 0.0,
"max": 0.0,
"sqr": 0.0,
"mean": 0.0,
"stddev": 0.0
}
},
"exclusive": {
"entry": {
"laps": 0,
"value": -894743517,
"accum": -894743517,
"repr_data": -0.894743517,
"repr_display": -0.894743517
},
"stats": {
"sum": 0.0,
"count": 0,
"min": 0.0,
"max": 0.0,
"sqr": 0.0,
"mean": 0.0,
"stddev": 0.0
}
}
},
"children": [
{
"node": {
"hash": 17481650134347108265,
"prefix": "main",
"tid": [
0
],
"pid": [
2539175
],
"depth": 1,
"is_dummy": false,
"inclusive": {
"entry": {
"laps": 1,
"value": 894743517,
"accum": 894743517,
"repr_data": 0.894743517,
"repr_display": 0.894743517
},
"stats": {
"sum": 0.894743517,
"count": 1,
"min": 0.894743517,
"max": 0.894743517,
"sqr": 0.8005659612135293,
"mean": 0.894743517,
"stddev": 0.0
}
},
"exclusive": {
"entry": {
"laps": 1,
"value": -1773605,
"accum": -1773605,
"repr_data": -0.001773605,
"repr_display": -0.001773605
},
"stats": {
"sum": -0.001773605,
"count": 1,
"min": 9.22e-07,
"max": 0.896506392,
"sqr": -0.0031577497803754,
"mean": -0.001773605,
"stddev": 0.0
}
}
},
"children": [
{
"..." : "... etc. ..."
}
]
}
]
}
]
]
}
}
}
Timemory JSON output Python post-processing example
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: python
#!/usr/bin/env python3
import sys
import json
def read_json(inp):
with open(inp, "r") as f:
return json.load(f)
def find_max(data):
"""Find the max for any function called multiple times"""
max_entry = None
for itr in data:
if itr["entry"]["laps"] == 1:
continue
if max_entry is None:
max_entry = itr
else:
if itr["stats"]["mean"] > max_entry["stats"]["mean"]:
max_entry = itr
return max_entry
def strip_name(name):
"""Return everything after |_ if it exists"""
idx = name.index("|_")
return name if idx is None else name[(idx + 2) :]
if __name__ == "__main__":
input_data = [[x, read_json(x)] for x in sys.argv[1:]]
for file, data in input_data:
for metric, metric_data in data["timemory"].items():
print(f"[{file}] Found metric: {metric}")
for n, itr in enumerate(metric_data["ranks"]):
max_entry = find_max(itr["graph"])
print(
"[{}] Maximum value: '{}' at depth {} was called {}x :: {:.3f} {} (mean = {:.3e} {})".format(
file,
strip_name(max_entry["prefix"]),
max_entry["depth"],
max_entry["entry"]["laps"],
max_entry["entry"]["repr_data"],
metric_data["unit_repr"],
max_entry["stats"]["mean"],
metric_data["unit_repr"],
)
)
The result of applying this script to the corresponding JSON output from the :ref:`text-output-example-label`
section is as follows:
.. code-block:: shell
[openmp-cg.inst-wall_clock.json] Found metric: wall_clock
[openmp-cg.inst-wall_clock.json] Maximum value: 'conj_grad' at depth 6 was called 76x :: 10.641 sec (mean = 1.400e-01 sec)
@@ -0,0 +1,294 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
****************************************************
Using the Omnitrace API
****************************************************
The following example shows how a program can use the Omnitrace API for run-time analysis.
Omnitrace user API example program
========================================
You can use the Omnitrace API to define custom regions to profile and trace.
The following C++ program demonstrates this technique by calling several functions from the
Omnitrace API, such as ``omnitrace_user_push_region`` and
``omnitrace_user_stop_thread_trace``.
.. note::
By default, when Omnitrace detects any ``omnitrace_user_start_*`` or
``omnitrace_user_stop_*`` function, instrumentation
is disabled at start up, which means ``omnitrace_user_stop_trace()`` is not
required at the beginning of ``main``. This behavior
can be manually controlled by using the ``OMNITRACE_INIT_ENABLED`` environment variable.
User-defined regions are always
recorded, regardless of whether ``omnitrace_user_start_*`` or
``omnitrace_user_stop_*`` has been called.
.. code-block:: shell
#include <omnitrace/categories.h>
#include <omnitrace/types.h>
#include <omnitrace/user.h>
#include <atomic>
#include <cassert>
#include <cerrno>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <sstream>
#include <thread>
#include <vector>
std::atomic<long> total{ 0 };
long
fib(long n) __attribute__((noinline));
void
run(size_t nitr, long) __attribute__((noinline));
int
custom_push_region(const char* name);
namespace
{
omnitrace_user_callbacks_t custom_callbacks = OMNITRACE_USER_CALLBACKS_INIT;
omnitrace_user_callbacks_t original_callbacks = OMNITRACE_USER_CALLBACKS_INIT;
} // namespace
int
main(int argc, char** argv)
{
custom_callbacks.push_region = &custom_push_region;
omnitrace_user_configure(OMNITRACE_USER_UNION_CONFIG, custom_callbacks,
&original_callbacks);
omnitrace_user_push_region(argv[0]);
omnitrace_user_push_region("initialization");
size_t nthread = std::min<size_t>(16, std::thread::hardware_concurrency());
size_t nitr = 50000;
long nfib = 10;
if(argc > 1) nfib = atol(argv[1]);
if(argc > 2) nthread = atol(argv[2]);
if(argc > 3) nitr = atol(argv[3]);
omnitrace_user_pop_region("initialization");
printf("[%s] Threads: %zu\n[%s] Iterations: %zu\n[%s] fibonacci(%li)...\n", argv[0],
nthread, argv[0], nitr, argv[0], nfib);
omnitrace_user_push_region("thread_creation");
std::vector<std::thread> threads{};
threads.reserve(nthread);
// disable instrumentation for child threads
omnitrace_user_stop_thread_trace();
for(size_t i = 0; i < nthread; ++i)
{
threads.emplace_back(&run, nitr, nfib);
}
// re-enable instrumentation
omnitrace_user_start_thread_trace();
omnitrace_user_pop_region("thread_creation");
omnitrace_user_push_region("thread_wait");
for(auto& itr : threads)
itr.join();
omnitrace_user_pop_region("thread_wait");
run(nitr, nfib);
printf("[%s] fibonacci(%li) x %lu = %li\n", argv[0], nfib, nthread, total.load());
omnitrace_user_pop_region(argv[0]);
return 0;
}
long
fib(long n)
{
return (n < 2) ? n : fib(n - 1) + fib(n - 2);
}
#define RUN_LABEL \
std::string{ std::string{ __FUNCTION__ } + "(" + std::to_string(n) + ") x " + \
std::to_string(nitr) } \
.c_str()
void
run(size_t nitr, long n)
{
omnitrace_user_push_region(RUN_LABEL);
long local = 0;
for(size_t i = 0; i < nitr; ++i)
local += fib(n);
total += local;
omnitrace_user_pop_region(RUN_LABEL);
}
int
custom_push_region(const char* name)
{
if(!original_callbacks.push_region || !original_callbacks.push_annotated_region)
return OMNITRACE_USER_ERROR_NO_BINDING;
printf("Pushing custom region :: %s\n", name);
if(original_callbacks.push_annotated_region)
{
int32_t _err = errno;
char* _msg = nullptr;
char _buff[1024];
if(_err != 0) _msg = strerror_r(_err, _buff, sizeof(_buff));
omnitrace_annotation_t _annotations[] = {
{ "errno", OMNITRACE_INT32, &_err }, { "strerror", OMNITRACE_STRING, _msg }
};
errno = 0; // reset errno
return (*original_callbacks.push_annotated_region)(
name, _annotations, sizeof(_annotations) / sizeof(omnitrace_annotation_t));
}
return (*original_callbacks.push_region)(name);
}
Linking the Omnitrace libraries to another program
=======================================================
To link the ``omnitrace-user-library`` to another program,
use the following CMake and ``g++`` directives.
CMake
-------------------------------------------------------
.. code-block:: cmake
find_package(omnitrace REQUIRED COMPONENTS user)
add_executable(foo foo.cpp)
target_link_libraries(foo PRIVATE omnitrace::omnitrace-user-library)
g++ compilation
-------------------------------------------------------
Assuming Omnitrace is installed in ``/opt/omnitrace``, use the ``g++`` compiler
to build the application.
.. code-block:: shell
g++ -I/opt/omnitrace foo.cpp -o foo -lomnitrace-user
Output from the API example program
========================================
First, instrument and run the program.
.. code-block:: shell
$ omnitrace-instrument -l --min-instructions=8 -E custom_push_region -o -- ./user-api
...
$ omnitrace-run --profile --use-pid off --time-output off -- ./user-api.inst 20 4 100
Pushing custom region :: ./user-api.inst
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Trace
______ .___ ___. .__ __. __ .___________..______ ___ ______ _______
/ __ \ | \/ | | \ | | | | | || _ \ / \ / || ____|
| | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__
| | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __|
| `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____
\______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______|
Pushing custom region :: initialization
[./user-api.inst] Threads: 4
[./user-api.inst] Iterations: 100
[./user-api.inst] fibonacci(20)...
Pushing custom region :: thread_creation
Pushing custom region :: thread_wait
Pushing custom region :: run(20) x 100
Pushing custom region :: run(20) x 100
Pushing custom region :: run(20) x 100
Pushing custom region :: run(20) x 100
Pushing custom region :: run(20) x 100
[./user-api.inst] fibonacci(20) x 4 = 3382500
[omnitrace][86267][0][omnitrace_finalize] finalizing...
[omnitrace][86267][0] omnitrace : 5.190895 sec wall_clock, 2.748 mb peak_rss, 6.330000 sec cpu_clock, 121.9 % cpu_util [laps: 1]
[omnitrace][86267][0] user-api.inst/thread-0 : 5.078713 sec wall_clock, 4.722415 sec thread_cpu_clock, 93.0 % thread_cpu_util, 1.276 mb peak_rss [laps: 1]
[omnitrace][86267][0] user-api.inst/thread-1 : 0.322248 sec wall_clock, 0.322191 sec thread_cpu_clock, 100.0 % thread_cpu_util, 1.000 mb peak_rss [laps: 1]
[omnitrace][86267][0] user-api.inst/thread-2 : 0.323255 sec wall_clock, 0.323194 sec thread_cpu_clock, 100.0 % thread_cpu_util, 0.000 mb peak_rss [laps: 1]
[omnitrace][86267][0] user-api.inst/thread-3 : 0.323569 sec wall_clock, 0.323484 sec thread_cpu_clock, 100.0 % thread_cpu_util, 1.092 mb peak_rss [laps: 1]
[omnitrace][86267][0] user-api.inst/thread-4 : 0.324178 sec wall_clock, 0.324057 sec thread_cpu_clock, 100.0 % thread_cpu_util, 1.184 mb peak_rss [laps: 1]
[omnitrace][86267][0] Post-processing 51 cpu frequency and memory usage entries...
[omnitrace][wall_clock]|0> Outputting 'omnitrace-user-api.inst-output/wall_clock.json'...
[omnitrace][wall_clock]|0> Outputting 'omnitrace-user-api.inst-output/wall_clock.tree.json'...
[omnitrace][wall_clock]|0> Outputting 'omnitrace-user-api.inst-output/wall_clock.txt'...
[omnitrace][manager::finalize][metadata]> Outputting 'omnitrace-user-api.inst-output/metadata.json' and 'omnitrace-user-api.inst-output/functions.json'...
[omnitrace][86267][0][omnitrace_finalize] Finalized
Then review the output.
.. code-block:: shell
$ cat omnitrace-example-output/wall_clock.txt
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| REAL-CLOCK TIMER (I.E. WALL-CLOCK TIMER) |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| LABEL | COUNT | DEPTH | METRIC | UNITS | SUM | MEAN | MIN | MAX | VAR | STDDEV | % SELF |
|---------------------------------------------------------------------------------|--------|--------|------------|--------|----------|----------|----------|----------|----------|----------|--------|
| |0>>> ./user-api.inst | 1 | 0 | wall_clock | sec | 5.078521 | 5.078521 | 5.078521 | 5.078521 | 0.000000 | 0.000000 | 0.0 |
| |0>>> |_initialization | 1 | 1 | wall_clock | sec | 0.000004 | 0.000004 | 0.000004 | 0.000004 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_thread_creation | 1 | 1 | wall_clock | sec | 0.000159 | 0.000159 | 0.000159 | 0.000159 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_thread_wait | 1 | 1 | wall_clock | sec | 0.355307 | 0.355307 | 0.355307 | 0.355307 | 0.000000 | 0.000000 | 0.0 |
| |0>>> |_std::vector<std::thread, std::allocator<std::thread> >::begin | 1 | 2 | wall_clock | sec | 0.000001 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_std::vector<std::thread, std::allocator<std::thread> >::end | 1 | 2 | wall_clock | sec | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_pthread_join | 4 | 2 | wall_clock | sec | 0.355257 | 0.088814 | 0.000001 | 0.333144 | 0.026559 | 0.162970 | 100.0 |
| |2>>> |_start_thread | 1 | 3 | wall_clock | sec | 0.000032 | 0.000032 | 0.000032 | 0.000032 | 0.000000 | 0.000000 | 100.0 |
| |1>>> |_start_thread | 1 | 3 | wall_clock | sec | 0.000036 | 0.000036 | 0.000036 | 0.000036 | 0.000000 | 0.000000 | 100.0 |
| |3>>> |_start_thread | 1 | 3 | wall_clock | sec | 0.000034 | 0.000034 | 0.000034 | 0.000034 | 0.000000 | 0.000000 | 100.0 |
| |4>>> |_start_thread | 1 | 3 | wall_clock | sec | 0.000039 | 0.000039 | 0.000039 | 0.000039 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_run | 1 | 1 | wall_clock | sec | 4.722993 | 4.722993 | 4.722993 | 4.722993 | 0.000000 | 0.000000 | 0.0 |
| |0>>> |_std::char_traits<char>::length | 1 | 2 | wall_clock | sec | 0.000001 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_std::distance<char const*> | 1 | 2 | wall_clock | sec | 0.000001 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_std::operator+<char, std::char_traits<char>, std::allocator<char> > | 2 | 2 | wall_clock | sec | 0.000002 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_run(20) x 100 | 1 | 2 | wall_clock | sec | 4.722951 | 4.722951 | 4.722951 | 4.722951 | 0.000000 | 0.000000 | 0.0 |
| |0>>> |_run [{94,25}-{96,25}] | 1 | 3 | wall_clock | sec | 4.722925 | 4.722925 | 4.722925 | 4.722925 | 0.000000 | 0.000000 | 0.0 |
| |0>>> |_fib | 100 | 4 | wall_clock | sec | 4.722718 | 0.047227 | 0.046713 | 0.051987 | 0.000000 | 0.000625 | 0.0 |
| |0>>> |_fib | 200 | 5 | wall_clock | sec | 4.722302 | 0.023612 | 0.017827 | 0.034091 | 0.000032 | 0.005627 | 0.0 |
| |0>>> |_fib | 400 | 6 | wall_clock | sec | 4.721485 | 0.011804 | 0.006790 | 0.023003 | 0.000016 | 0.004024 | 0.0 |
| |0>>> |_fib | 800 | 7 | wall_clock | sec | 4.719858 | 0.005900 | 0.002564 | 0.016078 | 0.000006 | 0.002498 | 0.1 |
| |0>>> |_fib | 1600 | 8 | wall_clock | sec | 4.716572 | 0.002948 | 0.000977 | 0.011849 | 0.000002 | 0.001465 | 0.1 |
| |0>>> |_fib | 3200 | 9 | wall_clock | sec | 4.709918 | 0.001472 | 0.000371 | 0.008246 | 0.000001 | 0.000831 | 0.3 |
| |0>>> |_fib | 6400 | 10 | wall_clock | sec | 4.696775 | 0.000734 | 0.000140 | 0.005111 | 0.000000 | 0.000461 | 0.6 |
| |0>>> |_fib | 12800 | 11 | wall_clock | sec | 4.670093 | 0.000365 | 0.000050 | 0.003166 | 0.000000 | 0.000253 | 1.1 |
| |0>>> |_fib | 25600 | 12 | wall_clock | sec | 4.617496 | 0.000180 | 0.000017 | 0.001959 | 0.000000 | 0.000137 | 2.3 |
| |0>>> |_fib | 51200 | 13 | wall_clock | sec | 4.512671 | 0.000088 | 0.000004 | 0.001212 | 0.000000 | 0.000074 | 4.6 |
| |0>>> |_fib | 102400 | 14 | wall_clock | sec | 4.304142 | 0.000042 | 0.000000 | 0.000752 | 0.000000 | 0.000039 | 9.6 |
| |0>>> |_fib | 202600 | 15 | wall_clock | sec | 3.892580 | 0.000019 | 0.000000 | 0.000469 | 0.000000 | 0.000021 | 19.0 |
| |0>>> |_fib | 363200 | 16 | wall_clock | sec | 3.151143 | 0.000009 | 0.000000 | 0.000293 | 0.000000 | 0.000011 | 33.2 |
| |0>>> |_fib | 502000 | 17 | wall_clock | sec | 2.105217 | 0.000004 | 0.000000 | 0.000183 | 0.000000 | 0.000006 | 49.1 |
| |0>>> |_fib | 476000 | 18 | wall_clock | sec | 1.071652 | 0.000002 | 0.000000 | 0.000114 | 0.000000 | 0.000004 | 63.6 |
| |0>>> |_fib | 294200 | 19 | wall_clock | sec | 0.390193 | 0.000001 | 0.000000 | 0.000071 | 0.000000 | 0.000003 | 75.3 |
| |0>>> |_fib | 115200 | 20 | wall_clock | sec | 0.096190 | 0.000001 | 0.000000 | 0.000043 | 0.000000 | 0.000002 | 84.4 |
| |0>>> |_fib | 27400 | 21 | wall_clock | sec | 0.015020 | 0.000001 | 0.000000 | 0.000025 | 0.000000 | 0.000001 | 91.1 |
| |0>>> |_fib | 3600 | 22 | wall_clock | sec | 0.001336 | 0.000000 | 0.000000 | 0.000013 | 0.000000 | 0.000001 | 96.3 |
| |0>>> |_fib | 200 | 23 | wall_clock | sec | 0.000050 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_std::char_traits<char>::length | 1 | 3 | wall_clock | sec | 0.000001 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_std::distance<char const*> | 1 | 3 | wall_clock | sec | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_std::operator+<char, std::char_traits<char>, std::allocator<char> > | 2 | 3 | wall_clock | sec | 0.000001 | 0.000001 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_std::operator& | 1 | 1 | wall_clock | sec | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
| |0>>> std::vector<std::thread, std::allocator<std::thread> >::~vector | 1 | 0 | wall_clock | sec | 0.000045 | 0.000045 | 0.000045 | 0.000045 | 0.000000 | 0.000000 | 32.7 |
| |0>>> |_std::thread::~thread | 4 | 1 | wall_clock | sec | 0.000030 | 0.000007 | 0.000007 | 0.000009 | 0.000000 | 0.000001 | 31.2 |
| |0>>> |_std::thread::joinable | 4 | 2 | wall_clock | sec | 0.000021 | 0.000005 | 0.000005 | 0.000006 | 0.000000 | 0.000001 | 89.4 |
| |0>>> |_std::thread::id::id | 4 | 3 | wall_clock | sec | 0.000001 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_std::operator== | 4 | 3 | wall_clock | sec | 0.000001 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_std::allocator_traits<std::allocator<std::thread> >::deallocate | 1 | 1 | wall_clock | sec | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
| |0>>> |_std::allocator<std::thread>::~allocator | 1 | 1 | wall_clock | sec | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+67
Féach ar an gComhad
@@ -0,0 +1,67 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
***********************
Omnitrace documentation
***********************
Omnitrace is designed for the high-level profiling and comprehensive tracing
of applications running on the CPU or the CPU and GPU. It supports dynamic binary
instrumentation, call-stack sampling, and various other features for determining
which function and line number are currently executing. To learn more, see :doc:`what-is-omnitrace`
The code is open and hosted at `<https://github.com/ROCm/omnitrace>`_.
.. grid:: 2
:gutter: 3
.. grid-item-card:: Install
* :doc:`Quick start <./install/quick-start>`
* :doc:`Omnitrace installation <./install/install>`
The documentation is structured as follows:
.. grid:: 2
:gutter: 3
.. grid-item-card:: Tutorials
* `GitHub examples <https://github.com/ROCm/omnitrace/tree/main/examples>`_
* :doc:`Video tutorials <./tutorials/video-tutorials>`
.. grid-item-card:: How to
* :doc:`Configuring and validating the Omnitrace environment <./how-to/configuring-validating-environment>`
* :doc:`Configuring runtime options <./how-to/configuring-runtime-options>`
* :doc:`Sampling the call stack <./how-to/sampling-call-stack>`
* :doc:`Instrumenting and rewriting a binary application <./how-to/instrumenting-rewriting-binary-application>`
* :doc:`Performing causal profiling <./how-to/performing-causal-profiling>`
* :doc:`Understanding the Omnitrace output <./how-to/understanding-omnitrace-output>`
* :doc:`Profiling Python scripts <./how-to/profiling-python-scripts>`
* :doc:`Using the Omnitrace API <./how-to/using-omnitrace-api>`
* :doc:`General tips for using Omnitrace <./how-to/general-tips-using-omnitrace>`
.. grid-item-card:: Conceptual
* :doc:`Data collection modes <./conceptual/data-collection-modes>`
* :doc:`The Omnitrace feature set <./conceptual/omnitrace-feature-set>`
.. grid-item-card:: Reference
* :doc:`Development guide <./reference/development-guide>`
* :doc:`Omnitrace glossary <./reference/omnitrace-glossary>`
* :doc:`API library <./doxygen/html/files>`
* :doc:`Class member functions <./doxygen/html/functions>`
* :doc:`Globals <./doxygen/html/globals>`
* :doc:`Classes, structures, and interfaces <./doxygen/html/annotated>`
To contribute to the documentation, refer to
`Contributing to ROCm <https://rocm.docs.amd.com/en/latest/contribute/contributing.html>`_.
You can find licensing information on the
`Licensing <https://rocm.docs.amd.com/en/latest/about/license.html>`_ page.
@@ -0,0 +1,410 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
*************************************
Omnitrace installation
*************************************
The following information builds on the guidelines in the :doc:`Quick start <./quick-start>` guide.
It covers how to install `Omnitrace <https://github.com/ROCm/omnitrace>`_ from source or a binary distribution,
as well as the :ref:`post-installation-steps`.
If you have problems using Omnitrace after installation,
consult the :ref:`post-installation-troubleshooting` section.
Release links
========================================
To review and install either the current Omnitrace release or earlier releases, use these links:
* Latest Omnitrace Release: `<https://github.com/ROCm/omnitrace/releases/latest>`_
* All Omnitrace Releases: `<https://github.com/ROCm/omnitrace/releases>`_
Operating system support
========================================
Omnitrace is only supported on Linux. The following distributions are tested in the Omnitrace GitHub workflows:
* Ubuntu 20.04
* Ubuntu 22.04
* OpenSUSE 15.3
* OpenSUSE 15.4
* Red Hat 8.7
* Red Hat 9.0
* Red Hat 9.1
Other OS distributions might function but are not supported or tested.
Identifying the operating system
-----------------------------------
If you are unsure of the operating system and version, the ``/etc/os-release`` and
``/usr/lib/os-release`` files contain operating system identification data for Linux systems.
.. code-block:: shell
$ cat /etc/os-release
.. code-block:: shell
NAME="Ubuntu"
VERSION="20.04.4 LTS (Focal Fossa)"
ID=ubuntu
...
VERSION_ID="20.04"
...
The relevant fields are ``ID`` and the ``VERSION_ID``.
Architecture
========================================
With regards to instrumentation, at present only AMD64 (x86_64) architectures are tested. However,
Dyninst supports several more architectures and Omnitrace instrumentation may support other
CPU architectures such as aarch64 and ppc64.
Other modes of use, such as sampling and causal profiling, are not dependent on Dyninst and therefore
might be more portable.
Installing Omnitrace from binary distributions
================================================
Every Omnitrace release provides binary installer scripts of the form:
.. code-block:: shell
omnitrace-{VERSION}-{OS_DISTRIB}-{OS_VERSION}[-ROCm-{ROCM_VERSION}[-{EXTRA}]].sh
For example,
.. code-block:: shell
omnitrace-1.0.0-ubuntu-18.04-OMPT-PAPI-Python3.sh
omnitrace-1.0.0-ubuntu-18.04-ROCm-405000-OMPT-PAPI-Python3.sh
...
omnitrace-1.0.0-ubuntu-20.04-ROCm-50000-OMPT-PAPI-Python3.sh
Any of the ``EXTRA`` fields with a CMake build option
(for example, PAPI, as referenced in a following section) or
with no link requirements (such as OMPT) have
self-contained support for these packages.
To install Omnitrace using a binary installer script, follow these steps:
#. Download the appropriate binary distribution
.. code-block:: shell
wget https://github.com/ROCm/omnitrace/releases/download/v<VERSION>/<SCRIPT>
#. Create the target installation directory
.. code-block:: shell
mkdir /opt/omnitrace
#. Run the installer script
.. code-block:: shell
./omnitrace-1.0.0-ubuntu-18.04-ROCm-405000-OMPT-PAPI.sh --prefix=/opt/omnitrace --exclude-subdir
Installing Omnitrace from source
========================================
Omnitrace needs a GCC compiler with full support for C++17 and CMake v3.16 or higher.
The Clang compiler may be used in lieu of the GCC compiler if `Dyninst <https://github.com/dyninst/dyninst>`_
is already installed.
Build requirements
-----------------------------------
* GCC compiler v7+
* Older GCC compilers may be supported but are not tested
* Clang compilers are generally supported for Omnitrace but not Dyninst
* `CMake <https://cmake.org/>`_ v3.16+
.. note::
* If the installed version of CMake is too old, installing a new version of CMake can be done through several methods
* One of the easiest options is to use the python ``pip`` utility, as follows:
.. code-block:: shell
pip install --user 'cmake==3.18.4'
export PATH=${HOME}/.local/bin:${PATH}
Required third-party packages
-----------------------------------
* `Dyninst <https://github.com/dyninst/dyninst>`_ for dynamic or static instrumentation.
Dyninst uses the following required and optional components.
* `TBB <https://github.com/oneapi-src/oneTBB>`_ (required)
* `Elfutils <https://sourceware.org/elfutils/>`_ (required)
* `Libiberty <https://github.com/gcc-mirror/gcc/tree/master/libiberty>`_ (required)
* `Boost <https://www.boost.org/>`_ (required)
* `OpenMP <https://www.openmp.org/>`_ (optional)
* `libunwind <https://www.nongnu.org/libunwind/>`_ for call-stack sampling
Any of the third-party packages required by Dyninst, along with Dyninst itself, can be built and installed
during the Omnitrace build. The following list indicates the package, the version,
the application that requires the package (for example, Omnitrace requires Dyninst
while Dyninst requires TBB), and the CMake option to build the package alongside Omnitrace:
.. csv-table::
:header: "Third-Party Library", "Minimum Version", "Required By", "CMake Option"
:widths: 15, 10, 12, 40
"Dyninst", "12.0", "Omnitrace", "``OMNITRACE_BUILD_DYNINST`` (default: OFF)"
"Libunwind", "", "Omnitrace", "``OMNITRACE_BUILD_LIBUNWIND`` (default: ON)"
"TBB", "2018.6", "Dyninst", "``DYNINST_BUILD_TBB`` (default: OFF)"
"ElfUtils", "0.178", "Dyninst", "``DYNINST_BUILD_ELFUTILS`` (default: OFF)"
"LibIberty", "", "Dyninst", "``DYNINST_BUILD_LIBIBERTY`` (default: OFF)"
"Boost", "1.67.0", "Dyninst", "``DYNINST_BUILD_BOOST`` (default: OFF)"
"OpenMP", "4.x", "Dyninst", ""
Optional third-party packages
-----------------------------------
* `ROCm <https://rocm.docs.amd.com/projects/install-on-linux/en/latest>`_
* HIP
* Roctracer for HIP API and kernel tracing
* ROCM-SMI for GPU monitoring
* Rocprofiler for GPU hardware counters
* `PAPI <https://icl.utk.edu/papi/>`_
* MPI
* ``OMNITRACE_USE_MPI`` enables full MPI support
* ``OMNITRACE_USE_MPI_HEADERS`` enables wrapping of the dynamically-linked MPI C function calls.
(By default, if Omnitrace cannot find an OpenMPI MPI distribution, it uses a local copy
of the OpenMPI ``mpi.h``.)
* Several optional third-party profiling tools supported by Timemory
(for example, `Caliper <https://github.com/LLNL/Caliper>`_, `TAU <https://www.cs.uoregon.edu/research/tau/home.php>`_, CrayPAT, and others)
.. csv-table::
:header: "Third-Party Library", "CMake Enable Option", "CMake Build Option"
:widths: 15, 45, 40
"PAPI", "``OMNITRACE_USE_PAPI`` (default: ON)", "``OMNITRACE_BUILD_PAPI`` (default: ON)"
"MPI", "``OMNITRACE_USE_MPI`` (default: OFF)", ""
"MPI (header-only)", "``OMNITRACE_USE_MPI_HEADERS`` (default: ON)", ""
Installing Dyninst
-----------------------------------
The easiest way to install Dyninst is alongside Omnitrace, but it can also be installed using Spack.
Building Dyninst alongside Omnitrace
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To install Dyninst alongside Omnitrace, configure Omnitrace with ``OMNITRACE_BUILD_DYNINST=ON``.
Depending on the version of Ubuntu, the ``apt`` package manager might have current enough
versions of the Dyninst Boost, TBB, and LibIberty dependencies
(use ``apt-get install libtbb-dev libiberty-dev libboost-dev``).
However, it is possible to request Dyninst to install
its dependencies via ``DYNINST_BUILD_<DEP>=ON``, as follows:
.. code-block:: shell
git clone https://github.com/ROCm/omnitrace.git omnitrace-source
cmake -B omnitrace-build -DOMNITRACE_BUILD_DYNINST=ON -DDYNINST_BUILD_{TBB,ELFUTILS,BOOST,LIBIBERTY}=ON omnitrace-source
where ``-DDYNINST_BUILD_{TBB,BOOST,ELFUTILS,LIBIBERTY}=ON`` is expanded by
the shell to ``-DDYNINST_BUILD_TBB=ON -DDYNINST_BUILD_BOOST=ON ...``
Installing Dyninst via Spack
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
`Spack <https://github.com/spack/spack>`_ is another option to install Dyninst and its dependencies:
.. code-block:: shell
git clone https://github.com/spack/spack.git
source ./spack/share/spack/setup-env.sh
spack compiler find
spack external find --all --not-buildable
spack spec -I --reuse dyninst
spack install --reuse dyninst
spack load -r dyninst
Installing Omnitrace
-----------------------------------
Omnitrace has CMake configuration options for MPI support (``OMNITRACE_USE_MPI`` or
``OMNITRACE_USE_MPI_HEADERS``), HIP kernel tracing (``OMNITRACE_USE_ROCTRACER``),
ROCm device sampling (``OMNITRACE_USE_ROCM_SMI``), OpenMP-Tools (``OMNITRACE_USE_OMPT``),
hardware counters via PAPI (``OMNITRACE_USE_PAPI``), among other features.
Various additional features can be enabled via the
``TIMEMORY_USE_*`` `CMake options <https://timemory.readthedocs.io/en/develop/installation.html#cmake-options>`_.
Any ``OMNITRACE_USE_<VAL>`` option which has a corresponding ``TIMEMORY_USE_<VAL>``
option means that the Timemory support for this feature has been integrated
into Perfetto support for Omnitrace, for example, ``OMNITRACE_USE_PAPI=<VAL>`` also configures
``TIMEMORY_USE_PAPI=<VAL>``. This means the data that Timemory is able to collect via this package
is passed along to Perfetto and is displayed when the ``.proto`` file is visualized
in `the Perfetto UI <https://ui.perfetto.dev>`_.
.. code-block:: shell
git clone https://github.com/ROCm/omnitrace.git omnitrace-source
cmake \
-B omnitrace-build \
-D CMAKE_INSTALL_PREFIX=/opt/omnitrace \
-D OMNITRACE_USE_HIP=ON \
-D OMNITRACE_USE_ROCM_SMI=ON \
-D OMNITRACE_USE_ROCTRACER=ON \
-D OMNITRACE_USE_PYTHON=ON \
-D OMNITRACE_USE_OMPT=ON \
-D OMNITRACE_USE_MPI_HEADERS=ON \
-D OMNITRACE_BUILD_PAPI=ON \
-D OMNITRACE_BUILD_LIBUNWIND=ON \
-D OMNITRACE_BUILD_DYNINST=ON \
-D DYNINST_BUILD_TBB=ON \
-D DYNINST_BUILD_BOOST=ON \
-D DYNINST_BUILD_ELFUTILS=ON \
-D DYNINST_BUILD_LIBIBERTY=ON \
omnitrace-source
cmake --build omnitrace-build --target all --parallel 8
cmake --build omnitrace-build --target install
source /opt/omnitrace/share/omnitrace/setup-env.sh
.. _mpi-support-omnitrace:
MPI support within Omnitrace
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Omnitrace can have full (``OMNITRACE_USE_MPI=ON``) or partial (``OMNITRACE_USE_MPI_HEADERS=ON``) MPI support.
The only difference between these two modes is whether or not the results collected
via Timemory and/or Perfetto can be aggregated into a single
output file during finalization. When full MPI support is enabled, combining the
Timemory results always occurs, whereas combining the Perfetto
results is configurable via the ``OMNITRACE_PERFETTO_COMBINE_TRACES`` setting.
The primary benefits of partial or full MPI support are the automatic wrapping
of MPI functions and the ability
to label output with suffixes which correspond to the ``MPI_COMM_WORLD`` rank ID
instead of having to use the system process identifier (i.e. ``PID``).
In general, it's recommended to use partial MPI support with the OpenMPI
headers as this is the most portable configuration.
If full MPI support is selected, make sure your target application is built
against the same MPI distribution as Omnitrace.
For example, do not build Omnitrace with MPICH and use it on a target application built against OpenMPI.
If partial support is selected, the reason the OpenMPI headers are recommended instead of the MPICH headers is
because the ``MPI_COMM_WORLD`` in OpenMPI is a pointer to ``ompi_communicator_t`` (8 bytes),
whereas ``MPI_COMM_WORLD`` in MPICH is an ``int`` (4 bytes). Building Omnitrace with partial MPI support
and the MPICH headers and then using
Omnitrace on an application built against OpenMPI causes a segmentation fault.
This happens because the value of the ``MPI_COMM_WORLD`` is truncated
during the function wrapping before being passed along to the underlying MPI function.
.. _post-installation-steps:
Post-installation steps
========================================
After installation, you can optionally configure the Omnitrace environment.
You should also test the executables to confirm Omnitrace is correctly installed.
Configure the environment
-----------------------------------
If environment modules are available and preferred, add them using these commands:
.. code-block:: shell
module use /opt/omnitrace/share/modulefiles
module load omnitrace/1.0.0
Alternatively, you can directly source the ``setup-env.sh`` script:
.. code-block:: shell
source /opt/omnitrace/share/omnitrace/setup-env.sh
Test the executables
-----------------------------------
Successful execution of these commands confirms that the installation does not have any
issues locating the installed libraries:
.. code-block:: shell
omnitrace-instrument --help
omnitrace-avail --help
.. note::
If ROCm support is enabled, you might have to add the path to the ROCm libraries to ``LD_LIBRARY_PATH``,
for example, ``export LD_LIBRARY_PATH=/opt/rocm/lib:${LD_LIBRARY_PATH}``.
.. _post-installation-troubleshooting:
Post-installation troubleshooting
========================================
This section explains how to resolve certain issues that might happen when you first use Omnitrace.
Issues with RHEL and SELinux
----------------------------------------------------
RHEL (Red Hat Enterprise Linux) and related distributions of Linux automatically enable a security feature
named SELinux (Security-Enhanced Linux) that prevents Omnitrace from running.
This issue applies to any Linux distribution with SELinux installed, including RHEL,
CentOS, Fedora, and Rocky Linux. The problem can happen with any GPU, or even without a GPU.
The problem occurs after you instrument a program and try to
run ``omnitrace-run`` with the instrumented program.
.. code-block:: shell
g++ hello.cpp -o hello
omniperf-instrument -M sampling -o hello.instr -- ./hello
omnitrace-run -- ./hello.instr
Instead of successfully running the binary with call-stack sampling,
Omnitrace crashes with a segmentation fault.
.. note::
If you are physically logged in on the system (not using SSH or a remote connection),
the operating system might display an SELinux pop-up warning in the notifications.
To workaround this problem, either disable SELinux or configure it to use a more
permissive setting.
To avoid this problem for the duration of the current session, run this command
from the shell:
.. code-block:: shell
sudo setenforce 0
For a permanent workaround, edit the SELinux configuration file using the command
``sudo vim /etc/sysconfig/selinux`` and change the ``SELINUX`` setting to
either ``Permissive`` or ``Disabled``.
.. note::
Permanently changing the SELinux settings can have security implications.
Ensure you review your system security settings before making any changes.
Modifying RPATH details
----------------------------------------------------
If you're experiencing problems loading your application with an instrumented library,
then you might have to check and modify the RPATH specified in your application.
See the section on `troubleshooting RPATHs <../how-to/instrumenting-rewriting-binary-application.html#rpath-troubleshooting>`_
for further details.
Configuring PAPI to collect hardware counters
----------------------------------------------------
To use PAPI to collect the majority of hardware counters, ensure
the ``/proc/sys/kernel/perf_event_paranoid`` setting has a value less than or equal to ``2``.
For more information, see the :ref:`omnitrace_papi_events` section.
@@ -0,0 +1,30 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
*************************************
Omnitrace quick start
*************************************
To install Omnitrace, download the `Omnitrace installer <https://github.com/ROCm/omnitrace/releases/latest/download/omnitrace-install.py>`_
and specify ``--prefix <install-directory>``. The script attempts to auto-detect
the appropriate OS distribution and version. To include AMD ROCm Software support,
specify ``--rocm X.Y``, where ``X`` is the ROCm major
version and ``Y`` is the ROCm minor version, for example, ``--rocm 6.2``.
.. code-block:: shell
wget https://github.com/ROCm/omnitrace/releases/latest/download/omnitrace-install.py
python3 ./omnitrace-install.py --prefix /opt/omnitrace --rocm 6.2
This script supports installation on Ubuntu, OpenSUSE, Red Hat, Debian, CentOS, and Fedora.
If the target OS is compatible with one of the operating system versions listed in
the comprehensive :doc:`Installation guidelines <./install>`,
specify ``-d <DISTRO> -v <VERSION>``. For example, if the OS is compatible with Ubuntu 22.04, pass
``-d ubuntu -v 22.04`` to the script.
.. note::
If you have ROCm version 6.2 or higher installed, you can use the
package manager to install a pre-built copy of Omnitrace using
``apt install omnitrace`` or ``dnf install omnitrace``.
+4
Féach ar an gComhad
@@ -0,0 +1,4 @@
# License
```{include} ../LICENSE
```
@@ -0,0 +1,412 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
****************************************************
Development guide
****************************************************
This guide discusses the `Omnitrace <https://github.com/ROCm/omnitrace>`_ design.
It includes a list of the executables and libraries, along with a discussion of the application's
memory, sampling, and time-window constraint models.
Executables
========================================
This section lists the Omnitrace executables.
omnitrace-avail: `source/bin/omnitrace-avail <https://github.com/ROCm/omnitrace/tree/main/source/bin/omnitrace-avail>`_
-------------------------------------------------------------------------------------------------------------------------------
The ``main`` routine of ``omnitrace-avail`` has three important sections:
* Printing components
* Printing options
* Printing hardware counters
omnitrace-sample: `source/bin/omnitrace-sample <https://github.com/ROCm/omnitrace/tree/main/source/bin/omnitrace-sample>`_
-------------------------------------------------------------------------------------------------------------------------------
* Requires a command-line format of ``omnitrace-sample <options> -- <command> <command-args>``
* Translates command-line options into environment variables
* Adds ``libomnitrace-dl.so`` to ``LD_PRELOAD``
* Is launched by using ``execvpe`` with ``<command> <command-args>`` and a modified environment
omnitrace-casual: `source/bin/omnitrace-causal <https://github.com/ROCm/omnitrace/tree/main/source/bin/omnitrace-causal>`_
-------------------------------------------------------------------------------------------------------------------------------
When there is exactly one causal profiling configuration variant (which enables debugging),
``omnitrace-casual`` has a nearly identical design to ``omnitrace-sample``
When the command-line options produce more than one causal profiling configuration variant,
the following actions take place for each variant:
* ``omnitrace-causal`` calls ``fork()``
* the child process launches ``<command> <command-args>`` using ``execvpe``, which modifies the environment for the variant
* the parent process waits for the child process to finish
omnitrace-instrument: `source/bin/omnitrace-instrument <https://github.com/ROCm/omnitrace/tree/main/source/bin/omnitrace-instrument>`_
-------------------------------------------------------------------------------------------------------------------------------------------
* Requires a command-line format of ``omnitrace-instrument <options> -- <command> <command-args>``
* Allows the user to provide options specifying whether to perform runtime instrumentation, use binary rewrite, or
attach to process
* Either opens the instrumentation target (for binary rewrite), launches the target and stops it
before it starts executing ``main``, or attaches to a running executable and pauses it
* Finds all functions in the targets
* Finds ``libomnitrace-dl`` and locates the functions
* Iterates over and instruments all the functions, provided they satisfy the
defined criteria (such as a minimum number of instructions)
* See the ``module_function`` class
* Until this point, the workflow has been the same for the different options,
but it diverges after instrumentation is complete:
* For a binary rewrite: it produces a new instrumented binary and exits
* For runtime instrumentation or attaching to a process: it instructs the application
to resume and then waits for it to exit
Libraries
========================================
Common library: `source/lib/common <https://github.com/ROCm/omnitrace/tree/main/source/lib/common>`_
--------------------------------------------------------------------------------------------------------------------------------
* General header-only functionality used in multiple executables and/or libraries.
* Not installed or exported outside of the build tree.
Core library: `source/lib/core <https://github.com/ROCm/omnitrace/tree/main/source/lib/core>`_
--------------------------------------------------------------------------------------------------------------------------------
* Static PIC library with functionality that does not depend on any components.
* Not installed or exported outside of the build tree.
Binary library: `source/lib/binary <https://github.com/ROCm/omnitrace/tree/main/source/lib/binary>`_
--------------------------------------------------------------------------------------------------------------------------------
* Static PIC library with functionality for reading/analyzing binary info.
* Mostly used by the causal profiling sections of ``libomnitrace``.
* Not installed or exported outside of the build tree.
libomnitrace: `source/lib/omnitrace <https://github.com/ROCm/omnitrace/tree/main/source/lib/omnitrace>`_
--------------------------------------------------------------------------------------------------------------------------------
This is the main library encapsulating all the capabilities.
libomnitrace-dl: `source/lib/omnitrace-dl <https://github.com/ROCm/omnitrace/tree/main/source/lib/omnitrace-dl>`_
--------------------------------------------------------------------------------------------------------------------------------
This is a lightweight, front-end library for ``libomnitrace`` which serves three primary purposes:
* Dramatically speeds up instrumentation time compared to using ``libomnitrace`` directly because
Dyninst must parse the entire library in order to find the instrumentation functions
(a ``dlopen`` call is made on ``libomnitrace`` when the instrumentation functions get called)
* Prevents re-entry if ``libomnitrace`` calls an instrumented function internally
* Coordinates communication between ``libomnitrace-user`` and ``libomnitrace``
libomnitrace-user: `source/lib/omnitrace-user <https://github.com/ROCm/omnitrace/tree/main/source/lib/omnitrace-user>`_
--------------------------------------------------------------------------------------------------------------------------------
* Provides a set of functions and types for the users to add to their code, for example,
disabling data collection globally or on a specific thread or
user-defined region
* If ``libomnitrace-dl`` is not loaded, the user API is effectively a set of no-op function calls.
Testing tools
========================================
* `CDash Testing Dashboard <https://my.cdash.org/index.php?project=Omnitrace>`_ (requires a login)
Components
========================================
Most measurements and capabilities are encapsulated into a "component" with the following definitions:
Measurement
A recording of some data relevant to performance, for instance, the current call-stack,
hardware counter values, current memory usage, or timestamp
Capability
Handles the implementation or orchestration of some feature which is used
to collect measurements, for example, a component which handles setting up function wrappers
around various functions such as ``pthread_create`` or ``MPI_Init``.
Components are designed to either hold no data at all or only the data for both an instantaneous
measurement and a phase measurement.
Components which store data typically implement a static ``record()`` function
for getting a record of the measurement,
``start()`` and ``stop()`` member functions for calculating a phase measurement,
and a ``sample()`` member function for storing an
instantaneous measurement. In reality, there are several more "standard" functions
but these are the most commonly-used ones.
Components which do not store data might also have ``start()``, ``stop()``, and ``sample()``
functions. However, components which
implement function wrappers typically provide a call operator or ``audit(...)``
functions. These are invoked with the
wrapped function's arguments before the wrapped function gets called and with the return value
after the wrapped function gets called.
.. note::
The goal of this design is to provide relatively small and resuable lightweight objects
for recording measurements and implementing capabilities.
Wall-clock component example
--------------------------------------
A component for computing the elapsed wall-clock time looks like this:
.. code-block:: cpp
struct wall_clock
{
using value_type = int64_t;
static value_type record() noexcept
{
return std::chrono::steady_clock::now().time_since_epoch().count();
}
void sample() noexcept
{
value = record();
}
void start() noexcept
{
value = record();
}
void stop() noexcept
{
auto _start_value = value;
value = record();
accum += (value - _start_value);
}
private:
int64_t value = 0;
int64_t accum = 0;
};
Function wrapper component example
--------------------------------------
A component which implements wrappers around ``fork()`` and ``exit(int)`` (and stores no data)
could look like this:
.. code-block:: cpp
struct function_wrapper
{
pid_t operator()(const gotcha_data&, pid_t (*real_fork)())
{
// disable all collection before forking
categories::disable_categories(config::get_enabled_categories());
auto _pid_v = real_fork();
// only re-enable collection on parent process
if(_pid_v != 0)
categories::enable_categories(config::get_enabled_categories());
return _pid_v;
}
void operator()(const gotcha_data&, void (*real_exit)(int), int _exit_code)
{
// catch the call to exit and finalize before truly exiting
omnitrace_finalize();
real_exit(_exit_code);
}
};
Component member functions
--------------------------------------
There are no real restrictions or requirements on the member functions a component needs to provide.
Unless the component is being used directly, the invocation of component member functions via a "component bundler"
(provided by Timemory) makes extensive use of template metaprogramming concepts. This finds the best match, if any,
for calling a component's member function. This is a bit easier to demonstrate using an example:
.. code-block:: cpp
struct foo
{
void sample() { puts("foo::sample()"); }
};
struct bar
{
void sample(int) { puts("bar::sample(int)"); }
};
struct spam
{
void start(int) { puts("spam::start()"); }
void stop() { puts("spam::stop()"); }
};
int main()
{
auto _bundle = component_tuple<foo, bar, spam>{ "main" };
puts("A");
_bundle.start();
puts("B");
_bundle.sample(10);
puts("C");
_bundle.sample();
puts("D");
_bundle.stop();
}
When the preceding code runs, the following messages are printed:
.. code-block:: shell
A
spam::start()
B
foo::sample()
bar::sample(int)
C
foo::sample()
D
spam::stop()
In section A, the bundle determined that only the ``spam`` object has a ``start`` function. Since this is determined
via template metaprogramming instead of dynamic polymorphism, this effectively omits any code related to
the ``foo`` or ``bar`` objects. In section B, because the integer ``10`` is passed to the bundle,
the bundle forwards this value to ``bar::sample(int)`` after it invokes ``foo::sample()``. ``foo::sample()`` is
invoked because the bundle recognizes that the call to the ``sample`` member function is still possible without
the argument.
Memory model
========================================
Collected data is generally handled in one of the three following ways:
* It is handed directly to, and stored by, Perfetto
* It is managed implicitly by Timemory and accessed as needed
* As thread-local data
In general, only instrumentation for relatively simple data is directly passed to
Perfetto and/or Timemory during runtime.
For example, the callbacks from binary instrumentation, user API instrumentation,
and roctracer directly invoke
calls to Perfetto or Timemory's storage model. Otherwise, the data is stored
by Omnitrace in the thread-data model
which is more persistent than simply using ``thread_local`` static data, which gets deleted
when the thread stops.
Thread identification
--------------------------------------
Each CPU thread is assigned two integral identifiers. One identifier, the ``internal_value``, is
atomically incremented every time a new thread is created.
The other identifier, known as the ``sequent_value``, tries to account for the fact that Omnitrace, Perfetto, ROCm, and other applications
start background threads. When a thread is created as a by-product of Omnitrace,
the index is offset by a large value. This serves
two purposes:
* Accessing the data for threads created by the user is closer in memory
* When log messages are printed, the index approximately correlates to the order of thread creation from the user's perspective.
The ``sequent_value`` identifier is typically used to access the thread-data.
Thread-data class
--------------------------------------
Currently, most thread data is effectively stored in a static
``std::array<std::unique_ptr<T>, OMNITRACE_MAX_THREADS>`` instance.
``OMNITRACE_MAX_THREADS`` is a value defined a compile-time and set to ``2048``
for release builds. During finalization,
Omnitrace iterates through the thread-data and transforms that data
into something that can be passed along to Perfetto and/or Timemory.
The downside of the current model is that if the user exceeds ``OMNITRACE_MAX_THREADS``,
a segmentation fault occurs. To fix this issue,
a new model is being adopted which has all the benefits of this model
but permits dynamic expansion.
Sampling model
========================================
The general structure for the sampling is within Timemory (``source/timemory/sampling``).
Currently, all sampling is done per-thread
via POSIX timers. Omnitrace supports both a real-time timer and a CPU-time timer.
Both have adjustable frequencies, delays, and durations.
By default, only CPU-time sampling is enabled. Initial settings are inherited from
the settings starting with ``OMNITRACE_SAMPLING_``.
For each type of timer, timer-specific settings can be used to
override the common and inherited timer settings.
These settings begin with ``OMNITRACE_SAMPLING_CPUTIME`` for the CPU-time sampler
and ``OMNITRACE_SAMPLING_REALTIME`` for
the real-time sampler. For example, ``OMNITRACE_SAMPLING_FREQ=500`` initially sets the
sampling frequency to 500 interrupts per second. Adding the setting ``OMNITRACE_SAMPLING_REALTIME_FREQ=10``
lowers the sampling frequency for the real-time sampler
to 10 interrupts per second of real-time.
The Omnitrace-specific implementation can be found in
`source/lib/omnitrace/library/sampling.cpp <https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/sampling.cpp>`_.
Within `sampling.cpp <https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/sampling.cpp>`_,
there is a bundle of three sampling components:
* `backtrace_timestamp <https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/components/backtrace_timestamp.hpp>`_ simply
records the wall-clock time of the sample.
* `backtrace <https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/components/backtrace.hpp>`_
records the call-stack via libunwind.
* `backtrace_metrics <https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/components/backtrace_metrics.hpp>`_
records the sample metrics, such as peak RSS and the hardware counters.
These three components are bundled together in
a tuple-like ``struct`` (``tuple<backtrace_timestamp, backtrace, backtrace_metrics>``).
A buffer of at least 1024 instances of this tuple is mapped using ``mmap``
per-thread. When this buffer is full,
the sampler hands the buffer off to its allocator thread and maps a new buffer with ``mmap``
before taking the next sample. The allocator thread takes this data
and either dynamically stores it in memory or writes it to a file depending on the
value of ``OMNITRACE_USE_TEMPORARY_FILES``.
This schema avoids all allocations in the signal handler, lets the data grow
dynamically, avoids potentially slow I/O within the signal handler, and also enables
the capability of avoiding I/O altogether.
The maximum number of samplers handled by each allocator is governed by the
``OMNITRACE_SAMPLING_ALLOCATOR_SIZE`` setting (the default is eight). Whenever an allocator
has reached its limit,
a new internal thread is created to handle the new samplers.
Time-window constraint model
========================================
With the recent introduction of tracing delay and duration, the
`constraint namespace <https://github.com/ROCm/omnitrace/blob/main/source/lib/core/constraint.hpp>`_
was introduced to improve the management of delays and duration limits for
data collection. The ``spec`` class accepts a clock identifier, a delay value, a duration value, and an
integer indicating how many times to repeat the delay and duration cycle. It is therefore
possible to perform tasks such as periodically enabling tracing for brief periods
of time in between long periods without data collection while the application runs. The
syntax follows the format ``clock_identifier:delay:capture_duration:cycles``, so a value of
``10:1:3`` for the last three parameters represents the following sequence of operations:
* Ten seconds where no data is collected, then one second where it is
* Ten seconds where no data is collected, then one second where it is
* Ten seconds where no data is collected, then one second where it is
* Stop
As another example, ``OMNITRACE_TRACE_PERIODS = realtime:10:1:5 process_cputime:10:2:20`` translates
to this sequence:
* Five cycles of: no data collection for ten seconds of real-time followed by one second of data collection
* Twenty cycles of: no data collection for ten seconds of process CPU time followed by two CPU-time seconds of data collection
Eventually, the goal is to migrate all subsets of data collection which currently support
more rudimentary models of time window constraints, such as process sampling and causal profiling,
to this model.
@@ -0,0 +1,102 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
*******************
Omnitrace Glossary
*******************
This topic explains the terminology necessary to use Omnitrace.
The list below provides a basic glossary for those who
are new to binary instrumentation. It also clarifies ambiguities
when certain terms have different
contextual meanings, for example, the Omnitrace meaning of the term "module"
when instrumenting Python.
**Binary**
A file written in the Executable and Linkable Format (ELF). This is the standard file
format for executable files, shared libraries, etc.
**Binary instrumentation**
Inserting callbacks to instrumentation into an existing binary. This can be performed
statically or dynamically.
**Static binary instrumentation**
Loads an existing binary, determines instrumentation points, and generates a new binary
with instrumentation directly embedded. It is applicable to executables and libraries but
limited to only the functions defined in the binary. This is also known as **Binary rewrite**.
**Dynamic binary instrumentation**
Loads an existing binary into memory, inserts instrumentation, and runs the binary.
It is limited to executables but is capable of instrumenting linked libraries.
This is also known as **Runtime instrumentation**.
**Statistical sampling**
At periodic intervals, the application is paused and the current call-stack of the CPU
is recorded along with various other metrics. It uses timers that measure either
(A) real clock time or (B) the CPU time used by the current thread and the CPU time
expended on behalf of the thread by the system. This is also known as simply **sampling**.
**Sampling rate**
* The period at which (A) or (B) are triggered (in units of ``# interrupts / second``)
* Higher values increase the number of samples
**Sampling delay**
* How long to wait before (A) and (B) begin triggering at their designated rate
**Sampling duration**
* The amount of time (in real-time) after the start of the application to record samples.
* After this time limit has been reached, no more samples are recorded.
**Process sampling**
At periodic (real-time) intervals, a background thread records global metrics without
interrupting the current process. These metrics include, but are not limited to:
CPU frequency, CPU memory high-water mark (i.e. peak memory usage), GPU temperature,
and GPU power usage.
**Sampling rate**
* The real-time period for recording metrics (in units of ``# measurements / second``)
* Higher values increase the number of samples
**Sampling delay**
* How long to wait (in real-time) before recording samples
**Sampling duration**
* The amount of time (in real-time) after the start of the application to record samples.
* After this time limit has been reached, no more samples are recorded.
**Module**
With respect to binary instrumentation, a module is defined as either the filename
(such as ``foo.c``) or library name (``libfoo.so``) which contains the definition
of one or more functions.
With respect to Python instrumentation, a module is defined as the **file** which contains
the definition of one or more functions. The full path to this file typically contains the
name of the "Python module".
**Basic block**
A straight-line code sequence with no branches in (except for the entry) and
no branches out (except for the exit).
**Address range**
The instructions for a function in a binary start at certain address with the ELF file
and end at a certain address. The range is ``end - start``.
The address range is a decent approximation for the "cost" of a function.
For example, a larger address range approximately equates to more instructions.
**Instrumentation traps**
On the x86 architecture, because instructions are of variable size, an instruction
might be too small for Dyninst to replace it with the normal code sequence
used to call instrumentation. When instrumentation is placed at points other
than subroutine entry, exit, or call points, traps may be used to ensure
the instrumentation fits. (By default, ``omnitrace-instrument`` avoids instrumentation
which requires a trap.)
**Overlapping functions**
Due to language constructs or compiler optimizations, it might be possible for
multiple functions to overlap (that is, share part of the same function body)
or for a single function to have multiple entry points. In practice, it's
impossible to determine the difference between multiple overlapping functions
and a single function with multiple entry points. (By default, ``omnitrace-instrument``
avoids instrumenting overlapping functions.)
@@ -0,0 +1,70 @@
# Anywhere {branch} is used, the branch name will be substituted.
# These comments will also be removed.
defaults:
numbered: False
maxdepth: 6
root: index
subtrees:
- entries:
- file: what-is-omnitrace.rst
- caption: Install
entries:
- file: install/quick-start.rst
title: Omnitrace quick start
- file: install/install.rst
title: Omnitrace installation guide
- caption: Tutorials
entries:
- url: https://github.com/ROCm/omnitrace/tree/main/examples
title: GitHub examples
- file: tutorials/video-tutorials.rst
title: Video tutorials
- caption: How to
entries:
- file: how-to/configuring-validating-environment.rst
title: Configuring and validating the environment
- file: how-to/configuring-runtime-options.rst
title: Configuring runtime options
- file: how-to/sampling-call-stack.rst
title: Sampling the call stack
- file: how-to/instrumenting-rewriting-binary-application.rst
title: Instrumenting and rewriting a binary application
- file: how-to/performing-causal-profiling.rst
title: Performing causal profiling
- file: how-to/understanding-omnitrace-output.rst
title: Understanding the Omnitrace output
- file: how-to/profiling-python-scripts.rst
title: Profiling Python scripts
- file: how-to/using-omnitrace-api.rst
title: Using the Omnitrace API
- file: how-to/general-tips-using-omnitrace.rst
title: General tips for using Omnitrace
- caption: Conceptual
entries:
- file: conceptual/data-collection-modes.rst
title: Data collection modes
- file: conceptual/omnitrace-feature-set.rst
title: The Omnitrace feature set and use cases
- caption: Reference
entries:
- file: reference/development-guide.rst
title: Development guide
- file: reference/omnitrace-glossary.rst
title: Omnitrace glossary
- file: doxygen/html/files
title: API library
- file: doxygen/html/functions
title: Class member functions
- file: doxygen/html/globals
title: Globals
- file: doxygen/html/annotated
title: Classes, structures, and interfaces
- caption: About
entries:
- file: license.md
@@ -0,0 +1 @@
rocm-docs-core[api_reference]==1.4.1
@@ -0,0 +1,169 @@
#
# This file is autogenerated by pip-compile with Python 3.10
# by the following command:
#
# pip-compile requirements.in
#
accessible-pygments==0.0.5
# via pydata-sphinx-theme
alabaster==0.7.16
# via sphinx
babel==2.15.0
# via
# pydata-sphinx-theme
# sphinx
beautifulsoup4==4.12.3
# via pydata-sphinx-theme
breathe==4.35.0
# via rocm-docs-core
certifi==2024.6.2
# via requests
cffi==1.16.0
# via
# cryptography
# pynacl
charset-normalizer==3.3.2
# via requests
click==8.1.7
# via
# click-log
# doxysphinx
# sphinx-external-toc
click-log==0.4.0
# via doxysphinx
cryptography==42.0.8
# via pyjwt
deprecated==1.2.14
# via pygithub
docutils==0.21.2
# via
# breathe
# myst-parser
# pydata-sphinx-theme
# sphinx
doxysphinx==3.3.9
# via rocm-docs-core
fastjsonschema==2.20.0
# via rocm-docs-core
gitdb==4.0.11
# via gitpython
gitpython==3.1.43
# via rocm-docs-core
idna==3.7
# via requests
imagesize==1.4.1
# via sphinx
jinja2==3.1.4
# via
# myst-parser
# sphinx
libsass==0.22.0
# via doxysphinx
lxml==4.9.4
# via doxysphinx
markdown-it-py==3.0.0
# via
# mdit-py-plugins
# myst-parser
markupsafe==2.1.5
# via jinja2
mdit-py-plugins==0.4.1
# via myst-parser
mdurl==0.1.2
# via markdown-it-py
mpire==2.10.2
# via doxysphinx
myst-parser==3.0.1
# via rocm-docs-core
numpy==1.26.4
# via doxysphinx
packaging==24.1
# via
# pydata-sphinx-theme
# sphinx
pycparser==2.22
# via cffi
pydata-sphinx-theme==0.15.4
# via
# rocm-docs-core
# sphinx-book-theme
pygithub==2.3.0
# via rocm-docs-core
pygments==2.18.0
# via
# accessible-pygments
# mpire
# pydata-sphinx-theme
# sphinx
pyjson5==1.6.6
# via doxysphinx
pyjwt[crypto]==2.8.0
# via pygithub
pynacl==1.5.0
# via pygithub
pyparsing==3.1.2
# via doxysphinx
pyyaml==6.0.1
# via
# myst-parser
# rocm-docs-core
# sphinx-external-toc
requests==2.32.3
# via
# pygithub
# sphinx
rocm-docs-core[api_reference]==1.4.1
# via -r requirements.in
smmap==5.0.1
# via gitdb
snowballstemmer==2.2.0
# via sphinx
soupsieve==2.5
# via beautifulsoup4
sphinx==7.3.7
# via
# breathe
# myst-parser
# pydata-sphinx-theme
# rocm-docs-core
# sphinx-book-theme
# sphinx-copybutton
# sphinx-design
# sphinx-external-toc
# sphinx-notfound-page
sphinx-book-theme==1.1.3
# via rocm-docs-core
sphinx-copybutton==0.5.2
# via rocm-docs-core
sphinx-design==0.6.0
# via rocm-docs-core
sphinx-external-toc==1.0.1
# via rocm-docs-core
sphinx-notfound-page==1.0.2
# via rocm-docs-core
sphinxcontrib-applehelp==1.0.8
# via sphinx
sphinxcontrib-devhelp==1.0.6
# via sphinx
sphinxcontrib-htmlhelp==2.0.5
# via sphinx
sphinxcontrib-jsmath==1.0.1
# via sphinx
sphinxcontrib-qthelp==1.0.7
# via sphinx
sphinxcontrib-serializinghtml==1.1.10
# via sphinx
tomli==2.0.1
# via sphinx
tqdm==4.66.4
# via mpire
typing-extensions==4.12.2
# via
# pydata-sphinx-theme
# pygithub
urllib3==2.2.2
# via
# pygithub
# requests
wrapt==1.16.0
# via deprecated
@@ -0,0 +1,35 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
****************************************************
Video tutorials
****************************************************
Installing a binary release
========================================
.. raw:: html
<p align="center"><iframe width="560" height="315" src="https://www.youtube.com/embed/gKtNCKf1IXA?modestbranding=1" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p>
Instrumenting a binary
========================================
.. raw:: html
<p align="center"><iframe width="560" height="315" src="https://www.youtube.com/embed/2B0gRr3FygQ?modestbranding=1" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p>
Writing an Omnitrace configuration file
========================================
.. raw:: html
<p align="center"><iframe width="560" height="315" src="https://www.youtube.com/embed/oG_fPYx9_gs?modestbranding=1" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p>
Visualization and features of Perfetto traces
=============================================
.. raw:: html
<p align="center"><iframe width="560" height="315" src="https://www.youtube.com/embed/7WN3N1hnCbI?modestbranding=1" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p>
@@ -0,0 +1,28 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
******************
What is Omnitrace?
******************
Omnitrace is designed for the high-level profiling and comprehensive tracing
of applications running on the CPU or the CPU and GPU. It supports dynamic binary
instrumentation, call-stack sampling, and various other features for determining
which function and line number are currently executing.
A visualization of the comprehensive Omnitrace results can be observed in any modern
web browser. Upload the Perfetto (``.proto``) output files produced by Omnitrace at
`ui.perfetto.dev <https://ui.perfetto.dev/>`_ to see the details.
Aggregated high-level results are available as human-readable text files and
JSON files for programmatic analysis. The JSON output files are compatible with the
`hatchet <https://github.com/hatchet/hatchet>`_ Python package. Hatchet converts
the performance data into pandas data frames and facilitates multi-run comparisons, filtering,
and visualization in Jupyter notebooks.
To use Omnitrace for instrumentation, follow these two configuration steps:
#. Indicate the functions and modules to :doc:`instrument <./how-to/instrumenting-rewriting-binary-application>` in the target binaries, including the executable and any libraries
#. Specify the :doc:`instrumentation parameters <./how-to/configuring-runtime-options>` to use when the instrumented binaries are launched