Omnitrace docs refactoring (#353)

* Add Sphinx and Read the Docs configs * Add documentation workflow configurations * Changed macros verbprintf and verbprintf_bare so they write to stdout… (#346) Flush stdout when listing keys + bump verbose level for GPU count * Removing static version asserts. (#347) It is causing failures on our internal builds Signed-off-by: David Galiffi <David.Galiffi@amd.com> * Check for an empty vector before popping (#350) Protect from possible seg. fault Signed-off-by: David Galiffi <David.Galiffi@amd.com> * Add release links to installation.md (#351) * Initial infrastructure rework for Omnitrace refactoring and a rewrite of the What is file * Add files in conceptual section, along with images and infrastructure changes. * Formatting and style fixes for files in conceptual directory * Add quick start install guide and fix spelling errors in other files * Add install document and fix code tags. Infrastructure changes * Add two how-to guides along with infra changes and spelling fixes * Add two new how to files and fix errors in the last commit * Fix spelling mistakes * Add new how to file on causal profiling and infra changes. * Add how to file on interpreting Omnitrace output, fixes, and images * Add remaining how-to guides and reference materials along with fixes and infrastructure * Add YouTube file and fix spelling and formatting * Fix a few loose ends and add link to license page * Add Sphinx and Doxygen infrastructure and some additional corrections * Update rocm-docs-core * Fix Doxyfile * Fix path to API header files * Run doxysphinx in conf.py * Add back custom css for doxygen * Remove doxygenlayout * Add api to toc * Update Doxyfile Generate from source .in * Proofreading edits and other changes * Add .gitignore for Doxygen and remove deprecated words and typos * Fix one additional typo * Turn off dot * Update doxyfile strip from path * Workflow, submodules, and thread info Updates (#352) * Update CI workflows - use node20 workflow packages * Update tests/source/CMakeLists.txt - Use OMNITRACE_TRACE and OMNTRACE_PROFILE instead of perfetto/timemory * Update timemory submodule - argparse: requires -> required - parse callbacks * Update thread_info.cpp - fix causal::delay::get_local usage * Update timemory submodule * Update kokkos submodule - release 3.7.02 * Revert opensuse.yml and ubuntu-bionic.yml to use node16 workflows * Update docs.yml * ROCm 6.1 Installers (#349) * Add ROCm 6.1 to packages * Bump version to 1.11.3 * Add 6.1 support to the docker build support. Simplified this by adding 6.* to case statements, now that repo links have been standardized. * Update timemory submodule (#354) - fix argparse::argument::required template deduction * Build omnitrace-rt library (#355) * Build omnitrace-rt library - Explicitly build dyninstAPI_RT as omnitrace-rt so that the SONAME in the ELF is omnitrace-rt instead of dyninstAPI_RT - Create symbolic link lib/omnitrace/libdyninstAPI_RT.so which points to lib/libomnitrace-rt.so - Simplify build tree location of libomnitrace-rt.so since it is ../lib from the bin directory even in the build tree - Update dyninst submodule with minor tweaks to dyninstAPI_RT/CMakeLists.txt * Update source/lib/omnitrace-rt/cmake/platform.cmake * Use ftpmirror.gnu.org instead of ftp.gnu.org - in timemory and dyninst submodules - minor .clang-tidy tweak * Executables append omnitrace library directory to LD_LIBRARY_PATH (#356) - omnitrace-run, omnitrace-sample, and omnitrace-causal now automatically append the LD_LIBRARY_PATH with the directory containing the omnitrace libraries - this helps ensure that binary rewritten exes can resolve omnitrace-rt library location * Fix a few typos and formatting issues * Additional fixes and minor formatting changes. * More fixes and minor formatting changes. * Complete second proofreading with fixes and minor formatting changes. * Make changes to table of contents and disable linting * Update links in the README doc to reflect the new structure. * Align intro on the Omnitrace index page with the first paragraph of the what-is page * Changes and edits based on review comments * Additional changes and edits based on external review * Additional updates and changes from the external review of Omnitrace * Additional changes based on the external review * New round of edits based on the external review * Additional edits based on the external review * Changes to address comments from the internal review * Correct to the RHEL SELinux note in the troubleshooting guide * One additional change to the development guide code example * Move troubleshooting to post-install of install.rst and other minor edits. * Remove troubleshooting page and modify new post-install troubleshooting section on install.rst * Refactor the how Omnitrace works page into seperate topics and redo infrastructure * API ToC changes * Additional API and ToC changes * Back out API and ToC changes and update requirements.txt * Additional API and ToC changes * Add commit for signing purposes * Add ElfUtils and BinUtils Download URL Overrides (#358) * Add CMake CACHE Variable ElfUtils_DOWNLOAD_URL Used to override the default URL to download ElfUtils from. Useful for internal builds Also, include a mirror to fallback to if the override URL fails. * Update timemory submodule Updating to include the BINUTIL_DOWNLOAD_URL override cmake variable. --------- Signed-off-by: David Galiffi <David.Galiffi@amd.com> * Remove Ubuntu 18.04 and SUSE 15.2 * Update checkout action to v4 * Add `docs/**` to `paths-ignore` Document location is being refactored. * Modified submodules dyninst and timemory. (#361) --------- Signed-off-by: David Galiffi <David.Galiffi@amd.com> Co-authored-by: Peter Jun Park <peter.park@amd.com> Co-authored-by: ajanicijamd <Aleksandar.Janicijevic@amd.com> Co-authored-by: David Galiffi <David.Galiffi@amd.com> Co-authored-by: Jonathan R. Madsen <jrmadsen@users.noreply.github.com> Co-authored-by: Sam Wu <22262939+samjwu@users.noreply.github.com> [ROCm/rocprofiler-systems commit: 0689797736]
2024-07-29 17:23:36 -04:00
@@ -4,3 +4,4 @@
 docs/* @ROCm/rocm-documentation
 *.md @ROCm/rocm-documentation
 *.rst @ROCm/rocm-documentation
+.readthedocs.yaml @ROCm/rocm-documentation
@@ -9,3 +9,14 @@ updates:
    directory: "/" # Location of package manifests
    schedule:
      interval: "weekly"
+
+  - package-ecosystem: "pip" # See documentation for possible values
+    directory: "/docs/sphinx" # Location of package manifests
+    open-pull-requests-limit: 10
+    schedule:
+      interval: "daily"
+    labels:
+      - "documentation"
+      - "dependencies"
+    reviewers:
+      - "samjwu"
@@ -37,6 +37,10 @@
 # Python cache files
 *.pyc

+# Documentation artifacts
+/_build
+_toc.yml
+
 /build*
 /.vscode
 /.cache
@@ -0,0 +1,18 @@
+# Read the Docs configuration file
+# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
+
+version: 2
+
+build:
+  os: ubuntu-22.04
+  tools:
+    python: "3.10"
+
+python:
+  install:
+  - requirements: docs/sphinx/requirements.txt
+
+sphinx:
+  configuration: docs/conf.py
+
+formats: []
@@ -7,8 +7,6 @@
 [![Installer Packaging (CPack)](https://github.com/ROCm/omnitrace/actions/workflows/cpack.yml/badge.svg)](https://github.com/ROCm/omnitrace/actions/workflows/cpack.yml)
 [![Documentation](https://github.com/ROCm/omnitrace/actions/workflows/docs.yml/badge.svg)](https://github.com/ROCm/omnitrace/actions/workflows/docs.yml)

-> ***[Omnitrace](https://github.com/ROCm/omnitrace) is an AMD open source research project and is not supported as part of the ROCm software stack.***
-
 ## Overview

 AMD Research is seeking to improve observability and performance analysis for software running on AMD heterogeneous systems.
@@ -86,8 +84,8 @@ such as the memory usage, page-faults, and context-switches, and thread-level me

 ## Documentation

-The full documentation for [omnitrace](https://github.com/ROCm/omnitrace) is available at [rocm.github.io/omnitrace](https://rocm.github.io/omnitrace/).
-See the [Getting Started documentation](https://rocm.github.io/omnitrace/getting_started) for general tips and a detailed discussion about sampling vs. binary instrumentation.
+The full documentation for [omnitrace](https://github.com/ROCm/omnitrace) is available at [the ROCm Omnitrace documentation repository](https://rocm.docs.amd.com/projects/omnitrace/en/latest/index.html).
+See the [Getting Started documentation](https://rocm.docs.amd.com/projects/omnitrace/en/conceptual/how-omnitrace-works.html) for general tips and a detailed discussion about sampling vs. binary instrumentation.

 ## Quick Start

@@ -108,7 +106,7 @@ wget https://github.com/ROCm/omnitrace/releases/latest/download/omnitrace-instal
 python3 ./omnitrace-install.py --prefix /opt/omnitrace/rocm-5.4 --rocm 5.4
 ```

-See the [Installation Documentation](https://rocm.github.io/omnitrace/installation) for detailed information.
+See the [Installation Documentation](https://rocm.docs.amd.com/projects/omnitrace/en/install/install.html) for detailed information.

 ### Setup

@@ -297,13 +295,13 @@ for `foo` via the direct call within `spam`. There will be no entries for `bar`
 - Select "Open trace file" from panel on the left
 - Locate the omnitrace perfetto output (extension: `.proto`)

-![omnitrace-perfetto](source/docs/images/omnitrace-perfetto.png)
+![omnitrace-perfetto](docs/data/omnitrace-perfetto.png)

-![omnitrace-rocm](source/docs/images/omnitrace-rocm.png)
+![omnitrace-rocm](docs/data/omnitrace-rocm.png)

-![omnitrace-rocm-flow](source/docs/images/omnitrace-rocm-flow.png)
+![omnitrace-rocm-flow](docs/data/omnitrace-rocm-flow.png)

-![omnitrace-user-api](source/docs/images/omnitrace-user-api.png)
+![omnitrace-user-api](docs/data/omnitrace-user-api.png)

 ## Using Perfetto tracing with System Backend

@@ -0,0 +1,2 @@
+_build/
+_doxygen/
@@ -0,0 +1,146 @@
+.. meta::
+   :description: Omnitrace documentation and reference
+   :keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
+
+**********************
+Data collection modes
+**********************
+
+Omnitrace supports several modes of recording trace and profiling data for your application.
+
+.. note::
+    
+   For an explanation of the terms used in this topic, see 
+   the :doc:`Omnitrace glossary <../reference/omnitrace-glossary>`.
+
+-----------------------------+---------------------------------------------------------+
+| Mode                        | Description                                             |
+=============================+=========================================================+
+| Binary Instrumentation      | Locates functions (and loops, if desired) in the binary |
+|                             | and inserts snippets at the entry and exit              |
+-----------------------------+---------------------------------------------------------+
+| Statistical Sampling        | Periodically pauses application at specified intervals  |
+|                             | and records various metrics for the given call stack    |
+-----------------------------+---------------------------------------------------------+
+| Callback APIs               | Parallelism frameworks such as ROCm, OpenMP, and Kokkos |
+|                             | make callbacks into Omnitrace to provide information    |
+|                             | about the work the API is performing                    |
+-----------------------------+---------------------------------------------------------+
+| Dynamic Symbol Interception | Wrap function symbols defined in a position independent |
+|                             | dynamic library/executable, like ``pthread_mutex_lock`` |
+|                             | in ``libpthread.so`` or ``MPI_Init`` in the MPI library |
+-----------------------------+---------------------------------------------------------+
+| User API                    | User-defined regions and controls for Omnitrace         |
+-----------------------------+---------------------------------------------------------+
+
+The two most generic and important modes are binary instrumentation and statistical sampling. 
+It is important to understand their advantages and disadvantages.
+Binary instrumentation and statistical sampling can be performed with the ``omnitrace-instrument`` 
+executable. For statistical sampling, it's highly recommended to use the
+``omnitrace-sample`` executable instead if binary instrumentation isn't required or needed. 
+Callback APIs and dynamic symbol interception can be utilized with either tool.
+
+Binary instrumentation
+-----------------------------------
+
+Binary instrumentation lets you record deterministic measurements for 
+every single invocation of a given function.
+Binary instrumentation effectively adds instructions to the target application to 
+collect the required information. It therefore has the potential to cause performance 
+changes which might, in some cases, lead to inaccurate results. The effect depends on 
+the information being collected and which features are activated in Omnitrace. 
+For example, collecting only the wall-clock timing data
+has less of an effect than collecting the wall-clock timing, CPU-clock timing, 
+memory usage, cache-misses, and number of instructions that were run. Similarly, 
+collecting a flat profile has less overhead than a hierarchical profile 
+and collecting a trace OR a profile has less overhead than collecting a 
+trace AND a profile.
+
+In Omnitrace, the primary heuristic for controlling the overhead with binary 
+instrumentation is the minimum number of instructions for selecting functions 
+for instrumentation.
+
+Statistical sampling
+-----------------------------------
+
+Statistical call-stack sampling periodically interrupts the application at 
+regular intervals using operating system interrupts.
+Sampling is typically less numerically accurate and specific, but the 
+target program runs at nearly full speed.
+In contrast to the data derived from binary instrumentation, the resulting 
+data is not exact but is instead a statistical approximation.
+However, sampling often provides a more accurate picture of the application 
+execution because it is less intrusive to the target application and has fewer
+side effects on memory caches or instruction decoding pipelines. Furthermore, 
+because sampling does not affect the execution speed as much, is it
+relatively immune to over-evaluating the cost of small, frequently called 
+functions or "tight" loops.
+
+In Omnitrace, the overhead for statistical sampling depends on the 
+sampling rate and whether the samples are taken with respect to the CPU time 
+and/or real time.
+
+Binary instrumentation vs. statistical sampling example
+-------------------------------------------------------
+
+Consider the following code:
+
+.. code-block:: c++
+
+   long fib(long n)
+   {
+        if(n < 2) return n;
+        return fib(n - 1) + fib(n - 2);
+   }
+
+   void run(long n)
+   {
+        long result = fib(n);
+        printf("[%li] fibonacci(%li) = %li\n", i, n, result);
+   }
+
+   int main(int argc, char** argv)
+   {
+        long nfib = 30;
+        long nitr = 10;
+        if(argc > 1) nfib = atol(argv[1]);
+        if(argc > 2) nitr = atol(argv[2]);
+
+        for(long i = 0; i < nitr; ++i)
+            run(nfib);
+
+        return 0;
+   }
+
+Binary instrumentation of the ``fib`` function will record **every single invocation** 
+of the function. For a very small function
+such as ``fib``, this results in **significant** overhead since this simple function 
+takes about 20 instructions, whereas the entry and
+exit snippets are ~1024 instructions. Therefore, you generally want to avoid 
+instrumenting functions where the instrumented function has significantly fewer
+instructions than entry and exit instrumentation. (Note that many of the 
+instructions in entry and exit functions are either logging functions or
+depend on the runtime settings and thus might never run). However, 
+due to the number of potential instructions in the entry and exit snippets,
+the default behavior of ``omnitrace-instrument`` is to only instrument functions 
+which contain fewer than 1024 instructions.
+
+However, recording every single invocation of the function can be extremely 
+useful for detecting anomalies, such as profiles that show minimum or maximum values much smaller or larger
+than the average or a high standard deviation. In this case, the traces help you 
+identify exactly when and where those instances deviated from the norm.
+Compare the level of detail in the following traces. In the top image, 
+every instance of the ``fib`` function is instrumented, while in the bottom image,
+the ``fib`` call-stack is derived via sampling.
+
+Binary instrumentation of the Fibonacci function
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. image:: ../data/fibonacci-instrumented.png
+   :alt: Visualization of the output of a binary instrumentation of the Fibonacci function
+
+Statistical sampling of the Fibonacci function
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. image:: ../data/fibonacci-sampling.png
+   :alt: Visualization of the output of a statistical sample of the Fibonacci function
@@ -0,0 +1,137 @@
+.. meta::
+   :description: Omnitrace documentation and reference
+   :keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
+
+***************************************
+The Omnitrace feature set and use cases
+***************************************
+
+`Omnitrace <https://github.com/ROCm/omnitrace>`_ is designed to be highly extensible. 
+Internally, it leverages the `Timemory performance analysis toolkit <https://github.com/NERSC/timemory>`_ 
+to manage extensions, resources, data, and other items. It supports the following features, 
+modes, metrics, and APIs.
+
+Data collection modes
+========================================
+
+* Dynamic instrumentation
+
+  * Runtime instrumentation: Instrument executables and shared libraries at runtime
+  * Binary rewriting: Generate a new executable and/or library with instrumentation built-in
+
+* Statistical sampling: Periodic software interrupts per-thread
+* Process-level sampling: A background thread records process-, system- and device-level metrics while the application runs
+* Causal profiling: Quantifies the potential impact of optimizations in parallel code
+  
+.. note::
+
+   Critical trace support was removed in Omnitrace v1.11.0. 
+   It was replaced by the causal profiling feature.
+
+Data analysis
+========================================
+
+* High-level summary profiles with mean, min, max, and standard deviation statistics
+
+  * Low overhead and memory efficient
+  * Ideal for running at scale
+
+* Comprehensive traces for every individual event and measurement
+* Application speed-up predictions resulting from potential optimizations in functions and lines of code based on causal profiling
+
+Parallelism API support
+========================================
+
+* HIP
+* HSA
+* Pthreads
+* MPI
+* Kokkos-Tools (KokkosP)
+* OpenMP-Tools (OMPT)
+
+GPU metrics
+========================================
+
+* GPU hardware counters
+* HIP API tracing
+* HIP kernel tracing
+* HSA API tracing
+* HSA operation tracing
+* System-level sampling (via rocm-smi)
+
+  * Memory usage
+  * Power usage
+  * Temperature
+  * Utilization
+
+CPU metrics
+========================================
+
+* CPU hardware counters sampling and profiles
+* CPU frequency sampling
+* Various timing metrics
+
+  * Wall time
+  * CPU time (process and thread)
+  * CPU utilization (process and thread)
+  * User CPU time
+  * Kernel CPU time
+
+* Various memory metrics
+
+  * High-water mark (sampling and profiles)
+  * Memory page allocation
+  * Virtual memory usage
+
+* Network statistics
+* I/O metrics
+* Many others
+
+Third-party API support
+========================================
+
+* TAU
+* LIKWID
+* Caliper
+* CrayPAT
+* VTune
+* NVTX
+* ROCTX
+
+Omnitrace use cases
+========================================
+
+When analyzing the performance of an application, do NOT 
+assume you know where the performance bottlenecks are
+and why they are happening. Omnitrace is a tool for analyzing the entire 
+application and its performance. It is
+ideal for characterizing where optimization would have the greatest impact 
+on an end-to-end run of the application and for
+viewing what else is happening on the system during a performance bottleneck.
+
+When GPUs are involved, there is a tendency to assume that 
+the quickest path to performance improvement is minimizing
+the runtime of the GPU kernels. This is a highly flawed assumption. 
+If you optimize the runtime of a kernel from one millisecond
+to 1 microsecond (1000x speed-up) but the original application never 
+spent time waiting for kernels to complete,
+there would be no statistically significant reduction in the end-to-end 
+runtime of your application. In other words, it does not matter
+how fast or slow the code on GPU is if the application has a  
+bottleneck on waiting on the GPU.
+
+Use Omnitrace to obtain a high-level view of the entire application. Use it 
+to determine where the performance bottlenecks are and
+obtain clues to why these bottlenecks are happening. Rather than worrying about kernel
+performance, start your investigation with Omnitrace, which characterizes the
+broad picture.
+
+.. note::
+
+   For insight into the execution of individual kernels on the GPU, 
+   use `Omniperf <https://github.com/rocm/omniperf>`_.
+
+In terms of CPU analysis, Omnitrace does not target any specific vendor. 
+It works just as well on AMD and non-AMD CPUs.
+With regard to the GPU, Omnitrace is currently restricted to HIP and HSA APIs 
+and kernels running on AMD GPUs.
@@ -0,0 +1,56 @@
+# MIT License
+
+# Copyright (c) 2023 - 2024 Advanced Micro Devices, Inc. All rights reserved.
+
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+
+# The above copyright notice and this permission notice shall be included in all
+# copies or substantial portions of the Software.
+
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+# SOFTWARE.
+
+# Configuration file for the Sphinx documentation builder.
+#
+# This file only contains a selection of the most common options. For a full
+# list see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+
+import re
+
+from rocm_docs import ROCmDocs
+
+with open("../VERSION", encoding="utf-8") as f:
+    match = re.search(r"([0-9.]+)[^0-9.]+", f.read())
+    if not match:
+        raise ValueError("VERSION not found!")
+    version_number = match[1]
+
+external_projects_current_project = "omnitrace"
+
+project = "omnitrace"
+author = "Advanced Micro Devices, Inc."
+copyright = "Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved."
+version = version_number
+release = version_number
+html_title = f"Omnitrace {version} documentation"
+
+external_toc_path = "./sphinx/_toc.yml"
+
+docs_core = ROCmDocs(html_title)
+docs_core.setup()
+docs_core.run_doxygen(doxygen_root="doxygen", doxygen_path="doxygen/xml")
+docs_core.enable_api_reference()
+
+for sphinx_var in ROCmDocs.SPHINX_VARS:
+    globals()[sphinx_var] = getattr(docs_core, sphinx_var)
@@ -0,0 +1,3 @@
+html/
+latex/
+xml/
@@ -0,0 +1,373 @@
+# Doxyfile 1.8.20
+
+#---------------------------------------------------------------------------
+# Project related configuration options
+#---------------------------------------------------------------------------
+DOXYFILE_ENCODING      = UTF-8
+PROJECT_NAME           = omnitrace
+PROJECT_NUMBER         = 1.11.3
+PROJECT_BRIEF          = "High-level and comprehensive application tracing and profiling on both the CPU and GPU"
+PROJECT_LOGO           =
+OUTPUT_DIRECTORY       = .
+CREATE_SUBDIRS         = NO
+ALLOW_UNICODE_NAMES    = YES
+OUTPUT_LANGUAGE        = English
+OUTPUT_TEXT_DIRECTION  = None
+BRIEF_MEMBER_DESC      = YES
+REPEAT_BRIEF           = YES
+ABBREVIATE_BRIEF       =
+ALWAYS_DETAILED_SEC    = YES
+INLINE_INHERITED_MEMB  = YES
+FULL_PATH_NAMES        = YES
+STRIP_FROM_PATH        = /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-omnitrace/checkouts/
+STRIP_FROM_INC_PATH    = /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-omnitrace/checkouts/
+SHORT_NAMES            = NO
+JAVADOC_AUTOBRIEF      = NO
+JAVADOC_BANNER         = NO
+QT_AUTOBRIEF           = NO
+MULTILINE_CPP_IS_BRIEF = YES
+PYTHON_DOCSTRING       = YES
+INHERIT_DOCS           = YES
+SEPARATE_MEMBER_PAGES  = NO
+TAB_SIZE               = 4
+ALIASES                =
+OPTIMIZE_OUTPUT_FOR_C  = NO
+OPTIMIZE_OUTPUT_JAVA   = NO
+OPTIMIZE_FOR_FORTRAN   = NO
+OPTIMIZE_OUTPUT_VHDL   = NO
+OPTIMIZE_OUTPUT_SLICE  = NO
+EXTENSION_MAPPING      = hpp=C++ \
+                         cpp=C++ \
+                         hh=C++ \
+                         cc=C++ \
+                         h=C \
+                         c=C \
+                         py=Python
+MARKDOWN_SUPPORT       = YES
+TOC_INCLUDE_HEADINGS   = 2
+AUTOLINK_SUPPORT       = YES
+BUILTIN_STL_SUPPORT    = YES
+CPP_CLI_SUPPORT        = NO
+SIP_SUPPORT            = NO
+IDL_PROPERTY_SUPPORT   = YES
+DISTRIBUTE_GROUP_DOC   = NO
+GROUP_NESTED_COMPOUNDS = YES
+SUBGROUPING            = YES
+INLINE_GROUPED_CLASSES = NO
+INLINE_SIMPLE_STRUCTS  = YES
+TYPEDEF_HIDES_STRUCT   = NO
+LOOKUP_CACHE_SIZE      = 5
+NUM_PROC_THREADS       = 0
+#---------------------------------------------------------------------------
+# Build related configuration options
+#---------------------------------------------------------------------------
+EXTRACT_ALL            = YES
+EXTRACT_PRIVATE        = NO
+EXTRACT_PRIV_VIRTUAL   = NO
+EXTRACT_PACKAGE        = NO
+EXTRACT_STATIC         = NO
+EXTRACT_LOCAL_CLASSES  = YES
+EXTRACT_LOCAL_METHODS  = NO
+EXTRACT_ANON_NSPACES   = NO
+HIDE_UNDOC_MEMBERS     = NO
+HIDE_UNDOC_CLASSES     = YES
+HIDE_FRIEND_COMPOUNDS  = NO
+HIDE_IN_BODY_DOCS      = NO
+INTERNAL_DOCS          = NO
+CASE_SENSE_NAMES       = NO
+HIDE_SCOPE_NAMES       = NO
+HIDE_COMPOUND_REFERENCE= NO
+SHOW_INCLUDE_FILES     = YES
+SHOW_GROUPED_MEMB_INC  = NO
+FORCE_LOCAL_INCLUDES   = YES
+INLINE_INFO            = YES
+SORT_MEMBER_DOCS       = YES
+SORT_BRIEF_DOCS        = NO
+SORT_MEMBERS_CTORS_1ST = YES
+SORT_GROUP_NAMES       = NO
+SORT_BY_SCOPE_NAME     = NO
+STRICT_PROTO_MATCHING  = NO
+GENERATE_TODOLIST      = NO
+GENERATE_TESTLIST      = NO
+GENERATE_BUGLIST       = NO
+GENERATE_DEPRECATEDLIST= NO
+ENABLED_SECTIONS       =
+MAX_INITIALIZER_LINES  = 30
+SHOW_USED_FILES        = YES
+SHOW_FILES             = YES
+SHOW_NAMESPACES        = YES
+FILE_VERSION_FILTER    =
+LAYOUT_FILE            =
+CITE_BIB_FILES         =
+#---------------------------------------------------------------------------
+# Configuration options related to warning and progress messages
+#---------------------------------------------------------------------------
+QUIET                  = NO
+WARNINGS               = YES
+WARN_IF_UNDOCUMENTED   = YES
+WARN_IF_DOC_ERROR      = YES
+WARN_NO_PARAMDOC       = YES
+WARN_AS_ERROR          = YES
+WARN_FORMAT            = "---> WARNING!   $file:$line: $text"
+WARN_LOGFILE           = doc/warnings.log
+#---------------------------------------------------------------------------
+# Configuration options related to the input files
+#---------------------------------------------------------------------------
+INPUT                  = ../../README.md \
+                         ../../source/lib/omnitrace-user/omnitrace/types.h \
+                         ../../source/lib/omnitrace-user/omnitrace/categories.h \
+                         ../../source/lib/omnitrace-user/omnitrace/user.h \
+                         ../../source/lib/omnitrace-user/omnitrace/causal.h
+INPUT_ENCODING         = UTF-8
+FILE_PATTERNS          = *.h \
+                         *.hh \
+                         *.hpp \
+                         *.c \
+                         *.cc \
+                         *.cxx \
+                         *.cpp \
+                         *.c++ \
+                         *.icc \
+                         *.tcc \
+                         *.py
+RECURSIVE              = YES
+EXCLUDE                =
+EXCLUDE_SYMLINKS       = YES
+EXCLUDE_PATTERNS       = */.git/* \
+                         ../../external/* \
+                         ../../examples/* \
+                         ../../tests/*
+EXCLUDE_SYMBOLS        = "std::*" \
+                         "OMNITRACE_ATTRIBUTE" \
+                         "OMNITRACE_VISIBILITY" \
+                         "OMNITRACE_PUBLIC_API" \
+                         "OMNITRACE_HIDDEN_API" \
+                         "SpaceHandle" \
+                         "KokkosPDevice*"
+EXAMPLE_PATH           = ../../examples
+EXAMPLE_PATTERNS       = *.h \
+                         *.hh \
+                         *.hpp \
+                         *.c \
+                         *.cc \
+                         *.cpp \
+                         *.py \
+                         *.txt
+EXAMPLE_RECURSIVE      = YES
+IMAGE_PATH             =
+INPUT_FILTER           =
+FILTER_PATTERNS        =
+FILTER_SOURCE_FILES    = NO
+FILTER_SOURCE_PATTERNS =
+USE_MDFILE_AS_MAINPAGE = ../../README.md
+#---------------------------------------------------------------------------
+# Configuration options related to source browsing
+#---------------------------------------------------------------------------
+SOURCE_BROWSER         = YES
+INLINE_SOURCES         = YES
+STRIP_CODE_COMMENTS    = NO
+REFERENCED_BY_RELATION = YES
+REFERENCES_RELATION    = YES
+REFERENCES_LINK_SOURCE = YES
+SOURCE_TOOLTIPS        = YES
+USE_HTAGS              = NO
+VERBATIM_HEADERS       = YES
+#---------------------------------------------------------------------------
+# Configuration options related to the alphabetical class index
+#---------------------------------------------------------------------------
+ALPHABETICAL_INDEX     = YES
+COLS_IN_ALPHA_INDEX    = 5
+IGNORE_PREFIX          =
+#---------------------------------------------------------------------------
+# Configuration options related to the HTML output
+#---------------------------------------------------------------------------
+GENERATE_HTML          = YES
+HTML_OUTPUT            = html
+HTML_FILE_EXTENSION    = .html
+HTML_HEADER            = ../_doxygen/header.html
+HTML_FOOTER            = ../_doxygen/footer.html
+HTML_STYLESHEET        = ../_doxygen/stylesheet.css
+HTML_EXTRA_STYLESHEET  = ../_doxygen/extra_stylesheet.css
+HTML_EXTRA_FILES       =
+HTML_COLORSTYLE_HUE    = 220
+HTML_COLORSTYLE_SAT    = 100
+HTML_COLORSTYLE_GAMMA  = 80
+HTML_TIMESTAMP         = YES
+HTML_DYNAMIC_MENUS     = YES
+HTML_DYNAMIC_SECTIONS  = YES
+HTML_INDEX_NUM_ENTRIES = 1000
+GENERATE_DOCSET        = NO
+DOCSET_FEEDNAME        = "Doxygen generated docs"
+DOCSET_BUNDLE_ID       = org.doxygen.omnitrace
+DOCSET_PUBLISHER_ID    = org.doxygen.amdresearch
+DOCSET_PUBLISHER_NAME  = "Audacious Software Group"
+GENERATE_HTMLHELP      = NO
+CHM_FILE               =
+HHC_LOCATION           =
+GENERATE_CHI           = NO
+CHM_INDEX_ENCODING     =
+BINARY_TOC             = NO
+TOC_EXPAND             = YES
+GENERATE_QHP           = NO
+QCH_FILE               =
+QHP_NAMESPACE          =
+QHP_VIRTUAL_FOLDER     = doc
+QHP_CUST_FILTER_NAME   =
+QHP_CUST_FILTER_ATTRS  =
+QHP_SECT_FILTER_ATTRS  =
+QHG_LOCATION           =
+GENERATE_ECLIPSEHELP   = NO
+ECLIPSE_DOC_ID         = org.doxygen.omnitrace
+DISABLE_INDEX          = NO
+GENERATE_TREEVIEW      = NO
+ENUM_VALUES_PER_LINE   = 1
+TREEVIEW_WIDTH         = 300
+EXT_LINKS_IN_WINDOW    = YES
+HTML_FORMULA_FORMAT    = png
+FORMULA_FONTSIZE       = 12
+FORMULA_TRANSPARENT    = YES
+FORMULA_MACROFILE      =
+USE_MATHJAX            = NO
+MATHJAX_FORMAT         = HTML-CSS
+MATHJAX_RELPATH        = http://cdn.mathjax.org/mathjax/latest
+MATHJAX_EXTENSIONS     =
+MATHJAX_CODEFILE       =
+SEARCHENGINE           = NO
+SERVER_BASED_SEARCH    = NO
+EXTERNAL_SEARCH        = NO
+SEARCHENGINE_URL       =
+SEARCHDATA_FILE        = searchdata.xml
+EXTERNAL_SEARCH_ID     =
+EXTRA_SEARCH_MAPPINGS  =
+#---------------------------------------------------------------------------
+# Configuration options related to the LaTeX output
+#---------------------------------------------------------------------------
+GENERATE_LATEX         = NO
+LATEX_OUTPUT           = latex
+LATEX_CMD_NAME         = latex
+MAKEINDEX_CMD_NAME     = makeindex
+LATEX_MAKEINDEX_CMD    = makeindex
+COMPACT_LATEX          = NO
+PAPER_TYPE             = a4wide
+EXTRA_PACKAGES         = float
+LATEX_HEADER           =
+LATEX_FOOTER           =
+LATEX_EXTRA_STYLESHEET =
+LATEX_EXTRA_FILES      =
+PDF_HYPERLINKS         = YES
+USE_PDFLATEX           = YES
+LATEX_BATCHMODE        = YES
+LATEX_HIDE_INDICES     = NO
+LATEX_SOURCE_CODE      = YES
+LATEX_BIB_STYLE        = plain
+LATEX_TIMESTAMP        = NO
+LATEX_EMOJI_DIRECTORY  =
+#---------------------------------------------------------------------------
+# Configuration options related to the RTF output
+#---------------------------------------------------------------------------
+GENERATE_RTF           = NO
+RTF_OUTPUT             = rtf
+COMPACT_RTF            = NO
+RTF_HYPERLINKS         = NO
+RTF_STYLESHEET_FILE    =
+RTF_EXTENSIONS_FILE    =
+RTF_SOURCE_CODE        = NO
+#---------------------------------------------------------------------------
+# Configuration options related to the man page output
+#---------------------------------------------------------------------------
+GENERATE_MAN           = NO
+MAN_OUTPUT             = man
+MAN_EXTENSION          = .3
+MAN_SUBDIR             =
+MAN_LINKS              = YES
+#---------------------------------------------------------------------------
+# Configuration options related to the XML output
+#---------------------------------------------------------------------------
+GENERATE_XML           = YES
+XML_OUTPUT             = xml
+XML_PROGRAMLISTING     = YES
+XML_NS_MEMB_FILE_SCOPE = YES
+#---------------------------------------------------------------------------
+# Configuration options related to the DOCBOOK output
+#---------------------------------------------------------------------------
+GENERATE_DOCBOOK       = NO
+DOCBOOK_OUTPUT         = docbook
+DOCBOOK_PROGRAMLISTING = NO
+#---------------------------------------------------------------------------
+# Configuration options for the AutoGen Definitions output
+#---------------------------------------------------------------------------
+GENERATE_AUTOGEN_DEF   = NO
+#---------------------------------------------------------------------------
+# Configuration options related to the Perl module output
+#---------------------------------------------------------------------------
+GENERATE_PERLMOD       = NO
+PERLMOD_LATEX          = NO
+PERLMOD_PRETTY         = YES
+PERLMOD_MAKEVAR_PREFIX =
+#---------------------------------------------------------------------------
+# Configuration options related to the preprocessor
+#---------------------------------------------------------------------------
+ENABLE_PREPROCESSING   = YES
+MACRO_EXPANSION        = YES
+EXPAND_ONLY_PREDEF     = NO
+SEARCH_INCLUDES        = YES
+INCLUDE_PATH           = ../../source/lib/omnitrace-user
+INCLUDE_FILE_PATTERNS  = *.h \
+                         *.hpp
+PREDEFINED             = OMNITRACE_PUBLIC_API= \
+                         OMNITRACE_HIDDEN_API= \
+                         "OMNITRACE_ATTRIBUTE(...)=" \
+                         "OMNITRACE_VISIBILITY(...)=" \
+                         "__attribute__(x)=" \
+                         "__declspec(x)=" \
+                         "size_t=unsigned long" \
+                         "uintptr_t=unsigned long" \
+                         DOXYGEN_SHOULD_SKIP_THIS
+EXPAND_AS_DEFINED      =
+SKIP_FUNCTION_MACROS   = NO
+#---------------------------------------------------------------------------
+# Configuration options related to external references
+#---------------------------------------------------------------------------
+TAGFILES               =
+GENERATE_TAGFILE       = html/tagfile.xml
+ALLEXTERNALS           = NO
+EXTERNAL_GROUPS        = YES
+EXTERNAL_PAGES         = YES
+#---------------------------------------------------------------------------
+# Configuration options related to the dot tool
+#---------------------------------------------------------------------------
+CLASS_DIAGRAMS         = YES
+DIA_PATH               =
+HIDE_UNDOC_RELATIONS   = NO
+HAVE_DOT               = NO
+DOT_NUM_THREADS        = 0
+DOT_FONTNAME           = Helvetica
+DOT_FONTSIZE           = 12
+DOT_FONTPATH           =
+CLASS_GRAPH            = NO
+COLLABORATION_GRAPH    = YES
+GROUP_GRAPHS           = YES
+UML_LOOK               = YES
+UML_LIMIT_NUM_FIELDS   = 10
+TEMPLATE_RELATIONS     = YES
+INCLUDE_GRAPH          = YES
+INCLUDED_BY_GRAPH      = YES
+CALL_GRAPH             = NO
+CALLER_GRAPH           = NO
+GRAPHICAL_HIERARCHY    = YES
+DIRECTORY_GRAPH        = YES
+DOT_IMAGE_FORMAT       = svg
+INTERACTIVE_SVG        = YES
+DOT_PATH               = /usr/bin/dot
+DOTFILE_DIRS           =
+MSCFILE_DIRS           =
+DIAFILE_DIRS           =
+PLANTUML_JAR_PATH      =
+PLANTUML_CFG_FILE      =
+PLANTUML_INCLUDE_PATH  =
+DOT_GRAPH_MAX_NODES    = 50
+MAX_DOT_GRAPH_DEPTH    = 0
+DOT_TRANSPARENT        = NO
+DOT_MULTI_TARGETS      = YES
+GENERATE_LEGEND        = YES
+DOT_CLEANUP            = YES
@@ -0,0 +1,71 @@
+.. meta::
+   :description: Omnitrace documentation and reference
+   :keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
+
+****************************************************
+Configuring and validating the environment
+****************************************************
+
+After installing `Omnitrace <https://github.com/ROCm/omnitrace>`_, additional steps are required to set up
+and validate the environment.
+
+.. note::
+
+   The following instructions use the installation path ``/opt/omnitrace``. If
+   Omnitrace is installed elsewhere, substitute the actual installation path.
+
+Configuring the environment
+========================================
+
+After Omnitrace is installed, source the ``setup-env.sh`` script to prefix the 
+``PATH``, ``LD_LIBRARY_PATH``, and other environment variables:
+
+.. code-block:: shell
+
+   source /opt/omnitrace/share/omnitrace/setup-env.sh
+
+Alternatively, if environment modules are supported, add the ``<prefix>/share/modulefiles`` directory
+to ``MODULEPATH``:
+
+.. code-block:: shell
+
+   module use /opt/omnitrace/share/modulefiles
+
+.. note::
+    
+   As an alternative, the above line can be added to the ``${HOME}/.modulerc`` file.
+
+After Omnitrace has been added to the ``MODULEPATH``, it can be loaded 
+using ``module load omnitrace/<VERSION>`` and unloaded using ``module unload omnitrace/<VERSION>``.
+
+.. code-block:: shell
+
+   module load omnitrace/1.0.0
+   module unload omnitrace/1.0.0
+
+.. note::
+
+   You might also need to add the path to the ROCm libraries to ``LD_LIBRARY_PATH``,
+   for example, ``export LD_LIBRARY_PATH=/opt/rocm/lib:${LD_LIBRARY_PATH}``
+
+Validating the environment configuration
+========================================
+
+If the following commands all run successfully with the expected output, 
+then you are ready to use Omnitrace:
+
+.. code-block:: shell
+
+   which omnitrace
+   which omnitrace-avail
+   which omnitrace-sample
+   omnitrace-instrument --help
+   omnitrace-avail --all
+   omnitrace-sample --help
+
+If Omnitrace was built with Python support, validate these additional commands:
+
+.. code-block:: shell
+
+   which omnitrace-python
+   omnitrace-python --help
@@ -0,0 +1,60 @@
+.. meta::
+   :description: Omnitrace documentation and reference
+   :keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
+
+**********************************
+General tips for using Omnitrace
+**********************************
+
+Follow these general guidelines when using Omnitrace. For an explanation of the terms used in this topic, see 
+the :doc:`Omnitrace glossary <../reference/omnitrace-glossary>`.
+
+* Use ``omnitrace-avail`` to look up configuration settings, hardware counters, and data collection components
+
+  * Use the ``-d`` flag for descriptions
+
+* Generate a default configuration with ``omnitrace-avail -G ${HOME}/.omnitrace.cfg`` and adjust it 
+  to the desired default behavior
+* **Decide whether binary instrumentation, statistical sampling, or both** provides the desired performance data (for non-Python applications)
+* Compile code with optimization enabled (``-O2`` or higher), disable asserts (i.e. ``-DNDEBUG``), and include debug info (for instance, ``-g1`` at a minimum)
+
+  * Compiling with debug info does not slow down the code, it only increases compile time and the size of the binary
+  * In CMake, this is generally done with the settings ``CMAKE_BUILD_TYPE=RelWithDebInfo`` or ``CMAKE_BUILD_TYPE=Release`` and ``CMAKE_<LANG>_FLAGS=-g1``
+
+* **Use binary instrumentation for characterizing the performance of every invocation of specific functions**
+* **Use statistical sampling to characterize the performance of the entire application while minimizing overhead**
+* Enable statistical sampling after binary instrumentation to help "fill in the gaps" between instrumented regions
+* Use the user API to create custom regions and enable/disable Omnitrace for specific processes, threads, and regions
+* Dynamic symbol interception, callback APIs, and the user API are always available with binary instrumentation and sampling
+
+  * Dynamic symbol interception and callback APIs are (generally) controlled through ``OMNITRACE_USE_<API>`` 
+    options, for example, ``OMNITRACE_USE_KOKKOSP`` and ``OMNITRACE_USE_OMPT`` enable Kokkos-Tools and OpenMP-Tools 
+    callbacks, respectively
+
+* When generically seeking regions for performance improvement:
+
+  * **Start off by collecting a flat profile**
+  * Look for functions with high call counts, large cumulative runtimes/values, or large standard deviations
+  
+    * When call counts are high, improving the performance of this function or "inlining" the function can result in quick and easy performance improvements
+    * When the standard deviation is high, collect a hierarchical profile and see if the high variation can be attributable to the calling context. 
+      In this scenario, consider creating a specialized version of the function for the longer-running contexts
+
+  * **Collect a hierarchical profile** and verify the functions that are part of the "critical path" of your 
+    application, as indicated in the flat profile
+
+    * For example, functions with high call counts but which are part of a "setup" or "post-processing" 
+      phase that does not consume much time relative to the overall time are generally a lower priority for optimization
+
+* **Use the information from the profiles when analyzing detailed traces**
+* When using binary instrumentation in "trace" mode, **binary rewrites are preferable to runtime instrumentation**.
+
+  * Binary rewrites only instrument the functions defined in the target binary, whereas runtime instrumentation might instrument functions defined in the shared libraries which are linked into the target binary
+
+* When using binary instrumentation with MPI, avoid runtime instrumentation
+
+  * Runtime instrumentation requires a fork and a ``ptrace``, which is generally incompatible with how MPI applications spawn processes
+  * Perform a binary rewrite of the executable (and optionally, libraries used by the executable) using MPI and run 
+    the generated instrumented executable using ``omnitrace-run`` instead of the original. 
+    For example, instead of ``mpirun -n 2 ./myexe``, use ``mpirun -n 2 omnitrace-run -- ./myexe.inst``, where 
+    ``myexe.inst`` is the instrumented ``myexe`` executable that was generated.
@@ -0,0 +1,942 @@
+.. meta::
+   :description: Omnitrace documentation and reference
+   :keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
+
+****************************************************
+Instrumenting and rewriting a binary application
+****************************************************
+
+There are three ways to perform instrumentation with the ``omnitrace-instrument`` executable:
+
+* Runtime instrumentation
+* Attaching to an already running process
+* Binary rewrite
+
+Here is a comparison of the three modes:
+
+* Runtime instrumentation of the application using the ``omnitrace-instrument`` executable 
+  (analogous to ``gdb --args <program> <args>``)
+
+  * This mode is the default if neither the ``-p`` nor ``-o`` command-line options are used
+  * Runtime instrumentation supports instrumenting not only the target executable but also 
+    the shared libraries loaded by the target executable. Consequently, this mode consumes more memory,
+    takes longer to perform the instrumentation, and tends to add more significant overhead to the
+    runtime of the application.
+  * This mode is recommended if you want to analyze not only the performance of your executable and/or
+    libraries but also the performance of the library dependencies
+
+* Attaching to a process that is currently running (analogous to ``gdb -p <PID>``)
+ 
+  * This mode is activated using ``-p <PID>``
+  * The same caveats from the first example apply with respect to memory and overhead
+
+  .. note::
+
+     Attaching to a running process is an alpha feature and detaching from the target process
+     without ending the target process is not currently supported.
+
+* Binary rewrite to generate a new executable or library with the instrumentation built-in
+
+  * This mode is activated through the ``-o <output-file>`` option
+  * Binary rewriting is limited to the text section of the target executable or library. It does not instrument
+    the dynamically-linked libraries. Consequently, this mode performs the 
+    instrumentation significantly faster
+    and has a much lower overhead when running the instrumented executable and libraries.
+  * Binary rewriting is the recommended mode when the target executable uses 
+    process-level parallelism (for example, MPI)
+  * If the target executable has a minimal ``main`` routine and the bulk of your 
+    application is in one specific dynamic library,
+    see :ref:`binary-rewriting-library-label` for help
+
+The omnitrace-instrument executable
+========================================
+
+Instrumentation is performed with the ``omnitrace-instrument`` executable. For more details, use the ``-h`` or ``--help`` option to
+view the help menu.
+
+.. code-block:: shell
+
+   $ omnitrace-instrument --help
+   [omnitrace-instrument] Usage: omnitrace-instrument [ --help (count: 0, dtype: bool)
+                                                      --version (count: 0, dtype: bool)
+                                                      --verbose (max: 1, dtype: bool)
+                                                      --error (max: 1, dtype: boolean)
+                                                      --debug (max: 1, dtype: bool)
+                                                      --log (count: 1)
+                                                      --log-file (count: 1)
+                                                      --simulate (max: 1, dtype: boolean)
+                                                      --print-format (min: 1, dtype: string)
+                                                      --print-dir (count: 1, dtype: string)
+                                                      --print-available (count: 1)
+                                                      --print-instrumented (count: 1)
+                                                      --print-coverage (count: 1)
+                                                      --print-excluded (count: 1)
+                                                      --print-overlapping (count: 1)
+                                                      --print-instructions (max: 1, dtype: bool)
+                                                      --output (min: 0, dtype: string)
+                                                      --pid (count: 1, dtype: int)
+                                                      --mode (count: 1)
+                                                      --force (max: 1, dtype: bool)
+                                                      --command (count: 1)
+                                                      --prefer (count: 1)
+                                                      --library (count: unlimited)
+                                                      --main-function (count: 1)
+                                                      --load (count: unlimited, dtype: string)
+                                                      --load-instr (count: unlimited, dtype: filepath)
+                                                      --init-functions (count: unlimited, dtype: string)
+                                                      --fini-functions (count: unlimited, dtype: string)
+                                                      --all-functions (max: 1, dtype: boolean)
+                                                      --function-include (count: unlimited)
+                                                      --function-exclude (count: unlimited)
+                                                      --function-restrict (count: unlimited)
+                                                      --caller-include (count: unlimited)
+                                                      --module-include (count: unlimited)
+                                                      --module-exclude (count: unlimited)
+                                                      --module-restrict (count: unlimited)
+                                                      --internal-function-include (count: unlimited)
+                                                      --internal-module-include (count: unlimited)
+                                                      --instruction-exclude (count: unlimited)
+                                                      --internal-library-deps (min: 0, dtype: boolean)
+                                                      --internal-library-append (count: unlimited)
+                                                      --internal-library-remove (count: unlimited)
+                                                      --linkage (min: 1)
+                                                      --visibility (min: 1)
+                                                      --label (count: unlimited, dtype: string)
+                                                      --config (min: 1, dtype: string)
+                                                      --default-components (count: unlimited, dtype: string)
+                                                      --env (count: unlimited)
+                                                      --mpi (max: 1, dtype: bool)
+                                                      --instrument-loops (max: 1, dtype: boolean)
+                                                      --min-instructions (count: 1, dtype: int)
+                                                      --min-address-range (count: 1, dtype: int)
+                                                      --min-instructions-loop (count: 1, dtype: int)
+                                                      --min-address-range-loop (count: 1, dtype: int)
+                                                      --coverage (max: 1, dtype: bool)
+                                                      --dynamic-callsites (max: 1, dtype: boolean)
+                                                      --traps (max: 1, dtype: boolean)
+                                                      --loop-traps (max: 1, dtype: boolean)
+                                                      --allow-overlapping (max: 1, dtype: bool)
+                                                      --parse-all-modules (max: 1, dtype: bool)
+                                                      --batch-size (count: 1, dtype: int)
+                                                      --dyninst-rt (min: 1, dtype: filepath)
+                                                      --dyninst-options (count: unlimited)
+                                                      ] -- <CMD> <ARGS>
+
+   Options:
+      -h, -?, --help                 Shows this page
+      --version                      Prints the version and exit
+
+      [DEBUG OPTIONS]
+
+      -v, --verbose                  Verbose output
+      -e, --error                    All warnings produce runtime errors
+      --debug                        Debug output
+      --log                          Number of log entries to display after an error. Any value < 0 will emit the entire log
+      --log-file                     Write the log out the specified file during the run
+      --simulate                     Exit after outputting diagnostic {available,instrumented,excluded,overlapping} module
+                                    function lists, e.g. available.txt
+      --print-format [ json | txt | xml ]
+                                    Output format for diagnostic {available,instrumented,excluded,overlapping} module
+                                    function lists, e.g. {print-dir}/available.txt
+      --print-dir                    Output directory for diagnostic {available,instrumented,excluded,overlapping} module
+                                    function lists, e.g. {print-dir}/available.txt
+      --print-available [ functions | functions+ | modules | pair | pair+ ]
+                                    Print the available entities for instrumentation (functions, modules, or module-function
+                                    pair) to stdout after applying regular expressions
+      --print-instrumented [ functions | functions+ | modules | pair | pair+ ]
+                                    Print the instrumented entities (functions, modules, or module-function pair) to stdout
+                                    after applying regular expressions
+      --print-coverage [ functions | functions+ | modules | pair | pair+ ]
+                                    Print the instrumented coverage entities (functions, modules, or module-function pair) to
+                                    stdout after applying regular expressions
+      --print-excluded [ functions | functions+ | modules | pair | pair+ ]
+                                    Print the entities for instrumentation (functions, modules, or module-function pair)
+                                    which are excluded from the instrumentation to stdout after applying regular expressions
+      --print-overlapping [ functions | functions+ | modules | pair | pair+ ]
+                                    Print the entities for instrumentation (functions, modules, or module-function pair)
+                                    which overlap other function calls or have multiple entry points to stdout after applying
+                                    regular expressions
+      --print-instructions           Print the instructions for each basic-block in the JSON/XML outputs
+
+      [MODE OPTIONS]
+
+      -o, --output                   Enable generation of a new executable (binary-rewrite). If a filename is not provided,
+                                    omnitrace will use the basename and output to the cwd, unless the target binary is in the
+                                    cwd. In the latter case, omnitrace will either use ${PWD}/<basename>.inst (non-libraries)
+                                    or ${PWD}/instrumented/<basename> (libraries)
+      -p, --pid                      Connect to running process
+      -M, --mode [ coverage | sampling | trace ]
+                                    Instrumentation mode. \'trace\' mode instruments the selected functions, \'sampling\' mode
+                                    only instruments the main function to start and stop the sampler.
+      -f, --force                    Force the command-line argument configuration, i.e. don't get cute. Useful for forcing
+                                    runtime instrumentation of an executable that [A] Dyninst thinks is a library after
+                                    reading ELF and [B] whose name makes it look like a library (e.g. starts with 'lib'
+                                    and/or ends in \'.so\', \'.so.*\', or \'.a\')
+      -c, --command                  Input executable and arguments (if \'-- <CMD>\' not provided)
+
+      [LIBRARY OPTIONS]
+
+      --prefer [ shared | static ]   Prefer this library types when available
+      -L, --library                  Libraries with instrumentation routines (default: "libomnitrace-dl")
+      -m, --main-function            The primary function to instrument around, e.g. \'main\'
+      --load                         Supplemental instrumentation library names w/o extension (e.g. \'libinstr\' for
+                                    \'libinstr.so\' or \'libinstr.a\')
+      --load-instr                   Load {available,instrumented,excluded,overlapping}-instr JSON or XML file(s) and override
+                                    what is read from the binary
+      --init-functions               Initialization function(s) for supplemental instrumentation libraries (see \'--load\'
+                                    option)
+      --fini-functions               Finalization function(s) for supplemental instrumentation libraries (see \'--load\' option)
+      --all-functions                When finding functions, include the functions which are not instrumentable. This is
+                                    purely diagnostic for the available/excluded functions output
+
+      [SYMBOL SELECTION OPTIONS]
+
+      -I, --function-include         Regex(es) for including functions (despite heuristics)
+      -E, --function-exclude         Regex(es) for excluding functions (always applied)
+      -R, --function-restrict        Regex(es) for restricting functions only to those that match the provided
+                                    regular-expressions
+      --caller-include               Regex(es) for including functions that call the listed functions (despite heuristics)
+      -MI, --module-include          Regex(es) for selecting modules/files/libraries (despite heuristics)
+      -ME, --module-exclude          Regex(es) for excluding modules/files/libraries (always applied)
+      -MR, --module-restrict         Regex(es) for restricting modules/files/libraries only to those that match the provided
+                                    regular-expressions
+      --internal-function-include    Regex(es) for including functions which are (likely) utilized by omnitrace itself. Use
+                                    this option with care.
+      --internal-module-include      Regex(es) for including modules/libraries which are (likely) utilized by omnitrace
+                                    itself. Use this option with care.
+      --instruction-exclude          Regex(es) for excluding functions containing certain instructions
+      --internal-library-deps        Treat the libraries linked to the internal libraries as internal libraries. This increase
+                                    the internal library processing time and consume more memory (so use with care) but may
+                                    be useful when the application uses Boost libraries and Dyninst is dynamically linked
+                                    against the same boost libraries
+      --internal-library-append      Append to the list of libraries which omnitrace treats as being used internally, e.g.
+                                    OmniTrace will find all the symbols in this library and prevent them from being
+                                    instrumented.
+      --internal-library-remove [ ld-linux-x86-64.so.2
+                                 libBrokenLocale.so.1
+                                 libanl.so.1
+                                 libbfd.so
+                                 libbz2.so
+                                 libc.so.6
+                                 libcaliper.so
+                                 libcommon.so
+                                 libcrypt.so.1
+                                 libdl.so.2
+                                 libdw.so
+                                 libdwarf.so
+                                 libdyninstAPI_RT.so
+                                 libelf.so
+                                 libgcc_s.so.1
+                                 libgotcha.so
+                                 liblikwid.so
+                                 liblzma.so
+                                 libnsl.so.1
+                                 libnss_compat.so.2
+                                 libnss_db.so.2
+                                 libnss_dns.so.2
+                                 libnss_files.so.2
+                                 libnss_hesiod.so.2
+                                 libnss_ldap.so.2
+                                 libnss_nis.so.2
+                                 libnss_nisplus.so.2
+                                 libnss_test1.so.2
+                                 libnss_test2.so.2
+                                 libpapi.so
+                                 libpfm.so
+                                 libprofiler.so
+                                 libpthread.so.0
+                                 libresolv.so.2
+                                 librocm_smi64.so
+                                 librocmtools.so
+                                 librocprofiler64.so
+                                 libroctracer64.so
+                                 libroctx64.so
+                                 librt.so.1
+                                 libstdc++.so.6
+                                 libtbb.so
+                                 libtbbmalloc.so
+                                 libtbbmalloc_proxy.so
+                                 libtcmalloc.so
+                                 libtcmalloc_and_profiler.so
+                                 libtcmalloc_debug.so
+                                 libtcmalloc_minimal.so
+                                 libtcmalloc_minimal_debug.so
+                                 libthread_db.so.1
+                                 libunwind-coredump.so
+                                 libunwind-generic.so
+                                 libunwind-ptrace.so
+                                 libunwind-setjmp.so
+                                 libunwind-x86_64.so
+                                 libunwind.so
+                                 libutil.so.1
+                                 libz.so
+                                 libzstd.so ]
+                                    Remove the specified libraries from being treated as being used internally, e.g.
+                                    OmniTrace will permit all the symbols in these libraries to be eligible for
+                                    instrumentation.
+      --linkage [ global | local | unique | unknown | weak ]
+                                    Only instrument functions with specified linkage (default: global, local, unique)
+      --visibility [ default | hidden | internal | protected | unknown ]
+                                    Only instrument functions with specified visibility (default: default, internal, hidden,
+                                    protected)
+
+      [RUNTIME OPTIONS]
+
+      --label [ args | file | line | return ]
+                                    Labeling info for functions. By default, just the function name is recorded. Use these
+                                    options to gain more information about the function signature or location of the
+                                    functions
+      -C, --config                   Read in a configuration file and encode these values as the defaults in the executable
+      -d, --default-components       Default components to instrument (only useful when timemory is enabled in omnitrace
+                                    library)
+      --env                          Environment variables to add to the runtime in form VARIABLE=VALUE. E.g. use \'--env
+                                    OMNITRACE_PROFILE=ON\' to default to using timemory instead of perfetto
+      --mpi                          Enable MPI support (requires omnitrace built w/ full or partial MPI support). NOTE: this
+                                    will automatically be activated if MPI_Init, MPI_Init_thread, MPI_Finalize,
+                                    MPI_Comm_rank, or MPI_Comm_size are found in the symbol table of target
+
+      [GRANULARITY OPTIONS]
+
+      -l, --instrument-loops         Instrument at the loop level
+      -i, --min-instructions         If the number of instructions in a function is less than this value, exclude it from
+                                    instrumentation
+      -r, --min-address-range        If the address range of a function is less than this value, exclude it from
+                                    instrumentation
+      --min-instructions-loop        If the number of instructions in a function containing a loop is less than this value,
+                                    exclude it from instrumentation
+      --min-address-range-loop       If the address range of a function containing a loop is less than this value, exclude it
+                                    from instrumentation
+      --coverage [ basic_block | function | none ]
+                                    Enable recording the code coverage. If instrumenting in coverage mode (\'-M converage\'),
+                                    this simply specifies the granularity. If instrumenting in trace or sampling mode, this
+                                    enables recording code-coverage in addition to the instrumentation of that mode (if any).
+      --dynamic-callsites            Force instrumentation if a function has dynamic callsites (e.g. function pointers)
+      --traps                        Instrument points which require using a trap. On the x86 architecture, because
+                                    instructions are of variable size, the instruction at a point may be too small for
+                                    Dyninst to replace it with the normal code sequence used to call instrumentation. Also,
+                                    when instrumentation is placed at points other than subroutine entry, exit, or call
+                                    points, traps may be used to ensure the instrumentation fits. In this case, Dyninst
+                                    replaces the instruction with a single-byte instruction that generates a trap.
+      --loop-traps                   Instrument points within a loop which require using a trap (only relevant when
+                                    --instrument-loops is enabled).
+      --allow-overlapping            Allow dyninst to instrument either multiple functions which overlap (share part of same
+                                    function body) or single functions with multiple entry points. For more info, see Section
+                                    2 of the DyninstAPI documentation.
+      --parse-all-modules            By default, omnitrace simply requests Dyninst to provide all the procedures in the
+                                    application image. If this option is enabled, omnitrace will iterate over all the modules
+                                    and extract the functions. Theoretically, it should be the same but the data is slightly
+                                    different, possibly due to weak binding scopes. In general, enabling option will probably
+                                    have no visible effect
+
+      [DYNINST OPTIONS]
+
+      -b, --batch-size               Dyninst supports batch insertion of multiple points during runtime instrumentation. If
+                                    one large batch insertion fails, this value will be used to create smaller batches.
+                                    Larger batches generally decrease the instrumentation time
+      --dyninst-rt                   Path(s) to the dyninstAPI_RT library
+      --dyninst-options [ BaseTrampDeletion
+                           DebugParsing
+                           DelayedParsing
+                           InstrStackFrames
+                           MergeTramp
+                           SaveFPR
+                           TrampRecursive
+                           TypeChecking ]
+      Advanced dyninst options: BPatch::set<OPTION>(bool), e.g. bpatch->setTrampRecursive(true)
+
+``omnitrace-instrument`` uses a similar syntax as LLVM to separate command-line arguments from the 
+application's arguments. It uses a standalone 
+double-hyphen (``--``) as a separator. 
+All arguments preceding the double-hyphen
+are interpreted as belonging to Omnitrace and all arguments following the 
+double-hyphen are interpreted as being part of the
+application and its arguments. In binary rewrite mode, all application arguments after the first argument
+are ignored. As an example, ``./omnitrace-instrument -o ls.inst -- ls -l`` interprets ``ls`` as 
+the target to instrument, ignoring the ``-l`` argument,
+and generates a ``ls.inst`` executable that you can subsequently run using the 
+``omnitrace-run -- ls.inst -l`` command.
+
+Runtime instrumentation example
+========================================
+
+The following example shows how to enable runtime instrumentation.
+
+.. code-block:: shell
+
+   omnitrace-instrument <omnitrace-options> -- <exe> [<exe-options>...]
+
+Attaching to a running process
+========================================
+
+Use the following command to attach to an active process.
+
+.. code-block:: shell
+
+   omnitrace-instrument <omnitrace-options> -p <PID> -- <exe-name>
+
+Binary rewrite
+========================================
+
+This example demonstrates how to rewrite a binary.
+
+.. code-block:: shell
+
+   omnitrace-instrument <omnitrace-options> -o <name-of-new-exe-or-library> -- <exe-or-library>
+
+.. _binary-rewriting-library-label:
+
+Binary rewrite of a library
+-----------------------------------
+
+Many applications bundle the bulk of their functionality into one or more 
+dynamic libraries and have a relatively simple ``main``
+which links to these libraries and serves as the "driver" for 
+setting up the workflow. If you perform a binary rewrite of an
+executable like this and find there is insufficient information, you 
+can either switch to runtime instrumentation or perform a
+binary rewrite on the relevant libraries.
+
+Support for stand-alone binary rewriting of a dynamic library without a binary rewrite of 
+the executable is a beta feature.
+In general, it is supported as long as the library contains the ``_init`` and 
+``_fini`` symbols but these symbols are not
+standardized to the extent of ``main`` in an executable.
+
+Here is the recommended workflow for the binary rewrite of a library:
+
+#. Determine the names of the dynamically linked libraries of interest using ``ldd``
+#. Generate a binary rewrite of the executable
+#. Generate a binary rewrite of the desired libraries with the same base name as the 
+   original library, for example, ``libfoo.so.2`` instead of ``libfoo.so``,  and output the instrumented 
+   library into a different folder than the original library.
+
+#. Prefix the ``LD_LIBRARY_PATH`` executable with the output folder from the previous step
+#. Use ``ldd`` to verify that the instrumented executable can resolve the location of the instrumented library
+
+Binary rewrite of a library example
+-----------------------------------
+
+The ``foo`` executable is dynamically linked to ``libfoo.so.2``:
+
+.. code-block:: shell
+
+   $ pwd
+   /home/user
+   $ which foo
+   /usr/local/bin/foo
+   $ ldd /usr/local/bin/foo
+         ...
+         libfoo.so.2 => /usr/local/lib/libfoo.so.2 (...)
+         ...
+
+Generate binary rewrites of ``foo`` and ``libfoo.so.2``:
+
+.. code-block:: shell
+
+   omnitrace-instrument -o ./foo.inst -- foo
+   omnitrace-instrument -o ./libfoo.so.2 -- /usr/local/lib/libfoo.so.2
+
+At this point, the instrumented ``foo.inst`` executable still dynamically loads the 
+original ``libfoo.so.2`` in ``/usr/local/lib``:
+
+.. code-block:: shell
+
+   $ ldd ./foo.inst
+         ...
+         libfoo.so.2 => /usr/local/lib/libfoo.so.2 (...)
+         ...
+
+Prefix the ``LD_LIBRARY_PATH`` environment variable with the folder containing 
+the instrumented ``libfoo.so.2``:
+
+.. code-block:: shell
+
+   export LD_LIBRARY_PATH=/home/user:${LD_LIBRARY_PATH}
+
+``foo.inst`` now loads the instrumented library when it runs:
+
+.. code-block:: shell
+
+   $ ldd ./foo.inst
+         ...
+         libfoo.so.2 => /home/user/libfoo.so.2 (...)
+         ...
+
+Selective instrumentation
+========================================
+
+The default behavior of ``omnitrace-instrument`` does not instrument every symbol in the binary. 
+The default rules are:
+
+* Skip instrumenting dynamic call-sites (such as function pointers)
+
+  * The ``--dynamic-callsites`` option forces instrumentation for all dynamic call-sites
+
+* The cost of a function can be loosely approximated by the number of 
+  instructions. By default, ``omnitrace-instrument`` only instruments functions 
+  with at least 1024 instructions
+
+  * The  ``--min-instructions`` option modifies this heuristic for all functions which do not contain loops
+  * The ``--min-instructions-loop`` option modifies this heuristic for functions which contain loops.
+
+* The cost of a function can be also be loosely approximated by the size of the function 
+  in the binary so this heuristic can be used in lieu of or in addition to the 
+  minimum number of instructions
+
+  * The ``--min-address-range`` option modifies this heuristic for all functions which do not contain loops
+  * The ``--min-address-range-loop`` option modifies this heuristic for functions which contain loops 
+
+* Skip instrumentation points which require using a trap
+ 
+  * See the description for the ``--traps`` and ``--loop-traps`` options for more information
+
+* Skip instrumenting loops within the body of a function
+
+  * The ``--instrument-loops`` option enables this behavior
+
+* Skip instrumenting functions with overlapping function bodies and single 
+  functions with multiple entry point
+
+  * These behaviors arise from various optimizations. Enable instrumenting for these functions 
+    by using the ``--allow-overlapping`` option
+
+.. note::
+
+   The separate loop options ``--min-instructions-loop`` and ``--min-address-range-loop`` 
+   are provided because functions with loops can be compact in the binary while also being costly
+
+Viewing the available, instrumented, excluded, and overlapping functions
+-------------------------------------------------------------------------
+
+Whenever ``omnitrace-instrument`` runs with a verbosity of zero or higher, 
+it generates files that detail which functions 
+were available for instrumentation (along with the module they were defined in), actually instrumented, 
+excluded, and which contained overlapping function bodies.
+By default, these files are saved to the ``omnitrace-<NAME>-output`` folder 
+where ``<NAME>`` is the base name of the targeted binary (or
+the base name of the resulting executable in the case of binary rewrite). For example,
+``omnitrace-instrument -- ls`` outputs these files to ``omnitrace-ls-output`` 
+whereas ``omnitrace-instrument -o ls.inst -- ls`` places them in ``omnitrace-ls.inst-output``.
+
+To generate these files without running or generating an 
+executable, use the ``--simulate`` option:
+
+.. code-block:: shell
+
+   omnitrace-instrument --simulate -- foo
+   omnitrace-instrument --simulate -o foo.inst -- foo
+
+Excluding and including modules and functions
+----------------------------------------------
+
+Omnitrace has a set of six command-line options which each accept one or more 
+regular expressions for customizing the scope of which module and/or functions are
+instrumented. Multiple regex patterns per option are treated as an OR operation, 
+for example, ``--module-include libfoo libbar`` is effectively the same as ``--module-include 'libfoo|libbar'``.
+
+To force the inclusion of certain modules and/or function 
+without changing any of the heuristics, use the ``--module-include`` and/or ``--function-include`` options.
+These options do not exclude modules or functions which do 
+not satisfy their regular expression.
+
+To narrow the scope of the instrumentation to a specific set 
+of libraries and/or functions, use the ``--module-restrict`` and ``--function-restrict`` options.
+These options let you exclusively select the union of one or more 
+regular expressions, regardless of whether or not the functions satisfy the
+previously-mentioned default heuristics. Any function or module that is not within 
+the union of these regular expressions is excluded from instrumentation.
+
+To avoid instrumenting a set of modules and/or functions, 
+use the ``--module-exclude`` and ``--function-exclude`` options.
+These options are always applied, even if the module or function 
+satisfies the "restrict" or "include" regular expression.
+
+.. _available-module-function-output:
+
+An example of the available module and function info output
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: shell
+
+   omnitrace-instrument -o lulesh.inst --label file line args --simulate -- lulesh
+
+.. code-block:: shell
+
+   AddressRange  Module                                    Function                                                                                 FunctionSignature
+           9165  ../examples/lulesh/lulesh-comm.cc         CommMonoQ                                                                                CommMonoQ(domain) [lulesh-comm.cc:1891]
+           3396  ../examples/lulesh/lulesh-comm.cc         CommRecv                                                                                 CommRecv(domain, int, Index_t, Index_t, Index_t, Index_t, bool, bool) [lulesh...
+           8666  ../examples/lulesh/lulesh-comm.cc         CommSBN                                                                                  CommSBN(domain, int, Domain_member *) [lulesh-comm.cc:926]
+          10212  ../examples/lulesh/lulesh-comm.cc         CommSend                                                                                 CommSend(domain, int, Index_t, Domain_member *, Index_t, Index_t, Index_t, bo...
+           6823  ../examples/lulesh/lulesh-comm.cc         CommSyncPosVel                                                                           CommSyncPosVel(domain) [lulesh-comm.cc:1404]
+            126  ../examples/lulesh/lulesh-comm.cc         _GLOBAL__sub_I_lulesh_comm.cc                                                            _GLOBAL__sub_I_lulesh_comm.cc() [lulesh-comm.cc]
+            308  ../examples/lulesh/lulesh-init.cc         .omp_outlined..26                                                                        .omp_outlined..26(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
+            628  ../examples/lulesh/lulesh-init.cc         .omp_outlined..34                                                                        .omp_outlined..34(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
+            656  ../examples/lulesh/lulesh-init.cc         .omp_outlined..41                                                                        .omp_outlined..41(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
+            662  ../examples/lulesh/lulesh-init.cc         .omp_outlined..45                                                                        .omp_outlined..45(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
+            550  ../examples/lulesh/lulesh-init.cc         .omp_outlined..55                                                                        .omp_outlined..55(const , const , const ParallelFor<Kokkos::Impl::ViewFill<Ko...
+            556  ../examples/lulesh/lulesh-init.cc         .omp_outlined..57                                                                        .omp_outlined..57(const , const , const ParallelFor<Kokkos::Impl::ViewFill<Ko...
+            550  ../examples/lulesh/lulesh-init.cc         .omp_outlined..78                                                                        .omp_outlined..78(const , const , const ParallelFor<Kokkos::Impl::ViewFill<Ko...
+            640  ../examples/lulesh/lulesh-init.cc         .omp_outlined..84                                                                        .omp_outlined..84(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
+            646  ../examples/lulesh/lulesh-init.cc         .omp_outlined..88                                                                        .omp_outlined..88(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
+           1840  ../examples/lulesh/lulesh-init.cc         Domain::AllocateElemPersistent                                                           Domain::AllocateElemPersistent(Domain *, Int_t) [lulesh-init.cc:94]
+           1384  ../examples/lulesh/lulesh-init.cc         Domain::AllocateNodePersistent                                                           Domain::AllocateNodePersistent(Domain *, Int_t) [lulesh-init.cc:94]
+           1264  ../examples/lulesh/lulesh-init.cc         Domain::BuildMesh                                                                        Domain::BuildMesh(Domain *, Int_t, Int_t, Int_t) [lulesh-init.cc:308]
+           2312  ../examples/lulesh/lulesh-init.cc         Domain::CreateRegionIndexSets                                                            Domain::CreateRegionIndexSets(Domain *, Int_t, Int_t) [lulesh-init.cc:409]
+           7109  ../examples/lulesh/lulesh-init.cc         Domain::Domain                                                                           Domain::Domain(Domain *, Int_t, Index_t, Index_t, Index_t, Index_t, int, int,...
+           2458  ../examples/lulesh/lulesh-init.cc         Domain::SetupBoundaryConditions                                                          Domain::SetupBoundaryConditions(Domain *, Int_t) [lulesh-init.cc:409]
+            956  ../examples/lulesh/lulesh-init.cc         Domain::SetupCommBuffers                                                                 Domain::SetupCommBuffers(Domain *, Int_t) [lulesh-init.cc]
+           1456  ../examples/lulesh/lulesh-init.cc         Domain::SetupElementConnectivities                                                       Domain::SetupElementConnectivities(Domain *, Int_t) [lulesh-init.cc:409]
+            721  ../examples/lulesh/lulesh-init.cc         Domain::SetupSymmetryPlanes                                                              Domain::SetupSymmetryPlanes(Domain *, Int_t) [lulesh-init.cc:409]
+           1591  ../examples/lulesh/lulesh-init.cc         Domain::SetupThreadSupportStructures                                                     Domain::SetupThreadSupportStructures(Domain *) [lulesh-init.cc:376]
+           1644  ../examples/lulesh/lulesh-init.cc         Domain::~Domain                                                                          Domain::~Domain(Domain *) [lulesh-init.cc:286]
+            218  ../examples/lulesh/lulesh-init.cc         InitMeshDecomp                                                                           InitMeshDecomp(Int_t, Int_t, Int_t *, Int_t *, Int_t *, Int_t *) [lulesh-init...
+            260  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::CommonSubview<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokk...         Kokkos::Impl::CommonSubview<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokk...
+           1786  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::HostIterateTile<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::R...         Kokkos::Impl::HostIterateTile<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::R...
+            330  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int**...         Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int**...
+            330  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int**...         Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int**...
+            330  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int*,...         Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int*,...
+            330  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int*,...         Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int*,...
+            330  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl...         Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl...
+            330  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl...         Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl...
+            330  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl...         Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl...
+            522  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::...         Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::...
+            232  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::...         Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::...
+             49  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, Kokkos::Impl::ViewVal...         Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, Kokkos::Impl::ViewVal...
+           1476  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::Tile_Loop_Type<2, false, int, void, void>::apply<Kokkos::Impl::...         Kokkos::Impl::Tile_Loop_Type<2, false, int, void, void>::apply<Kokkos::Impl::...
+            555  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::LayoutRight, Kokkos::Devic...         Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::LayoutRight, Kokkos::Devic...
+            613  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::LayoutRight, Kokkos::Devic...         Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::LayoutRight, Kokkos::Devic...
+            603  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::ViewCopy<Kokkos::View<int*, Kokkos::LayoutLeft, Kokkos::Device<...         Kokkos::Impl::ViewCopy<Kokkos::View<int*, Kokkos::LayoutLeft, Kokkos::Device<...
+            604  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::ViewCopy<Kokkos::View<int*, Kokkos::LayoutLeft, Kokkos::Device<...         Kokkos::Impl::ViewCopy<Kokkos::View<int*, Kokkos::LayoutLeft, Kokkos::Device<...
+            281  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...         Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
+            281  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...         Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
+            281  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...         Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
+            281  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...         Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
+            281  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...         Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
+            524  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev...         Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev...
+            525  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev...         Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev...
+            524  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev...         Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev...
+            583  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<int* [8], Kokkos::LayoutRight>, ...         SharedAllocationRecord<void, void> * Kokkos::Impl::ViewMapping<Kokkos::ViewTr...
+            529  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<int*, Kokkos::HostSpace>, void>:...         SharedAllocationRecord<void, void> * Kokkos::Impl::ViewMapping<Kokkos::ViewTr...
+            529  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<int*>, void>::allocate_shared<st...         SharedAllocationRecord<void, void> * Kokkos::Impl::ViewMapping<Kokkos::ViewTr...
+            203  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::ViewRemap<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::...         Kokkos::Impl::ViewRemap<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::...
+            331  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::ViewRemap<Kokkos::View<int*>, Kokkos::View<int*>, Kokkos::OpenM...         Kokkos::Impl::ViewRemap<Kokkos::View<int*>, Kokkos::View<int*>, Kokkos::OpenM...
+            461  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpa...         enable_if_t<std::is_trivial<int>::value && std::is_trivially_copy_assignable<...
+            353  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::contiguous_fill<Kokkos::OpenMP, double*>                                   Kokkos::Impl::contiguous_fill<Kokkos::OpenMP, double*>(exec_space, dst, value...
+            139  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::contiguous_fill<Kokkos::OpenMP, double, Kokkos::LayoutRight, Ko...         Kokkos::Impl::contiguous_fill<Kokkos::OpenMP, double, Kokkos::LayoutRight, Ko...
+            824  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight, Kokkos::D...         Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight, Kokkos::D...
+            824  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight, Kokkos::D...         Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight, Kokkos::D...
+            824  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::...         Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::...
+            824  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::...         Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::...
+            697  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::view_copy<Kokkos::View<int*, Kokkos::LayoutRight, Kokkos::Devic...         Kokkos::Impl::view_copy<Kokkos::View<int*, Kokkos::LayoutRight, Kokkos::Devic...
+            697  ../examples/lulesh/lulesh-init.cc         Kokkos::Impl::view_copy<Kokkos::View<int*>, Kokkos::View<int*> >                         Kokkos::Impl::view_copy<Kokkos::View<int*>, Kokkos::View<int*> >(dst, src) [l...
+           2036  ../examples/lulesh/lulesh-init.cc         Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::Schedule<Kokkos::Static>, int>::R...         Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::Schedule<Kokkos::Static>, int>::R...
+           2506  ../examples/lulesh/lulesh-init.cc         Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::Schedule<Kokkos::Static>, long>::...         Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::Schedule<Kokkos::Static>, long>::...
+            271  ../examples/lulesh/lulesh-init.cc         Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor...         Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor...
+            470  ../examples/lulesh/lulesh-init.cc         Kokkos::View<int* [8], Kokkos::LayoutRight>::View<std::__cxx11::basic_string<...         Kokkos::View<int* [8], Kokkos::LayoutRight>::View<std::__cxx11::basic_string<...
+            323  ../examples/lulesh/lulesh-init.cc         Kokkos::View<int* [8], Kokkos::LayoutRight>::View<std::__cxx11::basic_string<...         Kokkos::View<int* [8], Kokkos::LayoutRight>::View<std::__cxx11::basic_string<...
+            410  ../examples/lulesh/lulesh-init.cc         Kokkos::View<int*, Kokkos::HostSpace>::View<char [10]>                                   Kokkos::View<int*, Kokkos::HostSpace>::View<char [10]>(View<int *, Kokkos::Ho...
+            410  ../examples/lulesh/lulesh-init.cc         Kokkos::View<int*, Kokkos::HostSpace>::View<char [14]>                                   Kokkos::View<int*, Kokkos::HostSpace>::View<char [14]>(View<int *, Kokkos::Ho...
+            462  ../examples/lulesh/lulesh-init.cc         Kokkos::View<int*, Kokkos::HostSpace>::View<std::__cxx11::basic_string<char, ...         Kokkos::View<int*, Kokkos::HostSpace>::View<std::__cxx11::basic_string<char, ...
+            410  ../examples/lulesh/lulesh-init.cc         Kokkos::View<int*>::View<char [16]>                                                      Kokkos::View<int*>::View<char [16]>(View<int *> *, arg_label, type, const siz...
+            410  ../examples/lulesh/lulesh-init.cc         Kokkos::View<int*>::View<char [19]>                                                      Kokkos::View<int*>::View<char [19]>(View<int *> *, arg_label, type, const siz...
+            410  ../examples/lulesh/lulesh-init.cc         Kokkos::View<int*>::View<char [21]>                                                      Kokkos::View<int*>::View<char [21]>(View<int *> *, arg_label, type, const siz...
+            462  ../examples/lulesh/lulesh-init.cc         Kokkos::View<int*>::View<std::__cxx11::basic_string<char, std::char_traits<ch...         Kokkos::View<int*>::View<std::__cxx11::basic_string<char, std::char_traits<ch...
+            323  ../examples/lulesh/lulesh-init.cc         Kokkos::View<int*>::View<std::__cxx11::basic_string<char, std::char_traits<ch...         Kokkos::View<int*>::View<std::__cxx11::basic_string<char, std::char_traits<ch...
+           6589  ../examples/lulesh/lulesh-init.cc         Kokkos::deep_copy<double*, , double*, Kokkos::LayoutRight, Kokkos::Device<Kok...         Kokkos::deep_copy<double*, , double*, Kokkos::LayoutRight, Kokkos::Device<Kok...
+           1052  ../examples/lulesh/lulesh-init.cc         Kokkos::deep_copy<double*>                                                               Kokkos::deep_copy<double*>(dst, value) [lulesh-init.cc]
+           1050  ../examples/lulesh/lulesh-init.cc         Kokkos::deep_copy<double, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP,...         Kokkos::deep_copy<double, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP,...
+           7686  ../examples/lulesh/lulesh-init.cc         Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenM...         Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenM...
+           7686  ../examples/lulesh/lulesh-init.cc         Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, int* [8], Kokkos::LayoutRigh...         Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, int* [8], Kokkos::LayoutRigh...
+           6589  ../examples/lulesh/lulesh-init.cc         Kokkos::deep_copy<int*, , int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::O...         Kokkos::deep_copy<int*, , int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::O...
+           6589  ../examples/lulesh/lulesh-init.cc         Kokkos::deep_copy<int*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Ko...         Kokkos::deep_copy<int*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Ko...
+           6589  ../examples/lulesh/lulesh-init.cc         Kokkos::deep_copy<int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP, K...         Kokkos::deep_copy<int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP, K...
+            863  ../examples/lulesh/lulesh-init.cc         Kokkos::impl_resize<, int* [8], Kokkos::LayoutRight>                                     type Kokkos::impl_resize<, int* [8], Kokkos::LayoutRight>(v, const size_t, co...
+            854  ../examples/lulesh/lulesh-init.cc         Kokkos::impl_resize<, int*>                                                              type Kokkos::impl_resize<, int*>(v, const size_t, const size_t, const size_t,...
+            697  ../examples/lulesh/lulesh-init.cc         Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (...         Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (...
+            706  ../examples/lulesh/lulesh-init.cc         Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (...         Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (...
+            912  ../examples/lulesh/lulesh-init.cc         Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...         Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
+            791  ../examples/lulesh/lulesh-init.cc         Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...         Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
+            791  ../examples/lulesh/lulesh-init.cc         Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...         Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
+            944  ../examples/lulesh/lulesh-init.cc         Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...         Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
+            839  ../examples/lulesh/lulesh-init.cc         Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...         Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
+            126  ../examples/lulesh/lulesh-init.cc         _GLOBAL__sub_I_lulesh_init.cc                                                            _GLOBAL__sub_I_lulesh_init.cc() [lulesh-init.cc]
+           6589  ../examples/lulesh/lulesh-util.cc         Kokkos::deep_copy<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP...         Kokkos::deep_copy<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP...
+           1345  ../examples/lulesh/lulesh-util.cc         ParseCommandLineOptions                                                                  ParseCommandLineOptions(int, char * *, int, cmdLineOpts *) [lulesh-util.cc:67]
+            171  ../examples/lulesh/lulesh-util.cc         PrintCommandLineOptions                                                                  PrintCommandLineOptions(char *, int) [lulesh-util.cc:31]
+             67  ../examples/lulesh/lulesh-util.cc         StrToInt                                                                                 int StrToInt(const char *, int *) [lulesh-util.cc:13]
+            706  ../examples/lulesh/lulesh-util.cc         VerifyAndWriteFinalOutput                                                                VerifyAndWriteFinalOutput(Real_t, locDom, Int_t, Int_t) [lulesh-util.cc:222]
+            126  ../examples/lulesh/lulesh-util.cc         _GLOBAL__sub_I_lulesh_util.cc                                                            _GLOBAL__sub_I_lulesh_util.cc() [lulesh-util.cc]
+             17  ../examples/lulesh/lulesh-viz.cc          DumpToVisit                                                                              DumpToVisit(domain, int, int, int) [lulesh-viz.cc:415]
+            126  ../examples/lulesh/lulesh-viz.cc          _GLOBAL__sub_I_lulesh_viz.cc                                                             _GLOBAL__sub_I_lulesh_viz.cc() [lulesh-viz.cc]
+            451  ../examples/lulesh/lulesh.cc              .omp_outlined..103                                                                       .omp_outlined..103(const , const , const ParallelReduce<(lambda at ../example...
+            796  ../examples/lulesh/lulesh.cc              .omp_outlined..109                                                                       .omp_outlined..109(const , const , const ParallelFor<(lambda at ../examples/l...
+            394  ../examples/lulesh/lulesh.cc              .omp_outlined..111                                                                       .omp_outlined..111(const , const , const ParallelFor<(lambda at ../examples/l...
+            402  ../examples/lulesh/lulesh.cc              .omp_outlined..113                                                                       .omp_outlined..113(const , const , const ParallelFor<(lambda at ../examples/l...
+            427  ../examples/lulesh/lulesh.cc              .omp_outlined..115                                                                       .omp_outlined..115(const , const , const ParallelReduce<(lambda at ../example...
+            859  ../examples/lulesh/lulesh.cc              .omp_outlined..119                                                                       .omp_outlined..119(const , const , const ParallelFor<(lambda at ../examples/l...
+            243  ../examples/lulesh/lulesh.cc              .omp_outlined..122                                                                       .omp_outlined..122(const , const , const ParallelFor<(lambda at ../examples/l...
+            426  ../examples/lulesh/lulesh.cc              .omp_outlined..124                                                                       .omp_outlined..124(const , const , const ParallelFor<(lambda at ../examples/l...
+            529  ../examples/lulesh/lulesh.cc              .omp_outlined..127                                                                       .omp_outlined..127(const , const , const ParallelFor<(lambda at ../examples/l...
+            865  ../examples/lulesh/lulesh.cc              .omp_outlined..130                                                                       .omp_outlined..130(const , const , const ParallelFor<(lambda at ../examples/l...
+            539  ../examples/lulesh/lulesh.cc              .omp_outlined..132                                                                       .omp_outlined..132(const , const , const ParallelReduce<(lambda at ../example...
+            456  ../examples/lulesh/lulesh.cc              .omp_outlined..134                                                                       .omp_outlined..134(const , const , const ParallelReduce<(lambda at ../example...
+            252  ../examples/lulesh/lulesh.cc              .omp_outlined..20                                                                        .omp_outlined..20(const , const , const ParallelFor<(lambda at ../examples/lu...
+            870  ../examples/lulesh/lulesh.cc              .omp_outlined..35                                                                        .omp_outlined..35(const , const , const ParallelFor<(lambda at ../examples/lu...
+            473  ../examples/lulesh/lulesh.cc              .omp_outlined..42                                                                        .omp_outlined..42(const , const , const ParallelFor<(lambda at ../examples/lu...
+            252  ../examples/lulesh/lulesh.cc              .omp_outlined..46                                                                        .omp_outlined..46(const , const , const ParallelFor<(lambda at ../examples/lu...
+           1101  ../examples/lulesh/lulesh.cc              .omp_outlined..48                                                                        .omp_outlined..48(const , const , const ParallelFor<(lambda at ../examples/lu...
+            427  ../examples/lulesh/lulesh.cc              .omp_outlined..55                                                                        .omp_outlined..55(const , const , const ParallelReduce<(lambda at ../examples...
+           1326  ../examples/lulesh/lulesh.cc              .omp_outlined..57                                                                        .omp_outlined..57(const , const , const ParallelReduce<(lambda at ../examples...
+            243  ../examples/lulesh/lulesh.cc              .omp_outlined..61                                                                        .omp_outlined..61(const , const , const ParallelFor<(lambda at ../examples/lu...
+           1101  ../examples/lulesh/lulesh.cc              .omp_outlined..63                                                                        .omp_outlined..63(const , const , const ParallelFor<(lambda at ../examples/lu...
+            372  ../examples/lulesh/lulesh.cc              .omp_outlined..66                                                                        .omp_outlined..66(const , const , const ParallelFor<(lambda at ../examples/lu...
+            499  ../examples/lulesh/lulesh.cc              .omp_outlined..71                                                                        .omp_outlined..71(const , const , const ParallelFor<(lambda at ../examples/lu...
+            499  ../examples/lulesh/lulesh.cc              .omp_outlined..73                                                                        .omp_outlined..73(const , const , const ParallelFor<(lambda at ../examples/lu...
+            499  ../examples/lulesh/lulesh.cc              .omp_outlined..75                                                                        .omp_outlined..75(const , const , const ParallelFor<(lambda at ../examples/lu...
+            465  ../examples/lulesh/lulesh.cc              .omp_outlined..78                                                                        .omp_outlined..78(const , const , const ParallelFor<(lambda at ../examples/lu...
+            396  ../examples/lulesh/lulesh.cc              .omp_outlined..81                                                                        .omp_outlined..81(const , const , const ParallelFor<(lambda at ../examples/lu...
+            656  ../examples/lulesh/lulesh.cc              .omp_outlined..85                                                                        .omp_outlined..85(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
+            662  ../examples/lulesh/lulesh.cc              .omp_outlined..89                                                                        .omp_outlined..89(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
+            443  ../examples/lulesh/lulesh.cc              .omp_outlined..93                                                                        .omp_outlined..93(const , const , const ParallelReduce<(lambda at ../examples...
+            243  ../examples/lulesh/lulesh.cc              .omp_outlined..96                                                                        .omp_outlined..96(const , const , const ParallelFor<(lambda at ../examples/lu...
+            243  ../examples/lulesh/lulesh.cc              .omp_outlined..99                                                                        .omp_outlined..99(const , const , const ParallelFor<(lambda at ../examples/lu...
+          13367  ../examples/lulesh/lulesh.cc              ApplyMaterialPropertiesForElems                                                          ApplyMaterialPropertiesForElems(domain) [lulesh.cc:409]
+           1530  ../examples/lulesh/lulesh.cc              CalcElemCharacteristicLength                                                             Real_t CalcElemCharacteristicLength(const Real_t *, const Real_t *, const Rea...
+            982  ../examples/lulesh/lulesh.cc              CalcElemFBHourglassForce                                                                 CalcElemFBHourglassForce(const Real_t *, const Real_t[] *, coefficient, Real_...
+           2428  ../examples/lulesh/lulesh.cc              CalcElemNodeNormals                                                                      CalcElemNodeNormals(Real_t *, Real_t *, Real_t *, const Real_t *, const Real_...
+            853  ../examples/lulesh/lulesh.cc              CalcElemShapeFunctionDerivatives                                                         CalcElemShapeFunctionDerivatives(const Real_t *, const Real_t *, const Real_t...
+           1097  ../examples/lulesh/lulesh.cc              CalcElemVolumeDerivative                                                                 CalcElemVolumeDerivative(i, dvdx, dvdy, dvdz, const Real_t *, const Real_t *,...
+           1054  ../examples/lulesh/lulesh.cc              CalcKinematicsForElems                                                                   CalcKinematicsForElems(domain, Real_t, Index_t) [lulesh.cc]
+          14160  ../examples/lulesh/lulesh.cc              CalcVolumeForceForElems                                                                  CalcVolumeForceForElems(domain) [lulesh.cc:409]
+            366  ../examples/lulesh/lulesh.cc              Domain::AllocateGradients                                                                Domain::AllocateGradients(Domain *, Int_t, Int_t) [lulesh.cc:214]
+            475  ../examples/lulesh/lulesh.cc              Domain::DeallocateGradients                                                              Domain::DeallocateGradients(Domain *) [lulesh.cc:105]
+            250  ../examples/lulesh/lulesh.cc              Domain::DeallocateStrains                                                                Domain::DeallocateStrains(Domain *) [lulesh.cc:105]
+           4356  ../examples/lulesh/lulesh.cc              Domain::Domain                                                                           Domain::Domain(Domain *) [lulesh.cc:78]
+             15  ../examples/lulesh/lulesh.cc              Domain::delv_eta                                                                         Domain::delv_eta(const Domain *, const Index_t) [lulesh.cc:371]
+             15  ../examples/lulesh/lulesh.cc              Domain::delv_xi                                                                          Domain::delv_xi(const Domain *, const Index_t) [lulesh.cc:368]
+             15  ../examples/lulesh/lulesh.cc              Domain::delv_zeta                                                                        Domain::delv_zeta(const Domain *, const Index_t) [lulesh.cc:374]
+             15  ../examples/lulesh/lulesh.cc              Domain::fx                                                                               Domain::fx(const Domain *, const Index_t) [lulesh.cc:303]
+             15  ../examples/lulesh/lulesh.cc              Domain::fy                                                                               Domain::fy(const Domain *, const Index_t) [lulesh.cc:306]
+             15  ../examples/lulesh/lulesh.cc              Domain::fz                                                                               Domain::fz(const Domain *, const Index_t) [lulesh.cc:309]
+             15  ../examples/lulesh/lulesh.cc              Domain::nodalMass                                                                        Domain::nodalMass(const Domain *, const Index_t) [lulesh.cc:314]
+             15  ../examples/lulesh/lulesh.cc              Domain::x                                                                                Domain::x(const Domain *, const Index_t) [lulesh.cc:257]
+             15  ../examples/lulesh/lulesh.cc              Domain::xd                                                                               Domain::xd(const Domain *, const Index_t) [lulesh.cc:272]
+             15  ../examples/lulesh/lulesh.cc              Domain::y                                                                                Domain::y(const Domain *, const Index_t) [lulesh.cc:258]
+             15  ../examples/lulesh/lulesh.cc              Domain::yd                                                                               Domain::yd(const Domain *, const Index_t) [lulesh.cc:275]
+             15  ../examples/lulesh/lulesh.cc              Domain::z                                                                                Domain::z(const Domain *, const Index_t) [lulesh.cc:259]
+             15  ../examples/lulesh/lulesh.cc              Domain::zd                                                                               Domain::zd(const Domain *, const Index_t) [lulesh.cc:278]
+            330  ../examples/lulesh/lulesh.cc              Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<doubl...         Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<doubl...
+            330  ../examples/lulesh/lulesh.cc              Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<doubl...         Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<doubl...
+           1508  ../examples/lulesh/lulesh.cc              Kokkos::Impl::ParallelFor<CalcEnergyForElems(double*, double*, double*, doubl...         type Kokkos::Impl::ParallelFor<CalcEnergyForElems(double*, double*, double*, ...
+           3606  ../examples/lulesh/lulesh.cc              Kokkos::Impl::ParallelFor<CalcFBHourglassForceForElems(Domain&, double*, Kokk...         type Kokkos::Impl::ParallelFor<CalcFBHourglassForceForElems(Domain&, double*,...
+           2917  ../examples/lulesh/lulesh.cc              Kokkos::Impl::ParallelFor<CalcKinematicsForElems(Domain&, double, int)::$_0, ...         type Kokkos::Impl::ParallelFor<CalcKinematicsForElems(Domain&, double, int)::...
+           3119  ../examples/lulesh/lulesh.cc              Kokkos::Impl::ParallelFor<CalcMonotonicQGradientsForElems(Domain&)::{lambda(i...         type Kokkos::Impl::ParallelFor<CalcMonotonicQGradientsForElems(Domain&)::{lam...
+           1969  ../examples/lulesh/lulesh.cc              Kokkos::Impl::ParallelFor<CalcMonotonicQRegionForElems(Domain&, int, double):...         type Kokkos::Impl::ParallelFor<CalcMonotonicQRegionForElems(Domain&, int, dou...
+           1265  ../examples/lulesh/lulesh.cc              Kokkos::Impl::ParallelFor<IntegrateStressForElems(Domain&, double*, double*, ...         type Kokkos::Impl::ParallelFor<IntegrateStressForElems(Domain&, double*, doub...
+             49  ../examples/lulesh/lulesh.cc              Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, Kokkos::Impl::ViewVal...         Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, Kokkos::Impl::ViewVal...
+           1497  ../examples/lulesh/lulesh.cc              Kokkos::Impl::TeamPolicyInternal<Kokkos::OpenMP>::TeamPolicyInternal                     Kokkos::Impl::TeamPolicyInternal<Kokkos::OpenMP>::TeamPolicyInternal(TeamPoli...
+            603  ../examples/lulesh/lulesh.cc              Kokkos::Impl::ViewCopy<Kokkos::View<double*, Kokkos::LayoutLeft, Kokkos::Devi...         Kokkos::Impl::ViewCopy<Kokkos::View<double*, Kokkos::LayoutLeft, Kokkos::Devi...
+            604  ../examples/lulesh/lulesh.cc              Kokkos::Impl::ViewCopy<Kokkos::View<double*, Kokkos::LayoutLeft, Kokkos::Devi...         Kokkos::Impl::ViewCopy<Kokkos::View<double*, Kokkos::LayoutLeft, Kokkos::Devi...
+            281  ../examples/lulesh/lulesh.cc              Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...         Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
+            281  ../examples/lulesh/lulesh.cc              Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...         Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
+            521  ../examples/lulesh/lulesh.cc              Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<double*>, void>::allocate_shared...         SharedAllocationRecord<void, void> * Kokkos::Impl::ViewMapping<Kokkos::ViewTr...
+            331  ../examples/lulesh/lulesh.cc              Kokkos::Impl::ViewRemap<Kokkos::View<double*>, Kokkos::View<double*>, Kokkos:...         Kokkos::Impl::ViewRemap<Kokkos::View<double*>, Kokkos::View<double*>, Kokkos:...
+            461  ../examples/lulesh/lulesh.cc              Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpa...         enable_if_t<std::is_trivial<double>::value && std::is_trivially_copy_assignab...
+           1609  ../examples/lulesh/lulesh.cc              Kokkos::Impl::runtime_check_rank_host                                                    Kokkos::Impl::runtime_check_rank_host(const size_t, const bool, const size_t,...
+            697  ../examples/lulesh/lulesh.cc              Kokkos::Impl::view_copy<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::De...         Kokkos::Impl::view_copy<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::De...
+            697  ../examples/lulesh/lulesh.cc              Kokkos::Impl::view_copy<Kokkos::View<double*>, Kokkos::View<double*> >                   Kokkos::Impl::view_copy<Kokkos::View<double*>, Kokkos::View<double*> >(dst, s...
+           2250  ../examples/lulesh/lulesh.cc              Kokkos::RangePolicy<Kokkos::OpenMP>::RangePolicy                                         Kokkos::RangePolicy<Kokkos::OpenMP>::RangePolicy(RangePolicy<Kokkos::OpenMP> ...
+            213  ../examples/lulesh/lulesh.cc              Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor...         Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor...
+            410  ../examples/lulesh/lulesh.cc              Kokkos::View<double*>::View<char [6]>                                                    Kokkos::View<double*>::View<char [6]>(View<double *> *, arg_label, type, cons...
+            410  ../examples/lulesh/lulesh.cc              Kokkos::View<double*>::View<char [7]>                                                    Kokkos::View<double*>::View<char [7]>(View<double *> *, arg_label, type, cons...
+            462  ../examples/lulesh/lulesh.cc              Kokkos::View<double*>::View<std::__cxx11::basic_string<char, std::char_traits...         Kokkos::View<double*>::View<std::__cxx11::basic_string<char, std::char_traits...
+            323  ../examples/lulesh/lulesh.cc              Kokkos::View<double*>::View<std::__cxx11::basic_string<char, std::char_traits...         Kokkos::View<double*>::View<std::__cxx11::basic_string<char, std::char_traits...
+             25  ../examples/lulesh/lulesh.cc              Kokkos::View<double*>::~View                                                             Kokkos::View<double*>::~View(View<double *> *) [lulesh.cc:409]
+            840  ../examples/lulesh/lulesh.cc              Kokkos::abort                                                                            Kokkos::abort(const const char *, const const char *) [lulesh.cc:202]
+            854  ../examples/lulesh/lulesh.cc              Kokkos::impl_resize<, double*>                                                           type Kokkos::impl_resize<, double*>(v, const size_t, const size_t, const size...
+            928  ../examples/lulesh/lulesh.cc              Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...         Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
+            960  ../examples/lulesh/lulesh.cc              Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...         Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
+          21470  ../examples/lulesh/lulesh.cc              LagrangeLeapFrog                                                                         LagrangeLeapFrog(domain) [lulesh.cc]
+            226  ../examples/lulesh/lulesh.cc              ResizeBuffer                                                                             ResizeBuffer(const size_t) [lulesh.cc:23]
+            169  ../examples/lulesh/lulesh.cc              _GLOBAL__sub_I_lulesh.cc                                                                 _GLOBAL__sub_I_lulesh.cc() [lulesh.cc]
+           1836  ../examples/lulesh/lulesh.cc              main                                                                                     int main(int, char * *) [lulesh.cc]
+             63  ../examples/lulesh/lulesh.cc              std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::a...         std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::a...
+             20  ../examples/lulesh/lulesh.cc              std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::alloca...         std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::alloca...
+            160  ../examples/lulesh/lulesh.cc              std::operator+<char, std::char_traits<char>, std::allocator<char> >                      basic_string<char, std::char_traits<char>, std::allocator<char> > std::operat...
+            187  ../examples/lulesh/lulesh.cc              std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::alloc...         std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::alloc...
+             11  lulesh                                    __clang_call_terminate                                                                   __clang_call_terminate() [lulesh]
+             33  lulesh                                    __do_global_dtors_aux                                                                    __do_global_dtors_aux() [lulesh]
+              5  lulesh                                    __libc_csu_fini                                                                          __libc_csu_fini() [lulesh]
+            101  lulesh                                    __libc_csu_init                                                                          __libc_csu_init() [lulesh]
+              5  lulesh                                    _dl_relocate_static_pie                                                                  _dl_relocate_static_pie() [lulesh]
+             13  lulesh                                    _fini                                                                                    _fini() [lulesh]
+             27  lulesh                                    _init                                                                                    _init() [lulesh]
+             47  lulesh                                    _start                                                                                   _start() [lulesh]
+              6  lulesh                                    frame_dummy                                                                              frame_dummy() [lulesh]
+
+An example of instrumented module and function info output
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: shell
+
+   omnitrace-instrument -o lulesh.inst --label file line args --simulate -- lulesh
+
+After the heuristics are applied based on the pattern in :ref:`available-module-function-output`,
+the selected module and functions are:
+
+.. code-block:: shell
+
+   AddressRange  Module                                    Function                                                                                 FunctionSignature
+           9165  ../examples/lulesh/lulesh-comm.cc         CommMonoQ                                                                                CommMonoQ(domain) [lulesh-comm.cc:1891]
+           3396  ../examples/lulesh/lulesh-comm.cc         CommRecv                                                                                 CommRecv(domain, int, Index_t, Index_t, Index_t, Index_t, bool, bool) [lulesh...
+           8666  ../examples/lulesh/lulesh-comm.cc         CommSBN                                                                                  CommSBN(domain, int, Domain_member *) [lulesh-comm.cc:926]
+          10212  ../examples/lulesh/lulesh-comm.cc         CommSend                                                                                 CommSend(domain, int, Index_t, Domain_member *, Index_t, Index_t, Index_t, bo...
+           6823  ../examples/lulesh/lulesh-comm.cc         CommSyncPosVel                                                                           CommSyncPosVel(domain) [lulesh-comm.cc:1404]
+           1840  ../examples/lulesh/lulesh-init.cc         Domain::AllocateElemPersistent                                                           Domain::AllocateElemPersistent(Domain *, Int_t) [lulesh-init.cc:94]
+           1384  ../examples/lulesh/lulesh-init.cc         Domain::AllocateNodePersistent                                                           Domain::AllocateNodePersistent(Domain *, Int_t) [lulesh-init.cc:94]
+           1264  ../examples/lulesh/lulesh-init.cc         Domain::BuildMesh                                                                        Domain::BuildMesh(Domain *, Int_t, Int_t, Int_t) [lulesh-init.cc:308]
+           2312  ../examples/lulesh/lulesh-init.cc         Domain::CreateRegionIndexSets                                                            Domain::CreateRegionIndexSets(Domain *, Int_t, Int_t) [lulesh-init.cc:409]
+           7109  ../examples/lulesh/lulesh-init.cc         Domain::Domain                                                                           Domain::Domain(Domain *, Int_t, Index_t, Index_t, Index_t, Index_t, int, int,...
+           2458  ../examples/lulesh/lulesh-init.cc         Domain::SetupBoundaryConditions                                                          Domain::SetupBoundaryConditions(Domain *, Int_t) [lulesh-init.cc:409]
+            956  ../examples/lulesh/lulesh-init.cc         Domain::SetupCommBuffers                                                                 Domain::SetupCommBuffers(Domain *, Int_t) [lulesh-init.cc]
+           1456  ../examples/lulesh/lulesh-init.cc         Domain::SetupElementConnectivities                                                       Domain::SetupElementConnectivities(Domain *, Int_t) [lulesh-init.cc:409]
+            721  ../examples/lulesh/lulesh-init.cc         Domain::SetupSymmetryPlanes                                                              Domain::SetupSymmetryPlanes(Domain *, Int_t) [lulesh-init.cc:409]
+           1591  ../examples/lulesh/lulesh-init.cc         Domain::SetupThreadSupportStructures                                                     Domain::SetupThreadSupportStructures(Domain *) [lulesh-init.cc:376]
+           1644  ../examples/lulesh/lulesh-init.cc         Domain::~Domain                                                                          Domain::~Domain(Domain *) [lulesh-init.cc:286]
+            271  ../examples/lulesh/lulesh-init.cc         Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor...         Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor...
+            410  ../examples/lulesh/lulesh-init.cc         Kokkos::View<int*, Kokkos::HostSpace>::View<char [10]>                                   Kokkos::View<int*, Kokkos::HostSpace>::View<char [10]>(View<int *, Kokkos::Ho...
+            410  ../examples/lulesh/lulesh-init.cc         Kokkos::View<int*, Kokkos::HostSpace>::View<char [14]>                                   Kokkos::View<int*, Kokkos::HostSpace>::View<char [14]>(View<int *, Kokkos::Ho...
+            410  ../examples/lulesh/lulesh-init.cc         Kokkos::View<int*>::View<char [16]>                                                      Kokkos::View<int*>::View<char [16]>(View<int *> *, arg_label, type, const siz...
+            410  ../examples/lulesh/lulesh-init.cc         Kokkos::View<int*>::View<char [19]>                                                      Kokkos::View<int*>::View<char [19]>(View<int *> *, arg_label, type, const siz...
+            410  ../examples/lulesh/lulesh-init.cc         Kokkos::View<int*>::View<char [21]>                                                      Kokkos::View<int*>::View<char [21]>(View<int *> *, arg_label, type, const siz...
+           6589  ../examples/lulesh/lulesh-init.cc         Kokkos::deep_copy<double*, , double*, Kokkos::LayoutRight, Kokkos::Device<Kok...         Kokkos::deep_copy<double*, , double*, Kokkos::LayoutRight, Kokkos::Device<Kok...
+           1052  ../examples/lulesh/lulesh-init.cc         Kokkos::deep_copy<double*>                                                               Kokkos::deep_copy<double*>(dst, value) [lulesh-init.cc]
+           1050  ../examples/lulesh/lulesh-init.cc         Kokkos::deep_copy<double, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP,...         Kokkos::deep_copy<double, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP,...
+           7686  ../examples/lulesh/lulesh-init.cc         Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenM...         Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenM...
+           7686  ../examples/lulesh/lulesh-init.cc         Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, int* [8], Kokkos::LayoutRigh...         Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, int* [8], Kokkos::LayoutRigh...
+           6589  ../examples/lulesh/lulesh-init.cc         Kokkos::deep_copy<int*, , int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::O...         Kokkos::deep_copy<int*, , int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::O...
+           6589  ../examples/lulesh/lulesh-init.cc         Kokkos::deep_copy<int*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Ko...         Kokkos::deep_copy<int*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Ko...
+           6589  ../examples/lulesh/lulesh-init.cc         Kokkos::deep_copy<int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP, K...         Kokkos::deep_copy<int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP, K...
+            697  ../examples/lulesh/lulesh-init.cc         Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (...         Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (...
+            706  ../examples/lulesh/lulesh-init.cc         Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (...         Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (...
+            912  ../examples/lulesh/lulesh-init.cc         Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...         Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
+            791  ../examples/lulesh/lulesh-init.cc         Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...         Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
+            791  ../examples/lulesh/lulesh-init.cc         Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...         Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
+            944  ../examples/lulesh/lulesh-init.cc         Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...         Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
+            839  ../examples/lulesh/lulesh-init.cc         Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...         Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
+           6589  ../examples/lulesh/lulesh-util.cc         Kokkos::deep_copy<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP...         Kokkos::deep_copy<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP...
+           1345  ../examples/lulesh/lulesh-util.cc         ParseCommandLineOptions                                                                  ParseCommandLineOptions(int, char * *, int, cmdLineOpts *) [lulesh-util.cc:67]
+            706  ../examples/lulesh/lulesh-util.cc         VerifyAndWriteFinalOutput                                                                VerifyAndWriteFinalOutput(Real_t, locDom, Int_t, Int_t) [lulesh-util.cc:222]
+          13367  ../examples/lulesh/lulesh.cc              ApplyMaterialPropertiesForElems                                                          ApplyMaterialPropertiesForElems(domain) [lulesh.cc:409]
+            982  ../examples/lulesh/lulesh.cc              CalcElemFBHourglassForce                                                                 CalcElemFBHourglassForce(const Real_t *, const Real_t[] *, coefficient, Real_...
+           2428  ../examples/lulesh/lulesh.cc              CalcElemNodeNormals                                                                      CalcElemNodeNormals(Real_t *, Real_t *, Real_t *, const Real_t *, const Real_...
+            853  ../examples/lulesh/lulesh.cc              CalcElemShapeFunctionDerivatives                                                         CalcElemShapeFunctionDerivatives(const Real_t *, const Real_t *, const Real_t...
+           1054  ../examples/lulesh/lulesh.cc              CalcKinematicsForElems                                                                   CalcKinematicsForElems(domain, Real_t, Index_t) [lulesh.cc]
+          14160  ../examples/lulesh/lulesh.cc              CalcVolumeForceForElems                                                                  CalcVolumeForceForElems(domain) [lulesh.cc:409]
+            366  ../examples/lulesh/lulesh.cc              Domain::AllocateGradients                                                                Domain::AllocateGradients(Domain *, Int_t, Int_t) [lulesh.cc:214]
+            475  ../examples/lulesh/lulesh.cc              Domain::DeallocateGradients                                                              Domain::DeallocateGradients(Domain *) [lulesh.cc:105]
+           4356  ../examples/lulesh/lulesh.cc              Domain::Domain                                                                           Domain::Domain(Domain *) [lulesh.cc:78]
+            410  ../examples/lulesh/lulesh.cc              Kokkos::View<double*>::View<char [6]>                                                    Kokkos::View<double*>::View<char [6]>(View<double *> *, arg_label, type, cons...
+            410  ../examples/lulesh/lulesh.cc              Kokkos::View<double*>::View<char [7]>                                                    Kokkos::View<double*>::View<char [7]>(View<double *> *, arg_label, type, cons...
+            928  ../examples/lulesh/lulesh.cc              Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...         Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
+            960  ../examples/lulesh/lulesh.cc              Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...         Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
+          21470  ../examples/lulesh/lulesh.cc              LagrangeLeapFrog                                                                         LagrangeLeapFrog(domain) [lulesh.cc]
+           1836  ../examples/lulesh/lulesh.cc              main                                                                                     int main(int, char * *) [lulesh.cc]
+
+Sampling
+========================================
+
+.. note::
+
+   This capability has been deprecated in favor of :doc:`Call stack sampling <./sampling-call-stack>`.
+
+By default, ``omnitrace-instrument`` uses ``--mode trace`` for instrumentation. The ``--mode sampling`` option
+only instruments ``main`` in an executable. It activates both CPU call-stack sampling and
+background system-level thread sampling by default.
+Tracing capabilities which do not rely on instrumentation, such as the HIP API and kernel tracing
+(which is collected by roctracer), are still available.
+
+The Omnitrace sampling capabilities are always available, even in trace mode, but are deactivated by default.
+To activate sampling in trace mode, set ``OMNITRACE_USE_SAMPLING=ON`` in the environment
+or in an Omnitrace configuration file.
+
+Embedding a default configuration
+========================================
+
+Use the ``--env`` option to embed a default configuration into the target. Although this option
+works for runtime instrumentation, it is most useful when generating new binaries because the generated
+binary can be used later on in a different login session when the environment might have changed.
+
+For example, if the following commands are run,
+the configuration settings are not be preserved for subsequent sessions:
+
+.. code-block:: shell
+
+   omnitrace-instrument -o ./foo.inst -- ./foo
+   export OMNITRACE_USE_SAMPLING=ON
+   export OMNITRACE_SAMPLING_FREQ=5
+   omnitrace-run -- ./foo.inst
+
+Whereas the following command preserves those environment variables:
+
+.. code-block:: shell
+
+   omnitrace-instrument -o ./foo.samp --env OMNITRACE_USE_SAMPLING=ON OMNITRACE_SAMPLING_FREQ=5 -- ./foo
+
+They can now be used in future sessions.
+
+.. code-block:: shell
+
+   # will sample 5x per second
+   omnitrace-run -- ./foo.samp
+
+Even though the environment variables are preserved, subsequent sessions can still override those defaults:
+
+.. code-block:: shell
+
+   # will sample 100x per second
+   export OMNITRACE_SAMPLING_FREQ=100
+   omnitrace-run -- ./foo.samp
+
+.. _rpath-troubleshooting:
+
+Troubleshooting
+----------------------------------------------
+
+Checking for RPATH
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If ``ldd ./foo.inst`` from the :ref:`binary-rewriting-library-label` 
+section still returns ``/usr/local/lib/libfoo.so.2``, the executable could have 
+an rpath encoded in the binary.
+This ELF entry results in the dynamic linker ignoring ``LD_LIBRARY_PATH`` if 
+it finds ``libfoo.so.2`` in the rpath.
+Using the ``objdump`` tool, perform the following query:
+
+.. code-block:: shell
+
+   objdump -p <exe-or-library> | egrep 'RPATH|RUNPATH'
+
+If this produces output that appears similar to this output.:
+
+.. code-block:: shell
+
+   RUNPATH              $ORIGIN:$ORIGIN/../lib
+
+Remove or modify the rpath to get ``foo.inst`` to resolve 
+to the instrumented ``libfoo.so.2`` as explained in the next section.
+
+Modifying an RPATH
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This code snippet uses the ``patchelf`` tool to modify the rpath of the given executable 
+or library to ``/home/user``, which is where the instrumented libraries are located.
+
+.. note::
+
+   This functionality requires the ``patchelf`` package.
+
+.. code-block:: shell
+
+   patchelf --remove-rpath <exe-or-library>
+   patchelf --set-rpath '/home/user' <exe-or-library>
@@ -0,0 +1,630 @@
+.. meta::
+   :description: Omnitrace documentation and reference
+   :keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
+
+****************************************************
+Performing causal profiling
+****************************************************
+
+The process of causal profiling can be summarized as:
+
+*If you speed up a given block of code by X%, the application will run Y% faster*.
+
+Causal profiling directs parallel application developers to where they should focus their optimization
+efforts by quantifying the potential impact of optimizations. Causal profiling is rooted in the concept
+that *software execution speed is relative*. Speeding up a block of code by X% is mathematically equivalent
+to that block of code running at its current speed if all the other code is running slower by X%.
+Thus, causal profiling works by performing experiments on blocks of code during program execution which
+insert pauses to slow down all other concurrently running code. During post-processing, these experiments
+are translated into calculations for the potential impact of speeding up this block of code.
+
+.. note::
+
+   Causal profiling supersedes the original critical trace feature, which was removed in Omnitrace v1.11.0.
+
+Consider the following C++ code executing ``foo`` and ``bar`` concurrently in two different threads
+where ``foo`` is ideally 30% faster than ``bar``:
+
+.. code-block:: cpp
+
+   #include <cstddef>
+   #include <thread>
+   constexpr size_t FOO_N =  7 * 1000000000UL;
+   constexpr size_t BAR_N = 10 * 1000000000UL;
+
+   void foo()
+   {
+      for(volatile size_t i = 0; i < FOO_N; ++i) {}
+   }
+
+   void bar()
+   {
+      for(volatile size_t i = 0; i < BAR_N; ++i) {}
+   }
+
+   int main()
+   {
+      std::thread _threads[] = { std::thread{ foo },
+                        std::thread{ bar } };
+
+      for(auto& itr : _threads)
+         itr.join();
+   }
+
+No matter how many optimizations are applied to ``foo``, the application will always 
+require the same amount of time
+because the end-to-end performance is limited by ``bar``. However, a 5% speed-up 
+in ``bar`` results in the
+end-to-end performance improving by 5%. This trend continues linearly, with a 10% speed-up 
+in ``bar`` yielding a 10% speed-up in
+end-to-end performance, and so on, up to a 30% speed-up, at which point ``bar`` runs as fast as ``foo``.
+Any speed-up to ``bar`` beyond 30% still only yields an end-to-end performance 
+improvement of 30% because the application
+is now limited by performance of ``foo``, as demonstrated below in the causal 
+profiling visualization:
+
+.. image:: ../data/causal-foobar.png
+   :alt: Visualization of the performance improvements for two functions with causal profiling
+
+The full details of the causal profiling methodology can be found in the paper 
+`Coz: Finding Code that Counts with Causal Profiling <http://arxiv.org/pdf/1608.03676v1.pdf>`_.
+The author's implementation is publicly available on `GitHub <https://github.com/plasma-umass/coz>`_.
+
+Getting started
+========================================
+
+To effectively use causal profiling, it is important to understand a few key 
+concepts, such as progress points.
+
+Progress points
+-----------------------------------
+
+Causal profiling requires "progress points" to track progress through the code 
+in between samples. Progress points must be triggered in a deterministic manner via instrumentation.
+This can happen in three different ways:
+
+* `Omnitrace <https://github.com/ROCm/omnitrace>`_ can leverage the callbacks from 
+  Kokkos-Tools, OpenMP-Tools, roctracer, etc. and the wrappers around functions for 
+  MPI, NUMA, RCCL, etc. to act as progress points
+* Users can leverage the :doc:`runtime instrumentation capabilities <./instrumenting-rewriting-binary-application>` 
+  to insert progress points
+* Users can leverage :doc:`User APIs <../how-to/using-omnitrace-api>`, 
+  such as ``OMNITRACE_CAUSAL_PROGRESS``
+
+.. note::
+
+   Binary rewrite to insert progress points is not supported. When a rewritten binary 
+   runs, Dyninst translates the instruction pointer address in order to perform 
+   the instrumentation. As a result, call stack samples never return instruction 
+   pointer addresses within the valid Omnitrace range.
+
+Key concepts
+-----------------------------------
+
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
+| Concept          | Setting                             | Options                          | Description                                |
+==================+=====================================+==================================+============================================+
+| Backend          | ``OMNITRACE_CAUSAL_BACKEND``        | ``perf``, ``timer``              | Backend for recording samples required     |
+|                  |                                     |                                  | to calculate the virtual speed-up          |
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
+| Mode             | ``OMNITRACE_CAUSAL_MODE``           | ``function``, ``line``           | Select an entire function or individual    |
+|                  |                                     |                                  | line of code for causal experiments        |
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
+| End-to-end       | ``OMNITRACE_CAUSAL_END_TO_END``     | Boolean                          | Perform a single experiment during the     |
+|                  |                                     |                                  | entire run (does not require               |
+|                  |                                     |                                  | progress points)                           |
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
+| Fixed speed-up   | ``OMNITRACE_CAUSAL_FIXED_SPEEDUP``  | one or more values from [0, 100] | Virtual speed-up or pool of virtual        |
+|                  |                                     |                                  | speed-ups to randomly select               |
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
+| Binary scope     | ``OMNITRACE_CAUSAL_BINARY_SCOPE``   | regular expression(s)            | Dynamic binaries containing code for       |
+|                  |                                     |                                  | experiments                                |
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
+| Source scope     | ``OMNITRACE_CAUSAL_SOURCE_SCOPE``   | regular expression(s)            | ``<file>`` and/or ``<file>:<line>``        |
+|                  |                                     |                                  | containing code to include in experiments  |
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
+| Function scope   | ``OMNITRACE_CAUSAL_FUNCTION_SCOPE`` | regular expression(s)            | Restricts experiments to matching          |
+|                  |                                     |                                  | functions (function mode) or lines of      |
+|                  |                                     |                                  | code within matching functions (line mode) |
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
+
+.. note::
+
+   * Binary scope defaults to ``%MAIN%`` (in the executable), but the scope can be expanded to include linked libraries.
+   * ``<file>`` and ``<file>:<line>`` support requires debug info (for example, the code must be compiled with ``-g`` or, preferably, with ``-g3``)
+   * Function mode does not require debug info but does not support stripped binaries
+
+Backends
+-----------------------------------
+
+There are two backends to choose from: ``perf`` and ``timer``. 
+They are used to record the samples required to calculate the virtual speedup. 
+Both backends interrupt each thread 1000 times per second (of CPU-time) to apply the virtual speed-ups.
+The difference between each backend is how the samples are recorded.
+There are three key differences between the two backends:
+
+* the ``perf`` backend requires Linux Perf and elevated security priviledges
+* the ``perf`` backend interrupts the application less frequently whereas the ``timer`` backend 
+  interrupts the application 1000 times per second of realtime
+* the ``timer`` backend has less accurate call stacks due to instruction pointer skid
+
+In general, the ``perf`` backend is preferred over the ``timer`` backend when sufficient 
+security priviledges permit its usage.
+If ``OMNITRACE_CAUSAL_BACKEND`` is set to ``auto``, Omnitrace falls back 
+to using the ``timer`` backend only if
+the ``perf`` backend fails. If ``OMNITRACE_CAUSAL_BACKEND`` is 
+set to ``perf`` and using this backend fails, Omnitrace aborts.
+
+Instruction pointer skid
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Instruction pointer (IP) skid measures how many instructions run after the event of interest
+before the program actually stops. The IP skid is calculated by subtracting
+the location of the IP at the point of interest from the location of the IP 
+when the kernel finally stops the application.
+For the ``timer`` backend, this translates to the
+difference in the IP between when the timer generated a signal and when the
+signal was actually generated. Although IP skid still occurs with the ``perf`` backend,
+it is much more pronounced with the ``timer`` backend due to the overhead of pausing the entire thread.
+This means the ``timer`` backend tends to have a lower resolution than the ``perf`` backend,
+especially in ``line`` mode.
+
+Installing Linux Perf
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Linux Perf is built into the kernel and may already be installed 
+(for instance, it is included in the default kernel for OpenSUSE).
+The official method of checking whether Linux Perf is installed is 
+checking for the existence of the file
+``/proc/sys/kernel/perf_event_paranoid``. If the file exists, the kernel has Perf installed.
+
+If this file does not exist, as with Debian-based systems like Ubuntu, run the following command as superuser:
+
+.. code-block:: shell
+
+   apt-get install linux-tools-common linux-tools-generic linux-tools-$(uname -r)
+
+and reboot your computer. In order to use the ``perf`` backend, the value 
+of ``/proc/sys/kernel/perf_event_paranoid``
+should be less than or equal to 2. If the value in this file is greater than 2, you can't 
+use the ``perf`` backend.
+
+To update the paranoid level temporarily until the system is rebooted, run 
+one of the following commands
+as a superuser (where ``PARANOID_LEVEL=<N>`` has a value of ``<N>`` in the range ``[-1, 2]``):
+
+.. code-block:: shell
+
+   echo ${PARANOID_LEVEL} | sudo tee /proc/sys/kernel/perf_event_paranoid
+
+or
+
+.. code-block:: shell
+
+   sysctl kernel.perf_event_paranoid=${PARANOID_LEVEL}
+
+To make the paranoid level persistent after a reboot, add ``kernel.perf_event_paranoid=<N>``
+(where ``<N>`` is the desired paranoid level) to the ``/etc/sysctl.conf`` file.
+
+Speed-up prediction variability and the omnitrace-causal executable
+-----------------------------------------------------------------------
+
+Causal profiling typically requires running the application several times in 
+order to adequately sample all the code domains, experiment 
+with speed-ups and other techniques, and resolve statistical fluctuations.
+The ``omnitrace-causal`` executable is designed to simplify this procedure:
+
+.. code-block:: shell
+
+   $ omnitrace-causal --help
+   [omnitrace-causal] Usage: ./bin/omnitrace-causal [ --help (count: 0, dtype: bool)
+                                                      --version (count: 0, dtype: bool)
+                                                      --monochrome (max: 1, dtype: bool)
+                                                      --debug (max: 1, dtype: bool)
+                                                      --verbose (count: 1)
+                                                      --config (min: 0, dtype: filepath)
+                                                      --launcher (count: 1, dtype: executable)
+                                                      --generate-configs (min: 0, dtype: folder)
+                                                      --no-defaults (min: 0, dtype: bool)
+                                                      --mode (count: 1, dtype: string)
+                                                      --output-name (min: 1, dtype: filename)
+                                                      --reset (max: 1, dtype: bool)
+                                                      --end-to-end (max: 1, dtype: bool)
+                                                      --wait (count: 1, dtype: seconds)
+                                                      --duration (count: 1, dtype: seconds)
+                                                      --iterations (count: 1, dtype: int)
+                                                      --speedups (min: 0, dtype: integers)
+                                                      --binary-scope (min: 0, dtype: integers)
+                                                      --source-scope (min: 0, dtype: integers)
+                                                      --function-scope (min: 0, dtype: regex-list)
+                                                      --binary-exclude (min: 0, dtype: integers)
+                                                      --source-exclude (min: 0, dtype: integers)
+                                                      --function-exclude (min: 0, dtype: regex-list)
+                                                   ]
+
+      Causal profiling usually requires multiple runs to reliably resolve the speedup estimates.
+      This executable is designed to streamline that process.
+      For example (assume all commands end with \'-- <exe> <args>\'):
+
+         omnitrace-causal -n 5 -- <exe>                  # runs <exe> 5x with causal profiling enabled
+
+         omnitrace-causal -s 0 5,10,15,20                # runs <exe> 2x with virtual speedups:
+                                                         #   - 0
+                                                         #   - randomly selected from 5, 10, 15, and 20
+
+         omnitrace-causal -F func_A func_B func_(A|B)    # runs <exe> 3x with the function scope limited to:
+                                                         #   1. func_A
+                                                         #   2. func_B
+                                                         #   3. func_A or func_B
+      General tips:
+      - Insert progress points at hotspots in your code or use omnitrace\'s runtime instrumentation
+         - Note: binary rewrite will produce a incompatible new binary
+      - Run omnitrace-causal in "function" mode first (does not require debug info)
+      - Run omnitrace-causal in "line" mode when you are targeting one function (requires debug info)
+         - Preferably, use predictions from the "function" mode to determine which function to target
+      - Limit the virtual speedups to a smaller pool, e.g., 0,5,10,25,50, to get reliable predictions quicker
+      - Make use of the binary, source, and function scope to limit the functions/lines selected for experiments
+         - Note: source scope requires debug info
+
+
+   Options:
+      -h, -?, --help                 Shows this page
+      --version                      Prints the version and exit
+
+      [DEBUG OPTIONS]
+
+      --monochrome                   Disable colorized output
+      --debug                        Debug output
+      -v, --verbose                  Verbose output
+
+      [GENERAL OPTIONS]
+
+      -c, --config                   Base configuration file
+      -l, --launcher                 When running MPI jobs, omnitrace-causal needs to be *before* the executable which launches the MPI processes (i.e.
+                                    before `mpirun`, `srun`, etc.). Pass the name of the target executable (or a regex for matching to the name of the
+                                    target) for causal profiling, e.g., `omnitrace-causal -l foo -- mpirun -n 4 foo`. This ensures that the omnitrace
+                                    library is LD_PRELOADed on the proper target
+      -g, --generate-configs         Generate config files instead of passing environment variables directly. If no arguments are provided, the config files
+                                    will be placed in ${PWD}/omnitrace-causal-config folder
+      --no-defaults                  Do not activate default features which are recommended for causal profiling. For example: PID-tagging of output files
+                                    and timestamped subdirectories are disabled by default. Kokkos tools support is added by default
+                                    (OMNITRACE_USE_KOKKOSP=ON) because, for Kokkos applications, the Kokkos-Tools callbacks are used for progress points.
+                                    Activation of OpenMP tools support is similar
+
+      [CAUSAL PROFILING OPTIONS (General)]
+                                    (These settings will be applied to all causal profiling runs)
+
+      -m, --mode [ function (func) | line ]
+                                    Causal profiling mode
+      -o, --output-name              Output filename of causal profiling data w/o extension
+      -r, --reset                    Overwrite any existing experiment results during the first run
+      -e, --end-to-end               Single causal experiment for the entire application runtime
+      -w, --wait                     Set the wait time (i.e. delay) before starting the first causal experiment (in seconds)
+      -d, --duration                 Set the length of time (in seconds) to perform causal experimentationafter the first experiment is started. Once this
+                                    amount of time has elapsed, no more causal experiments will be started but any currently running experiment will be
+                                    allowed to finish.
+      -n, --iterations               Number of times to repeat the combination of run configurations
+
+      [CAUSAL PROFILING OPTIONS (Combinatorial)]
+                                    (Each individual argument to these options will multiply the number runs by the number of arguments and the number of
+                                    iterations. E.g. -n 2 -B "MAIN" -F "foo" "bar" will produce 4 runs: 2 iterations x 1 binary scope x 2 function scopes
+                                    (MAIN+foo, MAIN+bar, MAIN+foo, MAIN+bar))
+
+      -s, --speedups                 Pool of virtual speedups to sample from during experimentation. Each space designates a group and multiple speedups can
+                                    be grouped together by commas, e.g. -s 0 0,10,20-50 is two groups: group #1 is \'0\' and group #2 is \'0 10 20 25 30 35 40
+                                    45 50\'
+      -B, --binary-scope             Restricts causal experiments to the binaries matching the list of regular expressions. Each space designates a group
+                                    and multiple scopes can be grouped together with a semi-colon
+      -S, --source-scope             Restricts causal experiments to the source files or source file + lineno pairs (i.e. <file> or <file>:<line>) matching
+                                    the list of regular expressions. Each space designates a group and multiple scopes can be grouped together with a
+                                    semi-colon
+      -F, --function-scope           Restricts causal experiments to the functions matching the list of regular expressions. Each space designates a group
+                                    and multiple scopes can be grouped together with a semi-colon
+      -BE, --binary-exclude          Excludes causal experiments from being performed on the binaries matching the list of regular expressions. Each space
+                                    designates a group and multiple excludes can be grouped together with a semi-colon
+      -SE, --source-exclude          Excludes causal experiments from being performed on the code from the source files or source file + lineno pair (i.e.
+                                    <file> or <file>:<line>) matching the list of regular expressions. Each space designates a group and multiple excludes
+                                    can be grouped together with a semi-colon
+      -FE, --function-exclude        Excludes causal experiments from being performed on the functions matching the list of regular expressions. Each space
+                                    designates a group and multiple excludes can be grouped together with a semi-colon
+
+Examples
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: shell
+
+   #!/bin/bash -e
+
+   module load omnitrace
+
+   N=20
+   I=3
+
+   # when providing speedups to omnitrace-causal, speedup
+   # groups are separated by a space so "0,10" results in
+   # one speedup group where omnitrace samples from
+   # the speedup set of {0, 10}. Passing "0 10" (without
+   # quotes to omnitrace-causal multiplies the
+   # number of runs by 2, where the first half of the
+   # runs instruct omnitrace to only use 0 as the
+   # speedup and the second half of the runs instruct
+   # omnitrace to only use 10 as the speedup.
+   SPEEDUPS="0,0,0,10,20,30,40,50,50,75,75,75,90,90,90"
+   # thus, -s ${SPEEDUPS} only multiplies the number
+   # of runs by 1 whereas -S ${SPEEDUPS_E2E} multiplies
+   # the number of runs by 15:
+   #   - 3 runs with speedup of 0
+   #   - 1 run for each of the speedups 10, 20, 30, and 40
+   #   - 2 runs with speedup of 50
+   #   - 3 runs with speedup of 75
+   #   - 3 runs with speedup of 90
+   SPEEDUPS_E2E=$(echo "${SPEEDUPS}" | sed \'s/,/ /g\')
+
+
+   # 20 iterations in function mode with 1 speedup group
+   # and source scope set to .cpp files
+   #
+   # outputs to files:
+   #   - causal/experiments.func.coz
+   #   - causal/experiments.func.json
+   #
+   # total executions: 20
+   #
+   omnitrace-causal        \
+      -n ${N}             \
+      -s ${SPEEDUPS}      \
+      -m function         \
+      -o experiments.func \
+      -S ".*\\.cpp"       \
+      --                  \
+      ./causal-omni-cpu "${@}"
+
+
+   # 20 iterations in line mode with 1 speedup group
+   # and source scope restricted to lines 100 and 110
+   # in the causal.cpp file.
+   #
+   # outputs to files:
+   #   - causal/experiments.line.coz
+   #   - causal/experiments.line.json
+   #
+   # total executions: 20
+   #
+   omnitrace-causal                \
+      -n ${N}                     \
+      -s ${SPEEDUPS}              \
+      -m line                     \
+      -o experiments.line         \
+      -S "causal\\.cpp:(100|110)" \
+      --                          \
+      ./causal-omni-cpu "${@}"
+
+
+   # 3 iterations in function mode of 15 singular speedups
+   # in end-to-end mode with 2 different function scopes
+   # where one is restricted to "cpu_slow_func" and
+   # another is restricted to "cpu_fast_func".
+   #
+   # outputs to files:
+   #   - causal/experiments.func.e2e.coz
+   #   - causal/experiments.func.e2e.json
+   #
+   # total executions: 90
+   #
+   omnitrace-causal            \
+      -n ${I}                 \
+      -s ${SPEEDUPS_E2E}      \
+      -m func                 \
+      -e                      \
+      -o experiments.func.e2e \
+      -F "cpu_slow_func"      \
+         "cpu_fast_func"      \
+      --                      \
+      ./causal-omni-cpu "${@}"
+
+   # 3 iterations in line mode of 15 singular speedups
+   # in end-to-end mode with 2 different source scopes
+   # where one is restricted to line 100 in causal.cpp
+   # and another is restricted to line 110 in causal.cpp.
+   #
+   # outputs to files:
+   #   - causal/experiments.line.e2e.coz
+   #   - causal/experiments.line.e2e.json
+   #
+   # total executions: 90
+   #
+   omnitrace-causal            \
+      -n ${I}                 \
+      -s ${SPEEDUPS_E2E}      \
+      -m line                 \
+      -e                      \
+      -o experiments.line.e2e \
+      -S "causal\\.cpp:100"   \
+         "causal\\.cpp:110"   \
+      --                      \
+      ./causal-omni-cpu "${@}"
+
+
+   export OMP_NUM_THREADS=8
+   export OMP_PROC_BIND=spread
+   export OMP_PLACES=threads
+
+   # set number of iterations to 5
+   N=5
+
+   # 5 iterations in function mode of 1 speedup
+   # group with the source scope restricted
+   # to files containing "lulesh" in their filename
+   # and exclude functions which start with "Kokkos::"
+   # or "std::enable_if".
+   #
+   # outputs to files:
+   #   - causal/experiments.func.coz
+   #   - causal/experiments.func.json
+   #
+   # total executions: 5
+   #
+   # First of 5 executions overwrites any
+   # existing causal/experiments.func.(coz|json)
+   # file due to "--reset" argument
+   #
+   omnitrace-causal                            \
+      --reset                                 \
+      -n ${N}                                 \
+      -s ${SPEEDUPS}                          \
+      -m func                                 \
+      -o experiments.func                     \
+      -S "lulesh.*"                           \
+      -FE "^(Kokkos::|std::enable_if)"        \
+      --                                      \
+      ./lulesh-omni -i 50 -s 200 -r 20 -b 5 -c 5 -p
+
+
+   # 5 iterations in line mode of 1 speedup
+   # group with the source scope restricted
+   # to files containing "lulesh" in their filename
+   # and exclude functions which start with "exec_range"
+   # or "execute" and which contain either
+   # "construct_shared_allocation" or "._omp_fn." in
+   # the function name.
+   #
+   # outputs to files:
+   #   - causal/experiments.line.coz
+   #   - causal/experiments.line.json
+   #
+   # total executions: 5
+   #
+   # First of 5 executions overwrites any
+   # existing causal/experiments.line.(coz|json)
+   # file due to "--reset" argument
+   #
+   omnitrace-causal                            \
+      --reset                                 \
+      -n ${N}                                 \
+      -s ${SPEEDUPS}                          \
+      -m line                                 \
+      -o experiments.line                     \
+      -S "lulesh.*"                           \
+      -FE "^(exec_range|execute);construct_shared_allocation;\\._omp_fn\\." \
+      --                                      \
+      ./lulesh-omni -i 50 -s 200 -r 20 -b 5 -c 5 -p
+
+
+   # 5 iterations in line mode of 1 speedup
+   # group with the source scope restricted
+   # to files whose basename is "lulesh.cc"
+   # for 3 different functions:
+   #   - ApplyMaterialPropertiesForElems
+   #   - CalcHourglassControlForElems
+   #   - CalcVolumeForceForElems
+   #
+   # outputs to files:
+   #   - causal/experiments.line.targeted.coz
+   #   - causal/experiments.line.targeted.json
+   #
+   # total executions: 15
+   #
+   # First of 5 executions overwrites any
+   # existing causal/experiments.line.(coz|json)
+   # file due to "--reset" argument
+   #
+   omnitrace-causal                            \
+      --reset                                 \
+      -n ${N}                                 \
+      -s ${SPEEDUPS}                          \
+      -m line                                 \
+      -o experiments.line.targeted            \
+      -F "ApplyMaterialPropertiesForElems"    \
+         "CalcHourglassControlForElems"       \
+         "CalcVolumeForceForElems"            \
+      -S "lulesh\\.cc"                        \
+      --                                      \
+      ./lulesh-omni -i 50 -s 200 -r 20 -b 5 -c 5 -p
+
+Using omnitrace-causal with other launchers like mpirun
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The ``omnitrace-causal`` executable is intended to assist with application replay 
+and is designed to always be at the start of the command line as the primary process.
+``omnitrace-causal`` typically adds a ``LD_PRELOAD`` of the Omnitrace libraries 
+into the environment before launching the command to inject the functionality
+required to start the causal profiling tooling. However, this is problematic 
+when the target application for causal profiling uses a launcher, in which case 
+it is listed as an argument rather than as the main application. For example, 
+``foo`` is the target application for profiling, but the command to run it is 
+``mpirun -n 2 foo``. Running the command ``omnitrace-causal -- mpirun -n 2 foo`` 
+applies the causal profiling to ``mpirun`` instead of ``foo``. 
+
+``omnitrace-causal`` remedies this by providing a command-line option ``-l` / `--launcher``
+to indicate the target application is using a launcher script/executable. The 
+argument to the command-line option is the name of, or regular expression for, the target application
+on the command line. When ``--launcher`` is used, ``omnitrace-causal`` generates 
+all the replay configurations and runs them but delays adding the ``LD_PRELOAD``. Instead it
+inserts a call to itself into the command line right before the target 
+application. This recursive call inherits the configuration from
+the parent ``omnitrace-causal`` executable, inserts an ``LD_PRELOAD`` into the environment, 
+and calls ``execv`` to replace itself with the new process launched by the target
+application.
+
+In other words, the following command:
+
+.. code-block:: shell
+
+   omnitrace-causal -l foo -n 3 -- mpirun -n 2 foo`
+
+Effectively results in:
+
+.. code-block:: shell
+
+   mpirun -n 2 omnitrace-causal -- foo
+   mpirun -n 2 omnitrace-causal -- foo
+   mpirun -n 2 omnitrace-causal -- foo
+
+Visualizing the causal output
+-------------------------------------------------------------------------
+
+Omnitrace generates ``causal/experiments.json`` and ``causal/experiments.coz`` in 
+``${OMNITRACE_OUTPUT_PATH}/${OMNITRACE_OUTPUT_PREFIX}``. Visit 
+`plasma-umass.org/coz <https://plasma-umass.org/coz/>`_ to open the ``*.coz`` file.
+
+Omnitrace versus Coz
+=======================================
+
+This comparison is intended for readers who are familiar with the 
+`Coz profiler <https://github.com/plasma-umass/coz>`_.
+Omnitrace provides several additional features and utilities for causal profiling:
+
+.. csv-table:: 
+   :header: "Feature", "Coz", "Omnitrace", "Notes"
+   :widths: 20, 60, 60, 30
+
+   "Debug info", "requires debug info in DWARF v3 format (``-gdwarf-3``)", "optional, supports any DWARF format version", "See Note #1 below"
+   "Experiment selection", "``<file>:<line>``", "``<function>`` or ``<file>:<line>``", "See Note #2 below"
+   "Experiment speed-ups", "Randomly samples b/t 0..100 in increments of 5 or one fixed speed-up", "Supports specifying smaller subset", "See Note #3 below"
+   "Scope options", "Supports binary and source scopes", "Supports binary, source, and function scopes", "See Note #4, #5, and #6 below"
+   "Scope inclusion", "Uses ``%`` as a wildcard for binary and source scopes", "Full regex support for binary, source, and function scopes", ""
+   "Scope exclusion", "Not supported", "Supports regexes for excluding binary/source/function", "See Note #7 below"
+   "Call-stack sampling", "Linux Perf", "Linux Perf, libunwind", "See Note #8 below"
+
+.. note::
+
+  #. Omnitrace supports a "function" mode which does not require debug info.
+  #. Omnitrace supports selecting an entire range of instruction pointers for a function instead 
+     of an instruction pointer for one line. In large code bases, "function" mode
+     can resolve in fewer iterations. After a target function is identified, you can 
+     switch to line mode and limit the function scope to the target function.
+  #. Omnitrace supports randomly sampling from subsets, e.g. { 0, 0, 5, 10 } 
+     where 0% is randomly selected 50% of time and 5% and 10% are randomly selected 25% of the time.
+  #. Omnitrace and COZ have the same definition for binary scope, which is the binaries 
+     loaded at runtime (the executable and linked libraries).
+  #. Omnitrace "source scope" supports both ``<file>`` and ``<file>:<line>`` formats 
+     in contrast to the COZ "source scope" which requires ``<file>:<line>`` format.
+  #. Omnitrace supports a "function" scope which narrows the function and lines 
+     which are eligible for causal experiments to those within the matching functions.
+  #. Omnitrace supports a second filter on scopes for removing binary/source/function 
+     caught by an inclusive match. For example ``BINARY_SCOPE=.*`` and ``BINARY_EXCLUDE=libmpi.*``
+     initially includes all binaries but exclude regex removes MPI libraries.
+  #. In Omnitrace, the Linux Perf backend is preferred over use libunwind. However, 
+     Linux Perf usage can be restricted for security reasons.
+     Omnitrace falls back to using a second POSIX timer and libunwind if 
+     Linux Perf is not available.
@@ -0,0 +1,334 @@
+.. meta::
+   :description: Omnitrace documentation and reference
+   :keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
+
+****************************************************
+Profiling Python scripts
+****************************************************
+
+`Omnitrace <https://github.com/ROCm/omnitrace>`_ supports profiling Python code at the 
+source level and the script level.
+Python support is enabled via the ``OMNITRACE_USE_PYTHON`` and the 
+``OMNITRACE_PYTHON_VERSIONS="<MAJOR>.<MINOR>`` CMake options.
+Alternatively, to build multiple Python versions, use 
+``OMNITRACE_PYTHON_VERSIONS="<MAJOR>.<MINOR>;[<MAJOR>.<MINOR>]"``,
+and ``OMNITRACE_PYTHON_ROOT_DIRS="/path/to/version;[/path/to/version]"`` instead of ``OMNITRACE_PYTHON_VERSION``.
+When building multiple Python versions, the length of the ``OMNITRACE_PYTHON_VERSIONS`` 
+and ``OMNITRACE_PYTHON_ROOT_DIRS`` lists must
+be the same size.
+
+.. note::
+
+   When using Omnitrace with Python programs, the Python interpreter major and minor version (e.g. 3.7) 
+   must match the interpreter major and minor version
+   used when compiling the Python bindings. When building Omnitrace, 
+   the shared object file ``libpyomnitrace.<IMPL>-<VERSION>-<ARCH>-<OS>-<ABI>.so`` is generated
+   where ``IMPL`` is the Python implementation, ``VERSION`` is the major and minor 
+   version, ``ARCH`` is the architecture,
+   ``OS`` is the operating system, and ``ABI`` is the application binary interface, 
+   for example, ``libpyomnitrace.cpython-38-x86_64-linux-gnu.so``.
+
+Getting Started
+========================================
+
+The Omnitrace Python package is installed in ``lib/pythonX.Y/site-packages/omnitrace``. 
+To ensure the Python interpreter can find the Omnitrace package,
+add this path to the ``PYTHONPATH`` environment variable, as in the following example:
+
+.. code-block:: shell
+
+   export PYTHONPATH=/opt/omnitrace/lib/python3.8/site-packages:${PYTHONPATH}
+
+Both the ``share/omnitrace/setup-env.sh`` script and the module file in 
+``share/modulefiles/omnitrace`` automatically handle the prefixing of the ``PYTHONPATH``
+environment variable.
+
+Running Omnitrace on a Python script
+========================================
+
+Omnitrace provides an ``omnitrace-python`` helper bash script which 
+ensures ``PYTHONPATH`` is properly set and the correct Python interpreter is used.
+This means the following commands are effectively equivalent:
+
+.. code-block:: shell
+
+   omnitrace-python --help
+
+and
+
+.. code-block:: shell
+
+   export PYTHONPATH=/opt/omnitrace/lib/python3.8/site-packages:${PYTHONPATH}
+   python3.8 -m omnitrace --help
+
+.. note::
+
+   ``omnitrace-python`` and ``python -m omnitrace`` use the same command-line syntax 
+   as the other ``omnitrace`` executables (``omnitrace-python <OMNITRACE_ARGS> -- <SCRIPT> <SCRIPT_ARGS>``) 
+   and has similar options.
+
+Command line options
+-----------------------------------
+
+Use ``omnitrace-python --help`` to view the available options:
+
+.. code-block:: shell
+
+   usage: omnitrace [-h] [-v VERBOSITY] [-b] [-c FILE] [-s FILE] [-F [BOOL]] [--label [{args,file,line} [{args,file,line} ...]]] [-I FUNC [FUNC ...]] [-E FUNC [FUNC ...]] [-R FUNC [FUNC ...]] [-MI FILE [FILE ...]] [-ME FILE [FILE ...]] [-MR FILE [FILE ...]] [--trace-c [BOOL]]
+
+   optional arguments:
+   -h, --help            show this help message and exit
+   -v VERBOSITY, --verbosity VERBOSITY
+                           Logging verbosity
+   -b, --builtin         Put 'profile' in the builtins. Use '@profile' to decorate a single function, or 'with profile:' to profile a single section of code.
+   -c FILE, --config FILE
+                           OmniTrace configuration file
+   -s FILE, --setup FILE
+                           Code to execute before the code to profile
+   -F [BOOL], --full-filepath [BOOL]
+                           Encode the full function filename (instead of basename)
+   --label [{args,file,line} [{args,file,line} ...]]
+                           Encode the function arguments, filename, and/or line number into the profiling function label
+   -I FUNC [FUNC ...], --function-include FUNC [FUNC ...]
+                           Include any entries with these function names
+   -E FUNC [FUNC ...], --function-exclude FUNC [FUNC ...]
+                           Filter out any entries with these function names
+   -R FUNC [FUNC ...], --function-restrict FUNC [FUNC ...]
+                           Select only entries with these function names
+   -MI FILE [FILE ...], --module-include FILE [FILE ...]
+                           Include any entries from these files
+   -ME FILE [FILE ...], --module-exclude FILE [FILE ...]
+                           Filter out any entries from these files
+   -MR FILE [FILE ...], --module-restrict FILE [FILE ...]
+                           Select only entries from these files
+   --trace-c [BOOL]      Enable profiling C functions
+
+   usage: python3 -m omnitrace <OMNITRACE_ARGS> -- <SCRIPT> <SCRIPT_ARGS>
+
+.. note::
+
+   The ``--trace-c`` option does not incorporate Omnitrace's dynamic instrumentation support. 
+   It only enables profiling the underlying C function call within the Python interpreter.
+
+Selective instrumentation
+-----------------------------------
+
+Similar to the ``omnitrace-instrument`` executable, command-line options exist for restricting, 
+including, and excluding certain functions and modules, for example, ``--function-exclude "^__init__$"``.
+Alternatively, add the ``@profile`` decorator to the primary function of interest 
+in your program and use the ``-b`` / ``--builtin`` command-line option to narrow the scope of the
+instrumentation to this function and its children.
+
+Consider the following Python code (``example.py``):
+
+.. code-block:: python
+
+   import sys
+
+   def fib(n):
+      return n if n < 2 else (fib(n - 1) + fib(n - 2))
+
+
+   def inefficient(n):
+      a = 0
+      for i in range(n):
+         a += i
+         for j in range(n):
+               a += j
+      return a
+
+
+   def run(n):
+      return fib(n) + inefficient(n)
+
+
+   if __name__ == "__main__":
+      run(20)
+
+Running ``omnitrace-python ./example.py`` with ``OMNITRACE_PROFILE=ON`` and 
+``OMNITRACE_TIMEMORY_COMPONENTS=trip_count`` produces the following:
+
+.. code-block:: shell
+
+   |-------------------------------------------------------------------------------------------|
+   |                                COUNTS NUMBER OF INVOCATIONS                               |
+   |-------------------------------------------------------------------------------------------|
+   |                      LABEL                        | COUNT  | DEPTH  |   METRIC   |  SUM   |
+   |---------------------------------------------------|--------|--------|------------|--------|
+   | |0>>> run                                         |      1 |      0 | trip_count |      1 |
+   | |0>>> |_fib                                       |      1 |      1 | trip_count |      1 |
+   | |0>>>   |_fib                                     |      2 |      2 | trip_count |      2 |
+   | |0>>>     |_fib                                   |      4 |      3 | trip_count |      4 |
+   | |0>>>       |_fib                                 |      8 |      4 | trip_count |      8 |
+   | |0>>>         |_fib                               |     16 |      5 | trip_count |     16 |
+   | |0>>>           |_fib                             |     32 |      6 | trip_count |     32 |
+   | |0>>>             |_fib                           |     64 |      7 | trip_count |     64 |
+   | |0>>>               |_fib                         |    128 |      8 | trip_count |    128 |
+   | |0>>>                 |_fib                       |    256 |      9 | trip_count |    256 |
+   | |0>>>                   |_fib                     |    512 |     10 | trip_count |    512 |
+   | |0>>>                     |_fib                   |   1024 |     11 | trip_count |   1024 |
+   | |0>>>                       |_fib                 |   2026 |     12 | trip_count |   2026 |
+   | |0>>>                         |_fib               |   3632 |     13 | trip_count |   3632 |
+   | |0>>>                           |_fib             |   5020 |     14 | trip_count |   5020 |
+   | |0>>>                             |_fib           |   4760 |     15 | trip_count |   4760 |
+   | |0>>>                               |_fib         |   2942 |     16 | trip_count |   2942 |
+   | |0>>>                                 |_fib       |   1152 |     17 | trip_count |   1152 |
+   | |0>>>                                   |_fib     |    274 |     18 | trip_count |    274 |
+   | |0>>>                                     |_fib   |     36 |     19 | trip_count |     36 |
+   | |0>>>                                       |_fib |      2 |     20 | trip_count |      2 |
+   | |0>>> |_inefficient                               |      1 |      1 | trip_count |      1 |
+   |-------------------------------------------------------------------------------------------|
+
+If the ``inefficient`` function is decorated with ``@profile`` as follows:
+
+.. code-block:: python
+
+   @profile
+   def inefficient(n):
+      # ...
+
+And then run using the command ``omnitrace-python -b -- ./example.py``, Omnitrace produces this output:
+
+.. code-block:: shell
+
+   |-----------------------------------------------------------|
+   |                COUNTS NUMBER OF INVOCATIONS               |
+   |-----------------------------------------------------------|
+   |      LABEL        | COUNT  | DEPTH  |   METRIC   |  SUM   |
+   |-------------------|--------|--------|------------|--------|
+   | |0>>> inefficient |      1 |      0 | trip_count |      1 |
+   |-----------------------------------------------------------|
+
+Omnitrace Python source instrumentation
+========================================
+
+Starting with the unmodified ``example.py`` script above, import the ``omnitrace`` module:
+
+.. code-block:: python
+
+   import sys
+   import omnitrace  # import omnitrace
+
+   def fib(n):
+      # ... etc. ...
+
+Next, add ``@omnitrace.profile()`` to the ``run`` function:
+
+.. code-block:: python
+
+   @omnitrace.profile()
+   def run(n):
+      # ...
+
+Alternatively, use ``omnitrace.profile()`` as a context-manager around ``run(20)``:
+
+.. code-block:: python
+
+   if __name__ == "__main__":
+      with omnitrace.profile():
+         run(20)
+
+The results for both of the source-level instrumentation modes are identical to the 
+original ``omnitrace-python ./example.py`` results:
+
+.. code-block:: shell
+
+   |-------------------------------------------------------------------------------------------|
+   |                                COUNTS NUMBER OF INVOCATIONS                               |
+   |-------------------------------------------------------------------------------------------|
+   |                      LABEL                        | COUNT  | DEPTH  |   METRIC   |  SUM   |
+   |---------------------------------------------------|--------|--------|------------|--------|
+   | |0>>> run                                         |      1 |      0 | trip_count |      1 |
+   | |0>>> |_fib                                       |      1 |      1 | trip_count |      1 |
+   | |0>>>   |_fib                                     |      2 |      2 | trip_count |      2 |
+   | |0>>>     |_fib                                   |      4 |      3 | trip_count |      4 |
+   | |0>>>       |_fib                                 |      8 |      4 | trip_count |      8 |
+   | |0>>>         |_fib                               |     16 |      5 | trip_count |     16 |
+   | |0>>>           |_fib                             |     32 |      6 | trip_count |     32 |
+   | |0>>>             |_fib                           |     64 |      7 | trip_count |     64 |
+   | |0>>>               |_fib                         |    128 |      8 | trip_count |    128 |
+   | |0>>>                 |_fib                       |    256 |      9 | trip_count |    256 |
+   | |0>>>                   |_fib                     |    512 |     10 | trip_count |    512 |
+   | |0>>>                     |_fib                   |   1024 |     11 | trip_count |   1024 |
+   | |0>>>                       |_fib                 |   2026 |     12 | trip_count |   2026 |
+   | |0>>>                         |_fib               |   3632 |     13 | trip_count |   3632 |
+   | |0>>>                           |_fib             |   5020 |     14 | trip_count |   5020 |
+   | |0>>>                             |_fib           |   4760 |     15 | trip_count |   4760 |
+   | |0>>>                               |_fib         |   2942 |     16 | trip_count |   2942 |
+   | |0>>>                                 |_fib       |   1152 |     17 | trip_count |   1152 |
+   | |0>>>                                   |_fib     |    274 |     18 | trip_count |    274 |
+   | |0>>>                                     |_fib   |     36 |     19 | trip_count |     36 |
+   | |0>>>                                       |_fib |      2 |     20 | trip_count |      2 |
+   | |0>>> |_inefficient                               |      1 |      1 | trip_count |      1 |
+   |-------------------------------------------------------------------------------------------|
+
+.. note::
+
+   When ``omnitrace-python`` is used without built-ins, the profiling results can be cluttered by the
+   numerous functions called when more complex modules are imported, such as ``import numpy``.
+
+Omnitrace Python source instrumentation configuration
+-------------------------------------------------------------
+
+Within the Python source code, the profiler can be configured by directly 
+modifying the ``omnitrace.profiler.config`` data fields.
+
+.. code-block:: python
+
+   import sys
+
+   def fib(n):
+      return n if n < 2 else (fib(n - 1) + fib(n - 2))
+
+
+   def inefficient(n):
+      a = 0
+      for i in range(n):
+         a += i
+         for j in range(n):
+               a += j
+      return a
+
+
+   def run(n):
+      return fib(n) + inefficient(n)
+
+
+   if __name__ == "__main__":
+      from omnitrace.profiler import config
+      from omnitrace import profile
+
+      config.include_args = True
+      config.include_filename = False
+      config.include_line = False
+      config.restrict_functions += ["fib", "run"]
+
+      with profile():
+         run(5)
+
+Executing this script produces the following:
+
+.. code-block:: shell
+
+   |------------------------------------------------------------------|
+   |                   COUNTS NUMBER OF INVOCATIONS                   |
+   |------------------------------------------------------------------|
+   |          LABEL           | COUNT  | DEPTH  |   METRIC   |  SUM   |
+   |--------------------------|--------|--------|------------|--------|
+   | |0>>> run(n=5)           |      1 |      0 | trip_count |      1 |
+   | |0>>> |_fib(n=5)         |      1 |      1 | trip_count |      1 |
+   | |0>>>   |_fib(n=4)       |      1 |      2 | trip_count |      1 |
+   | |0>>>     |_fib(n=3)     |      1 |      3 | trip_count |      1 |
+   | |0>>>       |_fib(n=2)   |      1 |      4 | trip_count |      1 |
+   | |0>>>         |_fib(n=1) |      1 |      5 | trip_count |      1 |
+   | |0>>>         |_fib(n=0) |      1 |      5 | trip_count |      1 |
+   | |0>>>       |_fib(n=1)   |      1 |      4 | trip_count |      1 |
+   | |0>>>     |_fib(n=2)     |      1 |      3 | trip_count |      1 |
+   | |0>>>       |_fib(n=1)   |      1 |      4 | trip_count |      1 |
+   | |0>>>       |_fib(n=0)   |      1 |      4 | trip_count |      1 |
+   | |0>>>   |_fib(n=3)       |      1 |      2 | trip_count |      1 |
+   | |0>>>     |_fib(n=2)     |      1 |      3 | trip_count |      1 |
+   | |0>>>       |_fib(n=1)   |      1 |      4 | trip_count |      1 |
+   | |0>>>       |_fib(n=0)   |      1 |      4 | trip_count |      1 |
+   | |0>>>     |_fib(n=1)     |      1 |      3 | trip_count |      1 |
+   |------------------------------------------------------------------|
@@ -0,0 +1,404 @@
+.. meta::
+   :description: Omnitrace documentation and reference
+   :keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
+
+****************************************************
+Sampling the call stack
+****************************************************
+
+`Omnitrace <https://github.com/ROCm/omnitrace>`_ can use call-stack sampling 
+on a binary instrumented with either the ``omnitrace`` executable 
+or the ``omnitrace-sample`` executable.
+For example, all of the following commands are effectively equivalent:
+
+* Binary rewrite with only the instrumentation necessary to start and stop sampling
+
+  .. code-block:: shell
+
+     omnitrace-instrument -M sampling -o foo.inst -- foo
+     omnitrace-run -- ./foo.inst
+
+* Runtime instrumentation with only the instrumentation necessary to start and stop sampling
+
+  .. code-block:: shell
+
+     omnitrace-instrument -M sampling -- foo
+
+* No instrumentation required
+
+  .. code-block:: shell
+
+     omnitrace-sample -- foo
+
+.. note::
+
+   Set ``OMNITRACE_USE_SAMPLING=ON`` to activate call-stack sampling when executing an instrumented binary.
+
+All ``omnitrace-instrument -M sampling`` (subsequently referred to as "instrumented-sampling") 
+does is wrap the ``main`` of the executable with initialization
+before ``main`` starts and finalization after ``main`` ends.
+This can be accomplished without instrumentation through a ``LD_PRELOAD`` 
+of a library containing a dynamic symbol wrapper around ``__libc_start_main``.
+
+The use of ``omnitrace-sample`` is **recommended** over 
+``omnitrace-instrument -M sampling`` when binary instrumentation
+is not necessary. This is for a number of reasons:
+
+* ``omnitrace-sample`` provides command-line options for controlling the Omnitrace feature set instead of 
+  requiring configuration files or environment variables
+* Despite the fact that instrumented-sampling only requires inserting snippets 
+  around one function (``main``), Dyninst
+  does not have a feature for specifying that parsing and processing all the 
+  other symbols in the binary is unnecessary.
+  In the best-case scenario when the target binary is relatively small, 
+  instrumented-sampling has a slightly slower launch time,
+  but in the worst case scenarios it requires a significant amount of time and memory to launch.
+* ``omnitrace-sample`` is fully compatible with MPI. For example, 
+  the command ``mpirun -n 2 omnitrace-sample -- foo`` is valid, 
+  whereas ``mpirun -n 2 omnitrace-instrument -M sampling -- foo``
+  is incompatible with some MPI distributions (particularly OpenMPI). This is because
+  MPI prohibits forking within an MPI rank.
+
+  * When MPI and binary instrumentation are both involved, two steps are required:
+    performing a binary rewrite of the executable and then using the instrumented executable 
+    in lieu of the original executable. ``omnitrace-sample`` is therefore much easier to use with MPI.
+
+The omnitrace-sample executable
+========================================
+
+View the help menu of ``omnitrace-sample`` with the ``-h`` / ``--help`` option:
+
+.. code-block:: shell
+
+   $ omnitrace-sample --help
+   [omnitrace-sample] Usage: omnitrace-sample [ --help (count: 0, dtype: bool)
+                                                --version (count: 0, dtype: bool)
+                                                --monochrome (max: 1, dtype: bool)
+                                                --debug (max: 1, dtype: bool)
+                                                --verbose (count: 1)
+                                                --config (min: 0, dtype: filepath)
+                                                --output (min: 1)
+                                                --trace (max: 1, dtype: bool)
+                                                --profile (max: 1, dtype: bool)
+                                                --flat-profile (max: 1, dtype: bool)
+                                                --host (max: 1, dtype: bool)
+                                                --device (max: 1, dtype: bool)
+                                                --wait (count: 1)
+                                                --duration (count: 1)
+                                                --trace-file (count: 1, dtype: filepath)
+                                                --trace-buffer-size (count: 1, dtype: KB)
+                                                --trace-fill-policy (count: 1)
+                                                --trace-wait (count: 1)
+                                                --trace-duration (count: 1)
+                                                --trace-periods (min: 1)
+                                                --trace-clock-id (count: 1)
+                                                --profile-format (min: 1)
+                                                --profile-diff (min: 1)
+                                                --process-freq (count: 1)
+                                                --process-wait (count: 1)
+                                                --process-duration (count: 1)
+                                                --cpus (count: unlimited, dtype: int or range)
+                                                --gpus (count: unlimited, dtype: int or range)
+                                                --freq (count: 1)
+                                                --sampling-wait (count: 1)
+                                                --sampling-duration (count: 1)
+                                                --tids (min: 1)
+                                                --cputime (min: 0)
+                                                --realtime (min: 0)
+                                                --include (count: unlimited)
+                                                --exclude (count: unlimited)
+                                                --cpu-events (count: unlimited)
+                                                --gpu-events (count: unlimited)
+                                                --inlines (max: 1, dtype: bool)
+                                                --hsa-interrupt (count: 1, dtype: int)
+                                             ] 
+   Options:
+      -h, -?, --help                 Shows this page (count: 0, dtype: bool) 
+      --version                      Prints the version and exit (count: 0, dtype: bool) 
+                                                                  
+      [DEBUG OPTIONS]                                  
+                                                                  
+      --monochrome                   Disable colorized output (max: 1, dtype: bool) 
+      --debug                        Debug output (max: 1, dtype: bool) 
+      -v, --verbose                  Verbose output (count: 1)     
+                                                                  
+      [GENERAL OPTIONS]  These are options which are ubiquitously applied 
+                                                                  
+      -c, --config                   Configuration file (min: 0, dtype: filepath) 
+      -o, --output                   Output path. Accepts 1-2 parameters corresponding to the output path and the output prefix (min: 1) 
+      -T, --trace                    Generate a detailed trace (perfetto output) (max: 1, dtype: bool) 
+      -P, --profile                  Generate a call-stack-based profile (conflicts with --flat-profile) (max: 1, dtype: bool) 
+      -F, --flat-profile             Generate a flat profile (conflicts with --profile) (max: 1, dtype: bool) 
+      -H, --host                     Enable sampling host-based metrics for the process. E.g. CPU frequency, memory usage, etc. (max: 1, dtype: bool) 
+      -D, --device                   Enable sampling device-based metrics for the process. E.g. GPU temperature, memory usage, etc. (max: 1, dtype: bool) 
+      -w, --wait                     This option is a combination of '--trace-wait' and '--sampling-wait'. See the descriptions for those two options. 
+                                    (count: 1) 
+      -d, --duration                 This option is a combination of '--trace-duration' and '--sampling-duration'. See the descriptions for those two 
+                                    options. (count: 1) 
+                                                                  
+      [TRACING OPTIONS]  Specific options controlling tracing (i.e. deterministic measurements of every event) 
+                                                                  
+      --trace-file                   Specify the trace output filename. Relative filepath will be with respect to output path and output prefix. (count: 1, 
+                                    dtype: filepath) 
+      --trace-buffer-size            Size limit for the trace output (in KB) (count: 1, dtype: KB) 
+      --trace-fill-policy [ discard | ring_buffer ]
+                                    
+                                    Policy for new data when the buffer size limit is reached:
+                                          - discard     : new data is ignored
+                                          - ring_buffer : new data overwrites oldest data (count: 1)
+      --trace-wait                   Set the wait time (in seconds) before collecting trace and/or profiling data(in seconds). By default, the duration is 
+                                    in seconds of realtime but that can changed via --trace-clock-id. (count: 1) 
+      --trace-duration               Set the duration of the trace and/or profile data collection (in seconds). By default, the duration is in seconds of 
+                                    realtime but that can changed via --trace-clock-id. (count: 1) 
+      --trace-periods                More powerful version of specifying trace delay and/or duration. Format is one or more groups of: <DELAY>:<DURATION>, 
+                                    <DELAY>:<DURATION>:<REPEAT>, and/or <DELAY>:<DURATION>:<REPEAT>:<CLOCK_ID>. (min: 1) 
+      --trace-clock-id [ 0 (realtime|CLOCK_REALTIME)
+                        1 (monotonic|CLOCK_MONOTONIC)
+                        2 (cputime|CLOCK_PROCESS_CPUTIME_ID)
+                        4 (monotonic_raw|CLOCK_MONOTONIC_RAW)
+                        5 (realtime_coarse|CLOCK_REALTIME_COARSE)
+                        6 (monotonic_coarse|CLOCK_MONOTONIC_COARSE)
+                        7 (boottime|CLOCK_BOOTTIME) ]
+                                    Set the default clock ID for for trace delay/duration. Note: "cputime" is the *process* CPU time and might need to be 
+                                    scaled based on the number of threads, i.e. 4 seconds of CPU-time for an application with 4 fully active threads would 
+                                    equate to ~1 second of realtime. If this proves to be difficult to handle in practice, please file a feature request 
+                                    for omnitrace to auto-scale based on the number of threads. (count: 1) 
+                                                                  
+      [PROFILE OPTIONS]  Specific options controlling profiling (i.e. deterministic measurements which are aggregated into a summary) 
+                                                                  
+      --profile-format [ console | json | text ]
+                                    Data formats for profiling results (min: 1) 
+      --profile-diff                 Generate a diff output b/t the profile collected and an existing profile from another run Accepts 1-2 parameters 
+                                    corresponding to the input path and the input prefix (min: 1) 
+                                                                  
+      [HOST/DEVICE (PROCESS SAMPLING) OPTIONS]
+                                    Process sampling is background measurements for resources available to the entire process. These samples are not tied 
+                                    to specific lines/regions of code 
+                                                                  
+      --process-freq                 Set the default host/device sampling frequency (number of interrupts per second) (count: 1) 
+      --process-wait                 Set the default wait time (i.e. delay) before taking first host/device sample (in seconds of realtime) (count: 1) 
+      --process-duration             Set the duration of the host/device sampling (in seconds of realtime) (count: 1) 
+      --cpus                         CPU IDs for frequency sampling. Supports integers and/or ranges (count: unlimited, dtype: int or range) 
+      --gpus                         GPU IDs for SMI queries. Supports integers and/or ranges (count: unlimited, dtype: int or range) 
+                                                                  
+      [GENERAL SAMPLING OPTIONS] General options for timer-based sampling per-thread 
+                                                                  
+      -f, --freq                     Set the default sampling frequency (number of interrupts per second) (count: 1) 
+      --sampling-wait                Set the default wait time (i.e. delay) before taking first sample (in seconds). This delay time is based on the clock 
+                                    of the sampler, i.e., a delay of 1 second for CPU-clock sampler may not equal 1 second of realtime (count: 1) 
+      --sampling-duration            Set the duration of the sampling (in seconds of realtime). I.e., it is possible (currently) to set a CPU-clock time 
+                                    delay that exceeds the real-time duration... resulting in zero samples being taken (count: 1) 
+      -t, --tids                     Specify the default thread IDs for sampling, where 0 (zero) is the main thread and each thread created by the target 
+                                    application is assigned an atomically incrementing value. (min: 1) 
+                                                                  
+      [SAMPLING TIMER OPTIONS] These options determine the heuristic for deciding when to take a sample 
+                                                                  
+      --cputime                      Sample based on a CPU-clock timer (default). Accepts zero or more arguments:
+                                          0. Enables sampling based on CPU-clock timer.
+                                          1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of CPU-time.
+                                          2. Delay (in seconds of CPU-clock time). I.e., how long each thread should wait before taking first sample.
+                                          3+ Thread IDs to target for sampling, starting at 0 (the main thread).
+                                             May be specified as index or range, e.g., '0 2-4' will be interpreted as:
+                                                sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads (min: 0)
+      --realtime                     Sample based on a real-clock timer. Accepts zero or more arguments:
+                                          0. Enables sampling based on real-clock timer.
+                                          1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of realtime.
+                                          2. Delay (in seconds of real-clock time). I.e., how long each thread should wait before taking first sample.
+                                          3+ Thread IDs to target for sampling, starting at 0 (the main thread).
+                                             May be specified as index or range, e.g., '0 2-4' will be interpreted as:
+                                                sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads
+                                             When sampling with a real-clock timer, please note that enabling this will cause threads which are typically "idle"
+                                             to consume more resources since, while idle, the real-clock time increases (and therefore triggers taking samples)
+                                             whereas the CPU-clock time does not. (min: 0)
+                                                                  
+      [BACKEND OPTIONS]  These options control region information captured w/o sampling or instrumentation 
+                                                                  
+      -I, --include [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
+                                    Include data from these backends (count: unlimited) 
+      -E, --exclude [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
+                                    Exclude data from these backends (count: unlimited) 
+                                                                  
+      [HARDWARE COUNTER OPTIONS] See also: omnitrace-avail -H  
+                                                                  
+      -C, --cpu-events               Set the CPU hardware counter events to record (ref: `omnitrace-avail -H -c CPU`) (count: unlimited) 
+      -G, --gpu-events               Set the GPU hardware counter events to record (ref: `omnitrace-avail -H -c GPU`) (count: unlimited) 
+                                                                  
+      [MISCELLANEOUS OPTIONS]                               
+                                                                  
+      -i, --inlines                  Include inline info in output when available (max: 1, dtype: bool) 
+      --hsa-interrupt [ 0 | 1 ]      Set the value of the HSA_ENABLE_INTERRUPT environment variable.
+                                       ROCm version 5.2 and older have a bug which will cause a deadlock if a sample is taken while waiting for the signal
+                                       that a kernel completed -- which happens when sampling with a real-clock timer. We require this option to be set to
+                                       when --realtime is specified to make users aware that, while this may fix the bug, it can have a negative impact on
+                                       performance.
+                                       Values:
+                                          0     avoid triggering the bug, potentially at the cost of reduced performance
+                                          1     do not modify how ROCm is notified about kernel completion (count: 1, dtype: int)
+
+The general syntax for separating Omnitrace command-line arguments from the 
+following application arguments 
+is consistent with the LLVM style of using a stand-alone double hyphen (``--``). 
+All arguments preceding the double hyphen
+are interpreted as belonging to Omnitrace and all arguments following it 
+are interpreted as the
+application and its arguments. The double hyphen is only necessary when passing 
+command-line arguments to a target
+which also uses hyphens. For example, you can run ``omnitrace-sample ls``, but 
+to run ``ls -la``, use ``omnitrace-sample -- ls -la``.
+
+:doc:`Configuring the Omnitrace runtime options <./configuring-runtime-options>` 
+establishes the precedence of environment variable values over values specified 
+in the configuration files. This enables
+you to configure the Omnitrace runtime to your preferred default behavior 
+in a file such as ``~/.omnitrace.cfg`` and then easily override
+those settings in the command line, for example, ``OMNITRACE_ENABLED=OFF omnitrace-sample -- foo``.
+Similarly, the command-line arguments passed to ``omnitrace-sample`` take precedence 
+over environment variables.
+
+All of the command-line options above correlate to one or more configuration 
+settings, for example, ``--cpu-events`` correlates to the ``OMNITRACE_PAPI_EVENTS`` configuration variable.
+``omnitrace-sample`` processes the arguments and outputs a summary of its configuration 
+before running the target application. 
+
+The following snippets show how ``omnitrace-sample`` runs with various environment updates.
+
+*  This snippet shows the environment updates when ``omnitrace-sample`` is invoked with no arguments:
+
+   .. code-block:: shell
+
+      $ omnitrace-sample -- ./parallel-overhead-locks 30 4 100
+
+      HSA_TOOLS_LIB=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
+      HSA_TOOLS_REPORT_LOAD_FAILURE=1
+      LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
+      OMNITRACE_USE_PROCESS_SAMPLING=false
+      OMNITRACE_USE_SAMPLING=true
+      OMP_TOOL_LIBRARIES=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
+      ROCP_TOOL_LIB=/opt/omnitrace/lib/libomnitrace.so.1.7.1
+
+*  The next snippet shows the environment updates when ``omnitrace-sample`` enables 
+   profiling, tracing, host process-sampling, device process-sampling, and all the available backends:
+
+   .. code-block:: shell
+
+      $ omnitrace-sample -PTDH -I all -- ./parallel-overhead-locks 30 4 100
+
+      HSA_TOOLS_LIB=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
+      HSA_TOOLS_REPORT_LOAD_FAILURE=1
+      KOKKOS_PROFILE_LIBRARY=/opt/omnitrace/lib/libomnitrace.so.1.7.1
+      LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
+      OMNITRACE_CPU_FREQ_ENABLED=true
+      OMNITRACE_TRACE_THREAD_LOCKS=true
+      OMNITRACE_TRACE_THREAD_RW_LOCKS=true
+      OMNITRACE_TRACE_THREAD_SPIN_LOCKS=true
+      OMNITRACE_USE_KOKKOSP=true
+      OMNITRACE_USE_MPIP=true
+      OMNITRACE_USE_OMPT=true
+      OMNITRACE_TRACE=true
+      OMNITRACE_USE_PROCESS_SAMPLING=true
+      OMNITRACE_USE_RCCLP=true
+      OMNITRACE_USE_ROCM_SMI=true
+      OMNITRACE_USE_ROCPROFILER=true
+      OMNITRACE_USE_ROCTRACER=true
+      OMNITRACE_USE_ROCTX=true
+      OMNITRACE_USE_SAMPLING=true
+      OMNITRACE_PROFILE=true
+      OMP_TOOL_LIBRARIES=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
+      ROCP_TOOL_LIB=/opt/omnitrace/lib/libomnitrace.so.1.7.1
+      ...
+
+*  The final snippet shows the environment updates when ``omnitrace-sample`` enables 
+   profiling, tracing, host process-sampling, and device process-sampling,
+   sets the output path to ``omnitrace-output`` and the output prefix to ``%tag%``, and disables 
+   all the available backends:
+
+   .. code-block:: shell
+
+      $ omnitrace-sample -PTDH -E all -o omnitrace-output %tag% -- ./parallel-overhead-locks 30 4 100
+
+      LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
+      OMNITRACE_CPU_FREQ_ENABLED=true
+      OMNITRACE_OUTPUT_PATH=omnitrace-output
+      OMNITRACE_OUTPUT_PREFIX=%tag%
+      OMNITRACE_TRACE_THREAD_LOCKS=false
+      OMNITRACE_TRACE_THREAD_RW_LOCKS=false
+      OMNITRACE_TRACE_THREAD_SPIN_LOCKS=false
+      OMNITRACE_USE_KOKKOSP=false
+      OMNITRACE_USE_MPIP=false
+      OMNITRACE_USE_OMPT=false
+      OMNITRACE_TRACE=true
+      OMNITRACE_USE_PROCESS_SAMPLING=true
+      OMNITRACE_USE_RCCLP=false
+      OMNITRACE_USE_ROCM_SMI=false
+      OMNITRACE_USE_ROCPROFILER=false
+      OMNITRACE_USE_ROCTRACER=false
+      OMNITRACE_USE_ROCTX=false
+      OMNITRACE_USE_SAMPLING=true
+      OMNITRACE_PROFILE=true
+      ...
+
+An omnitrace-sample example
+========================================
+
+Here is the full output from the previous 
+``omnitrace-sample -PTDH -E all -o omnitrace-output %tag% -- ./parallel-overhead-locks 30 4 100`` command:
+
+.. code-block:: shell
+
+   $ omnitrace-sample -PTDH -E all -o omnitrace-output %tag% -c -- ./parallel-overhead-locks 30 4 100
+
+   LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.11.3
+   OMNITRACE_CONFIG_FILE=
+   OMNITRACE_CPU_FREQ_ENABLED=true
+   OMNITRACE_OUTPUT_PATH=omnitrace-output
+   OMNITRACE_OUTPUT_PREFIX=%tag%
+   OMNITRACE_PROFILE=true
+   OMNITRACE_TRACE=true
+   OMNITRACE_TRACE_THREAD_LOCKS=false
+   OMNITRACE_TRACE_THREAD_RW_LOCKS=false
+   OMNITRACE_TRACE_THREAD_SPIN_LOCKS=false
+   OMNITRACE_USE_KOKKOSP=false
+   OMNITRACE_USE_MPIP=false
+   OMNITRACE_USE_OMPT=false
+   OMNITRACE_USE_PROCESS_SAMPLING=true
+   OMNITRACE_USE_RCCLP=false
+   OMNITRACE_USE_ROCM_SMI=false
+   OMNITRACE_USE_ROCPROFILER=false
+   OMNITRACE_USE_ROCTRACER=false
+   OMNITRACE_USE_ROCTX=false
+   OMNITRACE_USE_SAMPLING=true
+   [omnitrace][dl][1785877] omnitrace_main
+   [omnitrace][1785877][omnitrace_init_tooling] Instrumentation mode: Sampling
+       ______   .___  ___. .__   __.  __  .___________..______          ___       ______  _______
+      /  __  \  |   \/   | |  \ |  | |  | |           ||   _  \        /   \     /      ||   ____|
+     |  |  |  | |  \  /  | |   \|  | |  | `---|  |----`|  |_)  |      /  ^  \   |  ,----'|  |__
+     |  |  |  | |  |\/|  | |  . `  | |  |     |  |     |      /      /  /_\  \  |  |     |   __|
+     |  `--'  | |  |  |  | |  |\   | |  |     |  |     |  |\  \----./  _____  \ |  `----.|  |____
+      \______/  |__|  |__| |__| \__| |__|     |__|     | _| `._____/__/     \__\ \______||_______|
+      omnitrace v1.11.2 (rev: 2586b74db8bf335742600010b8d9f1ce8da9cf89, compiler: GNU v11.4.1, rocm: v6.1.x)
+   [988.958]       perfetto.cc:58649 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
+   [parallel-overhead-locks] Threads: 4
+   [parallel-overhead-locks] Iterations: 100
+   [parallel-overhead-locks] fibonacci(30)...
+   [1] number of iterations: 100
+   [2] number of iterations: 100
+   [3] number of iterations: 100
+   [4] number of iterations: 100
+   [parallel-overhead-locks] fibonacci(30) x 4 = 409221992
+   [parallel-overhead-locks] number of mutex locks = 400
+   [omnitrace][1785877][0][omnitrace_finalize] finalizing...
+   [omnitrace][1785877][0][omnitrace_finalize] 
+   [omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877 : 0.294342 sec wall_clock,    4.776 MB peak_rss,    3.170 MB page_rss, 0.990000 sec cpu_clock,  336.3 % cpu_util [laps: 1]
+   [omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/0 : 0.291535 sec wall_clock, 0.002619 sec thread_cpu_clock,    0.9 % thread_cpu_util,    4.776 MB peak_rss [laps: 1]
+   [omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/1 : 0.271353 sec wall_clock, 0.222572 sec thread_cpu_clock,   82.0 % thread_cpu_util,    4.200 MB peak_rss [laps: 1]
+   [omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/2 : 0.238218 sec wall_clock, 0.206405 sec thread_cpu_clock,   86.6 % thread_cpu_util,    3.432 MB peak_rss [laps: 1]
+   [omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/3 : 0.209459 sec wall_clock, 0.193415 sec thread_cpu_clock,   92.3 % thread_cpu_util,    2.472 MB peak_rss [laps: 1]
+   [omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/4 : 0.212029 sec wall_clock, 0.211694 sec thread_cpu_clock,   99.8 % thread_cpu_util,    1.152 MB peak_rss [laps: 1]
+   [omnitrace][1785877][0][omnitrace_finalize] 
+   [omnitrace][1785877][0][omnitrace_finalize] Finalizing perfetto...
+   [omnitrace][1785877][perfetto]> Outputting '/home/user/code/omnitrace/build-release/omnitrace-output/2024-07-15_16.21/parallel-overhead-locksperfetto-trace-1785877.proto' (39.12 KB / 0.04 MB / 0.00 GB)... Done
+   [omnitrace][1785877][wall_clock]> Outputting 'omnitrace-output/2024-07-15_16.21/parallel-overhead-lockswall_clock-1785877.json'
+   [omnitrace][1785877][wall_clock]> Outputting 'omnitrace-output/2024-07-15_16.21/parallel-overhead-lockswall_clock-1785877.txt'
+   [omnitrace][1785877][metadata]> Outputting 'omnitrace-output/2024-07-15_16.21/parallel-overhead-locksmetadata-1785877.json' and 'omnitrace-output/2024-07-15_16.21/parallel-overhead-locksfunctions-1785877.json'
+   [omnitrace][1785877][0][omnitrace_finalize] Finalized: 0.054582 sec wall_clock,    0.000 MB peak_rss,   -1.798 MB page_rss, 0.040000 sec cpu_clock,   73.3 % cpu_util
+   [989.312]       perfetto.cc:60128 Tracing session 1 ended, total sessions:0
@@ -0,0 +1,938 @@
+.. meta::
+   :description: Omnitrace documentation and reference
+   :keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
+
+****************************************************
+Understanding the Omnitrace output
+****************************************************
+
+The general output form of `Omnitrace <https://github.com/ROCm/omnitrace>`_ is
+``<OUTPUT_PATH>[/<TIMESTAMP>]/[<PREFIX>]<DATA_NAME>[-<OUTPUT_SUFFIX>].<EXT>``.
+
+For example, starting with the following base configuration:
+
+.. code-block:: shell
+
+   export OMNITRACE_OUTPUT_PATH=omnitrace-example-output
+   export OMNITRACE_TIME_OUTPUT=ON
+   export OMNITRACE_USE_PID=OFF
+   export OMNITRACE_PROFILE=ON
+   export OMNITRACE_TRACE=ON
+
+.. code-block:: shell
+
+   $ omnitrace-instrument -- ./foo
+   ...
+   [omnitrace] Outputting 'omnitrace-example-output/perfetto-trace.proto'...
+
+   [omnitrace] Outputting 'omnitrace-example-output/wall-clock.txt'...
+   [omnitrace] Outputting 'omnitrace-example-output/wall-clock.json'...
+
+If the ``OMNITRACE_USE_PID`` option is enabled, then running a non-MPI executable 
+with a PID of ``63453`` results in the following output:
+
+.. code-block:: shell
+
+   $ export OMNITRACE_USE_PID=ON
+   $ omnitrace-instrument -- ./foo
+   ...
+   [omnitrace] Outputting 'omnitrace-example-output/perfetto-trace-63453.proto'...
+
+   [omnitrace] Outputting 'omnitrace-example-output/wall-clock-63453.txt'...
+   [omnitrace] Outputting 'omnitrace-example-output/wall-clock-63453.json'...
+
+If ``OMNITRACE_TIME_OUTPUT`` is enabled, then a job that started on January 31, 2022 at 12:30 PM
+generates the following:
+
+.. code-block:: shell
+
+   $ export OMNITRACE_TIME_OUTPUT=ON
+   $ omnitrace-instrument -- ./foo
+   ...
+   [omnitrace] Outputting 'omnitrace-example-output/2022-01-31_12.30_PM/perfetto-trace-63453.proto'...
+
+   [omnitrace] Outputting 'omnitrace-example-output/2022-01-31_12.30_PM/wall-clock-63453.txt'...
+   [omnitrace] Outputting 'omnitrace-example-output/2022-01-31_12.30_PM/wall-clock-63453.json'...
+
+Metadata
+========================================
+
+Omnitrace outputs a ``metadata.json`` file. This metadata file contains
+information about the settings, environment variables, output files, and info 
+about the system and the run, as follows:
+
+* Hardware cache sizes
+* Physical CPUs
+* Hardware concurrency
+* CPU model, frequency, vendor, and features
+* Launch date and time
+* Memory maps (for example, shared libraries)
+* Output files
+* Environment variables
+* Configuration settings
+
+Metadata JSON Sample
+-----------------------------------------------------------------------
+
+.. code-block:: json
+
+   {
+      "omnitrace": {
+         "metadata": {
+               "info": {
+                  "HW_L1_CACHE_SIZE": 32768,
+                  "HW_L2_CACHE_SIZE": 524288,
+                  "HW_L3_CACHE_SIZE": 16777216,
+                  "HW_PHYSICAL_CPU": 12,
+                  "HW_CONCURRENCY": 24,
+                  "LAUNCH_TIME": "02:04",
+                  "LAUNCH_DATE": "05/08/22",
+                  "TIMEMORY_GIT_REVISION": "52e7034fd419ff296506cdef43084f6071dbaba1",
+                  "TIMEMORY_VERSION": "3.3.0rc4",
+                  "TIMEMORY_API": "tim::project::timemory",
+                  "TIMEMORY_GIT_DESCRIBE": "v3.2.0-263-g52e7034f",
+                  "PWD": "/home/jrmadsen/devel/c++/AARInternal/hosttrace-dyninst/build-vscode",
+                  "USER": "jrmadsen",
+                  "HOME": "/home/jrmadsen",
+                  "SHELL": "/bin/bash",
+                  "CPU_MODEL": "AMD Ryzen Threadripper PRO 3945WX 12-Cores",
+                  "CPU_FREQUENCY": 2400,
+                  "CPU_VENDOR": "AuthenticAMD",
+                  "CPU_FEATURES": [
+                     "fpu",
+                     "msr",
+                     "sse",
+                     "sse2",
+                     "constant_tsc",
+                     "ssse3",
+                     "fma",
+                     "sse4_1",
+                     "sse4_2",
+                     "popcnt",
+                     "avx2",
+                     "... etc. ..."
+                  ],
+                  "memory_maps": [
+                     {
+                           "end_address": "7f4013797000",
+                           "start_address": "7f4012e58000",
+                           "pathname": "/opt/rocm-5.0.0/hip/lib/libamdhip64.so.5.0.50000",
+                           "offset": "34a000",
+                           "device": "103:05",
+                           "inode": 4331165,
+                           "permissions": "rw-p"
+                     },
+                     {
+                           "end_address": "7f4013902000",
+                           "start_address": "7f4013901000",
+                           "pathname": "/usr/lib/x86_64-linux-gnu/libm-2.31.so",
+                           "offset": "14d000",
+                           "device": "103:05",
+                           "inode": 42078854,
+                           "permissions": "rwxp"
+                     },
+                     {
+                           "end_address": "7f4013919000",
+                           "start_address": "7f4013908000",
+                           "pathname": "/usr/lib/x86_64-linux-gnu/libpthread-2.31.so",
+                           "offset": "6000",
+                           "device": "103:05",
+                           "inode": 42078874,
+                           "permissions": "r-xp"
+                     },
+                     {
+                           "...": "etc."
+                     },
+                  ],
+                  "memory_maps_files": [
+                     "/opt/rocm-5.0.0/hip/lib/libamdhip64.so.5.0.50000",
+                     "/opt/rocm-5.0.0/hsa-amd-aqlprofile/lib/libhsa-amd-aqlprofile64.so.1.0.50000",
+                     "/opt/rocm-5.0.0/lib/libamd_comgr.so.2.4.50000",
+                     "/opt/rocm-5.0.0/lib/libhsa-runtime64.so.1.5.50000",
+                     "/opt/rocm-5.0.0/rocm_smi/lib/librocm_smi64.so.5.0.50000",
+                     "/opt/rocm-5.0.0/roctracer/lib/libroctracer64.so.1.0.50000",
+                     "/usr/lib/x86_64-linux-gnu/ld-2.31.so",
+                     "/usr/lib/x86_64-linux-gnu/libc-2.31.so",
+                     "/usr/lib/x86_64-linux-gnu/libdl-2.31.so",
+                     "... etc. ..."
+                  ],
+               },
+               "output": {
+                  "text": [
+                     {
+                           "value": [
+                              "omnitrace-tests-output/parallel-overhead-binary-rewrite/roctracer.txt"
+                           ],
+                           "key": "roctracer"
+                     },
+                     {
+                           "value": [
+                              "omnitrace-tests-output/parallel-overhead-binary-rewrite/wall_clock.txt"
+                           ],
+                           "key": "wall_clock"
+                     }
+                  ],
+                  "json": [
+                     {
+                           "value": [
+                              "omnitrace-tests-output/parallel-overhead-binary-rewrite/roctracer.json",
+                              "omnitrace-tests-output/parallel-overhead-binary-rewrite/roctracer.tree.json"
+                           ],
+                           "key": "roctracer"
+                     },
+                     {
+                           "value": [
+                              "omnitrace-tests-output/parallel-overhead-binary-rewrite/wall_clock.json",
+                              "omnitrace-tests-output/parallel-overhead-binary-rewrite/wall_clock.tree.json"
+                           ],
+                           "key": "wall_clock"
+                     }
+                  ]
+               },
+               "environment": [
+                  {
+                     "value": "/home/jrmadsen",
+                     "key": "HOME"
+                  },
+                  {
+                     "value": "/bin/bash",
+                     "key": "SHELL"
+                  },
+                  {
+                     "value": "jrmadsen",
+                     "key": "USER"
+                  },
+                  {
+                     "value": "true",
+                     "key": "... etc. ..."
+                  }
+               ],
+               "settings": {
+                  "OMNITRACE_JSON_OUTPUT": {
+                     "count": -1,
+                     "environ_updated": false,
+                     "name": "json_output",
+                     "data_type": "bool",
+                     "initial": true,
+                     "enabled": true,
+                     "value": true,
+                     "max_count": 1,
+                     "cmdline": [
+                           "--omnitrace-json-output"
+                     ],
+                     "environ": "OMNITRACE_JSON_OUTPUT",
+                     "config_updated": false,
+                     "categories": [
+                           "io",
+                           "json",
+                           "native"
+                     ],
+                     "description": "Write json output files"
+                  },
+                  "... etc. ...": {
+                     "etc.": true
+                  }
+               }
+         }
+      }
+   }
+
+Configuring the Omnitrace output
+========================================
+
+Omnitrace includes a core set of options for controlling the format 
+and contents of the output files. For additional information, see the guide on
+:doc:`configuring runtime options <./configuring-runtime-options>`.
+
+Core configuration settings
+-----------------------------------
+
+.. csv-table:: 
+   :header: "Setting", "Value", "Description"
+   :widths: 30, 30, 100
+
+   "``OMNITRACE_OUTPUT_PATH``", "Any valid path", "Path to folder where output files should be placed"
+   "``OMNITRACE_OUTPUT_PREFIX``", "String", "Useful for multiple runs with different arguments. See the next section on output prefix keys."
+   "``OMNITRACE_OUTPUT_FILE``", "Any valid filepath", "Specific location for the Perfetto output file"
+   "``OMNITRACE_TIME_OUTPUT``", "Boolean", "Place all output in a timestamped folder, timestamp format controlled via ``OMNITRACE_TIME_FORMAT``"
+   "``OMNITRACE_TIME_FORMAT``", "String", "See ``strftime`` man pages for valid identifiers"
+   "``OMNITRACE_USE_PID``", "Boolean", "Append either the PID or the MPI rank to all output files (before the extension)"
+
+Output prefix keys
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Output prefix keys have many uses but are most helpful when dealing with multiple 
+profiling runs or large MPI jobs.
+They are included in Omnitrace because they were introduced into Timemory 
+for `compile-time-perf <https://github.com/jrmadsen/compile-time-perf>`_.
+They are needed to create different output files for a generic wrapper around 
+compilation commands while still
+overwriting the output from the last time a file was compiled.
+
+When doing scaling studies and specifying options via the command line, 
+the recommended process is to
+use a common ``OMNITRACE_OUTPUT_PATH``, disable ``OMNITRACE_TIME_OUTPUT``,
+set ``OMNITRACE_OUTPUT_PREFIX="%argt%-"``, and let Omnitrace cleanly organize the output.
+
+.. csv-table:: 
+   :header: "String", "Encoding"
+   :widths: 20, 120
+
+   "``%argv%``", "Entire command-line condensed into a single string"
+   "``%argt%``", "Similar to ``%argv%`` except basename of first command line argument"
+   "``%args%``", "All command line arguments condensed into a single string"
+   "``%tag%``", "Basename of first command line argument"
+   "``%arg<N>%``", "Command line argument at position ``<N>`` (zero indexed), e.g. ``%arg0%`` for first argument"
+   "``%argv_hash%``", "MD5 sum of ``%argv%``"
+   "``%argt_hash%``", "MD5 sum if ``%argt%``"
+   "``%args_hash%``", "MD5 sum of ``%args%``"
+   "``%tag_hash%``", "MD5 sum of ``%tag%``"
+   "``%arg<N>_hash%``", "MD5 sum of ``%arg<N>%``"
+   "``%pid%``", "Process identifier (i.e. ``getpid()``)"
+   "``%ppid%``", "Parent process identifier (i.e. ``getppid()``)"
+   "``%pgid%``", "Process group identifier (i.e. ``getpgid(getpid())``)"
+   "``%psid%``", "Process session identifier  (i.e. ``getsid(getpid())``)"
+   "``%psize%``", "Number of sibling process (from reading ``/proc/<PPID>/tasks/<PPID>/children``)"
+   "``%job%``", "Value of ``SLURM_JOB_ID`` environment variable if exists, else ``0``"
+   "``%rank%``", "Value of ``SLURM_PROCID`` environment variable if exists, else ``MPI_Comm_rank`` (or ``0`` non-mpi)"
+   "``%size%``", "``MPI_Comm_size`` or ``1`` if non-mpi"
+   "``%nid%``", "``%rank%`` if possible, otherwise ``%pid%``"
+   "``%launch_time%``", "Launch date and time (uses ``OMNITRACE_TIME_FORMAT``)"
+   "``%env{NAME}%``", "Value of environment variable ``NAME`` (i.e. ``getenv(NAME)``)"
+   "``%cfg{NAME}%``", "Value of configuration variable ``NAME`` (e.g. ``%cfg{OMNITRACE_SAMPLING_FREQ}%`` would resolve to sampling frequency)"
+   "``$env{NAME}``", "Alternative syntax to ``%env{NAME}%``"
+   "``$cfg{NAME}``", "Alternative syntax to ``%cfg{NAME}%``"
+   "``%m``", "Shorthand for ``%argt_hash%``"
+   "``%p``", "Shorthand for ``%pid%``"
+   "``%j``", "Shorthand for ``%job%``"
+   "``%r``", "Shorthand for ``%rank%``"
+   "``%s``", "Shorthand for ``%size%``"
+
+.. note::
+
+   In any output prefix key which contains a ``/`` character, the ``/`` characters
+   are replaced with ``_`` and any leading underscores are stripped. For example, 
+   an ``%arg0%`` of ``/usr/bin/foo`` translates to ``usr_bin_foo``. Additionally, any ``%arg<N>%`` keys which 
+   do not have a command line argument at position ``<N>`` are ignored.
+
+Perfetto output
+========================================
+
+Use the ``OMNITRACE_OUTPUT_FILE`` to specify a specific location. If this is an 
+absolute path, then all ``OMNITRACE_OUTPUT_PATH`` and similar
+settings are ignored. Visit `ui.perfetto.dev <https://ui.perfetto.dev>`_ and open this file.
+
+.. image:: ../data/omnitrace-perfetto.png
+   :alt: Visualization of a performance graph in Perfetto
+
+.. image:: ../data/omnitrace-rocm.png
+   :alt: Visualization of ROCm data in Perfetto
+
+.. image:: ../data/omnitrace-rocm-flow.png
+   :alt: Visualization of ROCm flow data in Perfetto
+
+.. image:: ../data/omnitrace-user-api.png
+   :alt: Visualization of ROCm API calls in Perfetto
+
+Timemory output
+========================================
+
+Use ``omnitrace-avail --components --filename`` to view the base filename for each component, as follows
+
+.. code-block:: shell
+
+   $ omnitrace-avail wall_clock -C -f
+   |---------------------------------|---------------|------------------------|
+   |            COMPONENT            |   AVAILABLE   |        FILENAME        |
+   |---------------------------------|---------------|------------------------|
+   | wall_clock                      |     true      | wall_clock             |
+   | sampling_wall_clock             |     true      | sampling_wall_clock    |
+   |---------------------------------|---------------|------------------------|
+
+The ``OMNITRACE_COLLAPSE_THREADS`` and ``OMNITRACE_COLLAPSE_PROCESSES`` settings are 
+only valid when full `MPI support is enabled <../install/install.html#mpi-support-within-omnitrace>`_. 
+When they are set, Timemory combines the per-thread and per-rank data (respectively) of 
+identical call stacks.
+
+The ``OMNITRACE_FLAT_PROFILE`` setting removes all call stack hierarchy. 
+Using ``OMNITRACE_FLAT_PROFILE=ON`` in combination
+with ``OMNITRACE_COLLAPSE_THREADS=ON`` is a useful configuration for identifying 
+min/max measurements regardless of the calling context.
+The ``OMNITRACE_TIMELINE_PROFILE`` setting (with ``OMNITRACE_FLAT_PROFILE=OFF``) effectively 
+generates similar data to that found
+in Perfetto. Enabling timeline and flat profiling effectively generates 
+similar data to ``strace``. However, while Timemory generally
+requires significantly less memory than Perfetto, this is not the case in timeline 
+mode, so use this setting with caution.
+
+Timemory text output
+-----------------------------------------------------------------------
+
+Timemory text output files are meant for human consumption (while JSON formats are for analysis),
+so some fields such as the ``LABEL`` might be truncated for readability.
+The truncation settings be changed through the ``OMNITRACE_MAX_WIDTH`` setting.
+
+.. note::
+
+   The generation of text output is configurable via ``OMNITRACE_TEXT_OUTPUT``.
+
+.. _text-output-example-label:
+
+Timemory text output example
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In the following example, the ``NN`` field in ``|NN>>>`` is the thread ID. If MPI support is enabled, 
+this becomes ``|MM|NN>>>`` where ``MM`` is the rank.
+If ``OMNITRACE_COLLAPSE_THREADS=ON`` and ``OMNITRACE_COLLAPSE_PROCESSES=ON`` are configured, 
+neither the ``MM`` nor the ``NN`` are present unless the
+component explicitly sets type traits. Type traits specify that the data is only 
+relevant per-thread or per-process, such as the ``thread_cpu_clock`` clock component.
+
+.. code-block:: shell
+
+   |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+   |                                                                       REAL-CLOCK TIMER (I.E. WALL-CLOCK TIMER)                                                                      |
+   |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+   |                            LABEL                             | COUNT  | DEPTH  |   METRIC   | UNITS  |   SUM     |   MEAN    |   MIN     |   MAX     |   VAR    | STDDEV   | % SELF |
+   |--------------------------------------------------------------|--------|--------|------------|--------|-----------|-----------|-----------|-----------|----------|----------|--------|
+   | |00>>> main                                                  |      1 |      0 | wall_clock | sec    | 13.360265 | 13.360265 | 13.360265 | 13.360265 | 0.000000 | 0.000000 |   18.2 |
+   | |00>>> |_ompt_thread_initial                                 |      1 |      1 | wall_clock | sec    | 10.924161 | 10.924161 | 10.924161 | 10.924161 | 0.000000 | 0.000000 |    0.0 |
+   | |00>>>   |_ompt_implicit_task                                |      1 |      2 | wall_clock | sec    | 10.923050 | 10.923050 | 10.923050 | 10.923050 | 0.000000 | 0.000000 |    0.1 |
+   | |00>>>     |_ompt_parallel [parallelism=12]                  |      1 |      3 | wall_clock | sec    | 10.915026 | 10.915026 | 10.915026 | 10.915026 | 0.000000 | 0.000000 |    0.0 |
+   | |00>>>       |_ompt_implicit_task                            |      1 |      4 | wall_clock | sec    | 10.647951 | 10.647951 | 10.647951 | 10.647951 | 0.000000 | 0.000000 |    0.0 |
+   | |00>>>         |_ompt_work_loop                              |    156 |      5 | wall_clock | sec    |  0.000812 |  0.000005 |  0.000001 |  0.000212 | 0.000000 | 0.000018 |  100.0 |
+   | |00>>>         |_ompt_work_single_executor                   |     40 |      5 | wall_clock | sec    |  0.000016 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |00>>>         |_ompt_sync_region_barrier_implicit           |    308 |      5 | wall_clock | sec    |  0.000629 |  0.000002 |  0.000001 |  0.000017 | 0.000000 | 0.000002 |  100.0 |
+   | |00>>>         |_conj_grad                                   |     76 |      5 | wall_clock | sec    | 10.641165 |  0.140015 |  0.131894 |  0.155099 | 0.000017 | 0.004080 |    1.0 |
+   | |00>>>           |_ompt_work_single_executor                 |    803 |      6 | wall_clock | sec    |  0.000292 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |00>>>           |_ompt_work_loop                            |   7904 |      6 | wall_clock | sec    |  7.420265 |  0.000939 |  0.000005 |  0.006974 | 0.000003 | 0.001613 |  100.0 |
+   | |00>>>           |_ompt_sync_region_barrier_implicit         |   6004 |      6 | wall_clock | sec    |  0.283160 |  0.000047 |  0.000001 |  0.004087 | 0.000000 | 0.000303 |  100.0 |
+   | |00>>>           |_ompt_sync_region_barrier_implementation   |   3952 |      6 | wall_clock | sec    |  2.829252 |  0.000716 |  0.000007 |  0.009005 | 0.000001 | 0.000985 |   99.7 |
+   | |00>>>             |_ompt_sync_region_reduction              |  15808 |      7 | wall_clock | sec    |  0.009142 |  0.000001 |  0.000000 |  0.000007 | 0.000000 | 0.000000 |  100.0 |
+   | |00>>>           |_ompt_work_single_other                    |   1249 |      6 | wall_clock | sec    |  0.000270 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |00>>>         |_ompt_work_single_other                      |    114 |      5 | wall_clock | sec    |  0.000024 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |00>>>         |_ompt_sync_region_barrier_implementation     |     76 |      5 | wall_clock | sec    |  0.000876 |  0.000012 |  0.000008 |  0.000025 | 0.000000 | 0.000003 |   84.4 |
+   | |00>>>           |_ompt_sync_region_reduction                |    304 |      6 | wall_clock | sec    |  0.000136 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |00>>>         |_ompt_master                                 |    226 |      5 | wall_clock | sec    |  0.001978 |  0.000009 |  0.000000 |  0.000038 | 0.000000 | 0.000012 |  100.0 |
+   | |11>>>       |_ompt_thread_worker                            |      1 |      4 | wall_clock | sec    | 10.656145 | 10.656145 | 10.656145 | 10.656145 | 0.000000 | 0.000000 |    0.1 |
+   | |11>>>         |_ompt_implicit_task                          |      1 |      5 | wall_clock | sec    | 10.649183 | 10.649183 | 10.649183 | 10.649183 | 0.000000 | 0.000000 |    0.0 |
+   | |11>>>           |_ompt_work_loop                            |    156 |      6 | wall_clock | sec    |  0.000852 |  0.000005 |  0.000002 |  0.000230 | 0.000000 | 0.000019 |  100.0 |
+   | |11>>>           |_ompt_work_single_other                    |    149 |      6 | wall_clock | sec    |  0.000035 |  0.000000 |  0.000000 |  0.000000 | 0.000000 | 0.000000 |  100.0 |
+   | |11>>>           |_ompt_sync_region_barrier_implicit         |    308 |      6 | wall_clock | sec    |  0.004135 |  0.000013 |  0.000001 |  0.001233 | 0.000000 | 0.000070 |  100.0 |
+   | |11>>>           |_conj_grad                                 |     76 |      6 | wall_clock | sec    | 10.641302 |  0.140017 |  0.131896 |  0.155102 | 0.000017 | 0.004080 |    0.6 |
+   | |11>>>             |_ompt_work_single_other                  |   2023 |      7 | wall_clock | sec    |  0.000458 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |11>>>             |_ompt_work_loop                          |   7904 |      7 | wall_clock | sec    |  8.253555 |  0.001044 |  0.000005 |  0.008021 | 0.000003 | 0.001790 |  100.0 |
+   | |11>>>             |_ompt_sync_region_barrier_implicit       |   6004 |      7 | wall_clock | sec    |  0.263840 |  0.000044 |  0.000001 |  0.004087 | 0.000000 | 0.000297 |  100.0 |
+   | |11>>>             |_ompt_sync_region_barrier_implementation |   3952 |      7 | wall_clock | sec    |  2.059823 |  0.000521 |  0.000007 |  0.009508 | 0.000001 | 0.000863 |  100.0 |
+   | |11>>>             |_ompt_work_single_executor               |     29 |      7 | wall_clock | sec    |  0.000011 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |11>>>           |_ompt_work_single_executor                 |      5 |      6 | wall_clock | sec    |  0.000002 |  0.000000 |  0.000000 |  0.000000 | 0.000000 | 0.000000 |  100.0 |
+   | |11>>>           |_ompt_sync_region_barrier_implementation   |     76 |      6 | wall_clock | sec    |  0.000975 |  0.000013 |  0.000008 |  0.000024 | 0.000000 | 0.000003 |  100.0 |
+   | |10>>>       |_ompt_thread_worker                            |      1 |      4 | wall_clock | sec    | 10.681664 | 10.681664 | 10.681664 | 10.681664 | 0.000000 | 0.000000 |    0.3 |
+   | |10>>>         |_ompt_implicit_task                          |      1 |      5 | wall_clock | sec    | 10.649158 | 10.649158 | 10.649158 | 10.649158 | 0.000000 | 0.000000 |    0.0 |
+   | |10>>>           |_ompt_work_loop                            |    156 |      6 | wall_clock | sec    |  0.000863 |  0.000006 |  0.000002 |  0.000231 | 0.000000 | 0.000019 |  100.0 |
+   | |10>>>           |_ompt_work_single_other                    |    140 |      6 | wall_clock | sec    |  0.000037 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |10>>>           |_ompt_sync_region_barrier_implicit         |    308 |      6 | wall_clock | sec    |  0.004149 |  0.000013 |  0.000001 |  0.001221 | 0.000000 | 0.000070 |  100.0 |
+   | |10>>>           |_conj_grad                                 |     76 |      6 | wall_clock | sec    | 10.641288 |  0.140017 |  0.131896 |  0.155101 | 0.000017 | 0.004080 |    0.7 |
+   | |10>>>             |_ompt_work_single_other                  |   1883 |      7 | wall_clock | sec    |  0.000487 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |10>>>             |_ompt_work_loop                          |   7904 |      7 | wall_clock | sec    |  8.174545 |  0.001034 |  0.000005 |  0.006899 | 0.000003 | 0.001766 |  100.0 |
+   | |10>>>             |_ompt_sync_region_barrier_implicit       |   6004 |      7 | wall_clock | sec    |  0.268808 |  0.000045 |  0.000001 |  0.004087 | 0.000000 | 0.000299 |  100.0 |
+   | |10>>>             |_ompt_sync_region_barrier_implementation |   3952 |      7 | wall_clock | sec    |  2.126988 |  0.000538 |  0.000007 |  0.009843 | 0.000001 | 0.000872 |   99.9 |
+   | |10>>>               |_ompt_sync_region_reduction            |   3952 |      8 | wall_clock | sec    |  0.002574 |  0.000001 |  0.000000 |  0.000014 | 0.000000 | 0.000000 |  100.0 |
+   | |10>>>             |_ompt_work_single_executor               |    169 |      7 | wall_clock | sec    |  0.000072 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |10>>>           |_ompt_sync_region_barrier_implementation   |     76 |      6 | wall_clock | sec    |  0.000954 |  0.000013 |  0.000009 |  0.000023 | 0.000000 | 0.000003 |   95.9 |
+   | |10>>>             |_ompt_sync_region_reduction              |     76 |      7 | wall_clock | sec    |  0.000039 |  0.000001 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |10>>>           |_ompt_work_single_executor                 |     14 |      6 | wall_clock | sec    |  0.000006 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |09>>>       |_ompt_thread_worker                            |      1 |      4 | wall_clock | sec    | 10.686552 | 10.686552 | 10.686552 | 10.686552 | 0.000000 | 0.000000 |    0.3 |
+   | |09>>>         |_ompt_implicit_task                          |      1 |      5 | wall_clock | sec    | 10.649151 | 10.649151 | 10.649151 | 10.649151 | 0.000000 | 0.000000 |    0.0 |
+   | |09>>>           |_ompt_work_loop                            |    156 |      6 | wall_clock | sec    |  0.000880 |  0.000006 |  0.000002 |  0.000258 | 0.000000 | 0.000021 |  100.0 |
+   | |09>>>           |_ompt_work_single_other                    |    148 |      6 | wall_clock | sec    |  0.000034 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |09>>>           |_ompt_sync_region_barrier_implicit         |    308 |      6 | wall_clock | sec    |  0.004129 |  0.000013 |  0.000001 |  0.001210 | 0.000000 | 0.000069 |  100.0 |
+   | |09>>>           |_conj_grad                                 |     76 |      6 | wall_clock | sec    | 10.641308 |  0.140017 |  0.131895 |  0.155102 | 0.000017 | 0.004080 |    0.7 |
+   | |09>>>             |_ompt_work_single_other                  |   2043 |      7 | wall_clock | sec    |  0.000473 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |09>>>             |_ompt_work_loop                          |   7904 |      7 | wall_clock | sec    |  7.977001 |  0.001009 |  0.000005 |  0.007325 | 0.000003 | 0.001732 |  100.0 |
+   | |09>>>             |_ompt_sync_region_barrier_implicit       |   6004 |      7 | wall_clock | sec    |  0.242996 |  0.000040 |  0.000001 |  0.004087 | 0.000000 | 0.000284 |  100.0 |
+   | |09>>>             |_ompt_sync_region_barrier_implementation |   3952 |      7 | wall_clock | sec    |  2.350895 |  0.000595 |  0.000007 |  0.008689 | 0.000001 | 0.000926 |  100.0 |
+   | |09>>>             |_ompt_work_single_executor               |      9 |      7 | wall_clock | sec    |  0.000004 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |09>>>           |_ompt_sync_region_barrier_implementation   |     76 |      6 | wall_clock | sec    |  0.000973 |  0.000013 |  0.000008 |  0.000025 | 0.000000 | 0.000003 |  100.0 |
+   | |09>>>           |_ompt_work_single_executor                 |      6 |      6 | wall_clock | sec    |  0.000002 |  0.000000 |  0.000000 |  0.000000 | 0.000000 | 0.000000 |  100.0 |
+   | |08>>>       |_ompt_thread_worker                            |      1 |      4 | wall_clock | sec    | 10.721622 | 10.721622 | 10.721622 | 10.721622 | 0.000000 | 0.000000 |    0.7 |
+   | |08>>>         |_ompt_implicit_task                          |      1 |      5 | wall_clock | sec    | 10.649135 | 10.649135 | 10.649135 | 10.649135 | 0.000000 | 0.000000 |    0.0 |
+   | |08>>>           |_ompt_work_loop                            |    156 |      6 | wall_clock | sec    |  0.000839 |  0.000005 |  0.000001 |  0.000231 | 0.000000 | 0.000019 |  100.0 |
+   | |08>>>           |_ompt_work_single_other                    |    141 |      6 | wall_clock | sec    |  0.000030 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |08>>>           |_ompt_sync_region_barrier_implicit         |    308 |      6 | wall_clock | sec    |  0.004114 |  0.000013 |  0.000001 |  0.001198 | 0.000000 | 0.000069 |  100.0 |
+   | |08>>>           |_conj_grad                                 |     76 |      6 | wall_clock | sec    | 10.641294 |  0.140017 |  0.131896 |  0.155101 | 0.000017 | 0.004080 |    0.6 |
+   | |08>>>             |_ompt_work_single_other                  |   1742 |      7 | wall_clock | sec    |  0.000392 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |08>>>             |_ompt_work_loop                          |   7904 |      7 | wall_clock | sec    |  8.306388 |  0.001051 |  0.000005 |  0.007886 | 0.000003 | 0.001795 |  100.0 |
+   | |08>>>             |_ompt_sync_region_barrier_implicit       |   6004 |      7 | wall_clock | sec    |  0.274358 |  0.000046 |  0.000001 |  0.004090 | 0.000000 | 0.000302 |  100.0 |
+   | |08>>>             |_ompt_sync_region_barrier_implementation |   3952 |      7 | wall_clock | sec    |  1.991251 |  0.000504 |  0.000007 |  0.008694 | 0.000001 | 0.000844 |   99.8 |
+   | |08>>>               |_ompt_sync_region_reduction            |   7904 |      8 | wall_clock | sec    |  0.003816 |  0.000000 |  0.000000 |  0.000017 | 0.000000 | 0.000000 |  100.0 |
+   | |08>>>             |_ompt_work_single_executor               |    310 |      7 | wall_clock | sec    |  0.000112 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |08>>>           |_ompt_sync_region_barrier_implementation   |     76 |      6 | wall_clock | sec    |  0.000955 |  0.000013 |  0.000009 |  0.000026 | 0.000000 | 0.000003 |   93.7 |
+   | |08>>>             |_ompt_sync_region_reduction              |    152 |      7 | wall_clock | sec    |  0.000060 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |08>>>           |_ompt_work_single_executor                 |     13 |      6 | wall_clock | sec    |  0.000005 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |07>>>       |_ompt_thread_worker                            |      1 |      4 | wall_clock | sec    | 10.747282 | 10.747282 | 10.747282 | 10.747282 | 0.000000 | 0.000000 |    0.9 |
+   | |07>>>         |_ompt_implicit_task                          |      1 |      5 | wall_clock | sec    | 10.649093 | 10.649093 | 10.649093 | 10.649093 | 0.000000 | 0.000000 |    0.0 |
+   | |07>>>           |_ompt_work_loop                            |    156 |      6 | wall_clock | sec    |  0.000923 |  0.000006 |  0.000002 |  0.000231 | 0.000000 | 0.000019 |  100.0 |
+   | |07>>>           |_ompt_work_single_other                    |    152 |      6 | wall_clock | sec    |  0.000048 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |07>>>           |_ompt_sync_region_barrier_implicit         |    308 |      6 | wall_clock | sec    |  0.003981 |  0.000013 |  0.000001 |  0.001186 | 0.000000 | 0.000068 |  100.0 |
+   | |07>>>           |_conj_grad                                 |     76 |      6 | wall_clock | sec    | 10.641295 |  0.140017 |  0.131896 |  0.155101 | 0.000017 | 0.004080 |    0.7 |
+   | |07>>>             |_ompt_work_single_other                  |   2043 |      7 | wall_clock | sec    |  0.000648 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |07>>>             |_ompt_work_loop                          |   7904 |      7 | wall_clock | sec    |  7.978811 |  0.001009 |  0.000005 |  0.006728 | 0.000003 | 0.001732 |  100.0 |
+   | |07>>>             |_ompt_sync_region_barrier_implicit       |   6004 |      7 | wall_clock | sec    |  0.199939 |  0.000033 |  0.000001 |  0.004086 | 0.000000 | 0.000255 |  100.0 |
+   | |07>>>             |_ompt_sync_region_barrier_implementation |   3952 |      7 | wall_clock | sec    |  2.385843 |  0.000604 |  0.000009 |  0.009039 | 0.000001 | 0.000938 |  100.0 |
+   | |07>>>             |_ompt_work_single_executor               |      9 |      7 | wall_clock | sec    |  0.000004 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |07>>>           |_ompt_sync_region_barrier_implementation   |     76 |      6 | wall_clock | sec    |  0.000905 |  0.000012 |  0.000010 |  0.000025 | 0.000000 | 0.000003 |  100.0 |
+   | |07>>>           |_ompt_work_single_executor                 |      2 |      6 | wall_clock | sec    |  0.000001 |  0.000001 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |06>>>       |_ompt_thread_worker                            |      1 |      4 | wall_clock | sec    | 10.772278 | 10.772278 | 10.772278 | 10.772278 | 0.000000 | 0.000000 |    1.1 |
+   | |06>>>         |_ompt_implicit_task                          |      1 |      5 | wall_clock | sec    | 10.649092 | 10.649092 | 10.649092 | 10.649092 | 0.000000 | 0.000000 |    0.0 |
+   | |06>>>           |_ompt_work_loop                            |    156 |      6 | wall_clock | sec    |  0.000888 |  0.000006 |  0.000002 |  0.000236 | 0.000000 | 0.000020 |  100.0 |
+   | |06>>>           |_ompt_work_single_other                    |    153 |      6 | wall_clock | sec    |  0.000037 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |06>>>           |_ompt_sync_region_barrier_implicit         |    308 |      6 | wall_clock | sec    |  0.004090 |  0.000013 |  0.000001 |  0.001175 | 0.000000 | 0.000067 |  100.0 |
+   | |06>>>           |_conj_grad                                 |     76 |      6 | wall_clock | sec    | 10.641317 |  0.140017 |  0.131896 |  0.155101 | 0.000017 | 0.004080 |    0.8 |
+   | |06>>>             |_ompt_work_single_other                  |   2041 |      7 | wall_clock | sec    |  0.000476 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |06>>>             |_ompt_work_loop                          |   7904 |      7 | wall_clock | sec    |  7.467961 |  0.000945 |  0.000005 |  0.010712 | 0.000003 | 0.001627 |  100.0 |
+   | |06>>>             |_ompt_sync_region_barrier_implicit       |   6004 |      7 | wall_clock | sec    |  0.250883 |  0.000042 |  0.000001 |  0.004087 | 0.000000 | 0.000285 |  100.0 |
+   | |06>>>             |_ompt_sync_region_barrier_implementation |   3952 |      7 | wall_clock | sec    |  2.838733 |  0.000718 |  0.000009 |  0.009015 | 0.000001 | 0.001015 |   99.9 |
+   | |06>>>               |_ompt_sync_region_reduction            |   3952 |      8 | wall_clock | sec    |  0.003334 |  0.000001 |  0.000000 |  0.000025 | 0.000000 | 0.000001 |  100.0 |
+   | |06>>>             |_ompt_work_single_executor               |     11 |      7 | wall_clock | sec    |  0.000005 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |06>>>           |_ompt_sync_region_barrier_implementation   |     76 |      6 | wall_clock | sec    |  0.000940 |  0.000012 |  0.000009 |  0.000025 | 0.000000 | 0.000003 |   95.4 |
+   | |06>>>             |_ompt_sync_region_reduction              |     76 |      7 | wall_clock | sec    |  0.000044 |  0.000001 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |06>>>           |_ompt_work_single_executor                 |      1 |      6 | wall_clock | sec    |  0.000000 |  0.000000 |  0.000000 |  0.000000 | 0.000000 | 0.000000 |  100.0 |
+   | |05>>>       |_ompt_thread_worker                            |      1 |      4 | wall_clock | sec    | 10.797950 | 10.797950 | 10.797950 | 10.797950 | 0.000000 | 0.000000 |    1.4 |
+   | |05>>>         |_ompt_implicit_task                          |      1 |      5 | wall_clock | sec    | 10.649072 | 10.649072 | 10.649072 | 10.649072 | 0.000000 | 0.000000 |    0.0 |
+   | |05>>>           |_ompt_work_loop                            |    156 |      6 | wall_clock | sec    |  0.000879 |  0.000006 |  0.000001 |  0.000248 | 0.000000 | 0.000021 |  100.0 |
+   | |05>>>           |_ompt_work_single_other                    |    142 |      6 | wall_clock | sec    |  0.000034 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |05>>>           |_ompt_sync_region_barrier_implicit         |    308 |      6 | wall_clock | sec    |  0.004062 |  0.000013 |  0.000002 |  0.001163 | 0.000000 | 0.000067 |  100.0 |
+   | |05>>>           |_conj_grad                                 |     76 |      6 | wall_clock | sec    | 10.641291 |  0.140017 |  0.131896 |  0.155101 | 0.000017 | 0.004080 |    0.7 |
+   | |05>>>             |_ompt_work_single_other                  |   2038 |      7 | wall_clock | sec    |  0.000500 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |05>>>             |_ompt_work_loop                          |   7904 |      7 | wall_clock | sec    |  8.279191 |  0.001047 |  0.000005 |  0.006596 | 0.000003 | 0.001792 |  100.0 |
+   | |05>>>             |_ompt_sync_region_barrier_implicit       |   6004 |      7 | wall_clock | sec    |  0.250939 |  0.000042 |  0.000001 |  0.004090 | 0.000000 | 0.000286 |  100.0 |
+   | |05>>>             |_ompt_sync_region_barrier_implementation |   3952 |      7 | wall_clock | sec    |  2.039013 |  0.000516 |  0.000009 |  0.008689 | 0.000001 | 0.000855 |  100.0 |
+   | |05>>>             |_ompt_work_single_executor               |     14 |      7 | wall_clock | sec    |  0.000005 |  0.000000 |  0.000000 |  0.000000 | 0.000000 | 0.000000 |  100.0 |
+   | |05>>>           |_ompt_sync_region_barrier_implementation   |     76 |      6 | wall_clock | sec    |  0.000926 |  0.000012 |  0.000009 |  0.000023 | 0.000000 | 0.000003 |  100.0 |
+   | |05>>>           |_ompt_work_single_executor                 |     12 |      6 | wall_clock | sec    |  0.000005 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |04>>>       |_ompt_thread_worker                            |      1 |      4 | wall_clock | sec    | 10.825935 | 10.825935 | 10.825935 | 10.825935 | 0.000000 | 0.000000 |    1.6 |
+   | |04>>>         |_ompt_implicit_task                          |      1 |      5 | wall_clock | sec    | 10.649068 | 10.649068 | 10.649068 | 10.649068 | 0.000000 | 0.000000 |    0.0 |
+   | |04>>>           |_ompt_work_loop                            |    156 |      6 | wall_clock | sec    |  0.000884 |  0.000006 |  0.000002 |  0.000245 | 0.000000 | 0.000020 |  100.0 |
+   | |04>>>           |_ompt_work_single_other                    |    150 |      6 | wall_clock | sec    |  0.000034 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |04>>>           |_ompt_sync_region_barrier_implicit         |    308 |      6 | wall_clock | sec    |  0.004069 |  0.000013 |  0.000001 |  0.001151 | 0.000000 | 0.000066 |  100.0 |
+   | |04>>>           |_conj_grad                                 |     76 |      6 | wall_clock | sec    | 10.641300 |  0.140017 |  0.131896 |  0.155101 | 0.000017 | 0.004080 |    1.1 |
+   | |04>>>             |_ompt_work_single_other                  |   2041 |      7 | wall_clock | sec    |  0.000448 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |04>>>             |_ompt_work_loop                          |   7904 |      7 | wall_clock | sec    |  7.438393 |  0.000941 |  0.000005 |  0.007090 | 0.000003 | 0.001624 |  100.0 |
+   | |04>>>             |_ompt_sync_region_barrier_implicit       |   6004 |      7 | wall_clock | sec    |  0.270654 |  0.000045 |  0.000001 |  0.004090 | 0.000000 | 0.000295 |  100.0 |
+   | |04>>>             |_ompt_sync_region_barrier_implementation |   3952 |      7 | wall_clock | sec    |  2.819165 |  0.000713 |  0.000009 |  0.008379 | 0.000001 | 0.001013 |   99.9 |
+   | |04>>>               |_ompt_sync_region_reduction            |   7904 |      8 | wall_clock | sec    |  0.003932 |  0.000000 |  0.000000 |  0.000015 | 0.000000 | 0.000000 |  100.0 |
+   | |04>>>             |_ompt_work_single_executor               |     11 |      7 | wall_clock | sec    |  0.000005 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |04>>>           |_ompt_sync_region_barrier_implementation   |     76 |      6 | wall_clock | sec    |  0.000936 |  0.000012 |  0.000009 |  0.000025 | 0.000000 | 0.000003 |   93.2 |
+   | |04>>>             |_ompt_sync_region_reduction              |    152 |      7 | wall_clock | sec    |  0.000064 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |04>>>           |_ompt_work_single_executor                 |      4 |      6 | wall_clock | sec    |  0.000001 |  0.000000 |  0.000000 |  0.000000 | 0.000000 | 0.000000 |  100.0 |
+   | |03>>>       |_ompt_thread_worker                            |      1 |      4 | wall_clock | sec    | 10.849322 | 10.849322 | 10.849322 | 10.849322 | 0.000000 | 0.000000 |    1.8 |
+   | |03>>>         |_ompt_implicit_task                          |      1 |      5 | wall_clock | sec    | 10.649075 | 10.649075 | 10.649075 | 10.649075 | 0.000000 | 0.000000 |    0.0 |
+   | |03>>>           |_ompt_work_loop                            |    156 |      6 | wall_clock | sec    |  0.000861 |  0.000006 |  0.000002 |  0.000238 | 0.000000 | 0.000020 |  100.0 |
+   | |03>>>           |_ompt_work_single_other                    |    120 |      6 | wall_clock | sec    |  0.000028 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |03>>>           |_ompt_sync_region_barrier_implicit         |    308 |      6 | wall_clock | sec    |  0.003993 |  0.000013 |  0.000001 |  0.001138 | 0.000000 | 0.000065 |  100.0 |
+   | |03>>>           |_conj_grad                                 |     76 |      6 | wall_clock | sec    | 10.641302 |  0.140017 |  0.131896 |  0.155101 | 0.000017 | 0.004080 |    0.8 |
+   | |03>>>             |_ompt_work_single_other                  |   1756 |      7 | wall_clock | sec    |  0.000426 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |03>>>             |_ompt_work_loop                          |   7904 |      7 | wall_clock | sec    |  8.005617 |  0.001013 |  0.000005 |  0.011500 | 0.000003 | 0.001741 |  100.0 |
+   | |03>>>             |_ompt_sync_region_barrier_implicit       |   6004 |      7 | wall_clock | sec    |  0.231485 |  0.000039 |  0.000001 |  0.004086 | 0.000000 | 0.000277 |  100.0 |
+   | |03>>>             |_ompt_sync_region_barrier_implementation |   3952 |      7 | wall_clock | sec    |  2.320428 |  0.000587 |  0.000009 |  0.010868 | 0.000001 | 0.000912 |  100.0 |
+   | |03>>>             |_ompt_work_single_executor               |    296 |      7 | wall_clock | sec    |  0.000120 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |03>>>           |_ompt_sync_region_barrier_implementation   |     76 |      6 | wall_clock | sec    |  0.000967 |  0.000013 |  0.000010 |  0.000023 | 0.000000 | 0.000003 |  100.0 |
+   | |03>>>           |_ompt_work_single_executor                 |     34 |      6 | wall_clock | sec    |  0.000013 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |02>>>       |_ompt_thread_worker                            |      1 |      4 | wall_clock | sec    | 10.876387 | 10.876387 | 10.876387 | 10.876387 | 0.000000 | 0.000000 |    2.1 |
+   | |02>>>         |_ompt_implicit_task                          |      1 |      5 | wall_clock | sec    | 10.649050 | 10.649050 | 10.649050 | 10.649050 | 0.000000 | 0.000000 |    0.0 |
+   | |02>>>           |_ompt_work_loop                            |    156 |      6 | wall_clock | sec    |  0.000924 |  0.000006 |  0.000001 |  0.000241 | 0.000000 | 0.000020 |  100.0 |
+   | |02>>>           |_ompt_work_single_other                    |    139 |      6 | wall_clock | sec    |  0.000040 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |02>>>           |_ompt_sync_region_barrier_implicit         |    308 |      6 | wall_clock | sec    |  0.003972 |  0.000013 |  0.000001 |  0.001127 | 0.000000 | 0.000064 |  100.0 |
+   | |02>>>           |_conj_grad                                 |     76 |      6 | wall_clock | sec    | 10.641287 |  0.140017 |  0.131895 |  0.155101 | 0.000017 | 0.004080 |    0.7 |
+   | |02>>>             |_ompt_work_single_other                  |   1902 |      7 | wall_clock | sec    |  0.000553 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |02>>>             |_ompt_work_loop                          |   7904 |      7 | wall_clock | sec    |  7.906688 |  0.001000 |  0.000005 |  0.007068 | 0.000003 | 0.001713 |  100.0 |
+   | |02>>>             |_ompt_sync_region_barrier_implicit       |   6004 |      7 | wall_clock | sec    |  0.261367 |  0.000044 |  0.000001 |  0.004088 | 0.000000 | 0.000295 |  100.0 |
+   | |02>>>             |_ompt_sync_region_barrier_implementation |   3952 |      7 | wall_clock | sec    |  2.402362 |  0.000608 |  0.000009 |  0.010399 | 0.000001 | 0.000944 |   99.9 |
+   | |02>>>               |_ompt_sync_region_reduction            |   3952 |      8 | wall_clock | sec    |  0.002937 |  0.000001 |  0.000000 |  0.000021 | 0.000000 | 0.000000 |  100.0 |
+   | |02>>>             |_ompt_work_single_executor               |    150 |      7 | wall_clock | sec    |  0.000073 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |02>>>           |_ompt_sync_region_barrier_implementation   |     76 |      6 | wall_clock | sec    |  0.000895 |  0.000012 |  0.000009 |  0.000026 | 0.000000 | 0.000003 |   95.2 |
+   | |02>>>             |_ompt_sync_region_reduction              |     76 |      7 | wall_clock | sec    |  0.000043 |  0.000001 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |02>>>           |_ompt_work_single_executor                 |     15 |      6 | wall_clock | sec    |  0.000007 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |01>>>       |_ompt_thread_worker                            |      1 |      4 | wall_clock | sec    | 10.901650 | 10.901650 | 10.901650 | 10.901650 | 0.000000 | 0.000000 |    2.3 |
+   | |01>>>         |_ompt_implicit_task                          |      1 |      5 | wall_clock | sec    | 10.649017 | 10.649017 | 10.649017 | 10.649017 | 0.000000 | 0.000000 |    0.0 |
+   | |01>>>           |_ompt_work_loop                            |    156 |      6 | wall_clock | sec    |  0.000863 |  0.000006 |  0.000001 |  0.000231 | 0.000000 | 0.000019 |  100.0 |
+   | |01>>>           |_ompt_work_single_other                    |    146 |      6 | wall_clock | sec    |  0.000033 |  0.000000 |  0.000000 |  0.000000 | 0.000000 | 0.000000 |  100.0 |
+   | |01>>>           |_ompt_sync_region_barrier_implicit         |    308 |      6 | wall_clock | sec    |  0.004012 |  0.000013 |  0.000001 |  0.001115 | 0.000000 | 0.000064 |  100.0 |
+   | |01>>>           |_conj_grad                                 |     76 |      6 | wall_clock | sec    | 10.641316 |  0.140017 |  0.131895 |  0.155101 | 0.000017 | 0.004080 |    0.8 |
+   | |01>>>             |_ompt_work_single_other                  |   1811 |      7 | wall_clock | sec    |  0.000403 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |01>>>             |_ompt_work_loop                          |   7904 |      7 | wall_clock | sec    |  7.410337 |  0.000938 |  0.000005 |  0.010556 | 0.000003 | 0.001610 |  100.0 |
+   | |01>>>             |_ompt_sync_region_barrier_implicit       |   6004 |      7 | wall_clock | sec    |  0.202494 |  0.000034 |  0.000001 |  0.003521 | 0.000000 | 0.000256 |  100.0 |
+   | |01>>>             |_ompt_sync_region_barrier_implementation |   3952 |      7 | wall_clock | sec    |  2.943604 |  0.000745 |  0.000008 |  0.009033 | 0.000001 | 0.001024 |  100.0 |
+   | |01>>>             |_ompt_work_single_executor               |    241 |      7 | wall_clock | sec    |  0.000093 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |01>>>           |_ompt_sync_region_barrier_implementation   |     76 |      6 | wall_clock | sec    |  0.000917 |  0.000012 |  0.000009 |  0.000026 | 0.000000 | 0.000003 |  100.0 |
+   | |01>>>           |_ompt_work_single_executor                 |      8 |      6 | wall_clock | sec    |  0.000004 |  0.000000 |  0.000000 |  0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |00>>>   |_c_print_results                                   |      1 |      2 | wall_clock | sec    |  0.000049 |  0.000049 |  0.000049 |  0.000049 | 0.000000 | 0.000000 |  100.0 |
+   |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+
+Timemory JSON output
+-------------------------------------------------------------------------
+
+Timemory represents the data within the JSON output in two forms: 
+a flat structure and a hierarchical structure.
+The flat JSON data represents the data similar to the text files, where the hierarchical information
+is represented by the indentation of the ``prefix`` field and the ``depth`` field.
+The hierarchical JSON contains additional information with respect 
+to inclusive and exclusive values. However,
+its structure must be processed using recursion. This section of the JSON output supports analysis
+by `hatchet <https://github.com/hatchet/hatchet>`_.
+All the data entries for the flat structure are in a single JSON array. It is easier to 
+write a simple Python script for post-processing using this format than with the hierarchical structure.
+
+.. note::
+
+   The generation of flat JSON output is configurable via ``OMNITRACE_JSON_OUTPUT``.
+   The generation of hierarchical JSON data is configurable via ``OMNITRACE_TREE_OUTPUT``
+
+Timemory JSON output sample
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In the following JSON data, the flat data starts at ``["timemory"]["wall_clock"]["ranks"]``
+and the hierarchical data starts at ``["timemory"]["wall_clock"]["graph"]``.
+To access the name (or prefix) of the nth entry in the flat data layout, use
+``["timemory"]["wall_clock"]["ranks"][0]["graph"][<N>]["prefix"]``. When full MPI
+support is enabled, the per-rank data in flat layout is represented
+as an entry in the ``ranks`` array. In the hierarchical data structure,
+the per-rank data is represented as an entry in the ``mpi`` array. However, ``graph``
+is used in lieu of ``mpi`` when full MPI support is enabled.
+In the hierarchical layout, all data for the process is a child of a dummy
+root node, which has the name ``unknown-hash=0``.
+
+.. code-block:: json
+
+   {
+      "timemory": {
+         "wall_clock": {
+               "properties": {
+                  "cereal_class_version": 0,
+                  "value": 78,
+                  "enum": "WALL_CLOCK",
+                  "id": "wall_clock",
+                  "ids": [
+                     "real_clock",
+                     "virtual_clock",
+                     "wall_clock"
+                  ]
+               },
+               "type": "wall_clock",
+               "description": "Real-clock timer (i.e. wall-clock timer)",
+               "unit_value": 1000000000,
+               "unit_repr": "sec",
+               "thread_scope_only": false,
+               "thread_count": 2,
+               "mpi_size": 1,
+               "upcxx_size": 1,
+               "process_count": 1,
+               "num_ranks": 1,
+               "concurrency": 2,
+               "ranks": [
+                  {
+                     "rank": 0,
+                     "graph_size": 112,
+                     "graph": [
+                           {
+                              "hash": 17481650134347108265,
+                              "prefix": "|0>>> main",
+                              "depth": 0,
+                              "entry": {
+                                 "cereal_class_version": 0,
+                                 "laps": 1,
+                                 "value": 894743517,
+                                 "accum": 894743517,
+                                 "repr_data": 0.894743517,
+                                 "repr_display": 0.894743517
+                              },
+                              "stats": {
+                                 "cereal_class_version": 0,
+                                 "sum": 0.894743517,
+                                 "count": 1,
+                                 "min": 0.894743517,
+                                 "max": 0.894743517,
+                                 "sqr": 0.8005659612135293,
+                                 "mean": 0.894743517,
+                                 "stddev": 0.0
+                              },
+                              "rolling_hash": 17481650134347108265
+                           },
+                           {
+                              "hash": 3455444288293231339,
+                              "prefix": "|0>>> |_read_input",
+                              "depth": 1,
+                              "entry": {
+                                 "laps": 1,
+                                 "value": 9808,
+                                 "accum": 9808,
+                                 "repr_data": 9.808e-06,
+                                 "repr_display": 9.808e-06
+                              },
+                              "stats": {
+                                 "sum": 9.808e-06,
+                                 "count": 1,
+                                 "min": 9.808e-06,
+                                 "max": 9.808e-06,
+                                 "sqr": 9.6196864e-11,
+                                 "mean": 9.808e-06,
+                                 "stddev": 0.0
+                              },
+                              "rolling_hash": 2490350348930787988
+                           },
+                           {
+                              "hash": 8456966793631718807,
+                              "prefix": "|0>>> |_setcoeff",
+                              "depth": 1,
+                              "entry": {
+                                 "laps": 1,
+                                 "value": 922,
+                                 "accum": 922,
+                                 "repr_data": 9.22e-07,
+                                 "repr_display": 9.22e-07
+                              },
+                              "stats": {
+                                 "sum": 9.22e-07,
+                                 "count": 1,
+                                 "min": 9.22e-07,
+                                 "max": 9.22e-07,
+                                 "sqr": 8.50084e-13,
+                                 "mean": 9.22e-07,
+                                 "stddev": 0.0
+                              },
+                              "rolling_hash": 7491872854269275456
+                           },
+                           {
+                              "hash": 6107876127803219007,
+                              "prefix": "|0>>> |_ompt_thread_initial",
+                              "depth": 1,
+                              "entry": {
+                                 "laps": 1,
+                                 "value": 896506392,
+                                 "accum": 896506392,
+                                 "repr_data": 0.896506392,
+                                 "repr_display": 0.896506392
+                              },
+                              "stats": {
+                                 "sum": 0.896506392,
+                                 "count": 1,
+                                 "min": 0.896506392,
+                                 "max": 0.896506392,
+                                 "sqr": 0.8037237108968578,
+                                 "mean": 0.896506392,
+                                 "stddev": 0.0
+                              },
+                              "rolling_hash": 5142782188440775656
+                           },
+                           {
+                              "hash": 15402802091993617561,
+                              "prefix": "|0>>>   |_ompt_implicit_task",
+                              "depth": 2,
+                              "entry": {
+                                 "laps": 1,
+                                 "value": 896479111,
+                                 "accum": 896479111,
+                                 "repr_data": 0.896479111,
+                                 "repr_display": 0.896479111
+                              },
+                              "stats": {
+                                 "sum": 0.896479111,
+                                 "count": 1,
+                                 "min": 0.896479111,
+                                 "max": 0.896479111,
+                                 "sqr": 0.8036747964593504,
+                                 "mean": 0.896479111,
+                                 "stddev": 0.0
+                              },
+                              "rolling_hash": 2098840206724841601                        },
+                           {
+                              "..." : "... etc. ..."
+                           }
+                     ]
+                  }
+               ],
+               "graph": [
+                  [
+                     {
+                           "cereal_class_version": 0,
+                           "node": {
+                              "hash": 0,
+                              "prefix": "unknown-hash=0",
+                              "tid": [
+                                 0
+                              ],
+                              "pid": [
+                                 2539175
+                              ],
+                              "depth": 0,
+                              "is_dummy": false,
+                              "inclusive": {
+                                 "entry": {
+                                       "laps": 0,
+                                       "value": 0,
+                                       "accum": 0,
+                                       "repr_data": 0.0,
+                                       "repr_display": 0.0
+                                 },
+                                 "stats": {
+                                       "sum": 0.0,
+                                       "count": 0,
+                                       "min": 0.0,
+                                       "max": 0.0,
+                                       "sqr": 0.0,
+                                       "mean": 0.0,
+                                       "stddev": 0.0
+                                 }
+                              },
+                              "exclusive": {
+                                 "entry": {
+                                       "laps": 0,
+                                       "value": -894743517,
+                                       "accum": -894743517,
+                                       "repr_data": -0.894743517,
+                                       "repr_display": -0.894743517
+                                 },
+                                 "stats": {
+                                       "sum": 0.0,
+                                       "count": 0,
+                                       "min": 0.0,
+                                       "max": 0.0,
+                                       "sqr": 0.0,
+                                       "mean": 0.0,
+                                       "stddev": 0.0
+                                 }
+                              }
+                           },
+                           "children": [
+                              {
+                                 "node": {
+                                       "hash": 17481650134347108265,
+                                       "prefix": "main",
+                                       "tid": [
+                                          0
+                                       ],
+                                       "pid": [
+                                          2539175
+                                       ],
+                                       "depth": 1,
+                                       "is_dummy": false,
+                                       "inclusive": {
+                                          "entry": {
+                                             "laps": 1,
+                                             "value": 894743517,
+                                             "accum": 894743517,
+                                             "repr_data": 0.894743517,
+                                             "repr_display": 0.894743517
+                                          },
+                                          "stats": {
+                                             "sum": 0.894743517,
+                                             "count": 1,
+                                             "min": 0.894743517,
+                                             "max": 0.894743517,
+                                             "sqr": 0.8005659612135293,
+                                             "mean": 0.894743517,
+                                             "stddev": 0.0
+                                          }
+                                       },
+                                       "exclusive": {
+                                          "entry": {
+                                             "laps": 1,
+                                             "value": -1773605,
+                                             "accum": -1773605,
+                                             "repr_data": -0.001773605,
+                                             "repr_display": -0.001773605
+                                          },
+                                          "stats": {
+                                             "sum": -0.001773605,
+                                             "count": 1,
+                                             "min": 9.22e-07,
+                                             "max": 0.896506392,
+                                             "sqr": -0.0031577497803754,
+                                             "mean": -0.001773605,
+                                             "stddev": 0.0
+                                          }
+                                       }
+                                 },
+                                 "children": [
+                                       {
+                                          "..." : "... etc. ..."
+                                       }
+                                 ]
+                              }
+                           ]
+                     }
+                  ]
+               ]
+         }
+      }
+   }
+
+Timemory JSON output Python post-processing example
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: python
+
+   #!/usr/bin/env python3
+
+   import sys
+   import json
+
+
+   def read_json(inp):
+      with open(inp, "r") as f:
+         return json.load(f)
+
+
+   def find_max(data):
+      """Find the max for any function called multiple times"""
+      max_entry = None
+      for itr in data:
+         if itr["entry"]["laps"] == 1:
+               continue
+         if max_entry is None:
+               max_entry = itr
+         else:
+               if itr["stats"]["mean"] > max_entry["stats"]["mean"]:
+                  max_entry = itr
+      return max_entry
+
+
+   def strip_name(name):
+      """Return everything after |_ if it exists"""
+      idx = name.index("|_")
+      return name if idx is None else name[(idx + 2) :]
+
+
+   if __name__ == "__main__":
+
+      input_data = [[x, read_json(x)] for x in sys.argv[1:]]
+
+      for file, data in input_data:
+         for metric, metric_data in data["timemory"].items():
+
+               print(f"[{file}] Found metric: {metric}")
+
+               for n, itr in enumerate(metric_data["ranks"]):
+
+                  max_entry = find_max(itr["graph"])
+                  print(
+                     "[{}] Maximum value: '{}' at depth {} was called {}x :: {:.3f} {} (mean = {:.3e} {})".format(
+                           file,
+                           strip_name(max_entry["prefix"]),
+                           max_entry["depth"],
+                           max_entry["entry"]["laps"],
+                           max_entry["entry"]["repr_data"],
+                           metric_data["unit_repr"],
+                           max_entry["stats"]["mean"],
+                           metric_data["unit_repr"],
+                     )
+                  )
+
+The result of applying this script to the corresponding JSON output from the :ref:`text-output-example-label` 
+section is as follows:
+
+.. code-block:: shell
+
+   [openmp-cg.inst-wall_clock.json] Found metric: wall_clock
+   [openmp-cg.inst-wall_clock.json] Maximum value: 'conj_grad' at depth 6 was called 76x :: 10.641 sec (mean = 1.400e-01 sec)
@@ -0,0 +1,294 @@
+.. meta::
+   :description: Omnitrace documentation and reference
+   :keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
+
+****************************************************
+Using the Omnitrace API
+****************************************************
+
+The following example shows how a program can use the Omnitrace API for run-time analysis.
+
+Omnitrace user API example program
+========================================
+
+You can use the Omnitrace API to define custom regions to profile and trace.
+The following C++ program demonstrates this technique by calling several functions from the 
+Omnitrace API, such as ``omnitrace_user_push_region`` and 
+``omnitrace_user_stop_thread_trace``.
+
+.. note::
+
+   By default, when Omnitrace detects any ``omnitrace_user_start_*`` or 
+   ``omnitrace_user_stop_*`` function, instrumentation
+   is disabled at start up, which means ``omnitrace_user_stop_trace()`` is not 
+   required at the beginning of ``main``. This behavior
+   can be manually controlled by using the ``OMNITRACE_INIT_ENABLED`` environment variable. 
+   User-defined regions are always
+   recorded, regardless of whether ``omnitrace_user_start_*`` or 
+   ``omnitrace_user_stop_*`` has been called.
+
+.. code-block:: shell
+
+   #include <omnitrace/categories.h>
+   #include <omnitrace/types.h>
+   #include <omnitrace/user.h>
+
+   #include <atomic>
+   #include <cassert>
+   #include <cerrno>
+   #include <cstdio>
+   #include <cstdlib>
+   #include <cstring>
+   #include <sstream>
+   #include <thread>
+   #include <vector>
+
+   std::atomic<long> total{ 0 };
+
+   long
+   fib(long n) __attribute__((noinline));
+
+   void
+   run(size_t nitr, long) __attribute__((noinline));
+
+   int
+   custom_push_region(const char* name);
+
+   namespace
+   {
+   omnitrace_user_callbacks_t custom_callbacks   = OMNITRACE_USER_CALLBACKS_INIT;
+   omnitrace_user_callbacks_t original_callbacks = OMNITRACE_USER_CALLBACKS_INIT;
+   }  // namespace
+
+   int
+   main(int argc, char** argv)
+   {
+      custom_callbacks.push_region = &custom_push_region;
+      omnitrace_user_configure(OMNITRACE_USER_UNION_CONFIG, custom_callbacks,
+                              &original_callbacks);
+
+      omnitrace_user_push_region(argv[0]);
+      omnitrace_user_push_region("initialization");
+      size_t nthread = std::min<size_t>(16, std::thread::hardware_concurrency());
+      size_t nitr    = 50000;
+      long   nfib    = 10;
+      if(argc > 1) nfib = atol(argv[1]);
+      if(argc > 2) nthread = atol(argv[2]);
+      if(argc > 3) nitr = atol(argv[3]);
+      omnitrace_user_pop_region("initialization");
+
+      printf("[%s] Threads: %zu\n[%s] Iterations: %zu\n[%s] fibonacci(%li)...\n", argv[0],
+            nthread, argv[0], nitr, argv[0], nfib);
+
+      omnitrace_user_push_region("thread_creation");
+      std::vector<std::thread> threads{};
+      threads.reserve(nthread);
+      // disable instrumentation for child threads
+      omnitrace_user_stop_thread_trace();
+      for(size_t i = 0; i < nthread; ++i)
+      {
+         threads.emplace_back(&run, nitr, nfib);
+      }
+      // re-enable instrumentation
+      omnitrace_user_start_thread_trace();
+      omnitrace_user_pop_region("thread_creation");
+
+      omnitrace_user_push_region("thread_wait");
+      for(auto& itr : threads)
+         itr.join();
+      omnitrace_user_pop_region("thread_wait");
+
+      run(nitr, nfib);
+
+      printf("[%s] fibonacci(%li) x %lu = %li\n", argv[0], nfib, nthread, total.load());
+      omnitrace_user_pop_region(argv[0]);
+
+      return 0;
+   }
+
+   long
+   fib(long n)
+   {
+      return (n < 2) ? n : fib(n - 1) + fib(n - 2);
+   }
+
+   #define RUN_LABEL                                                                        \
+      std::string{ std::string{ __FUNCTION__ } + "(" + std::to_string(n) + ") x " +        \
+                  std::to_string(nitr) }                                                  \
+         .c_str()
+
+   void
+   run(size_t nitr, long n)
+   {
+      omnitrace_user_push_region(RUN_LABEL);
+      long local = 0;
+      for(size_t i = 0; i < nitr; ++i)
+         local += fib(n);
+      total += local;
+      omnitrace_user_pop_region(RUN_LABEL);
+   }
+
+   int
+   custom_push_region(const char* name)
+   {
+      if(!original_callbacks.push_region || !original_callbacks.push_annotated_region)
+         return OMNITRACE_USER_ERROR_NO_BINDING;
+
+      printf("Pushing custom region :: %s\n", name);
+
+      if(original_callbacks.push_annotated_region)
+      {
+         int32_t _err = errno;
+         char*   _msg = nullptr;
+         char    _buff[1024];
+         if(_err != 0) _msg = strerror_r(_err, _buff, sizeof(_buff));
+
+         omnitrace_annotation_t _annotations[] = {
+               { "errno", OMNITRACE_INT32, &_err }, { "strerror", OMNITRACE_STRING, _msg }
+         };
+
+         errno = 0;  // reset errno
+         return (*original_callbacks.push_annotated_region)(
+               name, _annotations, sizeof(_annotations) / sizeof(omnitrace_annotation_t));
+      }
+
+      return (*original_callbacks.push_region)(name);
+   }
+
+Linking the Omnitrace libraries to another program
+=======================================================
+
+To link the ``omnitrace-user-library`` to another program, 
+use the following CMake and ``g++`` directives.
+
+CMake
+-------------------------------------------------------
+
+.. code-block:: cmake
+
+   find_package(omnitrace REQUIRED COMPONENTS user)
+   add_executable(foo foo.cpp)
+   target_link_libraries(foo PRIVATE omnitrace::omnitrace-user-library)
+
+g++ compilation
+-------------------------------------------------------
+
+Assuming Omnitrace is installed in ``/opt/omnitrace``, use the ``g++`` compiler 
+to build the application.
+
+.. code-block:: shell
+
+   g++ -I/opt/omnitrace foo.cpp -o foo -lomnitrace-user
+
+Output from the API example program
+========================================
+
+First, instrument and run the program.
+
+.. code-block:: shell
+
+   $ omnitrace-instrument -l --min-instructions=8 -E custom_push_region -o -- ./user-api
+   ...
+   $ omnitrace-run --profile --use-pid off --time-output off -- ./user-api.inst 20 4 100
+   Pushing custom region :: ./user-api.inst
+   [omnitrace][omnitrace_init_tooling] Instrumentation mode: Trace
+
+
+       ______   .___  ___. .__   __.  __  .___________..______          ___       ______  _______
+      /  __  \  |   \/   | |  \ |  | |  | |           ||   _  \        /   \     /      ||   ____|
+     |  |  |  | |  \  /  | |   \|  | |  | `---|  |----`|  |_)  |      /  ^  \   |  ,----'|  |__
+     |  |  |  | |  |\/|  | |  . `  | |  |     |  |     |      /      /  /_\  \  |  |     |   __|
+     |  `--'  | |  |  |  | |  |\   | |  |     |  |     |  |\  \----./  _____  \ |  `----.|  |____
+      \______/  |__|  |__| |__| \__| |__|     |__|     | _| `._____/__/     \__\ \______||_______|
+
+
+
+   Pushing custom region :: initialization
+   [./user-api.inst] Threads: 4
+   [./user-api.inst] Iterations: 100
+   [./user-api.inst] fibonacci(20)...
+   Pushing custom region :: thread_creation
+   Pushing custom region :: thread_wait
+   Pushing custom region :: run(20) x 100
+   Pushing custom region :: run(20) x 100
+   Pushing custom region :: run(20) x 100
+   Pushing custom region :: run(20) x 100
+   Pushing custom region :: run(20) x 100
+   [./user-api.inst] fibonacci(20) x 4 = 3382500
+   [omnitrace][86267][0][omnitrace_finalize] finalizing...
+
+
+   [omnitrace][86267][0] omnitrace : 5.190895 sec wall_clock,    2.748 mb peak_rss, 6.330000 sec cpu_clock,  121.9 % cpu_util [laps: 1]
+   [omnitrace][86267][0] user-api.inst/thread-0 : 5.078713 sec wall_clock, 4.722415 sec thread_cpu_clock,   93.0 % thread_cpu_util,    1.276 mb peak_rss [laps: 1]
+   [omnitrace][86267][0] user-api.inst/thread-1 : 0.322248 sec wall_clock, 0.322191 sec thread_cpu_clock,  100.0 % thread_cpu_util,    1.000 mb peak_rss [laps: 1]
+   [omnitrace][86267][0] user-api.inst/thread-2 : 0.323255 sec wall_clock, 0.323194 sec thread_cpu_clock,  100.0 % thread_cpu_util,    0.000 mb peak_rss [laps: 1]
+   [omnitrace][86267][0] user-api.inst/thread-3 : 0.323569 sec wall_clock, 0.323484 sec thread_cpu_clock,  100.0 % thread_cpu_util,    1.092 mb peak_rss [laps: 1]
+   [omnitrace][86267][0] user-api.inst/thread-4 : 0.324178 sec wall_clock, 0.324057 sec thread_cpu_clock,  100.0 % thread_cpu_util,    1.184 mb peak_rss [laps: 1]
+   [omnitrace][86267][0] Post-processing 51 cpu frequency and memory usage entries...
+
+   [omnitrace][wall_clock]|0> Outputting 'omnitrace-user-api.inst-output/wall_clock.json'...
+   [omnitrace][wall_clock]|0> Outputting 'omnitrace-user-api.inst-output/wall_clock.tree.json'...
+   [omnitrace][wall_clock]|0> Outputting 'omnitrace-user-api.inst-output/wall_clock.txt'...
+
+   [omnitrace][manager::finalize][metadata]> Outputting 'omnitrace-user-api.inst-output/metadata.json' and 'omnitrace-user-api.inst-output/functions.json'...
+   [omnitrace][86267][0][omnitrace_finalize] Finalized
+
+Then review the output.
+
+.. code-block:: shell
+
+   $ cat omnitrace-example-output/wall_clock.txt
+   |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+   |                                                                              REAL-CLOCK TIMER (I.E. WALL-CLOCK TIMER)                                                                              |
+   |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+   |                                     LABEL                                       | COUNT  | DEPTH  |   METRIC   | UNITS  |   SUM    |   MEAN   |   MIN    |   MAX    |   VAR    | STDDEV   | % SELF |
+   |---------------------------------------------------------------------------------|--------|--------|------------|--------|----------|----------|----------|----------|----------|----------|--------|
+   | |0>>> ./user-api.inst                                                           |      1 |      0 | wall_clock | sec    | 5.078521 | 5.078521 | 5.078521 | 5.078521 | 0.000000 | 0.000000 |    0.0 |
+   | |0>>> |_initialization                                                          |      1 |      1 | wall_clock | sec    | 0.000004 | 0.000004 | 0.000004 | 0.000004 | 0.000000 | 0.000000 |  100.0 |
+   | |0>>> |_thread_creation                                                         |      1 |      1 | wall_clock | sec    | 0.000159 | 0.000159 | 0.000159 | 0.000159 | 0.000000 | 0.000000 |  100.0 |
+   | |0>>> |_thread_wait                                                             |      1 |      1 | wall_clock | sec    | 0.355307 | 0.355307 | 0.355307 | 0.355307 | 0.000000 | 0.000000 |    0.0 |
+   | |0>>>   |_std::vector<std::thread, std::allocator<std::thread> >::begin         |      1 |      2 | wall_clock | sec    | 0.000001 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |0>>>   |_std::vector<std::thread, std::allocator<std::thread> >::end           |      1 |      2 | wall_clock | sec    | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |  100.0 |
+   | |0>>>   |_pthread_join                                                          |      4 |      2 | wall_clock | sec    | 0.355257 | 0.088814 | 0.000001 | 0.333144 | 0.026559 | 0.162970 |  100.0 |
+   | |2>>>     |_start_thread                                                        |      1 |      3 | wall_clock | sec    | 0.000032 | 0.000032 | 0.000032 | 0.000032 | 0.000000 | 0.000000 |  100.0 |
+   | |1>>>     |_start_thread                                                        |      1 |      3 | wall_clock | sec    | 0.000036 | 0.000036 | 0.000036 | 0.000036 | 0.000000 | 0.000000 |  100.0 |
+   | |3>>>     |_start_thread                                                        |      1 |      3 | wall_clock | sec    | 0.000034 | 0.000034 | 0.000034 | 0.000034 | 0.000000 | 0.000000 |  100.0 |
+   | |4>>>     |_start_thread                                                        |      1 |      3 | wall_clock | sec    | 0.000039 | 0.000039 | 0.000039 | 0.000039 | 0.000000 | 0.000000 |  100.0 |
+   | |0>>> |_run                                                                     |      1 |      1 | wall_clock | sec    | 4.722993 | 4.722993 | 4.722993 | 4.722993 | 0.000000 | 0.000000 |    0.0 |
+   | |0>>>   |_std::char_traits<char>::length                                        |      1 |      2 | wall_clock | sec    | 0.000001 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |0>>>   |_std::distance<char const*>                                            |      1 |      2 | wall_clock | sec    | 0.000001 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |0>>>   |_std::operator+<char, std::char_traits<char>, std::allocator<char> >   |      2 |      2 | wall_clock | sec    | 0.000002 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |0>>>   |_run(20) x 100                                                         |      1 |      2 | wall_clock | sec    | 4.722951 | 4.722951 | 4.722951 | 4.722951 | 0.000000 | 0.000000 |    0.0 |
+   | |0>>>     |_run [{94,25}-{96,25}]                                               |      1 |      3 | wall_clock | sec    | 4.722925 | 4.722925 | 4.722925 | 4.722925 | 0.000000 | 0.000000 |    0.0 |
+   | |0>>>       |_fib                                                               |    100 |      4 | wall_clock | sec    | 4.722718 | 0.047227 | 0.046713 | 0.051987 | 0.000000 | 0.000625 |    0.0 |
+   | |0>>>         |_fib                                                             |    200 |      5 | wall_clock | sec    | 4.722302 | 0.023612 | 0.017827 | 0.034091 | 0.000032 | 0.005627 |    0.0 |
+   | |0>>>           |_fib                                                           |    400 |      6 | wall_clock | sec    | 4.721485 | 0.011804 | 0.006790 | 0.023003 | 0.000016 | 0.004024 |    0.0 |
+   | |0>>>             |_fib                                                         |    800 |      7 | wall_clock | sec    | 4.719858 | 0.005900 | 0.002564 | 0.016078 | 0.000006 | 0.002498 |    0.1 |
+   | |0>>>               |_fib                                                       |   1600 |      8 | wall_clock | sec    | 4.716572 | 0.002948 | 0.000977 | 0.011849 | 0.000002 | 0.001465 |    0.1 |
+   | |0>>>                 |_fib                                                     |   3200 |      9 | wall_clock | sec    | 4.709918 | 0.001472 | 0.000371 | 0.008246 | 0.000001 | 0.000831 |    0.3 |
+   | |0>>>                   |_fib                                                   |   6400 |     10 | wall_clock | sec    | 4.696775 | 0.000734 | 0.000140 | 0.005111 | 0.000000 | 0.000461 |    0.6 |
+   | |0>>>                     |_fib                                                 |  12800 |     11 | wall_clock | sec    | 4.670093 | 0.000365 | 0.000050 | 0.003166 | 0.000000 | 0.000253 |    1.1 |
+   | |0>>>                       |_fib                                               |  25600 |     12 | wall_clock | sec    | 4.617496 | 0.000180 | 0.000017 | 0.001959 | 0.000000 | 0.000137 |    2.3 |
+   | |0>>>                         |_fib                                             |  51200 |     13 | wall_clock | sec    | 4.512671 | 0.000088 | 0.000004 | 0.001212 | 0.000000 | 0.000074 |    4.6 |
+   | |0>>>                           |_fib                                           | 102400 |     14 | wall_clock | sec    | 4.304142 | 0.000042 | 0.000000 | 0.000752 | 0.000000 | 0.000039 |    9.6 |
+   | |0>>>                             |_fib                                         | 202600 |     15 | wall_clock | sec    | 3.892580 | 0.000019 | 0.000000 | 0.000469 | 0.000000 | 0.000021 |   19.0 |
+   | |0>>>                               |_fib                                       | 363200 |     16 | wall_clock | sec    | 3.151143 | 0.000009 | 0.000000 | 0.000293 | 0.000000 | 0.000011 |   33.2 |
+   | |0>>>                                 |_fib                                     | 502000 |     17 | wall_clock | sec    | 2.105217 | 0.000004 | 0.000000 | 0.000183 | 0.000000 | 0.000006 |   49.1 |
+   | |0>>>                                   |_fib                                   | 476000 |     18 | wall_clock | sec    | 1.071652 | 0.000002 | 0.000000 | 0.000114 | 0.000000 | 0.000004 |   63.6 |
+   | |0>>>                                     |_fib                                 | 294200 |     19 | wall_clock | sec    | 0.390193 | 0.000001 | 0.000000 | 0.000071 | 0.000000 | 0.000003 |   75.3 |
+   | |0>>>                                       |_fib                               | 115200 |     20 | wall_clock | sec    | 0.096190 | 0.000001 | 0.000000 | 0.000043 | 0.000000 | 0.000002 |   84.4 |
+   | |0>>>                                         |_fib                             |  27400 |     21 | wall_clock | sec    | 0.015020 | 0.000001 | 0.000000 | 0.000025 | 0.000000 | 0.000001 |   91.1 |
+   | |0>>>                                           |_fib                           |   3600 |     22 | wall_clock | sec    | 0.001336 | 0.000000 | 0.000000 | 0.000013 | 0.000000 | 0.000001 |   96.3 |
+   | |0>>>                                             |_fib                         |    200 |     23 | wall_clock | sec    | 0.000050 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |0>>>     |_std::char_traits<char>::length                                      |      1 |      3 | wall_clock | sec    | 0.000001 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |0>>>     |_std::distance<char const*>                                          |      1 |      3 | wall_clock | sec    | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |  100.0 |
+   | |0>>>     |_std::operator+<char, std::char_traits<char>, std::allocator<char> > |      2 |      3 | wall_clock | sec    | 0.000001 | 0.000001 | 0.000000 | 0.000001 | 0.000000 | 0.000000 |  100.0 |
+   | |0>>> |_std::operator&                                                          |      1 |      1 | wall_clock | sec    | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |  100.0 |
+   | |0>>> std::vector<std::thread, std::allocator<std::thread> >::~vector           |      1 |      0 | wall_clock | sec    | 0.000045 | 0.000045 | 0.000045 | 0.000045 | 0.000000 | 0.000000 |   32.7 |
+   | |0>>> |_std::thread::~thread                                                    |      4 |      1 | wall_clock | sec    | 0.000030 | 0.000007 | 0.000007 | 0.000009 | 0.000000 | 0.000001 |   31.2 |
+   | |0>>>   |_std::thread::joinable                                                 |      4 |      2 | wall_clock | sec    | 0.000021 | 0.000005 | 0.000005 | 0.000006 | 0.000000 | 0.000001 |   89.4 |
+   | |0>>>     |_std::thread::id::id                                                 |      4 |      3 | wall_clock | sec    | 0.000001 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |  100.0 |
+   | |0>>>     |_std::operator==                                                     |      4 |      3 | wall_clock | sec    | 0.000001 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |  100.0 |
+   | |0>>> |_std::allocator_traits<std::allocator<std::thread> >::deallocate         |      1 |      1 | wall_clock | sec    | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |  100.0 |
+   | |0>>> |_std::allocator<std::thread>::~allocator                                 |      1 |      1 | wall_clock | sec    | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |  100.0 |
+   |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
@@ -0,0 +1,67 @@
+.. meta::
+   :description: Omnitrace documentation and reference
+   :keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
+
+***********************
+Omnitrace documentation
+***********************
+
+Omnitrace is designed for the high-level profiling and comprehensive tracing
+of applications running on the CPU or the CPU and GPU. It supports dynamic binary
+instrumentation, call-stack sampling, and various other features for determining
+which function and line number are currently executing. To learn more, see :doc:`what-is-omnitrace`
+
+The code is open and hosted at `<https://github.com/ROCm/omnitrace>`_.
+
+
+.. grid:: 2
+  :gutter: 3
+
+  .. grid-item-card:: Install
+
+    * :doc:`Quick start <./install/quick-start>`
+    * :doc:`Omnitrace installation <./install/install>`
+
+
+The documentation is structured as follows:
+
+.. grid:: 2
+  :gutter: 3
+
+  .. grid-item-card:: Tutorials
+
+    * `GitHub examples <https://github.com/ROCm/omnitrace/tree/main/examples>`_
+    * :doc:`Video tutorials <./tutorials/video-tutorials>`
+
+  .. grid-item-card:: How to
+
+    * :doc:`Configuring and validating the Omnitrace environment <./how-to/configuring-validating-environment>`
+    * :doc:`Configuring runtime options <./how-to/configuring-runtime-options>`
+    * :doc:`Sampling the call stack <./how-to/sampling-call-stack>`
+    * :doc:`Instrumenting and rewriting a binary application <./how-to/instrumenting-rewriting-binary-application>`
+    * :doc:`Performing causal profiling <./how-to/performing-causal-profiling>`
+    * :doc:`Understanding the Omnitrace output <./how-to/understanding-omnitrace-output>`
+    * :doc:`Profiling Python scripts <./how-to/profiling-python-scripts>`
+    * :doc:`Using the Omnitrace API <./how-to/using-omnitrace-api>`
+    * :doc:`General tips for using Omnitrace <./how-to/general-tips-using-omnitrace>`
+
+
+  .. grid-item-card:: Conceptual
+
+    * :doc:`Data collection modes <./conceptual/data-collection-modes>`
+    * :doc:`The Omnitrace feature set <./conceptual/omnitrace-feature-set>`
+  
+  .. grid-item-card:: Reference
+
+    * :doc:`Development guide <./reference/development-guide>`
+    * :doc:`Omnitrace glossary <./reference/omnitrace-glossary>`
+    * :doc:`API library <./doxygen/html/files>`
+    * :doc:`Class member functions <./doxygen/html/functions>`
+    * :doc:`Globals <./doxygen/html/globals>`
+    * :doc:`Classes, structures, and interfaces <./doxygen/html/annotated>`
+
+To contribute to the documentation, refer to
+`Contributing to ROCm <https://rocm.docs.amd.com/en/latest/contribute/contributing.html>`_.
+
+You can find licensing information on the
+`Licensing <https://rocm.docs.amd.com/en/latest/about/license.html>`_ page.
@@ -0,0 +1,410 @@
+.. meta::
+   :description: Omnitrace documentation and reference
+   :keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
+
+*************************************
+Omnitrace installation
+*************************************
+
+The following information builds on the guidelines in the :doc:`Quick start <./quick-start>` guide.
+It covers how to install `Omnitrace <https://github.com/ROCm/omnitrace>`_ from source or a binary distribution,
+as well as the :ref:`post-installation-steps`.
+
+If you have problems using Omnitrace after installation,
+consult the :ref:`post-installation-troubleshooting` section.
+
+Release links
+========================================
+
+To review and install either the current Omnitrace release or earlier releases, use these links:
+
+* Latest Omnitrace Release: `<https://github.com/ROCm/omnitrace/releases/latest>`_ 
+* All Omnitrace Releases: `<https://github.com/ROCm/omnitrace/releases>`_ 
+
+Operating system support
+========================================
+
+Omnitrace is only supported on Linux. The following distributions are tested in the Omnitrace GitHub workflows:
+
+* Ubuntu 20.04
+* Ubuntu 22.04
+* OpenSUSE 15.3
+* OpenSUSE 15.4
+* Red Hat 8.7
+* Red Hat 9.0
+* Red Hat 9.1
+
+Other OS distributions might function but are not supported or tested.
+
+Identifying the operating system
+-----------------------------------
+
+If you are unsure of the operating system and version, the ``/etc/os-release`` and 
+``/usr/lib/os-release`` files contain operating system identification data for Linux systems.
+
+.. code-block:: shell
+
+   $ cat /etc/os-release
+
+.. code-block:: shell
+
+   NAME="Ubuntu"
+   VERSION="20.04.4 LTS (Focal Fossa)"
+   ID=ubuntu
+   ...
+   VERSION_ID="20.04"
+   ...
+
+The relevant fields are ``ID`` and the ``VERSION_ID``.
+
+Architecture
+========================================
+
+With regards to instrumentation, at present only AMD64 (x86_64) architectures are tested. However,
+Dyninst supports several more architectures and Omnitrace instrumentation may support other
+CPU architectures such as aarch64 and ppc64.
+Other modes of use, such as sampling and causal profiling, are not dependent on Dyninst and therefore
+might be more portable.
+
+Installing Omnitrace from binary distributions
+================================================
+
+Every Omnitrace release provides binary installer scripts of the form:
+
+.. code-block:: shell
+
+   omnitrace-{VERSION}-{OS_DISTRIB}-{OS_VERSION}[-ROCm-{ROCM_VERSION}[-{EXTRA}]].sh
+
+For example,
+
+.. code-block:: shell
+
+   omnitrace-1.0.0-ubuntu-18.04-OMPT-PAPI-Python3.sh
+   omnitrace-1.0.0-ubuntu-18.04-ROCm-405000-OMPT-PAPI-Python3.sh
+   ...
+   omnitrace-1.0.0-ubuntu-20.04-ROCm-50000-OMPT-PAPI-Python3.sh
+
+Any of the ``EXTRA`` fields with a CMake build option 
+(for example, PAPI, as referenced in a following section) or 
+with no link requirements (such as OMPT) have
+self-contained support for these packages.
+
+To install Omnitrace using a binary installer script, follow these steps:
+
+#. Download the appropriate binary distribution
+
+   .. code-block:: shell
+
+      wget https://github.com/ROCm/omnitrace/releases/download/v<VERSION>/<SCRIPT>
+
+#. Create the target installation directory
+
+   .. code-block:: shell
+
+      mkdir /opt/omnitrace
+
+#. Run the installer script
+
+   .. code-block:: shell
+
+      ./omnitrace-1.0.0-ubuntu-18.04-ROCm-405000-OMPT-PAPI.sh --prefix=/opt/omnitrace --exclude-subdir
+
+Installing Omnitrace from source
+========================================
+
+Omnitrace needs a GCC compiler with full support for C++17 and CMake v3.16 or higher.
+The Clang compiler may be used in lieu of the GCC compiler if `Dyninst <https://github.com/dyninst/dyninst>`_  
+is already installed.
+
+Build requirements
+-----------------------------------
+
+* GCC compiler v7+
+  
+  * Older GCC compilers may be supported but are not tested
+  * Clang compilers are generally supported for Omnitrace but not Dyninst
+  
+* `CMake <https://cmake.org/>`_ v3.16+
+
+  .. note::
+
+     * If the installed version of CMake is too old, installing a new version of CMake can be done through several methods
+     * One of the easiest options is to use the python ``pip`` utility, as follows:
+
+     .. code-block:: shell
+
+        pip install --user 'cmake==3.18.4'
+        export PATH=${HOME}/.local/bin:${PATH}
+
+Required third-party packages
+-----------------------------------
+
+* `Dyninst <https://github.com/dyninst/dyninst>`_ for dynamic or static instrumentation. 
+  Dyninst uses the following required and optional components.
+
+  * `TBB <https://github.com/oneapi-src/oneTBB>`_ (required)
+  * `Elfutils <https://sourceware.org/elfutils/>`_ (required)
+  * `Libiberty <https://github.com/gcc-mirror/gcc/tree/master/libiberty>`_ (required)
+  * `Boost <https://www.boost.org/>`_ (required)
+  * `OpenMP <https://www.openmp.org/>`_ (optional)
+
+* `libunwind <https://www.nongnu.org/libunwind/>`_ for call-stack sampling
+
+Any of the third-party packages required by Dyninst, along with Dyninst itself, can be built and installed
+during the Omnitrace build. The following list indicates the package, the version,
+the application that requires the package (for example, Omnitrace requires Dyninst
+while Dyninst requires TBB), and the CMake option to build the package alongside Omnitrace:
+
+.. csv-table:: 
+   :header: "Third-Party Library", "Minimum Version", "Required By", "CMake Option"
+   :widths: 15, 10, 12, 40
+
+   "Dyninst", "12.0", "Omnitrace", "``OMNITRACE_BUILD_DYNINST`` (default: OFF)"
+   "Libunwind", "", "Omnitrace", "``OMNITRACE_BUILD_LIBUNWIND`` (default: ON)"
+   "TBB", "2018.6", "Dyninst", "``DYNINST_BUILD_TBB`` (default: OFF)"
+   "ElfUtils", "0.178", "Dyninst", "``DYNINST_BUILD_ELFUTILS`` (default: OFF)"
+   "LibIberty",  "", "Dyninst", "``DYNINST_BUILD_LIBIBERTY`` (default: OFF)"
+   "Boost",  "1.67.0", "Dyninst", "``DYNINST_BUILD_BOOST`` (default: OFF)"
+   "OpenMP", "4.x", "Dyninst", ""
+
+Optional third-party packages
+-----------------------------------
+
+* `ROCm <https://rocm.docs.amd.com/projects/install-on-linux/en/latest>`_
+
+  * HIP
+  * Roctracer for HIP API and kernel tracing
+  * ROCM-SMI for GPU monitoring
+  * Rocprofiler for GPU hardware counters
+
+* `PAPI <https://icl.utk.edu/papi/>`_
+* MPI
+
+  * ``OMNITRACE_USE_MPI`` enables full MPI support
+  * ``OMNITRACE_USE_MPI_HEADERS`` enables wrapping of the dynamically-linked MPI C function calls.
+    (By default, if Omnitrace cannot find an OpenMPI MPI distribution, it uses a local copy 
+    of the OpenMPI ``mpi.h``.)
+
+* Several optional third-party profiling tools supported by Timemory 
+  (for example, `Caliper <https://github.com/LLNL/Caliper>`_, `TAU <https://www.cs.uoregon.edu/research/tau/home.php>`_, CrayPAT, and others)
+
+.. csv-table:: 
+   :header: "Third-Party Library", "CMake Enable Option", "CMake Build Option"
+   :widths: 15, 45, 40
+
+   "PAPI", "``OMNITRACE_USE_PAPI`` (default: ON)", "``OMNITRACE_BUILD_PAPI`` (default: ON)"
+   "MPI", "``OMNITRACE_USE_MPI`` (default: OFF)", ""
+   "MPI (header-only)", "``OMNITRACE_USE_MPI_HEADERS`` (default: ON)", ""
+
+Installing Dyninst
+-----------------------------------
+
+The easiest way to install Dyninst is alongside Omnitrace, but it can also be installed using Spack.
+
+Building Dyninst alongside Omnitrace
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To install Dyninst alongside Omnitrace, configure Omnitrace with ``OMNITRACE_BUILD_DYNINST=ON``. 
+Depending on the version of Ubuntu, the ``apt`` package manager might have current enough
+versions of the Dyninst Boost, TBB, and LibIberty dependencies 
+(use ``apt-get install libtbb-dev libiberty-dev libboost-dev``). 
+However, it is possible to request Dyninst to install
+its dependencies via ``DYNINST_BUILD_<DEP>=ON``, as follows:
+
+.. code-block:: shell
+
+   git clone https://github.com/ROCm/omnitrace.git omnitrace-source
+   cmake -B omnitrace-build -DOMNITRACE_BUILD_DYNINST=ON -DDYNINST_BUILD_{TBB,ELFUTILS,BOOST,LIBIBERTY}=ON omnitrace-source
+
+where ``-DDYNINST_BUILD_{TBB,BOOST,ELFUTILS,LIBIBERTY}=ON`` is expanded by 
+the shell to ``-DDYNINST_BUILD_TBB=ON -DDYNINST_BUILD_BOOST=ON ...``
+
+Installing Dyninst via Spack
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+`Spack <https://github.com/spack/spack>`_ is another option to install Dyninst and its dependencies:
+
+.. code-block:: shell
+
+   git clone https://github.com/spack/spack.git
+   source ./spack/share/spack/setup-env.sh
+   spack compiler find
+   spack external find --all --not-buildable
+   spack spec -I --reuse dyninst
+   spack install --reuse dyninst
+   spack load -r dyninst
+
+Installing Omnitrace
+-----------------------------------
+
+Omnitrace has CMake configuration options for MPI support (``OMNITRACE_USE_MPI`` or 
+``OMNITRACE_USE_MPI_HEADERS``), HIP kernel tracing (``OMNITRACE_USE_ROCTRACER``), 
+ROCm device sampling (``OMNITRACE_USE_ROCM_SMI``), OpenMP-Tools (``OMNITRACE_USE_OMPT``), 
+hardware counters via PAPI (``OMNITRACE_USE_PAPI``), among other features.
+Various additional features can be enabled via the 
+``TIMEMORY_USE_*`` `CMake options <https://timemory.readthedocs.io/en/develop/installation.html#cmake-options>`_.
+Any ``OMNITRACE_USE_<VAL>`` option which has a corresponding ``TIMEMORY_USE_<VAL>`` 
+option means that the Timemory support for this feature has been integrated
+into Perfetto support for Omnitrace, for example, ``OMNITRACE_USE_PAPI=<VAL>`` also configures 
+``TIMEMORY_USE_PAPI=<VAL>``. This means the data that Timemory is able to collect via this package
+is passed along to Perfetto and is displayed when the ``.proto`` file is visualized 
+in `the Perfetto UI <https://ui.perfetto.dev>`_.
+
+.. code-block:: shell
+
+   git clone https://github.com/ROCm/omnitrace.git omnitrace-source
+   cmake                                       \
+       -B omnitrace-build                      \
+       -D CMAKE_INSTALL_PREFIX=/opt/omnitrace  \
+       -D OMNITRACE_USE_HIP=ON                 \
+       -D OMNITRACE_USE_ROCM_SMI=ON            \
+       -D OMNITRACE_USE_ROCTRACER=ON           \
+       -D OMNITRACE_USE_PYTHON=ON              \
+       -D OMNITRACE_USE_OMPT=ON                \
+       -D OMNITRACE_USE_MPI_HEADERS=ON         \
+       -D OMNITRACE_BUILD_PAPI=ON              \
+       -D OMNITRACE_BUILD_LIBUNWIND=ON         \
+       -D OMNITRACE_BUILD_DYNINST=ON           \
+       -D DYNINST_BUILD_TBB=ON                 \
+       -D DYNINST_BUILD_BOOST=ON               \
+       -D DYNINST_BUILD_ELFUTILS=ON            \
+       -D DYNINST_BUILD_LIBIBERTY=ON           \
+       omnitrace-source
+   cmake --build omnitrace-build --target all --parallel 8
+   cmake --build omnitrace-build --target install
+   source /opt/omnitrace/share/omnitrace/setup-env.sh
+
+.. _mpi-support-omnitrace:
+
+MPI support within Omnitrace
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Omnitrace can have full (``OMNITRACE_USE_MPI=ON``) or partial (``OMNITRACE_USE_MPI_HEADERS=ON``) MPI support.
+The only difference between these two modes is whether or not the results collected 
+via Timemory and/or Perfetto can be aggregated into a single
+output file during finalization. When full MPI support is enabled, combining the 
+Timemory results always occurs, whereas combining the Perfetto
+results is configurable via the ``OMNITRACE_PERFETTO_COMBINE_TRACES`` setting.
+
+The primary benefits of partial or full MPI support are the automatic wrapping 
+of MPI functions and the ability
+to label output with suffixes which correspond to the ``MPI_COMM_WORLD`` rank ID 
+instead of having to use the system process identifier (i.e. ``PID``).
+In general, it's recommended to use partial MPI support with the OpenMPI 
+headers as this is the most portable configuration.
+If full MPI support is selected, make sure your target application is built 
+against the same MPI distribution as Omnitrace.
+For example, do not build Omnitrace with MPICH and use it on a target application built against OpenMPI.
+If partial support is selected, the reason the OpenMPI headers are recommended instead of the MPICH headers is
+because the ``MPI_COMM_WORLD`` in OpenMPI is a pointer to ``ompi_communicator_t`` (8 bytes), 
+whereas ``MPI_COMM_WORLD`` in MPICH is an ``int`` (4 bytes). Building Omnitrace with partial MPI support 
+and the MPICH headers and then using
+Omnitrace on an application built against OpenMPI causes a segmentation fault. 
+This happens because the value of the ``MPI_COMM_WORLD`` is truncated
+during the function wrapping before being passed along to the underlying MPI function.
+
+.. _post-installation-steps:
+
+Post-installation steps
+========================================
+
+After installation, you can optionally configure the Omnitrace environment.
+You should also test the executables to confirm Omnitrace is correctly installed.
+
+Configure the environment
+-----------------------------------
+
+If environment modules are available and preferred, add them using these commands:
+
+.. code-block:: shell
+
+   module use /opt/omnitrace/share/modulefiles
+   module load omnitrace/1.0.0
+
+Alternatively, you can directly source the ``setup-env.sh`` script:
+
+.. code-block:: shell
+
+   source /opt/omnitrace/share/omnitrace/setup-env.sh
+
+Test the executables
+-----------------------------------
+
+Successful execution of these commands confirms that the installation does not have any 
+issues locating the installed libraries:
+
+.. code-block:: shell
+
+   omnitrace-instrument --help
+   omnitrace-avail --help
+
+.. note::
+
+   If ROCm support is enabled, you might have to add the path to the ROCm libraries to ``LD_LIBRARY_PATH``,
+   for example, ``export LD_LIBRARY_PATH=/opt/rocm/lib:${LD_LIBRARY_PATH}``.
+
+.. _post-installation-troubleshooting:
+
+Post-installation troubleshooting
+========================================
+
+This section explains how to resolve certain issues that might happen when you first use Omnitrace.
+
+Issues with RHEL and SELinux
+----------------------------------------------------
+
+RHEL (Red Hat Enterprise Linux) and related distributions of Linux automatically enable a security feature 
+named SELinux (Security-Enhanced Linux) that prevents Omnitrace from running.
+This issue applies to any Linux distribution with SELinux installed, including RHEL,
+CentOS, Fedora, and Rocky Linux. The problem can happen with any GPU, or even without a GPU.
+
+The problem occurs after you instrument a program and try to
+run ``omnitrace-run`` with the instrumented program.
+
+.. code-block:: shell
+
+   g++ hello.cpp -o hello
+   omniperf-instrument -M sampling -o hello.instr -- ./hello
+   omnitrace-run -- ./hello.instr
+
+Instead of successfully running the binary with call-stack sampling, 
+Omnitrace crashes with a segmentation fault.
+
+.. note::
+
+   If you are physically logged in on the system (not using SSH or a remote connection),
+   the operating system might display an SELinux pop-up warning in the notifications.
+
+To workaround this problem, either disable SELinux or configure it to use a more 
+permissive setting.
+
+To avoid this problem for the duration of the current session, run this command 
+from the shell:
+
+.. code-block:: shell
+
+   sudo setenforce 0
+
+For a permanent workaround, edit the SELinux configuration file using the command
+``sudo vim /etc/sysconfig/selinux`` and change the ``SELINUX`` setting to 
+either ``Permissive`` or ``Disabled``.
+
+.. note::
+
+   Permanently changing the SELinux settings can have security implications. 
+   Ensure you review your system security settings before making any changes.
+
+Modifying RPATH details
+----------------------------------------------------
+
+If you're experiencing problems loading your application with an instrumented library, 
+then you might have to check and modify the RPATH specified in your application. 
+See the section on `troubleshooting RPATHs <../how-to/instrumenting-rewriting-binary-application.html#rpath-troubleshooting>`_
+for further details.
+
+Configuring PAPI to collect hardware counters
+----------------------------------------------------
+
+To use PAPI to collect the majority of hardware counters, ensure 
+the ``/proc/sys/kernel/perf_event_paranoid`` setting has a value less than or equal to ``2``. 
+For more information, see the :ref:`omnitrace_papi_events` section.
@@ -0,0 +1,30 @@
+.. meta::
+   :description: Omnitrace documentation and reference
+   :keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
+
+*************************************
+Omnitrace quick start
+*************************************
+
+To install Omnitrace, download the `Omnitrace installer <https://github.com/ROCm/omnitrace/releases/latest/download/omnitrace-install.py>`_ 
+and specify ``--prefix <install-directory>``. The script attempts to auto-detect 
+the appropriate OS distribution and version. To include AMD ROCm Software support, 
+specify ``--rocm X.Y``, where ``X`` is the ROCm major
+version and ``Y`` is the ROCm minor version, for example, ``--rocm 6.2``.
+
+.. code-block:: shell
+
+   wget https://github.com/ROCm/omnitrace/releases/latest/download/omnitrace-install.py
+   python3 ./omnitrace-install.py --prefix /opt/omnitrace --rocm 6.2
+
+This script supports installation on Ubuntu, OpenSUSE, Red Hat, Debian, CentOS, and Fedora.
+If the target OS is compatible with one of the operating system versions listed in
+the comprehensive :doc:`Installation guidelines <./install>`,
+specify ``-d <DISTRO> -v <VERSION>``. For example, if the OS is compatible with Ubuntu 22.04, pass
+``-d ubuntu -v 22.04`` to the script.
+
+.. note::
+
+   If you have ROCm version 6.2 or higher installed, you can use the
+   package manager to install a pre-built copy of Omnitrace using 
+   ``apt install omnitrace`` or ``dnf install omnitrace``.
@@ -0,0 +1,4 @@
+# License
+
+```{include} ../LICENSE
+```
@@ -0,0 +1,412 @@
+.. meta::
+   :description: Omnitrace documentation and reference
+   :keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
+
+****************************************************
+Development guide
+****************************************************
+
+This guide discusses the `Omnitrace <https://github.com/ROCm/omnitrace>`_ design. 
+It includes a list of the executables and libraries, along with a discussion of the application's 
+memory, sampling, and time-window constraint models.
+
+Executables
+========================================
+
+This section lists the Omnitrace executables.
+
+omnitrace-avail: `source/bin/omnitrace-avail <https://github.com/ROCm/omnitrace/tree/main/source/bin/omnitrace-avail>`_
+-------------------------------------------------------------------------------------------------------------------------------
+
+The ``main`` routine of ``omnitrace-avail`` has three important sections:
+
+* Printing components
+* Printing options
+* Printing hardware counters
+
+omnitrace-sample: `source/bin/omnitrace-sample <https://github.com/ROCm/omnitrace/tree/main/source/bin/omnitrace-sample>`_
+-------------------------------------------------------------------------------------------------------------------------------
+
+* Requires a command-line format of ``omnitrace-sample <options> -- <command> <command-args>``
+* Translates command-line options into environment variables
+* Adds ``libomnitrace-dl.so`` to ``LD_PRELOAD``
+* Is launched by using ``execvpe`` with ``<command> <command-args>`` and a modified environment
+
+omnitrace-casual: `source/bin/omnitrace-causal <https://github.com/ROCm/omnitrace/tree/main/source/bin/omnitrace-causal>`_
+-------------------------------------------------------------------------------------------------------------------------------
+
+When there is exactly one causal profiling configuration variant (which enables debugging),
+``omnitrace-casual`` has a nearly identical design to ``omnitrace-sample``
+
+When the command-line options produce more than one causal profiling configuration variant,
+the following actions take place for each variant:
+
+* ``omnitrace-causal`` calls ``fork()``
+* the child process launches ``<command> <command-args>`` using ``execvpe``, which modifies the environment for the variant
+* the parent process waits for the child process to finish
+
+omnitrace-instrument: `source/bin/omnitrace-instrument <https://github.com/ROCm/omnitrace/tree/main/source/bin/omnitrace-instrument>`_
+-------------------------------------------------------------------------------------------------------------------------------------------
+
+* Requires a command-line format of ``omnitrace-instrument <options> -- <command> <command-args>``
+* Allows the user to provide options specifying whether to perform runtime instrumentation, use binary rewrite, or 
+  attach to process
+* Either opens the instrumentation target (for binary rewrite), launches the target and stops it
+  before it starts executing ``main``, or attaches to a running executable and pauses it
+* Finds all functions in the targets
+* Finds ``libomnitrace-dl`` and locates the functions
+* Iterates over and instruments all the functions, provided they satisfy the 
+  defined criteria (such as a minimum number of instructions)
+
+  * See the ``module_function`` class
+
+* Until this point, the workflow has been the same for the different options, 
+  but it diverges after instrumentation is complete:
+
+  * For a binary rewrite: it produces a new instrumented binary and exits
+  * For runtime instrumentation or attaching to a process: it instructs the application 
+    to resume and then waits for it to exit
+
+Libraries
+========================================
+
+Common library: `source/lib/common <https://github.com/ROCm/omnitrace/tree/main/source/lib/common>`_
+--------------------------------------------------------------------------------------------------------------------------------
+
+* General header-only functionality used in multiple executables and/or libraries. 
+* Not installed or exported outside of the build tree.
+
+Core library: `source/lib/core <https://github.com/ROCm/omnitrace/tree/main/source/lib/core>`_
+--------------------------------------------------------------------------------------------------------------------------------
+
+* Static PIC library with functionality that does not depend on any components. 
+* Not installed or exported outside of the build tree.
+
+Binary library: `source/lib/binary <https://github.com/ROCm/omnitrace/tree/main/source/lib/binary>`_
+--------------------------------------------------------------------------------------------------------------------------------
+
+* Static PIC library with functionality for reading/analyzing binary info.
+* Mostly used by the causal profiling sections of ``libomnitrace``.
+* Not installed or exported outside of the build tree.
+
+libomnitrace: `source/lib/omnitrace <https://github.com/ROCm/omnitrace/tree/main/source/lib/omnitrace>`_
+--------------------------------------------------------------------------------------------------------------------------------
+
+This is the main library encapsulating all the capabilities.
+
+libomnitrace-dl: `source/lib/omnitrace-dl <https://github.com/ROCm/omnitrace/tree/main/source/lib/omnitrace-dl>`_
+--------------------------------------------------------------------------------------------------------------------------------
+
+This is a lightweight, front-end library for ``libomnitrace`` which serves three primary purposes:
+
+* Dramatically speeds up instrumentation time compared to using ``libomnitrace`` directly because 
+  Dyninst must parse the entire library in order to find the instrumentation functions 
+  (a ``dlopen`` call is made on ``libomnitrace`` when the instrumentation functions get called)
+* Prevents re-entry if ``libomnitrace`` calls an instrumented function internally
+* Coordinates communication between ``libomnitrace-user`` and ``libomnitrace``
+
+libomnitrace-user: `source/lib/omnitrace-user <https://github.com/ROCm/omnitrace/tree/main/source/lib/omnitrace-user>`_
+--------------------------------------------------------------------------------------------------------------------------------
+
+* Provides a set of functions and types for the users to add to their code, for example,
+  disabling data collection globally or on a specific thread or
+  user-defined region
+* If ``libomnitrace-dl`` is not loaded, the user API is effectively a set of no-op function calls.
+
+Testing tools
+========================================
+
+* `CDash Testing Dashboard <https://my.cdash.org/index.php?project=Omnitrace>`_ (requires a login)
+
+Components
+========================================
+
+Most measurements and capabilities are encapsulated into a "component" with the following definitions:
+
+Measurement
+   A recording of some data relevant to performance, for instance, the current call-stack, 
+   hardware counter values, current memory usage, or timestamp
+
+Capability
+   Handles the implementation or orchestration of some feature which is used 
+   to collect measurements, for example, a component which handles setting up function wrappers 
+   around various functions such as ``pthread_create`` or ``MPI_Init``.
+
+Components are designed to either hold no data at all or only the data for both an instantaneous 
+measurement and a phase measurement.
+
+Components which store data typically implement a static ``record()`` function 
+for getting a record of the measurement,
+``start()`` and ``stop()`` member functions for calculating a phase measurement, 
+and a ``sample()`` member function for storing an
+instantaneous measurement. In reality, there are several more "standard" functions 
+but these are the most commonly-used ones.
+
+Components which do not store data might also have ``start()``, ``stop()``, and ``sample()`` 
+functions. However, components which
+implement function wrappers typically provide a call operator or ``audit(...)`` 
+functions. These are invoked with the
+wrapped function's arguments before the wrapped function gets called and with the return value 
+after the wrapped function gets called.
+
+.. note::
+
+   The goal of this design is to provide relatively small and resuable lightweight objects 
+   for recording measurements and implementing capabilities.
+
+Wall-clock component example
+--------------------------------------
+
+A component for computing the elapsed wall-clock time looks like this:
+
+.. code-block:: cpp
+
+   struct wall_clock
+   {
+      using value_type = int64_t;
+
+      static value_type record() noexcept
+      {
+         return std::chrono::steady_clock::now().time_since_epoch().count();
+      }
+
+      void sample() noexcept
+      {
+         value = record();
+      }
+
+      void start() noexcept
+      {
+         value = record();
+      }
+
+      void stop() noexcept
+      {
+         auto _start_value = value;
+         value = record();
+         accum += (value - _start_value);
+      }
+
+   private:
+      int64_t value = 0;
+      int64_t accum = 0;
+   };
+
+Function wrapper component example
+--------------------------------------
+
+A component which implements wrappers around ``fork()`` and ``exit(int)`` (and stores no data) 
+could look like this:
+
+.. code-block:: cpp
+
+   struct function_wrapper
+   {
+      pid_t operator()(const gotcha_data&, pid_t (*real_fork)())
+      {
+         // disable all collection before forking
+         categories::disable_categories(config::get_enabled_categories());
+
+         auto _pid_v = real_fork();
+
+         // only re-enable collection on parent process
+         if(_pid_v != 0)
+               categories::enable_categories(config::get_enabled_categories());
+
+         return _pid_v;
+      }
+
+      void operator()(const gotcha_data&, void (*real_exit)(int), int _exit_code)
+      {
+         // catch the call to exit and finalize before truly exiting
+         omnitrace_finalize();
+
+         real_exit(_exit_code);
+      }
+   };
+
+Component member functions
+--------------------------------------
+
+There are no real restrictions or requirements on the member functions a component needs to provide.
+Unless the component is being used directly, the invocation of component member functions via a "component bundler"
+(provided by Timemory) makes extensive use of template metaprogramming concepts. This finds the best match, if any,
+for calling a component's member function. This is a bit easier to demonstrate using an example:
+
+.. code-block:: cpp
+
+   struct foo
+   {
+      void sample() { puts("foo::sample()"); }
+   };
+
+   struct bar
+   {
+      void sample(int) { puts("bar::sample(int)"); }
+   };
+
+   struct spam
+   {
+      void start(int) { puts("spam::start()"); }
+      void stop()     { puts("spam::stop()"); }
+   };
+
+   int main()
+   {
+      auto _bundle = component_tuple<foo, bar, spam>{ "main" };
+
+      puts("A");
+      _bundle.start();
+
+      puts("B");
+      _bundle.sample(10);
+
+      puts("C");
+      _bundle.sample();
+
+      puts("D");
+      _bundle.stop();
+   }
+
+When the preceding code runs, the following messages are printed:
+
+.. code-block:: shell
+
+   A
+   spam::start()
+   B
+   foo::sample()
+   bar::sample(int)
+   C
+   foo::sample()
+   D
+   spam::stop()
+
+In section A, the bundle determined that only the ``spam`` object has a ``start`` function. Since this is determined
+via template metaprogramming instead of dynamic polymorphism, this effectively omits any code related to
+the ``foo`` or ``bar`` objects. In section B, because the integer ``10`` is passed to the bundle,
+the bundle forwards this value to ``bar::sample(int)`` after it invokes ``foo::sample()``. ``foo::sample()`` is
+invoked because the bundle recognizes that the call to the ``sample`` member function is still possible without
+the argument.
+
+Memory model
+========================================
+
+Collected data is generally handled in one of the three following ways:
+
+* It is handed directly to, and stored by, Perfetto
+* It is managed implicitly by Timemory and accessed as needed
+* As thread-local data
+
+In general, only instrumentation for relatively simple data is directly passed to 
+Perfetto and/or Timemory during runtime.
+For example, the callbacks from binary instrumentation, user API instrumentation, 
+and roctracer directly invoke
+calls to Perfetto or Timemory's storage model. Otherwise, the data is stored 
+by Omnitrace in the thread-data model
+which is more persistent than simply using ``thread_local`` static data, which gets deleted
+when the thread stops.
+
+Thread identification
+--------------------------------------
+
+Each CPU thread is assigned two integral identifiers. One identifier, the ``internal_value``, is 
+atomically incremented every time a new thread is created.
+The other identifier, known as the ``sequent_value``, tries to account for the fact that Omnitrace, Perfetto, ROCm, and other applications 
+start background threads. When a thread is created as a by-product of Omnitrace, 
+the index is offset by a large value. This serves
+two purposes:
+
+* Accessing the data for threads created by the user is closer in memory
+* When log messages are printed, the index approximately correlates to the order of thread creation from the user's perspective.
+
+The ``sequent_value`` identifier is typically used to access the thread-data.
+
+Thread-data class
+--------------------------------------
+
+Currently, most thread data is effectively stored in a static 
+``std::array<std::unique_ptr<T>, OMNITRACE_MAX_THREADS>`` instance.
+``OMNITRACE_MAX_THREADS`` is a value defined a compile-time and set to ``2048`` 
+for release builds. During finalization,
+Omnitrace iterates through the thread-data and transforms that data 
+into something that can be passed along to Perfetto and/or Timemory.
+The downside of the current model is that if the user exceeds ``OMNITRACE_MAX_THREADS``, 
+a segmentation fault occurs. To fix this issue,
+a new model is being adopted which has all the benefits of this model 
+but permits dynamic expansion.
+
+Sampling model
+========================================
+
+The general structure for the sampling is within Timemory (``source/timemory/sampling``). 
+Currently, all sampling is done per-thread
+via POSIX timers. Omnitrace supports both a real-time timer and a CPU-time timer. 
+Both have adjustable frequencies, delays, and durations.
+By default, only CPU-time sampling is enabled. Initial settings are inherited from 
+the settings starting with ``OMNITRACE_SAMPLING_``.
+
+For each type of timer, timer-specific settings can be used to 
+override the common and inherited timer settings. 
+These settings begin with ``OMNITRACE_SAMPLING_CPUTIME`` for the CPU-time sampler
+and ``OMNITRACE_SAMPLING_REALTIME`` for
+the real-time sampler. For example, ``OMNITRACE_SAMPLING_FREQ=500`` initially sets the 
+sampling frequency to 500 interrupts per second. Adding the setting ``OMNITRACE_SAMPLING_REALTIME_FREQ=10`` 
+lowers the sampling frequency for the real-time sampler
+to 10 interrupts per second of real-time.
+
+The Omnitrace-specific implementation can be found in 
+`source/lib/omnitrace/library/sampling.cpp <https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/sampling.cpp>`_.
+Within `sampling.cpp <https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/sampling.cpp>`_, 
+there is a bundle of three sampling components:
+
+* `backtrace_timestamp <https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/components/backtrace_timestamp.hpp>`_ simply
+  records the wall-clock time of the sample.
+* `backtrace <https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/components/backtrace.hpp>`_
+  records the call-stack via libunwind.
+* `backtrace_metrics <https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/components/backtrace_metrics.hpp>`_
+  records the sample metrics, such as peak RSS and the hardware counters.
+
+These three components are bundled together in 
+a tuple-like ``struct`` (``tuple<backtrace_timestamp, backtrace, backtrace_metrics>``).
+A buffer of at least 1024 instances of this tuple is mapped using ``mmap`` 
+per-thread. When this buffer is full, 
+the sampler hands the buffer off to its allocator thread and maps a new buffer with ``mmap``
+before taking the next sample. The allocator thread takes this data 
+and either dynamically stores it in memory or writes it to a file depending on the 
+value of ``OMNITRACE_USE_TEMPORARY_FILES``.
+This schema avoids all allocations in the signal handler, lets the data grow 
+dynamically, avoids potentially slow I/O within the signal handler, and also enables 
+the capability of avoiding I/O altogether.
+The maximum number of samplers handled by each allocator is governed by the 
+``OMNITRACE_SAMPLING_ALLOCATOR_SIZE`` setting (the default is eight). Whenever an allocator 
+has reached its limit,
+a new internal thread is created to handle the new samplers.
+
+Time-window constraint model
+========================================
+
+With the recent introduction of tracing delay and duration, the 
+`constraint namespace <https://github.com/ROCm/omnitrace/blob/main/source/lib/core/constraint.hpp>`_
+was introduced to improve the management of delays and duration limits for 
+data collection. The ``spec`` class accepts a clock identifier, a delay value, a duration value, and an
+integer indicating how many times to repeat the delay and duration cycle. It is therefore 
+possible to perform tasks such as periodically enabling tracing for brief periods
+of time in between long periods without data collection while the application runs. The
+syntax follows the format ``clock_identifier:delay:capture_duration:cycles``, so a value of 
+``10:1:3`` for the last three parameters represents the following sequence of operations:
+
+* Ten seconds where no data is collected, then one second where it is
+* Ten seconds where no data is collected, then one second where it is 
+* Ten seconds where no data is collected, then one second where it is 
+* Stop
+
+As another example, ``OMNITRACE_TRACE_PERIODS = realtime:10:1:5 process_cputime:10:2:20`` translates
+to this sequence:
+
+* Five cycles of: no data collection for ten seconds of real-time followed by one second of data collection
+* Twenty cycles of: no data collection for ten seconds of process CPU time followed by two CPU-time seconds of data collection
+
+Eventually, the goal is to migrate all subsets of data collection which currently support 
+more rudimentary models of time window constraints, such as process sampling and causal profiling,
+to this model.
@@ -0,0 +1,102 @@
+.. meta::
+   :description: Omnitrace documentation and reference
+   :keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
+
+*******************
+Omnitrace Glossary
+*******************
+
+This topic explains the terminology necessary to use Omnitrace. 
+The list below provides a basic glossary for those who 
+are new to binary instrumentation. It also clarifies ambiguities 
+when certain terms have different 
+contextual meanings, for example, the Omnitrace meaning of the term "module" 
+when instrumenting Python.
+
+**Binary**
+  A file written in the Executable and Linkable Format (ELF). This is the standard file 
+  format for executable files, shared libraries, etc.
+
+**Binary instrumentation**
+  Inserting callbacks to instrumentation into an existing binary. This can be performed 
+  statically or dynamically.
+
+**Static binary instrumentation**
+  Loads an existing binary, determines instrumentation points, and generates a new binary 
+  with instrumentation directly embedded. It is applicable to executables and libraries but 
+  limited to only the functions defined in the binary. This is also known as **Binary rewrite**.
+
+**Dynamic binary instrumentation**
+  Loads an existing binary into memory, inserts instrumentation, and runs the binary. 
+  It is limited to executables but is capable of instrumenting linked libraries. 
+  This is also known as **Runtime instrumentation**.
+
+**Statistical sampling**  
+  At periodic intervals, the application is paused and the current call-stack of the CPU 
+  is recorded along with various other metrics. It uses timers that measure either 
+  (A) real clock time or (B) the CPU time used by the current thread and the CPU time 
+  expended on behalf of the thread by the system. This is also known as simply **sampling**.
+
+  **Sampling rate**
+    * The period at which (A) or (B) are triggered (in units of ``# interrupts / second``)
+    * Higher values increase the number of samples
+
+  **Sampling delay**
+    * How long to wait before (A) and (B) begin triggering at their designated rate
+
+  **Sampling duration**
+    * The amount of time (in real-time) after the start of the application to record samples. 
+    * After this time limit has been reached, no more samples are recorded.
+
+**Process sampling**
+  At periodic (real-time) intervals, a background thread records global metrics without 
+  interrupting the current process. These metrics include, but are not limited to: 
+  CPU frequency, CPU memory high-water mark (i.e. peak memory usage), GPU temperature,
+  and GPU power usage.
+
+  **Sampling rate**
+    * The real-time period for recording metrics (in units of ``# measurements / second``)
+    * Higher values increase the number of samples
+
+  **Sampling delay**
+    * How long to wait (in real-time) before recording samples
+
+  **Sampling duration**
+    * The amount of time (in real-time) after the start of the application to record samples. 
+    * After this time limit has been reached, no more samples are recorded.
+
+**Module**
+  With respect to binary instrumentation, a module is defined as either the filename 
+  (such as ``foo.c``) or library name (``libfoo.so``) which contains the definition 
+  of one or more functions.
+
+  With respect to Python instrumentation, a module is defined as the **file** which contains 
+  the definition of one or more functions. The full path to this file typically contains the 
+  name of the "Python module".
+
+**Basic block**
+  A straight-line code sequence with no branches in (except for the entry) and 
+  no branches out (except for the exit).
+
+**Address range**
+  The instructions for a function in a binary start at certain address with the ELF file 
+  and end at a certain address. The range is ``end - start``.
+
+  The address range is a decent approximation for the "cost" of a function. 
+  For example, a larger address range approximately equates to more instructions.
+
+**Instrumentation traps**
+  On the x86 architecture, because instructions are of variable size, an instruction 
+  might be too small for Dyninst to replace it with the normal code sequence 
+  used to call instrumentation. When instrumentation is placed at points other 
+  than subroutine entry, exit, or call points, traps may be used to ensure 
+  the instrumentation fits. (By default, ``omnitrace-instrument`` avoids instrumentation 
+  which requires a trap.)
+
+**Overlapping functions**
+  Due to language constructs or compiler optimizations, it might be possible for 
+  multiple functions to overlap (that is, share part of the same function body) 
+  or for a single function to have multiple entry points. In practice, it's 
+  impossible to determine the difference between multiple overlapping functions 
+  and a single function with multiple entry points. (By default, ``omnitrace-instrument`` 
+  avoids instrumenting overlapping functions.)
@@ -0,0 +1,70 @@
+# Anywhere {branch} is used, the branch name will be substituted.
+# These comments will also be removed.
+defaults:
+  numbered: False
+  maxdepth: 6
+root: index
+subtrees:
+  - entries:
+    - file: what-is-omnitrace.rst
+
+  - caption: Install
+    entries:
+    - file: install/quick-start.rst
+      title: Omnitrace quick start
+    - file: install/install.rst
+      title: Omnitrace installation guide
+
+  - caption: Tutorials
+    entries:
+    - url: https://github.com/ROCm/omnitrace/tree/main/examples
+      title: GitHub examples
+    - file: tutorials/video-tutorials.rst
+      title: Video tutorials
+
+  - caption: How to
+    entries:
+    - file: how-to/configuring-validating-environment.rst
+      title: Configuring and validating the environment 
+    - file: how-to/configuring-runtime-options.rst
+      title: Configuring runtime options 
+    - file: how-to/sampling-call-stack.rst
+      title: Sampling the call stack 
+    - file: how-to/instrumenting-rewriting-binary-application.rst
+      title: Instrumenting and rewriting a binary application
+    - file: how-to/performing-causal-profiling.rst
+      title: Performing causal profiling 
+    - file: how-to/understanding-omnitrace-output.rst
+      title: Understanding the Omnitrace output 
+    - file: how-to/profiling-python-scripts.rst
+      title: Profiling Python scripts  
+    - file: how-to/using-omnitrace-api.rst
+      title: Using the Omnitrace API
+    - file: how-to/general-tips-using-omnitrace.rst 
+      title: General tips for using Omnitrace 
+
+  - caption: Conceptual
+    entries:
+    - file: conceptual/data-collection-modes.rst
+      title: Data collection modes
+    - file: conceptual/omnitrace-feature-set.rst
+      title: The Omnitrace feature set and use cases
+
+  - caption: Reference
+    entries:
+    - file: reference/development-guide.rst
+      title: Development guide
+    - file: reference/omnitrace-glossary.rst
+      title: Omnitrace glossary
+    - file: doxygen/html/files
+      title: API library
+    - file: doxygen/html/functions
+      title: Class member functions
+    - file: doxygen/html/globals
+      title: Globals
+    - file: doxygen/html/annotated
+      title: Classes, structures, and interfaces
+
+  - caption: About
+    entries:
+    - file: license.md
@@ -0,0 +1 @@
+rocm-docs-core[api_reference]==1.4.1
@@ -0,0 +1,169 @@
+#
+# This file is autogenerated by pip-compile with Python 3.10
+# by the following command:
+#
+#    pip-compile requirements.in
+#
+accessible-pygments==0.0.5
+    # via pydata-sphinx-theme
+alabaster==0.7.16
+    # via sphinx
+babel==2.15.0
+    # via
+    #   pydata-sphinx-theme
+    #   sphinx
+beautifulsoup4==4.12.3
+    # via pydata-sphinx-theme
+breathe==4.35.0
+    # via rocm-docs-core
+certifi==2024.6.2
+    # via requests
+cffi==1.16.0
+    # via
+    #   cryptography
+    #   pynacl
+charset-normalizer==3.3.2
+    # via requests
+click==8.1.7
+    # via
+    #   click-log
+    #   doxysphinx
+    #   sphinx-external-toc
+click-log==0.4.0
+    # via doxysphinx
+cryptography==42.0.8
+    # via pyjwt
+deprecated==1.2.14
+    # via pygithub
+docutils==0.21.2
+    # via
+    #   breathe
+    #   myst-parser
+    #   pydata-sphinx-theme
+    #   sphinx
+doxysphinx==3.3.9
+    # via rocm-docs-core
+fastjsonschema==2.20.0
+    # via rocm-docs-core
+gitdb==4.0.11
+    # via gitpython
+gitpython==3.1.43
+    # via rocm-docs-core
+idna==3.7
+    # via requests
+imagesize==1.4.1
+    # via sphinx
+jinja2==3.1.4
+    # via
+    #   myst-parser
+    #   sphinx
+libsass==0.22.0
+    # via doxysphinx
+lxml==4.9.4
+    # via doxysphinx
+markdown-it-py==3.0.0
+    # via
+    #   mdit-py-plugins
+    #   myst-parser
+markupsafe==2.1.5
+    # via jinja2
+mdit-py-plugins==0.4.1
+    # via myst-parser
+mdurl==0.1.2
+    # via markdown-it-py
+mpire==2.10.2
+    # via doxysphinx
+myst-parser==3.0.1
+    # via rocm-docs-core
+numpy==1.26.4
+    # via doxysphinx
+packaging==24.1
+    # via
+    #   pydata-sphinx-theme
+    #   sphinx
+pycparser==2.22
+    # via cffi
+pydata-sphinx-theme==0.15.4
+    # via
+    #   rocm-docs-core
+    #   sphinx-book-theme
+pygithub==2.3.0
+    # via rocm-docs-core
+pygments==2.18.0
+    # via
+    #   accessible-pygments
+    #   mpire
+    #   pydata-sphinx-theme
+    #   sphinx
+pyjson5==1.6.6
+    # via doxysphinx
+pyjwt[crypto]==2.8.0
+    # via pygithub
+pynacl==1.5.0
+    # via pygithub
+pyparsing==3.1.2
+    # via doxysphinx
+pyyaml==6.0.1
+    # via
+    #   myst-parser
+    #   rocm-docs-core
+    #   sphinx-external-toc
+requests==2.32.3
+    # via
+    #   pygithub
+    #   sphinx
+rocm-docs-core[api_reference]==1.4.1
+    # via -r requirements.in
+smmap==5.0.1
+    # via gitdb
+snowballstemmer==2.2.0
+    # via sphinx
+soupsieve==2.5
+    # via beautifulsoup4
+sphinx==7.3.7
+    # via
+    #   breathe
+    #   myst-parser
+    #   pydata-sphinx-theme
+    #   rocm-docs-core
+    #   sphinx-book-theme
+    #   sphinx-copybutton
+    #   sphinx-design
+    #   sphinx-external-toc
+    #   sphinx-notfound-page
+sphinx-book-theme==1.1.3
+    # via rocm-docs-core
+sphinx-copybutton==0.5.2
+    # via rocm-docs-core
+sphinx-design==0.6.0
+    # via rocm-docs-core
+sphinx-external-toc==1.0.1
+    # via rocm-docs-core
+sphinx-notfound-page==1.0.2
+    # via rocm-docs-core
+sphinxcontrib-applehelp==1.0.8
+    # via sphinx
+sphinxcontrib-devhelp==1.0.6
+    # via sphinx
+sphinxcontrib-htmlhelp==2.0.5
+    # via sphinx
+sphinxcontrib-jsmath==1.0.1
+    # via sphinx
+sphinxcontrib-qthelp==1.0.7
+    # via sphinx
+sphinxcontrib-serializinghtml==1.1.10
+    # via sphinx
+tomli==2.0.1
+    # via sphinx
+tqdm==4.66.4
+    # via mpire
+typing-extensions==4.12.2
+    # via
+    #   pydata-sphinx-theme
+    #   pygithub
+urllib3==2.2.2
+    # via
+    #   pygithub
+    #   requests
+wrapt==1.16.0
+    # via deprecated
@@ -0,0 +1,35 @@
+.. meta::
+   :description: Omnitrace documentation and reference
+   :keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
+
+****************************************************
+Video tutorials
+****************************************************
+
+Installing a binary release
+========================================
+
+.. raw:: html
+
+   <p align="center"><iframe width="560" height="315" src="https://www.youtube.com/embed/gKtNCKf1IXA?modestbranding=1" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p>
+
+Instrumenting a binary
+========================================
+
+.. raw:: html
+
+   <p align="center"><iframe width="560" height="315" src="https://www.youtube.com/embed/2B0gRr3FygQ?modestbranding=1" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p>
+
+Writing an Omnitrace configuration file
+========================================
+
+.. raw:: html
+
+   <p align="center"><iframe width="560" height="315" src="https://www.youtube.com/embed/oG_fPYx9_gs?modestbranding=1" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p>
+
+Visualization and features of Perfetto traces
+=============================================
+
+.. raw:: html
+
+   <p align="center"><iframe width="560" height="315" src="https://www.youtube.com/embed/7WN3N1hnCbI?modestbranding=1" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p>
@@ -0,0 +1,28 @@
+.. meta::
+   :description: Omnitrace documentation and reference
+   :keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
+
+******************
+What is Omnitrace?
+******************
+
+Omnitrace is designed for the high-level profiling and comprehensive tracing
+of applications running on the CPU or the CPU and GPU. It supports dynamic binary
+instrumentation, call-stack sampling, and various other features for determining
+which function and line number are currently executing.
+
+A visualization of the comprehensive Omnitrace results can be observed in any modern
+web browser. Upload the Perfetto (``.proto``) output files produced by Omnitrace at 
+`ui.perfetto.dev <https://ui.perfetto.dev/>`_ to see the details.
+
+Aggregated high-level results are available as human-readable text files and 
+JSON files for programmatic analysis. The JSON output files are compatible with the 
+`hatchet <https://github.com/hatchet/hatchet>`_ Python package. Hatchet converts
+the performance data into pandas data frames and facilitates multi-run comparisons, filtering, 
+and visualization in Jupyter notebooks.
+
+To use Omnitrace for instrumentation, follow these two configuration steps:
+
+#. Indicate the functions and modules to :doc:`instrument <./how-to/instrumenting-rewriting-binary-application>` in the target binaries, including the executable and any libraries
+#. Specify the :doc:`instrumentation parameters <./how-to/configuring-runtime-options>` to use when the instrumented binaries are launched
+