Omnitrace docs refactoring (#353)
* Add Sphinx and Read the Docs configs
* Add documentation workflow configurations
* Changed macros verbprintf and verbprintf_bare so they write to stdout… (#346)
Flush stdout when listing keys + bump verbose level for GPU count
* Removing static version asserts. (#347)
It is causing failures on our internal builds
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
* Check for an empty vector before popping (#350)
Protect from possible seg. fault
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
* Add release links to installation.md (#351)
* Initial infrastructure rework for Omnitrace refactoring and a rewrite of the What is file
* Add files in conceptual section, along with images and infrastructure changes.
* Formatting and style fixes for files in conceptual directory
* Add quick start install guide and fix spelling errors in other files
* Add install document and fix code tags. Infrastructure changes
* Add two how-to guides along with infra changes and spelling fixes
* Add two new how to files and fix errors in the last commit
* Fix spelling mistakes
* Add new how to file on causal profiling and infra changes.
* Add how to file on interpreting Omnitrace output, fixes, and images
* Add remaining how-to guides and reference materials along with fixes and infrastructure
* Add YouTube file and fix spelling and formatting
* Fix a few loose ends and add link to license page
* Add Sphinx and Doxygen infrastructure and some additional corrections
* Update rocm-docs-core
* Fix Doxyfile
* Fix path to API header files
* Run doxysphinx in conf.py
* Add back custom css for doxygen
* Remove doxygenlayout
* Add api to toc
* Update Doxyfile
Generate from source .in
* Proofreading edits and other changes
* Add .gitignore for Doxygen and remove deprecated words and typos
* Fix one additional typo
* Turn off dot
* Update doxyfile strip from path
* Workflow, submodules, and thread info Updates (#352)
* Update CI workflows
- use node20 workflow packages
* Update tests/source/CMakeLists.txt
- Use OMNITRACE_TRACE and OMNTRACE_PROFILE instead of perfetto/timemory
* Update timemory submodule
- argparse: requires -> required
- parse callbacks
* Update thread_info.cpp
- fix causal::delay::get_local usage
* Update timemory submodule
* Update kokkos submodule
- release 3.7.02
* Revert opensuse.yml and ubuntu-bionic.yml to use node16 workflows
* Update docs.yml
* ROCm 6.1 Installers (#349)
* Add ROCm 6.1 to packages
* Bump version to 1.11.3
* Add 6.1 support to the docker build support.
Simplified this by adding 6.* to case statements, now that repo links have been standardized.
* Update timemory submodule (#354)
- fix argparse::argument::required template deduction
* Build omnitrace-rt library (#355)
* Build omnitrace-rt library
- Explicitly build dyninstAPI_RT as omnitrace-rt so that the SONAME in the ELF is omnitrace-rt instead of dyninstAPI_RT
- Create symbolic link lib/omnitrace/libdyninstAPI_RT.so which points to lib/libomnitrace-rt.so
- Simplify build tree location of libomnitrace-rt.so since it is ../lib from the bin directory even in the build tree
- Update dyninst submodule with minor tweaks to dyninstAPI_RT/CMakeLists.txt
* Update source/lib/omnitrace-rt/cmake/platform.cmake
* Use ftpmirror.gnu.org instead of ftp.gnu.org
- in timemory and dyninst submodules
- minor .clang-tidy tweak
* Executables append omnitrace library directory to LD_LIBRARY_PATH (#356)
- omnitrace-run, omnitrace-sample, and omnitrace-causal now automatically append the LD_LIBRARY_PATH with the directory containing the omnitrace libraries
- this helps ensure that binary rewritten exes can resolve omnitrace-rt library location
* Fix a few typos and formatting issues
* Additional fixes and minor formatting changes.
* More fixes and minor formatting changes.
* Complete second proofreading with fixes and minor formatting changes.
* Make changes to table of contents and disable linting
* Update links in the README doc to reflect the new structure.
* Align intro on the Omnitrace index page with the first paragraph of the what-is page
* Changes and edits based on review comments
* Additional changes and edits based on external review
* Additional updates and changes from the external review of Omnitrace
* Additional changes based on the external review
* New round of edits based on the external review
* Additional edits based on the external review
* Changes to address comments from the internal review
* Correct to the RHEL SELinux note in the troubleshooting guide
* One additional change to the development guide code example
* Move troubleshooting to post-install of install.rst and other minor edits.
* Remove troubleshooting page and modify new post-install troubleshooting section on install.rst
* Refactor the how Omnitrace works page into seperate topics and redo infrastructure
* API ToC changes
* Additional API and ToC changes
* Back out API and ToC changes and update requirements.txt
* Additional API and ToC changes
* Add commit for signing purposes
* Add ElfUtils and BinUtils Download URL Overrides (#358)
* Add CMake CACHE Variable ElfUtils_DOWNLOAD_URL
Used to override the default URL to download ElfUtils from.
Useful for internal builds
Also, include a mirror to fallback to if the override URL fails.
* Update timemory submodule
Updating to include the BINUTIL_DOWNLOAD_URL override cmake
variable.
---------
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
* Remove Ubuntu 18.04 and SUSE 15.2
* Update checkout action to v4
* Add `docs/**` to `paths-ignore`
Document location is being refactored.
* Modified submodules dyninst and timemory. (#361)
---------
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
Co-authored-by: Peter Jun Park <peter.park@amd.com>
Co-authored-by: ajanicijamd <Aleksandar.Janicijevic@amd.com>
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
Co-authored-by: Jonathan R. Madsen <jrmadsen@users.noreply.github.com>
Co-authored-by: Sam Wu <22262939+samjwu@users.noreply.github.com>
[ROCm/rocprofiler-systems commit: 0689797736]
@@ -4,3 +4,4 @@
|
||||
docs/* @ROCm/rocm-documentation
|
||||
*.md @ROCm/rocm-documentation
|
||||
*.rst @ROCm/rocm-documentation
|
||||
.readthedocs.yaml @ROCm/rocm-documentation
|
||||
|
||||
@@ -9,3 +9,14 @@ updates:
|
||||
directory: "/" # Location of package manifests
|
||||
schedule:
|
||||
interval: "weekly"
|
||||
|
||||
- package-ecosystem: "pip" # See documentation for possible values
|
||||
directory: "/docs/sphinx" # Location of package manifests
|
||||
open-pull-requests-limit: 10
|
||||
schedule:
|
||||
interval: "daily"
|
||||
labels:
|
||||
- "documentation"
|
||||
- "dependencies"
|
||||
reviewers:
|
||||
- "samjwu"
|
||||
|
||||
@@ -37,6 +37,10 @@
|
||||
# Python cache files
|
||||
*.pyc
|
||||
|
||||
# Documentation artifacts
|
||||
/_build
|
||||
_toc.yml
|
||||
|
||||
/build*
|
||||
/.vscode
|
||||
/.cache
|
||||
|
||||
@@ -0,0 +1,18 @@
|
||||
# Read the Docs configuration file
|
||||
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
|
||||
|
||||
version: 2
|
||||
|
||||
build:
|
||||
os: ubuntu-22.04
|
||||
tools:
|
||||
python: "3.10"
|
||||
|
||||
python:
|
||||
install:
|
||||
- requirements: docs/sphinx/requirements.txt
|
||||
|
||||
sphinx:
|
||||
configuration: docs/conf.py
|
||||
|
||||
formats: []
|
||||
@@ -7,8 +7,6 @@
|
||||
[](https://github.com/ROCm/omnitrace/actions/workflows/cpack.yml)
|
||||
[](https://github.com/ROCm/omnitrace/actions/workflows/docs.yml)
|
||||
|
||||
> ***[Omnitrace](https://github.com/ROCm/omnitrace) is an AMD open source research project and is not supported as part of the ROCm software stack.***
|
||||
|
||||
## Overview
|
||||
|
||||
AMD Research is seeking to improve observability and performance analysis for software running on AMD heterogeneous systems.
|
||||
@@ -86,8 +84,8 @@ such as the memory usage, page-faults, and context-switches, and thread-level me
|
||||
|
||||
## Documentation
|
||||
|
||||
The full documentation for [omnitrace](https://github.com/ROCm/omnitrace) is available at [rocm.github.io/omnitrace](https://rocm.github.io/omnitrace/).
|
||||
See the [Getting Started documentation](https://rocm.github.io/omnitrace/getting_started) for general tips and a detailed discussion about sampling vs. binary instrumentation.
|
||||
The full documentation for [omnitrace](https://github.com/ROCm/omnitrace) is available at [the ROCm Omnitrace documentation repository](https://rocm.docs.amd.com/projects/omnitrace/en/latest/index.html).
|
||||
See the [Getting Started documentation](https://rocm.docs.amd.com/projects/omnitrace/en/conceptual/how-omnitrace-works.html) for general tips and a detailed discussion about sampling vs. binary instrumentation.
|
||||
|
||||
## Quick Start
|
||||
|
||||
@@ -108,7 +106,7 @@ wget https://github.com/ROCm/omnitrace/releases/latest/download/omnitrace-instal
|
||||
python3 ./omnitrace-install.py --prefix /opt/omnitrace/rocm-5.4 --rocm 5.4
|
||||
```
|
||||
|
||||
See the [Installation Documentation](https://rocm.github.io/omnitrace/installation) for detailed information.
|
||||
See the [Installation Documentation](https://rocm.docs.amd.com/projects/omnitrace/en/install/install.html) for detailed information.
|
||||
|
||||
### Setup
|
||||
|
||||
@@ -297,13 +295,13 @@ for `foo` via the direct call within `spam`. There will be no entries for `bar`
|
||||
- Select "Open trace file" from panel on the left
|
||||
- Locate the omnitrace perfetto output (extension: `.proto`)
|
||||
|
||||

|
||||

|
||||
|
||||

|
||||

|
||||
|
||||

|
||||

|
||||
|
||||

|
||||

|
||||
|
||||
## Using Perfetto tracing with System Backend
|
||||
|
||||
|
||||
@@ -0,0 +1,2 @@
|
||||
_build/
|
||||
_doxygen/
|
||||
@@ -0,0 +1,146 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
**********************
|
||||
Data collection modes
|
||||
**********************
|
||||
|
||||
Omnitrace supports several modes of recording trace and profiling data for your application.
|
||||
|
||||
.. note::
|
||||
|
||||
For an explanation of the terms used in this topic, see
|
||||
the :doc:`Omnitrace glossary <../reference/omnitrace-glossary>`.
|
||||
|
||||
+-----------------------------+---------------------------------------------------------+
|
||||
| Mode | Description |
|
||||
+=============================+=========================================================+
|
||||
| Binary Instrumentation | Locates functions (and loops, if desired) in the binary |
|
||||
| | and inserts snippets at the entry and exit |
|
||||
+-----------------------------+---------------------------------------------------------+
|
||||
| Statistical Sampling | Periodically pauses application at specified intervals |
|
||||
| | and records various metrics for the given call stack |
|
||||
+-----------------------------+---------------------------------------------------------+
|
||||
| Callback APIs | Parallelism frameworks such as ROCm, OpenMP, and Kokkos |
|
||||
| | make callbacks into Omnitrace to provide information |
|
||||
| | about the work the API is performing |
|
||||
+-----------------------------+---------------------------------------------------------+
|
||||
| Dynamic Symbol Interception | Wrap function symbols defined in a position independent |
|
||||
| | dynamic library/executable, like ``pthread_mutex_lock`` |
|
||||
| | in ``libpthread.so`` or ``MPI_Init`` in the MPI library |
|
||||
+-----------------------------+---------------------------------------------------------+
|
||||
| User API | User-defined regions and controls for Omnitrace |
|
||||
+-----------------------------+---------------------------------------------------------+
|
||||
|
||||
The two most generic and important modes are binary instrumentation and statistical sampling.
|
||||
It is important to understand their advantages and disadvantages.
|
||||
Binary instrumentation and statistical sampling can be performed with the ``omnitrace-instrument``
|
||||
executable. For statistical sampling, it's highly recommended to use the
|
||||
``omnitrace-sample`` executable instead if binary instrumentation isn't required or needed.
|
||||
Callback APIs and dynamic symbol interception can be utilized with either tool.
|
||||
|
||||
Binary instrumentation
|
||||
-----------------------------------
|
||||
|
||||
Binary instrumentation lets you record deterministic measurements for
|
||||
every single invocation of a given function.
|
||||
Binary instrumentation effectively adds instructions to the target application to
|
||||
collect the required information. It therefore has the potential to cause performance
|
||||
changes which might, in some cases, lead to inaccurate results. The effect depends on
|
||||
the information being collected and which features are activated in Omnitrace.
|
||||
For example, collecting only the wall-clock timing data
|
||||
has less of an effect than collecting the wall-clock timing, CPU-clock timing,
|
||||
memory usage, cache-misses, and number of instructions that were run. Similarly,
|
||||
collecting a flat profile has less overhead than a hierarchical profile
|
||||
and collecting a trace OR a profile has less overhead than collecting a
|
||||
trace AND a profile.
|
||||
|
||||
In Omnitrace, the primary heuristic for controlling the overhead with binary
|
||||
instrumentation is the minimum number of instructions for selecting functions
|
||||
for instrumentation.
|
||||
|
||||
Statistical sampling
|
||||
-----------------------------------
|
||||
|
||||
Statistical call-stack sampling periodically interrupts the application at
|
||||
regular intervals using operating system interrupts.
|
||||
Sampling is typically less numerically accurate and specific, but the
|
||||
target program runs at nearly full speed.
|
||||
In contrast to the data derived from binary instrumentation, the resulting
|
||||
data is not exact but is instead a statistical approximation.
|
||||
However, sampling often provides a more accurate picture of the application
|
||||
execution because it is less intrusive to the target application and has fewer
|
||||
side effects on memory caches or instruction decoding pipelines. Furthermore,
|
||||
because sampling does not affect the execution speed as much, is it
|
||||
relatively immune to over-evaluating the cost of small, frequently called
|
||||
functions or "tight" loops.
|
||||
|
||||
In Omnitrace, the overhead for statistical sampling depends on the
|
||||
sampling rate and whether the samples are taken with respect to the CPU time
|
||||
and/or real time.
|
||||
|
||||
Binary instrumentation vs. statistical sampling example
|
||||
-------------------------------------------------------
|
||||
|
||||
Consider the following code:
|
||||
|
||||
.. code-block:: c++
|
||||
|
||||
long fib(long n)
|
||||
{
|
||||
if(n < 2) return n;
|
||||
return fib(n - 1) + fib(n - 2);
|
||||
}
|
||||
|
||||
void run(long n)
|
||||
{
|
||||
long result = fib(n);
|
||||
printf("[%li] fibonacci(%li) = %li\n", i, n, result);
|
||||
}
|
||||
|
||||
int main(int argc, char** argv)
|
||||
{
|
||||
long nfib = 30;
|
||||
long nitr = 10;
|
||||
if(argc > 1) nfib = atol(argv[1]);
|
||||
if(argc > 2) nitr = atol(argv[2]);
|
||||
|
||||
for(long i = 0; i < nitr; ++i)
|
||||
run(nfib);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
Binary instrumentation of the ``fib`` function will record **every single invocation**
|
||||
of the function. For a very small function
|
||||
such as ``fib``, this results in **significant** overhead since this simple function
|
||||
takes about 20 instructions, whereas the entry and
|
||||
exit snippets are ~1024 instructions. Therefore, you generally want to avoid
|
||||
instrumenting functions where the instrumented function has significantly fewer
|
||||
instructions than entry and exit instrumentation. (Note that many of the
|
||||
instructions in entry and exit functions are either logging functions or
|
||||
depend on the runtime settings and thus might never run). However,
|
||||
due to the number of potential instructions in the entry and exit snippets,
|
||||
the default behavior of ``omnitrace-instrument`` is to only instrument functions
|
||||
which contain fewer than 1024 instructions.
|
||||
|
||||
However, recording every single invocation of the function can be extremely
|
||||
useful for detecting anomalies, such as profiles that show minimum or maximum values much smaller or larger
|
||||
than the average or a high standard deviation. In this case, the traces help you
|
||||
identify exactly when and where those instances deviated from the norm.
|
||||
Compare the level of detail in the following traces. In the top image,
|
||||
every instance of the ``fib`` function is instrumented, while in the bottom image,
|
||||
the ``fib`` call-stack is derived via sampling.
|
||||
|
||||
Binary instrumentation of the Fibonacci function
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. image:: ../data/fibonacci-instrumented.png
|
||||
:alt: Visualization of the output of a binary instrumentation of the Fibonacci function
|
||||
|
||||
Statistical sampling of the Fibonacci function
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. image:: ../data/fibonacci-sampling.png
|
||||
:alt: Visualization of the output of a statistical sample of the Fibonacci function
|
||||
@@ -0,0 +1,137 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
***************************************
|
||||
The Omnitrace feature set and use cases
|
||||
***************************************
|
||||
|
||||
`Omnitrace <https://github.com/ROCm/omnitrace>`_ is designed to be highly extensible.
|
||||
Internally, it leverages the `Timemory performance analysis toolkit <https://github.com/NERSC/timemory>`_
|
||||
to manage extensions, resources, data, and other items. It supports the following features,
|
||||
modes, metrics, and APIs.
|
||||
|
||||
Data collection modes
|
||||
========================================
|
||||
|
||||
* Dynamic instrumentation
|
||||
|
||||
* Runtime instrumentation: Instrument executables and shared libraries at runtime
|
||||
* Binary rewriting: Generate a new executable and/or library with instrumentation built-in
|
||||
|
||||
* Statistical sampling: Periodic software interrupts per-thread
|
||||
* Process-level sampling: A background thread records process-, system- and device-level metrics while the application runs
|
||||
* Causal profiling: Quantifies the potential impact of optimizations in parallel code
|
||||
|
||||
.. note::
|
||||
|
||||
Critical trace support was removed in Omnitrace v1.11.0.
|
||||
It was replaced by the causal profiling feature.
|
||||
|
||||
Data analysis
|
||||
========================================
|
||||
|
||||
* High-level summary profiles with mean, min, max, and standard deviation statistics
|
||||
|
||||
* Low overhead and memory efficient
|
||||
* Ideal for running at scale
|
||||
|
||||
* Comprehensive traces for every individual event and measurement
|
||||
* Application speed-up predictions resulting from potential optimizations in functions and lines of code based on causal profiling
|
||||
|
||||
Parallelism API support
|
||||
========================================
|
||||
|
||||
* HIP
|
||||
* HSA
|
||||
* Pthreads
|
||||
* MPI
|
||||
* Kokkos-Tools (KokkosP)
|
||||
* OpenMP-Tools (OMPT)
|
||||
|
||||
GPU metrics
|
||||
========================================
|
||||
|
||||
* GPU hardware counters
|
||||
* HIP API tracing
|
||||
* HIP kernel tracing
|
||||
* HSA API tracing
|
||||
* HSA operation tracing
|
||||
* System-level sampling (via rocm-smi)
|
||||
|
||||
* Memory usage
|
||||
* Power usage
|
||||
* Temperature
|
||||
* Utilization
|
||||
|
||||
CPU metrics
|
||||
========================================
|
||||
|
||||
* CPU hardware counters sampling and profiles
|
||||
* CPU frequency sampling
|
||||
* Various timing metrics
|
||||
|
||||
* Wall time
|
||||
* CPU time (process and thread)
|
||||
* CPU utilization (process and thread)
|
||||
* User CPU time
|
||||
* Kernel CPU time
|
||||
|
||||
* Various memory metrics
|
||||
|
||||
* High-water mark (sampling and profiles)
|
||||
* Memory page allocation
|
||||
* Virtual memory usage
|
||||
|
||||
* Network statistics
|
||||
* I/O metrics
|
||||
* Many others
|
||||
|
||||
Third-party API support
|
||||
========================================
|
||||
|
||||
* TAU
|
||||
* LIKWID
|
||||
* Caliper
|
||||
* CrayPAT
|
||||
* VTune
|
||||
* NVTX
|
||||
* ROCTX
|
||||
|
||||
Omnitrace use cases
|
||||
========================================
|
||||
|
||||
When analyzing the performance of an application, do NOT
|
||||
assume you know where the performance bottlenecks are
|
||||
and why they are happening. Omnitrace is a tool for analyzing the entire
|
||||
application and its performance. It is
|
||||
ideal for characterizing where optimization would have the greatest impact
|
||||
on an end-to-end run of the application and for
|
||||
viewing what else is happening on the system during a performance bottleneck.
|
||||
|
||||
When GPUs are involved, there is a tendency to assume that
|
||||
the quickest path to performance improvement is minimizing
|
||||
the runtime of the GPU kernels. This is a highly flawed assumption.
|
||||
If you optimize the runtime of a kernel from one millisecond
|
||||
to 1 microsecond (1000x speed-up) but the original application never
|
||||
spent time waiting for kernels to complete,
|
||||
there would be no statistically significant reduction in the end-to-end
|
||||
runtime of your application. In other words, it does not matter
|
||||
how fast or slow the code on GPU is if the application has a
|
||||
bottleneck on waiting on the GPU.
|
||||
|
||||
Use Omnitrace to obtain a high-level view of the entire application. Use it
|
||||
to determine where the performance bottlenecks are and
|
||||
obtain clues to why these bottlenecks are happening. Rather than worrying about kernel
|
||||
performance, start your investigation with Omnitrace, which characterizes the
|
||||
broad picture.
|
||||
|
||||
.. note::
|
||||
|
||||
For insight into the execution of individual kernels on the GPU,
|
||||
use `Omniperf <https://github.com/rocm/omniperf>`_.
|
||||
|
||||
In terms of CPU analysis, Omnitrace does not target any specific vendor.
|
||||
It works just as well on AMD and non-AMD CPUs.
|
||||
With regard to the GPU, Omnitrace is currently restricted to HIP and HSA APIs
|
||||
and kernels running on AMD GPUs.
|
||||
@@ -0,0 +1,56 @@
|
||||
# MIT License
|
||||
|
||||
# Copyright (c) 2023 - 2024 Advanced Micro Devices, Inc. All rights reserved.
|
||||
|
||||
# Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
# of this software and associated documentation files (the "Software"), to deal
|
||||
# in the Software without restriction, including without limitation the rights
|
||||
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
# copies of the Software, and to permit persons to whom the Software is
|
||||
# furnished to do so, subject to the following conditions:
|
||||
|
||||
# The above copyright notice and this permission notice shall be included in all
|
||||
# copies or substantial portions of the Software.
|
||||
|
||||
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
# SOFTWARE.
|
||||
|
||||
# Configuration file for the Sphinx documentation builder.
|
||||
#
|
||||
# This file only contains a selection of the most common options. For a full
|
||||
# list see the documentation:
|
||||
# https://www.sphinx-doc.org/en/master/usage/configuration.html
|
||||
|
||||
import re
|
||||
|
||||
from rocm_docs import ROCmDocs
|
||||
|
||||
with open("../VERSION", encoding="utf-8") as f:
|
||||
match = re.search(r"([0-9.]+)[^0-9.]+", f.read())
|
||||
if not match:
|
||||
raise ValueError("VERSION not found!")
|
||||
version_number = match[1]
|
||||
|
||||
external_projects_current_project = "omnitrace"
|
||||
|
||||
project = "omnitrace"
|
||||
author = "Advanced Micro Devices, Inc."
|
||||
copyright = "Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved."
|
||||
version = version_number
|
||||
release = version_number
|
||||
html_title = f"Omnitrace {version} documentation"
|
||||
|
||||
external_toc_path = "./sphinx/_toc.yml"
|
||||
|
||||
docs_core = ROCmDocs(html_title)
|
||||
docs_core.setup()
|
||||
docs_core.run_doxygen(doxygen_root="doxygen", doxygen_path="doxygen/xml")
|
||||
docs_core.enable_api_reference()
|
||||
|
||||
for sphinx_var in ROCmDocs.SPHINX_VARS:
|
||||
globals()[sphinx_var] = getattr(docs_core, sphinx_var)
|
||||
|
Tar éis Leithead: | Airde: | Méid: 27 KiB |
|
Tar éis Leithead: | Airde: | Méid: 106 KiB |
|
Tar éis Leithead: | Airde: | Méid: 408 KiB |
|
Tar éis Leithead: | Airde: | Méid: 313 KiB |
|
Tar éis Leithead: | Airde: | Méid: 195 KiB |
|
Tar éis Leithead: | Airde: | Méid: 230 KiB |
|
Tar éis Leithead: | Airde: | Méid: 277 KiB |
@@ -0,0 +1,3 @@
|
||||
html/
|
||||
latex/
|
||||
xml/
|
||||
@@ -0,0 +1,373 @@
|
||||
# Doxyfile 1.8.20
|
||||
|
||||
#---------------------------------------------------------------------------
|
||||
# Project related configuration options
|
||||
#---------------------------------------------------------------------------
|
||||
DOXYFILE_ENCODING = UTF-8
|
||||
PROJECT_NAME = omnitrace
|
||||
PROJECT_NUMBER = 1.11.3
|
||||
PROJECT_BRIEF = "High-level and comprehensive application tracing and profiling on both the CPU and GPU"
|
||||
PROJECT_LOGO =
|
||||
OUTPUT_DIRECTORY = .
|
||||
CREATE_SUBDIRS = NO
|
||||
ALLOW_UNICODE_NAMES = YES
|
||||
OUTPUT_LANGUAGE = English
|
||||
OUTPUT_TEXT_DIRECTION = None
|
||||
BRIEF_MEMBER_DESC = YES
|
||||
REPEAT_BRIEF = YES
|
||||
ABBREVIATE_BRIEF =
|
||||
ALWAYS_DETAILED_SEC = YES
|
||||
INLINE_INHERITED_MEMB = YES
|
||||
FULL_PATH_NAMES = YES
|
||||
STRIP_FROM_PATH = /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-omnitrace/checkouts/
|
||||
STRIP_FROM_INC_PATH = /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-omnitrace/checkouts/
|
||||
SHORT_NAMES = NO
|
||||
JAVADOC_AUTOBRIEF = NO
|
||||
JAVADOC_BANNER = NO
|
||||
QT_AUTOBRIEF = NO
|
||||
MULTILINE_CPP_IS_BRIEF = YES
|
||||
PYTHON_DOCSTRING = YES
|
||||
INHERIT_DOCS = YES
|
||||
SEPARATE_MEMBER_PAGES = NO
|
||||
TAB_SIZE = 4
|
||||
ALIASES =
|
||||
OPTIMIZE_OUTPUT_FOR_C = NO
|
||||
OPTIMIZE_OUTPUT_JAVA = NO
|
||||
OPTIMIZE_FOR_FORTRAN = NO
|
||||
OPTIMIZE_OUTPUT_VHDL = NO
|
||||
OPTIMIZE_OUTPUT_SLICE = NO
|
||||
EXTENSION_MAPPING = hpp=C++ \
|
||||
cpp=C++ \
|
||||
hh=C++ \
|
||||
cc=C++ \
|
||||
h=C \
|
||||
c=C \
|
||||
py=Python
|
||||
MARKDOWN_SUPPORT = YES
|
||||
TOC_INCLUDE_HEADINGS = 2
|
||||
AUTOLINK_SUPPORT = YES
|
||||
BUILTIN_STL_SUPPORT = YES
|
||||
CPP_CLI_SUPPORT = NO
|
||||
SIP_SUPPORT = NO
|
||||
IDL_PROPERTY_SUPPORT = YES
|
||||
DISTRIBUTE_GROUP_DOC = NO
|
||||
GROUP_NESTED_COMPOUNDS = YES
|
||||
SUBGROUPING = YES
|
||||
INLINE_GROUPED_CLASSES = NO
|
||||
INLINE_SIMPLE_STRUCTS = YES
|
||||
TYPEDEF_HIDES_STRUCT = NO
|
||||
LOOKUP_CACHE_SIZE = 5
|
||||
NUM_PROC_THREADS = 0
|
||||
#---------------------------------------------------------------------------
|
||||
# Build related configuration options
|
||||
#---------------------------------------------------------------------------
|
||||
EXTRACT_ALL = YES
|
||||
EXTRACT_PRIVATE = NO
|
||||
EXTRACT_PRIV_VIRTUAL = NO
|
||||
EXTRACT_PACKAGE = NO
|
||||
EXTRACT_STATIC = NO
|
||||
EXTRACT_LOCAL_CLASSES = YES
|
||||
EXTRACT_LOCAL_METHODS = NO
|
||||
EXTRACT_ANON_NSPACES = NO
|
||||
HIDE_UNDOC_MEMBERS = NO
|
||||
HIDE_UNDOC_CLASSES = YES
|
||||
HIDE_FRIEND_COMPOUNDS = NO
|
||||
HIDE_IN_BODY_DOCS = NO
|
||||
INTERNAL_DOCS = NO
|
||||
CASE_SENSE_NAMES = NO
|
||||
HIDE_SCOPE_NAMES = NO
|
||||
HIDE_COMPOUND_REFERENCE= NO
|
||||
SHOW_INCLUDE_FILES = YES
|
||||
SHOW_GROUPED_MEMB_INC = NO
|
||||
FORCE_LOCAL_INCLUDES = YES
|
||||
INLINE_INFO = YES
|
||||
SORT_MEMBER_DOCS = YES
|
||||
SORT_BRIEF_DOCS = NO
|
||||
SORT_MEMBERS_CTORS_1ST = YES
|
||||
SORT_GROUP_NAMES = NO
|
||||
SORT_BY_SCOPE_NAME = NO
|
||||
STRICT_PROTO_MATCHING = NO
|
||||
GENERATE_TODOLIST = NO
|
||||
GENERATE_TESTLIST = NO
|
||||
GENERATE_BUGLIST = NO
|
||||
GENERATE_DEPRECATEDLIST= NO
|
||||
ENABLED_SECTIONS =
|
||||
MAX_INITIALIZER_LINES = 30
|
||||
SHOW_USED_FILES = YES
|
||||
SHOW_FILES = YES
|
||||
SHOW_NAMESPACES = YES
|
||||
FILE_VERSION_FILTER =
|
||||
LAYOUT_FILE =
|
||||
CITE_BIB_FILES =
|
||||
#---------------------------------------------------------------------------
|
||||
# Configuration options related to warning and progress messages
|
||||
#---------------------------------------------------------------------------
|
||||
QUIET = NO
|
||||
WARNINGS = YES
|
||||
WARN_IF_UNDOCUMENTED = YES
|
||||
WARN_IF_DOC_ERROR = YES
|
||||
WARN_NO_PARAMDOC = YES
|
||||
WARN_AS_ERROR = YES
|
||||
WARN_FORMAT = "---> WARNING! $file:$line: $text"
|
||||
WARN_LOGFILE = doc/warnings.log
|
||||
#---------------------------------------------------------------------------
|
||||
# Configuration options related to the input files
|
||||
#---------------------------------------------------------------------------
|
||||
INPUT = ../../README.md \
|
||||
../../source/lib/omnitrace-user/omnitrace/types.h \
|
||||
../../source/lib/omnitrace-user/omnitrace/categories.h \
|
||||
../../source/lib/omnitrace-user/omnitrace/user.h \
|
||||
../../source/lib/omnitrace-user/omnitrace/causal.h
|
||||
INPUT_ENCODING = UTF-8
|
||||
FILE_PATTERNS = *.h \
|
||||
*.hh \
|
||||
*.hpp \
|
||||
*.c \
|
||||
*.cc \
|
||||
*.cxx \
|
||||
*.cpp \
|
||||
*.c++ \
|
||||
*.icc \
|
||||
*.tcc \
|
||||
*.py
|
||||
RECURSIVE = YES
|
||||
EXCLUDE =
|
||||
EXCLUDE_SYMLINKS = YES
|
||||
EXCLUDE_PATTERNS = */.git/* \
|
||||
../../external/* \
|
||||
../../examples/* \
|
||||
../../tests/*
|
||||
EXCLUDE_SYMBOLS = "std::*" \
|
||||
"OMNITRACE_ATTRIBUTE" \
|
||||
"OMNITRACE_VISIBILITY" \
|
||||
"OMNITRACE_PUBLIC_API" \
|
||||
"OMNITRACE_HIDDEN_API" \
|
||||
"SpaceHandle" \
|
||||
"KokkosPDevice*"
|
||||
EXAMPLE_PATH = ../../examples
|
||||
EXAMPLE_PATTERNS = *.h \
|
||||
*.hh \
|
||||
*.hpp \
|
||||
*.c \
|
||||
*.cc \
|
||||
*.cpp \
|
||||
*.py \
|
||||
*.txt
|
||||
EXAMPLE_RECURSIVE = YES
|
||||
IMAGE_PATH =
|
||||
INPUT_FILTER =
|
||||
FILTER_PATTERNS =
|
||||
FILTER_SOURCE_FILES = NO
|
||||
FILTER_SOURCE_PATTERNS =
|
||||
USE_MDFILE_AS_MAINPAGE = ../../README.md
|
||||
#---------------------------------------------------------------------------
|
||||
# Configuration options related to source browsing
|
||||
#---------------------------------------------------------------------------
|
||||
SOURCE_BROWSER = YES
|
||||
INLINE_SOURCES = YES
|
||||
STRIP_CODE_COMMENTS = NO
|
||||
REFERENCED_BY_RELATION = YES
|
||||
REFERENCES_RELATION = YES
|
||||
REFERENCES_LINK_SOURCE = YES
|
||||
SOURCE_TOOLTIPS = YES
|
||||
USE_HTAGS = NO
|
||||
VERBATIM_HEADERS = YES
|
||||
#---------------------------------------------------------------------------
|
||||
# Configuration options related to the alphabetical class index
|
||||
#---------------------------------------------------------------------------
|
||||
ALPHABETICAL_INDEX = YES
|
||||
COLS_IN_ALPHA_INDEX = 5
|
||||
IGNORE_PREFIX =
|
||||
#---------------------------------------------------------------------------
|
||||
# Configuration options related to the HTML output
|
||||
#---------------------------------------------------------------------------
|
||||
GENERATE_HTML = YES
|
||||
HTML_OUTPUT = html
|
||||
HTML_FILE_EXTENSION = .html
|
||||
HTML_HEADER = ../_doxygen/header.html
|
||||
HTML_FOOTER = ../_doxygen/footer.html
|
||||
HTML_STYLESHEET = ../_doxygen/stylesheet.css
|
||||
HTML_EXTRA_STYLESHEET = ../_doxygen/extra_stylesheet.css
|
||||
HTML_EXTRA_FILES =
|
||||
HTML_COLORSTYLE_HUE = 220
|
||||
HTML_COLORSTYLE_SAT = 100
|
||||
HTML_COLORSTYLE_GAMMA = 80
|
||||
HTML_TIMESTAMP = YES
|
||||
HTML_DYNAMIC_MENUS = YES
|
||||
HTML_DYNAMIC_SECTIONS = YES
|
||||
HTML_INDEX_NUM_ENTRIES = 1000
|
||||
GENERATE_DOCSET = NO
|
||||
DOCSET_FEEDNAME = "Doxygen generated docs"
|
||||
DOCSET_BUNDLE_ID = org.doxygen.omnitrace
|
||||
DOCSET_PUBLISHER_ID = org.doxygen.amdresearch
|
||||
DOCSET_PUBLISHER_NAME = "Audacious Software Group"
|
||||
GENERATE_HTMLHELP = NO
|
||||
CHM_FILE =
|
||||
HHC_LOCATION =
|
||||
GENERATE_CHI = NO
|
||||
CHM_INDEX_ENCODING =
|
||||
BINARY_TOC = NO
|
||||
TOC_EXPAND = YES
|
||||
GENERATE_QHP = NO
|
||||
QCH_FILE =
|
||||
QHP_NAMESPACE =
|
||||
QHP_VIRTUAL_FOLDER = doc
|
||||
QHP_CUST_FILTER_NAME =
|
||||
QHP_CUST_FILTER_ATTRS =
|
||||
QHP_SECT_FILTER_ATTRS =
|
||||
QHG_LOCATION =
|
||||
GENERATE_ECLIPSEHELP = NO
|
||||
ECLIPSE_DOC_ID = org.doxygen.omnitrace
|
||||
DISABLE_INDEX = NO
|
||||
GENERATE_TREEVIEW = NO
|
||||
ENUM_VALUES_PER_LINE = 1
|
||||
TREEVIEW_WIDTH = 300
|
||||
EXT_LINKS_IN_WINDOW = YES
|
||||
HTML_FORMULA_FORMAT = png
|
||||
FORMULA_FONTSIZE = 12
|
||||
FORMULA_TRANSPARENT = YES
|
||||
FORMULA_MACROFILE =
|
||||
USE_MATHJAX = NO
|
||||
MATHJAX_FORMAT = HTML-CSS
|
||||
MATHJAX_RELPATH = http://cdn.mathjax.org/mathjax/latest
|
||||
MATHJAX_EXTENSIONS =
|
||||
MATHJAX_CODEFILE =
|
||||
SEARCHENGINE = NO
|
||||
SERVER_BASED_SEARCH = NO
|
||||
EXTERNAL_SEARCH = NO
|
||||
SEARCHENGINE_URL =
|
||||
SEARCHDATA_FILE = searchdata.xml
|
||||
EXTERNAL_SEARCH_ID =
|
||||
EXTRA_SEARCH_MAPPINGS =
|
||||
#---------------------------------------------------------------------------
|
||||
# Configuration options related to the LaTeX output
|
||||
#---------------------------------------------------------------------------
|
||||
GENERATE_LATEX = NO
|
||||
LATEX_OUTPUT = latex
|
||||
LATEX_CMD_NAME = latex
|
||||
MAKEINDEX_CMD_NAME = makeindex
|
||||
LATEX_MAKEINDEX_CMD = makeindex
|
||||
COMPACT_LATEX = NO
|
||||
PAPER_TYPE = a4wide
|
||||
EXTRA_PACKAGES = float
|
||||
LATEX_HEADER =
|
||||
LATEX_FOOTER =
|
||||
LATEX_EXTRA_STYLESHEET =
|
||||
LATEX_EXTRA_FILES =
|
||||
PDF_HYPERLINKS = YES
|
||||
USE_PDFLATEX = YES
|
||||
LATEX_BATCHMODE = YES
|
||||
LATEX_HIDE_INDICES = NO
|
||||
LATEX_SOURCE_CODE = YES
|
||||
LATEX_BIB_STYLE = plain
|
||||
LATEX_TIMESTAMP = NO
|
||||
LATEX_EMOJI_DIRECTORY =
|
||||
#---------------------------------------------------------------------------
|
||||
# Configuration options related to the RTF output
|
||||
#---------------------------------------------------------------------------
|
||||
GENERATE_RTF = NO
|
||||
RTF_OUTPUT = rtf
|
||||
COMPACT_RTF = NO
|
||||
RTF_HYPERLINKS = NO
|
||||
RTF_STYLESHEET_FILE =
|
||||
RTF_EXTENSIONS_FILE =
|
||||
RTF_SOURCE_CODE = NO
|
||||
#---------------------------------------------------------------------------
|
||||
# Configuration options related to the man page output
|
||||
#---------------------------------------------------------------------------
|
||||
GENERATE_MAN = NO
|
||||
MAN_OUTPUT = man
|
||||
MAN_EXTENSION = .3
|
||||
MAN_SUBDIR =
|
||||
MAN_LINKS = YES
|
||||
#---------------------------------------------------------------------------
|
||||
# Configuration options related to the XML output
|
||||
#---------------------------------------------------------------------------
|
||||
GENERATE_XML = YES
|
||||
XML_OUTPUT = xml
|
||||
XML_PROGRAMLISTING = YES
|
||||
XML_NS_MEMB_FILE_SCOPE = YES
|
||||
#---------------------------------------------------------------------------
|
||||
# Configuration options related to the DOCBOOK output
|
||||
#---------------------------------------------------------------------------
|
||||
GENERATE_DOCBOOK = NO
|
||||
DOCBOOK_OUTPUT = docbook
|
||||
DOCBOOK_PROGRAMLISTING = NO
|
||||
#---------------------------------------------------------------------------
|
||||
# Configuration options for the AutoGen Definitions output
|
||||
#---------------------------------------------------------------------------
|
||||
GENERATE_AUTOGEN_DEF = NO
|
||||
#---------------------------------------------------------------------------
|
||||
# Configuration options related to the Perl module output
|
||||
#---------------------------------------------------------------------------
|
||||
GENERATE_PERLMOD = NO
|
||||
PERLMOD_LATEX = NO
|
||||
PERLMOD_PRETTY = YES
|
||||
PERLMOD_MAKEVAR_PREFIX =
|
||||
#---------------------------------------------------------------------------
|
||||
# Configuration options related to the preprocessor
|
||||
#---------------------------------------------------------------------------
|
||||
ENABLE_PREPROCESSING = YES
|
||||
MACRO_EXPANSION = YES
|
||||
EXPAND_ONLY_PREDEF = NO
|
||||
SEARCH_INCLUDES = YES
|
||||
INCLUDE_PATH = ../../source/lib/omnitrace-user
|
||||
INCLUDE_FILE_PATTERNS = *.h \
|
||||
*.hpp
|
||||
PREDEFINED = OMNITRACE_PUBLIC_API= \
|
||||
OMNITRACE_HIDDEN_API= \
|
||||
"OMNITRACE_ATTRIBUTE(...)=" \
|
||||
"OMNITRACE_VISIBILITY(...)=" \
|
||||
"__attribute__(x)=" \
|
||||
"__declspec(x)=" \
|
||||
"size_t=unsigned long" \
|
||||
"uintptr_t=unsigned long" \
|
||||
DOXYGEN_SHOULD_SKIP_THIS
|
||||
EXPAND_AS_DEFINED =
|
||||
SKIP_FUNCTION_MACROS = NO
|
||||
#---------------------------------------------------------------------------
|
||||
# Configuration options related to external references
|
||||
#---------------------------------------------------------------------------
|
||||
TAGFILES =
|
||||
GENERATE_TAGFILE = html/tagfile.xml
|
||||
ALLEXTERNALS = NO
|
||||
EXTERNAL_GROUPS = YES
|
||||
EXTERNAL_PAGES = YES
|
||||
#---------------------------------------------------------------------------
|
||||
# Configuration options related to the dot tool
|
||||
#---------------------------------------------------------------------------
|
||||
CLASS_DIAGRAMS = YES
|
||||
DIA_PATH =
|
||||
HIDE_UNDOC_RELATIONS = NO
|
||||
HAVE_DOT = NO
|
||||
DOT_NUM_THREADS = 0
|
||||
DOT_FONTNAME = Helvetica
|
||||
DOT_FONTSIZE = 12
|
||||
DOT_FONTPATH =
|
||||
CLASS_GRAPH = NO
|
||||
COLLABORATION_GRAPH = YES
|
||||
GROUP_GRAPHS = YES
|
||||
UML_LOOK = YES
|
||||
UML_LIMIT_NUM_FIELDS = 10
|
||||
TEMPLATE_RELATIONS = YES
|
||||
INCLUDE_GRAPH = YES
|
||||
INCLUDED_BY_GRAPH = YES
|
||||
CALL_GRAPH = NO
|
||||
CALLER_GRAPH = NO
|
||||
GRAPHICAL_HIERARCHY = YES
|
||||
DIRECTORY_GRAPH = YES
|
||||
DOT_IMAGE_FORMAT = svg
|
||||
INTERACTIVE_SVG = YES
|
||||
DOT_PATH = /usr/bin/dot
|
||||
DOTFILE_DIRS =
|
||||
MSCFILE_DIRS =
|
||||
DIAFILE_DIRS =
|
||||
PLANTUML_JAR_PATH =
|
||||
PLANTUML_CFG_FILE =
|
||||
PLANTUML_INCLUDE_PATH =
|
||||
DOT_GRAPH_MAX_NODES = 50
|
||||
MAX_DOT_GRAPH_DEPTH = 0
|
||||
DOT_TRANSPARENT = NO
|
||||
DOT_MULTI_TARGETS = YES
|
||||
GENERATE_LEGEND = YES
|
||||
DOT_CLEANUP = YES
|
||||
@@ -0,0 +1,71 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
****************************************************
|
||||
Configuring and validating the environment
|
||||
****************************************************
|
||||
|
||||
After installing `Omnitrace <https://github.com/ROCm/omnitrace>`_, additional steps are required to set up
|
||||
and validate the environment.
|
||||
|
||||
.. note::
|
||||
|
||||
The following instructions use the installation path ``/opt/omnitrace``. If
|
||||
Omnitrace is installed elsewhere, substitute the actual installation path.
|
||||
|
||||
Configuring the environment
|
||||
========================================
|
||||
|
||||
After Omnitrace is installed, source the ``setup-env.sh`` script to prefix the
|
||||
``PATH``, ``LD_LIBRARY_PATH``, and other environment variables:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
source /opt/omnitrace/share/omnitrace/setup-env.sh
|
||||
|
||||
Alternatively, if environment modules are supported, add the ``<prefix>/share/modulefiles`` directory
|
||||
to ``MODULEPATH``:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
module use /opt/omnitrace/share/modulefiles
|
||||
|
||||
.. note::
|
||||
|
||||
As an alternative, the above line can be added to the ``${HOME}/.modulerc`` file.
|
||||
|
||||
After Omnitrace has been added to the ``MODULEPATH``, it can be loaded
|
||||
using ``module load omnitrace/<VERSION>`` and unloaded using ``module unload omnitrace/<VERSION>``.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
module load omnitrace/1.0.0
|
||||
module unload omnitrace/1.0.0
|
||||
|
||||
.. note::
|
||||
|
||||
You might also need to add the path to the ROCm libraries to ``LD_LIBRARY_PATH``,
|
||||
for example, ``export LD_LIBRARY_PATH=/opt/rocm/lib:${LD_LIBRARY_PATH}``
|
||||
|
||||
Validating the environment configuration
|
||||
========================================
|
||||
|
||||
If the following commands all run successfully with the expected output,
|
||||
then you are ready to use Omnitrace:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
which omnitrace
|
||||
which omnitrace-avail
|
||||
which omnitrace-sample
|
||||
omnitrace-instrument --help
|
||||
omnitrace-avail --all
|
||||
omnitrace-sample --help
|
||||
|
||||
If Omnitrace was built with Python support, validate these additional commands:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
which omnitrace-python
|
||||
omnitrace-python --help
|
||||
@@ -0,0 +1,60 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
**********************************
|
||||
General tips for using Omnitrace
|
||||
**********************************
|
||||
|
||||
Follow these general guidelines when using Omnitrace. For an explanation of the terms used in this topic, see
|
||||
the :doc:`Omnitrace glossary <../reference/omnitrace-glossary>`.
|
||||
|
||||
* Use ``omnitrace-avail`` to look up configuration settings, hardware counters, and data collection components
|
||||
|
||||
* Use the ``-d`` flag for descriptions
|
||||
|
||||
* Generate a default configuration with ``omnitrace-avail -G ${HOME}/.omnitrace.cfg`` and adjust it
|
||||
to the desired default behavior
|
||||
* **Decide whether binary instrumentation, statistical sampling, or both** provides the desired performance data (for non-Python applications)
|
||||
* Compile code with optimization enabled (``-O2`` or higher), disable asserts (i.e. ``-DNDEBUG``), and include debug info (for instance, ``-g1`` at a minimum)
|
||||
|
||||
* Compiling with debug info does not slow down the code, it only increases compile time and the size of the binary
|
||||
* In CMake, this is generally done with the settings ``CMAKE_BUILD_TYPE=RelWithDebInfo`` or ``CMAKE_BUILD_TYPE=Release`` and ``CMAKE_<LANG>_FLAGS=-g1``
|
||||
|
||||
* **Use binary instrumentation for characterizing the performance of every invocation of specific functions**
|
||||
* **Use statistical sampling to characterize the performance of the entire application while minimizing overhead**
|
||||
* Enable statistical sampling after binary instrumentation to help "fill in the gaps" between instrumented regions
|
||||
* Use the user API to create custom regions and enable/disable Omnitrace for specific processes, threads, and regions
|
||||
* Dynamic symbol interception, callback APIs, and the user API are always available with binary instrumentation and sampling
|
||||
|
||||
* Dynamic symbol interception and callback APIs are (generally) controlled through ``OMNITRACE_USE_<API>``
|
||||
options, for example, ``OMNITRACE_USE_KOKKOSP`` and ``OMNITRACE_USE_OMPT`` enable Kokkos-Tools and OpenMP-Tools
|
||||
callbacks, respectively
|
||||
|
||||
* When generically seeking regions for performance improvement:
|
||||
|
||||
* **Start off by collecting a flat profile**
|
||||
* Look for functions with high call counts, large cumulative runtimes/values, or large standard deviations
|
||||
|
||||
* When call counts are high, improving the performance of this function or "inlining" the function can result in quick and easy performance improvements
|
||||
* When the standard deviation is high, collect a hierarchical profile and see if the high variation can be attributable to the calling context.
|
||||
In this scenario, consider creating a specialized version of the function for the longer-running contexts
|
||||
|
||||
* **Collect a hierarchical profile** and verify the functions that are part of the "critical path" of your
|
||||
application, as indicated in the flat profile
|
||||
|
||||
* For example, functions with high call counts but which are part of a "setup" or "post-processing"
|
||||
phase that does not consume much time relative to the overall time are generally a lower priority for optimization
|
||||
|
||||
* **Use the information from the profiles when analyzing detailed traces**
|
||||
* When using binary instrumentation in "trace" mode, **binary rewrites are preferable to runtime instrumentation**.
|
||||
|
||||
* Binary rewrites only instrument the functions defined in the target binary, whereas runtime instrumentation might instrument functions defined in the shared libraries which are linked into the target binary
|
||||
|
||||
* When using binary instrumentation with MPI, avoid runtime instrumentation
|
||||
|
||||
* Runtime instrumentation requires a fork and a ``ptrace``, which is generally incompatible with how MPI applications spawn processes
|
||||
* Perform a binary rewrite of the executable (and optionally, libraries used by the executable) using MPI and run
|
||||
the generated instrumented executable using ``omnitrace-run`` instead of the original.
|
||||
For example, instead of ``mpirun -n 2 ./myexe``, use ``mpirun -n 2 omnitrace-run -- ./myexe.inst``, where
|
||||
``myexe.inst`` is the instrumented ``myexe`` executable that was generated.
|
||||
@@ -0,0 +1,942 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
****************************************************
|
||||
Instrumenting and rewriting a binary application
|
||||
****************************************************
|
||||
|
||||
There are three ways to perform instrumentation with the ``omnitrace-instrument`` executable:
|
||||
|
||||
* Runtime instrumentation
|
||||
* Attaching to an already running process
|
||||
* Binary rewrite
|
||||
|
||||
Here is a comparison of the three modes:
|
||||
|
||||
* Runtime instrumentation of the application using the ``omnitrace-instrument`` executable
|
||||
(analogous to ``gdb --args <program> <args>``)
|
||||
|
||||
* This mode is the default if neither the ``-p`` nor ``-o`` command-line options are used
|
||||
* Runtime instrumentation supports instrumenting not only the target executable but also
|
||||
the shared libraries loaded by the target executable. Consequently, this mode consumes more memory,
|
||||
takes longer to perform the instrumentation, and tends to add more significant overhead to the
|
||||
runtime of the application.
|
||||
* This mode is recommended if you want to analyze not only the performance of your executable and/or
|
||||
libraries but also the performance of the library dependencies
|
||||
|
||||
* Attaching to a process that is currently running (analogous to ``gdb -p <PID>``)
|
||||
|
||||
* This mode is activated using ``-p <PID>``
|
||||
* The same caveats from the first example apply with respect to memory and overhead
|
||||
|
||||
.. note::
|
||||
|
||||
Attaching to a running process is an alpha feature and detaching from the target process
|
||||
without ending the target process is not currently supported.
|
||||
|
||||
* Binary rewrite to generate a new executable or library with the instrumentation built-in
|
||||
|
||||
* This mode is activated through the ``-o <output-file>`` option
|
||||
* Binary rewriting is limited to the text section of the target executable or library. It does not instrument
|
||||
the dynamically-linked libraries. Consequently, this mode performs the
|
||||
instrumentation significantly faster
|
||||
and has a much lower overhead when running the instrumented executable and libraries.
|
||||
* Binary rewriting is the recommended mode when the target executable uses
|
||||
process-level parallelism (for example, MPI)
|
||||
* If the target executable has a minimal ``main`` routine and the bulk of your
|
||||
application is in one specific dynamic library,
|
||||
see :ref:`binary-rewriting-library-label` for help
|
||||
|
||||
The omnitrace-instrument executable
|
||||
========================================
|
||||
|
||||
Instrumentation is performed with the ``omnitrace-instrument`` executable. For more details, use the ``-h`` or ``--help`` option to
|
||||
view the help menu.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omnitrace-instrument --help
|
||||
[omnitrace-instrument] Usage: omnitrace-instrument [ --help (count: 0, dtype: bool)
|
||||
--version (count: 0, dtype: bool)
|
||||
--verbose (max: 1, dtype: bool)
|
||||
--error (max: 1, dtype: boolean)
|
||||
--debug (max: 1, dtype: bool)
|
||||
--log (count: 1)
|
||||
--log-file (count: 1)
|
||||
--simulate (max: 1, dtype: boolean)
|
||||
--print-format (min: 1, dtype: string)
|
||||
--print-dir (count: 1, dtype: string)
|
||||
--print-available (count: 1)
|
||||
--print-instrumented (count: 1)
|
||||
--print-coverage (count: 1)
|
||||
--print-excluded (count: 1)
|
||||
--print-overlapping (count: 1)
|
||||
--print-instructions (max: 1, dtype: bool)
|
||||
--output (min: 0, dtype: string)
|
||||
--pid (count: 1, dtype: int)
|
||||
--mode (count: 1)
|
||||
--force (max: 1, dtype: bool)
|
||||
--command (count: 1)
|
||||
--prefer (count: 1)
|
||||
--library (count: unlimited)
|
||||
--main-function (count: 1)
|
||||
--load (count: unlimited, dtype: string)
|
||||
--load-instr (count: unlimited, dtype: filepath)
|
||||
--init-functions (count: unlimited, dtype: string)
|
||||
--fini-functions (count: unlimited, dtype: string)
|
||||
--all-functions (max: 1, dtype: boolean)
|
||||
--function-include (count: unlimited)
|
||||
--function-exclude (count: unlimited)
|
||||
--function-restrict (count: unlimited)
|
||||
--caller-include (count: unlimited)
|
||||
--module-include (count: unlimited)
|
||||
--module-exclude (count: unlimited)
|
||||
--module-restrict (count: unlimited)
|
||||
--internal-function-include (count: unlimited)
|
||||
--internal-module-include (count: unlimited)
|
||||
--instruction-exclude (count: unlimited)
|
||||
--internal-library-deps (min: 0, dtype: boolean)
|
||||
--internal-library-append (count: unlimited)
|
||||
--internal-library-remove (count: unlimited)
|
||||
--linkage (min: 1)
|
||||
--visibility (min: 1)
|
||||
--label (count: unlimited, dtype: string)
|
||||
--config (min: 1, dtype: string)
|
||||
--default-components (count: unlimited, dtype: string)
|
||||
--env (count: unlimited)
|
||||
--mpi (max: 1, dtype: bool)
|
||||
--instrument-loops (max: 1, dtype: boolean)
|
||||
--min-instructions (count: 1, dtype: int)
|
||||
--min-address-range (count: 1, dtype: int)
|
||||
--min-instructions-loop (count: 1, dtype: int)
|
||||
--min-address-range-loop (count: 1, dtype: int)
|
||||
--coverage (max: 1, dtype: bool)
|
||||
--dynamic-callsites (max: 1, dtype: boolean)
|
||||
--traps (max: 1, dtype: boolean)
|
||||
--loop-traps (max: 1, dtype: boolean)
|
||||
--allow-overlapping (max: 1, dtype: bool)
|
||||
--parse-all-modules (max: 1, dtype: bool)
|
||||
--batch-size (count: 1, dtype: int)
|
||||
--dyninst-rt (min: 1, dtype: filepath)
|
||||
--dyninst-options (count: unlimited)
|
||||
] -- <CMD> <ARGS>
|
||||
|
||||
Options:
|
||||
-h, -?, --help Shows this page
|
||||
--version Prints the version and exit
|
||||
|
||||
[DEBUG OPTIONS]
|
||||
|
||||
-v, --verbose Verbose output
|
||||
-e, --error All warnings produce runtime errors
|
||||
--debug Debug output
|
||||
--log Number of log entries to display after an error. Any value < 0 will emit the entire log
|
||||
--log-file Write the log out the specified file during the run
|
||||
--simulate Exit after outputting diagnostic {available,instrumented,excluded,overlapping} module
|
||||
function lists, e.g. available.txt
|
||||
--print-format [ json | txt | xml ]
|
||||
Output format for diagnostic {available,instrumented,excluded,overlapping} module
|
||||
function lists, e.g. {print-dir}/available.txt
|
||||
--print-dir Output directory for diagnostic {available,instrumented,excluded,overlapping} module
|
||||
function lists, e.g. {print-dir}/available.txt
|
||||
--print-available [ functions | functions+ | modules | pair | pair+ ]
|
||||
Print the available entities for instrumentation (functions, modules, or module-function
|
||||
pair) to stdout after applying regular expressions
|
||||
--print-instrumented [ functions | functions+ | modules | pair | pair+ ]
|
||||
Print the instrumented entities (functions, modules, or module-function pair) to stdout
|
||||
after applying regular expressions
|
||||
--print-coverage [ functions | functions+ | modules | pair | pair+ ]
|
||||
Print the instrumented coverage entities (functions, modules, or module-function pair) to
|
||||
stdout after applying regular expressions
|
||||
--print-excluded [ functions | functions+ | modules | pair | pair+ ]
|
||||
Print the entities for instrumentation (functions, modules, or module-function pair)
|
||||
which are excluded from the instrumentation to stdout after applying regular expressions
|
||||
--print-overlapping [ functions | functions+ | modules | pair | pair+ ]
|
||||
Print the entities for instrumentation (functions, modules, or module-function pair)
|
||||
which overlap other function calls or have multiple entry points to stdout after applying
|
||||
regular expressions
|
||||
--print-instructions Print the instructions for each basic-block in the JSON/XML outputs
|
||||
|
||||
[MODE OPTIONS]
|
||||
|
||||
-o, --output Enable generation of a new executable (binary-rewrite). If a filename is not provided,
|
||||
omnitrace will use the basename and output to the cwd, unless the target binary is in the
|
||||
cwd. In the latter case, omnitrace will either use ${PWD}/<basename>.inst (non-libraries)
|
||||
or ${PWD}/instrumented/<basename> (libraries)
|
||||
-p, --pid Connect to running process
|
||||
-M, --mode [ coverage | sampling | trace ]
|
||||
Instrumentation mode. \'trace\' mode instruments the selected functions, \'sampling\' mode
|
||||
only instruments the main function to start and stop the sampler.
|
||||
-f, --force Force the command-line argument configuration, i.e. don't get cute. Useful for forcing
|
||||
runtime instrumentation of an executable that [A] Dyninst thinks is a library after
|
||||
reading ELF and [B] whose name makes it look like a library (e.g. starts with 'lib'
|
||||
and/or ends in \'.so\', \'.so.*\', or \'.a\')
|
||||
-c, --command Input executable and arguments (if \'-- <CMD>\' not provided)
|
||||
|
||||
[LIBRARY OPTIONS]
|
||||
|
||||
--prefer [ shared | static ] Prefer this library types when available
|
||||
-L, --library Libraries with instrumentation routines (default: "libomnitrace-dl")
|
||||
-m, --main-function The primary function to instrument around, e.g. \'main\'
|
||||
--load Supplemental instrumentation library names w/o extension (e.g. \'libinstr\' for
|
||||
\'libinstr.so\' or \'libinstr.a\')
|
||||
--load-instr Load {available,instrumented,excluded,overlapping}-instr JSON or XML file(s) and override
|
||||
what is read from the binary
|
||||
--init-functions Initialization function(s) for supplemental instrumentation libraries (see \'--load\'
|
||||
option)
|
||||
--fini-functions Finalization function(s) for supplemental instrumentation libraries (see \'--load\' option)
|
||||
--all-functions When finding functions, include the functions which are not instrumentable. This is
|
||||
purely diagnostic for the available/excluded functions output
|
||||
|
||||
[SYMBOL SELECTION OPTIONS]
|
||||
|
||||
-I, --function-include Regex(es) for including functions (despite heuristics)
|
||||
-E, --function-exclude Regex(es) for excluding functions (always applied)
|
||||
-R, --function-restrict Regex(es) for restricting functions only to those that match the provided
|
||||
regular-expressions
|
||||
--caller-include Regex(es) for including functions that call the listed functions (despite heuristics)
|
||||
-MI, --module-include Regex(es) for selecting modules/files/libraries (despite heuristics)
|
||||
-ME, --module-exclude Regex(es) for excluding modules/files/libraries (always applied)
|
||||
-MR, --module-restrict Regex(es) for restricting modules/files/libraries only to those that match the provided
|
||||
regular-expressions
|
||||
--internal-function-include Regex(es) for including functions which are (likely) utilized by omnitrace itself. Use
|
||||
this option with care.
|
||||
--internal-module-include Regex(es) for including modules/libraries which are (likely) utilized by omnitrace
|
||||
itself. Use this option with care.
|
||||
--instruction-exclude Regex(es) for excluding functions containing certain instructions
|
||||
--internal-library-deps Treat the libraries linked to the internal libraries as internal libraries. This increase
|
||||
the internal library processing time and consume more memory (so use with care) but may
|
||||
be useful when the application uses Boost libraries and Dyninst is dynamically linked
|
||||
against the same boost libraries
|
||||
--internal-library-append Append to the list of libraries which omnitrace treats as being used internally, e.g.
|
||||
OmniTrace will find all the symbols in this library and prevent them from being
|
||||
instrumented.
|
||||
--internal-library-remove [ ld-linux-x86-64.so.2
|
||||
libBrokenLocale.so.1
|
||||
libanl.so.1
|
||||
libbfd.so
|
||||
libbz2.so
|
||||
libc.so.6
|
||||
libcaliper.so
|
||||
libcommon.so
|
||||
libcrypt.so.1
|
||||
libdl.so.2
|
||||
libdw.so
|
||||
libdwarf.so
|
||||
libdyninstAPI_RT.so
|
||||
libelf.so
|
||||
libgcc_s.so.1
|
||||
libgotcha.so
|
||||
liblikwid.so
|
||||
liblzma.so
|
||||
libnsl.so.1
|
||||
libnss_compat.so.2
|
||||
libnss_db.so.2
|
||||
libnss_dns.so.2
|
||||
libnss_files.so.2
|
||||
libnss_hesiod.so.2
|
||||
libnss_ldap.so.2
|
||||
libnss_nis.so.2
|
||||
libnss_nisplus.so.2
|
||||
libnss_test1.so.2
|
||||
libnss_test2.so.2
|
||||
libpapi.so
|
||||
libpfm.so
|
||||
libprofiler.so
|
||||
libpthread.so.0
|
||||
libresolv.so.2
|
||||
librocm_smi64.so
|
||||
librocmtools.so
|
||||
librocprofiler64.so
|
||||
libroctracer64.so
|
||||
libroctx64.so
|
||||
librt.so.1
|
||||
libstdc++.so.6
|
||||
libtbb.so
|
||||
libtbbmalloc.so
|
||||
libtbbmalloc_proxy.so
|
||||
libtcmalloc.so
|
||||
libtcmalloc_and_profiler.so
|
||||
libtcmalloc_debug.so
|
||||
libtcmalloc_minimal.so
|
||||
libtcmalloc_minimal_debug.so
|
||||
libthread_db.so.1
|
||||
libunwind-coredump.so
|
||||
libunwind-generic.so
|
||||
libunwind-ptrace.so
|
||||
libunwind-setjmp.so
|
||||
libunwind-x86_64.so
|
||||
libunwind.so
|
||||
libutil.so.1
|
||||
libz.so
|
||||
libzstd.so ]
|
||||
Remove the specified libraries from being treated as being used internally, e.g.
|
||||
OmniTrace will permit all the symbols in these libraries to be eligible for
|
||||
instrumentation.
|
||||
--linkage [ global | local | unique | unknown | weak ]
|
||||
Only instrument functions with specified linkage (default: global, local, unique)
|
||||
--visibility [ default | hidden | internal | protected | unknown ]
|
||||
Only instrument functions with specified visibility (default: default, internal, hidden,
|
||||
protected)
|
||||
|
||||
[RUNTIME OPTIONS]
|
||||
|
||||
--label [ args | file | line | return ]
|
||||
Labeling info for functions. By default, just the function name is recorded. Use these
|
||||
options to gain more information about the function signature or location of the
|
||||
functions
|
||||
-C, --config Read in a configuration file and encode these values as the defaults in the executable
|
||||
-d, --default-components Default components to instrument (only useful when timemory is enabled in omnitrace
|
||||
library)
|
||||
--env Environment variables to add to the runtime in form VARIABLE=VALUE. E.g. use \'--env
|
||||
OMNITRACE_PROFILE=ON\' to default to using timemory instead of perfetto
|
||||
--mpi Enable MPI support (requires omnitrace built w/ full or partial MPI support). NOTE: this
|
||||
will automatically be activated if MPI_Init, MPI_Init_thread, MPI_Finalize,
|
||||
MPI_Comm_rank, or MPI_Comm_size are found in the symbol table of target
|
||||
|
||||
[GRANULARITY OPTIONS]
|
||||
|
||||
-l, --instrument-loops Instrument at the loop level
|
||||
-i, --min-instructions If the number of instructions in a function is less than this value, exclude it from
|
||||
instrumentation
|
||||
-r, --min-address-range If the address range of a function is less than this value, exclude it from
|
||||
instrumentation
|
||||
--min-instructions-loop If the number of instructions in a function containing a loop is less than this value,
|
||||
exclude it from instrumentation
|
||||
--min-address-range-loop If the address range of a function containing a loop is less than this value, exclude it
|
||||
from instrumentation
|
||||
--coverage [ basic_block | function | none ]
|
||||
Enable recording the code coverage. If instrumenting in coverage mode (\'-M converage\'),
|
||||
this simply specifies the granularity. If instrumenting in trace or sampling mode, this
|
||||
enables recording code-coverage in addition to the instrumentation of that mode (if any).
|
||||
--dynamic-callsites Force instrumentation if a function has dynamic callsites (e.g. function pointers)
|
||||
--traps Instrument points which require using a trap. On the x86 architecture, because
|
||||
instructions are of variable size, the instruction at a point may be too small for
|
||||
Dyninst to replace it with the normal code sequence used to call instrumentation. Also,
|
||||
when instrumentation is placed at points other than subroutine entry, exit, or call
|
||||
points, traps may be used to ensure the instrumentation fits. In this case, Dyninst
|
||||
replaces the instruction with a single-byte instruction that generates a trap.
|
||||
--loop-traps Instrument points within a loop which require using a trap (only relevant when
|
||||
--instrument-loops is enabled).
|
||||
--allow-overlapping Allow dyninst to instrument either multiple functions which overlap (share part of same
|
||||
function body) or single functions with multiple entry points. For more info, see Section
|
||||
2 of the DyninstAPI documentation.
|
||||
--parse-all-modules By default, omnitrace simply requests Dyninst to provide all the procedures in the
|
||||
application image. If this option is enabled, omnitrace will iterate over all the modules
|
||||
and extract the functions. Theoretically, it should be the same but the data is slightly
|
||||
different, possibly due to weak binding scopes. In general, enabling option will probably
|
||||
have no visible effect
|
||||
|
||||
[DYNINST OPTIONS]
|
||||
|
||||
-b, --batch-size Dyninst supports batch insertion of multiple points during runtime instrumentation. If
|
||||
one large batch insertion fails, this value will be used to create smaller batches.
|
||||
Larger batches generally decrease the instrumentation time
|
||||
--dyninst-rt Path(s) to the dyninstAPI_RT library
|
||||
--dyninst-options [ BaseTrampDeletion
|
||||
DebugParsing
|
||||
DelayedParsing
|
||||
InstrStackFrames
|
||||
MergeTramp
|
||||
SaveFPR
|
||||
TrampRecursive
|
||||
TypeChecking ]
|
||||
Advanced dyninst options: BPatch::set<OPTION>(bool), e.g. bpatch->setTrampRecursive(true)
|
||||
|
||||
``omnitrace-instrument`` uses a similar syntax as LLVM to separate command-line arguments from the
|
||||
application's arguments. It uses a standalone
|
||||
double-hyphen (``--``) as a separator.
|
||||
All arguments preceding the double-hyphen
|
||||
are interpreted as belonging to Omnitrace and all arguments following the
|
||||
double-hyphen are interpreted as being part of the
|
||||
application and its arguments. In binary rewrite mode, all application arguments after the first argument
|
||||
are ignored. As an example, ``./omnitrace-instrument -o ls.inst -- ls -l`` interprets ``ls`` as
|
||||
the target to instrument, ignoring the ``-l`` argument,
|
||||
and generates a ``ls.inst`` executable that you can subsequently run using the
|
||||
``omnitrace-run -- ls.inst -l`` command.
|
||||
|
||||
Runtime instrumentation example
|
||||
========================================
|
||||
|
||||
The following example shows how to enable runtime instrumentation.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-instrument <omnitrace-options> -- <exe> [<exe-options>...]
|
||||
|
||||
Attaching to a running process
|
||||
========================================
|
||||
|
||||
Use the following command to attach to an active process.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-instrument <omnitrace-options> -p <PID> -- <exe-name>
|
||||
|
||||
Binary rewrite
|
||||
========================================
|
||||
|
||||
This example demonstrates how to rewrite a binary.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-instrument <omnitrace-options> -o <name-of-new-exe-or-library> -- <exe-or-library>
|
||||
|
||||
.. _binary-rewriting-library-label:
|
||||
|
||||
Binary rewrite of a library
|
||||
-----------------------------------
|
||||
|
||||
Many applications bundle the bulk of their functionality into one or more
|
||||
dynamic libraries and have a relatively simple ``main``
|
||||
which links to these libraries and serves as the "driver" for
|
||||
setting up the workflow. If you perform a binary rewrite of an
|
||||
executable like this and find there is insufficient information, you
|
||||
can either switch to runtime instrumentation or perform a
|
||||
binary rewrite on the relevant libraries.
|
||||
|
||||
Support for stand-alone binary rewriting of a dynamic library without a binary rewrite of
|
||||
the executable is a beta feature.
|
||||
In general, it is supported as long as the library contains the ``_init`` and
|
||||
``_fini`` symbols but these symbols are not
|
||||
standardized to the extent of ``main`` in an executable.
|
||||
|
||||
Here is the recommended workflow for the binary rewrite of a library:
|
||||
|
||||
#. Determine the names of the dynamically linked libraries of interest using ``ldd``
|
||||
#. Generate a binary rewrite of the executable
|
||||
#. Generate a binary rewrite of the desired libraries with the same base name as the
|
||||
original library, for example, ``libfoo.so.2`` instead of ``libfoo.so``, and output the instrumented
|
||||
library into a different folder than the original library.
|
||||
|
||||
#. Prefix the ``LD_LIBRARY_PATH`` executable with the output folder from the previous step
|
||||
#. Use ``ldd`` to verify that the instrumented executable can resolve the location of the instrumented library
|
||||
|
||||
Binary rewrite of a library example
|
||||
-----------------------------------
|
||||
|
||||
The ``foo`` executable is dynamically linked to ``libfoo.so.2``:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ pwd
|
||||
/home/user
|
||||
$ which foo
|
||||
/usr/local/bin/foo
|
||||
$ ldd /usr/local/bin/foo
|
||||
...
|
||||
libfoo.so.2 => /usr/local/lib/libfoo.so.2 (...)
|
||||
...
|
||||
|
||||
Generate binary rewrites of ``foo`` and ``libfoo.so.2``:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-instrument -o ./foo.inst -- foo
|
||||
omnitrace-instrument -o ./libfoo.so.2 -- /usr/local/lib/libfoo.so.2
|
||||
|
||||
At this point, the instrumented ``foo.inst`` executable still dynamically loads the
|
||||
original ``libfoo.so.2`` in ``/usr/local/lib``:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ ldd ./foo.inst
|
||||
...
|
||||
libfoo.so.2 => /usr/local/lib/libfoo.so.2 (...)
|
||||
...
|
||||
|
||||
Prefix the ``LD_LIBRARY_PATH`` environment variable with the folder containing
|
||||
the instrumented ``libfoo.so.2``:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export LD_LIBRARY_PATH=/home/user:${LD_LIBRARY_PATH}
|
||||
|
||||
``foo.inst`` now loads the instrumented library when it runs:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ ldd ./foo.inst
|
||||
...
|
||||
libfoo.so.2 => /home/user/libfoo.so.2 (...)
|
||||
...
|
||||
|
||||
Selective instrumentation
|
||||
========================================
|
||||
|
||||
The default behavior of ``omnitrace-instrument`` does not instrument every symbol in the binary.
|
||||
The default rules are:
|
||||
|
||||
* Skip instrumenting dynamic call-sites (such as function pointers)
|
||||
|
||||
* The ``--dynamic-callsites`` option forces instrumentation for all dynamic call-sites
|
||||
|
||||
* The cost of a function can be loosely approximated by the number of
|
||||
instructions. By default, ``omnitrace-instrument`` only instruments functions
|
||||
with at least 1024 instructions
|
||||
|
||||
* The ``--min-instructions`` option modifies this heuristic for all functions which do not contain loops
|
||||
* The ``--min-instructions-loop`` option modifies this heuristic for functions which contain loops.
|
||||
|
||||
* The cost of a function can be also be loosely approximated by the size of the function
|
||||
in the binary so this heuristic can be used in lieu of or in addition to the
|
||||
minimum number of instructions
|
||||
|
||||
* The ``--min-address-range`` option modifies this heuristic for all functions which do not contain loops
|
||||
* The ``--min-address-range-loop`` option modifies this heuristic for functions which contain loops
|
||||
|
||||
* Skip instrumentation points which require using a trap
|
||||
|
||||
* See the description for the ``--traps`` and ``--loop-traps`` options for more information
|
||||
|
||||
* Skip instrumenting loops within the body of a function
|
||||
|
||||
* The ``--instrument-loops`` option enables this behavior
|
||||
|
||||
* Skip instrumenting functions with overlapping function bodies and single
|
||||
functions with multiple entry point
|
||||
|
||||
* These behaviors arise from various optimizations. Enable instrumenting for these functions
|
||||
by using the ``--allow-overlapping`` option
|
||||
|
||||
.. note::
|
||||
|
||||
The separate loop options ``--min-instructions-loop`` and ``--min-address-range-loop``
|
||||
are provided because functions with loops can be compact in the binary while also being costly
|
||||
|
||||
Viewing the available, instrumented, excluded, and overlapping functions
|
||||
-------------------------------------------------------------------------
|
||||
|
||||
Whenever ``omnitrace-instrument`` runs with a verbosity of zero or higher,
|
||||
it generates files that detail which functions
|
||||
were available for instrumentation (along with the module they were defined in), actually instrumented,
|
||||
excluded, and which contained overlapping function bodies.
|
||||
By default, these files are saved to the ``omnitrace-<NAME>-output`` folder
|
||||
where ``<NAME>`` is the base name of the targeted binary (or
|
||||
the base name of the resulting executable in the case of binary rewrite). For example,
|
||||
``omnitrace-instrument -- ls`` outputs these files to ``omnitrace-ls-output``
|
||||
whereas ``omnitrace-instrument -o ls.inst -- ls`` places them in ``omnitrace-ls.inst-output``.
|
||||
|
||||
To generate these files without running or generating an
|
||||
executable, use the ``--simulate`` option:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-instrument --simulate -- foo
|
||||
omnitrace-instrument --simulate -o foo.inst -- foo
|
||||
|
||||
Excluding and including modules and functions
|
||||
----------------------------------------------
|
||||
|
||||
Omnitrace has a set of six command-line options which each accept one or more
|
||||
regular expressions for customizing the scope of which module and/or functions are
|
||||
instrumented. Multiple regex patterns per option are treated as an OR operation,
|
||||
for example, ``--module-include libfoo libbar`` is effectively the same as ``--module-include 'libfoo|libbar'``.
|
||||
|
||||
To force the inclusion of certain modules and/or function
|
||||
without changing any of the heuristics, use the ``--module-include`` and/or ``--function-include`` options.
|
||||
These options do not exclude modules or functions which do
|
||||
not satisfy their regular expression.
|
||||
|
||||
To narrow the scope of the instrumentation to a specific set
|
||||
of libraries and/or functions, use the ``--module-restrict`` and ``--function-restrict`` options.
|
||||
These options let you exclusively select the union of one or more
|
||||
regular expressions, regardless of whether or not the functions satisfy the
|
||||
previously-mentioned default heuristics. Any function or module that is not within
|
||||
the union of these regular expressions is excluded from instrumentation.
|
||||
|
||||
To avoid instrumenting a set of modules and/or functions,
|
||||
use the ``--module-exclude`` and ``--function-exclude`` options.
|
||||
These options are always applied, even if the module or function
|
||||
satisfies the "restrict" or "include" regular expression.
|
||||
|
||||
.. _available-module-function-output:
|
||||
|
||||
An example of the available module and function info output
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-instrument -o lulesh.inst --label file line args --simulate -- lulesh
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
AddressRange Module Function FunctionSignature
|
||||
9165 ../examples/lulesh/lulesh-comm.cc CommMonoQ CommMonoQ(domain) [lulesh-comm.cc:1891]
|
||||
3396 ../examples/lulesh/lulesh-comm.cc CommRecv CommRecv(domain, int, Index_t, Index_t, Index_t, Index_t, bool, bool) [lulesh...
|
||||
8666 ../examples/lulesh/lulesh-comm.cc CommSBN CommSBN(domain, int, Domain_member *) [lulesh-comm.cc:926]
|
||||
10212 ../examples/lulesh/lulesh-comm.cc CommSend CommSend(domain, int, Index_t, Domain_member *, Index_t, Index_t, Index_t, bo...
|
||||
6823 ../examples/lulesh/lulesh-comm.cc CommSyncPosVel CommSyncPosVel(domain) [lulesh-comm.cc:1404]
|
||||
126 ../examples/lulesh/lulesh-comm.cc _GLOBAL__sub_I_lulesh_comm.cc _GLOBAL__sub_I_lulesh_comm.cc() [lulesh-comm.cc]
|
||||
308 ../examples/lulesh/lulesh-init.cc .omp_outlined..26 .omp_outlined..26(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
|
||||
628 ../examples/lulesh/lulesh-init.cc .omp_outlined..34 .omp_outlined..34(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
|
||||
656 ../examples/lulesh/lulesh-init.cc .omp_outlined..41 .omp_outlined..41(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
|
||||
662 ../examples/lulesh/lulesh-init.cc .omp_outlined..45 .omp_outlined..45(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
|
||||
550 ../examples/lulesh/lulesh-init.cc .omp_outlined..55 .omp_outlined..55(const , const , const ParallelFor<Kokkos::Impl::ViewFill<Ko...
|
||||
556 ../examples/lulesh/lulesh-init.cc .omp_outlined..57 .omp_outlined..57(const , const , const ParallelFor<Kokkos::Impl::ViewFill<Ko...
|
||||
550 ../examples/lulesh/lulesh-init.cc .omp_outlined..78 .omp_outlined..78(const , const , const ParallelFor<Kokkos::Impl::ViewFill<Ko...
|
||||
640 ../examples/lulesh/lulesh-init.cc .omp_outlined..84 .omp_outlined..84(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
|
||||
646 ../examples/lulesh/lulesh-init.cc .omp_outlined..88 .omp_outlined..88(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
|
||||
1840 ../examples/lulesh/lulesh-init.cc Domain::AllocateElemPersistent Domain::AllocateElemPersistent(Domain *, Int_t) [lulesh-init.cc:94]
|
||||
1384 ../examples/lulesh/lulesh-init.cc Domain::AllocateNodePersistent Domain::AllocateNodePersistent(Domain *, Int_t) [lulesh-init.cc:94]
|
||||
1264 ../examples/lulesh/lulesh-init.cc Domain::BuildMesh Domain::BuildMesh(Domain *, Int_t, Int_t, Int_t) [lulesh-init.cc:308]
|
||||
2312 ../examples/lulesh/lulesh-init.cc Domain::CreateRegionIndexSets Domain::CreateRegionIndexSets(Domain *, Int_t, Int_t) [lulesh-init.cc:409]
|
||||
7109 ../examples/lulesh/lulesh-init.cc Domain::Domain Domain::Domain(Domain *, Int_t, Index_t, Index_t, Index_t, Index_t, int, int,...
|
||||
2458 ../examples/lulesh/lulesh-init.cc Domain::SetupBoundaryConditions Domain::SetupBoundaryConditions(Domain *, Int_t) [lulesh-init.cc:409]
|
||||
956 ../examples/lulesh/lulesh-init.cc Domain::SetupCommBuffers Domain::SetupCommBuffers(Domain *, Int_t) [lulesh-init.cc]
|
||||
1456 ../examples/lulesh/lulesh-init.cc Domain::SetupElementConnectivities Domain::SetupElementConnectivities(Domain *, Int_t) [lulesh-init.cc:409]
|
||||
721 ../examples/lulesh/lulesh-init.cc Domain::SetupSymmetryPlanes Domain::SetupSymmetryPlanes(Domain *, Int_t) [lulesh-init.cc:409]
|
||||
1591 ../examples/lulesh/lulesh-init.cc Domain::SetupThreadSupportStructures Domain::SetupThreadSupportStructures(Domain *) [lulesh-init.cc:376]
|
||||
1644 ../examples/lulesh/lulesh-init.cc Domain::~Domain Domain::~Domain(Domain *) [lulesh-init.cc:286]
|
||||
218 ../examples/lulesh/lulesh-init.cc InitMeshDecomp InitMeshDecomp(Int_t, Int_t, Int_t *, Int_t *, Int_t *, Int_t *) [lulesh-init...
|
||||
260 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::CommonSubview<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokk... Kokkos::Impl::CommonSubview<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokk...
|
||||
1786 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::HostIterateTile<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::R... Kokkos::Impl::HostIterateTile<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::R...
|
||||
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int**... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int**...
|
||||
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int**... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int**...
|
||||
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int*,... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int*,...
|
||||
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int*,... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int*,...
|
||||
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl...
|
||||
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl...
|
||||
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl...
|
||||
522 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::... Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::...
|
||||
232 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::... Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::...
|
||||
49 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, Kokkos::Impl::ViewVal... Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, Kokkos::Impl::ViewVal...
|
||||
1476 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::Tile_Loop_Type<2, false, int, void, void>::apply<Kokkos::Impl::... Kokkos::Impl::Tile_Loop_Type<2, false, int, void, void>::apply<Kokkos::Impl::...
|
||||
555 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::LayoutRight, Kokkos::Devic... Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::LayoutRight, Kokkos::Devic...
|
||||
613 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::LayoutRight, Kokkos::Devic... Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::LayoutRight, Kokkos::Devic...
|
||||
603 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCopy<Kokkos::View<int*, Kokkos::LayoutLeft, Kokkos::Device<... Kokkos::Impl::ViewCopy<Kokkos::View<int*, Kokkos::LayoutLeft, Kokkos::Device<...
|
||||
604 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCopy<Kokkos::View<int*, Kokkos::LayoutLeft, Kokkos::Device<... Kokkos::Impl::ViewCopy<Kokkos::View<int*, Kokkos::LayoutLeft, Kokkos::Device<...
|
||||
281 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
|
||||
281 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
|
||||
281 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
|
||||
281 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
|
||||
281 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
|
||||
524 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev... Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev...
|
||||
525 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev... Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev...
|
||||
524 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev... Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev...
|
||||
583 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<int* [8], Kokkos::LayoutRight>, ... SharedAllocationRecord<void, void> * Kokkos::Impl::ViewMapping<Kokkos::ViewTr...
|
||||
529 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<int*, Kokkos::HostSpace>, void>:... SharedAllocationRecord<void, void> * Kokkos::Impl::ViewMapping<Kokkos::ViewTr...
|
||||
529 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<int*>, void>::allocate_shared<st... SharedAllocationRecord<void, void> * Kokkos::Impl::ViewMapping<Kokkos::ViewTr...
|
||||
203 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewRemap<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::... Kokkos::Impl::ViewRemap<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::...
|
||||
331 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewRemap<Kokkos::View<int*>, Kokkos::View<int*>, Kokkos::OpenM... Kokkos::Impl::ViewRemap<Kokkos::View<int*>, Kokkos::View<int*>, Kokkos::OpenM...
|
||||
461 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpa... enable_if_t<std::is_trivial<int>::value && std::is_trivially_copy_assignable<...
|
||||
353 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::contiguous_fill<Kokkos::OpenMP, double*> Kokkos::Impl::contiguous_fill<Kokkos::OpenMP, double*>(exec_space, dst, value...
|
||||
139 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::contiguous_fill<Kokkos::OpenMP, double, Kokkos::LayoutRight, Ko... Kokkos::Impl::contiguous_fill<Kokkos::OpenMP, double, Kokkos::LayoutRight, Ko...
|
||||
824 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight, Kokkos::D... Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight, Kokkos::D...
|
||||
824 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight, Kokkos::D... Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight, Kokkos::D...
|
||||
824 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::... Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::...
|
||||
824 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::... Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::...
|
||||
697 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::view_copy<Kokkos::View<int*, Kokkos::LayoutRight, Kokkos::Devic... Kokkos::Impl::view_copy<Kokkos::View<int*, Kokkos::LayoutRight, Kokkos::Devic...
|
||||
697 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::view_copy<Kokkos::View<int*>, Kokkos::View<int*> > Kokkos::Impl::view_copy<Kokkos::View<int*>, Kokkos::View<int*> >(dst, src) [l...
|
||||
2036 ../examples/lulesh/lulesh-init.cc Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::Schedule<Kokkos::Static>, int>::R... Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::Schedule<Kokkos::Static>, int>::R...
|
||||
2506 ../examples/lulesh/lulesh-init.cc Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::Schedule<Kokkos::Static>, long>::... Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::Schedule<Kokkos::Static>, long>::...
|
||||
271 ../examples/lulesh/lulesh-init.cc Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor... Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor...
|
||||
470 ../examples/lulesh/lulesh-init.cc Kokkos::View<int* [8], Kokkos::LayoutRight>::View<std::__cxx11::basic_string<... Kokkos::View<int* [8], Kokkos::LayoutRight>::View<std::__cxx11::basic_string<...
|
||||
323 ../examples/lulesh/lulesh-init.cc Kokkos::View<int* [8], Kokkos::LayoutRight>::View<std::__cxx11::basic_string<... Kokkos::View<int* [8], Kokkos::LayoutRight>::View<std::__cxx11::basic_string<...
|
||||
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*, Kokkos::HostSpace>::View<char [10]> Kokkos::View<int*, Kokkos::HostSpace>::View<char [10]>(View<int *, Kokkos::Ho...
|
||||
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*, Kokkos::HostSpace>::View<char [14]> Kokkos::View<int*, Kokkos::HostSpace>::View<char [14]>(View<int *, Kokkos::Ho...
|
||||
462 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*, Kokkos::HostSpace>::View<std::__cxx11::basic_string<char, ... Kokkos::View<int*, Kokkos::HostSpace>::View<std::__cxx11::basic_string<char, ...
|
||||
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<char [16]> Kokkos::View<int*>::View<char [16]>(View<int *> *, arg_label, type, const siz...
|
||||
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<char [19]> Kokkos::View<int*>::View<char [19]>(View<int *> *, arg_label, type, const siz...
|
||||
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<char [21]> Kokkos::View<int*>::View<char [21]>(View<int *> *, arg_label, type, const siz...
|
||||
462 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<std::__cxx11::basic_string<char, std::char_traits<ch... Kokkos::View<int*>::View<std::__cxx11::basic_string<char, std::char_traits<ch...
|
||||
323 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<std::__cxx11::basic_string<char, std::char_traits<ch... Kokkos::View<int*>::View<std::__cxx11::basic_string<char, std::char_traits<ch...
|
||||
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<double*, , double*, Kokkos::LayoutRight, Kokkos::Device<Kok... Kokkos::deep_copy<double*, , double*, Kokkos::LayoutRight, Kokkos::Device<Kok...
|
||||
1052 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<double*> Kokkos::deep_copy<double*>(dst, value) [lulesh-init.cc]
|
||||
1050 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<double, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP,... Kokkos::deep_copy<double, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP,...
|
||||
7686 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenM... Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenM...
|
||||
7686 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, int* [8], Kokkos::LayoutRigh... Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, int* [8], Kokkos::LayoutRigh...
|
||||
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int*, , int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::O... Kokkos::deep_copy<int*, , int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::O...
|
||||
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Ko... Kokkos::deep_copy<int*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Ko...
|
||||
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP, K... Kokkos::deep_copy<int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP, K...
|
||||
863 ../examples/lulesh/lulesh-init.cc Kokkos::impl_resize<, int* [8], Kokkos::LayoutRight> type Kokkos::impl_resize<, int* [8], Kokkos::LayoutRight>(v, const size_t, co...
|
||||
854 ../examples/lulesh/lulesh-init.cc Kokkos::impl_resize<, int*> type Kokkos::impl_resize<, int*>(v, const size_t, const size_t, const size_t,...
|
||||
697 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (... Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (...
|
||||
706 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (... Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (...
|
||||
912 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
|
||||
791 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
|
||||
791 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
|
||||
944 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
|
||||
839 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
|
||||
126 ../examples/lulesh/lulesh-init.cc _GLOBAL__sub_I_lulesh_init.cc _GLOBAL__sub_I_lulesh_init.cc() [lulesh-init.cc]
|
||||
6589 ../examples/lulesh/lulesh-util.cc Kokkos::deep_copy<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP... Kokkos::deep_copy<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP...
|
||||
1345 ../examples/lulesh/lulesh-util.cc ParseCommandLineOptions ParseCommandLineOptions(int, char * *, int, cmdLineOpts *) [lulesh-util.cc:67]
|
||||
171 ../examples/lulesh/lulesh-util.cc PrintCommandLineOptions PrintCommandLineOptions(char *, int) [lulesh-util.cc:31]
|
||||
67 ../examples/lulesh/lulesh-util.cc StrToInt int StrToInt(const char *, int *) [lulesh-util.cc:13]
|
||||
706 ../examples/lulesh/lulesh-util.cc VerifyAndWriteFinalOutput VerifyAndWriteFinalOutput(Real_t, locDom, Int_t, Int_t) [lulesh-util.cc:222]
|
||||
126 ../examples/lulesh/lulesh-util.cc _GLOBAL__sub_I_lulesh_util.cc _GLOBAL__sub_I_lulesh_util.cc() [lulesh-util.cc]
|
||||
17 ../examples/lulesh/lulesh-viz.cc DumpToVisit DumpToVisit(domain, int, int, int) [lulesh-viz.cc:415]
|
||||
126 ../examples/lulesh/lulesh-viz.cc _GLOBAL__sub_I_lulesh_viz.cc _GLOBAL__sub_I_lulesh_viz.cc() [lulesh-viz.cc]
|
||||
451 ../examples/lulesh/lulesh.cc .omp_outlined..103 .omp_outlined..103(const , const , const ParallelReduce<(lambda at ../example...
|
||||
796 ../examples/lulesh/lulesh.cc .omp_outlined..109 .omp_outlined..109(const , const , const ParallelFor<(lambda at ../examples/l...
|
||||
394 ../examples/lulesh/lulesh.cc .omp_outlined..111 .omp_outlined..111(const , const , const ParallelFor<(lambda at ../examples/l...
|
||||
402 ../examples/lulesh/lulesh.cc .omp_outlined..113 .omp_outlined..113(const , const , const ParallelFor<(lambda at ../examples/l...
|
||||
427 ../examples/lulesh/lulesh.cc .omp_outlined..115 .omp_outlined..115(const , const , const ParallelReduce<(lambda at ../example...
|
||||
859 ../examples/lulesh/lulesh.cc .omp_outlined..119 .omp_outlined..119(const , const , const ParallelFor<(lambda at ../examples/l...
|
||||
243 ../examples/lulesh/lulesh.cc .omp_outlined..122 .omp_outlined..122(const , const , const ParallelFor<(lambda at ../examples/l...
|
||||
426 ../examples/lulesh/lulesh.cc .omp_outlined..124 .omp_outlined..124(const , const , const ParallelFor<(lambda at ../examples/l...
|
||||
529 ../examples/lulesh/lulesh.cc .omp_outlined..127 .omp_outlined..127(const , const , const ParallelFor<(lambda at ../examples/l...
|
||||
865 ../examples/lulesh/lulesh.cc .omp_outlined..130 .omp_outlined..130(const , const , const ParallelFor<(lambda at ../examples/l...
|
||||
539 ../examples/lulesh/lulesh.cc .omp_outlined..132 .omp_outlined..132(const , const , const ParallelReduce<(lambda at ../example...
|
||||
456 ../examples/lulesh/lulesh.cc .omp_outlined..134 .omp_outlined..134(const , const , const ParallelReduce<(lambda at ../example...
|
||||
252 ../examples/lulesh/lulesh.cc .omp_outlined..20 .omp_outlined..20(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
870 ../examples/lulesh/lulesh.cc .omp_outlined..35 .omp_outlined..35(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
473 ../examples/lulesh/lulesh.cc .omp_outlined..42 .omp_outlined..42(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
252 ../examples/lulesh/lulesh.cc .omp_outlined..46 .omp_outlined..46(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
1101 ../examples/lulesh/lulesh.cc .omp_outlined..48 .omp_outlined..48(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
427 ../examples/lulesh/lulesh.cc .omp_outlined..55 .omp_outlined..55(const , const , const ParallelReduce<(lambda at ../examples...
|
||||
1326 ../examples/lulesh/lulesh.cc .omp_outlined..57 .omp_outlined..57(const , const , const ParallelReduce<(lambda at ../examples...
|
||||
243 ../examples/lulesh/lulesh.cc .omp_outlined..61 .omp_outlined..61(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
1101 ../examples/lulesh/lulesh.cc .omp_outlined..63 .omp_outlined..63(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
372 ../examples/lulesh/lulesh.cc .omp_outlined..66 .omp_outlined..66(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
499 ../examples/lulesh/lulesh.cc .omp_outlined..71 .omp_outlined..71(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
499 ../examples/lulesh/lulesh.cc .omp_outlined..73 .omp_outlined..73(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
499 ../examples/lulesh/lulesh.cc .omp_outlined..75 .omp_outlined..75(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
465 ../examples/lulesh/lulesh.cc .omp_outlined..78 .omp_outlined..78(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
396 ../examples/lulesh/lulesh.cc .omp_outlined..81 .omp_outlined..81(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
656 ../examples/lulesh/lulesh.cc .omp_outlined..85 .omp_outlined..85(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
|
||||
662 ../examples/lulesh/lulesh.cc .omp_outlined..89 .omp_outlined..89(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
|
||||
443 ../examples/lulesh/lulesh.cc .omp_outlined..93 .omp_outlined..93(const , const , const ParallelReduce<(lambda at ../examples...
|
||||
243 ../examples/lulesh/lulesh.cc .omp_outlined..96 .omp_outlined..96(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
243 ../examples/lulesh/lulesh.cc .omp_outlined..99 .omp_outlined..99(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
13367 ../examples/lulesh/lulesh.cc ApplyMaterialPropertiesForElems ApplyMaterialPropertiesForElems(domain) [lulesh.cc:409]
|
||||
1530 ../examples/lulesh/lulesh.cc CalcElemCharacteristicLength Real_t CalcElemCharacteristicLength(const Real_t *, const Real_t *, const Rea...
|
||||
982 ../examples/lulesh/lulesh.cc CalcElemFBHourglassForce CalcElemFBHourglassForce(const Real_t *, const Real_t[] *, coefficient, Real_...
|
||||
2428 ../examples/lulesh/lulesh.cc CalcElemNodeNormals CalcElemNodeNormals(Real_t *, Real_t *, Real_t *, const Real_t *, const Real_...
|
||||
853 ../examples/lulesh/lulesh.cc CalcElemShapeFunctionDerivatives CalcElemShapeFunctionDerivatives(const Real_t *, const Real_t *, const Real_t...
|
||||
1097 ../examples/lulesh/lulesh.cc CalcElemVolumeDerivative CalcElemVolumeDerivative(i, dvdx, dvdy, dvdz, const Real_t *, const Real_t *,...
|
||||
1054 ../examples/lulesh/lulesh.cc CalcKinematicsForElems CalcKinematicsForElems(domain, Real_t, Index_t) [lulesh.cc]
|
||||
14160 ../examples/lulesh/lulesh.cc CalcVolumeForceForElems CalcVolumeForceForElems(domain) [lulesh.cc:409]
|
||||
366 ../examples/lulesh/lulesh.cc Domain::AllocateGradients Domain::AllocateGradients(Domain *, Int_t, Int_t) [lulesh.cc:214]
|
||||
475 ../examples/lulesh/lulesh.cc Domain::DeallocateGradients Domain::DeallocateGradients(Domain *) [lulesh.cc:105]
|
||||
250 ../examples/lulesh/lulesh.cc Domain::DeallocateStrains Domain::DeallocateStrains(Domain *) [lulesh.cc:105]
|
||||
4356 ../examples/lulesh/lulesh.cc Domain::Domain Domain::Domain(Domain *) [lulesh.cc:78]
|
||||
15 ../examples/lulesh/lulesh.cc Domain::delv_eta Domain::delv_eta(const Domain *, const Index_t) [lulesh.cc:371]
|
||||
15 ../examples/lulesh/lulesh.cc Domain::delv_xi Domain::delv_xi(const Domain *, const Index_t) [lulesh.cc:368]
|
||||
15 ../examples/lulesh/lulesh.cc Domain::delv_zeta Domain::delv_zeta(const Domain *, const Index_t) [lulesh.cc:374]
|
||||
15 ../examples/lulesh/lulesh.cc Domain::fx Domain::fx(const Domain *, const Index_t) [lulesh.cc:303]
|
||||
15 ../examples/lulesh/lulesh.cc Domain::fy Domain::fy(const Domain *, const Index_t) [lulesh.cc:306]
|
||||
15 ../examples/lulesh/lulesh.cc Domain::fz Domain::fz(const Domain *, const Index_t) [lulesh.cc:309]
|
||||
15 ../examples/lulesh/lulesh.cc Domain::nodalMass Domain::nodalMass(const Domain *, const Index_t) [lulesh.cc:314]
|
||||
15 ../examples/lulesh/lulesh.cc Domain::x Domain::x(const Domain *, const Index_t) [lulesh.cc:257]
|
||||
15 ../examples/lulesh/lulesh.cc Domain::xd Domain::xd(const Domain *, const Index_t) [lulesh.cc:272]
|
||||
15 ../examples/lulesh/lulesh.cc Domain::y Domain::y(const Domain *, const Index_t) [lulesh.cc:258]
|
||||
15 ../examples/lulesh/lulesh.cc Domain::yd Domain::yd(const Domain *, const Index_t) [lulesh.cc:275]
|
||||
15 ../examples/lulesh/lulesh.cc Domain::z Domain::z(const Domain *, const Index_t) [lulesh.cc:259]
|
||||
15 ../examples/lulesh/lulesh.cc Domain::zd Domain::zd(const Domain *, const Index_t) [lulesh.cc:278]
|
||||
330 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<doubl... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<doubl...
|
||||
330 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<doubl... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<doubl...
|
||||
1508 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelFor<CalcEnergyForElems(double*, double*, double*, doubl... type Kokkos::Impl::ParallelFor<CalcEnergyForElems(double*, double*, double*, ...
|
||||
3606 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelFor<CalcFBHourglassForceForElems(Domain&, double*, Kokk... type Kokkos::Impl::ParallelFor<CalcFBHourglassForceForElems(Domain&, double*,...
|
||||
2917 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelFor<CalcKinematicsForElems(Domain&, double, int)::$_0, ... type Kokkos::Impl::ParallelFor<CalcKinematicsForElems(Domain&, double, int)::...
|
||||
3119 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelFor<CalcMonotonicQGradientsForElems(Domain&)::{lambda(i... type Kokkos::Impl::ParallelFor<CalcMonotonicQGradientsForElems(Domain&)::{lam...
|
||||
1969 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelFor<CalcMonotonicQRegionForElems(Domain&, int, double):... type Kokkos::Impl::ParallelFor<CalcMonotonicQRegionForElems(Domain&, int, dou...
|
||||
1265 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelFor<IntegrateStressForElems(Domain&, double*, double*, ... type Kokkos::Impl::ParallelFor<IntegrateStressForElems(Domain&, double*, doub...
|
||||
49 ../examples/lulesh/lulesh.cc Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, Kokkos::Impl::ViewVal... Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, Kokkos::Impl::ViewVal...
|
||||
1497 ../examples/lulesh/lulesh.cc Kokkos::Impl::TeamPolicyInternal<Kokkos::OpenMP>::TeamPolicyInternal Kokkos::Impl::TeamPolicyInternal<Kokkos::OpenMP>::TeamPolicyInternal(TeamPoli...
|
||||
603 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewCopy<Kokkos::View<double*, Kokkos::LayoutLeft, Kokkos::Devi... Kokkos::Impl::ViewCopy<Kokkos::View<double*, Kokkos::LayoutLeft, Kokkos::Devi...
|
||||
604 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewCopy<Kokkos::View<double*, Kokkos::LayoutLeft, Kokkos::Devi... Kokkos::Impl::ViewCopy<Kokkos::View<double*, Kokkos::LayoutLeft, Kokkos::Devi...
|
||||
281 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
|
||||
281 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
|
||||
521 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<double*>, void>::allocate_shared... SharedAllocationRecord<void, void> * Kokkos::Impl::ViewMapping<Kokkos::ViewTr...
|
||||
331 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewRemap<Kokkos::View<double*>, Kokkos::View<double*>, Kokkos:... Kokkos::Impl::ViewRemap<Kokkos::View<double*>, Kokkos::View<double*>, Kokkos:...
|
||||
461 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpa... enable_if_t<std::is_trivial<double>::value && std::is_trivially_copy_assignab...
|
||||
1609 ../examples/lulesh/lulesh.cc Kokkos::Impl::runtime_check_rank_host Kokkos::Impl::runtime_check_rank_host(const size_t, const bool, const size_t,...
|
||||
697 ../examples/lulesh/lulesh.cc Kokkos::Impl::view_copy<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::De... Kokkos::Impl::view_copy<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::De...
|
||||
697 ../examples/lulesh/lulesh.cc Kokkos::Impl::view_copy<Kokkos::View<double*>, Kokkos::View<double*> > Kokkos::Impl::view_copy<Kokkos::View<double*>, Kokkos::View<double*> >(dst, s...
|
||||
2250 ../examples/lulesh/lulesh.cc Kokkos::RangePolicy<Kokkos::OpenMP>::RangePolicy Kokkos::RangePolicy<Kokkos::OpenMP>::RangePolicy(RangePolicy<Kokkos::OpenMP> ...
|
||||
213 ../examples/lulesh/lulesh.cc Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor... Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor...
|
||||
410 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::View<char [6]> Kokkos::View<double*>::View<char [6]>(View<double *> *, arg_label, type, cons...
|
||||
410 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::View<char [7]> Kokkos::View<double*>::View<char [7]>(View<double *> *, arg_label, type, cons...
|
||||
462 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::View<std::__cxx11::basic_string<char, std::char_traits... Kokkos::View<double*>::View<std::__cxx11::basic_string<char, std::char_traits...
|
||||
323 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::View<std::__cxx11::basic_string<char, std::char_traits... Kokkos::View<double*>::View<std::__cxx11::basic_string<char, std::char_traits...
|
||||
25 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::~View Kokkos::View<double*>::~View(View<double *> *) [lulesh.cc:409]
|
||||
840 ../examples/lulesh/lulesh.cc Kokkos::abort Kokkos::abort(const const char *, const const char *) [lulesh.cc:202]
|
||||
854 ../examples/lulesh/lulesh.cc Kokkos::impl_resize<, double*> type Kokkos::impl_resize<, double*>(v, const size_t, const size_t, const size...
|
||||
928 ../examples/lulesh/lulesh.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
|
||||
960 ../examples/lulesh/lulesh.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
|
||||
21470 ../examples/lulesh/lulesh.cc LagrangeLeapFrog LagrangeLeapFrog(domain) [lulesh.cc]
|
||||
226 ../examples/lulesh/lulesh.cc ResizeBuffer ResizeBuffer(const size_t) [lulesh.cc:23]
|
||||
169 ../examples/lulesh/lulesh.cc _GLOBAL__sub_I_lulesh.cc _GLOBAL__sub_I_lulesh.cc() [lulesh.cc]
|
||||
1836 ../examples/lulesh/lulesh.cc main int main(int, char * *) [lulesh.cc]
|
||||
63 ../examples/lulesh/lulesh.cc std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::a... std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::a...
|
||||
20 ../examples/lulesh/lulesh.cc std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::alloca... std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::alloca...
|
||||
160 ../examples/lulesh/lulesh.cc std::operator+<char, std::char_traits<char>, std::allocator<char> > basic_string<char, std::char_traits<char>, std::allocator<char> > std::operat...
|
||||
187 ../examples/lulesh/lulesh.cc std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::alloc... std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::alloc...
|
||||
11 lulesh __clang_call_terminate __clang_call_terminate() [lulesh]
|
||||
33 lulesh __do_global_dtors_aux __do_global_dtors_aux() [lulesh]
|
||||
5 lulesh __libc_csu_fini __libc_csu_fini() [lulesh]
|
||||
101 lulesh __libc_csu_init __libc_csu_init() [lulesh]
|
||||
5 lulesh _dl_relocate_static_pie _dl_relocate_static_pie() [lulesh]
|
||||
13 lulesh _fini _fini() [lulesh]
|
||||
27 lulesh _init _init() [lulesh]
|
||||
47 lulesh _start _start() [lulesh]
|
||||
6 lulesh frame_dummy frame_dummy() [lulesh]
|
||||
|
||||
An example of instrumented module and function info output
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-instrument -o lulesh.inst --label file line args --simulate -- lulesh
|
||||
|
||||
After the heuristics are applied based on the pattern in :ref:`available-module-function-output`,
|
||||
the selected module and functions are:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
AddressRange Module Function FunctionSignature
|
||||
9165 ../examples/lulesh/lulesh-comm.cc CommMonoQ CommMonoQ(domain) [lulesh-comm.cc:1891]
|
||||
3396 ../examples/lulesh/lulesh-comm.cc CommRecv CommRecv(domain, int, Index_t, Index_t, Index_t, Index_t, bool, bool) [lulesh...
|
||||
8666 ../examples/lulesh/lulesh-comm.cc CommSBN CommSBN(domain, int, Domain_member *) [lulesh-comm.cc:926]
|
||||
10212 ../examples/lulesh/lulesh-comm.cc CommSend CommSend(domain, int, Index_t, Domain_member *, Index_t, Index_t, Index_t, bo...
|
||||
6823 ../examples/lulesh/lulesh-comm.cc CommSyncPosVel CommSyncPosVel(domain) [lulesh-comm.cc:1404]
|
||||
1840 ../examples/lulesh/lulesh-init.cc Domain::AllocateElemPersistent Domain::AllocateElemPersistent(Domain *, Int_t) [lulesh-init.cc:94]
|
||||
1384 ../examples/lulesh/lulesh-init.cc Domain::AllocateNodePersistent Domain::AllocateNodePersistent(Domain *, Int_t) [lulesh-init.cc:94]
|
||||
1264 ../examples/lulesh/lulesh-init.cc Domain::BuildMesh Domain::BuildMesh(Domain *, Int_t, Int_t, Int_t) [lulesh-init.cc:308]
|
||||
2312 ../examples/lulesh/lulesh-init.cc Domain::CreateRegionIndexSets Domain::CreateRegionIndexSets(Domain *, Int_t, Int_t) [lulesh-init.cc:409]
|
||||
7109 ../examples/lulesh/lulesh-init.cc Domain::Domain Domain::Domain(Domain *, Int_t, Index_t, Index_t, Index_t, Index_t, int, int,...
|
||||
2458 ../examples/lulesh/lulesh-init.cc Domain::SetupBoundaryConditions Domain::SetupBoundaryConditions(Domain *, Int_t) [lulesh-init.cc:409]
|
||||
956 ../examples/lulesh/lulesh-init.cc Domain::SetupCommBuffers Domain::SetupCommBuffers(Domain *, Int_t) [lulesh-init.cc]
|
||||
1456 ../examples/lulesh/lulesh-init.cc Domain::SetupElementConnectivities Domain::SetupElementConnectivities(Domain *, Int_t) [lulesh-init.cc:409]
|
||||
721 ../examples/lulesh/lulesh-init.cc Domain::SetupSymmetryPlanes Domain::SetupSymmetryPlanes(Domain *, Int_t) [lulesh-init.cc:409]
|
||||
1591 ../examples/lulesh/lulesh-init.cc Domain::SetupThreadSupportStructures Domain::SetupThreadSupportStructures(Domain *) [lulesh-init.cc:376]
|
||||
1644 ../examples/lulesh/lulesh-init.cc Domain::~Domain Domain::~Domain(Domain *) [lulesh-init.cc:286]
|
||||
271 ../examples/lulesh/lulesh-init.cc Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor... Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor...
|
||||
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*, Kokkos::HostSpace>::View<char [10]> Kokkos::View<int*, Kokkos::HostSpace>::View<char [10]>(View<int *, Kokkos::Ho...
|
||||
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*, Kokkos::HostSpace>::View<char [14]> Kokkos::View<int*, Kokkos::HostSpace>::View<char [14]>(View<int *, Kokkos::Ho...
|
||||
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<char [16]> Kokkos::View<int*>::View<char [16]>(View<int *> *, arg_label, type, const siz...
|
||||
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<char [19]> Kokkos::View<int*>::View<char [19]>(View<int *> *, arg_label, type, const siz...
|
||||
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<char [21]> Kokkos::View<int*>::View<char [21]>(View<int *> *, arg_label, type, const siz...
|
||||
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<double*, , double*, Kokkos::LayoutRight, Kokkos::Device<Kok... Kokkos::deep_copy<double*, , double*, Kokkos::LayoutRight, Kokkos::Device<Kok...
|
||||
1052 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<double*> Kokkos::deep_copy<double*>(dst, value) [lulesh-init.cc]
|
||||
1050 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<double, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP,... Kokkos::deep_copy<double, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP,...
|
||||
7686 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenM... Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenM...
|
||||
7686 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, int* [8], Kokkos::LayoutRigh... Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, int* [8], Kokkos::LayoutRigh...
|
||||
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int*, , int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::O... Kokkos::deep_copy<int*, , int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::O...
|
||||
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Ko... Kokkos::deep_copy<int*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Ko...
|
||||
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP, K... Kokkos::deep_copy<int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP, K...
|
||||
697 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (... Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (...
|
||||
706 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (... Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (...
|
||||
912 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
|
||||
791 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
|
||||
791 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
|
||||
944 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
|
||||
839 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
|
||||
6589 ../examples/lulesh/lulesh-util.cc Kokkos::deep_copy<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP... Kokkos::deep_copy<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP...
|
||||
1345 ../examples/lulesh/lulesh-util.cc ParseCommandLineOptions ParseCommandLineOptions(int, char * *, int, cmdLineOpts *) [lulesh-util.cc:67]
|
||||
706 ../examples/lulesh/lulesh-util.cc VerifyAndWriteFinalOutput VerifyAndWriteFinalOutput(Real_t, locDom, Int_t, Int_t) [lulesh-util.cc:222]
|
||||
13367 ../examples/lulesh/lulesh.cc ApplyMaterialPropertiesForElems ApplyMaterialPropertiesForElems(domain) [lulesh.cc:409]
|
||||
982 ../examples/lulesh/lulesh.cc CalcElemFBHourglassForce CalcElemFBHourglassForce(const Real_t *, const Real_t[] *, coefficient, Real_...
|
||||
2428 ../examples/lulesh/lulesh.cc CalcElemNodeNormals CalcElemNodeNormals(Real_t *, Real_t *, Real_t *, const Real_t *, const Real_...
|
||||
853 ../examples/lulesh/lulesh.cc CalcElemShapeFunctionDerivatives CalcElemShapeFunctionDerivatives(const Real_t *, const Real_t *, const Real_t...
|
||||
1054 ../examples/lulesh/lulesh.cc CalcKinematicsForElems CalcKinematicsForElems(domain, Real_t, Index_t) [lulesh.cc]
|
||||
14160 ../examples/lulesh/lulesh.cc CalcVolumeForceForElems CalcVolumeForceForElems(domain) [lulesh.cc:409]
|
||||
366 ../examples/lulesh/lulesh.cc Domain::AllocateGradients Domain::AllocateGradients(Domain *, Int_t, Int_t) [lulesh.cc:214]
|
||||
475 ../examples/lulesh/lulesh.cc Domain::DeallocateGradients Domain::DeallocateGradients(Domain *) [lulesh.cc:105]
|
||||
4356 ../examples/lulesh/lulesh.cc Domain::Domain Domain::Domain(Domain *) [lulesh.cc:78]
|
||||
410 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::View<char [6]> Kokkos::View<double*>::View<char [6]>(View<double *> *, arg_label, type, cons...
|
||||
410 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::View<char [7]> Kokkos::View<double*>::View<char [7]>(View<double *> *, arg_label, type, cons...
|
||||
928 ../examples/lulesh/lulesh.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
|
||||
960 ../examples/lulesh/lulesh.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
|
||||
21470 ../examples/lulesh/lulesh.cc LagrangeLeapFrog LagrangeLeapFrog(domain) [lulesh.cc]
|
||||
1836 ../examples/lulesh/lulesh.cc main int main(int, char * *) [lulesh.cc]
|
||||
|
||||
Sampling
|
||||
========================================
|
||||
|
||||
.. note::
|
||||
|
||||
This capability has been deprecated in favor of :doc:`Call stack sampling <./sampling-call-stack>`.
|
||||
|
||||
By default, ``omnitrace-instrument`` uses ``--mode trace`` for instrumentation. The ``--mode sampling`` option
|
||||
only instruments ``main`` in an executable. It activates both CPU call-stack sampling and
|
||||
background system-level thread sampling by default.
|
||||
Tracing capabilities which do not rely on instrumentation, such as the HIP API and kernel tracing
|
||||
(which is collected by roctracer), are still available.
|
||||
|
||||
The Omnitrace sampling capabilities are always available, even in trace mode, but are deactivated by default.
|
||||
To activate sampling in trace mode, set ``OMNITRACE_USE_SAMPLING=ON`` in the environment
|
||||
or in an Omnitrace configuration file.
|
||||
|
||||
Embedding a default configuration
|
||||
========================================
|
||||
|
||||
Use the ``--env`` option to embed a default configuration into the target. Although this option
|
||||
works for runtime instrumentation, it is most useful when generating new binaries because the generated
|
||||
binary can be used later on in a different login session when the environment might have changed.
|
||||
|
||||
For example, if the following commands are run,
|
||||
the configuration settings are not be preserved for subsequent sessions:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-instrument -o ./foo.inst -- ./foo
|
||||
export OMNITRACE_USE_SAMPLING=ON
|
||||
export OMNITRACE_SAMPLING_FREQ=5
|
||||
omnitrace-run -- ./foo.inst
|
||||
|
||||
Whereas the following command preserves those environment variables:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-instrument -o ./foo.samp --env OMNITRACE_USE_SAMPLING=ON OMNITRACE_SAMPLING_FREQ=5 -- ./foo
|
||||
|
||||
They can now be used in future sessions.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
# will sample 5x per second
|
||||
omnitrace-run -- ./foo.samp
|
||||
|
||||
Even though the environment variables are preserved, subsequent sessions can still override those defaults:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
# will sample 100x per second
|
||||
export OMNITRACE_SAMPLING_FREQ=100
|
||||
omnitrace-run -- ./foo.samp
|
||||
|
||||
.. _rpath-troubleshooting:
|
||||
|
||||
Troubleshooting
|
||||
----------------------------------------------
|
||||
|
||||
Checking for RPATH
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
If ``ldd ./foo.inst`` from the :ref:`binary-rewriting-library-label`
|
||||
section still returns ``/usr/local/lib/libfoo.so.2``, the executable could have
|
||||
an rpath encoded in the binary.
|
||||
This ELF entry results in the dynamic linker ignoring ``LD_LIBRARY_PATH`` if
|
||||
it finds ``libfoo.so.2`` in the rpath.
|
||||
Using the ``objdump`` tool, perform the following query:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
objdump -p <exe-or-library> | egrep 'RPATH|RUNPATH'
|
||||
|
||||
If this produces output that appears similar to this output.:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
RUNPATH $ORIGIN:$ORIGIN/../lib
|
||||
|
||||
Remove or modify the rpath to get ``foo.inst`` to resolve
|
||||
to the instrumented ``libfoo.so.2`` as explained in the next section.
|
||||
|
||||
Modifying an RPATH
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
This code snippet uses the ``patchelf`` tool to modify the rpath of the given executable
|
||||
or library to ``/home/user``, which is where the instrumented libraries are located.
|
||||
|
||||
.. note::
|
||||
|
||||
This functionality requires the ``patchelf`` package.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
patchelf --remove-rpath <exe-or-library>
|
||||
patchelf --set-rpath '/home/user' <exe-or-library>
|
||||
@@ -0,0 +1,630 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
****************************************************
|
||||
Performing causal profiling
|
||||
****************************************************
|
||||
|
||||
The process of causal profiling can be summarized as:
|
||||
|
||||
*If you speed up a given block of code by X%, the application will run Y% faster*.
|
||||
|
||||
Causal profiling directs parallel application developers to where they should focus their optimization
|
||||
efforts by quantifying the potential impact of optimizations. Causal profiling is rooted in the concept
|
||||
that *software execution speed is relative*. Speeding up a block of code by X% is mathematically equivalent
|
||||
to that block of code running at its current speed if all the other code is running slower by X%.
|
||||
Thus, causal profiling works by performing experiments on blocks of code during program execution which
|
||||
insert pauses to slow down all other concurrently running code. During post-processing, these experiments
|
||||
are translated into calculations for the potential impact of speeding up this block of code.
|
||||
|
||||
.. note::
|
||||
|
||||
Causal profiling supersedes the original critical trace feature, which was removed in Omnitrace v1.11.0.
|
||||
|
||||
Consider the following C++ code executing ``foo`` and ``bar`` concurrently in two different threads
|
||||
where ``foo`` is ideally 30% faster than ``bar``:
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
#include <cstddef>
|
||||
#include <thread>
|
||||
constexpr size_t FOO_N = 7 * 1000000000UL;
|
||||
constexpr size_t BAR_N = 10 * 1000000000UL;
|
||||
|
||||
void foo()
|
||||
{
|
||||
for(volatile size_t i = 0; i < FOO_N; ++i) {}
|
||||
}
|
||||
|
||||
void bar()
|
||||
{
|
||||
for(volatile size_t i = 0; i < BAR_N; ++i) {}
|
||||
}
|
||||
|
||||
int main()
|
||||
{
|
||||
std::thread _threads[] = { std::thread{ foo },
|
||||
std::thread{ bar } };
|
||||
|
||||
for(auto& itr : _threads)
|
||||
itr.join();
|
||||
}
|
||||
|
||||
No matter how many optimizations are applied to ``foo``, the application will always
|
||||
require the same amount of time
|
||||
because the end-to-end performance is limited by ``bar``. However, a 5% speed-up
|
||||
in ``bar`` results in the
|
||||
end-to-end performance improving by 5%. This trend continues linearly, with a 10% speed-up
|
||||
in ``bar`` yielding a 10% speed-up in
|
||||
end-to-end performance, and so on, up to a 30% speed-up, at which point ``bar`` runs as fast as ``foo``.
|
||||
Any speed-up to ``bar`` beyond 30% still only yields an end-to-end performance
|
||||
improvement of 30% because the application
|
||||
is now limited by performance of ``foo``, as demonstrated below in the causal
|
||||
profiling visualization:
|
||||
|
||||
.. image:: ../data/causal-foobar.png
|
||||
:alt: Visualization of the performance improvements for two functions with causal profiling
|
||||
|
||||
The full details of the causal profiling methodology can be found in the paper
|
||||
`Coz: Finding Code that Counts with Causal Profiling <http://arxiv.org/pdf/1608.03676v1.pdf>`_.
|
||||
The author's implementation is publicly available on `GitHub <https://github.com/plasma-umass/coz>`_.
|
||||
|
||||
Getting started
|
||||
========================================
|
||||
|
||||
To effectively use causal profiling, it is important to understand a few key
|
||||
concepts, such as progress points.
|
||||
|
||||
Progress points
|
||||
-----------------------------------
|
||||
|
||||
Causal profiling requires "progress points" to track progress through the code
|
||||
in between samples. Progress points must be triggered in a deterministic manner via instrumentation.
|
||||
This can happen in three different ways:
|
||||
|
||||
* `Omnitrace <https://github.com/ROCm/omnitrace>`_ can leverage the callbacks from
|
||||
Kokkos-Tools, OpenMP-Tools, roctracer, etc. and the wrappers around functions for
|
||||
MPI, NUMA, RCCL, etc. to act as progress points
|
||||
* Users can leverage the :doc:`runtime instrumentation capabilities <./instrumenting-rewriting-binary-application>`
|
||||
to insert progress points
|
||||
* Users can leverage :doc:`User APIs <../how-to/using-omnitrace-api>`,
|
||||
such as ``OMNITRACE_CAUSAL_PROGRESS``
|
||||
|
||||
.. note::
|
||||
|
||||
Binary rewrite to insert progress points is not supported. When a rewritten binary
|
||||
runs, Dyninst translates the instruction pointer address in order to perform
|
||||
the instrumentation. As a result, call stack samples never return instruction
|
||||
pointer addresses within the valid Omnitrace range.
|
||||
|
||||
Key concepts
|
||||
-----------------------------------
|
||||
|
||||
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
|
||||
| Concept | Setting | Options | Description |
|
||||
+==================+=====================================+==================================+============================================+
|
||||
| Backend | ``OMNITRACE_CAUSAL_BACKEND`` | ``perf``, ``timer`` | Backend for recording samples required |
|
||||
| | | | to calculate the virtual speed-up |
|
||||
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
|
||||
| Mode | ``OMNITRACE_CAUSAL_MODE`` | ``function``, ``line`` | Select an entire function or individual |
|
||||
| | | | line of code for causal experiments |
|
||||
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
|
||||
| End-to-end | ``OMNITRACE_CAUSAL_END_TO_END`` | Boolean | Perform a single experiment during the |
|
||||
| | | | entire run (does not require |
|
||||
| | | | progress points) |
|
||||
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
|
||||
| Fixed speed-up | ``OMNITRACE_CAUSAL_FIXED_SPEEDUP`` | one or more values from [0, 100] | Virtual speed-up or pool of virtual |
|
||||
| | | | speed-ups to randomly select |
|
||||
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
|
||||
| Binary scope | ``OMNITRACE_CAUSAL_BINARY_SCOPE`` | regular expression(s) | Dynamic binaries containing code for |
|
||||
| | | | experiments |
|
||||
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
|
||||
| Source scope | ``OMNITRACE_CAUSAL_SOURCE_SCOPE`` | regular expression(s) | ``<file>`` and/or ``<file>:<line>`` |
|
||||
| | | | containing code to include in experiments |
|
||||
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
|
||||
| Function scope | ``OMNITRACE_CAUSAL_FUNCTION_SCOPE`` | regular expression(s) | Restricts experiments to matching |
|
||||
| | | | functions (function mode) or lines of |
|
||||
| | | | code within matching functions (line mode) |
|
||||
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
|
||||
|
||||
.. note::
|
||||
|
||||
* Binary scope defaults to ``%MAIN%`` (in the executable), but the scope can be expanded to include linked libraries.
|
||||
* ``<file>`` and ``<file>:<line>`` support requires debug info (for example, the code must be compiled with ``-g`` or, preferably, with ``-g3``)
|
||||
* Function mode does not require debug info but does not support stripped binaries
|
||||
|
||||
Backends
|
||||
-----------------------------------
|
||||
|
||||
There are two backends to choose from: ``perf`` and ``timer``.
|
||||
They are used to record the samples required to calculate the virtual speedup.
|
||||
Both backends interrupt each thread 1000 times per second (of CPU-time) to apply the virtual speed-ups.
|
||||
The difference between each backend is how the samples are recorded.
|
||||
There are three key differences between the two backends:
|
||||
|
||||
* the ``perf`` backend requires Linux Perf and elevated security priviledges
|
||||
* the ``perf`` backend interrupts the application less frequently whereas the ``timer`` backend
|
||||
interrupts the application 1000 times per second of realtime
|
||||
* the ``timer`` backend has less accurate call stacks due to instruction pointer skid
|
||||
|
||||
In general, the ``perf`` backend is preferred over the ``timer`` backend when sufficient
|
||||
security priviledges permit its usage.
|
||||
If ``OMNITRACE_CAUSAL_BACKEND`` is set to ``auto``, Omnitrace falls back
|
||||
to using the ``timer`` backend only if
|
||||
the ``perf`` backend fails. If ``OMNITRACE_CAUSAL_BACKEND`` is
|
||||
set to ``perf`` and using this backend fails, Omnitrace aborts.
|
||||
|
||||
Instruction pointer skid
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Instruction pointer (IP) skid measures how many instructions run after the event of interest
|
||||
before the program actually stops. The IP skid is calculated by subtracting
|
||||
the location of the IP at the point of interest from the location of the IP
|
||||
when the kernel finally stops the application.
|
||||
For the ``timer`` backend, this translates to the
|
||||
difference in the IP between when the timer generated a signal and when the
|
||||
signal was actually generated. Although IP skid still occurs with the ``perf`` backend,
|
||||
it is much more pronounced with the ``timer`` backend due to the overhead of pausing the entire thread.
|
||||
This means the ``timer`` backend tends to have a lower resolution than the ``perf`` backend,
|
||||
especially in ``line`` mode.
|
||||
|
||||
Installing Linux Perf
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Linux Perf is built into the kernel and may already be installed
|
||||
(for instance, it is included in the default kernel for OpenSUSE).
|
||||
The official method of checking whether Linux Perf is installed is
|
||||
checking for the existence of the file
|
||||
``/proc/sys/kernel/perf_event_paranoid``. If the file exists, the kernel has Perf installed.
|
||||
|
||||
If this file does not exist, as with Debian-based systems like Ubuntu, run the following command as superuser:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
apt-get install linux-tools-common linux-tools-generic linux-tools-$(uname -r)
|
||||
|
||||
and reboot your computer. In order to use the ``perf`` backend, the value
|
||||
of ``/proc/sys/kernel/perf_event_paranoid``
|
||||
should be less than or equal to 2. If the value in this file is greater than 2, you can't
|
||||
use the ``perf`` backend.
|
||||
|
||||
To update the paranoid level temporarily until the system is rebooted, run
|
||||
one of the following commands
|
||||
as a superuser (where ``PARANOID_LEVEL=<N>`` has a value of ``<N>`` in the range ``[-1, 2]``):
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
echo ${PARANOID_LEVEL} | sudo tee /proc/sys/kernel/perf_event_paranoid
|
||||
|
||||
or
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
sysctl kernel.perf_event_paranoid=${PARANOID_LEVEL}
|
||||
|
||||
To make the paranoid level persistent after a reboot, add ``kernel.perf_event_paranoid=<N>``
|
||||
(where ``<N>`` is the desired paranoid level) to the ``/etc/sysctl.conf`` file.
|
||||
|
||||
Speed-up prediction variability and the omnitrace-causal executable
|
||||
-----------------------------------------------------------------------
|
||||
|
||||
Causal profiling typically requires running the application several times in
|
||||
order to adequately sample all the code domains, experiment
|
||||
with speed-ups and other techniques, and resolve statistical fluctuations.
|
||||
The ``omnitrace-causal`` executable is designed to simplify this procedure:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omnitrace-causal --help
|
||||
[omnitrace-causal] Usage: ./bin/omnitrace-causal [ --help (count: 0, dtype: bool)
|
||||
--version (count: 0, dtype: bool)
|
||||
--monochrome (max: 1, dtype: bool)
|
||||
--debug (max: 1, dtype: bool)
|
||||
--verbose (count: 1)
|
||||
--config (min: 0, dtype: filepath)
|
||||
--launcher (count: 1, dtype: executable)
|
||||
--generate-configs (min: 0, dtype: folder)
|
||||
--no-defaults (min: 0, dtype: bool)
|
||||
--mode (count: 1, dtype: string)
|
||||
--output-name (min: 1, dtype: filename)
|
||||
--reset (max: 1, dtype: bool)
|
||||
--end-to-end (max: 1, dtype: bool)
|
||||
--wait (count: 1, dtype: seconds)
|
||||
--duration (count: 1, dtype: seconds)
|
||||
--iterations (count: 1, dtype: int)
|
||||
--speedups (min: 0, dtype: integers)
|
||||
--binary-scope (min: 0, dtype: integers)
|
||||
--source-scope (min: 0, dtype: integers)
|
||||
--function-scope (min: 0, dtype: regex-list)
|
||||
--binary-exclude (min: 0, dtype: integers)
|
||||
--source-exclude (min: 0, dtype: integers)
|
||||
--function-exclude (min: 0, dtype: regex-list)
|
||||
]
|
||||
|
||||
Causal profiling usually requires multiple runs to reliably resolve the speedup estimates.
|
||||
This executable is designed to streamline that process.
|
||||
For example (assume all commands end with \'-- <exe> <args>\'):
|
||||
|
||||
omnitrace-causal -n 5 -- <exe> # runs <exe> 5x with causal profiling enabled
|
||||
|
||||
omnitrace-causal -s 0 5,10,15,20 # runs <exe> 2x with virtual speedups:
|
||||
# - 0
|
||||
# - randomly selected from 5, 10, 15, and 20
|
||||
|
||||
omnitrace-causal -F func_A func_B func_(A|B) # runs <exe> 3x with the function scope limited to:
|
||||
# 1. func_A
|
||||
# 2. func_B
|
||||
# 3. func_A or func_B
|
||||
General tips:
|
||||
- Insert progress points at hotspots in your code or use omnitrace\'s runtime instrumentation
|
||||
- Note: binary rewrite will produce a incompatible new binary
|
||||
- Run omnitrace-causal in "function" mode first (does not require debug info)
|
||||
- Run omnitrace-causal in "line" mode when you are targeting one function (requires debug info)
|
||||
- Preferably, use predictions from the "function" mode to determine which function to target
|
||||
- Limit the virtual speedups to a smaller pool, e.g., 0,5,10,25,50, to get reliable predictions quicker
|
||||
- Make use of the binary, source, and function scope to limit the functions/lines selected for experiments
|
||||
- Note: source scope requires debug info
|
||||
|
||||
|
||||
Options:
|
||||
-h, -?, --help Shows this page
|
||||
--version Prints the version and exit
|
||||
|
||||
[DEBUG OPTIONS]
|
||||
|
||||
--monochrome Disable colorized output
|
||||
--debug Debug output
|
||||
-v, --verbose Verbose output
|
||||
|
||||
[GENERAL OPTIONS]
|
||||
|
||||
-c, --config Base configuration file
|
||||
-l, --launcher When running MPI jobs, omnitrace-causal needs to be *before* the executable which launches the MPI processes (i.e.
|
||||
before `mpirun`, `srun`, etc.). Pass the name of the target executable (or a regex for matching to the name of the
|
||||
target) for causal profiling, e.g., `omnitrace-causal -l foo -- mpirun -n 4 foo`. This ensures that the omnitrace
|
||||
library is LD_PRELOADed on the proper target
|
||||
-g, --generate-configs Generate config files instead of passing environment variables directly. If no arguments are provided, the config files
|
||||
will be placed in ${PWD}/omnitrace-causal-config folder
|
||||
--no-defaults Do not activate default features which are recommended for causal profiling. For example: PID-tagging of output files
|
||||
and timestamped subdirectories are disabled by default. Kokkos tools support is added by default
|
||||
(OMNITRACE_USE_KOKKOSP=ON) because, for Kokkos applications, the Kokkos-Tools callbacks are used for progress points.
|
||||
Activation of OpenMP tools support is similar
|
||||
|
||||
[CAUSAL PROFILING OPTIONS (General)]
|
||||
(These settings will be applied to all causal profiling runs)
|
||||
|
||||
-m, --mode [ function (func) | line ]
|
||||
Causal profiling mode
|
||||
-o, --output-name Output filename of causal profiling data w/o extension
|
||||
-r, --reset Overwrite any existing experiment results during the first run
|
||||
-e, --end-to-end Single causal experiment for the entire application runtime
|
||||
-w, --wait Set the wait time (i.e. delay) before starting the first causal experiment (in seconds)
|
||||
-d, --duration Set the length of time (in seconds) to perform causal experimentationafter the first experiment is started. Once this
|
||||
amount of time has elapsed, no more causal experiments will be started but any currently running experiment will be
|
||||
allowed to finish.
|
||||
-n, --iterations Number of times to repeat the combination of run configurations
|
||||
|
||||
[CAUSAL PROFILING OPTIONS (Combinatorial)]
|
||||
(Each individual argument to these options will multiply the number runs by the number of arguments and the number of
|
||||
iterations. E.g. -n 2 -B "MAIN" -F "foo" "bar" will produce 4 runs: 2 iterations x 1 binary scope x 2 function scopes
|
||||
(MAIN+foo, MAIN+bar, MAIN+foo, MAIN+bar))
|
||||
|
||||
-s, --speedups Pool of virtual speedups to sample from during experimentation. Each space designates a group and multiple speedups can
|
||||
be grouped together by commas, e.g. -s 0 0,10,20-50 is two groups: group #1 is \'0\' and group #2 is \'0 10 20 25 30 35 40
|
||||
45 50\'
|
||||
-B, --binary-scope Restricts causal experiments to the binaries matching the list of regular expressions. Each space designates a group
|
||||
and multiple scopes can be grouped together with a semi-colon
|
||||
-S, --source-scope Restricts causal experiments to the source files or source file + lineno pairs (i.e. <file> or <file>:<line>) matching
|
||||
the list of regular expressions. Each space designates a group and multiple scopes can be grouped together with a
|
||||
semi-colon
|
||||
-F, --function-scope Restricts causal experiments to the functions matching the list of regular expressions. Each space designates a group
|
||||
and multiple scopes can be grouped together with a semi-colon
|
||||
-BE, --binary-exclude Excludes causal experiments from being performed on the binaries matching the list of regular expressions. Each space
|
||||
designates a group and multiple excludes can be grouped together with a semi-colon
|
||||
-SE, --source-exclude Excludes causal experiments from being performed on the code from the source files or source file + lineno pair (i.e.
|
||||
<file> or <file>:<line>) matching the list of regular expressions. Each space designates a group and multiple excludes
|
||||
can be grouped together with a semi-colon
|
||||
-FE, --function-exclude Excludes causal experiments from being performed on the functions matching the list of regular expressions. Each space
|
||||
designates a group and multiple excludes can be grouped together with a semi-colon
|
||||
|
||||
Examples
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
#!/bin/bash -e
|
||||
|
||||
module load omnitrace
|
||||
|
||||
N=20
|
||||
I=3
|
||||
|
||||
# when providing speedups to omnitrace-causal, speedup
|
||||
# groups are separated by a space so "0,10" results in
|
||||
# one speedup group where omnitrace samples from
|
||||
# the speedup set of {0, 10}. Passing "0 10" (without
|
||||
# quotes to omnitrace-causal multiplies the
|
||||
# number of runs by 2, where the first half of the
|
||||
# runs instruct omnitrace to only use 0 as the
|
||||
# speedup and the second half of the runs instruct
|
||||
# omnitrace to only use 10 as the speedup.
|
||||
SPEEDUPS="0,0,0,10,20,30,40,50,50,75,75,75,90,90,90"
|
||||
# thus, -s ${SPEEDUPS} only multiplies the number
|
||||
# of runs by 1 whereas -S ${SPEEDUPS_E2E} multiplies
|
||||
# the number of runs by 15:
|
||||
# - 3 runs with speedup of 0
|
||||
# - 1 run for each of the speedups 10, 20, 30, and 40
|
||||
# - 2 runs with speedup of 50
|
||||
# - 3 runs with speedup of 75
|
||||
# - 3 runs with speedup of 90
|
||||
SPEEDUPS_E2E=$(echo "${SPEEDUPS}" | sed \'s/,/ /g\')
|
||||
|
||||
|
||||
# 20 iterations in function mode with 1 speedup group
|
||||
# and source scope set to .cpp files
|
||||
#
|
||||
# outputs to files:
|
||||
# - causal/experiments.func.coz
|
||||
# - causal/experiments.func.json
|
||||
#
|
||||
# total executions: 20
|
||||
#
|
||||
omnitrace-causal \
|
||||
-n ${N} \
|
||||
-s ${SPEEDUPS} \
|
||||
-m function \
|
||||
-o experiments.func \
|
||||
-S ".*\\.cpp" \
|
||||
-- \
|
||||
./causal-omni-cpu "${@}"
|
||||
|
||||
|
||||
# 20 iterations in line mode with 1 speedup group
|
||||
# and source scope restricted to lines 100 and 110
|
||||
# in the causal.cpp file.
|
||||
#
|
||||
# outputs to files:
|
||||
# - causal/experiments.line.coz
|
||||
# - causal/experiments.line.json
|
||||
#
|
||||
# total executions: 20
|
||||
#
|
||||
omnitrace-causal \
|
||||
-n ${N} \
|
||||
-s ${SPEEDUPS} \
|
||||
-m line \
|
||||
-o experiments.line \
|
||||
-S "causal\\.cpp:(100|110)" \
|
||||
-- \
|
||||
./causal-omni-cpu "${@}"
|
||||
|
||||
|
||||
# 3 iterations in function mode of 15 singular speedups
|
||||
# in end-to-end mode with 2 different function scopes
|
||||
# where one is restricted to "cpu_slow_func" and
|
||||
# another is restricted to "cpu_fast_func".
|
||||
#
|
||||
# outputs to files:
|
||||
# - causal/experiments.func.e2e.coz
|
||||
# - causal/experiments.func.e2e.json
|
||||
#
|
||||
# total executions: 90
|
||||
#
|
||||
omnitrace-causal \
|
||||
-n ${I} \
|
||||
-s ${SPEEDUPS_E2E} \
|
||||
-m func \
|
||||
-e \
|
||||
-o experiments.func.e2e \
|
||||
-F "cpu_slow_func" \
|
||||
"cpu_fast_func" \
|
||||
-- \
|
||||
./causal-omni-cpu "${@}"
|
||||
|
||||
# 3 iterations in line mode of 15 singular speedups
|
||||
# in end-to-end mode with 2 different source scopes
|
||||
# where one is restricted to line 100 in causal.cpp
|
||||
# and another is restricted to line 110 in causal.cpp.
|
||||
#
|
||||
# outputs to files:
|
||||
# - causal/experiments.line.e2e.coz
|
||||
# - causal/experiments.line.e2e.json
|
||||
#
|
||||
# total executions: 90
|
||||
#
|
||||
omnitrace-causal \
|
||||
-n ${I} \
|
||||
-s ${SPEEDUPS_E2E} \
|
||||
-m line \
|
||||
-e \
|
||||
-o experiments.line.e2e \
|
||||
-S "causal\\.cpp:100" \
|
||||
"causal\\.cpp:110" \
|
||||
-- \
|
||||
./causal-omni-cpu "${@}"
|
||||
|
||||
|
||||
export OMP_NUM_THREADS=8
|
||||
export OMP_PROC_BIND=spread
|
||||
export OMP_PLACES=threads
|
||||
|
||||
# set number of iterations to 5
|
||||
N=5
|
||||
|
||||
# 5 iterations in function mode of 1 speedup
|
||||
# group with the source scope restricted
|
||||
# to files containing "lulesh" in their filename
|
||||
# and exclude functions which start with "Kokkos::"
|
||||
# or "std::enable_if".
|
||||
#
|
||||
# outputs to files:
|
||||
# - causal/experiments.func.coz
|
||||
# - causal/experiments.func.json
|
||||
#
|
||||
# total executions: 5
|
||||
#
|
||||
# First of 5 executions overwrites any
|
||||
# existing causal/experiments.func.(coz|json)
|
||||
# file due to "--reset" argument
|
||||
#
|
||||
omnitrace-causal \
|
||||
--reset \
|
||||
-n ${N} \
|
||||
-s ${SPEEDUPS} \
|
||||
-m func \
|
||||
-o experiments.func \
|
||||
-S "lulesh.*" \
|
||||
-FE "^(Kokkos::|std::enable_if)" \
|
||||
-- \
|
||||
./lulesh-omni -i 50 -s 200 -r 20 -b 5 -c 5 -p
|
||||
|
||||
|
||||
# 5 iterations in line mode of 1 speedup
|
||||
# group with the source scope restricted
|
||||
# to files containing "lulesh" in their filename
|
||||
# and exclude functions which start with "exec_range"
|
||||
# or "execute" and which contain either
|
||||
# "construct_shared_allocation" or "._omp_fn." in
|
||||
# the function name.
|
||||
#
|
||||
# outputs to files:
|
||||
# - causal/experiments.line.coz
|
||||
# - causal/experiments.line.json
|
||||
#
|
||||
# total executions: 5
|
||||
#
|
||||
# First of 5 executions overwrites any
|
||||
# existing causal/experiments.line.(coz|json)
|
||||
# file due to "--reset" argument
|
||||
#
|
||||
omnitrace-causal \
|
||||
--reset \
|
||||
-n ${N} \
|
||||
-s ${SPEEDUPS} \
|
||||
-m line \
|
||||
-o experiments.line \
|
||||
-S "lulesh.*" \
|
||||
-FE "^(exec_range|execute);construct_shared_allocation;\\._omp_fn\\." \
|
||||
-- \
|
||||
./lulesh-omni -i 50 -s 200 -r 20 -b 5 -c 5 -p
|
||||
|
||||
|
||||
# 5 iterations in line mode of 1 speedup
|
||||
# group with the source scope restricted
|
||||
# to files whose basename is "lulesh.cc"
|
||||
# for 3 different functions:
|
||||
# - ApplyMaterialPropertiesForElems
|
||||
# - CalcHourglassControlForElems
|
||||
# - CalcVolumeForceForElems
|
||||
#
|
||||
# outputs to files:
|
||||
# - causal/experiments.line.targeted.coz
|
||||
# - causal/experiments.line.targeted.json
|
||||
#
|
||||
# total executions: 15
|
||||
#
|
||||
# First of 5 executions overwrites any
|
||||
# existing causal/experiments.line.(coz|json)
|
||||
# file due to "--reset" argument
|
||||
#
|
||||
omnitrace-causal \
|
||||
--reset \
|
||||
-n ${N} \
|
||||
-s ${SPEEDUPS} \
|
||||
-m line \
|
||||
-o experiments.line.targeted \
|
||||
-F "ApplyMaterialPropertiesForElems" \
|
||||
"CalcHourglassControlForElems" \
|
||||
"CalcVolumeForceForElems" \
|
||||
-S "lulesh\\.cc" \
|
||||
-- \
|
||||
./lulesh-omni -i 50 -s 200 -r 20 -b 5 -c 5 -p
|
||||
|
||||
Using omnitrace-causal with other launchers like mpirun
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The ``omnitrace-causal`` executable is intended to assist with application replay
|
||||
and is designed to always be at the start of the command line as the primary process.
|
||||
``omnitrace-causal`` typically adds a ``LD_PRELOAD`` of the Omnitrace libraries
|
||||
into the environment before launching the command to inject the functionality
|
||||
required to start the causal profiling tooling. However, this is problematic
|
||||
when the target application for causal profiling uses a launcher, in which case
|
||||
it is listed as an argument rather than as the main application. For example,
|
||||
``foo`` is the target application for profiling, but the command to run it is
|
||||
``mpirun -n 2 foo``. Running the command ``omnitrace-causal -- mpirun -n 2 foo``
|
||||
applies the causal profiling to ``mpirun`` instead of ``foo``.
|
||||
|
||||
``omnitrace-causal`` remedies this by providing a command-line option ``-l` / `--launcher``
|
||||
to indicate the target application is using a launcher script/executable. The
|
||||
argument to the command-line option is the name of, or regular expression for, the target application
|
||||
on the command line. When ``--launcher`` is used, ``omnitrace-causal`` generates
|
||||
all the replay configurations and runs them but delays adding the ``LD_PRELOAD``. Instead it
|
||||
inserts a call to itself into the command line right before the target
|
||||
application. This recursive call inherits the configuration from
|
||||
the parent ``omnitrace-causal`` executable, inserts an ``LD_PRELOAD`` into the environment,
|
||||
and calls ``execv`` to replace itself with the new process launched by the target
|
||||
application.
|
||||
|
||||
In other words, the following command:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-causal -l foo -n 3 -- mpirun -n 2 foo`
|
||||
|
||||
Effectively results in:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
mpirun -n 2 omnitrace-causal -- foo
|
||||
mpirun -n 2 omnitrace-causal -- foo
|
||||
mpirun -n 2 omnitrace-causal -- foo
|
||||
|
||||
Visualizing the causal output
|
||||
-------------------------------------------------------------------------
|
||||
|
||||
Omnitrace generates ``causal/experiments.json`` and ``causal/experiments.coz`` in
|
||||
``${OMNITRACE_OUTPUT_PATH}/${OMNITRACE_OUTPUT_PREFIX}``. Visit
|
||||
`plasma-umass.org/coz <https://plasma-umass.org/coz/>`_ to open the ``*.coz`` file.
|
||||
|
||||
Omnitrace versus Coz
|
||||
=======================================
|
||||
|
||||
This comparison is intended for readers who are familiar with the
|
||||
`Coz profiler <https://github.com/plasma-umass/coz>`_.
|
||||
Omnitrace provides several additional features and utilities for causal profiling:
|
||||
|
||||
.. csv-table::
|
||||
:header: "Feature", "Coz", "Omnitrace", "Notes"
|
||||
:widths: 20, 60, 60, 30
|
||||
|
||||
"Debug info", "requires debug info in DWARF v3 format (``-gdwarf-3``)", "optional, supports any DWARF format version", "See Note #1 below"
|
||||
"Experiment selection", "``<file>:<line>``", "``<function>`` or ``<file>:<line>``", "See Note #2 below"
|
||||
"Experiment speed-ups", "Randomly samples b/t 0..100 in increments of 5 or one fixed speed-up", "Supports specifying smaller subset", "See Note #3 below"
|
||||
"Scope options", "Supports binary and source scopes", "Supports binary, source, and function scopes", "See Note #4, #5, and #6 below"
|
||||
"Scope inclusion", "Uses ``%`` as a wildcard for binary and source scopes", "Full regex support for binary, source, and function scopes", ""
|
||||
"Scope exclusion", "Not supported", "Supports regexes for excluding binary/source/function", "See Note #7 below"
|
||||
"Call-stack sampling", "Linux Perf", "Linux Perf, libunwind", "See Note #8 below"
|
||||
|
||||
.. note::
|
||||
|
||||
#. Omnitrace supports a "function" mode which does not require debug info.
|
||||
#. Omnitrace supports selecting an entire range of instruction pointers for a function instead
|
||||
of an instruction pointer for one line. In large code bases, "function" mode
|
||||
can resolve in fewer iterations. After a target function is identified, you can
|
||||
switch to line mode and limit the function scope to the target function.
|
||||
#. Omnitrace supports randomly sampling from subsets, e.g. { 0, 0, 5, 10 }
|
||||
where 0% is randomly selected 50% of time and 5% and 10% are randomly selected 25% of the time.
|
||||
#. Omnitrace and COZ have the same definition for binary scope, which is the binaries
|
||||
loaded at runtime (the executable and linked libraries).
|
||||
#. Omnitrace "source scope" supports both ``<file>`` and ``<file>:<line>`` formats
|
||||
in contrast to the COZ "source scope" which requires ``<file>:<line>`` format.
|
||||
#. Omnitrace supports a "function" scope which narrows the function and lines
|
||||
which are eligible for causal experiments to those within the matching functions.
|
||||
#. Omnitrace supports a second filter on scopes for removing binary/source/function
|
||||
caught by an inclusive match. For example ``BINARY_SCOPE=.*`` and ``BINARY_EXCLUDE=libmpi.*``
|
||||
initially includes all binaries but exclude regex removes MPI libraries.
|
||||
#. In Omnitrace, the Linux Perf backend is preferred over use libunwind. However,
|
||||
Linux Perf usage can be restricted for security reasons.
|
||||
Omnitrace falls back to using a second POSIX timer and libunwind if
|
||||
Linux Perf is not available.
|
||||
@@ -0,0 +1,334 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
****************************************************
|
||||
Profiling Python scripts
|
||||
****************************************************
|
||||
|
||||
`Omnitrace <https://github.com/ROCm/omnitrace>`_ supports profiling Python code at the
|
||||
source level and the script level.
|
||||
Python support is enabled via the ``OMNITRACE_USE_PYTHON`` and the
|
||||
``OMNITRACE_PYTHON_VERSIONS="<MAJOR>.<MINOR>`` CMake options.
|
||||
Alternatively, to build multiple Python versions, use
|
||||
``OMNITRACE_PYTHON_VERSIONS="<MAJOR>.<MINOR>;[<MAJOR>.<MINOR>]"``,
|
||||
and ``OMNITRACE_PYTHON_ROOT_DIRS="/path/to/version;[/path/to/version]"`` instead of ``OMNITRACE_PYTHON_VERSION``.
|
||||
When building multiple Python versions, the length of the ``OMNITRACE_PYTHON_VERSIONS``
|
||||
and ``OMNITRACE_PYTHON_ROOT_DIRS`` lists must
|
||||
be the same size.
|
||||
|
||||
.. note::
|
||||
|
||||
When using Omnitrace with Python programs, the Python interpreter major and minor version (e.g. 3.7)
|
||||
must match the interpreter major and minor version
|
||||
used when compiling the Python bindings. When building Omnitrace,
|
||||
the shared object file ``libpyomnitrace.<IMPL>-<VERSION>-<ARCH>-<OS>-<ABI>.so`` is generated
|
||||
where ``IMPL`` is the Python implementation, ``VERSION`` is the major and minor
|
||||
version, ``ARCH`` is the architecture,
|
||||
``OS`` is the operating system, and ``ABI`` is the application binary interface,
|
||||
for example, ``libpyomnitrace.cpython-38-x86_64-linux-gnu.so``.
|
||||
|
||||
Getting Started
|
||||
========================================
|
||||
|
||||
The Omnitrace Python package is installed in ``lib/pythonX.Y/site-packages/omnitrace``.
|
||||
To ensure the Python interpreter can find the Omnitrace package,
|
||||
add this path to the ``PYTHONPATH`` environment variable, as in the following example:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export PYTHONPATH=/opt/omnitrace/lib/python3.8/site-packages:${PYTHONPATH}
|
||||
|
||||
Both the ``share/omnitrace/setup-env.sh`` script and the module file in
|
||||
``share/modulefiles/omnitrace`` automatically handle the prefixing of the ``PYTHONPATH``
|
||||
environment variable.
|
||||
|
||||
Running Omnitrace on a Python script
|
||||
========================================
|
||||
|
||||
Omnitrace provides an ``omnitrace-python`` helper bash script which
|
||||
ensures ``PYTHONPATH`` is properly set and the correct Python interpreter is used.
|
||||
This means the following commands are effectively equivalent:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-python --help
|
||||
|
||||
and
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export PYTHONPATH=/opt/omnitrace/lib/python3.8/site-packages:${PYTHONPATH}
|
||||
python3.8 -m omnitrace --help
|
||||
|
||||
.. note::
|
||||
|
||||
``omnitrace-python`` and ``python -m omnitrace`` use the same command-line syntax
|
||||
as the other ``omnitrace`` executables (``omnitrace-python <OMNITRACE_ARGS> -- <SCRIPT> <SCRIPT_ARGS>``)
|
||||
and has similar options.
|
||||
|
||||
Command line options
|
||||
-----------------------------------
|
||||
|
||||
Use ``omnitrace-python --help`` to view the available options:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
usage: omnitrace [-h] [-v VERBOSITY] [-b] [-c FILE] [-s FILE] [-F [BOOL]] [--label [{args,file,line} [{args,file,line} ...]]] [-I FUNC [FUNC ...]] [-E FUNC [FUNC ...]] [-R FUNC [FUNC ...]] [-MI FILE [FILE ...]] [-ME FILE [FILE ...]] [-MR FILE [FILE ...]] [--trace-c [BOOL]]
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
-v VERBOSITY, --verbosity VERBOSITY
|
||||
Logging verbosity
|
||||
-b, --builtin Put 'profile' in the builtins. Use '@profile' to decorate a single function, or 'with profile:' to profile a single section of code.
|
||||
-c FILE, --config FILE
|
||||
OmniTrace configuration file
|
||||
-s FILE, --setup FILE
|
||||
Code to execute before the code to profile
|
||||
-F [BOOL], --full-filepath [BOOL]
|
||||
Encode the full function filename (instead of basename)
|
||||
--label [{args,file,line} [{args,file,line} ...]]
|
||||
Encode the function arguments, filename, and/or line number into the profiling function label
|
||||
-I FUNC [FUNC ...], --function-include FUNC [FUNC ...]
|
||||
Include any entries with these function names
|
||||
-E FUNC [FUNC ...], --function-exclude FUNC [FUNC ...]
|
||||
Filter out any entries with these function names
|
||||
-R FUNC [FUNC ...], --function-restrict FUNC [FUNC ...]
|
||||
Select only entries with these function names
|
||||
-MI FILE [FILE ...], --module-include FILE [FILE ...]
|
||||
Include any entries from these files
|
||||
-ME FILE [FILE ...], --module-exclude FILE [FILE ...]
|
||||
Filter out any entries from these files
|
||||
-MR FILE [FILE ...], --module-restrict FILE [FILE ...]
|
||||
Select only entries from these files
|
||||
--trace-c [BOOL] Enable profiling C functions
|
||||
|
||||
usage: python3 -m omnitrace <OMNITRACE_ARGS> -- <SCRIPT> <SCRIPT_ARGS>
|
||||
|
||||
.. note::
|
||||
|
||||
The ``--trace-c`` option does not incorporate Omnitrace's dynamic instrumentation support.
|
||||
It only enables profiling the underlying C function call within the Python interpreter.
|
||||
|
||||
Selective instrumentation
|
||||
-----------------------------------
|
||||
|
||||
Similar to the ``omnitrace-instrument`` executable, command-line options exist for restricting,
|
||||
including, and excluding certain functions and modules, for example, ``--function-exclude "^__init__$"``.
|
||||
Alternatively, add the ``@profile`` decorator to the primary function of interest
|
||||
in your program and use the ``-b`` / ``--builtin`` command-line option to narrow the scope of the
|
||||
instrumentation to this function and its children.
|
||||
|
||||
Consider the following Python code (``example.py``):
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import sys
|
||||
|
||||
def fib(n):
|
||||
return n if n < 2 else (fib(n - 1) + fib(n - 2))
|
||||
|
||||
|
||||
def inefficient(n):
|
||||
a = 0
|
||||
for i in range(n):
|
||||
a += i
|
||||
for j in range(n):
|
||||
a += j
|
||||
return a
|
||||
|
||||
|
||||
def run(n):
|
||||
return fib(n) + inefficient(n)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
run(20)
|
||||
|
||||
Running ``omnitrace-python ./example.py`` with ``OMNITRACE_PROFILE=ON`` and
|
||||
``OMNITRACE_TIMEMORY_COMPONENTS=trip_count`` produces the following:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
|-------------------------------------------------------------------------------------------|
|
||||
| COUNTS NUMBER OF INVOCATIONS |
|
||||
|-------------------------------------------------------------------------------------------|
|
||||
| LABEL | COUNT | DEPTH | METRIC | SUM |
|
||||
|---------------------------------------------------|--------|--------|------------|--------|
|
||||
| |0>>> run | 1 | 0 | trip_count | 1 |
|
||||
| |0>>> |_fib | 1 | 1 | trip_count | 1 |
|
||||
| |0>>> |_fib | 2 | 2 | trip_count | 2 |
|
||||
| |0>>> |_fib | 4 | 3 | trip_count | 4 |
|
||||
| |0>>> |_fib | 8 | 4 | trip_count | 8 |
|
||||
| |0>>> |_fib | 16 | 5 | trip_count | 16 |
|
||||
| |0>>> |_fib | 32 | 6 | trip_count | 32 |
|
||||
| |0>>> |_fib | 64 | 7 | trip_count | 64 |
|
||||
| |0>>> |_fib | 128 | 8 | trip_count | 128 |
|
||||
| |0>>> |_fib | 256 | 9 | trip_count | 256 |
|
||||
| |0>>> |_fib | 512 | 10 | trip_count | 512 |
|
||||
| |0>>> |_fib | 1024 | 11 | trip_count | 1024 |
|
||||
| |0>>> |_fib | 2026 | 12 | trip_count | 2026 |
|
||||
| |0>>> |_fib | 3632 | 13 | trip_count | 3632 |
|
||||
| |0>>> |_fib | 5020 | 14 | trip_count | 5020 |
|
||||
| |0>>> |_fib | 4760 | 15 | trip_count | 4760 |
|
||||
| |0>>> |_fib | 2942 | 16 | trip_count | 2942 |
|
||||
| |0>>> |_fib | 1152 | 17 | trip_count | 1152 |
|
||||
| |0>>> |_fib | 274 | 18 | trip_count | 274 |
|
||||
| |0>>> |_fib | 36 | 19 | trip_count | 36 |
|
||||
| |0>>> |_fib | 2 | 20 | trip_count | 2 |
|
||||
| |0>>> |_inefficient | 1 | 1 | trip_count | 1 |
|
||||
|-------------------------------------------------------------------------------------------|
|
||||
|
||||
If the ``inefficient`` function is decorated with ``@profile`` as follows:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
@profile
|
||||
def inefficient(n):
|
||||
# ...
|
||||
|
||||
And then run using the command ``omnitrace-python -b -- ./example.py``, Omnitrace produces this output:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
|-----------------------------------------------------------|
|
||||
| COUNTS NUMBER OF INVOCATIONS |
|
||||
|-----------------------------------------------------------|
|
||||
| LABEL | COUNT | DEPTH | METRIC | SUM |
|
||||
|-------------------|--------|--------|------------|--------|
|
||||
| |0>>> inefficient | 1 | 0 | trip_count | 1 |
|
||||
|-----------------------------------------------------------|
|
||||
|
||||
Omnitrace Python source instrumentation
|
||||
========================================
|
||||
|
||||
Starting with the unmodified ``example.py`` script above, import the ``omnitrace`` module:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import sys
|
||||
import omnitrace # import omnitrace
|
||||
|
||||
def fib(n):
|
||||
# ... etc. ...
|
||||
|
||||
Next, add ``@omnitrace.profile()`` to the ``run`` function:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
@omnitrace.profile()
|
||||
def run(n):
|
||||
# ...
|
||||
|
||||
Alternatively, use ``omnitrace.profile()`` as a context-manager around ``run(20)``:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
if __name__ == "__main__":
|
||||
with omnitrace.profile():
|
||||
run(20)
|
||||
|
||||
The results for both of the source-level instrumentation modes are identical to the
|
||||
original ``omnitrace-python ./example.py`` results:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
|-------------------------------------------------------------------------------------------|
|
||||
| COUNTS NUMBER OF INVOCATIONS |
|
||||
|-------------------------------------------------------------------------------------------|
|
||||
| LABEL | COUNT | DEPTH | METRIC | SUM |
|
||||
|---------------------------------------------------|--------|--------|------------|--------|
|
||||
| |0>>> run | 1 | 0 | trip_count | 1 |
|
||||
| |0>>> |_fib | 1 | 1 | trip_count | 1 |
|
||||
| |0>>> |_fib | 2 | 2 | trip_count | 2 |
|
||||
| |0>>> |_fib | 4 | 3 | trip_count | 4 |
|
||||
| |0>>> |_fib | 8 | 4 | trip_count | 8 |
|
||||
| |0>>> |_fib | 16 | 5 | trip_count | 16 |
|
||||
| |0>>> |_fib | 32 | 6 | trip_count | 32 |
|
||||
| |0>>> |_fib | 64 | 7 | trip_count | 64 |
|
||||
| |0>>> |_fib | 128 | 8 | trip_count | 128 |
|
||||
| |0>>> |_fib | 256 | 9 | trip_count | 256 |
|
||||
| |0>>> |_fib | 512 | 10 | trip_count | 512 |
|
||||
| |0>>> |_fib | 1024 | 11 | trip_count | 1024 |
|
||||
| |0>>> |_fib | 2026 | 12 | trip_count | 2026 |
|
||||
| |0>>> |_fib | 3632 | 13 | trip_count | 3632 |
|
||||
| |0>>> |_fib | 5020 | 14 | trip_count | 5020 |
|
||||
| |0>>> |_fib | 4760 | 15 | trip_count | 4760 |
|
||||
| |0>>> |_fib | 2942 | 16 | trip_count | 2942 |
|
||||
| |0>>> |_fib | 1152 | 17 | trip_count | 1152 |
|
||||
| |0>>> |_fib | 274 | 18 | trip_count | 274 |
|
||||
| |0>>> |_fib | 36 | 19 | trip_count | 36 |
|
||||
| |0>>> |_fib | 2 | 20 | trip_count | 2 |
|
||||
| |0>>> |_inefficient | 1 | 1 | trip_count | 1 |
|
||||
|-------------------------------------------------------------------------------------------|
|
||||
|
||||
.. note::
|
||||
|
||||
When ``omnitrace-python`` is used without built-ins, the profiling results can be cluttered by the
|
||||
numerous functions called when more complex modules are imported, such as ``import numpy``.
|
||||
|
||||
Omnitrace Python source instrumentation configuration
|
||||
-------------------------------------------------------------
|
||||
|
||||
Within the Python source code, the profiler can be configured by directly
|
||||
modifying the ``omnitrace.profiler.config`` data fields.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import sys
|
||||
|
||||
def fib(n):
|
||||
return n if n < 2 else (fib(n - 1) + fib(n - 2))
|
||||
|
||||
|
||||
def inefficient(n):
|
||||
a = 0
|
||||
for i in range(n):
|
||||
a += i
|
||||
for j in range(n):
|
||||
a += j
|
||||
return a
|
||||
|
||||
|
||||
def run(n):
|
||||
return fib(n) + inefficient(n)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
from omnitrace.profiler import config
|
||||
from omnitrace import profile
|
||||
|
||||
config.include_args = True
|
||||
config.include_filename = False
|
||||
config.include_line = False
|
||||
config.restrict_functions += ["fib", "run"]
|
||||
|
||||
with profile():
|
||||
run(5)
|
||||
|
||||
Executing this script produces the following:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
|------------------------------------------------------------------|
|
||||
| COUNTS NUMBER OF INVOCATIONS |
|
||||
|------------------------------------------------------------------|
|
||||
| LABEL | COUNT | DEPTH | METRIC | SUM |
|
||||
|--------------------------|--------|--------|------------|--------|
|
||||
| |0>>> run(n=5) | 1 | 0 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=5) | 1 | 1 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=4) | 1 | 2 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=3) | 1 | 3 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=2) | 1 | 4 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=1) | 1 | 5 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=0) | 1 | 5 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=1) | 1 | 4 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=2) | 1 | 3 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=1) | 1 | 4 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=0) | 1 | 4 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=3) | 1 | 2 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=2) | 1 | 3 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=1) | 1 | 4 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=0) | 1 | 4 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=1) | 1 | 3 | trip_count | 1 |
|
||||
|------------------------------------------------------------------|
|
||||
@@ -0,0 +1,404 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
****************************************************
|
||||
Sampling the call stack
|
||||
****************************************************
|
||||
|
||||
`Omnitrace <https://github.com/ROCm/omnitrace>`_ can use call-stack sampling
|
||||
on a binary instrumented with either the ``omnitrace`` executable
|
||||
or the ``omnitrace-sample`` executable.
|
||||
For example, all of the following commands are effectively equivalent:
|
||||
|
||||
* Binary rewrite with only the instrumentation necessary to start and stop sampling
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-instrument -M sampling -o foo.inst -- foo
|
||||
omnitrace-run -- ./foo.inst
|
||||
|
||||
* Runtime instrumentation with only the instrumentation necessary to start and stop sampling
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-instrument -M sampling -- foo
|
||||
|
||||
* No instrumentation required
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-sample -- foo
|
||||
|
||||
.. note::
|
||||
|
||||
Set ``OMNITRACE_USE_SAMPLING=ON`` to activate call-stack sampling when executing an instrumented binary.
|
||||
|
||||
All ``omnitrace-instrument -M sampling`` (subsequently referred to as "instrumented-sampling")
|
||||
does is wrap the ``main`` of the executable with initialization
|
||||
before ``main`` starts and finalization after ``main`` ends.
|
||||
This can be accomplished without instrumentation through a ``LD_PRELOAD``
|
||||
of a library containing a dynamic symbol wrapper around ``__libc_start_main``.
|
||||
|
||||
The use of ``omnitrace-sample`` is **recommended** over
|
||||
``omnitrace-instrument -M sampling`` when binary instrumentation
|
||||
is not necessary. This is for a number of reasons:
|
||||
|
||||
* ``omnitrace-sample`` provides command-line options for controlling the Omnitrace feature set instead of
|
||||
requiring configuration files or environment variables
|
||||
* Despite the fact that instrumented-sampling only requires inserting snippets
|
||||
around one function (``main``), Dyninst
|
||||
does not have a feature for specifying that parsing and processing all the
|
||||
other symbols in the binary is unnecessary.
|
||||
In the best-case scenario when the target binary is relatively small,
|
||||
instrumented-sampling has a slightly slower launch time,
|
||||
but in the worst case scenarios it requires a significant amount of time and memory to launch.
|
||||
* ``omnitrace-sample`` is fully compatible with MPI. For example,
|
||||
the command ``mpirun -n 2 omnitrace-sample -- foo`` is valid,
|
||||
whereas ``mpirun -n 2 omnitrace-instrument -M sampling -- foo``
|
||||
is incompatible with some MPI distributions (particularly OpenMPI). This is because
|
||||
MPI prohibits forking within an MPI rank.
|
||||
|
||||
* When MPI and binary instrumentation are both involved, two steps are required:
|
||||
performing a binary rewrite of the executable and then using the instrumented executable
|
||||
in lieu of the original executable. ``omnitrace-sample`` is therefore much easier to use with MPI.
|
||||
|
||||
The omnitrace-sample executable
|
||||
========================================
|
||||
|
||||
View the help menu of ``omnitrace-sample`` with the ``-h`` / ``--help`` option:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omnitrace-sample --help
|
||||
[omnitrace-sample] Usage: omnitrace-sample [ --help (count: 0, dtype: bool)
|
||||
--version (count: 0, dtype: bool)
|
||||
--monochrome (max: 1, dtype: bool)
|
||||
--debug (max: 1, dtype: bool)
|
||||
--verbose (count: 1)
|
||||
--config (min: 0, dtype: filepath)
|
||||
--output (min: 1)
|
||||
--trace (max: 1, dtype: bool)
|
||||
--profile (max: 1, dtype: bool)
|
||||
--flat-profile (max: 1, dtype: bool)
|
||||
--host (max: 1, dtype: bool)
|
||||
--device (max: 1, dtype: bool)
|
||||
--wait (count: 1)
|
||||
--duration (count: 1)
|
||||
--trace-file (count: 1, dtype: filepath)
|
||||
--trace-buffer-size (count: 1, dtype: KB)
|
||||
--trace-fill-policy (count: 1)
|
||||
--trace-wait (count: 1)
|
||||
--trace-duration (count: 1)
|
||||
--trace-periods (min: 1)
|
||||
--trace-clock-id (count: 1)
|
||||
--profile-format (min: 1)
|
||||
--profile-diff (min: 1)
|
||||
--process-freq (count: 1)
|
||||
--process-wait (count: 1)
|
||||
--process-duration (count: 1)
|
||||
--cpus (count: unlimited, dtype: int or range)
|
||||
--gpus (count: unlimited, dtype: int or range)
|
||||
--freq (count: 1)
|
||||
--sampling-wait (count: 1)
|
||||
--sampling-duration (count: 1)
|
||||
--tids (min: 1)
|
||||
--cputime (min: 0)
|
||||
--realtime (min: 0)
|
||||
--include (count: unlimited)
|
||||
--exclude (count: unlimited)
|
||||
--cpu-events (count: unlimited)
|
||||
--gpu-events (count: unlimited)
|
||||
--inlines (max: 1, dtype: bool)
|
||||
--hsa-interrupt (count: 1, dtype: int)
|
||||
]
|
||||
Options:
|
||||
-h, -?, --help Shows this page (count: 0, dtype: bool)
|
||||
--version Prints the version and exit (count: 0, dtype: bool)
|
||||
|
||||
[DEBUG OPTIONS]
|
||||
|
||||
--monochrome Disable colorized output (max: 1, dtype: bool)
|
||||
--debug Debug output (max: 1, dtype: bool)
|
||||
-v, --verbose Verbose output (count: 1)
|
||||
|
||||
[GENERAL OPTIONS] These are options which are ubiquitously applied
|
||||
|
||||
-c, --config Configuration file (min: 0, dtype: filepath)
|
||||
-o, --output Output path. Accepts 1-2 parameters corresponding to the output path and the output prefix (min: 1)
|
||||
-T, --trace Generate a detailed trace (perfetto output) (max: 1, dtype: bool)
|
||||
-P, --profile Generate a call-stack-based profile (conflicts with --flat-profile) (max: 1, dtype: bool)
|
||||
-F, --flat-profile Generate a flat profile (conflicts with --profile) (max: 1, dtype: bool)
|
||||
-H, --host Enable sampling host-based metrics for the process. E.g. CPU frequency, memory usage, etc. (max: 1, dtype: bool)
|
||||
-D, --device Enable sampling device-based metrics for the process. E.g. GPU temperature, memory usage, etc. (max: 1, dtype: bool)
|
||||
-w, --wait This option is a combination of '--trace-wait' and '--sampling-wait'. See the descriptions for those two options.
|
||||
(count: 1)
|
||||
-d, --duration This option is a combination of '--trace-duration' and '--sampling-duration'. See the descriptions for those two
|
||||
options. (count: 1)
|
||||
|
||||
[TRACING OPTIONS] Specific options controlling tracing (i.e. deterministic measurements of every event)
|
||||
|
||||
--trace-file Specify the trace output filename. Relative filepath will be with respect to output path and output prefix. (count: 1,
|
||||
dtype: filepath)
|
||||
--trace-buffer-size Size limit for the trace output (in KB) (count: 1, dtype: KB)
|
||||
--trace-fill-policy [ discard | ring_buffer ]
|
||||
|
||||
Policy for new data when the buffer size limit is reached:
|
||||
- discard : new data is ignored
|
||||
- ring_buffer : new data overwrites oldest data (count: 1)
|
||||
--trace-wait Set the wait time (in seconds) before collecting trace and/or profiling data(in seconds). By default, the duration is
|
||||
in seconds of realtime but that can changed via --trace-clock-id. (count: 1)
|
||||
--trace-duration Set the duration of the trace and/or profile data collection (in seconds). By default, the duration is in seconds of
|
||||
realtime but that can changed via --trace-clock-id. (count: 1)
|
||||
--trace-periods More powerful version of specifying trace delay and/or duration. Format is one or more groups of: <DELAY>:<DURATION>,
|
||||
<DELAY>:<DURATION>:<REPEAT>, and/or <DELAY>:<DURATION>:<REPEAT>:<CLOCK_ID>. (min: 1)
|
||||
--trace-clock-id [ 0 (realtime|CLOCK_REALTIME)
|
||||
1 (monotonic|CLOCK_MONOTONIC)
|
||||
2 (cputime|CLOCK_PROCESS_CPUTIME_ID)
|
||||
4 (monotonic_raw|CLOCK_MONOTONIC_RAW)
|
||||
5 (realtime_coarse|CLOCK_REALTIME_COARSE)
|
||||
6 (monotonic_coarse|CLOCK_MONOTONIC_COARSE)
|
||||
7 (boottime|CLOCK_BOOTTIME) ]
|
||||
Set the default clock ID for for trace delay/duration. Note: "cputime" is the *process* CPU time and might need to be
|
||||
scaled based on the number of threads, i.e. 4 seconds of CPU-time for an application with 4 fully active threads would
|
||||
equate to ~1 second of realtime. If this proves to be difficult to handle in practice, please file a feature request
|
||||
for omnitrace to auto-scale based on the number of threads. (count: 1)
|
||||
|
||||
[PROFILE OPTIONS] Specific options controlling profiling (i.e. deterministic measurements which are aggregated into a summary)
|
||||
|
||||
--profile-format [ console | json | text ]
|
||||
Data formats for profiling results (min: 1)
|
||||
--profile-diff Generate a diff output b/t the profile collected and an existing profile from another run Accepts 1-2 parameters
|
||||
corresponding to the input path and the input prefix (min: 1)
|
||||
|
||||
[HOST/DEVICE (PROCESS SAMPLING) OPTIONS]
|
||||
Process sampling is background measurements for resources available to the entire process. These samples are not tied
|
||||
to specific lines/regions of code
|
||||
|
||||
--process-freq Set the default host/device sampling frequency (number of interrupts per second) (count: 1)
|
||||
--process-wait Set the default wait time (i.e. delay) before taking first host/device sample (in seconds of realtime) (count: 1)
|
||||
--process-duration Set the duration of the host/device sampling (in seconds of realtime) (count: 1)
|
||||
--cpus CPU IDs for frequency sampling. Supports integers and/or ranges (count: unlimited, dtype: int or range)
|
||||
--gpus GPU IDs for SMI queries. Supports integers and/or ranges (count: unlimited, dtype: int or range)
|
||||
|
||||
[GENERAL SAMPLING OPTIONS] General options for timer-based sampling per-thread
|
||||
|
||||
-f, --freq Set the default sampling frequency (number of interrupts per second) (count: 1)
|
||||
--sampling-wait Set the default wait time (i.e. delay) before taking first sample (in seconds). This delay time is based on the clock
|
||||
of the sampler, i.e., a delay of 1 second for CPU-clock sampler may not equal 1 second of realtime (count: 1)
|
||||
--sampling-duration Set the duration of the sampling (in seconds of realtime). I.e., it is possible (currently) to set a CPU-clock time
|
||||
delay that exceeds the real-time duration... resulting in zero samples being taken (count: 1)
|
||||
-t, --tids Specify the default thread IDs for sampling, where 0 (zero) is the main thread and each thread created by the target
|
||||
application is assigned an atomically incrementing value. (min: 1)
|
||||
|
||||
[SAMPLING TIMER OPTIONS] These options determine the heuristic for deciding when to take a sample
|
||||
|
||||
--cputime Sample based on a CPU-clock timer (default). Accepts zero or more arguments:
|
||||
0. Enables sampling based on CPU-clock timer.
|
||||
1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of CPU-time.
|
||||
2. Delay (in seconds of CPU-clock time). I.e., how long each thread should wait before taking first sample.
|
||||
3+ Thread IDs to target for sampling, starting at 0 (the main thread).
|
||||
May be specified as index or range, e.g., '0 2-4' will be interpreted as:
|
||||
sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads (min: 0)
|
||||
--realtime Sample based on a real-clock timer. Accepts zero or more arguments:
|
||||
0. Enables sampling based on real-clock timer.
|
||||
1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of realtime.
|
||||
2. Delay (in seconds of real-clock time). I.e., how long each thread should wait before taking first sample.
|
||||
3+ Thread IDs to target for sampling, starting at 0 (the main thread).
|
||||
May be specified as index or range, e.g., '0 2-4' will be interpreted as:
|
||||
sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads
|
||||
When sampling with a real-clock timer, please note that enabling this will cause threads which are typically "idle"
|
||||
to consume more resources since, while idle, the real-clock time increases (and therefore triggers taking samples)
|
||||
whereas the CPU-clock time does not. (min: 0)
|
||||
|
||||
[BACKEND OPTIONS] These options control region information captured w/o sampling or instrumentation
|
||||
|
||||
-I, --include [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
|
||||
Include data from these backends (count: unlimited)
|
||||
-E, --exclude [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
|
||||
Exclude data from these backends (count: unlimited)
|
||||
|
||||
[HARDWARE COUNTER OPTIONS] See also: omnitrace-avail -H
|
||||
|
||||
-C, --cpu-events Set the CPU hardware counter events to record (ref: `omnitrace-avail -H -c CPU`) (count: unlimited)
|
||||
-G, --gpu-events Set the GPU hardware counter events to record (ref: `omnitrace-avail -H -c GPU`) (count: unlimited)
|
||||
|
||||
[MISCELLANEOUS OPTIONS]
|
||||
|
||||
-i, --inlines Include inline info in output when available (max: 1, dtype: bool)
|
||||
--hsa-interrupt [ 0 | 1 ] Set the value of the HSA_ENABLE_INTERRUPT environment variable.
|
||||
ROCm version 5.2 and older have a bug which will cause a deadlock if a sample is taken while waiting for the signal
|
||||
that a kernel completed -- which happens when sampling with a real-clock timer. We require this option to be set to
|
||||
when --realtime is specified to make users aware that, while this may fix the bug, it can have a negative impact on
|
||||
performance.
|
||||
Values:
|
||||
0 avoid triggering the bug, potentially at the cost of reduced performance
|
||||
1 do not modify how ROCm is notified about kernel completion (count: 1, dtype: int)
|
||||
|
||||
The general syntax for separating Omnitrace command-line arguments from the
|
||||
following application arguments
|
||||
is consistent with the LLVM style of using a stand-alone double hyphen (``--``).
|
||||
All arguments preceding the double hyphen
|
||||
are interpreted as belonging to Omnitrace and all arguments following it
|
||||
are interpreted as the
|
||||
application and its arguments. The double hyphen is only necessary when passing
|
||||
command-line arguments to a target
|
||||
which also uses hyphens. For example, you can run ``omnitrace-sample ls``, but
|
||||
to run ``ls -la``, use ``omnitrace-sample -- ls -la``.
|
||||
|
||||
:doc:`Configuring the Omnitrace runtime options <./configuring-runtime-options>`
|
||||
establishes the precedence of environment variable values over values specified
|
||||
in the configuration files. This enables
|
||||
you to configure the Omnitrace runtime to your preferred default behavior
|
||||
in a file such as ``~/.omnitrace.cfg`` and then easily override
|
||||
those settings in the command line, for example, ``OMNITRACE_ENABLED=OFF omnitrace-sample -- foo``.
|
||||
Similarly, the command-line arguments passed to ``omnitrace-sample`` take precedence
|
||||
over environment variables.
|
||||
|
||||
All of the command-line options above correlate to one or more configuration
|
||||
settings, for example, ``--cpu-events`` correlates to the ``OMNITRACE_PAPI_EVENTS`` configuration variable.
|
||||
``omnitrace-sample`` processes the arguments and outputs a summary of its configuration
|
||||
before running the target application.
|
||||
|
||||
The following snippets show how ``omnitrace-sample`` runs with various environment updates.
|
||||
|
||||
* This snippet shows the environment updates when ``omnitrace-sample`` is invoked with no arguments:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omnitrace-sample -- ./parallel-overhead-locks 30 4 100
|
||||
|
||||
HSA_TOOLS_LIB=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
||||
HSA_TOOLS_REPORT_LOAD_FAILURE=1
|
||||
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
||||
OMNITRACE_USE_PROCESS_SAMPLING=false
|
||||
OMNITRACE_USE_SAMPLING=true
|
||||
OMP_TOOL_LIBRARIES=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
||||
ROCP_TOOL_LIB=/opt/omnitrace/lib/libomnitrace.so.1.7.1
|
||||
|
||||
* The next snippet shows the environment updates when ``omnitrace-sample`` enables
|
||||
profiling, tracing, host process-sampling, device process-sampling, and all the available backends:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omnitrace-sample -PTDH -I all -- ./parallel-overhead-locks 30 4 100
|
||||
|
||||
HSA_TOOLS_LIB=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
||||
HSA_TOOLS_REPORT_LOAD_FAILURE=1
|
||||
KOKKOS_PROFILE_LIBRARY=/opt/omnitrace/lib/libomnitrace.so.1.7.1
|
||||
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
||||
OMNITRACE_CPU_FREQ_ENABLED=true
|
||||
OMNITRACE_TRACE_THREAD_LOCKS=true
|
||||
OMNITRACE_TRACE_THREAD_RW_LOCKS=true
|
||||
OMNITRACE_TRACE_THREAD_SPIN_LOCKS=true
|
||||
OMNITRACE_USE_KOKKOSP=true
|
||||
OMNITRACE_USE_MPIP=true
|
||||
OMNITRACE_USE_OMPT=true
|
||||
OMNITRACE_TRACE=true
|
||||
OMNITRACE_USE_PROCESS_SAMPLING=true
|
||||
OMNITRACE_USE_RCCLP=true
|
||||
OMNITRACE_USE_ROCM_SMI=true
|
||||
OMNITRACE_USE_ROCPROFILER=true
|
||||
OMNITRACE_USE_ROCTRACER=true
|
||||
OMNITRACE_USE_ROCTX=true
|
||||
OMNITRACE_USE_SAMPLING=true
|
||||
OMNITRACE_PROFILE=true
|
||||
OMP_TOOL_LIBRARIES=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
||||
ROCP_TOOL_LIB=/opt/omnitrace/lib/libomnitrace.so.1.7.1
|
||||
...
|
||||
|
||||
* The final snippet shows the environment updates when ``omnitrace-sample`` enables
|
||||
profiling, tracing, host process-sampling, and device process-sampling,
|
||||
sets the output path to ``omnitrace-output`` and the output prefix to ``%tag%``, and disables
|
||||
all the available backends:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omnitrace-sample -PTDH -E all -o omnitrace-output %tag% -- ./parallel-overhead-locks 30 4 100
|
||||
|
||||
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
||||
OMNITRACE_CPU_FREQ_ENABLED=true
|
||||
OMNITRACE_OUTPUT_PATH=omnitrace-output
|
||||
OMNITRACE_OUTPUT_PREFIX=%tag%
|
||||
OMNITRACE_TRACE_THREAD_LOCKS=false
|
||||
OMNITRACE_TRACE_THREAD_RW_LOCKS=false
|
||||
OMNITRACE_TRACE_THREAD_SPIN_LOCKS=false
|
||||
OMNITRACE_USE_KOKKOSP=false
|
||||
OMNITRACE_USE_MPIP=false
|
||||
OMNITRACE_USE_OMPT=false
|
||||
OMNITRACE_TRACE=true
|
||||
OMNITRACE_USE_PROCESS_SAMPLING=true
|
||||
OMNITRACE_USE_RCCLP=false
|
||||
OMNITRACE_USE_ROCM_SMI=false
|
||||
OMNITRACE_USE_ROCPROFILER=false
|
||||
OMNITRACE_USE_ROCTRACER=false
|
||||
OMNITRACE_USE_ROCTX=false
|
||||
OMNITRACE_USE_SAMPLING=true
|
||||
OMNITRACE_PROFILE=true
|
||||
...
|
||||
|
||||
An omnitrace-sample example
|
||||
========================================
|
||||
|
||||
Here is the full output from the previous
|
||||
``omnitrace-sample -PTDH -E all -o omnitrace-output %tag% -- ./parallel-overhead-locks 30 4 100`` command:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omnitrace-sample -PTDH -E all -o omnitrace-output %tag% -c -- ./parallel-overhead-locks 30 4 100
|
||||
|
||||
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.11.3
|
||||
OMNITRACE_CONFIG_FILE=
|
||||
OMNITRACE_CPU_FREQ_ENABLED=true
|
||||
OMNITRACE_OUTPUT_PATH=omnitrace-output
|
||||
OMNITRACE_OUTPUT_PREFIX=%tag%
|
||||
OMNITRACE_PROFILE=true
|
||||
OMNITRACE_TRACE=true
|
||||
OMNITRACE_TRACE_THREAD_LOCKS=false
|
||||
OMNITRACE_TRACE_THREAD_RW_LOCKS=false
|
||||
OMNITRACE_TRACE_THREAD_SPIN_LOCKS=false
|
||||
OMNITRACE_USE_KOKKOSP=false
|
||||
OMNITRACE_USE_MPIP=false
|
||||
OMNITRACE_USE_OMPT=false
|
||||
OMNITRACE_USE_PROCESS_SAMPLING=true
|
||||
OMNITRACE_USE_RCCLP=false
|
||||
OMNITRACE_USE_ROCM_SMI=false
|
||||
OMNITRACE_USE_ROCPROFILER=false
|
||||
OMNITRACE_USE_ROCTRACER=false
|
||||
OMNITRACE_USE_ROCTX=false
|
||||
OMNITRACE_USE_SAMPLING=true
|
||||
[omnitrace][dl][1785877] omnitrace_main
|
||||
[omnitrace][1785877][omnitrace_init_tooling] Instrumentation mode: Sampling
|
||||
______ .___ ___. .__ __. __ .___________..______ ___ ______ _______
|
||||
/ __ \ | \/ | | \ | | | | | || _ \ / \ / || ____|
|
||||
| | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__
|
||||
| | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __|
|
||||
| `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____
|
||||
\______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______|
|
||||
omnitrace v1.11.2 (rev: 2586b74db8bf335742600010b8d9f1ce8da9cf89, compiler: GNU v11.4.1, rocm: v6.1.x)
|
||||
[988.958] perfetto.cc:58649 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
|
||||
[parallel-overhead-locks] Threads: 4
|
||||
[parallel-overhead-locks] Iterations: 100
|
||||
[parallel-overhead-locks] fibonacci(30)...
|
||||
[1] number of iterations: 100
|
||||
[2] number of iterations: 100
|
||||
[3] number of iterations: 100
|
||||
[4] number of iterations: 100
|
||||
[parallel-overhead-locks] fibonacci(30) x 4 = 409221992
|
||||
[parallel-overhead-locks] number of mutex locks = 400
|
||||
[omnitrace][1785877][0][omnitrace_finalize] finalizing...
|
||||
[omnitrace][1785877][0][omnitrace_finalize]
|
||||
[omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877 : 0.294342 sec wall_clock, 4.776 MB peak_rss, 3.170 MB page_rss, 0.990000 sec cpu_clock, 336.3 % cpu_util [laps: 1]
|
||||
[omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/0 : 0.291535 sec wall_clock, 0.002619 sec thread_cpu_clock, 0.9 % thread_cpu_util, 4.776 MB peak_rss [laps: 1]
|
||||
[omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/1 : 0.271353 sec wall_clock, 0.222572 sec thread_cpu_clock, 82.0 % thread_cpu_util, 4.200 MB peak_rss [laps: 1]
|
||||
[omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/2 : 0.238218 sec wall_clock, 0.206405 sec thread_cpu_clock, 86.6 % thread_cpu_util, 3.432 MB peak_rss [laps: 1]
|
||||
[omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/3 : 0.209459 sec wall_clock, 0.193415 sec thread_cpu_clock, 92.3 % thread_cpu_util, 2.472 MB peak_rss [laps: 1]
|
||||
[omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/4 : 0.212029 sec wall_clock, 0.211694 sec thread_cpu_clock, 99.8 % thread_cpu_util, 1.152 MB peak_rss [laps: 1]
|
||||
[omnitrace][1785877][0][omnitrace_finalize]
|
||||
[omnitrace][1785877][0][omnitrace_finalize] Finalizing perfetto...
|
||||
[omnitrace][1785877][perfetto]> Outputting '/home/user/code/omnitrace/build-release/omnitrace-output/2024-07-15_16.21/parallel-overhead-locksperfetto-trace-1785877.proto' (39.12 KB / 0.04 MB / 0.00 GB)... Done
|
||||
[omnitrace][1785877][wall_clock]> Outputting 'omnitrace-output/2024-07-15_16.21/parallel-overhead-lockswall_clock-1785877.json'
|
||||
[omnitrace][1785877][wall_clock]> Outputting 'omnitrace-output/2024-07-15_16.21/parallel-overhead-lockswall_clock-1785877.txt'
|
||||
[omnitrace][1785877][metadata]> Outputting 'omnitrace-output/2024-07-15_16.21/parallel-overhead-locksmetadata-1785877.json' and 'omnitrace-output/2024-07-15_16.21/parallel-overhead-locksfunctions-1785877.json'
|
||||
[omnitrace][1785877][0][omnitrace_finalize] Finalized: 0.054582 sec wall_clock, 0.000 MB peak_rss, -1.798 MB page_rss, 0.040000 sec cpu_clock, 73.3 % cpu_util
|
||||
[989.312] perfetto.cc:60128 Tracing session 1 ended, total sessions:0
|
||||
@@ -0,0 +1,938 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
****************************************************
|
||||
Understanding the Omnitrace output
|
||||
****************************************************
|
||||
|
||||
The general output form of `Omnitrace <https://github.com/ROCm/omnitrace>`_ is
|
||||
``<OUTPUT_PATH>[/<TIMESTAMP>]/[<PREFIX>]<DATA_NAME>[-<OUTPUT_SUFFIX>].<EXT>``.
|
||||
|
||||
For example, starting with the following base configuration:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export OMNITRACE_OUTPUT_PATH=omnitrace-example-output
|
||||
export OMNITRACE_TIME_OUTPUT=ON
|
||||
export OMNITRACE_USE_PID=OFF
|
||||
export OMNITRACE_PROFILE=ON
|
||||
export OMNITRACE_TRACE=ON
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omnitrace-instrument -- ./foo
|
||||
...
|
||||
[omnitrace] Outputting 'omnitrace-example-output/perfetto-trace.proto'...
|
||||
|
||||
[omnitrace] Outputting 'omnitrace-example-output/wall-clock.txt'...
|
||||
[omnitrace] Outputting 'omnitrace-example-output/wall-clock.json'...
|
||||
|
||||
If the ``OMNITRACE_USE_PID`` option is enabled, then running a non-MPI executable
|
||||
with a PID of ``63453`` results in the following output:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ export OMNITRACE_USE_PID=ON
|
||||
$ omnitrace-instrument -- ./foo
|
||||
...
|
||||
[omnitrace] Outputting 'omnitrace-example-output/perfetto-trace-63453.proto'...
|
||||
|
||||
[omnitrace] Outputting 'omnitrace-example-output/wall-clock-63453.txt'...
|
||||
[omnitrace] Outputting 'omnitrace-example-output/wall-clock-63453.json'...
|
||||
|
||||
If ``OMNITRACE_TIME_OUTPUT`` is enabled, then a job that started on January 31, 2022 at 12:30 PM
|
||||
generates the following:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ export OMNITRACE_TIME_OUTPUT=ON
|
||||
$ omnitrace-instrument -- ./foo
|
||||
...
|
||||
[omnitrace] Outputting 'omnitrace-example-output/2022-01-31_12.30_PM/perfetto-trace-63453.proto'...
|
||||
|
||||
[omnitrace] Outputting 'omnitrace-example-output/2022-01-31_12.30_PM/wall-clock-63453.txt'...
|
||||
[omnitrace] Outputting 'omnitrace-example-output/2022-01-31_12.30_PM/wall-clock-63453.json'...
|
||||
|
||||
Metadata
|
||||
========================================
|
||||
|
||||
Omnitrace outputs a ``metadata.json`` file. This metadata file contains
|
||||
information about the settings, environment variables, output files, and info
|
||||
about the system and the run, as follows:
|
||||
|
||||
* Hardware cache sizes
|
||||
* Physical CPUs
|
||||
* Hardware concurrency
|
||||
* CPU model, frequency, vendor, and features
|
||||
* Launch date and time
|
||||
* Memory maps (for example, shared libraries)
|
||||
* Output files
|
||||
* Environment variables
|
||||
* Configuration settings
|
||||
|
||||
Metadata JSON Sample
|
||||
-----------------------------------------------------------------------
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"omnitrace": {
|
||||
"metadata": {
|
||||
"info": {
|
||||
"HW_L1_CACHE_SIZE": 32768,
|
||||
"HW_L2_CACHE_SIZE": 524288,
|
||||
"HW_L3_CACHE_SIZE": 16777216,
|
||||
"HW_PHYSICAL_CPU": 12,
|
||||
"HW_CONCURRENCY": 24,
|
||||
"LAUNCH_TIME": "02:04",
|
||||
"LAUNCH_DATE": "05/08/22",
|
||||
"TIMEMORY_GIT_REVISION": "52e7034fd419ff296506cdef43084f6071dbaba1",
|
||||
"TIMEMORY_VERSION": "3.3.0rc4",
|
||||
"TIMEMORY_API": "tim::project::timemory",
|
||||
"TIMEMORY_GIT_DESCRIBE": "v3.2.0-263-g52e7034f",
|
||||
"PWD": "/home/jrmadsen/devel/c++/AARInternal/hosttrace-dyninst/build-vscode",
|
||||
"USER": "jrmadsen",
|
||||
"HOME": "/home/jrmadsen",
|
||||
"SHELL": "/bin/bash",
|
||||
"CPU_MODEL": "AMD Ryzen Threadripper PRO 3945WX 12-Cores",
|
||||
"CPU_FREQUENCY": 2400,
|
||||
"CPU_VENDOR": "AuthenticAMD",
|
||||
"CPU_FEATURES": [
|
||||
"fpu",
|
||||
"msr",
|
||||
"sse",
|
||||
"sse2",
|
||||
"constant_tsc",
|
||||
"ssse3",
|
||||
"fma",
|
||||
"sse4_1",
|
||||
"sse4_2",
|
||||
"popcnt",
|
||||
"avx2",
|
||||
"... etc. ..."
|
||||
],
|
||||
"memory_maps": [
|
||||
{
|
||||
"end_address": "7f4013797000",
|
||||
"start_address": "7f4012e58000",
|
||||
"pathname": "/opt/rocm-5.0.0/hip/lib/libamdhip64.so.5.0.50000",
|
||||
"offset": "34a000",
|
||||
"device": "103:05",
|
||||
"inode": 4331165,
|
||||
"permissions": "rw-p"
|
||||
},
|
||||
{
|
||||
"end_address": "7f4013902000",
|
||||
"start_address": "7f4013901000",
|
||||
"pathname": "/usr/lib/x86_64-linux-gnu/libm-2.31.so",
|
||||
"offset": "14d000",
|
||||
"device": "103:05",
|
||||
"inode": 42078854,
|
||||
"permissions": "rwxp"
|
||||
},
|
||||
{
|
||||
"end_address": "7f4013919000",
|
||||
"start_address": "7f4013908000",
|
||||
"pathname": "/usr/lib/x86_64-linux-gnu/libpthread-2.31.so",
|
||||
"offset": "6000",
|
||||
"device": "103:05",
|
||||
"inode": 42078874,
|
||||
"permissions": "r-xp"
|
||||
},
|
||||
{
|
||||
"...": "etc."
|
||||
},
|
||||
],
|
||||
"memory_maps_files": [
|
||||
"/opt/rocm-5.0.0/hip/lib/libamdhip64.so.5.0.50000",
|
||||
"/opt/rocm-5.0.0/hsa-amd-aqlprofile/lib/libhsa-amd-aqlprofile64.so.1.0.50000",
|
||||
"/opt/rocm-5.0.0/lib/libamd_comgr.so.2.4.50000",
|
||||
"/opt/rocm-5.0.0/lib/libhsa-runtime64.so.1.5.50000",
|
||||
"/opt/rocm-5.0.0/rocm_smi/lib/librocm_smi64.so.5.0.50000",
|
||||
"/opt/rocm-5.0.0/roctracer/lib/libroctracer64.so.1.0.50000",
|
||||
"/usr/lib/x86_64-linux-gnu/ld-2.31.so",
|
||||
"/usr/lib/x86_64-linux-gnu/libc-2.31.so",
|
||||
"/usr/lib/x86_64-linux-gnu/libdl-2.31.so",
|
||||
"... etc. ..."
|
||||
],
|
||||
},
|
||||
"output": {
|
||||
"text": [
|
||||
{
|
||||
"value": [
|
||||
"omnitrace-tests-output/parallel-overhead-binary-rewrite/roctracer.txt"
|
||||
],
|
||||
"key": "roctracer"
|
||||
},
|
||||
{
|
||||
"value": [
|
||||
"omnitrace-tests-output/parallel-overhead-binary-rewrite/wall_clock.txt"
|
||||
],
|
||||
"key": "wall_clock"
|
||||
}
|
||||
],
|
||||
"json": [
|
||||
{
|
||||
"value": [
|
||||
"omnitrace-tests-output/parallel-overhead-binary-rewrite/roctracer.json",
|
||||
"omnitrace-tests-output/parallel-overhead-binary-rewrite/roctracer.tree.json"
|
||||
],
|
||||
"key": "roctracer"
|
||||
},
|
||||
{
|
||||
"value": [
|
||||
"omnitrace-tests-output/parallel-overhead-binary-rewrite/wall_clock.json",
|
||||
"omnitrace-tests-output/parallel-overhead-binary-rewrite/wall_clock.tree.json"
|
||||
],
|
||||
"key": "wall_clock"
|
||||
}
|
||||
]
|
||||
},
|
||||
"environment": [
|
||||
{
|
||||
"value": "/home/jrmadsen",
|
||||
"key": "HOME"
|
||||
},
|
||||
{
|
||||
"value": "/bin/bash",
|
||||
"key": "SHELL"
|
||||
},
|
||||
{
|
||||
"value": "jrmadsen",
|
||||
"key": "USER"
|
||||
},
|
||||
{
|
||||
"value": "true",
|
||||
"key": "... etc. ..."
|
||||
}
|
||||
],
|
||||
"settings": {
|
||||
"OMNITRACE_JSON_OUTPUT": {
|
||||
"count": -1,
|
||||
"environ_updated": false,
|
||||
"name": "json_output",
|
||||
"data_type": "bool",
|
||||
"initial": true,
|
||||
"enabled": true,
|
||||
"value": true,
|
||||
"max_count": 1,
|
||||
"cmdline": [
|
||||
"--omnitrace-json-output"
|
||||
],
|
||||
"environ": "OMNITRACE_JSON_OUTPUT",
|
||||
"config_updated": false,
|
||||
"categories": [
|
||||
"io",
|
||||
"json",
|
||||
"native"
|
||||
],
|
||||
"description": "Write json output files"
|
||||
},
|
||||
"... etc. ...": {
|
||||
"etc.": true
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Configuring the Omnitrace output
|
||||
========================================
|
||||
|
||||
Omnitrace includes a core set of options for controlling the format
|
||||
and contents of the output files. For additional information, see the guide on
|
||||
:doc:`configuring runtime options <./configuring-runtime-options>`.
|
||||
|
||||
Core configuration settings
|
||||
-----------------------------------
|
||||
|
||||
.. csv-table::
|
||||
:header: "Setting", "Value", "Description"
|
||||
:widths: 30, 30, 100
|
||||
|
||||
"``OMNITRACE_OUTPUT_PATH``", "Any valid path", "Path to folder where output files should be placed"
|
||||
"``OMNITRACE_OUTPUT_PREFIX``", "String", "Useful for multiple runs with different arguments. See the next section on output prefix keys."
|
||||
"``OMNITRACE_OUTPUT_FILE``", "Any valid filepath", "Specific location for the Perfetto output file"
|
||||
"``OMNITRACE_TIME_OUTPUT``", "Boolean", "Place all output in a timestamped folder, timestamp format controlled via ``OMNITRACE_TIME_FORMAT``"
|
||||
"``OMNITRACE_TIME_FORMAT``", "String", "See ``strftime`` man pages for valid identifiers"
|
||||
"``OMNITRACE_USE_PID``", "Boolean", "Append either the PID or the MPI rank to all output files (before the extension)"
|
||||
|
||||
Output prefix keys
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Output prefix keys have many uses but are most helpful when dealing with multiple
|
||||
profiling runs or large MPI jobs.
|
||||
They are included in Omnitrace because they were introduced into Timemory
|
||||
for `compile-time-perf <https://github.com/jrmadsen/compile-time-perf>`_.
|
||||
They are needed to create different output files for a generic wrapper around
|
||||
compilation commands while still
|
||||
overwriting the output from the last time a file was compiled.
|
||||
|
||||
When doing scaling studies and specifying options via the command line,
|
||||
the recommended process is to
|
||||
use a common ``OMNITRACE_OUTPUT_PATH``, disable ``OMNITRACE_TIME_OUTPUT``,
|
||||
set ``OMNITRACE_OUTPUT_PREFIX="%argt%-"``, and let Omnitrace cleanly organize the output.
|
||||
|
||||
.. csv-table::
|
||||
:header: "String", "Encoding"
|
||||
:widths: 20, 120
|
||||
|
||||
"``%argv%``", "Entire command-line condensed into a single string"
|
||||
"``%argt%``", "Similar to ``%argv%`` except basename of first command line argument"
|
||||
"``%args%``", "All command line arguments condensed into a single string"
|
||||
"``%tag%``", "Basename of first command line argument"
|
||||
"``%arg<N>%``", "Command line argument at position ``<N>`` (zero indexed), e.g. ``%arg0%`` for first argument"
|
||||
"``%argv_hash%``", "MD5 sum of ``%argv%``"
|
||||
"``%argt_hash%``", "MD5 sum if ``%argt%``"
|
||||
"``%args_hash%``", "MD5 sum of ``%args%``"
|
||||
"``%tag_hash%``", "MD5 sum of ``%tag%``"
|
||||
"``%arg<N>_hash%``", "MD5 sum of ``%arg<N>%``"
|
||||
"``%pid%``", "Process identifier (i.e. ``getpid()``)"
|
||||
"``%ppid%``", "Parent process identifier (i.e. ``getppid()``)"
|
||||
"``%pgid%``", "Process group identifier (i.e. ``getpgid(getpid())``)"
|
||||
"``%psid%``", "Process session identifier (i.e. ``getsid(getpid())``)"
|
||||
"``%psize%``", "Number of sibling process (from reading ``/proc/<PPID>/tasks/<PPID>/children``)"
|
||||
"``%job%``", "Value of ``SLURM_JOB_ID`` environment variable if exists, else ``0``"
|
||||
"``%rank%``", "Value of ``SLURM_PROCID`` environment variable if exists, else ``MPI_Comm_rank`` (or ``0`` non-mpi)"
|
||||
"``%size%``", "``MPI_Comm_size`` or ``1`` if non-mpi"
|
||||
"``%nid%``", "``%rank%`` if possible, otherwise ``%pid%``"
|
||||
"``%launch_time%``", "Launch date and time (uses ``OMNITRACE_TIME_FORMAT``)"
|
||||
"``%env{NAME}%``", "Value of environment variable ``NAME`` (i.e. ``getenv(NAME)``)"
|
||||
"``%cfg{NAME}%``", "Value of configuration variable ``NAME`` (e.g. ``%cfg{OMNITRACE_SAMPLING_FREQ}%`` would resolve to sampling frequency)"
|
||||
"``$env{NAME}``", "Alternative syntax to ``%env{NAME}%``"
|
||||
"``$cfg{NAME}``", "Alternative syntax to ``%cfg{NAME}%``"
|
||||
"``%m``", "Shorthand for ``%argt_hash%``"
|
||||
"``%p``", "Shorthand for ``%pid%``"
|
||||
"``%j``", "Shorthand for ``%job%``"
|
||||
"``%r``", "Shorthand for ``%rank%``"
|
||||
"``%s``", "Shorthand for ``%size%``"
|
||||
|
||||
.. note::
|
||||
|
||||
In any output prefix key which contains a ``/`` character, the ``/`` characters
|
||||
are replaced with ``_`` and any leading underscores are stripped. For example,
|
||||
an ``%arg0%`` of ``/usr/bin/foo`` translates to ``usr_bin_foo``. Additionally, any ``%arg<N>%`` keys which
|
||||
do not have a command line argument at position ``<N>`` are ignored.
|
||||
|
||||
Perfetto output
|
||||
========================================
|
||||
|
||||
Use the ``OMNITRACE_OUTPUT_FILE`` to specify a specific location. If this is an
|
||||
absolute path, then all ``OMNITRACE_OUTPUT_PATH`` and similar
|
||||
settings are ignored. Visit `ui.perfetto.dev <https://ui.perfetto.dev>`_ and open this file.
|
||||
|
||||
.. image:: ../data/omnitrace-perfetto.png
|
||||
:alt: Visualization of a performance graph in Perfetto
|
||||
|
||||
.. image:: ../data/omnitrace-rocm.png
|
||||
:alt: Visualization of ROCm data in Perfetto
|
||||
|
||||
.. image:: ../data/omnitrace-rocm-flow.png
|
||||
:alt: Visualization of ROCm flow data in Perfetto
|
||||
|
||||
.. image:: ../data/omnitrace-user-api.png
|
||||
:alt: Visualization of ROCm API calls in Perfetto
|
||||
|
||||
Timemory output
|
||||
========================================
|
||||
|
||||
Use ``omnitrace-avail --components --filename`` to view the base filename for each component, as follows
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omnitrace-avail wall_clock -C -f
|
||||
|---------------------------------|---------------|------------------------|
|
||||
| COMPONENT | AVAILABLE | FILENAME |
|
||||
|---------------------------------|---------------|------------------------|
|
||||
| wall_clock | true | wall_clock |
|
||||
| sampling_wall_clock | true | sampling_wall_clock |
|
||||
|---------------------------------|---------------|------------------------|
|
||||
|
||||
The ``OMNITRACE_COLLAPSE_THREADS`` and ``OMNITRACE_COLLAPSE_PROCESSES`` settings are
|
||||
only valid when full `MPI support is enabled <../install/install.html#mpi-support-within-omnitrace>`_.
|
||||
When they are set, Timemory combines the per-thread and per-rank data (respectively) of
|
||||
identical call stacks.
|
||||
|
||||
The ``OMNITRACE_FLAT_PROFILE`` setting removes all call stack hierarchy.
|
||||
Using ``OMNITRACE_FLAT_PROFILE=ON`` in combination
|
||||
with ``OMNITRACE_COLLAPSE_THREADS=ON`` is a useful configuration for identifying
|
||||
min/max measurements regardless of the calling context.
|
||||
The ``OMNITRACE_TIMELINE_PROFILE`` setting (with ``OMNITRACE_FLAT_PROFILE=OFF``) effectively
|
||||
generates similar data to that found
|
||||
in Perfetto. Enabling timeline and flat profiling effectively generates
|
||||
similar data to ``strace``. However, while Timemory generally
|
||||
requires significantly less memory than Perfetto, this is not the case in timeline
|
||||
mode, so use this setting with caution.
|
||||
|
||||
Timemory text output
|
||||
-----------------------------------------------------------------------
|
||||
|
||||
Timemory text output files are meant for human consumption (while JSON formats are for analysis),
|
||||
so some fields such as the ``LABEL`` might be truncated for readability.
|
||||
The truncation settings be changed through the ``OMNITRACE_MAX_WIDTH`` setting.
|
||||
|
||||
.. note::
|
||||
|
||||
The generation of text output is configurable via ``OMNITRACE_TEXT_OUTPUT``.
|
||||
|
||||
.. _text-output-example-label:
|
||||
|
||||
Timemory text output example
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
In the following example, the ``NN`` field in ``|NN>>>`` is the thread ID. If MPI support is enabled,
|
||||
this becomes ``|MM|NN>>>`` where ``MM`` is the rank.
|
||||
If ``OMNITRACE_COLLAPSE_THREADS=ON`` and ``OMNITRACE_COLLAPSE_PROCESSES=ON`` are configured,
|
||||
neither the ``MM`` nor the ``NN`` are present unless the
|
||||
component explicitly sets type traits. Type traits specify that the data is only
|
||||
relevant per-thread or per-process, such as the ``thread_cpu_clock`` clock component.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| REAL-CLOCK TIMER (I.E. WALL-CLOCK TIMER) |
|
||||
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| LABEL | COUNT | DEPTH | METRIC | UNITS | SUM | MEAN | MIN | MAX | VAR | STDDEV | % SELF |
|
||||
|--------------------------------------------------------------|--------|--------|------------|--------|-----------|-----------|-----------|-----------|----------|----------|--------|
|
||||
| |00>>> main | 1 | 0 | wall_clock | sec | 13.360265 | 13.360265 | 13.360265 | 13.360265 | 0.000000 | 0.000000 | 18.2 |
|
||||
| |00>>> |_ompt_thread_initial | 1 | 1 | wall_clock | sec | 10.924161 | 10.924161 | 10.924161 | 10.924161 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |00>>> |_ompt_implicit_task | 1 | 2 | wall_clock | sec | 10.923050 | 10.923050 | 10.923050 | 10.923050 | 0.000000 | 0.000000 | 0.1 |
|
||||
| |00>>> |_ompt_parallel [parallelism=12] | 1 | 3 | wall_clock | sec | 10.915026 | 10.915026 | 10.915026 | 10.915026 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |00>>> |_ompt_implicit_task | 1 | 4 | wall_clock | sec | 10.647951 | 10.647951 | 10.647951 | 10.647951 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |00>>> |_ompt_work_loop | 156 | 5 | wall_clock | sec | 0.000812 | 0.000005 | 0.000001 | 0.000212 | 0.000000 | 0.000018 | 100.0 |
|
||||
| |00>>> |_ompt_work_single_executor | 40 | 5 | wall_clock | sec | 0.000016 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |00>>> |_ompt_sync_region_barrier_implicit | 308 | 5 | wall_clock | sec | 0.000629 | 0.000002 | 0.000001 | 0.000017 | 0.000000 | 0.000002 | 100.0 |
|
||||
| |00>>> |_conj_grad | 76 | 5 | wall_clock | sec | 10.641165 | 0.140015 | 0.131894 | 0.155099 | 0.000017 | 0.004080 | 1.0 |
|
||||
| |00>>> |_ompt_work_single_executor | 803 | 6 | wall_clock | sec | 0.000292 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |00>>> |_ompt_work_loop | 7904 | 6 | wall_clock | sec | 7.420265 | 0.000939 | 0.000005 | 0.006974 | 0.000003 | 0.001613 | 100.0 |
|
||||
| |00>>> |_ompt_sync_region_barrier_implicit | 6004 | 6 | wall_clock | sec | 0.283160 | 0.000047 | 0.000001 | 0.004087 | 0.000000 | 0.000303 | 100.0 |
|
||||
| |00>>> |_ompt_sync_region_barrier_implementation | 3952 | 6 | wall_clock | sec | 2.829252 | 0.000716 | 0.000007 | 0.009005 | 0.000001 | 0.000985 | 99.7 |
|
||||
| |00>>> |_ompt_sync_region_reduction | 15808 | 7 | wall_clock | sec | 0.009142 | 0.000001 | 0.000000 | 0.000007 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |00>>> |_ompt_work_single_other | 1249 | 6 | wall_clock | sec | 0.000270 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |00>>> |_ompt_work_single_other | 114 | 5 | wall_clock | sec | 0.000024 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |00>>> |_ompt_sync_region_barrier_implementation | 76 | 5 | wall_clock | sec | 0.000876 | 0.000012 | 0.000008 | 0.000025 | 0.000000 | 0.000003 | 84.4 |
|
||||
| |00>>> |_ompt_sync_region_reduction | 304 | 6 | wall_clock | sec | 0.000136 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |00>>> |_ompt_master | 226 | 5 | wall_clock | sec | 0.001978 | 0.000009 | 0.000000 | 0.000038 | 0.000000 | 0.000012 | 100.0 |
|
||||
| |11>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.656145 | 10.656145 | 10.656145 | 10.656145 | 0.000000 | 0.000000 | 0.1 |
|
||||
| |11>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649183 | 10.649183 | 10.649183 | 10.649183 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |11>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000852 | 0.000005 | 0.000002 | 0.000230 | 0.000000 | 0.000019 | 100.0 |
|
||||
| |11>>> |_ompt_work_single_other | 149 | 6 | wall_clock | sec | 0.000035 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |11>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004135 | 0.000013 | 0.000001 | 0.001233 | 0.000000 | 0.000070 | 100.0 |
|
||||
| |11>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641302 | 0.140017 | 0.131896 | 0.155102 | 0.000017 | 0.004080 | 0.6 |
|
||||
| |11>>> |_ompt_work_single_other | 2023 | 7 | wall_clock | sec | 0.000458 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |11>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 8.253555 | 0.001044 | 0.000005 | 0.008021 | 0.000003 | 0.001790 | 100.0 |
|
||||
| |11>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.263840 | 0.000044 | 0.000001 | 0.004087 | 0.000000 | 0.000297 | 100.0 |
|
||||
| |11>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.059823 | 0.000521 | 0.000007 | 0.009508 | 0.000001 | 0.000863 | 100.0 |
|
||||
| |11>>> |_ompt_work_single_executor | 29 | 7 | wall_clock | sec | 0.000011 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |11>>> |_ompt_work_single_executor | 5 | 6 | wall_clock | sec | 0.000002 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |11>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000975 | 0.000013 | 0.000008 | 0.000024 | 0.000000 | 0.000003 | 100.0 |
|
||||
| |10>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.681664 | 10.681664 | 10.681664 | 10.681664 | 0.000000 | 0.000000 | 0.3 |
|
||||
| |10>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649158 | 10.649158 | 10.649158 | 10.649158 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |10>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000863 | 0.000006 | 0.000002 | 0.000231 | 0.000000 | 0.000019 | 100.0 |
|
||||
| |10>>> |_ompt_work_single_other | 140 | 6 | wall_clock | sec | 0.000037 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |10>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004149 | 0.000013 | 0.000001 | 0.001221 | 0.000000 | 0.000070 | 100.0 |
|
||||
| |10>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641288 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 0.7 |
|
||||
| |10>>> |_ompt_work_single_other | 1883 | 7 | wall_clock | sec | 0.000487 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |10>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 8.174545 | 0.001034 | 0.000005 | 0.006899 | 0.000003 | 0.001766 | 100.0 |
|
||||
| |10>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.268808 | 0.000045 | 0.000001 | 0.004087 | 0.000000 | 0.000299 | 100.0 |
|
||||
| |10>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.126988 | 0.000538 | 0.000007 | 0.009843 | 0.000001 | 0.000872 | 99.9 |
|
||||
| |10>>> |_ompt_sync_region_reduction | 3952 | 8 | wall_clock | sec | 0.002574 | 0.000001 | 0.000000 | 0.000014 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |10>>> |_ompt_work_single_executor | 169 | 7 | wall_clock | sec | 0.000072 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |10>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000954 | 0.000013 | 0.000009 | 0.000023 | 0.000000 | 0.000003 | 95.9 |
|
||||
| |10>>> |_ompt_sync_region_reduction | 76 | 7 | wall_clock | sec | 0.000039 | 0.000001 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |10>>> |_ompt_work_single_executor | 14 | 6 | wall_clock | sec | 0.000006 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |09>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.686552 | 10.686552 | 10.686552 | 10.686552 | 0.000000 | 0.000000 | 0.3 |
|
||||
| |09>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649151 | 10.649151 | 10.649151 | 10.649151 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |09>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000880 | 0.000006 | 0.000002 | 0.000258 | 0.000000 | 0.000021 | 100.0 |
|
||||
| |09>>> |_ompt_work_single_other | 148 | 6 | wall_clock | sec | 0.000034 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |09>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004129 | 0.000013 | 0.000001 | 0.001210 | 0.000000 | 0.000069 | 100.0 |
|
||||
| |09>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641308 | 0.140017 | 0.131895 | 0.155102 | 0.000017 | 0.004080 | 0.7 |
|
||||
| |09>>> |_ompt_work_single_other | 2043 | 7 | wall_clock | sec | 0.000473 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |09>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 7.977001 | 0.001009 | 0.000005 | 0.007325 | 0.000003 | 0.001732 | 100.0 |
|
||||
| |09>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.242996 | 0.000040 | 0.000001 | 0.004087 | 0.000000 | 0.000284 | 100.0 |
|
||||
| |09>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.350895 | 0.000595 | 0.000007 | 0.008689 | 0.000001 | 0.000926 | 100.0 |
|
||||
| |09>>> |_ompt_work_single_executor | 9 | 7 | wall_clock | sec | 0.000004 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |09>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000973 | 0.000013 | 0.000008 | 0.000025 | 0.000000 | 0.000003 | 100.0 |
|
||||
| |09>>> |_ompt_work_single_executor | 6 | 6 | wall_clock | sec | 0.000002 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |08>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.721622 | 10.721622 | 10.721622 | 10.721622 | 0.000000 | 0.000000 | 0.7 |
|
||||
| |08>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649135 | 10.649135 | 10.649135 | 10.649135 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |08>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000839 | 0.000005 | 0.000001 | 0.000231 | 0.000000 | 0.000019 | 100.0 |
|
||||
| |08>>> |_ompt_work_single_other | 141 | 6 | wall_clock | sec | 0.000030 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |08>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004114 | 0.000013 | 0.000001 | 0.001198 | 0.000000 | 0.000069 | 100.0 |
|
||||
| |08>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641294 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 0.6 |
|
||||
| |08>>> |_ompt_work_single_other | 1742 | 7 | wall_clock | sec | 0.000392 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |08>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 8.306388 | 0.001051 | 0.000005 | 0.007886 | 0.000003 | 0.001795 | 100.0 |
|
||||
| |08>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.274358 | 0.000046 | 0.000001 | 0.004090 | 0.000000 | 0.000302 | 100.0 |
|
||||
| |08>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 1.991251 | 0.000504 | 0.000007 | 0.008694 | 0.000001 | 0.000844 | 99.8 |
|
||||
| |08>>> |_ompt_sync_region_reduction | 7904 | 8 | wall_clock | sec | 0.003816 | 0.000000 | 0.000000 | 0.000017 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |08>>> |_ompt_work_single_executor | 310 | 7 | wall_clock | sec | 0.000112 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |08>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000955 | 0.000013 | 0.000009 | 0.000026 | 0.000000 | 0.000003 | 93.7 |
|
||||
| |08>>> |_ompt_sync_region_reduction | 152 | 7 | wall_clock | sec | 0.000060 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |08>>> |_ompt_work_single_executor | 13 | 6 | wall_clock | sec | 0.000005 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |07>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.747282 | 10.747282 | 10.747282 | 10.747282 | 0.000000 | 0.000000 | 0.9 |
|
||||
| |07>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649093 | 10.649093 | 10.649093 | 10.649093 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |07>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000923 | 0.000006 | 0.000002 | 0.000231 | 0.000000 | 0.000019 | 100.0 |
|
||||
| |07>>> |_ompt_work_single_other | 152 | 6 | wall_clock | sec | 0.000048 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |07>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.003981 | 0.000013 | 0.000001 | 0.001186 | 0.000000 | 0.000068 | 100.0 |
|
||||
| |07>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641295 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 0.7 |
|
||||
| |07>>> |_ompt_work_single_other | 2043 | 7 | wall_clock | sec | 0.000648 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |07>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 7.978811 | 0.001009 | 0.000005 | 0.006728 | 0.000003 | 0.001732 | 100.0 |
|
||||
| |07>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.199939 | 0.000033 | 0.000001 | 0.004086 | 0.000000 | 0.000255 | 100.0 |
|
||||
| |07>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.385843 | 0.000604 | 0.000009 | 0.009039 | 0.000001 | 0.000938 | 100.0 |
|
||||
| |07>>> |_ompt_work_single_executor | 9 | 7 | wall_clock | sec | 0.000004 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |07>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000905 | 0.000012 | 0.000010 | 0.000025 | 0.000000 | 0.000003 | 100.0 |
|
||||
| |07>>> |_ompt_work_single_executor | 2 | 6 | wall_clock | sec | 0.000001 | 0.000001 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |06>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.772278 | 10.772278 | 10.772278 | 10.772278 | 0.000000 | 0.000000 | 1.1 |
|
||||
| |06>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649092 | 10.649092 | 10.649092 | 10.649092 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |06>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000888 | 0.000006 | 0.000002 | 0.000236 | 0.000000 | 0.000020 | 100.0 |
|
||||
| |06>>> |_ompt_work_single_other | 153 | 6 | wall_clock | sec | 0.000037 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |06>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004090 | 0.000013 | 0.000001 | 0.001175 | 0.000000 | 0.000067 | 100.0 |
|
||||
| |06>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641317 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 0.8 |
|
||||
| |06>>> |_ompt_work_single_other | 2041 | 7 | wall_clock | sec | 0.000476 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |06>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 7.467961 | 0.000945 | 0.000005 | 0.010712 | 0.000003 | 0.001627 | 100.0 |
|
||||
| |06>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.250883 | 0.000042 | 0.000001 | 0.004087 | 0.000000 | 0.000285 | 100.0 |
|
||||
| |06>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.838733 | 0.000718 | 0.000009 | 0.009015 | 0.000001 | 0.001015 | 99.9 |
|
||||
| |06>>> |_ompt_sync_region_reduction | 3952 | 8 | wall_clock | sec | 0.003334 | 0.000001 | 0.000000 | 0.000025 | 0.000000 | 0.000001 | 100.0 |
|
||||
| |06>>> |_ompt_work_single_executor | 11 | 7 | wall_clock | sec | 0.000005 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |06>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000940 | 0.000012 | 0.000009 | 0.000025 | 0.000000 | 0.000003 | 95.4 |
|
||||
| |06>>> |_ompt_sync_region_reduction | 76 | 7 | wall_clock | sec | 0.000044 | 0.000001 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |06>>> |_ompt_work_single_executor | 1 | 6 | wall_clock | sec | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |05>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.797950 | 10.797950 | 10.797950 | 10.797950 | 0.000000 | 0.000000 | 1.4 |
|
||||
| |05>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649072 | 10.649072 | 10.649072 | 10.649072 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |05>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000879 | 0.000006 | 0.000001 | 0.000248 | 0.000000 | 0.000021 | 100.0 |
|
||||
| |05>>> |_ompt_work_single_other | 142 | 6 | wall_clock | sec | 0.000034 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |05>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004062 | 0.000013 | 0.000002 | 0.001163 | 0.000000 | 0.000067 | 100.0 |
|
||||
| |05>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641291 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 0.7 |
|
||||
| |05>>> |_ompt_work_single_other | 2038 | 7 | wall_clock | sec | 0.000500 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |05>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 8.279191 | 0.001047 | 0.000005 | 0.006596 | 0.000003 | 0.001792 | 100.0 |
|
||||
| |05>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.250939 | 0.000042 | 0.000001 | 0.004090 | 0.000000 | 0.000286 | 100.0 |
|
||||
| |05>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.039013 | 0.000516 | 0.000009 | 0.008689 | 0.000001 | 0.000855 | 100.0 |
|
||||
| |05>>> |_ompt_work_single_executor | 14 | 7 | wall_clock | sec | 0.000005 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |05>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000926 | 0.000012 | 0.000009 | 0.000023 | 0.000000 | 0.000003 | 100.0 |
|
||||
| |05>>> |_ompt_work_single_executor | 12 | 6 | wall_clock | sec | 0.000005 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |04>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.825935 | 10.825935 | 10.825935 | 10.825935 | 0.000000 | 0.000000 | 1.6 |
|
||||
| |04>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649068 | 10.649068 | 10.649068 | 10.649068 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |04>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000884 | 0.000006 | 0.000002 | 0.000245 | 0.000000 | 0.000020 | 100.0 |
|
||||
| |04>>> |_ompt_work_single_other | 150 | 6 | wall_clock | sec | 0.000034 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |04>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004069 | 0.000013 | 0.000001 | 0.001151 | 0.000000 | 0.000066 | 100.0 |
|
||||
| |04>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641300 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 1.1 |
|
||||
| |04>>> |_ompt_work_single_other | 2041 | 7 | wall_clock | sec | 0.000448 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |04>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 7.438393 | 0.000941 | 0.000005 | 0.007090 | 0.000003 | 0.001624 | 100.0 |
|
||||
| |04>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.270654 | 0.000045 | 0.000001 | 0.004090 | 0.000000 | 0.000295 | 100.0 |
|
||||
| |04>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.819165 | 0.000713 | 0.000009 | 0.008379 | 0.000001 | 0.001013 | 99.9 |
|
||||
| |04>>> |_ompt_sync_region_reduction | 7904 | 8 | wall_clock | sec | 0.003932 | 0.000000 | 0.000000 | 0.000015 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |04>>> |_ompt_work_single_executor | 11 | 7 | wall_clock | sec | 0.000005 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |04>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000936 | 0.000012 | 0.000009 | 0.000025 | 0.000000 | 0.000003 | 93.2 |
|
||||
| |04>>> |_ompt_sync_region_reduction | 152 | 7 | wall_clock | sec | 0.000064 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |04>>> |_ompt_work_single_executor | 4 | 6 | wall_clock | sec | 0.000001 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |03>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.849322 | 10.849322 | 10.849322 | 10.849322 | 0.000000 | 0.000000 | 1.8 |
|
||||
| |03>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649075 | 10.649075 | 10.649075 | 10.649075 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |03>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000861 | 0.000006 | 0.000002 | 0.000238 | 0.000000 | 0.000020 | 100.0 |
|
||||
| |03>>> |_ompt_work_single_other | 120 | 6 | wall_clock | sec | 0.000028 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |03>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.003993 | 0.000013 | 0.000001 | 0.001138 | 0.000000 | 0.000065 | 100.0 |
|
||||
| |03>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641302 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 0.8 |
|
||||
| |03>>> |_ompt_work_single_other | 1756 | 7 | wall_clock | sec | 0.000426 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |03>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 8.005617 | 0.001013 | 0.000005 | 0.011500 | 0.000003 | 0.001741 | 100.0 |
|
||||
| |03>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.231485 | 0.000039 | 0.000001 | 0.004086 | 0.000000 | 0.000277 | 100.0 |
|
||||
| |03>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.320428 | 0.000587 | 0.000009 | 0.010868 | 0.000001 | 0.000912 | 100.0 |
|
||||
| |03>>> |_ompt_work_single_executor | 296 | 7 | wall_clock | sec | 0.000120 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |03>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000967 | 0.000013 | 0.000010 | 0.000023 | 0.000000 | 0.000003 | 100.0 |
|
||||
| |03>>> |_ompt_work_single_executor | 34 | 6 | wall_clock | sec | 0.000013 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |02>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.876387 | 10.876387 | 10.876387 | 10.876387 | 0.000000 | 0.000000 | 2.1 |
|
||||
| |02>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649050 | 10.649050 | 10.649050 | 10.649050 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |02>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000924 | 0.000006 | 0.000001 | 0.000241 | 0.000000 | 0.000020 | 100.0 |
|
||||
| |02>>> |_ompt_work_single_other | 139 | 6 | wall_clock | sec | 0.000040 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |02>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.003972 | 0.000013 | 0.000001 | 0.001127 | 0.000000 | 0.000064 | 100.0 |
|
||||
| |02>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641287 | 0.140017 | 0.131895 | 0.155101 | 0.000017 | 0.004080 | 0.7 |
|
||||
| |02>>> |_ompt_work_single_other | 1902 | 7 | wall_clock | sec | 0.000553 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |02>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 7.906688 | 0.001000 | 0.000005 | 0.007068 | 0.000003 | 0.001713 | 100.0 |
|
||||
| |02>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.261367 | 0.000044 | 0.000001 | 0.004088 | 0.000000 | 0.000295 | 100.0 |
|
||||
| |02>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.402362 | 0.000608 | 0.000009 | 0.010399 | 0.000001 | 0.000944 | 99.9 |
|
||||
| |02>>> |_ompt_sync_region_reduction | 3952 | 8 | wall_clock | sec | 0.002937 | 0.000001 | 0.000000 | 0.000021 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |02>>> |_ompt_work_single_executor | 150 | 7 | wall_clock | sec | 0.000073 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |02>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000895 | 0.000012 | 0.000009 | 0.000026 | 0.000000 | 0.000003 | 95.2 |
|
||||
| |02>>> |_ompt_sync_region_reduction | 76 | 7 | wall_clock | sec | 0.000043 | 0.000001 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |02>>> |_ompt_work_single_executor | 15 | 6 | wall_clock | sec | 0.000007 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |01>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.901650 | 10.901650 | 10.901650 | 10.901650 | 0.000000 | 0.000000 | 2.3 |
|
||||
| |01>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649017 | 10.649017 | 10.649017 | 10.649017 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |01>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000863 | 0.000006 | 0.000001 | 0.000231 | 0.000000 | 0.000019 | 100.0 |
|
||||
| |01>>> |_ompt_work_single_other | 146 | 6 | wall_clock | sec | 0.000033 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |01>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004012 | 0.000013 | 0.000001 | 0.001115 | 0.000000 | 0.000064 | 100.0 |
|
||||
| |01>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641316 | 0.140017 | 0.131895 | 0.155101 | 0.000017 | 0.004080 | 0.8 |
|
||||
| |01>>> |_ompt_work_single_other | 1811 | 7 | wall_clock | sec | 0.000403 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |01>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 7.410337 | 0.000938 | 0.000005 | 0.010556 | 0.000003 | 0.001610 | 100.0 |
|
||||
| |01>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.202494 | 0.000034 | 0.000001 | 0.003521 | 0.000000 | 0.000256 | 100.0 |
|
||||
| |01>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.943604 | 0.000745 | 0.000008 | 0.009033 | 0.000001 | 0.001024 | 100.0 |
|
||||
| |01>>> |_ompt_work_single_executor | 241 | 7 | wall_clock | sec | 0.000093 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |01>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000917 | 0.000012 | 0.000009 | 0.000026 | 0.000000 | 0.000003 | 100.0 |
|
||||
| |01>>> |_ompt_work_single_executor | 8 | 6 | wall_clock | sec | 0.000004 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |00>>> |_c_print_results | 1 | 2 | wall_clock | sec | 0.000049 | 0.000049 | 0.000049 | 0.000049 | 0.000000 | 0.000000 | 100.0 |
|
||||
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
|
||||
Timemory JSON output
|
||||
-------------------------------------------------------------------------
|
||||
|
||||
Timemory represents the data within the JSON output in two forms:
|
||||
a flat structure and a hierarchical structure.
|
||||
The flat JSON data represents the data similar to the text files, where the hierarchical information
|
||||
is represented by the indentation of the ``prefix`` field and the ``depth`` field.
|
||||
The hierarchical JSON contains additional information with respect
|
||||
to inclusive and exclusive values. However,
|
||||
its structure must be processed using recursion. This section of the JSON output supports analysis
|
||||
by `hatchet <https://github.com/hatchet/hatchet>`_.
|
||||
All the data entries for the flat structure are in a single JSON array. It is easier to
|
||||
write a simple Python script for post-processing using this format than with the hierarchical structure.
|
||||
|
||||
.. note::
|
||||
|
||||
The generation of flat JSON output is configurable via ``OMNITRACE_JSON_OUTPUT``.
|
||||
The generation of hierarchical JSON data is configurable via ``OMNITRACE_TREE_OUTPUT``
|
||||
|
||||
Timemory JSON output sample
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
In the following JSON data, the flat data starts at ``["timemory"]["wall_clock"]["ranks"]``
|
||||
and the hierarchical data starts at ``["timemory"]["wall_clock"]["graph"]``.
|
||||
To access the name (or prefix) of the nth entry in the flat data layout, use
|
||||
``["timemory"]["wall_clock"]["ranks"][0]["graph"][<N>]["prefix"]``. When full MPI
|
||||
support is enabled, the per-rank data in flat layout is represented
|
||||
as an entry in the ``ranks`` array. In the hierarchical data structure,
|
||||
the per-rank data is represented as an entry in the ``mpi`` array. However, ``graph``
|
||||
is used in lieu of ``mpi`` when full MPI support is enabled.
|
||||
In the hierarchical layout, all data for the process is a child of a dummy
|
||||
root node, which has the name ``unknown-hash=0``.
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"timemory": {
|
||||
"wall_clock": {
|
||||
"properties": {
|
||||
"cereal_class_version": 0,
|
||||
"value": 78,
|
||||
"enum": "WALL_CLOCK",
|
||||
"id": "wall_clock",
|
||||
"ids": [
|
||||
"real_clock",
|
||||
"virtual_clock",
|
||||
"wall_clock"
|
||||
]
|
||||
},
|
||||
"type": "wall_clock",
|
||||
"description": "Real-clock timer (i.e. wall-clock timer)",
|
||||
"unit_value": 1000000000,
|
||||
"unit_repr": "sec",
|
||||
"thread_scope_only": false,
|
||||
"thread_count": 2,
|
||||
"mpi_size": 1,
|
||||
"upcxx_size": 1,
|
||||
"process_count": 1,
|
||||
"num_ranks": 1,
|
||||
"concurrency": 2,
|
||||
"ranks": [
|
||||
{
|
||||
"rank": 0,
|
||||
"graph_size": 112,
|
||||
"graph": [
|
||||
{
|
||||
"hash": 17481650134347108265,
|
||||
"prefix": "|0>>> main",
|
||||
"depth": 0,
|
||||
"entry": {
|
||||
"cereal_class_version": 0,
|
||||
"laps": 1,
|
||||
"value": 894743517,
|
||||
"accum": 894743517,
|
||||
"repr_data": 0.894743517,
|
||||
"repr_display": 0.894743517
|
||||
},
|
||||
"stats": {
|
||||
"cereal_class_version": 0,
|
||||
"sum": 0.894743517,
|
||||
"count": 1,
|
||||
"min": 0.894743517,
|
||||
"max": 0.894743517,
|
||||
"sqr": 0.8005659612135293,
|
||||
"mean": 0.894743517,
|
||||
"stddev": 0.0
|
||||
},
|
||||
"rolling_hash": 17481650134347108265
|
||||
},
|
||||
{
|
||||
"hash": 3455444288293231339,
|
||||
"prefix": "|0>>> |_read_input",
|
||||
"depth": 1,
|
||||
"entry": {
|
||||
"laps": 1,
|
||||
"value": 9808,
|
||||
"accum": 9808,
|
||||
"repr_data": 9.808e-06,
|
||||
"repr_display": 9.808e-06
|
||||
},
|
||||
"stats": {
|
||||
"sum": 9.808e-06,
|
||||
"count": 1,
|
||||
"min": 9.808e-06,
|
||||
"max": 9.808e-06,
|
||||
"sqr": 9.6196864e-11,
|
||||
"mean": 9.808e-06,
|
||||
"stddev": 0.0
|
||||
},
|
||||
"rolling_hash": 2490350348930787988
|
||||
},
|
||||
{
|
||||
"hash": 8456966793631718807,
|
||||
"prefix": "|0>>> |_setcoeff",
|
||||
"depth": 1,
|
||||
"entry": {
|
||||
"laps": 1,
|
||||
"value": 922,
|
||||
"accum": 922,
|
||||
"repr_data": 9.22e-07,
|
||||
"repr_display": 9.22e-07
|
||||
},
|
||||
"stats": {
|
||||
"sum": 9.22e-07,
|
||||
"count": 1,
|
||||
"min": 9.22e-07,
|
||||
"max": 9.22e-07,
|
||||
"sqr": 8.50084e-13,
|
||||
"mean": 9.22e-07,
|
||||
"stddev": 0.0
|
||||
},
|
||||
"rolling_hash": 7491872854269275456
|
||||
},
|
||||
{
|
||||
"hash": 6107876127803219007,
|
||||
"prefix": "|0>>> |_ompt_thread_initial",
|
||||
"depth": 1,
|
||||
"entry": {
|
||||
"laps": 1,
|
||||
"value": 896506392,
|
||||
"accum": 896506392,
|
||||
"repr_data": 0.896506392,
|
||||
"repr_display": 0.896506392
|
||||
},
|
||||
"stats": {
|
||||
"sum": 0.896506392,
|
||||
"count": 1,
|
||||
"min": 0.896506392,
|
||||
"max": 0.896506392,
|
||||
"sqr": 0.8037237108968578,
|
||||
"mean": 0.896506392,
|
||||
"stddev": 0.0
|
||||
},
|
||||
"rolling_hash": 5142782188440775656
|
||||
},
|
||||
{
|
||||
"hash": 15402802091993617561,
|
||||
"prefix": "|0>>> |_ompt_implicit_task",
|
||||
"depth": 2,
|
||||
"entry": {
|
||||
"laps": 1,
|
||||
"value": 896479111,
|
||||
"accum": 896479111,
|
||||
"repr_data": 0.896479111,
|
||||
"repr_display": 0.896479111
|
||||
},
|
||||
"stats": {
|
||||
"sum": 0.896479111,
|
||||
"count": 1,
|
||||
"min": 0.896479111,
|
||||
"max": 0.896479111,
|
||||
"sqr": 0.8036747964593504,
|
||||
"mean": 0.896479111,
|
||||
"stddev": 0.0
|
||||
},
|
||||
"rolling_hash": 2098840206724841601 },
|
||||
{
|
||||
"..." : "... etc. ..."
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"graph": [
|
||||
[
|
||||
{
|
||||
"cereal_class_version": 0,
|
||||
"node": {
|
||||
"hash": 0,
|
||||
"prefix": "unknown-hash=0",
|
||||
"tid": [
|
||||
0
|
||||
],
|
||||
"pid": [
|
||||
2539175
|
||||
],
|
||||
"depth": 0,
|
||||
"is_dummy": false,
|
||||
"inclusive": {
|
||||
"entry": {
|
||||
"laps": 0,
|
||||
"value": 0,
|
||||
"accum": 0,
|
||||
"repr_data": 0.0,
|
||||
"repr_display": 0.0
|
||||
},
|
||||
"stats": {
|
||||
"sum": 0.0,
|
||||
"count": 0,
|
||||
"min": 0.0,
|
||||
"max": 0.0,
|
||||
"sqr": 0.0,
|
||||
"mean": 0.0,
|
||||
"stddev": 0.0
|
||||
}
|
||||
},
|
||||
"exclusive": {
|
||||
"entry": {
|
||||
"laps": 0,
|
||||
"value": -894743517,
|
||||
"accum": -894743517,
|
||||
"repr_data": -0.894743517,
|
||||
"repr_display": -0.894743517
|
||||
},
|
||||
"stats": {
|
||||
"sum": 0.0,
|
||||
"count": 0,
|
||||
"min": 0.0,
|
||||
"max": 0.0,
|
||||
"sqr": 0.0,
|
||||
"mean": 0.0,
|
||||
"stddev": 0.0
|
||||
}
|
||||
}
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"node": {
|
||||
"hash": 17481650134347108265,
|
||||
"prefix": "main",
|
||||
"tid": [
|
||||
0
|
||||
],
|
||||
"pid": [
|
||||
2539175
|
||||
],
|
||||
"depth": 1,
|
||||
"is_dummy": false,
|
||||
"inclusive": {
|
||||
"entry": {
|
||||
"laps": 1,
|
||||
"value": 894743517,
|
||||
"accum": 894743517,
|
||||
"repr_data": 0.894743517,
|
||||
"repr_display": 0.894743517
|
||||
},
|
||||
"stats": {
|
||||
"sum": 0.894743517,
|
||||
"count": 1,
|
||||
"min": 0.894743517,
|
||||
"max": 0.894743517,
|
||||
"sqr": 0.8005659612135293,
|
||||
"mean": 0.894743517,
|
||||
"stddev": 0.0
|
||||
}
|
||||
},
|
||||
"exclusive": {
|
||||
"entry": {
|
||||
"laps": 1,
|
||||
"value": -1773605,
|
||||
"accum": -1773605,
|
||||
"repr_data": -0.001773605,
|
||||
"repr_display": -0.001773605
|
||||
},
|
||||
"stats": {
|
||||
"sum": -0.001773605,
|
||||
"count": 1,
|
||||
"min": 9.22e-07,
|
||||
"max": 0.896506392,
|
||||
"sqr": -0.0031577497803754,
|
||||
"mean": -0.001773605,
|
||||
"stddev": 0.0
|
||||
}
|
||||
}
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"..." : "... etc. ..."
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Timemory JSON output Python post-processing example
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
#!/usr/bin/env python3
|
||||
|
||||
import sys
|
||||
import json
|
||||
|
||||
|
||||
def read_json(inp):
|
||||
with open(inp, "r") as f:
|
||||
return json.load(f)
|
||||
|
||||
|
||||
def find_max(data):
|
||||
"""Find the max for any function called multiple times"""
|
||||
max_entry = None
|
||||
for itr in data:
|
||||
if itr["entry"]["laps"] == 1:
|
||||
continue
|
||||
if max_entry is None:
|
||||
max_entry = itr
|
||||
else:
|
||||
if itr["stats"]["mean"] > max_entry["stats"]["mean"]:
|
||||
max_entry = itr
|
||||
return max_entry
|
||||
|
||||
|
||||
def strip_name(name):
|
||||
"""Return everything after |_ if it exists"""
|
||||
idx = name.index("|_")
|
||||
return name if idx is None else name[(idx + 2) :]
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
input_data = [[x, read_json(x)] for x in sys.argv[1:]]
|
||||
|
||||
for file, data in input_data:
|
||||
for metric, metric_data in data["timemory"].items():
|
||||
|
||||
print(f"[{file}] Found metric: {metric}")
|
||||
|
||||
for n, itr in enumerate(metric_data["ranks"]):
|
||||
|
||||
max_entry = find_max(itr["graph"])
|
||||
print(
|
||||
"[{}] Maximum value: '{}' at depth {} was called {}x :: {:.3f} {} (mean = {:.3e} {})".format(
|
||||
file,
|
||||
strip_name(max_entry["prefix"]),
|
||||
max_entry["depth"],
|
||||
max_entry["entry"]["laps"],
|
||||
max_entry["entry"]["repr_data"],
|
||||
metric_data["unit_repr"],
|
||||
max_entry["stats"]["mean"],
|
||||
metric_data["unit_repr"],
|
||||
)
|
||||
)
|
||||
|
||||
The result of applying this script to the corresponding JSON output from the :ref:`text-output-example-label`
|
||||
section is as follows:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
[openmp-cg.inst-wall_clock.json] Found metric: wall_clock
|
||||
[openmp-cg.inst-wall_clock.json] Maximum value: 'conj_grad' at depth 6 was called 76x :: 10.641 sec (mean = 1.400e-01 sec)
|
||||
@@ -0,0 +1,294 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
****************************************************
|
||||
Using the Omnitrace API
|
||||
****************************************************
|
||||
|
||||
The following example shows how a program can use the Omnitrace API for run-time analysis.
|
||||
|
||||
Omnitrace user API example program
|
||||
========================================
|
||||
|
||||
You can use the Omnitrace API to define custom regions to profile and trace.
|
||||
The following C++ program demonstrates this technique by calling several functions from the
|
||||
Omnitrace API, such as ``omnitrace_user_push_region`` and
|
||||
``omnitrace_user_stop_thread_trace``.
|
||||
|
||||
.. note::
|
||||
|
||||
By default, when Omnitrace detects any ``omnitrace_user_start_*`` or
|
||||
``omnitrace_user_stop_*`` function, instrumentation
|
||||
is disabled at start up, which means ``omnitrace_user_stop_trace()`` is not
|
||||
required at the beginning of ``main``. This behavior
|
||||
can be manually controlled by using the ``OMNITRACE_INIT_ENABLED`` environment variable.
|
||||
User-defined regions are always
|
||||
recorded, regardless of whether ``omnitrace_user_start_*`` or
|
||||
``omnitrace_user_stop_*`` has been called.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
#include <omnitrace/categories.h>
|
||||
#include <omnitrace/types.h>
|
||||
#include <omnitrace/user.h>
|
||||
|
||||
#include <atomic>
|
||||
#include <cassert>
|
||||
#include <cerrno>
|
||||
#include <cstdio>
|
||||
#include <cstdlib>
|
||||
#include <cstring>
|
||||
#include <sstream>
|
||||
#include <thread>
|
||||
#include <vector>
|
||||
|
||||
std::atomic<long> total{ 0 };
|
||||
|
||||
long
|
||||
fib(long n) __attribute__((noinline));
|
||||
|
||||
void
|
||||
run(size_t nitr, long) __attribute__((noinline));
|
||||
|
||||
int
|
||||
custom_push_region(const char* name);
|
||||
|
||||
namespace
|
||||
{
|
||||
omnitrace_user_callbacks_t custom_callbacks = OMNITRACE_USER_CALLBACKS_INIT;
|
||||
omnitrace_user_callbacks_t original_callbacks = OMNITRACE_USER_CALLBACKS_INIT;
|
||||
} // namespace
|
||||
|
||||
int
|
||||
main(int argc, char** argv)
|
||||
{
|
||||
custom_callbacks.push_region = &custom_push_region;
|
||||
omnitrace_user_configure(OMNITRACE_USER_UNION_CONFIG, custom_callbacks,
|
||||
&original_callbacks);
|
||||
|
||||
omnitrace_user_push_region(argv[0]);
|
||||
omnitrace_user_push_region("initialization");
|
||||
size_t nthread = std::min<size_t>(16, std::thread::hardware_concurrency());
|
||||
size_t nitr = 50000;
|
||||
long nfib = 10;
|
||||
if(argc > 1) nfib = atol(argv[1]);
|
||||
if(argc > 2) nthread = atol(argv[2]);
|
||||
if(argc > 3) nitr = atol(argv[3]);
|
||||
omnitrace_user_pop_region("initialization");
|
||||
|
||||
printf("[%s] Threads: %zu\n[%s] Iterations: %zu\n[%s] fibonacci(%li)...\n", argv[0],
|
||||
nthread, argv[0], nitr, argv[0], nfib);
|
||||
|
||||
omnitrace_user_push_region("thread_creation");
|
||||
std::vector<std::thread> threads{};
|
||||
threads.reserve(nthread);
|
||||
// disable instrumentation for child threads
|
||||
omnitrace_user_stop_thread_trace();
|
||||
for(size_t i = 0; i < nthread; ++i)
|
||||
{
|
||||
threads.emplace_back(&run, nitr, nfib);
|
||||
}
|
||||
// re-enable instrumentation
|
||||
omnitrace_user_start_thread_trace();
|
||||
omnitrace_user_pop_region("thread_creation");
|
||||
|
||||
omnitrace_user_push_region("thread_wait");
|
||||
for(auto& itr : threads)
|
||||
itr.join();
|
||||
omnitrace_user_pop_region("thread_wait");
|
||||
|
||||
run(nitr, nfib);
|
||||
|
||||
printf("[%s] fibonacci(%li) x %lu = %li\n", argv[0], nfib, nthread, total.load());
|
||||
omnitrace_user_pop_region(argv[0]);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
long
|
||||
fib(long n)
|
||||
{
|
||||
return (n < 2) ? n : fib(n - 1) + fib(n - 2);
|
||||
}
|
||||
|
||||
#define RUN_LABEL \
|
||||
std::string{ std::string{ __FUNCTION__ } + "(" + std::to_string(n) + ") x " + \
|
||||
std::to_string(nitr) } \
|
||||
.c_str()
|
||||
|
||||
void
|
||||
run(size_t nitr, long n)
|
||||
{
|
||||
omnitrace_user_push_region(RUN_LABEL);
|
||||
long local = 0;
|
||||
for(size_t i = 0; i < nitr; ++i)
|
||||
local += fib(n);
|
||||
total += local;
|
||||
omnitrace_user_pop_region(RUN_LABEL);
|
||||
}
|
||||
|
||||
int
|
||||
custom_push_region(const char* name)
|
||||
{
|
||||
if(!original_callbacks.push_region || !original_callbacks.push_annotated_region)
|
||||
return OMNITRACE_USER_ERROR_NO_BINDING;
|
||||
|
||||
printf("Pushing custom region :: %s\n", name);
|
||||
|
||||
if(original_callbacks.push_annotated_region)
|
||||
{
|
||||
int32_t _err = errno;
|
||||
char* _msg = nullptr;
|
||||
char _buff[1024];
|
||||
if(_err != 0) _msg = strerror_r(_err, _buff, sizeof(_buff));
|
||||
|
||||
omnitrace_annotation_t _annotations[] = {
|
||||
{ "errno", OMNITRACE_INT32, &_err }, { "strerror", OMNITRACE_STRING, _msg }
|
||||
};
|
||||
|
||||
errno = 0; // reset errno
|
||||
return (*original_callbacks.push_annotated_region)(
|
||||
name, _annotations, sizeof(_annotations) / sizeof(omnitrace_annotation_t));
|
||||
}
|
||||
|
||||
return (*original_callbacks.push_region)(name);
|
||||
}
|
||||
|
||||
Linking the Omnitrace libraries to another program
|
||||
=======================================================
|
||||
|
||||
To link the ``omnitrace-user-library`` to another program,
|
||||
use the following CMake and ``g++`` directives.
|
||||
|
||||
CMake
|
||||
-------------------------------------------------------
|
||||
|
||||
.. code-block:: cmake
|
||||
|
||||
find_package(omnitrace REQUIRED COMPONENTS user)
|
||||
add_executable(foo foo.cpp)
|
||||
target_link_libraries(foo PRIVATE omnitrace::omnitrace-user-library)
|
||||
|
||||
g++ compilation
|
||||
-------------------------------------------------------
|
||||
|
||||
Assuming Omnitrace is installed in ``/opt/omnitrace``, use the ``g++`` compiler
|
||||
to build the application.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
g++ -I/opt/omnitrace foo.cpp -o foo -lomnitrace-user
|
||||
|
||||
Output from the API example program
|
||||
========================================
|
||||
|
||||
First, instrument and run the program.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omnitrace-instrument -l --min-instructions=8 -E custom_push_region -o -- ./user-api
|
||||
...
|
||||
$ omnitrace-run --profile --use-pid off --time-output off -- ./user-api.inst 20 4 100
|
||||
Pushing custom region :: ./user-api.inst
|
||||
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Trace
|
||||
|
||||
|
||||
______ .___ ___. .__ __. __ .___________..______ ___ ______ _______
|
||||
/ __ \ | \/ | | \ | | | | | || _ \ / \ / || ____|
|
||||
| | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__
|
||||
| | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __|
|
||||
| `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____
|
||||
\______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______|
|
||||
|
||||
|
||||
|
||||
Pushing custom region :: initialization
|
||||
[./user-api.inst] Threads: 4
|
||||
[./user-api.inst] Iterations: 100
|
||||
[./user-api.inst] fibonacci(20)...
|
||||
Pushing custom region :: thread_creation
|
||||
Pushing custom region :: thread_wait
|
||||
Pushing custom region :: run(20) x 100
|
||||
Pushing custom region :: run(20) x 100
|
||||
Pushing custom region :: run(20) x 100
|
||||
Pushing custom region :: run(20) x 100
|
||||
Pushing custom region :: run(20) x 100
|
||||
[./user-api.inst] fibonacci(20) x 4 = 3382500
|
||||
[omnitrace][86267][0][omnitrace_finalize] finalizing...
|
||||
|
||||
|
||||
[omnitrace][86267][0] omnitrace : 5.190895 sec wall_clock, 2.748 mb peak_rss, 6.330000 sec cpu_clock, 121.9 % cpu_util [laps: 1]
|
||||
[omnitrace][86267][0] user-api.inst/thread-0 : 5.078713 sec wall_clock, 4.722415 sec thread_cpu_clock, 93.0 % thread_cpu_util, 1.276 mb peak_rss [laps: 1]
|
||||
[omnitrace][86267][0] user-api.inst/thread-1 : 0.322248 sec wall_clock, 0.322191 sec thread_cpu_clock, 100.0 % thread_cpu_util, 1.000 mb peak_rss [laps: 1]
|
||||
[omnitrace][86267][0] user-api.inst/thread-2 : 0.323255 sec wall_clock, 0.323194 sec thread_cpu_clock, 100.0 % thread_cpu_util, 0.000 mb peak_rss [laps: 1]
|
||||
[omnitrace][86267][0] user-api.inst/thread-3 : 0.323569 sec wall_clock, 0.323484 sec thread_cpu_clock, 100.0 % thread_cpu_util, 1.092 mb peak_rss [laps: 1]
|
||||
[omnitrace][86267][0] user-api.inst/thread-4 : 0.324178 sec wall_clock, 0.324057 sec thread_cpu_clock, 100.0 % thread_cpu_util, 1.184 mb peak_rss [laps: 1]
|
||||
[omnitrace][86267][0] Post-processing 51 cpu frequency and memory usage entries...
|
||||
|
||||
[omnitrace][wall_clock]|0> Outputting 'omnitrace-user-api.inst-output/wall_clock.json'...
|
||||
[omnitrace][wall_clock]|0> Outputting 'omnitrace-user-api.inst-output/wall_clock.tree.json'...
|
||||
[omnitrace][wall_clock]|0> Outputting 'omnitrace-user-api.inst-output/wall_clock.txt'...
|
||||
|
||||
[omnitrace][manager::finalize][metadata]> Outputting 'omnitrace-user-api.inst-output/metadata.json' and 'omnitrace-user-api.inst-output/functions.json'...
|
||||
[omnitrace][86267][0][omnitrace_finalize] Finalized
|
||||
|
||||
Then review the output.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ cat omnitrace-example-output/wall_clock.txt
|
||||
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| REAL-CLOCK TIMER (I.E. WALL-CLOCK TIMER) |
|
||||
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| LABEL | COUNT | DEPTH | METRIC | UNITS | SUM | MEAN | MIN | MAX | VAR | STDDEV | % SELF |
|
||||
|---------------------------------------------------------------------------------|--------|--------|------------|--------|----------|----------|----------|----------|----------|----------|--------|
|
||||
| |0>>> ./user-api.inst | 1 | 0 | wall_clock | sec | 5.078521 | 5.078521 | 5.078521 | 5.078521 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |0>>> |_initialization | 1 | 1 | wall_clock | sec | 0.000004 | 0.000004 | 0.000004 | 0.000004 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_thread_creation | 1 | 1 | wall_clock | sec | 0.000159 | 0.000159 | 0.000159 | 0.000159 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_thread_wait | 1 | 1 | wall_clock | sec | 0.355307 | 0.355307 | 0.355307 | 0.355307 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |0>>> |_std::vector<std::thread, std::allocator<std::thread> >::begin | 1 | 2 | wall_clock | sec | 0.000001 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_std::vector<std::thread, std::allocator<std::thread> >::end | 1 | 2 | wall_clock | sec | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_pthread_join | 4 | 2 | wall_clock | sec | 0.355257 | 0.088814 | 0.000001 | 0.333144 | 0.026559 | 0.162970 | 100.0 |
|
||||
| |2>>> |_start_thread | 1 | 3 | wall_clock | sec | 0.000032 | 0.000032 | 0.000032 | 0.000032 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |1>>> |_start_thread | 1 | 3 | wall_clock | sec | 0.000036 | 0.000036 | 0.000036 | 0.000036 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |3>>> |_start_thread | 1 | 3 | wall_clock | sec | 0.000034 | 0.000034 | 0.000034 | 0.000034 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |4>>> |_start_thread | 1 | 3 | wall_clock | sec | 0.000039 | 0.000039 | 0.000039 | 0.000039 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_run | 1 | 1 | wall_clock | sec | 4.722993 | 4.722993 | 4.722993 | 4.722993 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |0>>> |_std::char_traits<char>::length | 1 | 2 | wall_clock | sec | 0.000001 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_std::distance<char const*> | 1 | 2 | wall_clock | sec | 0.000001 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_std::operator+<char, std::char_traits<char>, std::allocator<char> > | 2 | 2 | wall_clock | sec | 0.000002 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_run(20) x 100 | 1 | 2 | wall_clock | sec | 4.722951 | 4.722951 | 4.722951 | 4.722951 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |0>>> |_run [{94,25}-{96,25}] | 1 | 3 | wall_clock | sec | 4.722925 | 4.722925 | 4.722925 | 4.722925 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |0>>> |_fib | 100 | 4 | wall_clock | sec | 4.722718 | 0.047227 | 0.046713 | 0.051987 | 0.000000 | 0.000625 | 0.0 |
|
||||
| |0>>> |_fib | 200 | 5 | wall_clock | sec | 4.722302 | 0.023612 | 0.017827 | 0.034091 | 0.000032 | 0.005627 | 0.0 |
|
||||
| |0>>> |_fib | 400 | 6 | wall_clock | sec | 4.721485 | 0.011804 | 0.006790 | 0.023003 | 0.000016 | 0.004024 | 0.0 |
|
||||
| |0>>> |_fib | 800 | 7 | wall_clock | sec | 4.719858 | 0.005900 | 0.002564 | 0.016078 | 0.000006 | 0.002498 | 0.1 |
|
||||
| |0>>> |_fib | 1600 | 8 | wall_clock | sec | 4.716572 | 0.002948 | 0.000977 | 0.011849 | 0.000002 | 0.001465 | 0.1 |
|
||||
| |0>>> |_fib | 3200 | 9 | wall_clock | sec | 4.709918 | 0.001472 | 0.000371 | 0.008246 | 0.000001 | 0.000831 | 0.3 |
|
||||
| |0>>> |_fib | 6400 | 10 | wall_clock | sec | 4.696775 | 0.000734 | 0.000140 | 0.005111 | 0.000000 | 0.000461 | 0.6 |
|
||||
| |0>>> |_fib | 12800 | 11 | wall_clock | sec | 4.670093 | 0.000365 | 0.000050 | 0.003166 | 0.000000 | 0.000253 | 1.1 |
|
||||
| |0>>> |_fib | 25600 | 12 | wall_clock | sec | 4.617496 | 0.000180 | 0.000017 | 0.001959 | 0.000000 | 0.000137 | 2.3 |
|
||||
| |0>>> |_fib | 51200 | 13 | wall_clock | sec | 4.512671 | 0.000088 | 0.000004 | 0.001212 | 0.000000 | 0.000074 | 4.6 |
|
||||
| |0>>> |_fib | 102400 | 14 | wall_clock | sec | 4.304142 | 0.000042 | 0.000000 | 0.000752 | 0.000000 | 0.000039 | 9.6 |
|
||||
| |0>>> |_fib | 202600 | 15 | wall_clock | sec | 3.892580 | 0.000019 | 0.000000 | 0.000469 | 0.000000 | 0.000021 | 19.0 |
|
||||
| |0>>> |_fib | 363200 | 16 | wall_clock | sec | 3.151143 | 0.000009 | 0.000000 | 0.000293 | 0.000000 | 0.000011 | 33.2 |
|
||||
| |0>>> |_fib | 502000 | 17 | wall_clock | sec | 2.105217 | 0.000004 | 0.000000 | 0.000183 | 0.000000 | 0.000006 | 49.1 |
|
||||
| |0>>> |_fib | 476000 | 18 | wall_clock | sec | 1.071652 | 0.000002 | 0.000000 | 0.000114 | 0.000000 | 0.000004 | 63.6 |
|
||||
| |0>>> |_fib | 294200 | 19 | wall_clock | sec | 0.390193 | 0.000001 | 0.000000 | 0.000071 | 0.000000 | 0.000003 | 75.3 |
|
||||
| |0>>> |_fib | 115200 | 20 | wall_clock | sec | 0.096190 | 0.000001 | 0.000000 | 0.000043 | 0.000000 | 0.000002 | 84.4 |
|
||||
| |0>>> |_fib | 27400 | 21 | wall_clock | sec | 0.015020 | 0.000001 | 0.000000 | 0.000025 | 0.000000 | 0.000001 | 91.1 |
|
||||
| |0>>> |_fib | 3600 | 22 | wall_clock | sec | 0.001336 | 0.000000 | 0.000000 | 0.000013 | 0.000000 | 0.000001 | 96.3 |
|
||||
| |0>>> |_fib | 200 | 23 | wall_clock | sec | 0.000050 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_std::char_traits<char>::length | 1 | 3 | wall_clock | sec | 0.000001 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_std::distance<char const*> | 1 | 3 | wall_clock | sec | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_std::operator+<char, std::char_traits<char>, std::allocator<char> > | 2 | 3 | wall_clock | sec | 0.000001 | 0.000001 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_std::operator& | 1 | 1 | wall_clock | sec | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> std::vector<std::thread, std::allocator<std::thread> >::~vector | 1 | 0 | wall_clock | sec | 0.000045 | 0.000045 | 0.000045 | 0.000045 | 0.000000 | 0.000000 | 32.7 |
|
||||
| |0>>> |_std::thread::~thread | 4 | 1 | wall_clock | sec | 0.000030 | 0.000007 | 0.000007 | 0.000009 | 0.000000 | 0.000001 | 31.2 |
|
||||
| |0>>> |_std::thread::joinable | 4 | 2 | wall_clock | sec | 0.000021 | 0.000005 | 0.000005 | 0.000006 | 0.000000 | 0.000001 | 89.4 |
|
||||
| |0>>> |_std::thread::id::id | 4 | 3 | wall_clock | sec | 0.000001 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_std::operator== | 4 | 3 | wall_clock | sec | 0.000001 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_std::allocator_traits<std::allocator<std::thread> >::deallocate | 1 | 1 | wall_clock | sec | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_std::allocator<std::thread>::~allocator | 1 | 1 | wall_clock | sec | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
@@ -0,0 +1,67 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
***********************
|
||||
Omnitrace documentation
|
||||
***********************
|
||||
|
||||
Omnitrace is designed for the high-level profiling and comprehensive tracing
|
||||
of applications running on the CPU or the CPU and GPU. It supports dynamic binary
|
||||
instrumentation, call-stack sampling, and various other features for determining
|
||||
which function and line number are currently executing. To learn more, see :doc:`what-is-omnitrace`
|
||||
|
||||
The code is open and hosted at `<https://github.com/ROCm/omnitrace>`_.
|
||||
|
||||
|
||||
.. grid:: 2
|
||||
:gutter: 3
|
||||
|
||||
.. grid-item-card:: Install
|
||||
|
||||
* :doc:`Quick start <./install/quick-start>`
|
||||
* :doc:`Omnitrace installation <./install/install>`
|
||||
|
||||
|
||||
The documentation is structured as follows:
|
||||
|
||||
.. grid:: 2
|
||||
:gutter: 3
|
||||
|
||||
.. grid-item-card:: Tutorials
|
||||
|
||||
* `GitHub examples <https://github.com/ROCm/omnitrace/tree/main/examples>`_
|
||||
* :doc:`Video tutorials <./tutorials/video-tutorials>`
|
||||
|
||||
.. grid-item-card:: How to
|
||||
|
||||
* :doc:`Configuring and validating the Omnitrace environment <./how-to/configuring-validating-environment>`
|
||||
* :doc:`Configuring runtime options <./how-to/configuring-runtime-options>`
|
||||
* :doc:`Sampling the call stack <./how-to/sampling-call-stack>`
|
||||
* :doc:`Instrumenting and rewriting a binary application <./how-to/instrumenting-rewriting-binary-application>`
|
||||
* :doc:`Performing causal profiling <./how-to/performing-causal-profiling>`
|
||||
* :doc:`Understanding the Omnitrace output <./how-to/understanding-omnitrace-output>`
|
||||
* :doc:`Profiling Python scripts <./how-to/profiling-python-scripts>`
|
||||
* :doc:`Using the Omnitrace API <./how-to/using-omnitrace-api>`
|
||||
* :doc:`General tips for using Omnitrace <./how-to/general-tips-using-omnitrace>`
|
||||
|
||||
|
||||
.. grid-item-card:: Conceptual
|
||||
|
||||
* :doc:`Data collection modes <./conceptual/data-collection-modes>`
|
||||
* :doc:`The Omnitrace feature set <./conceptual/omnitrace-feature-set>`
|
||||
|
||||
.. grid-item-card:: Reference
|
||||
|
||||
* :doc:`Development guide <./reference/development-guide>`
|
||||
* :doc:`Omnitrace glossary <./reference/omnitrace-glossary>`
|
||||
* :doc:`API library <./doxygen/html/files>`
|
||||
* :doc:`Class member functions <./doxygen/html/functions>`
|
||||
* :doc:`Globals <./doxygen/html/globals>`
|
||||
* :doc:`Classes, structures, and interfaces <./doxygen/html/annotated>`
|
||||
|
||||
To contribute to the documentation, refer to
|
||||
`Contributing to ROCm <https://rocm.docs.amd.com/en/latest/contribute/contributing.html>`_.
|
||||
|
||||
You can find licensing information on the
|
||||
`Licensing <https://rocm.docs.amd.com/en/latest/about/license.html>`_ page.
|
||||
@@ -0,0 +1,410 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
*************************************
|
||||
Omnitrace installation
|
||||
*************************************
|
||||
|
||||
The following information builds on the guidelines in the :doc:`Quick start <./quick-start>` guide.
|
||||
It covers how to install `Omnitrace <https://github.com/ROCm/omnitrace>`_ from source or a binary distribution,
|
||||
as well as the :ref:`post-installation-steps`.
|
||||
|
||||
If you have problems using Omnitrace after installation,
|
||||
consult the :ref:`post-installation-troubleshooting` section.
|
||||
|
||||
Release links
|
||||
========================================
|
||||
|
||||
To review and install either the current Omnitrace release or earlier releases, use these links:
|
||||
|
||||
* Latest Omnitrace Release: `<https://github.com/ROCm/omnitrace/releases/latest>`_
|
||||
* All Omnitrace Releases: `<https://github.com/ROCm/omnitrace/releases>`_
|
||||
|
||||
Operating system support
|
||||
========================================
|
||||
|
||||
Omnitrace is only supported on Linux. The following distributions are tested in the Omnitrace GitHub workflows:
|
||||
|
||||
* Ubuntu 20.04
|
||||
* Ubuntu 22.04
|
||||
* OpenSUSE 15.3
|
||||
* OpenSUSE 15.4
|
||||
* Red Hat 8.7
|
||||
* Red Hat 9.0
|
||||
* Red Hat 9.1
|
||||
|
||||
Other OS distributions might function but are not supported or tested.
|
||||
|
||||
Identifying the operating system
|
||||
-----------------------------------
|
||||
|
||||
If you are unsure of the operating system and version, the ``/etc/os-release`` and
|
||||
``/usr/lib/os-release`` files contain operating system identification data for Linux systems.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ cat /etc/os-release
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
NAME="Ubuntu"
|
||||
VERSION="20.04.4 LTS (Focal Fossa)"
|
||||
ID=ubuntu
|
||||
...
|
||||
VERSION_ID="20.04"
|
||||
...
|
||||
|
||||
The relevant fields are ``ID`` and the ``VERSION_ID``.
|
||||
|
||||
Architecture
|
||||
========================================
|
||||
|
||||
With regards to instrumentation, at present only AMD64 (x86_64) architectures are tested. However,
|
||||
Dyninst supports several more architectures and Omnitrace instrumentation may support other
|
||||
CPU architectures such as aarch64 and ppc64.
|
||||
Other modes of use, such as sampling and causal profiling, are not dependent on Dyninst and therefore
|
||||
might be more portable.
|
||||
|
||||
Installing Omnitrace from binary distributions
|
||||
================================================
|
||||
|
||||
Every Omnitrace release provides binary installer scripts of the form:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-{VERSION}-{OS_DISTRIB}-{OS_VERSION}[-ROCm-{ROCM_VERSION}[-{EXTRA}]].sh
|
||||
|
||||
For example,
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-1.0.0-ubuntu-18.04-OMPT-PAPI-Python3.sh
|
||||
omnitrace-1.0.0-ubuntu-18.04-ROCm-405000-OMPT-PAPI-Python3.sh
|
||||
...
|
||||
omnitrace-1.0.0-ubuntu-20.04-ROCm-50000-OMPT-PAPI-Python3.sh
|
||||
|
||||
Any of the ``EXTRA`` fields with a CMake build option
|
||||
(for example, PAPI, as referenced in a following section) or
|
||||
with no link requirements (such as OMPT) have
|
||||
self-contained support for these packages.
|
||||
|
||||
To install Omnitrace using a binary installer script, follow these steps:
|
||||
|
||||
#. Download the appropriate binary distribution
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
wget https://github.com/ROCm/omnitrace/releases/download/v<VERSION>/<SCRIPT>
|
||||
|
||||
#. Create the target installation directory
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
mkdir /opt/omnitrace
|
||||
|
||||
#. Run the installer script
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
./omnitrace-1.0.0-ubuntu-18.04-ROCm-405000-OMPT-PAPI.sh --prefix=/opt/omnitrace --exclude-subdir
|
||||
|
||||
Installing Omnitrace from source
|
||||
========================================
|
||||
|
||||
Omnitrace needs a GCC compiler with full support for C++17 and CMake v3.16 or higher.
|
||||
The Clang compiler may be used in lieu of the GCC compiler if `Dyninst <https://github.com/dyninst/dyninst>`_
|
||||
is already installed.
|
||||
|
||||
Build requirements
|
||||
-----------------------------------
|
||||
|
||||
* GCC compiler v7+
|
||||
|
||||
* Older GCC compilers may be supported but are not tested
|
||||
* Clang compilers are generally supported for Omnitrace but not Dyninst
|
||||
|
||||
* `CMake <https://cmake.org/>`_ v3.16+
|
||||
|
||||
.. note::
|
||||
|
||||
* If the installed version of CMake is too old, installing a new version of CMake can be done through several methods
|
||||
* One of the easiest options is to use the python ``pip`` utility, as follows:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
pip install --user 'cmake==3.18.4'
|
||||
export PATH=${HOME}/.local/bin:${PATH}
|
||||
|
||||
Required third-party packages
|
||||
-----------------------------------
|
||||
|
||||
* `Dyninst <https://github.com/dyninst/dyninst>`_ for dynamic or static instrumentation.
|
||||
Dyninst uses the following required and optional components.
|
||||
|
||||
* `TBB <https://github.com/oneapi-src/oneTBB>`_ (required)
|
||||
* `Elfutils <https://sourceware.org/elfutils/>`_ (required)
|
||||
* `Libiberty <https://github.com/gcc-mirror/gcc/tree/master/libiberty>`_ (required)
|
||||
* `Boost <https://www.boost.org/>`_ (required)
|
||||
* `OpenMP <https://www.openmp.org/>`_ (optional)
|
||||
|
||||
* `libunwind <https://www.nongnu.org/libunwind/>`_ for call-stack sampling
|
||||
|
||||
Any of the third-party packages required by Dyninst, along with Dyninst itself, can be built and installed
|
||||
during the Omnitrace build. The following list indicates the package, the version,
|
||||
the application that requires the package (for example, Omnitrace requires Dyninst
|
||||
while Dyninst requires TBB), and the CMake option to build the package alongside Omnitrace:
|
||||
|
||||
.. csv-table::
|
||||
:header: "Third-Party Library", "Minimum Version", "Required By", "CMake Option"
|
||||
:widths: 15, 10, 12, 40
|
||||
|
||||
"Dyninst", "12.0", "Omnitrace", "``OMNITRACE_BUILD_DYNINST`` (default: OFF)"
|
||||
"Libunwind", "", "Omnitrace", "``OMNITRACE_BUILD_LIBUNWIND`` (default: ON)"
|
||||
"TBB", "2018.6", "Dyninst", "``DYNINST_BUILD_TBB`` (default: OFF)"
|
||||
"ElfUtils", "0.178", "Dyninst", "``DYNINST_BUILD_ELFUTILS`` (default: OFF)"
|
||||
"LibIberty", "", "Dyninst", "``DYNINST_BUILD_LIBIBERTY`` (default: OFF)"
|
||||
"Boost", "1.67.0", "Dyninst", "``DYNINST_BUILD_BOOST`` (default: OFF)"
|
||||
"OpenMP", "4.x", "Dyninst", ""
|
||||
|
||||
Optional third-party packages
|
||||
-----------------------------------
|
||||
|
||||
* `ROCm <https://rocm.docs.amd.com/projects/install-on-linux/en/latest>`_
|
||||
|
||||
* HIP
|
||||
* Roctracer for HIP API and kernel tracing
|
||||
* ROCM-SMI for GPU monitoring
|
||||
* Rocprofiler for GPU hardware counters
|
||||
|
||||
* `PAPI <https://icl.utk.edu/papi/>`_
|
||||
* MPI
|
||||
|
||||
* ``OMNITRACE_USE_MPI`` enables full MPI support
|
||||
* ``OMNITRACE_USE_MPI_HEADERS`` enables wrapping of the dynamically-linked MPI C function calls.
|
||||
(By default, if Omnitrace cannot find an OpenMPI MPI distribution, it uses a local copy
|
||||
of the OpenMPI ``mpi.h``.)
|
||||
|
||||
* Several optional third-party profiling tools supported by Timemory
|
||||
(for example, `Caliper <https://github.com/LLNL/Caliper>`_, `TAU <https://www.cs.uoregon.edu/research/tau/home.php>`_, CrayPAT, and others)
|
||||
|
||||
.. csv-table::
|
||||
:header: "Third-Party Library", "CMake Enable Option", "CMake Build Option"
|
||||
:widths: 15, 45, 40
|
||||
|
||||
"PAPI", "``OMNITRACE_USE_PAPI`` (default: ON)", "``OMNITRACE_BUILD_PAPI`` (default: ON)"
|
||||
"MPI", "``OMNITRACE_USE_MPI`` (default: OFF)", ""
|
||||
"MPI (header-only)", "``OMNITRACE_USE_MPI_HEADERS`` (default: ON)", ""
|
||||
|
||||
Installing Dyninst
|
||||
-----------------------------------
|
||||
|
||||
The easiest way to install Dyninst is alongside Omnitrace, but it can also be installed using Spack.
|
||||
|
||||
Building Dyninst alongside Omnitrace
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
To install Dyninst alongside Omnitrace, configure Omnitrace with ``OMNITRACE_BUILD_DYNINST=ON``.
|
||||
Depending on the version of Ubuntu, the ``apt`` package manager might have current enough
|
||||
versions of the Dyninst Boost, TBB, and LibIberty dependencies
|
||||
(use ``apt-get install libtbb-dev libiberty-dev libboost-dev``).
|
||||
However, it is possible to request Dyninst to install
|
||||
its dependencies via ``DYNINST_BUILD_<DEP>=ON``, as follows:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
git clone https://github.com/ROCm/omnitrace.git omnitrace-source
|
||||
cmake -B omnitrace-build -DOMNITRACE_BUILD_DYNINST=ON -DDYNINST_BUILD_{TBB,ELFUTILS,BOOST,LIBIBERTY}=ON omnitrace-source
|
||||
|
||||
where ``-DDYNINST_BUILD_{TBB,BOOST,ELFUTILS,LIBIBERTY}=ON`` is expanded by
|
||||
the shell to ``-DDYNINST_BUILD_TBB=ON -DDYNINST_BUILD_BOOST=ON ...``
|
||||
|
||||
Installing Dyninst via Spack
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
`Spack <https://github.com/spack/spack>`_ is another option to install Dyninst and its dependencies:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
git clone https://github.com/spack/spack.git
|
||||
source ./spack/share/spack/setup-env.sh
|
||||
spack compiler find
|
||||
spack external find --all --not-buildable
|
||||
spack spec -I --reuse dyninst
|
||||
spack install --reuse dyninst
|
||||
spack load -r dyninst
|
||||
|
||||
Installing Omnitrace
|
||||
-----------------------------------
|
||||
|
||||
Omnitrace has CMake configuration options for MPI support (``OMNITRACE_USE_MPI`` or
|
||||
``OMNITRACE_USE_MPI_HEADERS``), HIP kernel tracing (``OMNITRACE_USE_ROCTRACER``),
|
||||
ROCm device sampling (``OMNITRACE_USE_ROCM_SMI``), OpenMP-Tools (``OMNITRACE_USE_OMPT``),
|
||||
hardware counters via PAPI (``OMNITRACE_USE_PAPI``), among other features.
|
||||
Various additional features can be enabled via the
|
||||
``TIMEMORY_USE_*`` `CMake options <https://timemory.readthedocs.io/en/develop/installation.html#cmake-options>`_.
|
||||
Any ``OMNITRACE_USE_<VAL>`` option which has a corresponding ``TIMEMORY_USE_<VAL>``
|
||||
option means that the Timemory support for this feature has been integrated
|
||||
into Perfetto support for Omnitrace, for example, ``OMNITRACE_USE_PAPI=<VAL>`` also configures
|
||||
``TIMEMORY_USE_PAPI=<VAL>``. This means the data that Timemory is able to collect via this package
|
||||
is passed along to Perfetto and is displayed when the ``.proto`` file is visualized
|
||||
in `the Perfetto UI <https://ui.perfetto.dev>`_.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
git clone https://github.com/ROCm/omnitrace.git omnitrace-source
|
||||
cmake \
|
||||
-B omnitrace-build \
|
||||
-D CMAKE_INSTALL_PREFIX=/opt/omnitrace \
|
||||
-D OMNITRACE_USE_HIP=ON \
|
||||
-D OMNITRACE_USE_ROCM_SMI=ON \
|
||||
-D OMNITRACE_USE_ROCTRACER=ON \
|
||||
-D OMNITRACE_USE_PYTHON=ON \
|
||||
-D OMNITRACE_USE_OMPT=ON \
|
||||
-D OMNITRACE_USE_MPI_HEADERS=ON \
|
||||
-D OMNITRACE_BUILD_PAPI=ON \
|
||||
-D OMNITRACE_BUILD_LIBUNWIND=ON \
|
||||
-D OMNITRACE_BUILD_DYNINST=ON \
|
||||
-D DYNINST_BUILD_TBB=ON \
|
||||
-D DYNINST_BUILD_BOOST=ON \
|
||||
-D DYNINST_BUILD_ELFUTILS=ON \
|
||||
-D DYNINST_BUILD_LIBIBERTY=ON \
|
||||
omnitrace-source
|
||||
cmake --build omnitrace-build --target all --parallel 8
|
||||
cmake --build omnitrace-build --target install
|
||||
source /opt/omnitrace/share/omnitrace/setup-env.sh
|
||||
|
||||
.. _mpi-support-omnitrace:
|
||||
|
||||
MPI support within Omnitrace
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Omnitrace can have full (``OMNITRACE_USE_MPI=ON``) or partial (``OMNITRACE_USE_MPI_HEADERS=ON``) MPI support.
|
||||
The only difference between these two modes is whether or not the results collected
|
||||
via Timemory and/or Perfetto can be aggregated into a single
|
||||
output file during finalization. When full MPI support is enabled, combining the
|
||||
Timemory results always occurs, whereas combining the Perfetto
|
||||
results is configurable via the ``OMNITRACE_PERFETTO_COMBINE_TRACES`` setting.
|
||||
|
||||
The primary benefits of partial or full MPI support are the automatic wrapping
|
||||
of MPI functions and the ability
|
||||
to label output with suffixes which correspond to the ``MPI_COMM_WORLD`` rank ID
|
||||
instead of having to use the system process identifier (i.e. ``PID``).
|
||||
In general, it's recommended to use partial MPI support with the OpenMPI
|
||||
headers as this is the most portable configuration.
|
||||
If full MPI support is selected, make sure your target application is built
|
||||
against the same MPI distribution as Omnitrace.
|
||||
For example, do not build Omnitrace with MPICH and use it on a target application built against OpenMPI.
|
||||
If partial support is selected, the reason the OpenMPI headers are recommended instead of the MPICH headers is
|
||||
because the ``MPI_COMM_WORLD`` in OpenMPI is a pointer to ``ompi_communicator_t`` (8 bytes),
|
||||
whereas ``MPI_COMM_WORLD`` in MPICH is an ``int`` (4 bytes). Building Omnitrace with partial MPI support
|
||||
and the MPICH headers and then using
|
||||
Omnitrace on an application built against OpenMPI causes a segmentation fault.
|
||||
This happens because the value of the ``MPI_COMM_WORLD`` is truncated
|
||||
during the function wrapping before being passed along to the underlying MPI function.
|
||||
|
||||
.. _post-installation-steps:
|
||||
|
||||
Post-installation steps
|
||||
========================================
|
||||
|
||||
After installation, you can optionally configure the Omnitrace environment.
|
||||
You should also test the executables to confirm Omnitrace is correctly installed.
|
||||
|
||||
Configure the environment
|
||||
-----------------------------------
|
||||
|
||||
If environment modules are available and preferred, add them using these commands:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
module use /opt/omnitrace/share/modulefiles
|
||||
module load omnitrace/1.0.0
|
||||
|
||||
Alternatively, you can directly source the ``setup-env.sh`` script:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
source /opt/omnitrace/share/omnitrace/setup-env.sh
|
||||
|
||||
Test the executables
|
||||
-----------------------------------
|
||||
|
||||
Successful execution of these commands confirms that the installation does not have any
|
||||
issues locating the installed libraries:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-instrument --help
|
||||
omnitrace-avail --help
|
||||
|
||||
.. note::
|
||||
|
||||
If ROCm support is enabled, you might have to add the path to the ROCm libraries to ``LD_LIBRARY_PATH``,
|
||||
for example, ``export LD_LIBRARY_PATH=/opt/rocm/lib:${LD_LIBRARY_PATH}``.
|
||||
|
||||
.. _post-installation-troubleshooting:
|
||||
|
||||
Post-installation troubleshooting
|
||||
========================================
|
||||
|
||||
This section explains how to resolve certain issues that might happen when you first use Omnitrace.
|
||||
|
||||
Issues with RHEL and SELinux
|
||||
----------------------------------------------------
|
||||
|
||||
RHEL (Red Hat Enterprise Linux) and related distributions of Linux automatically enable a security feature
|
||||
named SELinux (Security-Enhanced Linux) that prevents Omnitrace from running.
|
||||
This issue applies to any Linux distribution with SELinux installed, including RHEL,
|
||||
CentOS, Fedora, and Rocky Linux. The problem can happen with any GPU, or even without a GPU.
|
||||
|
||||
The problem occurs after you instrument a program and try to
|
||||
run ``omnitrace-run`` with the instrumented program.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
g++ hello.cpp -o hello
|
||||
omniperf-instrument -M sampling -o hello.instr -- ./hello
|
||||
omnitrace-run -- ./hello.instr
|
||||
|
||||
Instead of successfully running the binary with call-stack sampling,
|
||||
Omnitrace crashes with a segmentation fault.
|
||||
|
||||
.. note::
|
||||
|
||||
If you are physically logged in on the system (not using SSH or a remote connection),
|
||||
the operating system might display an SELinux pop-up warning in the notifications.
|
||||
|
||||
To workaround this problem, either disable SELinux or configure it to use a more
|
||||
permissive setting.
|
||||
|
||||
To avoid this problem for the duration of the current session, run this command
|
||||
from the shell:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
sudo setenforce 0
|
||||
|
||||
For a permanent workaround, edit the SELinux configuration file using the command
|
||||
``sudo vim /etc/sysconfig/selinux`` and change the ``SELINUX`` setting to
|
||||
either ``Permissive`` or ``Disabled``.
|
||||
|
||||
.. note::
|
||||
|
||||
Permanently changing the SELinux settings can have security implications.
|
||||
Ensure you review your system security settings before making any changes.
|
||||
|
||||
Modifying RPATH details
|
||||
----------------------------------------------------
|
||||
|
||||
If you're experiencing problems loading your application with an instrumented library,
|
||||
then you might have to check and modify the RPATH specified in your application.
|
||||
See the section on `troubleshooting RPATHs <../how-to/instrumenting-rewriting-binary-application.html#rpath-troubleshooting>`_
|
||||
for further details.
|
||||
|
||||
Configuring PAPI to collect hardware counters
|
||||
----------------------------------------------------
|
||||
|
||||
To use PAPI to collect the majority of hardware counters, ensure
|
||||
the ``/proc/sys/kernel/perf_event_paranoid`` setting has a value less than or equal to ``2``.
|
||||
For more information, see the :ref:`omnitrace_papi_events` section.
|
||||
@@ -0,0 +1,30 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
*************************************
|
||||
Omnitrace quick start
|
||||
*************************************
|
||||
|
||||
To install Omnitrace, download the `Omnitrace installer <https://github.com/ROCm/omnitrace/releases/latest/download/omnitrace-install.py>`_
|
||||
and specify ``--prefix <install-directory>``. The script attempts to auto-detect
|
||||
the appropriate OS distribution and version. To include AMD ROCm Software support,
|
||||
specify ``--rocm X.Y``, where ``X`` is the ROCm major
|
||||
version and ``Y`` is the ROCm minor version, for example, ``--rocm 6.2``.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
wget https://github.com/ROCm/omnitrace/releases/latest/download/omnitrace-install.py
|
||||
python3 ./omnitrace-install.py --prefix /opt/omnitrace --rocm 6.2
|
||||
|
||||
This script supports installation on Ubuntu, OpenSUSE, Red Hat, Debian, CentOS, and Fedora.
|
||||
If the target OS is compatible with one of the operating system versions listed in
|
||||
the comprehensive :doc:`Installation guidelines <./install>`,
|
||||
specify ``-d <DISTRO> -v <VERSION>``. For example, if the OS is compatible with Ubuntu 22.04, pass
|
||||
``-d ubuntu -v 22.04`` to the script.
|
||||
|
||||
.. note::
|
||||
|
||||
If you have ROCm version 6.2 or higher installed, you can use the
|
||||
package manager to install a pre-built copy of Omnitrace using
|
||||
``apt install omnitrace`` or ``dnf install omnitrace``.
|
||||
@@ -0,0 +1,4 @@
|
||||
# License
|
||||
|
||||
```{include} ../LICENSE
|
||||
```
|
||||
@@ -0,0 +1,412 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
****************************************************
|
||||
Development guide
|
||||
****************************************************
|
||||
|
||||
This guide discusses the `Omnitrace <https://github.com/ROCm/omnitrace>`_ design.
|
||||
It includes a list of the executables and libraries, along with a discussion of the application's
|
||||
memory, sampling, and time-window constraint models.
|
||||
|
||||
Executables
|
||||
========================================
|
||||
|
||||
This section lists the Omnitrace executables.
|
||||
|
||||
omnitrace-avail: `source/bin/omnitrace-avail <https://github.com/ROCm/omnitrace/tree/main/source/bin/omnitrace-avail>`_
|
||||
-------------------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
The ``main`` routine of ``omnitrace-avail`` has three important sections:
|
||||
|
||||
* Printing components
|
||||
* Printing options
|
||||
* Printing hardware counters
|
||||
|
||||
omnitrace-sample: `source/bin/omnitrace-sample <https://github.com/ROCm/omnitrace/tree/main/source/bin/omnitrace-sample>`_
|
||||
-------------------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
* Requires a command-line format of ``omnitrace-sample <options> -- <command> <command-args>``
|
||||
* Translates command-line options into environment variables
|
||||
* Adds ``libomnitrace-dl.so`` to ``LD_PRELOAD``
|
||||
* Is launched by using ``execvpe`` with ``<command> <command-args>`` and a modified environment
|
||||
|
||||
omnitrace-casual: `source/bin/omnitrace-causal <https://github.com/ROCm/omnitrace/tree/main/source/bin/omnitrace-causal>`_
|
||||
-------------------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
When there is exactly one causal profiling configuration variant (which enables debugging),
|
||||
``omnitrace-casual`` has a nearly identical design to ``omnitrace-sample``
|
||||
|
||||
When the command-line options produce more than one causal profiling configuration variant,
|
||||
the following actions take place for each variant:
|
||||
|
||||
* ``omnitrace-causal`` calls ``fork()``
|
||||
* the child process launches ``<command> <command-args>`` using ``execvpe``, which modifies the environment for the variant
|
||||
* the parent process waits for the child process to finish
|
||||
|
||||
omnitrace-instrument: `source/bin/omnitrace-instrument <https://github.com/ROCm/omnitrace/tree/main/source/bin/omnitrace-instrument>`_
|
||||
-------------------------------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
* Requires a command-line format of ``omnitrace-instrument <options> -- <command> <command-args>``
|
||||
* Allows the user to provide options specifying whether to perform runtime instrumentation, use binary rewrite, or
|
||||
attach to process
|
||||
* Either opens the instrumentation target (for binary rewrite), launches the target and stops it
|
||||
before it starts executing ``main``, or attaches to a running executable and pauses it
|
||||
* Finds all functions in the targets
|
||||
* Finds ``libomnitrace-dl`` and locates the functions
|
||||
* Iterates over and instruments all the functions, provided they satisfy the
|
||||
defined criteria (such as a minimum number of instructions)
|
||||
|
||||
* See the ``module_function`` class
|
||||
|
||||
* Until this point, the workflow has been the same for the different options,
|
||||
but it diverges after instrumentation is complete:
|
||||
|
||||
* For a binary rewrite: it produces a new instrumented binary and exits
|
||||
* For runtime instrumentation or attaching to a process: it instructs the application
|
||||
to resume and then waits for it to exit
|
||||
|
||||
Libraries
|
||||
========================================
|
||||
|
||||
Common library: `source/lib/common <https://github.com/ROCm/omnitrace/tree/main/source/lib/common>`_
|
||||
--------------------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
* General header-only functionality used in multiple executables and/or libraries.
|
||||
* Not installed or exported outside of the build tree.
|
||||
|
||||
Core library: `source/lib/core <https://github.com/ROCm/omnitrace/tree/main/source/lib/core>`_
|
||||
--------------------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
* Static PIC library with functionality that does not depend on any components.
|
||||
* Not installed or exported outside of the build tree.
|
||||
|
||||
Binary library: `source/lib/binary <https://github.com/ROCm/omnitrace/tree/main/source/lib/binary>`_
|
||||
--------------------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
* Static PIC library with functionality for reading/analyzing binary info.
|
||||
* Mostly used by the causal profiling sections of ``libomnitrace``.
|
||||
* Not installed or exported outside of the build tree.
|
||||
|
||||
libomnitrace: `source/lib/omnitrace <https://github.com/ROCm/omnitrace/tree/main/source/lib/omnitrace>`_
|
||||
--------------------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
This is the main library encapsulating all the capabilities.
|
||||
|
||||
libomnitrace-dl: `source/lib/omnitrace-dl <https://github.com/ROCm/omnitrace/tree/main/source/lib/omnitrace-dl>`_
|
||||
--------------------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
This is a lightweight, front-end library for ``libomnitrace`` which serves three primary purposes:
|
||||
|
||||
* Dramatically speeds up instrumentation time compared to using ``libomnitrace`` directly because
|
||||
Dyninst must parse the entire library in order to find the instrumentation functions
|
||||
(a ``dlopen`` call is made on ``libomnitrace`` when the instrumentation functions get called)
|
||||
* Prevents re-entry if ``libomnitrace`` calls an instrumented function internally
|
||||
* Coordinates communication between ``libomnitrace-user`` and ``libomnitrace``
|
||||
|
||||
libomnitrace-user: `source/lib/omnitrace-user <https://github.com/ROCm/omnitrace/tree/main/source/lib/omnitrace-user>`_
|
||||
--------------------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
* Provides a set of functions and types for the users to add to their code, for example,
|
||||
disabling data collection globally or on a specific thread or
|
||||
user-defined region
|
||||
* If ``libomnitrace-dl`` is not loaded, the user API is effectively a set of no-op function calls.
|
||||
|
||||
Testing tools
|
||||
========================================
|
||||
|
||||
* `CDash Testing Dashboard <https://my.cdash.org/index.php?project=Omnitrace>`_ (requires a login)
|
||||
|
||||
Components
|
||||
========================================
|
||||
|
||||
Most measurements and capabilities are encapsulated into a "component" with the following definitions:
|
||||
|
||||
Measurement
|
||||
A recording of some data relevant to performance, for instance, the current call-stack,
|
||||
hardware counter values, current memory usage, or timestamp
|
||||
|
||||
Capability
|
||||
Handles the implementation or orchestration of some feature which is used
|
||||
to collect measurements, for example, a component which handles setting up function wrappers
|
||||
around various functions such as ``pthread_create`` or ``MPI_Init``.
|
||||
|
||||
Components are designed to either hold no data at all or only the data for both an instantaneous
|
||||
measurement and a phase measurement.
|
||||
|
||||
Components which store data typically implement a static ``record()`` function
|
||||
for getting a record of the measurement,
|
||||
``start()`` and ``stop()`` member functions for calculating a phase measurement,
|
||||
and a ``sample()`` member function for storing an
|
||||
instantaneous measurement. In reality, there are several more "standard" functions
|
||||
but these are the most commonly-used ones.
|
||||
|
||||
Components which do not store data might also have ``start()``, ``stop()``, and ``sample()``
|
||||
functions. However, components which
|
||||
implement function wrappers typically provide a call operator or ``audit(...)``
|
||||
functions. These are invoked with the
|
||||
wrapped function's arguments before the wrapped function gets called and with the return value
|
||||
after the wrapped function gets called.
|
||||
|
||||
.. note::
|
||||
|
||||
The goal of this design is to provide relatively small and resuable lightweight objects
|
||||
for recording measurements and implementing capabilities.
|
||||
|
||||
Wall-clock component example
|
||||
--------------------------------------
|
||||
|
||||
A component for computing the elapsed wall-clock time looks like this:
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
struct wall_clock
|
||||
{
|
||||
using value_type = int64_t;
|
||||
|
||||
static value_type record() noexcept
|
||||
{
|
||||
return std::chrono::steady_clock::now().time_since_epoch().count();
|
||||
}
|
||||
|
||||
void sample() noexcept
|
||||
{
|
||||
value = record();
|
||||
}
|
||||
|
||||
void start() noexcept
|
||||
{
|
||||
value = record();
|
||||
}
|
||||
|
||||
void stop() noexcept
|
||||
{
|
||||
auto _start_value = value;
|
||||
value = record();
|
||||
accum += (value - _start_value);
|
||||
}
|
||||
|
||||
private:
|
||||
int64_t value = 0;
|
||||
int64_t accum = 0;
|
||||
};
|
||||
|
||||
Function wrapper component example
|
||||
--------------------------------------
|
||||
|
||||
A component which implements wrappers around ``fork()`` and ``exit(int)`` (and stores no data)
|
||||
could look like this:
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
struct function_wrapper
|
||||
{
|
||||
pid_t operator()(const gotcha_data&, pid_t (*real_fork)())
|
||||
{
|
||||
// disable all collection before forking
|
||||
categories::disable_categories(config::get_enabled_categories());
|
||||
|
||||
auto _pid_v = real_fork();
|
||||
|
||||
// only re-enable collection on parent process
|
||||
if(_pid_v != 0)
|
||||
categories::enable_categories(config::get_enabled_categories());
|
||||
|
||||
return _pid_v;
|
||||
}
|
||||
|
||||
void operator()(const gotcha_data&, void (*real_exit)(int), int _exit_code)
|
||||
{
|
||||
// catch the call to exit and finalize before truly exiting
|
||||
omnitrace_finalize();
|
||||
|
||||
real_exit(_exit_code);
|
||||
}
|
||||
};
|
||||
|
||||
Component member functions
|
||||
--------------------------------------
|
||||
|
||||
There are no real restrictions or requirements on the member functions a component needs to provide.
|
||||
Unless the component is being used directly, the invocation of component member functions via a "component bundler"
|
||||
(provided by Timemory) makes extensive use of template metaprogramming concepts. This finds the best match, if any,
|
||||
for calling a component's member function. This is a bit easier to demonstrate using an example:
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
struct foo
|
||||
{
|
||||
void sample() { puts("foo::sample()"); }
|
||||
};
|
||||
|
||||
struct bar
|
||||
{
|
||||
void sample(int) { puts("bar::sample(int)"); }
|
||||
};
|
||||
|
||||
struct spam
|
||||
{
|
||||
void start(int) { puts("spam::start()"); }
|
||||
void stop() { puts("spam::stop()"); }
|
||||
};
|
||||
|
||||
int main()
|
||||
{
|
||||
auto _bundle = component_tuple<foo, bar, spam>{ "main" };
|
||||
|
||||
puts("A");
|
||||
_bundle.start();
|
||||
|
||||
puts("B");
|
||||
_bundle.sample(10);
|
||||
|
||||
puts("C");
|
||||
_bundle.sample();
|
||||
|
||||
puts("D");
|
||||
_bundle.stop();
|
||||
}
|
||||
|
||||
When the preceding code runs, the following messages are printed:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
A
|
||||
spam::start()
|
||||
B
|
||||
foo::sample()
|
||||
bar::sample(int)
|
||||
C
|
||||
foo::sample()
|
||||
D
|
||||
spam::stop()
|
||||
|
||||
In section A, the bundle determined that only the ``spam`` object has a ``start`` function. Since this is determined
|
||||
via template metaprogramming instead of dynamic polymorphism, this effectively omits any code related to
|
||||
the ``foo`` or ``bar`` objects. In section B, because the integer ``10`` is passed to the bundle,
|
||||
the bundle forwards this value to ``bar::sample(int)`` after it invokes ``foo::sample()``. ``foo::sample()`` is
|
||||
invoked because the bundle recognizes that the call to the ``sample`` member function is still possible without
|
||||
the argument.
|
||||
|
||||
Memory model
|
||||
========================================
|
||||
|
||||
Collected data is generally handled in one of the three following ways:
|
||||
|
||||
* It is handed directly to, and stored by, Perfetto
|
||||
* It is managed implicitly by Timemory and accessed as needed
|
||||
* As thread-local data
|
||||
|
||||
In general, only instrumentation for relatively simple data is directly passed to
|
||||
Perfetto and/or Timemory during runtime.
|
||||
For example, the callbacks from binary instrumentation, user API instrumentation,
|
||||
and roctracer directly invoke
|
||||
calls to Perfetto or Timemory's storage model. Otherwise, the data is stored
|
||||
by Omnitrace in the thread-data model
|
||||
which is more persistent than simply using ``thread_local`` static data, which gets deleted
|
||||
when the thread stops.
|
||||
|
||||
Thread identification
|
||||
--------------------------------------
|
||||
|
||||
Each CPU thread is assigned two integral identifiers. One identifier, the ``internal_value``, is
|
||||
atomically incremented every time a new thread is created.
|
||||
The other identifier, known as the ``sequent_value``, tries to account for the fact that Omnitrace, Perfetto, ROCm, and other applications
|
||||
start background threads. When a thread is created as a by-product of Omnitrace,
|
||||
the index is offset by a large value. This serves
|
||||
two purposes:
|
||||
|
||||
* Accessing the data for threads created by the user is closer in memory
|
||||
* When log messages are printed, the index approximately correlates to the order of thread creation from the user's perspective.
|
||||
|
||||
The ``sequent_value`` identifier is typically used to access the thread-data.
|
||||
|
||||
Thread-data class
|
||||
--------------------------------------
|
||||
|
||||
Currently, most thread data is effectively stored in a static
|
||||
``std::array<std::unique_ptr<T>, OMNITRACE_MAX_THREADS>`` instance.
|
||||
``OMNITRACE_MAX_THREADS`` is a value defined a compile-time and set to ``2048``
|
||||
for release builds. During finalization,
|
||||
Omnitrace iterates through the thread-data and transforms that data
|
||||
into something that can be passed along to Perfetto and/or Timemory.
|
||||
The downside of the current model is that if the user exceeds ``OMNITRACE_MAX_THREADS``,
|
||||
a segmentation fault occurs. To fix this issue,
|
||||
a new model is being adopted which has all the benefits of this model
|
||||
but permits dynamic expansion.
|
||||
|
||||
Sampling model
|
||||
========================================
|
||||
|
||||
The general structure for the sampling is within Timemory (``source/timemory/sampling``).
|
||||
Currently, all sampling is done per-thread
|
||||
via POSIX timers. Omnitrace supports both a real-time timer and a CPU-time timer.
|
||||
Both have adjustable frequencies, delays, and durations.
|
||||
By default, only CPU-time sampling is enabled. Initial settings are inherited from
|
||||
the settings starting with ``OMNITRACE_SAMPLING_``.
|
||||
|
||||
For each type of timer, timer-specific settings can be used to
|
||||
override the common and inherited timer settings.
|
||||
These settings begin with ``OMNITRACE_SAMPLING_CPUTIME`` for the CPU-time sampler
|
||||
and ``OMNITRACE_SAMPLING_REALTIME`` for
|
||||
the real-time sampler. For example, ``OMNITRACE_SAMPLING_FREQ=500`` initially sets the
|
||||
sampling frequency to 500 interrupts per second. Adding the setting ``OMNITRACE_SAMPLING_REALTIME_FREQ=10``
|
||||
lowers the sampling frequency for the real-time sampler
|
||||
to 10 interrupts per second of real-time.
|
||||
|
||||
The Omnitrace-specific implementation can be found in
|
||||
`source/lib/omnitrace/library/sampling.cpp <https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/sampling.cpp>`_.
|
||||
Within `sampling.cpp <https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/sampling.cpp>`_,
|
||||
there is a bundle of three sampling components:
|
||||
|
||||
* `backtrace_timestamp <https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/components/backtrace_timestamp.hpp>`_ simply
|
||||
records the wall-clock time of the sample.
|
||||
* `backtrace <https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/components/backtrace.hpp>`_
|
||||
records the call-stack via libunwind.
|
||||
* `backtrace_metrics <https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/components/backtrace_metrics.hpp>`_
|
||||
records the sample metrics, such as peak RSS and the hardware counters.
|
||||
|
||||
These three components are bundled together in
|
||||
a tuple-like ``struct`` (``tuple<backtrace_timestamp, backtrace, backtrace_metrics>``).
|
||||
A buffer of at least 1024 instances of this tuple is mapped using ``mmap``
|
||||
per-thread. When this buffer is full,
|
||||
the sampler hands the buffer off to its allocator thread and maps a new buffer with ``mmap``
|
||||
before taking the next sample. The allocator thread takes this data
|
||||
and either dynamically stores it in memory or writes it to a file depending on the
|
||||
value of ``OMNITRACE_USE_TEMPORARY_FILES``.
|
||||
This schema avoids all allocations in the signal handler, lets the data grow
|
||||
dynamically, avoids potentially slow I/O within the signal handler, and also enables
|
||||
the capability of avoiding I/O altogether.
|
||||
The maximum number of samplers handled by each allocator is governed by the
|
||||
``OMNITRACE_SAMPLING_ALLOCATOR_SIZE`` setting (the default is eight). Whenever an allocator
|
||||
has reached its limit,
|
||||
a new internal thread is created to handle the new samplers.
|
||||
|
||||
Time-window constraint model
|
||||
========================================
|
||||
|
||||
With the recent introduction of tracing delay and duration, the
|
||||
`constraint namespace <https://github.com/ROCm/omnitrace/blob/main/source/lib/core/constraint.hpp>`_
|
||||
was introduced to improve the management of delays and duration limits for
|
||||
data collection. The ``spec`` class accepts a clock identifier, a delay value, a duration value, and an
|
||||
integer indicating how many times to repeat the delay and duration cycle. It is therefore
|
||||
possible to perform tasks such as periodically enabling tracing for brief periods
|
||||
of time in between long periods without data collection while the application runs. The
|
||||
syntax follows the format ``clock_identifier:delay:capture_duration:cycles``, so a value of
|
||||
``10:1:3`` for the last three parameters represents the following sequence of operations:
|
||||
|
||||
* Ten seconds where no data is collected, then one second where it is
|
||||
* Ten seconds where no data is collected, then one second where it is
|
||||
* Ten seconds where no data is collected, then one second where it is
|
||||
* Stop
|
||||
|
||||
As another example, ``OMNITRACE_TRACE_PERIODS = realtime:10:1:5 process_cputime:10:2:20`` translates
|
||||
to this sequence:
|
||||
|
||||
* Five cycles of: no data collection for ten seconds of real-time followed by one second of data collection
|
||||
* Twenty cycles of: no data collection for ten seconds of process CPU time followed by two CPU-time seconds of data collection
|
||||
|
||||
Eventually, the goal is to migrate all subsets of data collection which currently support
|
||||
more rudimentary models of time window constraints, such as process sampling and causal profiling,
|
||||
to this model.
|
||||
@@ -0,0 +1,102 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
*******************
|
||||
Omnitrace Glossary
|
||||
*******************
|
||||
|
||||
This topic explains the terminology necessary to use Omnitrace.
|
||||
The list below provides a basic glossary for those who
|
||||
are new to binary instrumentation. It also clarifies ambiguities
|
||||
when certain terms have different
|
||||
contextual meanings, for example, the Omnitrace meaning of the term "module"
|
||||
when instrumenting Python.
|
||||
|
||||
**Binary**
|
||||
A file written in the Executable and Linkable Format (ELF). This is the standard file
|
||||
format for executable files, shared libraries, etc.
|
||||
|
||||
**Binary instrumentation**
|
||||
Inserting callbacks to instrumentation into an existing binary. This can be performed
|
||||
statically or dynamically.
|
||||
|
||||
**Static binary instrumentation**
|
||||
Loads an existing binary, determines instrumentation points, and generates a new binary
|
||||
with instrumentation directly embedded. It is applicable to executables and libraries but
|
||||
limited to only the functions defined in the binary. This is also known as **Binary rewrite**.
|
||||
|
||||
**Dynamic binary instrumentation**
|
||||
Loads an existing binary into memory, inserts instrumentation, and runs the binary.
|
||||
It is limited to executables but is capable of instrumenting linked libraries.
|
||||
This is also known as **Runtime instrumentation**.
|
||||
|
||||
**Statistical sampling**
|
||||
At periodic intervals, the application is paused and the current call-stack of the CPU
|
||||
is recorded along with various other metrics. It uses timers that measure either
|
||||
(A) real clock time or (B) the CPU time used by the current thread and the CPU time
|
||||
expended on behalf of the thread by the system. This is also known as simply **sampling**.
|
||||
|
||||
**Sampling rate**
|
||||
* The period at which (A) or (B) are triggered (in units of ``# interrupts / second``)
|
||||
* Higher values increase the number of samples
|
||||
|
||||
**Sampling delay**
|
||||
* How long to wait before (A) and (B) begin triggering at their designated rate
|
||||
|
||||
**Sampling duration**
|
||||
* The amount of time (in real-time) after the start of the application to record samples.
|
||||
* After this time limit has been reached, no more samples are recorded.
|
||||
|
||||
**Process sampling**
|
||||
At periodic (real-time) intervals, a background thread records global metrics without
|
||||
interrupting the current process. These metrics include, but are not limited to:
|
||||
CPU frequency, CPU memory high-water mark (i.e. peak memory usage), GPU temperature,
|
||||
and GPU power usage.
|
||||
|
||||
**Sampling rate**
|
||||
* The real-time period for recording metrics (in units of ``# measurements / second``)
|
||||
* Higher values increase the number of samples
|
||||
|
||||
**Sampling delay**
|
||||
* How long to wait (in real-time) before recording samples
|
||||
|
||||
**Sampling duration**
|
||||
* The amount of time (in real-time) after the start of the application to record samples.
|
||||
* After this time limit has been reached, no more samples are recorded.
|
||||
|
||||
**Module**
|
||||
With respect to binary instrumentation, a module is defined as either the filename
|
||||
(such as ``foo.c``) or library name (``libfoo.so``) which contains the definition
|
||||
of one or more functions.
|
||||
|
||||
With respect to Python instrumentation, a module is defined as the **file** which contains
|
||||
the definition of one or more functions. The full path to this file typically contains the
|
||||
name of the "Python module".
|
||||
|
||||
**Basic block**
|
||||
A straight-line code sequence with no branches in (except for the entry) and
|
||||
no branches out (except for the exit).
|
||||
|
||||
**Address range**
|
||||
The instructions for a function in a binary start at certain address with the ELF file
|
||||
and end at a certain address. The range is ``end - start``.
|
||||
|
||||
The address range is a decent approximation for the "cost" of a function.
|
||||
For example, a larger address range approximately equates to more instructions.
|
||||
|
||||
**Instrumentation traps**
|
||||
On the x86 architecture, because instructions are of variable size, an instruction
|
||||
might be too small for Dyninst to replace it with the normal code sequence
|
||||
used to call instrumentation. When instrumentation is placed at points other
|
||||
than subroutine entry, exit, or call points, traps may be used to ensure
|
||||
the instrumentation fits. (By default, ``omnitrace-instrument`` avoids instrumentation
|
||||
which requires a trap.)
|
||||
|
||||
**Overlapping functions**
|
||||
Due to language constructs or compiler optimizations, it might be possible for
|
||||
multiple functions to overlap (that is, share part of the same function body)
|
||||
or for a single function to have multiple entry points. In practice, it's
|
||||
impossible to determine the difference between multiple overlapping functions
|
||||
and a single function with multiple entry points. (By default, ``omnitrace-instrument``
|
||||
avoids instrumenting overlapping functions.)
|
||||
@@ -0,0 +1,70 @@
|
||||
# Anywhere {branch} is used, the branch name will be substituted.
|
||||
# These comments will also be removed.
|
||||
defaults:
|
||||
numbered: False
|
||||
maxdepth: 6
|
||||
root: index
|
||||
subtrees:
|
||||
- entries:
|
||||
- file: what-is-omnitrace.rst
|
||||
|
||||
- caption: Install
|
||||
entries:
|
||||
- file: install/quick-start.rst
|
||||
title: Omnitrace quick start
|
||||
- file: install/install.rst
|
||||
title: Omnitrace installation guide
|
||||
|
||||
- caption: Tutorials
|
||||
entries:
|
||||
- url: https://github.com/ROCm/omnitrace/tree/main/examples
|
||||
title: GitHub examples
|
||||
- file: tutorials/video-tutorials.rst
|
||||
title: Video tutorials
|
||||
|
||||
- caption: How to
|
||||
entries:
|
||||
- file: how-to/configuring-validating-environment.rst
|
||||
title: Configuring and validating the environment
|
||||
- file: how-to/configuring-runtime-options.rst
|
||||
title: Configuring runtime options
|
||||
- file: how-to/sampling-call-stack.rst
|
||||
title: Sampling the call stack
|
||||
- file: how-to/instrumenting-rewriting-binary-application.rst
|
||||
title: Instrumenting and rewriting a binary application
|
||||
- file: how-to/performing-causal-profiling.rst
|
||||
title: Performing causal profiling
|
||||
- file: how-to/understanding-omnitrace-output.rst
|
||||
title: Understanding the Omnitrace output
|
||||
- file: how-to/profiling-python-scripts.rst
|
||||
title: Profiling Python scripts
|
||||
- file: how-to/using-omnitrace-api.rst
|
||||
title: Using the Omnitrace API
|
||||
- file: how-to/general-tips-using-omnitrace.rst
|
||||
title: General tips for using Omnitrace
|
||||
|
||||
- caption: Conceptual
|
||||
entries:
|
||||
- file: conceptual/data-collection-modes.rst
|
||||
title: Data collection modes
|
||||
- file: conceptual/omnitrace-feature-set.rst
|
||||
title: The Omnitrace feature set and use cases
|
||||
|
||||
- caption: Reference
|
||||
entries:
|
||||
- file: reference/development-guide.rst
|
||||
title: Development guide
|
||||
- file: reference/omnitrace-glossary.rst
|
||||
title: Omnitrace glossary
|
||||
- file: doxygen/html/files
|
||||
title: API library
|
||||
- file: doxygen/html/functions
|
||||
title: Class member functions
|
||||
- file: doxygen/html/globals
|
||||
title: Globals
|
||||
- file: doxygen/html/annotated
|
||||
title: Classes, structures, and interfaces
|
||||
|
||||
- caption: About
|
||||
entries:
|
||||
- file: license.md
|
||||
@@ -0,0 +1 @@
|
||||
rocm-docs-core[api_reference]==1.4.1
|
||||
@@ -0,0 +1,169 @@
|
||||
#
|
||||
# This file is autogenerated by pip-compile with Python 3.10
|
||||
# by the following command:
|
||||
#
|
||||
# pip-compile requirements.in
|
||||
#
|
||||
accessible-pygments==0.0.5
|
||||
# via pydata-sphinx-theme
|
||||
alabaster==0.7.16
|
||||
# via sphinx
|
||||
babel==2.15.0
|
||||
# via
|
||||
# pydata-sphinx-theme
|
||||
# sphinx
|
||||
beautifulsoup4==4.12.3
|
||||
# via pydata-sphinx-theme
|
||||
breathe==4.35.0
|
||||
# via rocm-docs-core
|
||||
certifi==2024.6.2
|
||||
# via requests
|
||||
cffi==1.16.0
|
||||
# via
|
||||
# cryptography
|
||||
# pynacl
|
||||
charset-normalizer==3.3.2
|
||||
# via requests
|
||||
click==8.1.7
|
||||
# via
|
||||
# click-log
|
||||
# doxysphinx
|
||||
# sphinx-external-toc
|
||||
click-log==0.4.0
|
||||
# via doxysphinx
|
||||
cryptography==42.0.8
|
||||
# via pyjwt
|
||||
deprecated==1.2.14
|
||||
# via pygithub
|
||||
docutils==0.21.2
|
||||
# via
|
||||
# breathe
|
||||
# myst-parser
|
||||
# pydata-sphinx-theme
|
||||
# sphinx
|
||||
doxysphinx==3.3.9
|
||||
# via rocm-docs-core
|
||||
fastjsonschema==2.20.0
|
||||
# via rocm-docs-core
|
||||
gitdb==4.0.11
|
||||
# via gitpython
|
||||
gitpython==3.1.43
|
||||
# via rocm-docs-core
|
||||
idna==3.7
|
||||
# via requests
|
||||
imagesize==1.4.1
|
||||
# via sphinx
|
||||
jinja2==3.1.4
|
||||
# via
|
||||
# myst-parser
|
||||
# sphinx
|
||||
libsass==0.22.0
|
||||
# via doxysphinx
|
||||
lxml==4.9.4
|
||||
# via doxysphinx
|
||||
markdown-it-py==3.0.0
|
||||
# via
|
||||
# mdit-py-plugins
|
||||
# myst-parser
|
||||
markupsafe==2.1.5
|
||||
# via jinja2
|
||||
mdit-py-plugins==0.4.1
|
||||
# via myst-parser
|
||||
mdurl==0.1.2
|
||||
# via markdown-it-py
|
||||
mpire==2.10.2
|
||||
# via doxysphinx
|
||||
myst-parser==3.0.1
|
||||
# via rocm-docs-core
|
||||
numpy==1.26.4
|
||||
# via doxysphinx
|
||||
packaging==24.1
|
||||
# via
|
||||
# pydata-sphinx-theme
|
||||
# sphinx
|
||||
pycparser==2.22
|
||||
# via cffi
|
||||
pydata-sphinx-theme==0.15.4
|
||||
# via
|
||||
# rocm-docs-core
|
||||
# sphinx-book-theme
|
||||
pygithub==2.3.0
|
||||
# via rocm-docs-core
|
||||
pygments==2.18.0
|
||||
# via
|
||||
# accessible-pygments
|
||||
# mpire
|
||||
# pydata-sphinx-theme
|
||||
# sphinx
|
||||
pyjson5==1.6.6
|
||||
# via doxysphinx
|
||||
pyjwt[crypto]==2.8.0
|
||||
# via pygithub
|
||||
pynacl==1.5.0
|
||||
# via pygithub
|
||||
pyparsing==3.1.2
|
||||
# via doxysphinx
|
||||
pyyaml==6.0.1
|
||||
# via
|
||||
# myst-parser
|
||||
# rocm-docs-core
|
||||
# sphinx-external-toc
|
||||
requests==2.32.3
|
||||
# via
|
||||
# pygithub
|
||||
# sphinx
|
||||
rocm-docs-core[api_reference]==1.4.1
|
||||
# via -r requirements.in
|
||||
smmap==5.0.1
|
||||
# via gitdb
|
||||
snowballstemmer==2.2.0
|
||||
# via sphinx
|
||||
soupsieve==2.5
|
||||
# via beautifulsoup4
|
||||
sphinx==7.3.7
|
||||
# via
|
||||
# breathe
|
||||
# myst-parser
|
||||
# pydata-sphinx-theme
|
||||
# rocm-docs-core
|
||||
# sphinx-book-theme
|
||||
# sphinx-copybutton
|
||||
# sphinx-design
|
||||
# sphinx-external-toc
|
||||
# sphinx-notfound-page
|
||||
sphinx-book-theme==1.1.3
|
||||
# via rocm-docs-core
|
||||
sphinx-copybutton==0.5.2
|
||||
# via rocm-docs-core
|
||||
sphinx-design==0.6.0
|
||||
# via rocm-docs-core
|
||||
sphinx-external-toc==1.0.1
|
||||
# via rocm-docs-core
|
||||
sphinx-notfound-page==1.0.2
|
||||
# via rocm-docs-core
|
||||
sphinxcontrib-applehelp==1.0.8
|
||||
# via sphinx
|
||||
sphinxcontrib-devhelp==1.0.6
|
||||
# via sphinx
|
||||
sphinxcontrib-htmlhelp==2.0.5
|
||||
# via sphinx
|
||||
sphinxcontrib-jsmath==1.0.1
|
||||
# via sphinx
|
||||
sphinxcontrib-qthelp==1.0.7
|
||||
# via sphinx
|
||||
sphinxcontrib-serializinghtml==1.1.10
|
||||
# via sphinx
|
||||
tomli==2.0.1
|
||||
# via sphinx
|
||||
tqdm==4.66.4
|
||||
# via mpire
|
||||
typing-extensions==4.12.2
|
||||
# via
|
||||
# pydata-sphinx-theme
|
||||
# pygithub
|
||||
urllib3==2.2.2
|
||||
# via
|
||||
# pygithub
|
||||
# requests
|
||||
wrapt==1.16.0
|
||||
# via deprecated
|
||||
@@ -0,0 +1,35 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
****************************************************
|
||||
Video tutorials
|
||||
****************************************************
|
||||
|
||||
Installing a binary release
|
||||
========================================
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<p align="center"><iframe width="560" height="315" src="https://www.youtube.com/embed/gKtNCKf1IXA?modestbranding=1" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p>
|
||||
|
||||
Instrumenting a binary
|
||||
========================================
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<p align="center"><iframe width="560" height="315" src="https://www.youtube.com/embed/2B0gRr3FygQ?modestbranding=1" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p>
|
||||
|
||||
Writing an Omnitrace configuration file
|
||||
========================================
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<p align="center"><iframe width="560" height="315" src="https://www.youtube.com/embed/oG_fPYx9_gs?modestbranding=1" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p>
|
||||
|
||||
Visualization and features of Perfetto traces
|
||||
=============================================
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<p align="center"><iframe width="560" height="315" src="https://www.youtube.com/embed/7WN3N1hnCbI?modestbranding=1" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p>
|
||||
@@ -0,0 +1,28 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
******************
|
||||
What is Omnitrace?
|
||||
******************
|
||||
|
||||
Omnitrace is designed for the high-level profiling and comprehensive tracing
|
||||
of applications running on the CPU or the CPU and GPU. It supports dynamic binary
|
||||
instrumentation, call-stack sampling, and various other features for determining
|
||||
which function and line number are currently executing.
|
||||
|
||||
A visualization of the comprehensive Omnitrace results can be observed in any modern
|
||||
web browser. Upload the Perfetto (``.proto``) output files produced by Omnitrace at
|
||||
`ui.perfetto.dev <https://ui.perfetto.dev/>`_ to see the details.
|
||||
|
||||
Aggregated high-level results are available as human-readable text files and
|
||||
JSON files for programmatic analysis. The JSON output files are compatible with the
|
||||
`hatchet <https://github.com/hatchet/hatchet>`_ Python package. Hatchet converts
|
||||
the performance data into pandas data frames and facilitates multi-run comparisons, filtering,
|
||||
and visualization in Jupyter notebooks.
|
||||
|
||||
To use Omnitrace for instrumentation, follow these two configuration steps:
|
||||
|
||||
#. Indicate the functions and modules to :doc:`instrument <./how-to/instrumenting-rewriting-binary-application>` in the target binaries, including the executable and any libraries
|
||||
#. Specify the :doc:`instrumentation parameters <./how-to/configuring-runtime-options>` to use when the instrumented binaries are launched
|
||||
|
||||