Update branding to ROCm Systems Profiler in documentation (#2)
* Update branding in docs * Rename image used in documentation * Update names of code samples. In the code snippets, the "-" is not valid. ex., rocprof-sys_ --> rocprofsys_ * Update ASCII art * update Doxyfile strip_from_path * Add a "Formerly known as" message. * Fixed typo in product name ROCm Systems Profiler, not ROCm Profiler System * Add "Omnitrace" back to the metadata keywords * Update "install via package manager" section * Update paths to user API files * Rename configuration and environment settings * Update Doxyfiles Update publisher name & ID to "AMD". Update bundle ID to "rocprofiler-systems" * Update docs/what-is-rocprof-sys.rst Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Update docs/conceptual/data-collection-modes.rst Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Update docs/tutorials/video-tutorials.rst Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Update docs/conceptual/rocprof-sys-feature-set.rst Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Update docs/how-to/configuring-runtime-options.rst Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Update docs/how-to/configuring-validating-environment.rst Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Update docs/how-to/general-tips-using-rocprof-sys.rst Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Update docs/reference/rocprof-sys-glossary.rst Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Update docs/reference/development-guide.rst Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Update docs/how-to/instrumenting-rewriting-binary-application.rst Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Update docs/install/quick-start.rst Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Note that videos were recorded using the "Omnitrace" name. * Rebase and update some file paths * Update paths to doc images * Update Omnitrace references in code snippets * Rename examples still using the "omni" prefix. * Update docs/how-to/performing-causal-profiling.rst Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Update docs/how-to/profiling-python-scripts.rst Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Update docs/how-to/sampling-call-stack.rst Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Update docs/how-to/understanding-rocprof-sys-output.rst Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Update docs/install/install.rst Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> --------- Co-authored-by: Peter Park <peter.park@amd.com> Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
@@ -20,7 +20,7 @@ In addition to runtimes, ROCm Systems Profiler supports the collection of system
|
||||
such as the memory usage, page-faults, and context-switches, and thread-level metrics such as memory usage, CPU time, and numerous hardware counters.
|
||||
|
||||
> [!NOTE]
|
||||
> Full documentation is available at [ROCm Systems Profiler documentation](https://rocm.docs.amd.com/projects/omnitrace/en/latest/index.html) in an organized, easy-to-read, searchable format.
|
||||
> Full documentation is available at [ROCm Systems Profiler documentation](https://rocm.docs.amd.com/projects/rocprofiler-systems/en/latest/index.html) in an organized, easy-to-read, searchable format.
|
||||
The documentation source files reside in the [`/docs`](/docs) folder of this repository. For information on contributing to the documentation, see
|
||||
[Contribute to ROCm documentation](https://rocm.docs.amd.com/en/latest/contribute/contributing.html)
|
||||
|
||||
@@ -99,14 +99,14 @@ The documentation source files reside in the [`/docs`](/docs) folder of this rep
|
||||
If the above recommendation is not desired, download the `rocprofiler-systems-install.py` and specify `--prefix <install-directory>` when
|
||||
executing it. This script will attempt to auto-detect a compatible OS distribution and version.
|
||||
If ROCm support is desired, specify `--rocm X.Y` where `X` is the ROCm major version and `Y`
|
||||
is the ROCm minor version, e.g. `--rocm 5.4`.
|
||||
is the ROCm minor version, e.g. `--rocm 6.2`.
|
||||
|
||||
```console
|
||||
wget https://github.com/ROCm/rocprofiler-systems/releases/latest/download/rocprofiler-systems-install.py
|
||||
python3 ./rocprofiler-systems-install.py --prefix /opt/rocprofiler-systems/rocm-5.4 --rocm 5.4
|
||||
python3 ./rocprofiler-systems-install.py --prefix /opt/rocprofiler-systems --rocm 6.2
|
||||
```
|
||||
|
||||
See the [ROCm Systems Profiler installation guide](https://rocm.docs.amd.com/projects/omnitrace/en/latest/install/install.html) for detailed information.
|
||||
See the [ROCm Systems Profiler installation guide](https://rocm.docs.amd.com/projects/rocprofiler-systems/en/latest/install/install.html) for detailed information.
|
||||
|
||||
### Setup
|
||||
|
||||
@@ -295,13 +295,13 @@ for `foo` via the direct call within `spam`. There will be no entries for `bar`
|
||||
- Select "Open trace file" from panel on the left
|
||||
- Locate the rocprofiler-systems perfetto output (extension: `.proto`)
|
||||
|
||||

|
||||

|
||||
|
||||

|
||||

|
||||
|
||||

|
||||

|
||||
|
||||

|
||||

|
||||
|
||||
## Using Perfetto tracing with System Backend
|
||||
|
||||
@@ -331,6 +331,7 @@ Configure rocprofiler-systems to use the perfetto system backend via the `--perf
|
||||
```shell
|
||||
# enable sampling on the uninstrumented binary
|
||||
rocprof-sys-run --sample --trace --perfetto-backend=system -- ./myapp
|
||||
|
||||
# trace the instrument the binary
|
||||
rocprof-sys-instrument -o ./myapp.inst -- ./myapp
|
||||
rocprof-sys-run --trace --perfetto-backend=system -- ./myapp.inst
|
||||
|
||||
@@ -1,17 +1,17 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
:description: ROCm Systems Profiler data collection modes documentation
|
||||
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, profiler, data collection, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
**********************
|
||||
Data collection modes
|
||||
**********************
|
||||
|
||||
Omnitrace supports several modes of recording trace and profiling data for your application.
|
||||
ROCm Systems Profiler supports several modes of recording trace and profiling data for your application.
|
||||
|
||||
.. note::
|
||||
|
||||
For an explanation of the terms used in this topic, see
|
||||
the :doc:`Omnitrace glossary <../reference/omnitrace-glossary>`.
|
||||
|
||||
For an explanation of the terms used in this topic, see
|
||||
the :doc:`ROCm Systems Profiler glossary <../reference/rocprof-sys-glossary>`.
|
||||
|
||||
+-----------------------------+---------------------------------------------------------+
|
||||
| Mode | Description |
|
||||
@@ -23,61 +23,62 @@ Omnitrace supports several modes of recording trace and profiling data for your
|
||||
| | and records various metrics for the given call stack |
|
||||
+-----------------------------+---------------------------------------------------------+
|
||||
| Callback APIs | Parallelism frameworks such as ROCm, OpenMP, and Kokkos |
|
||||
| | make callbacks into Omnitrace to provide information |
|
||||
| | about the work the API is performing |
|
||||
| | make callbacks into ROCm Systems Profiler to provide |
|
||||
| | information about the work the API is performing |
|
||||
+-----------------------------+---------------------------------------------------------+
|
||||
| Dynamic Symbol Interception | Wrap function symbols defined in a position independent |
|
||||
| | dynamic library/executable, like ``pthread_mutex_lock`` |
|
||||
| | in ``libpthread.so`` or ``MPI_Init`` in the MPI library |
|
||||
+-----------------------------+---------------------------------------------------------+
|
||||
| User API | User-defined regions and controls for Omnitrace |
|
||||
| User API | User-defined regions and controls for ROCm Systems |
|
||||
| | Profiler |
|
||||
+-----------------------------+---------------------------------------------------------+
|
||||
|
||||
The two most generic and important modes are binary instrumentation and statistical sampling.
|
||||
The two most generic and important modes are binary instrumentation and statistical sampling.
|
||||
It is important to understand their advantages and disadvantages.
|
||||
Binary instrumentation and statistical sampling can be performed with the ``omnitrace-instrument``
|
||||
Binary instrumentation and statistical sampling can be performed with the ``rocprof-sys-instrument``
|
||||
executable. For statistical sampling, it's highly recommended to use the
|
||||
``omnitrace-sample`` executable instead if binary instrumentation isn't required or needed.
|
||||
``rocprof-sys-sample`` executable instead if binary instrumentation isn't required or needed.
|
||||
Callback APIs and dynamic symbol interception can be utilized with either tool.
|
||||
|
||||
Binary instrumentation
|
||||
-----------------------------------
|
||||
|
||||
Binary instrumentation lets you record deterministic measurements for
|
||||
Binary instrumentation lets you record deterministic measurements for
|
||||
every single invocation of a given function.
|
||||
Binary instrumentation effectively adds instructions to the target application to
|
||||
collect the required information. It therefore has the potential to cause performance
|
||||
changes which might, in some cases, lead to inaccurate results. The effect depends on
|
||||
the information being collected and which features are activated in Omnitrace.
|
||||
Binary instrumentation effectively adds instructions to the target application to
|
||||
collect the required information. It therefore has the potential to cause performance
|
||||
changes which might, in some cases, lead to inaccurate results. The effect depends on
|
||||
the information being collected and which features are activated in ROCm Systems Profiler.
|
||||
For example, collecting only the wall-clock timing data
|
||||
has less of an effect than collecting the wall-clock timing, CPU-clock timing,
|
||||
memory usage, cache-misses, and number of instructions that were run. Similarly,
|
||||
collecting a flat profile has less overhead than a hierarchical profile
|
||||
and collecting a trace OR a profile has less overhead than collecting a
|
||||
has less of an effect than collecting the wall-clock timing, CPU-clock timing,
|
||||
memory usage, cache-misses, and number of instructions that were run. Similarly,
|
||||
collecting a flat profile has less overhead than a hierarchical profile
|
||||
and collecting a trace OR a profile has less overhead than collecting a
|
||||
trace AND a profile.
|
||||
|
||||
In Omnitrace, the primary heuristic for controlling the overhead with binary
|
||||
instrumentation is the minimum number of instructions for selecting functions
|
||||
In ROCm Systems Profiler, the primary heuristic for controlling the overhead with binary
|
||||
instrumentation is the minimum number of instructions for selecting functions
|
||||
for instrumentation.
|
||||
|
||||
Statistical sampling
|
||||
-----------------------------------
|
||||
|
||||
Statistical call-stack sampling periodically interrupts the application at
|
||||
Statistical call-stack sampling periodically interrupts the application at
|
||||
regular intervals using operating system interrupts.
|
||||
Sampling is typically less numerically accurate and specific, but the
|
||||
Sampling is typically less numerically accurate and specific, but the
|
||||
target program runs at nearly full speed.
|
||||
In contrast to the data derived from binary instrumentation, the resulting
|
||||
In contrast to the data derived from binary instrumentation, the resulting
|
||||
data is not exact but is instead a statistical approximation.
|
||||
However, sampling often provides a more accurate picture of the application
|
||||
However, sampling often provides a more accurate picture of the application
|
||||
execution because it is less intrusive to the target application and has fewer
|
||||
side effects on memory caches or instruction decoding pipelines. Furthermore,
|
||||
side effects on memory caches or instruction decoding pipelines. Furthermore,
|
||||
because sampling does not affect the execution speed as much, is it
|
||||
relatively immune to over-evaluating the cost of small, frequently called
|
||||
relatively immune to over-evaluating the cost of small, frequently called
|
||||
functions or "tight" loops.
|
||||
|
||||
In Omnitrace, the overhead for statistical sampling depends on the
|
||||
sampling rate and whether the samples are taken with respect to the CPU time
|
||||
In ROCm Systems Profiler, the overhead for statistical sampling depends on the
|
||||
sampling rate and whether the samples are taken with respect to the CPU time
|
||||
and/or real time.
|
||||
|
||||
Binary instrumentation vs. statistical sampling example
|
||||
@@ -112,24 +113,24 @@ Consider the following code:
|
||||
return 0;
|
||||
}
|
||||
|
||||
Binary instrumentation of the ``fib`` function will record **every single invocation**
|
||||
Binary instrumentation of the ``fib`` function will record **every single invocation**
|
||||
of the function. For a very small function
|
||||
such as ``fib``, this results in **significant** overhead since this simple function
|
||||
such as ``fib``, this results in **significant** overhead since this simple function
|
||||
takes about 20 instructions, whereas the entry and
|
||||
exit snippets are ~1024 instructions. Therefore, you generally want to avoid
|
||||
exit snippets are ~1024 instructions. Therefore, you generally want to avoid
|
||||
instrumenting functions where the instrumented function has significantly fewer
|
||||
instructions than entry and exit instrumentation. (Note that many of the
|
||||
instructions than entry and exit instrumentation. (Note that many of the
|
||||
instructions in entry and exit functions are either logging functions or
|
||||
depend on the runtime settings and thus might never run). However,
|
||||
depend on the runtime settings and thus might never run). However,
|
||||
due to the number of potential instructions in the entry and exit snippets,
|
||||
the default behavior of ``omnitrace-instrument`` is to only instrument functions
|
||||
the default behavior of ``rocprof-sys-instrument`` is to only instrument functions
|
||||
which contain at least 1024 instructions.
|
||||
|
||||
However, recording every single invocation of the function can be extremely
|
||||
However, recording every single invocation of the function can be extremely
|
||||
useful for detecting anomalies, such as profiles that show minimum or maximum values much smaller or larger
|
||||
than the average or a high standard deviation. In this case, the traces help you
|
||||
than the average or a high standard deviation. In this case, the traces help you
|
||||
identify exactly when and where those instances deviated from the norm.
|
||||
Compare the level of detail in the following traces. In the top image,
|
||||
Compare the level of detail in the following traces. In the top image,
|
||||
every instance of the ``fib`` function is instrumented, while in the bottom image,
|
||||
the ``fib`` call-stack is derived via sampling.
|
||||
|
||||
|
||||
@@ -1,14 +1,14 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
:description: ROCm Systems Profiler feature set documentation and reference
|
||||
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, profiler, feature set, use cases, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
***************************************
|
||||
The Omnitrace feature set and use cases
|
||||
The ROCm Systems Profiler feature set and use cases
|
||||
***************************************
|
||||
|
||||
`Omnitrace <https://github.com/ROCm/omnitrace>`_ is designed to be highly extensible.
|
||||
Internally, it leverages the `Timemory performance analysis toolkit <https://github.com/NERSC/timemory>`_
|
||||
to manage extensions, resources, data, and other items. It supports the following features,
|
||||
`ROCm Systems Profiler <https://github.com/ROCm/rocprofiler-systems>`_ is designed to be highly extensible.
|
||||
Internally, it leverages the `Timemory performance analysis toolkit <https://github.com/NERSC/timemory>`_
|
||||
to manage extensions, resources, data, and other items. It supports the following features,
|
||||
modes, metrics, and APIs.
|
||||
|
||||
Data collection modes
|
||||
@@ -22,11 +22,6 @@ Data collection modes
|
||||
* Statistical sampling: Periodic software interrupts per-thread
|
||||
* Process-level sampling: A background thread records process-, system- and device-level metrics while the application runs
|
||||
* Causal profiling: Quantifies the potential impact of optimizations in parallel code
|
||||
|
||||
.. note::
|
||||
|
||||
Critical trace support was removed in Omnitrace v1.11.0.
|
||||
It was replaced by the causal profiling feature.
|
||||
|
||||
Data analysis
|
||||
========================================
|
||||
@@ -98,40 +93,40 @@ Third-party API support
|
||||
* NVTX
|
||||
* ROCTX
|
||||
|
||||
Omnitrace use cases
|
||||
ROCm Systems Profiler use cases
|
||||
========================================
|
||||
|
||||
When analyzing the performance of an application, do NOT
|
||||
When analyzing the performance of an application, do NOT
|
||||
assume you know where the performance bottlenecks are
|
||||
and why they are happening. Omnitrace is a tool for analyzing the entire
|
||||
and why they are happening. ROCm Systems Profiler is a tool for analyzing the entire
|
||||
application and its performance. It is
|
||||
ideal for characterizing where optimization would have the greatest impact
|
||||
ideal for characterizing where optimization would have the greatest impact
|
||||
on an end-to-end run of the application and for
|
||||
viewing what else is happening on the system during a performance bottleneck.
|
||||
|
||||
When GPUs are involved, there is a tendency to assume that
|
||||
When GPUs are involved, there is a tendency to assume that
|
||||
the quickest path to performance improvement is minimizing
|
||||
the runtime of the GPU kernels. This is a highly flawed assumption.
|
||||
the runtime of the GPU kernels. This is a highly flawed assumption.
|
||||
If you optimize the runtime of a kernel from one millisecond
|
||||
to 1 microsecond (1000x speed-up) but the original application never
|
||||
to 1 microsecond (1000x speed-up) but the original application never
|
||||
spent time waiting for kernels to complete,
|
||||
there would be no statistically significant reduction in the end-to-end
|
||||
there would be no statistically significant reduction in the end-to-end
|
||||
runtime of your application. In other words, it does not matter
|
||||
how fast or slow the code on GPU is if the application has a
|
||||
how fast or slow the code on GPU is if the application has a
|
||||
bottleneck on waiting on the GPU.
|
||||
|
||||
Use Omnitrace to obtain a high-level view of the entire application. Use it
|
||||
Use ROCm Systems Profiler to obtain a high-level view of the entire application. Use it
|
||||
to determine where the performance bottlenecks are and
|
||||
obtain clues to why these bottlenecks are happening. Rather than worrying about kernel
|
||||
performance, start your investigation with Omnitrace, which characterizes the
|
||||
performance, start your investigation with ROCm Systems Profiler, which characterizes the
|
||||
broad picture.
|
||||
|
||||
.. note::
|
||||
|
||||
For insight into the execution of individual kernels on the GPU,
|
||||
use `Omniperf <https://github.com/rocm/omniperf>`_.
|
||||
For insight into the execution of individual kernels on the GPU,
|
||||
use `ROCm Compute Profiler <https://github.com/rocm/rocprofiler-compute>`_.
|
||||
|
||||
In terms of CPU analysis, Omnitrace does not target any specific vendor.
|
||||
In terms of CPU analysis, ROCm Systems Profiler does not target any specific vendor.
|
||||
It works just as well on AMD and non-AMD CPUs.
|
||||
With regard to the GPU, Omnitrace is currently restricted to HIP and HSA APIs
|
||||
With regard to the GPU, ROCm Systems Profiler is currently restricted to HIP and HSA APIs
|
||||
and kernels running on AMD GPUs.
|
||||
@@ -36,14 +36,14 @@ with open("../VERSION", encoding="utf-8") as f:
|
||||
raise ValueError("VERSION not found!")
|
||||
version_number = match[1]
|
||||
|
||||
external_projects_current_project = "omnitrace"
|
||||
external_projects_current_project = "rocprofiler-systems"
|
||||
|
||||
project = "omnitrace"
|
||||
project = "rocprofiler-systems"
|
||||
author = "Advanced Micro Devices, Inc."
|
||||
copyright = "Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved."
|
||||
version = version_number
|
||||
release = version_number
|
||||
html_title = f"Omnitrace {version} documentation"
|
||||
html_title = f"ROCm Systems Profiler {version} documentation"
|
||||
|
||||
external_toc_path = "./sphinx/_toc.yml"
|
||||
|
||||
|
||||
|
До Ширина: | Высота: | Размер: 313 KiB |
|
До Ширина: | Высота: | Размер: 195 KiB |
|
До Ширина: | Высота: | Размер: 230 KiB |
|
До Ширина: | Высота: | Размер: 277 KiB |
|
До Ширина: | Высота: | Размер: 313 KiB После Ширина: | Высота: | Размер: 313 KiB |
|
До Ширина: | Высота: | Размер: 195 KiB После Ширина: | Высота: | Размер: 195 KiB |
|
До Ширина: | Высота: | Размер: 230 KiB После Ширина: | Высота: | Размер: 230 KiB |
|
До Ширина: | Высота: | Размер: 277 KiB После Ширина: | Высота: | Размер: 277 KiB |
@@ -4,7 +4,7 @@
|
||||
# Project related configuration options
|
||||
#---------------------------------------------------------------------------
|
||||
DOXYFILE_ENCODING = UTF-8
|
||||
PROJECT_NAME = omnitrace
|
||||
PROJECT_NAME = rocprofiler-systems
|
||||
PROJECT_NUMBER = 1.11.3
|
||||
PROJECT_BRIEF = "High-level and comprehensive application tracing and profiling on both the CPU and GPU"
|
||||
PROJECT_LOGO =
|
||||
@@ -19,8 +19,8 @@ ABBREVIATE_BRIEF =
|
||||
ALWAYS_DETAILED_SEC = YES
|
||||
INLINE_INHERITED_MEMB = YES
|
||||
FULL_PATH_NAMES = YES
|
||||
STRIP_FROM_PATH = /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-omnitrace/checkouts/
|
||||
STRIP_FROM_INC_PATH = /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-omnitrace/checkouts/
|
||||
STRIP_FROM_PATH = /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-rocprofiler-systems/checkouts/
|
||||
STRIP_FROM_INC_PATH = /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-rocprofiler-systems/checkouts/
|
||||
SHORT_NAMES = NO
|
||||
JAVADOC_AUTOBRIEF = NO
|
||||
JAVADOC_BANNER = NO
|
||||
@@ -114,10 +114,10 @@ WARN_LOGFILE = doc/warnings.log
|
||||
# Configuration options related to the input files
|
||||
#---------------------------------------------------------------------------
|
||||
INPUT = ../../README.md \
|
||||
../../source/lib/omnitrace-user/omnitrace/types.h \
|
||||
../../source/lib/omnitrace-user/omnitrace/categories.h \
|
||||
../../source/lib/omnitrace-user/omnitrace/user.h \
|
||||
../../source/lib/omnitrace-user/omnitrace/causal.h
|
||||
../../source/lib/rocprof-sys-user/rocprofiler-systems/types.h \
|
||||
../../source/lib/rocprof-sys-user/rocprofiler-systems/categories.h \
|
||||
../../source/lib/rocprof-sys-user/rocprofiler-systems/user.h \
|
||||
../../source/lib/rocprof-sys-user/rocprofiler-systems/causal.h
|
||||
INPUT_ENCODING = UTF-8
|
||||
FILE_PATTERNS = *.h \
|
||||
*.hh \
|
||||
@@ -198,9 +198,9 @@ HTML_DYNAMIC_SECTIONS = YES
|
||||
HTML_INDEX_NUM_ENTRIES = 1000
|
||||
GENERATE_DOCSET = NO
|
||||
DOCSET_FEEDNAME = "Doxygen generated docs"
|
||||
DOCSET_BUNDLE_ID = org.doxygen.omnitrace
|
||||
DOCSET_PUBLISHER_ID = org.doxygen.amdresearch
|
||||
DOCSET_PUBLISHER_NAME = "Audacious Software Group"
|
||||
DOCSET_BUNDLE_ID = org.doxygen.rocprofiler-systems
|
||||
DOCSET_PUBLISHER_ID = org.doxygen.amd
|
||||
DOCSET_PUBLISHER_NAME = "Advanced Micro Devices, Inc."
|
||||
GENERATE_HTMLHELP = NO
|
||||
CHM_FILE =
|
||||
HHC_LOCATION =
|
||||
@@ -217,7 +217,7 @@ QHP_CUST_FILTER_ATTRS =
|
||||
QHP_SECT_FILTER_ATTRS =
|
||||
QHG_LOCATION =
|
||||
GENERATE_ECLIPSEHELP = NO
|
||||
ECLIPSE_DOC_ID = org.doxygen.omnitrace
|
||||
ECLIPSE_DOC_ID = org.doxygen.rocprofiler-systems
|
||||
DISABLE_INDEX = NO
|
||||
GENERATE_TREEVIEW = NO
|
||||
ENUM_VALUES_PER_LINE = 1
|
||||
@@ -311,7 +311,7 @@ ENABLE_PREPROCESSING = YES
|
||||
MACRO_EXPANSION = YES
|
||||
EXPAND_ONLY_PREDEF = NO
|
||||
SEARCH_INCLUDES = YES
|
||||
INCLUDE_PATH = ../../source/lib/omnitrace-user
|
||||
INCLUDE_PATH = ../../source/lib/rocprof-sys-user
|
||||
INCLUDE_FILE_PATTERNS = *.h \
|
||||
*.hpp
|
||||
PREDEFINED = ROCPROFSYS_PUBLIC_API= \
|
||||
|
||||
@@ -1,131 +1,133 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
:description: ROCm Systems Profiler runtime options documentation and reference
|
||||
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, runtime options, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
****************************************************
|
||||
Configuring runtime options
|
||||
****************************************************
|
||||
|
||||
The ``omnitrace.cfg`` file maintains a list of the `Omnitrace <https://github.com/ROCm/omnitrace>`_ runtime options. To create this configuration
|
||||
file and view the current runtime options, use the ``omnitrace-avail`` executable.
|
||||
The ``rocprof-sys.cfg`` file maintains a list of the
|
||||
`ROCm Systems Profiler <https://github.com/ROCm/rocprofiler-systems>`_ runtime
|
||||
options. To create this configuration
|
||||
file and view the current runtime options, use the ``rocprof-sys-avail`` executable.
|
||||
|
||||
The omnitrace-avail executable
|
||||
The rocprof-sys-avail executable
|
||||
========================================
|
||||
|
||||
The ``omnitrace-avail`` executable provides information about the runtime settings,
|
||||
The ``rocprof-sys-avail`` executable provides information about the runtime settings,
|
||||
data collection capabilities, and, when built with PAPI support, the
|
||||
available hardware counters. The executable is effectively
|
||||
self-updating. As new capabilities and settings are added to the Omnitrace source code, they are
|
||||
propagated to ``omnitrace-avail``. ``omnitrace-avail`` should be viewed as the ultimate authority
|
||||
self-updating. As new capabilities and settings are added to the ROCm Systems Profiler source code, they are
|
||||
propagated to ``rocprof-sys-avail``. ``rocprof-sys-avail`` should be viewed as the ultimate authority
|
||||
in the event of any conflicts with this documentation.
|
||||
|
||||
It is recommended that you create a default configuration file in
|
||||
``${HOME}/.omnitrace.cfg``. This can be done by
|
||||
running the command ``omnitrace-avail -G ~/.omnitrace.cfg``. Alternatively,
|
||||
use the ``omnitrace-avail -G ~/.omnitrace.cfg --all`` option
|
||||
It is recommended that you create a default configuration file in
|
||||
``${HOME}/.rocprof-sys.cfg``. This can be done by
|
||||
running the command ``rocprof-sys-avail -G ~/.rocprof-sys.cfg``. Alternatively,
|
||||
use the ``rocprof-sys-avail -G ~/.rocprof-sys.cfg --all`` option
|
||||
for a verbose configuration file with descriptions, categories, and additional information.
|
||||
|
||||
Modify ``${HOME}/.omnitrace.cfg`` as required. For example, enable `Perfetto <https://perfetto.dev/>`_,
|
||||
Modify ``${HOME}/.rocprof-sys.cfg`` as required. For example, enable `Perfetto <https://perfetto.dev/>`_,
|
||||
`Timemory <https://github.com/NERSC/timemory>`_, sampling, and process-level sampling by default
|
||||
and tweak the default sampling values.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
# ...
|
||||
OMNITRACE_TRACE = true
|
||||
OMNITRACE_PROFILE = true
|
||||
OMNITRACE_USE_SAMPLING = true
|
||||
OMNITRACE_USE_PROCESS_SAMPLING = true
|
||||
ROCPROFSYS_TRACE = true
|
||||
ROCPROFSYS_PROFILE = true
|
||||
ROCPROFSYS_USE_SAMPLING = true
|
||||
ROCPROFSYS_USE_PROCESS_SAMPLING = true
|
||||
# ...
|
||||
OMNITRACE_SAMPLING_FREQ = 50
|
||||
OMNITRACE_SAMPLING_CPUS = all
|
||||
OMNITRACE_SAMPLING_GPUS = $env:HIP_VISIBLE_DEVICES
|
||||
ROCPROFSYS_SAMPLING_FREQ = 50
|
||||
ROCPROFSYS_SAMPLING_CPUS = all
|
||||
ROCPROFSYS_SAMPLING_GPUS = $env:HIP_VISIBLE_DEVICES
|
||||
|
||||
Exploring runtime settings
|
||||
-----------------------------------
|
||||
|
||||
Use the following command to view the list of the available runtime settings, their current values, and descriptions
|
||||
Use the following command to view the list of the available runtime settings, their current values, and descriptions
|
||||
for each setting:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-avail --description
|
||||
rocprof-sys-avail --description
|
||||
|
||||
.. note::
|
||||
|
||||
Use ``--brief`` to suppress printing the current value and/or ``-c 0`` to suppress truncation of the descriptions.
|
||||
|
||||
Any Boolean setting (``omnitrace-avail --settings --value --brief --filter bool``)
|
||||
accepts a case insensitive match for nearly all common Boolean logic expressions:
|
||||
Any Boolean setting (``rocprof-sys-avail --settings --value --brief --filter bool``)
|
||||
accepts a case insensitive match for nearly all common Boolean logic expressions:
|
||||
``ON``, ``OFF``, ``YES``, ``NO``, ``TRUE``, ``FALSE``, ``0``, ``1``, etc.
|
||||
|
||||
Exploring components
|
||||
-----------------------------------
|
||||
|
||||
Omnitrace uses `Timemory <https://github.com/NERSC/timemory>`_ extensively to provide
|
||||
ROCm Systems Profiler uses `Timemory <https://github.com/NERSC/timemory>`_ extensively to provide
|
||||
various capabilities and manage
|
||||
data and resources. By default, with ``OMNITRACE_PROFILE=ON``, Omnitrace only collects wall-clock
|
||||
timing values. However, by modifying the ``OMNITRACE_TIMEMORY_COMPONENTS`` setting,
|
||||
Omnitrace can be configured to
|
||||
data and resources. By default, with ``ROCPROFSYS_PROFILE=ON``, ROCm Systems Profiler only collects wall-clock
|
||||
timing values. However, by modifying the ``ROCPROFSYS_TIMEMORY_COMPONENTS`` setting,
|
||||
ROCm Systems Profiler can be configured to
|
||||
collect hardware counters, CPU-clock timers, memory usage, context switches, page faults, network statistics,
|
||||
and much more. Omnitrace can even be used as a dynamic instrumentation vehicle
|
||||
and much more. ROCm Systems Profiler can even be used as a dynamic instrumentation vehicle
|
||||
for other third-party profiling
|
||||
APIs such as `Caliper <https://github.com/LLNL/Caliper>`_ and `LIKWID <https://github.com/RRZE-HPC/likwid>`_.
|
||||
To leverage this capability, build Omnitrace from source with the CMake
|
||||
To leverage this capability, build ROCm Systems Profiler from source with the CMake
|
||||
options ``TIMEMORY_USE_CALIPER=ON`` or ``TIMEMORY_USE_LIKWID=ON`` and then add
|
||||
``caliper_marker``, ``likwid_marker``, or both to ``OMNITRACE_TIMEMORY_COMPONENTS``.
|
||||
``caliper_marker``, ``likwid_marker``, or both to ``ROCPROFSYS_TIMEMORY_COMPONENTS``.
|
||||
|
||||
To view all possible components and their descriptions:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-avail --components --description
|
||||
rocprof-sys-avail --components --description
|
||||
|
||||
To restrict the output to available components and view the string identifiers for ``OMNITRACE_TIMEMORY_COMPONENTS``:
|
||||
To restrict the output to available components and view the string identifiers for ``ROCPROFSYS_TIMEMORY_COMPONENTS``:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-avail --components --available --string --brief
|
||||
rocprof-sys-avail --components --available --string --brief
|
||||
|
||||
Exploring hardware counters
|
||||
-----------------------------------
|
||||
|
||||
Omnitrace supports hardware counter collection via PAPI and ROCm.
|
||||
ROCm Systems Profiler supports hardware counter collection via PAPI and ROCm.
|
||||
Generally, PAPI is used to collect CPU-based hardware counters and ROCm is used to collect GPU-based hardware
|
||||
counters. Although it is possible to install PAPI with ROCm support and use it to
|
||||
collect GPU-based hardware counters, this is not recommended because PAPI
|
||||
counters. Although it is possible to install PAPI with ROCm support and use it to
|
||||
collect GPU-based hardware counters, this is not recommended because PAPI
|
||||
cannot simultaneously collect CPU and GPU hardware counters.
|
||||
|
||||
To view all possible hardware counters and their descriptions, use the following command:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-avail --hw-counters --description
|
||||
rocprof-sys-avail --hw-counters --description
|
||||
|
||||
Appending the ``-c CPU`` option restricts the list of hardware counters to
|
||||
Appending the ``-c CPU`` option restricts the list of hardware counters to
|
||||
those available through PAPI, while ``-c GPU`` limits the list to those available from ROCm.
|
||||
|
||||
Enabling hardware counters
|
||||
-----------------------------------
|
||||
|
||||
PAPI Hardware counters are configured with the ``OMNITRACE_PAPI_EVENTS`` configuration variable.
|
||||
ROCm Hardware counters are configured with the ``OMNITRACE_ROCM_EVENTS`` configuration variable.
|
||||
ROCm hardware counters also require the ``OMNITRACE_USE_ROCPROFILER`` configuration
|
||||
variable to be enabled using ``OMNITRACE_USE_ROCPROFILER=ON``.
|
||||
PAPI Hardware counters are configured with the ``ROCPROFSYS_PAPI_EVENTS`` configuration variable.
|
||||
ROCm Hardware counters are configured with the ``ROCPROFSYS_ROCM_EVENTS`` configuration variable.
|
||||
ROCm hardware counters also require the ``ROCPROFSYS_USE_ROCPROFILER`` configuration
|
||||
variable to be enabled using ``ROCPROFSYS_USE_ROCPROFILER=ON``.
|
||||
|
||||
Here is a sample configuration for hardware counters:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
# using papi identifiers
|
||||
OMNITRACE_PAPI_EVENTS = PAPI_TOT_CYC PAPI_TOT_INS
|
||||
ROCPROFSYS_PAPI_EVENTS = PAPI_TOT_CYC PAPI_TOT_INS
|
||||
|
||||
# using perf identifiers
|
||||
OMNITRACE_PAPI_EVENTS = perf::INSTRUCTIONS perf::CACHE-REFERENCES perf::CACHE-MISSES
|
||||
ROCPROFSYS_PAPI_EVENTS = perf::INSTRUCTIONS perf::CACHE-REFERENCES perf::CACHE-MISSES
|
||||
|
||||
.. _omnitrace_papi_events:
|
||||
.. _rocprof-sys_papi_events:
|
||||
|
||||
OMNITRACE_PAPI_EVENTS
|
||||
ROCPROFSYS_PAPI_EVENTS
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
In order to collect the majority of hardware counters via PAPI, ensure the ``/proc/sys/kernel/perf_event_paranoid``
|
||||
@@ -135,18 +137,18 @@ has a value <= 2. If you have ``sudo`` access, use the following command to modi
|
||||
|
||||
echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid
|
||||
|
||||
However this value is not retained upon reboot.
|
||||
However this value is not retained upon reboot.
|
||||
Use the following command to preserve this setting after a reboot:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
echo 'kernel.perf_event_paranoid=0' | sudo tee -a /etc/sysctl.conf
|
||||
|
||||
PAPI events use a concept similar to a namespace. All specified hardware
|
||||
PAPI events use a concept similar to a namespace. All specified hardware
|
||||
counters must be from the same namespace.
|
||||
For hardware counters starting with the ``PAPI_`` prefix, these are high-level
|
||||
For hardware counters starting with the ``PAPI_`` prefix, these are high-level
|
||||
aggregates of multiple hardware counters.
|
||||
Otherwise, most events use two or three colons (``::`` or ``:::``) between the
|
||||
Otherwise, most events use two or three colons (``::`` or ``:::``) between the
|
||||
component name and the counter name, for example,
|
||||
``amd64_rapl::RAPL_ENERGY_PKG`` and ``perf::PERF_COUNT_HW_CPU_CYCLES``.
|
||||
|
||||
@@ -154,33 +156,33 @@ For example, the following is a valid configuration:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
OMNITRACE_PAPI_EVENTS = perf::INSTRUCTIONS perf::CACHE-REFERENCES perf::CACHE-MISSES
|
||||
ROCPROFSYS_PAPI_EVENTS = perf::INSTRUCTIONS perf::CACHE-REFERENCES perf::CACHE-MISSES
|
||||
|
||||
However, the following specification of a roughly equivalent set of hardware counters is an incorrect configuration because it mixes
|
||||
PAPI components from different namespaces:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
OMNITRACE_PAPI_EVENTS = PAPI_TOT_INS perf::CACHE-REFERENCES perf::CACHE-MISSES
|
||||
ROCPROFSYS_PAPI_EVENTS = PAPI_TOT_INS perf::CACHE-REFERENCES perf::CACHE-MISSES
|
||||
|
||||
.. note::
|
||||
|
||||
If Omnitrace was configured with the default ``OMNITRACE_BUILD_PAPI=ON`` setting,
|
||||
If ROCm Systems Profiler was configured with the default ``ROCPROFSYS_BUILD_PAPI=ON`` setting,
|
||||
standard PAPI command-line tools such as
|
||||
``papi_avail`` and ``papi_event_chooser`` are not able to provide information
|
||||
about the PAPI library used by Omnitrace
|
||||
(because Omnitrace statically links to ``libpapi``). However, all of these tools are
|
||||
installed with the prefix ``omnitrace-`` with
|
||||
underscores replaced with hypens, for example ``papi_avail`` becomes ``omnitrace-papi-avail``.
|
||||
``papi_avail`` and ``papi_event_chooser`` are not able to provide information
|
||||
about the PAPI library used by ROCm Systems Profiler
|
||||
(because ROCm Systems Profiler statically links to ``libpapi``). However, all of these tools are
|
||||
installed with the prefix ``rocprof-sys-`` with
|
||||
underscores replaced with hypens, for example ``papi_avail`` becomes ``rocprof-sys-papi-avail``.
|
||||
|
||||
OMNITRACE_ROCM_EVENTS
|
||||
ROCPROFSYS_ROCM_EVENTS
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Omnitrace reads the ROCm events from the ``${ROCM_PATH}/lib/rocprofiler/metrics.xml``
|
||||
ROCm Systems Profiler reads the ROCm events from the ``${ROCM_PATH}/lib/rocprofiler/metrics.xml``
|
||||
file. Use the ``ROCP_METRICS`` environment
|
||||
variable to point Omnitrace to a different XML metrics file, for example,
|
||||
variable to point ROCm Systems Profiler to a different XML metrics file, for example,
|
||||
``export ROCP_METRICS=${PWD}/custom_metrics.xml``.
|
||||
``omnitrace-avail -H -c GPU`` shows event names with a suffix of ``:device=N``
|
||||
``rocprof-sys-avail -H -c GPU`` shows event names with a suffix of ``:device=N``
|
||||
where ``N`` is the device number.
|
||||
For example, if you have two devices, the output is:
|
||||
|
||||
@@ -190,7 +192,7 @@ For example, if you have two devices, the output is:
|
||||
...
|
||||
| Wavefronts:device=1 | Derived counter: SQ_WAVES |
|
||||
|
||||
To collect the event on all devices, specify the event,
|
||||
To collect the event on all devices, specify the event,
|
||||
such as ``Wavefronts``, without the ``:device=`` suffix.
|
||||
To collect the event only on specific devices, use the ``:device=`` suffix.
|
||||
|
||||
@@ -202,12 +204,12 @@ The following example:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
OMNITRACE_ROCM_EVENTS = GPUBusy SQ_WAVES:device=0 SQ_INSTS_VALU:device=1
|
||||
ROCPROFSYS_ROCM_EVENTS = GPUBusy SQ_WAVES:device=0 SQ_INSTS_VALU:device=1
|
||||
|
||||
omnitrace-avail examples
|
||||
rocprof-sys-avail examples
|
||||
-----------------------------------
|
||||
|
||||
The following examples demonstrate how to use ``omnitrace-avail`` to perform several common
|
||||
The following examples demonstrate how to use ``rocprof-sys-avail`` to perform several common
|
||||
configuration tasks.
|
||||
|
||||
Generating a default configuration file
|
||||
@@ -215,96 +217,96 @@ Generating a default configuration file
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omnitrace-avail -G ~/.omnitrace.cfg
|
||||
[omnitrace-avail] Outputting text configuration file '/home/user/.omnitrace.cfg'...
|
||||
$ cat ~/.omnitrace.cfg
|
||||
# auto-generated by omnitrace-avail (version 1.2.0) on 2022-06-27 @ 19:15
|
||||
$ rocprof-sys-avail -G ~/.rocprof-sys.cfg
|
||||
[rocprof-sys-avail] Outputting text configuration file '/home/user/.rocprof-sys.cfg'...
|
||||
$ cat ~/.rocprof-sys.cfg
|
||||
# auto-generated by rocprof-sys-avail (version 1.2.0) on 2022-06-27 @ 19:15
|
||||
|
||||
OMNITRACE_CONFIG_FILE =
|
||||
OMNITRACE_MODE = trace
|
||||
OMNITRACE_TRACE = true
|
||||
OMNITRACE_PROFILE = false
|
||||
OMNITRACE_USE_SAMPLING = false
|
||||
OMNITRACE_USE_PROCESS_SAMPLING = true
|
||||
OMNITRACE_USE_ROCTRACER = true
|
||||
OMNITRACE_USE_ROCM_SMI = true
|
||||
OMNITRACE_USE_KOKKOSP = false
|
||||
OMNITRACE_USE_CODE_COVERAGE = false
|
||||
OMNITRACE_USE_PID = true
|
||||
OMNITRACE_OUTPUT_PATH = omnitrace-%tag%-output
|
||||
OMNITRACE_OUTPUT_PREFIX =
|
||||
OMNITRACE_CI = false
|
||||
OMNITRACE_THREAD_POOL_SIZE = 8
|
||||
OMNITRACE_DEBUG = false
|
||||
OMNITRACE_DL_VERBOSE = 0
|
||||
OMNITRACE_INSTRUMENTATION_INTERVAL = 1
|
||||
OMNITRACE_KOKKOSP_KERNEL_LOGGER = false
|
||||
OMNITRACE_PAPI_EVENTS = PAPI_TOT_CYC
|
||||
OMNITRACE_PERFETTO_BACKEND = inprocess
|
||||
OMNITRACE_PERFETTO_BUFFER_SIZE_KB = 1024000
|
||||
OMNITRACE_PERFETTO_COMBINE_TRACES = false
|
||||
OMNITRACE_PERFETTO_FILE = perfetto-trace.proto
|
||||
OMNITRACE_PERFETTO_FILL_POLICY = discard
|
||||
OMNITRACE_PERFETTO_SHMEM_SIZE_HINT_KB = 4096
|
||||
OMNITRACE_ROCTRACER_HSA_ACTIVITY = false
|
||||
OMNITRACE_ROCTRACER_HSA_API = false
|
||||
OMNITRACE_ROCTRACER_HSA_API_TYPES =
|
||||
OMNITRACE_SAMPLING_CPUS =
|
||||
OMNITRACE_SAMPLING_DELAY = 0.5
|
||||
OMNITRACE_SAMPLING_FREQ = 10
|
||||
OMNITRACE_SAMPLING_GPUS = all
|
||||
OMNITRACE_TIME_OUTPUT = true
|
||||
OMNITRACE_TIMEMORY_COMPONENTS = wall_clock
|
||||
OMNITRACE_TRACE_THREAD_LOCKS = false
|
||||
OMNITRACE_VERBOSE = 0
|
||||
OMNITRACE_COLLAPSE_PROCESSES = false
|
||||
OMNITRACE_COLLAPSE_THREADS = false
|
||||
OMNITRACE_COUT_OUTPUT = false
|
||||
OMNITRACE_CPU_AFFINITY = false
|
||||
OMNITRACE_DIFF_OUTPUT = false
|
||||
OMNITRACE_ENABLE_SIGNAL_HANDLER = true
|
||||
OMNITRACE_ENABLED = true
|
||||
OMNITRACE_FILE_OUTPUT = true
|
||||
OMNITRACE_FLAT_PROFILE = false
|
||||
OMNITRACE_INPUT_EXTENSIONS = json,xml
|
||||
OMNITRACE_INPUT_PATH =
|
||||
OMNITRACE_INPUT_PREFIX =
|
||||
OMNITRACE_JSON_OUTPUT = true
|
||||
OMNITRACE_MAX_DEPTH = 65535
|
||||
OMNITRACE_MAX_WIDTH = 120
|
||||
OMNITRACE_MEMORY_PRECISION = -1
|
||||
OMNITRACE_MEMORY_SCIENTIFIC = false
|
||||
OMNITRACE_MEMORY_UNITS = MB
|
||||
OMNITRACE_MEMORY_WIDTH = -1
|
||||
OMNITRACE_NETWORK_INTERFACE =
|
||||
OMNITRACE_NODE_COUNT = 0
|
||||
OMNITRACE_PAPI_FAIL_ON_ERROR = false
|
||||
OMNITRACE_PAPI_MULTIPLEXING = false
|
||||
OMNITRACE_PAPI_OVERFLOW = 0
|
||||
OMNITRACE_PAPI_QUIET = false
|
||||
OMNITRACE_PAPI_THREADING = true
|
||||
OMNITRACE_PRECISION = -1
|
||||
OMNITRACE_SCIENTIFIC = false
|
||||
OMNITRACE_STRICT_CONFIG = true
|
||||
OMNITRACE_SUPPRESS_CONFIG = true
|
||||
OMNITRACE_SUPPRESS_PARSING = true
|
||||
OMNITRACE_TEXT_OUTPUT = true
|
||||
OMNITRACE_TIME_FORMAT = %F_%H.%M
|
||||
OMNITRACE_TIMELINE_PROFILE = false
|
||||
OMNITRACE_TIMING_PRECISION = 6
|
||||
OMNITRACE_TIMING_SCIENTIFIC = false
|
||||
OMNITRACE_TIMING_UNITS = sec
|
||||
OMNITRACE_TIMING_WIDTH = -1
|
||||
OMNITRACE_TREE_OUTPUT = true
|
||||
OMNITRACE_WIDTH = -1
|
||||
ROCPROFSYS_CONFIG_FILE =
|
||||
ROCPROFSYS_MODE = trace
|
||||
ROCPROFSYS_TRACE = true
|
||||
ROCPROFSYS_PROFILE = false
|
||||
ROCPROFSYS_USE_SAMPLING = false
|
||||
ROCPROFSYS_USE_PROCESS_SAMPLING = true
|
||||
ROCPROFSYS_USE_ROCTRACER = true
|
||||
ROCPROFSYS_USE_ROCM_SMI = true
|
||||
ROCPROFSYS_USE_KOKKOSP = false
|
||||
ROCPROFSYS_USE_CODE_COVERAGE = false
|
||||
ROCPROFSYS_USE_PID = true
|
||||
ROCPROFSYS_OUTPUT_PATH = rocprof-sys-%tag%-output
|
||||
ROCPROFSYS_OUTPUT_PREFIX =
|
||||
ROCPROFSYS_CI = false
|
||||
ROCPROFSYS_THREAD_POOL_SIZE = 8
|
||||
ROCPROFSYS_DEBUG = false
|
||||
ROCPROFSYS_DL_VERBOSE = 0
|
||||
ROCPROFSYS_INSTRUMENTATION_INTERVAL = 1
|
||||
ROCPROFSYS_KOKKOSP_KERNEL_LOGGER = false
|
||||
ROCPROFSYS_PAPI_EVENTS = PAPI_TOT_CYC
|
||||
ROCPROFSYS_PERFETTO_BACKEND = inprocess
|
||||
ROCPROFSYS_PERFETTO_BUFFER_SIZE_KB = 1024000
|
||||
ROCPROFSYS_PERFETTO_COMBINE_TRACES = false
|
||||
ROCPROFSYS_PERFETTO_FILE = perfetto-trace.proto
|
||||
ROCPROFSYS_PERFETTO_FILL_POLICY = discard
|
||||
ROCPROFSYS_PERFETTO_SHMEM_SIZE_HINT_KB = 4096
|
||||
ROCPROFSYS_ROCTRACER_HSA_ACTIVITY = false
|
||||
ROCPROFSYS_ROCTRACER_HSA_API = false
|
||||
ROCPROFSYS_ROCTRACER_HSA_API_TYPES =
|
||||
ROCPROFSYS_SAMPLING_CPUS =
|
||||
ROCPROFSYS_SAMPLING_DELAY = 0.5
|
||||
ROCPROFSYS_SAMPLING_FREQ = 10
|
||||
ROCPROFSYS_SAMPLING_GPUS = all
|
||||
ROCPROFSYS_TIME_OUTPUT = true
|
||||
ROCPROFSYS_TIMEMORY_COMPONENTS = wall_clock
|
||||
ROCPROFSYS_TRACE_THREAD_LOCKS = false
|
||||
ROCPROFSYS_VERBOSE = 0
|
||||
ROCPROFSYS_COLLAPSE_PROCESSES = false
|
||||
ROCPROFSYS_COLLAPSE_THREADS = false
|
||||
ROCPROFSYS_COUT_OUTPUT = false
|
||||
ROCPROFSYS_CPU_AFFINITY = false
|
||||
ROCPROFSYS_DIFF_OUTPUT = false
|
||||
ROCPROFSYS_ENABLE_SIGNAL_HANDLER = true
|
||||
ROCPROFSYS_ENABLED = true
|
||||
ROCPROFSYS_FILE_OUTPUT = true
|
||||
ROCPROFSYS_FLAT_PROFILE = false
|
||||
ROCPROFSYS_INPUT_EXTENSIONS = json,xml
|
||||
ROCPROFSYS_INPUT_PATH =
|
||||
ROCPROFSYS_INPUT_PREFIX =
|
||||
ROCPROFSYS_JSON_OUTPUT = true
|
||||
ROCPROFSYS_MAX_DEPTH = 65535
|
||||
ROCPROFSYS_MAX_WIDTH = 120
|
||||
ROCPROFSYS_MEMORY_PRECISION = -1
|
||||
ROCPROFSYS_MEMORY_SCIENTIFIC = false
|
||||
ROCPROFSYS_MEMORY_UNITS = MB
|
||||
ROCPROFSYS_MEMORY_WIDTH = -1
|
||||
ROCPROFSYS_NETWORK_INTERFACE =
|
||||
ROCPROFSYS_NODE_COUNT = 0
|
||||
ROCPROFSYS_PAPI_FAIL_ON_ERROR = false
|
||||
ROCPROFSYS_PAPI_MULTIPLEXING = false
|
||||
ROCPROFSYS_PAPI_OVERFLOW = 0
|
||||
ROCPROFSYS_PAPI_QUIET = false
|
||||
ROCPROFSYS_PAPI_THREADING = true
|
||||
ROCPROFSYS_PRECISION = -1
|
||||
ROCPROFSYS_SCIENTIFIC = false
|
||||
ROCPROFSYS_STRICT_CONFIG = true
|
||||
ROCPROFSYS_SUPPRESS_CONFIG = true
|
||||
ROCPROFSYS_SUPPRESS_PARSING = true
|
||||
ROCPROFSYS_TEXT_OUTPUT = true
|
||||
ROCPROFSYS_TIME_FORMAT = %F_%H.%M
|
||||
ROCPROFSYS_TIMELINE_PROFILE = false
|
||||
ROCPROFSYS_TIMING_PRECISION = 6
|
||||
ROCPROFSYS_TIMING_SCIENTIFIC = false
|
||||
ROCPROFSYS_TIMING_UNITS = sec
|
||||
ROCPROFSYS_TIMING_WIDTH = -1
|
||||
ROCPROFSYS_TREE_OUTPUT = true
|
||||
ROCPROFSYS_WIDTH = -1
|
||||
|
||||
When creating a new configuration file, the following recommendations apply:
|
||||
|
||||
* Use the ``--all`` option to view all descriptions, choices, and other information in the configuration file.
|
||||
* To create a new configuration without inheriting from an existing ``${HOME}/.omnitrace.cfg`` file,
|
||||
set ``OMNITRACE_SUPPRESS_CONFIG=ON`` in the environment beforehand.
|
||||
* To create a new configuration without inheriting from an existing ``${HOME}/.rocprof-sys.cfg`` file,
|
||||
set ``ROCPROFSYS_SUPPRESS_CONFIG=ON`` in the environment beforehand.
|
||||
* To create a new configuration that makes minor changes to an existing configuration,
|
||||
set ``OMNITRACE_CONFIG_FILE=/path/to/existing/file`` and define the changes as environment
|
||||
set ``ROCPROFSYS_CONFIG_FILE=/path/to/existing/file`` and define the changes as environment
|
||||
variables before generating it.
|
||||
|
||||
Viewing the setting descriptions
|
||||
@@ -312,89 +314,89 @@ Viewing the setting descriptions
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omnitrace-avail -S -bd
|
||||
$ rocprof-sys-avail -S -bd
|
||||
|-----------------------------------------|-----------------------------------------|
|
||||
| ENVIRONMENT VARIABLE | DESCRIPTION |
|
||||
|-----------------------------------------|-----------------------------------------|
|
||||
| OMNITRACE_CI | Enable some runtime validation check... |
|
||||
| OMNITRACE_ADD_SECONDARY | Enable/disable components adding sec... |
|
||||
| OMNITRACE_COLLAPSE_PROCESSES | Enable/disable combining process-spe... |
|
||||
| OMNITRACE_COLLAPSE_THREADS | Enable/disable combining thread-spec... |
|
||||
| OMNITRACE_CONFIG_FILE | Configuration file for omnitrace |
|
||||
| OMNITRACE_COUT_OUTPUT | Write output to stdout |
|
||||
| OMNITRACE_CPU_AFFINITY | Enable pinning threads to CPUs (Linu... |
|
||||
| OMNITRACE_THREAD_POOL_SIZE | Number of threads to use when genera... |
|
||||
| OMNITRACE_DEBUG | Enable debug output |
|
||||
| OMNITRACE_DIFF_OUTPUT | Generate a difference output vs. a p... |
|
||||
| OMNITRACE_DL_VERBOSE | Verbosity within the omnitrace-dl li... |
|
||||
| OMNITRACE_ENABLED | Activation state of timemory |
|
||||
| OMNITRACE_ENABLE_SIGNAL_HANDLER | Enable signals in timemory_init |
|
||||
| OMNITRACE_FILE_OUTPUT | Write output to files |
|
||||
| OMNITRACE_FLAT_PROFILE | Set the label hierarchy mode to defa... |
|
||||
| OMNITRACE_INPUT_EXTENSIONS | File extensions used when searching ... |
|
||||
| OMNITRACE_INPUT_PATH | Explicitly specify the input folder ... |
|
||||
| OMNITRACE_INPUT_PREFIX | Explicitly specify the prefix for in... |
|
||||
| OMNITRACE_INSTRUMENTATION_INTERVAL | Instrumentation only takes measureme... |
|
||||
| OMNITRACE_JSON_OUTPUT | Write json output files |
|
||||
| OMNITRACE_KOKKOSP_KERNEL_LOGGER | Enables kernel logging |
|
||||
| OMNITRACE_MAX_DEPTH | Set the maximum depth of label hiera... |
|
||||
| OMNITRACE_MAX_THREAD_BOOKMARKS | Maximum number of times a worker thr... |
|
||||
| OMNITRACE_MAX_WIDTH | Set the maximum width for component ... |
|
||||
| OMNITRACE_MEMORY_PRECISION | Set the precision for components wit... |
|
||||
| OMNITRACE_MEMORY_SCIENTIFIC | Set the numerical reporting format f... |
|
||||
| OMNITRACE_MEMORY_UNITS | Set the units for components with u... |
|
||||
| OMNITRACE_MEMORY_WIDTH | Set the output width for components ... |
|
||||
| OMNITRACE_NETWORK_INTERFACE | Default network interface |
|
||||
| OMNITRACE_NODE_COUNT | Total number of nodes used in applic... |
|
||||
| OMNITRACE_OUTPUT_FILE | Perfetto filename |
|
||||
| OMNITRACE_OUTPUT_PATH | Explicitly specify the output folder... |
|
||||
| OMNITRACE_OUTPUT_PREFIX | Explicitly specify a prefix for all ... |
|
||||
| OMNITRACE_PAPI_EVENTS | PAPI presets and events to collect (... |
|
||||
| OMNITRACE_PAPI_FAIL_ON_ERROR | Configure PAPI errors to trigger a r... |
|
||||
| OMNITRACE_PAPI_MULTIPLEXING | Enable multiplexing when using PAPI |
|
||||
| OMNITRACE_PAPI_OVERFLOW | Value at which PAPI hw counters trig... |
|
||||
| OMNITRACE_PAPI_QUIET | Configure suppression of reporting P... |
|
||||
| OMNITRACE_PAPI_THREADING | Enable multithreading support when u... |
|
||||
| OMNITRACE_PERFETTO_BACKEND | Specify the perfetto backend to acti... |
|
||||
| OMNITRACE_PERFETTO_BUFFER_SIZE_KB | Size of perfetto buffer (in KB) |
|
||||
| OMNITRACE_PERFETTO_COMBINE_TRACES | Combine Perfetto traces. If not expl... |
|
||||
| OMNITRACE_PERFETTO_FILL_POLICY | Behavior when perfetto buffer is ful... |
|
||||
| OMNITRACE_PERFETTO_SHMEM_SIZE_HINT_KB | Hint for shared-memory buffer size i... |
|
||||
| OMNITRACE_PRECISION | Set the global output precision for ... |
|
||||
| OMNITRACE_ROCTRACER_HSA_ACTIVITY | Enable HSA activity tracing support |
|
||||
| OMNITRACE_ROCTRACER_HSA_API | Enable HSA API tracing support |
|
||||
| OMNITRACE_ROCTRACER_HSA_API_TYPES | HSA API type to collect |
|
||||
| OMNITRACE_SAMPLING_CPUS | CPUs to collect frequency informatio... |
|
||||
| OMNITRACE_SAMPLING_DELAY | Number of seconds to wait before the... |
|
||||
| OMNITRACE_SAMPLING_FREQ | Number of software interrupts per se... |
|
||||
| OMNITRACE_SAMPLING_GPUS | Devices to query when OMNITRACE_USE_... |
|
||||
| OMNITRACE_SCIENTIFIC | Set the global numerical reporting t... |
|
||||
| OMNITRACE_STRICT_CONFIG | Throw errors for unknown setting nam... |
|
||||
| OMNITRACE_SUPPRESS_CONFIG | Disable processing of setting config... |
|
||||
| OMNITRACE_SUPPRESS_PARSING | Disable parsing environment |
|
||||
| OMNITRACE_TEXT_OUTPUT | Write text output files |
|
||||
| OMNITRACE_TIMELINE_PROFILE | Set the label hierarchy mode to defa... |
|
||||
| OMNITRACE_TIMEMORY_COMPONENTS | List of components to collect via ti... |
|
||||
| OMNITRACE_TIME_FORMAT | Customize the folder generation when... |
|
||||
| OMNITRACE_TIME_OUTPUT | Output data to subfolder w/ a timest... |
|
||||
| OMNITRACE_TIMING_PRECISION | Set the precision for components wit... |
|
||||
| OMNITRACE_TIMING_SCIENTIFIC | Set the numerical reporting format f... |
|
||||
| OMNITRACE_TIMING_UNITS | Set the units for components with u... |
|
||||
| OMNITRACE_TIMING_WIDTH | Set the output width for components ... |
|
||||
| OMNITRACE_TRACE_THREAD_LOCKS | Enable tracking calls to pthread_mut... |
|
||||
| OMNITRACE_TREE_OUTPUT | Write hierarchical json output files |
|
||||
| OMNITRACE_USE_CODE_COVERAGE | Enable support for code coverage |
|
||||
| OMNITRACE_USE_KOKKOSP | Enable support for Kokkos Tools |
|
||||
| OMNITRACE_USE_OMPT | Enable support for OpenMP-Tools |
|
||||
| OMNITRACE_TRACE | Enable perfetto backend |
|
||||
| OMNITRACE_USE_PID | Enable tagging filenames with proces... |
|
||||
| OMNITRACE_USE_ROCM_SMI | Enable sampling GPU power, temp, uti... |
|
||||
| OMNITRACE_USE_ROCTRACER | Enable ROCM tracing |
|
||||
| OMNITRACE_USE_SAMPLING | Enable statistical sampling of call-... |
|
||||
| OMNITRACE_USE_PROCESS_SAMPLING | Enable a background thread which sam... |
|
||||
| OMNITRACE_PROFILE | Enable timemory backend |
|
||||
| OMNITRACE_VERBOSE | Verbosity level |
|
||||
| OMNITRACE_WIDTH | Set the global output width for comp... |
|
||||
| ROCPROFSYS_CI | Enable some runtime validation check... |
|
||||
| ROCPROFSYS_ADD_SECONDARY | Enable/disable components adding sec... |
|
||||
| ROCPROFSYS_COLLAPSE_PROCESSES | Enable/disable combining process-spe... |
|
||||
| ROCPROFSYS_COLLAPSE_THREADS | Enable/disable combining thread-spec... |
|
||||
| ROCPROFSYS_CONFIG_FILE | Configuration file for rocprof-sys |
|
||||
| ROCPROFSYS_COUT_OUTPUT | Write output to stdout |
|
||||
| ROCPROFSYS_CPU_AFFINITY | Enable pinning threads to CPUs (Linu... |
|
||||
| ROCPROFSYS_THREAD_POOL_SIZE | Number of threads to use when genera... |
|
||||
| ROCPROFSYS_DEBUG | Enable debug output |
|
||||
| ROCPROFSYS_DIFF_OUTPUT | Generate a difference output vs. a p... |
|
||||
| ROCPROFSYS_DL_VERBOSE | Verbosity within the rocprof-sys-dl ... |
|
||||
| ROCPROFSYS_ENABLED | Activation state of timemory |
|
||||
| ROCPROFSYS_ENABLE_SIGNAL_HANDLER | Enable signals in timemory_init |
|
||||
| ROCPROFSYS_FILE_OUTPUT | Write output to files |
|
||||
| ROCPROFSYS_FLAT_PROFILE | Set the label hierarchy mode to defa... |
|
||||
| ROCPROFSYS_INPUT_EXTENSIONS | File extensions used when searching ... |
|
||||
| ROCPROFSYS_INPUT_PATH | Explicitly specify the input folder ... |
|
||||
| ROCPROFSYS_INPUT_PREFIX | Explicitly specify the prefix for in... |
|
||||
| ROCPROFSYS_INSTRUMENTATION_INTERVAL | Instrumentation only takes measureme... |
|
||||
| ROCPROFSYS_JSON_OUTPUT | Write json output files |
|
||||
| ROCPROFSYS_KOKKOSP_KERNEL_LOGGER | Enables kernel logging |
|
||||
| ROCPROFSYS_MAX_DEPTH | Set the maximum depth of label hiera... |
|
||||
| ROCPROFSYS_MAX_THREAD_BOOKMARKS | Maximum number of times a worker thr... |
|
||||
| ROCPROFSYS_MAX_WIDTH | Set the maximum width for component ... |
|
||||
| ROCPROFSYS_MEMORY_PRECISION | Set the precision for components wit... |
|
||||
| ROCPROFSYS_MEMORY_SCIENTIFIC | Set the numerical reporting format f... |
|
||||
| ROCPROFSYS_MEMORY_UNITS | Set the units for components with u... |
|
||||
| ROCPROFSYS_MEMORY_WIDTH | Set the output width for components ... |
|
||||
| ROCPROFSYS_NETWORK_INTERFACE | Default network interface |
|
||||
| ROCPROFSYS_NODE_COUNT | Total number of nodes used in applic... |
|
||||
| ROCPROFSYS_OUTPUT_FILE | Perfetto filename |
|
||||
| ROCPROFSYS_OUTPUT_PATH | Explicitly specify the output folder... |
|
||||
| ROCPROFSYS_OUTPUT_PREFIX | Explicitly specify a prefix for all ... |
|
||||
| ROCPROFSYS_PAPI_EVENTS | PAPI presets and events to collect (... |
|
||||
| ROCPROFSYS_PAPI_FAIL_ON_ERROR | Configure PAPI errors to trigger a r... |
|
||||
| ROCPROFSYS_PAPI_MULTIPLEXING | Enable multiplexing when using PAPI |
|
||||
| ROCPROFSYS_PAPI_OVERFLOW | Value at which PAPI hw counters trig... |
|
||||
| ROCPROFSYS_PAPI_QUIET | Configure suppression of reporting P... |
|
||||
| ROCPROFSYS_PAPI_THREADING | Enable multithreading support when u... |
|
||||
| ROCPROFSYS_PERFETTO_BACKEND | Specify the perfetto backend to acti... |
|
||||
| ROCPROFSYS_PERFETTO_BUFFER_SIZE_KB | Size of perfetto buffer (in KB) |
|
||||
| ROCPROFSYS_PERFETTO_COMBINE_TRACES | Combine Perfetto traces. If not expl... |
|
||||
| ROCPROFSYS_PERFETTO_FILL_POLICY | Behavior when perfetto buffer is ful... |
|
||||
| ROCPROFSYS_PERFETTO_SHMEM_SIZE_HINT_KB | Hint for shared-memory buffer size i... |
|
||||
| ROCPROFSYS_PRECISION | Set the global output precision for ... |
|
||||
| ROCPROFSYS_ROCTRACER_HSA_ACTIVITY | Enable HSA activity tracing support |
|
||||
| ROCPROFSYS_ROCTRACER_HSA_API | Enable HSA API tracing support |
|
||||
| ROCPROFSYS_ROCTRACER_HSA_API_TYPES | HSA API type to collect |
|
||||
| ROCPROFSYS_SAMPLING_CPUS | CPUs to collect frequency informatio... |
|
||||
| ROCPROFSYS_SAMPLING_DELAY | Number of seconds to wait before the... |
|
||||
| ROCPROFSYS_SAMPLING_FREQ | Number of software interrupts per se... |
|
||||
| ROCPROFSYS_SAMPLING_GPUS | Devices to query when ROCPROFSYS_USE_... |
|
||||
| ROCPROFSYS_SCIENTIFIC | Set the global numerical reporting t... |
|
||||
| ROCPROFSYS_STRICT_CONFIG | Throw errors for unknown setting nam... |
|
||||
| ROCPROFSYS_SUPPRESS_CONFIG | Disable processing of setting config... |
|
||||
| ROCPROFSYS_SUPPRESS_PARSING | Disable parsing environment |
|
||||
| ROCPROFSYS_TEXT_OUTPUT | Write text output files |
|
||||
| ROCPROFSYS_TIMELINE_PROFILE | Set the label hierarchy mode to defa... |
|
||||
| ROCPROFSYS_TIMEMORY_COMPONENTS | List of components to collect via ti... |
|
||||
| ROCPROFSYS_TIME_FORMAT | Customize the folder generation when... |
|
||||
| ROCPROFSYS_TIME_OUTPUT | Output data to subfolder w/ a timest... |
|
||||
| ROCPROFSYS_TIMING_PRECISION | Set the precision for components wit... |
|
||||
| ROCPROFSYS_TIMING_SCIENTIFIC | Set the numerical reporting format f... |
|
||||
| ROCPROFSYS_TIMING_UNITS | Set the units for components with u... |
|
||||
| ROCPROFSYS_TIMING_WIDTH | Set the output width for components ... |
|
||||
| ROCPROFSYS_TRACE_THREAD_LOCKS | Enable tracking calls to pthread_mut... |
|
||||
| ROCPROFSYS_TREE_OUTPUT | Write hierarchical json output files |
|
||||
| ROCPROFSYS_USE_CODE_COVERAGE | Enable support for code coverage |
|
||||
| ROCPROFSYS_USE_KOKKOSP | Enable support for Kokkos Tools |
|
||||
| ROCPROFSYS_USE_OMPT | Enable support for OpenMP-Tools |
|
||||
| ROCPROFSYS_TRACE | Enable perfetto backend |
|
||||
| ROCPROFSYS_USE_PID | Enable tagging filenames with proces... |
|
||||
| ROCPROFSYS_USE_ROCM_SMI | Enable sampling GPU power, temp, uti... |
|
||||
| ROCPROFSYS_USE_ROCTRACER | Enable ROCM tracing |
|
||||
| ROCPROFSYS_USE_SAMPLING | Enable statistical sampling of call-... |
|
||||
| ROCPROFSYS_USE_PROCESS_SAMPLING | Enable a background thread which sam... |
|
||||
| ROCPROFSYS_PROFILE | Enable timemory backend |
|
||||
| ROCPROFSYS_VERBOSE | Verbosity level |
|
||||
| ROCPROFSYS_WIDTH | Set the global output width for comp... |
|
||||
|-----------------------------------------|-----------------------------------------|
|
||||
|
||||
Viewing components
|
||||
@@ -402,7 +404,7 @@ Viewing components
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omnitrace-avail -C -bd
|
||||
$ rocprof-sys-avail -C -bd
|
||||
|-----------------------------------|----------------------------------------------|
|
||||
| COMPONENT | DESCRIPTION |
|
||||
|-----------------------------------|----------------------------------------------|
|
||||
@@ -460,7 +462,7 @@ Viewing components
|
||||
| wall_clock | Real-clock timer (i.e. wall-clock timer). |
|
||||
| written_bytes | Number of bytes sent to the storage layer. |
|
||||
| written_char | Number of bytes which this task has cause... |
|
||||
| omnitrace | Invokes instrumentation functions omnitr... |
|
||||
| rocprof-sys | Invokes instrumentation functions rocprof... |
|
||||
| roctracer | High-precision ROCm API and kernel tracing. |
|
||||
| sampling_wall_clock | Wall-clock timing. Derived from statistic... |
|
||||
| sampling_cpu_clock | CPU-clock timing. Derived from statistica... |
|
||||
@@ -476,7 +478,7 @@ Viewing hardware counters
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omnitrace-avail -H -bd
|
||||
$ rocprof-sys-avail -H -bd
|
||||
|---------------------------------------|---------------------------------------|
|
||||
| HARDWARE COUNTER | DESCRIPTION |
|
||||
|---------------------------------------|---------------------------------------|
|
||||
@@ -1197,17 +1199,17 @@ Viewing hardware counters
|
||||
Creating a configuration file
|
||||
========================================
|
||||
|
||||
Omnitrace supports three configuration file formats: JSON, XML, and plain text.
|
||||
Use ``omnitrace-avail -G <filename> -F txt json xml`` to generate default
|
||||
ROCm Systems Profiler supports three configuration file formats: JSON, XML, and plain text.
|
||||
Use ``rocprof-sys-avail -G <filename> -F txt json xml`` to generate default
|
||||
configuration files in each format. Optionally
|
||||
include the ``--all`` flag to include full descriptions and other information.
|
||||
Configuration files are specified by the ``OMNITRACE_CONFIG_FILE`` environment variable
|
||||
which by default looks for ``${HOME}/.omnitrace.cfg`` and ``${HOME}/.omnitrace.json``.
|
||||
Configuration files are specified by the ``ROCPROFSYS_CONFIG_FILE`` environment variable
|
||||
which by default looks for ``${HOME}/.rocprof-sys.cfg`` and ``${HOME}/.rocprof-sys.json``.
|
||||
Multiple configuration files can be concatenated using the ``:`` symbol, for example:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export OMNITRACE_CONFIG_FILE=~/.config/omnitrace.cfg:~/.config/omnitrace.json
|
||||
export ROCPROFSYS_CONFIG_FILE=~/.config/rocprof-sys.cfg:~/.config/rocprof-sys.json
|
||||
|
||||
If a configuration variable is specified in both a configuration file and in the environment,
|
||||
the environment variable takes precedence.
|
||||
@@ -1220,7 +1222,7 @@ Variables are created when an lvalue starts with a ``$`` and are
|
||||
de-referenced when they appear as rvalues.
|
||||
|
||||
Entries in the text configuration file which do not match a known setting
|
||||
in ``omnitrace-avail`` but are prefixed with ``OMNITRACE_`` are interpreted as
|
||||
in ``rocprof-sys-avail`` but are prefixed with ``ROCPROFSYS_`` are interpreted as
|
||||
environment variables. They are exported via ``setenv``
|
||||
but do not override an existing value for the environment variable.
|
||||
|
||||
@@ -1231,35 +1233,35 @@ but do not override an existing value for the environment variable.
|
||||
$SAMPLE = OFF
|
||||
|
||||
# use fields
|
||||
OMNITRACE_TRACE = $ENABLE
|
||||
OMNITRACE_PROFILE = $ENABLE
|
||||
OMNITRACE_USE_SAMPLING = $SAMPLE
|
||||
OMNITRACE_USE_PROCESS_SAMPLING = $SAMPLE
|
||||
ROCPROFSYS_TRACE = $ENABLE
|
||||
ROCPROFSYS_PROFILE = $ENABLE
|
||||
ROCPROFSYS_USE_SAMPLING = $SAMPLE
|
||||
ROCPROFSYS_USE_PROCESS_SAMPLING = $SAMPLE
|
||||
|
||||
# debug
|
||||
OMNITRACE_DEBUG = OFF
|
||||
OMNITRACE_VERBOSE = 1
|
||||
ROCPROFSYS_DEBUG = OFF
|
||||
ROCPROFSYS_VERBOSE = 1
|
||||
|
||||
# output fields
|
||||
OMNITRACE_OUTPUT_PATH = omnitrace-output
|
||||
OMNITRACE_OUTPUT_PREFIX = %tag%/
|
||||
OMNITRACE_TIME_OUTPUT = OFF
|
||||
OMNITRACE_USE_PID = OFF
|
||||
ROCPROFSYS_OUTPUT_PATH = rocprof-sys-output
|
||||
ROCPROFSYS_OUTPUT_PREFIX = %tag%/
|
||||
ROCPROFSYS_TIME_OUTPUT = OFF
|
||||
ROCPROFSYS_USE_PID = OFF
|
||||
|
||||
# timemory fields
|
||||
OMNITRACE_PAPI_EVENTS = PAPI_TOT_INS PAPI_FP_INS
|
||||
OMNITRACE_TIMEMORY_COMPONENTS = wall_clock peak_rss trip_count
|
||||
OMNITRACE_MEMORY_UNITS = MB
|
||||
OMNITRACE_TIMING_UNITS = sec
|
||||
ROCPROFSYS_PAPI_EVENTS = PAPI_TOT_INS PAPI_FP_INS
|
||||
ROCPROFSYS_TIMEMORY_COMPONENTS = wall_clock peak_rss trip_count
|
||||
ROCPROFSYS_MEMORY_UNITS = MB
|
||||
ROCPROFSYS_TIMING_UNITS = sec
|
||||
|
||||
# sampling fields
|
||||
OMNITRACE_SAMPLING_FREQ = 50
|
||||
OMNITRACE_SAMPLING_DELAY = 0.1
|
||||
OMNITRACE_SAMPLING_CPUS = 0-3
|
||||
OMNITRACE_SAMPLING_GPUS = $env:HIP_VISIBLE_DEVICES
|
||||
ROCPROFSYS_SAMPLING_FREQ = 50
|
||||
ROCPROFSYS_SAMPLING_DELAY = 0.1
|
||||
ROCPROFSYS_SAMPLING_CPUS = 0-3
|
||||
ROCPROFSYS_SAMPLING_GPUS = $env:HIP_VISIBLE_DEVICES
|
||||
|
||||
# misc env variables (see metadata JSON file after run)
|
||||
$env:OMNITRACE_SAMPLING_KEEP_DYNINST_SUFFIX = OFF
|
||||
$env:ROCPROFSYS_SAMPLING_KEEP_DYNINST_SUFFIX = OFF
|
||||
|
||||
Sample JSON configuration file
|
||||
-----------------------------------
|
||||
@@ -1269,9 +1271,9 @@ The full JSON specification for a configuration value contains a lot of informat
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"omnitrace": {
|
||||
"rocprof-sys": {
|
||||
"settings": {
|
||||
"OMNITRACE_ADD_SECONDARY": {
|
||||
"ROCPROFSYS_ADD_SECONDARY": {
|
||||
"count": -1,
|
||||
"name": "add_secondary",
|
||||
"data_type": "bool",
|
||||
@@ -1279,9 +1281,9 @@ The full JSON specification for a configuration value contains a lot of informat
|
||||
"value": true,
|
||||
"max_count": 1,
|
||||
"cmdline": [
|
||||
"--omnitrace-add-secondary"
|
||||
"--rocprof-sys-add-secondary"
|
||||
],
|
||||
"environ": "OMNITRACE_ADD_SECONDARY",
|
||||
"environ": "ROCPROFSYS_ADD_SECONDARY",
|
||||
"cereal_class_version": 1,
|
||||
"categories": [
|
||||
"component",
|
||||
@@ -1294,15 +1296,15 @@ The full JSON specification for a configuration value contains a lot of informat
|
||||
}
|
||||
}
|
||||
|
||||
However when writing an JSON configuration file, the following example is minimally acceptable
|
||||
for ``OMNITRACE_ADD_SECONDARY``:
|
||||
However when writing an JSON configuration file, the following example is minimally acceptable
|
||||
for ``ROCPROFSYS_ADD_SECONDARY``:
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"omnitrace": {
|
||||
"rocprof-sys": {
|
||||
"settings": {
|
||||
"OMNITRACE_ADD_SECONDARY": {
|
||||
"ROCPROFSYS_ADD_SECONDARY": {
|
||||
"value": true
|
||||
}
|
||||
}
|
||||
@@ -1318,19 +1320,19 @@ The full XML specification for a configuration value contains the same informati
|
||||
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<timemory_xml>
|
||||
<omnitrace>
|
||||
<rocprofiler-systems>
|
||||
<settings>
|
||||
<cereal_class_version>2</cereal_class_version>
|
||||
<!-- Full setting specification -->
|
||||
<OMNITRACE_ADD_SECONDARY>
|
||||
<ROCPROFSYS_ADD_SECONDARY>
|
||||
<cereal_class_version>1</cereal_class_version>
|
||||
<name>add_secondary</name>
|
||||
<environ>OMNITRACE_ADD_SECONDARY</environ>
|
||||
<environ>ROCPROFSYS_ADD_SECONDARY</environ>
|
||||
<description>...</description>
|
||||
<count>-1</count>
|
||||
<max_count>1</max_count>
|
||||
<cmdline>
|
||||
<value0>--omnitrace-add-secondary</value0>
|
||||
<value0>--rocprof-sys-add-secondary</value0>
|
||||
</cmdline>
|
||||
<categories>
|
||||
<value0>component</value0>
|
||||
@@ -1340,24 +1342,24 @@ The full XML specification for a configuration value contains the same informati
|
||||
<data_type>bool</data_type>
|
||||
<initial>true</initial>
|
||||
<value>true</value>
|
||||
</OMNITRACE_ADD_SECONDARY>
|
||||
</ROCPROFSYS_ADD_SECONDARY>
|
||||
<!-- etc. -->
|
||||
</settings>
|
||||
</omnitrace>
|
||||
</rocprofiler-systems>
|
||||
</timemory_xml>
|
||||
|
||||
However, when writing an XML configuration file, it is minimally acceptable
|
||||
to set ``OMNITRACE_ADD_SECONDARY=false``:
|
||||
However, when writing an XML configuration file, it is minimally acceptable
|
||||
to set ``ROCPROFSYS_ADD_SECONDARY=false``:
|
||||
|
||||
.. code-block:: xml
|
||||
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<timemory_xml>
|
||||
<omnitrace>
|
||||
<rocprofiler-systems>
|
||||
<settings>
|
||||
<OMNITRACE_ADD_SECONDARY>
|
||||
<ROCPROFSYS_ADD_SECONDARY>
|
||||
<value>false</value>
|
||||
</OMNITRACE_ADD_SECONDARY>
|
||||
</ROCPROFSYS_ADD_SECONDARY>
|
||||
</settings>
|
||||
</omnitrace>
|
||||
</rocprofiler-systems>
|
||||
</timemory_xml>
|
||||
|
||||
@@ -1,47 +1,47 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
:description: ROCm Systems Profiler environment validation documentation and reference
|
||||
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, profiler, environment, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
****************************************************
|
||||
Configuring and validating the environment
|
||||
****************************************************
|
||||
|
||||
After installing `Omnitrace <https://github.com/ROCm/omnitrace>`_, additional steps are required to set up
|
||||
After installing `ROCm Systems Profiler <https://github.com/ROCm/rocprofiler-systems>`_, additional steps are required to set up
|
||||
and validate the environment.
|
||||
|
||||
.. note::
|
||||
|
||||
The following instructions use the installation path ``/opt/omnitrace``. If
|
||||
Omnitrace is installed elsewhere, substitute the actual installation path.
|
||||
The following instructions use the installation path ``/opt/rocprofiler-systems``. If
|
||||
ROCm Systems Profiler is installed elsewhere, substitute the actual installation path.
|
||||
|
||||
Configuring the environment
|
||||
========================================
|
||||
|
||||
After Omnitrace is installed, source the ``setup-env.sh`` script to prefix the
|
||||
After ROCm Systems Profiler is installed, source the ``setup-env.sh`` script to prefix the
|
||||
``PATH``, ``LD_LIBRARY_PATH``, and other environment variables:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
source /opt/omnitrace/share/omnitrace/setup-env.sh
|
||||
source /opt/rocprofiler-systems/share/rocprofiler-systems/setup-env.sh
|
||||
|
||||
Alternatively, if environment modules are supported, add the ``<prefix>/share/modulefiles`` directory
|
||||
to ``MODULEPATH``:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
module use /opt/omnitrace/share/modulefiles
|
||||
module use /opt/rocprofiler-systems/share/modulefiles
|
||||
|
||||
.. note::
|
||||
|
||||
|
||||
As an alternative, the above line can be added to the ``${HOME}/.modulerc`` file.
|
||||
|
||||
After Omnitrace has been added to the ``MODULEPATH``, it can be loaded
|
||||
using ``module load omnitrace/<VERSION>`` and unloaded using ``module unload omnitrace/<VERSION>``.
|
||||
After ROCm Systems Profiler has been added to the ``MODULEPATH``, it can be loaded
|
||||
using ``module load rocprofiler-systems/<VERSION>`` and unloaded using ``module unload rocprofiler-systems/<VERSION>``.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
module load omnitrace/1.0.0
|
||||
module unload omnitrace/1.0.0
|
||||
module load rocprofiler-systems/1.0.0
|
||||
module unload rocprofiler-systems/1.0.0
|
||||
|
||||
.. note::
|
||||
|
||||
@@ -51,21 +51,21 @@ using ``module load omnitrace/<VERSION>`` and unloaded using ``module unload omn
|
||||
Validating the environment configuration
|
||||
========================================
|
||||
|
||||
If the following commands all run successfully with the expected output,
|
||||
then you are ready to use Omnitrace:
|
||||
If the following commands all run successfully with the expected output,
|
||||
then you are ready to use ROCm Systems Profiler:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
which omnitrace
|
||||
which omnitrace-avail
|
||||
which omnitrace-sample
|
||||
omnitrace-instrument --help
|
||||
omnitrace-avail --all
|
||||
omnitrace-sample --help
|
||||
which rocprof-sys
|
||||
which rocprof-sys-avail
|
||||
which rocprof-sys-sample
|
||||
rocprof-sys-instrument --help
|
||||
rocprof-sys-avail --all
|
||||
rocprof-sys-sample --help
|
||||
|
||||
If Omnitrace was built with Python support, validate these additional commands:
|
||||
If ROCm Systems Profiler was built with Python support, validate these additional commands:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
which omnitrace-python
|
||||
omnitrace-python --help
|
||||
which rocprof-sys-python
|
||||
rocprof-sys-python --help
|
||||
|
||||
@@ -1,19 +1,19 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
:description: ROCm Systems Profiler general tips and usage documentation and reference
|
||||
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, tips, how to, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
**********************************
|
||||
General tips for using Omnitrace
|
||||
General tips for using ROCm Systems Profiler
|
||||
**********************************
|
||||
|
||||
Follow these general guidelines when using Omnitrace. For an explanation of the terms used in this topic, see
|
||||
the :doc:`Omnitrace glossary <../reference/omnitrace-glossary>`.
|
||||
Follow these general guidelines when using ROCm Systems Profiler. For an explanation of the terms used in this topic, see
|
||||
the :doc:`ROCm Systems Profiler glossary <../reference/rocprof-sys-glossary>`.
|
||||
|
||||
* Use ``omnitrace-avail`` to look up configuration settings, hardware counters, and data collection components
|
||||
* Use ``rocprof-sys-avail`` to look up configuration settings, hardware counters, and data collection components
|
||||
|
||||
* Use the ``-d`` flag for descriptions
|
||||
|
||||
* Generate a default configuration with ``omnitrace-avail -G ${HOME}/.omnitrace.cfg`` and adjust it
|
||||
* Generate a default configuration with ``rocprof-sys-avail -G ${HOME}/.rocprof-sys.cfg`` and adjust it
|
||||
to the desired default behavior
|
||||
* **Decide whether binary instrumentation, statistical sampling, or both** provides the desired performance data (for non-Python applications)
|
||||
* Compile code with optimization enabled (``-O2`` or higher), disable asserts (i.e. ``-DNDEBUG``), and include debug info (for instance, ``-g1`` at a minimum)
|
||||
@@ -24,26 +24,26 @@ the :doc:`Omnitrace glossary <../reference/omnitrace-glossary>`.
|
||||
* **Use binary instrumentation for characterizing the performance of every invocation of specific functions**
|
||||
* **Use statistical sampling to characterize the performance of the entire application while minimizing overhead**
|
||||
* Enable statistical sampling after binary instrumentation to help "fill in the gaps" between instrumented regions
|
||||
* Use the user API to create custom regions and enable/disable Omnitrace for specific processes, threads, and regions
|
||||
* Use the user API to create custom regions and enable/disable ROCm Systems Profiler for specific processes, threads, and regions
|
||||
* Dynamic symbol interception, callback APIs, and the user API are always available with binary instrumentation and sampling
|
||||
|
||||
* Dynamic symbol interception and callback APIs are (generally) controlled through ``OMNITRACE_USE_<API>``
|
||||
options, for example, ``OMNITRACE_USE_KOKKOSP`` and ``OMNITRACE_USE_OMPT`` enable Kokkos-Tools and OpenMP-Tools
|
||||
* Dynamic symbol interception and callback APIs are (generally) controlled through ``ROCPROFSYS_USE_<API>``
|
||||
options, for example, ``ROCPROFSYS_USE_KOKKOSP`` and ``ROCPROFSYS_USE_OMPT`` enable Kokkos-Tools and OpenMP-Tools
|
||||
callbacks, respectively
|
||||
|
||||
* When generically seeking regions for performance improvement:
|
||||
|
||||
* **Start off by collecting a flat profile**
|
||||
* Look for functions with high call counts, large cumulative runtimes/values, or large standard deviations
|
||||
|
||||
|
||||
* When call counts are high, improving the performance of this function or "inlining" the function can result in quick and easy performance improvements
|
||||
* When the standard deviation is high, collect a hierarchical profile and see if the high variation can be attributable to the calling context.
|
||||
* When the standard deviation is high, collect a hierarchical profile and see if the high variation can be attributable to the calling context.
|
||||
In this scenario, consider creating a specialized version of the function for the longer-running contexts
|
||||
|
||||
* **Collect a hierarchical profile** and verify the functions that are part of the "critical path" of your
|
||||
* **Collect a hierarchical profile** and verify the functions that are part of the "critical path" of your
|
||||
application, as indicated in the flat profile
|
||||
|
||||
* For example, functions with high call counts but which are part of a "setup" or "post-processing"
|
||||
* For example, functions with high call counts but which are part of a "setup" or "post-processing"
|
||||
phase that does not consume much time relative to the overall time are generally a lower priority for optimization
|
||||
|
||||
* **Use the information from the profiles when analyzing detailed traces**
|
||||
@@ -54,7 +54,7 @@ the :doc:`Omnitrace glossary <../reference/omnitrace-glossary>`.
|
||||
* When using binary instrumentation with MPI, avoid runtime instrumentation
|
||||
|
||||
* Runtime instrumentation requires a fork and a ``ptrace``, which is generally incompatible with how MPI applications spawn processes
|
||||
* Perform a binary rewrite of the executable (and optionally, libraries used by the executable) using MPI and run
|
||||
the generated instrumented executable using ``omnitrace-run`` instead of the original.
|
||||
For example, instead of ``mpirun -n 2 ./myexe``, use ``mpirun -n 2 omnitrace-run -- ./myexe.inst``, where
|
||||
* Perform a binary rewrite of the executable (and optionally, libraries used by the executable) using MPI and run
|
||||
the generated instrumented executable using ``rocprof-sys-run`` instead of the original.
|
||||
For example, instead of ``mpirun -n 2 ./myexe``, use ``mpirun -n 2 rocprof-sys-run -- ./myexe.inst``, where
|
||||
``myexe.inst`` is the instrumented ``myexe`` executable that was generated.
|
||||
@@ -1,12 +1,12 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
:description: ROCm Systems Profiler binary instrumentation and rewrite documentation and reference
|
||||
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, binary instrumentation, binary rewrite, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
****************************************************
|
||||
Instrumenting and rewriting a binary application
|
||||
****************************************************
|
||||
|
||||
There are three ways to perform instrumentation with the ``omnitrace-instrument`` executable:
|
||||
There are three ways to perform instrumentation with the ``rocprof-sys-instrument`` executable:
|
||||
|
||||
* Runtime instrumentation
|
||||
* Attaching to an already running process
|
||||
@@ -14,11 +14,11 @@ There are three ways to perform instrumentation with the ``omnitrace-instrument`
|
||||
|
||||
Here is a comparison of the three modes:
|
||||
|
||||
* Runtime instrumentation of the application using the ``omnitrace-instrument`` executable
|
||||
* Runtime instrumentation of the application using the ``rocprof-sys-instrument`` executable
|
||||
(analogous to ``gdb --args <program> <args>``)
|
||||
|
||||
* This mode is the default if neither the ``-p`` nor ``-o`` command-line options are used
|
||||
* Runtime instrumentation supports instrumenting not only the target executable but also
|
||||
* Runtime instrumentation supports instrumenting not only the target executable but also
|
||||
the shared libraries loaded by the target executable. Consequently, this mode consumes more memory,
|
||||
takes longer to perform the instrumentation, and tends to add more significant overhead to the
|
||||
runtime of the application.
|
||||
@@ -26,7 +26,7 @@ Here is a comparison of the three modes:
|
||||
libraries but also the performance of the library dependencies
|
||||
|
||||
* Attaching to a process that is currently running (analogous to ``gdb -p <PID>``)
|
||||
|
||||
|
||||
* This mode is activated using ``-p <PID>``
|
||||
* The same caveats from the first example apply with respect to memory and overhead
|
||||
|
||||
@@ -39,25 +39,25 @@ Here is a comparison of the three modes:
|
||||
|
||||
* This mode is activated through the ``-o <output-file>`` option
|
||||
* Binary rewriting is limited to the text section of the target executable or library. It does not instrument
|
||||
the dynamically-linked libraries. Consequently, this mode performs the
|
||||
the dynamically-linked libraries. Consequently, this mode performs the
|
||||
instrumentation significantly faster
|
||||
and has a much lower overhead when running the instrumented executable and libraries.
|
||||
* Binary rewriting is the recommended mode when the target executable uses
|
||||
* Binary rewriting is the recommended mode when the target executable uses
|
||||
process-level parallelism (for example, MPI)
|
||||
* If the target executable has a minimal ``main`` routine and the bulk of your
|
||||
* If the target executable has a minimal ``main`` routine and the bulk of your
|
||||
application is in one specific dynamic library,
|
||||
see :ref:`binary-rewriting-library-label` for help
|
||||
|
||||
The omnitrace-instrument executable
|
||||
The rocprof-sys-instrument executable
|
||||
========================================
|
||||
|
||||
Instrumentation is performed with the ``omnitrace-instrument`` executable. For more details, use the ``-h`` or ``--help`` option to
|
||||
Instrumentation is performed with the ``rocprof-sys-instrument`` executable. For more details, use the ``-h`` or ``--help`` option to
|
||||
view the help menu.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omnitrace-instrument --help
|
||||
[omnitrace-instrument] Usage: omnitrace-instrument [ --help (count: 0, dtype: bool)
|
||||
$ rocprof-sys-instrument --help
|
||||
[rocprof-sys-instrument] Usage: rocprof-sys-instrument [ --help (count: 0, dtype: bool)
|
||||
--version (count: 0, dtype: bool)
|
||||
--verbose (max: 1, dtype: bool)
|
||||
--error (max: 1, dtype: boolean)
|
||||
@@ -161,8 +161,8 @@ view the help menu.
|
||||
[MODE OPTIONS]
|
||||
|
||||
-o, --output Enable generation of a new executable (binary-rewrite). If a filename is not provided,
|
||||
omnitrace will use the basename and output to the cwd, unless the target binary is in the
|
||||
cwd. In the latter case, omnitrace will either use ${PWD}/<basename>.inst (non-libraries)
|
||||
rocprof-sys will use the basename and output to the cwd, unless the target binary is in the
|
||||
cwd. In the latter case, rocprof-sys will either use ${PWD}/<basename>.inst (non-libraries)
|
||||
or ${PWD}/instrumented/<basename> (libraries)
|
||||
-p, --pid Connect to running process
|
||||
-M, --mode [ coverage | sampling | trace ]
|
||||
@@ -177,7 +177,7 @@ view the help menu.
|
||||
[LIBRARY OPTIONS]
|
||||
|
||||
--prefer [ shared | static ] Prefer this library types when available
|
||||
-L, --library Libraries with instrumentation routines (default: "libomnitrace-dl")
|
||||
-L, --library Libraries with instrumentation routines (default: "librocprof-sys-dl")
|
||||
-m, --main-function The primary function to instrument around, e.g. \'main\'
|
||||
--load Supplemental instrumentation library names w/o extension (e.g. \'libinstr\' for
|
||||
\'libinstr.so\' or \'libinstr.a\')
|
||||
@@ -200,17 +200,17 @@ view the help menu.
|
||||
-ME, --module-exclude Regex(es) for excluding modules/files/libraries (always applied)
|
||||
-MR, --module-restrict Regex(es) for restricting modules/files/libraries only to those that match the provided
|
||||
regular-expressions
|
||||
--internal-function-include Regex(es) for including functions which are (likely) utilized by omnitrace itself. Use
|
||||
--internal-function-include Regex(es) for including functions which are (likely) utilized by rocprof-sys itself. Use
|
||||
this option with care.
|
||||
--internal-module-include Regex(es) for including modules/libraries which are (likely) utilized by omnitrace
|
||||
--internal-module-include Regex(es) for including modules/libraries which are (likely) utilized by rocprof-sys
|
||||
itself. Use this option with care.
|
||||
--instruction-exclude Regex(es) for excluding functions containing certain instructions
|
||||
--internal-library-deps Treat the libraries linked to the internal libraries as internal libraries. This increase
|
||||
the internal library processing time and consume more memory (so use with care) but may
|
||||
be useful when the application uses Boost libraries and Dyninst is dynamically linked
|
||||
against the same boost libraries
|
||||
--internal-library-append Append to the list of libraries which omnitrace treats as being used internally, e.g.
|
||||
OmniTrace will find all the symbols in this library and prevent them from being
|
||||
--internal-library-append Append to the list of libraries which rocprof-sys treats as being used internally, e.g.
|
||||
ROCm Systems Profiler will find all the symbols in this library and prevent them from being
|
||||
instrumented.
|
||||
--internal-library-remove [ ld-linux-x86-64.so.2
|
||||
libBrokenLocale.so.1
|
||||
@@ -272,7 +272,7 @@ view the help menu.
|
||||
libz.so
|
||||
libzstd.so ]
|
||||
Remove the specified libraries from being treated as being used internally, e.g.
|
||||
OmniTrace will permit all the symbols in these libraries to be eligible for
|
||||
ROCm System Profiler will permit all the symbols in these libraries to be eligible for
|
||||
instrumentation.
|
||||
--linkage [ global | local | unique | unknown | weak ]
|
||||
Only instrument functions with specified linkage (default: global, local, unique)
|
||||
@@ -287,11 +287,11 @@ view the help menu.
|
||||
options to gain more information about the function signature or location of the
|
||||
functions
|
||||
-C, --config Read in a configuration file and encode these values as the defaults in the executable
|
||||
-d, --default-components Default components to instrument (only useful when timemory is enabled in omnitrace
|
||||
-d, --default-components Default components to instrument (only useful when timemory is enabled in rocprof-sys
|
||||
library)
|
||||
--env Environment variables to add to the runtime in form VARIABLE=VALUE. E.g. use \'--env
|
||||
OMNITRACE_PROFILE=ON\' to default to using timemory instead of perfetto
|
||||
--mpi Enable MPI support (requires omnitrace built w/ full or partial MPI support). NOTE: this
|
||||
ROCPROFSYS_PROFILE=ON\' to default to using timemory instead of perfetto
|
||||
--mpi Enable MPI support (requires rocprof-sys built w/ full or partial MPI support). NOTE: this
|
||||
will automatically be activated if MPI_Init, MPI_Init_thread, MPI_Finalize,
|
||||
MPI_Comm_rank, or MPI_Comm_size are found in the symbol table of target
|
||||
|
||||
@@ -322,8 +322,8 @@ view the help menu.
|
||||
--allow-overlapping Allow dyninst to instrument either multiple functions which overlap (share part of same
|
||||
function body) or single functions with multiple entry points. For more info, see Section
|
||||
2 of the DyninstAPI documentation.
|
||||
--parse-all-modules By default, omnitrace simply requests Dyninst to provide all the procedures in the
|
||||
application image. If this option is enabled, omnitrace will iterate over all the modules
|
||||
--parse-all-modules By default, rocprof-sys simply requests Dyninst to provide all the procedures in the
|
||||
application image. If this option is enabled, rocprof-sys will iterate over all the modules
|
||||
and extract the functions. Theoretically, it should be the same but the data is slightly
|
||||
different, possibly due to weak binding scopes. In general, enabling option will probably
|
||||
have no visible effect
|
||||
@@ -344,17 +344,17 @@ view the help menu.
|
||||
TypeChecking ]
|
||||
Advanced dyninst options: BPatch::set<OPTION>(bool), e.g. bpatch->setTrampRecursive(true)
|
||||
|
||||
``omnitrace-instrument`` uses a similar syntax as LLVM to separate command-line arguments from the
|
||||
application's arguments. It uses a standalone
|
||||
double-hyphen (``--``) as a separator.
|
||||
``rocprof-sys-instrument`` uses a similar syntax as LLVM to separate command-line arguments from the
|
||||
application's arguments. It uses a standalone
|
||||
double-hyphen (``--``) as a separator.
|
||||
All arguments preceding the double-hyphen
|
||||
are interpreted as belonging to Omnitrace and all arguments following the
|
||||
are interpreted as belonging to ROCm Systems Profiler and all arguments following the
|
||||
double-hyphen are interpreted as being part of the
|
||||
application and its arguments. In binary rewrite mode, all application arguments after the first argument
|
||||
are ignored. As an example, ``./omnitrace-instrument -o ls.inst -- ls -l`` interprets ``ls`` as
|
||||
are ignored. As an example, ``./rocprof-sys-instrument -o ls.inst -- ls -l`` interprets ``ls`` as
|
||||
the target to instrument, ignoring the ``-l`` argument,
|
||||
and generates a ``ls.inst`` executable that you can subsequently run using the
|
||||
``omnitrace-run -- ls.inst -l`` command.
|
||||
and generates a ``ls.inst`` executable that you can subsequently run using the
|
||||
``rocprof-sys-run -- ls.inst -l`` command.
|
||||
|
||||
Runtime instrumentation example
|
||||
========================================
|
||||
@@ -363,7 +363,7 @@ The following example shows how to enable runtime instrumentation.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-instrument <omnitrace-options> -- <exe> [<exe-options>...]
|
||||
rocprof-sys-instrument <rocprof-sys-options> -- <exe> [<exe-options>...]
|
||||
|
||||
Attaching to a running process
|
||||
========================================
|
||||
@@ -372,7 +372,7 @@ Use the following command to attach to an active process.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-instrument <omnitrace-options> -p <PID> -- <exe-name>
|
||||
rocprof-sys-instrument <rocprof-sys-options> -p <PID> -- <exe-name>
|
||||
|
||||
Binary rewrite
|
||||
========================================
|
||||
@@ -381,24 +381,24 @@ This example demonstrates how to rewrite a binary.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-instrument <omnitrace-options> -o <name-of-new-exe-or-library> -- <exe-or-library>
|
||||
rocprof-sys-instrument <rocprof-sys-options> -o <name-of-new-exe-or-library> -- <exe-or-library>
|
||||
|
||||
.. _binary-rewriting-library-label:
|
||||
|
||||
Binary rewrite of a library
|
||||
-----------------------------------
|
||||
|
||||
Many applications bundle the bulk of their functionality into one or more
|
||||
Many applications bundle the bulk of their functionality into one or more
|
||||
dynamic libraries and have a relatively simple ``main``
|
||||
which links to these libraries and serves as the "driver" for
|
||||
which links to these libraries and serves as the "driver" for
|
||||
setting up the workflow. If you perform a binary rewrite of an
|
||||
executable like this and find there is insufficient information, you
|
||||
executable like this and find there is insufficient information, you
|
||||
can either switch to runtime instrumentation or perform a
|
||||
binary rewrite on the relevant libraries.
|
||||
|
||||
Support for stand-alone binary rewriting of a dynamic library without a binary rewrite of
|
||||
Support for stand-alone binary rewriting of a dynamic library without a binary rewrite of
|
||||
the executable is a beta feature.
|
||||
In general, it is supported as long as the library contains the ``_init`` and
|
||||
In general, it is supported as long as the library contains the ``_init`` and
|
||||
``_fini`` symbols but these symbols are not
|
||||
standardized to the extent of ``main`` in an executable.
|
||||
|
||||
@@ -406,8 +406,8 @@ Here is the recommended workflow for the binary rewrite of a library:
|
||||
|
||||
#. Determine the names of the dynamically linked libraries of interest using ``ldd``
|
||||
#. Generate a binary rewrite of the executable
|
||||
#. Generate a binary rewrite of the desired libraries with the same base name as the
|
||||
original library, for example, ``libfoo.so.2`` instead of ``libfoo.so``, and output the instrumented
|
||||
#. Generate a binary rewrite of the desired libraries with the same base name as the
|
||||
original library, for example, ``libfoo.so.2`` instead of ``libfoo.so``, and output the instrumented
|
||||
library into a different folder than the original library.
|
||||
|
||||
#. Prefix the ``LD_LIBRARY_PATH`` executable with the output folder from the previous step
|
||||
@@ -433,10 +433,10 @@ Generate binary rewrites of ``foo`` and ``libfoo.so.2``:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-instrument -o ./foo.inst -- foo
|
||||
omnitrace-instrument -o ./libfoo.so.2 -- /usr/local/lib/libfoo.so.2
|
||||
rocprof-sys-instrument -o ./foo.inst -- foo
|
||||
rocprof-sys-instrument -o ./libfoo.so.2 -- /usr/local/lib/libfoo.so.2
|
||||
|
||||
At this point, the instrumented ``foo.inst`` executable still dynamically loads the
|
||||
At this point, the instrumented ``foo.inst`` executable still dynamically loads the
|
||||
original ``libfoo.so.2`` in ``/usr/local/lib``:
|
||||
|
||||
.. code-block:: shell
|
||||
@@ -446,7 +446,7 @@ original ``libfoo.so.2`` in ``/usr/local/lib``:
|
||||
libfoo.so.2 => /usr/local/lib/libfoo.so.2 (...)
|
||||
...
|
||||
|
||||
Prefix the ``LD_LIBRARY_PATH`` environment variable with the folder containing
|
||||
Prefix the ``LD_LIBRARY_PATH`` environment variable with the folder containing
|
||||
the instrumented ``libfoo.so.2``:
|
||||
|
||||
.. code-block:: shell
|
||||
@@ -465,90 +465,90 @@ the instrumented ``libfoo.so.2``:
|
||||
Selective instrumentation
|
||||
========================================
|
||||
|
||||
The default behavior of ``omnitrace-instrument`` does not instrument every symbol in the binary.
|
||||
The default behavior of ``rocprof-sys-instrument`` does not instrument every symbol in the binary.
|
||||
The default rules are:
|
||||
|
||||
* Skip instrumenting dynamic call-sites (such as function pointers)
|
||||
|
||||
* The ``--dynamic-callsites`` option forces instrumentation for all dynamic call-sites
|
||||
|
||||
* The cost of a function can be loosely approximated by the number of
|
||||
instructions. By default, ``omnitrace-instrument`` only instruments functions
|
||||
* The cost of a function can be loosely approximated by the number of
|
||||
instructions. By default, ``rocprof-sys-instrument`` only instruments functions
|
||||
with at least 1024 instructions
|
||||
|
||||
* The ``--min-instructions`` option modifies this heuristic for all functions which do not contain loops
|
||||
* The ``--min-instructions-loop`` option modifies this heuristic for functions which contain loops.
|
||||
|
||||
* The cost of a function can be also be loosely approximated by the size of the function
|
||||
in the binary so this heuristic can be used in lieu of or in addition to the
|
||||
* The cost of a function can be also be loosely approximated by the size of the function
|
||||
in the binary so this heuristic can be used in lieu of or in addition to the
|
||||
minimum number of instructions
|
||||
|
||||
* The ``--min-address-range`` option modifies this heuristic for all functions which do not contain loops
|
||||
* The ``--min-address-range-loop`` option modifies this heuristic for functions which contain loops
|
||||
* The ``--min-address-range-loop`` option modifies this heuristic for functions which contain loops
|
||||
|
||||
* Skip instrumentation points which require using a trap
|
||||
|
||||
|
||||
* See the description for the ``--traps`` and ``--loop-traps`` options for more information
|
||||
|
||||
* Skip instrumenting loops within the body of a function
|
||||
|
||||
* The ``--instrument-loops`` option enables this behavior
|
||||
|
||||
* Skip instrumenting functions with overlapping function bodies and single
|
||||
* Skip instrumenting functions with overlapping function bodies and single
|
||||
functions with multiple entry point
|
||||
|
||||
* These behaviors arise from various optimizations. Enable instrumenting for these functions
|
||||
* These behaviors arise from various optimizations. Enable instrumenting for these functions
|
||||
by using the ``--allow-overlapping`` option
|
||||
|
||||
.. note::
|
||||
|
||||
The separate loop options ``--min-instructions-loop`` and ``--min-address-range-loop``
|
||||
The separate loop options ``--min-instructions-loop`` and ``--min-address-range-loop``
|
||||
are provided because functions with loops can be compact in the binary while also being costly
|
||||
|
||||
Viewing the available, instrumented, excluded, and overlapping functions
|
||||
-------------------------------------------------------------------------
|
||||
|
||||
Whenever ``omnitrace-instrument`` runs with a verbosity of zero or higher,
|
||||
it generates files that detail which functions
|
||||
were available for instrumentation (along with the module they were defined in), actually instrumented,
|
||||
Whenever ``rocprof-sys-instrument`` runs with a verbosity of zero or higher,
|
||||
it generates files that detail which functions
|
||||
were available for instrumentation (along with the module they were defined in), actually instrumented,
|
||||
excluded, and which contained overlapping function bodies.
|
||||
By default, these files are saved to the ``omnitrace-<NAME>-output`` folder
|
||||
By default, these files are saved to the ``rocprof-sys-<NAME>-output`` folder
|
||||
where ``<NAME>`` is the base name of the targeted binary (or
|
||||
the base name of the resulting executable in the case of binary rewrite). For example,
|
||||
``omnitrace-instrument -- ls`` outputs these files to ``omnitrace-ls-output``
|
||||
whereas ``omnitrace-instrument -o ls.inst -- ls`` places them in ``omnitrace-ls.inst-output``.
|
||||
``rocprof-sys-instrument -- ls`` outputs these files to ``rocprof-sys-ls-output``
|
||||
whereas ``rocprof-sys-instrument -o ls.inst -- ls`` places them in ``rocprof-sys-ls.inst-output``.
|
||||
|
||||
To generate these files without running or generating an
|
||||
To generate these files without running or generating an
|
||||
executable, use the ``--simulate`` option:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-instrument --simulate -- foo
|
||||
omnitrace-instrument --simulate -o foo.inst -- foo
|
||||
rocprof-sys-instrument --simulate -- foo
|
||||
rocprof-sys-instrument --simulate -o foo.inst -- foo
|
||||
|
||||
Excluding and including modules and functions
|
||||
----------------------------------------------
|
||||
|
||||
Omnitrace has a set of six command-line options which each accept one or more
|
||||
ROCm Systems Profiler has a set of six command-line options which each accept one or more
|
||||
regular expressions for customizing the scope of which module and/or functions are
|
||||
instrumented. Multiple regex patterns per option are treated as an OR operation,
|
||||
instrumented. Multiple regex patterns per option are treated as an OR operation,
|
||||
for example, ``--module-include libfoo libbar`` is effectively the same as ``--module-include 'libfoo|libbar'``.
|
||||
|
||||
To force the inclusion of certain modules and/or function
|
||||
To force the inclusion of certain modules and/or function
|
||||
without changing any of the heuristics, use the ``--module-include`` and/or ``--function-include`` options.
|
||||
These options do not exclude modules or functions which do
|
||||
These options do not exclude modules or functions which do
|
||||
not satisfy their regular expression.
|
||||
|
||||
To narrow the scope of the instrumentation to a specific set
|
||||
To narrow the scope of the instrumentation to a specific set
|
||||
of libraries and/or functions, use the ``--module-restrict`` and ``--function-restrict`` options.
|
||||
These options let you exclusively select the union of one or more
|
||||
These options let you exclusively select the union of one or more
|
||||
regular expressions, regardless of whether or not the functions satisfy the
|
||||
previously-mentioned default heuristics. Any function or module that is not within
|
||||
previously-mentioned default heuristics. Any function or module that is not within
|
||||
the union of these regular expressions is excluded from instrumentation.
|
||||
|
||||
To avoid instrumenting a set of modules and/or functions,
|
||||
To avoid instrumenting a set of modules and/or functions,
|
||||
use the ``--module-exclude`` and ``--function-exclude`` options.
|
||||
These options are always applied, even if the module or function
|
||||
These options are always applied, even if the module or function
|
||||
satisfies the "restrict" or "include" regular expression.
|
||||
|
||||
.. _available-module-function-output:
|
||||
@@ -558,7 +558,7 @@ An example of the available module and function info output
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-instrument -o lulesh.inst --label file line args --simulate -- lulesh
|
||||
rocprof-sys-instrument -o lulesh.inst --label file line args --simulate -- lulesh
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
@@ -779,7 +779,7 @@ An example of instrumented module and function info output
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-instrument -o lulesh.inst --label file line args --simulate -- lulesh
|
||||
rocprof-sys-instrument -o lulesh.inst --label file line args --simulate -- lulesh
|
||||
|
||||
After the heuristics are applied based on the pattern in :ref:`available-module-function-output`,
|
||||
the selected module and functions are:
|
||||
@@ -850,15 +850,15 @@ Sampling
|
||||
|
||||
This capability has been deprecated in favor of :doc:`Call stack sampling <./sampling-call-stack>`.
|
||||
|
||||
By default, ``omnitrace-instrument`` uses ``--mode trace`` for instrumentation. The ``--mode sampling`` option
|
||||
By default, ``rocprof-sys-instrument`` uses ``--mode trace`` for instrumentation. The ``--mode sampling`` option
|
||||
only instruments ``main`` in an executable. It activates both CPU call-stack sampling and
|
||||
background system-level thread sampling by default.
|
||||
Tracing capabilities which do not rely on instrumentation, such as the HIP API and kernel tracing
|
||||
(which is collected by roctracer), are still available.
|
||||
|
||||
The Omnitrace sampling capabilities are always available, even in trace mode, but are deactivated by default.
|
||||
To activate sampling in trace mode, set ``OMNITRACE_USE_SAMPLING=ON`` in the environment
|
||||
or in an Omnitrace configuration file.
|
||||
The ROCm Systems Profiler sampling capabilities are always available, even in trace mode, but are deactivated by default.
|
||||
To activate sampling in trace mode, set ``ROCPROFSYS_USE_SAMPLING=ON`` in the environment
|
||||
or in an ROCm Systems Profiler configuration file.
|
||||
|
||||
Embedding a default configuration
|
||||
========================================
|
||||
@@ -872,31 +872,31 @@ the configuration settings are not be preserved for subsequent sessions:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-instrument -o ./foo.inst -- ./foo
|
||||
export OMNITRACE_USE_SAMPLING=ON
|
||||
export OMNITRACE_SAMPLING_FREQ=5
|
||||
omnitrace-run -- ./foo.inst
|
||||
rocprof-sys-instrument -o ./foo.inst -- ./foo
|
||||
export ROCPROFSYS_USE_SAMPLING=ON
|
||||
export ROCPROFSYS_SAMPLING_FREQ=5
|
||||
rocprof-sys-run -- ./foo.inst
|
||||
|
||||
Whereas the following command preserves those environment variables:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-instrument -o ./foo.samp --env OMNITRACE_USE_SAMPLING=ON OMNITRACE_SAMPLING_FREQ=5 -- ./foo
|
||||
rocprof-sys-instrument -o ./foo.samp --env ROCPROFSYS_USE_SAMPLING=ON ROCPROFSYS_SAMPLING_FREQ=5 -- ./foo
|
||||
|
||||
They can now be used in future sessions.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
# will sample 5x per second
|
||||
omnitrace-run -- ./foo.samp
|
||||
rocprof-sys-run -- ./foo.samp
|
||||
|
||||
Even though the environment variables are preserved, subsequent sessions can still override those defaults:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
# will sample 100x per second
|
||||
export OMNITRACE_SAMPLING_FREQ=100
|
||||
omnitrace-run -- ./foo.samp
|
||||
export ROCPROFSYS_SAMPLING_FREQ=100
|
||||
rocprof-sys-run -- ./foo.samp
|
||||
|
||||
.. _rpath-troubleshooting:
|
||||
|
||||
@@ -906,10 +906,10 @@ Troubleshooting
|
||||
Checking for RPATH
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
If ``ldd ./foo.inst`` from the :ref:`binary-rewriting-library-label`
|
||||
section still returns ``/usr/local/lib/libfoo.so.2``, the executable could have
|
||||
If ``ldd ./foo.inst`` from the :ref:`binary-rewriting-library-label`
|
||||
section still returns ``/usr/local/lib/libfoo.so.2``, the executable could have
|
||||
an rpath encoded in the binary.
|
||||
This ELF entry results in the dynamic linker ignoring ``LD_LIBRARY_PATH`` if
|
||||
This ELF entry results in the dynamic linker ignoring ``LD_LIBRARY_PATH`` if
|
||||
it finds ``libfoo.so.2`` in the rpath.
|
||||
Using the ``objdump`` tool, perform the following query:
|
||||
|
||||
@@ -923,13 +923,13 @@ If this produces output that appears similar to this output.:
|
||||
|
||||
RUNPATH $ORIGIN:$ORIGIN/../lib
|
||||
|
||||
Remove or modify the rpath to get ``foo.inst`` to resolve
|
||||
Remove or modify the rpath to get ``foo.inst`` to resolve
|
||||
to the instrumented ``libfoo.so.2`` as explained in the next section.
|
||||
|
||||
Modifying an RPATH
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
This code snippet uses the ``patchelf`` tool to modify the rpath of the given executable
|
||||
This code snippet uses the ``patchelf`` tool to modify the rpath of the given executable
|
||||
or library to ``/home/user``, which is where the instrumented libraries are located.
|
||||
|
||||
.. note::
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
:description: ROCm Systems Profiler causal profiling documentation and reference
|
||||
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, causal profiling, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
****************************************************
|
||||
Performing causal profiling
|
||||
@@ -18,10 +18,6 @@ Thus, causal profiling works by performing experiments on blocks of code during
|
||||
insert pauses to slow down all other concurrently running code. During post-processing, these experiments
|
||||
are translated into calculations for the potential impact of speeding up this block of code.
|
||||
|
||||
.. note::
|
||||
|
||||
Causal profiling supersedes the original critical trace feature, which was removed in Omnitrace v1.11.0.
|
||||
|
||||
Consider the following C++ code executing ``foo`` and ``bar`` concurrently in two different threads
|
||||
where ``foo`` is ideally 30% faster than ``bar``:
|
||||
|
||||
@@ -51,52 +47,52 @@ where ``foo`` is ideally 30% faster than ``bar``:
|
||||
itr.join();
|
||||
}
|
||||
|
||||
No matter how many optimizations are applied to ``foo``, the application will always
|
||||
No matter how many optimizations are applied to ``foo``, the application will always
|
||||
require the same amount of time
|
||||
because the end-to-end performance is limited by ``bar``. However, a 5% speed-up
|
||||
because the end-to-end performance is limited by ``bar``. However, a 5% speed-up
|
||||
in ``bar`` results in the
|
||||
end-to-end performance improving by 5%. This trend continues linearly, with a 10% speed-up
|
||||
end-to-end performance improving by 5%. This trend continues linearly, with a 10% speed-up
|
||||
in ``bar`` yielding a 10% speed-up in
|
||||
end-to-end performance, and so on, up to a 30% speed-up, at which point ``bar`` runs as fast as ``foo``.
|
||||
Any speed-up to ``bar`` beyond 30% still only yields an end-to-end performance
|
||||
Any speed-up to ``bar`` beyond 30% still only yields an end-to-end performance
|
||||
improvement of 30% because the application
|
||||
is now limited by performance of ``foo``, as demonstrated below in the causal
|
||||
is now limited by performance of ``foo``, as demonstrated below in the causal
|
||||
profiling visualization:
|
||||
|
||||
.. image:: ../data/causal-foobar.png
|
||||
:alt: Visualization of the performance improvements for two functions with causal profiling
|
||||
|
||||
The full details of the causal profiling methodology can be found in the paper
|
||||
The full details of the causal profiling methodology can be found in the paper
|
||||
`Coz: Finding Code that Counts with Causal Profiling <http://arxiv.org/pdf/1608.03676v1.pdf>`_.
|
||||
The author's implementation is publicly available on `GitHub <https://github.com/plasma-umass/coz>`_.
|
||||
|
||||
Getting started
|
||||
========================================
|
||||
|
||||
To effectively use causal profiling, it is important to understand a few key
|
||||
To effectively use causal profiling, it is important to understand a few key
|
||||
concepts, such as progress points.
|
||||
|
||||
Progress points
|
||||
-----------------------------------
|
||||
|
||||
Causal profiling requires "progress points" to track progress through the code
|
||||
Causal profiling requires "progress points" to track progress through the code
|
||||
in between samples. Progress points must be triggered in a deterministic manner via instrumentation.
|
||||
This can happen in three different ways:
|
||||
|
||||
* `Omnitrace <https://github.com/ROCm/omnitrace>`_ can leverage the callbacks from
|
||||
Kokkos-Tools, OpenMP-Tools, roctracer, etc. and the wrappers around functions for
|
||||
* `ROCm Systems Profiler <https://github.com/ROCm/rocprofiler-systems>`_ can leverage the callbacks from
|
||||
Kokkos-Tools, OpenMP-Tools, roctracer, etc. and the wrappers around functions for
|
||||
MPI, NUMA, RCCL, etc. to act as progress points
|
||||
* Users can leverage the :doc:`runtime instrumentation capabilities <./instrumenting-rewriting-binary-application>`
|
||||
* Users can leverage the :doc:`runtime instrumentation capabilities <./instrumenting-rewriting-binary-application>`
|
||||
to insert progress points
|
||||
* Users can leverage :doc:`User APIs <../how-to/using-omnitrace-api>`,
|
||||
such as ``OMNITRACE_CAUSAL_PROGRESS``
|
||||
* Users can leverage :doc:`User APIs <../how-to/using-rocprof-sys-api>`,
|
||||
such as ``ROCPROFSYS_CAUSAL_PROGRESS``
|
||||
|
||||
.. note::
|
||||
|
||||
Binary rewrite to insert progress points is not supported. When a rewritten binary
|
||||
runs, Dyninst translates the instruction pointer address in order to perform
|
||||
the instrumentation. As a result, call stack samples never return instruction
|
||||
pointer addresses within the valid Omnitrace range.
|
||||
Binary rewrite to insert progress points is not supported. When a rewritten binary
|
||||
runs, Dyninst translates the instruction pointer address in order to perform
|
||||
the instrumentation. As a result, call stack samples never return instruction
|
||||
pointer addresses within the valid ROCm Systems Profiler range.
|
||||
|
||||
Key concepts
|
||||
-----------------------------------
|
||||
@@ -104,26 +100,26 @@ Key concepts
|
||||
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
|
||||
| Concept | Setting | Options | Description |
|
||||
+==================+=====================================+==================================+============================================+
|
||||
| Backend | ``OMNITRACE_CAUSAL_BACKEND`` | ``perf``, ``timer`` | Backend for recording samples required |
|
||||
| Backend | ``ROCPROFSYS_CAUSAL_BACKEND`` | ``perf``, ``timer`` | Backend for recording samples required |
|
||||
| | | | to calculate the virtual speed-up |
|
||||
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
|
||||
| Mode | ``OMNITRACE_CAUSAL_MODE`` | ``function``, ``line`` | Select an entire function or individual |
|
||||
| Mode | ``ROCPROFSYS_CAUSAL_MODE`` | ``function``, ``line`` | Select an entire function or individual |
|
||||
| | | | line of code for causal experiments |
|
||||
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
|
||||
| End-to-end | ``OMNITRACE_CAUSAL_END_TO_END`` | Boolean | Perform a single experiment during the |
|
||||
| End-to-end | ``ROCPROFSYS_CAUSAL_END_TO_END`` | Boolean | Perform a single experiment during the |
|
||||
| | | | entire run (does not require |
|
||||
| | | | progress points) |
|
||||
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
|
||||
| Fixed speed-up | ``OMNITRACE_CAUSAL_FIXED_SPEEDUP`` | one or more values from [0, 100] | Virtual speed-up or pool of virtual |
|
||||
| Fixed speed-up | ``ROCPROFSYS_CAUSAL_FIXED_SPEEDUP`` | one or more values from [0, 100] | Virtual speed-up or pool of virtual |
|
||||
| | | | speed-ups to randomly select |
|
||||
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
|
||||
| Binary scope | ``OMNITRACE_CAUSAL_BINARY_SCOPE`` | regular expression(s) | Dynamic binaries containing code for |
|
||||
| Binary scope | ``ROCPROFSYS_CAUSAL_BINARY_SCOPE`` | regular expression(s) | Dynamic binaries containing code for |
|
||||
| | | | experiments |
|
||||
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
|
||||
| Source scope | ``OMNITRACE_CAUSAL_SOURCE_SCOPE`` | regular expression(s) | ``<file>`` and/or ``<file>:<line>`` |
|
||||
| Source scope | ``ROCPROFSYS_CAUSAL_SOURCE_SCOPE`` | regular expression(s) | ``<file>`` and/or ``<file>:<line>`` |
|
||||
| | | | containing code to include in experiments |
|
||||
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
|
||||
| Function scope | ``OMNITRACE_CAUSAL_FUNCTION_SCOPE`` | regular expression(s) | Restricts experiments to matching |
|
||||
| Function scope | ``ROCPROFSYS_CAUSAL_FUNCTION_SCOPE`` | regular expression(s) | Restricts experiments to matching |
|
||||
| | | | functions (function mode) or lines of |
|
||||
| | | | code within matching functions (line mode) |
|
||||
+------------------+-------------------------------------+----------------------------------+--------------------------------------------+
|
||||
@@ -137,30 +133,30 @@ Key concepts
|
||||
Backends
|
||||
-----------------------------------
|
||||
|
||||
There are two backends to choose from: ``perf`` and ``timer``.
|
||||
They are used to record the samples required to calculate the virtual speedup.
|
||||
There are two backends to choose from: ``perf`` and ``timer``.
|
||||
They are used to record the samples required to calculate the virtual speedup.
|
||||
Both backends interrupt each thread 1000 times per second (of CPU-time) to apply the virtual speed-ups.
|
||||
The difference between each backend is how the samples are recorded.
|
||||
There are three key differences between the two backends:
|
||||
|
||||
* the ``perf`` backend requires Linux Perf and elevated security priviledges
|
||||
* the ``perf`` backend interrupts the application less frequently whereas the ``timer`` backend
|
||||
* the ``perf`` backend interrupts the application less frequently whereas the ``timer`` backend
|
||||
interrupts the application 1000 times per second of realtime
|
||||
* the ``timer`` backend has less accurate call stacks due to instruction pointer skid
|
||||
|
||||
In general, the ``perf`` backend is preferred over the ``timer`` backend when sufficient
|
||||
In general, the ``perf`` backend is preferred over the ``timer`` backend when sufficient
|
||||
security priviledges permit its usage.
|
||||
If ``OMNITRACE_CAUSAL_BACKEND`` is set to ``auto``, Omnitrace falls back
|
||||
If ``ROCPROFSYS_CAUSAL_BACKEND`` is set to ``auto``, ROCm Systems Profiler falls back
|
||||
to using the ``timer`` backend only if
|
||||
the ``perf`` backend fails. If ``OMNITRACE_CAUSAL_BACKEND`` is
|
||||
set to ``perf`` and using this backend fails, Omnitrace aborts.
|
||||
the ``perf`` backend fails. If ``ROCPROFSYS_CAUSAL_BACKEND`` is
|
||||
set to ``perf`` and using this backend fails, ROCm Systems Profiler aborts.
|
||||
|
||||
Instruction pointer skid
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Instruction pointer (IP) skid measures how many instructions run after the event of interest
|
||||
before the program actually stops. The IP skid is calculated by subtracting
|
||||
the location of the IP at the point of interest from the location of the IP
|
||||
the location of the IP at the point of interest from the location of the IP
|
||||
when the kernel finally stops the application.
|
||||
For the ``timer`` backend, this translates to the
|
||||
difference in the IP between when the timer generated a signal and when the
|
||||
@@ -172,9 +168,9 @@ especially in ``line`` mode.
|
||||
Installing Linux Perf
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Linux Perf is built into the kernel and may already be installed
|
||||
Linux Perf is built into the kernel and may already be installed
|
||||
(for instance, it is included in the default kernel for OpenSUSE).
|
||||
The official method of checking whether Linux Perf is installed is
|
||||
The official method of checking whether Linux Perf is installed is
|
||||
checking for the existence of the file
|
||||
``/proc/sys/kernel/perf_event_paranoid``. If the file exists, the kernel has Perf installed.
|
||||
|
||||
@@ -184,12 +180,12 @@ If this file does not exist, as with Debian-based systems like Ubuntu, run the f
|
||||
|
||||
apt-get install linux-tools-common linux-tools-generic linux-tools-$(uname -r)
|
||||
|
||||
and reboot your computer. In order to use the ``perf`` backend, the value
|
||||
and reboot your computer. In order to use the ``perf`` backend, the value
|
||||
of ``/proc/sys/kernel/perf_event_paranoid``
|
||||
should be less than or equal to 2. If the value in this file is greater than 2, you can't
|
||||
should be less than or equal to 2. If the value in this file is greater than 2, you can't
|
||||
use the ``perf`` backend.
|
||||
|
||||
To update the paranoid level temporarily until the system is rebooted, run
|
||||
To update the paranoid level temporarily until the system is rebooted, run
|
||||
one of the following commands
|
||||
as a superuser (where ``PARANOID_LEVEL=<N>`` has a value of ``<N>`` in the range ``[-1, 2]``):
|
||||
|
||||
@@ -206,18 +202,18 @@ or
|
||||
To make the paranoid level persistent after a reboot, add ``kernel.perf_event_paranoid=<N>``
|
||||
(where ``<N>`` is the desired paranoid level) to the ``/etc/sysctl.conf`` file.
|
||||
|
||||
Speed-up prediction variability and the omnitrace-causal executable
|
||||
Speed-up prediction variability and the rocprof-sys-causal executable
|
||||
-----------------------------------------------------------------------
|
||||
|
||||
Causal profiling typically requires running the application several times in
|
||||
order to adequately sample all the code domains, experiment
|
||||
Causal profiling typically requires running the application several times in
|
||||
order to adequately sample all the code domains, experiment
|
||||
with speed-ups and other techniques, and resolve statistical fluctuations.
|
||||
The ``omnitrace-causal`` executable is designed to simplify this procedure:
|
||||
The ``rocprof-sys-causal`` executable is designed to simplify this procedure:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omnitrace-causal --help
|
||||
[omnitrace-causal] Usage: ./bin/omnitrace-causal [ --help (count: 0, dtype: bool)
|
||||
$ rocprof-sys-causal --help
|
||||
[rocprof-sys-causal] Usage: ./bin/rocprof-sys-causal [ --help (count: 0, dtype: bool)
|
||||
--version (count: 0, dtype: bool)
|
||||
--monochrome (max: 1, dtype: bool)
|
||||
--debug (max: 1, dtype: bool)
|
||||
@@ -246,21 +242,21 @@ The ``omnitrace-causal`` executable is designed to simplify this procedure:
|
||||
This executable is designed to streamline that process.
|
||||
For example (assume all commands end with \'-- <exe> <args>\'):
|
||||
|
||||
omnitrace-causal -n 5 -- <exe> # runs <exe> 5x with causal profiling enabled
|
||||
rocprof-sys-causal -n 5 -- <exe> # runs <exe> 5x with causal profiling enabled
|
||||
|
||||
omnitrace-causal -s 0 5,10,15,20 # runs <exe> 2x with virtual speedups:
|
||||
rocprof-sys-causal -s 0 5,10,15,20 # runs <exe> 2x with virtual speedups:
|
||||
# - 0
|
||||
# - randomly selected from 5, 10, 15, and 20
|
||||
|
||||
omnitrace-causal -F func_A func_B func_(A|B) # runs <exe> 3x with the function scope limited to:
|
||||
rocprof-sys-causal -F func_A func_B func_(A|B) # runs <exe> 3x with the function scope limited to:
|
||||
# 1. func_A
|
||||
# 2. func_B
|
||||
# 3. func_A or func_B
|
||||
General tips:
|
||||
- Insert progress points at hotspots in your code or use omnitrace\'s runtime instrumentation
|
||||
- Insert progress points at hotspots in your code or use rocprof-sys\'s runtime instrumentation
|
||||
- Note: binary rewrite will produce a incompatible new binary
|
||||
- Run omnitrace-causal in "function" mode first (does not require debug info)
|
||||
- Run omnitrace-causal in "line" mode when you are targeting one function (requires debug info)
|
||||
- Run rocprof-sys-causal in "function" mode first (does not require debug info)
|
||||
- Run rocprof-sys-causal in "line" mode when you are targeting one function (requires debug info)
|
||||
- Preferably, use predictions from the "function" mode to determine which function to target
|
||||
- Limit the virtual speedups to a smaller pool, e.g., 0,5,10,25,50, to get reliable predictions quicker
|
||||
- Make use of the binary, source, and function scope to limit the functions/lines selected for experiments
|
||||
@@ -280,15 +276,15 @@ The ``omnitrace-causal`` executable is designed to simplify this procedure:
|
||||
[GENERAL OPTIONS]
|
||||
|
||||
-c, --config Base configuration file
|
||||
-l, --launcher When running MPI jobs, omnitrace-causal needs to be *before* the executable which launches the MPI processes (i.e.
|
||||
-l, --launcher When running MPI jobs, rocprof-sys-causal needs to be *before* the executable which launches the MPI processes (i.e.
|
||||
before `mpirun`, `srun`, etc.). Pass the name of the target executable (or a regex for matching to the name of the
|
||||
target) for causal profiling, e.g., `omnitrace-causal -l foo -- mpirun -n 4 foo`. This ensures that the omnitrace
|
||||
target) for causal profiling, e.g., `rocprof-sys-causal -l foo -- mpirun -n 4 foo`. This ensures that the rocprof-sys
|
||||
library is LD_PRELOADed on the proper target
|
||||
-g, --generate-configs Generate config files instead of passing environment variables directly. If no arguments are provided, the config files
|
||||
will be placed in ${PWD}/omnitrace-causal-config folder
|
||||
will be placed in ${PWD}/rocprof-sys-causal-config folder
|
||||
--no-defaults Do not activate default features which are recommended for causal profiling. For example: PID-tagging of output files
|
||||
and timestamped subdirectories are disabled by default. Kokkos tools support is added by default
|
||||
(OMNITRACE_USE_KOKKOSP=ON) because, for Kokkos applications, the Kokkos-Tools callbacks are used for progress points.
|
||||
(ROCPROFSYS_USE_KOKKOSP=ON) because, for Kokkos applications, the Kokkos-Tools callbacks are used for progress points.
|
||||
Activation of OpenMP tools support is similar
|
||||
|
||||
[CAUSAL PROFILING OPTIONS (General)]
|
||||
@@ -335,20 +331,20 @@ Examples
|
||||
|
||||
#!/bin/bash -e
|
||||
|
||||
module load omnitrace
|
||||
module load rocprofiler-systems
|
||||
|
||||
N=20
|
||||
I=3
|
||||
|
||||
# when providing speedups to omnitrace-causal, speedup
|
||||
# when providing speedups to rocprof-sys-causal, speedup
|
||||
# groups are separated by a space so "0,10" results in
|
||||
# one speedup group where omnitrace samples from
|
||||
# one speedup group where rocprof-sys samples from
|
||||
# the speedup set of {0, 10}. Passing "0 10" (without
|
||||
# quotes to omnitrace-causal multiplies the
|
||||
# quotes to rocprof-sys-causal multiplies the
|
||||
# number of runs by 2, where the first half of the
|
||||
# runs instruct omnitrace to only use 0 as the
|
||||
# runs instruct rocprof-sys to only use 0 as the
|
||||
# speedup and the second half of the runs instruct
|
||||
# omnitrace to only use 10 as the speedup.
|
||||
# rocprof-sys to only use 10 as the speedup.
|
||||
SPEEDUPS="0,0,0,10,20,30,40,50,50,75,75,75,90,90,90"
|
||||
# thus, -s ${SPEEDUPS} only multiplies the number
|
||||
# of runs by 1 whereas -S ${SPEEDUPS_E2E} multiplies
|
||||
@@ -370,14 +366,14 @@ Examples
|
||||
#
|
||||
# total executions: 20
|
||||
#
|
||||
omnitrace-causal \
|
||||
rocprof-sys-causal \
|
||||
-n ${N} \
|
||||
-s ${SPEEDUPS} \
|
||||
-m function \
|
||||
-o experiments.func \
|
||||
-S ".*\\.cpp" \
|
||||
-- \
|
||||
./causal-omni-cpu "${@}"
|
||||
./causal-rocprofsys-cpu "${@}"
|
||||
|
||||
|
||||
# 20 iterations in line mode with 1 speedup group
|
||||
@@ -390,14 +386,14 @@ Examples
|
||||
#
|
||||
# total executions: 20
|
||||
#
|
||||
omnitrace-causal \
|
||||
rocprof-sys-causal \
|
||||
-n ${N} \
|
||||
-s ${SPEEDUPS} \
|
||||
-m line \
|
||||
-o experiments.line \
|
||||
-S "causal\\.cpp:(100|110)" \
|
||||
-- \
|
||||
./causal-omni-cpu "${@}"
|
||||
./causal-rocprofsys-cpu "${@}"
|
||||
|
||||
|
||||
# 3 iterations in function mode of 15 singular speedups
|
||||
@@ -411,7 +407,7 @@ Examples
|
||||
#
|
||||
# total executions: 90
|
||||
#
|
||||
omnitrace-causal \
|
||||
rocprof-sys-causal \
|
||||
-n ${I} \
|
||||
-s ${SPEEDUPS_E2E} \
|
||||
-m func \
|
||||
@@ -420,7 +416,7 @@ Examples
|
||||
-F "cpu_slow_func" \
|
||||
"cpu_fast_func" \
|
||||
-- \
|
||||
./causal-omni-cpu "${@}"
|
||||
./causal-rocprofsys-cpu "${@}"
|
||||
|
||||
# 3 iterations in line mode of 15 singular speedups
|
||||
# in end-to-end mode with 2 different source scopes
|
||||
@@ -433,7 +429,7 @@ Examples
|
||||
#
|
||||
# total executions: 90
|
||||
#
|
||||
omnitrace-causal \
|
||||
rocprof-sys-causal \
|
||||
-n ${I} \
|
||||
-s ${SPEEDUPS_E2E} \
|
||||
-m line \
|
||||
@@ -442,7 +438,7 @@ Examples
|
||||
-S "causal\\.cpp:100" \
|
||||
"causal\\.cpp:110" \
|
||||
-- \
|
||||
./causal-omni-cpu "${@}"
|
||||
./causal-rocprofsys-cpu "${@}"
|
||||
|
||||
|
||||
export OMP_NUM_THREADS=8
|
||||
@@ -468,7 +464,7 @@ Examples
|
||||
# existing causal/experiments.func.(coz|json)
|
||||
# file due to "--reset" argument
|
||||
#
|
||||
omnitrace-causal \
|
||||
rocprof-sys-causal \
|
||||
--reset \
|
||||
-n ${N} \
|
||||
-s ${SPEEDUPS} \
|
||||
@@ -477,7 +473,7 @@ Examples
|
||||
-S "lulesh.*" \
|
||||
-FE "^(Kokkos::|std::enable_if)" \
|
||||
-- \
|
||||
./lulesh-omni -i 50 -s 200 -r 20 -b 5 -c 5 -p
|
||||
./lulesh-rocprofsys -i 50 -s 200 -r 20 -b 5 -c 5 -p
|
||||
|
||||
|
||||
# 5 iterations in line mode of 1 speedup
|
||||
@@ -498,7 +494,7 @@ Examples
|
||||
# existing causal/experiments.line.(coz|json)
|
||||
# file due to "--reset" argument
|
||||
#
|
||||
omnitrace-causal \
|
||||
rocprof-sys-causal \
|
||||
--reset \
|
||||
-n ${N} \
|
||||
-s ${SPEEDUPS} \
|
||||
@@ -507,7 +503,7 @@ Examples
|
||||
-S "lulesh.*" \
|
||||
-FE "^(exec_range|execute);construct_shared_allocation;\\._omp_fn\\." \
|
||||
-- \
|
||||
./lulesh-omni -i 50 -s 200 -r 20 -b 5 -c 5 -p
|
||||
./lulesh-rocprofsys -i 50 -s 200 -r 20 -b 5 -c 5 -p
|
||||
|
||||
|
||||
# 5 iterations in line mode of 1 speedup
|
||||
@@ -528,7 +524,7 @@ Examples
|
||||
# existing causal/experiments.line.(coz|json)
|
||||
# file due to "--reset" argument
|
||||
#
|
||||
omnitrace-causal \
|
||||
rocprof-sys-causal \
|
||||
--reset \
|
||||
-n ${N} \
|
||||
-s ${SPEEDUPS} \
|
||||
@@ -539,30 +535,30 @@ Examples
|
||||
"CalcVolumeForceForElems" \
|
||||
-S "lulesh\\.cc" \
|
||||
-- \
|
||||
./lulesh-omni -i 50 -s 200 -r 20 -b 5 -c 5 -p
|
||||
./lulesh-rocprofsys -i 50 -s 200 -r 20 -b 5 -c 5 -p
|
||||
|
||||
Using omnitrace-causal with other launchers like mpirun
|
||||
Using rocprof-sys-causal with other launchers like mpirun
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The ``omnitrace-causal`` executable is intended to assist with application replay
|
||||
The ``rocprof-sys-causal`` executable is intended to assist with application replay
|
||||
and is designed to always be at the start of the command line as the primary process.
|
||||
``omnitrace-causal`` typically adds a ``LD_PRELOAD`` of the Omnitrace libraries
|
||||
``rocprof-sys-causal`` typically adds a ``LD_PRELOAD`` of the ROCm Systems Profiler libraries
|
||||
into the environment before launching the command to inject the functionality
|
||||
required to start the causal profiling tooling. However, this is problematic
|
||||
when the target application for causal profiling uses a launcher, in which case
|
||||
it is listed as an argument rather than as the main application. For example,
|
||||
``foo`` is the target application for profiling, but the command to run it is
|
||||
``mpirun -n 2 foo``. Running the command ``omnitrace-causal -- mpirun -n 2 foo``
|
||||
applies the causal profiling to ``mpirun`` instead of ``foo``.
|
||||
required to start the causal profiling tooling. However, this is problematic
|
||||
when the target application for causal profiling uses a launcher, in which case
|
||||
it is listed as an argument rather than as the main application. For example,
|
||||
``foo`` is the target application for profiling, but the command to run it is
|
||||
``mpirun -n 2 foo``. Running the command ``rocprof-sys-causal -- mpirun -n 2 foo``
|
||||
applies the causal profiling to ``mpirun`` instead of ``foo``.
|
||||
|
||||
``omnitrace-causal`` remedies this by providing a command-line option ``-l` / `--launcher``
|
||||
to indicate the target application is using a launcher script/executable. The
|
||||
``rocprof-sys-causal`` remedies this by providing a command-line option ``-l` / `--launcher``
|
||||
to indicate the target application is using a launcher script/executable. The
|
||||
argument to the command-line option is the name of, or regular expression for, the target application
|
||||
on the command line. When ``--launcher`` is used, ``omnitrace-causal`` generates
|
||||
on the command line. When ``--launcher`` is used, ``rocprof-sys-causal`` generates
|
||||
all the replay configurations and runs them but delays adding the ``LD_PRELOAD``. Instead it
|
||||
inserts a call to itself into the command line right before the target
|
||||
inserts a call to itself into the command line right before the target
|
||||
application. This recursive call inherits the configuration from
|
||||
the parent ``omnitrace-causal`` executable, inserts an ``LD_PRELOAD`` into the environment,
|
||||
the parent ``rocprof-sys-causal`` executable, inserts an ``LD_PRELOAD`` into the environment,
|
||||
and calls ``execv`` to replace itself with the new process launched by the target
|
||||
application.
|
||||
|
||||
@@ -570,32 +566,32 @@ In other words, the following command:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-causal -l foo -n 3 -- mpirun -n 2 foo`
|
||||
rocprof-sys-causal -l foo -n 3 -- mpirun -n 2 foo`
|
||||
|
||||
Effectively results in:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
mpirun -n 2 omnitrace-causal -- foo
|
||||
mpirun -n 2 omnitrace-causal -- foo
|
||||
mpirun -n 2 omnitrace-causal -- foo
|
||||
mpirun -n 2 rocprof-sys-causal -- foo
|
||||
mpirun -n 2 rocprof-sys-causal -- foo
|
||||
mpirun -n 2 rocprof-sys-causal -- foo
|
||||
|
||||
Visualizing the causal output
|
||||
-------------------------------------------------------------------------
|
||||
|
||||
Omnitrace generates ``causal/experiments.json`` and ``causal/experiments.coz`` in
|
||||
``${OMNITRACE_OUTPUT_PATH}/${OMNITRACE_OUTPUT_PREFIX}``. Visit
|
||||
ROCm Systems Profiler generates ``causal/experiments.json`` and ``causal/experiments.coz`` in
|
||||
``${ROCPROFSYS_OUTPUT_PATH}/${ROCPROFSYS_OUTPUT_PREFIX}``. Visit
|
||||
`plasma-umass.org/coz <https://plasma-umass.org/coz/>`_ to open the ``*.coz`` file.
|
||||
|
||||
Omnitrace versus Coz
|
||||
ROCm Systems Profiler versus Coz
|
||||
=======================================
|
||||
|
||||
This comparison is intended for readers who are familiar with the
|
||||
This comparison is intended for readers who are familiar with the
|
||||
`Coz profiler <https://github.com/plasma-umass/coz>`_.
|
||||
Omnitrace provides several additional features and utilities for causal profiling:
|
||||
ROCm Systems Profiler provides several additional features and utilities for causal profiling:
|
||||
|
||||
.. csv-table::
|
||||
:header: "Feature", "Coz", "Omnitrace", "Notes"
|
||||
.. csv-table::
|
||||
:header: "Feature", "Coz", "ROCm Systems Profiler", "Notes"
|
||||
:widths: 20, 60, 60, 30
|
||||
|
||||
"Debug info", "requires debug info in DWARF v3 format (``-gdwarf-3``)", "optional, supports any DWARF format version", "See Note #1 below"
|
||||
@@ -608,23 +604,23 @@ Omnitrace provides several additional features and utilities for causal profilin
|
||||
|
||||
.. note::
|
||||
|
||||
#. Omnitrace supports a "function" mode which does not require debug info.
|
||||
#. Omnitrace supports selecting an entire range of instruction pointers for a function instead
|
||||
#. ROCm Systems Profiler supports a "function" mode which does not require debug info.
|
||||
#. ROCm Systems Profiler supports selecting an entire range of instruction pointers for a function instead
|
||||
of an instruction pointer for one line. In large code bases, "function" mode
|
||||
can resolve in fewer iterations. After a target function is identified, you can
|
||||
can resolve in fewer iterations. After a target function is identified, you can
|
||||
switch to line mode and limit the function scope to the target function.
|
||||
#. Omnitrace supports randomly sampling from subsets, e.g. { 0, 0, 5, 10 }
|
||||
#. ROCm Systems Profiler supports randomly sampling from subsets, e.g. { 0, 0, 5, 10 }
|
||||
where 0% is randomly selected 50% of time and 5% and 10% are randomly selected 25% of the time.
|
||||
#. Omnitrace and COZ have the same definition for binary scope, which is the binaries
|
||||
#. ROCm Systems Profiler and COZ have the same definition for binary scope, which is the binaries
|
||||
loaded at runtime (the executable and linked libraries).
|
||||
#. Omnitrace "source scope" supports both ``<file>`` and ``<file>:<line>`` formats
|
||||
#. ROCm Systems Profiler "source scope" supports both ``<file>`` and ``<file>:<line>`` formats
|
||||
in contrast to the COZ "source scope" which requires ``<file>:<line>`` format.
|
||||
#. Omnitrace supports a "function" scope which narrows the function and lines
|
||||
#. ROCm Systems Profiler supports a "function" scope which narrows the function and lines
|
||||
which are eligible for causal experiments to those within the matching functions.
|
||||
#. Omnitrace supports a second filter on scopes for removing binary/source/function
|
||||
#. ROCm Systems Profiler supports a second filter on scopes for removing binary/source/function
|
||||
caught by an inclusive match. For example ``BINARY_SCOPE=.*`` and ``BINARY_EXCLUDE=libmpi.*``
|
||||
initially includes all binaries but exclude regex removes MPI libraries.
|
||||
#. In Omnitrace, the Linux Perf backend is preferred over use libunwind. However,
|
||||
#. In ROCm Systems Profiler, the Linux Perf backend is preferred over use libunwind. However,
|
||||
Linux Perf usage can be restricted for security reasons.
|
||||
Omnitrace falls back to using a second POSIX timer and libunwind if
|
||||
ROCm Systems Profiler falls back to using a second POSIX timer and libunwind if
|
||||
Linux Perf is not available.
|
||||
|
||||
@@ -1,80 +1,80 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
:description: ROCm Systems Profiler Python profiling documentation and reference
|
||||
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, Python, profiling Python, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
****************************************************
|
||||
Profiling Python scripts
|
||||
****************************************************
|
||||
|
||||
`Omnitrace <https://github.com/ROCm/omnitrace>`_ supports profiling Python code at the
|
||||
`ROCm Systems Profiler <https://github.com/ROCm/rocprofiler-systems>`_ supports profiling Python code at the
|
||||
source level and the script level.
|
||||
Python support is enabled via the ``OMNITRACE_USE_PYTHON`` and the
|
||||
``OMNITRACE_PYTHON_VERSIONS="<MAJOR>.<MINOR>`` CMake options.
|
||||
Alternatively, to build multiple Python versions, use
|
||||
``OMNITRACE_PYTHON_VERSIONS="<MAJOR>.<MINOR>;[<MAJOR>.<MINOR>]"``,
|
||||
and ``OMNITRACE_PYTHON_ROOT_DIRS="/path/to/version;[/path/to/version]"`` instead of ``OMNITRACE_PYTHON_VERSION``.
|
||||
When building multiple Python versions, the length of the ``OMNITRACE_PYTHON_VERSIONS``
|
||||
and ``OMNITRACE_PYTHON_ROOT_DIRS`` lists must
|
||||
Python support is enabled via the ``ROCPROFSYS_USE_PYTHON`` and the
|
||||
``ROCPROFSYS_PYTHON_VERSIONS="<MAJOR>.<MINOR>`` CMake options.
|
||||
Alternatively, to build multiple Python versions, use
|
||||
``ROCPROFSYS_PYTHON_VERSIONS="<MAJOR>.<MINOR>;[<MAJOR>.<MINOR>]"``,
|
||||
and ``ROCPROFSYS_PYTHON_ROOT_DIRS="/path/to/version;[/path/to/version]"`` instead of ``ROCPROFSYS_PYTHON_VERSION``.
|
||||
When building multiple Python versions, the length of the ``ROCPROFSYS_PYTHON_VERSIONS``
|
||||
and ``ROCPROFSYS_PYTHON_ROOT_DIRS`` lists must
|
||||
be the same size.
|
||||
|
||||
.. note::
|
||||
|
||||
When using Omnitrace with Python programs, the Python interpreter major and minor version (e.g. 3.7)
|
||||
When using ROCm Systems Profiler with Python programs, the Python interpreter major and minor version (e.g. 3.7)
|
||||
must match the interpreter major and minor version
|
||||
used when compiling the Python bindings. When building Omnitrace,
|
||||
the shared object file ``libpyomnitrace.<IMPL>-<VERSION>-<ARCH>-<OS>-<ABI>.so`` is generated
|
||||
where ``IMPL`` is the Python implementation, ``VERSION`` is the major and minor
|
||||
used when compiling the Python bindings. When building ROCm Systems Profiler,
|
||||
the shared object file ``libpyrocprofsys.<IMPL>-<VERSION>-<ARCH>-<OS>-<ABI>.so`` is generated
|
||||
where ``IMPL`` is the Python implementation, ``VERSION`` is the major and minor
|
||||
version, ``ARCH`` is the architecture,
|
||||
``OS`` is the operating system, and ``ABI`` is the application binary interface,
|
||||
for example, ``libpyomnitrace.cpython-38-x86_64-linux-gnu.so``.
|
||||
``OS`` is the operating system, and ``ABI`` is the application binary interface,
|
||||
for example, ``libpyrocprofsys.cpython-38-x86_64-linux-gnu.so``.
|
||||
|
||||
Getting Started
|
||||
========================================
|
||||
|
||||
The Omnitrace Python package is installed in ``lib/pythonX.Y/site-packages/omnitrace``.
|
||||
To ensure the Python interpreter can find the Omnitrace package,
|
||||
The ROCm Systems Profiler Python package is installed in ``lib/pythonX.Y/site-packages/rocprofsys``.
|
||||
To ensure the Python interpreter can find the ROCm Systems Profiler package,
|
||||
add this path to the ``PYTHONPATH`` environment variable, as in the following example:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export PYTHONPATH=/opt/omnitrace/lib/python3.8/site-packages:${PYTHONPATH}
|
||||
export PYTHONPATH=/opt/rocprofiler-systems/lib/python3.8/site-packages:${PYTHONPATH}
|
||||
|
||||
Both the ``share/omnitrace/setup-env.sh`` script and the module file in
|
||||
``share/modulefiles/omnitrace`` automatically handle the prefixing of the ``PYTHONPATH``
|
||||
Both the ``share/rocprofiler-systems/setup-env.sh`` script and the module file in
|
||||
``share/modulefiles/rocprofiler-systems`` automatically handle the prefixing of the ``PYTHONPATH``
|
||||
environment variable.
|
||||
|
||||
Running Omnitrace on a Python script
|
||||
Running ROCm Systems Profiler on a Python script
|
||||
========================================
|
||||
|
||||
Omnitrace provides an ``omnitrace-python`` helper bash script which
|
||||
ROCm Systems Profiler provides an ``rocprof-sys-python`` helper bash script which
|
||||
ensures ``PYTHONPATH`` is properly set and the correct Python interpreter is used.
|
||||
This means the following commands are effectively equivalent:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-python --help
|
||||
rocprof-sys-python --help
|
||||
|
||||
and
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export PYTHONPATH=/opt/omnitrace/lib/python3.8/site-packages:${PYTHONPATH}
|
||||
python3.8 -m omnitrace --help
|
||||
export PYTHONPATH=/opt/rocprofiler-systems/lib/python3.8/site-packages:${PYTHONPATH}
|
||||
python3.8 -m rocprofsys --help
|
||||
|
||||
.. note::
|
||||
|
||||
``omnitrace-python`` and ``python -m omnitrace`` use the same command-line syntax
|
||||
as the other ``omnitrace`` executables (``omnitrace-python <OMNITRACE_ARGS> -- <SCRIPT> <SCRIPT_ARGS>``)
|
||||
``rocprof-sys-python`` and ``python -m rocprofsys`` use the same command-line syntax
|
||||
as the other ``rocprof-sys`` executables (``rocprof-sys-python <ROCPROFSYS_ARGS> -- <SCRIPT> <SCRIPT_ARGS>``)
|
||||
and has similar options.
|
||||
|
||||
Command line options
|
||||
-----------------------------------
|
||||
|
||||
Use ``omnitrace-python --help`` to view the available options:
|
||||
Use ``rocprof-sys-python --help`` to view the available options:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
usage: omnitrace [-h] [-v VERBOSITY] [-b] [-c FILE] [-s FILE] [-F [BOOL]] [--label [{args,file,line} [{args,file,line} ...]]] [-I FUNC [FUNC ...]] [-E FUNC [FUNC ...]] [-R FUNC [FUNC ...]] [-MI FILE [FILE ...]] [-ME FILE [FILE ...]] [-MR FILE [FILE ...]] [--trace-c [BOOL]]
|
||||
usage: rocprof-sys [-h] [-v VERBOSITY] [-b] [-c FILE] [-s FILE] [-F [BOOL]] [--label [{args,file,line} [{args,file,line} ...]]] [-I FUNC [FUNC ...]] [-E FUNC [FUNC ...]] [-R FUNC [FUNC ...]] [-MI FILE [FILE ...]] [-ME FILE [FILE ...]] [-MR FILE [FILE ...]] [--trace-c [BOOL]]
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
@@ -82,7 +82,7 @@ Use ``omnitrace-python --help`` to view the available options:
|
||||
Logging verbosity
|
||||
-b, --builtin Put 'profile' in the builtins. Use '@profile' to decorate a single function, or 'with profile:' to profile a single section of code.
|
||||
-c FILE, --config FILE
|
||||
OmniTrace configuration file
|
||||
ROCm Systems Profiler configuration file
|
||||
-s FILE, --setup FILE
|
||||
Code to execute before the code to profile
|
||||
-F [BOOL], --full-filepath [BOOL]
|
||||
@@ -103,19 +103,19 @@ Use ``omnitrace-python --help`` to view the available options:
|
||||
Select only entries from these files
|
||||
--trace-c [BOOL] Enable profiling C functions
|
||||
|
||||
usage: python3 -m omnitrace <OMNITRACE_ARGS> -- <SCRIPT> <SCRIPT_ARGS>
|
||||
usage: python3 -m rocprofsys <ROCPROFSYS_ARGS> -- <SCRIPT> <SCRIPT_ARGS>
|
||||
|
||||
.. note::
|
||||
|
||||
The ``--trace-c`` option does not incorporate Omnitrace's dynamic instrumentation support.
|
||||
The ``--trace-c`` option does not incorporate ROCm Systems Profiler's dynamic instrumentation support.
|
||||
It only enables profiling the underlying C function call within the Python interpreter.
|
||||
|
||||
Selective instrumentation
|
||||
-----------------------------------
|
||||
|
||||
Similar to the ``omnitrace-instrument`` executable, command-line options exist for restricting,
|
||||
Similar to the ``rocprof-sys-instrument`` executable, command-line options exist for restricting,
|
||||
including, and excluding certain functions and modules, for example, ``--function-exclude "^__init__$"``.
|
||||
Alternatively, add the ``@profile`` decorator to the primary function of interest
|
||||
Alternatively, add the ``@profile`` decorator to the primary function of interest
|
||||
in your program and use the ``-b`` / ``--builtin`` command-line option to narrow the scope of the
|
||||
instrumentation to this function and its children.
|
||||
|
||||
@@ -145,8 +145,8 @@ Consider the following Python code (``example.py``):
|
||||
if __name__ == "__main__":
|
||||
run(20)
|
||||
|
||||
Running ``omnitrace-python ./example.py`` with ``OMNITRACE_PROFILE=ON`` and
|
||||
``OMNITRACE_TIMEMORY_COMPONENTS=trip_count`` produces the following:
|
||||
Running ``rocprof-sys-python ./example.py`` with ``ROCPROFSYS_PROFILE=ON`` and
|
||||
``ROCPROFSYS_TIMEMORY_COMPONENTS=trip_count`` produces the following:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
@@ -187,7 +187,7 @@ If the ``inefficient`` function is decorated with ``@profile`` as follows:
|
||||
def inefficient(n):
|
||||
# ...
|
||||
|
||||
And then run using the command ``omnitrace-python -b -- ./example.py``, Omnitrace produces this output:
|
||||
And then run using the command ``rocprof-sys-python -b -- ./example.py``, ROCm Systems Profiler produces this output:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
@@ -199,37 +199,37 @@ And then run using the command ``omnitrace-python -b -- ./example.py``, Omnitrac
|
||||
| |0>>> inefficient | 1 | 0 | trip_count | 1 |
|
||||
|-----------------------------------------------------------|
|
||||
|
||||
Omnitrace Python source instrumentation
|
||||
ROCm Systems Profiler Python source instrumentation
|
||||
========================================
|
||||
|
||||
Starting with the unmodified ``example.py`` script above, import the ``omnitrace`` module:
|
||||
Starting with the unmodified ``example.py`` script above, import the ``rocprofsys`` module:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import sys
|
||||
import omnitrace # import omnitrace
|
||||
import rocprofsys # import rocprofsys
|
||||
|
||||
def fib(n):
|
||||
# ... etc. ...
|
||||
|
||||
Next, add ``@omnitrace.profile()`` to the ``run`` function:
|
||||
Next, add ``@rocprofsys.profile()`` to the ``run`` function:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
@omnitrace.profile()
|
||||
@rocprofsys.profile()
|
||||
def run(n):
|
||||
# ...
|
||||
|
||||
Alternatively, use ``omnitrace.profile()`` as a context-manager around ``run(20)``:
|
||||
Alternatively, use ``rocprofsys.profile()`` as a context-manager around ``run(20)``:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
if __name__ == "__main__":
|
||||
with omnitrace.profile():
|
||||
with rocprofsys.profile():
|
||||
run(20)
|
||||
|
||||
The results for both of the source-level instrumentation modes are identical to the
|
||||
original ``omnitrace-python ./example.py`` results:
|
||||
The results for both of the source-level instrumentation modes are identical to the
|
||||
original ``rocprofsys-python ./example.py`` results:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
@@ -264,14 +264,14 @@ original ``omnitrace-python ./example.py`` results:
|
||||
|
||||
.. note::
|
||||
|
||||
When ``omnitrace-python`` is used without built-ins, the profiling results can be cluttered by the
|
||||
When ``rocprof-sys-python`` is used without built-ins, the profiling results can be cluttered by the
|
||||
numerous functions called when more complex modules are imported, such as ``import numpy``.
|
||||
|
||||
Omnitrace Python source instrumentation configuration
|
||||
ROCm Systems Profiler Python source instrumentation configuration
|
||||
-------------------------------------------------------------
|
||||
|
||||
Within the Python source code, the profiler can be configured by directly
|
||||
modifying the ``omnitrace.profiler.config`` data fields.
|
||||
Within the Python source code, the profiler can be configured by directly
|
||||
modifying the ``rocprof-sys.profiler.config`` data fields.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
@@ -295,8 +295,8 @@ modifying the ``omnitrace.profiler.config`` data fields.
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
from omnitrace.profiler import config
|
||||
from omnitrace import profile
|
||||
from rocprofsys.profiler import config
|
||||
from rocprofsys import profile
|
||||
|
||||
config.include_args = True
|
||||
config.include_filename = False
|
||||
|
||||
@@ -1,77 +1,77 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
:description: ROCm Systems Profiler call stack sampling documentation and reference
|
||||
:keywords: rocprofiler-systems,rocprofsys, ROCm, profiler, sampling, call stack, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
****************************************************
|
||||
Sampling the call stack
|
||||
****************************************************
|
||||
|
||||
`Omnitrace <https://github.com/ROCm/omnitrace>`_ can use call-stack sampling
|
||||
on a binary instrumented with either the ``omnitrace`` executable
|
||||
or the ``omnitrace-sample`` executable.
|
||||
`ROCm Systems Profiler <https://github.com/ROCm/rocprofiler-systems>`_ can use call-stack sampling
|
||||
on a binary instrumented with either the ``rocprof-sys`` executable
|
||||
or the ``rocprof-sys-sample`` executable.
|
||||
For example, all of the following commands are effectively equivalent:
|
||||
|
||||
* Binary rewrite with only the instrumentation necessary to start and stop sampling
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-instrument -M sampling -o foo.inst -- foo
|
||||
omnitrace-run -- ./foo.inst
|
||||
rocprof-sys-instrument -M sampling -o foo.inst -- foo
|
||||
rocprof-sys-run -- ./foo.inst
|
||||
|
||||
* Runtime instrumentation with only the instrumentation necessary to start and stop sampling
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-instrument -M sampling -- foo
|
||||
rocprof-sys-instrument -M sampling -- foo
|
||||
|
||||
* No instrumentation required
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-sample -- foo
|
||||
rocprof-sys-sample -- foo
|
||||
|
||||
.. note::
|
||||
|
||||
Set ``OMNITRACE_USE_SAMPLING=ON`` to activate call-stack sampling when executing an instrumented binary.
|
||||
Set ``ROCPROFSYS_USE_SAMPLING=ON`` to activate call-stack sampling when executing an instrumented binary.
|
||||
|
||||
All ``omnitrace-instrument -M sampling`` (subsequently referred to as "instrumented-sampling")
|
||||
All ``rocprof-sys-instrument -M sampling`` (subsequently referred to as "instrumented-sampling")
|
||||
does is wrap the ``main`` of the executable with initialization
|
||||
before ``main`` starts and finalization after ``main`` ends.
|
||||
This can be accomplished without instrumentation through a ``LD_PRELOAD``
|
||||
This can be accomplished without instrumentation through a ``LD_PRELOAD``
|
||||
of a library containing a dynamic symbol wrapper around ``__libc_start_main``.
|
||||
|
||||
The use of ``omnitrace-sample`` is **recommended** over
|
||||
``omnitrace-instrument -M sampling`` when binary instrumentation
|
||||
The use of ``rocprof-sys-sample`` is **recommended** over
|
||||
``rocprof-sys-instrument -M sampling`` when binary instrumentation
|
||||
is not necessary. This is for a number of reasons:
|
||||
|
||||
* ``omnitrace-sample`` provides command-line options for controlling the Omnitrace feature set instead of
|
||||
* ``rocprof-sys-sample`` provides command-line options for controlling the ROCm Systems Profiler feature set instead of
|
||||
requiring configuration files or environment variables
|
||||
* Despite the fact that instrumented-sampling only requires inserting snippets
|
||||
* Despite the fact that instrumented-sampling only requires inserting snippets
|
||||
around one function (``main``), Dyninst
|
||||
does not have a feature for specifying that parsing and processing all the
|
||||
does not have a feature for specifying that parsing and processing all the
|
||||
other symbols in the binary is unnecessary.
|
||||
In the best-case scenario when the target binary is relatively small,
|
||||
In the best-case scenario when the target binary is relatively small,
|
||||
instrumented-sampling has a slightly slower launch time,
|
||||
but in the worst case scenarios it requires a significant amount of time and memory to launch.
|
||||
* ``omnitrace-sample`` is fully compatible with MPI. For example,
|
||||
the command ``mpirun -n 2 omnitrace-sample -- foo`` is valid,
|
||||
whereas ``mpirun -n 2 omnitrace-instrument -M sampling -- foo``
|
||||
* ``rocprof-sys-sample`` is fully compatible with MPI. For example,
|
||||
the command ``mpirun -n 2 rocprof-sys-sample -- foo`` is valid,
|
||||
whereas ``mpirun -n 2 rocprof-sys-instrument -M sampling -- foo``
|
||||
is incompatible with some MPI distributions (particularly OpenMPI). This is because
|
||||
MPI prohibits forking within an MPI rank.
|
||||
|
||||
* When MPI and binary instrumentation are both involved, two steps are required:
|
||||
performing a binary rewrite of the executable and then using the instrumented executable
|
||||
in lieu of the original executable. ``omnitrace-sample`` is therefore much easier to use with MPI.
|
||||
performing a binary rewrite of the executable and then using the instrumented executable
|
||||
in lieu of the original executable. ``rocprof-sys-sample`` is therefore much easier to use with MPI.
|
||||
|
||||
The omnitrace-sample executable
|
||||
The rocprof-sys-sample executable
|
||||
========================================
|
||||
|
||||
View the help menu of ``omnitrace-sample`` with the ``-h`` / ``--help`` option:
|
||||
View the help menu of ``rocprof-sys-sample`` with the ``-h`` / ``--help`` option:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omnitrace-sample --help
|
||||
[omnitrace-sample] Usage: omnitrace-sample [ --help (count: 0, dtype: bool)
|
||||
$ rocprof-sys-sample --help
|
||||
[rocprof-sys-sample] Usage: rocprof-sys-sample [ --help (count: 0, dtype: bool)
|
||||
--version (count: 0, dtype: bool)
|
||||
--monochrome (max: 1, dtype: bool)
|
||||
--debug (max: 1, dtype: bool)
|
||||
@@ -111,47 +111,47 @@ View the help menu of ``omnitrace-sample`` with the ``-h`` / ``--help`` option:
|
||||
--gpu-events (count: unlimited)
|
||||
--inlines (max: 1, dtype: bool)
|
||||
--hsa-interrupt (count: 1, dtype: int)
|
||||
]
|
||||
]
|
||||
Options:
|
||||
-h, -?, --help Shows this page (count: 0, dtype: bool)
|
||||
--version Prints the version and exit (count: 0, dtype: bool)
|
||||
|
||||
[DEBUG OPTIONS]
|
||||
|
||||
--monochrome Disable colorized output (max: 1, dtype: bool)
|
||||
--debug Debug output (max: 1, dtype: bool)
|
||||
-v, --verbose Verbose output (count: 1)
|
||||
|
||||
[GENERAL OPTIONS] These are options which are ubiquitously applied
|
||||
|
||||
-c, --config Configuration file (min: 0, dtype: filepath)
|
||||
-o, --output Output path. Accepts 1-2 parameters corresponding to the output path and the output prefix (min: 1)
|
||||
-T, --trace Generate a detailed trace (perfetto output) (max: 1, dtype: bool)
|
||||
-P, --profile Generate a call-stack-based profile (conflicts with --flat-profile) (max: 1, dtype: bool)
|
||||
-F, --flat-profile Generate a flat profile (conflicts with --profile) (max: 1, dtype: bool)
|
||||
-H, --host Enable sampling host-based metrics for the process. E.g. CPU frequency, memory usage, etc. (max: 1, dtype: bool)
|
||||
-D, --device Enable sampling device-based metrics for the process. E.g. GPU temperature, memory usage, etc. (max: 1, dtype: bool)
|
||||
-w, --wait This option is a combination of '--trace-wait' and '--sampling-wait'. See the descriptions for those two options.
|
||||
(count: 1)
|
||||
-d, --duration This option is a combination of '--trace-duration' and '--sampling-duration'. See the descriptions for those two
|
||||
options. (count: 1)
|
||||
|
||||
[TRACING OPTIONS] Specific options controlling tracing (i.e. deterministic measurements of every event)
|
||||
|
||||
--trace-file Specify the trace output filename. Relative filepath will be with respect to output path and output prefix. (count: 1,
|
||||
dtype: filepath)
|
||||
--trace-buffer-size Size limit for the trace output (in KB) (count: 1, dtype: KB)
|
||||
-h, -?, --help Shows this page (count: 0, dtype: bool)
|
||||
--version Prints the version and exit (count: 0, dtype: bool)
|
||||
|
||||
[DEBUG OPTIONS]
|
||||
|
||||
--monochrome Disable colorized output (max: 1, dtype: bool)
|
||||
--debug Debug output (max: 1, dtype: bool)
|
||||
-v, --verbose Verbose output (count: 1)
|
||||
|
||||
[GENERAL OPTIONS] These are options which are ubiquitously applied
|
||||
|
||||
-c, --config Configuration file (min: 0, dtype: filepath)
|
||||
-o, --output Output path. Accepts 1-2 parameters corresponding to the output path and the output prefix (min: 1)
|
||||
-T, --trace Generate a detailed trace (perfetto output) (max: 1, dtype: bool)
|
||||
-P, --profile Generate a call-stack-based profile (conflicts with --flat-profile) (max: 1, dtype: bool)
|
||||
-F, --flat-profile Generate a flat profile (conflicts with --profile) (max: 1, dtype: bool)
|
||||
-H, --host Enable sampling host-based metrics for the process. E.g. CPU frequency, memory usage, etc. (max: 1, dtype: bool)
|
||||
-D, --device Enable sampling device-based metrics for the process. E.g. GPU temperature, memory usage, etc. (max: 1, dtype: bool)
|
||||
-w, --wait This option is a combination of '--trace-wait' and '--sampling-wait'. See the descriptions for those two options.
|
||||
(count: 1)
|
||||
-d, --duration This option is a combination of '--trace-duration' and '--sampling-duration'. See the descriptions for those two
|
||||
options. (count: 1)
|
||||
|
||||
[TRACING OPTIONS] Specific options controlling tracing (i.e. deterministic measurements of every event)
|
||||
|
||||
--trace-file Specify the trace output filename. Relative filepath will be with respect to output path and output prefix. (count: 1,
|
||||
dtype: filepath)
|
||||
--trace-buffer-size Size limit for the trace output (in KB) (count: 1, dtype: KB)
|
||||
--trace-fill-policy [ discard | ring_buffer ]
|
||||
|
||||
|
||||
Policy for new data when the buffer size limit is reached:
|
||||
- discard : new data is ignored
|
||||
- ring_buffer : new data overwrites oldest data (count: 1)
|
||||
--trace-wait Set the wait time (in seconds) before collecting trace and/or profiling data(in seconds). By default, the duration is
|
||||
in seconds of realtime but that can changed via --trace-clock-id. (count: 1)
|
||||
--trace-duration Set the duration of the trace and/or profile data collection (in seconds). By default, the duration is in seconds of
|
||||
realtime but that can changed via --trace-clock-id. (count: 1)
|
||||
--trace-periods More powerful version of specifying trace delay and/or duration. Format is one or more groups of: <DELAY>:<DURATION>,
|
||||
<DELAY>:<DURATION>:<REPEAT>, and/or <DELAY>:<DURATION>:<REPEAT>:<CLOCK_ID>. (min: 1)
|
||||
--trace-wait Set the wait time (in seconds) before collecting trace and/or profiling data(in seconds). By default, the duration is
|
||||
in seconds of realtime but that can changed via --trace-clock-id. (count: 1)
|
||||
--trace-duration Set the duration of the trace and/or profile data collection (in seconds). By default, the duration is in seconds of
|
||||
realtime but that can changed via --trace-clock-id. (count: 1)
|
||||
--trace-periods More powerful version of specifying trace delay and/or duration. Format is one or more groups of: <DELAY>:<DURATION>,
|
||||
<DELAY>:<DURATION>:<REPEAT>, and/or <DELAY>:<DURATION>:<REPEAT>:<CLOCK_ID>. (min: 1)
|
||||
--trace-clock-id [ 0 (realtime|CLOCK_REALTIME)
|
||||
1 (monotonic|CLOCK_MONOTONIC)
|
||||
2 (cputime|CLOCK_PROCESS_CPUTIME_ID)
|
||||
@@ -159,40 +159,40 @@ View the help menu of ``omnitrace-sample`` with the ``-h`` / ``--help`` option:
|
||||
5 (realtime_coarse|CLOCK_REALTIME_COARSE)
|
||||
6 (monotonic_coarse|CLOCK_MONOTONIC_COARSE)
|
||||
7 (boottime|CLOCK_BOOTTIME) ]
|
||||
Set the default clock ID for for trace delay/duration. Note: "cputime" is the *process* CPU time and might need to be
|
||||
scaled based on the number of threads, i.e. 4 seconds of CPU-time for an application with 4 fully active threads would
|
||||
equate to ~1 second of realtime. If this proves to be difficult to handle in practice, please file a feature request
|
||||
for omnitrace to auto-scale based on the number of threads. (count: 1)
|
||||
|
||||
[PROFILE OPTIONS] Specific options controlling profiling (i.e. deterministic measurements which are aggregated into a summary)
|
||||
|
||||
Set the default clock ID for for trace delay/duration. Note: "cputime" is the *process* CPU time and might need to be
|
||||
scaled based on the number of threads, i.e. 4 seconds of CPU-time for an application with 4 fully active threads would
|
||||
equate to ~1 second of realtime. If this proves to be difficult to handle in practice, please file a feature request
|
||||
for rocprof-sys to auto-scale based on the number of threads. (count: 1)
|
||||
|
||||
[PROFILE OPTIONS] Specific options controlling profiling (i.e. deterministic measurements which are aggregated into a summary)
|
||||
|
||||
--profile-format [ console | json | text ]
|
||||
Data formats for profiling results (min: 1)
|
||||
--profile-diff Generate a diff output b/t the profile collected and an existing profile from another run Accepts 1-2 parameters
|
||||
corresponding to the input path and the input prefix (min: 1)
|
||||
|
||||
Data formats for profiling results (min: 1)
|
||||
--profile-diff Generate a diff output b/t the profile collected and an existing profile from another run Accepts 1-2 parameters
|
||||
corresponding to the input path and the input prefix (min: 1)
|
||||
|
||||
[HOST/DEVICE (PROCESS SAMPLING) OPTIONS]
|
||||
Process sampling is background measurements for resources available to the entire process. These samples are not tied
|
||||
to specific lines/regions of code
|
||||
|
||||
--process-freq Set the default host/device sampling frequency (number of interrupts per second) (count: 1)
|
||||
--process-wait Set the default wait time (i.e. delay) before taking first host/device sample (in seconds of realtime) (count: 1)
|
||||
--process-duration Set the duration of the host/device sampling (in seconds of realtime) (count: 1)
|
||||
--cpus CPU IDs for frequency sampling. Supports integers and/or ranges (count: unlimited, dtype: int or range)
|
||||
--gpus GPU IDs for SMI queries. Supports integers and/or ranges (count: unlimited, dtype: int or range)
|
||||
|
||||
[GENERAL SAMPLING OPTIONS] General options for timer-based sampling per-thread
|
||||
|
||||
-f, --freq Set the default sampling frequency (number of interrupts per second) (count: 1)
|
||||
--sampling-wait Set the default wait time (i.e. delay) before taking first sample (in seconds). This delay time is based on the clock
|
||||
of the sampler, i.e., a delay of 1 second for CPU-clock sampler may not equal 1 second of realtime (count: 1)
|
||||
--sampling-duration Set the duration of the sampling (in seconds of realtime). I.e., it is possible (currently) to set a CPU-clock time
|
||||
delay that exceeds the real-time duration... resulting in zero samples being taken (count: 1)
|
||||
-t, --tids Specify the default thread IDs for sampling, where 0 (zero) is the main thread and each thread created by the target
|
||||
application is assigned an atomically incrementing value. (min: 1)
|
||||
|
||||
[SAMPLING TIMER OPTIONS] These options determine the heuristic for deciding when to take a sample
|
||||
|
||||
Process sampling is background measurements for resources available to the entire process. These samples are not tied
|
||||
to specific lines/regions of code
|
||||
|
||||
--process-freq Set the default host/device sampling frequency (number of interrupts per second) (count: 1)
|
||||
--process-wait Set the default wait time (i.e. delay) before taking first host/device sample (in seconds of realtime) (count: 1)
|
||||
--process-duration Set the duration of the host/device sampling (in seconds of realtime) (count: 1)
|
||||
--cpus CPU IDs for frequency sampling. Supports integers and/or ranges (count: unlimited, dtype: int or range)
|
||||
--gpus GPU IDs for SMI queries. Supports integers and/or ranges (count: unlimited, dtype: int or range)
|
||||
|
||||
[GENERAL SAMPLING OPTIONS] General options for timer-based sampling per-thread
|
||||
|
||||
-f, --freq Set the default sampling frequency (number of interrupts per second) (count: 1)
|
||||
--sampling-wait Set the default wait time (i.e. delay) before taking first sample (in seconds). This delay time is based on the clock
|
||||
of the sampler, i.e., a delay of 1 second for CPU-clock sampler may not equal 1 second of realtime (count: 1)
|
||||
--sampling-duration Set the duration of the sampling (in seconds of realtime). I.e., it is possible (currently) to set a CPU-clock time
|
||||
delay that exceeds the real-time duration... resulting in zero samples being taken (count: 1)
|
||||
-t, --tids Specify the default thread IDs for sampling, where 0 (zero) is the main thread and each thread created by the target
|
||||
application is assigned an atomically incrementing value. (min: 1)
|
||||
|
||||
[SAMPLING TIMER OPTIONS] These options determine the heuristic for deciding when to take a sample
|
||||
|
||||
--cputime Sample based on a CPU-clock timer (default). Accepts zero or more arguments:
|
||||
0. Enables sampling based on CPU-clock timer.
|
||||
1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of CPU-time.
|
||||
@@ -210,22 +210,22 @@ View the help menu of ``omnitrace-sample`` with the ``-h`` / ``--help`` option:
|
||||
When sampling with a real-clock timer, please note that enabling this will cause threads which are typically "idle"
|
||||
to consume more resources since, while idle, the real-clock time increases (and therefore triggers taking samples)
|
||||
whereas the CPU-clock time does not. (min: 0)
|
||||
|
||||
[BACKEND OPTIONS] These options control region information captured w/o sampling or instrumentation
|
||||
|
||||
|
||||
[BACKEND OPTIONS] These options control region information captured w/o sampling or instrumentation
|
||||
|
||||
-I, --include [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
|
||||
Include data from these backends (count: unlimited)
|
||||
Include data from these backends (count: unlimited)
|
||||
-E, --exclude [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
|
||||
Exclude data from these backends (count: unlimited)
|
||||
|
||||
[HARDWARE COUNTER OPTIONS] See also: omnitrace-avail -H
|
||||
|
||||
-C, --cpu-events Set the CPU hardware counter events to record (ref: `omnitrace-avail -H -c CPU`) (count: unlimited)
|
||||
-G, --gpu-events Set the GPU hardware counter events to record (ref: `omnitrace-avail -H -c GPU`) (count: unlimited)
|
||||
|
||||
[MISCELLANEOUS OPTIONS]
|
||||
|
||||
-i, --inlines Include inline info in output when available (max: 1, dtype: bool)
|
||||
Exclude data from these backends (count: unlimited)
|
||||
|
||||
[HARDWARE COUNTER OPTIONS] See also: rocprof-sys-avail -H
|
||||
|
||||
-C, --cpu-events Set the CPU hardware counter events to record (ref: `rocprof-sys-avail -H -c CPU`) (count: unlimited)
|
||||
-G, --gpu-events Set the GPU hardware counter events to record (ref: `rocprof-sys-avail -H -c GPU`) (count: unlimited)
|
||||
|
||||
[MISCELLANEOUS OPTIONS]
|
||||
|
||||
-i, --inlines Include inline info in output when available (max: 1, dtype: bool)
|
||||
--hsa-interrupt [ 0 | 1 ] Set the value of the HSA_ENABLE_INTERRUPT environment variable.
|
||||
ROCm version 5.2 and older have a bug which will cause a deadlock if a sample is taken while waiting for the signal
|
||||
that a kernel completed -- which happens when sampling with a real-clock timer. We require this option to be set to
|
||||
@@ -235,147 +235,148 @@ View the help menu of ``omnitrace-sample`` with the ``-h`` / ``--help`` option:
|
||||
0 avoid triggering the bug, potentially at the cost of reduced performance
|
||||
1 do not modify how ROCm is notified about kernel completion (count: 1, dtype: int)
|
||||
|
||||
The general syntax for separating Omnitrace command-line arguments from the
|
||||
following application arguments
|
||||
is consistent with the LLVM style of using a stand-alone double hyphen (``--``).
|
||||
The general syntax for separating ROCm Systems Profiler command-line arguments from the
|
||||
following application arguments
|
||||
is consistent with the LLVM style of using a stand-alone double hyphen (``--``).
|
||||
All arguments preceding the double hyphen
|
||||
are interpreted as belonging to Omnitrace and all arguments following it
|
||||
are interpreted as belonging to ROCm Systems Profiler and all arguments following it
|
||||
are interpreted as the
|
||||
application and its arguments. The double hyphen is only necessary when passing
|
||||
application and its arguments. The double hyphen is only necessary when passing
|
||||
command-line arguments to a target
|
||||
which also uses hyphens. For example, you can run ``omnitrace-sample ls``, but
|
||||
to run ``ls -la``, use ``omnitrace-sample -- ls -la``.
|
||||
which also uses hyphens. For example, you can run ``rocprof-sys-sample ls``, but
|
||||
to run ``ls -la``, use ``rocprof-sys-sample -- ls -la``.
|
||||
|
||||
:doc:`Configuring the Omnitrace runtime options <./configuring-runtime-options>`
|
||||
establishes the precedence of environment variable values over values specified
|
||||
:doc:`Configuring the ROCm Systems Profiler runtime options <./configuring-runtime-options>`
|
||||
establishes the precedence of environment variable values over values specified
|
||||
in the configuration files. This enables
|
||||
you to configure the Omnitrace runtime to your preferred default behavior
|
||||
in a file such as ``~/.omnitrace.cfg`` and then easily override
|
||||
those settings in the command line, for example, ``OMNITRACE_ENABLED=OFF omnitrace-sample -- foo``.
|
||||
Similarly, the command-line arguments passed to ``omnitrace-sample`` take precedence
|
||||
you to configure the ROCm Systems Profiler runtime to your preferred default behavior
|
||||
in a file such as ``~/.rocprof-sys.cfg`` and then easily override
|
||||
those settings in the command line, for example, ``ROCPROFSYS_ENABLED=OFF rocprof-sys-sample -- foo``.
|
||||
Similarly, the command-line arguments passed to ``rocprof-sys-sample`` take precedence
|
||||
over environment variables.
|
||||
|
||||
All of the command-line options above correlate to one or more configuration
|
||||
settings, for example, ``--cpu-events`` correlates to the ``OMNITRACE_PAPI_EVENTS`` configuration variable.
|
||||
``omnitrace-sample`` processes the arguments and outputs a summary of its configuration
|
||||
before running the target application.
|
||||
All of the command-line options above correlate to one or more configuration
|
||||
settings, for example, ``--cpu-events`` correlates to the ``ROCPROFSYS_PAPI_EVENTS`` configuration variable.
|
||||
``rocprof-sys-sample`` processes the arguments and outputs a summary of its configuration
|
||||
before running the target application.
|
||||
|
||||
The following snippets show how ``omnitrace-sample`` runs with various environment updates.
|
||||
The following snippets show how ``rocprof-sys-sample`` runs with various environment updates.
|
||||
|
||||
* This snippet shows the environment updates when ``omnitrace-sample`` is invoked with no arguments:
|
||||
* This snippet shows the environment updates when ``rocprof-sys-sample`` is invoked with no arguments:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omnitrace-sample -- ./parallel-overhead-locks 30 4 100
|
||||
$ rocprof-sys-sample -- ./parallel-overhead-locks 30 4 100
|
||||
|
||||
HSA_TOOLS_LIB=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
||||
HSA_TOOLS_LIB=/opt/rocprofiler-systems/lib/librocprof-sys-dl.so.1.7.1
|
||||
HSA_TOOLS_REPORT_LOAD_FAILURE=1
|
||||
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
||||
OMNITRACE_USE_PROCESS_SAMPLING=false
|
||||
OMNITRACE_USE_SAMPLING=true
|
||||
OMP_TOOL_LIBRARIES=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
||||
ROCP_TOOL_LIB=/opt/omnitrace/lib/libomnitrace.so.1.7.1
|
||||
LD_PRELOAD=/opt/rocprofiler-systems/lib/librocprof-sys-dl.so.1.7.1
|
||||
ROCPROFSYS_USE_PROCESS_SAMPLING=false
|
||||
ROCPROFSYS_USE_SAMPLING=true
|
||||
OMP_TOOL_LIBRARIES=/opt/rocprofiler-systems/lib/librocprof-sys-dl.so.1.7.1
|
||||
ROCP_TOOL_LIB=/opt/rocprofiler-systems/lib/librocprof-sys.so.1.7.1
|
||||
|
||||
* The next snippet shows the environment updates when ``omnitrace-sample`` enables
|
||||
* The next snippet shows the environment updates when ``rocprof-sys-sample`` enables
|
||||
profiling, tracing, host process-sampling, device process-sampling, and all the available backends:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omnitrace-sample -PTDH -I all -- ./parallel-overhead-locks 30 4 100
|
||||
$ rocprof-sys-sample -PTDH -I all -- ./parallel-overhead-locks 30 4 100
|
||||
|
||||
HSA_TOOLS_LIB=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
||||
HSA_TOOLS_LIB=/opt/rocprofiler-systems/lib/librocprof-sys-dl.so.1.7.1
|
||||
HSA_TOOLS_REPORT_LOAD_FAILURE=1
|
||||
KOKKOS_PROFILE_LIBRARY=/opt/omnitrace/lib/libomnitrace.so.1.7.1
|
||||
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
||||
OMNITRACE_CPU_FREQ_ENABLED=true
|
||||
OMNITRACE_TRACE_THREAD_LOCKS=true
|
||||
OMNITRACE_TRACE_THREAD_RW_LOCKS=true
|
||||
OMNITRACE_TRACE_THREAD_SPIN_LOCKS=true
|
||||
OMNITRACE_USE_KOKKOSP=true
|
||||
OMNITRACE_USE_MPIP=true
|
||||
OMNITRACE_USE_OMPT=true
|
||||
OMNITRACE_TRACE=true
|
||||
OMNITRACE_USE_PROCESS_SAMPLING=true
|
||||
OMNITRACE_USE_RCCLP=true
|
||||
OMNITRACE_USE_ROCM_SMI=true
|
||||
OMNITRACE_USE_ROCPROFILER=true
|
||||
OMNITRACE_USE_ROCTRACER=true
|
||||
OMNITRACE_USE_ROCTX=true
|
||||
OMNITRACE_USE_SAMPLING=true
|
||||
OMNITRACE_PROFILE=true
|
||||
OMP_TOOL_LIBRARIES=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
||||
ROCP_TOOL_LIB=/opt/omnitrace/lib/libomnitrace.so.1.7.1
|
||||
KOKKOS_PROFILE_LIBRARY=/opt/rocprofiler-systems/lib/librocprof-sys.so.1.7.1
|
||||
LD_PRELOAD=/opt/rocprofiler-systems/lib/librocprof-sys-dl.so.1.7.1
|
||||
ROCPROFSYS_CPU_FREQ_ENABLED=true
|
||||
ROCPROFSYS_TRACE_THREAD_LOCKS=true
|
||||
ROCPROFSYS_TRACE_THREAD_RW_LOCKS=true
|
||||
ROCPROFSYS_TRACE_THREAD_SPIN_LOCKS=true
|
||||
ROCPROFSYS_USE_KOKKOSP=true
|
||||
ROCPROFSYS_USE_MPIP=true
|
||||
ROCPROFSYS_USE_OMPT=true
|
||||
ROCPROFSYS_TRACE=true
|
||||
ROCPROFSYS_USE_PROCESS_SAMPLING=true
|
||||
ROCPROFSYS_USE_RCCLP=true
|
||||
ROCPROFSYS_USE_ROCM_SMI=true
|
||||
ROCPROFSYS_USE_ROCPROFILER=true
|
||||
ROCPROFSYS_USE_ROCTRACER=true
|
||||
ROCPROFSYS_USE_ROCTX=true
|
||||
ROCPROFSYS_USE_SAMPLING=true
|
||||
ROCPROFSYS_PROFILE=true
|
||||
OMP_TOOL_LIBRARIES=/opt/rocprofiler-systems/lib/librocprof-sys-dl.so.1.7.1
|
||||
ROCP_TOOL_LIB=/opt/rocprofiler-systems/lib/librocprof-sys.so.1.7.1
|
||||
...
|
||||
|
||||
* The final snippet shows the environment updates when ``omnitrace-sample`` enables
|
||||
* The final snippet shows the environment updates when ``rocprof-sys-sample`` enables
|
||||
profiling, tracing, host process-sampling, and device process-sampling,
|
||||
sets the output path to ``omnitrace-output`` and the output prefix to ``%tag%``, and disables
|
||||
sets the output path to ``rocprof-sys-output`` and the output prefix to ``%tag%``, and disables
|
||||
all the available backends:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omnitrace-sample -PTDH -E all -o omnitrace-output %tag% -- ./parallel-overhead-locks 30 4 100
|
||||
$ rocprof-sys-sample -PTDH -E all -o rocprof-sys-output %tag% -- ./parallel-overhead-locks 30 4 100
|
||||
|
||||
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
||||
OMNITRACE_CPU_FREQ_ENABLED=true
|
||||
OMNITRACE_OUTPUT_PATH=omnitrace-output
|
||||
OMNITRACE_OUTPUT_PREFIX=%tag%
|
||||
OMNITRACE_TRACE_THREAD_LOCKS=false
|
||||
OMNITRACE_TRACE_THREAD_RW_LOCKS=false
|
||||
OMNITRACE_TRACE_THREAD_SPIN_LOCKS=false
|
||||
OMNITRACE_USE_KOKKOSP=false
|
||||
OMNITRACE_USE_MPIP=false
|
||||
OMNITRACE_USE_OMPT=false
|
||||
OMNITRACE_TRACE=true
|
||||
OMNITRACE_USE_PROCESS_SAMPLING=true
|
||||
OMNITRACE_USE_RCCLP=false
|
||||
OMNITRACE_USE_ROCM_SMI=false
|
||||
OMNITRACE_USE_ROCPROFILER=false
|
||||
OMNITRACE_USE_ROCTRACER=false
|
||||
OMNITRACE_USE_ROCTX=false
|
||||
OMNITRACE_USE_SAMPLING=true
|
||||
OMNITRACE_PROFILE=true
|
||||
LD_PRELOAD=/opt/rocprofiler-systems/lib/librocprof-sys-dl.so.1.7.1
|
||||
ROCPROFSYS_CPU_FREQ_ENABLED=true
|
||||
ROCPROFSYS_OUTPUT_PATH=rocprof-sys-output
|
||||
ROCPROFSYS_OUTPUT_PREFIX=%tag%
|
||||
ROCPROFSYS_TRACE_THREAD_LOCKS=false
|
||||
ROCPROFSYS_TRACE_THREAD_RW_LOCKS=false
|
||||
ROCPROFSYS_TRACE_THREAD_SPIN_LOCKS=false
|
||||
ROCPROFSYS_USE_KOKKOSP=false
|
||||
ROCPROFSYS_USE_MPIP=false
|
||||
ROCPROFSYS_USE_OMPT=false
|
||||
ROCPROFSYS_TRACE=true
|
||||
ROCPROFSYS_USE_PROCESS_SAMPLING=true
|
||||
ROCPROFSYS_USE_RCCLP=false
|
||||
ROCPROFSYS_USE_ROCM_SMI=false
|
||||
ROCPROFSYS_USE_ROCPROFILER=false
|
||||
ROCPROFSYS_USE_ROCTRACER=false
|
||||
ROCPROFSYS_USE_ROCTX=false
|
||||
ROCPROFSYS_USE_SAMPLING=true
|
||||
ROCPROFSYS_PROFILE=true
|
||||
...
|
||||
|
||||
An omnitrace-sample example
|
||||
An rocprof-sys-sample example
|
||||
========================================
|
||||
|
||||
Here is the full output from the previous
|
||||
``omnitrace-sample -PTDH -E all -o omnitrace-output %tag% -- ./parallel-overhead-locks 30 4 100`` command:
|
||||
Here is the full output from the previous
|
||||
``rocprof-sys-sample -PTDH -E all -o rocprof-sys-output %tag% -- ./parallel-overhead-locks 30 4 100`` command:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omnitrace-sample -PTDH -E all -o omnitrace-output %tag% -c -- ./parallel-overhead-locks 30 4 100
|
||||
$ rocprof-sys-sample -PTDH -E all -o rocprof-sys-output %tag% -c -- ./parallel-overhead-locks 30 4 100
|
||||
|
||||
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.11.3
|
||||
OMNITRACE_CONFIG_FILE=
|
||||
OMNITRACE_CPU_FREQ_ENABLED=true
|
||||
OMNITRACE_OUTPUT_PATH=omnitrace-output
|
||||
OMNITRACE_OUTPUT_PREFIX=%tag%
|
||||
OMNITRACE_PROFILE=true
|
||||
OMNITRACE_TRACE=true
|
||||
OMNITRACE_TRACE_THREAD_LOCKS=false
|
||||
OMNITRACE_TRACE_THREAD_RW_LOCKS=false
|
||||
OMNITRACE_TRACE_THREAD_SPIN_LOCKS=false
|
||||
OMNITRACE_USE_KOKKOSP=false
|
||||
OMNITRACE_USE_MPIP=false
|
||||
OMNITRACE_USE_OMPT=false
|
||||
OMNITRACE_USE_PROCESS_SAMPLING=true
|
||||
OMNITRACE_USE_RCCLP=false
|
||||
OMNITRACE_USE_ROCM_SMI=false
|
||||
OMNITRACE_USE_ROCPROFILER=false
|
||||
OMNITRACE_USE_ROCTRACER=false
|
||||
OMNITRACE_USE_ROCTX=false
|
||||
OMNITRACE_USE_SAMPLING=true
|
||||
[omnitrace][dl][1785877] omnitrace_main
|
||||
[omnitrace][1785877][omnitrace_init_tooling] Instrumentation mode: Sampling
|
||||
______ .___ ___. .__ __. __ .___________..______ ___ ______ _______
|
||||
/ __ \ | \/ | | \ | | | | | || _ \ / \ / || ____|
|
||||
| | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__
|
||||
| | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __|
|
||||
| `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____
|
||||
\______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______|
|
||||
omnitrace v1.11.2 (rev: 2586b74db8bf335742600010b8d9f1ce8da9cf89, compiler: GNU v11.4.1, rocm: v6.1.x)
|
||||
LD_PRELOAD=/opt/rocprofiler-systems/lib/librocprof-sys-dl.so.1.11.3
|
||||
ROCPROFSYS_CONFIG_FILE=
|
||||
ROCPROFSYS_CPU_FREQ_ENABLED=true
|
||||
ROCPROFSYS_OUTPUT_PATH=rocprof-sys-output
|
||||
ROCPROFSYS_OUTPUT_PREFIX=%tag%
|
||||
ROCPROFSYS_PROFILE=true
|
||||
ROCPROFSYS_TRACE=true
|
||||
ROCPROFSYS_TRACE_THREAD_LOCKS=false
|
||||
ROCPROFSYS_TRACE_THREAD_RW_LOCKS=false
|
||||
ROCPROFSYS_TRACE_THREAD_SPIN_LOCKS=false
|
||||
ROCPROFSYS_USE_KOKKOSP=false
|
||||
ROCPROFSYS_USE_MPIP=false
|
||||
ROCPROFSYS_USE_OMPT=false
|
||||
ROCPROFSYS_USE_PROCESS_SAMPLING=true
|
||||
ROCPROFSYS_USE_RCCLP=false
|
||||
ROCPROFSYS_USE_ROCM_SMI=false
|
||||
ROCPROFSYS_USE_ROCPROFILER=false
|
||||
ROCPROFSYS_USE_ROCTRACER=false
|
||||
ROCPROFSYS_USE_ROCTX=false
|
||||
ROCPROFSYS_USE_SAMPLING=true
|
||||
[rocprof-sys][dl][1785877] rocprofsys_main
|
||||
[rocprof-sys][1785877][rocprofsys_init_tooling] Instrumentation mode: Sampling
|
||||
__
|
||||
_ __ ___ ___ _ __ _ __ ___ / _| ___ _ _ ___
|
||||
| '__| / _ \ / __| | '_ \ | '__| / _ \ | |_ _____ / __| | | | | / __|
|
||||
| | | (_) | | (__ | |_) | | | | (_) | | _| |_____| \__ \ | |_| | \__ \
|
||||
|_| \___/ \___| | .__/ |_| \___/ |_| |___/ \__, | |___/
|
||||
|_| |___/
|
||||
|
||||
rocprof-sys v1.11.2 (rev: 2586b74db8bf335742600010b8d9f1ce8da9cf89, compiler: GNU v11.4.1, rocm: v6.1.x)
|
||||
[988.958] perfetto.cc:58649 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
|
||||
[parallel-overhead-locks] Threads: 4
|
||||
[parallel-overhead-locks] Iterations: 100
|
||||
@@ -386,19 +387,19 @@ Here is the full output from the previous
|
||||
[4] number of iterations: 100
|
||||
[parallel-overhead-locks] fibonacci(30) x 4 = 409221992
|
||||
[parallel-overhead-locks] number of mutex locks = 400
|
||||
[omnitrace][1785877][0][omnitrace_finalize] finalizing...
|
||||
[omnitrace][1785877][0][omnitrace_finalize]
|
||||
[omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877 : 0.294342 sec wall_clock, 4.776 MB peak_rss, 3.170 MB page_rss, 0.990000 sec cpu_clock, 336.3 % cpu_util [laps: 1]
|
||||
[omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/0 : 0.291535 sec wall_clock, 0.002619 sec thread_cpu_clock, 0.9 % thread_cpu_util, 4.776 MB peak_rss [laps: 1]
|
||||
[omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/1 : 0.271353 sec wall_clock, 0.222572 sec thread_cpu_clock, 82.0 % thread_cpu_util, 4.200 MB peak_rss [laps: 1]
|
||||
[omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/2 : 0.238218 sec wall_clock, 0.206405 sec thread_cpu_clock, 86.6 % thread_cpu_util, 3.432 MB peak_rss [laps: 1]
|
||||
[omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/3 : 0.209459 sec wall_clock, 0.193415 sec thread_cpu_clock, 92.3 % thread_cpu_util, 2.472 MB peak_rss [laps: 1]
|
||||
[omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/4 : 0.212029 sec wall_clock, 0.211694 sec thread_cpu_clock, 99.8 % thread_cpu_util, 1.152 MB peak_rss [laps: 1]
|
||||
[omnitrace][1785877][0][omnitrace_finalize]
|
||||
[omnitrace][1785877][0][omnitrace_finalize] Finalizing perfetto...
|
||||
[omnitrace][1785877][perfetto]> Outputting '/home/user/code/omnitrace/build-release/omnitrace-output/2024-07-15_16.21/parallel-overhead-locksperfetto-trace-1785877.proto' (39.12 KB / 0.04 MB / 0.00 GB)... Done
|
||||
[omnitrace][1785877][wall_clock]> Outputting 'omnitrace-output/2024-07-15_16.21/parallel-overhead-lockswall_clock-1785877.json'
|
||||
[omnitrace][1785877][wall_clock]> Outputting 'omnitrace-output/2024-07-15_16.21/parallel-overhead-lockswall_clock-1785877.txt'
|
||||
[omnitrace][1785877][metadata]> Outputting 'omnitrace-output/2024-07-15_16.21/parallel-overhead-locksmetadata-1785877.json' and 'omnitrace-output/2024-07-15_16.21/parallel-overhead-locksfunctions-1785877.json'
|
||||
[omnitrace][1785877][0][omnitrace_finalize] Finalized: 0.054582 sec wall_clock, 0.000 MB peak_rss, -1.798 MB page_rss, 0.040000 sec cpu_clock, 73.3 % cpu_util
|
||||
[rocprof-sys][1785877][0][rocprofsys_finalize] finalizing...
|
||||
[rocprof-sys][1785877][0][rocprofsys_finalize]
|
||||
[rocprof-sys][1785877][0][rocprofsys_finalize] rocprof-sys/process/1785877 : 0.294342 sec wall_clock, 4.776 MB peak_rss, 3.170 MB page_rss, 0.990000 sec cpu_clock, 336.3 % cpu_util [laps: 1]
|
||||
[rocprof-sys][1785877][0][rocprofsys_finalize] rocprof-sys/process/1785877/thread/0 : 0.291535 sec wall_clock, 0.002619 sec thread_cpu_clock, 0.9 % thread_cpu_util, 4.776 MB peak_rss [laps: 1]
|
||||
[rocprof-sys][1785877][0][rocprofsys_finalize] rocprof-sys/process/1785877/thread/1 : 0.271353 sec wall_clock, 0.222572 sec thread_cpu_clock, 82.0 % thread_cpu_util, 4.200 MB peak_rss [laps: 1]
|
||||
[rocprof-sys][1785877][0][rocprofsys_finalize] rocprof-sys/process/1785877/thread/2 : 0.238218 sec wall_clock, 0.206405 sec thread_cpu_clock, 86.6 % thread_cpu_util, 3.432 MB peak_rss [laps: 1]
|
||||
[rocprof-sys][1785877][0][rocprofsys_finalize] rocprof-sys/process/1785877/thread/3 : 0.209459 sec wall_clock, 0.193415 sec thread_cpu_clock, 92.3 % thread_cpu_util, 2.472 MB peak_rss [laps: 1]
|
||||
[rocprof-sys][1785877][0][rocprofsys_finalize] rocprof-sys/process/1785877/thread/4 : 0.212029 sec wall_clock, 0.211694 sec thread_cpu_clock, 99.8 % thread_cpu_util, 1.152 MB peak_rss [laps: 1]
|
||||
[rocprof-sys][1785877][0][rocprofsys_finalize]
|
||||
[rocprof-sys][1785877][0][rocprofsys_finalize] Finalizing perfetto...
|
||||
[rocprof-sys][1785877][perfetto]> Outputting '/home/user/code/rocprofiler-systems/build-release/rocprofiler-systems-output/2024-07-15_16.21/parallel-overhead-locksperfetto-trace-1785877.proto' (39.12 KB / 0.04 MB / 0.00 GB)... Done
|
||||
[rocprof-sys][1785877][wall_clock]> Outputting 'rocprof-sys-output/2024-07-15_16.21/parallel-overhead-lockswall_clock-1785877.json'
|
||||
[rocprof-sys][1785877][wall_clock]> Outputting 'rocprof-sys-output/2024-07-15_16.21/parallel-overhead-lockswall_clock-1785877.txt'
|
||||
[rocprof-sys][1785877][metadata]> Outputting 'rocprof-sys-output/2024-07-15_16.21/parallel-overhead-locksmetadata-1785877.json' and 'rocprof-sys-output/2024-07-15_16.21/parallel-overhead-locksfunctions-1785877.json'
|
||||
[rocprof-sys][1785877][0][rocprofsys_finalize] Finalized: 0.054582 sec wall_clock, 0.000 MB peak_rss, -1.798 MB page_rss, 0.040000 sec cpu_clock, 73.3 % cpu_util
|
||||
[989.312] perfetto.cc:60128 Tracing session 1 ended, total sessions:0
|
||||
|
||||
@@ -1,63 +1,63 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
:description: ROCm Systems Profiler system output documentation and reference
|
||||
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, profiler, system output, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
****************************************************
|
||||
Understanding the Omnitrace output
|
||||
Understanding the Systems Profiler output
|
||||
****************************************************
|
||||
|
||||
The general output form of `Omnitrace <https://github.com/ROCm/omnitrace>`_ is
|
||||
The general output form of `ROCm Systems Profiler <https://github.com/ROCm/rocprofiler-systems>`_ is
|
||||
``<OUTPUT_PATH>[/<TIMESTAMP>]/[<PREFIX>]<DATA_NAME>[-<OUTPUT_SUFFIX>].<EXT>``.
|
||||
|
||||
For example, starting with the following base configuration:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export OMNITRACE_OUTPUT_PATH=omnitrace-example-output
|
||||
export OMNITRACE_TIME_OUTPUT=ON
|
||||
export OMNITRACE_USE_PID=OFF
|
||||
export OMNITRACE_PROFILE=ON
|
||||
export OMNITRACE_TRACE=ON
|
||||
export ROCPROFSYS_OUTPUT_PATH=rocprof-sys-example-output
|
||||
export ROCPROFSYS_TIME_OUTPUT=ON
|
||||
export ROCPROFSYS_USE_PID=OFF
|
||||
export ROCPROFSYS_PROFILE=ON
|
||||
export ROCPROFSYS_TRACE=ON
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omnitrace-instrument -- ./foo
|
||||
$ rocprof-sys-instrument -- ./foo
|
||||
...
|
||||
[omnitrace] Outputting 'omnitrace-example-output/perfetto-trace.proto'...
|
||||
[rocprof-sys] Outputting 'rocprof-sys-example-output/perfetto-trace.proto'...
|
||||
|
||||
[omnitrace] Outputting 'omnitrace-example-output/wall-clock.txt'...
|
||||
[omnitrace] Outputting 'omnitrace-example-output/wall-clock.json'...
|
||||
[rocprof-sys] Outputting 'rocprof-sys-example-output/wall-clock.txt'...
|
||||
[rocprof-sys] Outputting 'rocprof-sys-example-output/wall-clock.json'...
|
||||
|
||||
If the ``OMNITRACE_USE_PID`` option is enabled, then running a non-MPI executable
|
||||
If the ``ROCPROFSYS_USE_PID`` option is enabled, then running a non-MPI executable
|
||||
with a PID of ``63453`` results in the following output:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ export OMNITRACE_USE_PID=ON
|
||||
$ omnitrace-instrument -- ./foo
|
||||
$ export ROCPROFSYS_USE_PID=ON
|
||||
$ rocprof-sys-instrument -- ./foo
|
||||
...
|
||||
[omnitrace] Outputting 'omnitrace-example-output/perfetto-trace-63453.proto'...
|
||||
[rocprof-sys] Outputting 'rocprof-sys-example-output/perfetto-trace-63453.proto'...
|
||||
|
||||
[omnitrace] Outputting 'omnitrace-example-output/wall-clock-63453.txt'...
|
||||
[omnitrace] Outputting 'omnitrace-example-output/wall-clock-63453.json'...
|
||||
[rocprof-sys] Outputting 'rocprof-sys-example-output/wall-clock-63453.txt'...
|
||||
[rocprof-sys] Outputting 'rocprof-sys-example-output/wall-clock-63453.json'...
|
||||
|
||||
If ``OMNITRACE_TIME_OUTPUT`` is enabled, then a job that started on January 31, 2022 at 12:30 PM
|
||||
If ``ROCPROFSYS_TIME_OUTPUT`` is enabled, then a job that started on January 31, 2022 at 12:30 PM
|
||||
generates the following:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ export OMNITRACE_TIME_OUTPUT=ON
|
||||
$ omnitrace-instrument -- ./foo
|
||||
$ export ROCPROFSYS_TIME_OUTPUT=ON
|
||||
$ rocprof-sys-instrument -- ./foo
|
||||
...
|
||||
[omnitrace] Outputting 'omnitrace-example-output/2022-01-31_12.30_PM/perfetto-trace-63453.proto'...
|
||||
[rocprof-sys] Outputting 'rocprof-sys-example-output/2022-01-31_12.30_PM/perfetto-trace-63453.proto'...
|
||||
|
||||
[omnitrace] Outputting 'omnitrace-example-output/2022-01-31_12.30_PM/wall-clock-63453.txt'...
|
||||
[omnitrace] Outputting 'omnitrace-example-output/2022-01-31_12.30_PM/wall-clock-63453.json'...
|
||||
[rocprof-sys] Outputting 'rocprof-sys-example-output/2022-01-31_12.30_PM/wall-clock-63453.txt'...
|
||||
[rocprof-sys] Outputting 'rocprof-sys-example-output/2022-01-31_12.30_PM/wall-clock-63453.json'...
|
||||
|
||||
Metadata
|
||||
========================================
|
||||
|
||||
Omnitrace outputs a ``metadata.json`` file. This metadata file contains
|
||||
ROCm Systems Profiler outputs a ``metadata.json`` file. This metadata file contains
|
||||
information about the settings, environment variables, output files, and info
|
||||
about the system and the run, as follows:
|
||||
|
||||
@@ -77,7 +77,7 @@ Metadata JSON Sample
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"omnitrace": {
|
||||
"rocprof-sys": {
|
||||
"metadata": {
|
||||
"info": {
|
||||
"HW_L1_CACHE_SIZE": 32768,
|
||||
@@ -161,13 +161,13 @@ Metadata JSON Sample
|
||||
"text": [
|
||||
{
|
||||
"value": [
|
||||
"omnitrace-tests-output/parallel-overhead-binary-rewrite/roctracer.txt"
|
||||
"rocprof-sys-tests-output/parallel-overhead-binary-rewrite/roctracer.txt"
|
||||
],
|
||||
"key": "roctracer"
|
||||
},
|
||||
{
|
||||
"value": [
|
||||
"omnitrace-tests-output/parallel-overhead-binary-rewrite/wall_clock.txt"
|
||||
"rocprof-sys-tests-output/parallel-overhead-binary-rewrite/wall_clock.txt"
|
||||
],
|
||||
"key": "wall_clock"
|
||||
}
|
||||
@@ -175,15 +175,15 @@ Metadata JSON Sample
|
||||
"json": [
|
||||
{
|
||||
"value": [
|
||||
"omnitrace-tests-output/parallel-overhead-binary-rewrite/roctracer.json",
|
||||
"omnitrace-tests-output/parallel-overhead-binary-rewrite/roctracer.tree.json"
|
||||
"rocprof-sys-tests-output/parallel-overhead-binary-rewrite/roctracer.json",
|
||||
"rocprof-sys-tests-output/parallel-overhead-binary-rewrite/roctracer.tree.json"
|
||||
],
|
||||
"key": "roctracer"
|
||||
},
|
||||
{
|
||||
"value": [
|
||||
"omnitrace-tests-output/parallel-overhead-binary-rewrite/wall_clock.json",
|
||||
"omnitrace-tests-output/parallel-overhead-binary-rewrite/wall_clock.tree.json"
|
||||
"rocprof-sys-tests-output/parallel-overhead-binary-rewrite/wall_clock.json",
|
||||
"rocprof-sys-tests-output/parallel-overhead-binary-rewrite/wall_clock.tree.json"
|
||||
],
|
||||
"key": "wall_clock"
|
||||
}
|
||||
@@ -208,7 +208,7 @@ Metadata JSON Sample
|
||||
}
|
||||
],
|
||||
"settings": {
|
||||
"OMNITRACE_JSON_OUTPUT": {
|
||||
"ROCPROFSYS_JSON_OUTPUT": {
|
||||
"count": -1,
|
||||
"environ_updated": false,
|
||||
"name": "json_output",
|
||||
@@ -218,9 +218,9 @@ Metadata JSON Sample
|
||||
"value": true,
|
||||
"max_count": 1,
|
||||
"cmdline": [
|
||||
"--omnitrace-json-output"
|
||||
"--rocprof-sys-json-output"
|
||||
],
|
||||
"environ": "OMNITRACE_JSON_OUTPUT",
|
||||
"environ": "ROCPROFSYS_JSON_OUTPUT",
|
||||
"config_updated": false,
|
||||
"categories": [
|
||||
"io",
|
||||
@@ -237,10 +237,10 @@ Metadata JSON Sample
|
||||
}
|
||||
}
|
||||
|
||||
Configuring the Omnitrace output
|
||||
Configuring the ROCm Systems Profiler output
|
||||
========================================
|
||||
|
||||
Omnitrace includes a core set of options for controlling the format
|
||||
ROCm Systems Profiler includes a core set of options for controlling the format
|
||||
and contents of the output files. For additional information, see the guide on
|
||||
:doc:`configuring runtime options <./configuring-runtime-options>`.
|
||||
|
||||
@@ -251,19 +251,19 @@ Core configuration settings
|
||||
:header: "Setting", "Value", "Description"
|
||||
:widths: 30, 30, 100
|
||||
|
||||
"``OMNITRACE_OUTPUT_PATH``", "Any valid path", "Path to folder where output files should be placed"
|
||||
"``OMNITRACE_OUTPUT_PREFIX``", "String", "Useful for multiple runs with different arguments. See the next section on output prefix keys."
|
||||
"``OMNITRACE_OUTPUT_FILE``", "Any valid filepath", "Specific location for the Perfetto output file"
|
||||
"``OMNITRACE_TIME_OUTPUT``", "Boolean", "Place all output in a timestamped folder, timestamp format controlled via ``OMNITRACE_TIME_FORMAT``"
|
||||
"``OMNITRACE_TIME_FORMAT``", "String", "See ``strftime`` man pages for valid identifiers"
|
||||
"``OMNITRACE_USE_PID``", "Boolean", "Append either the PID or the MPI rank to all output files (before the extension)"
|
||||
"``ROCPROFSYS_OUTPUT_PATH``", "Any valid path", "Path to folder where output files should be placed"
|
||||
"``ROCPROFSYS_OUTPUT_PREFIX``", "String", "Useful for multiple runs with different arguments. See the next section on output prefix keys."
|
||||
"``ROCPROFSYS_OUTPUT_FILE``", "Any valid filepath", "Specific location for the Perfetto output file"
|
||||
"``ROCPROFSYS_TIME_OUTPUT``", "Boolean", "Place all output in a timestamped folder, timestamp format controlled via ``ROCPROFSYS_TIME_FORMAT``"
|
||||
"``ROCPROFSYS_TIME_FORMAT``", "String", "See ``strftime`` man pages for valid identifiers"
|
||||
"``ROCPROFSYS_USE_PID``", "Boolean", "Append either the PID or the MPI rank to all output files (before the extension)"
|
||||
|
||||
Output prefix keys
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Output prefix keys have many uses but are most helpful when dealing with multiple
|
||||
profiling runs or large MPI jobs.
|
||||
They are included in Omnitrace because they were introduced into Timemory
|
||||
They are included in ROCm Systems Profiler because they were introduced into Timemory
|
||||
for `compile-time-perf <https://github.com/jrmadsen/compile-time-perf>`_.
|
||||
They are needed to create different output files for a generic wrapper around
|
||||
compilation commands while still
|
||||
@@ -271,8 +271,8 @@ overwriting the output from the last time a file was compiled.
|
||||
|
||||
When doing scaling studies and specifying options via the command line,
|
||||
the recommended process is to
|
||||
use a common ``OMNITRACE_OUTPUT_PATH``, disable ``OMNITRACE_TIME_OUTPUT``,
|
||||
set ``OMNITRACE_OUTPUT_PREFIX="%argt%-"``, and let Omnitrace cleanly organize the output.
|
||||
use a common ``ROCPROFSYS_OUTPUT_PATH``, disable ``ROCPROFSYS_TIME_OUTPUT``,
|
||||
set ``ROCPROFSYS_OUTPUT_PREFIX="%argt%-"``, and let ROCm Systems Profiler cleanly organize the output.
|
||||
|
||||
.. csv-table::
|
||||
:header: "String", "Encoding"
|
||||
@@ -297,9 +297,9 @@ set ``OMNITRACE_OUTPUT_PREFIX="%argt%-"``, and let Omnitrace cleanly organize th
|
||||
"``%rank%``", "Value of ``SLURM_PROCID`` environment variable if exists, else ``MPI_Comm_rank`` (or ``0`` non-mpi)"
|
||||
"``%size%``", "``MPI_Comm_size`` or ``1`` if non-mpi"
|
||||
"``%nid%``", "``%rank%`` if possible, otherwise ``%pid%``"
|
||||
"``%launch_time%``", "Launch date and time (uses ``OMNITRACE_TIME_FORMAT``)"
|
||||
"``%launch_time%``", "Launch date and time (uses ``ROCPROFSYS_TIME_FORMAT``)"
|
||||
"``%env{NAME}%``", "Value of environment variable ``NAME`` (i.e. ``getenv(NAME)``)"
|
||||
"``%cfg{NAME}%``", "Value of configuration variable ``NAME`` (e.g. ``%cfg{OMNITRACE_SAMPLING_FREQ}%`` would resolve to sampling frequency)"
|
||||
"``%cfg{NAME}%``", "Value of configuration variable ``NAME`` (e.g. ``%cfg{ROCPROFSYS_SAMPLING_FREQ}%`` would resolve to sampling frequency)"
|
||||
"``$env{NAME}``", "Alternative syntax to ``%env{NAME}%``"
|
||||
"``$cfg{NAME}``", "Alternative syntax to ``%cfg{NAME}%``"
|
||||
"``%m``", "Shorthand for ``%argt_hash%``"
|
||||
@@ -318,8 +318,8 @@ set ``OMNITRACE_OUTPUT_PREFIX="%argt%-"``, and let Omnitrace cleanly organize th
|
||||
Perfetto output
|
||||
========================================
|
||||
|
||||
Use the ``OMNITRACE_OUTPUT_FILE`` to specify a specific location. If this is an
|
||||
absolute path, then all ``OMNITRACE_OUTPUT_PATH`` and similar
|
||||
Use the ``ROCPROFSYS_OUTPUT_FILE`` to specify a specific location. If this is an
|
||||
absolute path, then all ``ROCPROFSYS_OUTPUT_PATH`` and similar
|
||||
settings are ignored. Visit `ui.perfetto.dev <https://ui.perfetto.dev>`_ and open
|
||||
this file.
|
||||
|
||||
@@ -328,26 +328,26 @@ this file.
|
||||
If you are experiencing problems viewing your trace in the latest version of `Perfetto <http://ui.perfetto.dev>`_,
|
||||
then try using `Perfetto UI v46.0 <https://ui.perfetto.dev/v46.0-35b3d9845/#!/>`_.
|
||||
|
||||
.. image:: ../data/omnitrace-perfetto.png
|
||||
.. image:: ../data/rocprof-sys-perfetto.png
|
||||
:alt: Visualization of a performance graph in Perfetto
|
||||
|
||||
.. image:: ../data/omnitrace-rocm.png
|
||||
.. image:: ../data/rocprof-sys-rocm.png
|
||||
:alt: Visualization of ROCm data in Perfetto
|
||||
|
||||
.. image:: ../data/omnitrace-rocm-flow.png
|
||||
.. image:: ../data/rocprof-sys-rocm-flow.png
|
||||
:alt: Visualization of ROCm flow data in Perfetto
|
||||
|
||||
.. image:: ../data/omnitrace-user-api.png
|
||||
.. image:: ../data/rocprof-sys-user-api.png
|
||||
:alt: Visualization of ROCm API calls in Perfetto
|
||||
|
||||
Timemory output
|
||||
========================================
|
||||
|
||||
Use ``omnitrace-avail --components --filename`` to view the base filename for each component, as follows
|
||||
Use ``rocprof-sys-avail --components --filename`` to view the base filename for each component, as follows
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omnitrace-avail wall_clock -C -f
|
||||
$ rocprof-sys-avail wall_clock -C -f
|
||||
|---------------------------------|---------------|------------------------|
|
||||
| COMPONENT | AVAILABLE | FILENAME |
|
||||
|---------------------------------|---------------|------------------------|
|
||||
@@ -355,16 +355,16 @@ Use ``omnitrace-avail --components --filename`` to view the base filename for ea
|
||||
| sampling_wall_clock | true | sampling_wall_clock |
|
||||
|---------------------------------|---------------|------------------------|
|
||||
|
||||
The ``OMNITRACE_COLLAPSE_THREADS`` and ``OMNITRACE_COLLAPSE_PROCESSES`` settings are
|
||||
only valid when full `MPI support is enabled <../install/install.html#mpi-support-within-omnitrace>`_.
|
||||
The ``ROCPROFSYS_COLLAPSE_THREADS`` and ``ROCPROFSYS_COLLAPSE_PROCESSES`` settings are
|
||||
only valid when full `MPI support is enabled <../install/install.html#mpi-support-within-rocprof-sys>`_.
|
||||
When they are set, Timemory combines the per-thread and per-rank data (respectively) of
|
||||
identical call stacks.
|
||||
|
||||
The ``OMNITRACE_FLAT_PROFILE`` setting removes all call stack hierarchy.
|
||||
Using ``OMNITRACE_FLAT_PROFILE=ON`` in combination
|
||||
with ``OMNITRACE_COLLAPSE_THREADS=ON`` is a useful configuration for identifying
|
||||
The ``ROCPROFSYS_FLAT_PROFILE`` setting removes all call stack hierarchy.
|
||||
Using ``ROCPROFSYS_FLAT_PROFILE=ON`` in combination
|
||||
with ``ROCPROFSYS_COLLAPSE_THREADS=ON`` is a useful configuration for identifying
|
||||
min/max measurements regardless of the calling context.
|
||||
The ``OMNITRACE_TIMELINE_PROFILE`` setting (with ``OMNITRACE_FLAT_PROFILE=OFF``) effectively
|
||||
The ``ROCPROFSYS_TIMELINE_PROFILE`` setting (with ``ROCPROFSYS_FLAT_PROFILE=OFF``) effectively
|
||||
generates similar data to that found
|
||||
in Perfetto. Enabling timeline and flat profiling effectively generates
|
||||
similar data to ``strace``. However, while Timemory generally
|
||||
@@ -376,11 +376,11 @@ Timemory text output
|
||||
|
||||
Timemory text output files are meant for human consumption (while JSON formats are for analysis),
|
||||
so some fields such as the ``LABEL`` might be truncated for readability.
|
||||
The truncation settings be changed through the ``OMNITRACE_MAX_WIDTH`` setting.
|
||||
The truncation settings be changed through the ``ROCPROFSYS_MAX_WIDTH`` setting.
|
||||
|
||||
.. note::
|
||||
|
||||
The generation of text output is configurable via ``OMNITRACE_TEXT_OUTPUT``.
|
||||
The generation of text output is configurable via ``ROCPROFSYS_TEXT_OUTPUT``.
|
||||
|
||||
.. _text-output-example-label:
|
||||
|
||||
@@ -389,7 +389,7 @@ Timemory text output example
|
||||
|
||||
In the following example, the ``NN`` field in ``|NN>>>`` is the thread ID. If MPI support is enabled,
|
||||
this becomes ``|MM|NN>>>`` where ``MM`` is the rank.
|
||||
If ``OMNITRACE_COLLAPSE_THREADS=ON`` and ``OMNITRACE_COLLAPSE_PROCESSES=ON`` are configured,
|
||||
If ``ROCPROFSYS_COLLAPSE_THREADS=ON`` and ``ROCPROFSYS_COLLAPSE_PROCESSES=ON`` are configured,
|
||||
neither the ``MM`` nor the ``NN`` are present unless the
|
||||
component explicitly sets type traits. Type traits specify that the data is only
|
||||
relevant per-thread or per-process, such as the ``thread_cpu_clock`` clock component.
|
||||
@@ -592,8 +592,8 @@ write a simple Python script for post-processing using this format than with the
|
||||
|
||||
.. note::
|
||||
|
||||
The generation of flat JSON output is configurable via ``OMNITRACE_JSON_OUTPUT``.
|
||||
The generation of hierarchical JSON data is configurable via ``OMNITRACE_TREE_OUTPUT``
|
||||
The generation of flat JSON output is configurable via ``ROCPROFSYS_JSON_OUTPUT``.
|
||||
The generation of hierarchical JSON data is configurable via ``ROCPROFSYS_TREE_OUTPUT``
|
||||
|
||||
Timemory JSON output sample
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
@@ -1,37 +1,38 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
:description: ROCm Systems Profiler documentation and reference
|
||||
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
****************************************************
|
||||
Using the Omnitrace API
|
||||
Using the ROCm Systems Profiler API
|
||||
****************************************************
|
||||
|
||||
The following example shows how a program can use the Omnitrace API for run-time analysis.
|
||||
The following example shows how a program can use the ROCm Systems Profiler API
|
||||
for run-time analysis.
|
||||
|
||||
Omnitrace user API example program
|
||||
ROCm Systems Profiler user API example program
|
||||
========================================
|
||||
|
||||
You can use the Omnitrace API to define custom regions to profile and trace.
|
||||
The following C++ program demonstrates this technique by calling several functions from the
|
||||
Omnitrace API, such as ``omnitrace_user_push_region`` and
|
||||
``omnitrace_user_stop_thread_trace``.
|
||||
You can use the ROCm Systems Profiler API to define custom regions to profile and trace.
|
||||
The following C++ program demonstrates this technique by calling several functions from the
|
||||
ROCm Systems Profiler API, such as ``rocprofsys_user_push_region`` and
|
||||
``rocprofsys_user_stop_thread_trace``.
|
||||
|
||||
.. note::
|
||||
|
||||
By default, when Omnitrace detects any ``omnitrace_user_start_*`` or
|
||||
``omnitrace_user_stop_*`` function, instrumentation
|
||||
is disabled at start up, which means ``omnitrace_user_stop_trace()`` is not
|
||||
By default, when ROCm Systems Profiler detects any ``rocprofsys_user_start_*`` or
|
||||
``rocprofsys_user_stop_*`` function, instrumentation
|
||||
is disabled at start up, which means ``rocprofsys_user_stop_trace()`` is not
|
||||
required at the beginning of ``main``. This behavior
|
||||
can be manually controlled by using the ``OMNITRACE_INIT_ENABLED`` environment variable.
|
||||
can be manually controlled by using the ``ROCPROFSYS_INIT_ENABLED`` environment variable.
|
||||
User-defined regions are always
|
||||
recorded, regardless of whether ``omnitrace_user_start_*`` or
|
||||
``omnitrace_user_stop_*`` has been called.
|
||||
recorded, regardless of whether ``rocprofsys_user_start_*`` or
|
||||
``rocprofsys_user_stop_*`` has been called.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
#include <omnitrace/categories.h>
|
||||
#include <omnitrace/types.h>
|
||||
#include <omnitrace/user.h>
|
||||
#include <rocprofiler-systems/categories.h>
|
||||
#include <rocprofiler-systems/types.h>
|
||||
#include <rocprofiler-systems/user.h>
|
||||
|
||||
#include <atomic>
|
||||
#include <cassert>
|
||||
@@ -56,52 +57,52 @@ Omnitrace API, such as ``omnitrace_user_push_region`` and
|
||||
|
||||
namespace
|
||||
{
|
||||
omnitrace_user_callbacks_t custom_callbacks = OMNITRACE_USER_CALLBACKS_INIT;
|
||||
omnitrace_user_callbacks_t original_callbacks = OMNITRACE_USER_CALLBACKS_INIT;
|
||||
rocprofsys_user_callbacks_t custom_callbacks = ROCPROFSYS_USER_CALLBACKS_INIT;
|
||||
rocprofsys_user_callbacks_t original_callbacks = ROCPROFSYS_USER_CALLBACKS_INIT;
|
||||
} // namespace
|
||||
|
||||
int
|
||||
main(int argc, char** argv)
|
||||
{
|
||||
custom_callbacks.push_region = &custom_push_region;
|
||||
omnitrace_user_configure(OMNITRACE_USER_UNION_CONFIG, custom_callbacks,
|
||||
rocprofsys_user_configure(ROCPROFSYS_USER_UNION_CONFIG, custom_callbacks,
|
||||
&original_callbacks);
|
||||
|
||||
omnitrace_user_push_region(argv[0]);
|
||||
omnitrace_user_push_region("initialization");
|
||||
rocprofsys_user_push_region(argv[0]);
|
||||
rocprofsys_user_push_region("initialization");
|
||||
size_t nthread = std::min<size_t>(16, std::thread::hardware_concurrency());
|
||||
size_t nitr = 50000;
|
||||
long nfib = 10;
|
||||
if(argc > 1) nfib = atol(argv[1]);
|
||||
if(argc > 2) nthread = atol(argv[2]);
|
||||
if(argc > 3) nitr = atol(argv[3]);
|
||||
omnitrace_user_pop_region("initialization");
|
||||
rocprofsys_user_pop_region("initialization");
|
||||
|
||||
printf("[%s] Threads: %zu\n[%s] Iterations: %zu\n[%s] fibonacci(%li)...\n", argv[0],
|
||||
nthread, argv[0], nitr, argv[0], nfib);
|
||||
|
||||
omnitrace_user_push_region("thread_creation");
|
||||
rocprofsys_user_push_region("thread_creation");
|
||||
std::vector<std::thread> threads{};
|
||||
threads.reserve(nthread);
|
||||
// disable instrumentation for child threads
|
||||
omnitrace_user_stop_thread_trace();
|
||||
rocprofsys_user_stop_thread_trace();
|
||||
for(size_t i = 0; i < nthread; ++i)
|
||||
{
|
||||
threads.emplace_back(&run, nitr, nfib);
|
||||
}
|
||||
// re-enable instrumentation
|
||||
omnitrace_user_start_thread_trace();
|
||||
omnitrace_user_pop_region("thread_creation");
|
||||
rocprofsys_user_start_thread_trace();
|
||||
rocprofsys_user_pop_region("thread_creation");
|
||||
|
||||
omnitrace_user_push_region("thread_wait");
|
||||
rocprofsys_user_push_region("thread_wait");
|
||||
for(auto& itr : threads)
|
||||
itr.join();
|
||||
omnitrace_user_pop_region("thread_wait");
|
||||
rocprofsys_user_pop_region("thread_wait");
|
||||
|
||||
run(nitr, nfib);
|
||||
|
||||
printf("[%s] fibonacci(%li) x %lu = %li\n", argv[0], nfib, nthread, total.load());
|
||||
omnitrace_user_pop_region(argv[0]);
|
||||
rocprofsys_user_pop_region(argv[0]);
|
||||
|
||||
return 0;
|
||||
}
|
||||
@@ -120,19 +121,19 @@ Omnitrace API, such as ``omnitrace_user_push_region`` and
|
||||
void
|
||||
run(size_t nitr, long n)
|
||||
{
|
||||
omnitrace_user_push_region(RUN_LABEL);
|
||||
rocprofsys_user_push_region(RUN_LABEL);
|
||||
long local = 0;
|
||||
for(size_t i = 0; i < nitr; ++i)
|
||||
local += fib(n);
|
||||
total += local;
|
||||
omnitrace_user_pop_region(RUN_LABEL);
|
||||
rocprofsys_user_pop_region(RUN_LABEL);
|
||||
}
|
||||
|
||||
int
|
||||
custom_push_region(const char* name)
|
||||
{
|
||||
if(!original_callbacks.push_region || !original_callbacks.push_annotated_region)
|
||||
return OMNITRACE_USER_ERROR_NO_BINDING;
|
||||
return ROCPROFSYS_USER_ERROR_NO_BINDING;
|
||||
|
||||
printf("Pushing custom region :: %s\n", name);
|
||||
|
||||
@@ -143,22 +144,22 @@ Omnitrace API, such as ``omnitrace_user_push_region`` and
|
||||
char _buff[1024];
|
||||
if(_err != 0) _msg = strerror_r(_err, _buff, sizeof(_buff));
|
||||
|
||||
omnitrace_annotation_t _annotations[] = {
|
||||
{ "errno", OMNITRACE_INT32, &_err }, { "strerror", OMNITRACE_STRING, _msg }
|
||||
rocprofsys_annotation_t _annotations[] = {
|
||||
{ "errno", ROCPROFSYS_INT32, &_err }, { "strerror", ROCPROFSYS_STRING, _msg }
|
||||
};
|
||||
|
||||
errno = 0; // reset errno
|
||||
return (*original_callbacks.push_annotated_region)(
|
||||
name, _annotations, sizeof(_annotations) / sizeof(omnitrace_annotation_t));
|
||||
name, _annotations, sizeof(_annotations) / sizeof(rocprofsys_annotation_t));
|
||||
}
|
||||
|
||||
return (*original_callbacks.push_region)(name);
|
||||
}
|
||||
|
||||
Linking the Omnitrace libraries to another program
|
||||
Linking the ROCm Systems Profiler libraries to another program
|
||||
=======================================================
|
||||
|
||||
To link the ``omnitrace-user-library`` to another program,
|
||||
To link the ``rocprofiler-systems-user-library`` to another program,
|
||||
use the following CMake and ``g++`` directives.
|
||||
|
||||
CMake
|
||||
@@ -166,19 +167,19 @@ CMake
|
||||
|
||||
.. code-block:: cmake
|
||||
|
||||
find_package(omnitrace REQUIRED COMPONENTS user)
|
||||
find_package(rocprofiler-systems REQUIRED COMPONENTS user)
|
||||
add_executable(foo foo.cpp)
|
||||
target_link_libraries(foo PRIVATE omnitrace::omnitrace-user-library)
|
||||
target_link_libraries(foo PRIVATE rocprofiler-systems::rocprofiler-systems-user-library)
|
||||
|
||||
g++ compilation
|
||||
-------------------------------------------------------
|
||||
|
||||
Assuming Omnitrace is installed in ``/opt/omnitrace``, use the ``g++`` compiler
|
||||
Assuming ROCm Systems Profiler is installed in ``/opt/rocprofiler-systems``, use the ``g++`` compiler
|
||||
to build the application.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
g++ -I/opt/omnitrace foo.cpp -o foo -lomnitrace-user
|
||||
g++ -I/opt/rocprofiler-systems foo.cpp -o foo -lrocprofiler-systems-user
|
||||
|
||||
Output from the API example program
|
||||
========================================
|
||||
@@ -187,19 +188,19 @@ First, instrument and run the program.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omnitrace-instrument -l --min-instructions=8 -E custom_push_region -o -- ./user-api
|
||||
$ rocprof-sys-instrument -l --min-instructions=8 -E custom_push_region -o -- ./user-api
|
||||
...
|
||||
$ omnitrace-run --profile --use-pid off --time-output off -- ./user-api.inst 20 4 100
|
||||
$ rocprof-sys-run --profile --use-pid off --time-output off -- ./user-api.inst 20 4 100
|
||||
Pushing custom region :: ./user-api.inst
|
||||
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Trace
|
||||
[rocprof-sys][rocprofsys_init_tooling] Instrumentation mode: Trace
|
||||
|
||||
|
||||
______ .___ ___. .__ __. __ .___________..______ ___ ______ _______
|
||||
/ __ \ | \/ | | \ | | | | | || _ \ / \ / || ____|
|
||||
| | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__
|
||||
| | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __|
|
||||
| `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____
|
||||
\______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______|
|
||||
__
|
||||
_ __ ___ ___ _ __ _ __ ___ / _| ___ _ _ ___
|
||||
| '__| / _ \ / __| | '_ \ | '__| / _ \ | |_ _____ / __| | | | | / __|
|
||||
| | | (_) | | (__ | |_) | | | | (_) | | _| |_____| \__ \ | |_| | \__ \
|
||||
|_| \___/ \___| | .__/ |_| \___/ |_| |___/ \__, | |___/
|
||||
|_| |___/
|
||||
|
||||
|
||||
|
||||
@@ -215,29 +216,29 @@ First, instrument and run the program.
|
||||
Pushing custom region :: run(20) x 100
|
||||
Pushing custom region :: run(20) x 100
|
||||
[./user-api.inst] fibonacci(20) x 4 = 3382500
|
||||
[omnitrace][86267][0][omnitrace_finalize] finalizing...
|
||||
[rocprof-sys][86267][0][rocprofsys_finalize] finalizing...
|
||||
|
||||
|
||||
[omnitrace][86267][0] omnitrace : 5.190895 sec wall_clock, 2.748 mb peak_rss, 6.330000 sec cpu_clock, 121.9 % cpu_util [laps: 1]
|
||||
[omnitrace][86267][0] user-api.inst/thread-0 : 5.078713 sec wall_clock, 4.722415 sec thread_cpu_clock, 93.0 % thread_cpu_util, 1.276 mb peak_rss [laps: 1]
|
||||
[omnitrace][86267][0] user-api.inst/thread-1 : 0.322248 sec wall_clock, 0.322191 sec thread_cpu_clock, 100.0 % thread_cpu_util, 1.000 mb peak_rss [laps: 1]
|
||||
[omnitrace][86267][0] user-api.inst/thread-2 : 0.323255 sec wall_clock, 0.323194 sec thread_cpu_clock, 100.0 % thread_cpu_util, 0.000 mb peak_rss [laps: 1]
|
||||
[omnitrace][86267][0] user-api.inst/thread-3 : 0.323569 sec wall_clock, 0.323484 sec thread_cpu_clock, 100.0 % thread_cpu_util, 1.092 mb peak_rss [laps: 1]
|
||||
[omnitrace][86267][0] user-api.inst/thread-4 : 0.324178 sec wall_clock, 0.324057 sec thread_cpu_clock, 100.0 % thread_cpu_util, 1.184 mb peak_rss [laps: 1]
|
||||
[omnitrace][86267][0] Post-processing 51 cpu frequency and memory usage entries...
|
||||
[rocprof-sys][86267][0] rocprof-sys : 5.190895 sec wall_clock, 2.748 mb peak_rss, 6.330000 sec cpu_clock, 121.9 % cpu_util [laps: 1]
|
||||
[rocprof-sys][86267][0] user-api.inst/thread-0 : 5.078713 sec wall_clock, 4.722415 sec thread_cpu_clock, 93.0 % thread_cpu_util, 1.276 mb peak_rss [laps: 1]
|
||||
[rocprof-sys][86267][0] user-api.inst/thread-1 : 0.322248 sec wall_clock, 0.322191 sec thread_cpu_clock, 100.0 % thread_cpu_util, 1.000 mb peak_rss [laps: 1]
|
||||
[rocprof-sys][86267][0] user-api.inst/thread-2 : 0.323255 sec wall_clock, 0.323194 sec thread_cpu_clock, 100.0 % thread_cpu_util, 0.000 mb peak_rss [laps: 1]
|
||||
[rocprof-sys][86267][0] user-api.inst/thread-3 : 0.323569 sec wall_clock, 0.323484 sec thread_cpu_clock, 100.0 % thread_cpu_util, 1.092 mb peak_rss [laps: 1]
|
||||
[rocprof-sys][86267][0] user-api.inst/thread-4 : 0.324178 sec wall_clock, 0.324057 sec thread_cpu_clock, 100.0 % thread_cpu_util, 1.184 mb peak_rss [laps: 1]
|
||||
[rocprof-sys][86267][0] Post-processing 51 cpu frequency and memory usage entries...
|
||||
|
||||
[omnitrace][wall_clock]|0> Outputting 'omnitrace-user-api.inst-output/wall_clock.json'...
|
||||
[omnitrace][wall_clock]|0> Outputting 'omnitrace-user-api.inst-output/wall_clock.tree.json'...
|
||||
[omnitrace][wall_clock]|0> Outputting 'omnitrace-user-api.inst-output/wall_clock.txt'...
|
||||
[rocprof-sys][wall_clock]|0> Outputting 'rocprof-sys-user-api.inst-output/wall_clock.json'...
|
||||
[rocprof-sys][wall_clock]|0> Outputting 'rocprof-sys-user-api.inst-output/wall_clock.tree.json'...
|
||||
[rocprof-sys][wall_clock]|0> Outputting 'rocprof-sys-user-api.inst-output/wall_clock.txt'...
|
||||
|
||||
[omnitrace][manager::finalize][metadata]> Outputting 'omnitrace-user-api.inst-output/metadata.json' and 'omnitrace-user-api.inst-output/functions.json'...
|
||||
[omnitrace][86267][0][omnitrace_finalize] Finalized
|
||||
[rocprof-sys][manager::finalize][metadata]> Outputting 'rocprof-sys-user-api.inst-output/metadata.json' and 'rocprof-sys-user-api.inst-output/functions.json'...
|
||||
[rocprof-sys][86267][0][rocprofsys_finalize] Finalized
|
||||
|
||||
Then review the output.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ cat omnitrace-example-output/wall_clock.txt
|
||||
$ cat rocprof-sys-example-output/wall_clock.txt
|
||||
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| REAL-CLOCK TIMER (I.E. WALL-CLOCK TIMER) |
|
||||
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
@@ -1,17 +1,17 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
:description: ROCm Systems Profiler documentation and reference
|
||||
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
***********************
|
||||
Omnitrace documentation
|
||||
ROCm Systems Profiler documentation
|
||||
***********************
|
||||
|
||||
Omnitrace is designed for the high-level profiling and comprehensive tracing
|
||||
ROCm Systems Profiler, formerly known as "Omnitrace", is designed for the high-level profiling and comprehensive tracing
|
||||
of applications running on the CPU or the CPU and GPU. It supports dynamic binary
|
||||
instrumentation, call-stack sampling, and various other features for determining
|
||||
which function and line number are currently executing. To learn more, see :doc:`what-is-omnitrace`
|
||||
which function and line number are currently executing. To learn more, see :doc:`what-is-rocprof-sys`
|
||||
|
||||
The code is open and hosted at `<https://github.com/ROCm/omnitrace>`_.
|
||||
The code is open and hosted at `<https://github.com/ROCm/rocprofiler-systems>`_.
|
||||
|
||||
|
||||
.. grid:: 2
|
||||
@@ -20,7 +20,7 @@ The code is open and hosted at `<https://github.com/ROCm/omnitrace>`_.
|
||||
.. grid-item-card:: Install
|
||||
|
||||
* :doc:`Quick start <./install/quick-start>`
|
||||
* :doc:`Omnitrace installation <./install/install>`
|
||||
* :doc:`ROCm Systems Profiler installation <./install/install>`
|
||||
|
||||
|
||||
The documentation is structured as follows:
|
||||
@@ -30,31 +30,30 @@ The documentation is structured as follows:
|
||||
|
||||
.. grid-item-card:: Tutorials
|
||||
|
||||
* `GitHub examples <https://github.com/ROCm/omnitrace/tree/amd-mainline/examples>`_
|
||||
* `GitHub examples <https://github.com/ROCm/rocprofiler-systems/tree/amd-mainline/examples>`_
|
||||
* :doc:`Video tutorials <./tutorials/video-tutorials>`
|
||||
|
||||
.. grid-item-card:: How to
|
||||
|
||||
* :doc:`Configuring and validating the Omnitrace environment <./how-to/configuring-validating-environment>`
|
||||
* :doc:`Configuring and validating the ROCm Systems Profiler environment <./how-to/configuring-validating-environment>`
|
||||
* :doc:`Configuring runtime options <./how-to/configuring-runtime-options>`
|
||||
* :doc:`Sampling the call stack <./how-to/sampling-call-stack>`
|
||||
* :doc:`Instrumenting and rewriting a binary application <./how-to/instrumenting-rewriting-binary-application>`
|
||||
* :doc:`Performing causal profiling <./how-to/performing-causal-profiling>`
|
||||
* :doc:`Understanding the Omnitrace output <./how-to/understanding-omnitrace-output>`
|
||||
* :doc:`Understanding the ROCm Systems Profiler output <./how-to/understanding-rocprof-sys-output>`
|
||||
* :doc:`Profiling Python scripts <./how-to/profiling-python-scripts>`
|
||||
* :doc:`Using the Omnitrace API <./how-to/using-omnitrace-api>`
|
||||
* :doc:`General tips for using Omnitrace <./how-to/general-tips-using-omnitrace>`
|
||||
|
||||
* :doc:`Using the ROCm Systems Profiler API <./how-to/using-rocprof-sys-api>`
|
||||
* :doc:`General tips for using ROCm Systems Profiler <./how-to/general-tips-using-rocprof-sys>`
|
||||
|
||||
.. grid-item-card:: Conceptual
|
||||
|
||||
* :doc:`Data collection modes <./conceptual/data-collection-modes>`
|
||||
* :doc:`The Omnitrace feature set <./conceptual/omnitrace-feature-set>`
|
||||
|
||||
* :doc:`The ROCm Systems Profiler feature set <./conceptual/rocprof-sys-feature-set>`
|
||||
|
||||
.. grid-item-card:: Reference
|
||||
|
||||
* :doc:`Development guide <./reference/development-guide>`
|
||||
* :doc:`Omnitrace glossary <./reference/omnitrace-glossary>`
|
||||
* :doc:`ROCm Systems Profiler glossary <./reference/rocprof-sys-glossary>`
|
||||
* :doc:`API library <./doxygen/html/files>`
|
||||
* :doc:`Class member functions <./doxygen/html/functions>`
|
||||
* :doc:`Globals <./doxygen/html/globals>`
|
||||
|
||||
@@ -1,38 +1,41 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
:description: ROCm Systems Profiler installation documentation and reference
|
||||
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, installation, installer, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
*************************************
|
||||
Omnitrace installation
|
||||
ROCm Systems Profiler installation
|
||||
*************************************
|
||||
|
||||
The following information builds on the guidelines in the :doc:`Quick start <./quick-start>` guide.
|
||||
It covers how to install `Omnitrace <https://github.com/ROCm/omnitrace>`_ from source or a binary distribution,
|
||||
as well as the :ref:`post-installation-steps`.
|
||||
It covers how to install `ROCm Systems Profiler <https://github.com/ROCm/rocprofiler-systems>`_ from
|
||||
source or a binary distribution, as well as the :ref:`post-installation-steps`.
|
||||
|
||||
If you have problems using Omnitrace after installation,
|
||||
If you have problems using ROCm Systems Profiler after installation,
|
||||
consult the :ref:`post-installation-troubleshooting` section.
|
||||
|
||||
Release links
|
||||
========================================
|
||||
|
||||
To review and install either the current Omnitrace release or earlier releases, use these links:
|
||||
To review and install either the current ROCm Systems Profiler release or earlier releases, use these links:
|
||||
|
||||
* Latest Omnitrace Release: `<https://github.com/ROCm/omnitrace/releases/latest>`_
|
||||
* All Omnitrace Releases: `<https://github.com/ROCm/omnitrace/releases>`_
|
||||
* Latest ROCm Systems Profiler Release: `<https://github.com/ROCm/rocprofiler-systems/releases/latest>`_
|
||||
* All ROCm Systems Profiler Releases: `<https://github.com/ROCm/rocprofiler-systems/releases>`_
|
||||
|
||||
Operating system support
|
||||
========================================
|
||||
|
||||
Omnitrace is only supported on Linux. The following distributions are tested in the Omnitrace GitHub workflows:
|
||||
ROCm Systems Profiler is only supported on Linux. The following distributions are tested in the ROCm Systems Profiler GitHub workflows:
|
||||
|
||||
* Ubuntu 20.04
|
||||
* Ubuntu 22.04
|
||||
* OpenSUSE 15.3
|
||||
* OpenSUSE 15.4
|
||||
* Red Hat 8.7
|
||||
* Red Hat 9.0
|
||||
* Red Hat 9.1
|
||||
* OpenSUSE 15.5
|
||||
* OpenSUSE 15.6
|
||||
* Red Hat 8.8
|
||||
* Red Hat 8.9
|
||||
* Red Hat 8.10
|
||||
* Red Hat 9.2
|
||||
* Red Hat 9.3
|
||||
* Red Hat 9.4
|
||||
|
||||
Other OS distributions might function but are not supported or tested.
|
||||
|
||||
@@ -61,58 +64,58 @@ Architecture
|
||||
========================================
|
||||
|
||||
With regards to instrumentation, at present only AMD64 (x86_64) architectures are tested. However,
|
||||
Dyninst supports several more architectures and Omnitrace instrumentation may support other
|
||||
Dyninst supports several more architectures and ROCm Systems Profiler instrumentation may support other
|
||||
CPU architectures such as aarch64 and ppc64.
|
||||
Other modes of use, such as sampling and causal profiling, are not dependent on Dyninst and therefore
|
||||
might be more portable.
|
||||
|
||||
Installing Omnitrace from binary distributions
|
||||
Installing ROCm Systems Profiler from binary distributions
|
||||
================================================
|
||||
|
||||
Every Omnitrace release provides binary installer scripts of the form:
|
||||
Every ROCm Systems Profiler release provides binary installer scripts of the form:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-{VERSION}-{OS_DISTRIB}-{OS_VERSION}[-ROCm-{ROCM_VERSION}[-{EXTRA}]].sh
|
||||
rocprof-sys-{VERSION}-{OS_DISTRIB}-{OS_VERSION}[-ROCm-{ROCM_VERSION}[-{EXTRA}]].sh
|
||||
|
||||
For example,
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-1.0.0-ubuntu-18.04-OMPT-PAPI-Python3.sh
|
||||
omnitrace-1.0.0-ubuntu-18.04-ROCm-405000-OMPT-PAPI-Python3.sh
|
||||
rocprof-sys-1.0.0-ubuntu-18.04-OMPT-PAPI-Python3.sh
|
||||
rocprof-sys-1.0.0-ubuntu-18.04-ROCm-405000-OMPT-PAPI-Python3.sh
|
||||
...
|
||||
omnitrace-1.0.0-ubuntu-20.04-ROCm-50000-OMPT-PAPI-Python3.sh
|
||||
rocprof-sys-1.0.0-ubuntu-20.04-ROCm-50000-OMPT-PAPI-Python3.sh
|
||||
|
||||
Any of the ``EXTRA`` fields with a CMake build option
|
||||
(for example, PAPI, as referenced in a following section) or
|
||||
with no link requirements (such as OMPT) have
|
||||
self-contained support for these packages.
|
||||
|
||||
To install Omnitrace using a binary installer script, follow these steps:
|
||||
To install ROCm Systems Profiler using a binary installer script, follow these steps:
|
||||
|
||||
#. Download the appropriate binary distribution
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
wget https://github.com/ROCm/omnitrace/releases/download/v<VERSION>/<SCRIPT>
|
||||
wget https://github.com/ROCm/rocprofiler-systems/releases/download/v<VERSION>/<SCRIPT>
|
||||
|
||||
#. Create the target installation directory
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
mkdir /opt/omnitrace
|
||||
mkdir /opt/rocprofiler-systems
|
||||
|
||||
#. Run the installer script
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
./omnitrace-1.0.0-ubuntu-18.04-ROCm-405000-OMPT-PAPI.sh --prefix=/opt/omnitrace --exclude-subdir
|
||||
./rocprofiler-systems-1.0.0-ubuntu-18.04-ROCm-405000-OMPT-PAPI.sh --prefix=/opt/rocprofiler-systems --exclude-subdir
|
||||
|
||||
Installing Omnitrace from source
|
||||
Installing ROCm Systems Profiler from source
|
||||
========================================
|
||||
|
||||
Omnitrace needs a GCC compiler with full support for C++17 and CMake v3.16 or higher.
|
||||
ROCm Systems Profiler needs a GCC compiler with full support for C++17 and CMake v3.16 or higher.
|
||||
The Clang compiler may be used in lieu of the GCC compiler if `Dyninst <https://github.com/dyninst/dyninst>`_
|
||||
is already installed.
|
||||
|
||||
@@ -122,7 +125,7 @@ Build requirements
|
||||
* GCC compiler v7+
|
||||
|
||||
* Older GCC compilers may be supported but are not tested
|
||||
* Clang compilers are generally supported for Omnitrace but not Dyninst
|
||||
* Clang compilers are generally supported for ROCm Systems Profiler but not Dyninst
|
||||
|
||||
* `CMake <https://cmake.org/>`_ v3.16+
|
||||
|
||||
@@ -151,16 +154,16 @@ Required third-party packages
|
||||
* `libunwind <https://www.nongnu.org/libunwind/>`_ for call-stack sampling
|
||||
|
||||
Any of the third-party packages required by Dyninst, along with Dyninst itself, can be built and installed
|
||||
during the Omnitrace build. The following list indicates the package, the version,
|
||||
the application that requires the package (for example, Omnitrace requires Dyninst
|
||||
while Dyninst requires TBB), and the CMake option to build the package alongside Omnitrace:
|
||||
during the ROCm Systems Profiler build. The following list indicates the package, the version,
|
||||
the application that requires the package (for example, ROCm Systems Profiler requires Dyninst
|
||||
while Dyninst requires TBB), and the CMake option to build the package alongside ROCm Systems Profiler:
|
||||
|
||||
.. csv-table::
|
||||
:header: "Third-Party Library", "Minimum Version", "Required By", "CMake Option"
|
||||
:widths: 15, 10, 12, 40
|
||||
|
||||
"Dyninst", "12.0", "Omnitrace", "``OMNITRACE_BUILD_DYNINST`` (default: OFF)"
|
||||
"Libunwind", "", "Omnitrace", "``OMNITRACE_BUILD_LIBUNWIND`` (default: ON)"
|
||||
"Dyninst", "12.0", "ROCm Systems Profiler", "``ROCPROFSYS_BUILD_DYNINST`` (default: OFF)"
|
||||
"Libunwind", "", "ROCm Systems Profiler", "``ROCPROFSYS_BUILD_LIBUNWIND`` (default: ON)"
|
||||
"TBB", "2018.6", "Dyninst", "``DYNINST_BUILD_TBB`` (default: OFF)"
|
||||
"ElfUtils", "0.178", "Dyninst", "``DYNINST_BUILD_ELFUTILS`` (default: OFF)"
|
||||
"LibIberty", "", "Dyninst", "``DYNINST_BUILD_LIBIBERTY`` (default: OFF)"
|
||||
@@ -180,9 +183,9 @@ Optional third-party packages
|
||||
* `PAPI <https://icl.utk.edu/papi/>`_
|
||||
* MPI
|
||||
|
||||
* ``OMNITRACE_USE_MPI`` enables full MPI support
|
||||
* ``OMNITRACE_USE_MPI_HEADERS`` enables wrapping of the dynamically-linked MPI C function calls.
|
||||
(By default, if Omnitrace cannot find an OpenMPI MPI distribution, it uses a local copy
|
||||
* ``ROCPROFSYS_USE_MPI`` enables full MPI support
|
||||
* ``ROCPROFSYS_USE_MPI_HEADERS`` enables wrapping of the dynamically-linked MPI C function calls.
|
||||
(By default, if ROCm Systems Profiler cannot find an OpenMPI MPI distribution, it uses a local copy
|
||||
of the OpenMPI ``mpi.h``.)
|
||||
|
||||
* Several optional third-party profiling tools supported by Timemory
|
||||
@@ -192,19 +195,19 @@ Optional third-party packages
|
||||
:header: "Third-Party Library", "CMake Enable Option", "CMake Build Option"
|
||||
:widths: 15, 45, 40
|
||||
|
||||
"PAPI", "``OMNITRACE_USE_PAPI`` (default: ON)", "``OMNITRACE_BUILD_PAPI`` (default: ON)"
|
||||
"MPI", "``OMNITRACE_USE_MPI`` (default: OFF)", ""
|
||||
"MPI (header-only)", "``OMNITRACE_USE_MPI_HEADERS`` (default: ON)", ""
|
||||
"PAPI", "``ROCPROFSYS_USE_PAPI`` (default: ON)", "``ROCPROFSYS_BUILD_PAPI`` (default: ON)"
|
||||
"MPI", "``ROCPROFSYS_USE_MPI`` (default: OFF)", ""
|
||||
"MPI (header-only)", "``ROCPROFSYS_USE_MPI_HEADERS`` (default: ON)", ""
|
||||
|
||||
Installing Dyninst
|
||||
-----------------------------------
|
||||
|
||||
The easiest way to install Dyninst is alongside Omnitrace, but it can also be installed using Spack.
|
||||
The easiest way to install Dyninst is alongside ROCm Systems Profiler, but it can also be installed using Spack.
|
||||
|
||||
Building Dyninst alongside Omnitrace
|
||||
Building Dyninst alongside ROCm Systems Profiler
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
To install Dyninst alongside Omnitrace, configure Omnitrace with ``OMNITRACE_BUILD_DYNINST=ON``.
|
||||
To install Dyninst alongside ROCm Systems Profiler, configure ROCm Systems Profiler with ``ROCPROFSYS_BUILD_DYNINST=ON``.
|
||||
Depending on the version of Ubuntu, the ``apt`` package manager might have current enough
|
||||
versions of the Dyninst Boost, TBB, and LibIberty dependencies
|
||||
(use ``apt-get install libtbb-dev libiberty-dev libboost-dev``).
|
||||
@@ -213,8 +216,8 @@ its dependencies via ``DYNINST_BUILD_<DEP>=ON``, as follows:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
git clone https://github.com/ROCm/omnitrace.git omnitrace-source
|
||||
cmake -B omnitrace-build -DOMNITRACE_BUILD_DYNINST=ON -DDYNINST_BUILD_{TBB,ELFUTILS,BOOST,LIBIBERTY}=ON omnitrace-source
|
||||
git clone https://github.com/ROCm/rocprofiler-systems.git rocprof-sys-source
|
||||
cmake -B rocprof-sys-build -DROCPROFSYS_BUILD_DYNINST=ON -DDYNINST_BUILD_{TBB,ELFUTILS,BOOST,LIBIBERTY}=ON rocprof-sys-source
|
||||
|
||||
where ``-DDYNINST_BUILD_{TBB,BOOST,ELFUTILS,LIBIBERTY}=ON`` is expanded by
|
||||
the shell to ``-DDYNINST_BUILD_TBB=ON -DDYNINST_BUILD_BOOST=ON ...``
|
||||
@@ -234,18 +237,18 @@ Installing Dyninst via Spack
|
||||
spack install --reuse dyninst
|
||||
spack load -r dyninst
|
||||
|
||||
Installing Omnitrace
|
||||
Installing ROCm Systems Profiler
|
||||
-----------------------------------
|
||||
|
||||
Omnitrace has CMake configuration options for MPI support (``OMNITRACE_USE_MPI`` or
|
||||
``OMNITRACE_USE_MPI_HEADERS``), HIP kernel tracing (``OMNITRACE_USE_ROCTRACER``),
|
||||
ROCm device sampling (``OMNITRACE_USE_ROCM_SMI``), OpenMP-Tools (``OMNITRACE_USE_OMPT``),
|
||||
hardware counters via PAPI (``OMNITRACE_USE_PAPI``), among other features.
|
||||
ROCm Systems Profiler has CMake configuration options for MPI support (``ROCPROFSYS_USE_MPI`` or
|
||||
``ROCPROFSYS_USE_MPI_HEADERS``), HIP kernel tracing (``ROCPROFSYS_USE_ROCTRACER``),
|
||||
ROCm device sampling (``ROCPROFSYS_USE_ROCM_SMI``), OpenMP-Tools (``ROCPROFSYS_USE_OMPT``),
|
||||
hardware counters via PAPI (``ROCPROFSYS_USE_PAPI``), among other features.
|
||||
Various additional features can be enabled via the
|
||||
``TIMEMORY_USE_*`` `CMake options <https://timemory.readthedocs.io/en/develop/installation.html#cmake-options>`_.
|
||||
Any ``OMNITRACE_USE_<VAL>`` option which has a corresponding ``TIMEMORY_USE_<VAL>``
|
||||
Any ``ROCPROFSYS_USE_<VAL>`` option which has a corresponding ``TIMEMORY_USE_<VAL>``
|
||||
option means that the Timemory support for this feature has been integrated
|
||||
into Perfetto support for Omnitrace, for example, ``OMNITRACE_USE_PAPI=<VAL>`` also configures
|
||||
into Perfetto support for ROCm Systems Profiler, for example, ``ROCPROFSYS_USE_PAPI=<VAL>`` also configures
|
||||
``TIMEMORY_USE_PAPI=<VAL>``. This means the data that Timemory is able to collect via this package
|
||||
is passed along to Perfetto and is displayed when the ``.proto`` file is visualized
|
||||
in `the Perfetto UI <https://ui.perfetto.dev>`_.
|
||||
@@ -257,39 +260,39 @@ in `the Perfetto UI <https://ui.perfetto.dev>`_.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
git clone https://github.com/ROCm/omnitrace.git omnitrace-source
|
||||
git clone https://github.com/ROCm/rocprofiler-systems.git rocprof-sys-source
|
||||
cmake \
|
||||
-B omnitrace-build \
|
||||
-D CMAKE_INSTALL_PREFIX=/opt/omnitrace \
|
||||
-D OMNITRACE_USE_HIP=ON \
|
||||
-D OMNITRACE_USE_ROCM_SMI=ON \
|
||||
-D OMNITRACE_USE_ROCTRACER=ON \
|
||||
-D OMNITRACE_USE_PYTHON=ON \
|
||||
-D OMNITRACE_USE_OMPT=ON \
|
||||
-D OMNITRACE_USE_MPI_HEADERS=ON \
|
||||
-D OMNITRACE_BUILD_PAPI=ON \
|
||||
-D OMNITRACE_BUILD_LIBUNWIND=ON \
|
||||
-D OMNITRACE_BUILD_DYNINST=ON \
|
||||
-B rocprof-sys-build \
|
||||
-D CMAKE_INSTALL_PREFIX=/opt/rocprofiler-systems \
|
||||
-D ROCPROFSYS_USE_HIP=ON \
|
||||
-D ROCPROFSYS_USE_ROCM_SMI=ON \
|
||||
-D ROCPROFSYS_USE_ROCTRACER=ON \
|
||||
-D ROCPROFSYS_USE_PYTHON=ON \
|
||||
-D ROCPROFSYS_USE_OMPT=ON \
|
||||
-D ROCPROFSYS_USE_MPI_HEADERS=ON \
|
||||
-D ROCPROFSYS_BUILD_PAPI=ON \
|
||||
-D ROCPROFSYS_BUILD_LIBUNWIND=ON \
|
||||
-D ROCPROFSYS_BUILD_DYNINST=ON \
|
||||
-D DYNINST_BUILD_TBB=ON \
|
||||
-D DYNINST_BUILD_BOOST=ON \
|
||||
-D DYNINST_BUILD_ELFUTILS=ON \
|
||||
-D DYNINST_BUILD_LIBIBERTY=ON \
|
||||
omnitrace-source
|
||||
cmake --build omnitrace-build --target all --parallel 8
|
||||
cmake --build omnitrace-build --target install
|
||||
source /opt/omnitrace/share/omnitrace/setup-env.sh
|
||||
rocprof-sys-source
|
||||
cmake --build rocprof-sys-build --target all --parallel 8
|
||||
cmake --build rocprof-sys-build --target install
|
||||
source /opt/rocprofiler-systems/share/rocprofiler-systems/setup-env.sh
|
||||
|
||||
.. _mpi-support-omnitrace:
|
||||
.. _mpi-support-rocprof-sys:
|
||||
|
||||
MPI support within Omnitrace
|
||||
MPI support within ROCm Systems Profiler
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Omnitrace can have full (``OMNITRACE_USE_MPI=ON``) or partial (``OMNITRACE_USE_MPI_HEADERS=ON``) MPI support.
|
||||
ROCm Systems Profiler can have full (``ROCPROFSYS_USE_MPI=ON``) or partial (``ROCPROFSYS_USE_MPI_HEADERS=ON``) MPI support.
|
||||
The only difference between these two modes is whether or not the results collected
|
||||
via Timemory and/or Perfetto can be aggregated into a single
|
||||
output file during finalization. When full MPI support is enabled, combining the
|
||||
Timemory results always occurs, whereas combining the Perfetto
|
||||
results is configurable via the ``OMNITRACE_PERFETTO_COMBINE_TRACES`` setting.
|
||||
results is configurable via the ``ROCPROFSYS_PERFETTO_COMBINE_TRACES`` setting.
|
||||
|
||||
The primary benefits of partial or full MPI support are the automatic wrapping
|
||||
of MPI functions and the ability
|
||||
@@ -298,13 +301,13 @@ instead of having to use the system process identifier (i.e. ``PID``).
|
||||
In general, it's recommended to use partial MPI support with the OpenMPI
|
||||
headers as this is the most portable configuration.
|
||||
If full MPI support is selected, make sure your target application is built
|
||||
against the same MPI distribution as Omnitrace.
|
||||
For example, do not build Omnitrace with MPICH and use it on a target application built against OpenMPI.
|
||||
against the same MPI distribution as ROCm Systems Profiler.
|
||||
For example, do not build ROCm Systems Profiler with MPICH and use it on a target application built against OpenMPI.
|
||||
If partial support is selected, the reason the OpenMPI headers are recommended instead of the MPICH headers is
|
||||
because the ``MPI_COMM_WORLD`` in OpenMPI is a pointer to ``ompi_communicator_t`` (8 bytes),
|
||||
whereas ``MPI_COMM_WORLD`` in MPICH is an ``int`` (4 bytes). Building Omnitrace with partial MPI support
|
||||
whereas ``MPI_COMM_WORLD`` in MPICH is an ``int`` (4 bytes). Building ROCm Systems Profiler with partial MPI support
|
||||
and the MPICH headers and then using
|
||||
Omnitrace on an application built against OpenMPI causes a segmentation fault.
|
||||
ROCm Systems Profiler on an application built against OpenMPI causes a segmentation fault.
|
||||
This happens because the value of the ``MPI_COMM_WORLD`` is truncated
|
||||
during the function wrapping before being passed along to the underlying MPI function.
|
||||
|
||||
@@ -313,8 +316,8 @@ during the function wrapping before being passed along to the underlying MPI fun
|
||||
Post-installation steps
|
||||
========================================
|
||||
|
||||
After installation, you can optionally configure the Omnitrace environment.
|
||||
You should also test the executables to confirm Omnitrace is correctly installed.
|
||||
After installation, you can optionally configure the ROCm Systems Profiler environment.
|
||||
You should also test the executables to confirm ROCm Systems Profiler is correctly installed.
|
||||
|
||||
Configure the environment
|
||||
-----------------------------------
|
||||
@@ -323,14 +326,14 @@ If environment modules are available and preferred, add them using these command
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
module use /opt/omnitrace/share/modulefiles
|
||||
module load omnitrace/1.0.0
|
||||
module use /opt/rocprofiler-systems/share/modulefiles
|
||||
module load rocprofiler-systems/1.0.0
|
||||
|
||||
Alternatively, you can directly source the ``setup-env.sh`` script:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
source /opt/omnitrace/share/omnitrace/setup-env.sh
|
||||
source /opt/rocprofiler-systems/share/rocprofiler-systems/setup-env.sh
|
||||
|
||||
Test the executables
|
||||
-----------------------------------
|
||||
@@ -340,8 +343,8 @@ issues locating the installed libraries:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
omnitrace-instrument --help
|
||||
omnitrace-avail --help
|
||||
rocprof-sys-instrument --help
|
||||
rocprof-sys-avail --help
|
||||
|
||||
.. note::
|
||||
|
||||
@@ -353,27 +356,27 @@ issues locating the installed libraries:
|
||||
Post-installation troubleshooting
|
||||
========================================
|
||||
|
||||
This section explains how to resolve certain issues that might happen when you first use Omnitrace.
|
||||
This section explains how to resolve certain issues that might happen when you first use ROCm Systems Profiler.
|
||||
|
||||
Issues with RHEL and SELinux
|
||||
----------------------------------------------------
|
||||
|
||||
RHEL (Red Hat Enterprise Linux) and related distributions of Linux automatically enable a security feature
|
||||
named SELinux (Security-Enhanced Linux) that prevents Omnitrace from running.
|
||||
named SELinux (Security-Enhanced Linux) that prevents ROCm Systems Profiler from running.
|
||||
This issue applies to any Linux distribution with SELinux installed, including RHEL,
|
||||
CentOS, Fedora, and Rocky Linux. The problem can happen with any GPU, or even without a GPU.
|
||||
|
||||
The problem occurs after you instrument a program and try to
|
||||
run ``omnitrace-run`` with the instrumented program.
|
||||
run ``rocprof-sys-run`` with the instrumented program.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
g++ hello.cpp -o hello
|
||||
omniperf-instrument -M sampling -o hello.instr -- ./hello
|
||||
omnitrace-run -- ./hello.instr
|
||||
rocprof-sys-instrument -M sampling -o hello.instr -- ./hello
|
||||
rocprof-sys-run -- ./hello.instr
|
||||
|
||||
Instead of successfully running the binary with call-stack sampling,
|
||||
Omnitrace crashes with a segmentation fault.
|
||||
ROCm Systems Profiler crashes with a segmentation fault.
|
||||
|
||||
.. note::
|
||||
|
||||
@@ -412,4 +415,4 @@ Configuring PAPI to collect hardware counters
|
||||
|
||||
To use PAPI to collect the majority of hardware counters, ensure
|
||||
the ``/proc/sys/kernel/perf_event_paranoid`` setting has a value less than or equal to ``2``.
|
||||
For more information, see the :ref:`omnitrace_papi_events` section.
|
||||
For more information, see the :ref:`rocprof-sys_papi_events` section.
|
||||
@@ -1,21 +1,22 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
:description: ROCm Systems Profiler quick start documentation and reference
|
||||
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, profiler, quick start, getting started, quick install, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
*************************************
|
||||
Omnitrace quick start
|
||||
ROCm Systems Profiler quick start
|
||||
*************************************
|
||||
|
||||
To install Omnitrace, download the `Omnitrace installer <https://github.com/ROCm/omnitrace/releases/latest/download/omnitrace-install.py>`_
|
||||
and specify ``--prefix <install-directory>``. The script attempts to auto-detect
|
||||
the appropriate OS distribution and version. To include AMD ROCm Software support,
|
||||
To install ROCm Systems Profiler, download the
|
||||
`ROCm Systems Profiler installer <https://github.com/ROCm/rocprofiler-systems/releases/latest/download/rocprofiler-systems-install.py>`_
|
||||
and specify ``--prefix <install-directory>``. The script attempts to auto-detect
|
||||
the appropriate OS distribution and version. To include AMD ROCm Software support,
|
||||
specify ``--rocm X.Y``, where ``X`` is the ROCm major
|
||||
version and ``Y`` is the ROCm minor version, for example, ``--rocm 6.2``.
|
||||
version and ``Y`` is the ROCm minor version, for example, ``--rocm 6.3``.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
wget https://github.com/ROCm/omnitrace/releases/latest/download/omnitrace-install.py
|
||||
python3 ./omnitrace-install.py --prefix /opt/omnitrace --rocm 6.2
|
||||
wget https://github.com/ROCm/rocprofiler-systems/releases/latest/download/rocprofiler-systems-install.py
|
||||
python3 ./rocprofiler-systems-install.py --prefix /opt/rocprofiler-systems --rocm 6.3
|
||||
|
||||
This script supports installation on Ubuntu, OpenSUSE, Red Hat, Debian, CentOS, and Fedora.
|
||||
If the target OS is compatible with one of the operating system versions listed in
|
||||
@@ -23,8 +24,28 @@ the comprehensive :doc:`Installation guidelines <./install>`,
|
||||
specify ``-d <DISTRO> -v <VERSION>``. For example, if the OS is compatible with Ubuntu 22.04, pass
|
||||
``-d ubuntu -v 22.04`` to the script.
|
||||
|
||||
.. note::
|
||||
Install via package manager
|
||||
============================
|
||||
|
||||
If you have ROCm version 6.2 or higher installed, you can use the
|
||||
package manager to install a pre-built copy of Omnitrace using
|
||||
``apt install omnitrace`` or ``dnf install omnitrace``.
|
||||
If you have ROCm version 6.3 or higher installed, you can use the
|
||||
package manager to install a pre-built copy of ROCm Systems Profiler.
|
||||
|
||||
.. tab-set::
|
||||
|
||||
.. tab-item:: Ubuntu
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ sudo apt install rocprofiler-systems
|
||||
|
||||
.. tab-item:: Red Hat Enterprise Linux
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ sudo dnf install rocprofiler-systems
|
||||
|
||||
.. tab-item:: SUSE Linux Enterprise Server
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ sudo zypper install rocprofiler-systems
|
||||
|
||||
@@ -1,122 +1,122 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
:description: ROCm Systems Profiler development documentation and reference
|
||||
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, development, developers guide, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
****************************************************
|
||||
Development guide
|
||||
****************************************************
|
||||
|
||||
This guide discusses the `Omnitrace <https://github.com/ROCm/omnitrace>`_ design.
|
||||
It includes a list of the executables and libraries, along with a discussion of the application's
|
||||
This guide discusses the `ROCm Systems Profiler <https://github.com/ROCm/rocprofiler-systems>`_ design.
|
||||
It includes a list of the executables and libraries, along with a discussion of the application's
|
||||
memory, sampling, and time-window constraint models.
|
||||
|
||||
Executables
|
||||
========================================
|
||||
|
||||
This section lists the Omnitrace executables.
|
||||
This section lists the ROCm Systems Profiler executables.
|
||||
|
||||
omnitrace-avail: `source/bin/omnitrace-avail <https://github.com/ROCm/omnitrace/tree/amd-mainline/source/bin/omnitrace-avail>`_
|
||||
rocprof-sys-avail: `source/bin/rocprof-sys-avail <https://github.com/ROCm/rocprofiler-systems/tree/amd-mainline/source/bin/rocprof-sys-avail>`_
|
||||
-------------------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
The ``main`` routine of ``omnitrace-avail`` has three important sections:
|
||||
The ``main`` routine of ``rocprof-sys-avail`` has three important sections:
|
||||
|
||||
* Printing components
|
||||
* Printing options
|
||||
* Printing hardware counters
|
||||
|
||||
omnitrace-sample: `source/bin/omnitrace-sample <https://github.com/ROCm/omnitrace/tree/amd-mainline/source/bin/omnitrace-sample>`_
|
||||
rocprof-sys-sample: `source/bin/rocprof-sys-sample <https://github.com/ROCm/rocprofiler-systems/tree/amd-mainline/source/bin/rocprof-sys-sample>`_
|
||||
----------------------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
* Requires a command-line format of ``omnitrace-sample <options> -- <command> <command-args>``
|
||||
* Requires a command-line format of ``rocprof-sys-sample <options> -- <command> <command-args>``
|
||||
* Translates command-line options into environment variables
|
||||
* Adds ``libomnitrace-dl.so`` to ``LD_PRELOAD``
|
||||
* Adds ``librocprof-sys-dl.so`` to ``LD_PRELOAD``
|
||||
* Is launched by using ``execvpe`` with ``<command> <command-args>`` and a modified environment
|
||||
|
||||
omnitrace-casual: `source/bin/omnitrace-causal <https://github.com/ROCm/omnitrace/tree/amd-mainline/source/bin/omnitrace-causal>`_
|
||||
rocprof-sys-casual: `source/bin/rocprof-sys-causal <https://github.com/ROCm/rocprofiler-systems/tree/amd-mainline/source/bin/rocprof-sys-causal>`_
|
||||
----------------------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
When there is exactly one causal profiling configuration variant (which enables debugging),
|
||||
``omnitrace-casual`` has a nearly identical design to ``omnitrace-sample``
|
||||
``rocprof-sys-casual`` has a nearly identical design to ``rocprof-sys-sample``
|
||||
|
||||
When the command-line options produce more than one causal profiling configuration variant,
|
||||
the following actions take place for each variant:
|
||||
|
||||
* ``omnitrace-causal`` calls ``fork()``
|
||||
* ``rocprof-sys-causal`` calls ``fork()``
|
||||
* the child process launches ``<command> <command-args>`` using ``execvpe``, which modifies the environment for the variant
|
||||
* the parent process waits for the child process to finish
|
||||
|
||||
omnitrace-instrument: `source/bin/omnitrace-instrument <https://github.com/ROCm/omnitrace/tree/amd-mainline/source/bin/omnitrace-instrument>`_
|
||||
rocprof-sys-instrument: `source/bin/rocprof-sys-instrument <https://github.com/ROCm/rocprofiler-systems/tree/amd-mainline/source/bin/rocprof-sys-instrument>`_
|
||||
----------------------------------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
* Requires a command-line format of ``omnitrace-instrument <options> -- <command> <command-args>``
|
||||
* Allows the user to provide options specifying whether to perform runtime instrumentation, use binary rewrite, or
|
||||
* Requires a command-line format of ``rocprof-sys-instrument <options> -- <command> <command-args>``
|
||||
* Allows the user to provide options specifying whether to perform runtime instrumentation, use binary rewrite, or
|
||||
attach to process
|
||||
* Either opens the instrumentation target (for binary rewrite), launches the target and stops it
|
||||
before it starts executing ``main``, or attaches to a running executable and pauses it
|
||||
* Finds all functions in the targets
|
||||
* Finds ``libomnitrace-dl`` and locates the functions
|
||||
* Iterates over and instruments all the functions, provided they satisfy the
|
||||
* Finds ``librocprof-sys-dl`` and locates the functions
|
||||
* Iterates over and instruments all the functions, provided they satisfy the
|
||||
defined criteria (such as a minimum number of instructions)
|
||||
|
||||
* See the ``module_function`` class
|
||||
|
||||
* Until this point, the workflow has been the same for the different options,
|
||||
* Until this point, the workflow has been the same for the different options,
|
||||
but it diverges after instrumentation is complete:
|
||||
|
||||
* For a binary rewrite: it produces a new instrumented binary and exits
|
||||
* For runtime instrumentation or attaching to a process: it instructs the application
|
||||
* For runtime instrumentation or attaching to a process: it instructs the application
|
||||
to resume and then waits for it to exit
|
||||
|
||||
Libraries
|
||||
========================================
|
||||
|
||||
Common library: `source/lib/common <https://github.com/ROCm/omnitrace/tree/amd-mainline/source/lib/common>`_
|
||||
Common library: `source/lib/common <https://github.com/ROCm/rocprofiler-systems/tree/amd-mainline/source/lib/common>`_
|
||||
--------------------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
* General header-only functionality used in multiple executables and/or libraries.
|
||||
* General header-only functionality used in multiple executables and/or libraries.
|
||||
* Not installed or exported outside of the build tree.
|
||||
|
||||
Core library: `source/lib/core <https://github.com/ROCm/omnitrace/tree/amd-mainline/source/lib/core>`_
|
||||
Core library: `source/lib/core <https://github.com/ROCm/rocprofiler-systems/tree/amd-mainline/source/lib/core>`_
|
||||
--------------------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
* Static PIC library with functionality that does not depend on any components.
|
||||
* Static PIC library with functionality that does not depend on any components.
|
||||
* Not installed or exported outside of the build tree.
|
||||
|
||||
Binary library: `source/lib/binary <https://github.com/ROCm/omnitrace/tree/amd-mainline/source/lib/binary>`_
|
||||
Binary library: `source/lib/binary <https://github.com/ROCm/rocprofiler-systems/tree/amd-mainline/source/lib/binary>`_
|
||||
--------------------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
* Static PIC library with functionality for reading/analyzing binary info.
|
||||
* Mostly used by the causal profiling sections of ``libomnitrace``.
|
||||
* Mostly used by the causal profiling sections of ``librocprof-sys``.
|
||||
* Not installed or exported outside of the build tree.
|
||||
|
||||
libomnitrace: `source/lib/omnitrace <https://github.com/ROCm/omnitrace/tree/amd-mainline/source/lib/omnitrace>`_
|
||||
librocprof-sys: `source/lib/rocprof-sys <https://github.com/ROCm/rocprofiler-systems/tree/amd-mainline/source/lib/rocprof-sys>`_
|
||||
--------------------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
This is the main library encapsulating all the capabilities.
|
||||
|
||||
libomnitrace-dl: `source/lib/omnitrace-dl <https://github.com/ROCm/omnitrace/tree/amd-mainline/source/lib/omnitrace-dl>`_
|
||||
librocprof-sys-dl: `source/lib/rocprof-sys-dl <https://github.com/ROCm/rocprofiler-systems/tree/amd-mainline/source/lib/rocprof-sys-dl>`_
|
||||
--------------------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
This is a lightweight, front-end library for ``libomnitrace`` which serves three primary purposes:
|
||||
This is a lightweight, front-end library for ``librocprof-sys`` which serves three primary purposes:
|
||||
|
||||
* Dramatically speeds up instrumentation time compared to using ``libomnitrace`` directly because
|
||||
Dyninst must parse the entire library in order to find the instrumentation functions
|
||||
(a ``dlopen`` call is made on ``libomnitrace`` when the instrumentation functions get called)
|
||||
* Prevents re-entry if ``libomnitrace`` calls an instrumented function internally
|
||||
* Coordinates communication between ``libomnitrace-user`` and ``libomnitrace``
|
||||
* Dramatically speeds up instrumentation time compared to using ``librocprof-sys`` directly because
|
||||
Dyninst must parse the entire library in order to find the instrumentation functions
|
||||
(a ``dlopen`` call is made on ``librocprof-sys`` when the instrumentation functions get called)
|
||||
* Prevents re-entry if ``librocprof-sys`` calls an instrumented function internally
|
||||
* Coordinates communication between ``librocprof-sys-user`` and ``librocprof-sys``
|
||||
|
||||
libomnitrace-user: `source/lib/omnitrace-user <https://github.com/ROCm/omnitrace/tree/amd-mainline/source/lib/omnitrace-user>`_
|
||||
librocprof-sys-user: `source/lib/rocprof-sys-user <https://github.com/ROCm/rocprofiler-systems/tree/amd-mainline/source/lib/rocprof-sys-user>`_
|
||||
--------------------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
* Provides a set of functions and types for the users to add to their code, for example,
|
||||
disabling data collection globally or on a specific thread or
|
||||
user-defined region
|
||||
* If ``libomnitrace-dl`` is not loaded, the user API is effectively a set of no-op function calls.
|
||||
* If ``librocprof-sys-dl`` is not loaded, the user API is effectively a set of no-op function calls.
|
||||
|
||||
Testing tools
|
||||
========================================
|
||||
|
||||
* `CDash Testing Dashboard <https://my.cdash.org/index.php?project=Omnitrace>`_ (requires a login)
|
||||
* `CDash Testing Dashboard <https://my.cdash.org/index.php?project=rocprofiler-systems>`_ (requires a login)
|
||||
|
||||
Components
|
||||
========================================
|
||||
@@ -124,34 +124,34 @@ Components
|
||||
Most measurements and capabilities are encapsulated into a "component" with the following definitions:
|
||||
|
||||
Measurement
|
||||
A recording of some data relevant to performance, for instance, the current call-stack,
|
||||
A recording of some data relevant to performance, for instance, the current call-stack,
|
||||
hardware counter values, current memory usage, or timestamp
|
||||
|
||||
Capability
|
||||
Handles the implementation or orchestration of some feature which is used
|
||||
to collect measurements, for example, a component which handles setting up function wrappers
|
||||
Handles the implementation or orchestration of some feature which is used
|
||||
to collect measurements, for example, a component which handles setting up function wrappers
|
||||
around various functions such as ``pthread_create`` or ``MPI_Init``.
|
||||
|
||||
Components are designed to either hold no data at all or only the data for both an instantaneous
|
||||
Components are designed to either hold no data at all or only the data for both an instantaneous
|
||||
measurement and a phase measurement.
|
||||
|
||||
Components which store data typically implement a static ``record()`` function
|
||||
Components which store data typically implement a static ``record()`` function
|
||||
for getting a record of the measurement,
|
||||
``start()`` and ``stop()`` member functions for calculating a phase measurement,
|
||||
``start()`` and ``stop()`` member functions for calculating a phase measurement,
|
||||
and a ``sample()`` member function for storing an
|
||||
instantaneous measurement. In reality, there are several more "standard" functions
|
||||
instantaneous measurement. In reality, there are several more "standard" functions
|
||||
but these are the most commonly-used ones.
|
||||
|
||||
Components which do not store data might also have ``start()``, ``stop()``, and ``sample()``
|
||||
Components which do not store data might also have ``start()``, ``stop()``, and ``sample()``
|
||||
functions. However, components which
|
||||
implement function wrappers typically provide a call operator or ``audit(...)``
|
||||
implement function wrappers typically provide a call operator or ``audit(...)``
|
||||
functions. These are invoked with the
|
||||
wrapped function's arguments before the wrapped function gets called and with the return value
|
||||
wrapped function's arguments before the wrapped function gets called and with the return value
|
||||
after the wrapped function gets called.
|
||||
|
||||
.. note::
|
||||
|
||||
The goal of this design is to provide relatively small and resuable lightweight objects
|
||||
The goal of this design is to provide relatively small and resuable lightweight objects
|
||||
for recording measurements and implementing capabilities.
|
||||
|
||||
Wall-clock component example
|
||||
@@ -195,7 +195,7 @@ A component for computing the elapsed wall-clock time looks like this:
|
||||
Function wrapper component example
|
||||
--------------------------------------
|
||||
|
||||
A component which implements wrappers around ``fork()`` and ``exit(int)`` (and stores no data)
|
||||
A component which implements wrappers around ``fork()`` and ``exit(int)`` (and stores no data)
|
||||
could look like this:
|
||||
|
||||
.. code-block:: cpp
|
||||
@@ -219,7 +219,7 @@ could look like this:
|
||||
void operator()(const gotcha_data&, void (*real_exit)(int), int _exit_code)
|
||||
{
|
||||
// catch the call to exit and finalize before truly exiting
|
||||
omnitrace_finalize();
|
||||
rocprofsys_finalize();
|
||||
|
||||
real_exit(_exit_code);
|
||||
}
|
||||
@@ -298,22 +298,22 @@ Collected data is generally handled in one of the three following ways:
|
||||
* It is managed implicitly by Timemory and accessed as needed
|
||||
* As thread-local data
|
||||
|
||||
In general, only instrumentation for relatively simple data is directly passed to
|
||||
In general, only instrumentation for relatively simple data is directly passed to
|
||||
Perfetto and/or Timemory during runtime.
|
||||
For example, the callbacks from binary instrumentation, user API instrumentation,
|
||||
For example, the callbacks from binary instrumentation, user API instrumentation,
|
||||
and roctracer directly invoke
|
||||
calls to Perfetto or Timemory's storage model. Otherwise, the data is stored
|
||||
by Omnitrace in the thread-data model
|
||||
calls to Perfetto or Timemory's storage model. Otherwise, the data is stored
|
||||
by ROCm Systems Profiler in the thread-data model
|
||||
which is more persistent than simply using ``thread_local`` static data, which gets deleted
|
||||
when the thread stops.
|
||||
|
||||
Thread identification
|
||||
--------------------------------------
|
||||
|
||||
Each CPU thread is assigned two integral identifiers. One identifier, the ``internal_value``, is
|
||||
Each CPU thread is assigned two integral identifiers. One identifier, the ``internal_value``, is
|
||||
atomically incremented every time a new thread is created.
|
||||
The other identifier, known as the ``sequent_value``, tries to account for the fact that Omnitrace, Perfetto, ROCm, and other applications
|
||||
start background threads. When a thread is created as a by-product of Omnitrace,
|
||||
The other identifier, known as the ``sequent_value``, tries to account for the fact that ROCm Systems Profiler, Perfetto, ROCm, and other applications
|
||||
start background threads. When a thread is created as a by-product of ROCm Systems Profiler,
|
||||
the index is offset by a large value. This serves
|
||||
two purposes:
|
||||
|
||||
@@ -325,88 +325,88 @@ The ``sequent_value`` identifier is typically used to access the thread-data.
|
||||
Thread-data class
|
||||
--------------------------------------
|
||||
|
||||
Currently, most thread data is effectively stored in a static
|
||||
``std::array<std::unique_ptr<T>, OMNITRACE_MAX_THREADS>`` instance.
|
||||
``OMNITRACE_MAX_THREADS`` is a value defined a compile-time and set to ``2048``
|
||||
Currently, most thread data is effectively stored in a static
|
||||
``std::array<std::unique_ptr<T>, ROCPROFSYS_MAX_THREADS>`` instance.
|
||||
``ROCPROFSYS_MAX_THREADS`` is a value defined a compile-time and set to ``2048``
|
||||
for release builds. During finalization,
|
||||
Omnitrace iterates through the thread-data and transforms that data
|
||||
ROCm Systems Profiler iterates through the thread-data and transforms that data
|
||||
into something that can be passed along to Perfetto and/or Timemory.
|
||||
The downside of the current model is that if the user exceeds ``OMNITRACE_MAX_THREADS``,
|
||||
The downside of the current model is that if the user exceeds ``ROCPROFSYS_MAX_THREADS``,
|
||||
a segmentation fault occurs. To fix this issue,
|
||||
a new model is being adopted which has all the benefits of this model
|
||||
a new model is being adopted which has all the benefits of this model
|
||||
but permits dynamic expansion.
|
||||
|
||||
Sampling model
|
||||
========================================
|
||||
|
||||
The general structure for the sampling is within Timemory (``source/timemory/sampling``).
|
||||
The general structure for the sampling is within Timemory (``source/timemory/sampling``).
|
||||
Currently, all sampling is done per-thread
|
||||
via POSIX timers. Omnitrace supports both a real-time timer and a CPU-time timer.
|
||||
via POSIX timers. ROCm Systems Profiler supports both a real-time timer and a CPU-time timer.
|
||||
Both have adjustable frequencies, delays, and durations.
|
||||
By default, only CPU-time sampling is enabled. Initial settings are inherited from
|
||||
the settings starting with ``OMNITRACE_SAMPLING_``.
|
||||
By default, only CPU-time sampling is enabled. Initial settings are inherited from
|
||||
the settings starting with ``ROCPROFSYS_SAMPLING_``.
|
||||
|
||||
For each type of timer, timer-specific settings can be used to
|
||||
override the common and inherited timer settings.
|
||||
These settings begin with ``OMNITRACE_SAMPLING_CPUTIME`` for the CPU-time sampler
|
||||
and ``OMNITRACE_SAMPLING_REALTIME`` for
|
||||
the real-time sampler. For example, ``OMNITRACE_SAMPLING_FREQ=500`` initially sets the
|
||||
sampling frequency to 500 interrupts per second. Adding the setting ``OMNITRACE_SAMPLING_REALTIME_FREQ=10``
|
||||
For each type of timer, timer-specific settings can be used to
|
||||
override the common and inherited timer settings.
|
||||
These settings begin with ``ROCPROFSYS_SAMPLING_CPUTIME`` for the CPU-time sampler
|
||||
and ``ROCPROFSYS_SAMPLING_REALTIME`` for
|
||||
the real-time sampler. For example, ``ROCPROFSYS_SAMPLING_FREQ=500`` initially sets the
|
||||
sampling frequency to 500 interrupts per second. Adding the setting ``ROCPROFSYS_SAMPLING_REALTIME_FREQ=10``
|
||||
lowers the sampling frequency for the real-time sampler
|
||||
to 10 interrupts per second of real-time.
|
||||
|
||||
The Omnitrace-specific implementation can be found in
|
||||
`source/lib/omnitrace/library/sampling.cpp <https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/sampling.cpp>`_.
|
||||
Within `sampling.cpp <https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/sampling.cpp>`_,
|
||||
The ROCm Systems Profiler-specific implementation can be found in
|
||||
`source/lib/rocprof-sys/library/sampling.cpp <https://github.com/ROCm/rocprofiler-systems/blob/main/source/lib/rocprof-sys/library/sampling.cpp>`_.
|
||||
Within `sampling.cpp <https://github.com/ROCm/rocprofiler-systems/blob/main/source/lib/rocprof-sys/library/sampling.cpp>`_,
|
||||
there is a bundle of three sampling components:
|
||||
|
||||
* `backtrace_timestamp <https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/components/backtrace_timestamp.hpp>`_ simply
|
||||
* `backtrace_timestamp <https://github.com/ROCm/rocprofiler-systems/blob/main/source/lib/rocprof-sys/library/components/backtrace_timestamp.hpp>`_ simply
|
||||
records the wall-clock time of the sample.
|
||||
* `backtrace <https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/components/backtrace.hpp>`_
|
||||
* `backtrace <https://github.com/ROCm/rocprofiler-systems/blob/main/source/lib/rocprof-sys/library/components/backtrace.hpp>`_
|
||||
records the call-stack via libunwind.
|
||||
* `backtrace_metrics <https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/components/backtrace_metrics.hpp>`_
|
||||
* `backtrace_metrics <https://github.com/ROCm/rocprofiler-systems/blob/main/source/lib/rocprof-sys/library/components/backtrace_metrics.hpp>`_
|
||||
records the sample metrics, such as peak RSS and the hardware counters.
|
||||
|
||||
These three components are bundled together in
|
||||
These three components are bundled together in
|
||||
a tuple-like ``struct`` (``tuple<backtrace_timestamp, backtrace, backtrace_metrics>``).
|
||||
A buffer of at least 1024 instances of this tuple is mapped using ``mmap``
|
||||
per-thread. When this buffer is full,
|
||||
A buffer of at least 1024 instances of this tuple is mapped using ``mmap``
|
||||
per-thread. When this buffer is full,
|
||||
the sampler hands the buffer off to its allocator thread and maps a new buffer with ``mmap``
|
||||
before taking the next sample. The allocator thread takes this data
|
||||
and either dynamically stores it in memory or writes it to a file depending on the
|
||||
value of ``OMNITRACE_USE_TEMPORARY_FILES``.
|
||||
This schema avoids all allocations in the signal handler, lets the data grow
|
||||
dynamically, avoids potentially slow I/O within the signal handler, and also enables
|
||||
before taking the next sample. The allocator thread takes this data
|
||||
and either dynamically stores it in memory or writes it to a file depending on the
|
||||
value of ``ROCPROFSYS_USE_TEMPORARY_FILES``.
|
||||
This schema avoids all allocations in the signal handler, lets the data grow
|
||||
dynamically, avoids potentially slow I/O within the signal handler, and also enables
|
||||
the capability of avoiding I/O altogether.
|
||||
The maximum number of samplers handled by each allocator is governed by the
|
||||
``OMNITRACE_SAMPLING_ALLOCATOR_SIZE`` setting (the default is eight). Whenever an allocator
|
||||
The maximum number of samplers handled by each allocator is governed by the
|
||||
``ROCPROFSYS_SAMPLING_ALLOCATOR_SIZE`` setting (the default is eight). Whenever an allocator
|
||||
has reached its limit,
|
||||
a new internal thread is created to handle the new samplers.
|
||||
|
||||
Time-window constraint model
|
||||
========================================
|
||||
|
||||
With the recent introduction of tracing delay and duration, the
|
||||
`constraint namespace <https://github.com/ROCm/omnitrace/blob/main/source/lib/core/constraint.hpp>`_
|
||||
was introduced to improve the management of delays and duration limits for
|
||||
With the recent introduction of tracing delay and duration, the
|
||||
`constraint namespace <https://github.com/ROCm/rocprofiler-systems/blob/main/source/lib/core/constraint.hpp>`_
|
||||
was introduced to improve the management of delays and duration limits for
|
||||
data collection. The ``spec`` class accepts a clock identifier, a delay value, a duration value, and an
|
||||
integer indicating how many times to repeat the delay and duration cycle. It is therefore
|
||||
integer indicating how many times to repeat the delay and duration cycle. It is therefore
|
||||
possible to perform tasks such as periodically enabling tracing for brief periods
|
||||
of time in between long periods without data collection while the application runs. The
|
||||
syntax follows the format ``clock_identifier:delay:capture_duration:cycles``, so a value of
|
||||
syntax follows the format ``clock_identifier:delay:capture_duration:cycles``, so a value of
|
||||
``10:1:3`` for the last three parameters represents the following sequence of operations:
|
||||
|
||||
* Ten seconds where no data is collected, then one second where it is
|
||||
* Ten seconds where no data is collected, then one second where it is
|
||||
* Ten seconds where no data is collected, then one second where it is
|
||||
* Ten seconds where no data is collected, then one second where it is
|
||||
* Ten seconds where no data is collected, then one second where it is
|
||||
* Stop
|
||||
|
||||
As another example, ``OMNITRACE_TRACE_PERIODS = realtime:10:1:5 process_cputime:10:2:20`` translates
|
||||
As another example, ``ROCPROFSYS_TRACE_PERIODS = realtime:10:1:5 process_cputime:10:2:20`` translates
|
||||
to this sequence:
|
||||
|
||||
* Five cycles of: no data collection for ten seconds of real-time followed by one second of data collection
|
||||
* Twenty cycles of: no data collection for ten seconds of process CPU time followed by two CPU-time seconds of data collection
|
||||
|
||||
Eventually, the goal is to migrate all subsets of data collection which currently support
|
||||
Eventually, the goal is to migrate all subsets of data collection which currently support
|
||||
more rudimentary models of time window constraints, such as process sampling and causal profiling,
|
||||
to this model.
|
||||
|
||||
@@ -1,40 +1,40 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
:description: ROCm Systems Profiler glossary and reference
|
||||
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, glossary, terminology, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
*******************
|
||||
Omnitrace Glossary
|
||||
ROCm Systems Profiler Glossary
|
||||
*******************
|
||||
|
||||
This topic explains the terminology necessary to use Omnitrace.
|
||||
The list below provides a basic glossary for those who
|
||||
are new to binary instrumentation. It also clarifies ambiguities
|
||||
when certain terms have different
|
||||
contextual meanings, for example, the Omnitrace meaning of the term "module"
|
||||
This topic explains the terminology necessary to use ROCm Systems Profiler.
|
||||
The list below provides a basic glossary for those who
|
||||
are new to binary instrumentation. It also clarifies ambiguities
|
||||
when certain terms have different
|
||||
contextual meanings, for example, the ROCm Systems Profiler meaning of the term "module"
|
||||
when instrumenting Python.
|
||||
|
||||
**Binary**
|
||||
A file written in the Executable and Linkable Format (ELF). This is the standard file
|
||||
A file written in the Executable and Linkable Format (ELF). This is the standard file
|
||||
format for executable files, shared libraries, etc.
|
||||
|
||||
**Binary instrumentation**
|
||||
Inserting callbacks to instrumentation into an existing binary. This can be performed
|
||||
Inserting callbacks to instrumentation into an existing binary. This can be performed
|
||||
statically or dynamically.
|
||||
|
||||
**Static binary instrumentation**
|
||||
Loads an existing binary, determines instrumentation points, and generates a new binary
|
||||
with instrumentation directly embedded. It is applicable to executables and libraries but
|
||||
Loads an existing binary, determines instrumentation points, and generates a new binary
|
||||
with instrumentation directly embedded. It is applicable to executables and libraries but
|
||||
limited to only the functions defined in the binary. This is also known as **Binary rewrite**.
|
||||
|
||||
**Dynamic binary instrumentation**
|
||||
Loads an existing binary into memory, inserts instrumentation, and runs the binary.
|
||||
It is limited to executables but is capable of instrumenting linked libraries.
|
||||
Loads an existing binary into memory, inserts instrumentation, and runs the binary.
|
||||
It is limited to executables but is capable of instrumenting linked libraries.
|
||||
This is also known as **Runtime instrumentation**.
|
||||
|
||||
**Statistical sampling**
|
||||
At periodic intervals, the application is paused and the current call-stack of the CPU
|
||||
is recorded along with various other metrics. It uses timers that measure either
|
||||
(A) real clock time or (B) the CPU time used by the current thread and the CPU time
|
||||
**Statistical sampling**
|
||||
At periodic intervals, the application is paused and the current call-stack of the CPU
|
||||
is recorded along with various other metrics. It uses timers that measure either
|
||||
(A) real clock time or (B) the CPU time used by the current thread and the CPU time
|
||||
expended on behalf of the thread by the system. This is also known as simply **sampling**.
|
||||
|
||||
**Sampling rate**
|
||||
@@ -45,12 +45,12 @@ when instrumenting Python.
|
||||
* How long to wait before (A) and (B) begin triggering at their designated rate
|
||||
|
||||
**Sampling duration**
|
||||
* The amount of time (in real-time) after the start of the application to record samples.
|
||||
* The amount of time (in real-time) after the start of the application to record samples.
|
||||
* After this time limit has been reached, no more samples are recorded.
|
||||
|
||||
**Process sampling**
|
||||
At periodic (real-time) intervals, a background thread records global metrics without
|
||||
interrupting the current process. These metrics include, but are not limited to:
|
||||
At periodic (real-time) intervals, a background thread records global metrics without
|
||||
interrupting the current process. These metrics include, but are not limited to:
|
||||
CPU frequency, CPU memory high-water mark (i.e. peak memory usage), GPU temperature,
|
||||
and GPU power usage.
|
||||
|
||||
@@ -62,41 +62,41 @@ when instrumenting Python.
|
||||
* How long to wait (in real-time) before recording samples
|
||||
|
||||
**Sampling duration**
|
||||
* The amount of time (in real-time) after the start of the application to record samples.
|
||||
* The amount of time (in real-time) after the start of the application to record samples.
|
||||
* After this time limit has been reached, no more samples are recorded.
|
||||
|
||||
**Module**
|
||||
With respect to binary instrumentation, a module is defined as either the filename
|
||||
(such as ``foo.c``) or library name (``libfoo.so``) which contains the definition
|
||||
With respect to binary instrumentation, a module is defined as either the filename
|
||||
(such as ``foo.c``) or library name (``libfoo.so``) which contains the definition
|
||||
of one or more functions.
|
||||
|
||||
With respect to Python instrumentation, a module is defined as the **file** which contains
|
||||
the definition of one or more functions. The full path to this file typically contains the
|
||||
With respect to Python instrumentation, a module is defined as the **file** which contains
|
||||
the definition of one or more functions. The full path to this file typically contains the
|
||||
name of the "Python module".
|
||||
|
||||
**Basic block**
|
||||
A straight-line code sequence with no branches in (except for the entry) and
|
||||
A straight-line code sequence with no branches in (except for the entry) and
|
||||
no branches out (except for the exit).
|
||||
|
||||
**Address range**
|
||||
The instructions for a function in a binary start at certain address with the ELF file
|
||||
The instructions for a function in a binary start at certain address with the ELF file
|
||||
and end at a certain address. The range is ``end - start``.
|
||||
|
||||
The address range is a decent approximation for the "cost" of a function.
|
||||
The address range is a decent approximation for the "cost" of a function.
|
||||
For example, a larger address range approximately equates to more instructions.
|
||||
|
||||
**Instrumentation traps**
|
||||
On the x86 architecture, because instructions are of variable size, an instruction
|
||||
might be too small for Dyninst to replace it with the normal code sequence
|
||||
used to call instrumentation. When instrumentation is placed at points other
|
||||
than subroutine entry, exit, or call points, traps may be used to ensure
|
||||
the instrumentation fits. (By default, ``omnitrace-instrument`` avoids instrumentation
|
||||
On the x86 architecture, because instructions are of variable size, an instruction
|
||||
might be too small for Dyninst to replace it with the normal code sequence
|
||||
used to call instrumentation. When instrumentation is placed at points other
|
||||
than subroutine entry, exit, or call points, traps may be used to ensure
|
||||
the instrumentation fits. (By default, ``rocprof-sys-instrument`` avoids instrumentation
|
||||
which requires a trap.)
|
||||
|
||||
**Overlapping functions**
|
||||
Due to language constructs or compiler optimizations, it might be possible for
|
||||
multiple functions to overlap (that is, share part of the same function body)
|
||||
or for a single function to have multiple entry points. In practice, it's
|
||||
impossible to determine the difference between multiple overlapping functions
|
||||
and a single function with multiple entry points. (By default, ``omnitrace-instrument``
|
||||
Due to language constructs or compiler optimizations, it might be possible for
|
||||
multiple functions to overlap (that is, share part of the same function body)
|
||||
or for a single function to have multiple entry points. In practice, it's
|
||||
impossible to determine the difference between multiple overlapping functions
|
||||
and a single function with multiple entry points. (By default, ``rocprof-sys-instrument``
|
||||
avoids instrumenting overlapping functions.)
|
||||
@@ -6,18 +6,18 @@ defaults:
|
||||
root: index
|
||||
subtrees:
|
||||
- entries:
|
||||
- file: what-is-omnitrace.rst
|
||||
- file: what-is-rocprof-sys.rst
|
||||
|
||||
- caption: Install
|
||||
entries:
|
||||
- file: install/quick-start.rst
|
||||
title: Omnitrace quick start
|
||||
title: ROCm Systems Profiler quick start
|
||||
- file: install/install.rst
|
||||
title: Omnitrace installation guide
|
||||
title: ROCm Systems Profiler installation guide
|
||||
|
||||
- caption: Tutorials
|
||||
entries:
|
||||
- url: https://github.com/ROCm/omnitrace/tree/amd-mainline/examples
|
||||
- url: https://github.com/ROCm/rocprofiler-systems/tree/amd-mainline/examples
|
||||
title: GitHub examples
|
||||
- file: tutorials/video-tutorials.rst
|
||||
title: Video tutorials
|
||||
@@ -25,37 +25,37 @@ subtrees:
|
||||
- caption: How to
|
||||
entries:
|
||||
- file: how-to/configuring-validating-environment.rst
|
||||
title: Configuring and validating the environment
|
||||
title: Configuring and validating the environment
|
||||
- file: how-to/configuring-runtime-options.rst
|
||||
title: Configuring runtime options
|
||||
title: Configuring runtime options
|
||||
- file: how-to/sampling-call-stack.rst
|
||||
title: Sampling the call stack
|
||||
title: Sampling the call stack
|
||||
- file: how-to/instrumenting-rewriting-binary-application.rst
|
||||
title: Instrumenting and rewriting a binary application
|
||||
- file: how-to/performing-causal-profiling.rst
|
||||
title: Performing causal profiling
|
||||
- file: how-to/understanding-omnitrace-output.rst
|
||||
title: Understanding the Omnitrace output
|
||||
title: Performing causal profiling
|
||||
- file: how-to/understanding-rocprof-sys-output.rst
|
||||
title: Understanding the ROCm Systems Profiler output
|
||||
- file: how-to/profiling-python-scripts.rst
|
||||
title: Profiling Python scripts
|
||||
- file: how-to/using-omnitrace-api.rst
|
||||
title: Using the Omnitrace API
|
||||
- file: how-to/general-tips-using-omnitrace.rst
|
||||
title: General tips for using Omnitrace
|
||||
title: Profiling Python scripts
|
||||
- file: how-to/using-rocprof-sys-api.rst
|
||||
title: Using the ROCm Systems Profiler API
|
||||
- file: how-to/general-tips-using-rocprof-sys.rst
|
||||
title: General tips for using ROCm Systems Profiler
|
||||
|
||||
- caption: Conceptual
|
||||
entries:
|
||||
- file: conceptual/data-collection-modes.rst
|
||||
title: Data collection modes
|
||||
- file: conceptual/omnitrace-feature-set.rst
|
||||
title: The Omnitrace feature set and use cases
|
||||
- file: conceptual/rocprof-sys-feature-set.rst
|
||||
title: The ROCm Systems Profiler feature set and use cases
|
||||
|
||||
- caption: Reference
|
||||
entries:
|
||||
- file: reference/development-guide.rst
|
||||
title: Development guide
|
||||
- file: reference/omnitrace-glossary.rst
|
||||
title: Omnitrace glossary
|
||||
- file: reference/rocprof-sys-glossary.rst
|
||||
title: ROCm Systems Profiler glossary
|
||||
- file: doxygen/html/files
|
||||
title: API library
|
||||
- file: doxygen/html/functions
|
||||
|
||||
@@ -1,11 +1,14 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
:description: ROCm Systems Profiler video documentation and reference
|
||||
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, profiler, video, tutorial, demonstration, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
****************************************************
|
||||
Video tutorials
|
||||
****************************************************
|
||||
|
||||
The following video tutorials provide a visual guide to using ROCm Systems Profiler.
|
||||
They were recorded using the former name of the tool, Omnitrace, but the content is still applicable.
|
||||
|
||||
Installing a binary release
|
||||
========================================
|
||||
|
||||
@@ -20,7 +23,7 @@ Instrumenting a binary
|
||||
|
||||
<p align="center"><iframe width="560" height="315" src="https://www.youtube.com/embed/2B0gRr3FygQ?modestbranding=1" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p>
|
||||
|
||||
Writing an Omnitrace configuration file
|
||||
Writing an ROCm Systems Profiler configuration file
|
||||
========================================
|
||||
|
||||
.. raw:: html
|
||||
|
||||
@@ -1,18 +1,18 @@
|
||||
.. meta::
|
||||
:description: Omnitrace documentation and reference
|
||||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
:description: ROCm Systems Profiler introduction, explanation, and reference
|
||||
:keywords: rocprof-sys, rocprofiler-systems, Omnitrace, ROCm, profiler, explanation, introduction, what is, tracking, visualization, tool, Instinct, accelerator, AMD
|
||||
|
||||
******************
|
||||
What is Omnitrace?
|
||||
What is ROCm Systems Profiler?
|
||||
******************
|
||||
|
||||
Omnitrace is designed for the high-level profiling and comprehensive tracing
|
||||
ROCm Systems Profiler is designed for the high-level profiling and comprehensive tracing
|
||||
of applications running on the CPU or the CPU and GPU. It supports dynamic binary
|
||||
instrumentation, call-stack sampling, and various other features for determining
|
||||
which function and line number are currently executing.
|
||||
|
||||
A visualization of the comprehensive Omnitrace results can be observed in any modern
|
||||
web browser. Upload the Perfetto (``.proto``) output files produced by Omnitrace at
|
||||
A visualization of the comprehensive ROCm Systems Profiler results can be observed in any modern
|
||||
web browser. Upload the Perfetto (``.proto``) output files produced by ROCm Systems Profiler at
|
||||
`ui.perfetto.dev <https://ui.perfetto.dev/>`_ to see the details.
|
||||
|
||||
.. important::
|
||||
@@ -26,7 +26,7 @@ JSON files for programmatic analysis. The JSON output files are compatible with
|
||||
the performance data into pandas data frames and facilitates multi-run comparisons, filtering,
|
||||
and visualization in Jupyter notebooks.
|
||||
|
||||
To use Omnitrace for instrumentation, follow these two configuration steps:
|
||||
To use ROCm Systems Profiler for instrumentation, follow these two configuration steps:
|
||||
|
||||
#. Indicate the functions and modules to :doc:`instrument <./how-to/instrumenting-rewriting-binary-application>` in the target binaries, including the executable and any libraries
|
||||
#. Specify the :doc:`instrumentation parameters <./how-to/configuring-runtime-options>` to use when the instrumented binaries are launched
|
||||
@@ -1,5 +0,0 @@
|
||||
/build*
|
||||
/_build
|
||||
/_doxygen
|
||||
/.gitinfo
|
||||
/omnitrace.dox
|
||||
@@ -1,20 +0,0 @@
|
||||
# Minimal makefile for Sphinx documentation
|
||||
#
|
||||
|
||||
# You can set these variables from the command line, and also
|
||||
# from the environment for the first two.
|
||||
SPHINXOPTS ?=
|
||||
SPHINXBUILD ?= sphinx-build
|
||||
SOURCEDIR = .
|
||||
BUILDDIR = _build
|
||||
|
||||
# Put it first so that "make" without argument is like "make help".
|
||||
help:
|
||||
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
|
||||
|
||||
.PHONY: help Makefile
|
||||
|
||||
# Catch-all target: route all unknown targets to Sphinx using the new
|
||||
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
|
||||
%: Makefile
|
||||
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
|
||||
@@ -1,53 +0,0 @@
|
||||
# About
|
||||
|
||||
```eval_rst
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 4
|
||||
```
|
||||
|
||||
## Overview
|
||||
|
||||
> ***[OmniTrace](https://github.com/ROCm/omnitrace) is an AMD open source research project and is not supported as part of the ROCm software stack.***
|
||||
|
||||
[Browse OmniTrace source code on Github](https://github.com/ROCm/omnitrace)
|
||||
|
||||
[OmniTrace](https://github.com/ROCm/omnitrace) is designed for both high-level profiling and
|
||||
comprehensive tracing of applications running on the CPU or the CPU+GPU via dynamic binary instrumentation,
|
||||
call-stack sampling, and various other means for determining currently executing function and line information.
|
||||
|
||||
Visualization of the comprehensive omnitrace results can be viewed in any modern web browser by visiting
|
||||
[ui.perfetto.dev](https://ui.perfetto.dev/) and loading the perfetto output (`.proto` files) produced by omnitrace.
|
||||
|
||||
Aggregated high-level results are available in text files for human consumption and JSON files for programmatic analysis.
|
||||
The JSON output files are compatible with the python package [hatchet](https://github.com/hatchet/hatchet) which converts
|
||||
the performance data into pandas dataframes and facilitate multi-run comparisons, filtering, visualization in Jupyter notebooks,
|
||||
and much more.
|
||||
|
||||
[OmniTrace](https://github.com/ROCm/omnitrace) has two distinct configuration steps when instrumenting:
|
||||
|
||||
1. Configuring which functions and modules are instrumented in the target binaries (i.e. executable and/or libraries)
|
||||
- [Instrumenting with OmniTrace](instrumenting.md)
|
||||
2. Configuring what the instrumentation does happens when the instrumented binaries are executed
|
||||
- [Customizing OmniTrace Runtime](runtime.md)
|
||||
|
||||
## OmniTrace Use Cases
|
||||
|
||||
When analyzing the performance of an application, ***it is always best to NOT assume you know where the performance bottlenecks are***
|
||||
***and why they are happening.*** OmniTrace is a ***tool for the entire execution of application***. It is the sort of tool which is
|
||||
ideal for *characterizing* where optimization would have the greatest impact on the end-to-end execution of the application and/or
|
||||
viewing what else is happening on the system during a performance bottleneck.
|
||||
|
||||
Especially when GPUs are involved, there is a tendency to assume that the quickest path to performance improvement is minimizing
|
||||
the runtime of the GPU kernels. This is a highly flawed assumption: if you optimize the runtime of a kernel from 1 millisecond
|
||||
to 1 microsecond (1000x speed-up) but the original application *never spent time waiting* for kernel(s) to complete,
|
||||
you will see zero statistically significant speed-up in end-to-end runtime of your application. In other words, it does not matter
|
||||
how fast or slow the code on GPU is if the application is not bottlenecked waiting on the GPU.
|
||||
|
||||
Use OmniTrace to obtain a high-level view of the entire application. Use it to determine where the performance bottlenecks are and
|
||||
obtain clues to why these bottlenecks are happening. If you want ***extensive*** insight into the execution of individual kernels
|
||||
on the GPU, AMD Research is working on another tool for this but you should start with the tool which characterizes the
|
||||
broad picture: OmniTrace.
|
||||
|
||||
With regard to the CPU, OmniTrace does not target any specific vendor, it works just as well with non-AMD CPUs as with AMD CPUs.
|
||||
With regard to the GPU, OmniTrace is currently restricted to the HIP and HSA APIs and kernels executing on AMD GPUs.
|
||||
@@ -1,535 +0,0 @@
|
||||
# Causal Profiling
|
||||
|
||||
```eval_rst
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 3
|
||||
```
|
||||
|
||||
## What is "Causal Profiling"?
|
||||
|
||||
> ***If you speed up a given block of code by X%, the application will execute Y% faster***
|
||||
|
||||
Causal profiling directs parallel application developers to where they should focus their optimization
|
||||
efforts by quantifying the potential impact of optimizations. Causal profiling is rooted in the concept
|
||||
that *software execution speed is relative*: speeding up a block of code by X% is mathematically equivalent
|
||||
to that block of code running at its current speed if all the other code running slower by X%.
|
||||
Thus, causal profiling works by performing experiments on blocks of code during program execution which
|
||||
insert pauses to slow down all other concurrently running code. During post-processing, these experiments
|
||||
are translated into calculations for the potential impact of speeding up this block of code.
|
||||
|
||||
Consider the following C++ code executing `foo` and `bar` concurrently in two different threads
|
||||
where `foo` is 30% faster than `bar` (ideally):
|
||||
|
||||
```cpp
|
||||
#include <cstddef>
|
||||
#include <thread>
|
||||
constexpr size_t FOO_N = 7 * 1000000000UL;
|
||||
constexpr size_t BAR_N = 10 * 1000000000UL;
|
||||
|
||||
void foo()
|
||||
{
|
||||
for(volatile size_t i = 0; i < FOO_N; ++i) {}
|
||||
}
|
||||
|
||||
void bar()
|
||||
{
|
||||
for(volatile size_t i = 0; i < BAR_N; ++i) {}
|
||||
}
|
||||
|
||||
int main()
|
||||
{
|
||||
std::thread _threads[] = { std::thread{ foo },
|
||||
std::thread{ bar } };
|
||||
|
||||
for(auto& itr : _threads)
|
||||
itr.join();
|
||||
}
|
||||
```
|
||||
|
||||
No matter how many optimizations are applied to `foo`, the application will always require the same amount of time
|
||||
because the end-to-end performance is limited by `bar`. However, a 5% speedup in `bar` will result in the
|
||||
end-to-end performance improving by 5% and this trend will continue linearly (10% speedup in `bar` yields 10% speedup in
|
||||
end-to-end performance, and so on) up to 30% speedup, at which point, `bar` executes as fast as `foo`;
|
||||
any speedup to `bar` beyond 30% will still only yield an end-to-end performance speedup of 30% since the application
|
||||
will be limited by performance of `foo`, as demonstrated below in the causal profiling visualization:
|
||||
|
||||

|
||||
|
||||
The full details of the causal profiling methodology can be found in the paper [Coz: Finding Code that Counts with Causal Profiling](http://arxiv.org/pdf/1608.03676v1.pdf).
|
||||
The author's implementation is publicly available on [GitHub](https://github.com/plasma-umass/coz).
|
||||
|
||||
## Getting Started
|
||||
|
||||
### Progress Points
|
||||
|
||||
Causal profiling requires "progress points" to track progress through the code in between samples. Progress points must be triggered deterministically via instrumentation.
|
||||
This can happen in three different ways:
|
||||
|
||||
1. OmniTrace can leverage the callbacks from Kokkos-Tools, OpenMP-Tools, roctracer, etc. and the wrappers around functions for MPI, NUMA, RCCL, etc. to act as progress-points
|
||||
2. User can leverage the [runtime instrumentation capabilities](instrumenting.md#runtime-instrumentation) to insert progress-points (NOTE: binary rewrite to insert progress-points is not supported)
|
||||
3. User can leverage the [User API](user_api.md), e.g. `OMNITRACE_CAUSAL_PROGRESS`
|
||||
|
||||
Please note with regard to #2, binary rewrite to insert progress-points is not supported: when a rewritten binary is executed, Dyninst translates the instruction pointer address in order
|
||||
to execute the instrumentation and, as a result, call-stack samples never return instruction pointer addresses in the ranges defined as valid by OmniTrace. Hopefully, a work-around will
|
||||
be found in the future.
|
||||
|
||||
### Key Concepts
|
||||
|
||||
| Concept | Setting | Options | Description |
|
||||
|------------------|-----------------------------------|----------------------------------|--------------------------------------------------------------------------------------------------------------------|
|
||||
| Backend | `OMNITRACE_CAUSAL_BACKEND` | `perf`, `timer` | Backend for recording samples required to calculate the virtual speed-up |
|
||||
| Mode | `OMNITRACE_CAUSAL_MODE` | `function`, `line` | Select entire function or individual line of code for causal experiments |
|
||||
| End-to-End | `OMNITRACE_CAUSAL_END_TO_END` | boolean | Perform a single experiment during the entire run (does not require progress-points) |
|
||||
| Fixed speedup(s) | `OMNITRACE_CAUSAL_FIXED_SPEEDUP` | one or more values from [0, 100] | Virtual speedup or pool of virtual speedups to randomly select |
|
||||
| Binary scope | `OMNITRACE_CAUSAL_BINARY_SCOPE` | regular expression(s) | Dynamic binaries containing code for experiments |
|
||||
| Source scope | `OMNITRACE_CAUSAL_SOURCE_SCOPE` | regular expression(s) | `<file>` and/or `<file>:<line>` containing code to include in experiments |
|
||||
| Function scope | `OMNITRACE_CAUSAL_FUNCTION_SCOPE` | regular expression(s) | Restricts experiments to matching functions (function mode) or lines of code within matching functions (line mode) |
|
||||
|
||||
#### Notes
|
||||
|
||||
1. Binary scope defaults to `%MAIN%` (executable). Scope can be expanded to include linked libraries
|
||||
2. `<file>` and `<file>:<line>` support requires debug info (i.e. code was compiled with `-g` or, preferably, `-g3`)
|
||||
3. Function mode does not require debug info but does not support stripped binaries
|
||||
|
||||
### Backends
|
||||
|
||||
Both causal profiling backends interrupt each thread 1000x per second of CPU-time to apply virtual speedups.
|
||||
The difference between the backends is how the samples which are responsible calculating the virtual speedup are recorded.
|
||||
There are 3 key differences between the two backends:
|
||||
|
||||
1. `perf` backend requires Linux Perf and elevated security priviledges
|
||||
2. `perf` backend interrupts the application less frequently whereas the `timer` backend will interrupt the applicaiton 1000x per second of realtime
|
||||
3. `timer` backend has less accurate call-stacks due to instruction pointer skid
|
||||
|
||||
In general, the `"perf"` is preferred over the `"timer"` backend when sufficient security priviledges permit it's usage.
|
||||
If `"OMNITRACE_CAUSAL_BACKEND"` is set to `"auto"`, Omnitrace will fallback to using the `"timer"` backend only if
|
||||
using the `"perf"` backend fails; if `"OMNITRACE_CAUSAL_BACKEND"` is set to `"perf"` and using this backend fails, Omnitrace
|
||||
will abort.
|
||||
|
||||
#### Instruction Pointer Skid
|
||||
|
||||
Instruction pointer (IP) skid is how many instructions execute between an event of interest
|
||||
happening and where the IP is when the kernel is able to stop the application.
|
||||
For the `"timer"` backend, this translates to the
|
||||
difference between when the IP when the timer generated a signal and the IP when the
|
||||
signal was actually generated. Although IP skid does still occur with the `"perf"` backend,
|
||||
the overhead of pausing the entire thread with the `"timer"` backend makes this much more pronounced
|
||||
and, as such, the `"timer"` backend tends to have a lower resolution than the `"perf"` backend,
|
||||
especially in `"line"` mode.
|
||||
|
||||
#### Installing Linux Perf
|
||||
|
||||
Linux Perf is built into the kernel and may already be installed (e.g., included in the default kernel for OpenSUSE).
|
||||
The official method of checking whether Linux Perf is installed is checking for the existence of the file
|
||||
`/proc/sys/kernel/perf_event_paranoid` -- if the file exists, the kernel has Perf installed.
|
||||
|
||||
If this file does not exist, on Debian-based systems like Ubuntu, install (as superuser):
|
||||
|
||||
```console
|
||||
apt-get install linux-tools-common linux-tools-generic linux-tools-$(uname -r)
|
||||
```
|
||||
|
||||
and reboot your computer. In order to use the `"perf"` backend, the value of `/proc/sys/kernel/perf_event_paranoid`
|
||||
should be <= 2. If the value in this file is greater than 2, you will likely be unable to use the perf backend.
|
||||
|
||||
To update the paranoid level temporarily (until the system is rebooted), run one of the following methods
|
||||
as a superuser (where `PARANOID_LEVEL=<N>` with `<N>` in the range `[-1, 2]):
|
||||
|
||||
```console
|
||||
echo ${PARANOID_LEVEL} | sudo tee /proc/sys/kernel/perf_event_paranoid
|
||||
sysctl kernel.perf_event_paranoid=${PARANOID_LEVEL}
|
||||
```
|
||||
|
||||
To make the paranoid level persistent after a reboot, add `kernel.perf_event_paranoid=<N>`
|
||||
(where `<N>` is the desired paranoid level) to the `/etc/sysctl.conf` file.
|
||||
|
||||
### Speedup Prediction Variability and `omnitrace-causal` Executable
|
||||
|
||||
Causal profiling typically require executing the application several times in order to adequately sample all the domains of executing code, experiment speedups, etc. and resolve statistical fluctuations.
|
||||
The `omnitrace-causal` executable is designed to simplify running this procedure:
|
||||
|
||||
```console
|
||||
$ omnitrace-causal --help
|
||||
[omnitrace-causal] Usage: ./bin/omnitrace-causal [ --help (count: 0, dtype: bool)
|
||||
--version (count: 0, dtype: bool)
|
||||
--monochrome (max: 1, dtype: bool)
|
||||
--debug (max: 1, dtype: bool)
|
||||
--verbose (count: 1)
|
||||
--config (min: 0, dtype: filepath)
|
||||
--launcher (count: 1, dtype: executable)
|
||||
--generate-configs (min: 0, dtype: folder)
|
||||
--no-defaults (min: 0, dtype: bool)
|
||||
--mode (count: 1, dtype: string)
|
||||
--output-name (min: 1, dtype: filename)
|
||||
--reset (max: 1, dtype: bool)
|
||||
--end-to-end (max: 1, dtype: bool)
|
||||
--wait (count: 1, dtype: seconds)
|
||||
--duration (count: 1, dtype: seconds)
|
||||
--iterations (count: 1, dtype: int)
|
||||
--speedups (min: 0, dtype: integers)
|
||||
--binary-scope (min: 0, dtype: integers)
|
||||
--source-scope (min: 0, dtype: integers)
|
||||
--function-scope (min: 0, dtype: regex-list)
|
||||
--binary-exclude (min: 0, dtype: integers)
|
||||
--source-exclude (min: 0, dtype: integers)
|
||||
--function-exclude (min: 0, dtype: regex-list)
|
||||
]
|
||||
|
||||
Causal profiling usually requires multiple runs to reliably resolve the speedup estimates.
|
||||
This executable is designed to streamline that process.
|
||||
For example (assume all commands end with '-- <exe> <args>'):
|
||||
|
||||
omnitrace-causal -n 5 -- <exe> # runs <exe> 5x with causal profiling enabled
|
||||
|
||||
omnitrace-causal -s 0 5,10,15,20 # runs <exe> 2x with virtual speedups:
|
||||
# - 0
|
||||
# - randomly selected from 5, 10, 15, and 20
|
||||
|
||||
omnitrace-causal -F func_A func_B func_(A|B) # runs <exe> 3x with the function scope limited to:
|
||||
# 1. func_A
|
||||
# 2. func_B
|
||||
# 3. func_A or func_B
|
||||
General tips:
|
||||
- Insert progress points at hotspots in your code or use omnitrace's runtime instrumentation
|
||||
- Note: binary rewrite will produce a incompatible new binary
|
||||
- Run omnitrace-causal in "function" mode first (does not require debug info)
|
||||
- Run omnitrace-causal in "line" mode when you are targeting one function (requires debug info)
|
||||
- Preferably, use predictions from the "function" mode to determine which function to target
|
||||
- Limit the virtual speedups to a smaller pool, e.g., 0,5,10,25,50, to get reliable predictions quicker
|
||||
- Make use of the binary, source, and function scope to limit the functions/lines selected for experiments
|
||||
- Note: source scope requires debug info
|
||||
|
||||
|
||||
Options:
|
||||
-h, -?, --help Shows this page
|
||||
--version Prints the version and exit
|
||||
|
||||
[DEBUG OPTIONS]
|
||||
|
||||
--monochrome Disable colorized output
|
||||
--debug Debug output
|
||||
-v, --verbose Verbose output
|
||||
|
||||
[GENERAL OPTIONS]
|
||||
|
||||
-c, --config Base configuration file
|
||||
-l, --launcher When running MPI jobs, omnitrace-causal needs to be *before* the executable which launches the MPI processes (i.e.
|
||||
before `mpirun`, `srun`, etc.). Pass the name of the target executable (or a regex for matching to the name of the
|
||||
target) for causal profiling, e.g., `omnitrace-causal -l foo -- mpirun -n 4 foo`. This ensures that the omnitrace
|
||||
library is LD_PRELOADed on the proper target
|
||||
-g, --generate-configs Generate config files instead of passing environment variables directly. If no arguments are provided, the config files
|
||||
will be placed in ${PWD}/omnitrace-causal-config folder
|
||||
--no-defaults Do not activate default features which are recommended for causal profiling. For example: PID-tagging of output files
|
||||
and timestamped subdirectories are disabled by default. Kokkos tools support is added by default
|
||||
(OMNITRACE_USE_KOKKOSP=ON) because, for Kokkos applications, the Kokkos-Tools callbacks are used for progress points.
|
||||
Activation of OpenMP tools support is similar
|
||||
|
||||
[CAUSAL PROFILING OPTIONS (General)]
|
||||
(These settings will be applied to all causal profiling runs)
|
||||
|
||||
-m, --mode [ function (func) | line ]
|
||||
Causal profiling mode
|
||||
-o, --output-name Output filename of causal profiling data w/o extension
|
||||
-r, --reset Overwrite any existing experiment results during the first run
|
||||
-e, --end-to-end Single causal experiment for the entire application runtime
|
||||
-w, --wait Set the wait time (i.e. delay) before starting the first causal experiment (in seconds)
|
||||
-d, --duration Set the length of time (in seconds) to perform causal experimentationafter the first experiment is started. Once this
|
||||
amount of time has elapsed, no more causal experiments will be started but any currently running experiment will be
|
||||
allowed to finish.
|
||||
-n, --iterations Number of times to repeat the combination of run configurations
|
||||
|
||||
[CAUSAL PROFILING OPTIONS (Combinatorial)]
|
||||
(Each individual argument to these options will multiply the number runs by the number of arguments and the number of
|
||||
iterations. E.g. -n 2 -B "MAIN" -F "foo" "bar" will produce 4 runs: 2 iterations x 1 binary scope x 2 function scopes
|
||||
(MAIN+foo, MAIN+bar, MAIN+foo, MAIN+bar))
|
||||
|
||||
-s, --speedups Pool of virtual speedups to sample from during experimentation. Each space designates a group and multiple speedups can
|
||||
be grouped together by commas, e.g. -s 0 0,10,20-50 is two groups: group #1 is '0' and group #2 is '0 10 20 25 30 35 40
|
||||
45 50'
|
||||
-B, --binary-scope Restricts causal experiments to the binaries matching the list of regular expressions. Each space designates a group
|
||||
and multiple scopes can be grouped together with a semi-colon
|
||||
-S, --source-scope Restricts causal experiments to the source files or source file + lineno pairs (i.e. <file> or <file>:<line>) matching
|
||||
the list of regular expressions. Each space designates a group and multiple scopes can be grouped together with a
|
||||
semi-colon
|
||||
-F, --function-scope Restricts causal experiments to the functions matching the list of regular expressions. Each space designates a group
|
||||
and multiple scopes can be grouped together with a semi-colon
|
||||
-BE, --binary-exclude Excludes causal experiments from being performed on the binaries matching the list of regular expressions. Each space
|
||||
designates a group and multiple excludes can be grouped together with a semi-colon
|
||||
-SE, --source-exclude Excludes causal experiments from being performed on the code from the source files or source file + lineno pair (i.e.
|
||||
<file> or <file>:<line>) matching the list of regular expressions. Each space designates a group and multiple excludes
|
||||
can be grouped together with a semi-colon
|
||||
-FE, --function-exclude Excludes causal experiments from being performed on the functions matching the list of regular expressions. Each space
|
||||
designates a group and multiple excludes can be grouped together with a semi-colon
|
||||
```
|
||||
|
||||
#### Examples
|
||||
|
||||
```bash
|
||||
#!/bin/bash -e
|
||||
|
||||
module load omnitrace
|
||||
|
||||
N=20
|
||||
I=3
|
||||
|
||||
# when providing speedups to omnitrace-causal, speedup
|
||||
# groups are separated by a space so "0,10" results in
|
||||
# one speedup group where omnitrace samples from
|
||||
# the speedup set of {0, 10}. Passing "0 10" (without
|
||||
# quotes to omnitrace-causal multiplies the
|
||||
# number of runs by 2, where the first half of the
|
||||
# runs instruct omnitrace to only use 0 as the
|
||||
# speedup and the second half of the runs instruct
|
||||
# omnitrace to only use 10 as the speedup.
|
||||
SPEEDUPS="0,0,0,10,20,30,40,50,50,75,75,75,90,90,90"
|
||||
# thus, -s ${SPEEDUPS} only multiplies the number
|
||||
# of runs by 1 whereas -S ${SPEEDUPS_E2E} multiplies
|
||||
# the number of runs by 15:
|
||||
# - 3 runs with speedup of 0
|
||||
# - 1 run for each of the speedups 10, 20, 30, and 40
|
||||
# - 2 runs with speedup of 50
|
||||
# - 3 runs with speedup of 75
|
||||
# - 3 runs with speedup of 90
|
||||
SPEEDUPS_E2E=$(echo "${SPEEDUPS}" | sed 's/,/ /g')
|
||||
|
||||
|
||||
# 20 iterations in function mode with 1 speedup group
|
||||
# and source scope set to .cpp files
|
||||
#
|
||||
# outputs to files:
|
||||
# - causal/experiments.func.coz
|
||||
# - causal/experiments.func.json
|
||||
#
|
||||
# total executions: 20
|
||||
#
|
||||
omnitrace-causal \
|
||||
-n ${N} \
|
||||
-s ${SPEEDUPS} \
|
||||
-m function \
|
||||
-o experiments.func \
|
||||
-S ".*\\.cpp" \
|
||||
-- \
|
||||
./causal-omni-cpu "${@}"
|
||||
|
||||
|
||||
# 20 iterations in line mode with 1 speedup group
|
||||
# and source scope restricted to lines 100 and 110
|
||||
# in the causal.cpp file.
|
||||
#
|
||||
# outputs to files:
|
||||
# - causal/experiments.line.coz
|
||||
# - causal/experiments.line.json
|
||||
#
|
||||
# total executions: 20
|
||||
#
|
||||
omnitrace-causal \
|
||||
-n ${N} \
|
||||
-s ${SPEEDUPS} \
|
||||
-m line \
|
||||
-o experiments.line \
|
||||
-S "causal\\.cpp:(100|110)" \
|
||||
-- \
|
||||
./causal-omni-cpu "${@}"
|
||||
|
||||
|
||||
# 3 iterations in function mode of 15 singular speedups
|
||||
# in end-to-end mode with 2 different function scopes
|
||||
# where one is restricted to "cpu_slow_func" and
|
||||
# another is restricted to "cpu_fast_func".
|
||||
#
|
||||
# outputs to files:
|
||||
# - causal/experiments.func.e2e.coz
|
||||
# - causal/experiments.func.e2e.json
|
||||
#
|
||||
# total executions: 90
|
||||
#
|
||||
omnitrace-causal \
|
||||
-n ${I} \
|
||||
-s ${SPEEDUPS_E2E} \
|
||||
-m func \
|
||||
-e \
|
||||
-o experiments.func.e2e \
|
||||
-F "cpu_slow_func" \
|
||||
"cpu_fast_func" \
|
||||
-- \
|
||||
./causal-omni-cpu "${@}"
|
||||
|
||||
# 3 iterations in line mode of 15 singular speedups
|
||||
# in end-to-end mode with 2 different source scopes
|
||||
# where one is restricted to line 100 in causal.cpp
|
||||
# and another is restricted to line 110 in causal.cpp.
|
||||
#
|
||||
# outputs to files:
|
||||
# - causal/experiments.line.e2e.coz
|
||||
# - causal/experiments.line.e2e.json
|
||||
#
|
||||
# total executions: 90
|
||||
#
|
||||
omnitrace-causal \
|
||||
-n ${I} \
|
||||
-s ${SPEEDUPS_E2E} \
|
||||
-m line \
|
||||
-e \
|
||||
-o experiments.line.e2e \
|
||||
-S "causal\\.cpp:100" \
|
||||
"causal\\.cpp:110" \
|
||||
-- \
|
||||
./causal-omni-cpu "${@}"
|
||||
|
||||
|
||||
export OMP_NUM_THREADS=8
|
||||
export OMP_PROC_BIND=spread
|
||||
export OMP_PLACES=threads
|
||||
|
||||
# set number of iterations to 5
|
||||
N=5
|
||||
|
||||
# 5 iterations in function mode of 1 speedup
|
||||
# group with the source scope restricted
|
||||
# to files containing "lulesh" in their filename
|
||||
# and exclude functions which start with "Kokkos::"
|
||||
# or "std::enable_if".
|
||||
#
|
||||
# outputs to files:
|
||||
# - causal/experiments.func.coz
|
||||
# - causal/experiments.func.json
|
||||
#
|
||||
# total executions: 5
|
||||
#
|
||||
# First of 5 executions overwrites any
|
||||
# existing causal/experiments.func.(coz|json)
|
||||
# file due to "--reset" argument
|
||||
#
|
||||
omnitrace-causal \
|
||||
--reset \
|
||||
-n ${N} \
|
||||
-s ${SPEEDUPS} \
|
||||
-m func \
|
||||
-o experiments.func \
|
||||
-S "lulesh.*" \
|
||||
-FE "^(Kokkos::|std::enable_if)" \
|
||||
-- \
|
||||
./lulesh-omni -i 50 -s 200 -r 20 -b 5 -c 5 -p
|
||||
|
||||
|
||||
# 5 iterations in line mode of 1 speedup
|
||||
# group with the source scope restricted
|
||||
# to files containing "lulesh" in their filename
|
||||
# and exclude functions which start with "exec_range"
|
||||
# or "execute" and which contain either
|
||||
# "construct_shared_allocation" or "._omp_fn." in
|
||||
# the function name.
|
||||
#
|
||||
# outputs to files:
|
||||
# - causal/experiments.line.coz
|
||||
# - causal/experiments.line.json
|
||||
#
|
||||
# total executions: 5
|
||||
#
|
||||
# First of 5 executions overwrites any
|
||||
# existing causal/experiments.line.(coz|json)
|
||||
# file due to "--reset" argument
|
||||
#
|
||||
omnitrace-causal \
|
||||
--reset \
|
||||
-n ${N} \
|
||||
-s ${SPEEDUPS} \
|
||||
-m line \
|
||||
-o experiments.line \
|
||||
-S "lulesh.*" \
|
||||
-FE "^(exec_range|execute);construct_shared_allocation;\\._omp_fn\\." \
|
||||
-- \
|
||||
./lulesh-omni -i 50 -s 200 -r 20 -b 5 -c 5 -p
|
||||
|
||||
|
||||
# 5 iterations in line mode of 1 speedup
|
||||
# group with the source scope restricted
|
||||
# to files whose basename is "lulesh.cc"
|
||||
# for 3 different functions:
|
||||
# - ApplyMaterialPropertiesForElems
|
||||
# - CalcHourglassControlForElems
|
||||
# - CalcVolumeForceForElems
|
||||
#
|
||||
# outputs to files:
|
||||
# - causal/experiments.line.targeted.coz
|
||||
# - causal/experiments.line.targeted.json
|
||||
#
|
||||
# total executions: 15
|
||||
#
|
||||
# First of 5 executions overwrites any
|
||||
# existing causal/experiments.line.(coz|json)
|
||||
# file due to "--reset" argument
|
||||
#
|
||||
omnitrace-causal \
|
||||
--reset \
|
||||
-n ${N} \
|
||||
-s ${SPEEDUPS} \
|
||||
-m line \
|
||||
-o experiments.line.targeted \
|
||||
-F "ApplyMaterialPropertiesForElems" \
|
||||
"CalcHourglassControlForElems" \
|
||||
"CalcVolumeForceForElems" \
|
||||
-S "lulesh\\.cc" \
|
||||
-- \
|
||||
./lulesh-omni -i 50 -s 200 -r 20 -b 5 -c 5 -p
|
||||
```
|
||||
|
||||
#### Using `omnitrace-causal` with other launchers (e.g. `mpirun`)
|
||||
|
||||
The `omnitrace-causal` executable is intended to assist with application replay and is designed to always be at the start of the command-line (i.e. the primary process).
|
||||
`omnitrace-causal` typically adds a `LD_PRELOAD` of the OmniTrace libraries into the environment before launching the command in order to inject the functionality
|
||||
required to start the causal profiling tooling. However, this is problematic when the target application for causal profiling requires another command-line
|
||||
tool in order to run, e.g. `foo` is the target application but executing `foo` requires `mpirun -n 2 foo`. If one were to simply do `omnitrace-causal -- mpirun -n 2 foo`,
|
||||
then the causal profiling would be applied to `mpirun` instead of `foo`. `omnitrace-causal` remedies this by providing a command-line option `-l` / `--launcher`
|
||||
to indicate the target application is using a launcher script/executable. The argument to the command-line option is the name of (or regex for) the target application
|
||||
on the command-line. When `--launcher` is used, `omnitrace-causal` will generate all the replay configurations and execute them but delay adding the `LD_PRELOAD`, instead it
|
||||
will inject a call to itself into the command-line right before the target application. This recursive call to itself will inherit the configuration from
|
||||
parent `omnitrace-causal` executable, insert an `LD_PRELOAD` into the environment, and then invoke an `execv` to replace itself with the new process launched by the target
|
||||
application.
|
||||
|
||||
In other words, the following command:
|
||||
|
||||
```console
|
||||
omnitrace-causal -l foo -n 3 -- mpirun -n 2 foo`
|
||||
```
|
||||
|
||||
Effectively results in:
|
||||
|
||||
```console
|
||||
mpirun -n 2 omnitrace-causal -- foo
|
||||
mpirun -n 2 omnitrace-causal -- foo
|
||||
mpirun -n 2 omnitrace-causal -- foo
|
||||
```
|
||||
|
||||
### Visualizing the Causal Output
|
||||
|
||||
OmniTrace generates a `causal/experiments.json` and `causal/experiments.coz` in `${OMNITRACE_OUTPUT_PATH}/${OMNITRACE_OUTPUT_PREFIX}`. A standalone GUI for viewing the causal profiling
|
||||
results in under development but until this is available, visit [plasma-umass.org/coz/](https://plasma-umass.org/coz/) and open the `*.coz` file.
|
||||
|
||||
## OmniTrace vs. Coz
|
||||
|
||||
This section is intended for readers who are familiar with the [Coz profiler](https://github.com/plasma-umass/coz).
|
||||
OmniTrace provides several additional features and utilities for causal profiling:
|
||||
|
||||
| | [Coz](https://github.com/plasma-umass/coz) | [OmniTrace](https://github.com/ROCm/omnitrace) | Notes |
|
||||
|----------------------|:-------------------------------------------------------------------:|:----------------------------------------------------------:|-------------------------------|
|
||||
| Debug info | requires debug info in DWARF v3 format (`-gdwarf-3`) | optional, supports any DWARF format version | See Note #1 below |
|
||||
| Experiment selection | `<file>:<line>` | `<function>` or `<file>:<line>` | See Note #2 below |
|
||||
| Experiment speedups | Randomly samples b/t 0..100 in increments of 5 or one fixed speedup | Supports specifying smaller subset | Set Note #3 below |
|
||||
| Scope options | Supports binary and source scopes | Supports binary, source, and function scopes | See Note #4, #5, and #6 below |
|
||||
| Scope inclusion | Uses `%` as wildcard for binary and source scopes | Full regex support for binary, source, and function scopes | |
|
||||
| Scope exclusion | Not supported | Supports regexes for excluding binary/source/function | See Note #7 below |
|
||||
| Call-stack sampling | Linux perf | Linux perf, libunwind | See Note #8 below |
|
||||
|
||||
1. OmniTrace supports a "function" mode which does not require debug info
|
||||
2. OmniTrace supports selecting entire range of instruction pointers for a function instead of instruction pointer for one line. In large codes, "function" mode
|
||||
can resolve in fewer iterations and once a target function is identified, one can switch to line mode and limit the function scope to the target function
|
||||
3. OmniTrace supports randomly sampling from subsets, e.g. { 0, 0, 5, 10 } where 0% is randomly selected 50% of time and 5% and 10% are randomly selected 25% of the time
|
||||
4. OmniTrace and COZ have same definition for binary scope: the binaries loaded at runtime (e.g. executable and linked libraries)
|
||||
5. OmniTrace "source scope" supports both `<file>` and `<file>:<line>` formats in contrast to COZ "source scope" which requires `<file>:<line>` format
|
||||
6. OmniTrace supports a "function" scope which narrows the functions/lines which are eligible for causal experiments to those within the matching functions
|
||||
7. OmniTrace supports a second filter on scopes for removing binary/source/function caught by inclusive match, e.g. `BINARY_SCOPE=.*` + `BINARY_EXCLUDE=libmpi.*`
|
||||
initially includes all binaries but exclude regex removes MPI libraries
|
||||
8. In Omnitrace, the Linux perf backend is preferred over use libunwind. However, Linux perf usage can be restricted for security reasons.
|
||||
Omnitrace will fallback to using a second POSIX timer and libunwind if Linux perf is not available.
|
||||
@@ -1,169 +0,0 @@
|
||||
# Configuration file for the Sphinx documentation builder.
|
||||
#
|
||||
# This file only contains a selection of the most common options. For a full
|
||||
# list see the documentation:
|
||||
# http://www.sphinx-doc.org/en/master/config
|
||||
|
||||
# -- Path setup --------------------------------------------------------------
|
||||
|
||||
# If extensions (or modules to document with autodoc) are in another directory,
|
||||
# add these directories to sys.path here. If the directory is relative to the
|
||||
# documentation root, use os.path.abspath to make it absolute, like shown here.
|
||||
#
|
||||
# import os
|
||||
# sys.path.insert(0, os.path.abspath('.'))
|
||||
|
||||
import os
|
||||
import sys
|
||||
import subprocess as sp
|
||||
|
||||
# If extensions (or modules to document with autodoc) are in another directory,
|
||||
# add these directories to sys.path here. If the directory is relative to the
|
||||
# documentation root, use os.path.abspath to make it absolute, like shown here.
|
||||
sys.path.insert(0, os.path.abspath(".."))
|
||||
|
||||
|
||||
def install(package):
|
||||
sp.call([sys.executable, "-m", "pip", "install", package])
|
||||
|
||||
|
||||
# Check if we're running on Read the Docs' servers
|
||||
read_the_docs_build = os.environ.get("READTHEDOCS", None) == "True"
|
||||
|
||||
|
||||
# -- Project information -----------------------------------------------------
|
||||
project = "omnitrace"
|
||||
copyright = "2022, Advanced Micro Devices, Inc."
|
||||
author = "Audacious Software Group"
|
||||
|
||||
project_root = os.path.normpath(os.path.join(os.getcwd(), "..", ".."))
|
||||
version = open(os.path.join(project_root, "VERSION")).read().strip()
|
||||
# The full version, including alpha/beta/rc tags
|
||||
release = version
|
||||
|
||||
_docdir = os.path.realpath(os.getcwd())
|
||||
_srcdir = os.path.realpath(os.path.join(os.getcwd(), ".."))
|
||||
_sitedir = os.path.realpath(os.path.join(os.getcwd(), "..", "site"))
|
||||
_staticdir = os.path.realpath(os.path.join(_docdir, "_static"))
|
||||
_templatedir = os.path.realpath(os.path.join(_docdir, "_templates"))
|
||||
|
||||
if not os.path.exists(_staticdir):
|
||||
os.makedirs(_staticdir)
|
||||
|
||||
if not os.path.exists(_templatedir):
|
||||
os.makedirs(_templatedir)
|
||||
|
||||
|
||||
# -- General configuration ---------------------------------------------------
|
||||
|
||||
install("sphinx_rtd_theme")
|
||||
|
||||
# Add any Sphinx extension module names here, as strings. They can be
|
||||
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
|
||||
# ones.
|
||||
extensions = [
|
||||
"sphinx.ext.autodoc",
|
||||
"sphinx.ext.doctest",
|
||||
"sphinx.ext.todo",
|
||||
"sphinx.ext.viewcode",
|
||||
"sphinx.ext.githubpages",
|
||||
"sphinx.ext.mathjax",
|
||||
"sphinx.ext.autosummary",
|
||||
"sphinx.ext.napoleon",
|
||||
"sphinx_markdown_tables",
|
||||
"recommonmark",
|
||||
"breathe",
|
||||
]
|
||||
|
||||
source_suffix = {
|
||||
".rst": "restructuredtext",
|
||||
".md": "markdown",
|
||||
}
|
||||
|
||||
from recommonmark.parser import CommonMarkParser
|
||||
|
||||
source_parsers = {".md": CommonMarkParser}
|
||||
|
||||
# Add any paths that contain templates here, relative to this directory.
|
||||
templates_path = ["_templates"]
|
||||
|
||||
# The master toctree document.
|
||||
master_doc = "index"
|
||||
|
||||
# List of patterns, relative to source directory, that match files and
|
||||
# directories to ignore when looking for source files.
|
||||
# This pattern also affects html_static_path and html_extra_path.
|
||||
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
|
||||
|
||||
default_role = None
|
||||
|
||||
# -- Options for HTML output -------------------------------------------------
|
||||
|
||||
# The theme to use for HTML and HTML Help pages. See the documentation for
|
||||
# a list of builtin themes.
|
||||
#
|
||||
html_theme = "sphinx_rtd_theme"
|
||||
|
||||
# Add any paths that contain custom static files (such as style sheets) here,
|
||||
# relative to this directory. They are copied after the builtin static files,
|
||||
# so a file named "default.css" will overwrite the builtin "default.css".
|
||||
html_static_path = ["_static"]
|
||||
|
||||
html_theme_options = {
|
||||
"analytics_id": "G-1HLBBRSTT9", # Provided by Google in your dashboard
|
||||
"analytics_anonymize_ip": False,
|
||||
"logo_only": False,
|
||||
"display_version": True,
|
||||
"prev_next_buttons_location": "bottom",
|
||||
"style_external_links": False,
|
||||
"vcs_pageview_mode": "",
|
||||
# 'style_nav_header_background': 'white',
|
||||
# Toc options
|
||||
"collapse_navigation": True,
|
||||
"sticky_navigation": True,
|
||||
"navigation_depth": 4,
|
||||
"includehidden": True,
|
||||
"titles_only": False,
|
||||
}
|
||||
|
||||
# Breathe Configuration
|
||||
breathe_projects = {"omnitrace": "_doxygen/xml"}
|
||||
breathe_default_project = "omnitrace"
|
||||
breathe_default_members = ("members",)
|
||||
breathe_projects_source = {
|
||||
"omnitrace": (
|
||||
os.path.join(project_root, "source", "lib", "omnitrace-user"),
|
||||
[
|
||||
"omnitrace/types.h",
|
||||
"omnitrace/categories.h",
|
||||
"omnitrace/user.h",
|
||||
"omnitrace/causal.h",
|
||||
],
|
||||
)
|
||||
}
|
||||
|
||||
from pygments.styles import get_all_styles
|
||||
|
||||
# The name of the Pygments (syntax highlighting) style to use.
|
||||
styles = list(get_all_styles())
|
||||
preferences = ("emacs", "pastie", "colorful")
|
||||
for pref in preferences:
|
||||
if pref in styles:
|
||||
pygments_style = pref
|
||||
break
|
||||
|
||||
from recommonmark.transform import AutoStructify
|
||||
|
||||
|
||||
# app setup hook
|
||||
def setup(app):
|
||||
app.add_config_value(
|
||||
"recommonmark_config",
|
||||
{
|
||||
"auto_toc_tree_section": "Contents",
|
||||
"enable_eval_rst": True,
|
||||
"enable_auto_doc_ref": False,
|
||||
},
|
||||
True,
|
||||
)
|
||||
app.add_transform(AutoStructify)
|
||||
@@ -1,10 +0,0 @@
|
||||
# Critical Trace Support
|
||||
|
||||
```eval_rst
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 4
|
||||
```
|
||||
|
||||
Critical trace support has been superseded by causal profiling support.
|
||||
Critical trace support was removed in Omnitrace v1.11.0 due to incomplete implementation.
|
||||
@@ -1,307 +0,0 @@
|
||||
# Development Guide
|
||||
|
||||
## Miscellaneous Info
|
||||
|
||||
- [CDash Testing Dashboard](https://my.cdash.org/index.php?project=Omnitrace)
|
||||
- requires login to view
|
||||
|
||||
## Executables
|
||||
|
||||
### omnitrace-avail: [source/bin/omnitrace-avail](https://github.com/ROCm/omnitrace/tree/amd-mainline/source/bin/omnitrace-avail)
|
||||
|
||||
The main of `omnitrace-avail` has three important sections:
|
||||
|
||||
1. Printing components
|
||||
2. Printing options
|
||||
3. Printing hardware counters
|
||||
|
||||
### omnitrace-sample: [source/bin/omnitrace-sample](https://github.com/ROCm/omnitrace/tree/amd-mainline/source/bin/omnitrace-sample)
|
||||
|
||||
General design:
|
||||
|
||||
- Requires a command-line format of `omnitrace-sample <options> -- <command> <command-args>`
|
||||
- Translates command line options into environment variables
|
||||
- Adds `libomnitrace-dl.so` to `LD_PRELOAD`
|
||||
- Application is launched via `execvpe` with `<command> <command-args>` and modified environment
|
||||
|
||||
### omnitrace-casual: [source/bin/omnitrace-causal](https://github.com/ROCm/omnitrace/tree/amd-mainline/source/bin/omnitrace-causal)
|
||||
|
||||
Nearly identical design to [omnitrace-sample](#omnitrace-sample-sourcebinomnitrace-sample) when
|
||||
there is exactly one causal profiling configuration variant (this enables debugging).
|
||||
|
||||
When more than one causal profiling configuration variant it produced from command-line options,
|
||||
for each variant:
|
||||
|
||||
- `omnitrace-causal` calls `fork()`
|
||||
- child process launches `<command> <command-args>` via `execvpe` which modified environment for variant
|
||||
- parent process waits for child process to finish
|
||||
|
||||
### omnitrace-instrument: [source/bin/omnitrace-instrument](https://github.com/ROCm/omnitrace/tree/amd-mainline/source/bin/omnitrace-instrument)
|
||||
|
||||
- Requires a command-line format of `omnitrace-instrument <options> -- <command> <command-args>`
|
||||
- User specifies in options whether they want to do runtime instrumentation, binary rewrite, or attach to process
|
||||
- Either opens the instrumentation target (binary rewrite), launches the target and stops it before it starts executing main (runtime), or
|
||||
attaches to running executable and pauses it
|
||||
- Finds all functions in target(s)
|
||||
- Finds `libomnitrace-dl` and finds the functions
|
||||
- Iterates over all the functions and instruments them as long as they satisfy the defined criteria (minimum number of instructions, etc.)
|
||||
- See the `module_function` class
|
||||
- Most of the workflow has been the same at the point but once the instrumentation is complete, it diverges
|
||||
- For a binary rewrite: outputs new instrumented binary and exits
|
||||
- For runtime instrumentation or attaching to a process: instructs the application to resume executing and then waits for the application to exit
|
||||
|
||||
## Libraries
|
||||
|
||||
### Common Library: [source/lib/common](https://github.com/ROCm/omnitrace/tree/amd-mainline/source/lib/common)
|
||||
|
||||
General header-only functionality used in multiple executables and/or libraries. Not installed or exported outside of the build tree.
|
||||
|
||||
### Core Library: [source/lib/core](https://github.com/ROCm/omnitrace/tree/amd-mainline/source/lib/core)
|
||||
|
||||
Static PIC library with functionality that does not depend on any components. Not installed or exported outside of the build tree.
|
||||
|
||||
### Binary Library: [source/lib/binary](https://github.com/ROCm/omnitrace/tree/amd-mainline/source/lib/binary)
|
||||
|
||||
Static PIC library with functionality for reading/analyzing binary info. Mostly used by the causal profiling sections
|
||||
of [libomnitrace](#libomnitrace-sourcelibomnitrace). Not installed or exported outside of the build tree.
|
||||
|
||||
### libomnitrace: [source/lib/omnitrace](https://github.com/ROCm/omnitrace/tree/amd-mainline/source/lib/omnitrace)
|
||||
|
||||
This is the main library encapsulating all the capabilities.
|
||||
|
||||
### libomnitrace-dl: [source/lib/omnitrace-dl](https://github.com/ROCm/omnitrace/tree/amd-mainline/source/lib/omnitrace-dl)
|
||||
|
||||
Lightweight, front-end library for [libomnitrace](#libomnitrace-sourcelibomnitrace) which serves 3 primary purposes:
|
||||
|
||||
1. Dramatically speeds up instrumentation time vs. using [libomnitrace](#libomnitrace-sourcelibomnitrace) directly since Dyninst must parse entire library in order to find instrumentation functions ([libomnitrace](#libomnitrace-sourcelibomnitrace) is dlopen'ed when the instrumentation functions get called)
|
||||
2. Prevents re-entry if [libomnitrace](#libomnitrace-sourcelibomnitrace) calls an instrumentated function internally)
|
||||
3. Coordinates communication between [libomnitrace-user](#libomnitrace-user-sourcelibomnitrace-user) and [libomnitrace](#libomnitrace-sourcelibomnitrace)
|
||||
|
||||
### libomnitrace-user: [source/lib/omnitrace-user](https://github.com/ROCm/omnitrace/tree/amd-mainline/source/lib/omnitrace-user)
|
||||
|
||||
Provides a set of functions and types for the users to add to their code, e.g. disabling data collection globally or on a specific thread,
|
||||
user-defined regions, etc. If [libomnitrace-dl](#libomnitrace-dl-sourcelibomnitrace-dl) is not loaded, the user API is effectively no-op
|
||||
function calls.
|
||||
|
||||
## Concepts
|
||||
|
||||
### Component
|
||||
|
||||
Most measurements and capabilities are encapsulated into a "component" with the following definitions:
|
||||
|
||||
- Measurement: recording of some data relevant to performance, e.g. current call-stack, hardware counter values, current memory usage, timestamp
|
||||
- Capability: handles the implementation or orchestration of some feature which is used to collect measurements, e.g. a component which handles setting up function wrappers around various functions such as `pthread_create`, `MPI_Init`, etc.
|
||||
|
||||
Components are designed to hold no data at all or only the data for both an instantaeous measurement and a phase measurement.
|
||||
|
||||
Components which store data typically implement a static `record()` function (for getting a record of the measurement),
|
||||
`start()` + `stop()` member functions for calculating a phase measurement, and a `sample()` member function for storing an
|
||||
instantaneous measurement. In reality, there are several more "standard" functions but these are the most often used ones.
|
||||
|
||||
Components which do not store data may also have `start()`, `stop()`, and `sample()` functions but for components which
|
||||
implement function wrappers, they typically provide a call operator or `audit(...)` functions which are invoked with the
|
||||
wrappee function's arguments before the wrappee gets called and with the return value after the wrappee gets called.
|
||||
|
||||
***The goal of this design is to provide relatively small and resuable lightweight objects for recording measurements
|
||||
and/or implementing capabilities.***
|
||||
|
||||
#### Wall-Clock Component Example
|
||||
|
||||
A component for computing the elapsed wall-clock time looks like this:
|
||||
|
||||
```cpp
|
||||
struct wall_clock
|
||||
{
|
||||
using value_type = int64_t;
|
||||
|
||||
static value_type record() noexcept
|
||||
{
|
||||
return std::chrono::steady_clock::now().time_since_epoch().count();
|
||||
}
|
||||
|
||||
void sample() noexcept
|
||||
{
|
||||
value = record();
|
||||
}
|
||||
|
||||
void start() noexcept
|
||||
{
|
||||
value = record();
|
||||
}
|
||||
|
||||
void stop() noexcept
|
||||
{
|
||||
auto _start_value = value;
|
||||
value = record();
|
||||
accum += (value - _start_value);
|
||||
}
|
||||
|
||||
private:
|
||||
int64_t value = 0;
|
||||
int64_t accum = 0;
|
||||
};
|
||||
```
|
||||
|
||||
#### Function Wrapper Component Example
|
||||
|
||||
A component which implements wrappers around `fork()` and `exit(int)` (and stores no data) may look like this:
|
||||
|
||||
```cpp
|
||||
struct function_wrapper
|
||||
{
|
||||
pid_t operator()(const gotcha_data&, pid_t (*real_fork)())
|
||||
{
|
||||
// disable all collection before forking
|
||||
categories::disable_categories(config::get_enabled_categories());
|
||||
|
||||
auto _pid_v = real_fork();
|
||||
|
||||
// only re-enable collection on parent process
|
||||
if(_pid_v != 0)
|
||||
categories::enable_categories(config::get_enabled_categories());
|
||||
|
||||
return _pid_v;
|
||||
}
|
||||
|
||||
void operator()(const gotcha_data&, void (*real_exit)(int), int _exit_code)
|
||||
{
|
||||
// catch the call to exit and finalize before truly exiting
|
||||
omnitrace_finalize();
|
||||
|
||||
real_exit(_exit_code);
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
#### Component Member Functions
|
||||
|
||||
There are no real restrictions or requirements on the member functions a component needs to provide.
|
||||
Unless the component is being directly used, invocation of component member functions via "component bundlers"
|
||||
(provided via timemory) makes extensive use of template metaprogramming concept to find the best match (if any)
|
||||
for calling a components member function. This is a bit easier to demonstrate via example:
|
||||
|
||||
```cpp
|
||||
struct foo
|
||||
{
|
||||
void sample() { puts("foo::sample()"); }
|
||||
};
|
||||
|
||||
struct bar
|
||||
{
|
||||
void sample(int) { puts("bar::sample(int)"); }
|
||||
};
|
||||
|
||||
struct spam
|
||||
{
|
||||
void start(int) { puts("spam::start()"); }
|
||||
void stop() { puts("spam::stop()"); }
|
||||
};
|
||||
|
||||
int main()
|
||||
{
|
||||
auto _bundle = component_tuple<foo, bar, spam>{ "main" };
|
||||
|
||||
puts("A");
|
||||
_bundle.start();
|
||||
|
||||
puts("B");
|
||||
_bundle.sample(10);
|
||||
|
||||
puts("C");
|
||||
_bundle.sample();
|
||||
|
||||
puts("D");
|
||||
_bundle.stop();
|
||||
}
|
||||
```
|
||||
|
||||
In the above, this would be the message printed:
|
||||
|
||||
```console
|
||||
A
|
||||
bar::start()
|
||||
B
|
||||
foo::sample()
|
||||
bar::sample(int)
|
||||
C
|
||||
foo::sample()
|
||||
D
|
||||
spam::stop()
|
||||
```
|
||||
|
||||
In section A, the bundle determined only the `spam` object had a `start` function. Since this is determined
|
||||
via template metaprogramming instead of dynamic polymorphism, this effectively elides any code related to
|
||||
the `foo` or `bar` objects. In section B, since an integer of `10` was passed to the bundle,
|
||||
the bundle forwards that value onto `spam::sample(int)` after it invokes `foo::sample()` -- which
|
||||
is invoked because it recognizes that the call is the `sample` member function is still possible without
|
||||
the arguments.
|
||||
|
||||
## Memory Model
|
||||
|
||||
Collected data is generally stored in one of following 3 places:
|
||||
|
||||
1. Perfetto (i.e. data is handed directly to perfetto)
|
||||
2. Managed implictly by timemory and accessed as needed
|
||||
3. Thread-local data
|
||||
|
||||
In general, only instrumentation for relatively simple data is directly passed to Perfetto and/or timemory during runtime.
|
||||
For example, the callbacks from binary instrumentation, user API instrumentation, and roctracer directly invoke
|
||||
calls to Perfetto and/or timemory's storage model. Otherwise, the data is stored by omnitrace in the thread-data model
|
||||
which is more persistent than simply using `thread_local` static data (which is problematic because the data gets deleted
|
||||
when a thread terminates).
|
||||
|
||||
### Thread Identification
|
||||
|
||||
Each CPU thread is assigned two integral identifiers. One identifier is simply an atomic increment everytime a new thread is created
|
||||
(called `internal_value`).
|
||||
The other identifier tries to account for the fact that OmniTrace, Perfetto, ROCm, etc. start background threads and for these threads
|
||||
(called `sequent_value`). When a thread is created as a byproduct of OmniTrace, the index is offset by a large value. This serves
|
||||
two purposes: (1) accessing the data for threads created by the user is closer in memory and (2) when log messages are printed,
|
||||
the index more-or-less correlates to the order of thread creation to the user's knowledge.
|
||||
|
||||
The `sequent_value` is typically the one used to access the thread-data.
|
||||
|
||||
### Thread-Data Class
|
||||
|
||||
Currently, most thread data is effectively stored in a static `std::array<std::unique_ptr<T>, OMNITRACE_MAX_THREADS>` instance.
|
||||
`OMNITRACE_MAX_THREADS` is a value defined a compile-time and set to 2048 for release builds. During finalization,
|
||||
omnitrace iterates over all the thread-data and then transforms that data into something that is passed to perfetto and/or timemory.
|
||||
The downside of the current model is that if the user exceeds `OMNITRACE_MAX_THREADS`, omnitrace segfaults. To fix this issue,
|
||||
a new model is being adopted which has all the benefits of this model but permits dynamic expansion.
|
||||
|
||||
## Sampling Model
|
||||
|
||||
The general structure for the sampling is within timemory (`source/timemory/sampling`). Currently, all sampling is done per-thread
|
||||
via POSIX timers. Omnitrace supports using a realtime timer and a CPU-time timer. Both have adjustable frequencies, delays, and durations.
|
||||
By default, only CPU-time sampling is enabled. Initial settings are inherited from the settings starting with `OMNITRACE_SAMPLING_`.
|
||||
For each type of timer, there exists timer-specific settings that can be used to override the common/inherited settings for that timer
|
||||
specifically. For the CPU-time sampler, these settings start with `OMNITRACE_SAMPLING_CPUTIME` and `OMNITRACE_SAMPLING_REALTIME` for
|
||||
the realtime sampler. For example, `OMNITRACE_SAMPLING_FREQ=500` initially sets the sampling frequency to 500 interrupts per second
|
||||
(based on their clock). Settings `OMNITRACE_SAMPLING_REALTIME_FREQ=10` will lower the sampling frequency for the realtime sampler
|
||||
to 10 interrupts per second of realtime.
|
||||
|
||||
The omnitrace-specific implementation can be found in [source/lib/omnitrace/library/sampling.cpp](https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/sampling.cpp).
|
||||
Within [sampling.cpp](https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/sampling.cpp), you will a bundle of 3 sampling components:
|
||||
`backtrace_timestamp`, `backtrace`, and `backtrace_metrics`.
|
||||
The first component [backtrace_timestamp](https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/components/backtrace_timestamp.hpp) simply
|
||||
records the wall-clock time of the sample.
|
||||
The second component [backtrace](https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/components/backtrace.hpp) records the call-stack via libunwind.
|
||||
The last component [backtrace_metrics](https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/components/backtrace_metrics.hpp) is responsible for recording the
|
||||
metrics for that sample, e.g. peak RSS, HW counters, etc. These 3 components are bundled together in a tuple-like struct (e.g. `tuple<backtrace_timestamp, backtrace, backtrace_metrics>`)
|
||||
a buffer of at least 1024 instances of this tuple are mmap'ed per-thread. When this buffer is full, before taking the next sample, the sampler will hand the buffer
|
||||
off to it's allocator thread and mmap a new buffer. The allocator thread takes this data and either dynamically stores it in memory or writes it to a file depending on the value of `OMNITRACE_USE_TEMPORARY_FILES`.
|
||||
This schema avoids all allocations in the signal handler, allows the data to grow dynamically, avoid potentially slow I/O within the signal handler, and also enables the capability to avoid I/O altogether.
|
||||
The maximum number of samplers handled by each allocator is governed by the setting `OMNITRACE_SAMPLING_ALLOCATOR_SIZE` setting (the default is 8) -- whenever an allocator has reached it's limit,
|
||||
a new internal thread is created to handle the new samplers.
|
||||
|
||||
## Time-Window Constraint Model
|
||||
|
||||
Recently with the introduction of tracing delay/duration/etc., the [constraint namespace](https://github.com/ROCm/omnitrace/blob/main/source/lib/core/constraint.hpp)
|
||||
was introduced to improve the management of delays and/or duration limits of data collection. The `spec` class takes a clock identifier, a delay value, a duration value, and an
|
||||
integer indicating how many times to repeat the delay + duration. Thus, it is possible to perform tasks such as periodically enabling tracing for brief periods
|
||||
of time in between long periods without data collection during the application, e.g. `OMNITRACE_TRACE_PERIODS = realtime:10:1:5 process_cputime:10:2:20` would enable
|
||||
five periods of no data collection for 10 seconds of realtime followed by 1 second of data collection + twenty periods of no data collection for 10 seconds
|
||||
of process CPU time followed by 2 CPU-time seconds of data collection.
|
||||
|
||||
Eventually, the goal is have all subsets of data collection which currently support more rudimentary models of time window constraints, such as process sampling and causal profiling,
|
||||
to be migrated to this model.
|
||||
@@ -1,196 +0,0 @@
|
||||
name: omnitrace-docs
|
||||
channels:
|
||||
- conda-forge
|
||||
- defaults
|
||||
dependencies:
|
||||
- _libgcc_mutex=0.1=conda_forge
|
||||
- _openmp_mutex=4.5=1_gnu
|
||||
- alabaster=0.7.12=py_0
|
||||
- alsa-lib=1.2.3=h516909a_0
|
||||
- argh=0.26.2=pyh9f0ad1d_1002
|
||||
- atk-1.0=2.36.0=h3371d22_4
|
||||
- babel=2.9.1=pyh44b312d_0
|
||||
- breathe=4.29.2=pyhd8ed1ab_0
|
||||
- brotli=1.0.9=h7f98852_6
|
||||
- brotli-bin=1.0.9=h7f98852_6
|
||||
- brotlipy=0.7.0=py39h3811e60_1003
|
||||
- bzip2=1.0.8=h7f98852_4
|
||||
- c-ares=1.18.1=h7f98852_0
|
||||
- ca-certificates=2021.10.8=ha878542_0
|
||||
- cairo=1.16.0=ha00ac49_1009
|
||||
- certifi=2021.10.8=py39hf3d152e_1
|
||||
- cffi=1.15.0=py39h4bc2ebd_0
|
||||
- charset-normalizer=2.0.12=pyhd8ed1ab_0
|
||||
- click=8.0.4=py39hf3d152e_0
|
||||
- cmake=3.22.2=h1021d11_0
|
||||
- colorama=0.4.4=pyh9f0ad1d_0
|
||||
- commonmark=0.9.1=py_0
|
||||
- cryptography=36.0.1=py39h95dcef6_0
|
||||
- curl=7.81.0=h2574ce0_0
|
||||
- cycler=0.11.0=pyhd8ed1ab_0
|
||||
- dbus=1.13.6=h5008d03_3
|
||||
- docutils=0.16=py39hf3d152e_3
|
||||
- doxygen=1.9.2=hb166930_0
|
||||
- expat=2.4.4=h9c3ff4c_0
|
||||
- font-ttf-dejavu-sans-mono=2.37=hab24e00_0
|
||||
- font-ttf-inconsolata=3.000=h77eed37_0
|
||||
- font-ttf-source-code-pro=2.038=h77eed37_0
|
||||
- font-ttf-ubuntu=0.83=hab24e00_0
|
||||
- fontconfig=2.13.96=ha180cfb_0
|
||||
- fonts-conda-ecosystem=1=0
|
||||
- fonts-conda-forge=1=0
|
||||
- fonttools=4.29.1=py39h3811e60_0
|
||||
- freetype=2.10.4=h0708190_1
|
||||
- fribidi=1.0.10=h36c2ea0_0
|
||||
- future=0.18.2=py39hf3d152e_4
|
||||
- gdk-pixbuf=2.42.6=h04a7f16_0
|
||||
- gettext=0.19.8.1=h73d1719_1008
|
||||
- ghp-import=2.0.2=pyhd8ed1ab_0
|
||||
- giflib=5.2.1=h36c2ea0_2
|
||||
- git=2.35.0=pl5321hc30692c_0
|
||||
- graphite2=1.3.13=h58526e2_1001
|
||||
- graphviz=2.50.0=h8e749b2_2
|
||||
- gst-plugins-base=1.18.5=hf529b03_3
|
||||
- gstreamer=1.18.5=h9f60fe5_3
|
||||
- gtk2=2.24.33=h90689f9_2
|
||||
- gts=0.7.6=h64030ff_2
|
||||
- harfbuzz=3.4.0=hb4a5f5f_0
|
||||
- icu=69.1=h9c3ff4c_0
|
||||
- idna=3.3=pyhd8ed1ab_0
|
||||
- imagesize=1.3.0=pyhd8ed1ab_0
|
||||
- importlib-metadata=4.11.1=py39hf3d152e_0
|
||||
- jbig=2.1=h7f98852_2003
|
||||
- jinja2=3.0.3=pyhd8ed1ab_0
|
||||
- jpeg=9e=h7f98852_0
|
||||
- kiwisolver=1.3.2=py39h1a9c180_1
|
||||
- krb5=1.19.2=hcc1bbae_3
|
||||
- lcms2=2.12=hddcbb42_0
|
||||
- ld_impl_linux-64=2.36.1=hea4e1c9_2
|
||||
- lerc=3.0=h9c3ff4c_0
|
||||
- libblas=3.9.0=13_linux64_openblas
|
||||
- libbrotlicommon=1.0.9=h7f98852_6
|
||||
- libbrotlidec=1.0.9=h7f98852_6
|
||||
- libbrotlienc=1.0.9=h7f98852_6
|
||||
- libcblas=3.9.0=13_linux64_openblas
|
||||
- libclang=13.0.1=default_hc23dcda_0
|
||||
- libcurl=7.81.0=h2574ce0_0
|
||||
- libdeflate=1.10=h7f98852_0
|
||||
- libedit=3.1.20191231=he28a2e2_2
|
||||
- libev=4.33=h516909a_1
|
||||
- libevent=2.1.10=h9b69904_4
|
||||
- libffi=3.4.2=h7f98852_5
|
||||
- libgcc-ng=11.2.0=h1d223b6_12
|
||||
- libgd=2.3.3=h3cfcdeb_1
|
||||
- libgfortran-ng=11.2.0=h69a702a_12
|
||||
- libgfortran5=11.2.0=h5c6108e_12
|
||||
- libglib=2.70.2=h174f98d_4
|
||||
- libgomp=11.2.0=h1d223b6_12
|
||||
- libiconv=1.16=h516909a_0
|
||||
- liblapack=3.9.0=13_linux64_openblas
|
||||
- libllvm13=13.0.1=hf817b99_0
|
||||
- libnghttp2=1.46.0=h812cca2_0
|
||||
- libnsl=2.0.0=h7f98852_0
|
||||
- libogg=1.3.4=h7f98852_1
|
||||
- libopenblas=0.3.18=pthreads_h8fe5266_0
|
||||
- libopus=1.3.1=h7f98852_1
|
||||
- libpng=1.6.37=h21135ba_2
|
||||
- libpq=14.2=hd57d9b9_0
|
||||
- librsvg=2.52.5=h0a9e6e8_2
|
||||
- libssh2=1.10.0=ha56f1ee_2
|
||||
- libstdcxx-ng=11.2.0=he4da1e4_12
|
||||
- libtiff=4.3.0=h542a066_3
|
||||
- libtool=2.4.6=h9c3ff4c_1008
|
||||
- libuuid=2.32.1=h7f98852_1000
|
||||
- libuv=1.43.0=h7f98852_0
|
||||
- libvorbis=1.3.7=h9c3ff4c_0
|
||||
- libwebp=1.2.2=h3452ae3_0
|
||||
- libwebp-base=1.2.2=h7f98852_1
|
||||
- libxcb=1.13=h7f98852_1004
|
||||
- libxkbcommon=1.0.3=he3ba5ed_0
|
||||
- libxml2=2.9.12=h885dcf4_1
|
||||
- libzlib=1.2.11=h36c2ea0_1013
|
||||
- lz4-c=1.9.3=h9c3ff4c_1
|
||||
- markdown=3.3.6=pyhd8ed1ab_0
|
||||
- markupsafe=2.1.0=py39hb9d737c_0
|
||||
- matplotlib=3.5.1=py39hf3d152e_0
|
||||
- matplotlib-base=3.5.1=py39h2fa2bec_0
|
||||
- mergedeep=1.3.4=pyhd8ed1ab_0
|
||||
- mkdocs=1.2.3=pyhd8ed1ab_0
|
||||
- munkres=1.1.4=pyh9f0ad1d_0
|
||||
- mysql-common=8.0.28=ha770c72_0
|
||||
- mysql-libs=8.0.28=hfa10184_0
|
||||
- ncurses=6.3=h9c3ff4c_0
|
||||
- nspr=4.32=h9c3ff4c_1
|
||||
- nss=3.74=hb5efdd6_0
|
||||
- numpy=1.22.2=py39h91f2184_0
|
||||
- openjpeg=2.4.0=hb52868f_1
|
||||
- openssl=1.1.1l=h7f98852_0
|
||||
- packaging=21.3=pyhd8ed1ab_0
|
||||
- pango=1.50.3=h9967ed3_0
|
||||
- pcre=8.45=h9c3ff4c_0
|
||||
- pcre2=10.37=h032f7d1_0
|
||||
- perl=5.32.1=2_h7f98852_perl5
|
||||
- pillow=9.0.1=py39hae2aec6_2
|
||||
- pip=22.0.3=pyhd8ed1ab_0
|
||||
- pixman=0.40.0=h36c2ea0_0
|
||||
- pthread-stubs=0.4=h36c2ea0_1001
|
||||
- pycparser=2.21=pyhd8ed1ab_0
|
||||
- pygments=2.11.2=pyhd8ed1ab_0
|
||||
- pyopenssl=22.0.0=pyhd8ed1ab_0
|
||||
- pyparsing=3.0.7=pyhd8ed1ab_0
|
||||
- pyqt=5.12.3=py39hf3d152e_8
|
||||
- pyqt-impl=5.12.3=py39hde8b62d_8
|
||||
- pyqt5-sip=4.19.18=py39he80948d_8
|
||||
- pyqtchart=5.12=py39h0fcd23e_8
|
||||
- pyqtwebengine=5.12.1=py39h0fcd23e_8
|
||||
- pysocks=1.7.1=py39hf3d152e_4
|
||||
- python=3.9.10=h85951f9_2_cpython
|
||||
- python-dateutil=2.8.2=pyhd8ed1ab_0
|
||||
- python_abi=3.9=2_cp39
|
||||
- pytz=2021.3=pyhd8ed1ab_0
|
||||
- pyyaml=6.0=py39h3811e60_3
|
||||
- pyyaml-env-tag=0.1=pyhd8ed1ab_0
|
||||
- qt=5.12.9=ha98a1a1_5
|
||||
- readline=8.1=h46c0cb4_0
|
||||
- recommonmark=0.7.1=pyhd8ed1ab_0
|
||||
- requests=2.27.1=pyhd8ed1ab_0
|
||||
- rhash=1.4.1=h7f98852_0
|
||||
- setuptools=60.9.3=py39hf3d152e_0
|
||||
- six=1.16.0=pyh6c4a22f_0
|
||||
- snowballstemmer=2.2.0=pyhd8ed1ab_0
|
||||
- sphinx=3.5.4=pyh44b312d_0
|
||||
- sphinx-markdown-tables=0.0.15=pyhd3deb0d_0
|
||||
- sphinxcontrib-applehelp=1.0.2=py_0
|
||||
- sphinxcontrib-devhelp=1.0.2=py_0
|
||||
- sphinxcontrib-htmlhelp=2.0.0=pyhd8ed1ab_0
|
||||
- sphinxcontrib-jsmath=1.0.1=py_0
|
||||
- sphinxcontrib-qthelp=1.0.3=py_0
|
||||
- sphinxcontrib-serializinghtml=1.1.5=pyhd8ed1ab_1
|
||||
- sqlite=3.37.0=h9cd32fc_0
|
||||
- tk=8.6.12=h27826a3_0
|
||||
- tornado=6.1=py39h3811e60_2
|
||||
- tzdata=2021e=he74cb21_0
|
||||
- unicodedata2=14.0.0=py39h3811e60_0
|
||||
- urllib3=1.26.8=pyhd8ed1ab_1
|
||||
- watchdog=2.1.6=py39hf3d152e_1
|
||||
- wheel=0.37.1=pyhd8ed1ab_0
|
||||
- xorg-kbproto=1.0.7=h7f98852_1002
|
||||
- xorg-libice=1.0.10=h7f98852_0
|
||||
- xorg-libsm=1.2.3=hd9c2040_1000
|
||||
- xorg-libx11=1.7.2=h7f98852_0
|
||||
- xorg-libxau=1.0.9=h7f98852_0
|
||||
- xorg-libxdmcp=1.1.3=h7f98852_0
|
||||
- xorg-libxext=1.3.4=h7f98852_1
|
||||
- xorg-libxrender=0.9.10=h7f98852_1003
|
||||
- xorg-renderproto=0.11.1=h7f98852_1002
|
||||
- xorg-xextproto=7.3.0=h7f98852_1002
|
||||
- xorg-xproto=7.0.31=h7f98852_1007
|
||||
- xz=5.2.5=h516909a_1
|
||||
- yaml=0.2.5=h7f98852_2
|
||||
- zipp=3.7.0=pyhd8ed1ab_1
|
||||
- zlib=1.2.11=h36c2ea0_1013
|
||||
- zstd=1.5.2=ha95c52a_0
|
||||
- pip:
|
||||
- py-gfm==1.0.2
|
||||
- sphinx-markdown==1.0.2
|
||||
- sphinx-rtd-theme==1.0.0
|
||||
@@ -1,86 +0,0 @@
|
||||
# Features
|
||||
|
||||
```eval_rst
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 4
|
||||
```
|
||||
|
||||
## Overview
|
||||
|
||||
[OmniTrace](https://github.com/ROCm/omnitrace) is designed to be highly extensible. Internally, it leverages the
|
||||
[timemory performance analysis toolkit](https://github.com/NERSC/timemory) to
|
||||
manage extensions, resources, data, etc.
|
||||
|
||||
### Data Collection Modes
|
||||
|
||||
- Dynamic instrumentation
|
||||
- Runtime instrumentation
|
||||
- Instrument executable and shared libraries at runtime
|
||||
- Binary rewriting
|
||||
- Generate a new executable and/or library with instrumentation built-in
|
||||
- Statistical sampling
|
||||
- Periodic software interrupts per-thread
|
||||
- Process-level sampling
|
||||
- Background thread records process-, system- and device-level metrics while the application executes
|
||||
- Causal profiling
|
||||
- Quantifies the potential impact of optimizations in parallel codes
|
||||
|
||||
### Data Analysis
|
||||
|
||||
- High-level summary profiles with mean/min/max/stddev statistics
|
||||
- Low overhead, memory efficient
|
||||
- Ideal for running at scale
|
||||
- Comprehensive traces
|
||||
- Every individual event/measurement
|
||||
- Application speedup predictions resulting from potential optimizations in functions and lines of code (causal profiling)
|
||||
|
||||
### Parallelism API Support
|
||||
|
||||
- HIP
|
||||
- HSA
|
||||
- Pthreads
|
||||
- MPI
|
||||
- Kokkos-Tools (KokkosP)
|
||||
- OpenMP-Tools (OMPT)
|
||||
|
||||
### GPU Metrics
|
||||
|
||||
- GPU hardware counters
|
||||
- HIP API tracing
|
||||
- HIP kernel tracing
|
||||
- HSA API tracing
|
||||
- HSA operation tracing
|
||||
- System-level sampling (via rocm-smi)
|
||||
- Memory usage
|
||||
- Power usage
|
||||
- Temperature
|
||||
- Utilization
|
||||
|
||||
### CPU Metrics
|
||||
|
||||
- CPU hardware counters sampling and profiles
|
||||
- CPU frequency sampling
|
||||
- Various timing metrics
|
||||
- Wall time
|
||||
- CPU time (process and/or thread)
|
||||
- CPU utilization (process and/or thread)
|
||||
- User CPU time
|
||||
- Kernel CPU time
|
||||
- Various memory metrics
|
||||
- High-water mark (sampling and profiles)
|
||||
- Memory page allocation
|
||||
- Virtual memory usage
|
||||
- Network statistics
|
||||
- I/O metrics
|
||||
- ... many more
|
||||
|
||||
### Third-party API support
|
||||
|
||||
- TAU
|
||||
- LIKWID
|
||||
- Caliper
|
||||
- CrayPAT
|
||||
- VTune
|
||||
- NVTX
|
||||
- ROCTX
|
||||
@@ -1,19 +0,0 @@
|
||||
if(NOT DEFINED SOURCE_DIR)
|
||||
message(FATAL_ERROR "Please define SOURCE_DIR")
|
||||
endif()
|
||||
|
||||
get_filename_component(SOURCE_DIR "${SOURCE_DIR}" ABSOLUTE)
|
||||
|
||||
find_program(DOT_EXECUTABLE NAMES dot)
|
||||
|
||||
if(NOT DOT_EXECUTABLE)
|
||||
message(FATAL_ERROR "Please install dot and/or specify DOT_EXECUTABLE")
|
||||
endif()
|
||||
|
||||
file(READ "${SOURCE_DIR}/VERSION" FULL_VERSION_STRING LIMIT_COUNT 1)
|
||||
string(REGEX REPLACE "(\n|\r)" "" FULL_VERSION_STRING "${FULL_VERSION_STRING}")
|
||||
string(REGEX REPLACE "([0-9]+)\\.([0-9]+)\\.([0-9]+)(.*)" "\\1.\\2.\\3" OMNITRACE_VERSION
|
||||
"${FULL_VERSION_STRING}")
|
||||
|
||||
configure_file(${SOURCE_DIR}/source/docs/omnitrace.dox.in
|
||||
${SOURCE_DIR}/source/docs/omnitrace.dox @ONLY)
|
||||
@@ -1,189 +0,0 @@
|
||||
# Getting Started
|
||||
|
||||
```eval_rst
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 3
|
||||
```
|
||||
|
||||
<style>
|
||||
em { color: Green; }
|
||||
</style>
|
||||
|
||||
## Nomenclature
|
||||
|
||||
The list provided below is intended to (A) provide a basic glossary for those who are not familiar with binary instrumentation, etc. and (B)
|
||||
provide clarification to ambiguities when certain terms have different contextual meanings,
|
||||
e.g., omnitrace's meaning of the term "module" when instrumenting Python.
|
||||
|
||||
- **Binary**
|
||||
- File written in the Executable and Linkable Format (ELF)
|
||||
- Standard file format for executable files, shared libraries, etc.
|
||||
- **Binary Instrumentation**
|
||||
- Inserting callbacks to instrumentation into an existing binary. This can be performed statically or dynamically
|
||||
- **Static Binary Instrumentation**
|
||||
- Loads an existing binary, determines instrumentation points, and generates a new binary with instrumentation directly embedded
|
||||
- Applicable to executables and libraries but limited to only the functions defined in the binary
|
||||
- Also known as: **Binary Rewrite**
|
||||
- **Dynamic Binary Instrumentation**
|
||||
- Loads an existing binary into memory, inserts instrumentation, executes binary
|
||||
- Limited to executables but capable of instrumenting linked libraries
|
||||
- Also known as: **Runtime Instrumentation**
|
||||
- **Statistical Sampling**
|
||||
- Also known as (simply) "sampling"
|
||||
- At periodic intervals, the application is paused and the current call-stack of the CPU is recorded alongside with various other metrics
|
||||
- Uses timers that measure either (A) real clock time or (B) the CPU time used by the current thread and the CPU time expended on behalf of the thread by the system
|
||||
- **Sampling Rate**
|
||||
- The period at which (A) or (B) are triggered (in units of `# interrupts / second`)
|
||||
- Higher values increase the number of samples
|
||||
- **Sampling Delay**
|
||||
- How long to wait before (A) and (B) begin triggering at their designated rate
|
||||
- **Sampling Duration**
|
||||
- The time (in realtime) after the start of the application to record samples. Once this time limit has been reached, no more samples will be recorded.
|
||||
- **Process Sampling**
|
||||
- At periodic (realtime) intervals, a background thread records global metrics without interrupting the current process. These metrics include, but are not limited to: CPU frequency,
|
||||
CPU memory high-water mark (i.e. peak memory usage), GPU Temperature, GPU Power usage, etc.
|
||||
- **Sampling Rate**
|
||||
- The realtime period for recording metrics (in units of `# measurements / second`)
|
||||
- Higher values increase the number of samples
|
||||
- **Sampling Delay**
|
||||
- How long to wait (in realtime) before recording samples
|
||||
- **Sampling Duration**
|
||||
- The time (in realtime) after the start of the application to record samples. Once this time limit has been reached, no more samples will be recorded.
|
||||
- **Module**
|
||||
- With respect to binary instrumentation, a module is defined as either the filename (e.g. `foo.c`) or library name (`libfoo.so`) which contains the definition of one or more functions
|
||||
- With respect to Python instrumentation, a module is defined as the *file* which contains the definition of one or more functions.
|
||||
- The full path to this file *typically* contains the name of the "Python module"
|
||||
- **Basic Block**
|
||||
- Straight-line code sequence with:
|
||||
- No branches in (except for the entry)
|
||||
- No branches out (except for the exit)
|
||||
- **Address Range**
|
||||
- The instructions for a function in a binary start at certain address with the ELF file and end at a certain address, the range is `end - start`
|
||||
- The address range is a decent approximation for the "cost" of a function, i.e., a larger address range approx. equates to more instructions
|
||||
- **Instrumentation Traps**
|
||||
- On the x86 architecture, because instructions are of variable size, the instruction at a point may be too small for Dyninst to replace it with the normal code sequence used to call instrumentation
|
||||
- Also, when instrumentation is placed at points other than subroutine entry, exit, or call points, traps may be used to ensure the instrumentation fits
|
||||
- By default, omnitrace-instrument avoids instrumentation which requires using a trap
|
||||
- **Overlapping functions**
|
||||
- Due to language constructs or compiler optimizations, it may be possible for multiple functions to overlap (that is, share part of the same function body) or for a single function to have multiple entry points
|
||||
- In practice, it is impossible to determine the difference between multiple overlapping functions and a single function with multiple entry points
|
||||
- By default, omnitrace-instrument avoids instrumenting overlapping functions
|
||||
|
||||
## General Tips
|
||||
|
||||
- ***Use `omnitrace-avail` to lookup configuration settings***, hardware counters, and data collection components
|
||||
- Use `-d` flag for descriptions
|
||||
- Generate a default configuration with `omnitrace-avail -G ${HOME}/.omnitrace.cfg` and tweak accordingly to the desired default behavior
|
||||
- ***Decide whether binary instrumentation, statistical sampling, or both*** will provide the desired performance data (for non-Python applications)
|
||||
- Compile code with optimization enabled (e.g. `-O2` or higher), disable asserts (i.e. `-DNDEBUG`), and include debug info (i.e. `-g1` at a minimum)
|
||||
- NOTE: compiling with debug info does not slow down the code, it only increases compile time and the size of the binary
|
||||
- In CMake, this is generally as easy as settings `CMAKE_BUILD_TYPE=RelWithDebInfo` or `CMAKE_BUILD_TYPE=Release` and `CMAKE_<LANG>_FLAGS=-g1`
|
||||
- Use ***binary instrumentation for characterizing the performance of every invocation of specific functions***
|
||||
- Use ***statistical sampling to characterize the performance of the entire application while minimizing overhead***
|
||||
- Enable statistical sampling after binary instrumentation to help "fill in the gaps" between instrumented regions
|
||||
- Use the user API to create custom regions, enable/disable omnitrace to specific processes, threads, and/or regions
|
||||
- Dynamic symbol interception, callback APIs, and the user API are always available with binary instrumentation and sampling
|
||||
- Dynamic symbol interception and callback APIs are (generally) controlled through `OMNITRACE_USE_<API>` options, e.g. `OMNITRACE_USE_KOKKOSP`, `OMNITRACE_USE_OMPT` enable Kokkos-Tools and OpenMP-Tools callbacks, respectively
|
||||
- When generically seeking regions for performance improvement:
|
||||
- ***Start off collecting a flat profile***
|
||||
- Look for functions with high call counts, large cumulative runtimes/values, and/or large standard deviations
|
||||
- When call-counts are high, improving the performance of this function or "inlining" the function can be quick and easy performance improvements
|
||||
- When the standard-deviation is high, collect a hierarchical profile and see if the high variation can be attributable to the calling context. In this scenario, consider creating a specialized version for the function for the longer running contexts
|
||||
- ***Collect a hierarchical profile*** and, keeping the flat-profiling data in mind, verify the functions noted in the flat profile are part of the "critical path" of your application
|
||||
- E.g. function(s) with high call counts, etc. which are part of a "setup" or "post-processing" phase which does not consume much time relative to the overall time is, generally, a lower priority for optimization
|
||||
- ***Use the information from the profiles when analyzing detailed traces***
|
||||
- When using binary instrumentation in the "trace" mode, the ***binary rewrites are preferable to runtime instrumentation***.
|
||||
- Binary rewrites only instrument the functions defined in the target binary, whereas runtime instrumentation can/will instrument functions defined in the shared libraries which are linked into the target binary
|
||||
- When using binary instrumentation with MPI, avoid runtime instrumentation
|
||||
- Runtime instrumentation requires a fork + ptrace: which is generally incompatible with how MPI applications spawn their processes
|
||||
- Binary rewrite the executable using MPI (and, optionally, libraries used by the executable) and execute the generated instrumented executable via `omnitrace-run` instead of the original, e.g. `mpirun -n 2 ./myexe` should be `mpirun -n 2 omnitrace-run -- ./myexe.inst` where `myexe.inst` is the generated instrumented `myexe` executable.
|
||||
|
||||
## Data Collection Mode(s)
|
||||
|
||||
OmniTrace supports several modes of recording trace and profiling data for your application:
|
||||
|
||||
| Mode | Descriptions |
|
||||
|-----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| Binary Instrumentation | Locates functions (and loops, if desired) in binary and inserts snippets at the entry and exit |
|
||||
| Statistical Sampling | Periodically pauses application at specified intervals and records various metrics for the given call-stack |
|
||||
| Callback APIs | Parallelism frameworks such as ROCm, OpenMP, and Kokkos will make callbacks into omnitrace to provide information about the work the API is performing |
|
||||
| Dynamic Symbol Interception | Wrap function symbols defined in position independent dynamic library/executable, e.g. `pthread_mutex_lock` in libpthread.so or `MPI_Init` in the MPI library |
|
||||
| User API | User-defined regions and controls for omnitrace |
|
||||
|
||||
The two most generic, important modes are binary instrumentation and statistical sampling. It is important to understand the advantages and disadvantages.
|
||||
Binary instrumentation and statistical sampling can be performed with the `omnitrace` executable but for statistical sampling, it is highly recommended to use the
|
||||
`omnitrace-sample` executable instead if no binary instrumentation is required/desired. With either tool, the callback APIs and dynamic symbol interception can be
|
||||
utilized.
|
||||
|
||||
### Binary Instrumentation
|
||||
|
||||
Binary instrumentation will allow one to deterministically record measurements for every single invocation of a given function.
|
||||
Binary instrumentation effectively adds instructions to the target application to collect the required information and, thus, has the potential to cause performance changes which may,
|
||||
in some cases, lead to inaccurate results. The effect depends on what information being collected and which features are activated in omnitrace. For example, collecting only the wall-clock timing data
|
||||
will have less effect than collected the wall-clock timing, cpu-clock timing, memory usage, cache-misses, and number of instructions executed. Similarly, collecting a flat profile will have
|
||||
less overhead than a hierarchical profile and collecting a trace OR a profile will have less overhead than collecting a trace AND a profile.
|
||||
|
||||
In omnitrace, the primary heuristic for controlling the overhead with binary instrumentation is the minimum number of instructions for selecting functions for instrumentation.
|
||||
|
||||
### Statistical Sampling
|
||||
|
||||
Statistical call-stack sampling periodically interrupts the application at regular intervals using operating system interrupts.
|
||||
Sampling is typically less numerically accurate and specific, but allows the target program to run at near full speed.
|
||||
In constrast to the data derived from binary instrumentation, the resulting data is not exact but, instead, a statistical approximation.
|
||||
However, sampling often provides a more accurate picture of the application execution because it is less intrusive to the target application and has fewer
|
||||
side effects on memory caches or instruction decoding pipelines. Furthermore, since sampling does not affect the execution speed as significantly, is it
|
||||
relatively immune to over-evaluating the cost of small, frequently called functions or "tight" loops.
|
||||
|
||||
In omnitrace, the overhead for statistical sampling is a factor of the sampling rate and whether the samples are taken with respect to the CPU time and/or real time.
|
||||
|
||||
### Binary Instrumentation vs. Statistical Sampling Example
|
||||
|
||||
Consider for the following code:
|
||||
|
||||
```cpp
|
||||
long fib(long n)
|
||||
{
|
||||
if(n < 2) return n;
|
||||
return fib(n - 1) + fib(n - 2);
|
||||
}
|
||||
|
||||
void run(long n)
|
||||
{
|
||||
long result = fib(nfib);
|
||||
printf("[%li] fibonacci(%li) = %li\n", i, nfib, result);
|
||||
}
|
||||
|
||||
int main(int argc, char** argv)
|
||||
{
|
||||
long nfib = 30;
|
||||
long nitr = 10;
|
||||
if(argc > 1) nfib = atol(argv[1]);
|
||||
if(argc > 2) nitr = atol(argv[2]);
|
||||
|
||||
for(long i = 0; i < nitr; ++i)
|
||||
run(nfib);
|
||||
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
Binary instrumentation of the `fib` function will record ***every single invocation*** of the function -- which for a very small function
|
||||
such as `fib`, will result in *significant* overhead since this simple function tends to be less than 20 or so instructions, whereas the entry and
|
||||
exit snippets are ~1024 instructions. Thus, ***we generally want to avoid instrumenting functions where the instrumented function has significantly fewer
|
||||
instructions than entry + exit instrumentation*** (please note, however, that many of the instructions entry/exit functions are either logging functions or
|
||||
depend on the runtime settins and thus may never be executed). However, due to the number of potentially executed instructions in the entry/exit snippets,
|
||||
the default behavior of omnitrace-instrument is to only instrument functions which contain fewer than 1024 instructions.
|
||||
|
||||
However, recording every single invocation of the function can be extremely useful for detecting anomalies: profiles will show min/max values much smaller/larger
|
||||
than the average and/or high standard deviation and traces will allow you to identify exactly when and where those instances deviated from the norm.
|
||||
Consider the level of details in the following traces where, in the top image, every instance of the `fib` function was instrumented vs. the bottom image
|
||||
where the `fib` call-stack was derived via sampling:
|
||||
|
||||
#### Binary Instrumentation of Fibonacci Function
|
||||
|
||||

|
||||
|
||||
#### Statistical Sampling of Fibonacci Function
|
||||
|
||||

|
||||
|
До Ширина: | Высота: | Размер: 27 KiB |
|
До Ширина: | Высота: | Размер: 106 KiB |
|
До Ширина: | Высота: | Размер: 408 KiB |
@@ -1,24 +0,0 @@
|
||||
# Welcome to the [OmniTrace](https://github.com/ROCm/omnitrace) Documentation!
|
||||
|
||||
```eval_rst
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 4
|
||||
:caption: Table of Contents
|
||||
|
||||
about
|
||||
features
|
||||
installation
|
||||
setup
|
||||
getting_started
|
||||
runtime
|
||||
sampling
|
||||
instrumenting
|
||||
causal_profiling
|
||||
critical_trace
|
||||
output
|
||||
user_api
|
||||
python
|
||||
youtube
|
||||
development
|
||||
```
|
||||
@@ -1,281 +0,0 @@
|
||||
# Installation
|
||||
|
||||
```eval_rst
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 4
|
||||
```
|
||||
|
||||
## Release Links
|
||||
|
||||
- [Latest Omnitrace Release](https://github.com/ROCm/omnitrace/releases/latest)
|
||||
- [All Omnitrace Releases](https://github.com/ROCm/omnitrace/releases)
|
||||
|
||||
## Quick Start (Latest Release, Binary Installer)
|
||||
|
||||
Download the [omnitrace-install.py](https://github.com/ROCm/omnitrace/releases/latest/download/omnitrace-install.py)
|
||||
and specify `--prefix <install-directory>`. This script will attempt to auto-detect the appropriate OS
|
||||
distribution and OS version. If ROCm support is desired, specify `--rocm X.Y` where `X` is the ROCm major
|
||||
version and `Y` is the ROCm minor version, e.g. `--rocm 6.0`.
|
||||
|
||||
```shell
|
||||
wget https://github.com/ROCm/omnitrace/releases/latest/download/omnitrace-install.py
|
||||
python3 ./omnitrace-install.py --prefix /opt/omnitrace --rocm 6.0
|
||||
```
|
||||
|
||||
This script supports installation on Ubuntu, OpenSUSE, RedHat, Debian, CentOS, and Fedora.
|
||||
If the target OS is compatible with one of the [operating system versions](#operating-system-support) below,
|
||||
specify `-d <DISTRO> -v <VERSION>`, e.g. if the OS is compatible with Ubuntu 18.04, pass
|
||||
`-d ubuntu -v 18.04` to the script.
|
||||
|
||||
## Operating System Support
|
||||
|
||||
OmniTrace is only supported on Linux. The following distributions are tested:
|
||||
|
||||
- Ubuntu 18.04
|
||||
- Ubuntu 20.04
|
||||
- Ubuntu 22.04
|
||||
- OpenSUSE 15.2
|
||||
- OpenSUSE 15.3
|
||||
- OpenSUSE 15.4
|
||||
- RedHat 8.7
|
||||
- RedHat 9.0
|
||||
- RedHat 9.1
|
||||
|
||||
Other OS distributions may be supported but are not tested.
|
||||
|
||||
### Identifying the Operating System
|
||||
|
||||
If you are unsure of the operating system and version, the `/etc/os-release` and `/usr/lib/os-release` files contain operating system identification data for Linux systems.
|
||||
|
||||
```shell
|
||||
$ cat /etc/os-release
|
||||
NAME="Ubuntu"
|
||||
VERSION="20.04.4 LTS (Focal Fossa)"
|
||||
ID=ubuntu
|
||||
...
|
||||
VERSION_ID="20.04"
|
||||
...
|
||||
```
|
||||
|
||||
The relevent fields are `ID` and the `VERSION_ID`.
|
||||
|
||||
## Architecture
|
||||
|
||||
With regards to instrumentation, at present only amd64 (x86_64) architectures are tested; however,
|
||||
Dyninst supports several more architectures and thus, omnitrace instrumentation may support other
|
||||
CPU architectures such as aarch64, ppc64, etc.
|
||||
Other modes of use, such as sampling and causal profiling, are not dependent on Dyninst and therefore
|
||||
may be more portable.
|
||||
|
||||
## Installing omnitrace from binary distributions
|
||||
|
||||
Every omnitrace release provides binary installer scripts of the form:
|
||||
|
||||
```shell
|
||||
omnitrace-{VERSION}-{OS_DISTRIB}-{OS_VERSION}[-ROCm-{ROCM_VERSION}[-{EXTRA}]].sh
|
||||
```
|
||||
|
||||
E.g.:
|
||||
|
||||
```shell
|
||||
omnitrace-1.0.0-ubuntu-18.04-OMPT-PAPI-Python3.sh
|
||||
omnitrace-1.0.0-ubuntu-18.04-ROCm-405000-OMPT-PAPI-Python3.sh
|
||||
...
|
||||
omnitrace-1.0.0-ubuntu-20.04-ROCm-50000-OMPT-PAPI-Python3.sh
|
||||
```
|
||||
|
||||
Any of the EXTRA fields with a cmake build option (e.g. PAPI, see below) or no link requirements (e.g. OMPT) have
|
||||
self-contained support for these packages.
|
||||
|
||||
### Download the appropriate binary distribution
|
||||
|
||||
```shell
|
||||
wget https://github.com/ROCm/omnitrace/releases/download/v<VERSION>/<SCRIPT>
|
||||
```
|
||||
|
||||
### Create the target installation directory
|
||||
|
||||
```shell
|
||||
mkdir /opt/omnitrace
|
||||
```
|
||||
|
||||
### Run the installer script
|
||||
|
||||
```shell
|
||||
./omnitrace-1.0.0-ubuntu-18.04-ROCm-405000-OMPT-PAPI.sh --prefix=/opt/omnitrace --exclude-subdir
|
||||
```
|
||||
|
||||
## Installing OmniTrace from source
|
||||
|
||||
### Build Requirements
|
||||
|
||||
OmniTrace needs a GCC compiler with full support for C++17 and CMake v3.16 or higher.
|
||||
The Clang compiler may be used in lieu of the GCC compiler if Dyninst is already installed.
|
||||
|
||||
- GCC compiler v7+
|
||||
- Older GCC compilers may be supported but are not tested
|
||||
- Clang compilers are generally supported for [OmniTrace](https://github.com/ROCm/omnitrace) but not Dyninst
|
||||
- [CMake](https://cmake.org/) v3.16+
|
||||
|
||||
> ***If the system installed cmake is too old, installing a new version of cmake can be done through several methods.***
|
||||
> ***One of the easiest options is to use PyPi (i.e. python's pip):***
|
||||
>
|
||||
> ```shell
|
||||
> pip install --user 'cmake==3.18.4'
|
||||
> export PATH=${HOME}/.local/bin:${PATH}
|
||||
> ```
|
||||
|
||||
### Required Third-Party Packages
|
||||
|
||||
- [DynInst](https://github.com/dyninst/dyninst) for dynamic or static instrumentation
|
||||
- [TBB](https://github.com/oneapi-src/oneTBB) required by Dyninst
|
||||
- [ElfUtils](https://sourceware.org/elfutils/) required by Dyninst
|
||||
- [LibIberty](https://github.com/gcc-mirror/gcc/tree/master/libiberty) required by Dyninst
|
||||
- [Boost](https://www.boost.org/) required by Dyninst
|
||||
- [OpenMP](https://www.openmp.org/) optional by Dyninst
|
||||
- [libunwind](https://www.nongnu.org/libunwind/) for call-stack sampling
|
||||
|
||||
All of the third-party packages required by [DynInst](https://github.com/dyninst/dyninst) and
|
||||
[DynInst](https://github.com/dyninst/dyninst) itself can be built and installed
|
||||
during the build of omnitrace itself. In the list below, we list the package, the version,
|
||||
which package requires the package (i.e. omnitrace requires Dyninst
|
||||
and Dyninst requires TBB), and the CMake option to build the package alongside omnitrace:
|
||||
|
||||
| Third-Party Library | Minimum Version | Required By | CMake Option |
|
||||
|---------------------|-----------------|-------------|-------------------------------------------|
|
||||
| Dyninst | 12.0 | OmniTrace | `OMNITRACE_BUILD_DYNINST` (default: OFF) |
|
||||
| Libunwind | | OmniTrace | `OMNITRACE_BUILD_LIBUNWIND` (default: ON) |
|
||||
| TBB | 2018.6 | Dyninst | `DYNINST_BUILD_TBB` (default: OFF) |
|
||||
| ElfUtils | 0.178 | Dyninst | `DYNINST_BUILD_ELFUTILS` (default: OFF) |
|
||||
| LibIberty | | Dyninst | `DYNINST_BUILD_LIBIBERTY` (default: OFF) |
|
||||
| Boost | 1.67.0 | Dyninst | `DYNINST_BUILD_BOOST` (default: OFF) |
|
||||
| OpenMP | 4.x | Dyninst | |
|
||||
|
||||
### Optional Third-Party Packages
|
||||
|
||||
- [ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/latest)
|
||||
- HIP
|
||||
- Roctracer for HIP API and kernel tracing
|
||||
- ROCM-SMI for GPU monitoring
|
||||
- Rocprofiler for GPU hardware counters
|
||||
- [PAPI](https://icl.utk.edu/papi/)
|
||||
- MPI
|
||||
- `OMNITRACE_USE_MPI` will enable full MPI support
|
||||
- `OMNITRACE_USE_MPI_HEADERS` will enable wrapping of the dynamically-linked MPI C function calls
|
||||
- By default, if an OpenMPI MPI distribution cannot be found, omnitrace will use a local copy of the OpenMPI mpi.h
|
||||
- Several optional third-party profiling tools supported by timemory (e.g. [Caliper](https://github.com/LLNL/Caliper), [TAU](https://www.cs.uoregon.edu/research/tau/home.php), CrayPAT, etc.)
|
||||
|
||||
| Third-Party Library | CMake Enable Option | CMake Build Option |
|
||||
|---------------------|--------------------------------------------|--------------------------------------|
|
||||
| PAPI | `OMNITRACE_USE_PAPI` (default: ON) | `OMNITRACE_BUILD_PAPI` (default: ON) |
|
||||
| MPI | `OMNITRACE_USE_MPI` (default: OFF) | |
|
||||
| MPI (header-only) | `OMNITRACE_USE_MPI_HEADERS` (default: ON) | |
|
||||
|
||||
### Installing DynInst
|
||||
|
||||
#### Building Dyninst alongside OmniTrace
|
||||
|
||||
The easiest way to install Dyninst is to configure omnitrace with `OMNITRACE_BUILD_DYNINST=ON`. Depending on the version of Ubuntu, the apt package manager may have current enough
|
||||
versions of Dyninst's Boost, TBB, and LibIberty dependencies (i.e. `apt-get install libtbb-dev libiberty-dev libboost-dev`); however, it is possible to request Dyninst to install
|
||||
it's dependencies via `DYNINST_BUILD_<DEP>=ON`, e.g.:
|
||||
|
||||
```shell
|
||||
git clone https://github.com/ROCm/omnitrace.git omnitrace-source
|
||||
cmake -B omnitrace-build -DOMNITRACE_BUILD_DYNINST=ON -DDYNINST_BUILD_{TBB,ELFUTILS,BOOST,LIBIBERTY}=ON omnitrace-source
|
||||
```
|
||||
|
||||
where `-DDYNINST_BUILD_{TBB,BOOST,ELFUTILS,LIBIBERTY}=ON` is expanded by the shell to `-DDYNINST_BUILD_TBB=ON -DDYNINST_BUILD_BOOST=ON ...`
|
||||
|
||||
#### Installing Dyninst via Spack
|
||||
|
||||
[Spack](https://github.com/spack/spack) is another option to install Dyninst and it's dependencies:
|
||||
|
||||
```shell
|
||||
git clone https://github.com/spack/spack.git
|
||||
source ./spack/share/spack/setup-env.sh
|
||||
spack compiler find
|
||||
spack external find --all --not-buildable
|
||||
spack spec -I --reuse dyninst
|
||||
spack install --reuse dyninst
|
||||
spack load -r dyninst
|
||||
```
|
||||
|
||||
### Installing omnitrace
|
||||
|
||||
OmniTrace has cmake configuration options for supporting MPI (`OMNITRACE_USE_MPI` or `OMNITRACE_USE_MPI_HEADERS`), HIP kernel tracing (`OMNITRACE_USE_ROCTRACER`),
|
||||
sampling ROCm devices (`OMNITRACE_USE_ROCM_SMI`), OpenMP-Tools (`OMNITRACE_USE_OMPT`), hardware counters via PAPI (`OMNITRACE_USE_PAPI`), among others.
|
||||
Various additional features can be enabled via the [`TIMEMORY_USE_*` CMake options](https://timemory.readthedocs.io/en/develop/installation.html#cmake-options).
|
||||
Any `OMNITRACE_USE_<VAL>` option which has a corresponding `TIMEMORY_USE_<VAL>` option means that the support within timemory for this feature has been integrated
|
||||
into omnitrace's perfetto support, e.g. `OMNITRACE_USE_PAPI=<VAL>` forces `TIMEMORY_USE_PAPI=<VAL>` and the data that timemory is able to collect via this package
|
||||
is passed along to perfetto and will be displayed when the `.proto` file is visualized in [ui.perfetto.dev](https://ui.perfetto.dev).
|
||||
|
||||
```shell
|
||||
git clone https://github.com/ROCm/omnitrace.git omnitrace-source
|
||||
cmake \
|
||||
-B omnitrace-build \
|
||||
-D CMAKE_INSTALL_PREFIX=/opt/omnitrace \
|
||||
-D OMNITRACE_USE_HIP=ON \
|
||||
-D OMNITRACE_USE_ROCM_SMI=ON \
|
||||
-D OMNITRACE_USE_ROCTRACER=ON \
|
||||
-D OMNITRACE_USE_PYTHON=ON \
|
||||
-D OMNITRACE_USE_OMPT=ON \
|
||||
-D OMNITRACE_USE_MPI_HEADERS=ON \
|
||||
-D OMNITRACE_BUILD_PAPI=ON \
|
||||
-D OMNITRACE_BUILD_LIBUNWIND=ON \
|
||||
-D OMNITRACE_BUILD_DYNINST=ON \
|
||||
-D DYNINST_BUILD_TBB=ON \
|
||||
-D DYNINST_BUILD_BOOST=ON \
|
||||
-D DYNINST_BUILD_ELFUTILS=ON \
|
||||
-D DYNINST_BUILD_LIBIBERTY=ON \
|
||||
omnitrace-source
|
||||
cmake --build omnitrace-build --target all --parallel 8
|
||||
cmake --build omnitrace-build --target install
|
||||
source /opt/omnitrace/share/omnitrace/setup-env.sh
|
||||
```
|
||||
|
||||
#### MPI Support within OmniTrace
|
||||
|
||||
[OmniTrace](https://github.com/ROCm/omnitrace) can have full (`OMNITRACE_USE_MPI=ON`) or partial (`OMNITRACE_USE_MPI_HEADERS=ON`) MPI support.
|
||||
The only difference between these two modes is whether or not the results collected via timemory and/or perfetto can be aggregated into a single
|
||||
output file during finalization. When full MPI support is enabled, combining the timemory results always occurs whereas combining the perfetto
|
||||
results is configurable via the `OMNITRACE_PERFETTO_COMBINE_TRACES` setting.
|
||||
|
||||
The primary benefits of partial or full MPI support are the automatic wrapping of MPI functions and the ability
|
||||
to label output with suffixes which correspond to the `MPI_COMM_WORLD` rank ID instead of using the system process identifier (i.e. PID).
|
||||
In general, it is recommended to use partial MPI support with the OpenMPI headers as this is the most portable configuration.
|
||||
If full MPI support is selected, make sure your target application is built against the same MPI distribution as omnitrace,
|
||||
i.e. do not build omnitrace with MPICH and use it on a target application built against OpenMPI.
|
||||
If partial support is selected, the reason the OpenMPI headers are recommended instead of the MPICH headers is
|
||||
because the `MPI_COMM_WORLD` in OpenMPI is a pointer to `ompi_communicator_t` (8 bytes), whereas `MPI_COMM_WORLD` in MPICH,
|
||||
it is an `int` (4 bytes). Building omnitrace with partial MPI support and the MPICH headers and then using
|
||||
omnitrace on an application built against OpenMPI will cause a segmentation fault due to the value of the `MPI_COMM_WORLD` being narrowed
|
||||
during the function wrapping before being passed along to the underlying MPI function.
|
||||
|
||||
## Post-Installation Steps
|
||||
|
||||
### Configure the environment
|
||||
|
||||
If environment modules are available and preferred:
|
||||
|
||||
```shell
|
||||
module use /opt/omnitrace/share/modulefiles
|
||||
module load omnitrace/1.0.0
|
||||
```
|
||||
|
||||
Alternatively, once can directly source the `setup-env.sh` script:
|
||||
|
||||
```shell
|
||||
source /opt/omnitrace/share/omnitrace/setup-env.sh
|
||||
```
|
||||
|
||||
### Test the executables
|
||||
|
||||
Successful execution of these commands indicates that the installation does not have any issues locating the installed libraries:
|
||||
|
||||
```shell
|
||||
omnitrace-instrument --help
|
||||
omnitrace-avail --help
|
||||
```
|
||||
|
||||
> ***NOTE: If ROCm support was enabled, you may have to add the path to the ROCm libraries to `LD_LIBRARY_PATH`, e.g. `export LD_LIBRARY_PATH=/opt/rocm/lib:${LD_LIBRARY_PATH}`***
|
||||
@@ -1,835 +0,0 @@
|
||||
# Binary Instrumentation
|
||||
|
||||
```eval_rst
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 4
|
||||
```
|
||||
|
||||
## omnitrace-instrument Executable
|
||||
|
||||
> ***NOTE: With the introduction of `omnitrace-sample`, in future versions of omnitrace, the current `omnitrace` executable***
|
||||
> ***noted below will likely be renamed to `omnitrace-instrument` and a new `omnitrace` executable will serve as a common***
|
||||
> ***executable for multiple executables, e.g. `omnitrace-instrument sample ...`, `omnitrace run ...`, `omnitrace rewrite ...`, etc.***
|
||||
|
||||
Instrumentation is performed with the `omnitrace` executable. View the help menu with the `-h` / `--help` option:
|
||||
|
||||
```console
|
||||
$ omnitrace-instrument --help
|
||||
[omnitrace-instrument] Usage: omnitrace-instrument [ --help (count: 0, dtype: bool)
|
||||
--version (count: 0, dtype: bool)
|
||||
--verbose (max: 1, dtype: bool)
|
||||
--error (max: 1, dtype: boolean)
|
||||
--debug (max: 1, dtype: bool)
|
||||
--log (count: 1)
|
||||
--log-file (count: 1)
|
||||
--simulate (max: 1, dtype: boolean)
|
||||
--print-format (min: 1, dtype: string)
|
||||
--print-dir (count: 1, dtype: string)
|
||||
--print-available (count: 1)
|
||||
--print-instrumented (count: 1)
|
||||
--print-coverage (count: 1)
|
||||
--print-excluded (count: 1)
|
||||
--print-overlapping (count: 1)
|
||||
--print-instructions (max: 1, dtype: bool)
|
||||
--output (min: 0, dtype: string)
|
||||
--pid (count: 1, dtype: int)
|
||||
--mode (count: 1)
|
||||
--force (max: 1, dtype: bool)
|
||||
--command (count: 1)
|
||||
--prefer (count: 1)
|
||||
--library (count: unlimited)
|
||||
--main-function (count: 1)
|
||||
--load (count: unlimited, dtype: string)
|
||||
--load-instr (count: unlimited, dtype: filepath)
|
||||
--init-functions (count: unlimited, dtype: string)
|
||||
--fini-functions (count: unlimited, dtype: string)
|
||||
--all-functions (max: 1, dtype: boolean)
|
||||
--function-include (count: unlimited)
|
||||
--function-exclude (count: unlimited)
|
||||
--function-restrict (count: unlimited)
|
||||
--caller-include (count: unlimited)
|
||||
--module-include (count: unlimited)
|
||||
--module-exclude (count: unlimited)
|
||||
--module-restrict (count: unlimited)
|
||||
--internal-function-include (count: unlimited)
|
||||
--internal-module-include (count: unlimited)
|
||||
--instruction-exclude (count: unlimited)
|
||||
--internal-library-deps (min: 0, dtype: boolean)
|
||||
--internal-library-append (count: unlimited)
|
||||
--internal-library-remove (count: unlimited)
|
||||
--linkage (min: 1)
|
||||
--visibility (min: 1)
|
||||
--label (count: unlimited, dtype: string)
|
||||
--config (min: 1, dtype: string)
|
||||
--default-components (count: unlimited, dtype: string)
|
||||
--env (count: unlimited)
|
||||
--mpi (max: 1, dtype: bool)
|
||||
--instrument-loops (max: 1, dtype: boolean)
|
||||
--min-instructions (count: 1, dtype: int)
|
||||
--min-address-range (count: 1, dtype: int)
|
||||
--min-instructions-loop (count: 1, dtype: int)
|
||||
--min-address-range-loop (count: 1, dtype: int)
|
||||
--coverage (max: 1, dtype: bool)
|
||||
--dynamic-callsites (max: 1, dtype: boolean)
|
||||
--traps (max: 1, dtype: boolean)
|
||||
--loop-traps (max: 1, dtype: boolean)
|
||||
--allow-overlapping (max: 1, dtype: bool)
|
||||
--parse-all-modules (max: 1, dtype: bool)
|
||||
--batch-size (count: 1, dtype: int)
|
||||
--dyninst-rt (min: 1, dtype: filepath)
|
||||
--dyninst-options (count: unlimited)
|
||||
] -- <CMD> <ARGS>
|
||||
|
||||
Options:
|
||||
-h, -?, --help Shows this page
|
||||
--version Prints the version and exit
|
||||
|
||||
[DEBUG OPTIONS]
|
||||
|
||||
-v, --verbose Verbose output
|
||||
-e, --error All warnings produce runtime errors
|
||||
--debug Debug output
|
||||
--log Number of log entries to display after an error. Any value < 0 will emit the entire log
|
||||
--log-file Write the log out the specified file during the run
|
||||
--simulate Exit after outputting diagnostic {available,instrumented,excluded,overlapping} module
|
||||
function lists, e.g. available.txt
|
||||
--print-format [ json | txt | xml ]
|
||||
Output format for diagnostic {available,instrumented,excluded,overlapping} module
|
||||
function lists, e.g. {print-dir}/available.txt
|
||||
--print-dir Output directory for diagnostic {available,instrumented,excluded,overlapping} module
|
||||
function lists, e.g. {print-dir}/available.txt
|
||||
--print-available [ functions | functions+ | modules | pair | pair+ ]
|
||||
Print the available entities for instrumentation (functions, modules, or module-function
|
||||
pair) to stdout after applying regular expressions
|
||||
--print-instrumented [ functions | functions+ | modules | pair | pair+ ]
|
||||
Print the instrumented entities (functions, modules, or module-function pair) to stdout
|
||||
after applying regular expressions
|
||||
--print-coverage [ functions | functions+ | modules | pair | pair+ ]
|
||||
Print the instrumented coverage entities (functions, modules, or module-function pair) to
|
||||
stdout after applying regular expressions
|
||||
--print-excluded [ functions | functions+ | modules | pair | pair+ ]
|
||||
Print the entities for instrumentation (functions, modules, or module-function pair)
|
||||
which are excluded from the instrumentation to stdout after applying regular expressions
|
||||
--print-overlapping [ functions | functions+ | modules | pair | pair+ ]
|
||||
Print the entities for instrumentation (functions, modules, or module-function pair)
|
||||
which overlap other function calls or have multiple entry points to stdout after applying
|
||||
regular expressions
|
||||
--print-instructions Print the instructions for each basic-block in the JSON/XML outputs
|
||||
|
||||
[MODE OPTIONS]
|
||||
|
||||
-o, --output Enable generation of a new executable (binary-rewrite). If a filename is not provided,
|
||||
omnitrace will use the basename and output to the cwd, unless the target binary is in the
|
||||
cwd. In the latter case, omnitrace will either use ${PWD}/<basename>.inst (non-libraries)
|
||||
or ${PWD}/instrumented/<basename> (libraries)
|
||||
-p, --pid Connect to running process
|
||||
-M, --mode [ coverage | sampling | trace ]
|
||||
Instrumentation mode. 'trace' mode instruments the selected functions, 'sampling' mode
|
||||
only instruments the main function to start and stop the sampler.
|
||||
-f, --force Force the command-line argument configuration, i.e. don't get cute. Useful for forcing
|
||||
runtime instrumentation of an executable that [A] Dyninst thinks is a library after
|
||||
reading ELF and [B] whose name makes it look like a library (e.g. starts with 'lib'
|
||||
and/or ends in '.so', '.so.*', or '.a')
|
||||
-c, --command Input executable and arguments (if '-- <CMD>' not provided)
|
||||
|
||||
[LIBRARY OPTIONS]
|
||||
|
||||
--prefer [ shared | static ] Prefer this library types when available
|
||||
-L, --library Libraries with instrumentation routines (default: "libomnitrace-dl")
|
||||
-m, --main-function The primary function to instrument around, e.g. 'main'
|
||||
--load Supplemental instrumentation library names w/o extension (e.g. 'libinstr' for
|
||||
'libinstr.so' or 'libinstr.a')
|
||||
--load-instr Load {available,instrumented,excluded,overlapping}-instr JSON or XML file(s) and override
|
||||
what is read from the binary
|
||||
--init-functions Initialization function(s) for supplemental instrumentation libraries (see '--load'
|
||||
option)
|
||||
--fini-functions Finalization function(s) for supplemental instrumentation libraries (see '--load' option)
|
||||
--all-functions When finding functions, include the functions which are not instrumentable. This is
|
||||
purely diagnostic for the available/excluded functions output
|
||||
|
||||
[SYMBOL SELECTION OPTIONS]
|
||||
|
||||
-I, --function-include Regex(es) for including functions (despite heuristics)
|
||||
-E, --function-exclude Regex(es) for excluding functions (always applied)
|
||||
-R, --function-restrict Regex(es) for restricting functions only to those that match the provided
|
||||
regular-expressions
|
||||
--caller-include Regex(es) for including functions that call the listed functions (despite heuristics)
|
||||
-MI, --module-include Regex(es) for selecting modules/files/libraries (despite heuristics)
|
||||
-ME, --module-exclude Regex(es) for excluding modules/files/libraries (always applied)
|
||||
-MR, --module-restrict Regex(es) for restricting modules/files/libraries only to those that match the provided
|
||||
regular-expressions
|
||||
--internal-function-include Regex(es) for including functions which are (likely) utilized by omnitrace itself. Use
|
||||
this option with care.
|
||||
--internal-module-include Regex(es) for including modules/libraries which are (likely) utilized by omnitrace
|
||||
itself. Use this option with care.
|
||||
--instruction-exclude Regex(es) for excluding functions containing certain instructions
|
||||
--internal-library-deps Treat the libraries linked to the internal libraries as internal libraries. This increase
|
||||
the internal library processing time and consume more memory (so use with care) but may
|
||||
be useful when the application uses Boost libraries and Dyninst is dynamically linked
|
||||
against the same boost libraries
|
||||
--internal-library-append Append to the list of libraries which omnitrace treats as being used internally, e.g.
|
||||
OmniTrace will find all the symbols in this library and prevent them from being
|
||||
instrumented.
|
||||
--internal-library-remove [ ld-linux-x86-64.so.2
|
||||
libBrokenLocale.so.1
|
||||
libanl.so.1
|
||||
libbfd.so
|
||||
libbz2.so
|
||||
libc.so.6
|
||||
libcaliper.so
|
||||
libcommon.so
|
||||
libcrypt.so.1
|
||||
libdl.so.2
|
||||
libdw.so
|
||||
libdwarf.so
|
||||
libdyninstAPI_RT.so
|
||||
libelf.so
|
||||
libgcc_s.so.1
|
||||
libgotcha.so
|
||||
liblikwid.so
|
||||
liblzma.so
|
||||
libnsl.so.1
|
||||
libnss_compat.so.2
|
||||
libnss_db.so.2
|
||||
libnss_dns.so.2
|
||||
libnss_files.so.2
|
||||
libnss_hesiod.so.2
|
||||
libnss_ldap.so.2
|
||||
libnss_nis.so.2
|
||||
libnss_nisplus.so.2
|
||||
libnss_test1.so.2
|
||||
libnss_test2.so.2
|
||||
libpapi.so
|
||||
libpfm.so
|
||||
libprofiler.so
|
||||
libpthread.so.0
|
||||
libresolv.so.2
|
||||
librocm_smi64.so
|
||||
librocmtools.so
|
||||
librocprofiler64.so
|
||||
libroctracer64.so
|
||||
libroctx64.so
|
||||
librt.so.1
|
||||
libstdc++.so.6
|
||||
libtbb.so
|
||||
libtbbmalloc.so
|
||||
libtbbmalloc_proxy.so
|
||||
libtcmalloc.so
|
||||
libtcmalloc_and_profiler.so
|
||||
libtcmalloc_debug.so
|
||||
libtcmalloc_minimal.so
|
||||
libtcmalloc_minimal_debug.so
|
||||
libthread_db.so.1
|
||||
libunwind-coredump.so
|
||||
libunwind-generic.so
|
||||
libunwind-ptrace.so
|
||||
libunwind-setjmp.so
|
||||
libunwind-x86_64.so
|
||||
libunwind.so
|
||||
libutil.so.1
|
||||
libz.so
|
||||
libzstd.so ]
|
||||
Remove the specified libraries from being treated as being used internally, e.g.
|
||||
OmniTrace will permit all the symbols in these libraries to be eligible for
|
||||
instrumentation.
|
||||
--linkage [ global | local | unique | unknown | weak ]
|
||||
Only instrument functions with specified linkage (default: global, local, unique)
|
||||
--visibility [ default | hidden | internal | protected | unknown ]
|
||||
Only instrument functions with specified visibility (default: default, internal, hidden,
|
||||
protected)
|
||||
|
||||
[RUNTIME OPTIONS]
|
||||
|
||||
--label [ args | file | line | return ]
|
||||
Labeling info for functions. By default, just the function name is recorded. Use these
|
||||
options to gain more information about the function signature or location of the
|
||||
functions
|
||||
-C, --config Read in a configuration file and encode these values as the defaults in the executable
|
||||
-d, --default-components Default components to instrument (only useful when timemory is enabled in omnitrace
|
||||
library)
|
||||
--env Environment variables to add to the runtime in form VARIABLE=VALUE. E.g. use '--env
|
||||
OMNITRACE_PROFILE=ON' to default to using timemory instead of perfetto
|
||||
--mpi Enable MPI support (requires omnitrace built w/ full or partial MPI support). NOTE: this
|
||||
will automatically be activated if MPI_Init, MPI_Init_thread, MPI_Finalize,
|
||||
MPI_Comm_rank, or MPI_Comm_size are found in the symbol table of target
|
||||
|
||||
[GRANULARITY OPTIONS]
|
||||
|
||||
-l, --instrument-loops Instrument at the loop level
|
||||
-i, --min-instructions If the number of instructions in a function is less than this value, exclude it from
|
||||
instrumentation
|
||||
-r, --min-address-range If the address range of a function is less than this value, exclude it from
|
||||
instrumentation
|
||||
--min-instructions-loop If the number of instructions in a function containing a loop is less than this value,
|
||||
exclude it from instrumentation
|
||||
--min-address-range-loop If the address range of a function containing a loop is less than this value, exclude it
|
||||
from instrumentation
|
||||
--coverage [ basic_block | function | none ]
|
||||
Enable recording the code coverage. If instrumenting in coverage mode ('-M converage'),
|
||||
this simply specifies the granularity. If instrumenting in trace or sampling mode, this
|
||||
enables recording code-coverage in addition to the instrumentation of that mode (if any).
|
||||
--dynamic-callsites Force instrumentation if a function has dynamic callsites (e.g. function pointers)
|
||||
--traps Instrument points which require using a trap. On the x86 architecture, because
|
||||
instructions are of variable size, the instruction at a point may be too small for
|
||||
Dyninst to replace it with the normal code sequence used to call instrumentation. Also,
|
||||
when instrumentation is placed at points other than subroutine entry, exit, or call
|
||||
points, traps may be used to ensure the instrumentation fits. In this case, Dyninst
|
||||
replaces the instruction with a single-byte instruction that generates a trap.
|
||||
--loop-traps Instrument points within a loop which require using a trap (only relevant when
|
||||
--instrument-loops is enabled).
|
||||
--allow-overlapping Allow dyninst to instrument either multiple functions which overlap (share part of same
|
||||
function body) or single functions with multiple entry points. For more info, see Section
|
||||
2 of the DyninstAPI documentation.
|
||||
--parse-all-modules By default, omnitrace simply requests Dyninst to provide all the procedures in the
|
||||
application image. If this option is enabled, omnitrace will iterate over all the modules
|
||||
and extract the functions. Theoretically, it should be the same but the data is slightly
|
||||
different, possibly due to weak binding scopes. In general, enabling option will probably
|
||||
have no visible effect
|
||||
|
||||
[DYNINST OPTIONS]
|
||||
|
||||
-b, --batch-size Dyninst supports batch insertion of multiple points during runtime instrumentation. If
|
||||
one large batch insertion fails, this value will be used to create smaller batches.
|
||||
Larger batches generally decrease the instrumentation time
|
||||
--dyninst-rt Path(s) to the dyninstAPI_RT library
|
||||
--dyninst-options [ BaseTrampDeletion
|
||||
DebugParsing
|
||||
DelayedParsing
|
||||
InstrStackFrames
|
||||
MergeTramp
|
||||
SaveFPR
|
||||
TrampRecursive
|
||||
TypeChecking ]
|
||||
Advanced dyninst options: BPatch::set<OPTION>(bool), e.g. bpatch->setTrampRecursive(true)
|
||||
```
|
||||
|
||||
There are three ways to perform instrumentation:
|
||||
|
||||
1. Running the application via the omnitrace-instrument executable (analagous to `gdb --args <program> <args>`)
|
||||
- This mode is the default if neither the `-p` nor `-o` comand-line options are used
|
||||
- Runtime instrumentation supports instrumenting not only the target executable but also the
|
||||
the shared libraries loaded by the target executable. Consequently, this mode consumes more memory,
|
||||
takes longer to perform the instrumentation, and tends to have a more significant overhead on the
|
||||
runtime of the application
|
||||
- This mode is recommended if you want to analyze not only the performance of your executable and/or
|
||||
libraries but also the performance of the library dependencies
|
||||
2. Attaching to a process that is currently running (analagous to `gdb -p <PID>`)
|
||||
- This mode is activate via `-p <PID>`
|
||||
- Same caveats as 1. with respect to memory and overhead
|
||||
3. Generating a new executable or library with the instrumentation built-in (binary rewrite)
|
||||
- This mode is activated via the `-o <output-file>` option
|
||||
- Binary rewriting is limited to the text section of the target executable or library: it will not instrument
|
||||
the dynamically-linked libraries. Consequently, this mode performs the instrumentation significantly faster
|
||||
and has a much lower overhead when running the instrumentated executable and/or libraries
|
||||
- Binary rewriting is the recommended mode when the target executable uses process-level parallelism (e.g. MPI)
|
||||
- If your target executable has a minimal main which and the bulk of your application is in one specific dynamic library,
|
||||
see [Binary Rewriting a Library](#binary-rewriting-a-library) for help
|
||||
|
||||
|
||||
> ***Attaching to a running process is an alpha feature and support for detaching from the target process***
|
||||
> ***without ending the target process is not currently supported.***
|
||||
|
||||
The general syntax for separating omnitrace command line arguments from the application arguments
|
||||
is consistent with the LLVM style of using a standalone double-hyphen (`--`). All arguments preceding the double-hyphen
|
||||
are interpreted as belonging to omnitrace and all arguments following the double-hyphen are interpreted as the
|
||||
application and it's arguments. In binary rewrite mode, all application arguments after the first argument
|
||||
are ignored, i.e. `./omnitrace-instrument -o ls.inst -- ls -l` interprets `ls` as the target to instrument (ignores the `-l` argument)
|
||||
and generates a `ls.inst` executable that you can subsequently run `omnitrace-run -- ls.inst -l` with.
|
||||
|
||||
## Runtime Instrumentation
|
||||
|
||||
```shell
|
||||
omnitrace-instrument <omnitrace-options> -- <exe> [<exe-options>...]
|
||||
```
|
||||
|
||||
## Attaching to Running Process
|
||||
|
||||
```shell
|
||||
omnitrace-instrument <omnitrace-options> -p <PID> -- <exe-name>
|
||||
```
|
||||
|
||||
## Binary Rewrite
|
||||
|
||||
```shell
|
||||
omnitrace-instrument <omnitrace-options> -o <name-of-new-exe-or-library> -- <exe-or-library>
|
||||
```
|
||||
|
||||
### Binary Rewriting a Library
|
||||
|
||||
Many applications bundle the bulk of their functionality into one or more dynamic libraries and have a relatively simple main
|
||||
which links to these libraries and simply serves as the "driver" for setting up the workflow. If you binary rewrite your
|
||||
executable and find there is insufficient info because of this, you can either switch to runtime instrumentation or
|
||||
binary rewrite the libraries of interest.
|
||||
|
||||
Support for standalone binary rewriting of a dynamic library without binary rewriting the executable is a beta feature.
|
||||
In general, it is supported as long as the library contains the `_init` and `_fini` symbols but these symbols are not
|
||||
standardized to the extent of `main` in an executable.
|
||||
The recommended workflow is as follows:
|
||||
|
||||
1. Determine the names of the dynamically linked libraries of interest via `ldd`
|
||||
2. Generate a binary rewrite of the executable
|
||||
3. Generate a binary rewrite of the desired libraries with the same base name as the original library, e.g. `libfoo.so.2` instead of `libfoo.so`
|
||||
- Output the instrumented library into a different folder than the original library
|
||||
4. Prefix the `LD_LIBRARY_PATH` executable with the output folder from 3
|
||||
5. Verify via `ldd` that the instrumented executable resolves the location of the instrumented library
|
||||
|
||||
### Binary Rewriting a Library Example
|
||||
|
||||
`foo` executable is dynamically linked to `libfoo.so.2`:
|
||||
|
||||
```shell
|
||||
$ pwd
|
||||
/home/user
|
||||
$ which foo
|
||||
/usr/local/bin/foo
|
||||
$ ldd /usr/local/bin/foo
|
||||
...
|
||||
libfoo.so.2 => /usr/local/lib/libfoo.so.2 (...)
|
||||
...
|
||||
```
|
||||
|
||||
Generate binary rewrites of `foo` and `libfoo.so.2`:
|
||||
|
||||
```shell
|
||||
omnitrace-instrument -o ./foo.inst -- foo
|
||||
omnitrace-instrument -o ./libfoo.so.2 -- /usr/local/lib/libfoo.so.2
|
||||
```
|
||||
|
||||
At this point, the instrumented `foo.inst` executable will still dynamically load the original `libfoo.so.2` in `/usr/local/lib`:
|
||||
|
||||
```shell
|
||||
$ ldd ./foo.inst
|
||||
...
|
||||
libfoo.so.2 => /usr/local/lib/libfoo.so.2 (...)
|
||||
...
|
||||
```
|
||||
|
||||
Prefix the `LD_LIBRARY_PATH` environment variable with the folder containing the instrumented `libfoo.so.2`:
|
||||
|
||||
```shell
|
||||
export LD_LIBRARY_PATH=/home/user:${LD_LIBRARY_PATH}
|
||||
```
|
||||
|
||||
When `foo.inst` is executed, it will now load the instrumented library:
|
||||
|
||||
```shell
|
||||
$ ldd ./foo.inst
|
||||
...
|
||||
libfoo.so.2 => /home/user/libfoo.so.2 (...)
|
||||
...
|
||||
```
|
||||
|
||||
## Selective Instrumentation
|
||||
|
||||
The default behavior of omnitrace-instrument does not instrument every symbol in the binary. These default rules are:
|
||||
|
||||
- Skip instrumenting dynamic call-sites (i.e. function pointers)
|
||||
- Option `--dynamic-callsites` will force instrumentation for all dynamic call-sites
|
||||
- The cost of a function can be loosely approximated by the number of instruction so by default, omnitrace-instrument only instruments functions with at least 1024 instructions
|
||||
- Option `--min-instructions` will modify this heuristic for all functions which do not contain loops
|
||||
- Option `--min-instructions-loop` will modify this heuristic for functions which contain loops
|
||||
- This separate loop option is provided because functions with loops can be compact in the binary while also being costly
|
||||
- The cost of a function can be also be loosely approximated by the size of the function in the binary so this heuristic can also be used in lieu of or in addition to the minimum number of instructions
|
||||
- Option `--min-address-range` will modify this heuristic for all functions which do not contain loops
|
||||
- Option `--min-address-range-loop` will modify this heuristic for functions which contain loops
|
||||
- This separate loop option is provided because functions with loops can be compact in the binary while also being costly
|
||||
- Skip instrumentation points which require using a trap
|
||||
- See the description for the `--traps` and `--loop-traps` options for more information
|
||||
- Skip instrumenting loops within the body of a function
|
||||
- Option `--instrument-loops` will enable this behavior
|
||||
- Skip instrumenting functions with overlapping function bodies and single functions with multiple entry point
|
||||
- These arise from various optimizations and instrumenting these functions can be enabled via the `--allow-overlapping` option
|
||||
|
||||
### Viewing the Available, Instrumented, Excluded, and Overlapping Functions
|
||||
|
||||
Whenever omnitrace-instrument is executed with a verbosity of zero or higher, it emits files which detail which functions (and which module they were defined in)
|
||||
were available for instrumentation, which functions were instrumented, which functions were excluded, and which functions contained overlapping function bodies.
|
||||
The default output path of these files will be in a `omnitrace-<NAME>-output` folder where `<NAME>` is the basename of the targeted binary or
|
||||
(in the case of binary rewrite, the basename of the resulting executable), e.g.
|
||||
`omnitrace-instrument -- ls` will output it's files to `omnitrace-ls-output` whereas `omnitrace-instrument -o ls.inst -- ls` will output to `omnitrace-ls.inst-output`.
|
||||
|
||||
If you would like to generate these files without executing or generating an executable, use the `--simulate` option:
|
||||
|
||||
```shell
|
||||
omnitrace-instrument --simulate -- foo
|
||||
omnitrace-instrument --simulate -o foo.inst -- foo
|
||||
```
|
||||
|
||||
### Excluding and Including Modules and Functions
|
||||
|
||||
[OmniTrace](https://github.com/ROCm/omnitrace) has a set of 6 command-line options which each accept one or more regular expressions for customizing the scope of which module and/or functions are
|
||||
instrumented. Multiple regexes per option are treated as an OR operation, e.g. `--module-include libfoo libbar` is effectively that same as `--module-include 'libfoo|libbar'`.
|
||||
|
||||
If you would like to force the inclusion of certain modules and/or function without changing any of the heuristics, use the `--module-include` and/or `--function-include` options.
|
||||
Note that these options will not exclude modules and/or functions which do not satisfy their regular expression.
|
||||
|
||||
If you would like to narrow the scope of the instrumentation to a specific set of libraries and/or functions, use the `--module-restrict` and `--function-restrict` options.
|
||||
Applying these options allow you to exclusively select the union one or more regular expressions, regardless of whether or not the functions satisfy the
|
||||
aforementioned default heuristics. Any function or module that is not within the union of these regular expressions will be excluded from instrumentation.
|
||||
|
||||
If you would like to avoid instrumenting a set of modules and/or functions, use the `--module-exclude` and `--function-exclude` options.
|
||||
These options are always applied regardless of whether the module or function satisfied the "restrict" or "include" regular expression.
|
||||
|
||||
#### Example Available Module and Function Info Output
|
||||
|
||||
> ***`omnitrace-instrument -o lulesh.inst --label file line args --simulate -- lulesh`***
|
||||
|
||||
```console
|
||||
AddressRange Module Function FunctionSignature
|
||||
9165 ../examples/lulesh/lulesh-comm.cc CommMonoQ CommMonoQ(domain) [lulesh-comm.cc:1891]
|
||||
3396 ../examples/lulesh/lulesh-comm.cc CommRecv CommRecv(domain, int, Index_t, Index_t, Index_t, Index_t, bool, bool) [lulesh...
|
||||
8666 ../examples/lulesh/lulesh-comm.cc CommSBN CommSBN(domain, int, Domain_member *) [lulesh-comm.cc:926]
|
||||
10212 ../examples/lulesh/lulesh-comm.cc CommSend CommSend(domain, int, Index_t, Domain_member *, Index_t, Index_t, Index_t, bo...
|
||||
6823 ../examples/lulesh/lulesh-comm.cc CommSyncPosVel CommSyncPosVel(domain) [lulesh-comm.cc:1404]
|
||||
126 ../examples/lulesh/lulesh-comm.cc _GLOBAL__sub_I_lulesh_comm.cc _GLOBAL__sub_I_lulesh_comm.cc() [lulesh-comm.cc]
|
||||
308 ../examples/lulesh/lulesh-init.cc .omp_outlined..26 .omp_outlined..26(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
|
||||
628 ../examples/lulesh/lulesh-init.cc .omp_outlined..34 .omp_outlined..34(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
|
||||
656 ../examples/lulesh/lulesh-init.cc .omp_outlined..41 .omp_outlined..41(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
|
||||
662 ../examples/lulesh/lulesh-init.cc .omp_outlined..45 .omp_outlined..45(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
|
||||
550 ../examples/lulesh/lulesh-init.cc .omp_outlined..55 .omp_outlined..55(const , const , const ParallelFor<Kokkos::Impl::ViewFill<Ko...
|
||||
556 ../examples/lulesh/lulesh-init.cc .omp_outlined..57 .omp_outlined..57(const , const , const ParallelFor<Kokkos::Impl::ViewFill<Ko...
|
||||
550 ../examples/lulesh/lulesh-init.cc .omp_outlined..78 .omp_outlined..78(const , const , const ParallelFor<Kokkos::Impl::ViewFill<Ko...
|
||||
640 ../examples/lulesh/lulesh-init.cc .omp_outlined..84 .omp_outlined..84(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
|
||||
646 ../examples/lulesh/lulesh-init.cc .omp_outlined..88 .omp_outlined..88(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
|
||||
1840 ../examples/lulesh/lulesh-init.cc Domain::AllocateElemPersistent Domain::AllocateElemPersistent(Domain *, Int_t) [lulesh-init.cc:94]
|
||||
1384 ../examples/lulesh/lulesh-init.cc Domain::AllocateNodePersistent Domain::AllocateNodePersistent(Domain *, Int_t) [lulesh-init.cc:94]
|
||||
1264 ../examples/lulesh/lulesh-init.cc Domain::BuildMesh Domain::BuildMesh(Domain *, Int_t, Int_t, Int_t) [lulesh-init.cc:308]
|
||||
2312 ../examples/lulesh/lulesh-init.cc Domain::CreateRegionIndexSets Domain::CreateRegionIndexSets(Domain *, Int_t, Int_t) [lulesh-init.cc:409]
|
||||
7109 ../examples/lulesh/lulesh-init.cc Domain::Domain Domain::Domain(Domain *, Int_t, Index_t, Index_t, Index_t, Index_t, int, int,...
|
||||
2458 ../examples/lulesh/lulesh-init.cc Domain::SetupBoundaryConditions Domain::SetupBoundaryConditions(Domain *, Int_t) [lulesh-init.cc:409]
|
||||
956 ../examples/lulesh/lulesh-init.cc Domain::SetupCommBuffers Domain::SetupCommBuffers(Domain *, Int_t) [lulesh-init.cc]
|
||||
1456 ../examples/lulesh/lulesh-init.cc Domain::SetupElementConnectivities Domain::SetupElementConnectivities(Domain *, Int_t) [lulesh-init.cc:409]
|
||||
721 ../examples/lulesh/lulesh-init.cc Domain::SetupSymmetryPlanes Domain::SetupSymmetryPlanes(Domain *, Int_t) [lulesh-init.cc:409]
|
||||
1591 ../examples/lulesh/lulesh-init.cc Domain::SetupThreadSupportStructures Domain::SetupThreadSupportStructures(Domain *) [lulesh-init.cc:376]
|
||||
1644 ../examples/lulesh/lulesh-init.cc Domain::~Domain Domain::~Domain(Domain *) [lulesh-init.cc:286]
|
||||
218 ../examples/lulesh/lulesh-init.cc InitMeshDecomp InitMeshDecomp(Int_t, Int_t, Int_t *, Int_t *, Int_t *, Int_t *) [lulesh-init...
|
||||
260 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::CommonSubview<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokk... Kokkos::Impl::CommonSubview<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokk...
|
||||
1786 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::HostIterateTile<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::R... Kokkos::Impl::HostIterateTile<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::R...
|
||||
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int**... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int**...
|
||||
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int**... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int**...
|
||||
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int*,... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int*,...
|
||||
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int*,... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<int*,...
|
||||
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl...
|
||||
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl...
|
||||
330 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewFill<Kokkos::View<doubl...
|
||||
522 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::... Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::...
|
||||
232 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::... Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::...
|
||||
49 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, Kokkos::Impl::ViewVal... Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, Kokkos::Impl::ViewVal...
|
||||
1476 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::Tile_Loop_Type<2, false, int, void, void>::apply<Kokkos::Impl::... Kokkos::Impl::Tile_Loop_Type<2, false, int, void, void>::apply<Kokkos::Impl::...
|
||||
555 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::LayoutRight, Kokkos::Devic... Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::LayoutRight, Kokkos::Devic...
|
||||
613 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::LayoutRight, Kokkos::Devic... Kokkos::Impl::ViewCopy<Kokkos::View<int**, Kokkos::LayoutRight, Kokkos::Devic...
|
||||
603 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCopy<Kokkos::View<int*, Kokkos::LayoutLeft, Kokkos::Device<... Kokkos::Impl::ViewCopy<Kokkos::View<int*, Kokkos::LayoutLeft, Kokkos::Device<...
|
||||
604 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCopy<Kokkos::View<int*, Kokkos::LayoutLeft, Kokkos::Device<... Kokkos::Impl::ViewCopy<Kokkos::View<int*, Kokkos::LayoutLeft, Kokkos::Device<...
|
||||
281 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
|
||||
281 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
|
||||
281 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
|
||||
281 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
|
||||
281 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
|
||||
524 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev... Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev...
|
||||
525 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev... Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev...
|
||||
524 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev... Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Dev...
|
||||
583 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<int* [8], Kokkos::LayoutRight>, ... SharedAllocationRecord<void, void> * Kokkos::Impl::ViewMapping<Kokkos::ViewTr...
|
||||
529 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<int*, Kokkos::HostSpace>, void>:... SharedAllocationRecord<void, void> * Kokkos::Impl::ViewMapping<Kokkos::ViewTr...
|
||||
529 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<int*>, void>::allocate_shared<st... SharedAllocationRecord<void, void> * Kokkos::Impl::ViewMapping<Kokkos::ViewTr...
|
||||
203 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewRemap<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::... Kokkos::Impl::ViewRemap<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::...
|
||||
331 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewRemap<Kokkos::View<int*>, Kokkos::View<int*>, Kokkos::OpenM... Kokkos::Impl::ViewRemap<Kokkos::View<int*>, Kokkos::View<int*>, Kokkos::OpenM...
|
||||
461 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpa... enable_if_t<std::is_trivial<int>::value && std::is_trivially_copy_assignable<...
|
||||
353 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::contiguous_fill<Kokkos::OpenMP, double*> Kokkos::Impl::contiguous_fill<Kokkos::OpenMP, double*>(exec_space, dst, value...
|
||||
139 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::contiguous_fill<Kokkos::OpenMP, double, Kokkos::LayoutRight, Ko... Kokkos::Impl::contiguous_fill<Kokkos::OpenMP, double, Kokkos::LayoutRight, Ko...
|
||||
824 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight, Kokkos::D... Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight, Kokkos::D...
|
||||
824 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight, Kokkos::D... Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight, Kokkos::D...
|
||||
824 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::... Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::...
|
||||
824 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::... Kokkos::Impl::view_copy<Kokkos::View<int* [8], Kokkos::LayoutRight>, Kokkos::...
|
||||
697 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::view_copy<Kokkos::View<int*, Kokkos::LayoutRight, Kokkos::Devic... Kokkos::Impl::view_copy<Kokkos::View<int*, Kokkos::LayoutRight, Kokkos::Devic...
|
||||
697 ../examples/lulesh/lulesh-init.cc Kokkos::Impl::view_copy<Kokkos::View<int*>, Kokkos::View<int*> > Kokkos::Impl::view_copy<Kokkos::View<int*>, Kokkos::View<int*> >(dst, src) [l...
|
||||
2036 ../examples/lulesh/lulesh-init.cc Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::Schedule<Kokkos::Static>, int>::R... Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::Schedule<Kokkos::Static>, int>::R...
|
||||
2506 ../examples/lulesh/lulesh-init.cc Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::Schedule<Kokkos::Static>, long>::... Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::Schedule<Kokkos::Static>, long>::...
|
||||
271 ../examples/lulesh/lulesh-init.cc Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor... Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor...
|
||||
470 ../examples/lulesh/lulesh-init.cc Kokkos::View<int* [8], Kokkos::LayoutRight>::View<std::__cxx11::basic_string<... Kokkos::View<int* [8], Kokkos::LayoutRight>::View<std::__cxx11::basic_string<...
|
||||
323 ../examples/lulesh/lulesh-init.cc Kokkos::View<int* [8], Kokkos::LayoutRight>::View<std::__cxx11::basic_string<... Kokkos::View<int* [8], Kokkos::LayoutRight>::View<std::__cxx11::basic_string<...
|
||||
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*, Kokkos::HostSpace>::View<char [10]> Kokkos::View<int*, Kokkos::HostSpace>::View<char [10]>(View<int *, Kokkos::Ho...
|
||||
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*, Kokkos::HostSpace>::View<char [14]> Kokkos::View<int*, Kokkos::HostSpace>::View<char [14]>(View<int *, Kokkos::Ho...
|
||||
462 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*, Kokkos::HostSpace>::View<std::__cxx11::basic_string<char, ... Kokkos::View<int*, Kokkos::HostSpace>::View<std::__cxx11::basic_string<char, ...
|
||||
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<char [16]> Kokkos::View<int*>::View<char [16]>(View<int *> *, arg_label, type, const siz...
|
||||
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<char [19]> Kokkos::View<int*>::View<char [19]>(View<int *> *, arg_label, type, const siz...
|
||||
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<char [21]> Kokkos::View<int*>::View<char [21]>(View<int *> *, arg_label, type, const siz...
|
||||
462 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<std::__cxx11::basic_string<char, std::char_traits<ch... Kokkos::View<int*>::View<std::__cxx11::basic_string<char, std::char_traits<ch...
|
||||
323 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<std::__cxx11::basic_string<char, std::char_traits<ch... Kokkos::View<int*>::View<std::__cxx11::basic_string<char, std::char_traits<ch...
|
||||
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<double*, , double*, Kokkos::LayoutRight, Kokkos::Device<Kok... Kokkos::deep_copy<double*, , double*, Kokkos::LayoutRight, Kokkos::Device<Kok...
|
||||
1052 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<double*> Kokkos::deep_copy<double*>(dst, value) [lulesh-init.cc]
|
||||
1050 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<double, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP,... Kokkos::deep_copy<double, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP,...
|
||||
7686 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenM... Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenM...
|
||||
7686 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, int* [8], Kokkos::LayoutRigh... Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, int* [8], Kokkos::LayoutRigh...
|
||||
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int*, , int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::O... Kokkos::deep_copy<int*, , int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::O...
|
||||
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Ko... Kokkos::deep_copy<int*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Ko...
|
||||
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP, K... Kokkos::deep_copy<int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP, K...
|
||||
863 ../examples/lulesh/lulesh-init.cc Kokkos::impl_resize<, int* [8], Kokkos::LayoutRight> type Kokkos::impl_resize<, int* [8], Kokkos::LayoutRight>(v, const size_t, co...
|
||||
854 ../examples/lulesh/lulesh-init.cc Kokkos::impl_resize<, int*> type Kokkos::impl_resize<, int*>(v, const size_t, const size_t, const size_t,...
|
||||
697 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (... Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (...
|
||||
706 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (... Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (...
|
||||
912 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
|
||||
791 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
|
||||
791 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
|
||||
944 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
|
||||
839 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
|
||||
126 ../examples/lulesh/lulesh-init.cc _GLOBAL__sub_I_lulesh_init.cc _GLOBAL__sub_I_lulesh_init.cc() [lulesh-init.cc]
|
||||
6589 ../examples/lulesh/lulesh-util.cc Kokkos::deep_copy<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP... Kokkos::deep_copy<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP...
|
||||
1345 ../examples/lulesh/lulesh-util.cc ParseCommandLineOptions ParseCommandLineOptions(int, char * *, int, cmdLineOpts *) [lulesh-util.cc:67]
|
||||
171 ../examples/lulesh/lulesh-util.cc PrintCommandLineOptions PrintCommandLineOptions(char *, int) [lulesh-util.cc:31]
|
||||
67 ../examples/lulesh/lulesh-util.cc StrToInt int StrToInt(const char *, int *) [lulesh-util.cc:13]
|
||||
706 ../examples/lulesh/lulesh-util.cc VerifyAndWriteFinalOutput VerifyAndWriteFinalOutput(Real_t, locDom, Int_t, Int_t) [lulesh-util.cc:222]
|
||||
126 ../examples/lulesh/lulesh-util.cc _GLOBAL__sub_I_lulesh_util.cc _GLOBAL__sub_I_lulesh_util.cc() [lulesh-util.cc]
|
||||
17 ../examples/lulesh/lulesh-viz.cc DumpToVisit DumpToVisit(domain, int, int, int) [lulesh-viz.cc:415]
|
||||
126 ../examples/lulesh/lulesh-viz.cc _GLOBAL__sub_I_lulesh_viz.cc _GLOBAL__sub_I_lulesh_viz.cc() [lulesh-viz.cc]
|
||||
451 ../examples/lulesh/lulesh.cc .omp_outlined..103 .omp_outlined..103(const , const , const ParallelReduce<(lambda at ../example...
|
||||
796 ../examples/lulesh/lulesh.cc .omp_outlined..109 .omp_outlined..109(const , const , const ParallelFor<(lambda at ../examples/l...
|
||||
394 ../examples/lulesh/lulesh.cc .omp_outlined..111 .omp_outlined..111(const , const , const ParallelFor<(lambda at ../examples/l...
|
||||
402 ../examples/lulesh/lulesh.cc .omp_outlined..113 .omp_outlined..113(const , const , const ParallelFor<(lambda at ../examples/l...
|
||||
427 ../examples/lulesh/lulesh.cc .omp_outlined..115 .omp_outlined..115(const , const , const ParallelReduce<(lambda at ../example...
|
||||
859 ../examples/lulesh/lulesh.cc .omp_outlined..119 .omp_outlined..119(const , const , const ParallelFor<(lambda at ../examples/l...
|
||||
243 ../examples/lulesh/lulesh.cc .omp_outlined..122 .omp_outlined..122(const , const , const ParallelFor<(lambda at ../examples/l...
|
||||
426 ../examples/lulesh/lulesh.cc .omp_outlined..124 .omp_outlined..124(const , const , const ParallelFor<(lambda at ../examples/l...
|
||||
529 ../examples/lulesh/lulesh.cc .omp_outlined..127 .omp_outlined..127(const , const , const ParallelFor<(lambda at ../examples/l...
|
||||
865 ../examples/lulesh/lulesh.cc .omp_outlined..130 .omp_outlined..130(const , const , const ParallelFor<(lambda at ../examples/l...
|
||||
539 ../examples/lulesh/lulesh.cc .omp_outlined..132 .omp_outlined..132(const , const , const ParallelReduce<(lambda at ../example...
|
||||
456 ../examples/lulesh/lulesh.cc .omp_outlined..134 .omp_outlined..134(const , const , const ParallelReduce<(lambda at ../example...
|
||||
252 ../examples/lulesh/lulesh.cc .omp_outlined..20 .omp_outlined..20(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
870 ../examples/lulesh/lulesh.cc .omp_outlined..35 .omp_outlined..35(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
473 ../examples/lulesh/lulesh.cc .omp_outlined..42 .omp_outlined..42(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
252 ../examples/lulesh/lulesh.cc .omp_outlined..46 .omp_outlined..46(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
1101 ../examples/lulesh/lulesh.cc .omp_outlined..48 .omp_outlined..48(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
427 ../examples/lulesh/lulesh.cc .omp_outlined..55 .omp_outlined..55(const , const , const ParallelReduce<(lambda at ../examples...
|
||||
1326 ../examples/lulesh/lulesh.cc .omp_outlined..57 .omp_outlined..57(const , const , const ParallelReduce<(lambda at ../examples...
|
||||
243 ../examples/lulesh/lulesh.cc .omp_outlined..61 .omp_outlined..61(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
1101 ../examples/lulesh/lulesh.cc .omp_outlined..63 .omp_outlined..63(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
372 ../examples/lulesh/lulesh.cc .omp_outlined..66 .omp_outlined..66(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
499 ../examples/lulesh/lulesh.cc .omp_outlined..71 .omp_outlined..71(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
499 ../examples/lulesh/lulesh.cc .omp_outlined..73 .omp_outlined..73(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
499 ../examples/lulesh/lulesh.cc .omp_outlined..75 .omp_outlined..75(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
465 ../examples/lulesh/lulesh.cc .omp_outlined..78 .omp_outlined..78(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
396 ../examples/lulesh/lulesh.cc .omp_outlined..81 .omp_outlined..81(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
656 ../examples/lulesh/lulesh.cc .omp_outlined..85 .omp_outlined..85(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
|
||||
662 ../examples/lulesh/lulesh.cc .omp_outlined..89 .omp_outlined..89(const , const , const ParallelFor<Kokkos::Impl::ViewCopy<Ko...
|
||||
443 ../examples/lulesh/lulesh.cc .omp_outlined..93 .omp_outlined..93(const , const , const ParallelReduce<(lambda at ../examples...
|
||||
243 ../examples/lulesh/lulesh.cc .omp_outlined..96 .omp_outlined..96(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
243 ../examples/lulesh/lulesh.cc .omp_outlined..99 .omp_outlined..99(const , const , const ParallelFor<(lambda at ../examples/lu...
|
||||
13367 ../examples/lulesh/lulesh.cc ApplyMaterialPropertiesForElems ApplyMaterialPropertiesForElems(domain) [lulesh.cc:409]
|
||||
1530 ../examples/lulesh/lulesh.cc CalcElemCharacteristicLength Real_t CalcElemCharacteristicLength(const Real_t *, const Real_t *, const Rea...
|
||||
982 ../examples/lulesh/lulesh.cc CalcElemFBHourglassForce CalcElemFBHourglassForce(const Real_t *, const Real_t[] *, coefficient, Real_...
|
||||
2428 ../examples/lulesh/lulesh.cc CalcElemNodeNormals CalcElemNodeNormals(Real_t *, Real_t *, Real_t *, const Real_t *, const Real_...
|
||||
853 ../examples/lulesh/lulesh.cc CalcElemShapeFunctionDerivatives CalcElemShapeFunctionDerivatives(const Real_t *, const Real_t *, const Real_t...
|
||||
1097 ../examples/lulesh/lulesh.cc CalcElemVolumeDerivative CalcElemVolumeDerivative(i, dvdx, dvdy, dvdz, const Real_t *, const Real_t *,...
|
||||
1054 ../examples/lulesh/lulesh.cc CalcKinematicsForElems CalcKinematicsForElems(domain, Real_t, Index_t) [lulesh.cc]
|
||||
14160 ../examples/lulesh/lulesh.cc CalcVolumeForceForElems CalcVolumeForceForElems(domain) [lulesh.cc:409]
|
||||
366 ../examples/lulesh/lulesh.cc Domain::AllocateGradients Domain::AllocateGradients(Domain *, Int_t, Int_t) [lulesh.cc:214]
|
||||
475 ../examples/lulesh/lulesh.cc Domain::DeallocateGradients Domain::DeallocateGradients(Domain *) [lulesh.cc:105]
|
||||
250 ../examples/lulesh/lulesh.cc Domain::DeallocateStrains Domain::DeallocateStrains(Domain *) [lulesh.cc:105]
|
||||
4356 ../examples/lulesh/lulesh.cc Domain::Domain Domain::Domain(Domain *) [lulesh.cc:78]
|
||||
15 ../examples/lulesh/lulesh.cc Domain::delv_eta Domain::delv_eta(const Domain *, const Index_t) [lulesh.cc:371]
|
||||
15 ../examples/lulesh/lulesh.cc Domain::delv_xi Domain::delv_xi(const Domain *, const Index_t) [lulesh.cc:368]
|
||||
15 ../examples/lulesh/lulesh.cc Domain::delv_zeta Domain::delv_zeta(const Domain *, const Index_t) [lulesh.cc:374]
|
||||
15 ../examples/lulesh/lulesh.cc Domain::fx Domain::fx(const Domain *, const Index_t) [lulesh.cc:303]
|
||||
15 ../examples/lulesh/lulesh.cc Domain::fy Domain::fy(const Domain *, const Index_t) [lulesh.cc:306]
|
||||
15 ../examples/lulesh/lulesh.cc Domain::fz Domain::fz(const Domain *, const Index_t) [lulesh.cc:309]
|
||||
15 ../examples/lulesh/lulesh.cc Domain::nodalMass Domain::nodalMass(const Domain *, const Index_t) [lulesh.cc:314]
|
||||
15 ../examples/lulesh/lulesh.cc Domain::x Domain::x(const Domain *, const Index_t) [lulesh.cc:257]
|
||||
15 ../examples/lulesh/lulesh.cc Domain::xd Domain::xd(const Domain *, const Index_t) [lulesh.cc:272]
|
||||
15 ../examples/lulesh/lulesh.cc Domain::y Domain::y(const Domain *, const Index_t) [lulesh.cc:258]
|
||||
15 ../examples/lulesh/lulesh.cc Domain::yd Domain::yd(const Domain *, const Index_t) [lulesh.cc:275]
|
||||
15 ../examples/lulesh/lulesh.cc Domain::z Domain::z(const Domain *, const Index_t) [lulesh.cc:259]
|
||||
15 ../examples/lulesh/lulesh.cc Domain::zd Domain::zd(const Domain *, const Index_t) [lulesh.cc:278]
|
||||
330 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<doubl... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<doubl...
|
||||
330 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<doubl... Kokkos::Impl::ParallelConstructName<Kokkos::Impl::ViewCopy<Kokkos::View<doubl...
|
||||
1508 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelFor<CalcEnergyForElems(double*, double*, double*, doubl... type Kokkos::Impl::ParallelFor<CalcEnergyForElems(double*, double*, double*, ...
|
||||
3606 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelFor<CalcFBHourglassForceForElems(Domain&, double*, Kokk... type Kokkos::Impl::ParallelFor<CalcFBHourglassForceForElems(Domain&, double*,...
|
||||
2917 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelFor<CalcKinematicsForElems(Domain&, double, int)::$_0, ... type Kokkos::Impl::ParallelFor<CalcKinematicsForElems(Domain&, double, int)::...
|
||||
3119 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelFor<CalcMonotonicQGradientsForElems(Domain&)::{lambda(i... type Kokkos::Impl::ParallelFor<CalcMonotonicQGradientsForElems(Domain&)::{lam...
|
||||
1969 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelFor<CalcMonotonicQRegionForElems(Domain&, int, double):... type Kokkos::Impl::ParallelFor<CalcMonotonicQRegionForElems(Domain&, int, dou...
|
||||
1265 ../examples/lulesh/lulesh.cc Kokkos::Impl::ParallelFor<IntegrateStressForElems(Domain&, double*, double*, ... type Kokkos::Impl::ParallelFor<IntegrateStressForElems(Domain&, double*, doub...
|
||||
49 ../examples/lulesh/lulesh.cc Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, Kokkos::Impl::ViewVal... Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, Kokkos::Impl::ViewVal...
|
||||
1497 ../examples/lulesh/lulesh.cc Kokkos::Impl::TeamPolicyInternal<Kokkos::OpenMP>::TeamPolicyInternal Kokkos::Impl::TeamPolicyInternal<Kokkos::OpenMP>::TeamPolicyInternal(TeamPoli...
|
||||
603 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewCopy<Kokkos::View<double*, Kokkos::LayoutLeft, Kokkos::Devi... Kokkos::Impl::ViewCopy<Kokkos::View<double*, Kokkos::LayoutLeft, Kokkos::Devi...
|
||||
604 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewCopy<Kokkos::View<double*, Kokkos::LayoutLeft, Kokkos::Devi... Kokkos::Impl::ViewCopy<Kokkos::View<double*, Kokkos::LayoutLeft, Kokkos::Devi...
|
||||
281 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
|
||||
281 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<... Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<...
|
||||
521 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<double*>, void>::allocate_shared... SharedAllocationRecord<void, void> * Kokkos::Impl::ViewMapping<Kokkos::ViewTr...
|
||||
331 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewRemap<Kokkos::View<double*>, Kokkos::View<double*>, Kokkos:... Kokkos::Impl::ViewRemap<Kokkos::View<double*>, Kokkos::View<double*>, Kokkos:...
|
||||
461 ../examples/lulesh/lulesh.cc Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpa... enable_if_t<std::is_trivial<double>::value && std::is_trivially_copy_assignab...
|
||||
1609 ../examples/lulesh/lulesh.cc Kokkos::Impl::runtime_check_rank_host Kokkos::Impl::runtime_check_rank_host(const size_t, const bool, const size_t,...
|
||||
697 ../examples/lulesh/lulesh.cc Kokkos::Impl::view_copy<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::De... Kokkos::Impl::view_copy<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::De...
|
||||
697 ../examples/lulesh/lulesh.cc Kokkos::Impl::view_copy<Kokkos::View<double*>, Kokkos::View<double*> > Kokkos::Impl::view_copy<Kokkos::View<double*>, Kokkos::View<double*> >(dst, s...
|
||||
2250 ../examples/lulesh/lulesh.cc Kokkos::RangePolicy<Kokkos::OpenMP>::RangePolicy Kokkos::RangePolicy<Kokkos::OpenMP>::RangePolicy(RangePolicy<Kokkos::OpenMP> ...
|
||||
213 ../examples/lulesh/lulesh.cc Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor... Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor...
|
||||
410 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::View<char [6]> Kokkos::View<double*>::View<char [6]>(View<double *> *, arg_label, type, cons...
|
||||
410 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::View<char [7]> Kokkos::View<double*>::View<char [7]>(View<double *> *, arg_label, type, cons...
|
||||
462 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::View<std::__cxx11::basic_string<char, std::char_traits... Kokkos::View<double*>::View<std::__cxx11::basic_string<char, std::char_traits...
|
||||
323 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::View<std::__cxx11::basic_string<char, std::char_traits... Kokkos::View<double*>::View<std::__cxx11::basic_string<char, std::char_traits...
|
||||
25 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::~View Kokkos::View<double*>::~View(View<double *> *) [lulesh.cc:409]
|
||||
840 ../examples/lulesh/lulesh.cc Kokkos::abort Kokkos::abort(const const char *, const const char *) [lulesh.cc:202]
|
||||
854 ../examples/lulesh/lulesh.cc Kokkos::impl_resize<, double*> type Kokkos::impl_resize<, double*>(v, const size_t, const size_t, const size...
|
||||
928 ../examples/lulesh/lulesh.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
|
||||
960 ../examples/lulesh/lulesh.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
|
||||
21470 ../examples/lulesh/lulesh.cc LagrangeLeapFrog LagrangeLeapFrog(domain) [lulesh.cc]
|
||||
226 ../examples/lulesh/lulesh.cc ResizeBuffer ResizeBuffer(const size_t) [lulesh.cc:23]
|
||||
169 ../examples/lulesh/lulesh.cc _GLOBAL__sub_I_lulesh.cc _GLOBAL__sub_I_lulesh.cc() [lulesh.cc]
|
||||
1836 ../examples/lulesh/lulesh.cc main int main(int, char * *) [lulesh.cc]
|
||||
63 ../examples/lulesh/lulesh.cc std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::a... std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::a...
|
||||
20 ../examples/lulesh/lulesh.cc std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::alloca... std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::alloca...
|
||||
160 ../examples/lulesh/lulesh.cc std::operator+<char, std::char_traits<char>, std::allocator<char> > basic_string<char, std::char_traits<char>, std::allocator<char> > std::operat...
|
||||
187 ../examples/lulesh/lulesh.cc std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::alloc... std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::alloc...
|
||||
11 lulesh __clang_call_terminate __clang_call_terminate() [lulesh]
|
||||
33 lulesh __do_global_dtors_aux __do_global_dtors_aux() [lulesh]
|
||||
5 lulesh __libc_csu_fini __libc_csu_fini() [lulesh]
|
||||
101 lulesh __libc_csu_init __libc_csu_init() [lulesh]
|
||||
5 lulesh _dl_relocate_static_pie _dl_relocate_static_pie() [lulesh]
|
||||
13 lulesh _fini _fini() [lulesh]
|
||||
27 lulesh _init _init() [lulesh]
|
||||
47 lulesh _start _start() [lulesh]
|
||||
6 lulesh frame_dummy frame_dummy() [lulesh]
|
||||
```
|
||||
|
||||
#### Example Instrumented Module and Function Info Output
|
||||
|
||||
> ***`omnitrace-instrument -o lulesh.inst --label file line args --simulate -- lulesh`***
|
||||
|
||||
After the heuristics are applied in [Example Available Module and Function Info Output](#example-available-module-and-function-info-output),
|
||||
the selected module/functions are:
|
||||
|
||||
```console
|
||||
AddressRange Module Function FunctionSignature
|
||||
9165 ../examples/lulesh/lulesh-comm.cc CommMonoQ CommMonoQ(domain) [lulesh-comm.cc:1891]
|
||||
3396 ../examples/lulesh/lulesh-comm.cc CommRecv CommRecv(domain, int, Index_t, Index_t, Index_t, Index_t, bool, bool) [lulesh...
|
||||
8666 ../examples/lulesh/lulesh-comm.cc CommSBN CommSBN(domain, int, Domain_member *) [lulesh-comm.cc:926]
|
||||
10212 ../examples/lulesh/lulesh-comm.cc CommSend CommSend(domain, int, Index_t, Domain_member *, Index_t, Index_t, Index_t, bo...
|
||||
6823 ../examples/lulesh/lulesh-comm.cc CommSyncPosVel CommSyncPosVel(domain) [lulesh-comm.cc:1404]
|
||||
1840 ../examples/lulesh/lulesh-init.cc Domain::AllocateElemPersistent Domain::AllocateElemPersistent(Domain *, Int_t) [lulesh-init.cc:94]
|
||||
1384 ../examples/lulesh/lulesh-init.cc Domain::AllocateNodePersistent Domain::AllocateNodePersistent(Domain *, Int_t) [lulesh-init.cc:94]
|
||||
1264 ../examples/lulesh/lulesh-init.cc Domain::BuildMesh Domain::BuildMesh(Domain *, Int_t, Int_t, Int_t) [lulesh-init.cc:308]
|
||||
2312 ../examples/lulesh/lulesh-init.cc Domain::CreateRegionIndexSets Domain::CreateRegionIndexSets(Domain *, Int_t, Int_t) [lulesh-init.cc:409]
|
||||
7109 ../examples/lulesh/lulesh-init.cc Domain::Domain Domain::Domain(Domain *, Int_t, Index_t, Index_t, Index_t, Index_t, int, int,...
|
||||
2458 ../examples/lulesh/lulesh-init.cc Domain::SetupBoundaryConditions Domain::SetupBoundaryConditions(Domain *, Int_t) [lulesh-init.cc:409]
|
||||
956 ../examples/lulesh/lulesh-init.cc Domain::SetupCommBuffers Domain::SetupCommBuffers(Domain *, Int_t) [lulesh-init.cc]
|
||||
1456 ../examples/lulesh/lulesh-init.cc Domain::SetupElementConnectivities Domain::SetupElementConnectivities(Domain *, Int_t) [lulesh-init.cc:409]
|
||||
721 ../examples/lulesh/lulesh-init.cc Domain::SetupSymmetryPlanes Domain::SetupSymmetryPlanes(Domain *, Int_t) [lulesh-init.cc:409]
|
||||
1591 ../examples/lulesh/lulesh-init.cc Domain::SetupThreadSupportStructures Domain::SetupThreadSupportStructures(Domain *) [lulesh-init.cc:376]
|
||||
1644 ../examples/lulesh/lulesh-init.cc Domain::~Domain Domain::~Domain(Domain *) [lulesh-init.cc:286]
|
||||
271 ../examples/lulesh/lulesh-init.cc Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor... Kokkos::StaticCrsGraph<int, Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::Memor...
|
||||
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*, Kokkos::HostSpace>::View<char [10]> Kokkos::View<int*, Kokkos::HostSpace>::View<char [10]>(View<int *, Kokkos::Ho...
|
||||
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*, Kokkos::HostSpace>::View<char [14]> Kokkos::View<int*, Kokkos::HostSpace>::View<char [14]>(View<int *, Kokkos::Ho...
|
||||
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<char [16]> Kokkos::View<int*>::View<char [16]>(View<int *> *, arg_label, type, const siz...
|
||||
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<char [19]> Kokkos::View<int*>::View<char [19]>(View<int *> *, arg_label, type, const siz...
|
||||
410 ../examples/lulesh/lulesh-init.cc Kokkos::View<int*>::View<char [21]> Kokkos::View<int*>::View<char [21]>(View<int *> *, arg_label, type, const siz...
|
||||
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<double*, , double*, Kokkos::LayoutRight, Kokkos::Device<Kok... Kokkos::deep_copy<double*, , double*, Kokkos::LayoutRight, Kokkos::Device<Kok...
|
||||
1052 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<double*> Kokkos::deep_copy<double*>(dst, value) [lulesh-init.cc]
|
||||
1050 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<double, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP,... Kokkos::deep_copy<double, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP,...
|
||||
7686 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenM... Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenM...
|
||||
7686 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, int* [8], Kokkos::LayoutRigh... Kokkos::deep_copy<int* [8], Kokkos::LayoutRight, int* [8], Kokkos::LayoutRigh...
|
||||
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int*, , int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::O... Kokkos::deep_copy<int*, , int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::O...
|
||||
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Ko... Kokkos::deep_copy<int*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Ko...
|
||||
6589 ../examples/lulesh/lulesh-init.cc Kokkos::deep_copy<int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP, K... Kokkos::deep_copy<int*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP, K...
|
||||
697 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (... Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (...
|
||||
706 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (... Kokkos::parallel_for<Kokkos::MDRangePolicy<Kokkos::OpenMP, Kokkos::Rank<2u, (...
|
||||
912 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
|
||||
791 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
|
||||
791 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
|
||||
944 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
|
||||
839 ../examples/lulesh/lulesh-init.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
|
||||
6589 ../examples/lulesh/lulesh-util.cc Kokkos::deep_copy<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP... Kokkos::deep_copy<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP...
|
||||
1345 ../examples/lulesh/lulesh-util.cc ParseCommandLineOptions ParseCommandLineOptions(int, char * *, int, cmdLineOpts *) [lulesh-util.cc:67]
|
||||
706 ../examples/lulesh/lulesh-util.cc VerifyAndWriteFinalOutput VerifyAndWriteFinalOutput(Real_t, locDom, Int_t, Int_t) [lulesh-util.cc:222]
|
||||
13367 ../examples/lulesh/lulesh.cc ApplyMaterialPropertiesForElems ApplyMaterialPropertiesForElems(domain) [lulesh.cc:409]
|
||||
982 ../examples/lulesh/lulesh.cc CalcElemFBHourglassForce CalcElemFBHourglassForce(const Real_t *, const Real_t[] *, coefficient, Real_...
|
||||
2428 ../examples/lulesh/lulesh.cc CalcElemNodeNormals CalcElemNodeNormals(Real_t *, Real_t *, Real_t *, const Real_t *, const Real_...
|
||||
853 ../examples/lulesh/lulesh.cc CalcElemShapeFunctionDerivatives CalcElemShapeFunctionDerivatives(const Real_t *, const Real_t *, const Real_t...
|
||||
1054 ../examples/lulesh/lulesh.cc CalcKinematicsForElems CalcKinematicsForElems(domain, Real_t, Index_t) [lulesh.cc]
|
||||
14160 ../examples/lulesh/lulesh.cc CalcVolumeForceForElems CalcVolumeForceForElems(domain) [lulesh.cc:409]
|
||||
366 ../examples/lulesh/lulesh.cc Domain::AllocateGradients Domain::AllocateGradients(Domain *, Int_t, Int_t) [lulesh.cc:214]
|
||||
475 ../examples/lulesh/lulesh.cc Domain::DeallocateGradients Domain::DeallocateGradients(Domain *) [lulesh.cc:105]
|
||||
4356 ../examples/lulesh/lulesh.cc Domain::Domain Domain::Domain(Domain *) [lulesh.cc:78]
|
||||
410 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::View<char [6]> Kokkos::View<double*>::View<char [6]>(View<double *> *, arg_label, type, cons...
|
||||
410 ../examples/lulesh/lulesh.cc Kokkos::View<double*>::View<char [7]> Kokkos::View<double*>::View<char [7]>(View<double *> *, arg_label, type, cons...
|
||||
928 ../examples/lulesh/lulesh.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<in...
|
||||
960 ../examples/lulesh/lulesh.cc Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo... Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<lo...
|
||||
21470 ../examples/lulesh/lulesh.cc LagrangeLeapFrog LagrangeLeapFrog(domain) [lulesh.cc]
|
||||
1836 ../examples/lulesh/lulesh.cc main int main(int, char * *) [lulesh.cc]
|
||||
```
|
||||
|
||||
## Sampling
|
||||
|
||||
> ***NOTE: This capability has been deprecated in favor of [omnitrace-sample](sampling.md)***
|
||||
|
||||
By default, omnitrace-instrument uses `--mode trace` for instrumentation. The `--mode sampling` option
|
||||
will only instrument `main` in an executable and will activate both CPU call-stack sampling and
|
||||
background system-level thread sampling by default.
|
||||
Tracing capabilities which do not rely on instrumentation, such as the HIP API and kernel tracing
|
||||
(which is collected via roctracer), will still be available.
|
||||
|
||||
[OmniTrace](https://github.com/ROCm/omnitrace)'s sampling capabilities are always available, even in trace mode, but is deactivated by default.
|
||||
In order to activate sampling in trace mode, simply set `OMNITRACE_USE_SAMPLING=ON` in the environment
|
||||
or in an omnitrace configuration file.
|
||||
|
||||
## Embedding a Default Configuration
|
||||
|
||||
Using the `--env` option, a default configuration can be embedded into the target. Although this option
|
||||
works for runtime instrumentation, it is most useful when generating new binaries since the generated
|
||||
binary may be used later in a different login sessions when the environment may have changed.
|
||||
|
||||
For example, if the following sequence of commands are run:
|
||||
|
||||
```shell
|
||||
omnitrace-instrument -o ./foo.inst -- ./foo
|
||||
export OMNITRACE_USE_SAMPLING=ON
|
||||
export OMNITRACE_SAMPLING_FREQ=5
|
||||
omnitrace-run -- ./foo.inst
|
||||
```
|
||||
|
||||
These configuration settings will not be preserved in another session, whereas:
|
||||
|
||||
```shell
|
||||
omnitrace-instrument -o ./foo.samp --env OMNITRACE_USE_SAMPLING=ON OMNITRACE_SAMPLING_FREQ=5 -- ./foo
|
||||
```
|
||||
|
||||
will preserve those environment variables:
|
||||
|
||||
```shell
|
||||
# will sample 5x per second
|
||||
omnitrace-run -- ./foo.samp
|
||||
```
|
||||
|
||||
while still allowing the subsequent session to override those defaults:
|
||||
|
||||
```shell
|
||||
# will sample 100x per second
|
||||
export OMNITRACE_SAMPLING_FREQ=100
|
||||
omnitrace-run -- ./foo.samp
|
||||
```
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
#### Checking for RPATH
|
||||
|
||||
If `ldd ./foo.inst` from the [Binary Rewriting a Library Example](#binary-rewriting-a-library-example) section still returned `/usr/local/lib/libfoo.so.2`, your executable may have an rpath encoded in the binary.
|
||||
This ELF entry will result in the dynamic linker to ignore `LD_LIBRARY_PATH` if it finds a `libfoo.so.2` in the rpath.
|
||||
You can use the `objdump` tool to perform this query:
|
||||
|
||||
```shell
|
||||
objdump -p <exe-or-library> | egrep 'RPATH|RUNPATH'
|
||||
```
|
||||
|
||||
If this produces output, e.g.:
|
||||
|
||||
```shell
|
||||
RUNPATH $ORIGIN:$ORIGIN/../lib
|
||||
```
|
||||
|
||||
You will have to remove or modify the rpath in order to get `foo.inst` to resolve to the instrumented `libfoo.so.2`
|
||||
|
||||
#### Modifying RPATH
|
||||
|
||||
> ***Requires `patchelf` package***
|
||||
|
||||
```shell
|
||||
patchelf --remove-rpath <exe-or-library>
|
||||
patchelf --set-rpath '/home/user' <exe-or-library>
|
||||
```
|
||||
@@ -1,35 +0,0 @@
|
||||
@ECHO OFF
|
||||
|
||||
pushd %~dp0
|
||||
|
||||
REM Command file for Sphinx documentation
|
||||
|
||||
if "%SPHINXBUILD%" == "" (
|
||||
set SPHINXBUILD=sphinx-build
|
||||
)
|
||||
set SOURCEDIR=.
|
||||
set BUILDDIR=_build
|
||||
|
||||
if "%1" == "" goto help
|
||||
|
||||
%SPHINXBUILD% >NUL 2>NUL
|
||||
if errorlevel 9009 (
|
||||
echo.
|
||||
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
|
||||
echo.installed, then set the SPHINXBUILD environment variable to point
|
||||
echo.to the full path of the 'sphinx-build' executable. Alternatively you
|
||||
echo.may add the Sphinx directory to PATH.
|
||||
echo.
|
||||
echo.If you don't have Sphinx installed, grab it from
|
||||
echo.http://sphinx-doc.org/
|
||||
exit /b 1
|
||||
)
|
||||
|
||||
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
|
||||
goto end
|
||||
|
||||
:help
|
||||
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
|
||||
|
||||
:end
|
||||
popd
|
||||
@@ -1,888 +0,0 @@
|
||||
# OmniTrace Output
|
||||
|
||||
```eval_rst
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 3
|
||||
```
|
||||
|
||||
## Overview
|
||||
|
||||
The general output form of omnitrace is `<OUTPUT_PATH>[/<TIMESTAMP>]/[<PREFIX>]<DATA_NAME>[-<OUTPUT_SUFFIX>].<EXT>`.
|
||||
|
||||
E.g. with the base configuration:
|
||||
|
||||
```shell
|
||||
export OMNITRACE_OUTPUT_PATH=omnitrace-example-output
|
||||
export OMNITRACE_TIME_OUTPUT=ON
|
||||
export OMNITRACE_USE_PID=OFF
|
||||
export OMNITRACE_PROFILE=ON
|
||||
export OMNITRACE_TRACE=ON
|
||||
```
|
||||
|
||||
```shell
|
||||
$ omnitrace-instrument -- ./foo
|
||||
...
|
||||
[omnitrace] Outputting 'omnitrace-example-output/perfetto-trace.proto'...
|
||||
|
||||
[omnitrace] Outputting 'omnitrace-example-output/wall-clock.txt'...
|
||||
[omnitrace] Outputting 'omnitrace-example-output/wall-clock.json'...
|
||||
```
|
||||
|
||||
If we enable the `OMNITRACE_USE_PID` option, then when our non-MPI executable is executed with a PID of 63453:
|
||||
|
||||
```shell
|
||||
$ export OMNITRACE_USE_PID=ON
|
||||
$ omnitrace-instrument -- ./foo
|
||||
...
|
||||
[omnitrace] Outputting 'omnitrace-example-output/perfetto-trace-63453.proto'...
|
||||
|
||||
[omnitrace] Outputting 'omnitrace-example-output/wall-clock-63453.txt'...
|
||||
[omnitrace] Outputting 'omnitrace-example-output/wall-clock-63453.json'...
|
||||
```
|
||||
|
||||
If we enable `OMNITRACE_TIME_OUTPUT`, then a job started on January 31, 2022 at 12:30 PM:
|
||||
|
||||
```shell
|
||||
$ export OMNITRACE_TIME_OUTPUT=ON
|
||||
$ omnitrace-instrument -- ./foo
|
||||
...
|
||||
[omnitrace] Outputting 'omnitrace-example-output/2022-01-31_12.30_PM/perfetto-trace-63453.proto'...
|
||||
|
||||
[omnitrace] Outputting 'omnitrace-example-output/2022-01-31_12.30_PM/wall-clock-63453.txt'...
|
||||
[omnitrace] Outputting 'omnitrace-example-output/2022-01-31_12.30_PM/wall-clock-63453.json'...
|
||||
```
|
||||
|
||||
## Metadata
|
||||
|
||||
[OmniTrace](https://github.com/ROCm/omnitrace) will output a metadata.json file. This metadata file will contain
|
||||
information about the settings, environment variables, output files, and info about the system and the run:
|
||||
|
||||
- Hardware cache sizes
|
||||
- Physical CPUs
|
||||
- Hardware concurrency
|
||||
- CPU model, frequency, vendor, and features
|
||||
- Launch date and time
|
||||
- Memory maps (e.g. shared libraries)
|
||||
- Output files
|
||||
- Environment Variables
|
||||
- Configuration Settings
|
||||
|
||||
### Metadata JSON Sample
|
||||
|
||||
```json
|
||||
{
|
||||
"omnitrace": {
|
||||
"metadata": {
|
||||
"info": {
|
||||
"HW_L1_CACHE_SIZE": 32768,
|
||||
"HW_L2_CACHE_SIZE": 524288,
|
||||
"HW_L3_CACHE_SIZE": 16777216,
|
||||
"HW_PHYSICAL_CPU": 12,
|
||||
"HW_CONCURRENCY": 24,
|
||||
"LAUNCH_TIME": "02:04",
|
||||
"LAUNCH_DATE": "05/08/22",
|
||||
"TIMEMORY_GIT_REVISION": "52e7034fd419ff296506cdef43084f6071dbaba1",
|
||||
"TIMEMORY_VERSION": "3.3.0rc4",
|
||||
"TIMEMORY_API": "tim::project::timemory",
|
||||
"TIMEMORY_GIT_DESCRIBE": "v3.2.0-263-g52e7034f",
|
||||
"PWD": "/home/jrmadsen/devel/c++/AARInternal/hosttrace-dyninst/build-vscode",
|
||||
"USER": "jrmadsen",
|
||||
"HOME": "/home/jrmadsen",
|
||||
"SHELL": "/bin/bash",
|
||||
"CPU_MODEL": "AMD Ryzen Threadripper PRO 3945WX 12-Cores",
|
||||
"CPU_FREQUENCY": 2400,
|
||||
"CPU_VENDOR": "AuthenticAMD",
|
||||
"CPU_FEATURES": [
|
||||
"fpu",
|
||||
"msr",
|
||||
"sse",
|
||||
"sse2",
|
||||
"constant_tsc",
|
||||
"ssse3",
|
||||
"fma",
|
||||
"sse4_1",
|
||||
"sse4_2",
|
||||
"popcnt",
|
||||
"avx2",
|
||||
"... etc. ..."
|
||||
],
|
||||
"memory_maps": [
|
||||
{
|
||||
"end_address": "7f4013797000",
|
||||
"start_address": "7f4012e58000",
|
||||
"pathname": "/opt/rocm-5.0.0/hip/lib/libamdhip64.so.5.0.50000",
|
||||
"offset": "34a000",
|
||||
"device": "103:05",
|
||||
"inode": 4331165,
|
||||
"permissions": "rw-p"
|
||||
},
|
||||
{
|
||||
"end_address": "7f4013902000",
|
||||
"start_address": "7f4013901000",
|
||||
"pathname": "/usr/lib/x86_64-linux-gnu/libm-2.31.so",
|
||||
"offset": "14d000",
|
||||
"device": "103:05",
|
||||
"inode": 42078854,
|
||||
"permissions": "rwxp"
|
||||
},
|
||||
{
|
||||
"end_address": "7f4013919000",
|
||||
"start_address": "7f4013908000",
|
||||
"pathname": "/usr/lib/x86_64-linux-gnu/libpthread-2.31.so",
|
||||
"offset": "6000",
|
||||
"device": "103:05",
|
||||
"inode": 42078874,
|
||||
"permissions": "r-xp"
|
||||
},
|
||||
{
|
||||
"...": "etc."
|
||||
},
|
||||
],
|
||||
"memory_maps_files": [
|
||||
"/opt/rocm-5.0.0/hip/lib/libamdhip64.so.5.0.50000",
|
||||
"/opt/rocm-5.0.0/hsa-amd-aqlprofile/lib/libhsa-amd-aqlprofile64.so.1.0.50000",
|
||||
"/opt/rocm-5.0.0/lib/libamd_comgr.so.2.4.50000",
|
||||
"/opt/rocm-5.0.0/lib/libhsa-runtime64.so.1.5.50000",
|
||||
"/opt/rocm-5.0.0/rocm_smi/lib/librocm_smi64.so.5.0.50000",
|
||||
"/opt/rocm-5.0.0/roctracer/lib/libroctracer64.so.1.0.50000",
|
||||
"/usr/lib/x86_64-linux-gnu/ld-2.31.so",
|
||||
"/usr/lib/x86_64-linux-gnu/libc-2.31.so",
|
||||
"/usr/lib/x86_64-linux-gnu/libdl-2.31.so",
|
||||
"... etc. ..."
|
||||
],
|
||||
},
|
||||
"output": {
|
||||
"text": [
|
||||
{
|
||||
"value": [
|
||||
"omnitrace-tests-output/parallel-overhead-binary-rewrite/roctracer.txt"
|
||||
],
|
||||
"key": "roctracer"
|
||||
},
|
||||
{
|
||||
"value": [
|
||||
"omnitrace-tests-output/parallel-overhead-binary-rewrite/wall_clock.txt"
|
||||
],
|
||||
"key": "wall_clock"
|
||||
}
|
||||
],
|
||||
"json": [
|
||||
{
|
||||
"value": [
|
||||
"omnitrace-tests-output/parallel-overhead-binary-rewrite/roctracer.json",
|
||||
"omnitrace-tests-output/parallel-overhead-binary-rewrite/roctracer.tree.json"
|
||||
],
|
||||
"key": "roctracer"
|
||||
},
|
||||
{
|
||||
"value": [
|
||||
"omnitrace-tests-output/parallel-overhead-binary-rewrite/wall_clock.json",
|
||||
"omnitrace-tests-output/parallel-overhead-binary-rewrite/wall_clock.tree.json"
|
||||
],
|
||||
"key": "wall_clock"
|
||||
}
|
||||
]
|
||||
},
|
||||
"environment": [
|
||||
{
|
||||
"value": "/home/jrmadsen",
|
||||
"key": "HOME"
|
||||
},
|
||||
{
|
||||
"value": "/bin/bash",
|
||||
"key": "SHELL"
|
||||
},
|
||||
{
|
||||
"value": "jrmadsen",
|
||||
"key": "USER"
|
||||
},
|
||||
{
|
||||
"value": "true",
|
||||
"key": "... etc. ..."
|
||||
}
|
||||
],
|
||||
"settings": {
|
||||
"OMNITRACE_JSON_OUTPUT": {
|
||||
"count": -1,
|
||||
"environ_updated": false,
|
||||
"name": "json_output",
|
||||
"data_type": "bool",
|
||||
"initial": true,
|
||||
"enabled": true,
|
||||
"value": true,
|
||||
"max_count": 1,
|
||||
"cmdline": [
|
||||
"--omnitrace-json-output"
|
||||
],
|
||||
"environ": "OMNITRACE_JSON_OUTPUT",
|
||||
"config_updated": false,
|
||||
"categories": [
|
||||
"io",
|
||||
"json",
|
||||
"native"
|
||||
],
|
||||
"description": "Write json output files"
|
||||
},
|
||||
"... etc. ...": {
|
||||
"etc.": true
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Configuring Output
|
||||
|
||||
### Core Configuration Settings
|
||||
|
||||
> ***See also: [Customizing OmniTrace Runtime](runtime.md)***
|
||||
|
||||
| Setting | Value | Description |
|
||||
|---------------------------|--------------------|---------------------------------------------------------------------------------------------------|
|
||||
| `OMNITRACE_OUTPUT_PATH` | Any valid path | Path to folder where output files should be placed |
|
||||
| `OMNITRACE_OUTPUT_PREFIX` | String | Useful for multiple runs with different arguments. See [Output Prefix Keys](#output-prefix-keys) |
|
||||
| `OMNITRACE_OUTPUT_FILE` | Any valid filepath | Specific location for perfetto output file. |
|
||||
| `OMNITRACE_TIME_OUTPUT` | Boolean | Place all output in a timestamped folder, timestamp format controlled via `OMNITRACE_TIME_FORMAT` |
|
||||
| `OMNITRACE_TIME_FORMAT` | String | See `strftime` man pages for valid identifiers |
|
||||
| `OMNITRACE_USE_PID` | Boolean | Append either the PID or the MPI rank to all output files (before the extension) |
|
||||
|
||||
#### Output Prefix Keys
|
||||
|
||||
Output prefix keys have many uses but most useful when dealing with multiple profiling runs or large MPI jobs.
|
||||
Their inclusion in omnitrace stems from their introduction into timemory for [compile-time-perf](https://github.com/jrmadsen/compile-time-perf)
|
||||
which needed to be able to create different output files for a generic wrapper around compilation commands while still
|
||||
overwriting the output from the last time a file was compiled.
|
||||
|
||||
If you are ever doing scaling studies and specifying options via the command line, it is highly recommend to just
|
||||
use a common `OMNITRACE_OUTPUT_PATH`, disable `OMNITRACE_TIME_OUTPUT`,
|
||||
set `OMNITRACE_OUTPUT_PREFIX="%argt%-"` and let omnitrace cleanly organize the output.
|
||||
|
||||
| String | Encoding |
|
||||
|-----------------|--------------------------------------------------------------------------------------------------------------------|
|
||||
| `%argv%` | Entire command-line condensed into a single string |
|
||||
| `%argt%` | Similar to `%argv%` except basename of first command line argument |
|
||||
| `%args%` | All command line arguments condensed into a single string |
|
||||
| `%tag%` | Basename of first command line argument |
|
||||
| `%arg<N>%` | Command line argument at position `<N>` (zero indexed), e.g. `%arg0%` for first argument. |
|
||||
| `%argv_hash%` | MD5 sum of `%argv%` |
|
||||
| `%argt_hash%` | MD5 sum if `%argt%` |
|
||||
| `%args_hash%` | MD5 sum of `%args%` |
|
||||
| `%tag_hash%` | MD5 sum of `%tag%` |
|
||||
| `%arg<N>_hash%` | MD5 sum of `%arg<N>%` |
|
||||
| `%pid%` | Process identifier (i.e. `getpid()`) |
|
||||
| `%ppid%` | Parent process identifier (i.e. `getppid()`) |
|
||||
| `%pgid%` | Process group identifier (i.e. `getpgid(getpid())`) |
|
||||
| `%psid%` | Process session identifier (i.e. `getsid(getpid())`) |
|
||||
| `%psize%` | Number of sibling process (from reading `/proc/<PPID>/tasks/<PPID>/children`) |
|
||||
| `%job%` | Value of `SLURM_JOB_ID` environment variable if exists, else `0` |
|
||||
| `%rank%` | Value of `SLURM_PROCID` environment variable if exists, else `MPI_Comm_rank` (or `0` non-mpi) |
|
||||
| `%size%` | `MPI_Comm_size` or `1` if non-mpi |
|
||||
| `%nid%` | `%rank%` if possible, otherwise `%pid%` |
|
||||
| `%launch_time%` | Launch date and time (uses `OMNITRACE_TIME_FORMAT`) |
|
||||
| `%env{NAME}%` | Value of environment variable `NAME` (i.e. `getenv(NAME)`) |
|
||||
| `%cfg{NAME}%` | Value of configuration variable `NAME` (e.g. `%cfg{OMNITRACE_SAMPLING_FREQ}%` would resolve to sampling frequency) |
|
||||
| `$env{NAME}` | Alternative syntax to `%env{NAME}%` |
|
||||
| `$cfg{NAME}` | Alternative syntax to `%cfg{NAME}%` |
|
||||
| `%m` | Shorthand for `%argt_hash%` |
|
||||
| `%p` | Shorthand for `%pid%` |
|
||||
| `%j` | Shorthand for `%job%` |
|
||||
| `%r` | Shorthand for `%rank%` |
|
||||
| `%s` | Shorthand for `%size%` |
|
||||
|
||||
> ***Any output prefix key which contain a `/` will have the `/` characters***
|
||||
> ***replaced with `_` and any leading underscores will be stripped, e.g. if `%arg0%` is `/usr/bin/foo`, this***
|
||||
> ***will translate to `usr_bin_foo`. Additionally, any `%arg<N>%` keys which do not have a command line argument***
|
||||
> ***at position `<N>` will be ignored.***
|
||||
|
||||
## Perfetto Output
|
||||
|
||||
Use the `OMNITRACE_OUTPUT_FILE` to specify a specific location. If this is an absolute path, then all `OMNITRACE_OUTPUT_PATH`, etc.
|
||||
settings will be ignored. Visit [ui.perfetto.dev](https://ui.perfetto.dev) and open this file.
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
## Timemory Output
|
||||
|
||||
Use `omnitrace-avail --components --filename` to view the base filename for each component. E.g.
|
||||
|
||||
```shell
|
||||
$ omnitrace-avail wall_clock -C -f
|
||||
|---------------------------------|---------------|------------------------|
|
||||
| COMPONENT | AVAILABLE | FILENAME |
|
||||
|---------------------------------|---------------|------------------------|
|
||||
| wall_clock | true | wall_clock |
|
||||
| sampling_wall_clock | true | sampling_wall_clock |
|
||||
|---------------------------------|---------------|------------------------|
|
||||
```
|
||||
|
||||
Setting `OMNITRACE_COLLAPSE_THREADS=ON` and/or `OMNITRACE_COLLAPSE_PROCESSES=ON` (only valid with full MPI support) the timemory output
|
||||
will combine the per-thread and/or per-rank data which have identical call-stacks.
|
||||
|
||||
The `OMNITRACE_FLAT_PROFILE` setting will remove all call stack heirarchy. Using `OMNITRACE_FLAT_PROFILE=ON` in combination
|
||||
with `OMNITRACE_COLLAPSE_THREADS=ON` is a useful configuration for identifying min/max measurements regardless of calling context.
|
||||
The `OMNITRACE_TIMELINE_PROFILE` setting (with `OMNITRACE_FLAT_PROFILE=OFF`) will effectively generate similar data that can be found
|
||||
in perfetto. Enabling timeline and flat profiling will effectively generate similar data to `strace`. However, while timemory in general
|
||||
requires significantly less memory than perfetto, this is not the case in timeline mode so activate this setting with caution.
|
||||
|
||||
### Timemory Text Output
|
||||
|
||||
> ***Hint: the generation of text output is configurable via `OMNITRACE_TEXT_OUTPUT`***
|
||||
|
||||
Timemory text output files are meant for human-consumption (use JSON formats for analysis)
|
||||
and as such, some fields such as the `LABEL` fields may be truncated for readability.
|
||||
Modification of the truncation can be changed via the `OMNITRACE_MAX_WIDTH` setting.
|
||||
|
||||
#### Timemory Text Output Example
|
||||
|
||||
In the below, the `NN` field in `|NN>>>` is the thread ID. If MPI support is enabled, this will be `|MM|NN>>>` and `MM` will be the rank.
|
||||
If `OMNITRACE_COLLAPSE_THREADS=ON` and `OMNITRACE_COLLAPSE_PROCESSES=ON`, neither the `MM` nor the `NN` will be present unless the
|
||||
component explicitly sets type-traits which specify that the data is only relevant per-thread or per-process, e.g. the `thread_cpu_clock` clock component.
|
||||
|
||||
```console
|
||||
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| REAL-CLOCK TIMER (I.E. WALL-CLOCK TIMER) |
|
||||
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| LABEL | COUNT | DEPTH | METRIC | UNITS | SUM | MEAN | MIN | MAX | VAR | STDDEV | % SELF |
|
||||
|--------------------------------------------------------------|--------|--------|------------|--------|-----------|-----------|-----------|-----------|----------|----------|--------|
|
||||
| |00>>> main | 1 | 0 | wall_clock | sec | 13.360265 | 13.360265 | 13.360265 | 13.360265 | 0.000000 | 0.000000 | 18.2 |
|
||||
| |00>>> |_ompt_thread_initial | 1 | 1 | wall_clock | sec | 10.924161 | 10.924161 | 10.924161 | 10.924161 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |00>>> |_ompt_implicit_task | 1 | 2 | wall_clock | sec | 10.923050 | 10.923050 | 10.923050 | 10.923050 | 0.000000 | 0.000000 | 0.1 |
|
||||
| |00>>> |_ompt_parallel [parallelism=12] | 1 | 3 | wall_clock | sec | 10.915026 | 10.915026 | 10.915026 | 10.915026 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |00>>> |_ompt_implicit_task | 1 | 4 | wall_clock | sec | 10.647951 | 10.647951 | 10.647951 | 10.647951 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |00>>> |_ompt_work_loop | 156 | 5 | wall_clock | sec | 0.000812 | 0.000005 | 0.000001 | 0.000212 | 0.000000 | 0.000018 | 100.0 |
|
||||
| |00>>> |_ompt_work_single_executor | 40 | 5 | wall_clock | sec | 0.000016 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |00>>> |_ompt_sync_region_barrier_implicit | 308 | 5 | wall_clock | sec | 0.000629 | 0.000002 | 0.000001 | 0.000017 | 0.000000 | 0.000002 | 100.0 |
|
||||
| |00>>> |_conj_grad | 76 | 5 | wall_clock | sec | 10.641165 | 0.140015 | 0.131894 | 0.155099 | 0.000017 | 0.004080 | 1.0 |
|
||||
| |00>>> |_ompt_work_single_executor | 803 | 6 | wall_clock | sec | 0.000292 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |00>>> |_ompt_work_loop | 7904 | 6 | wall_clock | sec | 7.420265 | 0.000939 | 0.000005 | 0.006974 | 0.000003 | 0.001613 | 100.0 |
|
||||
| |00>>> |_ompt_sync_region_barrier_implicit | 6004 | 6 | wall_clock | sec | 0.283160 | 0.000047 | 0.000001 | 0.004087 | 0.000000 | 0.000303 | 100.0 |
|
||||
| |00>>> |_ompt_sync_region_barrier_implementation | 3952 | 6 | wall_clock | sec | 2.829252 | 0.000716 | 0.000007 | 0.009005 | 0.000001 | 0.000985 | 99.7 |
|
||||
| |00>>> |_ompt_sync_region_reduction | 15808 | 7 | wall_clock | sec | 0.009142 | 0.000001 | 0.000000 | 0.000007 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |00>>> |_ompt_work_single_other | 1249 | 6 | wall_clock | sec | 0.000270 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |00>>> |_ompt_work_single_other | 114 | 5 | wall_clock | sec | 0.000024 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |00>>> |_ompt_sync_region_barrier_implementation | 76 | 5 | wall_clock | sec | 0.000876 | 0.000012 | 0.000008 | 0.000025 | 0.000000 | 0.000003 | 84.4 |
|
||||
| |00>>> |_ompt_sync_region_reduction | 304 | 6 | wall_clock | sec | 0.000136 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |00>>> |_ompt_master | 226 | 5 | wall_clock | sec | 0.001978 | 0.000009 | 0.000000 | 0.000038 | 0.000000 | 0.000012 | 100.0 |
|
||||
| |11>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.656145 | 10.656145 | 10.656145 | 10.656145 | 0.000000 | 0.000000 | 0.1 |
|
||||
| |11>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649183 | 10.649183 | 10.649183 | 10.649183 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |11>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000852 | 0.000005 | 0.000002 | 0.000230 | 0.000000 | 0.000019 | 100.0 |
|
||||
| |11>>> |_ompt_work_single_other | 149 | 6 | wall_clock | sec | 0.000035 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |11>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004135 | 0.000013 | 0.000001 | 0.001233 | 0.000000 | 0.000070 | 100.0 |
|
||||
| |11>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641302 | 0.140017 | 0.131896 | 0.155102 | 0.000017 | 0.004080 | 0.6 |
|
||||
| |11>>> |_ompt_work_single_other | 2023 | 7 | wall_clock | sec | 0.000458 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |11>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 8.253555 | 0.001044 | 0.000005 | 0.008021 | 0.000003 | 0.001790 | 100.0 |
|
||||
| |11>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.263840 | 0.000044 | 0.000001 | 0.004087 | 0.000000 | 0.000297 | 100.0 |
|
||||
| |11>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.059823 | 0.000521 | 0.000007 | 0.009508 | 0.000001 | 0.000863 | 100.0 |
|
||||
| |11>>> |_ompt_work_single_executor | 29 | 7 | wall_clock | sec | 0.000011 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |11>>> |_ompt_work_single_executor | 5 | 6 | wall_clock | sec | 0.000002 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |11>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000975 | 0.000013 | 0.000008 | 0.000024 | 0.000000 | 0.000003 | 100.0 |
|
||||
| |10>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.681664 | 10.681664 | 10.681664 | 10.681664 | 0.000000 | 0.000000 | 0.3 |
|
||||
| |10>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649158 | 10.649158 | 10.649158 | 10.649158 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |10>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000863 | 0.000006 | 0.000002 | 0.000231 | 0.000000 | 0.000019 | 100.0 |
|
||||
| |10>>> |_ompt_work_single_other | 140 | 6 | wall_clock | sec | 0.000037 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |10>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004149 | 0.000013 | 0.000001 | 0.001221 | 0.000000 | 0.000070 | 100.0 |
|
||||
| |10>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641288 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 0.7 |
|
||||
| |10>>> |_ompt_work_single_other | 1883 | 7 | wall_clock | sec | 0.000487 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |10>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 8.174545 | 0.001034 | 0.000005 | 0.006899 | 0.000003 | 0.001766 | 100.0 |
|
||||
| |10>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.268808 | 0.000045 | 0.000001 | 0.004087 | 0.000000 | 0.000299 | 100.0 |
|
||||
| |10>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.126988 | 0.000538 | 0.000007 | 0.009843 | 0.000001 | 0.000872 | 99.9 |
|
||||
| |10>>> |_ompt_sync_region_reduction | 3952 | 8 | wall_clock | sec | 0.002574 | 0.000001 | 0.000000 | 0.000014 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |10>>> |_ompt_work_single_executor | 169 | 7 | wall_clock | sec | 0.000072 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |10>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000954 | 0.000013 | 0.000009 | 0.000023 | 0.000000 | 0.000003 | 95.9 |
|
||||
| |10>>> |_ompt_sync_region_reduction | 76 | 7 | wall_clock | sec | 0.000039 | 0.000001 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |10>>> |_ompt_work_single_executor | 14 | 6 | wall_clock | sec | 0.000006 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |09>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.686552 | 10.686552 | 10.686552 | 10.686552 | 0.000000 | 0.000000 | 0.3 |
|
||||
| |09>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649151 | 10.649151 | 10.649151 | 10.649151 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |09>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000880 | 0.000006 | 0.000002 | 0.000258 | 0.000000 | 0.000021 | 100.0 |
|
||||
| |09>>> |_ompt_work_single_other | 148 | 6 | wall_clock | sec | 0.000034 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |09>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004129 | 0.000013 | 0.000001 | 0.001210 | 0.000000 | 0.000069 | 100.0 |
|
||||
| |09>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641308 | 0.140017 | 0.131895 | 0.155102 | 0.000017 | 0.004080 | 0.7 |
|
||||
| |09>>> |_ompt_work_single_other | 2043 | 7 | wall_clock | sec | 0.000473 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |09>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 7.977001 | 0.001009 | 0.000005 | 0.007325 | 0.000003 | 0.001732 | 100.0 |
|
||||
| |09>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.242996 | 0.000040 | 0.000001 | 0.004087 | 0.000000 | 0.000284 | 100.0 |
|
||||
| |09>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.350895 | 0.000595 | 0.000007 | 0.008689 | 0.000001 | 0.000926 | 100.0 |
|
||||
| |09>>> |_ompt_work_single_executor | 9 | 7 | wall_clock | sec | 0.000004 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |09>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000973 | 0.000013 | 0.000008 | 0.000025 | 0.000000 | 0.000003 | 100.0 |
|
||||
| |09>>> |_ompt_work_single_executor | 6 | 6 | wall_clock | sec | 0.000002 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |08>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.721622 | 10.721622 | 10.721622 | 10.721622 | 0.000000 | 0.000000 | 0.7 |
|
||||
| |08>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649135 | 10.649135 | 10.649135 | 10.649135 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |08>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000839 | 0.000005 | 0.000001 | 0.000231 | 0.000000 | 0.000019 | 100.0 |
|
||||
| |08>>> |_ompt_work_single_other | 141 | 6 | wall_clock | sec | 0.000030 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |08>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004114 | 0.000013 | 0.000001 | 0.001198 | 0.000000 | 0.000069 | 100.0 |
|
||||
| |08>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641294 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 0.6 |
|
||||
| |08>>> |_ompt_work_single_other | 1742 | 7 | wall_clock | sec | 0.000392 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |08>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 8.306388 | 0.001051 | 0.000005 | 0.007886 | 0.000003 | 0.001795 | 100.0 |
|
||||
| |08>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.274358 | 0.000046 | 0.000001 | 0.004090 | 0.000000 | 0.000302 | 100.0 |
|
||||
| |08>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 1.991251 | 0.000504 | 0.000007 | 0.008694 | 0.000001 | 0.000844 | 99.8 |
|
||||
| |08>>> |_ompt_sync_region_reduction | 7904 | 8 | wall_clock | sec | 0.003816 | 0.000000 | 0.000000 | 0.000017 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |08>>> |_ompt_work_single_executor | 310 | 7 | wall_clock | sec | 0.000112 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |08>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000955 | 0.000013 | 0.000009 | 0.000026 | 0.000000 | 0.000003 | 93.7 |
|
||||
| |08>>> |_ompt_sync_region_reduction | 152 | 7 | wall_clock | sec | 0.000060 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |08>>> |_ompt_work_single_executor | 13 | 6 | wall_clock | sec | 0.000005 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |07>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.747282 | 10.747282 | 10.747282 | 10.747282 | 0.000000 | 0.000000 | 0.9 |
|
||||
| |07>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649093 | 10.649093 | 10.649093 | 10.649093 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |07>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000923 | 0.000006 | 0.000002 | 0.000231 | 0.000000 | 0.000019 | 100.0 |
|
||||
| |07>>> |_ompt_work_single_other | 152 | 6 | wall_clock | sec | 0.000048 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |07>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.003981 | 0.000013 | 0.000001 | 0.001186 | 0.000000 | 0.000068 | 100.0 |
|
||||
| |07>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641295 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 0.7 |
|
||||
| |07>>> |_ompt_work_single_other | 2043 | 7 | wall_clock | sec | 0.000648 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |07>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 7.978811 | 0.001009 | 0.000005 | 0.006728 | 0.000003 | 0.001732 | 100.0 |
|
||||
| |07>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.199939 | 0.000033 | 0.000001 | 0.004086 | 0.000000 | 0.000255 | 100.0 |
|
||||
| |07>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.385843 | 0.000604 | 0.000009 | 0.009039 | 0.000001 | 0.000938 | 100.0 |
|
||||
| |07>>> |_ompt_work_single_executor | 9 | 7 | wall_clock | sec | 0.000004 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |07>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000905 | 0.000012 | 0.000010 | 0.000025 | 0.000000 | 0.000003 | 100.0 |
|
||||
| |07>>> |_ompt_work_single_executor | 2 | 6 | wall_clock | sec | 0.000001 | 0.000001 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |06>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.772278 | 10.772278 | 10.772278 | 10.772278 | 0.000000 | 0.000000 | 1.1 |
|
||||
| |06>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649092 | 10.649092 | 10.649092 | 10.649092 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |06>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000888 | 0.000006 | 0.000002 | 0.000236 | 0.000000 | 0.000020 | 100.0 |
|
||||
| |06>>> |_ompt_work_single_other | 153 | 6 | wall_clock | sec | 0.000037 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |06>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004090 | 0.000013 | 0.000001 | 0.001175 | 0.000000 | 0.000067 | 100.0 |
|
||||
| |06>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641317 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 0.8 |
|
||||
| |06>>> |_ompt_work_single_other | 2041 | 7 | wall_clock | sec | 0.000476 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |06>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 7.467961 | 0.000945 | 0.000005 | 0.010712 | 0.000003 | 0.001627 | 100.0 |
|
||||
| |06>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.250883 | 0.000042 | 0.000001 | 0.004087 | 0.000000 | 0.000285 | 100.0 |
|
||||
| |06>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.838733 | 0.000718 | 0.000009 | 0.009015 | 0.000001 | 0.001015 | 99.9 |
|
||||
| |06>>> |_ompt_sync_region_reduction | 3952 | 8 | wall_clock | sec | 0.003334 | 0.000001 | 0.000000 | 0.000025 | 0.000000 | 0.000001 | 100.0 |
|
||||
| |06>>> |_ompt_work_single_executor | 11 | 7 | wall_clock | sec | 0.000005 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |06>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000940 | 0.000012 | 0.000009 | 0.000025 | 0.000000 | 0.000003 | 95.4 |
|
||||
| |06>>> |_ompt_sync_region_reduction | 76 | 7 | wall_clock | sec | 0.000044 | 0.000001 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |06>>> |_ompt_work_single_executor | 1 | 6 | wall_clock | sec | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |05>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.797950 | 10.797950 | 10.797950 | 10.797950 | 0.000000 | 0.000000 | 1.4 |
|
||||
| |05>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649072 | 10.649072 | 10.649072 | 10.649072 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |05>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000879 | 0.000006 | 0.000001 | 0.000248 | 0.000000 | 0.000021 | 100.0 |
|
||||
| |05>>> |_ompt_work_single_other | 142 | 6 | wall_clock | sec | 0.000034 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |05>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004062 | 0.000013 | 0.000002 | 0.001163 | 0.000000 | 0.000067 | 100.0 |
|
||||
| |05>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641291 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 0.7 |
|
||||
| |05>>> |_ompt_work_single_other | 2038 | 7 | wall_clock | sec | 0.000500 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |05>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 8.279191 | 0.001047 | 0.000005 | 0.006596 | 0.000003 | 0.001792 | 100.0 |
|
||||
| |05>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.250939 | 0.000042 | 0.000001 | 0.004090 | 0.000000 | 0.000286 | 100.0 |
|
||||
| |05>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.039013 | 0.000516 | 0.000009 | 0.008689 | 0.000001 | 0.000855 | 100.0 |
|
||||
| |05>>> |_ompt_work_single_executor | 14 | 7 | wall_clock | sec | 0.000005 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |05>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000926 | 0.000012 | 0.000009 | 0.000023 | 0.000000 | 0.000003 | 100.0 |
|
||||
| |05>>> |_ompt_work_single_executor | 12 | 6 | wall_clock | sec | 0.000005 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |04>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.825935 | 10.825935 | 10.825935 | 10.825935 | 0.000000 | 0.000000 | 1.6 |
|
||||
| |04>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649068 | 10.649068 | 10.649068 | 10.649068 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |04>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000884 | 0.000006 | 0.000002 | 0.000245 | 0.000000 | 0.000020 | 100.0 |
|
||||
| |04>>> |_ompt_work_single_other | 150 | 6 | wall_clock | sec | 0.000034 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |04>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004069 | 0.000013 | 0.000001 | 0.001151 | 0.000000 | 0.000066 | 100.0 |
|
||||
| |04>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641300 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 1.1 |
|
||||
| |04>>> |_ompt_work_single_other | 2041 | 7 | wall_clock | sec | 0.000448 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |04>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 7.438393 | 0.000941 | 0.000005 | 0.007090 | 0.000003 | 0.001624 | 100.0 |
|
||||
| |04>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.270654 | 0.000045 | 0.000001 | 0.004090 | 0.000000 | 0.000295 | 100.0 |
|
||||
| |04>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.819165 | 0.000713 | 0.000009 | 0.008379 | 0.000001 | 0.001013 | 99.9 |
|
||||
| |04>>> |_ompt_sync_region_reduction | 7904 | 8 | wall_clock | sec | 0.003932 | 0.000000 | 0.000000 | 0.000015 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |04>>> |_ompt_work_single_executor | 11 | 7 | wall_clock | sec | 0.000005 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |04>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000936 | 0.000012 | 0.000009 | 0.000025 | 0.000000 | 0.000003 | 93.2 |
|
||||
| |04>>> |_ompt_sync_region_reduction | 152 | 7 | wall_clock | sec | 0.000064 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |04>>> |_ompt_work_single_executor | 4 | 6 | wall_clock | sec | 0.000001 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |03>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.849322 | 10.849322 | 10.849322 | 10.849322 | 0.000000 | 0.000000 | 1.8 |
|
||||
| |03>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649075 | 10.649075 | 10.649075 | 10.649075 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |03>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000861 | 0.000006 | 0.000002 | 0.000238 | 0.000000 | 0.000020 | 100.0 |
|
||||
| |03>>> |_ompt_work_single_other | 120 | 6 | wall_clock | sec | 0.000028 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |03>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.003993 | 0.000013 | 0.000001 | 0.001138 | 0.000000 | 0.000065 | 100.0 |
|
||||
| |03>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641302 | 0.140017 | 0.131896 | 0.155101 | 0.000017 | 0.004080 | 0.8 |
|
||||
| |03>>> |_ompt_work_single_other | 1756 | 7 | wall_clock | sec | 0.000426 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |03>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 8.005617 | 0.001013 | 0.000005 | 0.011500 | 0.000003 | 0.001741 | 100.0 |
|
||||
| |03>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.231485 | 0.000039 | 0.000001 | 0.004086 | 0.000000 | 0.000277 | 100.0 |
|
||||
| |03>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.320428 | 0.000587 | 0.000009 | 0.010868 | 0.000001 | 0.000912 | 100.0 |
|
||||
| |03>>> |_ompt_work_single_executor | 296 | 7 | wall_clock | sec | 0.000120 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |03>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000967 | 0.000013 | 0.000010 | 0.000023 | 0.000000 | 0.000003 | 100.0 |
|
||||
| |03>>> |_ompt_work_single_executor | 34 | 6 | wall_clock | sec | 0.000013 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |02>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.876387 | 10.876387 | 10.876387 | 10.876387 | 0.000000 | 0.000000 | 2.1 |
|
||||
| |02>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649050 | 10.649050 | 10.649050 | 10.649050 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |02>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000924 | 0.000006 | 0.000001 | 0.000241 | 0.000000 | 0.000020 | 100.0 |
|
||||
| |02>>> |_ompt_work_single_other | 139 | 6 | wall_clock | sec | 0.000040 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |02>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.003972 | 0.000013 | 0.000001 | 0.001127 | 0.000000 | 0.000064 | 100.0 |
|
||||
| |02>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641287 | 0.140017 | 0.131895 | 0.155101 | 0.000017 | 0.004080 | 0.7 |
|
||||
| |02>>> |_ompt_work_single_other | 1902 | 7 | wall_clock | sec | 0.000553 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |02>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 7.906688 | 0.001000 | 0.000005 | 0.007068 | 0.000003 | 0.001713 | 100.0 |
|
||||
| |02>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.261367 | 0.000044 | 0.000001 | 0.004088 | 0.000000 | 0.000295 | 100.0 |
|
||||
| |02>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.402362 | 0.000608 | 0.000009 | 0.010399 | 0.000001 | 0.000944 | 99.9 |
|
||||
| |02>>> |_ompt_sync_region_reduction | 3952 | 8 | wall_clock | sec | 0.002937 | 0.000001 | 0.000000 | 0.000021 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |02>>> |_ompt_work_single_executor | 150 | 7 | wall_clock | sec | 0.000073 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |02>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000895 | 0.000012 | 0.000009 | 0.000026 | 0.000000 | 0.000003 | 95.2 |
|
||||
| |02>>> |_ompt_sync_region_reduction | 76 | 7 | wall_clock | sec | 0.000043 | 0.000001 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |02>>> |_ompt_work_single_executor | 15 | 6 | wall_clock | sec | 0.000007 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |01>>> |_ompt_thread_worker | 1 | 4 | wall_clock | sec | 10.901650 | 10.901650 | 10.901650 | 10.901650 | 0.000000 | 0.000000 | 2.3 |
|
||||
| |01>>> |_ompt_implicit_task | 1 | 5 | wall_clock | sec | 10.649017 | 10.649017 | 10.649017 | 10.649017 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |01>>> |_ompt_work_loop | 156 | 6 | wall_clock | sec | 0.000863 | 0.000006 | 0.000001 | 0.000231 | 0.000000 | 0.000019 | 100.0 |
|
||||
| |01>>> |_ompt_work_single_other | 146 | 6 | wall_clock | sec | 0.000033 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |01>>> |_ompt_sync_region_barrier_implicit | 308 | 6 | wall_clock | sec | 0.004012 | 0.000013 | 0.000001 | 0.001115 | 0.000000 | 0.000064 | 100.0 |
|
||||
| |01>>> |_conj_grad | 76 | 6 | wall_clock | sec | 10.641316 | 0.140017 | 0.131895 | 0.155101 | 0.000017 | 0.004080 | 0.8 |
|
||||
| |01>>> |_ompt_work_single_other | 1811 | 7 | wall_clock | sec | 0.000403 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |01>>> |_ompt_work_loop | 7904 | 7 | wall_clock | sec | 7.410337 | 0.000938 | 0.000005 | 0.010556 | 0.000003 | 0.001610 | 100.0 |
|
||||
| |01>>> |_ompt_sync_region_barrier_implicit | 6004 | 7 | wall_clock | sec | 0.202494 | 0.000034 | 0.000001 | 0.003521 | 0.000000 | 0.000256 | 100.0 |
|
||||
| |01>>> |_ompt_sync_region_barrier_implementation | 3952 | 7 | wall_clock | sec | 2.943604 | 0.000745 | 0.000008 | 0.009033 | 0.000001 | 0.001024 | 100.0 |
|
||||
| |01>>> |_ompt_work_single_executor | 241 | 7 | wall_clock | sec | 0.000093 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |01>>> |_ompt_sync_region_barrier_implementation | 76 | 6 | wall_clock | sec | 0.000917 | 0.000012 | 0.000009 | 0.000026 | 0.000000 | 0.000003 | 100.0 |
|
||||
| |01>>> |_ompt_work_single_executor | 8 | 6 | wall_clock | sec | 0.000004 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |00>>> |_c_print_results | 1 | 2 | wall_clock | sec | 0.000049 | 0.000049 | 0.000049 | 0.000049 | 0.000000 | 0.000000 | 100.0 |
|
||||
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
```
|
||||
|
||||
### Timemory JSON Output
|
||||
|
||||
> ***Hint: the generation of flat JSON output is configurable via `OMNITRACE_JSON_OUTPUT`.***
|
||||
> ***The generation of hierarchical JSON data is configurable via `OMNITRACE_TREE_OUTPUT`.***
|
||||
|
||||
Timemory represents the data within the JSON output in two forms: a flat structure and a hierarchical structure.
|
||||
The flat JSON data represents the data similar to the text files: the hierarchical information
|
||||
is represented by the indentation of the `"prefix"` field and the `"depth"` field.
|
||||
The hierarchical JSON contains additional information with respect to inclusive and exclusive value, however,
|
||||
it's structure requires processing through recursion. This section of the JSON supports analysis
|
||||
by [hatchet](https://github.com/hatchet/hatchet).
|
||||
All the data entries for the flat structure are in a single JSON array.
|
||||
This format is easier than the hierarchical format to write a simple Python script for post-processing.
|
||||
|
||||
#### Timemory JSON Output Sample
|
||||
|
||||
In the JSON below, the flat data starts at `["timemory"]["wall_clock"]["ranks"]`
|
||||
and the hierarchical data starts at `["timemory"]["wall_clock"]["graph"]`.
|
||||
E.g., accessing the name (prefix) of the nth entry in the flat data layout is:
|
||||
`["timemory"]["wall_clock"]["ranks"][0]["graph"][<N>]["prefix"]`. When full MPI
|
||||
support is enable, the per-rank data in flat layout will be represented
|
||||
in as an entry in the "ranks" array; in the hierarchical data structure,
|
||||
the per-rank data is represented as entry in the "mpi" array (but "graph"
|
||||
is used in lieu of "mpi" when full MPI support is enabled).
|
||||
In the hierarchical layout, all data for the process is all a child of a (dummy)
|
||||
root node (which has the name `unknown-hash=0`).
|
||||
|
||||
```json
|
||||
{
|
||||
"timemory": {
|
||||
"wall_clock": {
|
||||
"properties": {
|
||||
"cereal_class_version": 0,
|
||||
"value": 78,
|
||||
"enum": "WALL_CLOCK",
|
||||
"id": "wall_clock",
|
||||
"ids": [
|
||||
"real_clock",
|
||||
"virtual_clock",
|
||||
"wall_clock"
|
||||
]
|
||||
},
|
||||
"type": "wall_clock",
|
||||
"description": "Real-clock timer (i.e. wall-clock timer)",
|
||||
"unit_value": 1000000000,
|
||||
"unit_repr": "sec",
|
||||
"thread_scope_only": false,
|
||||
"thread_count": 2,
|
||||
"mpi_size": 1,
|
||||
"upcxx_size": 1,
|
||||
"process_count": 1,
|
||||
"num_ranks": 1,
|
||||
"concurrency": 2,
|
||||
"ranks": [
|
||||
{
|
||||
"rank": 0,
|
||||
"graph_size": 112,
|
||||
"graph": [
|
||||
{
|
||||
"hash": 17481650134347108265,
|
||||
"prefix": "|0>>> main",
|
||||
"depth": 0,
|
||||
"entry": {
|
||||
"cereal_class_version": 0,
|
||||
"laps": 1,
|
||||
"value": 894743517,
|
||||
"accum": 894743517,
|
||||
"repr_data": 0.894743517,
|
||||
"repr_display": 0.894743517
|
||||
},
|
||||
"stats": {
|
||||
"cereal_class_version": 0,
|
||||
"sum": 0.894743517,
|
||||
"count": 1,
|
||||
"min": 0.894743517,
|
||||
"max": 0.894743517,
|
||||
"sqr": 0.8005659612135293,
|
||||
"mean": 0.894743517,
|
||||
"stddev": 0.0
|
||||
},
|
||||
"rolling_hash": 17481650134347108265
|
||||
},
|
||||
{
|
||||
"hash": 3455444288293231339,
|
||||
"prefix": "|0>>> |_read_input",
|
||||
"depth": 1,
|
||||
"entry": {
|
||||
"laps": 1,
|
||||
"value": 9808,
|
||||
"accum": 9808,
|
||||
"repr_data": 9.808e-06,
|
||||
"repr_display": 9.808e-06
|
||||
},
|
||||
"stats": {
|
||||
"sum": 9.808e-06,
|
||||
"count": 1,
|
||||
"min": 9.808e-06,
|
||||
"max": 9.808e-06,
|
||||
"sqr": 9.6196864e-11,
|
||||
"mean": 9.808e-06,
|
||||
"stddev": 0.0
|
||||
},
|
||||
"rolling_hash": 2490350348930787988
|
||||
},
|
||||
{
|
||||
"hash": 8456966793631718807,
|
||||
"prefix": "|0>>> |_setcoeff",
|
||||
"depth": 1,
|
||||
"entry": {
|
||||
"laps": 1,
|
||||
"value": 922,
|
||||
"accum": 922,
|
||||
"repr_data": 9.22e-07,
|
||||
"repr_display": 9.22e-07
|
||||
},
|
||||
"stats": {
|
||||
"sum": 9.22e-07,
|
||||
"count": 1,
|
||||
"min": 9.22e-07,
|
||||
"max": 9.22e-07,
|
||||
"sqr": 8.50084e-13,
|
||||
"mean": 9.22e-07,
|
||||
"stddev": 0.0
|
||||
},
|
||||
"rolling_hash": 7491872854269275456
|
||||
},
|
||||
{
|
||||
"hash": 6107876127803219007,
|
||||
"prefix": "|0>>> |_ompt_thread_initial",
|
||||
"depth": 1,
|
||||
"entry": {
|
||||
"laps": 1,
|
||||
"value": 896506392,
|
||||
"accum": 896506392,
|
||||
"repr_data": 0.896506392,
|
||||
"repr_display": 0.896506392
|
||||
},
|
||||
"stats": {
|
||||
"sum": 0.896506392,
|
||||
"count": 1,
|
||||
"min": 0.896506392,
|
||||
"max": 0.896506392,
|
||||
"sqr": 0.8037237108968578,
|
||||
"mean": 0.896506392,
|
||||
"stddev": 0.0
|
||||
},
|
||||
"rolling_hash": 5142782188440775656
|
||||
},
|
||||
{
|
||||
"hash": 15402802091993617561,
|
||||
"prefix": "|0>>> |_ompt_implicit_task",
|
||||
"depth": 2,
|
||||
"entry": {
|
||||
"laps": 1,
|
||||
"value": 896479111,
|
||||
"accum": 896479111,
|
||||
"repr_data": 0.896479111,
|
||||
"repr_display": 0.896479111
|
||||
},
|
||||
"stats": {
|
||||
"sum": 0.896479111,
|
||||
"count": 1,
|
||||
"min": 0.896479111,
|
||||
"max": 0.896479111,
|
||||
"sqr": 0.8036747964593504,
|
||||
"mean": 0.896479111,
|
||||
"stddev": 0.0
|
||||
},
|
||||
"rolling_hash": 2098840206724841601 },
|
||||
{
|
||||
"..." : "... etc. ..."
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"graph": [
|
||||
[
|
||||
{
|
||||
"cereal_class_version": 0,
|
||||
"node": {
|
||||
"hash": 0,
|
||||
"prefix": "unknown-hash=0",
|
||||
"tid": [
|
||||
0
|
||||
],
|
||||
"pid": [
|
||||
2539175
|
||||
],
|
||||
"depth": 0,
|
||||
"is_dummy": false,
|
||||
"inclusive": {
|
||||
"entry": {
|
||||
"laps": 0,
|
||||
"value": 0,
|
||||
"accum": 0,
|
||||
"repr_data": 0.0,
|
||||
"repr_display": 0.0
|
||||
},
|
||||
"stats": {
|
||||
"sum": 0.0,
|
||||
"count": 0,
|
||||
"min": 0.0,
|
||||
"max": 0.0,
|
||||
"sqr": 0.0,
|
||||
"mean": 0.0,
|
||||
"stddev": 0.0
|
||||
}
|
||||
},
|
||||
"exclusive": {
|
||||
"entry": {
|
||||
"laps": 0,
|
||||
"value": -894743517,
|
||||
"accum": -894743517,
|
||||
"repr_data": -0.894743517,
|
||||
"repr_display": -0.894743517
|
||||
},
|
||||
"stats": {
|
||||
"sum": 0.0,
|
||||
"count": 0,
|
||||
"min": 0.0,
|
||||
"max": 0.0,
|
||||
"sqr": 0.0,
|
||||
"mean": 0.0,
|
||||
"stddev": 0.0
|
||||
}
|
||||
}
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"node": {
|
||||
"hash": 17481650134347108265,
|
||||
"prefix": "main",
|
||||
"tid": [
|
||||
0
|
||||
],
|
||||
"pid": [
|
||||
2539175
|
||||
],
|
||||
"depth": 1,
|
||||
"is_dummy": false,
|
||||
"inclusive": {
|
||||
"entry": {
|
||||
"laps": 1,
|
||||
"value": 894743517,
|
||||
"accum": 894743517,
|
||||
"repr_data": 0.894743517,
|
||||
"repr_display": 0.894743517
|
||||
},
|
||||
"stats": {
|
||||
"sum": 0.894743517,
|
||||
"count": 1,
|
||||
"min": 0.894743517,
|
||||
"max": 0.894743517,
|
||||
"sqr": 0.8005659612135293,
|
||||
"mean": 0.894743517,
|
||||
"stddev": 0.0
|
||||
}
|
||||
},
|
||||
"exclusive": {
|
||||
"entry": {
|
||||
"laps": 1,
|
||||
"value": -1773605,
|
||||
"accum": -1773605,
|
||||
"repr_data": -0.001773605,
|
||||
"repr_display": -0.001773605
|
||||
},
|
||||
"stats": {
|
||||
"sum": -0.001773605,
|
||||
"count": 1,
|
||||
"min": 9.22e-07,
|
||||
"max": 0.896506392,
|
||||
"sqr": -0.0031577497803754,
|
||||
"mean": -0.001773605,
|
||||
"stddev": 0.0
|
||||
}
|
||||
}
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"..." : "... etc. ..."
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Timemory JSON Output Python Post-Processing Example
|
||||
|
||||
```python
|
||||
#!/usr/bin/env python3
|
||||
|
||||
import sys
|
||||
import json
|
||||
|
||||
|
||||
def read_json(inp):
|
||||
with open(inp, "r") as f:
|
||||
return json.load(f)
|
||||
|
||||
|
||||
def find_max(data):
|
||||
"""Find the max for any function called multiple times"""
|
||||
max_entry = None
|
||||
for itr in data:
|
||||
if itr["entry"]["laps"] == 1:
|
||||
continue
|
||||
if max_entry is None:
|
||||
max_entry = itr
|
||||
else:
|
||||
if itr["stats"]["mean"] > max_entry["stats"]["mean"]:
|
||||
max_entry = itr
|
||||
return max_entry
|
||||
|
||||
|
||||
def strip_name(name):
|
||||
"""Return everything after |_ if it exists"""
|
||||
idx = name.index("|_")
|
||||
return name if idx is None else name[(idx + 2) :]
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
input_data = [[x, read_json(x)] for x in sys.argv[1:]]
|
||||
|
||||
for file, data in input_data:
|
||||
for metric, metric_data in data["timemory"].items():
|
||||
|
||||
print(f"[{file}] Found metric: {metric}")
|
||||
|
||||
for n, itr in enumerate(metric_data["ranks"]):
|
||||
|
||||
max_entry = find_max(itr["graph"])
|
||||
print(
|
||||
"[{}] Maximum value: '{}' at depth {} was called {}x :: {:.3f} {} (mean = {:.3e} {})".format(
|
||||
file,
|
||||
strip_name(max_entry["prefix"]),
|
||||
max_entry["depth"],
|
||||
max_entry["entry"]["laps"],
|
||||
max_entry["entry"]["repr_data"],
|
||||
metric_data["unit_repr"],
|
||||
max_entry["stats"]["mean"],
|
||||
metric_data["unit_repr"],
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
This script applied to the corresponding JSON output from [Text Output Example](#timemory-text-output-example) would be:
|
||||
|
||||
```console
|
||||
[openmp-cg.inst-wall_clock.json] Found metric: wall_clock
|
||||
[openmp-cg.inst-wall_clock.json] Maximum value: 'conj_grad' at depth 6 was called 76x :: 10.641 sec (mean = 1.400e-01 sec)
|
||||
```
|
||||
@@ -1,297 +0,0 @@
|
||||
# Python Support
|
||||
|
||||
```eval_rst
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 3
|
||||
```
|
||||
|
||||
[OmniTrace](https://github.com/ROCm/omnitrace) supports profiling Python code at the source-level and/or the script-level.
|
||||
Python support is enabled via the `OMNITRACE_USE_PYTHON` and the `OMNITRACE_PYTHON_VERSIONS="<MAJOR>.<MINOR>` CMake options.
|
||||
Alternatively, to build multiple python versions, use `OMNITRACE_PYTHON_VERSIONS="<MAJOR>.<MINOR>;[<MAJOR>.<MINOR>]"`,
|
||||
and `OMNITRACE_PYTHON_ROOT_DIRS="/path/to/version;[/path/to/version]"` instead of `OMNITRACE_PYTHON_VERSION`.
|
||||
When building multiple Python versions, the length of the `OMNITRACE_PYTHON_VERSIONS` and `OMNITRACE_PYTHON_ROOT_DIRS` lists must
|
||||
be the same size.
|
||||
|
||||
> ***When using omnitrace for Python, the Python interpreter major and minor version (e.g. 3.7) must match the interpreter major and minor version***
|
||||
> ***used when compiling the Python bindings, i.e. when building omnitrace, a `libpyomnitrace.<IMPL>-<VERSION>-<ARCH>-<OS>-<ABI>.so` will be generated***
|
||||
> ***where `IMPL` is the Python implementation, `VERSION` is the major and minor version, `ARCH` is the architecture,***
|
||||
> ***`OS` is the operating system, and `ABI` is the application binary interface; Example: `libpyomnitrace.cpython-38-x86_64-linux-gnu.so`.***
|
||||
|
||||
## Getting Started
|
||||
|
||||
The omnitrace Python package is installed in `lib/pythonX.Y/site-packages/omnitrace`. In order to ensure the Python interpreter can find the omnitrace package,
|
||||
add this path to the `PYTHONPATH` environment variable, e.g.:
|
||||
|
||||
```bash
|
||||
export PYTHONPATH=/opt/omnitrace/lib/python3.8/site-packages:${PYTHONPATH}
|
||||
```
|
||||
|
||||
If using either the `share/omnitrace/setup-env.sh` script or the modulefile in `share/modulefiles/omnitrace`, prefixing the `PYTHONPATH`
|
||||
environment variable is automatically handled.
|
||||
|
||||
## Running OmniTrace on a Python Script
|
||||
|
||||
OmniTrace provides an `omnitrace-python` helper bash script which effectively handles ensuring `PYTHONPATH` is properly set and the correct python interpreter is used.
|
||||
Thus the following are effectively equivalent:
|
||||
|
||||
```bash
|
||||
omnitrace-python --help
|
||||
|
||||
export PYTHONPATH=/opt/omnitrace/lib/python3.8/site-packages:${PYTHONPATH}
|
||||
python3.8 -m omnitrace --help
|
||||
```
|
||||
|
||||
> ***`omnitrace-python` / `python -m omnitrace` uses the same command-line syntax as the `omnitrace` executable (i.e. `omnitrace-python <OMNITRACE_ARGS> -- <SCRIPT> <SCRIPT_ARGS>`) and has similar options.***
|
||||
|
||||
### Command Line Options
|
||||
|
||||
Use `omnitrace-python --help` to view the available options:
|
||||
|
||||
```console
|
||||
usage: omnitrace [-h] [-v VERBOSITY] [-b] [-c FILE] [-s FILE] [-F [BOOL]] [--label [{args,file,line} [{args,file,line} ...]]] [-I FUNC [FUNC ...]] [-E FUNC [FUNC ...]] [-R FUNC [FUNC ...]] [-MI FILE [FILE ...]] [-ME FILE [FILE ...]] [-MR FILE [FILE ...]] [--trace-c [BOOL]]
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
-v VERBOSITY, --verbosity VERBOSITY
|
||||
Logging verbosity
|
||||
-b, --builtin Put 'profile' in the builtins. Use '@profile' to decorate a single function, or 'with profile:' to profile a single section of code.
|
||||
-c FILE, --config FILE
|
||||
OmniTrace configuration file
|
||||
-s FILE, --setup FILE
|
||||
Code to execute before the code to profile
|
||||
-F [BOOL], --full-filepath [BOOL]
|
||||
Encode the full function filename (instead of basename)
|
||||
--label [{args,file,line} [{args,file,line} ...]]
|
||||
Encode the function arguments, filename, and/or line number into the profiling function label
|
||||
-I FUNC [FUNC ...], --function-include FUNC [FUNC ...]
|
||||
Include any entries with these function names
|
||||
-E FUNC [FUNC ...], --function-exclude FUNC [FUNC ...]
|
||||
Filter out any entries with these function names
|
||||
-R FUNC [FUNC ...], --function-restrict FUNC [FUNC ...]
|
||||
Select only entries with these function names
|
||||
-MI FILE [FILE ...], --module-include FILE [FILE ...]
|
||||
Include any entries from these files
|
||||
-ME FILE [FILE ...], --module-exclude FILE [FILE ...]
|
||||
Filter out any entries from these files
|
||||
-MR FILE [FILE ...], --module-restrict FILE [FILE ...]
|
||||
Select only entries from these files
|
||||
--trace-c [BOOL] Enable profiling C functions
|
||||
|
||||
usage: python3 -m omnitrace <OMNITRACE_ARGS> -- <SCRIPT> <SCRIPT_ARGS>
|
||||
```
|
||||
|
||||
> ***The `--trace-c` option does not incorporate omnitrace's dynamic instrumentation support, rather it just enables profiling the underlying C function call within the Python interpreter.***
|
||||
|
||||
### Selective Instrumentation
|
||||
|
||||
Similar to the `omnitrace` executable, command-line options exist for restricting, including, and excluded the desired functions and modules, e.g. `--function-exclude "^__init__$"`.
|
||||
Alternatively, adding `@profile` decorator to the primary function of interest in combination with the `-b` / `--builtin` option will narrow the scope of the
|
||||
instrumentation to these function(s) and their children.
|
||||
|
||||
Consider the following Python code (`example.py`):
|
||||
|
||||
```python
|
||||
import sys
|
||||
|
||||
def fib(n):
|
||||
return n if n < 2 else (fib(n - 1) + fib(n - 2))
|
||||
|
||||
|
||||
def inefficient(n):
|
||||
a = 0
|
||||
for i in range(n):
|
||||
a += i
|
||||
for j in range(n):
|
||||
a += j
|
||||
return a
|
||||
|
||||
|
||||
def run(n):
|
||||
return fib(n) + inefficient(n)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
run(20)
|
||||
```
|
||||
|
||||
Using `omnitrace-python ./example.py` with `OMNITRACE_PROFILE=ON` and `OMNITRACE_TIMEMORY_COMPONENTS=trip_count` would produce:
|
||||
|
||||
```console
|
||||
|-------------------------------------------------------------------------------------------|
|
||||
| COUNTS NUMBER OF INVOCATIONS |
|
||||
|-------------------------------------------------------------------------------------------|
|
||||
| LABEL | COUNT | DEPTH | METRIC | SUM |
|
||||
|---------------------------------------------------|--------|--------|------------|--------|
|
||||
| |0>>> run | 1 | 0 | trip_count | 1 |
|
||||
| |0>>> |_fib | 1 | 1 | trip_count | 1 |
|
||||
| |0>>> |_fib | 2 | 2 | trip_count | 2 |
|
||||
| |0>>> |_fib | 4 | 3 | trip_count | 4 |
|
||||
| |0>>> |_fib | 8 | 4 | trip_count | 8 |
|
||||
| |0>>> |_fib | 16 | 5 | trip_count | 16 |
|
||||
| |0>>> |_fib | 32 | 6 | trip_count | 32 |
|
||||
| |0>>> |_fib | 64 | 7 | trip_count | 64 |
|
||||
| |0>>> |_fib | 128 | 8 | trip_count | 128 |
|
||||
| |0>>> |_fib | 256 | 9 | trip_count | 256 |
|
||||
| |0>>> |_fib | 512 | 10 | trip_count | 512 |
|
||||
| |0>>> |_fib | 1024 | 11 | trip_count | 1024 |
|
||||
| |0>>> |_fib | 2026 | 12 | trip_count | 2026 |
|
||||
| |0>>> |_fib | 3632 | 13 | trip_count | 3632 |
|
||||
| |0>>> |_fib | 5020 | 14 | trip_count | 5020 |
|
||||
| |0>>> |_fib | 4760 | 15 | trip_count | 4760 |
|
||||
| |0>>> |_fib | 2942 | 16 | trip_count | 2942 |
|
||||
| |0>>> |_fib | 1152 | 17 | trip_count | 1152 |
|
||||
| |0>>> |_fib | 274 | 18 | trip_count | 274 |
|
||||
| |0>>> |_fib | 36 | 19 | trip_count | 36 |
|
||||
| |0>>> |_fib | 2 | 20 | trip_count | 2 |
|
||||
| |0>>> |_inefficient | 1 | 1 | trip_count | 1 |
|
||||
|-------------------------------------------------------------------------------------------|
|
||||
```
|
||||
|
||||
If the `inefficient` function were decorated with `@profile`:
|
||||
|
||||
```python
|
||||
@profile
|
||||
def inefficient(n):
|
||||
# ...
|
||||
```
|
||||
|
||||
And executed with `omnitrace-python -b -- ./example.py`, omnitrace would produce:
|
||||
|
||||
```console
|
||||
|-----------------------------------------------------------|
|
||||
| COUNTS NUMBER OF INVOCATIONS |
|
||||
|-----------------------------------------------------------|
|
||||
| LABEL | COUNT | DEPTH | METRIC | SUM |
|
||||
|-------------------|--------|--------|------------|--------|
|
||||
| |0>>> inefficient | 1 | 0 | trip_count | 1 |
|
||||
|-----------------------------------------------------------|
|
||||
```
|
||||
|
||||
## OmniTrace Python Source Instrumentation
|
||||
|
||||
Starting from the unmodified `example.py` script above, we start by importing the `omnitrace` module:
|
||||
|
||||
```python
|
||||
import sys
|
||||
import omnitrace # import omnitrace
|
||||
|
||||
def fib(n):
|
||||
# ... etc. ...
|
||||
```
|
||||
|
||||
Then, we can add `@omnitrace.profile()` to the `run` function:
|
||||
|
||||
```python
|
||||
@omnitrace.profile()
|
||||
def run(n):
|
||||
# ...
|
||||
```
|
||||
|
||||
Or we can use `omnitrace.profile()` as a context-manager around `run(20)`:
|
||||
|
||||
```python
|
||||
if __name__ == "__main__":
|
||||
with omnitrace.profile():
|
||||
run(20)
|
||||
```
|
||||
|
||||
The results for both of the source-level instrumentation modes are identical to the original `omnitrace-python ./example.py` results:
|
||||
|
||||
```console
|
||||
|-------------------------------------------------------------------------------------------|
|
||||
| COUNTS NUMBER OF INVOCATIONS |
|
||||
|-------------------------------------------------------------------------------------------|
|
||||
| LABEL | COUNT | DEPTH | METRIC | SUM |
|
||||
|---------------------------------------------------|--------|--------|------------|--------|
|
||||
| |0>>> run | 1 | 0 | trip_count | 1 |
|
||||
| |0>>> |_fib | 1 | 1 | trip_count | 1 |
|
||||
| |0>>> |_fib | 2 | 2 | trip_count | 2 |
|
||||
| |0>>> |_fib | 4 | 3 | trip_count | 4 |
|
||||
| |0>>> |_fib | 8 | 4 | trip_count | 8 |
|
||||
| |0>>> |_fib | 16 | 5 | trip_count | 16 |
|
||||
| |0>>> |_fib | 32 | 6 | trip_count | 32 |
|
||||
| |0>>> |_fib | 64 | 7 | trip_count | 64 |
|
||||
| |0>>> |_fib | 128 | 8 | trip_count | 128 |
|
||||
| |0>>> |_fib | 256 | 9 | trip_count | 256 |
|
||||
| |0>>> |_fib | 512 | 10 | trip_count | 512 |
|
||||
| |0>>> |_fib | 1024 | 11 | trip_count | 1024 |
|
||||
| |0>>> |_fib | 2026 | 12 | trip_count | 2026 |
|
||||
| |0>>> |_fib | 3632 | 13 | trip_count | 3632 |
|
||||
| |0>>> |_fib | 5020 | 14 | trip_count | 5020 |
|
||||
| |0>>> |_fib | 4760 | 15 | trip_count | 4760 |
|
||||
| |0>>> |_fib | 2942 | 16 | trip_count | 2942 |
|
||||
| |0>>> |_fib | 1152 | 17 | trip_count | 1152 |
|
||||
| |0>>> |_fib | 274 | 18 | trip_count | 274 |
|
||||
| |0>>> |_fib | 36 | 19 | trip_count | 36 |
|
||||
| |0>>> |_fib | 2 | 20 | trip_count | 2 |
|
||||
| |0>>> |_inefficient | 1 | 1 | trip_count | 1 |
|
||||
|-------------------------------------------------------------------------------------------|
|
||||
```
|
||||
|
||||
> ***When `omnitrace-python` is used without built-ins, the profiling results will likely be cluttered by***
|
||||
> ***numerous functions called during the importing of more complex modules, e.g. `import numpy`.***
|
||||
|
||||
### OmniTrace Python Source Instrumentation Configuration
|
||||
|
||||
Within the Python source code, the profiler can be configured by directly modifying the `omnitrace.profiler.config` data fields.
|
||||
|
||||
```python
|
||||
import sys
|
||||
|
||||
def fib(n):
|
||||
return n if n < 2 else (fib(n - 1) + fib(n - 2))
|
||||
|
||||
|
||||
def inefficient(n):
|
||||
a = 0
|
||||
for i in range(n):
|
||||
a += i
|
||||
for j in range(n):
|
||||
a += j
|
||||
return a
|
||||
|
||||
|
||||
def run(n):
|
||||
return fib(n) + inefficient(n)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
from omnitrace.profiler import config
|
||||
from omnitrace import profile
|
||||
|
||||
config.include_args = True
|
||||
config.include_filename = False
|
||||
config.include_line = False
|
||||
config.restrict_functions += ["fib", "run"]
|
||||
|
||||
with profile():
|
||||
run(5)
|
||||
```
|
||||
|
||||
Executing this script would produce:
|
||||
|
||||
```console
|
||||
|------------------------------------------------------------------|
|
||||
| COUNTS NUMBER OF INVOCATIONS |
|
||||
|------------------------------------------------------------------|
|
||||
| LABEL | COUNT | DEPTH | METRIC | SUM |
|
||||
|--------------------------|--------|--------|------------|--------|
|
||||
| |0>>> run(n=5) | 1 | 0 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=5) | 1 | 1 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=4) | 1 | 2 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=3) | 1 | 3 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=2) | 1 | 4 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=1) | 1 | 5 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=0) | 1 | 5 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=1) | 1 | 4 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=2) | 1 | 3 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=1) | 1 | 4 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=0) | 1 | 4 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=3) | 1 | 2 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=2) | 1 | 3 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=1) | 1 | 4 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=0) | 1 | 4 | trip_count | 1 |
|
||||
| |0>>> |_fib(n=1) | 1 | 3 | trip_count | 1 |
|
||||
|------------------------------------------------------------------|
|
||||
```
|
||||
@@ -1,353 +0,0 @@
|
||||
# Call-Stack Sampling
|
||||
|
||||
```eval_rst
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 4
|
||||
```
|
||||
|
||||
> ***NOTE: Set `OMNITRACE_USE_SAMPLING=ON` to activate call-stack sampling when executing an instrumented binary***
|
||||
|
||||
Call-stack sampling can be activated with either a binary instrumented via the `omnitrace` executable or via the `omnitrace-sample` executable.
|
||||
***Effectively***, all of the commands below are equivalent:
|
||||
|
||||
- Binary rewrite with only instrumentation necessary to start/stop sampling
|
||||
|
||||
```console
|
||||
omnitrace-instrument -M sampling -o foo.inst -- foo
|
||||
omnitrace-run -- ./foo.inst
|
||||
```
|
||||
|
||||
- Runtime instrumentation with only instrumentation necessary to start/stop sampling
|
||||
|
||||
```console
|
||||
omnitrace-instrument -M sampling -- foo
|
||||
```
|
||||
|
||||
- No instrumentation required
|
||||
|
||||
```console
|
||||
omnitrace-sample -- foo
|
||||
```
|
||||
|
||||
All `omnitrace-instrument -M sampling` (referred to as "instrumented-sampling" henceforth) does is wrap the `main` of the executable with initialization
|
||||
before `main` starts and finalization after `main` ends.
|
||||
This can be easily accomplished without instrumentation via a `LD_PRELOAD` of a library with containing a dynamic symbol wrapper around `__libc_start_main`.
|
||||
Thus, whenever binary instrumentation is unnecessary, using `omnitrace-sample` is recommended over `omnitrace-instrument -M sampling` for several reasons:
|
||||
|
||||
1. `omnitrace-sample` provides command-line options for controlling features of omnitrace instead of *requiring* configuration files or environment variables
|
||||
2. Despite the fact that instrumented-sampling only requires inserting snippets around one function (`main`), Dyninst
|
||||
does not have a feature for specifying that parsing and processing all the other symbols in the binary is unnecessary,
|
||||
thus, in the best case scenario, instrumented-sampling has a slightly slower launch time when the target binary is relatively small
|
||||
but, in the worst case scenarios, requires a significant amount of time and memory to launch
|
||||
3. `omnitrace-sample` is fully compatible with MPI, e.g. `mpirun -n 2 omnitrace-sample -- foo`, whereas `mpirun -n 2 omnitrace-instrument -M sampling -- foo`
|
||||
is incompatible with some MPI distributions (particularly OpenMPI) because of MPI restrictions against forking within an MPI rank
|
||||
- If you recall, when MPI and binary instrumentation is involved, two steps are involed: (1) do a binary rewrite of the executable
|
||||
and (2) use the instrumented executable in leiu of the original executable. `omnitrace-sample` is thus much easier to use with MPI.
|
||||
|
||||
## omnitrace-sample Executable
|
||||
|
||||
View the help menu of `omnitrace-sample` with the `-h` / `--help` option:
|
||||
|
||||
```console
|
||||
$ omnitrace-sample --help
|
||||
[omnitrace-sample] Usage: omnitrace-sample [ --help (count: 0, dtype: bool)
|
||||
--monochrome (max: 1, dtype: bool)
|
||||
--debug (max: 1, dtype: bool)
|
||||
--verbose (count: 1)
|
||||
--config (min: 0, dtype: filepath)
|
||||
--output (min: 1)
|
||||
--trace (max: 1, dtype: bool)
|
||||
--profile (max: 1, dtype: bool)
|
||||
--flat-profile (max: 1, dtype: bool)
|
||||
--host (max: 1, dtype: bool)
|
||||
--device (max: 1, dtype: bool)
|
||||
--trace-file (count: 1, dtype: filepath)
|
||||
--trace-buffer-size (count: 1, dtype: KB)
|
||||
--trace-fill-policy (count: 1)
|
||||
--profile-format (min: 1)
|
||||
--profile-diff (min: 1)
|
||||
--process-freq (count: 1)
|
||||
--process-wait (count: 1)
|
||||
--process-duration (count: 1)
|
||||
--cpus (count: unlimited, dtype: int or range)
|
||||
--gpus (count: unlimited, dtype: int or range)
|
||||
--freq (count: 1)
|
||||
--wait (count: 1)
|
||||
--duration (count: 1)
|
||||
--tids (min: 1)
|
||||
--cputime (min: 0)
|
||||
--realtime (min: 0)
|
||||
--include (count: unlimited)
|
||||
--exclude (count: unlimited)
|
||||
--cpu-events (count: unlimited)
|
||||
--gpu-events (count: unlimited)
|
||||
--inlines (max: 1, dtype: bool)
|
||||
--hsa-interrupt (count: 1, dtype: int)
|
||||
]
|
||||
|
||||
Options:
|
||||
-h, -?, --help Shows this page
|
||||
|
||||
[DEBUG OPTIONS]
|
||||
|
||||
--monochrome Disable colorized output
|
||||
--debug Debug output
|
||||
-v, --verbose Verbose output
|
||||
|
||||
[GENERAL OPTIONS]
|
||||
|
||||
-c, --config Configuration file
|
||||
-o, --output Output path. Accepts 1-2 parameters corresponding to the output path and the output prefix
|
||||
-T, --trace Generate a detailed trace (perfetto output)
|
||||
-P, --profile Generate a call-stack-based profile (conflicts with --flat-profile)
|
||||
-F, --flat-profile Generate a flat profile (conflicts with --profile)
|
||||
-H, --host Enable sampling host-based metrics for the process. E.g. CPU frequency, memory usage, etc.
|
||||
-D, --device Enable sampling device-based metrics for the process. E.g. GPU temperature, memory usage, etc.
|
||||
|
||||
[TRACING OPTIONS]
|
||||
|
||||
--trace-file Specify the trace output filename. Relative filepath will be with respect to output path and output prefix.
|
||||
--trace-buffer-size Size limit for the trace output (in KB)
|
||||
--trace-fill-policy [ discard | ring_buffer ]
|
||||
|
||||
Policy for new data when the buffer size limit is reached:
|
||||
- discard : new data is ignored
|
||||
- ring_buffer : new data overwrites oldest data
|
||||
|
||||
[PROFILE OPTIONS]
|
||||
|
||||
--profile-format [ console | json | text ]
|
||||
Data formats for profiling results
|
||||
--profile-diff Generate a diff output b/t the profile collected and an existing profile from another run Accepts 1-2 parameters
|
||||
corresponding to the input path and the input prefix
|
||||
|
||||
[HOST/DEVICE (PROCESS SAMPLING) OPTIONS]
|
||||
|
||||
|
||||
--process-freq Set the default host/device sampling frequency (number of interrupts per second)
|
||||
--process-wait Set the default wait time (i.e. delay) before taking first host/device sample (in seconds of realtime)
|
||||
--process-duration Set the duration of the host/device sampling (in seconds of realtime)
|
||||
--cpus CPU IDs for frequency sampling. Supports integers and/or ranges
|
||||
--gpus GPU IDs for SMI queries. Supports integers and/or ranges
|
||||
|
||||
[GENERAL SAMPLING OPTIONS]
|
||||
|
||||
-f, --freq Set the default sampling frequency (number of interrupts per second)
|
||||
-w, --wait Set the default wait time (i.e. delay) before taking first sample (in seconds). This delay time is based on the clock
|
||||
of the sampler, i.e., a delay of 1 second for CPU-clock sampler may not equal 1 second of realtime
|
||||
-d, --duration Set the duration of the sampling (in seconds of realtime). I.e., it is possible (currently) to set a CPU-clock time
|
||||
delay that exceeds the real-time duration... resulting in zero samples being taken
|
||||
-t, --tids Specify the default thread IDs for sampling, where 0 (zero) is the main thread and each thread created by the target
|
||||
application is assigned an atomically incrementing value.
|
||||
|
||||
[SAMPLING TIMER OPTIONS]
|
||||
|
||||
--cputime Sample based on a CPU-clock timer (default). Accepts zero or more arguments:
|
||||
0. Enables sampling based on CPU-clock timer.
|
||||
1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of CPU-time.
|
||||
2. Delay (in seconds of CPU-clock time). I.e., how long each thread should wait before taking first sample.
|
||||
3+ Thread IDs to target for sampling, starting at 0 (the main thread).
|
||||
May be specified as index or range, e.g., '0 2-4' will be interpreted as:
|
||||
sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads
|
||||
--realtime Sample based on a real-clock timer. Accepts zero or more arguments:
|
||||
0. Enables sampling based on real-clock timer.
|
||||
1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of realtime.
|
||||
2. Delay (in seconds of real-clock time). I.e., how long each thread should wait before taking first sample.
|
||||
3+ Thread IDs to target for sampling, starting at 0 (the main thread).
|
||||
May be specified as index or range, e.g., '0 2-4' will be interpreted as:
|
||||
sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads
|
||||
When sampling with a real-clock timer, please note that enabling this will cause threads which are typically "idle"
|
||||
to consume more resources since, while idle, the real-clock time increases (and therefore triggers taking samples)
|
||||
whereas the CPU-clock time does not.
|
||||
|
||||
[BACKEND OPTIONS] (These options control region information captured w/o sampling or instrumentation)
|
||||
|
||||
-I, --include [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
|
||||
Include data from these backends
|
||||
-E, --exclude [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
|
||||
Exclude data from these backends
|
||||
|
||||
[HARDWARE COUNTER OPTIONS]
|
||||
|
||||
-C, --cpu-events Set the CPU hardware counter events to record (ref: `omnitrace-avail -H -c CPU`)
|
||||
-G, --gpu-events Set the GPU hardware counter events to record (ref: `omnitrace-avail -H -c GPU`)
|
||||
|
||||
[MISCELLANEOUS OPTIONS]
|
||||
|
||||
-i, --inlines Include inline info in output when available
|
||||
--hsa-interrupt [ 0 | 1 ] Set the value of the HSA_ENABLE_INTERRUPT environment variable.
|
||||
ROCm version 5.2 and older have a bug which will cause a deadlock if a sample is taken while waiting for the signal
|
||||
that a kernel completed -- which happens when sampling with a real-clock timer. We require this option to be set to
|
||||
when --realtime is specified to make users aware that, while this may fix the bug, it can have a negative impact on
|
||||
performance.
|
||||
Values:
|
||||
0 avoid triggering the bug, potentially at the cost of reduced performance
|
||||
1 do not modify how ROCm is notified about kernel completion
|
||||
```
|
||||
|
||||
The general syntax for separating omnitrace command line arguments from the application arguments follows the
|
||||
is consistent with the LLVM style of using a standalone double-hyphen (`--`). All arguments preceding the double-hyphen
|
||||
are interpreted as belonging to omnitrace and all arguments following the double-hyphen are interpreted as the
|
||||
application and it's arguments. The double-hyphen is only necessary when passing command line arguments to the target
|
||||
which also use hyphens. E.g. `omnitrace-sample ls` works but, in order to run `ls -la`, use `omnitrace-sample -- ls -la`.
|
||||
|
||||
[Configuring OmniTrace Runtime](runtime.md) establish the precedence of environment variable values over values specified in the configuration files. This enables
|
||||
the user to configure the omnitrace runtime to their preferred default behavior in a file such as `~/.omnitrace.cfg` and then easily override
|
||||
those settings via something like `OMNITRACE_ENABLED=OFF omnitrace-sample -- foo`.
|
||||
Similarly, the command line arguments passed to `omnitrace-sample` take precedence over environment variables.
|
||||
|
||||
All of the command-line options above correlate to one or more configuration settings, e.g. `--cpu-events` correlates to the `OMNITRACE_PAPI_EVENTS` configuration variable.
|
||||
After the command-line arguments to `omnitrace-sample` have been processed but before the target application is executed, `omnitrace-sample` will emit a log
|
||||
for which environment variables where set and/or modified:
|
||||
|
||||
The snippet below shows the environment updates when `omnitrace-sample` is invoked with no arguments
|
||||
|
||||
```console
|
||||
$ omnitrace-sample -- ./parallel-overhead-locks 30 4 100
|
||||
|
||||
HSA_TOOLS_LIB=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
||||
HSA_TOOLS_REPORT_LOAD_FAILURE=1
|
||||
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
||||
OMNITRACE_USE_PROCESS_SAMPLING=false
|
||||
OMNITRACE_USE_SAMPLING=true
|
||||
OMP_TOOL_LIBRARIES=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
||||
ROCP_TOOL_LIB=/opt/omnitrace/lib/libomnitrace.so.1.7.1
|
||||
|
||||
...
|
||||
```
|
||||
|
||||
The snippet below shows the environment updates when `omnitrace-sample` enables profiling, tracing, host process-sampling, device process-sampling, and all the available backends:
|
||||
|
||||
```console
|
||||
$ omnitrace-sample -PTDH -I all -- ./parallel-overhead-locks 30 4 100
|
||||
|
||||
HSA_TOOLS_LIB=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
||||
HSA_TOOLS_REPORT_LOAD_FAILURE=1
|
||||
KOKKOS_PROFILE_LIBRARY=/opt/omnitrace/lib/libomnitrace.so.1.7.1
|
||||
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
||||
OMNITRACE_CPU_FREQ_ENABLED=true
|
||||
OMNITRACE_TRACE_THREAD_LOCKS=true
|
||||
OMNITRACE_TRACE_THREAD_RW_LOCKS=true
|
||||
OMNITRACE_TRACE_THREAD_SPIN_LOCKS=true
|
||||
OMNITRACE_USE_KOKKOSP=true
|
||||
OMNITRACE_USE_MPIP=true
|
||||
OMNITRACE_USE_OMPT=true
|
||||
OMNITRACE_TRACE=true
|
||||
OMNITRACE_USE_PROCESS_SAMPLING=true
|
||||
OMNITRACE_USE_RCCLP=true
|
||||
OMNITRACE_USE_ROCM_SMI=true
|
||||
OMNITRACE_USE_ROCPROFILER=true
|
||||
OMNITRACE_USE_ROCTRACER=true
|
||||
OMNITRACE_USE_ROCTX=true
|
||||
OMNITRACE_USE_SAMPLING=true
|
||||
OMNITRACE_PROFILE=true
|
||||
OMP_TOOL_LIBRARIES=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
||||
ROCP_TOOL_LIB=/opt/omnitrace/lib/libomnitrace.so.1.7.1
|
||||
|
||||
...
|
||||
```
|
||||
|
||||
The snippet below shows the environment updates when `omnitrace-sample` enables profiling, tracing, host process-sampling, device process-sampling,
|
||||
sets the output path to `omnitrace-output`, the output prefix to `%tag%` and disables all the available backends:
|
||||
|
||||
```console
|
||||
$ omnitrace-sample -PTDH -E all -o omnitrace-output %tag% -- ./parallel-overhead-locks 30 4 100
|
||||
|
||||
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
||||
OMNITRACE_CPU_FREQ_ENABLED=true
|
||||
OMNITRACE_OUTPUT_PATH=omnitrace-output
|
||||
OMNITRACE_OUTPUT_PREFIX=%tag%
|
||||
OMNITRACE_TRACE_THREAD_LOCKS=false
|
||||
OMNITRACE_TRACE_THREAD_RW_LOCKS=false
|
||||
OMNITRACE_TRACE_THREAD_SPIN_LOCKS=false
|
||||
OMNITRACE_USE_KOKKOSP=false
|
||||
OMNITRACE_USE_MPIP=false
|
||||
OMNITRACE_USE_OMPT=false
|
||||
OMNITRACE_TRACE=true
|
||||
OMNITRACE_USE_PROCESS_SAMPLING=true
|
||||
OMNITRACE_USE_RCCLP=false
|
||||
OMNITRACE_USE_ROCM_SMI=false
|
||||
OMNITRACE_USE_ROCPROFILER=false
|
||||
OMNITRACE_USE_ROCTRACER=false
|
||||
OMNITRACE_USE_ROCTX=false
|
||||
OMNITRACE_USE_SAMPLING=true
|
||||
OMNITRACE_PROFILE=true
|
||||
|
||||
...
|
||||
```
|
||||
|
||||
## omnitrace-sample Example
|
||||
|
||||
```console
|
||||
$ omnitrace-sample -PTDH -E all -o omnitrace-output %tag% -c -- ./parallel-overhead-locks 30 4 100
|
||||
|
||||
LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
|
||||
OMNITRACE_CONFIG_FILE=
|
||||
OMNITRACE_CPU_FREQ_ENABLED=true
|
||||
OMNITRACE_OUTPUT_PATH=omnitrace-output
|
||||
OMNITRACE_OUTPUT_PREFIX=%tag%
|
||||
OMNITRACE_TRACE_THREAD_LOCKS=false
|
||||
OMNITRACE_TRACE_THREAD_RW_LOCKS=false
|
||||
OMNITRACE_TRACE_THREAD_SPIN_LOCKS=false
|
||||
OMNITRACE_USE_KOKKOSP=false
|
||||
OMNITRACE_USE_MPIP=false
|
||||
OMNITRACE_USE_OMPT=false
|
||||
OMNITRACE_TRACE=true
|
||||
OMNITRACE_USE_PROCESS_SAMPLING=true
|
||||
OMNITRACE_USE_RCCLP=false
|
||||
OMNITRACE_USE_ROCM_SMI=false
|
||||
OMNITRACE_USE_ROCPROFILER=false
|
||||
OMNITRACE_USE_ROCTRACER=false
|
||||
OMNITRACE_USE_ROCTX=false
|
||||
OMNITRACE_USE_SAMPLING=true
|
||||
OMNITRACE_PROFILE=true
|
||||
|
||||
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Sampling
|
||||
|
||||
|
||||
______ .___ ___. .__ __. __ .___________..______ ___ ______ _______
|
||||
/ __ \ | \/ | | \ | | | | | || _ \ / \ / || ____|
|
||||
| | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__
|
||||
| | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __|
|
||||
| `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____
|
||||
\______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______|
|
||||
|
||||
|
||||
[759.689] perfetto.cc:55903 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
|
||||
|
||||
[parallel-overhead-locks] Threads: 4
|
||||
[parallel-overhead-locks] Iterations: 100
|
||||
[parallel-overhead-locks] fibonacci(30)...
|
||||
[1] number of iterations: 100
|
||||
[2] number of iterations: 100
|
||||
[3] number of iterations: 100
|
||||
[4] number of iterations: 100
|
||||
[parallel-overhead-locks] fibonacci(30) x 4 = 394644873
|
||||
[parallel-overhead-locks] number of mutex locks = 400
|
||||
[omnitrace][107157][0][omnitrace_finalize]
|
||||
[omnitrace][107157][0][omnitrace_finalize] finalizing...
|
||||
[omnitrace][107157][0][omnitrace_finalize]
|
||||
[omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157 : 0.610427 sec wall_clock, 2.248 MB peak_rss, 2.265 MB page_rss, 2.560000 sec cpu_clock, 419.4 % cpu_util [laps: 1]
|
||||
[omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157/thread/0 : 0.608866 sec wall_clock, 0.000677 sec thread_cpu_clock, 0.1 % thread_cpu_util, 2.248 MB peak_rss [laps: 1]
|
||||
[omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157/thread/1 : 0.608237 sec wall_clock, 0.603553 sec thread_cpu_clock, 99.2 % thread_cpu_util, 2.204 MB peak_rss [laps: 1]
|
||||
[omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157/thread/2 : 0.601430 sec wall_clock, 0.598378 sec thread_cpu_clock, 99.5 % thread_cpu_util, 1.156 MB peak_rss [laps: 1]
|
||||
[omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157/thread/3 : 0.570223 sec wall_clock, 0.568713 sec thread_cpu_clock, 99.7 % thread_cpu_util, 0.772 MB peak_rss [laps: 1]
|
||||
[omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157/thread/4 : 0.557637 sec wall_clock, 0.557198 sec thread_cpu_clock, 99.9 % thread_cpu_util, 0.156 MB peak_rss [laps: 1]
|
||||
[omnitrace][107157][0][omnitrace_finalize]
|
||||
[omnitrace][107157][0][omnitrace_finalize] Finalizing perfetto...
|
||||
[omnitrace][107157][perfetto]> Outputting '/home/user/data/omnitrace-output/2022-10-19_02.46/parallel-overhead-locksperfetto-trace-107157.proto' (842.90 KB / 0.84 MB / 0.00 GB)... Done
|
||||
[omnitrace][107157][trip_count]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockstrip_count-107157.json'
|
||||
[omnitrace][107157][trip_count]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockstrip_count-107157.txt'
|
||||
[omnitrace][107157][sampling_percent]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_percent-107157.json'
|
||||
[omnitrace][107157][sampling_percent]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_percent-107157.txt'
|
||||
[omnitrace][107157][sampling_cpu_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_cpu_clock-107157.json'
|
||||
[omnitrace][107157][sampling_cpu_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_cpu_clock-107157.txt'
|
||||
[omnitrace][107157][sampling_wall_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_wall_clock-107157.json'
|
||||
[omnitrace][107157][sampling_wall_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_wall_clock-107157.txt'
|
||||
[omnitrace][107157][wall_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockswall_clock-107157.json'
|
||||
[omnitrace][107157][wall_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockswall_clock-107157.txt'
|
||||
[omnitrace][107157][metadata]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-locksmetadata-107157.json' and 'omnitrace-output/2022-10-19_02.46/parallel-overhead-locksfunctions-107157.json'
|
||||
[omnitrace][107157][0][omnitrace_finalize] Finalized
|
||||
[761.584] perfetto.cc:57382 Tracing session 1 ended, total sessions:0
|
||||
```
|
||||
@@ -1,49 +0,0 @@
|
||||
# Setup and Validation
|
||||
|
||||
```eval_rst
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 3
|
||||
```
|
||||
|
||||
## Configuring Environment
|
||||
|
||||
Once omnitrace is installed, source the `setup-env.sh` script to prefix the `PATH`, `LD_LIBRARY_PATH`, etc. environment variables:
|
||||
|
||||
```bash
|
||||
source /opt/omnitrace/share/omnitrace/setup-env.sh
|
||||
```
|
||||
|
||||
Alternatively, if environment modules are supported, add the `<prefix>/share/modulefiles` directory to `MODULEPATH`:
|
||||
|
||||
```bash
|
||||
module use /opt/omnitrace/share/modulefiles
|
||||
```
|
||||
|
||||
> ***Alternatively, the above line can be added to the `${HOME}/.modulerc` file.***
|
||||
|
||||
Once omnitrace is in the `MODULEPATH`, omnitrace can be loaded via `module load omnitrace/<VERSION>` and unloaded via `module unload omnitrace/<VERSION>`, e.g.:
|
||||
|
||||
```bash
|
||||
module load omnitrace/1.0.0
|
||||
module unload omnitrace/1.0.0
|
||||
```
|
||||
|
||||
> ***You may need to also add the path to the ROCm libraries to `LD_LIBRARY_PATH`, e.g. `export LD_LIBRARY_PATH=/opt/rocm/lib:${LD_LIBRARY_PATH}`***
|
||||
|
||||
## Validating Environment Configuration
|
||||
|
||||
If all the following commands execute successfully with output, then you are ready to use omnitrace:
|
||||
|
||||
```bash
|
||||
which omnitrace
|
||||
which omnitrace-avail
|
||||
which omnitrace-sample
|
||||
omnitrace-instrument --help
|
||||
omnitrace-avail --all
|
||||
omnitrace-sample --help
|
||||
|
||||
# if built with python support
|
||||
which omnitrace-python
|
||||
omnitrace-python --help
|
||||
```
|
||||
@@ -1,36 +0,0 @@
|
||||
#!/bin/bash -e
|
||||
|
||||
message()
|
||||
{
|
||||
echo -e "\n\n##### ${@}... #####\n"
|
||||
}
|
||||
|
||||
WORK_DIR=$(cd $(dirname ${BASH_SOURCE[0]}) && pwd)
|
||||
SOURCE_DIR=$(cd ${WORK_DIR}/../.. &> /dev/null && pwd)
|
||||
|
||||
message "Working directory is ${WORK_DIR}"
|
||||
message "Source directory is ${SOURCE_DIR}"
|
||||
|
||||
message "Changing directory to ${WORK_DIR}"
|
||||
cd ${WORK_DIR}
|
||||
|
||||
message "Generating omnitrace.dox"
|
||||
cmake -DSOURCE_DIR=${SOURCE_DIR} -P ${WORK_DIR}/generate-doxyfile.cmake
|
||||
|
||||
message "Generating doxygen xml files"
|
||||
doxygen omnitrace.dox
|
||||
doxygen omnitrace.dox
|
||||
|
||||
message "Building html documentation"
|
||||
make html SPHINXOPTS="-W --keep-going -n"
|
||||
|
||||
if [ -d ${SOURCE_DIR}/docs ]; then
|
||||
message "Removing stale documentation in ${SOURCE_DIR}/docs/"
|
||||
rm -rf ${SOURCE_DIR}/docs/*
|
||||
|
||||
message "Adding nojekyll to docs/"
|
||||
cp -r ${WORK_DIR}/.nojekyll ${SOURCE_DIR}/docs/.nojekyll
|
||||
|
||||
message "Copying source/docs/_build/html/* to docs/"
|
||||
cp -r ${WORK_DIR}/_build/html/* ${SOURCE_DIR}/docs/
|
||||
fi
|
||||
@@ -1,9 +0,0 @@
|
||||
#!/bin/bash -e
|
||||
|
||||
WORK_DIR=$(dirname ${BASH_SOURCE[0]})
|
||||
|
||||
SOURCE_DIR=$(cd ${WORK_DIR}/../.. &> /dev/null && pwd)
|
||||
|
||||
cmake -DSOURCE_DIR=${SOURCE_DIR} -P generate-doxyfile.cmake
|
||||
|
||||
doxygen omnitrace.dox
|
||||
@@ -1,270 +0,0 @@
|
||||
# User API
|
||||
|
||||
```eval_rst
|
||||
.. doxygenfile:: omnitrace/types.h
|
||||
.. doxygenfile:: omnitrace/categories.h
|
||||
.. doxygenfile:: omnitrace/user.h
|
||||
.. doxygenfile:: omnitrace/causal.h
|
||||
```
|
||||
|
||||
By default, when omnitrace detects any `omnitrace_user_start_*` or `omnitrace_user_stop_*` function, instrumentation
|
||||
is disabled at start-up -- thus, `omnitrace_user_stop_trace()` is not required at the beginning of main. This is
|
||||
can be manually controlled via the `OMNITRACE_INIT_ENABLED` environment variable. User-defined regions are always
|
||||
recorded, regardless of whether whether `omnitrace_user_start_*` or `omnitrace_user_stop_*` has been called.
|
||||
|
||||
## Example
|
||||
|
||||
### Compilation
|
||||
|
||||
#### CMake
|
||||
|
||||
```cmake
|
||||
find_package(omnitrace REQUIRED COMPONENTS user)
|
||||
|
||||
add_executable(foo foo.cpp)
|
||||
|
||||
target_link_libraries(foo PRIVATE omnitrace::omnitrace-user-library)
|
||||
```
|
||||
|
||||
#### General
|
||||
|
||||
Assuming omnitrace installed in `/opt/omnitrace`:
|
||||
|
||||
```bash
|
||||
g++ -I/opt/omnitrace foo.cpp -o foo -lomnitrace-user
|
||||
```
|
||||
|
||||
### User API Implementation
|
||||
|
||||
```cpp
|
||||
#include <omnitrace/categories.h>
|
||||
#include <omnitrace/types.h>
|
||||
#include <omnitrace/user.h>
|
||||
|
||||
#include <atomic>
|
||||
#include <cassert>
|
||||
#include <cerrno>
|
||||
#include <cstdio>
|
||||
#include <cstdlib>
|
||||
#include <cstring>
|
||||
#include <sstream>
|
||||
#include <thread>
|
||||
#include <vector>
|
||||
|
||||
std::atomic<long> total{ 0 };
|
||||
|
||||
long
|
||||
fib(long n) __attribute__((noinline));
|
||||
|
||||
void
|
||||
run(size_t nitr, long) __attribute__((noinline));
|
||||
|
||||
int
|
||||
custom_push_region(const char* name);
|
||||
|
||||
namespace
|
||||
{
|
||||
omnitrace_user_callbacks_t custom_callbacks = OMNITRACE_USER_CALLBACKS_INIT;
|
||||
omnitrace_user_callbacks_t original_callbacks = OMNITRACE_USER_CALLBACKS_INIT;
|
||||
} // namespace
|
||||
|
||||
int
|
||||
main(int argc, char** argv)
|
||||
{
|
||||
custom_callbacks.push_region = &custom_push_region;
|
||||
omnitrace_user_configure(OMNITRACE_USER_UNION_CONFIG, custom_callbacks,
|
||||
&original_callbacks);
|
||||
|
||||
omnitrace_user_push_region(argv[0]);
|
||||
omnitrace_user_push_region("initialization");
|
||||
size_t nthread = std::min<size_t>(16, std::thread::hardware_concurrency());
|
||||
size_t nitr = 50000;
|
||||
long nfib = 10;
|
||||
if(argc > 1) nfib = atol(argv[1]);
|
||||
if(argc > 2) nthread = atol(argv[2]);
|
||||
if(argc > 3) nitr = atol(argv[3]);
|
||||
omnitrace_user_pop_region("initialization");
|
||||
|
||||
printf("[%s] Threads: %zu\n[%s] Iterations: %zu\n[%s] fibonacci(%li)...\n", argv[0],
|
||||
nthread, argv[0], nitr, argv[0], nfib);
|
||||
|
||||
omnitrace_user_push_region("thread_creation");
|
||||
std::vector<std::thread> threads{};
|
||||
threads.reserve(nthread);
|
||||
// disable instrumentation for child threads
|
||||
omnitrace_user_stop_thread_trace();
|
||||
for(size_t i = 0; i < nthread; ++i)
|
||||
{
|
||||
threads.emplace_back(&run, nitr, nfib);
|
||||
}
|
||||
// re-enable instrumentation
|
||||
omnitrace_user_start_thread_trace();
|
||||
omnitrace_user_pop_region("thread_creation");
|
||||
|
||||
omnitrace_user_push_region("thread_wait");
|
||||
for(auto& itr : threads)
|
||||
itr.join();
|
||||
omnitrace_user_pop_region("thread_wait");
|
||||
|
||||
run(nitr, nfib);
|
||||
|
||||
printf("[%s] fibonacci(%li) x %lu = %li\n", argv[0], nfib, nthread, total.load());
|
||||
omnitrace_user_pop_region(argv[0]);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
long
|
||||
fib(long n)
|
||||
{
|
||||
return (n < 2) ? n : fib(n - 1) + fib(n - 2);
|
||||
}
|
||||
|
||||
#define RUN_LABEL \
|
||||
std::string{ std::string{ __FUNCTION__ } + "(" + std::to_string(n) + ") x " + \
|
||||
std::to_string(nitr) } \
|
||||
.c_str()
|
||||
|
||||
void
|
||||
run(size_t nitr, long n)
|
||||
{
|
||||
omnitrace_user_push_region(RUN_LABEL);
|
||||
long local = 0;
|
||||
for(size_t i = 0; i < nitr; ++i)
|
||||
local += fib(n);
|
||||
total += local;
|
||||
omnitrace_user_pop_region(RUN_LABEL);
|
||||
}
|
||||
|
||||
int
|
||||
custom_push_region(const char* name)
|
||||
{
|
||||
if(!original_callbacks.push_region || !original_callbacks.push_annotated_region)
|
||||
return OMNITRACE_USER_ERROR_NO_BINDING;
|
||||
|
||||
printf("Pushing custom region :: %s\n", name);
|
||||
|
||||
if(original_callbacks.push_annotated_region)
|
||||
{
|
||||
int32_t _err = errno;
|
||||
char* _msg = nullptr;
|
||||
char _buff[1024];
|
||||
if(_err != 0) _msg = strerror_r(_err, _buff, sizeof(_buff));
|
||||
|
||||
omnitrace_annotation_t _annotations[] = {
|
||||
{ "errno", OMNITRACE_INT32, &_err }, { "strerror", OMNITRACE_STRING, _msg }
|
||||
};
|
||||
|
||||
errno = 0; // reset errno
|
||||
return (*original_callbacks.push_annotated_region)(
|
||||
name, _annotations, sizeof(_annotations) / sizeof(omnitrace_annotation_t));
|
||||
}
|
||||
|
||||
return (*original_callbacks.push_region)(name);
|
||||
}
|
||||
```
|
||||
|
||||
### User API Output
|
||||
|
||||
```console
|
||||
$ omnitrace-instrument -l --min-instructions=8 -E custom_push_region -o -- ./user-api
|
||||
...
|
||||
$ omnitrace-run --profile --use-pid off --time-output off -- ./user-api.inst 20 4 100
|
||||
Pushing custom region :: ./user-api.inst
|
||||
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Trace
|
||||
|
||||
|
||||
______ .___ ___. .__ __. __ .___________..______ ___ ______ _______
|
||||
/ __ \ | \/ | | \ | | | | | || _ \ / \ / || ____|
|
||||
| | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__
|
||||
| | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __|
|
||||
| `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____
|
||||
\______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______|
|
||||
|
||||
|
||||
|
||||
Pushing custom region :: initialization
|
||||
[./user-api.inst] Threads: 4
|
||||
[./user-api.inst] Iterations: 100
|
||||
[./user-api.inst] fibonacci(20)...
|
||||
Pushing custom region :: thread_creation
|
||||
Pushing custom region :: thread_wait
|
||||
Pushing custom region :: run(20) x 100
|
||||
Pushing custom region :: run(20) x 100
|
||||
Pushing custom region :: run(20) x 100
|
||||
Pushing custom region :: run(20) x 100
|
||||
Pushing custom region :: run(20) x 100
|
||||
[./user-api.inst] fibonacci(20) x 4 = 3382500
|
||||
[omnitrace][86267][0][omnitrace_finalize] finalizing...
|
||||
|
||||
|
||||
[omnitrace][86267][0] omnitrace : 5.190895 sec wall_clock, 2.748 mb peak_rss, 6.330000 sec cpu_clock, 121.9 % cpu_util [laps: 1]
|
||||
[omnitrace][86267][0] user-api.inst/thread-0 : 5.078713 sec wall_clock, 4.722415 sec thread_cpu_clock, 93.0 % thread_cpu_util, 1.276 mb peak_rss [laps: 1]
|
||||
[omnitrace][86267][0] user-api.inst/thread-1 : 0.322248 sec wall_clock, 0.322191 sec thread_cpu_clock, 100.0 % thread_cpu_util, 1.000 mb peak_rss [laps: 1]
|
||||
[omnitrace][86267][0] user-api.inst/thread-2 : 0.323255 sec wall_clock, 0.323194 sec thread_cpu_clock, 100.0 % thread_cpu_util, 0.000 mb peak_rss [laps: 1]
|
||||
[omnitrace][86267][0] user-api.inst/thread-3 : 0.323569 sec wall_clock, 0.323484 sec thread_cpu_clock, 100.0 % thread_cpu_util, 1.092 mb peak_rss [laps: 1]
|
||||
[omnitrace][86267][0] user-api.inst/thread-4 : 0.324178 sec wall_clock, 0.324057 sec thread_cpu_clock, 100.0 % thread_cpu_util, 1.184 mb peak_rss [laps: 1]
|
||||
[omnitrace][86267][0] Post-processing 51 cpu frequency and memory usage entries...
|
||||
|
||||
[omnitrace][wall_clock]|0> Outputting 'omnitrace-user-api.inst-output/wall_clock.json'...
|
||||
[omnitrace][wall_clock]|0> Outputting 'omnitrace-user-api.inst-output/wall_clock.tree.json'...
|
||||
[omnitrace][wall_clock]|0> Outputting 'omnitrace-user-api.inst-output/wall_clock.txt'...
|
||||
|
||||
[omnitrace][manager::finalize][metadata]> Outputting 'omnitrace-user-api.inst-output/metadata.json' and 'omnitrace-user-api.inst-output/functions.json'...
|
||||
[omnitrace][86267][0][omnitrace_finalize] Finalized
|
||||
$ cat omnitrace-example-output/wall_clock.txt
|
||||
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| REAL-CLOCK TIMER (I.E. WALL-CLOCK TIMER) |
|
||||
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| LABEL | COUNT | DEPTH | METRIC | UNITS | SUM | MEAN | MIN | MAX | VAR | STDDEV | % SELF |
|
||||
|---------------------------------------------------------------------------------|--------|--------|------------|--------|----------|----------|----------|----------|----------|----------|--------|
|
||||
| |0>>> ./user-api.inst | 1 | 0 | wall_clock | sec | 5.078521 | 5.078521 | 5.078521 | 5.078521 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |0>>> |_initialization | 1 | 1 | wall_clock | sec | 0.000004 | 0.000004 | 0.000004 | 0.000004 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_thread_creation | 1 | 1 | wall_clock | sec | 0.000159 | 0.000159 | 0.000159 | 0.000159 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_thread_wait | 1 | 1 | wall_clock | sec | 0.355307 | 0.355307 | 0.355307 | 0.355307 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |0>>> |_std::vector<std::thread, std::allocator<std::thread> >::begin | 1 | 2 | wall_clock | sec | 0.000001 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_std::vector<std::thread, std::allocator<std::thread> >::end | 1 | 2 | wall_clock | sec | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_pthread_join | 4 | 2 | wall_clock | sec | 0.355257 | 0.088814 | 0.000001 | 0.333144 | 0.026559 | 0.162970 | 100.0 |
|
||||
| |2>>> |_start_thread | 1 | 3 | wall_clock | sec | 0.000032 | 0.000032 | 0.000032 | 0.000032 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |1>>> |_start_thread | 1 | 3 | wall_clock | sec | 0.000036 | 0.000036 | 0.000036 | 0.000036 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |3>>> |_start_thread | 1 | 3 | wall_clock | sec | 0.000034 | 0.000034 | 0.000034 | 0.000034 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |4>>> |_start_thread | 1 | 3 | wall_clock | sec | 0.000039 | 0.000039 | 0.000039 | 0.000039 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_run | 1 | 1 | wall_clock | sec | 4.722993 | 4.722993 | 4.722993 | 4.722993 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |0>>> |_std::char_traits<char>::length | 1 | 2 | wall_clock | sec | 0.000001 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_std::distance<char const*> | 1 | 2 | wall_clock | sec | 0.000001 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_std::operator+<char, std::char_traits<char>, std::allocator<char> > | 2 | 2 | wall_clock | sec | 0.000002 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_run(20) x 100 | 1 | 2 | wall_clock | sec | 4.722951 | 4.722951 | 4.722951 | 4.722951 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |0>>> |_run [{94,25}-{96,25}] | 1 | 3 | wall_clock | sec | 4.722925 | 4.722925 | 4.722925 | 4.722925 | 0.000000 | 0.000000 | 0.0 |
|
||||
| |0>>> |_fib | 100 | 4 | wall_clock | sec | 4.722718 | 0.047227 | 0.046713 | 0.051987 | 0.000000 | 0.000625 | 0.0 |
|
||||
| |0>>> |_fib | 200 | 5 | wall_clock | sec | 4.722302 | 0.023612 | 0.017827 | 0.034091 | 0.000032 | 0.005627 | 0.0 |
|
||||
| |0>>> |_fib | 400 | 6 | wall_clock | sec | 4.721485 | 0.011804 | 0.006790 | 0.023003 | 0.000016 | 0.004024 | 0.0 |
|
||||
| |0>>> |_fib | 800 | 7 | wall_clock | sec | 4.719858 | 0.005900 | 0.002564 | 0.016078 | 0.000006 | 0.002498 | 0.1 |
|
||||
| |0>>> |_fib | 1600 | 8 | wall_clock | sec | 4.716572 | 0.002948 | 0.000977 | 0.011849 | 0.000002 | 0.001465 | 0.1 |
|
||||
| |0>>> |_fib | 3200 | 9 | wall_clock | sec | 4.709918 | 0.001472 | 0.000371 | 0.008246 | 0.000001 | 0.000831 | 0.3 |
|
||||
| |0>>> |_fib | 6400 | 10 | wall_clock | sec | 4.696775 | 0.000734 | 0.000140 | 0.005111 | 0.000000 | 0.000461 | 0.6 |
|
||||
| |0>>> |_fib | 12800 | 11 | wall_clock | sec | 4.670093 | 0.000365 | 0.000050 | 0.003166 | 0.000000 | 0.000253 | 1.1 |
|
||||
| |0>>> |_fib | 25600 | 12 | wall_clock | sec | 4.617496 | 0.000180 | 0.000017 | 0.001959 | 0.000000 | 0.000137 | 2.3 |
|
||||
| |0>>> |_fib | 51200 | 13 | wall_clock | sec | 4.512671 | 0.000088 | 0.000004 | 0.001212 | 0.000000 | 0.000074 | 4.6 |
|
||||
| |0>>> |_fib | 102400 | 14 | wall_clock | sec | 4.304142 | 0.000042 | 0.000000 | 0.000752 | 0.000000 | 0.000039 | 9.6 |
|
||||
| |0>>> |_fib | 202600 | 15 | wall_clock | sec | 3.892580 | 0.000019 | 0.000000 | 0.000469 | 0.000000 | 0.000021 | 19.0 |
|
||||
| |0>>> |_fib | 363200 | 16 | wall_clock | sec | 3.151143 | 0.000009 | 0.000000 | 0.000293 | 0.000000 | 0.000011 | 33.2 |
|
||||
| |0>>> |_fib | 502000 | 17 | wall_clock | sec | 2.105217 | 0.000004 | 0.000000 | 0.000183 | 0.000000 | 0.000006 | 49.1 |
|
||||
| |0>>> |_fib | 476000 | 18 | wall_clock | sec | 1.071652 | 0.000002 | 0.000000 | 0.000114 | 0.000000 | 0.000004 | 63.6 |
|
||||
| |0>>> |_fib | 294200 | 19 | wall_clock | sec | 0.390193 | 0.000001 | 0.000000 | 0.000071 | 0.000000 | 0.000003 | 75.3 |
|
||||
| |0>>> |_fib | 115200 | 20 | wall_clock | sec | 0.096190 | 0.000001 | 0.000000 | 0.000043 | 0.000000 | 0.000002 | 84.4 |
|
||||
| |0>>> |_fib | 27400 | 21 | wall_clock | sec | 0.015020 | 0.000001 | 0.000000 | 0.000025 | 0.000000 | 0.000001 | 91.1 |
|
||||
| |0>>> |_fib | 3600 | 22 | wall_clock | sec | 0.001336 | 0.000000 | 0.000000 | 0.000013 | 0.000000 | 0.000001 | 96.3 |
|
||||
| |0>>> |_fib | 200 | 23 | wall_clock | sec | 0.000050 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_std::char_traits<char>::length | 1 | 3 | wall_clock | sec | 0.000001 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_std::distance<char const*> | 1 | 3 | wall_clock | sec | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_std::operator+<char, std::char_traits<char>, std::allocator<char> > | 2 | 3 | wall_clock | sec | 0.000001 | 0.000001 | 0.000000 | 0.000001 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_std::operator& | 1 | 1 | wall_clock | sec | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> std::vector<std::thread, std::allocator<std::thread> >::~vector | 1 | 0 | wall_clock | sec | 0.000045 | 0.000045 | 0.000045 | 0.000045 | 0.000000 | 0.000000 | 32.7 |
|
||||
| |0>>> |_std::thread::~thread | 4 | 1 | wall_clock | sec | 0.000030 | 0.000007 | 0.000007 | 0.000009 | 0.000000 | 0.000001 | 31.2 |
|
||||
| |0>>> |_std::thread::joinable | 4 | 2 | wall_clock | sec | 0.000021 | 0.000005 | 0.000005 | 0.000006 | 0.000000 | 0.000001 | 89.4 |
|
||||
| |0>>> |_std::thread::id::id | 4 | 3 | wall_clock | sec | 0.000001 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_std::operator== | 4 | 3 | wall_clock | sec | 0.000001 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_std::allocator_traits<std::allocator<std::thread> >::deallocate | 1 | 1 | wall_clock | sec | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
| |0>>> |_std::allocator<std::thread>::~allocator | 1 | 1 | wall_clock | sec | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 100.0 |
|
||||
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
```
|
||||
@@ -1,23 +0,0 @@
|
||||
# YouTube Tutorials
|
||||
|
||||
```eval_rst
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 3
|
||||
```
|
||||
|
||||
## Installing a binary release
|
||||
|
||||
<p align="center"><iframe width="560" height="315" src="https://www.youtube.com/embed/gKtNCKf1IXA?modestbranding=1" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p>
|
||||
|
||||
## Instrumenting a binary
|
||||
|
||||
<p align="center"><iframe width="560" height="315" src="https://www.youtube.com/embed/2B0gRr3FygQ?modestbranding=1" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p>
|
||||
|
||||
## Writing an omnitrace configuration file
|
||||
|
||||
<p align="center"><iframe width="560" height="315" src="https://www.youtube.com/embed/oG_fPYx9_gs?modestbranding=1" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p>
|
||||
|
||||
## Visualization and Features of Perfetto Traces
|
||||
|
||||
<p align="center"><iframe width="560" height="315" src="https://www.youtube.com/embed/7WN3N1hnCbI?modestbranding=1" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p>
|
||||