Omnitrace sample documentation (#179)

* Documentation for omnitrace-sample * Improve omnitrace-sample - improve the printing of the env updates - remove env settings when something is deactivated - restore env settings when something is deactivated [ROCm/rocprofiler-systems commit: 67f7471253]
2022-10-19 03:30:00 -05:00
@@ -85,9 +85,43 @@ such as the memory usage, page-faults, and context-switches, and thread-level me
 ## Documentation

 The full documentation for [omnitrace](https://github.com/AMDResearch/omnitrace) is available at [amdresearch.github.io/omnitrace](https://amdresearch.github.io/omnitrace/).
+See the [Getting Started documentation](https://amdresearch.github.io/omnitrace/getting_started) for general tips and a detailed discussion about sampling vs. binary instrumentation.

 ## Quick Start

+### Installation
+
+- Visit [Releases](https://github.com/AMDResearch/omnitrace/releases) page
+- Select appropriate installer (recommendation: `.sh` scripts do not require super-user priviledges unlike the DEB/RPM installers)
+  - If targeting a ROCm application, find the installer script with the matching ROCm version
+  - If you are unsure about your Linux distro, check `/etc/os-release`
+  - If no installer script matches your target OS, try one of the Ubuntu 18.04 `*.sh` installers
+    - This installation may be built against older library versions supported on your distro via backwards compatibility
+
+### Setup
+
+> NOTE: Replace `/opt/omnitrace` below with installation prefix as necessary.
+
+- Option 1: Source `setup-env.sh` script
+
+```bash
+source /opt/omnitrace/share/omnitrace/setup-env.sh
+```
+
+- Option 2: Load modulefile
+
+```bash
+module use /opt/omnitrace/share/modulefiles
+module load omnitrace
+```
+
+- Option 3: Manual
+
+```bash
+export PATH=/opt/omnitrace/bin:${PATH}
+export LD_LIBRARY_PATH=/opt/omnitrace/lib:${LD_LIBRARY_PATH}
+```
+
 ### Omnitrace Settings

 Generate an omnitrace configuration file using `omnitrace-avail -G omnitrace.cfg`. Optionally, use `omnitrace-avail -G omnitrace.cfg --all` for
@@ -111,9 +145,23 @@ Once the configuration file is adjusted to your preferences, either export the p
 or place this file in `${HOME}/.omnitrace.cfg` to ensure these values are always read as the default. If you wish to change any of these settings,
 you can override them via environment variables or by specifying an alternative `OMNITRACE_CONFIG_FILE`.

-### Omnitrace Executable
+### Call-Stack Sampling

-The `omnitrace` executable is used to instrument an existing binary.
+The `omnitrace-sample` executable is used to execute call-stack sampling on a target application without binary instrumentation.
+Use a double-hypen (`--`) to separate the command-line arguments for `omnitrace-sample` from the target application and it's arguments.
+
+```shell
+omnitrace-sample --help
+omnitrace-sample <omnitrace-options> -- <exe> <exe-options>
+omnitrace-sample -f 1000 -- ls -la
+```
+
+### Binary Instrumentation
+
+The `omnitrace` executable is used to instrument an existing binary. Call-stack sampling can be enabled alongside
+the execution an instrumented binary, to help "fill in the gaps" between the instrumentation via setting the `OMNITRACE_USE_SAMPLING`
+configuration variable to `ON`.
+Similar to `omnitrace-sample`, use a double-hypen (`--`) to separate the command-line arguments for `omnitrace` from the target application and it's arguments.

 ```shell
 omnitrace --help
@@ -183,9 +231,57 @@ omnitrace -ME '^(libhsa-runtime64|libz\\.so)' -- /path/to/app
 omnitrace -E 'rocr::atomic|rocr::core|rocr::HSA' --  /path/to/app
 ```

-### Visualizing Perfetto Results
+### Python Profiling and Tracing

-Visit [ui.perfetto.dev](https://ui.perfetto.dev) in your browser and open up the `.proto` file(s) created by omnitrace.
+Use the `omnitrace-python` script to profile/trace Python interpreter function calls.
+Use a double-hypen (`--`) to separate the command-line arguments for `omnitrace-python` from the target script and it's arguments.
+
+```shell
+omnitrace-python --help
+omnitrace-python <omnitrace-options> -- <python-script> <script-args>
+omnitrace-python -- ./script.py
+```
+
+Please note, the first argument after the double-hyphen *must be a Python script*, e.g. `omnitrace-python -- ./script.py`.
+
+If you need to specify a specific python interpreter version, use `omnitrace-python-X.Y` where `X.Y` is the Python
+major and minor version:
+
+```shell
+omnitrace-python-3.8 -- ./script.py
+```
+
+If you need to specify the full path to a Python interpreter, set the `PYTHON_EXECUTABLE` environment variable:
+
+```shell
+PYTHON_EXECUTABLE=/opt/conda/bin/python omnitrace-python -- ./script.py
+```
+
+If you want to restrict the data collection to specific function(s) and its callees, pass the `-b` / `--builtin` option after decorating the
+function(s) with `@profile`. Use the `@noprofile` decorator for excluding/ignoring function(s) and its callees:
+
+```python
+def foo():
+    pass
+
+@noprofile
+def bar():
+    foo()
+
+@profile
+def spam():
+    foo()
+    bar()
+```
+
+Each time `spam` is called during profiling, the profiling results will include 1 entry for `spam` and 1 entry
+for `foo` via the direct call within `spam`. There will be no entries for `bar` or the `foo` invocation within it.
+
+### Trace Visualization
+
+- Visit [ui.perfetto.dev](https://ui.perfetto.dev) in the web-browser
+- Select "Open trace file" from panel on the left
+- Locate the omnitrace perfetto output (extension: `.proto`)

 ![omnitrace-perfetto](source/docs/images/omnitrace-perfetto.png)

@@ -52,14 +52,17 @@
 #endif

 namespace color = tim::log::color;
-using tim::log::stream;
 using namespace timemory::join;
 using tim::get_env;
+using tim::log::colorized;
+using tim::log::stream;

 namespace
 {
-int verbose = 0;
-}
+int  verbose       = 0;
+auto updated_envs  = std::set<std::string_view>{};
+auto original_envs = std::set<std::string>{};
+}  // namespace

 std::string
 get_command(const char* _argv0)
@@ -92,7 +95,11 @@ get_initial_environment()
    {
        int idx = 0;
        while(environ[idx] != nullptr)
-            _env.emplace_back(strdup(environ[idx++]));
+        {
+            auto* _v = environ[idx++];
+            original_envs.emplace(_v);
+            _env.emplace_back(strdup(_v));
+        }
    }

    update_env(_env, "LD_PRELOAD",
@@ -106,22 +113,25 @@ get_initial_environment()
    update_env(_env, "OMNITRACE_USE_SAMPLING", true);
    update_env(_env, "OMNITRACE_CRITICAL_TRACE", false);
    update_env(_env, "OMNITRACE_USE_PROCESS_SAMPLING", false);
+
    // update_env(_env, "OMNITRACE_USE_PID", false);
    // update_env(_env, "OMNITRACE_TIME_OUTPUT", false);
    // update_env(_env, "OMNITRACE_OUTPUT_PATH", "omnitrace-output/%tag%/%launch_time%");

 #if defined(OMNITRACE_USE_ROCTRACER) || defined(OMNITRACE_USE_ROCPROFILER)
    update_env(_env, "HSA_TOOLS_LIB", _dl_libpath);
-    update_env(_env, "HSA_TOOLS_REPORT_LOAD_FAILURE", "1");
+    if(!getenv("HSA_TOOLS_REPORT_LOAD_FAILURE"))
+        update_env(_env, "HSA_TOOLS_REPORT_LOAD_FAILURE", "1");
 #endif

 #if defined(OMNITRACE_USE_ROCPROFILER)
    update_env(_env, "ROCP_TOOL_LIB", _omni_libpath);
-    update_env(_env, "ROCP_HSA_INTERCEPT", "1");
+    if(!getenv("ROCP_HSA_INTERCEPT")) update_env(_env, "ROCP_HSA_INTERCEPT", "1");
 #endif

 #if defined(OMNITRACE_USE_OMPT)
-    update_env(_env, "OMP_TOOL_LIBRARIES", _dl_libpath);
+    if(!getenv("OMP_TOOL_LIBRARIES"))
+        update_env(_env, "OMP_TOOL_LIBRARIES", _dl_libpath, true);
 #endif

    free(_dl_libpath);
@@ -140,11 +150,58 @@ get_internal_libpath(const std::string& _lib)
    return omnitrace::common::join("/", _dir, "..", "lib", _lib);
 }

+void
+print_updated_environment(std::vector<char*> _env)
+{
+    std::sort(_env.begin(), _env.end(), [](auto* _lhs, auto* _rhs) {
+        if(!_lhs) return false;
+        if(!_rhs) return true;
+        return std::string_view{ _lhs } < std::string_view{ _rhs };
+    });
+
+    std::vector<char*> _updates = {};
+    std::vector<char*> _general = {};
+
+    for(auto* itr : _env)
+    {
+        if(itr == nullptr) continue;
+
+        auto _is_omni = (std::string_view{ itr }.find("OMNITRACE") == 0);
+        auto _updated = false;
+        for(const auto& vitr : updated_envs)
+        {
+            if(std::string_view{ itr }.find(vitr) == 0)
+            {
+                _updated = true;
+                break;
+            }
+        }
+
+        if(_updated)
+            _updates.emplace_back(itr);
+        else if(verbose >= 1 && _is_omni)
+            _general.emplace_back(itr);
+    }
+
+    if(_general.size() + _updates.size() == 0 || verbose < 0) return;
+
+    std::cerr << std::endl;
+
+    for(auto& itr : _general)
+        stream(std::cerr, color::source()) << itr << "\n";
+    for(auto& itr : _updates)
+        stream(std::cerr, color::source()) << itr << "\n";
+
+    std::cerr << std::endl;
+}
+
 template <typename Tp>
 void
 update_env(std::vector<char*>& _environ, std::string_view _env_var, Tp&& _env_val,
           bool _append)
 {
+    updated_envs.emplace(_env_var);
+
    auto _key = join("", _env_var, "=");
    for(auto& itr : _environ)
    {
@@ -153,11 +210,13 @@ update_env(std::vector<char*>& _environ, std::string_view _env_var, Tp&& _env_va
        {
            if(_append)
            {
-                auto _val = std::string{ itr }.substr(_key.length());
-                free(itr);
-                itr = strdup(
-                    omnitrace::common::join('=', _env_var, join(":", _env_val, _val))
-                        .c_str());
+                if(std::string_view{ itr }.find(join("", _env_val)) ==
+                   std::string_view::npos)
+                {
+                    auto _val = std::string{ itr }.substr(_key.length());
+                    free(itr);
+                    itr = strdup(join('=', _env_var, join(":", _env_val, _val)).c_str());
+                }
            }
            else
            {
@@ -171,6 +230,22 @@ update_env(std::vector<char*>& _environ, std::string_view _env_var, Tp&& _env_va
        strdup(omnitrace::common::join('=', _env_var, _env_val).c_str()));
 }

+void
+remove_env(std::vector<char*>& _environ, std::string_view _env_var)
+{
+    auto _key   = join("", _env_var, "=");
+    auto _match = [&_key](auto itr) { return std::string_view{ itr }.find(_key) == 0; };
+
+    _environ.erase(std::remove_if(_environ.begin(), _environ.end(), _match),
+                   _environ.end());
+
+    for(const auto& itr : original_envs)
+    {
+        if(std::string_view{ itr }.find(_key) == 0)
+            _environ.emplace_back(strdup(itr.c_str()));
+    }
+}
+
 std::vector<char*>
 parse_args(int argc, char** argv, std::vector<char*>& _env)
 {
@@ -200,6 +275,11 @@ parse_args(int argc, char** argv, std::vector<char*>& _env)
        exit(_pec);
    };

+    auto* _dl_libpath =
+        realpath(get_internal_libpath("libomnitrace-dl.so").c_str(), nullptr);
+    auto* _omni_libpath =
+        realpath(get_internal_libpath("libomnitrace.so").c_str(), nullptr);
+
    auto parser = parser_t(argv[0]);

    parser.on_error([](parser_t&, const parser_err_t& _err) {
@@ -273,6 +353,7 @@ parse_args(int argc, char** argv, std::vector<char*>& _env)
        .dtype("bool")
        .action([&](parser_t& p) {
            auto _colorized = !p.get<bool>("monochrome");
+            colorized()     = _colorized;
            p.set_use_color(_colorized);
            update_env(_env, "OMNITRACE_COLORIZED_LOG", (_colorized) ? "1" : "0");
            update_env(_env, "COLORIZED_LOG", (_colorized) ? "1" : "0");
@@ -599,6 +680,12 @@ parse_args(int argc, char** argv, std::vector<char*>& _env)
            _update("OMNITRACE_TRACE_THREAD_LOCKS", _v.count("mutex-locks") > 0);
            _update("OMNITRACE_TRACE_THREAD_RW_LOCKS", _v.count("rw-locks") > 0);
            _update("OMNITRACE_TRACE_THREAD_SPIN_LOCKS", _v.count("spin-locks") > 0);
+
+            if(_v.count("all") > 0 || _v.count("ompt") > 0)
+                update_env(_env, "OMP_TOOL_LIBRARIES", _dl_libpath, true);
+
+            if(_v.count("all") > 0 || _v.count("kokkosp") > 0)
+                update_env(_env, "KOKKOS_PROFILE_LIBRARY", _omni_libpath, true);
        });

    parser.add_argument({ "-E", "--exclude" }, "Exclude data from these backends")
@@ -619,6 +706,25 @@ parse_args(int argc, char** argv, std::vector<char*>& _env)
            _update("OMNITRACE_TRACE_THREAD_LOCKS", _v.count("mutex-locks") > 0);
            _update("OMNITRACE_TRACE_THREAD_RW_LOCKS", _v.count("rw-locks") > 0);
            _update("OMNITRACE_TRACE_THREAD_SPIN_LOCKS", _v.count("spin-locks") > 0);
+
+            if(_v.count("all") > 0 ||
+               (_v.count("roctracer") > 0 && _v.count("rocprofiler") > 0))
+            {
+                remove_env(_env, "HSA_TOOLS_LIB");
+                remove_env(_env, "HSA_TOOLS_REPORT_LOAD_FAILURE");
+            }
+
+            if(_v.count("all") > 0 || _v.count("rocprofiler") > 0)
+            {
+                remove_env(_env, "ROCP_TOOL_LIB");
+                remove_env(_env, "ROCP_HSA_INTERCEPT");
+            }
+
+            if(_v.count("all") > 0 || _v.count("ompt") > 0)
+                remove_env(_env, "OMP_TOOL_LIBRARIES");
+
+            if(_v.count("all") > 0 || _v.count("kokkosp") > 0)
+                remove_env(_env, "KOKKOS_PROFILE_LIBRARY");
        });

    _add_separator("HARDWARE COUNTER OPTIONS", "");
@@ -626,7 +732,6 @@ parse_args(int argc, char** argv, std::vector<char*>& _env)
        .add_argument({ "-C", "--cpu-events" },
                      "Set the CPU hardware counter events to record (ref: "
                      "`omnitrace-avail -H -c CPU`)")
-        .set_default(std::set<std::string>{})
        .action([&](parser_t& p) {
            auto _events =
                join(array_config{ "," }, p.get<std::vector<std::string>>("cpu-events"));
@@ -638,7 +743,6 @@ parse_args(int argc, char** argv, std::vector<char*>& _env)
        .add_argument({ "-G", "--gpu-events" },
                      "Set the GPU hardware counter events to record (ref: "
                      "`omnitrace-avail -H -c GPU`)")
-        .set_default(std::set<std::string>{})
        .action([&](parser_t& p) {
            auto _events =
                join(array_config{ "," }, p.get<std::vector<std::string>>("gpu-events"));
@@ -695,5 +799,8 @@ parse_args(int argc, char** argv, std::vector<char*>& _env)
        throw std::runtime_error(
            "Error! '--profile' argument conflicts with '--flat-profile' argument");

+    free(_dl_libpath);
+    free(_omni_libpath);
+
    return _outv;
 }
@@ -52,15 +52,7 @@ main(int argc, char** argv)
            _argv.emplace_back(argv[i]);
    }

-    std::sort(_env.begin(), _env.end(), [](auto* _lhs, auto* _rhs) {
-        if(!_lhs) return false;
-        if(!_rhs) return true;
-        return std::string_view{ _lhs } < std::string_view{ _rhs };
-    });
-
-    for(auto* itr : _env)
-        if(itr != nullptr && std::string_view{ itr }.find("OMNITRACE") == 0)
-            std::cout << itr << "\n";
+    print_updated_environment(_env);

    if(!_argv.empty())
    {
@@ -35,6 +35,8 @@ get_realpath(const std::string&);
 void
 print_command(const std::vector<char*>& _argv);

+void print_updated_environment(std::vector<char*>);
+
 std::vector<char*>
 get_initial_environment();

@@ -45,5 +47,8 @@ template <typename Tp>
 void
 update_env(std::vector<char*>&, std::string_view, Tp&&, bool _append = false);

+void
+remove_env(std::vector<char*>&, std::string_view);
+
 std::vector<char*>
 parse_args(int argc, char** argv, std::vector<char*>&);
@@ -4,10 +4,186 @@
 .. toctree::
   :glob:
   :maxdepth: 3
-
-   setup
-   nomenclature
-   instrumenting
-   runtime
-   critical_trace
 ```
+
+<style>
+em { color: Green; }
+</style>
+
+## Nomenclature
+
+The list provided below is intended to (A) provide a basic glossary for those who are not familiar with binary instrumentation, etc. and (B)
+provide clarification to ambiguities when certain terms have different contextual meanings,
+e.g., omnitrace's meaning of the term "module" when instrumenting Python.
+
+- **Binary**
+  - File written in the Executable and Linkable Format (ELF)
+  - Standard file format for executable files, shared libraries, etc.
+- **Binary Instrumentation**
+  - Inserting callbacks to instrumentation into an existing binary. This can be performed statically or dynamically
+- **Static Binary Instrumentation**
+  - Loads an existing binary, determines instrumentation points, and generates a new binary with instrumentation directly embedded
+  - Applicable to executables and libraries but limited to only the functions defined in the binary
+  - Also known as: **Binary Rewrite**
+- **Dynamic Binary Instrumentation**
+  - Loads an existing binary into memory, inserts instrumentation, executes binary
+  - Limited to executables but capable of instrumenting linked libraries
+  - Also known as: **Runtime Instrumentation**
+- **Statistical Sampling**
+  - Also known as (simply) "sampling"
+  - At periodic intervals, the application is paused and the current call-stack of the CPU is recorded alongside with various other metrics
+  - Uses timers that measure either (A) real clock time or (B) the CPU time used by the current thread and the CPU time expended on behalf of the thread by the system
+  - **Sampling Rate**
+    - The period at which (A) or (B) are triggered (in units of `# interrupts / second`)
+    - Higher values increase the number of samples
+  - **Sampling Delay**
+    - How long to wait before (A) and (B) begin triggering at their designated rate
+  - **Sampling Duration**
+    - The time (in realtime) after the start of the application to record samples. Once this time limit has been reached, no more samples will be recorded.
+- **Process Sampling**
+  - At periodic (realtime) intervals, a background thread records global metrics without interrupting the current process. These metrics include, but are not limited to: CPU frequency,
+    CPU memory high-water mark (i.e. peak memory usage), GPU Temperature, GPU Power usage, etc.
+  - **Sampling Rate**
+    - The realtime period for recording metrics (in units of `# measurements / second`)
+    - Higher values increase the number of samples
+  - **Sampling Delay**
+    - How long to wait (in realtime) before recording samples
+  - **Sampling Duration**
+    - The time (in realtime) after the start of the application to record samples. Once this time limit has been reached, no more samples will be recorded.
+- **Module**
+  - With respect to binary instrumentation, a module is defined as either the filename (e.g. `foo.c`) or library name (`libfoo.so`) which contains the definition of one or more functions
+  - With respect to Python instrumentation, a module is defined as the *file* which contains the definition of one or more functions.
+    - The full path to this file *typically* contains the name of the "Python module"
+- **Basic Block**
+  - Straight-line code sequence with:
+    - No branches in (except for the entry)
+    - No branches out (except for the exit)
+- **Address Range**
+  - The instructions for a function in a binary start at certain address with the ELF file and end at a certain address, the range is `end - start`
+  - The address range is a decent approximation for the "cost" of a function, i.e., a larger address range approx. equates to more instructions
+- **Instrumentation Traps**
+  - On the x86 architecture, because instructions are of variable size, the instruction at a point may be too small for Dyninst to replace it with the normal code sequence used to call instrumentation
+    - Also, when instrumentation is placed at points other than subroutine entry, exit, or call points, traps may be used to ensure the instrumentation fits
+  - By default, omnitrace avoids instrumentation which requires using a trap
+- **Overlapping functions**
+  - Due to language constructs or compiler optimizations, it may be possible for multiple functions to overlap (that is, share part of the same function body) or for a single function to have multiple entry points
+  - In practice, it is impossible to determine the difference between multiple overlapping functions and a single function with multiple entry points
+  - By default, omnitrace avoids instrumenting overlapping functions
+
+## General Tips
+
+- ***Use `omnitrace-avail` to lookup configuration settings***, hardware counters, and data collection components
+  - Use `-d` flag for descriptions
+- Generate a default configuration with `omnitrace-avail -G ${HOME}/.omnitrace.cfg` and tweak accordingly to the desired default behavior
+- ***Decide whether binary instrumentation, statistical sampling, or both*** will provide the desired performance data (for non-Python applications)
+- Compile code with optimization enabled (e.g. `-O2` or higher), disable asserts (i.e. `-DNDEBUG`), and include debug info (i.e. `-g1` at a minimum)
+  - NOTE: compiling with debug info does not slow down the code, it only increases compile time and the size of the binary
+  - In CMake, this is generally as easy as settings `CMAKE_BUILD_TYPE=RelWithDebInfo` or `CMAKE_BUILD_TYPE=Release` and `CMAKE_<LANG>_FLAGS=-g1`
+- Use ***binary instrumentation for characterizing the performance of every invocation of specific functions***
+- Use ***statistical sampling to characterize the performance of the entire application while minimizing overhead***
+- Enable statistical sampling after binary instrumentation to help "fill in the gaps" between instrumented regions
+- Use the user API to create custom regions, enable/disable omnitrace to specific processes, threads, and/or regions
+- Dynamic symbol interception, callback APIs, and the user API are always available with binary instrumentation and sampling
+  - Dynamic symbol interception and callback APIs are (generally) controlled through `OMNITRACE_USE_<API>` options, e.g. `OMNITRACE_USE_KOKKOSP`, `OMNITRACE_USE_OMPT` enable Kokkos-Tools and OpenMP-Tools callbacks, respectively
+- When generically seeking regions for performance improvement:
+  - ***Start off collecting a flat profile***
+  - Look for functions with high call counts, large cumulative runtimes/values, and/or large standard deviations
+    - When call-counts are high, improving the performance of this function or "inlining" the function can be quick and easy performance improvements
+    - When the standard-deviation is high, collect a hierarchical profile and see if the high variation can be attributable to the calling context. In this scenario, consider creating a specialized version for the function for the longer running contexts
+  - ***Collect a hierarchical profile*** and, keeping the flat-profiling data in mind, verify the functions noted in the flat profile are part of the "critical path" of your application
+    - E.g. function(s) with high call counts, etc. which are part of a "setup" or "post-processing" phase which does not consume much time relative to the overall time is, generally, a lower priority for optimization
+- ***Use the information from the profiles when analyzing detailed traces***
+- When using binary instrumentation in the "trace" mode, the ***binary rewrites are preferable to runtime instrumentation***.
+  - Binary rewrites only instrument the functions defined in the target binary, whereas runtime instrumentation can/will instrument functions defined in the shared libraries which are linked into the target binary
+- When using binary instrumentation with MPI, avoid runtime instrumentation
+  - Runtime instrumentation requires a fork + ptrace: which is generally incompatible with how MPI applications spawn their processes
+  - Binary rewrite the executable using MPI (and, optionally, libraries used by the executable) and execute the generated instrumented executable instead of the original, e.g. `mpirun -n 2 ./myexe` should be `mpirun -n 2 ./myexe.inst` where `myexe.inst` is the generated instrumented `myexe` executable.
+
+## Data Collection Mode(s)
+
+Omnitrace supports several modes of recording trace and profiling data for your application:
+
+| Mode                        | Descriptions                                                                                                                                                  |
+|-----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Binary Instrumentation      | Locates functions (and loops, if desired) in binary and inserts snippets at the entry and exit                                                                |
+| Statistical Sampling        | Periodically pauses application at specified intervals and records various metrics for the given call-stack                                                   |
+| Callback APIs               | Parallelism frameworks such as ROCm, OpenMP, and Kokkos will make callbacks into omnitrace to provide information about the work the API is performing        |
+| Dynamic Symbol Interception | Wrap function symbols defined in position independent dynamic library/executable, e.g. `pthread_mutex_lock` in libpthread.so or `MPI_Init` in the MPI library |
+| User API                    | User-defined regions and controls for omnitrace                                                                                                               |
+
+The two most generic, important modes are binary instrumentation and statistical sampling. It is important to understand the advantages and disadvantages.
+Binary instrumentation and statistical sampling can be performed with the `omnitrace` executable but for statistical sampling, it is highly recommended to use the
+`omnitrace-sample` executable instead if no binary instrumentation is required/desired. With either tool, the callback APIs and dynamic symbol interception can be
+utilized.
+
+### Binary Instrumentation
+
+Binary instrumentation will allow one to deterministically record measurements for every single invocation of a given function.
+Binary instrumentation effectively adds instructions to the target application to collect the required information and, thus, has the potential to cause performance changes which may,
+in some cases, lead to inaccurate results. The effect depends on what information being collected and which features are activated in omnitrace. For example, collecting only the wall-clock timing data
+will have less effect than collected the wall-clock timing, cpu-clock timing, memory usage, cache-misses, and number of instructions executed. Similarly, collecting a flat profile will have
+less overhead than a hierarchical profile and collecting a trace OR a profile will have less overhead than collecting a trace AND a profile.
+
+In omnitrace, the primary heuristic for controlling the overhead with binary instrumentation is the minimum number of instructions for selecting functions for instrumentation.
+
+### Statistical Sampling
+
+Statistical call-stack sampling periodically interrupts the application at regular intervals using operating system interrupts.
+Sampling is typically less numerically accurate and specific, but allows the target program to run at near full speed.
+In constrast to the data derived from binary instrumentation, the resulting data is not exact but, instead, a statistical approximation.
+However, sampling often provides a more accurate picture of the application execution because it is less intrusive to the target application and has fewer
+side effects on memory caches or instruction decoding pipelines. Furthermore, since sampling does not affect the execution speed as significantly, is it
+relatively immune to over-evaluating the cost of small, frequently called functions or "tight" loops.
+
+In omnitrace, the overhead for statistical sampling is a factor of the sampling rate and whether the samples are taken with respect to the CPU time and/or real time.
+
+### Binary Instrumentation vs. Statistical Sampling Example
+
+Consider for the following code:
+
+```cpp
+long fib(long n)
+{
+    if(n < 2) return n;
+    return fib(n - 1) + fib(n - 2);
+}
+
+void run(long n)
+{
+    long result = fib(nfib);
+    printf("[%li] fibonacci(%li) = %li\n", i, nfib, result);
+}
+
+int main(int argc, char** argv)
+{
+    long nfib = 30;
+    long nitr = 10;
+    if(argc > 1) nfib = atol(argv[1]);
+    if(argc > 2) nitr = atol(argv[2]);
+
+    for(long i = 0; i < nitr; ++i)
+        run(nfib);
+
+    return 0;
+}
+```
+
+Binary instrumentation of the `fib` function will record ***every single invocation*** of the function -- which for a very small function
+such as `fib`, will result in *significant* overhead since this simple function tends to be less than 20 or so instructions, whereas the entry and
+exit snippets are ~1024 instructions. Thus, ***we generally want to avoid instrumenting functions where the instrumented function has significantly fewer
+instructions than entry + exit instrumentation*** (please note, however, that many of the instructions entry/exit functions are either logging functions or
+depend on the runtime settins and thus may never be executed). However, due to the number of potentially executed instructions in the entry/exit snippets,
+the default behavior of omnitrace is to only instrument functions which contain fewer than 1024 instructions.
+
+However, recording every single invocation of the function can be extremely useful for detecting anomalies: profiles will show min/max values much smaller/larger
+than the average and/or high standard deviation and traces will allow you to identify exactly when and where those instances deviated from the norm.
+Consider the level of details in the following traces where, in the top image, every instance of the `fib` function was instrumented vs. the bottom image
+where the `fib` call-stack was derived via sampling:
+
+#### Binary Instrumentation of Fibonacci Function
+
+![instrumented-fibonnaci-trace](images/fibonacci-instrumented.png)
+
+#### Statistical Sampling of Fibonacci Function
+
+![sampled-fibonnaci-trace](images/fibonacci-sampling.png)
@@ -9,7 +9,12 @@
   about
   features
   installation
+   setup
   getting_started
+   runtime
+   sampling
+   instrumenting
+   critical_trace
   output
   user_api
   python
@@ -1,4 +1,4 @@
-# Instrumenting with Omnitrace
+# Binary Instrumentation

 ```eval_rst
 .. toctree::
@@ -8,9 +8,13 @@

 ## omnitrace Executable

+> ***NOTE: With the introduction of `omnitrace-sample`, in future versions of omnitrace, the current `omnitrace` executable***
+> ***noted below will likely be renamed to `omnitrace-instrument` and a new `omnitrace` executable will serve as a common***
+> ***executable for multiple executables, e.g. `omnitrace sample ...`, `omnitrace run ...`, `omnitrace rewrite ...`, etc.***
+
 Instrumentation is performed with the `omnitrace` executable. View the help menu with the `-h` / `--help` option:

-```shell
+```console
 $ omnitrace --help
 [omnitrace] Usage: omnitrace   [ --help (count: 0, dtype: bool)
                                 --debug (max: 1, dtype: bool)
@@ -1,51 +0,0 @@
-# Nomenclature
-
-```eval_rst
-.. toctree::
-   :glob:
-   :maxdepth: 3
-```
-
-The list provided below is intended to (A) provide a basic glossary for those who are not familiar with binary instrumentation and (B) provide clarification to ambiguities when certain terms
-have different contextual meanings, e.g., omnitrace's meaning of the term "module" when instrumenting Python.
-
- **Binary**
-  - File written in the Executable and Linkable Format (ELF)
-  - Standard file format for executable files, shared libraries, etc.
- **Binary Instrumentation**
-  - Inserting callbacks to instrumentation into an existing binary. This can be performed statically or dynamically
- **Static Binary Instrumentation**
-  - Loads an existing binary, determines instrumentation points, and generates a new binary with instrumentation directly embedded
-  - Applicable to executables and libraries but limited to only the functions defined in the binary
-  - Also known as: **Binary Rewrite**
- **Dynamic Binary Instrumentation**
-  - Loads an existing binary into memory, inserts instrumentation, executes binary
-  - Limited to executables but capable of instrumenting linked libraries
-  - Also known as: **Runtime Instrumentation**
- **Sampling**
-  - At periodic intervals, the application is paused and the current call-stack of the CPU is recorded alongside with various other metrics
-  - Uses timers that measure either (A) real clock time or (B) the CPU time used by the current thread and the CPU time expended on behalf of the thread by the system
-  - **Sampling Rate**
-    - The period at which (A) or (B) are triggered (in units of `# interrupts / second`)
-    - Higher values increase the number of samples
-  - **Sampling Delay**
-    - How long to wait before (A) and (B) begin triggering at their designated rate
- **Module**
-  - With respect to binary instrumentation, a module is defined as either the filename (e.g. `foo.c`) or library name (`libfoo.so`) which contains the definition of one or more functions
-  - With respect to Python instrumentation, a module is defined as the _file_ which contains the definition of one or more functions.
-    - The full path to this file _typically_ contains the name of the "Python module"
- **Basic Block**
-  - Straight-line code sequence with:
-    - No branches in (except for the entry)
-    - No branches out (except for the exit)
- **Address Range**
-  - The instructions for a function in a binary start at certain address with the ELF file and end at a certain address, the range is `end - start`
-  - The address range is a decent approximation for the "cost" of a function, i.e., a larger address range approx. equates to more instructions
- **Instrumentation Traps**
-  - On the x86 architecture, because instructions are of variable size, the instruction at a point may be too small for Dyninst to replace it with the normal code sequence used to call instrumentation
-    - Also, when instrumentation is placed at points other than subroutine entry, exit, or call points, traps may be used to ensure the instrumentation fits
-  - By default, omnitrace avoids instrumentation which requires using a trap
- **Overlapping functions**
-  - Due to language constructs or compiler optimizations, it may be possible for multiple functions to overlap (that is, share part of the same function body) or for a single function to have multiple entry points
-  - In practice, it is impossible to determine the difference between multiple overlapping functions and a single function with multiple entry points
-  - By default, omnitrace avoids instrumenting overlapping functions
@@ -1,4 +1,4 @@
-# Customizing Omnitrace Runtime
+# Configuring Omnitrace Runtime

 ```eval_rst
 .. toctree::
@@ -0,0 +1,357 @@
+# Call-Stack Sampling
+
+```eval_rst
+.. toctree::
+   :glob:
+   :maxdepth: 4
+```
+
+> ***NOTE: Set `OMNITRACE_USE_SAMPLING=ON` to activate call-stack sampling when executing an instrumented binary***
+
+Call-stack sampling can be activated with either a binary instrumented via the `omnitrace` executable or via the `omnitrace-sample` executable.
+***Effectively***, all of the commands below are equivalent:
+
+- Binary rewrite with only instrumentation necessary to start/stop sampling
+
+```console
+omnitrace -M sampling -o foo.inst -- foo
+./foo.inst
+```
+
+- Runtime instrumentation with only instrumentation necessary to start/stop sampling
+
+```console
+omnitrace -M sampling -- foo
+```
+
+- No instrumentation required
+
+```console
+omnitrace-sample -- foo
+```
+
+All `omnitrace -M sampling` (referred to as "instrumented-sampling" henceforth) does is wrap the `main` of the executable with initialization
+before `main` starts and finalization after `main` ends.
+This can be easily accomplished without instrumentation via a `LD_PRELOAD` of a library with containing a dynamic symbol wrapper around `__libc_start_main`.
+Thus, whenever binary instrumentation is unnecessary, using `omnitrace-sample` is recommended over `omnitrace -M sampling` for several reasons:
+
+1. `omnitrace-sample` provides command-line options for controlling features of omnitrace instead of *requiring* configuration files or environment variables
+2. Despite the fact that instrumented-sampling only requires inserting snippets around one function (`main`), Dyninst
+   does not have a feature for specifying that parsing and processing all the other symbols in the binary is unnecessary,
+   thus, in the best case scenario, instrumented-sampling has a slightly slower launch time when the target binary is relatively small
+   but, in the worst case scenarios, requires a significant amount of time and memory to launch
+3. `omnitrace-sample` is fully compatible with MPI, e.g. `mpirun -n 2 omnitrace-sample -- foo`, whereas `mpirun -n 2 omnitrace -M sampling -- foo`
+   is incompatible with some MPI distributions (particularly OpenMPI) because of MPI restrictions against forking within an MPI rank
+    - If you recall, when MPI and binary instrumentation is involved, two steps are involed: (1) do a binary rewrite of the executable
+      and (2) use the instrumented executable in leiu of the original executable. `omnitrace-sample` is thus much easier to use with MPI.
+
+## omnitrace-sample Executable
+
+View the help menu of `omnitrace-sample` with the `-h` / `--help` option:
+
+```console
+$ omnitrace-sample --help
+[omnitrace-sample] Usage: omnitrace-sample [ --help (count: 0, dtype: bool)
+                                             --monochrome (max: 1, dtype: bool)
+                                             --debug (max: 1, dtype: bool)
+                                             --verbose (count: 1)
+                                             --config (min: 0, dtype: filepath)
+                                             --output (min: 1)
+                                             --trace (max: 1, dtype: bool)
+                                             --profile (max: 1, dtype: bool)
+                                             --flat-profile (max: 1, dtype: bool)
+                                             --host (max: 1, dtype: bool)
+                                             --device (max: 1, dtype: bool)
+                                             --trace-file (count: 1, dtype: filepath)
+                                             --trace-buffer-size (count: 1, dtype: KB)
+                                             --trace-fill-policy (count: 1)
+                                             --profile-format (min: 1)
+                                             --profile-diff (min: 1)
+                                             --process-freq (count: 1)
+                                             --process-wait (count: 1)
+                                             --process-duration (count: 1)
+                                             --cpus (count: unlimited, dtype: int or range)
+                                             --gpus (count: unlimited, dtype: int or range)
+                                             --freq (count: 1)
+                                             --wait (count: 1)
+                                             --duration (count: 1)
+                                             --tids (min: 1)
+                                             --cputime (min: 0)
+                                             --realtime (min: 0)
+                                             --include (count: unlimited)
+                                             --exclude (count: unlimited)
+                                             --cpu-events (count: unlimited)
+                                             --gpu-events (count: unlimited)
+                                             --inlines (max: 1, dtype: bool)
+                                             --hsa-interrupt (count: 1, dtype: int)
+                                           ]
+
+Options:
+    -h, -?, --help                 Shows this page
+
+    [DEBUG OPTIONS]
+
+    --monochrome                   Disable colorized output
+    --debug                        Debug output
+    -v, --verbose                  Verbose output
+
+    [GENERAL OPTIONS]
+
+    -c, --config                   Configuration file
+    -o, --output                   Output path. Accepts 1-2 parameters corresponding to the output path and the output prefix
+    -T, --trace                    Generate a detailed trace (perfetto output)
+    -P, --profile                  Generate a call-stack-based profile (conflicts with --flat-profile)
+    -F, --flat-profile             Generate a flat profile (conflicts with --profile)
+    -H, --host                     Enable sampling host-based metrics for the process. E.g. CPU frequency, memory usage, etc.
+    -D, --device                   Enable sampling device-based metrics for the process. E.g. GPU temperature, memory usage, etc.
+
+    [TRACING OPTIONS]
+
+    --trace-file                   Specify the trace output filename. Relative filepath will be with respect to output path and output prefix.
+    --trace-buffer-size            Size limit for the trace output (in KB)
+    --trace-fill-policy [ discard | ring_buffer ]
+
+                                   Policy for new data when the buffer size limit is reached:
+                                       - discard     : new data is ignored
+                                       - ring_buffer : new data overwrites oldest data
+
+    [PROFILE OPTIONS]
+
+    --profile-format [ console | json | text ]
+                                   Data formats for profiling results
+    --profile-diff                 Generate a diff output b/t the profile collected and an existing profile from another run Accepts 1-2 parameters
+                                   corresponding to the input path and the input prefix
+
+    [HOST/DEVICE (PROCESS SAMPLING) OPTIONS]
+
+
+    --process-freq                 Set the default host/device sampling frequency (number of interrupts per second)
+    --process-wait                 Set the default wait time (i.e. delay) before taking first host/device sample (in seconds of realtime)
+    --process-duration             Set the duration of the host/device sampling (in seconds of realtime)
+    --cpus                         CPU IDs for frequency sampling. Supports integers and/or ranges
+    --gpus                         GPU IDs for SMI queries. Supports integers and/or ranges
+
+    [GENERAL SAMPLING OPTIONS]
+
+    -f, --freq                     Set the default sampling frequency (number of interrupts per second)
+    -w, --wait                     Set the default wait time (i.e. delay) before taking first sample (in seconds). This delay time is based on the clock
+                                   of the sampler, i.e., a delay of 1 second for CPU-clock sampler may not equal 1 second of realtime
+    -d, --duration                 Set the duration of the sampling (in seconds of realtime). I.e., it is possible (currently) to set a CPU-clock time
+                                   delay that exceeds the real-time duration... resulting in zero samples being taken
+    -t, --tids                     Specify the default thread IDs for sampling, where 0 (zero) is the main thread and each thread created by the target
+                                   application is assigned an atomically incrementing value.
+
+    [SAMPLING TIMER OPTIONS]
+
+    --cputime                      Sample based on a CPU-clock timer (default). Accepts zero or more arguments:
+                                       0. Enables sampling based on CPU-clock timer.
+                                       1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of CPU-time.
+                                       2. Delay (in seconds of CPU-clock time). I.e., how long each thread should wait before taking first sample.
+                                       3+ Thread IDs to target for sampling, starting at 0 (the main thread).
+                                          May be specified as index or range, e.g., '0 2-4' will be interpreted as:
+                                             sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads
+    --realtime                     Sample based on a real-clock timer. Accepts zero or more arguments:
+                                       0. Enables sampling based on real-clock timer.
+                                       1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of realtime.
+                                       2. Delay (in seconds of real-clock time). I.e., how long each thread should wait before taking first sample.
+                                       3+ Thread IDs to target for sampling, starting at 0 (the main thread).
+                                          May be specified as index or range, e.g., '0 2-4' will be interpreted as:
+                                             sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads
+                                          When sampling with a real-clock timer, please note that enabling this will cause threads which are typically "idle"
+                                          to consume more resources since, while idle, the real-clock time increases (and therefore triggers taking samples)
+                                          whereas the CPU-clock time does not.
+
+    [BACKEND OPTIONS]  (These options control region information captured w/o sampling or instrumentation)
+
+    -I, --include [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
+                                   Include data from these backends
+    -E, --exclude [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
+                                   Exclude data from these backends
+
+    [HARDWARE COUNTER OPTIONS]
+
+    -C, --cpu-events               Set the CPU hardware counter events to record (ref: `omnitrace-avail -H -c CPU`)
+    -G, --gpu-events               Set the GPU hardware counter events to record (ref: `omnitrace-avail -H -c GPU`)
+
+    [MISCELLANEOUS OPTIONS]
+
+    -i, --inlines                  Include inline info in output when available
+    --hsa-interrupt [ 0 | 1 ]      Set the value of the HSA_ENABLE_INTERRUPT environment variable.
+                                     ROCm version 5.2 and older have a bug which will cause a deadlock if a sample is taken while waiting for the signal
+                                     that a kernel completed -- which happens when sampling with a real-clock timer. We require this option to be set to
+                                     when --realtime is specified to make users aware that, while this may fix the bug, it can have a negative impact on
+                                     performance.
+                                     Values:
+                                       0     avoid triggering the bug, potentially at the cost of reduced performance
+                                       1     do not modify how ROCm is notified about kernel completion
+```
+
+The general syntax for separating omnitrace command line arguments from the application arguments follows the
+is consistent with the LLVM style of using a standalone double-hyphen (`--`). All arguments preceding the double-hyphen
+are interpreted as belonging to omnitrace and all arguments following the double-hyphen are interpreted as the
+application and it's arguments. The double-hyphen is only necessary when passing command line arguments to the target
+which also use hyphens. E.g. `omnitrace-sample ls` works but, in order to run `ls -la`, use `omnitrace-sample -- ls -la`.
+
+[Configuring Omnitrace Runtime](runtime.md) establish the precedence of environment variable values over values specified in the configuration files. This enables
+the user to configure the omnitrace runtime to their preferred default behavior in a file such as `~/.omnitrace.cfg` and then easily override
+those settings via something like `OMNITRACE_ENABLED=OFF omnitrace-sample -- foo`.
+Similarly, the command line arguments passed to `omnitrace-sample` take precedence over environment variables.
+
+All of the command-line options above correlate to one or more configuration settings, e.g. `--cpu-events` correlates to the `OMNITRACE_PAPI_EVENTS` configuration variable.
+After the command-line arguments to `omnitrace-sample` have been processed but before the target application is executed, `omnitrace-sample` will emit a log
+for which environment variables where set and/or modified:
+
+The snippet below shows the environment updates when `omnitrace-sample` is invoked with no arguments
+
+```console
+$ omnitrace-sample -- ./parallel-overhead-locks 30 4 100
+
+HSA_TOOLS_LIB=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
+HSA_TOOLS_REPORT_LOAD_FAILURE=1
+LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
+OMNITRACE_CRITICAL_TRACE=false
+OMNITRACE_USE_PROCESS_SAMPLING=false
+OMNITRACE_USE_SAMPLING=true
+OMP_TOOL_LIBRARIES=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
+ROCP_TOOL_LIB=/opt/omnitrace/lib/libomnitrace.so.1.7.1
+
+...
+```
+
+The snippet below shows the environment updates when `omnitrace-sample` enables profiling, tracing, host process-sampling, device process-sampling, and all the available backends:
+
+```console
+$ omnitrace-sample -PTDH -I all -- ./parallel-overhead-locks 30 4 100
+
+HSA_TOOLS_LIB=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
+HSA_TOOLS_REPORT_LOAD_FAILURE=1
+KOKKOS_PROFILE_LIBRARY=/opt/omnitrace/lib/libomnitrace.so.1.7.1
+LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
+OMNITRACE_CPU_FREQ_ENABLED=true
+OMNITRACE_CRITICAL_TRACE=false
+OMNITRACE_TRACE_THREAD_LOCKS=true
+OMNITRACE_TRACE_THREAD_RW_LOCKS=true
+OMNITRACE_TRACE_THREAD_SPIN_LOCKS=true
+OMNITRACE_USE_KOKKOSP=true
+OMNITRACE_USE_MPIP=true
+OMNITRACE_USE_OMPT=true
+OMNITRACE_USE_PERFETTO=true
+OMNITRACE_USE_PROCESS_SAMPLING=true
+OMNITRACE_USE_RCCLP=true
+OMNITRACE_USE_ROCM_SMI=true
+OMNITRACE_USE_ROCPROFILER=true
+OMNITRACE_USE_ROCTRACER=true
+OMNITRACE_USE_ROCTX=true
+OMNITRACE_USE_SAMPLING=true
+OMNITRACE_USE_TIMEMORY=true
+OMP_TOOL_LIBRARIES=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
+ROCP_TOOL_LIB=/opt/omnitrace/lib/libomnitrace.so.1.7.1
+
+...
+```
+
+The snippet below shows the environment updates when `omnitrace-sample` enables profiling, tracing, host process-sampling, device process-sampling,
+sets the output path to `omnitrace-output`, the output prefix to `%tag%` and disables all the available backends:
+
+```console
+$ omnitrace-sample -PTDH -E all -o omnitrace-output %tag% -- ./parallel-overhead-locks 30 4 100
+
+LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
+OMNITRACE_CPU_FREQ_ENABLED=true
+OMNITRACE_CRITICAL_TRACE=false
+OMNITRACE_OUTPUT_PATH=omnitrace-output
+OMNITRACE_OUTPUT_PREFIX=%tag%
+OMNITRACE_TRACE_THREAD_LOCKS=false
+OMNITRACE_TRACE_THREAD_RW_LOCKS=false
+OMNITRACE_TRACE_THREAD_SPIN_LOCKS=false
+OMNITRACE_USE_KOKKOSP=false
+OMNITRACE_USE_MPIP=false
+OMNITRACE_USE_OMPT=false
+OMNITRACE_USE_PERFETTO=true
+OMNITRACE_USE_PROCESS_SAMPLING=true
+OMNITRACE_USE_RCCLP=false
+OMNITRACE_USE_ROCM_SMI=false
+OMNITRACE_USE_ROCPROFILER=false
+OMNITRACE_USE_ROCTRACER=false
+OMNITRACE_USE_ROCTX=false
+OMNITRACE_USE_SAMPLING=true
+OMNITRACE_USE_TIMEMORY=true
+
+...
+```
+
+## omnitrace-sample Example
+
+```console
+$ omnitrace-sample -PTDH -E all -o omnitrace-output %tag% -c -- ./parallel-overhead-locks 30 4 100
+
+LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1
+OMNITRACE_CONFIG_FILE=
+OMNITRACE_CPU_FREQ_ENABLED=true
+OMNITRACE_CRITICAL_TRACE=false
+OMNITRACE_OUTPUT_PATH=omnitrace-output
+OMNITRACE_OUTPUT_PREFIX=%tag%
+OMNITRACE_TRACE_THREAD_LOCKS=false
+OMNITRACE_TRACE_THREAD_RW_LOCKS=false
+OMNITRACE_TRACE_THREAD_SPIN_LOCKS=false
+OMNITRACE_USE_KOKKOSP=false
+OMNITRACE_USE_MPIP=false
+OMNITRACE_USE_OMPT=false
+OMNITRACE_USE_PERFETTO=true
+OMNITRACE_USE_PROCESS_SAMPLING=true
+OMNITRACE_USE_RCCLP=false
+OMNITRACE_USE_ROCM_SMI=false
+OMNITRACE_USE_ROCPROFILER=false
+OMNITRACE_USE_ROCTRACER=false
+OMNITRACE_USE_ROCTX=false
+OMNITRACE_USE_SAMPLING=true
+OMNITRACE_USE_TIMEMORY=true
+
+[omnitrace][omnitrace_init_tooling] Instrumentation mode: Sampling
+
+
+      ______   .___  ___. .__   __.  __  .___________..______          ___       ______  _______
+     /  __  \  |   \/   | |  \ |  | |  | |           ||   _  \        /   \     /      ||   ____|
+    |  |  |  | |  \  /  | |   \|  | |  | `---|  |----`|  |_)  |      /  ^  \   |  ,----'|  |__
+    |  |  |  | |  |\/|  | |  . `  | |  |     |  |     |      /      /  /_\  \  |  |     |   __|
+    |  `--'  | |  |  |  | |  |\   | |  |     |  |     |  |\  \----./  _____  \ |  `----.|  |____
+     \______/  |__|  |__| |__| \__| |__|     |__|     | _| `._____/__/     \__\ \______||_______|
+
+
+[759.689]       perfetto.cc:55903 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
+
+[parallel-overhead-locks] Threads: 4
+[parallel-overhead-locks] Iterations: 100
+[parallel-overhead-locks] fibonacci(30)...
+[1] number of iterations: 100
+[2] number of iterations: 100
+[3] number of iterations: 100
+[4] number of iterations: 100
+[parallel-overhead-locks] fibonacci(30) x 4 = 394644873
+[parallel-overhead-locks] number of mutex locks = 400
+[omnitrace][107157][0][omnitrace_finalize]
+[omnitrace][107157][0][omnitrace_finalize] finalizing...
+[omnitrace][107157][0][omnitrace_finalize]
+[omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157 : 0.610427 sec wall_clock,    2.248 MB peak_rss,    2.265 MB page_rss, 2.560000 sec cpu_clock,  419.4 % cpu_util [laps: 1]
+[omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157/thread/0 : 0.608866 sec wall_clock, 0.000677 sec thread_cpu_clock,    0.1 % thread_cpu_util,    2.248 MB peak_rss [laps: 1]
+[omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157/thread/1 : 0.608237 sec wall_clock, 0.603553 sec thread_cpu_clock,   99.2 % thread_cpu_util,    2.204 MB peak_rss [laps: 1]
+[omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157/thread/2 : 0.601430 sec wall_clock, 0.598378 sec thread_cpu_clock,   99.5 % thread_cpu_util,    1.156 MB peak_rss [laps: 1]
+[omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157/thread/3 : 0.570223 sec wall_clock, 0.568713 sec thread_cpu_clock,   99.7 % thread_cpu_util,    0.772 MB peak_rss [laps: 1]
+[omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157/thread/4 : 0.557637 sec wall_clock, 0.557198 sec thread_cpu_clock,   99.9 % thread_cpu_util,    0.156 MB peak_rss [laps: 1]
+[omnitrace][107157][0][omnitrace_finalize]
+[omnitrace][107157][0][omnitrace_finalize] Finalizing perfetto...
+[omnitrace][107157][perfetto]> Outputting '/home/user/data/omnitrace-output/2022-10-19_02.46/parallel-overhead-locksperfetto-trace-107157.proto' (842.90 KB / 0.84 MB / 0.00 GB)... Done
+[omnitrace][107157][trip_count]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockstrip_count-107157.json'
+[omnitrace][107157][trip_count]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockstrip_count-107157.txt'
+[omnitrace][107157][sampling_percent]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_percent-107157.json'
+[omnitrace][107157][sampling_percent]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_percent-107157.txt'
+[omnitrace][107157][sampling_cpu_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_cpu_clock-107157.json'
+[omnitrace][107157][sampling_cpu_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_cpu_clock-107157.txt'
+[omnitrace][107157][sampling_wall_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_wall_clock-107157.json'
+[omnitrace][107157][sampling_wall_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_wall_clock-107157.txt'
+[omnitrace][107157][wall_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockswall_clock-107157.json'
+[omnitrace][107157][wall_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockswall_clock-107157.txt'
+[omnitrace][107157][metadata]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-locksmetadata-107157.json' and 'omnitrace-output/2022-10-19_02.46/parallel-overhead-locksfunctions-107157.json'
+[omnitrace][107157][0][omnitrace_finalize] Finalized
+[761.584]       perfetto.cc:57382 Tracing session 1 ended, total sessions:0
+```
@@ -1,4 +1,4 @@
-# Setup
+# Setup and Validation

 ```eval_rst
 .. toctree::
@@ -8,13 +8,13 @@

 ## Configuring Environment

-Source the `setup-env.sh` script to prefix the `PATH`, `LD_LIBRARY_PATH`, etc. environment variables:
+Once omnitrace is installed, source the `setup-env.sh` script to prefix the `PATH`, `LD_LIBRARY_PATH`, etc. environment variables:

 ```bash
 source /opt/omnitrace/share/omnitrace/setup-env.sh
 ```

-Alternatively, if environment modules are supported, add the `<prefix>/share/modulefiles` directory to `MODULEPATH` via:
+Alternatively, if environment modules are supported, add the `<prefix>/share/modulefiles` directory to `MODULEPATH`:

 ```bash
 module use /opt/omnitrace/share/modulefiles
@@ -38,6 +38,12 @@ If all the following commands execute successfully with output, then you are rea
 ```bash
 which omnitrace
 which omnitrace-avail
+which omnitrace-sample
 omnitrace --help
 omnitrace-avail --all
+omnitrace-sample --help
+
+# if built with python support
+which omnitrace-python
+omnitrace-python --help
 ```
@@ -580,7 +580,7 @@ omnitrace_finalize_hidden(void)
        return;
    }

-    OMNITRACE_VERBOSE_F(0, "\n");
+    if(get_verbose() >= 0 || get_debug()) fprintf(stderr, "\n");
    OMNITRACE_VERBOSE_F(0, "finalizing...\n");

    sampling::block_samples();