Lightweight, front-end library for [libomnitrace](#libomnitrace-sourcelibomnitrace) which serves 3 primary purposes:
1. Dramatically speeds up instrumentation time vs. using [libomnitrace](#libomnitrace-sourcelibomnitrace) directly since Dyninst must parse entire library in order to find instrumentation functions ([libomnitrace](#libomnitrace-sourcelibomnitrace) is dlopen'ed when the instrumentation functions get called)
2. Prevents re-entry if [libomnitrace](#libomnitrace-sourcelibomnitrace) calls an instrumentated function internally)
3. Coordinates communication between [libomnitrace-user](#libomnitrace-user-sourcelibomnitrace-user) and [libomnitrace](#libomnitrace-sourcelibomnitrace)
Provides a set of functions and types for the users to add to their code, e.g. disabling data collection globally or on a specific thread,
user-defined regions, etc. If [libomnitrace-dl](#libomnitrace-dl-sourcelibomnitrace-dl) is not loaded, the user API is effectively no-op
function calls.
## Concepts
### Component
Most measurements and capabilities are encapsulated into a "component" with the following definitions:
- Measurement: recording of some data relevant to performance, e.g. current call-stack, hardware counter values, current memory usage, timestamp
- Capability: handles the implementation or orchestration of some feature which is used to collect measurements, e.g. a component which handles setting up function wrappers around various functions such as `pthread_create`, `MPI_Init`, etc.
Components are designed to hold no data at all or only the data for both an instantaeous measurement and a phase measurement.
Components which store data typically implement a static `record()` function (for getting a record of the measurement),
`start()` + `stop()` member functions for calculating a phase measurement, and a `sample()` member function for storing an
instantaneous measurement. In reality, there are several more "standard" functions but these are the most often used ones.
Components which do not store data may also have `start()`, `stop()`, and `sample()` functions but for components which
implement function wrappers, they typically provide a call operator or `audit(...)` functions which are invoked with the
wrappee function's arguments before the wrappee gets called and with the return value after the wrappee gets called.
***The goal of this design is to provide relatively small and resuable lightweight objects for recording measurements
and/or implementing capabilities.***
#### Wall-Clock Component Example
A component for computing the elapsed wall-clock time looks like this:
The omnitrace-specific implementation can be found in [source/lib/omnitrace/library/sampling.cpp](https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/sampling.cpp).
Within [sampling.cpp](https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/sampling.cpp), you will a bundle of 3 sampling components:
The first component [backtrace_timestamp](https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/components/backtrace_timestamp.hpp) simply
The second component [backtrace](https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/components/backtrace.hpp) records the call-stack via libunwind.
The last component [backtrace_metrics](https://github.com/ROCm/omnitrace/blob/main/source/lib/omnitrace/library/components/backtrace_metrics.hpp) is responsible for recording the
metrics for that sample, e.g. peak RSS, HW counters, etc. These 3 components are bundled together in a tuple-like struct (e.g. `tuple<backtrace_timestamp, backtrace, backtrace_metrics>`)
a buffer of at least 1024 instances of this tuple are mmap'ed per-thread. When this buffer is full, before taking the next sample, the sampler will hand the buffer
off to it's allocator thread and mmap a new buffer. The allocator thread takes this data and either dynamically stores it in memory or writes it to a file depending on the value of `OMNITRACE_USE_TEMPORARY_FILES`.
This schema avoids all allocations in the signal handler, allows the data to grow dynamically, avoid potentially slow I/O within the signal handler, and also enables the capability to avoid I/O altogether.
The maximum number of samplers handled by each allocator is governed by the setting `OMNITRACE_SAMPLING_ALLOCATOR_SIZE` setting (the default is 8) -- whenever an allocator has reached it's limit,
a new internal thread is created to handle the new samplers.
Recently with the introduction of tracing delay/duration/etc., the [constraint namespace](https://github.com/ROCm/omnitrace/blob/main/source/lib/core/constraint.hpp)
was introduced to improve the management of delays and/or duration limits of data collection. The `spec` class takes a clock identifier, a delay value, a duration value, and an
integer indicating how many times to repeat the delay + duration. Thus, it is possible to perform tasks such as periodically enabling tracing for brief periods
of time in between long periods without data collection during the application, e.g. `OMNITRACE_TRACE_PERIODS = realtime:10:1:5 process_cputime:10:2:20` would enable
five periods of no data collection for 10 seconds of realtime followed by 1 second of data collection + twenty periods of no data collection for 10 seconds
of process CPU time followed by 2 CPU-time seconds of data collection.
Eventually, the goal is have all subsets of data collection which currently support more rudimentary models of time window constraints, such as process sampling and causal profiling,