2
0
Ficheiros
rocm-systems/source/docs/user_api.md
T
Jonathan R. Madsen 9618ddefba Causal profiling (#229)
* Addition of basic structure

* Reworked categories

* More causal integration additions

* Causal implementation

* Update examples

* delete virtual_speedup files

* Update perfetto submodule to v31.0

* Update dyninst submodule

* Update timemory submodule

* ElfUtils build for libdw

* OMNITRACE_LIKELY and OMNITRACE_UNLIKELY

* Update common lib join

* Examples updates for causal profiling

* config updates with causal options

- OMNITRACE_CAUSAL_FIXED_LINE
- OMNITRACE_CAUSAL_FIXED_SPEEDUP
- OMNITRACE_CAUSAL_FILE
- OMNITRACE_CAUSAL_BINARY_SCOPE
- OMNITRACE_CAUSAL_SOURCE_SCOPE
- version info in banner
- support increments in parse_numeric_range
- fix occasional deadlock in first call to get_config

* PTL general task group

* Always include PID in debug/verbose messages

* Add blocking/unblocking gotchas to runtime init bundle

* CausalState

* thread_data updates

- generic component_bundle_cache

* Improve handling of causal in category_region

* components updates

- backtrace_causal component
- backtrace::get_data member func
- decrease ignore_depth in backtrace::sample(int)
- handle "omnitrace_main" in backtrace::filter_and_patch(...)
- tweak internal thread state scope for pthread_mutex_gotcha wrappers

* simplify tracing get_instrumentation_bundles usage

* sampling updates

- include backtrace_causal component
- disable backtrace_metrics if using causal and not using perfetto
- disable backtrace and backtrace_timestamp when using causal
- post_process_causal

* causal updates

- more checks in blocking_gotcha and unblocking_gotcha start/stop
- miscellaneous overhaul of data
- experiment update

* Remove virtual speedup

* libomnitrace code_object

* causal-profiling test

* libomnitrace library.cpp updates

- handle causal profiling
- fini_bundle

* Disable causal profiling by default

* Updated causal code and example

- example: three execution variants: cpu + rng, cpu, rng
- example: three instrumentation variants: none, omni, coz
- fix blocking gotcha credit
- rework perform_experiment_impl
- get_eligible_address_ranges
- compute_eligible_lines
- support fixed lines/speedups/functions
- update selected_entry to support function mode
- fix causal::delay
- experiment updates

* omnitrace_progress / omnitrace_user_progress

- with accompanying omnitrace_annotated_progress / omnitrace_user_annotated_progress

* Update timemory submodule

* CausalMode

- mode indicated whether causal predictions source be at line-level or function-level

* code_object, config, runtime, sampling, thread_data

- code_object: address_range
- code_object: basic::line_info serialize(), name(), hash()
- config updates
- two signals for causal sampling
- thread_data init fixes

* pthread updates

- pthread_create_gotcha processes delays
- pthread_mutex_gotcha does not wrap pthread_join in causal mode

* backtrace_causal update

- dynamic delay period stats

* main wrapper uses basename of argv[0]

* update elfio submodule

* perf support (currently unused)

* Fix experiment JSON serialization

- static_vector.hpp (unused)

* causal executable + config options updates

- omnitrace-causal exe simplifies running multiple causal configs
- changed the causal config option names

* Support both throughput and latency points

* process-causal-json.py script

- will be used later for testing

* stable_vector

* Rework thread_data

* Improve omnitrace-causal exe

- better verbosity handling
- correct diagnosis of status for child process
- execvpe when only one iteration (debugging)

* Update timemory submodule

* exe --version

- omnitrace, omnitrace-avail, and omnitrace-sample all support --version on command-line

* OMNITRACE_INTERNAL_API + OMNITRACE_{LIKELY,UNLIKELY}

* omnitrace-causal cmake format

* omnitrace config update

- OMNITRACE_CAUSAL_FILE_CLOBBER

* custom exception

- wraps STL exception and gets stacktrace during construction

* exit_gotcha supports _Exit

* use global construct_on_init + max threads

- add some safety when exceeding max # of threads

* update code_object binary filter

- exclude dyninst and tbbmalloc library

* containers: c_array, static_vector, stable_vector

- moved utility::c_array to container::c_array
- created static_vector: std::vector bound to std::array
- created stable_vector: vector with stable references

* grow thread_data when new thread created

* causal updates

- data: improve compute_eligible_lines to ignore lambdas
- data: use new thread_data
- delay: use new thread_data
- experiment: properly support latency points
- experiment: support file clobber
- experiment: ensure non-zero experiment time
- progress_point: use new thread_data
- backtrace_causal: use new thread_data

* Update causal-profiling tests

* fix omnitrace-causal backslash escaping

* process-causal-json script

* restructure causal implementation

- update verbose messages for omnitrace-causal diagnose_status
- migrated causal implementation in sampling.cpp to causal/sampling.cpp
- OMNITRACE_USE_CAUSAL does not require OMNITRACE_USE_SAMPLING
- added Mode::Causal
- causal sampling uses same signals as regular sampling
- moved tracing::thread_init to implementation file
- combined tracing::thread_init and tracing::thread_init_sampling
- added causal/components folder
- pthread_create_gotcha::wrapper_config
- omnitrace_preload checks OMNITRACE_USE_CAUSAL
  - updates mode accordingly

* update timemory submodule

* update timemory submodule

* causal example updates

- causal for lulesh

* perf code + utility - helpers

- relocated causal perf code
- placement new when generating unique ptr trait for potentially allocating during sampling
- additions to utility header
- removed previously added helpers.hpp

* update timemory submodule

* Default env variables for omnitrace-causal

- activate OMNITRACE_USE_KOKKOSP, etc.

* update stable_vector and static_vector

- static vector can use atomic for size tracking for thread-safe situations

* update causal example header

- CAUSAL_PROGRESS_NAMED
- use CAUSAL_ prefix for some macros

* Tweak lulesh example

- use CAUSAL_PROGRESS instead of CAUSAL_BEGIN and CAUSAL_END

* omnitrace-sample support for causal mode

- set OMNITRACE_USE_SAMPLING to off when OMNITRACE_MODE=causal

* refactor and cleanup code_object

- scope filter
- fixes to address_range

* overhaul causal data + causal config options

- full support for function and line mode
- support static vector of instruction pointers
- improve line info mapping resolution
- remove thread-locality from miscellanous functions where unnecessary
- causal options for {binary,source,function,fileline} exclusion

* causal experiment, sampling, and backtrace updates

- is_selected + unwind address array
- experiment warning about progress points
- increased buffer size for backtrace_casual sampler
- backtrace_causal only stores IP addresses instead of full unwind info

* category_region updates

- minor refactor
- local_category_region::mark

* Update causal tests

* Bump version to 1.8.0

* omnitrace-causal args + CLOBBER -> RESET

- renamed OMNITRACE_CAUSAL_FILE_CLOBBER to OMNITRACE_CAUSAL_FILE_RESET
- updated omnitrace-causal exe to support recently added configuration options
- other miscellaneous tweaks to data.cpp, experiment.cpp, and sampling.cpp

* Refactor causal and code_object

- code_object.hpp and code_object.cpp moved into binary folder
- causal components namespaced into omnitrace::causal::component
- moved sample_data out of backtrace_causal and into own file
- renamed backtrace_causal to causal::component::backtrace

* preload omnitrace_init + OMNITRACE_DEBUG_MARK

- env OMNITRACE_DEBUG_MARK
- fix omnitrace_init call when LD_PRELOAD-ing omnitrace

* Fix fileline support + line-info output names + experiment log

- line-info log files are prefixed with experiment name
- don't print experiment duration when E2E
- account for fileline scope in analysis

* KokkosP: OMNITRACE_KOKKOSP_NAME_LENGTH_MAX

- config option to limit the name of kokkos tool callbacks
- remove [kokkos] from KokkosP names

* Update causal example

- minor tweaks to decrease probability of overlapping regions in binary

* omnitrace-causal update

- prefix N / Ntot in environment printout

* Miscellaneous updates

- causal::finish_experimenting()
- OMNITRACE_CAUSAL_RANDOM_SEED
- KokkosP causal updates
  - exclude some callbacks, make some callbacks unique, etc.
- address_range::operator+=(address_range)
- combine contiguous ranges in binary/analysis.cpp when file, func, line is same and address range is contiguous
- bfd_line_info reads inline info
- wait for perform_experiment_impl to complete
- causal::delay updates
  - delay::process checks if experiment is active
  - uses threading::get_id()
- experiment scales duration up for larger speedup experiments
- line info samples includes excluded lines
- sampler uses CLOCK_REALTIME
- blocking_gotcha updates
  - is no longer fully static
  - adds audit routine which sets the postblock value to zero if try/timed routine fails
- category::host was added to causal_throughput_categories_t
- pthread_create_gotcha sets new threads local parent delay
  - was using internal value, now uses sequent value

* Causal improvements to KokkosP

* Updates to experiment time scaling

- use stats instead of just max

* binary/link_map.{hpp,cpp}

* update process-causal-json.py

* Folded fileline scope into source scope

* Update documentation

- Add documentation for causal profiling
- Replace 'Omnitrace' with 'OmniTrace' everywhere

* Update causal-helpers.cmake + omnitrace-testing.cmake

- split tests/CMakeLists.txt partially into omnitrace-testing.cmake

* omnitrace/causal.h

- OMNITRACE_CAUSAL_PROGRESS
- OMNITRACE_CAUSAL_PROGRESS_NAMED
- OMNITRACE_CAUSAL_BEGIN
- OMNITRACE_CAUSAL_END

* selected_entry + remove default filters for lambdas and operator()

- selected entry stores range and binary load address

* update process-causal-json.py

* format examples/lulesh/CMakeLists.txt

* causal-helpers find_package(Threads)

* OMNITRACE_KOKKOSP_KERNEL_LOGGER

- was OMNITRACE_KOKKOS_KERNEL_LOGGER

* quiet find of coz-profiler

* Fix rocm_smi exception handling

* Update timemory submodule (binutils)

- fix binutls compile error on some systems
- bump binutils to v2.40

* Fix miscellaneous tests

* OMNITRACE_KOKKOSP_PREFIX

* revert rocm_smi handling

* ElfUtils updates

- default to download version 0.188
- add -Wno-error=null-dereference due to GCC 12 compiler error

* Update causal example

* Remove OMNITRACE_VERBOSE from global workflow envs

* Reliable causal test

* disable compilation of causal perf files

* Remove set_current_selection with unwind stack

* update timemory submodule

* fix for segfault on bionic

- locking in TLS dtor was causing segfault

* remove experiment::is_selected(unwind_stack_t)

* update default init of selected_entry

* Fix for when IP is not offset by load address

* Update CMakeLists.txt

* Miscellaneous updates

- OMNITRACE_WARNING_OR_CI_THROW
- OMNITRACE_REQUIRE
- OMNITRACE_PREFER
- fixed issues with no ASLR
-  added load address variable and ipaddr() func to basic/bfd line info
- removed get_basic() from dwarf_line_info
- TIMEMORY_PREFER -> OMNITRACE_PREFER
- removed previously added binary_address and range variables from selected_entry

* Removed superfluous CausalState

* Additional causal tests (lulesh + kokkos)

* filter, prefer, analysis ASLR handling

- removed default filter on cold functions
- fixed OMNITRACE_PREFER
- fixed analysis ASLR handling

* Tweak line-info output

* Removed some superfluous code

- causal/delay
- causal/selected_entry

* Exclude main.cold in function mode

* Update validate-perfetto-proto.py

- account for occasional http errors

* Add sampling test disabling tmp files

* argparser for process-causal-json

- support validation
- support filtering

* Avoid pthread_{lock,unlock} in sampling offload

- use homemade atomic_mutex/atomic_lock since contention will be low and using pthread tools might trigger our wrappers

* Rename process-causal-json.py

- validate-causal-json.py

* rework omnitrace_add_causal_test

- capable of performing validation
- added validation tests

* Fix kokkosp_begin_deep_copy + causal

* Tweak address range in bfd_line_info::read_pc

* Tweak analysis and data IP handling

- look for gaps

* Disable scaling experiment time by speedup

* Revert change in max threads during CI

* binary updates

- significant overhaul of binary analysis implementation
- removed "basic_line_info" and "bfd_line_info" in lieu of "symbol" class
  - symbol class has basic BFD info + vector of inlines + vector of dwarf info

* Updated causal to use new binary analysis

- Fix symbol.cpp includes

* Updated formatting target

- include *.cmake files

* Updated causal tests

- causal tests should be stable now

* Update timemory and dyninst submodules

- TPLs are stripped + built w/o debug info

* Increase tolerance for causal validation speedups

- higher speedups have more variance (increased to +/- 5 from 3)

* Support causal output for MPI

- i.e. tag with MPI rank

* omnitrace-causal launcher argument

* improve experiment sampling output

* causal data updates

- call compute lines once
- fixed filtered cached binary info
- debugging info when experiment fails to start

* Tweaked causal validation tests

* dwarf_entry ranges

* CI updates

- increase max threads to 64

* Tweak causal E2E validation tests

- more threads
- shorter thread runtime
- more iterations

* Fix shadowed variable

* fix symbol read_bfd last PC calculation

* fix maybe-uninitialized warning

* omnitrace-causal launcher update

- only inject "omnitrace-causal --" once
- throw error if no matches found

* Update causal profiling docs for launcher

* fix address range boundaries
2023-01-24 18:53:23 -06:00

18 KiB

User API

.. doxygenfile:: omnitrace/types.h
.. doxygenfile:: omnitrace/categories.h
.. doxygenfile:: omnitrace/user.h
.. doxygenfile:: omnitrace/causal.h

By default, when omnitrace detects any omnitrace_user_start_* or omnitrace_user_stop_* function, instrumentation is disabled at start-up -- thus, omnitrace_user_stop_trace() is not required at the beginning of main. This is can be manually controlled via the OMNITRACE_INIT_ENABLED environment variable. User-defined regions are always recorded, regardless of whether whether omnitrace_user_start_* or omnitrace_user_stop_* has been called.

Example

Compilation

CMake

find_package(omnitrace REQUIRED COMPONENTS user)

add_executable(foo foo.cpp)

target_link_libraries(foo PRIVATE omnitrace::omnitrace-user-library)

General

Assuming omnitrace installed in /opt/omnitrace:

g++ -I/opt/omnitrace foo.cpp -o foo -lomnitrace-user

User API Implementation

#include <omnitrace/categories.h>
#include <omnitrace/types.h>
#include <omnitrace/user.h>

#include <atomic>
#include <cassert>
#include <cerrno>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <sstream>
#include <thread>
#include <vector>

std::atomic<long> total{ 0 };

long
fib(long n) __attribute__((noinline));

void
run(size_t nitr, long) __attribute__((noinline));

int
custom_push_region(const char* name);

namespace
{
omnitrace_user_callbacks_t custom_callbacks   = OMNITRACE_USER_CALLBACKS_INIT;
omnitrace_user_callbacks_t original_callbacks = OMNITRACE_USER_CALLBACKS_INIT;
}  // namespace

int
main(int argc, char** argv)
{
    custom_callbacks.push_region = &custom_push_region;
    omnitrace_user_configure(OMNITRACE_USER_UNION_CONFIG, custom_callbacks,
                             &original_callbacks);

    omnitrace_user_push_region(argv[0]);
    omnitrace_user_push_region("initialization");
    size_t nthread = std::min<size_t>(16, std::thread::hardware_concurrency());
    size_t nitr    = 50000;
    long   nfib    = 10;
    if(argc > 1) nfib = atol(argv[1]);
    if(argc > 2) nthread = atol(argv[2]);
    if(argc > 3) nitr = atol(argv[3]);
    omnitrace_user_pop_region("initialization");

    printf("[%s] Threads: %zu\n[%s] Iterations: %zu\n[%s] fibonacci(%li)...\n", argv[0],
           nthread, argv[0], nitr, argv[0], nfib);

    omnitrace_user_push_region("thread_creation");
    std::vector<std::thread> threads{};
    threads.reserve(nthread);
    // disable instrumentation for child threads
    omnitrace_user_stop_thread_trace();
    for(size_t i = 0; i < nthread; ++i)
    {
        threads.emplace_back(&run, nitr, nfib);
    }
    // re-enable instrumentation
    omnitrace_user_start_thread_trace();
    omnitrace_user_pop_region("thread_creation");

    omnitrace_user_push_region("thread_wait");
    for(auto& itr : threads)
        itr.join();
    omnitrace_user_pop_region("thread_wait");

    run(nitr, nfib);

    printf("[%s] fibonacci(%li) x %lu = %li\n", argv[0], nfib, nthread, total.load());
    omnitrace_user_pop_region(argv[0]);

    return 0;
}

long
fib(long n)
{
    return (n < 2) ? n : fib(n - 1) + fib(n - 2);
}

#define RUN_LABEL                                                                        \
    std::string{ std::string{ __FUNCTION__ } + "(" + std::to_string(n) + ") x " +        \
                 std::to_string(nitr) }                                                  \
        .c_str()

void
run(size_t nitr, long n)
{
    omnitrace_user_push_region(RUN_LABEL);
    long local = 0;
    for(size_t i = 0; i < nitr; ++i)
        local += fib(n);
    total += local;
    omnitrace_user_pop_region(RUN_LABEL);
}

int
custom_push_region(const char* name)
{
    if(!original_callbacks.push_region || !original_callbacks.push_annotated_region)
        return OMNITRACE_USER_ERROR_NO_BINDING;

    printf("Pushing custom region :: %s\n", name);

    if(original_callbacks.push_annotated_region)
    {
        int32_t _err = errno;
        char*   _msg = nullptr;
        char    _buff[1024];
        if(_err != 0) _msg = strerror_r(_err, _buff, sizeof(_buff));

        omnitrace_annotation_t _annotations[] = {
            { "errno", OMNITRACE_INT32, &_err }, { "strerror", OMNITRACE_STRING, _msg }
        };

        errno = 0;  // reset errno
        return (*original_callbacks.push_annotated_region)(
            name, _annotations, sizeof(_annotations) / sizeof(omnitrace_annotation_t));
    }

    return (*original_callbacks.push_region)(name);
}

User API Output

$ omnitrace -l --min-instructions=8 -E custom_push_region -o -- ./user-api
...
$ export OMNITRACE_USE_TIMEMORY=ON
$ export OMNITRACE_USE_PID=OFF
$ export OMNITRACE_TIME_OUTPUT=OFF
$ export OMNITRACE_OUTPUT_PATH=omnitrace-example-output
$ ./user-api.inst 20 4 100
Pushing custom region :: ./user-api.inst
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Trace


      ______   .___  ___. .__   __.  __  .___________..______          ___       ______  _______
     /  __  \  |   \/   | |  \ |  | |  | |           ||   _  \        /   \     /      ||   ____|
    |  |  |  | |  \  /  | |   \|  | |  | `---|  |----`|  |_)  |      /  ^  \   |  ,----'|  |__
    |  |  |  | |  |\/|  | |  . `  | |  |     |  |     |      /      /  /_\  \  |  |     |   __|
    |  `--'  | |  |  |  | |  |\   | |  |     |  |     |  |\  \----./  _____  \ |  `----.|  |____
     \______/  |__|  |__| |__| \__| |__|     |__|     | _| `._____/__/     \__\ \______||_______|



Pushing custom region :: initialization
[./user-api.inst] Threads: 4
[./user-api.inst] Iterations: 100
[./user-api.inst] fibonacci(20)...
Pushing custom region :: thread_creation
Pushing custom region :: thread_wait
Pushing custom region :: run(20) x 100
Pushing custom region :: run(20) x 100
Pushing custom region :: run(20) x 100
Pushing custom region :: run(20) x 100
Pushing custom region :: run(20) x 100
[./user-api.inst] fibonacci(20) x 4 = 3382500
[omnitrace][86267][0][omnitrace_finalize] finalizing...


[omnitrace][86267][0] omnitrace : 5.190895 sec wall_clock,    2.748 mb peak_rss, 6.330000 sec cpu_clock,  121.9 % cpu_util [laps: 1]
[omnitrace][86267][0] user-api.inst/thread-0 : 5.078713 sec wall_clock, 4.722415 sec thread_cpu_clock,   93.0 % thread_cpu_util,    1.276 mb peak_rss [laps: 1]
[omnitrace][86267][0] user-api.inst/thread-1 : 0.322248 sec wall_clock, 0.322191 sec thread_cpu_clock,  100.0 % thread_cpu_util,    1.000 mb peak_rss [laps: 1]
[omnitrace][86267][0] user-api.inst/thread-2 : 0.323255 sec wall_clock, 0.323194 sec thread_cpu_clock,  100.0 % thread_cpu_util,    0.000 mb peak_rss [laps: 1]
[omnitrace][86267][0] user-api.inst/thread-3 : 0.323569 sec wall_clock, 0.323484 sec thread_cpu_clock,  100.0 % thread_cpu_util,    1.092 mb peak_rss [laps: 1]
[omnitrace][86267][0] user-api.inst/thread-4 : 0.324178 sec wall_clock, 0.324057 sec thread_cpu_clock,  100.0 % thread_cpu_util,    1.184 mb peak_rss [laps: 1]
[omnitrace][86267][0] Post-processing 51 cpu frequency and memory usage entries...

[omnitrace][wall_clock]|0> Outputting 'omnitrace-user-api.inst-output/wall_clock.json'...
[omnitrace][wall_clock]|0> Outputting 'omnitrace-user-api.inst-output/wall_clock.tree.json'...
[omnitrace][wall_clock]|0> Outputting 'omnitrace-user-api.inst-output/wall_clock.txt'...

[omnitrace][manager::finalize][metadata]> Outputting 'omnitrace-user-api.inst-output/metadata.json' and 'omnitrace-user-api.inst-output/functions.json'...
[omnitrace][86267][0][omnitrace_finalize] Finalized
$ cat omnitrace-example-output/wall_clock.txt
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                                                                              REAL-CLOCK TIMER (I.E. WALL-CLOCK TIMER)                                                                              |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                                     LABEL                                       | COUNT  | DEPTH  |   METRIC   | UNITS  |   SUM    |   MEAN   |   MIN    |   MAX    |   VAR    | STDDEV   | % SELF |
|---------------------------------------------------------------------------------|--------|--------|------------|--------|----------|----------|----------|----------|----------|----------|--------|
| |0>>> ./user-api.inst                                                           |      1 |      0 | wall_clock | sec    | 5.078521 | 5.078521 | 5.078521 | 5.078521 | 0.000000 | 0.000000 |    0.0 |
| |0>>> |_initialization                                                          |      1 |      1 | wall_clock | sec    | 0.000004 | 0.000004 | 0.000004 | 0.000004 | 0.000000 | 0.000000 |  100.0 |
| |0>>> |_thread_creation                                                         |      1 |      1 | wall_clock | sec    | 0.000159 | 0.000159 | 0.000159 | 0.000159 | 0.000000 | 0.000000 |  100.0 |
| |0>>> |_thread_wait                                                             |      1 |      1 | wall_clock | sec    | 0.355307 | 0.355307 | 0.355307 | 0.355307 | 0.000000 | 0.000000 |    0.0 |
| |0>>>   |_std::vector<std::thread, std::allocator<std::thread> >::begin         |      1 |      2 | wall_clock | sec    | 0.000001 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 |  100.0 |
| |0>>>   |_std::vector<std::thread, std::allocator<std::thread> >::end           |      1 |      2 | wall_clock | sec    | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |  100.0 |
| |0>>>   |_pthread_join                                                          |      4 |      2 | wall_clock | sec    | 0.355257 | 0.088814 | 0.000001 | 0.333144 | 0.026559 | 0.162970 |  100.0 |
| |2>>>     |_start_thread                                                        |      1 |      3 | wall_clock | sec    | 0.000032 | 0.000032 | 0.000032 | 0.000032 | 0.000000 | 0.000000 |  100.0 |
| |1>>>     |_start_thread                                                        |      1 |      3 | wall_clock | sec    | 0.000036 | 0.000036 | 0.000036 | 0.000036 | 0.000000 | 0.000000 |  100.0 |
| |3>>>     |_start_thread                                                        |      1 |      3 | wall_clock | sec    | 0.000034 | 0.000034 | 0.000034 | 0.000034 | 0.000000 | 0.000000 |  100.0 |
| |4>>>     |_start_thread                                                        |      1 |      3 | wall_clock | sec    | 0.000039 | 0.000039 | 0.000039 | 0.000039 | 0.000000 | 0.000000 |  100.0 |
| |0>>> |_run                                                                     |      1 |      1 | wall_clock | sec    | 4.722993 | 4.722993 | 4.722993 | 4.722993 | 0.000000 | 0.000000 |    0.0 |
| |0>>>   |_std::char_traits<char>::length                                        |      1 |      2 | wall_clock | sec    | 0.000001 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 |  100.0 |
| |0>>>   |_std::distance<char const*>                                            |      1 |      2 | wall_clock | sec    | 0.000001 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 |  100.0 |
| |0>>>   |_std::operator+<char, std::char_traits<char>, std::allocator<char> >   |      2 |      2 | wall_clock | sec    | 0.000002 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 |  100.0 |
| |0>>>   |_run(20) x 100                                                         |      1 |      2 | wall_clock | sec    | 4.722951 | 4.722951 | 4.722951 | 4.722951 | 0.000000 | 0.000000 |    0.0 |
| |0>>>     |_run [{94,25}-{96,25}]                                               |      1 |      3 | wall_clock | sec    | 4.722925 | 4.722925 | 4.722925 | 4.722925 | 0.000000 | 0.000000 |    0.0 |
| |0>>>       |_fib                                                               |    100 |      4 | wall_clock | sec    | 4.722718 | 0.047227 | 0.046713 | 0.051987 | 0.000000 | 0.000625 |    0.0 |
| |0>>>         |_fib                                                             |    200 |      5 | wall_clock | sec    | 4.722302 | 0.023612 | 0.017827 | 0.034091 | 0.000032 | 0.005627 |    0.0 |
| |0>>>           |_fib                                                           |    400 |      6 | wall_clock | sec    | 4.721485 | 0.011804 | 0.006790 | 0.023003 | 0.000016 | 0.004024 |    0.0 |
| |0>>>             |_fib                                                         |    800 |      7 | wall_clock | sec    | 4.719858 | 0.005900 | 0.002564 | 0.016078 | 0.000006 | 0.002498 |    0.1 |
| |0>>>               |_fib                                                       |   1600 |      8 | wall_clock | sec    | 4.716572 | 0.002948 | 0.000977 | 0.011849 | 0.000002 | 0.001465 |    0.1 |
| |0>>>                 |_fib                                                     |   3200 |      9 | wall_clock | sec    | 4.709918 | 0.001472 | 0.000371 | 0.008246 | 0.000001 | 0.000831 |    0.3 |
| |0>>>                   |_fib                                                   |   6400 |     10 | wall_clock | sec    | 4.696775 | 0.000734 | 0.000140 | 0.005111 | 0.000000 | 0.000461 |    0.6 |
| |0>>>                     |_fib                                                 |  12800 |     11 | wall_clock | sec    | 4.670093 | 0.000365 | 0.000050 | 0.003166 | 0.000000 | 0.000253 |    1.1 |
| |0>>>                       |_fib                                               |  25600 |     12 | wall_clock | sec    | 4.617496 | 0.000180 | 0.000017 | 0.001959 | 0.000000 | 0.000137 |    2.3 |
| |0>>>                         |_fib                                             |  51200 |     13 | wall_clock | sec    | 4.512671 | 0.000088 | 0.000004 | 0.001212 | 0.000000 | 0.000074 |    4.6 |
| |0>>>                           |_fib                                           | 102400 |     14 | wall_clock | sec    | 4.304142 | 0.000042 | 0.000000 | 0.000752 | 0.000000 | 0.000039 |    9.6 |
| |0>>>                             |_fib                                         | 202600 |     15 | wall_clock | sec    | 3.892580 | 0.000019 | 0.000000 | 0.000469 | 0.000000 | 0.000021 |   19.0 |
| |0>>>                               |_fib                                       | 363200 |     16 | wall_clock | sec    | 3.151143 | 0.000009 | 0.000000 | 0.000293 | 0.000000 | 0.000011 |   33.2 |
| |0>>>                                 |_fib                                     | 502000 |     17 | wall_clock | sec    | 2.105217 | 0.000004 | 0.000000 | 0.000183 | 0.000000 | 0.000006 |   49.1 |
| |0>>>                                   |_fib                                   | 476000 |     18 | wall_clock | sec    | 1.071652 | 0.000002 | 0.000000 | 0.000114 | 0.000000 | 0.000004 |   63.6 |
| |0>>>                                     |_fib                                 | 294200 |     19 | wall_clock | sec    | 0.390193 | 0.000001 | 0.000000 | 0.000071 | 0.000000 | 0.000003 |   75.3 |
| |0>>>                                       |_fib                               | 115200 |     20 | wall_clock | sec    | 0.096190 | 0.000001 | 0.000000 | 0.000043 | 0.000000 | 0.000002 |   84.4 |
| |0>>>                                         |_fib                             |  27400 |     21 | wall_clock | sec    | 0.015020 | 0.000001 | 0.000000 | 0.000025 | 0.000000 | 0.000001 |   91.1 |
| |0>>>                                           |_fib                           |   3600 |     22 | wall_clock | sec    | 0.001336 | 0.000000 | 0.000000 | 0.000013 | 0.000000 | 0.000001 |   96.3 |
| |0>>>                                             |_fib                         |    200 |     23 | wall_clock | sec    | 0.000050 | 0.000000 | 0.000000 | 0.000001 | 0.000000 | 0.000000 |  100.0 |
| |0>>>     |_std::char_traits<char>::length                                      |      1 |      3 | wall_clock | sec    | 0.000001 | 0.000001 | 0.000001 | 0.000001 | 0.000000 | 0.000000 |  100.0 |
| |0>>>     |_std::distance<char const*>                                          |      1 |      3 | wall_clock | sec    | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |  100.0 |
| |0>>>     |_std::operator+<char, std::char_traits<char>, std::allocator<char> > |      2 |      3 | wall_clock | sec    | 0.000001 | 0.000001 | 0.000000 | 0.000001 | 0.000000 | 0.000000 |  100.0 |
| |0>>> |_std::operator&                                                          |      1 |      1 | wall_clock | sec    | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |  100.0 |
| |0>>> std::vector<std::thread, std::allocator<std::thread> >::~vector           |      1 |      0 | wall_clock | sec    | 0.000045 | 0.000045 | 0.000045 | 0.000045 | 0.000000 | 0.000000 |   32.7 |
| |0>>> |_std::thread::~thread                                                    |      4 |      1 | wall_clock | sec    | 0.000030 | 0.000007 | 0.000007 | 0.000009 | 0.000000 | 0.000001 |   31.2 |
| |0>>>   |_std::thread::joinable                                                 |      4 |      2 | wall_clock | sec    | 0.000021 | 0.000005 | 0.000005 | 0.000006 | 0.000000 | 0.000001 |   89.4 |
| |0>>>     |_std::thread::id::id                                                 |      4 |      3 | wall_clock | sec    | 0.000001 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |  100.0 |
| |0>>>     |_std::operator==                                                     |      4 |      3 | wall_clock | sec    | 0.000001 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |  100.0 |
| |0>>> |_std::allocator_traits<std::allocator<std::thread> >::deallocate         |      1 |      1 | wall_clock | sec    | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |  100.0 |
| |0>>> |_std::allocator<std::thread>::~allocator                                 |      1 |      1 | wall_clock | sec    | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |  100.0 |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|