文件
rocm-systems/source/docs/causal_profiling.md
T
Jonathan R. Madsen 9618ddefba Causal profiling (#229)
* Addition of basic structure

* Reworked categories

* More causal integration additions

* Causal implementation

* Update examples

* delete virtual_speedup files

* Update perfetto submodule to v31.0

* Update dyninst submodule

* Update timemory submodule

* ElfUtils build for libdw

* OMNITRACE_LIKELY and OMNITRACE_UNLIKELY

* Update common lib join

* Examples updates for causal profiling

* config updates with causal options

- OMNITRACE_CAUSAL_FIXED_LINE
- OMNITRACE_CAUSAL_FIXED_SPEEDUP
- OMNITRACE_CAUSAL_FILE
- OMNITRACE_CAUSAL_BINARY_SCOPE
- OMNITRACE_CAUSAL_SOURCE_SCOPE
- version info in banner
- support increments in parse_numeric_range
- fix occasional deadlock in first call to get_config

* PTL general task group

* Always include PID in debug/verbose messages

* Add blocking/unblocking gotchas to runtime init bundle

* CausalState

* thread_data updates

- generic component_bundle_cache

* Improve handling of causal in category_region

* components updates

- backtrace_causal component
- backtrace::get_data member func
- decrease ignore_depth in backtrace::sample(int)
- handle "omnitrace_main" in backtrace::filter_and_patch(...)
- tweak internal thread state scope for pthread_mutex_gotcha wrappers

* simplify tracing get_instrumentation_bundles usage

* sampling updates

- include backtrace_causal component
- disable backtrace_metrics if using causal and not using perfetto
- disable backtrace and backtrace_timestamp when using causal
- post_process_causal

* causal updates

- more checks in blocking_gotcha and unblocking_gotcha start/stop
- miscellaneous overhaul of data
- experiment update

* Remove virtual speedup

* libomnitrace code_object

* causal-profiling test

* libomnitrace library.cpp updates

- handle causal profiling
- fini_bundle

* Disable causal profiling by default

* Updated causal code and example

- example: three execution variants: cpu + rng, cpu, rng
- example: three instrumentation variants: none, omni, coz
- fix blocking gotcha credit
- rework perform_experiment_impl
- get_eligible_address_ranges
- compute_eligible_lines
- support fixed lines/speedups/functions
- update selected_entry to support function mode
- fix causal::delay
- experiment updates

* omnitrace_progress / omnitrace_user_progress

- with accompanying omnitrace_annotated_progress / omnitrace_user_annotated_progress

* Update timemory submodule

* CausalMode

- mode indicated whether causal predictions source be at line-level or function-level

* code_object, config, runtime, sampling, thread_data

- code_object: address_range
- code_object: basic::line_info serialize(), name(), hash()
- config updates
- two signals for causal sampling
- thread_data init fixes

* pthread updates

- pthread_create_gotcha processes delays
- pthread_mutex_gotcha does not wrap pthread_join in causal mode

* backtrace_causal update

- dynamic delay period stats

* main wrapper uses basename of argv[0]

* update elfio submodule

* perf support (currently unused)

* Fix experiment JSON serialization

- static_vector.hpp (unused)

* causal executable + config options updates

- omnitrace-causal exe simplifies running multiple causal configs
- changed the causal config option names

* Support both throughput and latency points

* process-causal-json.py script

- will be used later for testing

* stable_vector

* Rework thread_data

* Improve omnitrace-causal exe

- better verbosity handling
- correct diagnosis of status for child process
- execvpe when only one iteration (debugging)

* Update timemory submodule

* exe --version

- omnitrace, omnitrace-avail, and omnitrace-sample all support --version on command-line

* OMNITRACE_INTERNAL_API + OMNITRACE_{LIKELY,UNLIKELY}

* omnitrace-causal cmake format

* omnitrace config update

- OMNITRACE_CAUSAL_FILE_CLOBBER

* custom exception

- wraps STL exception and gets stacktrace during construction

* exit_gotcha supports _Exit

* use global construct_on_init + max threads

- add some safety when exceeding max # of threads

* update code_object binary filter

- exclude dyninst and tbbmalloc library

* containers: c_array, static_vector, stable_vector

- moved utility::c_array to container::c_array
- created static_vector: std::vector bound to std::array
- created stable_vector: vector with stable references

* grow thread_data when new thread created

* causal updates

- data: improve compute_eligible_lines to ignore lambdas
- data: use new thread_data
- delay: use new thread_data
- experiment: properly support latency points
- experiment: support file clobber
- experiment: ensure non-zero experiment time
- progress_point: use new thread_data
- backtrace_causal: use new thread_data

* Update causal-profiling tests

* fix omnitrace-causal backslash escaping

* process-causal-json script

* restructure causal implementation

- update verbose messages for omnitrace-causal diagnose_status
- migrated causal implementation in sampling.cpp to causal/sampling.cpp
- OMNITRACE_USE_CAUSAL does not require OMNITRACE_USE_SAMPLING
- added Mode::Causal
- causal sampling uses same signals as regular sampling
- moved tracing::thread_init to implementation file
- combined tracing::thread_init and tracing::thread_init_sampling
- added causal/components folder
- pthread_create_gotcha::wrapper_config
- omnitrace_preload checks OMNITRACE_USE_CAUSAL
  - updates mode accordingly

* update timemory submodule

* update timemory submodule

* causal example updates

- causal for lulesh

* perf code + utility - helpers

- relocated causal perf code
- placement new when generating unique ptr trait for potentially allocating during sampling
- additions to utility header
- removed previously added helpers.hpp

* update timemory submodule

* Default env variables for omnitrace-causal

- activate OMNITRACE_USE_KOKKOSP, etc.

* update stable_vector and static_vector

- static vector can use atomic for size tracking for thread-safe situations

* update causal example header

- CAUSAL_PROGRESS_NAMED
- use CAUSAL_ prefix for some macros

* Tweak lulesh example

- use CAUSAL_PROGRESS instead of CAUSAL_BEGIN and CAUSAL_END

* omnitrace-sample support for causal mode

- set OMNITRACE_USE_SAMPLING to off when OMNITRACE_MODE=causal

* refactor and cleanup code_object

- scope filter
- fixes to address_range

* overhaul causal data + causal config options

- full support for function and line mode
- support static vector of instruction pointers
- improve line info mapping resolution
- remove thread-locality from miscellanous functions where unnecessary
- causal options for {binary,source,function,fileline} exclusion

* causal experiment, sampling, and backtrace updates

- is_selected + unwind address array
- experiment warning about progress points
- increased buffer size for backtrace_casual sampler
- backtrace_causal only stores IP addresses instead of full unwind info

* category_region updates

- minor refactor
- local_category_region::mark

* Update causal tests

* Bump version to 1.8.0

* omnitrace-causal args + CLOBBER -> RESET

- renamed OMNITRACE_CAUSAL_FILE_CLOBBER to OMNITRACE_CAUSAL_FILE_RESET
- updated omnitrace-causal exe to support recently added configuration options
- other miscellaneous tweaks to data.cpp, experiment.cpp, and sampling.cpp

* Refactor causal and code_object

- code_object.hpp and code_object.cpp moved into binary folder
- causal components namespaced into omnitrace::causal::component
- moved sample_data out of backtrace_causal and into own file
- renamed backtrace_causal to causal::component::backtrace

* preload omnitrace_init + OMNITRACE_DEBUG_MARK

- env OMNITRACE_DEBUG_MARK
- fix omnitrace_init call when LD_PRELOAD-ing omnitrace

* Fix fileline support + line-info output names + experiment log

- line-info log files are prefixed with experiment name
- don't print experiment duration when E2E
- account for fileline scope in analysis

* KokkosP: OMNITRACE_KOKKOSP_NAME_LENGTH_MAX

- config option to limit the name of kokkos tool callbacks
- remove [kokkos] from KokkosP names

* Update causal example

- minor tweaks to decrease probability of overlapping regions in binary

* omnitrace-causal update

- prefix N / Ntot in environment printout

* Miscellaneous updates

- causal::finish_experimenting()
- OMNITRACE_CAUSAL_RANDOM_SEED
- KokkosP causal updates
  - exclude some callbacks, make some callbacks unique, etc.
- address_range::operator+=(address_range)
- combine contiguous ranges in binary/analysis.cpp when file, func, line is same and address range is contiguous
- bfd_line_info reads inline info
- wait for perform_experiment_impl to complete
- causal::delay updates
  - delay::process checks if experiment is active
  - uses threading::get_id()
- experiment scales duration up for larger speedup experiments
- line info samples includes excluded lines
- sampler uses CLOCK_REALTIME
- blocking_gotcha updates
  - is no longer fully static
  - adds audit routine which sets the postblock value to zero if try/timed routine fails
- category::host was added to causal_throughput_categories_t
- pthread_create_gotcha sets new threads local parent delay
  - was using internal value, now uses sequent value

* Causal improvements to KokkosP

* Updates to experiment time scaling

- use stats instead of just max

* binary/link_map.{hpp,cpp}

* update process-causal-json.py

* Folded fileline scope into source scope

* Update documentation

- Add documentation for causal profiling
- Replace 'Omnitrace' with 'OmniTrace' everywhere

* Update causal-helpers.cmake + omnitrace-testing.cmake

- split tests/CMakeLists.txt partially into omnitrace-testing.cmake

* omnitrace/causal.h

- OMNITRACE_CAUSAL_PROGRESS
- OMNITRACE_CAUSAL_PROGRESS_NAMED
- OMNITRACE_CAUSAL_BEGIN
- OMNITRACE_CAUSAL_END

* selected_entry + remove default filters for lambdas and operator()

- selected entry stores range and binary load address

* update process-causal-json.py

* format examples/lulesh/CMakeLists.txt

* causal-helpers find_package(Threads)

* OMNITRACE_KOKKOSP_KERNEL_LOGGER

- was OMNITRACE_KOKKOS_KERNEL_LOGGER

* quiet find of coz-profiler

* Fix rocm_smi exception handling

* Update timemory submodule (binutils)

- fix binutls compile error on some systems
- bump binutils to v2.40

* Fix miscellaneous tests

* OMNITRACE_KOKKOSP_PREFIX

* revert rocm_smi handling

* ElfUtils updates

- default to download version 0.188
- add -Wno-error=null-dereference due to GCC 12 compiler error

* Update causal example

* Remove OMNITRACE_VERBOSE from global workflow envs

* Reliable causal test

* disable compilation of causal perf files

* Remove set_current_selection with unwind stack

* update timemory submodule

* fix for segfault on bionic

- locking in TLS dtor was causing segfault

* remove experiment::is_selected(unwind_stack_t)

* update default init of selected_entry

* Fix for when IP is not offset by load address

* Update CMakeLists.txt

* Miscellaneous updates

- OMNITRACE_WARNING_OR_CI_THROW
- OMNITRACE_REQUIRE
- OMNITRACE_PREFER
- fixed issues with no ASLR
-  added load address variable and ipaddr() func to basic/bfd line info
- removed get_basic() from dwarf_line_info
- TIMEMORY_PREFER -> OMNITRACE_PREFER
- removed previously added binary_address and range variables from selected_entry

* Removed superfluous CausalState

* Additional causal tests (lulesh + kokkos)

* filter, prefer, analysis ASLR handling

- removed default filter on cold functions
- fixed OMNITRACE_PREFER
- fixed analysis ASLR handling

* Tweak line-info output

* Removed some superfluous code

- causal/delay
- causal/selected_entry

* Exclude main.cold in function mode

* Update validate-perfetto-proto.py

- account for occasional http errors

* Add sampling test disabling tmp files

* argparser for process-causal-json

- support validation
- support filtering

* Avoid pthread_{lock,unlock} in sampling offload

- use homemade atomic_mutex/atomic_lock since contention will be low and using pthread tools might trigger our wrappers

* Rename process-causal-json.py

- validate-causal-json.py

* rework omnitrace_add_causal_test

- capable of performing validation
- added validation tests

* Fix kokkosp_begin_deep_copy + causal

* Tweak address range in bfd_line_info::read_pc

* Tweak analysis and data IP handling

- look for gaps

* Disable scaling experiment time by speedup

* Revert change in max threads during CI

* binary updates

- significant overhaul of binary analysis implementation
- removed "basic_line_info" and "bfd_line_info" in lieu of "symbol" class
  - symbol class has basic BFD info + vector of inlines + vector of dwarf info

* Updated causal to use new binary analysis

- Fix symbol.cpp includes

* Updated formatting target

- include *.cmake files

* Updated causal tests

- causal tests should be stable now

* Update timemory and dyninst submodules

- TPLs are stripped + built w/o debug info

* Increase tolerance for causal validation speedups

- higher speedups have more variance (increased to +/- 5 from 3)

* Support causal output for MPI

- i.e. tag with MPI rank

* omnitrace-causal launcher argument

* improve experiment sampling output

* causal data updates

- call compute lines once
- fixed filtered cached binary info
- debugging info when experiment fails to start

* Tweaked causal validation tests

* dwarf_entry ranges

* CI updates

- increase max threads to 64

* Tweak causal E2E validation tests

- more threads
- shorter thread runtime
- more iterations

* Fix shadowed variable

* fix symbol read_bfd last PC calculation

* fix maybe-uninitialized warning

* omnitrace-causal launcher update

- only inject "omnitrace-causal --" once
- throw error if no matches found

* Update causal profiling docs for launcher

* fix address range boundaries
2023-01-24 18:53:23 -06:00

26 KiB

Causal Profiling

.. toctree::
   :glob:
   :maxdepth: 3

What is "Causal Profiling"?

If you speed up a given block of code by X%, the application will execute Y% faster

Causal profiling directs parallel application developers to where they should focus their optimization efforts by quantifying the potential impact of optimizations. Causal profiling is rooted in the concept that software execution speed is relative: speeding up a block of code by X% is mathematically equivalent to that block of code running at its current speed if all the other code running slower by X%. Thus, causal profiling works by performing experiments on blocks of code during program execution which insert pauses to slow down all other concurrently running code. During post-processing, these experiments are translated into calculations for the potential impact of speeding up this block of code.

Consider the following C++ code executing foo and bar concurrently in two different threads where foo is 30% faster than bar (ideally):

constexpr size_t FOO_N =  7 * 1000000000UL;
constexpr size_t BAR_N = 10 * 1000000000UL;

void foo()
{
    for(volatile size_t i = 0; i < FOO_N; ++i) {}
}

void bar()
{
    for(volatile size_t i = 0; i < BAR_N; ++i) {}
}

int main()
{
    auto _threads = { std::thread{ foo },
                      std::thread{ bar } };

    for(auto& itr : _threads)
        itr.join();
}

No matter how many optimizations are applied to foo, the application will always require the same amount of time because the end-to-end performance is limited by bar. However, a 5% speedup in bar will result in the end-to-end performance improving by 5% and this trend will continue linearly (10% speedup in bar yields 10% speedup in end-to-end performance, and so on) up to 30% speedup, at which point, bar executes as fast as foo; any speedup to bar beyond 30% will still only yield an end-to-end performance speedup of 30% since the application will be limited by performance of foo, as demonstrated below in the causal profiling visualization:

foobar-causal-plot

The full details of the causal profiling methodology can be found in the paper Coz: Finding Code that Counts with Causal Profiling. The author's implementation is publicly available on GitHub.

Getting Started

Progress Points

Causal profiling requires "progress points" to track progress through the code in between samples. Progress points must be triggered deterministically via instrumentation. This can happen in three different ways:

  1. OmniTrace can leverage the callbacks from Kokkos-Tools, OpenMP-Tools, roctracer, etc. and the wrappers around functions for MPI, NUMA, RCCL, etc. to act as progress-points
  2. User can leverage the runtime instrumentation capabilities to insert progress-points (NOTE: binary rewrite to insert progress-points is not supported)
  3. User can leverage the User API, e.g. OMNITRACE_CAUSAL_PROGRESS

Please note with regard to #2, binary rewrite to insert progress-points is not supported: when a rewritten binary is executed, Dyninst translates the instruction pointer address in order to execute the instrumentation and, as a result, call-stack samples never return instruction pointer addresses in the ranges defined as valid by OmniTrace. Hopefully, a work-around will be found in the future.

Key Concepts

Concept Setting Options Description
Mode OMNITRACE_CAUSAL_MODE function, line Select entire function or individual line of code for causal experiments
End-to-End OMNITRACE_CAUSAL_END_TO_END boolean Perform a single experiment during the entire run (does not require progress-points)
Fixed speedup(s) OMNITRACE_CAUSAL_FIXED_SPEEDUP one or more values from [0, 100] Virtual speedup or pool of virtual speedups to randomly select
Binary scope OMNITRACE_CAUSAL_BINARY_SCOPE regular expression(s) Dynamic binaries containing code for experiments
Source scope OMNITRACE_CAUSAL_SOURCE_SCOPE regular expression(s) <file> and/or <file>:<line> containing code to include in experiments
Function scope OMNITRACE_CAUSAL_FUNCTION_SCOPE regular expression(s) Restricts experiments to matching functions (function mode) or lines of code within matching functions (line mode)

Notes

  1. Binary scope defaults to %MAIN% (executable). Scope can be expanded to include linked libraries
  2. <file> and <file>:<line> support requires debug info (i.e. code was compiled with -g or, preferably, -g3)
  3. Function mode does not require debug info but does not support stripped binaries

Speedup Prediction Variability and omnitrace-causal Executable

Causal profiling typically require executing the application several times in order to adequately sample all the domains of executing code, experiment speedups, etc. and resolve statistical fluctuations. The omnitrace-causal executable is designed to simplify running this procedure:

$ omnitrace-causal --help
[omnitrace-causal] Usage: ./bin/omnitrace-causal [ --help (count: 0, dtype: bool)
                                                   --version (count: 0, dtype: bool)
                                                   --monochrome (max: 1, dtype: bool)
                                                   --debug (max: 1, dtype: bool)
                                                   --verbose (count: 1)
                                                   --config (min: 0, dtype: filepath)
                                                   --launcher (count: 1, dtype: executable)
                                                   --generate-configs (min: 0, dtype: folder)
                                                   --no-defaults (min: 0, dtype: bool)
                                                   --mode (count: 1, dtype: string)
                                                   --output-name (min: 1, dtype: filename)
                                                   --reset (max: 1, dtype: bool)
                                                   --end-to-end (max: 1, dtype: bool)
                                                   --wait (count: 1, dtype: seconds)
                                                   --duration (count: 1, dtype: seconds)
                                                   --iterations (count: 1, dtype: int)
                                                   --speedups (min: 0, dtype: integers)
                                                   --binary-scope (min: 0, dtype: integers)
                                                   --source-scope (min: 0, dtype: integers)
                                                   --function-scope (min: 0, dtype: regex-list)
                                                   --binary-exclude (min: 0, dtype: integers)
                                                   --source-exclude (min: 0, dtype: integers)
                                                   --function-exclude (min: 0, dtype: regex-list)
                                                 ]

    Causal profiling usually requires multiple runs to reliably resolve the speedup estimates.
    This executable is designed to streamline that process.
    For example (assume all commands end with '-- <exe> <args>'):

        omnitrace-causal -n 5 -- <exe>                  # runs <exe> 5x with causal profiling enabled

        omnitrace-causal -s 0 5,10,15,20                # runs <exe> 2x with virtual speedups:
                                                        #   - 0
                                                        #   - randomly selected from 5, 10, 15, and 20

        omnitrace-causal -F func_A func_B func_(A|B)    # runs <exe> 3x with the function scope limited to:
                                                        #   1. func_A
                                                        #   2. func_B
                                                        #   3. func_A or func_B
    General tips:
    - Insert progress points at hotspots in your code or use omnitrace's runtime instrumentation
        - Note: binary rewrite will produce a incompatible new binary
    - Run omnitrace-causal in "function" mode first (does not require debug info)
    - Run omnitrace-causal in "line" mode when you are targeting one function (requires debug info)
        - Preferably, use predictions from the "function" mode to determine which function to target
    - Limit the virtual speedups to a smaller pool, e.g., 0,5,10,25,50, to get reliable predictions quicker
    - Make use of the binary, source, and function scope to limit the functions/lines selected for experiments
        - Note: source scope requires debug info


Options:
    -h, -?, --help                 Shows this page
    --version                      Prints the version and exit

    [DEBUG OPTIONS]

    --monochrome                   Disable colorized output
    --debug                        Debug output
    -v, --verbose                  Verbose output

    [GENERAL OPTIONS]

    -c, --config                   Base configuration file
    -l, --launcher                 When running MPI jobs, omnitrace-causal needs to be *before* the executable which launches the MPI processes (i.e.
                                   before `mpirun`, `srun`, etc.). Pass the name of the target executable (or a regex for matching to the name of the
                                   target) for causal profiling, e.g., `omnitrace-causal -l foo -- mpirun -n 4 foo`. This ensures that the omnitrace
                                   library is LD_PRELOADed on the proper target
    -g, --generate-configs         Generate config files instead of passing environment variables directly. If no arguments are provided, the config files
                                   will be placed in ${PWD}/omnitrace-causal-config folder
    --no-defaults                  Do not activate default features which are recommended for causal profiling. For example: PID-tagging of output files
                                   and timestamped subdirectories are disabled by default. Kokkos tools support is added by default
                                   (OMNITRACE_USE_KOKKOSP=ON) because, for Kokkos applications, the Kokkos-Tools callbacks are used for progress points.
                                   Activation of OpenMP tools support is similar

    [CAUSAL PROFILING OPTIONS (General)]
                                   (These settings will be applied to all causal profiling runs)

    -m, --mode [ function (func) | line ]
                                   Causal profiling mode
    -o, --output-name              Output filename of causal profiling data w/o extension
    -r, --reset                    Overwrite any existing experiment results during the first run
    -e, --end-to-end               Single causal experiment for the entire application runtime
    -w, --wait                     Set the wait time (i.e. delay) before starting the first causal experiment (in seconds)
    -d, --duration                 Set the length of time (in seconds) to perform causal experimentationafter the first experiment is started. Once this
                                   amount of time has elapsed, no more causal experiments will be started but any currently running experiment will be
                                   allowed to finish.
    -n, --iterations               Number of times to repeat the combination of run configurations

    [CAUSAL PROFILING OPTIONS (Combinatorial)]
                                   (Each individual argument to these options will multiply the number runs by the number of arguments and the number of
                                   iterations. E.g. -n 2 -B "MAIN" -F "foo" "bar" will produce 4 runs: 2 iterations x 1 binary scope x 2 function scopes
                                   (MAIN+foo, MAIN+bar, MAIN+foo, MAIN+bar))

    -s, --speedups                 Pool of virtual speedups to sample from during experimentation. Each space designates a group and multiple speedups can
                                   be grouped together by commas, e.g. -s 0 0,10,20-50 is two groups: group #1 is '0' and group #2 is '0 10 20 25 30 35 40
                                   45 50'
    -B, --binary-scope             Restricts causal experiments to the binaries matching the list of regular expressions. Each space designates a group
                                   and multiple scopes can be grouped together with a semi-colon
    -S, --source-scope             Restricts causal experiments to the source files or source file + lineno pairs (i.e. <file> or <file>:<line>) matching
                                   the list of regular expressions. Each space designates a group and multiple scopes can be grouped together with a
                                   semi-colon
    -F, --function-scope           Restricts causal experiments to the functions matching the list of regular expressions. Each space designates a group
                                   and multiple scopes can be grouped together with a semi-colon
    -BE, --binary-exclude          Excludes causal experiments from being performed on the binaries matching the list of regular expressions. Each space
                                   designates a group and multiple excludes can be grouped together with a semi-colon
    -SE, --source-exclude          Excludes causal experiments from being performed on the code from the source files or source file + lineno pair (i.e.
                                   <file> or <file>:<line>) matching the list of regular expressions. Each space designates a group and multiple excludes
                                   can be grouped together with a semi-colon
    -FE, --function-exclude        Excludes causal experiments from being performed on the functions matching the list of regular expressions. Each space
                                   designates a group and multiple excludes can be grouped together with a semi-colon

Examples

#!/bin/bash -e

module load omnitrace

N=20
I=3

# when providing speedups to omnitrace-causal, speedup
# groups are separated by a space so "0,10" results in
# one speedup group where omnitrace samples from
# the speedup set of {0, 10}. Passing "0 10" (without
# quotes to omnitrace-causal multiplies the
# number of runs by 2, where the first half of the
# runs instruct omnitrace to only use 0 as the
# speedup and the second half of the runs instruct
# omnitrace to only use 10 as the speedup.
SPEEDUPS="0,0,0,10,20,30,40,50,50,75,75,75,90,90,90"
# thus, -s ${SPEEDUPS} only multiplies the number
# of runs by 1 whereas -S ${SPEEDUPS_E2E} multiplies
# the number of runs by 15:
#   - 3 runs with speedup of 0
#   - 1 run for each of the speedups 10, 20, 30, and 40
#   - 2 runs with speedup of 50
#   - 3 runs with speedup of 75
#   - 3 runs with speedup of 90
SPEEDUPS_E2E=$(echo "${SPEEDUPS}" | sed 's/,/ /g')


# 20 iterations in function mode with 1 speedup group
# and source scope set to .cpp files
#
# outputs to files:
#   - causal/experiments.func.coz
#   - causal/experiments.func.json
#
# total executions: 20
#
omnitrace-causal        \
    -n ${N}             \
    -s ${SPEEDUPS}      \
    -m function         \
    -o experiments.func \
    -S ".*\\.cpp"       \
    --                  \
    ./causal-omni-cpu "${@}"


# 20 iterations in line mode with 1 speedup group
# and source scope restricted to lines 155 and 165
# in the causal.cpp file.
#
# outputs to files:
#   - causal/experiments.line.coz
#   - causal/experiments.line.json
#
# total executions: 20
#
omnitrace-causal                \
    -n ${N}                     \
    -s ${SPEEDUPS}              \
    -m line                     \
    -o experiments.line         \
    -S "causal\\.cpp:(155|165)" \
    --                          \
    ./causal-omni-cpu "${@}"


# 3 iterations in function mode of 15 singular speedups
# in end-to-end mode with 2 different function scopes
# where one is restricted to "cpu_slow_func" and
# another is restricted to "cpu_fast_func".
#
# outputs to files:
#   - causal/experiments.func.e2e.coz
#   - causal/experiments.func.e2e.json
#
# total executions: 90
#
omnitrace-causal            \
    -n ${I}                 \
    -s ${SPEEDUPS_E2E}      \
    -m func                 \
    -e                      \
    -o experiments.func.e2e \
    -F "cpu_slow_func"      \
       "cpu_fast_func"      \
    --                      \
    ./causal-omni-cpu "${@}"

# 3 iterations in line mode of 15 singular speedups
# in end-to-end mode with 2 different source scopes
# where one is restricted to line 155 in causal.cpp
# and another is restricted to line 165 in causal.cpp.
#
# outputs to files:
#   - causal/experiments.line.e2e.coz
#   - causal/experiments.line.e2e.json
#
# total executions: 90
#
omnitrace-causal            \
    -n ${I}                 \
    -s ${SPEEDUPS_E2E}      \
    -m line                 \
    -e                      \
    -o experiments.line.e2e \
    -S "causal\\.cpp:155"   \
       "causal\\.cpp:165"   \
    --                      \
    ./causal-omni-cpu "${@}"


export OMP_NUM_THREADS=8
export OMP_PROC_BIND=spread
export OMP_PLACES=threads

# set number of iterations to 5
N=5

# 5 iterations in function mode of 1 speedup
# group with the source scope restricted
# to files containing "lulesh" in their filename
# and exclude functions which start with "Kokkos::"
# or "std::enable_if".
#
# outputs to files:
#   - causal/experiments.func.coz
#   - causal/experiments.func.json
#
# total executions: 5
#
# First of 5 executions overwrites any
# existing causal/experiments.func.(coz|json)
# file due to "--reset" argument
#
omnitrace-causal                            \
    --reset                                 \
    -n ${N}                                 \
    -s ${SPEEDUPS}                          \
    -m func                                 \
    -o experiments.func                     \
    -S "lulesh.*"                           \
    -FE "^(Kokkos::|std::enable_if)"        \
    --                                      \
    ./lulesh-omni -i 50 -s 200 -r 20 -b 5 -c 5 -p


# 5 iterations in line mode of 1 speedup
# group with the source scope restricted
# to files containing "lulesh" in their filename
# and exclude functions which start with "exec_range"
# or "execute" and which contain either
# "construct_shared_allocation" or "._omp_fn." in
# the function name.
#
# outputs to files:
#   - causal/experiments.line.coz
#   - causal/experiments.line.json
#
# total executions: 5
#
# First of 5 executions overwrites any
# existing causal/experiments.line.(coz|json)
# file due to "--reset" argument
#
omnitrace-causal                            \
    --reset                                 \
    -n ${N}                                 \
    -s ${SPEEDUPS}                          \
    -m line                                 \
    -o experiments.line                     \
    -S "lulesh.*"                           \
    -FE "^(exec_range|execute);construct_shared_allocation;\\._omp_fn\\." \
    --                                      \
    ./lulesh-omni -i 50 -s 200 -r 20 -b 5 -c 5 -p


# 5 iterations in line mode of 1 speedup
# group with the source scope restricted
# to files whose basename is "lulesh.cc"
# for 3 different functions:
#   - ApplyMaterialPropertiesForElems
#   - CalcHourglassControlForElems
#   - CalcVolumeForceForElems
#
# outputs to files:
#   - causal/experiments.line.targeted.coz
#   - causal/experiments.line.targeted.json
#
# total executions: 15
#
# First of 5 executions overwrites any
# existing causal/experiments.line.(coz|json)
# file due to "--reset" argument
#
omnitrace-causal                            \
    --reset                                 \
    -n ${N}                                 \
    -s ${SPEEDUPS}                          \
    -m line                                 \
    -o experiments.line.targeted            \
    -F "ApplyMaterialPropertiesForElems"    \
       "CalcHourglassControlForElems"       \
       "CalcVolumeForceForElems"            \
    -S "lulesh\\.cc"                        \
    --                                      \
    ./lulesh-omni -i 50 -s 200 -r 20 -b 5 -c 5 -p

Using omnitrace-causal with other launchers (e.g. mpirun)

The omnitrace-causal executable is intended to assist with application replay and is designed to always be at the start of the command-line (i.e. the primary process). omnitrace-causal typically adds a LD_PRELOAD of the OmniTrace libraries into the environment before launching the command in order to inject the functionality required to start the causal profiling tooling. However, this is problematic when the target application for causal profiling requires another command-line tool in order to run, e.g. foo is the target application but executing foo requires mpirun -n 2 foo. If one were to simply do omnitrace-causal -- mpirun -n 2 foo, then the causal profiling would be applied to mpirun instead of foo. omnitrace-causal remedies this by providing a command-line option -l / --launcher to indicate the target application is using a launcher script/executable. The argument to the command-line option is the name of (or regex for) the target application on the command-line. When --launcher is used, omnitrace-causal will generate all the replay configurations and execute them but delay adding the LD_PRELOAD, instead it will inject a call to itself into the command-line right before the target application. This recursive call to itself will inherit the configuration from parent omnitrace-causal executable, insert an LD_PRELOAD into the environment, and then invoke an execv to replace itself with the new process launched by the target application.

In other words, the following command:

omnitrace-causal -l foo -n 3 -- mpirun -n 2 foo`

Effectively results in:

mpirun -n 2 omnitrace-causal -- foo
mpirun -n 2 omnitrace-causal -- foo
mpirun -n 2 omnitrace-causal -- foo

Visualizing the Causal Output

OmniTrace generates a causal/experiments.json and causal/experiments.coz in ${OMNITRACE_OUTPUT_PATH}/${OMNITRACE_OUTPUT_PREFIX}. A standalone GUI for viewing the causal profiling results in under development but until this is available, visit plasma-umass.org/coz/ and open the *.coz file.

OmniTrace vs. Coz

This section is intended for readers who are familiar with the Coz profiler. OmniTrace provides several additional features and utilities for causal profiling:

Coz OmniTrace Notes
Debug info requires debug info in DWARF v3 format (-gdwarf-3) optional, supports any DWARF format version See Note #1 below
Experiment selection <file>:<line> <function> or <file>:<line> See Note #2 below
Experiment speedups Randomly samples b/t 0..100 in increments of 5 or one fixed speedup Supports specifying smaller subset Set Note #3 below
Scope options Supports binary and source scopes Supports binary, source, and function scopes See Note #4, #5, and #6 below
Scope inclusion Uses % as wildcard for binary and source scopes Full regex support for binary, source, and function scopes
Scope exclusion Not supported Supports regexes for excluding binary/source/function See Note #7 below
Call-stack sampling Linux perf libunwind See Note #8 below

Notes

  1. OmniTrace supports a "function" mode which does not require debug info
  2. OmniTrace supports selecting entire range of instruction pointers for a function instead of instruction pointer for one line. In large codes, "function" mode can resolve in fewer iterations and once a target function is identified, one can switch to line mode and limit the function scope to the target function
  3. OmniTrace supports randomly sampling from subsets, e.g. { 0, 0, 5, 10 } where 0% is randomly selected 50% of time and 5% and 10% are randomly selected 25% of the time
  4. OmniTrace and COZ have same definition for binary scope: the binaries loaded at runtime (e.g. executable and linked libraries)
  5. OmniTrace "source scope" supports both <file> and <file>:<line> formats in contrast to COZ "source scope" which requires <file>:<line> format
  6. OmniTrace supports a "function" scope which narrows the functions/lines which are eligible for causal experiments to those within the matching functions
  7. OmniTrace supports a second filter on scopes for removing binary/source/function caught by inclusive match, e.g. BINARY_SCOPE=.* + BINARY_EXCLUDE=libmpi.* initially includes all binaries but exclude regex removes MPI libraries